Nebari Slurm - Deploy Nebari on HPC systems

Nebari Slurm is an opinionated open source deployment of JupyterHub based on an HPC jobscheduler. Nebari Slurm is a "distribution" of these packages much like Debian and Ubuntu are distributions of Linux. The high level goal of this distribution is to form a cohesive set of tools that enable:

environment management via conda and conda-store
monitoring of compute infrastructure and services
scalable and efficient compute via jupyterlab and dask
deployment of jupyterhub on prem without requiring deep devops knowledge of the Slurm/HPC and jupyter ecosystem

important

Nebari-Slurm was previously called Qhub-HPC, the documentation pages are being migrated, so there are a few mentions of the original name.

Overview

Nebari Slurm is a High-Performance Computing (HPC) deployment using JupyterHub. In this document, we will discuss the services that run within this architecture and how they are interconnected. The setup follows a standard HPC configuration with a master/login node and 'N' worker nodes.

The master node serves as the central control and coordination hub for the entire cluster. It plays a pivotal role in managing and optimizing cluster resources and ensuring secure, efficient, and reliable operations. In contrast, worker nodes primarily focus on executing computational tasks and rely on instructions from the master node for job execution.

At a high level, the architecture comprises several key services: monitoring, the job scheduler (Slurm), and JupyterHub along with related Python services.

Important URLs:

https://<master node ip>/: JupyterHub server
https://<master node ip>/monitoring/: Grafana server
https://<master node ip>/auth/: Keycloak server
https://<master node ip>/gateway/: Dask-Gateway server for remote connections
ssh <master node ip> -p 8022: SSH into a JupyterLab session for users (requires a JupyterHub token)

Services (All Nodes)

node_exporter: Collects node metrics (default port 9100)

Master Node

Services

Authentication

Keycloak: Provides enterprise-grade open-source authentication

Control and Coordination

Slurm: Manages job scheduling, resource allocation, and cluster control
slurmctld: Manages the Slurm central management daemon
slurmdbd: Handles Slurm accounting
MySQL: Acts as the database for Slurm accounting

Reverse Proxy and Routing

Traefik: Serves as an open-source network proxy, routing network traffic efficiently

Monitoring and Metrics

Grafana: Acts as a central place to view monitoring information (default port 3000)
Prometheus: Scrapes metrics (default port 9090)
slurm_exporter: Provides Slurm metrics (default port 9341)
Traefik exported metrics
JupyterHub exported metrics

Python Ecosystem

JupyterHub: Provides scalable interactive computing (default port 8000)
Dask-Gateway: Enables scalable distributed computing
NFS server: Facilitates sharing Conda environments and home directories among all users
conda-store: Manages Conda environments within nodes

Worker Nodes

Worker nodes primarily focus on executing computational tasks and have minimal dependencies, making them efficient for running parallel workloads. They rely on instructions from the master node for job execution and do not have the same level of control and coordination responsibilities as the master node. The master node's role is pivotal in orchestrating the overall cluster's functionality and ensuring efficient and secure operations.

Overview

Services (All Nodes)​

Master Node​

Services​

Authentication​

Control and Coordination​

Reverse Proxy and Routing​

Monitoring and Metrics​

Python Ecosystem​

Worker Nodes​