Nebari Slurm - Deploy Nebari on HPC systems
Nebari Slurm is an opinionated open source deployment of JupyterHub based on an HPC jobscheduler. Nebari Slurm is a "distribution" of these packages much like Debian and Ubuntu are distributions of Linux. The high level goal of this distribution is to form a cohesive set of tools that enable:
- environment management via conda and conda-store
- monitoring of compute infrastructure and services
- scalable and efficient compute via jupyterlab and dask
- deployment of jupyterhub on prem without requiring deep devops knowledge of the Slurm/HPC and jupyter ecosystem
Nebari-Slurm was previously called Qhub-HPC, the documentation pages are being migrated, so there are a few mentions of the original name.
Overview
Nebari Slurm is a High-Performance Computing (HPC) deployment using JupyterHub. In this document, we will discuss the services that run within this architecture and how they are interconnected. The setup follows a standard HPC configuration with a master/login node and 'N' worker nodes.
The master node serves as the central control and coordination hub for the entire cluster. It plays a pivotal role in managing and optimizing cluster resources and ensuring secure, efficient, and reliable operations. In contrast, worker nodes primarily focus on executing computational tasks and rely on instructions from the master node for job execution.
At a high level, the architecture comprises several key services: monitoring, the job scheduler (Slurm), and JupyterHub along with related Python services.
Important URLs:
https://<master node ip>/
: JupyterHub serverhttps://<master node ip>/monitoring/
: Grafana serverhttps://<master node ip>/auth/
: Keycloak serverhttps://<master node ip>/gateway/
: Dask-Gateway server for remote connectionsssh <master node ip> -p 8022
: SSH into a JupyterLab session for users (requires a JupyterHub token)
Services (All Nodes)
- node_exporter: Collects node metrics (default port 9100)
Master Node
Services
Authentication
- Keycloak: Provides enterprise-grade open-source authentication
Control and Coordination
- Slurm: Manages job scheduling, resource allocation, and cluster control
- slurmctld: Manages the Slurm central management daemon
- slurmdbd: Handles Slurm accounting
- MySQL: Acts as the database for Slurm accounting
Reverse Proxy and Routing
- Traefik: Serves as an open-source network proxy, routing network traffic efficiently
Monitoring and Metrics
- Grafana: Acts as a central place to view monitoring information (default port 3000)
- Prometheus: Scrapes metrics (default port 9090)
- slurm_exporter: Provides Slurm metrics (default port 9341)
- Traefik exported metrics
- JupyterHub exported metrics
Python Ecosystem
- JupyterHub: Provides scalable interactive computing (default port 8000)
- Dask-Gateway: Enables scalable distributed computing
- NFS server: Facilitates sharing Conda environments and home directories among all users
- conda-store: Manages Conda environments within nodes
Worker Nodes
Worker nodes primarily focus on executing computational tasks and have minimal dependencies, making them efficient for running parallel workloads. They rely on instructions from the master node for job execution and do not have the same level of control and coordination responsibilities as the master node. The master node's role is pivotal in orchestrating the overall cluster's functionality and ensuring efficient and secure operations.