Skip to main content

Nebari Slurm - Deploy Nebari on HPC systems

Nebari Slurm is an opinionated open source deployment of JupyterHub based on an HPC jobscheduler. Nebari Slurm is a "distribution" of these packages much like Debian and Ubuntu are distributions of Linux. The high level goal of this distribution is to form a cohesive set of tools that enable:

  • environment management via conda and conda-store
  • monitoring of compute infrastructure and services
  • scalable and efficient compute via jupyterlab and dask
  • deployment of jupyterhub on prem without requiring deep devops knowledge of the Slurm/HPC and jupyter ecosystem
important

Nebari-Slurm was previously called Qhub-HPC, the documentation pages are being migrated, so there are a few mentions of the original name.

Overview

Nebari Slurm is a High-Performance Computing (HPC) deployment using JupyterHub. In this document, we will discuss the services that run within this architecture and how they are interconnected. The setup follows a standard HPC configuration with a master/login node and 'N' worker nodes.

The master node serves as the central control and coordination hub for the entire cluster. It plays a pivotal role in managing and optimizing cluster resources and ensuring secure, efficient, and reliable operations. In contrast, worker nodes primarily focus on executing computational tasks and rely on instructions from the master node for job execution.

At a high level, the architecture comprises several key services: monitoring, the job scheduler (Slurm), and JupyterHub along with related Python services.

Important URLs:

  • https://<master node ip>/: JupyterHub server
  • https://<master node ip>/monitoring/: Grafana server
  • https://<master node ip>/auth/: Keycloak server
  • https://<master node ip>/gateway/: Dask-Gateway server for remote connections
  • ssh <master node ip> -p 8022: SSH into a JupyterLab session for users (requires a JupyterHub token)

Services (All Nodes)

Master Node

Services

Authentication

  • Keycloak: Provides enterprise-grade open-source authentication

Control and Coordination

  • Slurm: Manages job scheduling, resource allocation, and cluster control
  • slurmctld: Manages the Slurm central management daemon
  • slurmdbd: Handles Slurm accounting
  • MySQL: Acts as the database for Slurm accounting

Reverse Proxy and Routing

  • Traefik: Serves as an open-source network proxy, routing network traffic efficiently

Monitoring and Metrics

Python Ecosystem

  • JupyterHub: Provides scalable interactive computing (default port 8000)
  • Dask-Gateway: Enables scalable distributed computing
  • NFS server: Facilitates sharing Conda environments and home directories among all users
  • conda-store: Manages Conda environments within nodes

Worker Nodes

Worker nodes primarily focus on executing computational tasks and have minimal dependencies, making them efficient for running parallel workloads. They rely on instructions from the master node for job execution and do not have the same level of control and coordination responsibilities as the master node. The master node's role is pivotal in orchestrating the overall cluster's functionality and ensuring efficient and secure operations.