Using GPUs on Nebari

Introduction

Overview of using GPUs on Nebari including server setup, environment setup, and validation.

1. Starting a GPU server

Follow Steps 1 to 3 in the [Authenticate and launch JupyterLab][login-with-keycloak] tutorial. The UI will show a list of profiles (a.k.a, instances, servers, or machines).

Nebari select profile Your administrator pre-configures these options, as described in [Profile Configuration documentation][profile-configuration].

Select an appropriate GPU Instance and click "Start".

Understanding GPU setup on the server.

The following steps describe how to get CUDA-related information from the server.

Once your server starts, it will redirect you to a JupyterLab home page.
Click on the "Terminal" icon.
Run the command nvidia-smi. The top right corner of the command's output should have the highest supported driver.

If you get the error nvidia-smi: command not found, you are most likely on a non-GPU server. Shutdown your server, and start up a GPU-enabled server.

Compatible environments for this server must contain CUDA versions below the GPU server version. For example, the server in this case is on 12.4. All environments used on this server must contain packages build with CUDA<=12.4.

2. Creating environments

By default, conda-store will build CPU-compatible packages. To build GPU-compatible packages, we do the following.

Build a GPU-compatible environment

By default, conda-store will build CPU-compatible packages. To build GPU-compatible packages, we have two options:

Create the environment specification using CONDA_OVERRIDE_CUDA (recommended approach):

Conda-store provides an alternate mechanism to enable GPU environments via the setting of an environment variable as explained in the conda-store docs. While creating a new config, click on the **GUI <-> YAML** Toggle to edit yaml config.

channels:
  - pytorch
  - conda-forge
dependencies:
  - pytorch
  - ipykernel
variables:
  CONDA_OVERRIDE_CUDA: "12.1"

Alternatively, you can configure the same config using the UI.

Add the CONDA_OVERRIDE_CUDA override to the variables section to tell conda-store to build a GPU-compatible environment.

note

At the time of writing this document, the latest CUDA version was showing as 12.1. Please follow the steps below to determine the latest override value for the CONDA_OVERRIDE_CUDA environment variable.

Please ensure that your choice from PyTorch documentation is not greater than the highest supported version in the nvidia-smi output (captured above).

Create the environment specification based on recommendations from the PyTorch documentation: You can check PyTorch documentation to get a quick list of the necessary CUDA-specific packages. Select the following options to get the latest CUDA version:
- PyTorch Build = Stable
- Your OS = Linux
- Package = Conda
- Language = Python
- Compute Platform = 12.1 (Select the version that is less than or equal to the nvidia-smi output (see above) on your server)

pytorch-linux-conda-version

The command conda install from above is:

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

The corresponding yaml config would be:

channels:
  - pytorch
  - nvidia
  - conda-forge
dependencies:
  - pytorch
  - pytorch-cuda==12.1
  - torchvision
  - torchaudio
  - ipykernel
variables: {}

note

The order of the channels is respected by conda, so keep pytorch at the top, then nvidia, then conda-forge.

You can use **GUI <-> YAML** Toggle to edit the config.

3. Validating the setup

You can check that your GPU server is compatible with your conda environment by opening a Jupyter Notebook, loading the environment, and running the following code:

import torch
print(f"GPU available: {torch.cuda.is_available()}")
print(f"Number of GPUs available: {torch.cuda.device_count()}")
print(f"ID of current GPU: {torch.cuda.current_device()}")
print(f"Name of first GPU: {torch.cuda.get_device_name(0)}")

Your output should look something like this:

jupyter-notebook-command-output

Introduction​

1. Starting a GPU server​

Understanding GPU setup on the server.​

2. Creating environments​

Build a GPU-compatible environment​

3. Validating the setup​