GPUs on QUEST

Quest and Kellogg Linux Cluster Downtime, March 22 - 31.

Quest, including the Quest Analytics Nodes, the Genomics Compute Cluster (GCC), the Kellogg Linux Cluster (KLC), and Quest OnDemand, will be unavailable for scheduled maintenance starting at 8 A.M. on Saturday, March 22, and ending approximately at 5 P.M. on Monday, March 31. During the maintenance window, you will not be able to login to Quest, Quest Analytics Nodes, the GCC, KLC, or Quest OnDemand submit new jobs, run jobs, or access files stored on Quest in any way including Globus. For details on this maintenance, please see the Status of University IT Services page.

We strongly encourage all Quest users to review the summary of post-maintenance changes as these changes could immediately impact your workflow after the maintenance.

Quest RHEL8 Pilot Environment

The RHEL8 Pilot Environment is available for use now.

Ahead of the March 2025 Downtime, Quest users have the opportunity to test their software and research workflows on CPU nodes and NVIDIA H100 GPU nodes which are running the new RHEL8 OS. Detailed instructions are available on how to submit jobs for the new Operating System in the Knowledge Base article, RHEL8 Pilot Environment.

RHEL8 Pilot Quest log-in nodes can be access via ssh or FastX through using the hostname login.quest.northwestern.edu. Please note that the new hostname login.quest.northwestern.edu will require the GlobalProtect VPN when outside of the United States.

RHEL8 Pilot Quest Analytics nodes can be access via: rstudio.quest.northwestern.edu, jupyterhub.quest.northwestern.edu, and sasstudio.quest.northwestern.edu.

What GPUs are available on QUEST?

All GPUs cards on Quest (both General Access and Buy-In) have NVIDIA driver version 570.86.15 which are compatible with applications compiled against CUDA Toolkit <= version 12.8.

There are 58 GPU nodes available to the Quest General Access allocations.

  • 16 nodes which each have 2 x 40GB Tesla A100 PCIe GPU cards, 52 CPU cores, and 192 GB of CPU RAM.
  • 18 nodes which each have 4 x 80GB Tesla A100 SXM GPU cards, 52 CPU cores, and 512 GB of CPU RAM.
  • 24 nodes which each have 4 x 80GB Tesla H100 SXM GPU cards, 64 CPU cores, and 1 TB of CPU RAM.

There are 4 GPU nodes in the Genomics Compute Cluster (b1042). 

  • 2 nodes which each have 4 x 40GB Tesla A100 PCIe GPU cards, 52 CPU cores, and 192 GB of CPU RAM
  • 2 nodes which each have 4 x 80GB Tesla A100 PCIe GPU cards, 64 CPU cores, and 512 GB of CPU RAM

 

Table of Contents

Using General Access GPUs

The maximum run time is 48 hours for a job on these nodes. To submit jobs to general access GPU nodes, you should set gengpu as the partition and state the number of GPUs in your job submission command or script. You can also identify the type of GPU you want in your job submission. For instance to request one A100 GPU, you should add the following lines in your job submission script:

#SBATCH -A <allocationID>
#SBATCH -p gengpu
#SBATCH --gres=gpu:a100:1
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 1:00:00
#SBATCH --mem=XXG

Note that the memory you request here is for CPU memory. You are automatically given access to the entire memory of the GPU, but you will also need CPU memory as you will be copying memory from the CPU to the GPU.

To schedule another type of GPU, e.g. H100, you should change the a100 designation to the other GPU type, e.g. h100.

Specifying GPU Interconnect Types

There are two flavors of A100 GPUs on Quest, PCIe and SXM. In the submission script in the block above, there is no way of knowing which type of A100 GPU Slurm would assign the job to. However, you can specify which type of A100 node you'd prefer using the --constraint flag. The choices are either pcie for the 40GB A100s or sxm for the 80GB A100s.

Choosing whether you want your job to land on a PCIe or SXM A100 largely depends on the kind of job you are executing. Specifically, there are considerations to make when it comes to using a single GPU card or multiple GPU cards, which will influence whether you want to use a PCIe A100 or a SXM A100.

Considerations for Using a Single GPU Card

If you only need to use one GPU card, you want to look at how many GB you will need on that one card. If your memory needs are <40GB, you request a PCIe A100. However, if you need >40GB on a single GPU card, you should request a SXM A100.

Considerations for Using Multiple GPU Cards

If you know that you want to use multiple GPU cards for your job, an important consideration to make is that the sharing of data between two, three, or four SXM cards will be a lot faster than sharing data between the two cards on the PCIe A100.

The following example submission script would request one 80GB SXM A100 card.

#SBATCH -A <allocationID>
#SBATCH -p gengpu
#SBATCH --gres=gpu:a100:1
#SBATCH --constraint=sxm
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 1:00:00
#SBATCH --mem=XXG

Replace sxm with pcie and you'd receive a 40GB A100. 

If you don't specify any constraint you will be assigned an A100 at random.  With the GPUs you are automatically given 100% of the memory on the GPU, 40GBs or 80GBs. GPU memory is treated separately from the system memory you request with the --mem flag, they are not the same thing.

Using Genomics Compute Cluster GPUs

The maximum run time is 48 hours for a job on these nodes. Feinberg members of the Genomics Compute Cluster should use the partition genomics-gpu, while non-Feinberg members should use genomicsguest-gpu. To submit a job to these GPUs, include the appropriate partition name and specify the type and number of GPUs:

 

#SBATCH -A b1042
#SBATCH -p genomics-gpu
#SBATCH --gres=gpu:a100:1
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 1:00:00
#SBATCH --mem=XXG

Note that the memory you request here is for CPU memory. You are automatically given access to the entire memory of the GPU, but you will also need CPU memory as you will be copying memory from the CPU to the GPU.

Interactive GPU jobs

If you want to start an interactive session on a GPU instead of a batch submission, you can use a run command similar to the one below - these examples both request a A100:

srun will start a session on the node immediately after the job has been scheduled.

$ srun -A pXXXXX -p gengpu --mem=XX --gres=gpu:a100:1 -N 1 -n 1 -t 1:00:00 --pty bash -l

salloc will allocate the resource after which you will have to SSH to the GPU node.

$ salloc -A pXXXXX -p gengpu --mem=XX --gres=gpu:a100:1 -N 1 -n 1 -t 1:00:00

Install Popular GPU Accelerated Python Software with Anaconda Virtual Environments

CUDA

To see which versions of CUDA are available on Quest, run the command:

module spider cuda

NOTE: You cannot use code or applications that require a CUDA toolkit or module that is newer than the CUDA versions listed above. However, CUDA modules and toolkits that are older than the CUDA versions listed at the top of this page should still work.

jaxlib

All instructions will utilize the software management utility mamba. Please see Using Python on QUEST for more information on Mamba virtual environments.

These install instructions should work for jaxlib vXXXX through vXXX. It may work for newer releases of jaxlib but that will depend on the version of CUDA that was used to compile the PyPi package. If the version is older than CUDA 11.8 and newer than CUDA 12.4, it will not work with the Quest GPUs.

Installation

Please run the command that come after the $.

$ module purge
$ module load mamba/24.3.0
$ mamba create -p ./jaxlib-cuda-12-4 python=3.11
$ source activate ./jaxlib-cuda-12-4
$ python -m pip install "jax[cuda12]"

Testing

This Python script will test to verify that the Jax Installation can see and use the GPU devices on Quest. The Python script must be run on a GPU node either through a batch job or an interactive job.

Python Script

test_gpu.py

from jax import extend
print(extend.backend.get_backend().platform)
print(extend.backend.get_backend().platform_version)
print(extend.backend.get_backend().local_devices())
Run Python Script
$ python3 test_gpu.py
gpu
PJRT C API
cuda 12030
[CudaDevice(id=0), CudaDevice(id=1), CudaDevice(id=2), CudaDevice(id=3)]

Tensorflow

All instructions will utilize the software management utility mamba. Please see Using Python on QUEST for more information on Mamba virtual environments.

These install instructions should work for Tensorflow v2.15.0 through v2.17.0. It may work for newer releases of tensorflow but that will depend on the version of CUDA that was used to compile the conda package. If the version is older than CUDA 11.8 and newer than CUDA 12.4, it will not work with the Quest GPUs. You can search all versions of Tensorflow that have been compiled with CUDA toolkit 12.X with the following command.

Check Available Versions

$ module purge
$ module load mamba/24.3.0
$ mamba search 'tensorflow[channel=conda-forge,subdir=linux-64,build=*cuda12*]'

Installation

Please run the command that come after the $.

$ module purge
$ module load mamba/24.3.0
$ CONDA_OVERRIDE_CUDA="12" mamba create --prefix ./tensorflow_with_cuda_12 -c nvidia tensorflow[build=*cuda12*]

Testing

This Python script will test to verify that the Tensorflow Installation can see and use the GPU devices on Quest. The Python script must be run on a GPU node either through a batch job or an interactive job.

Python Script

test_gpu.py

import tensorflow as tf

physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs:", len(physical_devices))
print("GPUs: ", physical_devices)
print("Tensorflow Built With CUDA Support {0}".format(tf.test.is_built_with_cuda()))
Run Python Script
$ python3 test_gpu.py
Num GPUs: 4
GPUs:  [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
Tensorflow Built With CUDA Support True

PyTorch

All instructions will utilize the software management utility mamba. Please see Using Python on QUEST for more information on Mamba virtual environments.

These install instructions should work for PyTorch v2.4.0 through v2.5.1. It may work for newer releases of PyTorch but that will depend on the version of CUDA that was used to compile the conda package. If the version is older than CUDA 11.8 and newer than CUDA 12.4, it will not work with the Quest GPUs. You can search all versions of PyTorch that have been compiled with CUDA toolkit 12.4 with the following command.

Check Available Versions

$ module purge
$ module load mamba/24.3.0
$ mamba search 'pytorch::pytorch[subdir=linux-64,build=*cuda12.4*]'

Installation

Please run the command that come after the $.

module purge
module load mamba/24.3.0
CONDA_OVERRIDE_CUDA="12.4" mamba create --prefix ./pytorch-cuda-12-4 -c nvidia -c pytorch pytorch[build=*cuda12.4*]

Testing

This Python script will test to verify that the PyTorch Installation can see and use the GPU devices on Quest. The Python script must be run on a GPU node either through a batch job or an interactive job.

Python Script

test_gpu.py

import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))
Run Python Script
$ python3 test_gpu.py
True
4
NVIDIA H100 PCIe

PyTorch on Multiple Nodes with Multiple GPUs

The batch submission script below shows how to use multiple nodes with multiple GPUs for your workflow/jobs. Please check out the GitHub page that provides the Python code for setting this up. You can modify the Python code to suit your needs.

https://github.com/nuitrcs/examplejobs/tree/master/python/pytorch_ddp

#!/bin/bash
#SBATCH --account=pXXXX
#SBATCH --partition=gengpu
#SBATCH --time=04:00:00
#SBATCH --job-name=multinode-example
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:a100:2
#SBATCH --mem=20G
#SBATCH --cpus-per-task=4

module purge
module load mamba/24.3.0
source activate ../torch_with_cuda_12_4/

export LOGLEVEL=INFO

srun torchrun \
    --nnodes 2 \
    --nproc_per_node 2 \
    --rdzv_id $RANDOM \
    --rdzv_backend c10d \
    --rdzv_endpoint "$SLURMD_NODENAME:29500" \
    ./multinode_torchrun.py 10000 100

CUpy

All instructions will utilize the software management utility mamba. Please see Using Python on QUEST for more information on Mamba virtual environments.

These install instructions should work for CUpy v12.4.

Installation

Please run the command that come after the $.

module purge
module load mamba/24.3.0
mamba create --prefix ./cupy-with-cuda-12-4 python=3.12 cuda-toolkit=12.4 -c nvidia
source activate ./cupy-with-cuda-12-4  
python3 -m pip install cupy-cuda124 

Testing

This Python script will test to verify that the CUpy Installation can see and use the GPU devices on Quest. The Python script must be run on a GPU node either through a batch job or an interactive job.

Python Script

test_gpu.py

import cupy as cp
x_gpu = cp.array([1, 2, 3])
l2_gpu = cp.linalg.norm(x_gpu)
Run Python Script
$ python3 test_gpu.py

Rapids

Was this helpful?
60% helpful - 5 reviews