GPUs on QUEST

Quest and Kellogg Linux Cluster Downtime, December 14 - 18.

Quest, including the Quest Analytics Nodes, the Genomics Compute Cluster (GCC), the Kellogg Linux Cluster (KLC), and Quest OnDemand, will be unavailable for scheduled maintenance starting at 8 A.M. on Saturday, December 14, and ending approximately at 5 P.M. on Wednesday, December 18. During the maintenance window, you will not be able to login to Quest, Quest Analytics Nodes, the GCC, KLC, or Quest OnDemand submit new jobs, run jobs, or access files stored on Quest in any way including Globus. For details on this maintenance, please see the Status of University IT Services page.

Quest RHEL8 Pilot Environment - November 18.

Starting November 18, all Quest users are invited to test and run their workflows in a RHEL8 pilot environment to prepare for Quest moving completely to RHEL8 in March 2025. We invite researchers to provide us with feedback during the pilot by contacting the Research Computing and Data Services team at quest-help@northwestern.edu. The pilot environment will consist of 24 H100 GPU nodes and seventy-two CPU nodes, and it will expand with additional nodes through March 2025. Details on how to access this pilot environment will be published in a KB article on November 18.

What GPUs are available on QUEST?

There are 34 GPU nodes available to the Quest General Access allocations. These nodes have driver version 550.127.05 which is compatible with CUDA 12.4 or earlier:

  • 16 nodes which each have 2 x 40GB Tesla A100 PCIe GPU cards, 52 CPU cores, and 192 GB of CPU RAM.
  • 18 nodes which each have 4 x 80GB Tesla A100 SXM GPU cards, 52 CPU cores, and 512 GB of CPU RAM.
  • 24 nodes which each have 4 x 80GB Tesla H100 SXM GPU cards, 64 CPU cores, and 1 TB of CPU RAM.

There are 4 GPU nodes in the Genomics Compute Cluster (b1042). These nodes have driver version 525.105.17 which is compatible with CUDA 12.0 or earlier:

  • 2 nodes which each have 4 x 40GB Tesla A100 PCIe GPU cards, 52 CPU cores, and 192 GB of CPU RAM
  • 2 nodes which each have 4 x 80GB Tesla A100 PCIe GPU cards, 64 CPU cores, and 512 GB of CPU RAM

 

Table of Contents

Using General Access GPUs

The maximum run time is 48 hours for a job on these nodes. To submit jobs to general access GPU nodes, you should set gengpu as the partition and state the number of GPUs in your job submission command or script. You can also identify the type of GPU you want in your job submission. For instance to request one A100 GPU, you should add the following lines in your job submission script:

#SBATCH -A <allocationID>
#SBATCH -p gengpu
#SBATCH --gres=gpu:a100:1
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 1:00:00
#SBATCH --mem=XXG

Note that the memory you request here is for CPU memory. You are automatically given access to the entire memory of the GPU, but you will also need CPU memory as you will be copying memory from the CPU to the GPU.

To schedule another type of GPU, e.g. H100, you should change the a100 designation to the other GPU type, e.g. h100.

Specifying GPU Interconnect Types

There are two flavors of A100 GPUs on Quest, PCIe and SXM. In the submission script in the block above, there is no way of knowing which type of A100 GPU Slurm would assign the job to. However, you can specify which type of A100 node you'd prefer using the --constraint flag. The choices are either pcie for the 40GB A100s or sxm for the 80GB A100s.

Choosing whether you want your job to land on a PCIe or SXM A100 largely depends on the kind of job you are executing. Specifically, there are considerations to make when it comes to using a single GPU card or multiple GPU cards, which will influence whether you want to use a PCIe A100 or a SXM A100.

Considerations for Using a Single GPU Card

If you only need to use one GPU card, you want to look at how many GB you will need on that one card. If your memory needs are <40GB, you request a PCIe A100. However, if you need >40GB on a single GPU card, you should request a SXM A100.

Considerations for Using Multiple GPU Cards

If you know that you want to use multiple GPU cards for your job, an important consideration to make is that the sharing of data between two, three, or four SXM cards will be a lot faster than sharing data between the two cards on the PCIe A100.

The following example submission script would request one 80GB SXM A100 card.

#SBATCH -A <allocationID>
#SBATCH -p gengpu
#SBATCH --gres=gpu:a100:1
#SBATCH --constraint=sxm
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 1:00:00
#SBATCH --mem=XXG

Replace sxm with pcie and you'd receive a 40GB A100. 

If you don't specify any constraint you will be assigned an A100 at random.  With the GPUs you are automatically given 100% of the memory on the GPU, 40GBs or 80GBs. GPU memory is treated separately from the system memory you request with the --mem flag, they are not the same thing.

Using Genomics Compute Cluster GPUs

The maximum run time is 48 hours for a job on these nodes. Feinberg members of the Genomics Compute Cluster should use the partition genomics-gpu, while non-Feinberg members should use genomicsguest-gpu. To submit a job to these GPUs, include the appropriate partition name and specify the type and number of GPUs:

 

#SBATCH -A b1042
#SBATCH -p genomics-gpu
#SBATCH --gres=gpu:a100:1
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 1:00:00
#SBATCH --mem=XXG

Note that the memory you request here is for CPU memory. You are automatically given access to the entire memory of the GPU, but you will also need CPU memory as you will be copying memory from the CPU to the GPU.

Interactive GPU jobs

If you want to start an interactive session on a GPU instead of a batch submission, you can use a run command similar to the one below - these examples both request a A100:

srun will start a session on the node immediately after the job has been scheduled.

$ srun -A pXXXXX -p gengpu --mem=XX --gres=gpu:a100:1 -N 1 -n 1 -t 1:00:00 --pty bash -l

salloc will allocate the resource after which you will have to SSH to the GPU node.

$ salloc -A pXXXXX -p gengpu --mem=XX --gres=gpu:a100:1 -N 1 -n 1 -t 1:00:00

Install Popular GPU Accelerated Python Software with Anaconda Virtual Environments

CUDA

jaxlib

Tensorflow

PyTorch

CUpy

Rapids

Was this helpful?
75% helpful - 4 reviews