What GPUs are available on QUEST?
There are 34 GPU nodes available to the Quest General Access allocations. These nodes have driver version 525.105.17 which is compatible with CUDA 12.0 or earlier:
- 16 nodes which each have 2 x 40GB Tesla A100 PCIe GPU cards, 52 CPU cores, and 192 GB of CPU RAM.
- 18 nodes which each have 4 x 80GB Tesla A100 SXM GPU cards, 52 CPU cores, and 512 GB of CPU RAM.
There are 2 GPU nodes in the
Genomics Compute Cluster (b1042). These nodes have driver version 525.105.17 which is compatible with CUDA 12.0 or earlier:
- 8 x 40GB Tesla A100 GPUs available on 2 nodes (four GPUs, 52 CPU cores, and 192 GB RAM on each node)
Using General Access GPUs
The maximum run time is 48 hours for a job on these nodes. To submit jobs to general access GPU nodes, you should set gengpu as the partition and state the number of GPUs in your job submission command or script. You can also identify the type of GPU you want in your job submission. For instance to request one A100 GPU, you should add the following lines in your job submission script:
#SBATCH -A <allocationID>
#SBATCH -p gengpu
#SBATCH --gres=gpu:a100:1
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 1:00:00
#SBATCH --mem=XXG
Note that the memory you request here is for CPU memory. You are automatically given access to the entire memory of the GPU, but you will also need CPU memory as you will be copying memory from the CPU to the GPU.
To schedule another type of GPU, e.g. P100, you should change the a100 designation to the other GPU type, e.g. p100.
A100 GPU Nodes
There are two flavors of A100 GPUs on Quest, those with 40GBs of memory and those with 80GBs. In the example submission script above you there is no way of knowing which type of A100 GPU Slurm would assign the job to. However, you can specify which type of A100 node you'd prefer using the --constraint flag. The choices are either pcie for the 40GB A100s or sxm for the 80GB A100s.
The following example submission script would request one 80GB A100.
#SBATCH -A <allocationID>
#SBATCH -p gengpu
#SBATCH --gres=gpu:a100:1
#SBATCH --constraint=sxm
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 1:00:00
#SBATCH --mem=XXG
Replace sxm with pcie and you'd receive a 40GB A100. If you don't specify any constraint you will be assigned an A100 at random. With the GPUs you are automatically given 100% of the memory on the GPU, 40GBs or 80GBs. GPU memory is treated separately from the system memory you request with the --mem flag, they are not the same thing.
Using Genomics Compute Cluster GPUs
The maximum run time is 48 hours for a job on these nodes. Feinberg members of the Genomics Compute Cluster should use the partition genomics-gpu, while non-Feinberg members should use genomicsguest-gpu. To submit a job to these GPUs, include the appropriate partition name and specify the type and number of GPUs:
#SBATCH -A b1042
#SBATCH -p genomics-gpu
#SBATCH --gres=gpu:a100:1
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 1:00:00
#SBATCH --mem=XXG
Note that the memory you request here is for CPU memory. You are automatically given access to the entire memory of the GPU, but you will also need CPU memory as you will be copying memory from the CPU to the GPU.
Interactive GPU jobs
If you want to start an interactive session on a GPU instead of a batch submission, you can use a run command similar to the one below - these examples both request a A100:
srun -A pXXXXX -p gengpu --mem=XX --gres=gpu:a100:1 -N 1 -n 1 -t 1:00:00 --pty bash -l
salloc -A pXXXXX -p gengpu --mem=XX --gres=gpu:a100:1 -N 1 -n 1 -t 1:00:00
What GPU software is available on QUEST?
CUDA
To see which versions of CUDA are available on Quest, run the command:
module spider cuda
NOTE: You cannot use code or applications which require a CUDA toolkit or module that is newer than the CUDA versions listed above. However, CUDA modules and toolkits that are older than the CUDA versions listed at the top of this page should still work.
Anaconda
We strongly encourage people to use anaconda to create virtual environments in order to use software that utilize GPUs, especially when using Python. Please see Using Python on QUEST for more information on anaconda virtual environments. Below we provide instructions for creating a local anaconda virtual environment containing...
- Tensorflow
- PyTorch
- CUpy
- Rapids
Please run the command that come after the $
.
Tensorflow
1. Load anaconda on QUEST
$ module load python-miniconda3/4.12.0
2. Create a virtual environment withtensorflow
and cudatoolkit==11.2
. We are going to name our environment tensorflow-2.6-py38
. On QUEST, by default, all anaconda environments go into a folder in your HOME directory called ~/.conda/envs/
. Therefore, once these steps are completed, all of the necessary packages will live in a folder whose PATH is ~/.conda/envs/tensorflow-2.6-py38
.
$ conda create --name tensorflow-2.6-py38 -c conda-forge tensorflow[build=cuda112*] cudatoolkit=11.2
3. Activate virtual environment
$ source (conda) activate tensorflow-2.6-py38
PyTorch
1. Load anaconda on QUEST
$ module load python-miniconda3/4.12.0
2. Create a virtual environment with pytorch
and cudatoolkit==11.2
. We are going to name our environment pytorch-1.11-py38
. On QUEST, by default, all anaconda environments go into a folder in your HOME directory called ~/.conda/envs/
. Therefore, once these steps are completed, all of the necessary packages will live in a folder whose PATH is ~/.conda/envs/pytorch-1.11-py38
.
$ conda create --name pytorch-1.11-py38 -c conda-forge pytorch=1.11[build=cuda112*] numpy python=3.8 cudatoolkit=11.2 --yes
3. Activate virtual environment
$ source (conda) activate pytorch-1.11-py38
CUpy
1. Load anaconda on QUEST
$ module load python-miniconda3/4.12.0
2. Create a virtual environment and install Python into it. We are going to name our environment CUpy-py38
. On QUEST, by default, all anaconda environments go into a folder in your HOME directory
called ~/.conda/envs/
. Therefore, once these steps are completed, all of the necessary packages will live in a folder whose PATH is ~/.conda/envs/CUpy-py38
.
$ conda create -n CUpy-py38 python=3.8 cudatoolkit=11.2 -c nvidia --yes
3. Activate virtual environment
$ source (conda) activate CUpy-py38
4. Install the CUpy binary that is pre-compiled against CUDA 11.2
$ python3 -m pip install cupy-cuda112
Note: CUpy will only import correctly on a GPU node and will not import on a CPU only node.
Rapids
1. Load anaconda on QUEST
$ module load python-miniconda3/4.12.0
2. Create a virtual environment and install Python into it. We are going to name our environment rapids-22.06
. On QUEST, by default, all anaconda environments go into a folder in your HOME directory
called ~/.conda/envs/
. Therefore, once these steps are completed, all of the necessary packages will live in a folder whose PATH is ~/.conda/envs/rapids-22.06
.
$conda create -n rapids-22.06 -c rapidsai -c nvidia -c conda-forge rapids=22.06 python=3.9 cudatoolkit=11.4 jupyterlab --yes
3. Activate virtual environment
$ source (conda) activate rapids-22.06
Note: Please see getting started with rapids for more details.
Singularity
NVIDIA provides a whole host of GPU containers that are suitable for different applications. Docker images cannot be used directly on Quest due to security risks, but they can be pulled to generate Singularity containers. Below we provide an examples of using Singularity to pull the NVIDIA Tensorflow Docker image and the NVIDIA PyTorch Docker image.
Tensorflow
For most NVIDIA containers, there are many different versions which come with specific versions of the relevant libraries and packages. See NVIDIA's TensorFlow documentation for further information about the version of Tensorflow that is shipped with each version of the Tensorflow Docker container.
module purge all
module load singularity
singularity pull docker://nvcr.io/nvidia/tensorflow:21.07-tf2-py3
We can then use the command below to call this NVIDIA GPU container to run a simple TensorFlow training example.
singularity exec --nv -B /projects:/projects tensorflow_21.07-tf2-py3.sif python training.py
PyTorch
For most NVIDIA containers, there are many different versions which come with specific versions of the relevant libraries and packages. See NVIDIA's PyTorch documentation for further information about the version of PyTorch that is shipped with each version of the PyTorch Docker container.
module purge all
module load singularity
singularity pull docker://nvcr.io/nvidia/pytorch:21.07-py3
We can then use the command below to call this NVIDIA GPU container to run a simple TensorFlow training example.
singularity exec --nv -B /projects:/projects pytorch_21.07-py3.sif python training_pytorch.py
NOTE: A key difference between calling a non-GPU container versus a GPU container is passing the --nv
argument to the exec
command. A reminder that -B /projects:/projects
mounts the projects folder into the singularity environment. By default, /projects
is not mounted or discoverable by the container. Please see our page containers on Quest for more information on containers in general.