Quest Slurm Quick Start

For video demonstrations on using SLURM on Quest please visit Research Computing How-to Videos.

Simple Batch Job Submission Script Example

The sbatch command is used for scheduler directives in job submission scripts as well as the job submission command at the command line. The bare minimum number of directives that you need for a valid submission script is partition (-p/--partition), account (-A/--account), and walltime (-t/--time). In addition to these settings, we strongly recommend setting the number of nodes to run on (-N/--nodes), the number of cores/tasks to use total (-n/--ntasks) or the number of cores/tasks to use per node (--ntasks-per-node=n) and the amount of memory (in gigabytes) needed to run your application either per cpu (--mem-per-cpu=XXG) or per node (--mem=XXG).

If you have additional scheduler directives, please see the full list of Extended SLURM Job and Submission Options.

#!/bin/bash
#SBATCH --account=p12345  ## YOUR ACCOUNT pXXXX or bXXXX
#SBATCH --partition=short  ### PARTITION (buyin, short, normal, etc)
#SBATCH --nodes=1 ## how many computers do you need
#SBATCH --ntasks-per-node=1 ## how many cpus or processors do you need on each computer
#SBATCH --time=00:10:00 ## how long does this need to run (remember different partitions have restrictions on this parameter)
#SBATCH --mem=1G ## how much RAM do you need per node (this effects your FairShare score so be careful to not ask for more than you need))
#SBATCH --job-name=sample_job  ## When you run squeue -u NETID this is how you can identify the job
#SBATCH --output=output.log ## standard out and standard error goes to this file


module purge all
module load python-anaconda3
source activate /projects/intro/envs/slurm-py37-test


python --version
python slurm_test.py

Environment

Note that when you submit your job Slurm passes your current environment variables to the compute nodes, including any modules you've loaded on the command line before the job was submitted.

Architectures

Not all Quest compute nodes are the same. We currently have four different generations or architectures of compute nodes which we refer to as quest7, quest8, quest9 and quest10 and information on each of these architectures can be found here. If you need to restrict your job to a particular architecture, you can do so through the constraint directive (-C/--constraint). For example, --constraint=quest10 will cause the scheduler to only match you to computers of the quest10 generation.

Submitting a Batch Job

To use sbatch to submit a job to the Slurm scheduler:

sbatch job_script.sh
Submitted batch job 546723

or in cases where you only want the job number to be returned, you can pass --parsable to the sbatch command:

sbatch --parsable job_scipt.sh
546723

Slurm will reject the job at submission time if there are requests or constraints within the job submission script that Slurm cannot meet. This gives the user the opportunity to examine the rejected job request and resubmit it with the necessary corrections. With Slurm, if a job number is returned at the time of job submission, the job will run although it may experience a wait time in the queue depending on how busy the system is.

If your job submission receives an error, see Debugging your Slurm submission script.

Submitting an Interactive Job (to run an application without Graphical User Interface)

To launch an interactive job from the Quest log-in node in order to run an application without a GUI use either the srun or salloc command. If you use srun to run an interactive job, then SLURM will automatically launch a terminal session on the compute node after it schedules the job and you simply need to wait for this to happen. Due to the behavior of srun, if you lose connection to your interactive session, the interactive job will terminate.

srun -N 1 -n 1 --account=<account> --mem=XXG --partition=<partition> --time=<hh:mm:ss> --pty bash -l

If you use salloc instead, it will not automatically launch a terminal session on the compute node. Instead, after it schedules your job/request, it will tell you the name of the compute node at which point you can run ssh qnodeXXXX to directly connect to the compute node. Due to the behavior of salloc, if you lose connection to your interactive session, the interactive job will not terminate.

salloc -N 1 -n 1 --account=<account> --mem=<XXG> --partition=<partition> --time=<hh:mm:ss>

For additional information on interactive jobs under Slurm, please see Submitting a Job on Quest.

Submitting an Interactive Job (to run an application with Graphical User Interface)

To launch an interactive job from the Quest log-in node in order to run an application with a GUI, first you need to connect to Quest using an application with X11 forwarding support. We recommend using the FastX3 client. Once you have connected to Quest with X11 forwarding enabled, you can then use either the srun or salloc command. If you use srun to run an interactive job, then SLURM will automatically launch a terminal session on the compute node after it schedules the job and you simply need to wait for this to happen. Due to the behavior of srun, if you lose connection to your interactive session, the interactive job will terminate.

srun --x11 -N 1 -n 1 --account=<account> --mem=XXG --partition=<partition> --time=<hh:mm:ss> --pty bash -l

If you use salloc instead, it will not automatically launch a terminal session on the compute node. Instead, after it schedules your job/request, it will tell you the name of the compute node at which point you can run ssh -X qnodeXXXX to directly connect to the compute node. Due to the behavior of salloc, if you lose connection to your interactive session, the interactive job will not terminate.

salloc --x11 -N 1 -n 1 --account=<account> --mem=<XXG> --partition=<partition> --time=<hh:mm:ss>

For additional information on interactive jobs under Slurm, please see Submitting a Job on Quest.

Monitoring Jobs

You can use squeue to monitor your currently pending or running jobs

squeue -u <NetID>

squeue returns information on jobs in the Slurm queue:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
546723     short slurm2.s  <netid>  R   INVALID      1 qnode4017
546711     short high-thr  <netid>  R       2:34      3 qnode[4180-4181,4196]
546712     short high-thr  <netid>  R       2:34      3 qnode[4078,4086,4196]
Field Description
JOBID Number assigned to the job upon submission
PARTITION The queue, also called partition, that the job is running in
NAME Name of the job submission script
USER NetID of the user who submitted the job
ST State of the job: "R" for Running or "PD" for Pending (Idle)
TIME hours:minutes:seconds a job has been running; can be INVALID for the first few minutes.
NODES Number of nodes the job resides on
NODELIST Names of the nodes the job is running on

Canceling Jobs

To cancel a single job use scancel:

scancel <job_ID_number>

To cancel all of your jobs:

scancel -u <netID>

For additional job commands, please see Common Job Commands.

Was this helpful?
0 reviews

Details

Article ID: 1796
Created
Thu 5/12/22 12:39 PM
Modified
Fri 3/29/24 2:25 PM