For video demonstrations on using SLURM on Quest please visit Research Computing How-to Videos.
Simple Batch Job Submission Script Example
The sbatch command is used for scheduler directives in job submission scripts as well as the job submission command at the command line. The bare minimum number of directives that you need for a valid submission script is partition (-p/--partition
), account (-A/--account
), and walltime (-t/--time
). In addition to these settings, we strongly recommend setting the number of nodes to run on (-N/--nodes
), the number of cores/tasks to use total (-n/--ntasks
) or the number of cores/tasks to use per node (--ntasks-per-node=n
) and the amount of memory (in gigabytes) needed to run your application either per cpu (--mem-per-cpu=XXG
) or per node (--mem=XXG
).
If you have additional scheduler directives, please see the full list of Extended SLURM Job and Submission Options.
#!/bin/bash
#SBATCH --account=p12345 ## YOUR ACCOUNT pXXXX or bXXXX
#SBATCH --partition=short ### PARTITION (buyin, short, normal, etc)
#SBATCH --nodes=1 ## how many computers do you need
#SBATCH --ntasks-per-node=1 ## how many cpus or processors do you need on each computer
#SBATCH --time=00:10:00 ## how long does this need to run (remember different partitions have restrictions on this parameter)
#SBATCH --mem=1G ## how much RAM do you need per node (this effects your FairShare score so be careful to not ask for more than you need))
#SBATCH --job-name=sample_job ## When you run squeue -u NETID this is how you can identify the job
#SBATCH --output=output.log ## standard out and standard error goes to this file
module purge all
module load python-anaconda3
source activate /projects/intro/envs/slurm-py37-test
python --version
python slurm_test.py
|
Environment
Note that when you submit your job Slurm passes your current environment variables to the compute nodes, including any modules you've loaded on the command line before the job was submitted.
Architectures
Not all Quest compute nodes are the same. We currently have four different generations or architectures of compute nodes which we refer to as quest7, quest8, quest9 and quest10 and information on each of these architectures can be found here. If you need to restrict your job to a particular architecture, you can do so through the constraint directive (-C/--constraint
). For example, --constraint=quest10
will cause the scheduler to only match you to computers of the quest10 generation.
Submitting a Batch Job
To use sbatch to submit a job to the Slurm scheduler:
sbatch job_script.sh
Submitted batch job 546723
or in cases where you only want the job number to be returned, you can pass --parsable
to the sbatch
command:
sbatch --parsable job_scipt.sh
546723
Slurm will reject the job at submission time if there are requests or constraints within the job submission script that Slurm cannot meet. This gives the user the opportunity to examine the rejected job request and resubmit it with the necessary corrections. With Slurm, if a job number is returned at the time of job submission, the job will run although it may experience a wait time in the queue depending on how busy the system is.
If your job submission receives an error, see Debugging your Slurm submission script.
Submitting an Interactive Job (to run an application without Graphical User Interface)
To launch an interactive job from the Quest log-in node in order to run an application without a GUI use either the srun or salloc command. If you use srun
to run an interactive job, then SLURM will automatically launch a terminal session on the compute node after it schedules the job and you simply need to wait for this to happen. Due to the behavior of srun
, if you lose connection to your interactive session, the interactive job will terminate.
srun -N 1 -n 1 --account=<account> --mem=XXG --partition=<partition> --time=<hh:mm:ss> --pty bash -l
If you use salloc
instead, it will not automatically launch a terminal session on the compute node. Instead, after it schedules your job/request, it will tell you the name of the compute node at which point you can run ssh qnodeXXXX
to directly connect to the compute node. Due to the behavior of salloc
, if you lose connection to your interactive session, the interactive job will not terminate.
salloc -N 1 -n 1 --account=<account> --mem=<XXG> --partition=<partition> --time=<hh:mm:ss>
For additional information on interactive jobs under Slurm, please see Submitting a Job on Quest.
Submitting an Interactive Job (to run an application with Graphical User Interface)
To launch an interactive job from the Quest log-in node in order to run an application with a GUI, first you need to connect to Quest using an application with X11 forwarding support. We recommend using the FastX3 client. Once you have connected to Quest with X11 forwarding enabled, you can then use either the srun or salloc command. If you use srun
to run an interactive job, then SLURM will automatically launch a terminal session on the compute node after it schedules the job and you simply need to wait for this to happen. Due to the behavior of srun
, if you lose connection to your interactive session, the interactive job will terminate.
srun --x11 -N 1 -n 1 --account=<account> --mem=XXG --partition=<partition> --time=<hh:mm:ss> --pty bash -l
If you use salloc
instead, it will not automatically launch a terminal session on the compute node. Instead, after it schedules your job/request, it will tell you the name of the compute node at which point you can run ssh -X qnodeXXXX
to directly connect to the compute node. Due to the behavior of salloc
, if you lose connection to your interactive session, the interactive job will not terminate.
salloc --x11 -N 1 -n 1 --account=<account> --mem=<XXG> --partition=<partition> --time=<hh:mm:ss>
For additional information on interactive jobs under Slurm, please see Submitting a Job on Quest.
Monitoring Jobs
You can use squeue to monitor your currently pending or running jobs
squeue -u <NetID>
squeue
returns information on jobs in the Slurm queue:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
546723 short slurm2.s <netid> R INVALID 1 qnode4017
546711 short high-thr <netid> R 2:34 3 qnode[4180-4181,4196]
546712 short high-thr <netid> R 2:34 3 qnode[4078,4086,4196]
Field |
Description |
JOBID |
Number assigned to the job upon submission |
PARTITION |
The queue, also called partition, that the job is running in |
NAME |
Name of the job submission script |
USER |
NetID of the user who submitted the job |
ST |
State of the job: "R" for Running or "PD" for Pending (Idle) |
TIME |
hours:minutes:seconds a job has been running; can be INVALID for the first few minutes. |
NODES |
Number of nodes the job resides on |
NODELIST |
Names of the nodes the job is running on |
Canceling Jobs
To cancel a single job use scancel:
scancel <job_ID_number>
To cancel all of your jobs:
scancel -u <netID>
For additional job commands, please see Common Job Commands.