Body
Quest RHEL8 Pilot Environment - November 18.
Starting November 18, all Quest users are invited to test and run their workflows in a RHEL8 pilot environment to prepare for Quest moving completely to RHEL8 in March 2025. We invite researchers to provide us with feedback during the pilot by contacting the Research Computing and Data Services team at quest-help@northwestern.edu. The pilot environment will consist of 24 H100 GPU nodes and seventy-two CPU nodes, and it will expand with additional nodes through March 2025. Details on how to access this pilot environment will be published in a KB article on November 18.
Examples of submitting interactive and batch jobs to the Quest compute nodes.
Small test scripts and applications can be run directly on the Quest login nodes, but if you are going to use any significant computational resources (more than 4 cores and/or 4GB of RAM) or run for more than about an hour, you need to submit a job to request computational resources from the Quest compute nodes. Jobs can be submitted to the Quest compute nodes in two ways: Interactive jobs, which are particularly useful for GUI applications, or Batch jobs, which are the most common jobs on Quest. Interactive jobs are appropriate for GUI applications like Stata, or interactively testing and prototyping scripts; they should generally use a small number of cores (fewer than 6) and be of short duration (a few hours). Batch jobs are appropriate for jobs with no GUI interface, and they can accommodate large core counts and long duration (up to a week).
The program that schedules jobs and manages resources on Quest is Slurm. To submit, monitor, modify, and delete jobs on Quest you must use Slurm commands, such as #SBATCH.
Batch Jobs
To submit a batch job, you first write a submission script specifying the resources you need and what commands to run, then you submit this script to the scheduler by running an sbatch command on the command line.
Example Submission Script
A submission script for a batch job could look like the following. When substituting values, replace <> too. These commands would be saved in a file such as jobscript.sh.
jobscript.sh
#!/bin/bash
#SBATCH -A p20XXX # Allocation
#SBATCH -p short # Queue
#SBATCH -t 04:00:00 # Walltime/duration of the job
#SBATCH -N 1 # Number of Nodes
#SBATCH --mem=18G # Memory per node in GB needed for a job. Also see --mem-per-cpu
#SBATCH --ntasks-per-node=6 # Number of Cores (Processors)
#SBATCH --mail-user= # Designate email address for job communications
#SBATCH --mail-type= # Events options are job BEGIN, END, NONE, FAIL, REQUEUE
#SBATCH --output= # Path for output must already exist
#SBATCH --error= # Path for errors must already exist
#SBATCH --job-name="test" # Name of job
# unload any modules that carried over from your command line session
module purge
# add a project directory to your PATH (if needed)
export PATH=$PATH:/projects/p20XXX/tools/
# load modules you need to use
module load python/anaconda
module load java
# A command you actually want to execute:
java -jar
# Another command you actually want to execute, if needed:
python myscript.py
The first line of the script loads the bash shell. Lines that begin with #SBATCH are interpreted by Slurm. Until Slurm places the job on a compute node, no other line in this script is executed. In these lines, # is needed; it is not a comment character when used with #SBATCH.
After the Slurm commands, the rest of the script works like a regular Bash script. You can modify environment variables, load modules, change directories, and execute program commands. Lines in the second half of the script that start with # are comments.
In the example above, export PATH=$PATH:/projects/p20XXX/tools/ is used to put additional tools stored in a project directory on the user's path so that they can be easily called. Slurm jobs start from the submit directory by default. Your script can cd (change directory) to a different directory instead if your code is located in a different directory than your submission script.
Find a downloadable copy of this example script on GitHub.
Commands and Options
Example Commands |
Description |
#!/bin/bash |
REQUIRED: The first line of your script, specifying the type of shell (in this case, bash) |
#SBATCH -A |
REQUIRED: Tells the scheduler the allocation name, so that it can determine your access |
#SBATCH -t |
REQUIRED: Provides the scheduler with the time needed for your job to run so resources can be allocated. On general access allocations, Quest allows jobs of up to 7 days (168 hours). |
#SBATCH -p |
REQUIRED: Common values are short, normal, long, or buyin. Note that under Slurm, queues are called "partitions". See Quest Partitions/Queues for details on the queue to choose for different length jobs. |
#SBATCH --job-name="name_of_job" |
Gives the job a descriptive name, useful for reporting, such as when using the command squeue. |
#SBATCH --mail-type= |
Event options are job BEGIN, END, NONE, FAIL, REQUEUE. You must include you email address in your .forward file in your /home/NetID directory or use the command below. Specify multiple values with a comma separated list (no spaces). |
#SBATCH --mail-user= |
Specifies email address. |
#SBATCH -N
#SBATCH --ntasks-per-node=
or
#SBATCH -n |
The first option specifies how many nodes and how many processors (cores) per node. The second option specifies how many processors total without restricting them to being on a specific number of nodes. Use only one of these options, NOT both. If neither of these options are used, one core on one node will be allocated. If your code is not parallelized, one core on one node may be appropriate for your job. |
#SBATCH --mem=G |
Specifies the amount of memory per node in GB needed by a job, where is the number of GB of RAM you're requesting. (details below) |
#SBATCH --mem-per-cpu=G |
Specifies the amount of memory in GB needed for each processor, appropriate for multi-threaded applications. You should define only one of --mem or --mem-per-cpu flags in your job submission script. (details below) |
#SBATCH --output= |
Writes the output log for the job (whatever would go to stdout) into a file - note the path must exist. If not specified, stdout is written to a file in the directory you submitted the job from that is named according to . If --error is not specified (below), errors will also be written to the output file. |
#SBATCH --error= |
Writes the error file for the job (whatever would go to stderr) into a file name. The error file is very important for diagnosing jobs that fail to run properly. If not specified, stderr will be written to the output file (above). |
Setting memory for your job
Slurm allocates the memory that your job will have access to on the compute nodes with a hard upper limit. Jobs on the compute nodes cannot access memory beyond what Slurm reserves for them; if your job tries to access more memory than has been reserved, it will either run very slowly or terminate, depending on how the software you are running was written to handle this type of situation.
The amount of memory reserved by Slurm can be specified in your job submission script with the directives #SBATCH --mem=G or #SBATCH --mem-per-cpu=G. If your job submission script does not specify how much memory your job requires, Slurm will allocate a default amount of memory (i.e. 3,256 MB per core) for your job, which may not be enough. If you submitted a job to run on 10 cores and did not specify your memory request in your job submission script, Slurm will allocate 32,560 MB in total.
For general access jobs which can land on any of these architectures, jobs that request more memory than is available on all architectures will be limited to running on the subset of available general access nodes that do have that much memory available. Note that in general the more resources your job requests, the longer the wait time for your job to be placed on a suitable compute node.
Specifying Memory for Jobs On Quest provides more details on setting memory. In addition, Checking Processor and Memory Utilization for Jobs on Quest provides information on profiling how much memory your completed SLURM jobs actually use versus how much memory was reserved.
Specifying Memory for Jobs on Quest
Submitting Your Batch Job
After you've written and saved your submission script, you can submit your job. At the command line type
sbatch
where, in the example above would be jobscript.sh. Upon submission the scheduler will return your job number:
Submitted batch job 549005
If you would prefer the return value of your job submission to be just the job number, use qsub:
qsub
549005
This may be desirable if you have a workflow that accepts the return value as a variable for job monitoring or dependencies.
If there is an error in your job submission script, the job will not be accepted by the scheduler and you will receive an error message right away, for example:
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
If your job submission receives an error, you will need to resubmit your job. If no error is received, your job has entered the queue and will run.
More Information
For more examples and options, see Examples of Jobs on Quest.
Interactive Jobs
To launch an interactive job from the command line use the srun command:
srun --account= --time= --partition= --mem=G --pty bash -l
This will launch a terminal session on the compute node as a single core job. To request additional cores for multi-threaded applications, include the -N and -n flag:
srun --account= --time= --partition= -N 1 -n 6 --mem=G --pty bash -l
For best practices with srun always include the following flags:
Option |
Description |
--pty |
Launch an interactive terminal session on the compute node |
--account= |
Allocation |
--time= |
Duration of this interactive job. The job will be killed if you exit the terminal session before the time is up. Note that your session will be killed without warning at the end of your requested time period. |
--partition= |
Queue/partition for the job |
--mem=G |
The amount of memory per node in GB requested for the interactive job. |
To request more than the default single node/single core:
Option |
Description |
-N |
Requests a number of nodes to run the job. If this is not specified but -n is, the tasks may land on multiple nodes. For most non-mpi based applications, request a single node. |
-n |
Requests the number of tasks/processors/cores for the job. If your work supports multi-threading, request the number of threads that you will need. |
Note that by reserving more resources than you actually utilize, you decrease your priority on future jobs unnecessarily.
Interactive Job Examples
Example 1: Interactive Job to Run a Bash Command Line session
srun --account=p12345 --partition=short -N 1 -n 4 --mem=12G --time=01:00:00 --pty bash -l
This would run an interactive bash session on a single compute node with four cores, and access to 12GB of RAM for up to an hour, debited to the p12345 account.
Example 2: Interactive job to run a GUI program
If you're connecting to Quest using SSH via a terminal program, then you need to make sure to enable X-forwarding when you connect to Quest by using the -Y option:
ssh -Y @quest.northwestern.edu
If you use FastX to connect instead, then X-forwarding will be enabled by default in the GNOME terminal.
For an interactive job with a GUI component, you will need to use the --x11 flag for srun that allows for X tunneling from Quest to your desktop display. For example:
srun --x11 --account=p12345 -N 1 -n 4 --partition=short --mem=12G --time=01:00:00 --pty bash -l
This requires an X window server to be running on your desktop, which is the case if you're using FastX. Another option for mac users is XQuartz. To confirm that X-forwarding is enabled, try the command:
xclock
If the clock graphic appears on your screen, that confirms that x-forwarding is successfully working.
Note that when you enter the srun for an interactive job, there may be a pause while the scheduler looks for available resources. Then you will be provided information about the compute node you're assigned, and you will be automatically connected to it. The command prompt in your terminal will change to reflect this new connection. You can then proceed with your work as if you were on a login node.
Keywords: quest, job, submit, submission, script, msub, moab, torque, module, bash, interactive, batch, slurm, sbatch, srun
Created: 2016-12-07 02:57:18
Updated: 2021-09-28 21:50:43