Everything You Need to Know about Using Slurm on Quest

Overview

This page contains in-depth information on submitting jobs on Quest through Slurm.

The program that schedules jobs and manages resources on Quest is called Slurm. This page is designed to help facilitate a deeper understanding of how Slurm works and to contextualize the things that you can do with Slurm. For those who are brand new to Quest, we recommend visiting our Research Computing How-to Videos page and watching our Introduction to Quest video before proceeding with this documentation. This page covers:

  • A simple Slurm job submission script.
  • The key configuration settings of the Slurm submission script as well as details on additional settings that may be useful to specify.
  • How to monitor and manage your job submissions
  • Special types of job submissions and when they are useful.
  • Diagnosing issues with your job submission.

Table of Contents

The Job Submission Script

Slurm requires users to write a batch submission script to run a batch job. In this section, we present a basic submission script and how a user would then submit a batch job using this script. In the submission script, you specify the resources you need and what commands to run. You submit this script to the scheduler by running the sbatch command on the command line.

Example Submission Script

Submitting Your Batch Job

Slurm Configuration Settings

In this section, we go into details about a number of the Slurm configurations settings. In each subsection, we include;

  • the possible range of values a user can set,
  • how to think about what value to use for a given setting,
  • whether the setting is required, and
  • if a setting is required, what the default is for the setting.

Allocation/Account

Quest Partitions/Queues

Walltime/Length of the job

Number of Nodes

Number of Cores

Required memory

Standard Output/Error

Job Name

Sending e-mail alerts about your job

Constraints

All Slurm Configuration Options

Environmental Variables Set by Slurm

SLURM Commands and Job Management

In this section, we discuss how to manage batch jobs after they have been submitted on Quest. This includes how to monitor jobs currently pending or running, how to cancel jobs, and how to check on the status of past jobs.

Table of Common Slurm Commands

The squeue Command

The sacct Command

The seff Command

The checkjob Command

Cancelling Jobs

Holding, Releasing, or Modifying Jobs

Probing Priority

Special Types of Job Submissions

In this section, we provide details and examples of how to use Slurm to run interactive jobs, job arrays, and jobs that depend on each other.

Interactive Job Examples

Job Array

Dependent Jobs

Factors Affecting Job Scheduling on Quest

If your job is waiting on the queue, the reason is most probably one of the following:

  • Your job's score is lower compared to others
  • Unavailable/occupied compute resources at that moment.

Priority

Backfill Scheduling

Diagnosing Issues with Your Job Submission Script and/or Your Job Itself

Debugging a Job Submission Script Rejected By The Scheduler

Debugging a Job Accepted by the Scheduler

Common reasons for Failed Jobs

This section provides some common reasons for why your job may fail and how to go about fixing it.

Job Exceeded Request Time or Memory

Out of Disk Space

Was this helpful?
100% helpful - 2 reviews
Print Article

Details

Article ID: 1964
Created
Wed 10/5/22 3:49 PM
Modified
Wed 7/17/24 2:44 PM

Related Articles (1)