Everything You Need to Know about Using Slurm on Quest

Body

Quest and Kellogg Linux Cluster Downtime, December 14 - 18.

Quest, including the Quest Analytics Nodes, the Genomics Compute Cluster (GCC), the Kellogg Linux Cluster (KLC), and Quest OnDemand, will be unavailable for scheduled maintenance starting at 8 A.M. on Saturday, December 14, and ending approximately at 5 P.M. on Wednesday, December 18. During the maintenance window, you will not be able to login to Quest, Quest Analytics Nodes, the GCC, KLC, or Quest OnDemand submit new jobs, run jobs, or access files stored on Quest in any way including Globus. For details on this maintenance, please see the Status of University IT Services page.

Quest RHEL8 Pilot Environment - November 18.

Starting November 18, all Quest users are invited to test and run their workflows in a RHEL8 pilot environment to prepare for Quest moving completely to RHEL8 in March 2025. We invite researchers to provide us with feedback during the pilot by contacting the Research Computing and Data Services team at quest-help@northwestern.edu. The pilot environment will consist of 24 H100 GPU nodes and seventy-two CPU nodes, and it will expand with additional nodes through March 2025. Details on how to access this pilot environment will be published in a KB article on November 18.

Navigate To Other Quest Pages

Overview

This page contains in-depth information on submitting jobs on Quest through Slurm.

The program that schedules jobs and manages resources on Quest is called Slurm. This page is designed to help facilitate a deeper understanding of how Slurm works and to contextualize the things that you can do with Slurm. For those who are brand new to Quest, we recommend visiting our Research Computing How-to Videos page and watching our Introduction to Quest video before proceeding with this documentation. This page covers:

  • A simple Slurm job submission script.
  • The key configuration settings of the Slurm submission script as well as details on additional settings that may be useful to specify.
  • How to monitor and manage your job submissions
  • Special types of job submissions and when they are useful.
  • Diagnosing issues with your job submission.

Table of Contents

The Job Submission Script

Slurm requires users to write a batch submission script to run a batch job. In this section, we present a basic submission script and how a user would then submit a batch job using this script. In the submission script, you specify the resources you need and what commands to run. You submit this script to the scheduler by running the sbatch command on the command line.

Example Submission Script

Submitting Your Batch Job

Slurm Configuration Settings

In this section, we go into details about a number of the Slurm configurations settings. In each subsection, we include;

  • the possible range of values a user can set,
  • how to think about what value to use for a given setting,
  • whether the setting is required, and
  • if a setting is required, what the default is for the setting.

Allocation/Account

Quest Partitions/Queues

Walltime/Length of the job

Number of Nodes

Number of Cores

Required memory

Standard Output/Error

Job Name

Sending e-mail alerts about your job

Constraints

All Slurm Configuration Options

Environmental Variables Set by Slurm

SLURM Commands and Job Management

In this section, we discuss how to manage batch jobs after they have been submitted on Quest. This includes how to monitor jobs currently pending or running, how to cancel jobs, and how to check on the status of past jobs.

Table of Common Slurm Commands

The squeue Command

The sacct Command

The seff Command

The checkjob Command

Cancelling Jobs

Holding, Releasing, or Modifying Jobs

Probing Priority

Special Types of Job Submissions

In this section, we provide details and examples of how to use Slurm to run interactive jobs, job arrays, and jobs that depend on each other.

Interactive Job Examples

Job Array

Dependent Jobs

Factors Affecting Job Scheduling on Quest

If your job is waiting on the queue, the reason is most probably one of the following:

  • Your job's score is lower compared to others
  • Unavailable/occupied compute resources at that moment.

Priority

Backfill Scheduling

Diagnosing Issues with Your Job Submission Script and/or Your Job Itself

Debugging a Job Submission Script Rejected By The Scheduler

Debugging a Job Accepted by the Scheduler

Common reasons for Failed Jobs

This section provides some common reasons for why your job may fail and how to go about fixing it.

Job Exceeded Request Time or Memory

Out of Disk Space

Details

Details

Article ID: 1964
Created
Wed 10/5/22 4:49 PM
Modified
Fri 10/25/24 3:10 PM

Related Articles

Related Articles (1)