Debugging your Slurm submission script on Quest

This document explains how to fix a Slurm job submission script that either does not run or runs incorrectly, and lists common mistakes as well as how to identify and fix them.

Debugging a Job Submission Script Rejected By The Scheduler

If your job submission script generates an error when you submit it with the sbatch command, the problem in your script is in one or more of the lines that begin with #SBATCH. To debug job scripts that generate errors, look up the error message in the section below to identify the most likely reason your script received that error message. Once you have identified the mistake in your script, edit your script to correct it and re-submit your job. If you receive the same error message again, examine the error message and the mistake in your script more closely. Sometimes the same error message can be generated by two different mistakes in the same script, meaning it's possible that you may resolve the first mistake but need to correct a second mistake to clear that particular error message. Mistakes can be difficult to identify, and often require careful reading of your #SBATCH lines.

When you re-submit your job you may receive a new error message. This means the mistake that generated the first error message has been resolved, and now you need to fix a second mistake. Slurm returns up to two distinct error messages at a time. If your submission script has more than two mistakes, you will need to re-submit your job multiple times to identify and fix all of them.

When Slurm encounters a mistake in your job submission script, it does not read the rest of your script that comes after the mistake. If the mistake generates an error, you can fix it and resubmit your job, however not all mistakes generate errors. If your script's required elements (account, partition, nodes, cores, and wall time) have been read successfully before Slurm encounters your mistake, your job will be still be accepted by the scheduler and run, just not the way you expect it to. Scripts with mistakes that don't generate errors still need to be debugged since the scheduler has ignored some of your #SBATCH lines. You can identify a script with mistakes if the output from your job is unexpected or incorrect.

Mistakes That Generate Error Messages

To use this reference: search for the exact error message generated by your job. Some error messages appear to be similar but are generated by different mistakes.

Note that the errors listed in this document may also be generated by interactive job submissions using srun or salloc. In those cases, the error messages will begin with srun error or salloc error. The information about resolving these error messages is the same.

With certain combinations of GUI editors and character sets on your personal computer, copying and pasting into Quest job submission scripts may bring in specific hidden characters that interfere with the scheduler's ability to interpret the script. In these cases, #SBATCH lines will have no mistakes but still generate errors when submitted to the scheduler. To see all of the hidden characters in your job submission script, use the command cat -A <script_name>. To resolve this, you may need to type your submission script into a native unix editor like vi and not use copy and paste.

sbatch: error: --account option required
sbatch: error: Unable to allocate resources: Invalid account or account/partition combination specified

Location of mistake:
#SBATCH --account=<allocation>
or
#SBATCH -A <allocation>
Example of correct account syntax:
#SBATCH --account=p12345
or
#SBATCH -A p12345

Possible mistake: your script doesn't have an #SBATCH line specifying account
Fix: confirm that #SBATCH --account=<allocation> is in your script.

Possible mistake: a typo in the "--account=" or "-A" part of this #SBATCH line
Fix: examine this line closely to make sure the syntax is correct

Possible mistake: you are not a member of the allocation specified in your job submission script
Fix: confirm you are a member of the allocation by typing groups at the command line on Quest. If the allocation you have specified in your job submission script is not listed, you are not a member of this allocation. Use an allocation that you are a member of in your job submission script.

Possible mistake: the mistake is on a line earlier in your job submission script which causes Slurm to stop reading your script before it reaches the #SBATCH --account=<allocation> line
Fix: Move the #SBATCH --account=<allocation> line to be immediately after the line #!/bin/bash and submit your job again. If this generates a new error referencing a different line of your script, the account line is correct and the mistake is elsewhere in your submission script. To resolve the new error, follow the debugging suggestions for the new error message.

sbatch: error: Your allocation has expired
sbatch: error: Unable to allocate resources: Invalid qos specification

Location of mistake:
#SBATCH --account=<allocation>
or
#SBATCH -A <allocation>

The allocation specified in your job submission script is no longer active.

If you are a member of more than one allocation, you may wish to submit your job to an alternate allocation. To see a list your allocations, type groups at the command line on Quest.

To renew your allocation or request a new one, please see Managing an Allocation on Quest.

srun: error: --partition option required
srun: error: Unable to allocate resources: Access/permission denied

Location of mistake:
#SBATCH --partition=<partition/queue>
or
#SBATCH -p <partition/queue>

Example of correct syntax for general access allocations ("p" account):
#SBATCH --partition=short
or
#SBATCH -p short

Example of correct syntax for buy-in allocations ("b" account):
#SBATCH --partition=buyin
or
#SBATCH -p buyin

Note that Slurm refers to queues as partitions. When specifying partition in your job submission script, use the queue name that you submitted to under Moab.

Possible mistake: your script doesn't have an #SBATCH line specifying partition
Fix: confirm that #SBATCH --partition=<partition/queue> or #SBATCH -p <partition/queue> is in your script.

Possible mistake: a typo in the "--partition=" or "-p" part of this #SBATCH line
Fix: examine this line closely to make sure the syntax is correct

Possible mistake: the mistake is on a line earlier in your job submission script which causes Slurm to stop reading your script before it reaches the #SBATCH --account=<allocation> line
Fix: Move the #SBATCH --account=<allocation> line to be immediately after the line #!/bin/bash and submit your job again. If this generates a new error referencing a different line of your script, the account line is correct and the mistake is elsewhere in your submission script. To resolve the new error, follow the debugging suggestions for the new error message.

sbatch: error: Unable to allocate resources: Invalid qos specification

Location of mistake:
#SBATCH --partition=<partition/queue>
or
#SBATCH -p <partition/queue>

The partition/queue name specified is not associated with the allocation in the line #SBATCH --account=<allocation>.

Possible mistake: Your script specifies a buy-in allocation, and you've specified "short", "normal" or "long" as your partition/queue.

Possible mistake: Your script specifies an allocation and partition combination which do not belong together.
Fix: Specify the correct partition/queue for your allocation. To see the allocations and partitions you have access to, use this version of the sinfo command:

sinfo -o "%g %.10R %.20l"
GROUPS      PARTITION         TIMELIMIT
b1234       buyin             168:00:00
Note that "GROUPS" are allocations/accounts on Quest.

In this example, valid lines in your job submission script that relate to account, partition and time would be:

#SBATCH --account=b1234
#SBATCH --partition=buyin
#SBATCH --time=168:00:00

sbatch: error: invalid partition specified: <partition_name>
sbatch: error: Unable to allocate resources: Invalid partition name specified

Location of mistake:
#SBATCH --partition=<partition/queue>
or
#SBATCH -p <partition/queue>

Example of correct syntax for general access allocations ("p" account):
#SBATCH --partition=short
or
#SBATCH -p short

Example of correct syntax for buy-in allocations ("b" account):
#SBATCH --partition=buyin
or
#SBATCH -p buyin

Possible mistake: a typo in the "--partition=" or "-p" part of this #SBATCH line
Fix: examine this line closely to make sure the syntax is correct

Possible mistake: Your script specifies a general access allocation ("p" account) with a queue that isn't "short", "normal" or "long".
Fix: change your partition to be "short", "normal" or "long"

sbatch: error: Unable to allocate resources: Invalid account or account/partition combination specified
sbatch: error: Unable to allocate resources: User's group not permitted to use this partition

This message can refer to mistakes on the SBATCH lines specifying account or partition.

Possible location of mistake specifying account:
#SBATCH --account=<allocation>
or
#SBATCH -A <allocation>

Possible location of mistake specifying partition
#SBATCH --partition=<partition/queue>
or
#SBATCH -p <partition/queue>

Possible mistake: the syntax in the #SBATCH line specifying account is incorrect
Fix: examine the account line closely to confirm the syntax is exactly correct. Example of correct account syntax:
#SBATCH --account=p12345
or
#SBATCH -A p12345

Possible mistake: you are trying to run in a partition/queue that belongs to one account, while specifying a different account.
Fix: Specify the correct partition/queue for your allocation. To see the allocations and partitions you have access to, use this version of the sinfo command:

sinfo -o "%g %.10R %.20l"
GROUPS      PARTITION         TIMELIMIT
b1234       buyin             168:00:00
Note that "GROUPS" are allocations/accounts on Quest.

In this example, valid lines in your job submission script that relate to account, partition and time would be:

#SBATCH --account=b1234
#SBATCH --partition=buyin
#SBATCH --time=168:00:00

Possible mistake: the mistake is on a line earlier in your job submission script which causes Slurm to stop reading your script before it reaches the #SBATCH --account=<allocation> line
Fix: Move the #SBATCH --account=<allocation> line to be immediately after the line #!/bin/bash and submit your job again. If this generates a new error referencing a different line of your script, the account line is correct and the mistake is elsewhere in your submission script. To resolve the new error, follow the debugging suggestions for the new error message.

sbatch: error: --time limit option required
sbatch: error: Unable to allocate resources: Requested time limit is invalid (missing or exceeds some limit)

Location of mistake:
#SBATCH --time=<hours:minutes:seconds>
or
#SBATCH -t <hours:minutes:seconds>

Example of correct syntax:
#SBATCH --time=10:00:00
or
#SBATCH -t 10:00:00

Possible mistake: your script doesn't have an #SBATCH line specifying time
Fix: confirm that #SBATCH --time=<hh:mm:ss> is in your script.

Possible mistake: a typo in the "--time=" or "-t" part of this #SBATCH line
Fix: examine this line closely to make sure the syntax is correct.

Possible mistake: the time request is too long for the partition (queue)
Fix: review the wall time limits of your partition and adjust the amount of time requested by your script. For general access users with allocations that begin with a "p", please use this reference:

Partition Walltime limit
Short 4 hours
Normal 48 hours
Long 7 days / 168 hours

Buy-in accounts that begin with a "b" have their own wall time limits. For information on the wall time of your partition, use the sinfo command:

sinfo -o "%g %.10R %.20l"
GROUPS      PARTITION         TIMELIMIT
b1234       buyin             168:00:00
To fix this error, set your wall time to be less than the time limit of your partition and re-submit your job.

Possible mistake: the mistake is on a line earlier in your job submission script which causes Slurm to stop reading your script before it reaches the #SBATCH --account=<allocation> line
Fix: Move the #SBATCH --time=<hh:mm::ss> line to be immediately after the line #!/bin/bash and submit your job again. If this generates a new error referencing a different line of your script, the account line is correct and the mistake is elsewhere in your submission script. To resolve the new error, follow the debugging suggestions for the new error message.

sbatch: unrecognized option <option>

Example:
Line in script: #SBATCH --n-tasks-per-node=1
Error generated sbatch: unrecognized option ‘--n-tasks-per-node=1'

With an "unrecognized option" error, Slurm correctly read the first part of the #SBATCH line but the option that follows it has generated the error. In this example, the option has a dash between "n" and "tasks" that should not be there. The correct option does not have a dash in that location. This line should be corrected to:
#SBATCH --ntasks-per-node=1
To fix this error, locate the option specified in the error message and examine it carefully for errors. To see correct syntax for all #SBATCH directives, see Converting Moab/Torque scripts to Slurm.

sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

Location of mistake:
#SBATCH --ntasks-per-node=<CPU count>
Example of mistake:
#SBATCH --ntasks-per-node=10000

This error is generated if your job requests more CPUs/cores than are available on the nodes in the partition your job submission script specified. CPU count is the number of cores requested by your job submission script. Cores are also called processors or CPUs.

To fix this mistake, use the sinfo command to get the maximum number of cores available in the partitions you have access to:

sinfo -o "%g %.10R %.20l %.10c"
GROUPS      PARTITION       TIMELIMIT       CPUS
b1234       buyin           2-00:00:00      20+
In this example, your job submission script can request up to 20 CPUs/cores per node like this:
#SBATCH --ntasks-per-node=20

sbatch: error: Batch script contains DOS line breaks (\r\n)
sbatch: error: instead of expected UNIX line breaks (\n).

Location of mistake:
Hidden characters in your job submission script

Mistake: your job submission script was created on a Windows machine and copied onto Quest without converting it into UNIX encoded characters.
Fix: from the command line on Quest run the command dos2unix <submission_script> to correct your job submission script and re-submit your job to the scheduler.

Mistakes That Don't Generate Error Messages

Debugging a Job Accepted by the Scheduler

Once your job has been accepted, the Slurm scheduler will return a job id number. After waiting in the queue, your job will run. To see the status of your job, use the command sacct -X

For jobs with mistakes that do not give error messages, you will need to investigate if you notice something is wrong with how the job runs. If you notice a problem on the list below, click on it for debugging suggestions.

Job runs in home directory instead of project directory
Job can't locate files or executables

Problem: job runs in home directory instead of project directory, job can't locate files or executables.
Possible cause: job script contains the Moab variable $PBS_O_WORKDIR.

In Moab job submission scripts, the $PBS_O_WORKDIR variable contains the name of the directory where you submit your job. Some Moab scripts begin with the line:
cd $PBS_O_WORKDIR
to start the job running in the job submission directory.

Slurm jobs start in the job submission directory by default. Slurm does not read the Moab variable $PBS_O_WORKDIR, which means $PBS_O_WORKDIR is an empty variable when your Slurm job runs. Your Slurm script will still run the command cd $PBS_O_WORKDIR however since $PBS_O_WORKDIR is empty, the command becomes
cd ""
In Unix, executing cd by itself is shorthand for changing into your home directory. Slurm jobs that begin with cd $PBS_O_WORKDIR begin running in your home directory instead of in your job submission directory. $PBS_O_WORKDIR should never be used in a Slurm job submission script, as it can only cause mistakes.

Fix: You can solve this problem by changing this line to
cd $SLURM_SUBMIT_DIR
or by deleting the line
cd $PBS_O_WORKDIR
completely, as Slurm's default is to run in the job submission directory.

Job runs very slowly or dies after starting

Problem: job runs very slowly, or dies after starting
Possible cause: job script is not reading the directive #SBATCH --mem=<amount>.

All Slurm job scripts should specify the amount of memory your job needs to run. If your job runs very slowly or dies, investigate if it requests enough memory with the Slurm utility seff. For more information, see Checking Processor and Memory Utilization for Jobs on Quest.

Job name is name of job submission script instead of name in submission script

Problem: job name is name of job submission script instead of name in submission script
Possible cause: job script is not reading the #SBATCH --job-name=<job name> directive.

Slurm is not reading the SBATCH directive:
#SBATCH -J <Job_Name>
or
#SBATCH --job-name=<Job_Name>

To see the name of your job, run sacct -X. If JOB NAME is the first eight characters of the name of your submission script, SLURM has not read the #SBATCH lines for job name.

Possible Mistake: a typo in the "--job-name=" or "-J" part of this #SBATCH line
Fix: examine this line closely to make sure the syntax is correct

Possible mistake: the mistake is on a line earlier in your job submission script which causes Slurm to stop reading your script before it reaches the #SBATCH --job-name=<job name> line
Fix: Move the #SBATCH --job-name=<job name> line to be immediately after the line #!/bin/bash and submit your job again. If this generates a new error referencing a different line of your script, the account line is correct and the mistake is elsewhere in your submission script. To resolve the new error, follow the debugging suggestions for the new error message.

Modules or environment variables are inherited from the login session by a running job

Problem: modules or environmental variables are inherited from the login session by a running job
Possible cause: job script is not purging modules before beginning compute node session

Fix: after the #SBATCH directives in your job submission script, add the line

module purge all

This will clear any modules inherited from your login session, and begin your job in a clean environment. You will need to load any necessary modules into your job submission script after this line.

Job immediately fails and generates no output or error file

Problem: job can't write into output and/or error files so job immediately dies
Possible cause: job script specifies directory path for output and/or error files but does not provide a file name
Possible cause: job script specifies a directory that does not exist

Slurm is not getting a file name that it can write into in the SBATCH directive:

#SBATCH –-output=/path/to/file/file_name

or

#SBATCH --error=/path/to/file/file_name

Possible Mistake: a typo in the "--output=" or "--error" part of this #SBATCH line
Fix: examine this line closely to make sure the syntax is correct

Possible Mistake: providing a directory but not a file name for output and/or error files
Fix: add a file name at the end of the specified path. For a file name in the format <job_name>.o<job_id>, use

#SBATCH –-output=/path/to/file/"%x.o%j"

Note if a separate error file is not specified, errors and output will both be written into the output file. To generate a separate error file, include the line:

#SBATCH –-error=/path/to/file/"%x.e%j"
Was this helpful?
0% helpful - 1 review
Print Article

Details

Article ID: 1808
Created
Thu 5/12/22 12:39 PM
Modified
Fri 6/14/24 10:29 AM