Quest and Kellogg Linux Cluster Downtime, December 14 - 18.
Quest, including the Quest Analytics Nodes, the Genomics Compute Cluster (GCC), the Kellogg Linux Cluster (KLC), and Quest OnDemand, will be unavailable for scheduled maintenance starting at 8 A.M. on Saturday, December 14, and ending approximately at 5 P.M. on Wednesday, December 18. During the maintenance window, you will not be able to login to Quest, Quest Analytics Nodes, the GCC, KLC, or Quest OnDemand submit new jobs, run jobs, or access files stored on Quest in any way including Globus. For details on this maintenance, please see the Status of University IT Services page.
This document explains how to fix a Slurm job submission script that either does not run or runs incorrectly, and lists common mistakes as well as how to identify and fix them.
When you re-submit your job you may receive a new error message. This means the mistake that generated the first error message has been resolved, and now you need to fix a second mistake. Slurm returns up to two distinct error messages at a time. If your submission script has more than two mistakes, you will need to re-submit your job multiple times to identify and fix all of them.
When Slurm encounters a mistake in your job submission script, it does not read the rest of your script that comes after the mistake. If the mistake generates an error, you can fix it and resubmit your job, however not all mistakes generate errors. If your script's required elements (account, partition, nodes, cores, and wall time) have been read successfully before Slurm encounters your mistake, your job will be still be accepted by the scheduler and run, just not the way you expect it to. Scripts with mistakes that don't generate errors still need to be debugged since the scheduler has ignored some of your #SBATCH
lines. You can identify a script with mistakes if the output from your job is unexpected or incorrect.
To use this reference: search for the exact error message generated by your job. Some error messages appear to be similar but are generated by different mistakes.
Note that the errors listed in this document may also be generated by interactive job submissions using srun
or salloc
. In those cases, the error messages will begin with srun error
or salloc error
. The information about resolving these error messages is the same.
With certain combinations of GUI editors and character sets on your personal computer, copying and pasting into Quest job submission scripts may bring in specific hidden characters that interfere with the scheduler's ability to interpret the script. In these cases, #SBATCH
lines will have no mistakes but still generate errors when submitted to the scheduler. To see all of the hidden characters in your job submission script, use the command cat -A <script_name>
. To resolve this, you may need to type your submission script into a native unix editor like vi and not use copy and paste.
sbatch: error: --account option required
sbatch: error: Unable to allocate resources: Invalid account or account/partition combination specified
Location of mistake:
#SBATCH --account=<allocation>
or
#SBATCH -A <allocation>
Example of correct account syntax:
#SBATCH --account=p12345
or
#SBATCH -A p12345
Possible mistake: your script doesn't have an #SBATCH
line specifying account
Fix: confirm that #SBATCH --account=<allocation>
is in your script.
Possible mistake: a typo in the "--account=" or "-A" part of this #SBATCH
line
Fix: examine this line closely to make sure the syntax is correct
Possible mistake: you are not a member of the allocation specified in your job submission script
Fix: confirm you are a member of the allocation by typing groups
at the command line on Quest. If the allocation you have specified in your job submission script is not listed, you are not a member of this allocation. Use an allocation that you are a member of in your job submission script.
Possible mistake: the mistake is on a line earlier in your job submission script which causes Slurm to stop reading your script before it reaches the #SBATCH --account=<allocation>
line
Fix: Move the #SBATCH --account=<allocation>
line to be immediately after the line #!/bin/bash
and submit your job again. If this generates a new error referencing a different line of your script, the account line is correct and the mistake is elsewhere in your submission script. To resolve the new error, follow the debugging suggestions for the new error message.
sbatch: error: Your allocation has expired
sbatch: error: Unable to allocate resources: Invalid qos specification
Location of mistake:
#SBATCH --account=<allocation>
or
#SBATCH -A <allocation>
The allocation specified in your job submission script is no longer active.
If you are a member of more than one allocation, you may wish to submit your job to an alternate allocation. To see a list your allocations, type groups
at the command line on Quest.
To renew your allocation or request a new one, please see Managing an Allocation on Quest.
srun: error: --partition option required
srun: error: Unable to allocate resources: Access/permission denied
Location of mistake:
#SBATCH --partition=<partition/queue>
or
#SBATCH -p <partition/queue>
Example of correct syntax for general access allocations ("p" account):
#SBATCH --partition=short
or
#SBATCH -p short
Example of correct syntax for buy-in allocations ("b" account):
#SBATCH --partition=buyin
or
#SBATCH -p buyin
Note that Slurm refers to queues as partitions. When specifying partition in your job submission script, use the queue name that you submitted to under Moab.
Possible mistake: your script doesn't have an #SBATCH
line specifying partition
Fix: confirm that #SBATCH --partition=<partition/queue>
or #SBATCH -p <partition/queue>
is in your script.
Possible mistake: a typo in the "--partition=" or "-p" part of this #SBATCH
line
Fix: examine this line closely to make sure the syntax is correct
Possible mistake: the mistake is on a line earlier in your job submission script which causes Slurm to stop reading your script before it reaches the #SBATCH --account=<allocation>
line
Fix: Move the #SBATCH --account=<allocation>
line to be immediately after the line #!/bin/bash
and submit your job again. If this generates a new error referencing a different line of your script, the account line is correct and the mistake is elsewhere in your submission script. To resolve the new error, follow the debugging suggestions for the new error message.
sbatch: error: Unable to allocate resources: Invalid qos specification
Location of mistake:
#SBATCH --partition=<partition/queue>
or
#SBATCH -p <partition/queue>
The partition/queue name specified is not associated with the allocation in the line #SBATCH --account=<allocation>.
Possible mistake: Your script specifies a buy-in allocation, and you've specified "short", "normal" or "long" as your partition/queue.
Possible mistake: Your script specifies an allocation and partition combination which do not belong together.
Fix: Specify the correct partition/queue for your allocation. To see the allocations and partitions you have access to, use this version of the sinfo command:
sinfo -o "%g %.10R %.20l"
GROUPS PARTITION TIMELIMIT
b1234 buyin 168:00:00
Note that "GROUPS" are allocations/accounts on Quest.
In this example, valid lines in your job submission script that relate to account, partition and time would be:
#SBATCH --account=b1234
#SBATCH --partition=buyin
#SBATCH --time=168:00:00
sbatch: error: invalid partition specified: <partition_name>
sbatch: error: Unable to allocate resources: Invalid partition name specified
Location of mistake:
#SBATCH --partition=<partition/queue>
or
#SBATCH -p <partition/queue>
Example of correct syntax for general access allocations ("p" account):
#SBATCH --partition=short
or
#SBATCH -p short
Example of correct syntax for buy-in allocations ("b" account):
#SBATCH --partition=buyin
or
#SBATCH -p buyin
Possible mistake: a typo in the "--partition=" or "-p" part of this #SBATCH
line
Fix: examine this line closely to make sure the syntax is correct
Possible mistake: Your script specifies a general access allocation ("p" account) with a queue that isn't "short", "normal" or "long".
Fix: change your partition to be "short", "normal" or "long"
sbatch: error: Unable to allocate resources: Invalid account or account/partition combination specified
sbatch: error: Unable to allocate resources: User's group not permitted to use this partition
This message can refer to mistakes on the SBATCH lines specifying account or partition.
Possible location of mistake specifying account:
#SBATCH --account=<allocation>
or
#SBATCH -A <allocation>
Possible location of mistake specifying partition
#SBATCH --partition=<partition/queue>
or
#SBATCH -p <partition/queue>
Possible mistake: the syntax in the #SBATCH
line specifying account is incorrect
Fix: examine the account line closely to confirm the syntax is exactly correct. Example of correct account syntax:
#SBATCH --account=p12345
or
#SBATCH -A p12345
Possible mistake: you are trying to run in a partition/queue that belongs to one account, while specifying a different account.
Fix: Specify the correct partition/queue for your allocation. To see the allocations and partitions you have access to, use this version of the sinfo command:
sinfo -o "%g %.10R %.20l"
GROUPS PARTITION TIMELIMIT
b1234 buyin 168:00:00
Note that "GROUPS" are allocations/accounts on Quest.
In this example, valid lines in your job submission script that relate to account, partition and time would be:
#SBATCH --account=b1234
#SBATCH --partition=buyin
#SBATCH --time=168:00:00
Possible mistake: the mistake is on a line earlier in your job submission script which causes Slurm to stop reading your script before it reaches the #SBATCH --account=<allocation>
line
Fix: Move the #SBATCH --account=<allocation>
line to be immediately after the line #!/bin/bash
and submit your job again. If this generates a new error referencing a different line of your script, the account line is correct and the mistake is elsewhere in your submission script. To resolve the new error, follow the debugging suggestions for the new error message.
sbatch: error: --time limit option required
sbatch: error: Unable to allocate resources: Requested time limit is invalid (missing or exceeds some limit)
Location of mistake:
#SBATCH --time=<hours:minutes:seconds>
or
#SBATCH -t <hours:minutes:seconds>
Example of correct syntax:
#SBATCH --time=10:00:00
or
#SBATCH -t 10:00:00
Possible mistake: your script doesn't have an #SBATCH
line specifying time
Fix: confirm that #SBATCH --time=<hh:mm:ss>
is in your script.
Possible mistake: a typo in the "--time=" or "-t" part of this #SBATCH
line
Fix: examine this line closely to make sure the syntax is correct.
Possible mistake: the time request is too long for the partition (queue)
Fix: review the wall time limits of your partition and adjust the amount of time requested by your script. For general access users with allocations that begin with a "p", please use this reference:
Partition |
Walltime limit |
Short |
4 hours |
Normal |
48 hours |
Long |
7 days / 168 hours |
Buy-in accounts that begin with a "b" have their own wall time limits. For information on the wall time of your partition, use the sinfo
command:
sinfo -o "%g %.10R %.20l"
GROUPS PARTITION TIMELIMIT
b1234 buyin 168:00:00
To fix this error, set your wall time to be less than the time limit of your partition and re-submit your job.
Possible mistake: the mistake is on a line earlier in your job submission script which causes Slurm to stop reading your script before it reaches the #SBATCH --account=<allocation>
line
Fix: Move the #SBATCH --time=<hh:mm::ss>
line to be immediately after the line #!/bin/bash
and submit your job again. If this generates a new error referencing a different line of your script, the account line is correct and the mistake is elsewhere in your submission script. To resolve the new error, follow the debugging suggestions for the new error message.
sbatch: unrecognized option <option>
Example:
Line in script: #SBATCH --n-tasks-per-node=1
Error generated sbatch: unrecognized option ‘--n-tasks-per-node=1'
With an "unrecognized option" error, Slurm correctly read the first part of the #SBATCH
line but the option that follows it has generated the error. In this example, the option has a dash between "n" and "tasks" that should not be there. The correct option does not have a dash in that location. This line should be corrected to:
#SBATCH --ntasks-per-node=1
To fix this error, locate the option specified in the error message and examine it carefully for errors. To see correct syntax for all #SBATCH
directives, see Converting Moab/Torque scripts to Slurm.
sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
Location of mistake:
#SBATCH --ntasks-per-node=<CPU count>
Example of mistake:
#SBATCH --ntasks-per-node=10000
This error is generated if your job requests more CPUs/cores than are available on the nodes in the partition your job submission script specified. CPU count is the number of cores requested by your job submission script. Cores are also called processors or CPUs.
To fix this mistake, use the sinfo command to get the maximum number of cores available in the partitions you have access to:
sinfo -o "%g %.10R %.20l %.10c"
GROUPS PARTITION TIMELIMIT CPUS
b1234 buyin 2-00:00:00 20+
In this example, your job submission script can request up to 20 CPUs/cores per node like this:
#SBATCH --ntasks-per-node=20
sbatch: error: Batch script contains DOS line breaks (\r\n)
sbatch: error: instead of expected UNIX line breaks (\n).
Location of mistake:
Hidden characters in your job submission script
Mistake: your job submission script was created on a Windows machine and copied onto Quest without converting it into UNIX encoded characters.
Fix: from the command line on Quest run the command dos2unix <submission_script>
to correct your job submission script and re-submit your job to the scheduler.
Once your job has been accepted, the Slurm scheduler will return a job id number. After waiting in the queue, your job will run. To see the status of your job, use the command sacct -X
For jobs with mistakes that do not give error messages, you will need to investigate if you notice something is wrong with how the job runs. If you notice a problem on the list below, click on it for debugging suggestions.
Job runs in home directory instead of project directory
Job can't locate files or executables
Problem: job runs in home directory instead of project directory, job can't locate files or executables.
Possible cause: job script contains the Moab variable $PBS_O_WORKDIR
.
In Moab job submission scripts, the $PBS_O_WORKDIR
variable contains the name of the directory where you submit your job. Some Moab scripts begin with the line:
cd $PBS_O_WORKDIR
to start the job running in the job submission directory.
Slurm jobs start in the job submission directory by default. Slurm does not read the Moab variable $PBS_O_WORKDIR
, which means $PBS_O_WORKDIR
is an empty variable when your Slurm job runs. Your Slurm script will still run the command cd $PBS_O_WORKDIR
however since $PBS_O_WORKDIR
is empty, the command becomes
cd ""
In Unix, executing cd
by itself is shorthand for changing into your home directory. Slurm jobs that begin with cd $PBS_O_WORKDIR
begin running in your home directory instead of in your job submission directory. $PBS_O_WORKDIR
should never be used in a Slurm job submission script, as it can only cause mistakes.
Fix: You can solve this problem by changing this line to
cd $SLURM_SUBMIT_DIR
or by deleting the line
cd $PBS_O_WORKDIR
completely, as Slurm's default is to run in the job submission directory.
Job runs very slowly or dies after starting
Problem: job runs very slowly, or dies after starting
Possible cause: job script is not reading the directive #SBATCH --mem=<amount>
.
All Slurm job scripts should specify the amount of memory your job needs to run. If your job runs very slowly or dies, investigate if it requests enough memory with the Slurm utility seff
. For more information, see Checking Processor and Memory Utilization for Jobs on Quest.
Job name is name of job submission script instead of name in submission script
Problem: job name is name of job submission script instead of name in submission script
Possible cause: job script is not reading the #SBATCH --job-name=<job name>
directive.
Slurm is not reading the SBATCH directive:
#SBATCH -J <Job_Name>
or
#SBATCH --job-name=<Job_Name>
To see the name of your job, run sacct -X
. If JOB NAME is the first eight characters of the name of your submission script, SLURM has not read the #SBATCH lines for job name.
Possible Mistake: a typo in the "--job-name=" or "-J" part of this #SBATCH line
Fix: examine this line closely to make sure the syntax is correct
Possible mistake: the mistake is on a line earlier in your job submission script which causes Slurm to stop reading your script before it reaches the #SBATCH --job-name=<job name>
line
Fix: Move the #SBATCH --job-name=<job name>
line to be immediately after the line #!/bin/bash
and submit your job again. If this generates a new error referencing a different line of your script, the account line is correct and the mistake is elsewhere in your submission script. To resolve the new error, follow the debugging suggestions for the new error message.
Modules or environment variables are inherited from the login session by a running job
Problem: modules or environmental variables are inherited from the login session by a running job
Possible cause: job script is not purging modules before beginning compute node session
Fix: after the #SBATCH directives in your job submission script, add the line
module purge all
This will clear any modules inherited from your login session, and begin your job in a clean environment. You will need to load any necessary modules into your job submission script after this line.
Job immediately fails and generates no output or error file
Problem: job can't write into output and/or error files so job immediately dies
Possible cause: job script specifies directory path for output and/or error files but does not provide a file name
Possible cause: job script specifies a directory that does not exist
Slurm is not getting a file name that it can write into in the SBATCH directive:
#SBATCH –-output=/path/to/file/file_name
or
#SBATCH --error=/path/to/file/file_name
Possible Mistake: a typo in the "--output=" or "--error" part of this #SBATCH line
Fix: examine this line closely to make sure the syntax is correct
Possible Mistake: providing a directory but not a file name for output and/or error files
Fix: add a file name at the end of the specified path. For a file name in the format <job_name>.o<job_id>, use
#SBATCH –-output=/path/to/file/"%x.o%j"
Note if a separate error file is not specified, errors and output will both be written into the output file. To generate a separate error file, include the line:
#SBATCH –-error=/path/to/file/"%x.e%j"