Quest Troubleshooting: Checking the job output file

Quest and Kellogg Linux Cluster Downtime, June 8th-14th, 2024.

Quest, including the Quest Analytics Nodes, the Genomics Compute Cluster (GCC), the Kellogg Linux Cluster (KLC), and Quest OnDemand, will be unavailable for scheduled maintenance starting at 7 A.M. on Saturday, June 8, and ending approximately at 5 P.M. on Friday, June 14. During the maintenance window, you will not be able to login to Quest, Quest Analytics Nodes, the GCC, KLC, or Quest OnDemand submit new jobs, run jobs, or access files stored on Quest in any way including Globus. For details on this maintenance, please see the Status of University IT Services page.

Quest Troubleshooting: Checking the job output file

This page demonstrates an example of looking problems in a job output files when a job fails.

When a job fails, the first place to look is the output/error file for your job. Unless you explicitly directed it elsewhere, it will be in the directory from which you submitted your job. Even if you directed output from your script/program to another location, there is still an output file with information about the job itself. By default, the output file is named slurm-\<jobID>.out and it contains both the standard output and error. You can use the cat command to print the contents to the terminal, or open the file in your preferred text editor.

If the job exit value, ExitCode (listed in checkjob <JobID> report) is anything other than 0:0, the scheduler thinks something went wrong with the job.

Here is an example of the error/output file for a job where the job submission script referenced a command that wasn't found. The relevant notice is given as: line 10: lmpx: command not found. The line number refers to lines in the job submission script. A command not found can happen if you have not loaded the necessary module to add the necessary executable command to your path. It is also possible that there is a typo in your the name of the executable.

[akh9585@quser21 fail_example]$ cat slurm-549001.out
/var/spool/slurmd/job549001/slurm_script: line 10: lmpx: command not found

For this case, checkjob 549001 report includes the following lines:

JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=127:0

Slurm reports that the job is FAILED in JobState and the ExitCode is given as 127:0. The scheduler obtains the exit code from bash return code. Bash returns 127 when the command doesn't exist.

Was this helpful?
0 reviews
Print Article


Article ID: 1671
Thu 5/12/22 12:39 PM
Wed 5/1/24 9:13 AM