Quest Troubleshooting: Checking the job output file

Quest and Kellogg Linux Cluster Downtime, December 14 - 18.

Quest, including the Quest Analytics Nodes, the Genomics Compute Cluster (GCC), the Kellogg Linux Cluster (KLC), and Quest OnDemand, will be unavailable for scheduled maintenance starting at 8 A.M. on Saturday, December 14, and ending approximately at 5 P.M. on Wednesday, December 18. During the maintenance window, you will not be able to login to Quest, Quest Analytics Nodes, the GCC, KLC, or Quest OnDemand submit new jobs, run jobs, or access files stored on Quest in any way including Globus. For details on this maintenance, please see the Status of University IT Services page.

Quest RHEL8 Pilot Environment - November 18.

Starting November 18, all Quest users are invited to test and run their workflows in a RHEL8 pilot environment to prepare for Quest moving completely to RHEL8 in March 2025. We invite researchers to provide us with feedback during the pilot by contacting the Research Computing and Data Services team at quest-help@northwestern.edu. The pilot environment will consist of 24 H100 GPU nodes and seventy-two CPU nodes, and it will expand with additional nodes through March 2025. Details on how to access this pilot environment will be published in a KB article on November 18.

Quest Troubleshooting: Checking the job output file

This page demonstrates an example of looking problems in a job output files when a job fails.

When a job fails, the first place to look is the output/error file for your job. Unless you explicitly directed it elsewhere, it will be in the directory from which you submitted your job. Even if you directed output from your script/program to another location, there is still an output file with information about the job itself. By default, the output file is named slurm-\<jobID>.out and it contains both the standard output and error. You can use the cat command to print the contents to the terminal, or open the file in your preferred text editor.

If the job exit value, ExitCode (listed in checkjob <JobID> report) is anything other than 0:0, the scheduler thinks something went wrong with the job.

Here is an example of the error/output file for a job where the job submission script referenced a command that wasn't found. The relevant notice is given as: line 10: lmpx: command not found. The line number refers to lines in the job submission script. A command not found can happen if you have not loaded the necessary module to add the necessary executable command to your path. It is also possible that there is a typo in your the name of the executable.

[akh9585@quser21 fail_example]$ cat slurm-549001.out
/var/spool/slurmd/job549001/slurm_script: line 10: lmpx: command not found

For this case, checkjob 549001 report includes the following lines:

JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=127:0

Slurm reports that the job is FAILED in JobState and the ExitCode is given as 127:0. The scheduler obtains the exit code from bash return code. Bash returns 127 when the command doesn't exist.

Was this helpful?
0 reviews