Job monitoring#
By default, the standard output (e.g. things that you print as part of your script) and standard error (e.g. error messages from Slurm, your tool or package) are written to the file slurm-jobid.out
.
You can check the status of your job and follow its progress with the sacct
or squeue --me
command (see also Slurm documentation of job code states)
Resource usage while the job runs, can be queried with seff jobid
(note that seff
output can only be trusted after a job has finished).
If you would like to cancel a job after job submission or during runtime, you can do so with scancel jobid
.
Resource monitoring#
Important resource requests that should be monitored with seff
are:
Scaling of a job over several cores and nodes
Parallel jobs must always benefit from all requested resources
When you double the number of cores, the job should run at least 1.5x faster
Troubleshooting resource usage
Points to pay attention to:
Low CPU Efficiency:
Too many cores requested?
Cores waiting for other processes?
Cores waiting for data from disk?
Cores spread over too many nodes?
Low Memory Efficiency:
Too much memory requested?
Lots of caveats here
Low GPU efficiency:
Better to use CPUs? Disk I/O?
sacct
More detailed queries can be tailored with sacct
sacct -j jobid -o jobid,partition,state,elapsed,start,end
sacct -S 2022-08-01
will show all jobs started after that date Note! Querying data from the Slurm accounting database withsacct
can be a very heavy operation. Don’t query long time intervals or runsacct
in a loop/usingwatch
as this will degrade the performance of the system for all users.