Psst, remember the cheatsheet!

Job monitoring

Job monitoring#

By default, the standard output (e.g. things that you print as part of your script) and standard error (e.g. error messages from Slurm, your tool or package) are written to the file slurm-jobid.out.

You can check the status of your job and follow its progress with the sacct or squeue --me command (see also Slurm documentation of job code states) Resource usage while the job runs, can be queried with seff jobid (note that seff output can only be trusted after a job has finished).

If you would like to cancel a job after job submission or during runtime, you can do so with scancel jobid.

Resource monitoring#

Important resource requests that should be monitored with seff are:

Troubleshooting resource usage

Points to pay attention to:

  • Low CPU Efficiency:

    • Too many cores requested?

    • Cores waiting for other processes?

    • Cores waiting for data from disk?

    • Cores spread over too many nodes?

  • Low Memory Efficiency:

    • Too much memory requested?

    • Lots of caveats here

  • Low GPU efficiency:

    • Better to use CPUs? Disk I/O?

sacct

More detailed queries can be tailored with sacct

  • sacct -j jobid -o jobid,partition,state,elapsed,start,end

  • sacct -S 2022-08-01 will show all jobs started after that date Note! Querying data from the Slurm accounting database with sacct can be a very heavy operation. Don’t query long time intervals or run sacct in a loop/using watch as this will degrade the performance of the system for all users.