Using sacct
and seff
to understand resource usage of finished jobs
π¬ In this tutorial we look at the seff
and sacct
commands. The tutorial should be done on Puhti.
π seff
shows detailed data on used resources in an easy-to-read format, but can only show one job at a time.
π sacct
is useful when you want to look at a listing of jobs, but by default it only shows minimal data.
Get details about batch jobs
- Try
sacct
which by default shows the jobs you have run on the current date (i.e. since last midnight):
sacct
- Try specifying the start time of the listing using the
-S
option. Donβt query too long time intervals, since this causes significant load on the system (max. queryable interval is three months).
sacct -S YYYY-MM-DD # replace YYYY-MM-DD
- Look for a specific job β i.e. specify the job ID using the
-j
option (if you canβt think of one, you can use23169759
):
sacct -j <slurmjobid> # replace <slurmjobid> with a valid job ID
- To print out all the available data for a job, try:
sacct -l -j <slurmjobid> # replace <slurmjobid> with a valid job ID
- Select only the interesting data using the
-o
option. For example, to see job name, job ID, used memory, job state and elapsed wall-clock time, try:
sacct -o jobname,jobid,maxrss,state,elapsed -j <slurmjobid> # replace <slurmjobid> with a valid job ID
- Check out the list of all available data fields with:
sacct -e
βΌοΈ Note, running sacct
is heavy on the batch queue system.
- You should not, for example, write scripts that run it repeatedly.
Running a test job
π¬ Run a simple array job to practice using seff
and sacct
.
βπ» If you have limited time, you can skip to Examining the finished job and use the job ID 23694920
(it is the same job).
- Create a file named
array.sh
and paste the following contents in it.
#!/bin/bash
#SBATCH --account=<project> # Choose the billing project. Has to be defined!
#SBATCH --time=00:01:00 # Maximum duration of the job. Max: depends of the partition.
#SBATCH --partition=small # Job queues: test, interactive, small, large, longrun, hugemem, hugemem_longrun
#SBATCH --job-name=array_job # Name of the job visible in the queue.
#SBATCH --output=out_%A_%a.txt # Name of the output-file.
#SBATCH --error=err_%A_%a.txt # Name of the error-file.
#SBATCH --ntasks=1 # Number of tasks. Max: depends on partition.
#SBATCH --cpus-per-task=1 # How many processors work on one task. Max: Number of CPUs per node.
#SBATCH --mem=1000 # How much RAM is reserved for job per node. Unit: MiB
#SBATCH --array=1-6 # The indices of the array jobs.
/appl/soft/bio/course/sacct_exercise/test-a ${SLURM_ARRAY_TASK_ID}
- Replace
<project>
with your actual project name, e.g.project_2001234
- Submit the job with the command:
sbatch array.sh
- You will see a message like:
Submitted batch job 123456
- Make note of the Slurm job ID.
- Follow the progress of the job with the command:
squeue -u $USER
π How is an array job listed in the queue?
Examining the finished job
- When the job has finished (you can no longer see any of the sub jobs with
squeue
), you can usesacct
to study it:
sacct -j <slurmjobid> # replace <slurmjobid> with the actual job ID
- Get a cleaner view by omitting the job steps:
sacct -X -j <slurmjobid> # replace <slurmjobid> with the actual job ID
π¬ sacct
is especially handy here, because it is easy to spot the failed sub jobs.
- Which sub jobs failed?
- Can you figure out why they failed?
- How do they compare to jobs that finished?
- Use
seff
to look at individual sub jobs, e.g.:
seff <slurmjobid>_5 # replace <slurmjobid> with the actual job ID
- Try
sacct
with the-o
option (discussed above). This time add the fieldsreqmem
(requested memory) andtimelimit
(requested time):
sacct -o jobname,jobid,reqmem,maxrss,timelimit,elapsed,state -j <slurmjobid> # replace <slurmjobid> with the actual job ID
π Note that in this case we can not use the -X
option as we want to see the memory usage for each step.
Adjusting the job-file
- Look at the error messages produced by the failed jobs.
- When you know which sub jobs failed and why, adjust the resource requests as necessary.
βπ» If you have limited time, you can skip to step 4 and use the job ID 23695009
(it is the same job with adjusted resource requests).
- Change time and memory reservations:
#SBATCH --time=00:05:00
#SBATCH --mem=2000
- Re-run the failed sub jobs:
#SBATCH --array=3,5 # Specify which ones to run
- Use
seff
andsacct
to look at the jobs. How much memory and time did they use?
More information
π‘ You can read more about array jobs and seff and sacct in Docs CSC.