Using `sacct` and `seff` to understand resource usage of finished jobs

💬 In this tutorial we look at the seff and sacct commands. The tutorial should be done on Puhti.

💭 seff shows detailed data on used resources in an easy-to-read format, but can only show one job at a time.

💭 sacct is useful when you want to look at a listing of jobs, but by default it only shows minimal data.

Get details about batch jobs

Try sacct which by default shows the jobs you have run on the current date (i.e. since last midnight):
```
sacct
```
Try specifying the start time of the listing using the -S option. Don’t query too long time intervals, since this causes significant load on the system (max. queryable interval is three months).
```
sacct -S YYYY-MM-DD    # replace YYYY-MM-DD
```
Look for a specific job – i.e. specify the job ID using the -j option (if you can’t think of one, you can use 27074259):
```
sacct -j <slurmjobid>    # replace <slurmjobid> with a valid job ID 
```

To print out all the available data for a job, try:

sacct -l -j <slurmjobid>    # replace <slurmjobid> with a valid job ID

Select only the interesting data using the -o option. For example, to see job name, job ID, used memory, job state and elapsed wall-clock time, try:
```
sacct -o jobname,jobid,maxrss,state,elapsed -j <slurmjobid>   # replace <slurmjobid> with a valid job ID
```
Check out the list of all available data fields with:
```
sacct -e
```

‼️ Note, running sacct is heavy on the batch queue system.

You should not, for example, write scripts that run it repeatedly.

Running a test job

💬 Run a simple array job to practice using seff and sacct.

☝🏻 If you have limited time, you can skip to Examining the finished job and use the job ID 27099109 (it is the same job).

Create a file named array.sh and paste the following contents in it.

#!/bin/bash
#SBATCH --account=<project>      # Choose the billing project. Has to be defined!
#SBATCH --time=00:01:00          # Maximum duration of the job. Max: depends of the partition.
#SBATCH --partition=small        # Job queues: test, interactive, small, large, longrun, hugemem, hugemem_longrun
#SBATCH --job-name=array_job     # Name of the job visible in the queue.
#SBATCH --output=out_%A_%a.txt   # Name of the output-file.
#SBATCH --error=err_%A_%a.txt    # Name of the error-file.
#SBATCH --ntasks=1               # Number of tasks. Max: depends on partition.
#SBATCH --cpus-per-task=1        # How many processors work on one task. Max: Number of CPUs per node.
#SBATCH --mem=1000               # How much RAM is reserved for job per node. Unit: MiB
#SBATCH --array=1-6              # The indices of the array jobs.

/appl/soft/bio/course/sacct_exercise/test-a ${SLURM_ARRAY_TASK_ID}

Replace <project> with your actual project name, e.g. project_2001234
Submit the job with the command:
```
sbatch array.sh
```
You will see a message like:
```
Submitted batch job 123456
```
Make note of the Slurm job ID.
Follow the progress of the job with the command:
```
squeue -u $USER
```

💭 How is an array job listed in the queue?

Examining the finished job

When the job has finished (you can no longer see any of the sub jobs with squeue), you can use sacct to study it:
```
sacct -j <slurmjobid>    # replace <slurmjobid> with the actual job ID
```
Get a cleaner view by omitting the job steps:
```
sacct -X -j <slurmjobid>    # replace <slurmjobid> with the actual job ID
```
💬 sacct is especially handy here, because it is easy to spot the failed sub jobs.
- Which sub jobs failed?
  - Can you figure out why they failed?
  - How do they compare to jobs that finished?

Use seff to look at individual sub jobs, e.g.:

seff <slurmjobid>_5    # replace <slurmjobid> with the actual job ID

Try sacct with the -o option (discussed above). This time add the fields reqmem (requested memory) and timelimit (requested time):

sacct -o jobname,jobid,reqmem,maxrss,timelimit,elapsed,state -j <slurmjobid>    # replace <slurmjobid> with the actual job ID

💭 Note that in this case we can not use the -X option as we want to see the memory usage for each step.

Adjusting the job-file

Look at the error messages produced by the failed jobs.
When you know which sub jobs failed and why, adjust the resource requests as necessary.

☝🏻 If you have limited time, you can skip to step 4 and use the job ID 27099394 (it is the same job with adjusted resource requests).
- Change time and memory reservations:
```
#SBATCH --time=00:05:00
#SBATCH --mem=2000
```

Re-run the failed sub jobs:

#SBATCH --array=3,5    # Specify which ones to run

Use seff and sacct to look at the jobs. How much memory and time did they use?

More information

💡 You can read more about array jobs and seff and sacct in Docs CSC.

Using sacct and seff to understand resource usage of finished jobs