Performing a simple scaling test

This tutorial is done on Puhti, which requires that

you have a user account at CSC
your account belongs to a project that has access to the Puhti service.

Overview

💬 Before running large jobs using a lot of computing resources (cores), it is important to verify that the calculation actually can utilize the requested resources efficiently.

💡 In this tutorial, you will perform a very simple scalability test, i.e. running a parallel program with a varying number of cores and observing how it speeds up.

Download a sample parallel program

Create and enter a suitable scratch directory on Puhti (replace <project> with your CSC project, e.g. project_2001234):
```
mkdir -p /scratch/<project>/$USER/scalability-test
cd /scratch/<project>/$USER/scalability-test
```
Download a toy program that performs a simple molecular dynamics simulation in parallel using OpenMP threading. Understanding the details of the code is not important for the completion of this tutorial.
```
wget https://a3s.fi/CSC_training/md
```
Edit the access permissions of the file to allow execution:
```
chmod +x md
```

Create a parallel batch job script

💬 We will run the MD program multiple times using six different thread counts; 1, 2, 4, 8, 16 and 32.

Copy the following script into a file job.sh using, e.g., nano:

#!/bin/bash
#SBATCH --partition=test
#SBATCH --account=<project>   # replace <project> with your CSC project, e.g. project_2001234
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=<N>   # Replace <N> with appropriate number of threads
#SBATCH --time=00:05:00

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun md --particles=500 --steps=5000

Replace --cpus-per-task=<N> in the script with --cpus-per-task=1 in order to run the program using one thread per task.
Submit the script with:
```
sbatch job.sh
```
After a few moments, an output file slurm-<jobid>.out will appear in the current directory. View its contents once the job has finished (takes less than a minute):
```
cat slurm-<jobid>.out   # Replace <jobid> with the actual Slurm job id
```
Repeat the above steps for the thread counts 2, 4, 8, 16 and 32 by editing --cpus-per-task in the job.sh script and then resubmitting the job. If you have limited time, you may also just download a set of pre-calculated results:
```
wget https://a3s.fi/CSC_training/scaling-test.tar
tar -xvf scaling-test.tar
```
Check the elapsed time of each simulation once they have completed:
```
grep "Elapsed time" *.out | sort -nk 4
```

💭 Did the computation become faster? If so, is the scaling ideal, i.e. does doubling the thread count also make it run twice as fast? If not, can you think of any reasons that might limit the scalability? How many threads does it make sense to run the program with?

☝🏻 To ensure efficient use of resources, a good rule of thumb is that when you double the number of used cores the job should become at least 1.5 times faster. If this is not the case, request fewer cores.

💡 Bonus! Increase the problem size by increasing --particles=<value>. Is the program now able to scale to a larger number of threads? Why does --steps=<value> not have the same effect?

More information

💡 Docs CSC: Performance checklist