Performing a simple scaling test

💬 Before running large jobs using a lot of computing resources (cores), it is important to verify that the calculation actually can utilize the requested resources efficiently.

💡 In this tutorial, you will perform a very simple scalability test, i.e. running a parallel program with a varying number of cores and observing how it speeds up.

Download a sample parallel program

  1. Create and enter a suitable scratch directory on Puhti (replace <project> with your CSC project, e.g. project_2001234):

    mkdir -p /scratch/<project>/$USER/scalability-test
    cd /scratch/<project>/$USER/scalability-test
  2. Download a toy program that performs a simple molecular dynamics simulation in parallel using OpenMP threading. Understanding the details of the code is not important for the completion of this tutorial.

  3. Edit the access permissions of the file to allow execution:

    chmod +x md

Create a parallel batch job script

💬 We will run the MD program multiple times using six different thread counts; 1, 2, 4, 8, 16 and 32.

  1. Copy the following script into a file using, e.g., nano:

    #SBATCH --partition=test
    #SBATCH --account=<project>   # replace <project> with your CSC project, e.g. project_2001234
    #SBATCH --nodes=1
    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task=<N>   # Replace <N> with appropriate number of threads
    #SBATCH --time=00:05:00
    srun md --particles=500 --steps=5000
  2. Replace --cpus-per-task=<N> in the script with --cpus-per-task=1 in order to run the program using one thread per task.
  3. Submit the script with:

  4. After a few moments, an output file slurm-<jobid>.out will appear in the current directory. View its contents once the job has finished (takes less than a minute):

    cat slurm-<jobid>.out   # Replace <jobid> with the actual Slurm job id
  5. Repeat the above steps for the thread counts 2, 4, 8, 16 and 32 by editing --cpus-per-task in the script and then resubmitting the job. If you have limited time, you may also just download a set of pre-calculated results:

    tar -xvf scaling-test.tar
  6. Check the elapsed time of each simulation once they have completed:

    grep "Elapsed time" *.out | sort -n

💭 Did the computation become faster? If so, is the scaling ideal, i.e. does doubling the thread count also make it run twice as fast? If not, can you think of any reasons that might limit the scalability? How many threads does it make sense to run the program with?

☝🏻 To ensure efficient use of resources, a good rule of thumb is that when you double the number of used cores the job should become at least 1.5 times faster. If this is not the case, request fewer cores.

💡 Bonus! Increase the problem size by increasing --particles=<value>. Is the program now able to scale to a larger number of threads? Why does --steps=<value> not have the same effect?

More information

💡 Docs CSC: Performance checklist