Application performance

This exercise is done on Mahti, which requires that

you have a user account at CSC.
your account belongs to a project that has access to the Mahti service.

☝🏻 This exercise can in principle also be completed using Puhti. In this case, you need to edit the batch script settings below accordingly (e.g. partition), and note that Puhti has 40 CPU cores per node, not 128. Read more about available batch job partitions on Puhti.

Overview

💬 In this exercise, you will optimize the performance of a real simulation use case by tuning the number of cores used and ratio between MPI tasks and OpenMP threads. As an example application, we will use the CP2K software. The details of the code and what it does is not important for the completion of this exercise. Just take it as an example parallel program that uses hybrid MPI/OpenMP parallelization.

Download a sample input file

Create and enter a suitable scratch directory on Mahti (replace <project> with your CSC project, e.g. project_2001234):
```
mkdir -p /scratch/<project>/$USER/app-perf
cd /scratch/<project>/$USER/app-perf
```

Download a sample input file:

wget https://a3s.fi/CSC_training/cp2k.inp

💬 Reading/understanding the contents of this input file is not important for the sake of completing this exercise.

Scalability test

💬 You should first check how many CPU nodes can be used efficiently to run the example simulation.

Modify the following batch script to request 1 node and save it into a file cp2k.sh using, e.g., nano:

#!/bin/bash
#SBATCH --partition=medium
#SBATCH --account=<project>    # replace <project> with your CSC project, e.g. project_2001234
#SBATCH --nodes=<N>            # replace <N> with the number of nodes to run on
#SBATCH --ntasks-per-node=128  # Mahti has 128 CPU cores per node, Puhti has 40
#SBATCH --time=00:10:00

module purge
module load gcc/9.4.0 openmpi/4.1.2 cp2k/2023.2
srun cp2k.psmp -i cp2k.inp

Submit the batch script:
```
sbatch cp2k.sh
```
Once the job has completed, you may use the program’s internal timer to check how many seconds it took to run the simulation:
```
grep "CP2K  " slurm-<jobid>.out | awk '{print $7}'
```
Repeat the job for the number of nodes listed below and complete the table! Calculate the speedup by dividing the previous elapsed time with the elapsed time obtained using twice as many nodes:

Number of nodes Elapsed time (s) Speedup Slurm job ID

1 -

2 t₁/t₂

4 t₂/t₄

8 t₄/t₈

Number of nodes	Elapsed time (s)	Speedup	Slurm job ID
1		-
2		t₁/t₂
4		t₂/t₄
8		t₄/t₈

☝🏻 Remember that the speedup should be at least 1.5x when you double the number of cores! This is important to ensure that the CPU resources are used efficiently.

💭 To how many nodes is the job able to scale up to efficiently? Using that node count, proceed to the next part!

Assess optimal thread–task balance

💬 The performance of software using hybrid MPI/OpenMP parallelism may be further improved by running multiple OpenMP threads per MPI task. The optimal ratio between the number of tasks and threads varies for each program and job input and should be tested.

☝🏻 To run multiple threads, one needs to set --cpus-per-task. The default is one CPU (thread) per task. To use all 128 physical cores in a Mahti node, the value of --ntasks-per-node multiplied by --cpus-per-task should equal 128 (40 on Puhti). Most applications also require setting the OMP_NUM_THREADS environment variable to be equal to the number of threads per task.

Copy the following script into a file job.sh using, e.g., nano:

#!/bin/bash
#SBATCH --partition=medium
#SBATCH --account=<project>   # replace <project> with your CSC project, e.g. project_2001234
#SBATCH --nodes=<N>           # replace <N> by the optimum number of nodes you got in the last part
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --time=00:10:00

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

module purge
module load gcc/9.4.0 openmpi/4.1.2 cp2k/2023.2
srun cp2k.psmp -i cp2k.inp

Submit the job using different combinations of --ntasks-per-node and --cpus-per-task.

💡 The number of threads is stored by Slurm in the SLURM_CPUS_PER_TASK environment variable, which can then be used to set the value of OMP_NUM_THREADS.
Complete the table below:

MPI tasks per node OpenMP threads per task Elapsed time (s) Memory utilized (GB) Slurm job ID

128

64

32

16

8

💭 Were you able to run the calculation faster by launching multiple OpenMP threads per MPI task? What is the optimum ratio?

💭 How does the memory usage vary when you increase the number of threads per task? Use the seff command to check. Can you explain the reason for your observation?

Note: if you plan to apply for study credits for this course, prepare a report including the tables above and discussion on all questions with the 💭 symbol. Upload the report and present it together with the course certificate to the local authority granting credits. CSC cannot grant credits, but for carefully prepared and correct reports we recommend granting them.

More information

Docs CSC: Performance checklist