Psst, remember the cheatsheet!

Fair share of use on multi-user computing platforms

Contents

Fair share of use on multi-user computing platforms#

The computing resources are shared among hundreds of users, who all have different resource needs. Resources allocated to a job are not available for others to use. It is important to request only the resources you need and ensure that the resources are used efficiently. A resource/job management system keeps track of the computing resources. It aims to share the resources in an efficient and fair way among all users. It optimizes resource usage by filling the compute nodes so that there will be as little idling resources as possible. If a job is not using the memory it reserved, resources are wasted.

How batch jobs are distributed on compute nodes in terms of number of CPU cores, time and memory

SLURM job allocations#

Slurm#

CSC uses a batch job system called Slurm to manage resources. Slurm is used to control how the overall computing resources are shared among all jobs and users in an efficient and fair manner. Slurm controls how a single job request is allocated resources, such as:

  • computing time

  • number of cores

  • amount of memory

  • other resources like GPUs, local disk, etc.

Queueing#

  • A job is queued and starts when the requested resources become available

  • The order in which the queued jobs start depends on their priority and currently available resources

  • At CSC, the priority is configured to use “fair share”

    • The initial priority of a job decreases if the user has recently run lots of jobs

    • Over time (while queueing) a jobs priority increases and eventually it will run

  • In general, always use the smallest partition possible!

  • See our documentation for more information on Getting started with running batch jobs on Puhti/Mahti and LUMI.

How many resources to request?

  • You can use your workstation / laptop as a base measuring stick: If the code runs on your machine, as a first guess you can reserve the same amount of CPUs & memory as your machine has. Before reserving multiple CPUs, check if your code can make use them.

  • You can also check more closely what resources are used with top on Mac and Linux or task manager on Windows when running on your machine

  • Similarly for running time: if you have run it on your machine, you should reserve similar time in the cluster.

  • If your program does the same thing more than once, you can estimate that the total run time is number of steps times time taken by each step.

  • Likewise, if your program runs multiple parameters, the total time needed is number of parameters times the time needed to run the program with one/some parameters.

  • You can also run a smaller version of the problem and try to estimate how the program will scale when you make the problem bigger.

  • You should always monitor jobs to find out what were the actual resources you requested.

Adapted from Aalto Scientific Computing