Parallel processes#

Typically, large computers like those at CSC are not much faster than personal ones – they are simply bigger. For fast computation, they utilize parallelism (and typically have special disk, memory and network solutions, too). Parallelism simplified: You use hundreds of ordinary computers simultaneously to solve a single problem.

A small thought example

Let’s make some pea soup following this recipe:

(1 min) Pour water into a soup pan, add the split peas and bring it to boil.
(60 min) Let it simmer under a lid for about 60 minutes.
(15 min) Clean and chop the leek, celeriac, onion, carrot and potato.
(20 min) Add the vegetables and simmer for 20 more minutes. Stir the soup occasionally.
(1 day) Leave the soup for one day. Reheat before serving and add a sliced smoked sausage (vegetarian options are also welcome). Season with pepper and salt.

Imagine you’re cooking alone.

How many workers does this process need?
Can you identify potential for parallelisation in this recipe?
And what if you are cooking with the help of a friend? Is the soup done any faster?

Adjusted from Introduction to parallel programming in Python by the Carpentries

Solution

There are two ‘workers’: the cook and the stove.
You can cut vegetables while simmering the split peas.
If you have help, you can parallelize cutting vegetables further.
The main part of the process (waiting) is not parallelizable, so we have to consider if it is worth the effort.

→ Not everything can be parallelized. Identify serial and parallelizable parts of your code early on.

So what options do we have to run things at the same time?

Parallel programming

Parallel programs are typically parallelized with the MPI and/or OpenMP standards or using GPUs. In this course we are focusing on making use of multiple CPUs and dealing with so called “embarrasingly/naturally/delightfully parallel” tasks. If you are interested in more advanced topics, please check our CSC training calendar.

Using multiple cores#

In-built multi core support#

First thing to check, is if the software you are using has built-in support for using multiple CPUs/cores. For command line tools, look for -n(umber of)_cores, -c(ores/pu), -j(obs), -t(hreads) or similar.

Some example geospatial tools with built-in multi CPU support:

GDAL, e.g. gdalwarp -multi -wo NUM_THREADS=val/ALL_CPUS ...
Orfeo ToolBox; no extra action needed
Whiteboxtools; many tools support parallel execution without extra action
Lastools; many tools support parallel execution by setting -cores
PDAL-wrench (not on Puhti); many tools support parallel execution without extra action

For your own scripts, do you have for-loops or similar that you could replace by using multiple cores instead? Many programming languages have packages that support this:

Python: multiprocessing, joblib and dask
R: snow, parallel and future
Julia: built in multi-threading

Tools for running scripts/tools at the same time#

Apart from in-built features, there is also ways of running scripts at the same time for tools without multi-core support or adapting your own code. Any program may be run in parallel with these tools. This way of running programs is also called task farming or high throughput computing.

GNU Parallel#

GNU parallel is a shell tool for executing jobs in parallel using one or more computers or cores. A job can be a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. Tasks that are not run immediately due to space restrictions are queued and are automatically executed as space becomes available.

From GNU Parallel documentation See also CSC Docs: GNU Parallel tutorial

Array jobs#

Array jobs are another way of how you can run multiple independent jobs at the same time. They are a structure of SLURM, enabling users to run multiple instances of the same batch script as independent jobs.

HyperQueue#

Instead of submitting each of your computational tasks as separate Slurm jobs or job steps, you can also allocate a large resource block and then use HyperQueue to submit your tasks to this allocation.

Workflow tools#

If running your jobs gets more complex, requiring e.g. dependencies between subtasks, workflow tools can be another or additional option. Workflow tools also help with making your work more reproducible by recording the computational steps and data. You can find some guidelines and suggestions in CSC Docs: High Throughput Computing page. Many tools available:

Think about your own work

Do you need to run a lot of steps one after another? Or few steps that need a lot of memory? Do steps depend on each other? Which steps could be run in parallel? Which steps cannot be run in parallel?

How many jobs is too many?

We mention in documentation and guidelines that users shouldn’t send too many jobs, but how many is too many?

Unfortunately it’s impossible to give any exact numbers because both Slurm and Lustre are shared resources.

It’s possible to give better limits for global usage of the system.
When system total load is low, it may be ok to run something that is problematic when system is full.

How many jobs is too many?

SHOULD BE OK to run tens of jobs
PAY ATTENTION if you run hundreds of jobs
DON’T RUN several thousands of jobs

How many file operations is too many?

SHOULD BE OK to access hundreds of files
PAY ATTENTION if you need several thousands of files
DON’T USE hundreds of thoudsands of files

Note that these guideline numbers are for all operations on all jobs.

I have lots of small files

Check the tool that you are using
- There may be different options for data storage
Tar/untar and compress your datasets.
Use local disk (NVMe on Puhti, ramdisk on Mahti).
Remove intermediate files if possible.
Use squashfs for read-only datasets and containers.

I have lots of small tasks for Slurm

Regroup your tasks and execute larger group of tasks in single job.
- Manual or automatic (if feature is present in your tool)
- Horizontal and vertical packing
- Tradeoff (redundancy, parallelism, utilization)
Do a larger job and use another scheduler (HyperQueue, flux).
- Integration for Nextflow and Snakemake already exists
- CSC has some tools for farming type jobs
- Not all or nothing