Parallel computing#

For fast computation, supercomputers utilize parallelism.

What to paralellize?#

Spatial data analysis often allows splitting at least some parts of the analysis to independent parts, that could be run in parallel:

  • Dividing data into parts:

    • Think how to divide the data: rectangular boxes, catchment areas, administrative units, chunks of vector data, data of different time periods, etc.

    • In many cases the borders need special care, one option is to use overlapping splitting of rasters.

  • Repeating the analysis with different parameters: scenarios, time periods, model settings, etc.

Think about your own work

  • Do you need to run a lot of steps one after another?

  • Or few steps that need a lot of memory?

  • Do steps depend on each other?

  • Which steps could be run in parallel?

  • Which steps cannot be run in parallel?

  • How to split your data?

How to parallelize?#

For doing analysis in parallel there are four main options:

  1. Use spatial analysis tools with built-in parallel support

  2. Write your own scripts using parallel libaries of different scripting languages

  3. Use external tools to run the scripts in parallel

  4. Write your own parallel code

From practical point of view in supercomptuers, it is also important to understand, if the the tool/script supports:

  • Multi-core - it runs in parallel only inside one node of the supercomputer.

  • Multi-node - it can distribute the work to several nodes of the supercomputer.

For multi-core there is clearly more options. The number of cores in a single node has been recently been increasing, so also multi-core tools can be very useful.

Tools with built-in parallel support#

Look from the tool’s manual, if it has built-in support for using multiple CPUs/cores. For command line tools, look for number_of_cores, cores, cpu), jobs, threads or similar. Unfortunatelly not many GIS-tools have such option.

Some example geospatial tools with built-in parallel support:

  • GDAL, some commands e.g. gdalwarp -multi -wo NUM_THREADS=val/ALL_CPUS ...

  • FORCE

  • Lastools

  • OpenDronemap

  • OrfeoToolBox

  • PDAL-wrench

  • SNAP

  • Zonation

  • Whiteboxtools

All of these tools are multi-core, but not multi-node.

Define number of cores explicitly

The GIS-tools are not written for supercomputers, so they might not understand HPC specifics correctly and may think that they can use more cores than they actually can. It usually is better to define the number of cores to use explicitly, rather than “use all available cores”.

The deep learning libraries have options for Multi-GPU and multi-node machine learning.

Parallel libaries of scripting languages#

Many programming languages have packages for parallel computing.

External tools to run the scripts in parallel#

The external tools enable running the scripts in parallel, with minimal changes to the scripts. This way of running programs is also called task farming or high throughput computing. The tools have different complexity and different features. The simpler ones are for running same script with different input paramaters, for example different input files, scenarios, time frames etc. More complicated tools support managing the whole workflow with several steps and with dependecies between steps. Workflow tools also help with making your work more reproducible by recording the computational steps and data. See CSC Docs: High-throughput computing and workflows for more information.

GNU Parallel#

GNU parallel is a general Linux tool for executing commands or scripts in parallel in one node. It iterates over an input list, which can be a list of files or list of input parameters. The number of tasks may be higher than number of cores, it waits with execution as resources become available. GNU Parellel does not support dependecies between the tasks.

Snakemake#

Snakemake is a scientific workflow management system, that supports running for example R, bash and Python scripts. It can handle dependecies between the tasks and can be used both multi-core and multi-node set-ups. Snakemake is one of the easiest tools for workflow management.

Write your own parallel code#

Parallel programs are typically parallelized with the MPI and/or OpenMP standards or using GPUs, but in this course we are not going to these topics.