Parallel R#

Spatial libraries with parallel support#

If starting from scratch with new code, the first option would be to look for spatial libraries that have parallelization already built in:

  • terra has some functions in parallel for raster processing

  • gdalcubes for multi-dimensional spatial data analysis

  • lidR for lidar data analysis

R parallel libraries#

The parallel spatial libraries cover only very limited functionality, so often these do not fit all requirements. They also cannot be used to easily parallelize existing serial code. Then the next option is to write parallel code yourself.

R has many libraries to support parallelization:

  • Multi-core: parallel

  • Multi-core or multi-node: future, snow, foreach, Rmpi, pbdMPI..

If unsure, start with future. It is one of the newest, most versatile and most easy to use.

Supercomputer usage

Some of the packages require specific settings in Puhti, see CSC Docs, r-env, Parallel batch jobs for details about some of these packages. These might differ from the package’s general instructions.

future library#

When using future, two main decisions have to be made for running code in parallel, which we will answer next:

  • How to run the parallel code?

  • How to make the code parallel?

How to run the parallel code?#

future-library supports both serial and parallel computing with different set-ups:

Name

Description

sequential

serial, in the current R process

multisession

multi-core, background R sessions, limited to one node

multicore

multi-core, forked R processes, limited to one node, not in Windows nor in RStudio

cluster

multi-node, external R sessions

While developing the code, it might be good to start with multisession or multicore parallelization and then if needed change it to cluster. The required changes to code are small when changing the parallelization set-up.

# Multi-core, use one of them
plan(multicore)
plan(multisession)

# Multi-node
cl<-getMPIcluster()
plan(cluster, workers = cl)

How to make the code parallel?#

The basic R code runs in serial mode, so usually some changes to code are needed to benefit from parallel computing. The changes to code are exactly the same for all parallization set-ups.

The most simple changes could be:

  • For-loops:

  • purrr's map() -> furrr's future_map()

  • *apply() -> future.apply functions

# Example of chaning for-loop and purrr's map() to furrr's future_map()
# Just a slow demo function that waits for 5 seconds
slow_function<-function(i) {
  Sys.sleep(5) 
  return(i)
}
# Input data vector. The slow function is run for each element.
input = 1:7

# SERIAL options
# Basic FOR loop
a <- 0
for(i in input) {  
  a[i] <- slow_function(i)
}

# purrr, map
library(purrr) 
a <- map(input, slow_function)

# PARALLEL, furrr future_map
library(furrr)
plan(multisession)

a <- future_map(input, slow_function)

If you have used *apply()-functions, future.apply library provides replacements for these.

# Example of chaning lapply() to future.apply's future_lapply()
# Basic lapply
b <- lapply(input, slow_function)

# Parallel future.apply lapply
library(future.apply) 
d <- future_lapply(input, slow_function)

variables with future

  • future exports needed variables and libraries automatically to the parallel processes

  • The variables must be serializable. Terra’s raster objects are not serializable, see Terra library’s recommendations

  • Avoid moving variables that refer to large objects from the serial main process to a parallel process. Spatial data analysis often involves significant amounts of data. It is better to read the data inside the parallel function. Give the file name as input, compute area coordinates, etc.

Batch job scripts#

Multi-core jobs#

multicore or multisession parallelization:

#SBATCH --nodes=1
#SBATCH --ntasks=4  # Number of tasks. Upper limit depends on number of CPUs per node.

(...)

srun apptainer_wrapper exec Rscript --no-save Calc_contours_future_multicore.R

Multi-node jobs#

cluster parallelization:

#SBATCH --nodes=2 #For cluster usage to make sense, this should be more than 1.
#SBATCH --ntasks=40  # Number of tasks. Upper limit depends on number of CPUs per node.

(...)

srun apptainer_wrapper exec RMPISNOW --no-save --slave -f Calc_contours_future_cluster.R

Further reading: