Using HyperQueue and local disk to process many small files efficiently
This tutorial requires that
- you have a user account at CSC
- your account belongs to a project that has access to the Puhti service.
This tutorial is done on Puhti
Overview
💬 HyperQueue is a tool for efficient sub-node task scheduling and well suited for task farming and running embarrassingly parallel jobs.
💬 In this example, we have a large number of files (4000+) which we want to convert to another format.
- The files represent SMILES strings (a line notation encoding the two-dimensional structure of a molecule) which we want to convert into a three-dimensional coordinate format
- The computational cost of each of the conversions is expected to be comparable
- Since the workflow involves many small files, we will utilize the fast local disk to avoid stressing the parallel file system
The workflow of this exercise
- Download the input files from Allas
- Decompress the files to
$LOCAL_SCRATCH
- Convert each
.smi
file to a three-dimensional.sdf
molecular file format - Archive and compress the output files and move them back to
/scratch
Download the input files
- Create and enter a suitable scratch directory on Puhti (replace
<project>
with your CSC project, e.g.project_2001234
):
mkdir -p /scratch/<project>/$USER/hq-example
cd /scratch/<project>/$USER/hq-example
- Download the input files representing small molecules initially obtained from the ChEMBL database:
wget https://a3s.fi/CSC_training/smiles.tar.gz
Create a HyperQueue batch script
💬 We will use Open Babel to convert the SMILES strings into a three-dimensional .sdf
coordinate format.
☝🏻 Multiple files can be converted using the batch conversion mode of Open Babel. For 4000+ files this would, however, take more than an hour. Similarly, submitting each conversion as a separate Slurm job is also a bad idea since each conversion takes just a few seconds. A large number of short jobs may degrade the performance of the scheduler for all users.
💡 HyperQueue is a program that allows us to schedule sub-node tasks within a Slurm allocation. One can think of HyperQueue as a “Slurm within a Slurm”, or a so-called “meta-scheduler”, which allows us to leverage embarrassing parallelism without overloading Slurm by looping srun
or sbatch
commands.
- Copy the following script into a file
hq.sh
using, e.g.,nano
:
#!/bin/bash
#SBATCH --partition=small
#SBATCH --account=<project> # replace <project> with your CSC project, e.g. project_2001234
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=40
#SBATCH --time=00:10:00
#SBATCH --gres=nvme:1
module load hyperqueue openbabel
# Specify a location for the HyperQueue server
export HQ_SERVER_DIR=${PWD}/hq-server-${SLURM_JOB_ID}
mkdir -p "${HQ_SERVER_DIR}"
# Start the server in the background (&) and wait until it has started
hq server start &
until hq job list &>/dev/null ; do sleep 1 ; done
# Start the workers (one per node, in the background) and wait until they have started
srun --exact --cpu-bind=none --mpi=none hq worker start --cpus=${SLURM_CPUS_PER_TASK} &
hq worker wait "${SLURM_NTASKS}"
# Extract the input files to the local disk and cd there
tar -xf smiles.tar.gz -C $LOCAL_SCRATCH
cd $LOCAL_SCRATCH/smiles
# Submit each Open Babel conversion as a separate HyperQueue job
for f in *.smi ; do
hq submit --stdout=none --stderr=none obabel $f -O ${f%.*}.sdf --gen3d best &
done
# Wait until all jobs have finished, shut down the HyperQueue workers and server
hq job wait all
hq worker stop all
hq server stop
# Compress the output .sdf files and copy the package back to /scratch
tar -czf sdf.tar.gz *.sdf
cp sdf.tar.gz $SLURM_SUBMIT_DIR
- As explained by the comments in the script above, HyperQueue works on a worker-server-client basis, i.e. a worker is started on each compute node which executes commands that the client submitted to the server
- This is in principle what Slurm also does, only difference is that you need to start the server and workers yourself
- In this example, one full Puhti node is reserved for processing the files, meaning that 40 conversion commands will be running in parallel
☝🏻 Ideally, the number of sub-tasks should be larger than the amount that can fit running on the reserved resources simultaneously to avoid too short Slurm jobs.
- Submit the script with:
sbatch hq.sh
- After a couple of minutes, you should notice that a file
sdf.tar.gz
containing the output files has appeared in your working directory.
💡 Tip: If you want to monitor the progress of your HyperQueue jobs/tasks, just export the location of the HyperQueue server an use the hq
commands (see the official documentation for details).
module load hyperqueue
export HQ_SERVER_DIR=/path/to/hq/server
hq job list
hq job info <hqjobid>
hq job progress <hqjobid>
hq task list <hqjobid>
hq task info <hqjobid> <hqtaskid>
☝🏻 Remember to load the hyperqueue module before running the commands above from the terminal, otherwise you will get an error message that the command hq is not found.
☝🏻 Also note that these commands are available only after your job starts running.
💬 To get a report of how your jobs/tasks completed and spot possible failures, you could also run one of these commands in your batch script and redirect the output to a file before shutting down the server.