Fast disk areas in CSCβs computing environment
βπ» This tutorial requires that you have a user account at CSC that is a member of a project that has access to the Puhti service.
Upon completion of this tutorial, you will be familiar with ideal disk areas for I/O-intensive workloads, i.e. frequent read and write operations.
Perform a light-weight pre-processing of data files using fast local disk
π¬ You may sometimes come across situations where you have to process a large number of smaller files, which can cause heavy input/output load on the shared file system used in CSCβs computing environment.
π¬ In order to facilitate such heavy I/O operations, CSC provides fast local disk areas on the login and compute nodes.
-
First login to Puhti using SSH (or by opening a login node shell in the Puhti web interface):
ssh <username>@puhti.csc.fi # replace <username> with your CSC username, e.g. myname@puhti.csc.fi
-
Identify the fast local disk areas on the login nodes with the following command:
echo $TMPDIR
π‘ The local disk area on the login nodes is meant for light-weight pre-processing of data and I/O-intensive tasks such as software compilation. Actual computations should be submitted to the batch queue from the /scratch
disk.
π‘ The local disk area on the login nodes are meant for temporary use and cleaned often, so make sure to move important data to /scratch
or /projappl
once you do not need the fast disk anymore.
βπ» Note that a local disk is specific to a particular node, i.e. you cannot access the local disk of puhti-login11
from puhti-login12
.
Download a tar archive containing thousands of small files and merge the files into one large file using the fast local disk
-
Download a tar file from the Allas object storage directly to the local disk:
cd $TMPDIR wget https://a3s.fi/CSC_training/Individual_files.tar.gz
-
Unpack the downloaded tar file:
tar -xavf Individual_files.tar.gz cd Individual_files
-
Merge each small file into a larger one and remove all small files:
find . -name 'individual.fasta*' | xargs cat >> Merged.fasta find . -name 'individual.fasta*' | xargs rm
π‘
xargs
is a convenient command that takes the output from one command and uses it as an argument to another.
Move your pre-processed data to the project-specific /scratch
area before analysis
π Remember: the commands csc-projects
and csc-workspaces
reveal information about your projects.
-
Create your own folder (using the environment variable
$USER
) under a project-specific directory on the/scratch
disk (or skip this step if you already created the folder in a previous tutorial):mkdir -p /scratch/<project>/$USER # replace <project> with your CSC project, e.g. project_2001234
-
Move your pre-processed data from the previous step (i.e., the
Merged.fasta
file) from the fast disk to/scratch
:mv Merged.fasta /scratch/<project>/$USER
-
You have now successfully moved your data to the
/scratch
area and can start performing actual analysis using batch job scripts.
Optional: Fast local disk areas on compute nodes
βπ» If you intend to perform heavy computing tasks using a large number of small files, you have to use the fast local disk areas on the compute nodes instead of the login nodes. The compute nodes are accessed either interactively or using batch jobs.
-
Move to the
/scratch
area of your project and use thesinteractive
command to request an interactive session on a compute node with 1 GB fast local disk for 10 minutes:cd /scratch/<project>/$USER # replace <project> with your CSC project, e.g. project_2001234 sinteractive --account <project> --time 00:10:00 --tmp 1 # replace <project> with your CSC project, e.g. project_2001234
βπ» Not all compute nodes have fast local disks, meaning that you may have to queue for a while before the interactive session starts. You may skip this part if youβre in a hurry.
-
In the interactive session, use the following commands to locate the fast local storage areas on that compute node:
echo $LOCAL_SCRATCH echo $TMPDIR
π‘ Note how the path to the fast local storage area contains the ID of your Slurm job,
/run/nvme/job_<id>
. -
Terminate the interactive session and now try the same in a proper batch job. Create a file called
my_nvme.bash
using, for example, thenano
text editor:nano my_nvme.bash
-
Copy the following batch script there and change
<project>
to the CSC project you actually want to use:#!/bin/bash #SBATCH --account=<project> # Choose the billing project. Has to be defined! #SBATCH --time=00:01:00 # Maximum duration of the job. Upper limit depends on the partition. #SBATCH --partition=small # Job queues: test, interactive, small, large, longrun, hugemem, hugemem_longrun #SBATCH --ntasks=1 # Number of tasks. Upper limit depends on partition. For a serial job this should be set 1! #SBATCH --gres=nvme:1 # Request fast local disk space. Default unit is GB. echo $LOCAL_SCRATCH echo $TMPDIR
-
Submit the batch job with the command:
sbatch my_nvme.bash
-
Monitor the progress of your batch job and print the contents of the output file when it has completed:
squeue -u $USER cat slurm-<jobid>.out # replace <jobid> with the actual Slurm job ID
βπ» Again, please note that requesting fast local disk space tends to increase your queueing time. It is a scarce resource, and should only be requested if you really need it. Please ask CSC Service Desk if youβre unsure.
βΌοΈ If you write important data to the local disk in your interactive session or batch job, remember to copy the data back to
/scratch
before the job terminates! The local disk is cleaned immediately after your job, and salvaging any forgotten files is not possible afterwards.π Bonus exercise: Try to repeat the first part of this tutorial using a batch job!
More information
π‘ Docs CSC: Temporary local disk areas
π‘ Docs CSC: Local storage on Puhti
π‘ Docs CSC: Local storage on Mahti