Fast disk areas in CSC’s computing environment

☝🏻 This tutorial requires that you have a user account at CSC that is a member of a project that has access to the Puhti service.

Upon completion of this tutorial, you will be familiar with ideal disk areas for I/O-intensive workloads, i.e. frequent read and write operations.

Perform a light-weight pre-processing of data files using fast local disk

💬 You may sometimes come across situations where you have to process a large number of smaller files, which can cause heavy input/output load on the shared file system used in CSC’s computing environment.

💬 In order to facilitate such heavy I/O operations, CSC provides fast local disk areas on the login and compute nodes.

First login to Puhti using SSH (or by opening a login node shell in the Puhti web interface):

ssh <username>@puhti.csc.fi    # replace <username> with your CSC username, e.g. myname@puhti.csc.fi

Identify the fast local disk areas on the login nodes with the following command:
```
echo $TMPDIR
```

💡 The local disk area on the login nodes is meant for light-weight pre-processing of data and I/O-intensive tasks such as software compilation. Actual computations should be submitted to the batch queue from the /scratch disk.

💡 The local disk area on the login nodes are meant for temporary use and cleaned often, so make sure to move important data to /scratch or /projappl once you do not need the fast disk anymore.

☝🏻 Note that a local disk is specific to a particular node, i.e. you cannot access the local disk of puhti-login11 from puhti-login12.

Download a tar archive containing thousands of small files and merge the files into one large file using the fast local disk

Download a tar file from the Allas object storage directly to the local disk:
```
cd $TMPDIR
wget https://a3s.fi/CSC_training/Individual_files.tar.gz
```

Unpack the downloaded tar file:

tar -xavf Individual_files.tar.gz
cd Individual_files

Merge each small file into a larger one and remove all small files:
```
find . -name 'individual.fasta*' | xargs cat >> Merged.fasta
find . -name 'individual.fasta*' | xargs rm
```
💡 xargs is a convenient command that takes the output from one command and uses it as an argument to another.

Move your pre-processed data to the project-specific `/scratch` area before analysis

💭 Remember: the commands csc-projects and csc-workspaces reveal information about your projects.

Create your own folder (using the environment variable $USER) under a project-specific directory on the /scratch disk (or skip this step if you already created the folder in a previous tutorial):
```
mkdir -p /scratch/<project>/$USER    # replace <project> with your CSC project, e.g. project_2001234
```
Move your pre-processed data from the previous step (i.e., the Merged.fasta file) from the fast disk to /scratch:
```
mv Merged.fasta /scratch/<project>/$USER
```
You have now successfully moved your data to the /scratch area and can start performing actual analysis using batch job scripts.

Optional: Fast local disk areas on compute nodes

☝🏻 If you intend to perform heavy computing tasks using a large number of small files, you have to use the fast local disk areas on the compute nodes instead of the login nodes. The compute nodes are accessed either interactively or using batch jobs.

Move to the /scratch area of your project and use the sinteractive command to request an interactive session on a compute node with 1 GB fast local disk for 10 minutes:
```
cd /scratch/<project>/$USER    # replace <project> with your CSC project, e.g. project_2001234
sinteractive --account <project> --time 00:10:00 --tmp 1    # replace <project> with your CSC project, e.g. project_2001234
```
☝🏻 Not all compute nodes have fast local disks, meaning that you may have to queue for a while before the interactive session starts. You may skip this part if you’re in a hurry.
In the interactive session, use the following commands to locate the fast local storage areas on that compute node:
```
echo $LOCAL_SCRATCH
echo $TMPDIR
```
💡 Note how the path to the fast local storage area contains the ID of your Slurm job, /run/nvme/job_<id>.
Terminate the interactive session and now try the same in a proper batch job. Create a file called my_nvme.bash using, for example, the nano text editor:
```
nano my_nvme.bash
```

Copy the following batch script there and change <project> to the CSC project you actually want to use:

#!/bin/bash
#SBATCH --account=<project>      # Choose the billing project. Has to be defined!
#SBATCH --time=00:01:00          # Maximum duration of the job. Upper limit depends on the partition. 
#SBATCH --partition=small        # Job queues: test, interactive, small, large, longrun, hugemem, hugemem_longrun
#SBATCH --ntasks=1               # Number of tasks. Upper limit depends on partition. For a serial job this should be set 1!
#SBATCH --gres=nvme:1            # Request fast local disk space. Default unit is GB.

echo $LOCAL_SCRATCH
echo $TMPDIR

Submit the batch job with the command:
```
sbatch my_nvme.bash
```
Monitor the progress of your batch job and print the contents of the output file when it has completed:
```
squeue -u $USER
cat slurm-<jobid>.out    # replace <jobid> with the actual Slurm job ID
```
☝🏻 Again, please note that requesting fast local disk space tends to increase your queueing time. It is a scarce resource, and should only be requested if you really need it. Please ask CSC Service Desk if you’re unsure.

‼️ If you write important data to the local disk in your interactive session or batch job, remember to copy the data back to /scratch before the job terminates! The local disk is cleaned immediately after your job, and salvaging any forgotten files is not possible afterwards.

💭 Bonus exercise: Try to repeat the first part of this tutorial using a batch job!