In this tutorial you
- Familiarize yourself with efficient ways to check where you have a lot of files and data on Puhti.
💬 CSC has developed a tool called LUE for keeping track of much data/files one has on the disk. Conventional tools such as stat
or du
are slow and heavy on the parallel file system, while lue
is significantly faster. This comes with a slight loss in accuracy, although this is usually not a problem. See Docs CSC for a list of possible caveats.
☝🏻 Keeping track of how much data/files one has on the disk and (re)moving it in a timely manner (e.g. to Allas) is very important to ensure a more performant file system for all users.
lue
module:module load lue
lue --help
$HOME
directory (i.e. /users/$USER
):lue $HOME
💡 You can also try some other directory e.g. in your project’s /scratch
. However, don’t run the tool on the whole project folder (e.g. /scratch/project_2001234
), but choose instead a smaller subdirectory where you think you might have a lot of files or data. Some operations can be both slow and heavy on the file system! By default, the tool will only fetch size data for 30 mins before quitting. Alternatively, you can limit the runtime of the tool as instructed in Docs CSC.
Rerunning find for /users/$USER
Total size: 922595190 Processed files: 14036 Permission denied: 0 Missing size: 2, Other err: 0
path, total size, in dir size, % of total, % of dir
---------------------------------------------------
/users/$USER 910MB 384KB 100.0 100.0
...
Processed files
and a breakdown of the subdirectory sizes in the following table. How many files do you have in total in your $HOME
? Which directory is the largest in size?lue
with the --count
flag to display the number of files in each directory instead of the size. Which directory in $HOME
contains most files?lue --count $HOME
💡 To get more detailed information, use the --display-level=<n>
flag to show a deeper directory hierarchy. Alternatively, rerun the query for individual subdirectories.
☝🏻 LUE stores a very simple cache of runs in $TMPDIR
. This means that you can run a query on any subdirectories without actually re-querying anything from the file system. To rerun the query from scratch, add the flag --refresh
. This might be needed to get a more accurate estimate of the file count e.g. if the cache file is old.
💡 See Docs CSC for more information about managing data on Puhti and Mahti /scratch
disks and using LUE (e.g. how to fix NOSIZE
/NOPERM
errors).