CSC Computing Environment

Where do I have a lot of data?

In this tutorial you

💬 CSC has developed a tool called LUE for keeping track of much data/files one has on the disk. Conventional tools such as stat or du are slow and heavy on the parallel file system, while lue is significantly faster. This comes with a slight loss in accuracy, although this is usually not a problem. See Docs CSC for a list of possible caveats.

☝🏻 Keeping track of how much data/files one has on the disk and (re)moving it in a timely manner (e.g. to Allas) is very important to ensure a more performant file system for all users.

Querying where you have a lot of files and data

  1. On Puhti, start by loading the lue module:
module load lue
  1. Check out the different available options with:
lue --help
  1. See how much data you have in your $HOME directory (i.e. /users/$USER):
lue $HOME

💡 You can also try some other directory e.g. in your project’s /scratch. However, don’t run the tool on the whole project folder (e.g. /scratch/project_2001234), but choose instead a smaller subdirectory where you think you might have a lot of files or data. Some operations can be both slow and heavy on the file system! By default, the tool will only fetch size data for 30 mins before quitting. Alternatively, you can limit the runtime of the tool as instructed in Docs CSC.

  1. The first lines of the output should look like:
Rerunning find for /users/$USER
Total size: 922595190 Processed files: 14036 Permission denied: 0 Missing size: 2, Other err: 0
path, total size, in dir size, % of total, % of dir
/users/$USER        910MB  384KB 100.0 100.0
  1. You can see the total number of files in the directory on the second line after Processed files and a breakdown of the subdirectory sizes in the following table. How many files do you have in total in your $HOME? Which directory is the largest in size?
  2. Rerun lue with the --count flag to display the number of files in each directory instead of the size. Which directory in $HOME contains most files?
lue --count $HOME

💡 To get more detailed information, use the --display-level=<n> flag to show a deeper directory hierarchy. Alternatively, rerun the query for individual subdirectories.

☝🏻 LUE stores a very simple cache of runs in $TMPDIR. This means that you can run a query on any subdirectories without actually re-querying anything from the file system. To rerun the query from scratch, add the flag --refresh. This might be needed to get a more accurate estimate of the file count e.g. if the cache file is old.

More information

💡 See Docs CSC for more information about managing data on Puhti and Mahti /scratch disks and using LUE (e.g. how to fix NOSIZE/NOPERM errors).