Exercise: Retrieving data from bio data repositories

This exercise covers retrieving data from various commonly used bio data repositories.

  1. We will do these exercises in an interactive session launched using the sinteractive command:
sinteractive --account <project>   # replace <project> with your CSC project, e.g. project_2001234
  1. Alternatively, open a compute node shell through the Puhti web interface.
  2. To access the applications in parts 2 and 3, we will need to load the biokit module:
module load biokit
  1. Create a directory for yourself under the /scratch directory of your project and move there:
mkdir -p /scratch/<project>/$USER   # replace <project> with your CSC project, e.g. project_2001234
cd /scratch/<project>/$USER         # replace <project> with your CSC project, e.g. project_2001234

💭 Everyone in a project shares the same /scratch directory, so it is a good idea to use subdirectories for each user and task to avoid accidentally deleting or overwriting others’ files.

🗯 In normal usage it may be a good idea to use the chmod command to alter file access rights so that only you have write access to your own subfolder, but please do not do this if you are using a CSC course project, as it will make clean-up after the course harder.

💡 You can find more information about this on the Disk areas page in Docs CSC.

1. Downloading data with curl

  1. curl and wget are general tools to download data from an URL.
  2. Download a dataset from internet using curl and uncompress it. The dataset contains some Pythium genomes with related BWA indexes.
curl https://a3s.fi/course_12.11.2019/pythium.tgz > pythium.tgz
tar -zxvf pythium.tgz  

2. Downloading data with NCBI edirect

  1. Create directory cellulose_synthase and move to this new directory:
mkdir cellulose_synthase
cd cellulose_synthase
  1. Next we use the NCBI edirect tool to retrieve some data.
  2. Check how many proteins are found in the NCBI protein database for Pythium species (count row in the results):
esearch -db protein -query "Pythium [ORGN]" 
  1. Check the number of proteins for cellulose synthase 1, cellulose synthase 2 and cellulose synthase 3 that are found for Pythium species.
  2. For cellulose synthase 1 this can be done with:
esearch -db protein -query "Pythium [ORGN] AND cellulose synthase 1 [PROT]"
  1. Do the same for the other proteins.
  2. Retrieve the cellulose synthase 3 sequences in Fasta format
esearch -db protein -query "Pythium [ORGN] AND cellulose synthase 3 [PROT]" | efetch -format fasta > cesy3.fasta
  1. Run the esearch command that tells how many cellulose synthase 3 sequences there are in total in the NCBI protein database?

Extra exercise for fast ones

  1. Align the cellulose synthase 3 set with mafft
mafft cesy3.fasta > cesy3_aln.fasta
  1. Study the results:
infoalign cesy3_aln.fasta
showalign cesy3_aln.fasta

3. Downloading with enaDataGet

  1. Check the options of enaDataGet with command:
enaDataGet -h
  1. Download a file (Pythium iwayamai genome assembly)
enaDataGet AKYA02000000 -f fasta
gunzip AKYA02.fasta.gz 

Extra exercise for fast ones

  1. Study the downloaded file:
head -20 AKYA02.fasta
tail AKYA02.fasta
infoseq_summary AKYA02.fasta

4. Finishing up

  1. Close the interactive session when you are done by typing exit.