Exercise: Retrieving data from bio data repositories
This exercise covers retrieving data from various commonly used bio data repositories.
- We will do these exercises in an interactive session launched using the sinteractive command:
sinteractive --account <project> # replace <project> with your CSC project, e.g. project_2001234
- Alternatively, open a compute node shell through the Puhti web interface.
- To access the applications in parts 2 and 3, we will need to load the
biokit
module:
module load biokit
- Create a directory for yourself under the
/scratch
directory of your project and move there:
mkdir -p /scratch/<project>/$USER # replace <project> with your CSC project, e.g. project_2001234
cd /scratch/<project>/$USER # replace <project> with your CSC project, e.g. project_2001234
💠Everyone in a project shares the same /scratch
directory, so it is a good idea to use subdirectories for each user and task to avoid accidentally deleting or overwriting others’ files.
🗯 In normal usage it may be a good idea to use the chmod
command to alter file access rights so that only you have write access to your own subfolder, but please do not do this if you are using a CSC course project, as it will make clean-up after the course harder.
💡 You can find more information about this on the Disk areas page in Docs CSC.
1. Downloading data with curl
curl
andwget
are general tools to download data from an URL.- Download a dataset from internet using
curl
and uncompress it. The dataset contains some Pythium genomes with related BWA indexes.
curl https://a3s.fi/course_12.11.2019/pythium.tgz > pythium.tgz
ls
tar -zxvf pythium.tgz
ls
2. Downloading data with NCBI edirect
- Create directory
cellulose_synthase
and move to this new directory:
mkdir cellulose_synthase
cd cellulose_synthase
- Next we use the NCBI edirect tool to retrieve some data.
- Check how many proteins are found in the NCBI protein database for Pythium species (
count
row in the results):
esearch -db protein -query "Pythium [ORGN]"
- Check the number of proteins for cellulose synthase 1, cellulose synthase 2 and cellulose synthase 3 that are found for Pythium species.
- For cellulose synthase 1 this can be done with:
esearch -db protein -query "Pythium [ORGN] AND cellulose synthase 1 [PROT]"
- Do the same for the other proteins.
- Retrieve the cellulose synthase 3 sequences in Fasta format
esearch -db protein -query "Pythium [ORGN] AND cellulose synthase 3 [PROT]" | efetch -format fasta > cesy3.fasta
- Run the
esearch
command that tells how many cellulose synthase 3 sequences there are in total in the NCBI protein database?
Extra exercise for fast ones
- Align the cellulose synthase 3 set with
mafft
mafft cesy3.fasta > cesy3_aln.fasta
- Study the results:
infoalign cesy3_aln.fasta
showalign cesy3_aln.fasta
3. Downloading with enaDataGet
- Check the options of
enaDataGet
with command:
enaDataGet -h
- Download a file (Pythium iwayamai genome assembly)
enaDataGet AKYA02000000 -f fasta
gunzip AKYA02.fasta.gz
ls
Extra exercise for fast ones
- Study the downloaded file:
head -20 AKYA02.fasta
tail AKYA02.fasta
infoseq_summary AKYA02.fasta
4. Finishing up
- Close the interactive session when you are done by typing
exit
.