This exercise covers retrieving data from various commonly used bio data repositories.
sinteractive --account <project> # replace <project> with your CSC project, e.g. project_2001234
biokit
module:module load biokit
/scratch
directory of your project and move there:mkdir -p /scratch/<project>/$USER # replace <project> with your CSC project, e.g. project_2001234
cd /scratch/<project>/$USER # replace <project> with your CSC project, e.g. project_2001234
💠Everyone in a project shares the same /scratch
directory, so it is a good idea to use subdirectories for each user and task to avoid accidentally deleting or overwriting others’ files.
🗯 In normal usage it may be a good idea to use the chmod
command to alter file access rights so that only you have write access to your own subfolder, but please do not do this if you are using a CSC course project, as it will make clean-up after the course harder.
💡 You can find more information about this on the Disk areas page in Docs CSC.
curl
curl
and wget
are general tools to download data from an URL.curl
and uncompress it. The dataset contains some Pythium genomes with related BWA indexes.curl https://a3s.fi/course_12.11.2019/pythium.tgz > pythium.tgz
ls
tar -zxvf pythium.tgz
ls
cellulose_synthase
and move to this new directory:mkdir cellulose_synthase
cd cellulose_synthase
count
row in the results):esearch -db protein -query "Pythium [ORGN]"
esearch -db protein -query "Pythium [ORGN] AND cellulose synthase 1 [PROT]"
esearch -db protein -query "Pythium [ORGN] AND cellulose synthase 3 [PROT]" | efetch -format fasta > cesy3.fasta
esearch
command that tells how many cellulose synthase 3 sequences there are in total in the NCBI protein database?mafft
mafft cesy3.fasta > cesy3_aln.fasta
infoalign cesy3_aln.fasta
showalign cesy3_aln.fasta
enaDataGet
with command:enaDataGet -h
enaDataGet AKYA02000000 -f fasta
gunzip AKYA02.fasta.gz
ls
head -20 AKYA02.fasta
tail AKYA02.fasta
infoseq_summary AKYA02.fasta
exit
.