Day 1: Data pre-processing
This hands-on example is based on Mothur’s MiSeq SOP.
The data consist of the following 19 mouse fecal samples collected after weaning from one animal:
- 9 early samples, collected on days 0-9 post weaning
- 10 late samples, collected on days 141-150 post weaning
Our aim in these exercises is to find out if the early samples differ from the late samples in terms of microbial community structure. All samples were sequenced using Illumina MiSeq. We have overlapping 251 bp paired end reads which come from the 253 bp long V4 region of the 16S rRNA gene.
Step 1. Start Chipster and open a session
Go to the Chipster website and log in. In the session list, scroll down to Training sessions and select course_16S_rRNA_community_analysis_MiSeq. This session has 38 zipped fastq files. Save your own copy of the session: click the three dots in the Session info section and select Rename.
Step 2. Package all the FASTQ files in a tar package
Select all the FASTQ files:
Go to the List tab, click on the first file, keep the shift key down and click on the last file.
Select the toolset
NGS
, categoryUtilities
, and scroll down to the toolMake a Tar package
and select it.Click Parameters, set
File name for tar package = zippedFastq
, and click Run. SelectRun Tool 1 job
.
Note that after this point in real life you can delete the individual
FASTQ files, because you can always open the tar package using the tool
Utilities / Extract .tar.gz file
.
Step 3. Check the quality of reads with MultiQC
Select the file zippedFastq.tar
and the tool
Quality control / Read quality with MultiQC for many FASTQ files
,
and click Run. Select the result file and click Open in new
tab.
How long are the reads (in the General Statistics section, click
Configure columns, select Average Read Length and unselect M sequences)?
Based on the plot Sequence Counts, do all the samples have the
same number of reads? Are most of the reads unique?
Based on the plot Sequence Quality Histograms, is the base quality
good all along the reads?
Do the samples marked with red have something in common (click the
traffic light bar above the plot)?
Step 4. Remove primers and adapters
Here we skip this step, because our data was already processed to
remove primers and sequencing adapters. When you are working
with your own data that still contains primers and adapters, it is
important to remove them. In Chipster, you can use the tool
Remove primers and adapters with Cutadapt
in the category
Preprocessing
to remove them.
In parameters, you would set
Is the data paired end or single end reads = paired
,
5' adapter
is the forward primer, and
3'adapter
is the reverse primer as it appears in the
forward reads, usually reverse complement of the reserve primer. You can
use the tool Identify primers and the correct orientation
in category Microbial amplicon data preprocessing for ASV
to check the orientation of the primers, get the reverse complement
version of the reverse primer and check that the primers have been
removed correctly.
But in this exercise data set, primers and adapters have already been removed, so let’s continue to the next step:
Step 5. Combine paired reads into sequence contigs
Select zippedFastq.tar
and run the tool
Microbial amplicon data preprocessing for OTU / Combine paired reads to contigs with VSEARCH
.
Select summary_stats.tsv
and view it as Spreadsheet.
Are there roughly the same number of merged pairs in each sample?
What percentage of read pairs were not merged?
What is the mean number of expected errors?
Check in the samples.fastqs.txt if the FASTQ files were assigned
correctly to samples.
Step 6. Filter contigs to reduce errors
Select contigs.tar
and run the tool
Filter sequences based on the number of expected errors
Select summary.tsv
.
What percentage of contigs were discarded?
When would this step be particularly useful?
Step 7. Combine contigs into one FASTA file and create a count file
Select filtered_contigs.tar
and run the tool
Combine FASTQ files into one FASTA file and make a Mothur count file
.
Select sequences-summary.tsv
and view it as
spreadsheet.
How many sequences are there in the data?
How long are the contigs?
Are there ambiguous bases in the contigs?
How many contigs are there in total?
Open the sequences.count_table file as a
spreadsheet. What information does it contain?
Step 8. Remove suspiciously long sequences and sequences containing ambiguous bases
If you see any suspiciously long contigs or contigs that contain ambiguous bases, this tool can be used to remove those. In this example there should be none, but let’s run this step anyway to get familiar with the tool.
Choose sequences.fasta.gz
and
sequences.count_table
(use ctrl / cmd key to select
multiple files). Select the tool
Screen sequences for several criteria
, set
Maximum length of the sequences = 275
and
Maximum number of ambiguous bases = 0
, and run the
tool.
Open summary.screened.tsv
as spreadsheet.
In general, pay attention here if all the overly long contigs and ambiguous bases were removed, and how many sequences are left.
In this example, why was no new count file created?
Step 9. Remove identical sequences
Select screened.fasta.gz
and
sequences.count_table
, and run the tool
Extract unique sequences
.
Open unique.summary.tsv
.
How many unique sequences vs. total sequence are there?
Why are we removing the identical sequences?
Open unique.count_table
as spreadsheet.
What do the rows and columns represent?
Step 10. Align sequences to reference
Select the files unique.fasta.gz
and
unique.count_table
, and run the tool
Align sequences to reference
so that you use only the
following section of the reference template alignment:
Start = 13 862
and End = 23 444
.
(Note that these start and end values are selected for this specific V4 region amplicon. To modify the values for other regions, align an example sequence (a small set of samples or the same fragment from E. coli 16S rRNA gene) with the same tool and check where it is placed in the reference.)
Open aligned-summary.tsv
.
Where in the reference alignment do most of the contigs align?
Are there deviants?
Open also custom.reference.summary.tsv
.
`How long is the longest homopolymer in the reference? How many sequences does the reference contain?
Step 11. Remove sequences which align outside the common alignment range or contain homopolymers of excessive length
Select aligned.fasta.gz
and
unique.count_table
, and the tool
Screen sequences for several criteria
. Set
Start position = 1
, End position = 9583
, and
Maximum homopolymer length = 6
. Run the tool.
Open summary.screened.tsv
as spreadsheet.
How many unique sequences were removed (hint: compare with aligned-summary.tsv)?
How many sequences were removed overall?
Were these sequences removed because:
1) they aligned outside the common alignment range, or
2) they contained too long homopolymers?
Step 12. Remove gaps and overhangs from the alignment. If this creates new identical sequences, they will be removed
Choose screened.fasta.gz
and
screened.count_table
and run the tool
Filter sequence alignment
.
Open filtered-log.txt
.
How many columns we removed? How long is the alignment now?
Did the filtering generate new identical sequences, which were subsequently removed
(compare summary.screened.tsv and filtered-unique-summary.tsv)?
Step 13. Precluster the aligned sequences
Select filtered-unique.fasta.gz
and
filtered-unique.count_table
, and the tool
Precluster aligned sequences
. Set Number of differences
allowed = 1 and run the tool.
Open preclustered-summary.tsv
.
How many unique sequences do we have now?
Step 14. Remove chimeras
Select preclustered.fasta.gz
and
preclustered.count_table
. and the tool
Remove chimeric sequences
. Select
Reference = none, de novo
, Method = vsearch
and Dereplicate = true
and run the tool.
How many chimeras were removed?
Step 15. Classify sequences
Choose chimeras.removed.fasta.gz
and
chimeras.removed.count_table
, and run the tool
Classify 16S or 18S sequences to taxonomic units using Silva.
Open classification-summary.tsv
and select Full Screen.
Sort the table by the taxon column (click on the word taxon).
Does your data contain chloroplast sequences?
Step 16. Share your analysis session with a colleague
Select your session, click on the three dots, and select Share. Create a new access rule: enter haka/ekorpela@csc.fi in the UserID field and set Rights = read-only.