Day 1: Data pre-processing

This hands-on example is based on Mothur’s MiSeq SOP.

The data consist of the following 19 mouse fecal samples collected after weaning from one animal:

9 early samples, collected on days 0-9 post weaning
10 late samples, collected on days 141-150 post weaning

Our aim in these exercises is to find out if the early samples differ from the late samples in terms of microbial community structure. All samples were sequenced using Illumina MiSeq. We have overlapping 251 bp paired end reads which come from the 253 bp long V4 region of the 16S rRNA gene.

Step 1. Start Chipster and open a session

Go to the Chipster website and log in. In the session list, scroll down to Training sessions and select course_16S_rRNA_community_analysis_MiSeq. This session has 38 zipped fastq files. Save your own copy of the session: click the three dots in the Session info section and select Rename.

Step 2. Package all the FASTQ files in a tar package

Select all the FASTQ files:

Go to the List tab, click on the first file, keep the shift key down and click on the last file.
Select the toolset NGS, category Utilities, and scroll down to the tool Make a Tar package and select it.
Click Parameters, set File name for tar package = zippedFastq, and click Run. Select Run Tool 1 job.

Note that after this point in real life you can delete the individual FASTQ files, because you can always open the tar package using the tool Utilities / Extract .tar.gz file.

Step 3. Check the quality of reads with MultiQC

Select the file zippedFastq.tar and the tool Quality control / Read quality with MultiQC for many FASTQ files, and click Run. Select the result file and click Open in new tab.

How long are the reads (in the General Statistics section, click 
Configure columns, select Average Read Length and unselect M sequences)?

Based on the plot Sequence Counts, do all the samples have the 
same number of reads? Are most of the reads unique?

Based on the plot Sequence Quality Histograms, is the base quality
good all along the reads? 

Do the samples marked with red have something in common (click the
traffic light bar above the plot)?

Step 4. Remove primers and adapters

Here we skip this step, because our data was already processed to remove primers and sequencing adapters. When you are working with your own data that still contains primers and adapters, it is important to remove them. In Chipster, you can use the tool Remove primers and adapters with Cutadapt in the category Preprocessing to remove them.

In parameters, you would set Is the data paired end or single end reads = paired, 5' adapter is the forward primer, and 3'adapter is the reverse primer as it appears in the forward reads, usually reverse complement of the reserve primer. You can use the tool Identify primers and the correct orientation in category Microbial amplicon data preprocessing for ASV to check the orientation of the primers, get the reverse complement version of the reverse primer and check that the primers have been removed correctly.

But in this exercise data set, primers and adapters have already been removed, so let’s continue to the next step:

Step 5. Combine paired reads into sequence contigs

Select zippedFastq.tar and run the tool Microbial amplicon data preprocessing for OTU / Combine paired reads to contigs with VSEARCH.

Select summary_stats.tsv and view it as Spreadsheet.

Are there roughly the same number of merged pairs in each sample?
What percentage of read pairs were not merged?
What is the mean number of expected errors?

Check in the samples.fastqs.txt if the FASTQ files were assigned
correctly to samples.

Step 6. Filter contigs to reduce errors

Select contigs.tar and run the tool Filter sequences based on the number of expected errors

Select summary.tsv.

What percentage of contigs were discarded?
When would this step be particularly useful?

Step 7. Combine contigs into one FASTA file and create a count file

Select filtered_contigs.tar and run the tool Combine FASTQ files into one FASTA file and make a Mothur count file.

Select sequences-summary.tsv and view it as spreadsheet.

How many sequences are there in the data? 
How long are the contigs? 
Are there ambiguous bases in the contigs?
How many contigs are there in total?

Open the sequences.count_table file as a 
spreadsheet. What information does it contain?

Step 8. Remove suspiciously long sequences and sequences containing ambiguous bases

If you see any suspiciously long contigs or contigs that contain ambiguous bases, this tool can be used to remove those. In this example there should be none, but let’s run this step anyway to get familiar with the tool.

Choose sequences.fasta.gz and sequences.count_table (use ctrl / cmd key to select multiple files). Select the tool Screen sequences for several criteria, set Maximum length of the sequences = 275 and Maximum number of ambiguous bases = 0, and run the tool.

Open summary.screened.tsv as spreadsheet.

In general, pay attention here if all the overly long contigs and ambiguous bases were removed, and how many sequences are left.
In this example, why was no new count file created?

Step 9. Remove identical sequences

Select screened.fasta.gz and sequences.count_table, and run the tool Extract unique sequences.

Open unique.summary.tsv.

How many unique sequences vs. total sequence are there?
Why are we removing the identical sequences?

Open unique.count_table as spreadsheet.

What do the rows and columns represent?

Step 10. Align sequences to reference

Select the files unique.fasta.gz and unique.count_table, and run the tool Align sequences to reference so that you use only the following section of the reference template alignment: Start = 13 862 and End = 23 444.

(Note that these start and end values are selected for this specific V4 region amplicon. To modify the values for other regions, align an example sequence (a small set of samples or the same fragment from E. coli 16S rRNA gene) with the same tool and check where it is placed in the reference.)

Open aligned-summary.tsv.

Where in the reference alignment do most of the contigs align? 
Are there deviants?

Open also custom.reference.summary.tsv. `How long is the longest homopolymer in the reference? How many sequences does the reference contain?

Step 11. Remove sequences which align outside the common alignment range or contain homopolymers of excessive length

Select aligned.fasta.gz and unique.count_table, and the tool Screen sequences for several criteria. Set Start position = 1, End position = 9583, and Maximum homopolymer length = 6. Run the tool.

Open summary.screened.tsv as spreadsheet.

How many unique sequences were removed (hint: compare with aligned-summary.tsv)? 
How many sequences were removed overall?

Were these sequences removed because: 
1) they aligned outside the common alignment range, or 
2) they contained too long homopolymers?

Step 12. Remove gaps and overhangs from the alignment. If this creates new identical sequences, they will be removed

Choose screened.fasta.gz and screened.count_table and run the tool Filter sequence alignment.

Open filtered-log.txt.

How many columns we removed? How long is the alignment now? 

Did the filtering generate new identical sequences, which were subsequently removed 
(compare summary.screened.tsv and filtered-unique-summary.tsv)?

Step 13. Precluster the aligned sequences

Select filtered-unique.fasta.gz and filtered-unique.count_table, and the tool Precluster aligned sequences. Set Number of differences allowed = 1 and run the tool.

Open preclustered-summary.tsv.

How many unique sequences do we have now?

Step 14. Remove chimeras

Select preclustered.fasta.gz and preclustered.count_table. and the tool Remove chimeric sequences. Select Reference = none, de novo, Method = vsearch and Dereplicate = true and run the tool.

How many chimeras were removed?

Step 15. Classify sequences

Choose chimeras.removed.fasta.gz and chimeras.removed.count_table, and run the tool Classify 16S or 18S sequences to taxonomic units using Silva.

Open classification-summary.tsv and select Full Screen. Sort the table by the taxon column (click on the word taxon).

Does your data contain chloroplast sequences?

Step 16. Share your analysis session with a colleague

Select your session, click on the three dots, and select Share. Create a new access rule: enter haka/ekorpela@csc.fi in the UserID field and set Rights = read-only.