Day 1: Data pre-processing
This is an example workflow for analysing amplicon sequence data of single-end reads from Ion Torrent sequencing.
The data consist of 16 samples of willow catkins to study plant-associated bacteria. The samples are a subset from a larger study that aimed to investigate the effect of pollinator visits on willow catkin microbes (https://doi.org/10.1007/s00442-022-05285-7).
The samples here were collected from
- two sites
- HP: no honeybee hives closeby
- KEK: honeybee hives present at the site
- two treatments (4 replicates each)
- ps: willow catkins protected from pollinators by a mesh bag
- c: control catkins that pollinators visited
Our aim is to find out if site and/or protection from pollinators influenced the bacterial community in the willow catkins. All samples were sequenced using Ion Torrent. The amplified region is the V6-V8 region of the bacterial 16S rRNA gene and the length of the amplicon is ca. 350 bp. The samples have been demultiplexed and barcodes removed, and each FASTQ file corresponds to one sample.
Step 1. Start Chipster and open a session
Go to the Chipster website and log in. In the session list, scroll down to Training sessions and select course_16S_rRNA_community_analysis_IonTorrent. This session has 16 zipped fastq files. Save your own copy of the session: click the three dots in the Session info section on the lower right and select Save a copy.
Step 2. Package all the fastq files in a tar package
Select all the FASTQ files:
- Go to the List tab, click on the first file, keep the shift key down and click on the last file.
- Select the tool
Utilities / Make a tar package
. - Click Parameters, set
File name for tar package = zippedFastq
, and close the parameter window. Click Run and select Run Tool 1 job.
Note that after this point in real life you can delete the individual
FASTQ files, because you can always open the tar package using the tool
Utilities / Extract .tar.gz file
.
Step 3. Check the quality of reads with MultiQC
Select the file zippedFastq.tar
and the tool
Quality control / Read quality with MultiQC for many FASTQ files
,
and click Run. Select the result file and click Open in new
tab.
How long are the reads?
Based on the plot Sequence Counts, do all the samples have the
same number of reads? Are most of the reads unique?
Based on the plot Sequence Quality Histograms, is the base quality
good all along the reads?
Step 4. Remove primers with Cutadapt
Select zippedFastq.tar
and select the tool
Microbial amplicon data preprocessing for ASV / Remove primers and adapters with Cutadapt
.
Click Parameters and set
Is the data paired end or single end reads = single
.
Give the primer sequences:
- Forward primer (in this case an M13 linker and the forward
primer):
The 5' adapter = TGTAAAACGACGGCCAGTGTCAGCTCGTGYYGTGAG
- Reverse primer:
The 3' adapter = TTGYACACACCGCCCGT
(reverse complement of the reverse primer)
Also set Remove reads which were not trimmed = yes
.
Close the parameter window and run the tool
Check the Cutadapt output (report.txt).
Look at the first sample and check how many times the forward primer (Adapter 1)
and reverse primer (Adapter 2) were trimmed.
Does Cutadapt with these settings detect exact matches of the primers?
Step 5. Trim sequences for base quality and sequence length with Trimmomatic
Choose adapters_removed.tar
and select the tool
Preprocessing / Trim Ion Torrent reads reads with Trimmomatic
.
Set Sliding window trimming parameters = 10:20
. Here we are
telling Trimmomatic to slide along each sequence read in stretches of 10
bases, and if the average quality score of the window is below 20, trim
the read at that point. Also set
Minimum length of reads to keep = 200
and run the tool.
Choose trimmed.tar
and run the MultiQC tool again
(Quality control / Read quality with MultiQC for many FASTQ files
)
Do you see any differences in the sequence counts and sequence quality histograms
compared to the first time we ran the tool before primer removal and quality trimming?
Step 6. Combine FASTQ files and create count file
Select trimmed.tar
run the tool
Microbial amplicon data preprocessing for OTU / Combine FASTQ files into one FASTA file and make a Mothur count file
.
Check the summary file sequences-summary.tsv
.
How long are most of the sequences?
How long is the longest sequence?
Are there any ambiguous bases in the sequences?
How many sequences are there in total?
(Step 7. Screen sequences for ambiguous bases and suspiciously long sequences)
Generally at this point, we would screen the sequences for example as
follows: Select sequences.fasta.gz
and
sequences.count_table
(use ctrl / cmd key to select
multiple files). Select the tool
Screen sequences for several criteria
and in parameters set
Maximum number of ambiguous bases = 0
and
Maximum length = 400
. However, we see from the output of
the previous step that these steps are not necessary for this data set
(there are no ambiguous bases and the longest sequence is only 333 bp),
so let’s skip this step and continue to step 8.
Step 8. Remove identical sequences
Select the files sequences.fasta.gz
and
sequences.count_table
(use ctrl / cmd key to select
multiple files) and run the tool
Extract unique sequences
.
Note: If we had run step 7, we would have selected files
screened.fasta.gz
and screened.count_table
here.
Check the summary file unique.summary.tsv
.
How many unique sequences vs. total sequence are there?
Why are we removing the identical sequences?
Step 9. Align sequences to reference
Next, we will align our sequences against the Silva v. 138.1
reference alignment. Select the files unique.fasta.gz
and
unique.count_table
, and run the tool
Align sequences to reference
so that you use only the
following section of the reference template alignment:
Start = 28500
and End = 43100
.
(Note that these start and end values are selected for this specific V6-V8 region amplicon. To modify the values for other regions, align an example sequence (a small set of samples or the same fragment from E. coli 16S rRNA gene) with the same tool and check where it is placed in the reference.)
Open aligned-summary.tsv
.
Where in the reference alignment do most of the sequences align?
Are there deviants?
Open also custom.reference.summary.tsv
.
How long are the longest homopolymers in the reference?
How many sequences does the reference contain?
Step 10. Remove sequences which align outside the common alignment range
Select aligned.fasta.gz
and
unique.count_table
, and the tool
Screen sequences for several criteria
. Set
Alignment start position = 5963
(where most sequences start
based on aligned-summary.tsv
),
Alignment end position = 12353
(before most sequences end),
and Maximum homopolymer length = 8
. Run the tool.
Open summary.screened.tsv
as spreadsheet.
How many unique sequences were removed (hint: compare with aligned-summary.tsv)?
How many sequences were removed overall?
Were these sequences removed because:
1) they aligned outside the common alignment range, or
2) they contained too long homopolymers?
Step 11. Remove gaps and overhangs from the alignment. If this creates new identical sequences, they will be removed
Choose screened.fasta.gz
and
screened.count_table
and run the tool
Filter sequence alignment
.
Open filtered-log.txt
.
How many columns we removed? How long is the alignment now?
Did the filtering generate new identical sequences, which were subsequently removed
(compare summary.screened.tsv and filtered-unique-summary.tsv)?
Step 12. Precluster the aligned sequences
Select filtered-unique.fasta.gz
and
filtered-unique.count_table
, and the tool
Precluster aligned sequences
. Set
Number of differences allowed = 1
and run the tool.
Open preclustered-summary.tsv
.
How many unique sequences do we have now?
Step 13. Remove chimeras
Select preclustered.fasta.gz
and
preclustered.count_table
and the tool
Remove chimeric sequences
. Select
Reference = none, de novo
, Method = vsearch
and Dereplicate = true
and run the tool.
Open chimeras.removed.summary.tsv
.
How many chimeras were removed?
Step 14. Classify sequences
Choose chimeras.removed.fasta.gz
and
chimeras.removed.count_table
, and run the tool
Classify 16S or 18S sequences to taxonomic units using Silva.
Open classification-summary.tsv
and select Full Screen.
Sort the table by the taxon column (click on the word taxon).
Does your data contain chloroplast sequences?
Step 15. Share your analysis session with a colleague
Select your session in the session menu, click on the three dots, and select Share. Create a new access rule: enter haka/ekorpela@csc.fi in the UserID field and set Rights = read-only.