Day 2: Data tidying and analysis
Getting the data into phyloseq
Step 14. Creating phyloseq
input files
Choose chimeras.removed.fasta.gz
, chimeras.removed.count_table
and sequences-taxonomy-assignment.txt
. Next, tun the tool Generate input files for phyloseq
so that you select the correct data type (16S, 18S or archaeal
) and set a dissimilarity threshold of 0.03 (i.e. 3%) for OTU clustering.
Why are we using a dissimilarity threshold of 3%?
Can you think of examples where another threshold might be more appropriate?
Step 15. Filling in the phenodata file
The Generate input files for phyloseq
tool produces a phenodata file (a file type in Chipster used for documenting sample groupings). Before proceeding with downstream analyses, let's take a moment to work on the phenodata. We'll imagine there are two columns with sample groupings that we're interested in: chiptype
and group
.
- Select
phenodata-file.opti_mcc.shared
.
Fill the chiptype
column so that samples F3D0-F3D147
(moving downward from the top row) are marked as NGS
, and that samples from F3D148
onward are marked as r
. You should end up with 9 samples with the chiptype marked as NGS
and 10 samples with r
. Note, these are entirely arbitrary groupings (we could also use other labels).
- Next, fill in the
group
column: mark all the early samples (D0-D9
) witha
, and the late samples (D141-D150
) withb
. These are also arbitrary labels.
The end result should look like this:
Step 16. Producing a phyloseq
object
Finally! It's the moment we've been waiting for. 😄
Select the Mothur shared file and constaxonomy file. Run the Convert Mothur files into phyloseq object
tool, noticing that here we need to specify the phenodata column including unique IDs for each community profile (in our case, the column sample
).
This tool produces two files:
- a phyloseq object (stored as
ps.Rda
) - a text summary of the imported object (
ps_imported.txt
).
Inspect the text summary a little closer.
What elements does the phyloseq object consist of?
Inspecting and tidying the data
Step 17. Sequence numbers, rarefaction curve and alpha diversity indices
For this step, select the Rda file produced in Step 15. Try running the Sequence numbers, rarefaction curve and alpha diversity estimates
tool. There are a couple of parameters to fill in - have a look at these in advance. Choose a phenodata variable to tabulate with the alpha diversity values and make sure you are telling the tool that we are working with raw data.
Can you already make some basic inferences about the data
based on the rarefaction curve and alpha diversity values?
Based on the rarefaction curve, what do you think might be
best - proceeding as is or rarefying the data to even depth?
Feel free to try the Rarefy OTU data to even depth
tool using the same Rda file and re-running the Sequence numbers...
tool using the resulting dataset (ps_rarefied.Rda). If doing this, remember that you need to specify in the Sequence numbers...
tool that the data are rarefied to even depth.
Step 18. Overview of taxon composition
Let's move on using the non-rarefied data set. Selecting ps.Rda
, run the Overview of taxon composition
tool, selecting Class as the level of biological organisation.
How many OTUs corresponding to Clostridia are there in the data set?
Do you see any particular classes in the data set that you
think could be removed - if yes, why would you exclude them?
Step 19. Data tidying
There are quite a few data tidying tools. Let's go through them one by one!
Don't worry if you get lost at first, we're here to help. 👍
Again selecting
ps.Rda
, tidy the data set using theFilter by taxonomic group
tool, choosingBacteria
as the group to retain. This helps ensure that we only keep those data that are classified as Bacteria at the domain level.Using the resulting Rda file (
ps_bacteria.Rda
), also run theRemove selected taxa
tool. You could use this tool to filter out any taxa that you shortlisted for removal in Step 18. Even if not, this tool can be used to remove any chloroplast or mitochondrial sequences in the data.
Why might you expect to see chloroplast or mitochondrial
sequences in a bacterial dataset - isn't that a little strange?
- Selecting
ps_ind.Rda
, run theAdditional prevalence summaries
tool. This will produce both a prevalence plot (ps_prevalence.pdf
) and a text summary (ps_low.txt
). The plot has a prevalence threshold of 5% drawn as a default guess for prevalence filtering.
How many doubletons are there in the data set?
Do you have an advance idea about what the term "prevalence" refers to?
- There are two further tools for data tidying:
- Remove OTUs with 0-2 occurrences
- Proportional prevalence filtering
Selecting ps_ind.Rda
, run the former tool, making sure that both singletons and doubletons are removed.
Why would we want to remove singletons and doubletons from the data?
Can you think of situations where these should be kept as part of the dataset
(hint: we have already covered one such step earlier)?
Taking a closer look at patterns
Step 20. Stacked taxon composition bar plots
At this point, we have many alternatives that depend on the study goals. Let's say that, in our case, the goal is to compare the overall community structure between different groups (with individual OTUs being of limited interest at this stage).
Nevertheless, first we probably want to visualise the data and compare the groups in more detail. After all, we don't want to blindly rely on statistical test output!
Select ps_pruned.Rda
and run the Transform OTU counts
tool, selecting Relative abundances (%)
as the data treatment. We can use relative abundance data to produce bar plots.
This will produce a file called ps_relabund.Rda
. Select it and run the OTU relative abundance bar plots
tool, making sure that you have the Sample
column selected in Phenodata variable with sequencing sample IDs
. There are quite a few other parameters here that you can also modify. Feel free to choose any cut-off threshold for excluding lower-abundance OTUs, otherwise selecting the following options:
- Family as the level of biological organisation
- group as phenodata variable 1 for plot faceting
The result should look close to this (click on the thumbnail to expand the image):