Day 2: Microbial community analysis

Clustering sequences into OTUs

Step 17. Clustering sequences into OTUs, classifying OTUs, and creating phyloseq input files

Choose chimeras.removed.fasta.gz, chimeras.removed.count_table and sequences-taxonomy-assignment.txt. Next, run the tool Cluster sequences to OTUs and classify them so that you select the correct data type (16S or 18S) and set a dissimilarity threshold of 0.03 (i.e. 3%, corresponding to 97% sequence similarity) for OTU clustering.

Why are we using a dissimilarity threshold of 3%? 
Can you think of examples where another threshold might be more appropriate?

Getting the data into phyloseq

Step 18. Filling in the phenodata file

The Cluster sequences to OTUs and classify them tool produces a phenodata file (a file type in Chipster used for documenting sample groupings). Before proceeding with downstream analyses, let’s take a moment to work on the phenodata. We’ll focus on two sample groupings that we’re interested in: group and time.

  1. Select phenodata-file.opti_mcc.shared and in Selected files choose the tab Phenodata.

  2. Remove the columns chiptype and description by clicking the x. Create a new column called time.

  3. Fill the group column so that samples F3D0-F3D147 (moving downward from the top row) are marked as a, and that samples from F3D148 onward are marked as b. You should end up with 9 samples with the group marked as a and 10 samples with b. Note, the names of the columns and the group labels can be anything you like, except numbers are not recommended.

  4. Next, fill in the time column: mark all the early samples (D0-D9) with early, and the late samples (D141-D150) with late.

The end result should look like this:

Step 19. Producing a phyloseq object

Finally! 😄 We are done with pre-processing and ready to convert our files into a phyloseq object that will be used in the community analysis steps.

Select the Mothur shared file file.opti_mcc.shared and the taxonomy file file.opti_mcc.0.03.cons.taxonomy. Run the Convert Mothur files into phyloseq object tool, noticing that here we need to specify the phenodata column including unique IDs for each community profile (in our case, the column sample).

This tool produces two files:

  • a phyloseq object (stored as ps.Rda)
  • a text summary of the imported object (ps_imported.txt).
Inspect the text summary a little closer. 
What elements does the phyloseq object consist of?

Tidying and inspecting the data

Step 20. Data tidying

  1. First, let’s remove any sequences that are not from our target group Bacteria. Select ps.Rda and in the category Microbial amplicon data analyses the tool Filter by taxonomic group tool. Choose Bacteria as the group to retain. This helps ensure that we only keep those data that are classified as Bacteria at the domain level.

  2. Earlier we already saw that our data contains chloroplast sequences, so let’s remove them together with any mitochondrial sequences. Select the file ps_bacteria.Rda and the tool Remove selected taxa. Set Remove class Chloroplast = yes and Remove family Mitochondria = yes and run the tool. You could also use this tool to filter out other specific taxa such as known contaminants.

Why might you expect to see chloroplast or mitochondrial 
sequences in a bacterial dataset - isn't that a little strange?

There are a few more additional tools for data tidying. Let’s first get an overview of the distribution of OTUs in our data.

  1. Selecting ps_ind.Rda, run the Prevalence summaries tool. This will produce both a prevalence plot (ps_prevalence.pdf) and a text summary (ps_low.txt). The plot has a prevalence threshold of 5% drawn as a default guess for prevalence filtering.
How many doubletons are there in the data set?
Do you have an advance idea about what the term "prevalence" refers to?
Which bacterial groups are most common and abundant in the data?
  1. There are two further tools for data tidying:
  • Remove OTUs with 0-2 occurrences
  • Proportional prevalence filtering (for removing OTUs that occur in less than specific % of samples)

Selecting ps_ind.Rda, run the former tool, making sure that both singletons and doubletons are removed.

Why would we want to remove singletons and doubletons from the data?
Can you think of situations where these should be kept as part of the dataset?

Step 21. Sequence numbers, rarefaction curve and alpha diversity indices

Select the file ps_ind.Rda. Try running the Sequence numbers, rarefaction curve and alpha diversity estimates tool. Choose a phenodata variable (time or group) to tabulate the alpha diversity values.

Can you already make some basic inferences about the data 
based on the rarefaction curve and alpha diversity values?

Based on the rarefaction curve, what do you think might be 
best - proceeding as is or rarefying the data to even depth?

Try the Rarefy OTU data to even depth tool using the same file ps_ind.Rda. Select the resulting dataset ps_rarefied.Rdaand and re-run the Sequence numbers... tool.

Would you compare alpha diversity based on non-rarefied or rarefied data?

Taking a closer look at patterns

Step 22. Stacked taxon composition bar plots

At this point, we have many alternatives that depend on the study goals. Let’s say that, in our case, the goal is to compare the overall community structure between different sites and treatments (with individual OTUs being of limited interest at this stage).

Nevertheless, first we probably want to visualise the data and compare the sites and treatments in more detail. After all, we don’t want to blindly rely on statistical test output!

Select ps_pruned.Rda and run the Transform OTU counts tool, selecting Relative abundances (%) as the data treatment. We can use relative abundance data to produce bar plots.

This will produce a file called ps_relabund.Rda. Select it and run the OTU relative abundance bar plots tool, making sure that you have the Sample column selected in Phenodata variable with sequencing sample IDs. There are quite a few other parameters here that you can also modify. Feel free to play around with the options but start with these:

  • 2 as the relative abundance cut-off threshold (%) for excluding OTUs
  • Family as the level of biological organisation
  • time as the phenodata variable 1 for plot faceting

The result should look close to this (click on the thumbnail to expand the image):

Great! But let’s take a moment to think:

Based on the plot, what would you anticipate - are the time points likely to differ?
Can you think of another way to visualise the data?

Step 23. Other ways to visualize the data

Going back to ps_pruned.Rda, let’s also visualise the data using a multivariate ordination. For this, first use the tool Transform OTU counts tool to perform a CLR transformation (Centered log-ratio transformation with pseudocount).

Why might we want to use CLR transformation here, instead of % relative abundances?

Selecting the resulting file ps_clr.Rda, run the Distance matrices and ordinations tool. Once again, one must specify a column in the phenodata with unique IDs for each sample (i.e. the Sample column). There are options for Euclidean and Bray-Curtis distances - choose Euclidean and for the ordination type, select nMDS. Also colour the ordination points by choosing time in Phenodata variable for grouping ordination points by colour and define the shape by choosing group in Phenodata variable for grouping ordination points by shape.

Here, we are using Aitchinson distances because we carried out a CLR transformation and chose Euclidean distances. An alternative approach would be to continue from the rarefied dataset (ps_rarefied.Rda) and use Bray-Curtis dissimilarities to produce the nMDS. In addition to nMDS, there is another ordination type (db-RDA), but let’s focus on the nMDS for now.

Looking at the ordination (ps_ordiplot.pdf), what features stand out? 
Does it look like any difference between groups could be due to a "location effect", 
"dispersion effect" or both?

What is the stress value of the nMDS (in ps_ordi.txt)? 
Can you guess why this might be important?

What is the advantage of plotting individual sample names here?

As a bonus topic, you might also want to produce a db-RDA plot and consider this question:

When might you want to use db-RDA rather than nMDS?

Diving into statistics

Step 24. PERMANOVA

Ok! We’re ready to run some statistical tests now. Let’s proceed with a permutational multivariate analysis of variance (PERMANOVA).

The Distance matrices and ordinations tool also produced a distance matrix (ps_dist.Rda) that we can use for statistical testing. Selecting it, run the Global PERMANOVA for OTU abundance data tool, selecting Main effects only as the PERMANOVA formula and specifying time as Phenodata variable 1.

What does the word "global" mean here?

How would you interpret the results (global_permanova.txt)? 
Does the time variable (early versus late samples) explain community variation in our data?

Using ps_dist.Rda, also run the PERMDISP for OTU abundance data tool, selecting the same phenodata variable.

What does PERMDISP stand for? Why would we want to use it?

Comparing the output of this test (permdisp.txt) with the global PERMANOVA results,  
can you confirm whether the PERMANOVA result is due to a location or dispersion effect? 

Does the result match your earlier anticipations based on inspecting the ordination?

There are also post-hoc pairwise tests for further community comparisons. We do not need those for the present dataset, since we are only comparing two groups at a time.

Step 25. DESeq2

If we were interested in OTU-level differences between sample groups, we could turn to differential abundance analysis with DESeq2 instead of PERMANOVA. For this, let’s return to the untransformed dataset (ps_pruned.Rda).

Choosing that dataset and the Transform OTU counts tool, select DEseq2 format conversion and variance-stabilizing transformation as the data transformation. For the DESeq2 format conversion, we also need to specify a phenodata variable - for this, use time. We get two files as a result:

  • ps_vst.Rda
  • deseq2.Rda

Of these, deseq2.Rda can be used for DESeq2.

Selecting deseq2.Rda, run the Differential OTU abundance analysis using DESeq2 tool, with the number of sample groups set to 2, the lower-level taxonomic grouping to Family and the higher-level grouping to Phylum.

Take a moment to inspect the results table (deseq2_otutable.tsv) and results plot (deseq2_otuplot.pdf).

Do some taxa seem more responsible for community-level differences than others?
Which time point is the reference level in the comparison, early or late?
Are  OTUs with positive log2 fold change values more abundant in early or late samples?