Day 2: Data tidying and analysis

Getting the data into phyloseq

Step 14. Creating phyloseq input files

Choose chimeras.removed.fasta.gz, chimeras.removed.count_table and sequences-taxonomy-assignment.txt. Next, tun the tool Generate input files for phyloseq so that you select the correct data type (16S, 18S or archaeal) and set a dissimilarity threshold of 0.03 (i.e. 3%) for OTU clustering.

Why are we using a dissimilarity threshold of 3%? 
Can you think of examples where another threshold might be more appropriate?

Step 15. Filling in the phenodata file

The Generate input files for phyloseq tool produces a phenodata file (a file type in Chipster used for documenting sample groupings). Before proceeding with downstream analyses, let's take a moment to work on the phenodata. We'll imagine there are two columns with sample groupings that we're interested in: chiptype and group.

  1. Select phenodata-file.opti_mcc.shared.

Fill the chiptype column so that samples F3D0-F3D147 (moving downward from the top row) are marked as NGS, and that samples from F3D148 onward are marked as r. You should end up with 9 samples with the chiptype marked as NGS and 10 samples with r. Note, these are entirely arbitrary groupings (we could also use other labels).

  1. Next, fill in the group column: mark all the early samples (D0-D9) with a, and the late samples (D141-D150) with b. These are also arbitrary labels.

The end result should look like this:

Step 16. Producing a phyloseq object

Finally! It's the moment we've been waiting for. 😄

Select the Mothur shared file and constaxonomy file. Run the Convert Mothur files into phyloseq object tool, noticing that here we need to specify the phenodata column including unique IDs for each community profile (in our case, the column sample).

This tool produces two files:

  • a phyloseq object (stored as ps.Rda)
  • a text summary of the imported object (ps_imported.txt).
Inspect the text summary a little closer. 
What elements does the phyloseq object consist of?

Inspecting and tidying the data

Step 17. Sequence numbers, rarefaction curve and alpha diversity indices

For this step, select the Rda file produced in Step 15. Try running the Sequence numbers, rarefaction curve and alpha diversity estimates tool. There are a couple of parameters to fill in - have a look at these in advance. Choose a phenodata variable to tabulate with the alpha diversity values and make sure you are telling the tool that we are working with raw data.

Can you already make some basic inferences about the data 
based on the rarefaction curve and alpha diversity values?

Based on the rarefaction curve, what do you think might be 
best - proceeding as is or rarefying the data to even depth?

Feel free to try the Rarefy OTU data to even depth tool using the same Rda file and re-running the Sequence numbers... tool using the resulting dataset (ps_rarefied.Rda). If doing this, remember that you need to specify in the Sequence numbers... tool that the data are rarefied to even depth.

Step 18. Overview of taxon composition

Let's move on using the non-rarefied data set. Selecting ps.Rda, run the Overview of taxon composition tool, selecting Class as the level of biological organisation.

How many OTUs corresponding to Clostridia are there in the data set?

Do you see any particular classes in the data set that you
think could be removed - if yes, why would you exclude them?

Step 19. Data tidying

There are quite a few data tidying tools. Let's go through them one by one!

Don't worry if you get lost at first, we're here to help. 👍

  1. Again selecting ps.Rda, tidy the data set using the Filter by taxonomic group tool, choosing Bacteria as the group to retain. This helps ensure that we only keep those data that are classified as Bacteria at the domain level.

  2. Using the resulting Rda file (ps_bacteria.Rda), also run the Remove selected taxa tool. You could use this tool to filter out any taxa that you shortlisted for removal in Step 18. Even if not, this tool can be used to remove any chloroplast or mitochondrial sequences in the data.

Why might you expect to see chloroplast or mitochondrial 
sequences in a bacterial dataset - isn't that a little strange?
  1. Selecting ps_ind.Rda, run the Additional prevalence summaries tool. This will produce both a prevalence plot (ps_prevalence.pdf) and a text summary (ps_low.txt). The plot has a prevalence threshold of 5% drawn as a default guess for prevalence filtering.
How many doubletons are there in the data set?
Do you have an advance idea about what the term "prevalence" refers to?
  1. There are two further tools for data tidying:
  • Remove OTUs with 0-2 occurrences
  • Proportional prevalence filtering

Selecting ps_ind.Rda, run the former tool, making sure that both singletons and doubletons are removed.

Why would we want to remove singletons and doubletons from the data?

Can you think of situations where these should be kept as part of the dataset 
(hint: we have already covered one such step earlier)?

Taking a closer look at patterns

Step 20. Stacked taxon composition bar plots

At this point, we have many alternatives that depend on the study goals. Let's say that, in our case, the goal is to compare the overall community structure between different groups (with individual OTUs being of limited interest at this stage).

Nevertheless, first we probably want to visualise the data and compare the groups in more detail. After all, we don't want to blindly rely on statistical test output!

Select ps_pruned.Rda and run the Transform OTU counts tool, selecting Relative abundances (%) as the data treatment. We can use relative abundance data to produce bar plots.

This will produce a file called ps_relabund.Rda. Select it and run the OTU relative abundance bar plots tool, making sure that you have the Sample column selected in Phenodata variable with sequencing sample IDs. There are quite a few other parameters here that you can also modify. Feel free to choose any cut-off threshold for excluding lower-abundance OTUs, otherwise selecting the following options:

  • Family as the level of biological organisation
  • group as phenodata variable 1 for plot faceting

The result should look close to this (click on the thumbnail to expand the image):