Intro to statistics with R

Objectives:

Have an appreciation of:

The difference between inferential and descriptive statistics
How hypothesis testing works
The difference between parametric and non-parametric tests
The typical sequence of steps taken to perform statistical analyses

Have a working knowledge of the following statistical methods:

Different types of t-test (t.test)
One-sample, independent two-sample and paired-sample tests
Student’s vs Welch’s t-test
Linear regression (lm, short for “linear model”)
Chi-squared test of independence (chisq.test)
Know how to inspect test diagnostics
Interpret the output of statistical tests in R

1. Introduction

Welcome to the last episode of this workshop! Contrary to the others, this episode involves more of a focus on theory (although there are still a few coding exercises). This is because the concepts behind significance testing are more complex than the actual R code we’re going to learn.

Depending on the course schedule and how we’ve run through the previous exercises, we may already have quite a lot stored in our R workspace. Let’s start by wiping R’s memory clean, reloading tidyverse and reimporting the tidied survey data we prepared before:

rm(list = ls()) # Wipes everything clean
library(tidyverse) # Reloading tidyverse

surveys_complete <- read_csv("data_output/surveys_complete.csv")

# We also have an in-built backup of the data:
# surveys_complete <- read_csv("/home/rstudio/shared/surveys_complete.csv")

# ... Or the data could be downloaded:
# download.file(url = "https://tinyurl.com/surveyscomplete",
# destfile = "data_output/surveys_complete.csv")

# the tinyurl links to:
# https://raw.githubusercontent.com/csc-training/da-with-r/master/DataFiles/surveys_complete.csv

The functions for t-tests, Chi-squared tests and linear regression come as part of a default R installation, so we don’t have to load separate packages for those. However, we can load the package ggfortify (along with tidyverse), which is useful for plotting purposes:

library(tidyverse)
library(ggfortify)

2. Two types of statistics

A typical goal for statistics is to infer what is happening within a population based on a smaller sample from that population (or one resembling it). This is what we call inferential statistics, with the inferences we make being based on the concept of statistical significance. We will cover this concept and some common statistical methods today.

Prior to inferential statistics, we need to obtain a general understanding of the data we are working with. The branch of statistics dealing with describing a data set is called descriptive statistics. Examples include inspecting how the data are distributed as well as computing summary statistics like the mean or the median.

We will soon see how descriptive and inferential statistics go hand in hand, even when we’ve already had a first look at the data.

3. More on descriptive statistics

The earlier episodes included examples of working with descriptive statistics. For instance, we used tidyverse to calculate means using the surveys_complete data set. We also used ggplot2 to look at distributions. However, R comes with some further useful ways to summarize a data set.

A quick way to compute summary statistics for a data frame is to use the summary function call:

summary(surveys_complete)

# Generates lots of information, including:
# Minimum and maximum values
# 1st and 3rd quartiles
# Medians and means

There are also specific functions that we haven’t yet covered:

min() # minimum, e.g. min(surveys_complete$weight)
maximum() # maximum
range() # minimum and maximum
median() # median
quantile() # quantiles
IQR() # interquantile range (Q3 - Q1)
var() # variance
sd() # standard deviation (= square root of variance)
std.error() # standard error (part of the "plotrix" package)

# The standard error can also be calculated as:
# Standard deviation / n (with n being the sample size)

# A note on variance:

# Variance measures the spread between values in a data set
# by determining how far each value is from the mean. If you
# wanted to calculate it manually, you would need to:

# 1. Calculate differences between the observations vs. the mean
# 2. Square the differences
# 3. Divide the sum of squares by n

4. Hypothesis testing

Most of this episode deals with inferential statistics. First let’s discuss some of the philosophy underlying the methods we’re going to learn. The general idea is that we want to have a grasp of the probability that a hypothesis is true. Hypothesis testing is the use of statistics to determine that probability.

More formally, there are several steps to this:

Formulating a null hypothesis (usually stating that the observations are due to pure chance). The alternative hypothesis is that the observations cannot be explained by chance alone.
Deciding on the best way to test whether the null hypothesis is true.
Determine the P value, i.e. the probability that a result as significant as the one we obtain would also be observed if the null hypothesis was true. The higher the P value, the more likely the result is due to chance (and the lower, the more likely we’re seeing a real effect).
Compare the P value to a pre-determined alpha value (typically 0.05). If P < α, the effect is said to be statistically significant (if α = 0.05, that would mean there’s less than a 5% probability that the result is entirely due to chance).

5. Parametric and non-parametric methods

Statistical methods can be parametric or non-parametric. Both assume that the data come from a random sample and often also that the observations are independent (meaning that the value of an observation is not affected by other observations). However, there are also many key differences.

Parametric methods

Let’s have a look at some of the key features that define parametric tests:

Parametric methods make assumptions about the underlying distribution of the data. A common scenario is the assumption of normality, i.e. adherence to the normal (or Gaussian) distribution. Many natural phenomena tend to follow the normal distribution, which is shaped like a bell curve. To see this in practice, let’s create a histogram:

# generate random data from normal distribution using rnorm()
random_normal = rnorm(10000, mean = 0, sd = 1)
# creates vector with mean of 0 and standard deviation of 1

# convert to data frame
random_normal <- data.frame(random_normal)

# histogram of the data
ggplot(random_normal, aes(x = random_normal)) +
  geom_histogram(color = "black", fill = "white")

The assumption of normality can apply to either the observations or to residuals (i.e. differences between your observations and a set of predicted values). We will talk more about residuals soon!
Parametric methods assume a homogeneity of variance between groups. For example, we might want to compare the hindfoot_length of males and females in the surveys_complete data set. A parametric test would assume the values in the two groups to have equal variances.
Parametric methods use the mean as a measure of central tendency (in practice this means that they rely on comparisons of group means).

Non-parametric methods

In contrast to their parametric counterparts, non-parametric methods:

Do not assume a particular distribution or the homogeneity of variances. They still make other assumptions (while they are sometimes called “distribution-free”, they are never assumption-free).
Are suitable for the analysis of data measured at the nominal level (such as names and labels).
Employ the median as a measure of central tendency. This is important because if the study aim is to compare group medians rather than means, a non-parametric test might be preferable (even if the data could otherwise be analyzed using a parametric approach).

5. How to put things into practice

With real data, things practically never look as neat as they did in the example above. For example, the distributions may be skewed or you might see multiple peaks (or modes). Parametric methods are nevertheless often recommended because:

When the assumptions are met, they are more sensitive than non-parametric analyses (you are more likely to detect an effect if there is one).
If the assumptions are not met, it may be possible to address this by transforming the data (e.g. using a log or x+1 transformation). The idea is that transfomations can help address issues including non-normality without affecting the relationship between observations (since all the data are subjected to the same transformation).

For a typical parametric analysis, we would move through the following steps:

Descriptive statistics: summarize and visualize the data
Test diagnostics: check for deviations from test assumptions

If deviations occur, transform and repeat steps 1-2
If transformations don’t help, choose a suitable non-parametric test

Inferential statistics: run the test and interpret the output

Remember that in some cases a non-parametric test might be your best option to start with! The steps taken for these are very similar: describe the data, check that the assumptions are met, then go ahead with significance testing.

Zoom Breakout Discussion

Let’s split into breakout rooms (around 5 people in each room). Spend up to 10 minutes (~2 mins each) taking turns to discuss the types of data you deal with in your work (or data you’re anticipating to generate). Which types of statistical tests are likely to be the most suitable for these - parametric or non-parametric? Once you’re ready, you can leave the breakout room and return to the main meeting (remember: don’t click on the exit meeting button, this will remove you from the entire Zoom meeting!).

6. Examples of statistical methods

We are now ready to learn about some common statistical techniques and how to use them in R. We will start with two parametric methods (t-test and linear regression), after which we will move onto the Chi-squared (χ²) test as an example of a non-parametric method.

6a. The t-test (and more essential concepts)

The t-test comes in three different flavors. Without going into mathematical details, what we need to know is that all these approaches are based on the comparison of group means. The three types are:

The independent-samples t-test, which compares two separate groups
The paired-sample t-test, which compares one group at different times (or in response to two different treatments)
The one-sample t-test, which compares a group mean against a known value

The standard version of the t-test, also known as Student’s t-test, assumes equal variances among groups. However, the default option in R is to implement a version that corrects for inequal variances: Welch’s t-test. The function call is:

t.test() # used for all types of t-test
t.test(var.equal = TRUE) # if we wanted to use a Student's test

To have a closer look at this, let’s import the sleep data set built into R. This data set shows the effect of two soporific drugs (hours of extra sleep compared to a control) on ten patients.

sleepdata <- sleep # assigns sleep data to a data frame
sleepdata

Challenge (and Zoom Breakout Discussion)

Let’s create a box plot and histogram of the data.

# this time also changing the formatting a little
 ggplot(data = sleepdata, mapping = aes(x = group, y = extra)) +
  geom_boxplot() +
  theme_bw() +
  theme(text = element_text(size = 16))

ggplot(sleepdata, aes(x = extra)) +
 geom_histogram(color = "black", fill = "white") +
 facet_wrap(facets = vars(group)) +
 theme_bw() +
 theme(text = element_text(size = 16))

Take some time to copy these for yourself from the Codeshare session.

Once you have created the plots for yourself, let’s split again into Zoom Breakout Rooms (5 people in each). Discuss the following topics for ~5-10 minutes and return to the main room once you are ready:

If we’d take a situation where groups 1 and 2 are distinct from one another and we wanted to compare them, which test would we use? What if we’d like to test if the mean for group 1 differs from a value reported for a third drug type?
The t-test assumes that the dependent value (extra) is normally distributed within each group. Do you think the assumption is successfully met?
Do you think the box plot is useful for evaluating parametric test diagnostics? Is there anything you would display differently?

Independent-samples t-test

For our purposes, we can conclude that we have met the assumptions (we’ll look at variance later on). We can now perform an independent-samples t-test, also known as a two-sample test. Notice the ~ symbol, which is indicative of formulas in R. We are using it to say that we’d like to examine extra “by” the group variable:

t.test(extra ~ group, sleep)

# Welch Two Sample t-test

# data: extra by group
# t = -1.8608, df = 17.776, p-value = 0.07939
# alternative hypothesis: true difference in means is not equal to 0

# 95 percent confidence interval:
# -3.3654832 0.2054832
# sample estimates:
# mean in group 1 mean in group 2
#          0.75            2.33

The command itself is very short, but we get a lot of text as the output. For interpreting the result, this section is particularly important:

# t = -1.8608, df = 17.776, p-value = 0.07939

We can see that the result is not statistically significant (P > 0.05). We also have a couple of other values that are often reported alongside the P value in research reports and publications.

The first part (t = ...) is the test statistic (which for a t-test is the t statistic). We won’t dwell on the mathematics, but in practice the test statistic plays a part in determining the P value and is calculated from the sample data.
The second part (df = ...) reports the degrees of freedom. Normally this equals the sample size minus the number of parameters calculated during an analysis. We get an uneven number because the Welch correction changes the outcome, but with a Student’s t-test we’d have 18 degrees of freedom (10 + 10 observations - 2, i.e the number of groups for which we calculate means). What does it mean, though, and why does it matter?

To think of degrees of freedom in terms of an example, imagine different types of food. You’d like to eat something different every day of the week, but only have ingredients for seven different dishes.

On Monday you can use any recipe. On Tuesday there are six left (and so on). On Sunday, you no longer have a choice - you have to settle with whatever’s left for the final recipe. This means that on six days, you were able to exercise some level of choice.
Similarly, degrees of freedom refers to the number of values in a data set that are “free to vary” when estimating statistical parameters. Further to the test statistic, this value plays a role in the mathematics used to establish statistical significance.

Paired-samples t-test

We could also use the sleep data to perform a paired-samples t-test. Here the position (row number) of the observations within groups matters, as the test assumes observations to be paired in a meaningful way. In our case, the data are ordered by the patient ID.

The code itself is similar to running an independent-samples test:

t.test(extra ~ group, sleep, paired = TRUE)

# Paired t-test

# data: extra by group
# t = -4.0621, df = 9, p-value = 0.002833
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -2.4598858 -0.7001142
# sample estimates:
# mean of the differences
#                   -1.58

The output also contains the same terms we saw before.

How would you interpret the results?
Why do you think the type of test affects the conclusion we’re able to make?

One-sample t-test

The final type of t-test involves comparing a single group mean against some known (or hypothetical) value. This value is specified using mu. For example, if we wanted to compare the entire sleep data set against a group that showed an average of three extra hours of sleep, we would feed this to R:

t.test(sleep$extra, mu = 3)

# One Sample t-test

# data: sleep$extra
# t = -3.2357, df = 19, p-value = 0.004352
# alternative hypothesis: true mean is not equal to 3
# 95 percent confidence interval:
# 0.5955845 2.4844155
# sample estimates:
# mean of x
#      1.54

6b. Linear regression

Linear regression is a method that can be used to quantitatively describe the relationship between a dependent variable and an independent variable. The terms can be thought of along the lines of:

# "y depends on x"

# y is the dependent and x the independent value
# (you may also see "response variable" and "explanatory variable")

# In other words, x is used to predict (or "explain") y

The word linear is mentioned because the method assumes a linear relationship between x and y, meaning that it can be explained by a straight line. We can describe this using the following equation:

# y = a + bx

# Where...
# a is a the intercept (the value of y when x = 0)
# b is the slope of the line

If you think of how experimental data would look as a scatter plot, the points likely wouldn’t fall perfectly onto a line drawn through them. The line passing through this “cloud of data” is called a line of best fit because it tries to minimize the distance between itself and individual data points.

To illustrate, let’s load another in-built data set called cars. It consists of 50 observations and two variables (dist and speed).

carsdata <- cars
head(carsdata)

#   speed dist
# 1     4    2
# 2     4   10
# 3     7    4
# 4     7   22
# 5     8   16
# 6     9   10

Let’s plot the observations as a scatter plot and add a line of best fit using geom_smooth:

ggplot(carsdata, aes(x = speed, y = dist)) +
 geom_point() +
 geom_smooth(method = lm) + # this part is new...
 theme_bw()

By default, this also draws the 95% confidence interval (CI). A 95% CI of the mean is a range of values describing possible values that the mean could be. In other words, if we took many samples from the same population and calculated a 95% CI for each of these, we would expect the true population mean to be found within 95% of these values.

How about method = lm - what does it stand for? We are telling R that we’d like to use a linear model. Regression models the relationship between x and y with the regression equation describing that model (and the line of best fit being a visual representation of it). Linear models are common in hypothesis testing and come in different forms. The t-test is also a special case of a linear model!

To better understand the theory behind regression, there are a couple more points to cover before creating a model in R (outside ggplot2, that is).

Regression and correlation are different things. Correlation analysis is another tool in the statistical toolbox that describes the relationship between two variables. However, it only describes if there is a positive or negative association between them (without predicting y from x).
You may remember the concept of residuals. The scatter plot makes this easier to follow - we can think of a residual as the distance between a data point and the line of best fit. Residuals come to play when checking if our data meet the assumptions of a linear regression model.

To examine model diagnostics, we first need to create a model:

cars_lm <- lm(dist ~ speed, data = carsdata)

The package ggfortify comes with the function autoplot, which can be used to create diagnostic plots. It generates four plots at once that are laid out as a panel. There is a lot of detail, but we’ll explore the plots one by one:

autoplot(cars_lm)

The residuals vs fitted plot is used to check the homogeneity of variance. The residuals should be somewhat equally spread around the dashed horizontal line (the line of best fit). We see a few outliers identified by row numbers.

While the outliers could be removed, you should first ask why the values might differ from the rest and why their removal could be justified.
Question: Can you think of a situation where it would be OK to remove an outlier? How about when it might not be a good idea?

The Q-Q (quantile-quantile) plot shows if the residuals are normally distributed. In a perfect scenario, everything would fall on the dashed line. Our values look OK aside from some departures around the tails, which often happens.
The scale-location plot can also be used to evaluate the homogeneity of variance. The blue line should look approximately horizontal and the spread should look similar across the range of fitted values.

While we can see things aren’t perfect, not to worry. There is no obvious co-dependence between the residuals and fitted values. While the scatter is harder to interpret, it lacks an obvious pattern.

The residuals vs leverage plot can help us find out if certain observations have a pronounced effect on the model. Such values would stand out from the other points. In our case, it looks like a few values might be more influential than the others (two were also identified as outliers in the other plots).

If everything looks acceptable, we can print out the model output using the summary function:

summary(cars_lm)

# Call:
# lm(formula = dist ~ speed, data = carsdata)

# Residuals:
#     Min     1Q Median    3Q    Max
# -29.069 -9.525 -2.272 9.215 43.201

# Coefficients:
#             Estimate Std. Error  t value  Pr(>|t|)
# (Intercept) -17.5791     6.7584  -2.601    0.0123 *
# speed         3.9324     0.4155   9.464  1.49e-12 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Residual standard error: 15.38 on 48 degrees of freedom
# Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
# F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12

This generates a lot of information, most of which looks quite different to the t-test output. After the model call and some summary statistics on the residuals, we have information on the model coefficients.

# Coefficients:
#             Estimate Std. Error  t value  Pr(>|t|)
# (Intercept) -17.5791     6.7584  -2.601    0.0123 *
# speed         3.9324     0.4155   9.464  1.49e-12 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The coefficients correspond to the two constants in the model (the intercept and the slope). The Estimate column gives the solution to the equation describing our model, which could be written as:

# (y = a + bx)
# dist = −17.579 + 3.932 ∗ speed

After columns with standard errors and test statistics, we see a column containing P values for the coefficients. The model shows that speed is positively associated with distance (P < 0.001). The intercept is also significantly different from zero (which is nearly always the case).

The final part contains general information on the model:

# Residual standard error: 15.38 on 48 degrees of freedom
# Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
# F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12

The residual standard error describes the model fit (if it’s 0, then the model would describe the data perfectly).
The multiple R-squared (R²) value gives the percentage of variance in the dependent variable predicted by the independent variable (in our case, 65%). This value is often reported as a measure of how well the model performs. The adjusted R² is used when there are many independent variables (it is a penalized version of the multiple R²).
The last line describes the statistical significance of the model. The F statistic is based on a test where your model is compared against a model with no predictors. The P value is the overall P value for the model. We can see that our model is significant (meaning that it performs better than a model without predictors!).

6b. Chi-squared test of independence

We will cover one more test: the Chi-squared (χ²) test of independence, also known as Pearson’s χ² test. This method can be used to determine if there is a correlation between categorical variables, such as names or labels. Specifically, it is used when you want to compare frequencies (counts) between groups. This is done by comparing the observed frequencies with a set of frequencies one would expect if there was no correlation between groups.

To have a look at this in practice, let’s load another car-related data set that comes pre-built in R: mtcars.

mtcarsdata <- mtcars
head(mtcars)

The data set comes as a data frame with different car models as row names.

The first variable we might be interested in is the type of engine (vs), which can be either V-shaped (0) or straight (1).
There is also a column for the gear count (gear), ranging from three to five.

Let’s say we’d like to find out if there is an association between these two variables. How to go about it?

First we need to shape the data into a format suitable for the χ² test. To do this, let’s create a frequency table listing all the combinations we’re interested in. Such a table is also known as a contingency table.

Before creating the table, let’s make things easier to follow by replacing the numbers with labels. We can use a dplyr function called recode:

mtcarsdata <- mtcarsdata |>
  mutate(vs = recode(vs,
                     "0" = "Vshaped",
                     "1" = "Straight"),
        gear = recode(gear,
                     "3" = "Three_gears",
                     "4" = "Four_gears",
                     "5" = "Five_gears"))
head(mtcarsdata)

We can create a contingency table using table:

mtcarsdata_freqs <- table(mtcarsdata$vs, mtcarsdata$gear)
mtcarsdata_freqs

#           Five_gears Four_gears Three_gears
# Straight           1         10           3
# Vshaped            4          2          12

Now that the data are in the desired format, this is a good time to think about test assumptions. The χ² test is a non-parametric method that comes with the following requirements, aside from needing the data as counts:

The groups must be mutually exclusive (the data should fit within a single level of each variable).
Each subject should only contribute data to one cell (e.g. comparisons of the same individual at different times are not possible). Similarly, for paired-sample designs one would need to consider another test.
Expected values should be 5 or above in at least 80% of the cells, and no cell should have an expected value of less than one.

Looking at the numbers we have in our table, especially the last assumption might give us something to think about. To have a look at the expected values, let’s run the test using chisq.test:

mtcarsdata_chisq <- chisq.test(mtcarsdata_freqs)

# Warning message:
# In chisq.test(mtcarsdata_freqs) :
# Chi-squared approximation may be incorrect

We get a warning saying the approximation might be incorrect. We can extract the expected values from the object:

round(mtcarsdata_chisq$expected, 2) # rounding to two decimals

#            Five_gears Four_gears Three_gears
# Straight         2.19       5.25        6.56
# Vshaped          2.81       6.75        8.44

The warning is because there are some cells in the contingency table that have an expected value below five. R is telling us that the test output might be biased due to a lack of data. Out of curiosity, we could still have a look at it:

mtcarsdata_chisq

# Pearson's Chi-squared test

# data: mtcarsdata_freqs
# X-squared = 12.224, df = 2, p-value = 0.002216

# Could be an association between motor type and gear count...

With a small sample size like ours (and also with 2x2 contingency tables), an alternative test called the Fisher’s exact test may be more appropriate. We won’t cover the details, but running it is similar to performing a χ² analysis: fisher.test(mtcarsdata_freqs). Although not always possible in the world of experimental science (let alone field studies), another option would be to gather more data!

Note: It is possible to correct for small sample sizes by typing correct = TRUE inside the chisq.test function call. This is called the Yates continuity correction and R uses it automatically when working with 2x2 contingency tables. However, this correction can be highly conservative.

7. Further information

Choosing the most suitable statistical method, evaluating test diagnostics and how to best transform your data are all quite vast topics of their own. While an in-depth discussion of these is beyond the scope of the workshop, useful resources that might be helpful for self-study are the R for Data Science e-book (accessible for free) and Getting Started with R: An Introduction for Biologists.