Data manipulation - exercise sheet

These exercises are partially derived from Software Carpentry teaching materials available under the CC BY 4.0 license:

https://datacarpentry.org/R-ecology-lesson/

Block 1: selecting and filtering using pipes

1.1 Load the tidyverse package collection:

library(tidyverse)

1.2 Read in portal_data_joined.csv using read_csv():

surveys <- read_csv("/home/rstudio/shared/portal_data_joined.csv")

1.3 Using pipes, subset the surveys data to include animals collected before 1995 and retain only the columns year, sex, and weight.

surveys |>
    filter(year < 1995) |>
    select(year, sex, weight)

1.4 Do the same, except this time keeping all the columns apart from species_id.

surveys |>
    filter(year < 1995) |>
    select(-species_id)

Block 2: the mutate function

2.1 Create a new data frame from the surveys data that meets the following criteria:

Contains only the species_id column and a new column called hindfoot_half containing values that are half the hindfoot_length values.
In the hindfoot_half column, there are no NAs
All values are less than 30

Hint: think about how the commands should be ordered to produce this data frame!

surveys_hindfoot_half <- surveys |>
    filter(!is.na(hindfoot_length)) |>
    mutate(hindfoot_half = hindfoot_length / 2) |>
    filter(hindfoot_half < 30) |>
    select(species_id, hindfoot_half)

Block 3: split-apply-combine and summarize

3.1 Consider this example that we just had a look at:

surveys |>
  filter(!is.na(weight)) |>
  summarize(mean_weight = mean(weight),
  .by = c(sex, species_id))

It is also possible to summarize multiple variables at the same time. Extending the code above, can you figure out how to add a column indicating the minimum weight for each species for each sex?

surveys |>
  filter(!is.na(weight)) |>
  summarize(mean_weight = mean(weight),
            min_weight = min(weight),
            .by = c(sex, species_id))

3.2 We can also arrange the results of the previous exercise in a different manner. Try adding a pipe at the end that sorts the data so that lighter species are displayed first.

To make things easier, here is a list of dplyr functions:

select(): subset columns
filter(): subset rows on conditions
mutate(): create new columns by using information from other columns
`summarize() and by: create summary statistics on grouped data
arrange(): sort results
count(): count discrete values

surveys |>
  filter(!is.na(weight)) |>
  summarize(mean_weight = mean(weight),
            min_weight = min(weight),
            .by = c(sex, species_id)) |>
  arrange(min_weight)

3.3 How about arranging things in descending order instead? Try sorting the results by decreasing order of mean weight. For this you will need to use desc() inside the arrange() function.

surveys |>
  filter(!is.na(weight)) |>
  summarize(mean_weight = mean(weight),
            min_weight = min(weight),
            .by = c(sex, species_id)) |>
  arrange(desc(mean_weight))

Block 4: counting

4.1 How many animals were caught in each plot_type surveyed?

surveys |>
    count(plot_type)

4.2 Use summarize() .by to find the mean, minimum and maximum hindfoot length for each species (using species_id). Also try adding the number of observations (hint: use n).

surveys |>
    filter(!is.na(hindfoot_length)) |>
    summarize(
        mean_hindfoot_length = mean(hindfoot_length),
        min_hindfoot_length = min(hindfoot_length),
        max_hindfoot_length = max(hindfoot_length),
        n = n(),
        .by = species_id
    )

4.3 What was the heaviest animal measured in each year? Return the columns year, genus, species_id, and weight. Hint: you can use the argument by also with some dplyr functions that summarize.

surveys |>
    filter(!is.na(weight)) |>
    filter(weight == max(weight), .by = year) |>
    select(year, genus, species_id, weight) |>
    arrange(year)

Block 5: pivoting

This topic is quite time-consuming, so chances are that you are looking at this section independently.

5.1 Pivot the surveys data frame with the end result featuring year as columns, plot_id as rows, and the number of genera per plot as the values. You will need to summarize before reshaping, and use the function n_distinct() to get the number of unique genera within a particular chunk of data.

surveys_new_wider <- surveys |>
  summarize(n_genera = n_distinct(genus),
  .by = c(plot_id, year)) |>
  pivot_wider(names_from = year, 
              values_from = n_genera)

head(surveys_new_wider)

5.2 Now take that data frame and pivot_longer it, so that each row is a unique plot_id by year combination.

surveys_new_wider |>
  pivot_longer(!plot_id,
               names_to = "year",
               values_to = "n_genera") |>
  arrange(year)

5.3 The surveys data set has two measurement columns: hindfoot_length and weight. This makes it difficult to do things like look at the relationship between mean values of each measurement per year in different plot types. Let’s walk through a common solution for this type of problem. First, use pivot_longer() to create a dataset where we have the types of measurement under a column called measurement and a value column that takes on the value of either hindfoot_length or weight. Hint: You’ll need to specify which columns to use for collecting the values.

surveys_new_longer <- surveys |>
  pivot_longer(names_to = "measurement",
               values_to = "value",
               cols = c(hindfoot_length, weight)) |>
  arrange(measurement)

surveys_new_longer

5.4 With this new data set, calculate the average of each measurement in each year for each different plot_type. Then pivot_wider() them into a data set with a column for hindfoot_length and weight. Hint: You only need to specify the names_from and values_from arguments for pivot_wider().

surveys_new_longer |>
  summarize(mean_value = mean(value, na.rm = TRUE),
  .by = c(year, measurement, plot_type) |>
  pivot_wider(names_from = measurement, 
              values_from = mean_value)

Block 6: using write_csv()

To make sure we’re all on the same page, let’s repeat the steps we just covered.

6.1 Remove observations of animals for which weight and hindfoot_length are missing, or the sex has not been determined:

surveys_complete <- surveys |>
  filter(!is.na(weight),           # remove missing weight
         !is.na(hindfoot_length),  # remove missing hindfoot_length
         !is.na(sex))              # remove missing sex

6.2 Remove observations for rare species (i.e., that have been observed less than 50 times). We will do this in two steps: first, create a data set that counts how often each species has been observed, and filter out the rare species. Second, extract only the observations for these more common species.

## Extract the most common species_id
species_counts <- surveys_complete |>
    count(species_id) |> 
    filter(n >= 50)

## Only keep the most common species
surveys_complete <- surveys_complete |>
  filter(species_id %in% species_counts$species_id)

6.3 Check your data set by running dim(surveys_complete). It should have 30463 rows and 13 columns!

6.4 Save the data as a CSV file in the data_output folder.

write_csv(surveys_complete, file = "data_output/surveys_complete.csv")

If you run into troubles here, don’t worry. When we continue with data visualization exercises tomorrow, we can also use an in-built backup of the surveys_complete data set.