Data manipulation - exercise sheet
These exercises are partially derived from Software Carpentry teaching materials available under the CC BY 4.0 license:
https://datacarpentry.org/R-ecology-lesson/
Block 1: selecting and filtering using pipes
1.1 Load the tidyverse package
collection:
library(tidyverse)1.2 Read in portal_data_joined.csv
using read_csv():
surveys <- read_csv("/home/rstudio/shared/portal_data_joined.csv")1.3 Using pipes, subset the surveys
data to include animals collected before 1995 and retain only the
columns year, sex, and
weight.
1.4 Do the same, except this time keeping all the
columns apart from species_id.
Block 2: the mutate function
2.1 Create a new data frame from the
surveys data that meets the following criteria:
- Contains only the
species_idcolumn and a new column calledhindfoot_halfcontaining values that are half thehindfoot_lengthvalues. - In the
hindfoot_halfcolumn, there are noNAs - All values are less than 30
Hint: think about how the commands should be ordered to produce this data frame!
Block 3: split-apply-combine and summarize
3.1 Consider this example that we just had a look at:
surveys |>
filter(!is.na(weight)) |>
summarize(mean_weight = mean(weight),
.by = c(sex, species_id))It is also possible to summarize multiple variables at the same time. Extending the code above, can you figure out how to add a column indicating the minimum weight for each species for each sex?
3.2 We can also arrange the results of the previous exercise in a different manner. Try adding a pipe at the end that sorts the data so that lighter species are displayed first.
To make things easier, here is a list of dplyr
functions:
select(): subset columnsfilter(): subset rows on conditionsmutate(): create new columns by using information from other columns- summarize()
and the argument.by`: create summary statistics on grouped data arrange(): sort resultscount(): count discrete values
3.3 How about arranging things in descending order
instead? Try sorting the results by decreasing order of mean weight. For
this you will need to use desc() inside the
arrange() function.
Block 4: counting
4.1 How many animals were caught in each
plot_type surveyed?
4.2 Use summarize() and
.by to find the mean, minimum and maximum hindfoot length
for each species (using species_id). Also try adding the
number of observations (hint: use n).
4.3 What was the heaviest animal measured in each
year? Return the columns year, genus,
species_id, and weight.
OPTIONAL Block 5: pivoting
This topic is quite time-consuming, so chances are that you are looking at this section independently.
5.1 Pivot the surveys data frame with
the end result featuring year as columns,
plot_id as rows, and the number of genera per plot as the
values. You will need to summarize before reshaping, and use the
function n_distinct() to get the number of unique genera
within a particular chunk of data.
5.2 Now take that data frame and
pivot_longer it, so that each row is a unique
plot_id by year combination.
5.3 The surveys data set has two
measurement columns: hindfoot_length and
weight. This makes it difficult to do things like look at
the relationship between mean values of each measurement per year in
different plot types. Let’s walk through a common solution for this type
of problem. First, use pivot_longer() to create a dataset
where we have the types of measurement under a column called
measurement and a value column that takes on
the value of either hindfoot_length or weight.
Hint: You’ll need to specify which columns to use for
collecting the values.
5.4 With this new data set, calculate the average of
each measurement in each year for each
different plot_type. Then pivot_wider() them
into a data set with a column for hindfoot_length and
weight. Hint: You only need to specify the
names_from and values_from arguments for
pivot_wider().
Block 6: using write_csv()
To make sure we’re all on the same page, let’s repeat the steps we just covered.
6.1 Remove observations of animals for which
weight and hindfoot_length are missing, or the
sex has not been determined:
surveys_complete <- surveys |>
filter(!is.na(weight), # remove missing weight
!is.na(hindfoot_length), # remove missing hindfoot_length
!is.na(sex)) # remove missing sex6.2 Remove observations for rare species (i.e., that have been observed less than 50 times). We will do this in two steps: first, create a data set that counts how often each species has been observed, and filter out the rare species. Second, extract only the observations for these more common species.
## Extract the most common species_id
species_counts <- surveys_complete |>
count(species_id) |>
filter(n >= 50)
## Only keep the most common species
surveys_complete <- surveys_complete |>
filter(species_id %in% species_counts$species_id)6.3 Check your data set by running
dim(surveys_complete). It should have 30463 rows and 13
columns!
6.4 Save the data as a CSV file in the
data_output folder.
write_csv(surveys_complete, file = "data_output/surveys_complete.csv")If you run into troubles here, don’t worry. When we continue with
data visualization exercises tomorrow, we can also use an in-built
backup of the surveys_complete data set.