Data manipulation - exercise sheet
These exercises are partially derived from Software Carpentry teaching materials available under the CC BY 4.0 license:
https://datacarpentry.org/R-ecology-lesson/
Block 1: selecting and filtering using pipes
1.1 Load the tidyverse
package
collection:
library(tidyverse)
1.2 Read in portal_data_joined.csv
using read_csv()
:
<- read_csv("/home/rstudio/shared/portal_data_joined.csv") surveys
1.3 Using pipes, subset the surveys
data to include animals collected before 1995 and retain only the
columns year
, sex
, and
weight
.
1.4 Do the same, except this time keeping all the
columns apart from species_id
.
Block 2: the mutate function
2.1 Create a new data frame from the
surveys
data that meets the following criteria:
- Contains only the
species_id
column and a new column calledhindfoot_half
containing values that are half thehindfoot_length
values. - In the
hindfoot_half
column, there are noNA
s - All values are less than 30
Hint: think about how the commands should be ordered to produce this data frame!
Block 3: split-apply-combine and summarize
3.1 Consider this example that we just had a look at:
|>
surveys filter(!is.na(weight)) |>
summarize(mean_weight = mean(weight),
.by = c(sex, species_id))
It is also possible to summarize multiple variables at the same time. Extending the code above, can you figure out how to add a column indicating the minimum weight for each species for each sex?
3.2 We can also arrange the results of the previous exercise in a different manner. Try adding a pipe at the end that sorts the data so that lighter species are displayed first.
To make things easier, here is a list of dplyr
functions:
select()
: subset columnsfilter()
: subset rows on conditionsmutate()
: create new columns by using information from other columns- summarize()
and the argument
.by`: create summary statistics on grouped data arrange()
: sort resultscount()
: count discrete values
3.3 How about arranging things in descending order
instead? Try sorting the results by decreasing order of mean weight. For
this you will need to use desc()
inside the
arrange()
function.
Block 4: counting
4.1 How many animals were caught in each
plot_type
surveyed?
4.2 Use summarize()
and
.by
to find the mean, minimum and maximum hindfoot length
for each species (using species_id
). Also try adding the
number of observations (hint: use n
).
4.3 What was the heaviest animal measured in each
year? Return the columns year
, genus
,
species_id
, and weight
.
OPTIONAL Block 5: pivoting
This topic is quite time-consuming, so chances are that you are looking at this section independently.
5.1 Pivot the surveys
data frame with
the end result featuring year
as columns,
plot_id
as rows, and the number of genera per plot as the
values. You will need to summarize before reshaping, and use the
function n_distinct()
to get the number of unique genera
within a particular chunk of data.
5.2 Now take that data frame and
pivot_longer
it, so that each row is a unique
plot_id
by year
combination.
5.3 The surveys
data set has two
measurement columns: hindfoot_length
and
weight
. This makes it difficult to do things like look at
the relationship between mean values of each measurement per year in
different plot types. Let’s walk through a common solution for this type
of problem. First, use pivot_longer()
to create a dataset
where we have the types of measurement under a column called
measurement
and a value
column that takes on
the value of either hindfoot_length
or weight
.
Hint: You’ll need to specify which columns to use for
collecting the values.
5.4 With this new data set, calculate the average of
each measurement
in each year
for each
different plot_type
. Then pivot_wider()
them
into a data set with a column for hindfoot_length
and
weight
. Hint: You only need to specify the
names_from
and values_from
arguments for
pivot_wider()
.
Block 6: using write_csv()
To make sure we’re all on the same page, let’s repeat the steps we just covered.
6.1 Remove observations of animals for which
weight
and hindfoot_length
are missing, or the
sex
has not been determined:
<- surveys |>
surveys_complete filter(!is.na(weight), # remove missing weight
!is.na(hindfoot_length), # remove missing hindfoot_length
!is.na(sex)) # remove missing sex
6.2 Remove observations for rare species (i.e., that have been observed less than 50 times). We will do this in two steps: first, create a data set that counts how often each species has been observed, and filter out the rare species. Second, extract only the observations for these more common species.
## Extract the most common species_id
<- surveys_complete |>
species_counts count(species_id) |>
filter(n >= 50)
## Only keep the most common species
<- surveys_complete |>
surveys_complete filter(species_id %in% species_counts$species_id)
6.3 Check your data set by running
dim(surveys_complete)
. It should have 30463 rows and 13
columns!
6.4 Save the data as a CSV file in the
data_output
folder.
write_csv(surveys_complete, file = "data_output/surveys_complete.csv")
If you run into troubles here, don’t worry. When we continue with
data visualization exercises tomorrow, we can also use an in-built
backup of the surveys_complete
data set.