Data manipulation - exercise sheet
These exercises are partially derived from Software Carpentry teaching materials available under the CC BY 4.0 license:
https://datacarpentry.org/R-ecology-lesson/
Block 1: selecting and filtering using pipes
1.1 Load the tidyverse
package
collection:
library(tidyverse)
1.2 Read in portal_data_joined.csv
using read_csv()
:
<- read_csv("/home/rstudio/shared/portal_data_joined.csv") surveys
1.3 Using pipes, subset the surveys
data to include animals collected before 1995 and retain only the
columns year
, sex
, and
weight
.
|>
surveys filter(year < 1995) |>
select(year, sex, weight)
1.4 Do the same, except this time keeping all the
columns apart from species_id
.
|>
surveys filter(year < 1995) |>
select(-species_id)
Block 2: the mutate function
2.1 Create a new data frame from the
surveys
data that meets the following criteria:
- Contains only the
species_id
column and a new column calledhindfoot_half
containing values that are half thehindfoot_length
values. - In the
hindfoot_half
column, there are noNA
s - All values are less than 30
Hint: think about how the commands should be ordered to produce this data frame!
<- surveys |>
surveys_hindfoot_half filter(!is.na(hindfoot_length)) |>
mutate(hindfoot_half = hindfoot_length / 2) |>
filter(hindfoot_half < 30) |>
select(species_id, hindfoot_half)
Block 3: split-apply-combine and summarize
3.1 Consider this example that we just had a look at:
|>
surveys filter(!is.na(weight)) |>
summarize(mean_weight = mean(weight),
.by = c(sex, species_id))
It is also possible to summarize multiple variables at the same time. Extending the code above, can you figure out how to add a column indicating the minimum weight for each species for each sex?
|>
surveys filter(!is.na(weight)) |>
summarize(mean_weight = mean(weight),
min_weight = min(weight),
.by = c(sex, species_id))
3.2 We can also arrange the results of the previous exercise in a different manner. Try adding a pipe at the end that sorts the data so that lighter species are displayed first.
To make things easier, here is a list of dplyr
functions:
select()
: subset columnsfilter()
: subset rows on conditionsmutate()
: create new columns by using information from other columns- `
summarize()
andby
: create summary statistics on grouped data arrange()
: sort resultscount()
: count discrete values
|>
surveys filter(!is.na(weight)) |>
summarize(mean_weight = mean(weight),
min_weight = min(weight),
.by = c(sex, species_id)) |>
arrange(min_weight)
3.3 How about arranging things in descending order
instead? Try sorting the results by decreasing order of mean weight. For
this you will need to use desc()
inside the
arrange()
function.
|>
surveys filter(!is.na(weight)) |>
summarize(mean_weight = mean(weight),
min_weight = min(weight),
.by = c(sex, species_id)) |>
arrange(desc(mean_weight))
Block 4: counting
4.1 How many animals were caught in each
plot_type
surveyed?
|>
surveys count(plot_type)
4.2 Use summarize()
.by
to
find the mean, minimum and maximum hindfoot length for each species
(using species_id
). Also try adding the number of
observations (hint: use n
).
|>
surveys filter(!is.na(hindfoot_length)) |>
summarize(
mean_hindfoot_length = mean(hindfoot_length),
min_hindfoot_length = min(hindfoot_length),
max_hindfoot_length = max(hindfoot_length),
n = n(),
.by = species_id
)
4.3 What was the heaviest animal measured in each
year? Return the columns year
, genus
,
species_id
, and weight
. Hint: you can use the
argument by
also with some dplyr functions that
summarize
.
|>
surveys filter(!is.na(weight)) |>
filter(weight == max(weight), .by = year) |>
select(year, genus, species_id, weight) |>
arrange(year)
Block 5: pivoting
This topic is quite time-consuming, so chances are that you are looking at this section independently.
5.1 Pivot the surveys
data frame with
the end result featuring year
as columns,
plot_id
as rows, and the number of genera per plot as the
values. You will need to summarize before reshaping, and use the
function n_distinct()
to get the number of unique genera
within a particular chunk of data.
<- surveys |>
surveys_new_wider summarize(n_genera = n_distinct(genus),
.by = c(plot_id, year)) |>
pivot_wider(names_from = year,
values_from = n_genera)
head(surveys_new_wider)
5.2 Now take that data frame and
pivot_longer
it, so that each row is a unique
plot_id
by year
combination.
|>
surveys_new_wider pivot_longer(!plot_id,
names_to = "year",
values_to = "n_genera") |>
arrange(year)
5.3 The surveys
data set has two
measurement columns: hindfoot_length
and
weight
. This makes it difficult to do things like look at
the relationship between mean values of each measurement per year in
different plot types. Let’s walk through a common solution for this type
of problem. First, use pivot_longer()
to create a dataset
where we have the types of measurement under a column called
measurement
and a value
column that takes on
the value of either hindfoot_length
or weight
.
Hint: You’ll need to specify which columns to use for
collecting the values.
<- surveys |>
surveys_new_longer pivot_longer(names_to = "measurement",
values_to = "value",
cols = c(hindfoot_length, weight)) |>
arrange(measurement)
surveys_new_longer
5.4 With this new data set, calculate the average of
each measurement
in each year
for each
different plot_type
. Then pivot_wider()
them
into a data set with a column for hindfoot_length
and
weight
. Hint: You only need to specify the
names_from
and values_from
arguments for
pivot_wider()
.
|>
surveys_new_longer summarize(mean_value = mean(value, na.rm = TRUE),
.by = c(year, measurement, plot_type) |>
pivot_wider(names_from = measurement,
values_from = mean_value)
Block 6: using write_csv()
To make sure we’re all on the same page, let’s repeat the steps we just covered.
6.1 Remove observations of animals for which
weight
and hindfoot_length
are missing, or the
sex
has not been determined:
<- surveys |>
surveys_complete filter(!is.na(weight), # remove missing weight
!is.na(hindfoot_length), # remove missing hindfoot_length
!is.na(sex)) # remove missing sex
6.2 Remove observations for rare species (i.e., that have been observed less than 50 times). We will do this in two steps: first, create a data set that counts how often each species has been observed, and filter out the rare species. Second, extract only the observations for these more common species.
## Extract the most common species_id
<- surveys_complete |>
species_counts count(species_id) |>
filter(n >= 50)
## Only keep the most common species
<- surveys_complete |>
surveys_complete filter(species_id %in% species_counts$species_id)
6.3 Check your data set by running
dim(surveys_complete)
. It should have 30463 rows and 13
columns!
6.4 Save the data as a CSV file in the
data_output
folder.
write_csv(surveys_complete, file = "data_output/surveys_complete.csv")
If you run into troubles here, don’t worry. When we continue with
data visualization exercises tomorrow, we can also use an in-built
backup of the surveys_complete
data set.