Data manipulation - independent exercises

In this section we have some extra exercises for those who are interested to test their R skills a little more or want to get additional 0.5 ECTS for the course (in total 1 ECTS). Here we explore data from Finnish Biodiversity Information Facility (FinBIF) as an example of using data from open data application programming interfaces (APIs).

We will use data downloaded with an R package finbif that is an R interface to the FinBIF API. This means we can use the R package to download data from FinBIF directly to R and then continue working with it. You can find more information on the package here: https://luomus.github.io/finbif/

1. Getting the data (optional)

This section shows how the data used in the following exercises was downloaded from FinBIF. If you are carrying out these exercises for the additional 0.5 ECTS, it is recommended to use the pre-downloaded data and start from step 2.

# Install the packages `finbif`
install.packages("finbif")

# Load the package `finbif` 
library(finbif)

# Request an access token to access the FinBIF data (the token will arrive by email)
finbif_request_token("your@email.com")

# Copy the token (the long string of characters) from the email and activate it in R
# Technically speaking, setting an environmental variable called FINBIF_ACCESS_TOKEN and giving the token as the value
Sys.setenv(FINBIF_ACCESS_TOKEN = "xtmSOIxjPwq0pOMB1WvcZgFLU9QBklauOlonWl8K5oaLIx8RniJLrvcJU4v9H7Et")

# Note: this is not a real token and should be replaced with your own one

# Retrieving 5000 observations of the barn swallow (*Hirundo rustica*, haarapääsky) in 2025 from FinBIF with the R package `finbif`.

# To see what the options in the command `finbif_occurrence` mean and what other options are available to control data retrieval, check the [manual the `finbif` package on CRAN](https://cloud.r-project.org/web/packages/finbif/finbif.pdf) (the central repository for R packages).

swallows_data <- finbif_occurrence("Hirundo rustica", n=5000, select = c("default_vars", "bio_province"), filter = list(date_range_ym = c("2025")))

2. Data formatting

2.0 Start here if you are working on the exercises to get the additional 0.5 ECTS.

First, let’s load the tidyverse R package and then load the pre-downloaded data into R. It is stored in the shared folder of our course RStudio environment in Noppe. Because the data are saved as an R object in an .rds file, we don’t need an additional import command here. We use the command readRDS to read in the saved data.

library(tidyverse)

swallows_data <- readRDS("/home/rstudio/shared/swallows_data.rds")

2.1 Before getting started on the exercises for data manipulation, let’s make a copy of the data to keep original data intact. Then, we will add a column for the month of observation. We will also remove observations in winter, because we don’t expect swallows to occur then. Copy-paste these lines to your R script and run them as they are.

# Making a copy of the data
swallows <- swallows_data

# Creating a new column of observation months containing only the name of the month (abbreviated to three letters)
swallows$month <- format(as.Date(swallows$date_time), '%b') 

# If we inspect the data, we see there are lots of observations from January 1
# These are most likely observations without proper date information, so let's remove them
# We also know barn swallows don't occur in Finland in January
swallows <- swallows |> filter(!month == "Jan")

Now we are ready for some exercises! If you get stuck, have a look at the lecture notes and exercises on data manipulation.

3. Exercises

3.1 Look at the first lines of the data to get an overview of how the data look like.

head(swallows)

# If you use str(swallows) to check the structure of the data frame object, you will see that
# the top looks like a normal data frame. Below that there is more information, because our 
# data set is a special class of object produced by the `finbif` package.

3.2 The column bio_province shows the region of the observation. Let’s count the number of swallow observations (rows in the data) by region. Save the result in an object called counts_by_region.

counts_by_region <- swallows |> 
  summarise(count = n(),
  .by = c(bio_province)
  
# or

counts_by_region <- swallows |> 
  count(bio_province)

3.3 Next, let’s check if there are any observations where the information for region is missing. We can do this for example by running the following line

sum(is.na(swallows$bio_province))

The answer shows that there are two observations with a missing value for region (column bio_province)

3.3 To fix the problem of missing values, let’s modify the code from the previous step. Remove observations (rows) that have NA (missing value) in the column bio_province.

counts_by_region <- swallows |> 
  filter(!is.na(bio_province)) |> 
  summarise(count = n(),
  .by = bio_province)
  
# or

counts_by_region <- swallows |> 
  filter(!is.na(bio_province)) |> 
  count(bio_province)

3.4 Arrange the rows of counts_by_region to show which region had the most observations of barn swallows and print the top 5.

Depending on how you did the counting in step 3.3, your column with counts is called either count or n. Make sure to use the correct column name here.

counts_by_region |>
  arrange(desc(count)) |>
  head(5)

# or 
counts_by_region |>
  arrange(desc(n)) |>
  head(5)

First task for the extra 0.5 ECTS: report the top 5 regions.

4. Saving and exporting your script

Now we are ready to move to the next part on visualizing the regional counts of barn swallows (6. Data visualization -> Independent exercises (data visualization)).

To save and export your R script, you can find the instructions in the Starting with Data exercise sheet (steps 7.3 and 7.4).

If you downloaded the data from FinBIF yourself, here is how you can save the data so that we can continue with it without retrieving it from the database again. Here we save the data frame object in the my-work directory as an R object (.rds file) with the command saveRDS():

saveRDS(swallows_data, "/home/rstudio/my-work/swallows_data.rds")