Data manipulation - independent exercises

In this section we have some extra exercises for the first day of the R course, for those who are interested to test their R skills a little more. Here we explore data from Finnish Biodiversity Information Facility (FinBIF) as an example of using data from open data application programming interfaces (APIs).

We will use an R package finbif that is an R interface to the FinBIF API. This means we can use the R package to download data from FinBIF directly to R and then continue working with it. You can find more information on the package here: https://luomus.github.io/finbif/

1. Getting the data

(If you like, instead of the Notebooks course environment you can also try these exercises on your own computer that has R and RStudio installed - see here for instructions.)

Begin by installing the package finbif and loading the necessary libraries in RStudio. Remember to start a new R script (these should be the first commands on the script).

# install the packages `finbif` (and `tidyverse` if not already installed)
install.packages("finbif")
install.packages("tidyverse")

# load the packages ´finbif´ and ´tidyverse´
library(finbif)
library(tidyverse)

Next, we have to get an access token to access the FinBIF data. Replace your@email.com with your email address. You will then receive the token by email.

finbif_request_token("your@email.com")

Copy the token and active it in R (technically speaking we are setting an environmental variable called FINBIF_ACCESS_TOKEN and giving the token as the value). Replace the long string in quotation marks with the token you received by email.

Sys.setenv(FINBIF_ACCESS_TOKEN = "xtmSOIxjPwq0pOMB1WvcZgFLU9QBklauOlonWl8K5oaLIx8RniJLrvcJU4v9H7Et")

# Note: this is not a real token and should be replaced with your own one

Now we are ready to retrieve data. Let’s retrieve the latest 5000 observations of the barn swallow (Hirundo rustica, haarapääsky) from FinBIF with the R package finbif.

Add the following command to your R script and run it as it is. To see what the options in the command finbif_occurrence mean and what other options are available to control data retrieval, you can check the manual the finbif package on CRAN (the central repository for R packages).

swallows_data <- finbif_occurrence("Hirundo rustica", n=5000, select = c("default_vars", "bio_province"))

If reading in the data fails for some reason, here is a backup option for reading in the data in the Notebooks course environment:

swallows_data <- readRDS("/home/rstudio/shared/swallows_data.rds")

2. Data formatting

2.1 Before getting started on the exercises for data manipulation, let’s make a copy of the data to keep original data intact, Then, we will simplify the formatting of dates and remove double entries for regions (bio_province). We will also remove observations with dates in winter because we don’t expect swallows to occur then. Copy-paste these lines to your R script and run them as they are.

# making a copy of the data
swallows <- swallows_data

# setting format of the column ´date_time´ to year-month-day
swallows$date_time = as.Date(format(swallows$date_time, "%Y-%m-%d"))

# removing double entries in the column ´bio_province´ by replacing text with the function gsub
swallows$bio_province <- gsub(",.*", "", swallows$bio_province)

Now we are ready for some exercises!

3. Exercises

3.1 Look at the first lines of the data and check the structure of the object.

head(swallows)
str(swallows)

3.2 The column bio_province shows the region of the observation. Let’s calculate the number of swallows observations by region.

by_region <- swallows |> 
  summarise(count = n(),
  .by = bio_province)
  
# or

by_region <- swallows |> 
  count(bio_province)

Check the resulting data frame by_region. At the bottom there is a region called NA. If you look at the column bio_province in swallows carefully or run tail(swallows), you will notice that some observations are missing the information on region.

3.3 Let’s fix the problem by modifying the code from the previous step. Remove observations that have NA in the column bio_province.

by_region <- swallows |> 
  filter(!is.na(bio_province)) |> 
  count(bio_province)

3.4 Use by_region to check which region had the most observations of barn swallows and print the top 5.

by_region |>
  arrange(desc(count)) |>
  head(5)

# or 
by_region |>
  arrange(desc(n)) |>
  head(5)
  
# Depending on how you did the counting in step 3.3, your column with counts is called either ´count´ or ´n´. Make sure to use the correct column name here.

4. Saving and exporting your script and saving the data for tomorrow

Let’s first save the R script as we will run through the answers tomorrow. Can you remember how to save and export it? You can find some instructions in the Starting with Data exercise sheet (steps 7.3 and 7.4).

Next, let’s save the data so that we can continue with it without retrieving it from the database again. Here we will save the data in the my-work directory as an R object (.rds file) with the command saveRDS():

saveRDS(swallows_data, "/home/rstudio/my-work/swallows_data.rds")