Data manipulation - independent exercises
In this section we have some extra exercises for the first day of the R course to practice using the data wrangling tools we covered there. Here, we explore data from Finnish Biodiversity Information Facility (FinBIF) as an example of using data from open data application programming interfaces (APIs).
We will use an R package finbif
that is an R interface
to the FinBIF API. This means we can use the R package to download data
from FinBIF directly to R and then continue working with it. You can
find more information on the package here: https://luomus.github.io/finbif/
1. Getting the data
(If you like, instead of the Notebooks course environment you can also try these exercises on your own computer that has R and RStudio installed - see here for instructions.)
Begin by installing the package finbif
and loading the
necessary libraries in RStudio. Remember to start a new R script (these
should be the first commands on the script).
# install the packages `finbif` (and `tidyverse` if not already installed)
install.packages("finbif")
install.packages("tidyverse")
# load the packages ´finbif´ and ´tidyverse´
library(finbif)
library(tidyverse)
Next, we have to get an access token to access the FinBIF data. Replace your@email.com with your email address. You will then receive the token by email.
finbif_request_token("your@email.com")
Copy the token and active it in R (technically speaking we are setting an environmental variable called FINBIF_ACCESS_TOKEN and giving the token as the value). Replace the long string in quotation marks with the token you received by email.
Sys.setenv(FINBIF_ACCESS_TOKEN = "xtmSOIxjPwq0pOMB1WvcZgFLU9QBklauOlonWl8K5oaLIx8RniJLrvcJU4v9H7Et")
# Note: this is not a real token and should be replaced with your own one
Now we are ready to retrieve data. Let’s retrieve the latest 5000
observations of the barn swallow (Hirundo rustica, haarapääsky)
from FinBIF with the R package finbif
.
Add the following command to your R script and run it as it is. To
see what the options in the command finbif_occurrence
mean
and what other options are available to control data retrieval, you can
check the manual
the finbif
package on CRAN (the central repository for
R packages).
<- finbif_occurrence("Hirundo rustica", n=5000, select = c("default_vars", "bio_province")) swallows_data
If reading in the data fails for some reason, here is a backup option for reading in the data in the Notebooks course environment:
<- readRDS("/home/rstudio/shared/swallows_data.rds") swallows_data
2. Data formatting
2.1 Before getting started on the exercises for data
manipulation, let’s make a copy of the data to keep original data
intact, Then, we will simplify the formatting of dates and remove double
entries for regions (bio_province
). We will also remove
observations with dates in winter because we don’t expect swallows to
occur then. Copy-paste these lines to your R script and run them as they
are.
# making a copy of the data
<- swallows_data
swallows
# setting format of the column ´date_time´ to year-month-day
$date_time = as.Date(format(swallows$date_time, "%Y-%m-%d"))
swallows
# removing double entries in the column ´bio_province´ by replacing text with the function gsub
$bio_province <- gsub(",.*", "", swallows$bio_province) swallows
Now we are ready for some exercises!
3. Exercises
3.1 Look at the first lines of the data and check the structure of the object.
# Write the answer in your R script.
3.2 The column bio_province
shows the
region of the observation. Let’s calculate the number of swallows
observations (rows in the data) by region. Save the result in a data
frame called by_region
.
# Write the answer in your R script.
Check the resulting data frame by_region
. At the bottom
there is a region called NA. If you look at the column
bio_province
in swallows
carefully or run
tail(swallows)
, you will notice that some observations are
missing the information on region.
3.3 Let’s fix the problem by modifying the code from
the previous step. Remove observations that have NA in the column
bio_province
.
# Write the answer in your R script.
3.4 Use by_region
to check which region
had the most observations of barn swallows and print the top 5.
# Write the answer in your R script.
4. Saving and exporting your script and saving the data for tomorrow
Let’s first save the R script as we will run through the answers tomorrow. Can you remember how to save and export it? You can find some instructions in the Starting with Data exercise sheet (steps 7.3 and 7.4).
Next, let’s save the data so that we can continue with it without
retrieving it from the database again. Here we will save the data in the
my-work
directory as an R object (.rds file) with the
command saveRDS()
:
saveRDS(swallows_data, "/home/rstudio/my-work/swallows_data.rds")