Starting with data

This lesson is derived from Data Carpentry teaching materials available under the CC BY 4.0 license:

https://datacarpentry.org/R-ecology-lesson/

Objectives:

Load external data from a CSV file into a data frame
Describe what a data frame is
Summarize the contents of a data frame
Use indexing to subset specific portions of data frames
Describe what a factor is
Convert between strings and factors
Reorder and rename factors

1. Presentation of the data

We are studying the species repartition and weight of animals caught in plots in our study area. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:

Column	Description
record_id	Unique id for the observation
month	month of observation
day	day of observation
year	year of observation
plot_id	ID of a particular plot
species_id	2-letter code
sex	sex of animal (“M”, “F”)
hindfoot_length	length of the hindfoot in mm
weight	weight of the animal in grams
genus	genus of animal
species	species of animal
taxon	e.g. Rodent, Reptile, Bird, Rabbit
plot_type	type of plot

2. Importing the data

We will use read.csv() to load into memory the content of the CSV file as an object of class data.frame. The file is called portal_data_joined.csv and it resides in a directory within our R instance (/home/rstudio/shared).

In fact, we haven’t yet explored our working directory in much detail, so let’s have a quick look. Let’s check the current working directory using getwd:

getwd()

# This will give something like: /home/rstudio/my-work/<R-project>

This looks OK for now, but we could also change it using setwd(...) (with the new path given inside the brackets). For now, we will use the present working directory (we can still import our data from outside it).

We can now import the data with function call the read.csv(). We will come back to this function and its arguments in the next session.

surveys <- read.csv("/home/rstudio/shared/portal_data_joined.csv")

# Note:

# If we wanted to download the file from a website instead, such as one of these:
# https://raw.githubusercontent.com/csc-training/da-with-r/master/DataFiles/portal_data_joined.csv

# ... We could use download.file(): 
# download.file(url="https://tinyurl.com/portaljoined",
#              destfile = "portal_data_joined.csv")

# This would save the file in your current working directory.
# We could also save it elsewhere by writing ... destfile = "path/to/portal_data_joined.csv"

The surveys <- read.csv("/home...") statement doesn’t produce any output because, as you might recall, assignments don’t display anything. If we want to check that our data has been loaded, we can see the contents of the data frame by typing its name: surveys.

Wow… that was a lot of output. At least it means the data loaded properly. Let’s check the top (the first 6 lines) of this data frame using the function head():

head(surveys)

## Try also
View(surveys)

Ok, looks good! Let’s keep on working with the survey data.

Challenge:

At this point, let’s complete Starting with Data Exercise Block 1 (~ 5 mins).

3. What are data frames?

Data frames are the de facto data structure for most tabular data in R, and what we use for statistics and plotting.

A data frame can be created by hand, but most commonly they are generated by the functions read.csv() or read.table(); in other words, when importing spreadsheets from your hard drive (or the web).

A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character and a logical vector.

We can see this when inspecting the structure of a data frame with the function str():

str(surveys)

4. Inspecting `data.frame` objects

We already saw how the functions head() and str() can be useful to check the content and the structure of a data frame. There are also many other useful functions for inspecting data frames, such as dim() (which returns a vector containing the numbers of rows in the first element, and the number of columns as the second element). Many of these functions are “generic”, meaning that they can be used on other types of objects besides data.frame.

Challenge

Let’s inspect the surveys data set further by going through Starting with Data Exercise Block 2 (~ 5 mins).

5. Indexing and subsetting data frames

Our survey data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes.

# first element in the first column of the data frame (as a vector)
surveys[1, 1]   
# first element in the 6th column (as a vector)
surveys[1, 6]   
# first column of the data frame (as a vector)
surveys[, 1]    
# first column of the data frame (as a data.frame)
surveys[1]      
# first three elements in the 7th column (as a vector)
surveys[1:3, 7] 
# the 3rd row of the data frame (as a data.frame)
surveys[3, ]    
# equivalent to head_surveys <- head(surveys)
head_surveys <- surveys[1:6, ]

: is a special function that creates numeric vectors of integers in increasing or decreasing order, test 1:10 and 10:1 for instance.

You can also exclude certain indices of a data frame using the “-” sign:

# The whole data frame, except the first column
surveys[, -1]

# Equivalent to head(surveys)
surveys[-c(7:34786), ]

Data frames can be subset by calling indices (as shown previously), but also by calling their column names directly:

surveys["species_id"]       # Result is a data.frame
surveys[, "species_id"]     # Result is a vector
surveys[["species_id"]]     # Result is a vector
surveys$species_id          # Result is a vector

In RStudio, you can press the tab key to use the autocompletion feature to get the full and correct names of the columns.

Optional challenge

Have a look at Starting with Data Exercise Block 3 (~ 15 mins).

6. Introducing factors

When we did str(surveys) we saw that the columns species_id, genus, species, taxa, sex, and plot_type consist of character data, i.e., strings of letters. These columns are, however, intended to represent categorical data, for example information on group membership (e.g. whether measurements belong to experimental controls or different treatment groups). In R, categorical data are handled as a special class called factor. Factors are very useful and contribute to making R particularly well suited to working with data. So we are going to spend a little time introducing them.

To convert the character data column species_id in our surveys data frame into a factor, we use the function factor:

species_id <- factor(surveys$species_id)

Factors are stored as integers associated with labels and can be either ordered or unordered. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings. Let’s have a look at what this means in practice!

The defining feature of a factor is that it can have several levels. By default, R always sorts levels in alphabetical order. For instance, if we wanted to create a factor called “sex” with the levels “male” and “female”:

sex <- factor(c("male", "female", "female", "male"))

R will assign 1 to the level "female" and 2 to the level "male" (because f comes before m, even though the first element in this vector is "male"). You can see this by using the function levels() and you can find the number of levels using nlevels():

levels(sex)
nlevels(sex)

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”), it improves your visualization, or it is required by a particular type of analysis. We can first check the current order:

sex
#> [1] male female female male  
#> Levels: female male

One way to reorder our levels in the sex vector would be:

sex <- factor(sex, levels = c("male", "female"))
sex 

# after re-ordering
#> [1] male female female male  
#> Levels: male female

In R’s memory, these factors are represented by integers (1, 2, 3), but are more informative than integers because factors are self-describing: "female", "male" is more descriptive than 1, 2. Which one is “male”? You wouldn’t be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels (like the species names in our example dataset).

7. Converting factors

If you need to convert a factor to a character vector, you use as.character(x).

as.character(sex)

In some cases, you may have to convert factors where the levels appear as numbers (such as concentration levels or years) to a numeric vector. For instance, in one part of your analysis the years might need to be encoded as factors (e.g., comparing average weights across years) but in another part of your analysis they may need to be stored as numeric values (e.g., doing math operations on the years). This conversion from factor to numeric is a little trickier. The as.numeric() function returns the index values of the factor, not its levels, so it will result in an entirely new (and unwanted in this case) set of numbers. One method to avoid this is to convert factors to characters, and then to numbers.

Another method is to use the levels() function. Compare:

year_fct <- factor(c(1990, 1983, 1977, 1998, 1990))

as.numeric(year_fct) # Wrong! And there is no warning...
as.numeric(as.character(year_fct)) # Works...

as.numeric(levels(year_fct))[year_fct] # Using levels()

Notice that in the levels() approach, three important steps occur:

We obtain all the factor levels using levels(year_fct)
We convert these levels to numeric values using as.numeric(levels(year_fct))
We then access these numeric values using the underlying integers of the vector year_fct inside the square brackets

8. Renaming factors

When your data is stored as a factor, you can use the plot() function to get a quick glance at the number of observations represented by each factor level. Let’s look at the number of males and females captured over the course of the experiment. Before plotting, we need to convert the column sex into a factor:

## bar plot of the number of females and males:
surveys$sex <- factor(surveys$sex)
plot(surveys$sex)

In addition to males and females, there are about 1700 individuals for which the sex information hasn’t been recorded. Additionally, for these individuals, there is no label to indicate that the information is missing or undetermined. Let’s rename this label to something more meaningful. Before doing that, we’re going to pull out the data on sex and work with that data, so we’re not modifying the working copy of the data frame:

sex <- factor(surveys$sex)
head(sex)

#> [1] M M        
#> Levels:  F M

levels(sex)
#> [1] ""  "F" "M"

levels(sex)[1] <- "undetermined"
levels(sex)
#> [1] "undetermined" "F" "M"

head(sex)
#> [1] M M undetermined undetermined undetermined
#> [6] undetermined
#> Levels: undetermined F M

It is also possible to import the surveys data set with the function read.csv and the argument stringsAsFactors=TRUE. By using this argument, we would coerce (= convert) those columns that contain characters (i.e. text) into factors. However, more commonly the case is that we’d rather perform such conversions separately and for selected columns only, after the data have been imported. This way we ensure that we only convert those columns into factors where that data type is required. This corresponds to stringsAsFactors=FALSE and is the default behaviour of the read.csv() and read.table functions (this feature was introduced with R version 4 - before it was the opposite!).

Challenge

Have a look at Starting with Data Exercise Block 4 (~ 5 mins).

Optional challenge

Have a look at Starting with Data Exercise Block 5 (~ 5-10 mins).

The automatic conversion of data type is sometimes a blessing, sometimes an annoyance. Be aware that it exists, learn the rules, and double check that data you import in R are of the correct type within your data frame. If not, use it to your advantage to detect mistakes that might have been introduced during data entry (for instance, a letter in a column that should only contain numbers).

Starting with data

1. Presentation of the data

2. Importing the data

3. What are data frames?

4. Inspecting data.frame objects

5. Indexing and subsetting data frames

6. Introducing factors

7. Converting factors

8. Renaming factors

4. Inspecting `data.frame` objects