Starting with data - exercise sheet
These exercises are partially derived from Software Carpentry teaching materials available under the CC BY 4.0 license:
https://swcarpentry.github.io/r-novice-gapminder/
https://swcarpentry.github.io/r-novice-inflammation/
https://datacarpentry.org/R-ecology-lesson/
Block 1: downloading and importing data
This exercise block repeats the lecture contents - at this point it’s important that we’re all on the same page! We will need these data as we move on.
1.1 Import some example data
(/home/rstudio(shared/portal_data_joined.csv
) into an
object called surveys
. When importing the data, use the
argument stringsAsFactors = TRUE
. You can use
head()
to check that everything got imported correctly.
<- read.csv("/home/rstudio/shared/portal_data_joined.csv", stringsAsFactors=TRUE)
surveys head(surveys)
1.2 We just imported data from outside our current
working directory. Let’s also try something inside the working directory
- create a new folder called testfolder
(Files pane on the
bottom right –> New folder).
Block 2: inspecting data frames
2.1 Inspect the surveys
data set using
the following functions. We don’t need to memorize them now, but
eventually this can be useful when you are working with your own
data.
- Size:
dim()
- returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)nrow()
- number of rowsncol()
- number of columns
- Content:
head()
- first 6 rowstail()
- last 6 rows
- Names:
names()
- column names (synonym ofcolnames()
fordata.frame
objects)rownames()
- row names
- Summary:
str()
- structure of the object and information about the class, length and content of each columnsummary()
- summary statistics for each column
2.2 Based on the output of
str(surveys)
, can you answer the following questions?
- What is the class of the object
surveys
? - How many rows and how many columns are in this object?
- How many different species IDs have been recorded during these surveys?
## * class: data frame
## * how many rows: 34786, how many columns: 13
## * how many species IDs: 48
Block 3: indexing and subsetting data frames
3.1 Create a data.frame
(surveys_200
) containing only the data in row 200 of the
surveys
dataset.
<- surveys[200, ] surveys_200
3.2 Notice how nrow()
gave you the
number of rows in a data.frame
?
- Use that number to pull out just that last row in the data frame.
- Compare that with what you see as the last row using
tail()
to make sure it’s meeting expectations. - Pull out that last row using
nrow()
instead of the row number. - Create a new data frame (
surveys_last
) from that last row.
nrow(surveys)
<- surveys[34786, ]
surveys_last1 tail(surveys)
# (Saving `n_rows` to improve readability)
<- nrow(surveys)
n_rows <- surveys[n_rows, ] surveys_last2
3.3 Use nrow()
to extract the row that
is in the middle of the data frame. Store the content of this row in an
object named surveys_middle
.
<- surveys[n_rows / 2, ] surveys_middle
3.4 A trickier example: combine nrow()
with the -
notation above to reproduce the behavior of
head(surveys)
, keeping just the first through 6th rows of
the surveys dataset. Don’t worry if you can’t get this one right!
head(surveys)
<- surveys[-(7:n_rows), ] surveys_head
Block 4: factors
4.1 First assign the gender data from
surveys
to a new object. You probably also remember there
are some measurements where the gender information is either missing or
undetermined, let’s call those “undetermined”.
<- surveys$sex
sex levels(sex)[1] <- "undetermined"
4.2 Rename “F” and “M” to “female” and “male”, respectively.
levels(sex)[2] <- "female"
levels(sex)[3] <- "male"
# levels(sex)[2:3] <- c("female", "male")
4.3 Can you reorder the data so that the levels are
ordered like this: undetermined
, male
,
female
? After this, use plot()
to check how
the data now look.
<- factor(sex, levels = c("undetermined", "male", "female"))
sex levels(sex)
plot(sex)
Block 5: more on data frames
5.1 We have seen how data frames are created when
using read.csv()
, but they can also be created by hand with
the data.frame()
function. There are a few mistakes in this
hand-crafted data.frame
. Can you spot and fix them? Don’t
hesitate to experiment!
<- data.frame(
animal_data animal = c(dog, cat, sea cucumber, sea urchin),
feel = c("furry", "squishy", "spiny"),
weight = c(45, 8 1.1, 0.8)
)
# missing quotations around the names of the animals
# missing entry in the feel column (one of the furry animals)
# missing one comma in the weight column
5.2 Can you predict the class for each of the
columns in the following example? Check your guesses using
str(country_climate)
:
- Are they what you expected? Why? Why not?
- What would have been different if we had added
stringsAsFactors = TRUE
when creating the data frame? - What would you need to change to ensure that each column had the accurate data type?
<- data.frame(
country_climate country = c("Canada", "Panama", "South Africa", "Australia"),
climate = c("cold", "hot", "temperate", "hot/temperate"),
temperature = c(10, 30, 18, "15"),
northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"),
has_kangaroo = c(FALSE, FALSE, FALSE, 1)
)
# country, climate, temperature, and northern_hemisphere are character vectors; has_kangaroo is numeric
# using stringsAsFactors = TRUE would have made factors instead of character vectors
# removing quotes in temperature and northern_hemisphere and replacing 1 by TRUE in the has_kangaroo column
Block 6: arguments for CSV importing
6.1 Import a new data set on car speeds
(/home/rstudio/shared/car-speeds.csv
) and assign it to an
object called carSpeeds
.
<- read.csv("/home/rstudio/shared/car-speeds.csv") carSpeeds
6.2 Have a look at the structure of
carSpeeds
using str()
. Then try reimporting
the data with the header argument set to false. How does the new data
set differ from the previous one?
str(carSpeeds)
<- read.csv("/home/rstudio/shared/car-speeds.csv", header = FALSE)
carSpeeds str(carSpeeds)
# the first row of the data
1, ] carSpeeds[
6.3 The car colour labels contain accidental
whitespaces. To fix this, try importing the data using the
strip.white
argument. NOTE - this argument must be
accompanied by the sep
argument, by which we indicate the
type of delimiter in the file (the comma for most .csv files). In other
words, we need to add sep = ','
to the import command. You
can use unique()
to check if it worked!
unique(carSpeeds$Color) # right now we can see a problem here
# [1] Green Red White Red Black
# Levels: Red Black Green Red White
<- read.csv(
carSpeeds file = '/home/rstudio/shared/car-speeds.csv',
stringsAsFactors = FALSE,
strip.white = TRUE,
sep = ','
)
unique(carSpeeds$Color)
# [1] "Blue" "Red" "White" "Black"
# That looks better!
Block 7: exporting data and saving scripts
The steps here are identical to what we just covered, but it’s good to run through them by yourself!
7.1 Create a new folder for output files, called
data_output
.
7.2 Export the car speeds data. The write.csv() function requires a minimum of two arguments, the data to be saved and the name of the output file.
write.csv(carSpeeds, file = "data_output/car-speeds-cleaned.csv")
7.3 Save your script (File –> Save as…).
7.4 Export your data from RStudio Server.
- Go to the "Files" pane (on the right, next to "Plots")
- Tick the empty boxes next to the files / folders you would like to export
- Click on "More" (look for a cogwheel symbol)
- Click on "Export". After this, specify a suitable name for the .zip file. The file will be saved into your default downloads folder.