Starting with data - exercise sheet

These exercises are partially derived from Software Carpentry teaching materials available under the CC BY 4.0 license:

Block 1: downloading and importing data

This exercise block repeats the lecture contents - at this point it’s important that we’re all on the same page! We will need these data as we move on.

1.1 Import some example data (/home/rstudio(shared/portal_data_joined.csv) into an object called surveys. When importing the data, use the argument stringsAsFactors = TRUE. You can use head() to check that everything got imported correctly.

surveys <- read.csv("/home/rstudio/shared/portal_data_joined.csv", stringsAsFactors=TRUE)
head(surveys)

1.2 We just imported data from outside our current working directory. Let’s also try something inside the working directory - create a new folder called testfolder (Files pane on the bottom right –> New folder).

Block 2: inspecting data frames

2.1 Inspect the surveys data set using the following functions. We don’t need to memorize them now, but eventually this can be useful when you are working with your own data.

Size:
- dim() - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
- nrow() - number of rows
- ncol() - number of columns
Content:
- head() - first 6 rows
- tail() - last 6 rows
Names:
- names() - column names (synonym of colnames() for data.frame objects)
- rownames() - row names
Summary:
- str() - structure of the object and information about the class, length and content of each column
- summary() - summary statistics for each column

2.2 Based on the output of str(surveys), can you answer the following questions?

What is the class of the object surveys?
How many rows and how many columns are in this object?
How many different species IDs have been recorded during these surveys?

## * class: data frame
## * how many rows: 34786,  how many columns: 13
## * how many species IDs: 48

Block 3: indexing and subsetting data frames

3.1 Create a data.frame (surveys_200) containing only the data in row 200 of the surveys dataset.

surveys_200 <- surveys[200, ]

3.2 Notice how nrow() gave you the number of rows in a data.frame?

Use that number to pull out just that last row in the data frame.
Compare that with what you see as the last row using tail() to make sure it’s meeting expectations.
Pull out that last row using nrow() instead of the row number.
Create a new data frame (surveys_last) from that last row.

nrow(surveys)
surveys_last1 <- surveys[34786, ]
tail(surveys)

# (Saving `n_rows` to improve readability)
n_rows <- nrow(surveys)
surveys_last2 <- surveys[n_rows, ]

3.3 Use nrow() to extract the row that is in the middle of the data frame. Store the content of this row in an object named surveys_middle.

surveys_middle <- surveys[n_rows / 2, ]

3.4 A trickier example: combine nrow() with the - notation above to reproduce the behavior of head(surveys), keeping just the first through 6th rows of the surveys dataset. Don’t worry if you can’t get this one right!

head(surveys)
surveys_head <- surveys[-(7:n_rows), ]

Block 4: factors

4.1 First assign the gender data from surveys to a new object. You probably also remember there are some measurements where the gender information is either missing or undetermined, let’s call those “undetermined”.

sex <- surveys$sex
levels(sex)[1] <- "undetermined"

4.2 Rename “F” and “M” to “female” and “male”, respectively.

levels(sex)[2] <- "female"
levels(sex)[3] <- "male"

# levels(sex)[2:3] <- c("female", "male")

4.3 Can you reorder the data so that the levels are ordered like this: undetermined, male, female? After this, use plot() to check how the data now look.

sex <- factor(sex, levels = c("undetermined", "male", "female"))
levels(sex)

plot(sex)

Block 5: more on data frames

5.1 We have seen how data frames are created when using read.csv(), but they can also be created by hand with the data.frame() function. There are a few mistakes in this hand-crafted data.frame. Can you spot and fix them? Don’t hesitate to experiment!

animal_data <- data.frame(
          animal = c(dog, cat, sea cucumber, sea urchin),
          feel = c("furry", "squishy", "spiny"),
          weight = c(45, 8 1.1, 0.8)
          )

# missing quotations around the names of the animals
# missing entry in the feel column (one of the furry animals)
# missing one comma in the weight column

5.2 Can you predict the class for each of the columns in the following example? Check your guesses using str(country_climate):

Are they what you expected? Why? Why not?
What would have been different if we had added stringsAsFactors = TRUE when creating the data frame?
What would you need to change to ensure that each column had the accurate data type?

country_climate <- data.frame(
       country = c("Canada", "Panama", "South Africa", "Australia"),
       climate = c("cold", "hot", "temperate", "hot/temperate"),
       temperature = c(10, 30, 18, "15"),
       northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"),
       has_kangaroo = c(FALSE, FALSE, FALSE, 1)
       )
       
# country, climate, temperature, and northern_hemisphere are character vectors; has_kangaroo is numeric
# using stringsAsFactors = TRUE would have made factors instead of character vectors
# removing quotes in temperature and northern_hemisphere and replacing 1 by TRUE in the has_kangaroo column

Block 6: arguments for CSV importing

6.1 Import a new data set on car speeds (/home/rstudio/shared/car-speeds.csv) and assign it to an object called carSpeeds.

carSpeeds <- read.csv("/home/rstudio/shared/car-speeds.csv")

6.2 Have a look at the structure of carSpeeds using str(). Then try reimporting the data with the header argument set to false. How does the new data set differ from the previous one?

str(carSpeeds)

carSpeeds <- read.csv("/home/rstudio/shared/car-speeds.csv", header = FALSE)
str(carSpeeds)

# the first row of the data
carSpeeds[1, ]

6.3 The car colour labels contain accidental whitespaces. To fix this, try importing the data using the strip.white argument. NOTE - this argument must be accompanied by the sep argument, by which we indicate the type of delimiter in the file (the comma for most .csv files). In other words, we need to add sep = ',' to the import command. You can use unique() to check if it worked!

unique(carSpeeds$Color) # right now we can see a problem here

# [1] Green Red White Red Black
# Levels: Red Black Green Red White

carSpeeds <- read.csv(
  file = '/home/rstudio/shared/car-speeds.csv',
  stringsAsFactors = FALSE, 
  strip.white = TRUE,
  sep = ','
  )

unique(carSpeeds$Color)
# [1] "Blue" "Red" "White" "Black"
# That looks better!

Block 7: exporting data and saving scripts

The steps here are identical to what we just covered, but it’s good to run through them by yourself!

7.1 Create a new folder for output files, called data_output.

7.2 Export the car speeds data. The write.csv() function requires a minimum of two arguments, the data to be saved and the name of the output file.

write.csv(carSpeeds, file = "data_output/car-speeds-cleaned.csv")

7.3 Save your script (File –> Save as…).

7.4 Export your data from RStudio Server.

- Go to the "Files" pane (on the right, next to "Plots")

- Tick the empty boxes next to the files / folders you would like to export

- Click on "More" (look for a cogwheel symbol)

- Click on "Export". After this, specify a suitable name for the .zip file. The file will be saved into your default downloads folder.