Importing and exporting data

This lesson is derived from Software and Data Carpentry teaching materials available under the CC BY 4.0 license:

http://swcarpentry.github.io/r-novice-inflammation/

https://datacarpentry.org/spreadsheet-ecology-lesson/

Objectives:

Be aware of recommended practices for data organisation
Build a deeper understanding of the arguments of R’s CSV reader
Know how to export data from R as a CSV file

1. Introduction

The most common way that scientists store data is in Excel spreadsheets. In the previous session, the data we imported into R was in a comma-separated values file (CSV). Indeed, while there are ways in R to access data from Excel files, users often find it easier to save their spreadsheets as CSV and then use R’s built-in functionality to read and manipulate the data. Many also prefer to use CSV files because they are not tied to any proprietary software, and are therefore ideal for cross-platform sharing.

In the previous session, we covered the stringsAsFactors function, but read.csv also comes with a number of arguments that we haven’t yet looked at. The survey data we used was also in a ready-to-use format. In this short lesson, we’ll explore further arguments that allow you to import the data correctly for your needs, as well as what you can subsequently do to export the data from R.

One of the key steps preceding any of this is to have your data formatted correctly, so that they can be imported into R. Before looking at CSV importing and exporting, we will take a quick look at some recommended practices for organising your data prior to reading them into R.

2. Spreadsheet organisation

A quick word on schedules: as we are already covering a lot of other ground during this course, this part will only cover some basics. We would encourage you to have a closer look at the Data Carpentry spreadsheet organisation lessons later on (link in the beginning of this document).

2a. Keeping track of your analyses

When you’re working with spreadsheets, during data clean up or analyses, it’s very easy to end up with a spreadsheet that looks very different from the one you started with. In order to be able to reproduce your analyses or figure out what you did when Reviewer #3 asks for a different analysis, you should

create a new file with your cleaned or analyzed data. Don’t modify the original dataset, or you will never know where you started!
keep track of the steps you took in your clean up or analysis. You should track these steps as you would any step in an experiment. We recommend that you do this in a plain text file stored in the same folder as the data file.

This might be an example of a spreadsheet setup:

Note: There is also a very useful Software Carpentry lesson on maintaining version control over your data using Git. While we won’t cover Git / GitHub lessons in this workshop, feel free to explore the lessons in your own time!

2b. Recommendations for data organisation

The cardinal rules of using spreadsheet programs for data:

Put all your variables in columns - the thing you’re measuring, like ‘weight’ or ‘temperature’.
Put each observation in its own row.
Don’t combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that’s the only way you’ll want to be able to use or sort that data.
Leave the raw data raw - don’t change it!
Export the cleaned data to a text-based format like CSV (comma-separated values) format. This ensures that anyone can use the data, and is required by most data repositories.

Extra guidelines to avoid problems at a later stage:

Avoid using multiple separate tables, for example:

In the example above, the computer will see (for example) row 4 and assume that all columns A-AF refer to the same sample. Other rows are similarly problematic.

Avoid spreading data across multiple tabs.
- You are more likely to add inconsistencies to your data if each time you take a measurement, you start recording data in a new tab, and
- This adds an extra step before you analyze your data because you will have to combine these data into a single datatable.
Record zeros as zeros (and null values as null).
- To the computer, a zero is data (you measured or counted it). A blank cell means that it wasn’t measured and the computer will interpret it as an unknown value, otherwise known as a null value.
- It is essential to use a clearly defined and consistent null indicator. Blanks (most applications) and NA (for R) are good choices.
Don’t use formatting (such as colour) to convey information or to make your spreadsheet look pretty (e.g. by merging cells or leaving blank rows to indicate separations in the data). Compare these two examples:

Place any comments in a separate column (rather than using a commenting function). This is because most analysis software cannot see Excel or LibreOffice comments, and would be confused by comments placed within your data cells.
- Similarly, record units in column headers rather than cells.
Avoid spaces, numbers and special characters in column headers. Underscores are good alternatives to spaces (which can be misinterpreted by certain parsers). Some programs don’t like field names that are text strings that start with numbers.
Record any metadata in a separate plain text file. While important, these can disrupt how computer programs interpret your data file.

2c. How to export an Excel spreadsheet in CSV format

Use Save As and choose CSV as the file format (click on the links to see examples as images): File -> Save As -> File Format: Comma-separated values (.csv).

3. Additional arguments for CSV importing

Before we complete some exercises, let’s cover a few other topics that might be useful for importing your data.

3a. Changing delimiters

The default delimiter of the read.csv() function is a comma, but you can use other delimiters by supplying the ‘sep’ argument to the function. For example, typing sep = ';' allows a semi-colon separated file to be correctly imported (see ?read.csv() for more information).

3b. The `header` argument

The default for read.csv(...) is to set the header argument to TRUE. This means that the first row of values in the .csv is set as header information (column names). If your data set does not have a header, set the header argument to FALSE.

3c. The `strip.white` argument

Even when adhering to good data organisation practices, it is not uncommon for a spreadsheet to contain mistakes that have gone unnoticed, for example a space (whitespace) may have accidentally been inserted before a data value.

By default this whitespace will be kept in the R environment, such that e.g. ‘ Red’ will be recognized as a different value than ‘Red’. In order to avoid this type of error, use the strip.white argument.

Challenge:

Before we carry on working with the survey data, let’s have a closer look at some of the arguments in read.csv (aside from stringsAsFactors). First we will load another data set into R, this time containing information on the speeds at which cars of different colors were clocked in 45 mph zones in different US states.

We can have a look at the data by going through the steps in Starting with Data Exercise Block 6 (~5-10 mins).

4. Exporting CSV files using `write.csv`

After altering our cars data set, we would now like to save the output. For this, we can use the write.csv function call. As with read.csv, there are several arguments to this function, a few of which are particularly important for how the data are exported. Let’s explore these now.

Before exporting the data we are going to create a new folder, data_output, in our working directory. It’s good practice to keep our output files separate from other files we might be using. This way, even if the files in the output folder are deleted, we can always re-generate them (since our raw data are stored elsewhere).

# Export the data. The write.csv() function requires a minimum of two
# arguments, the data to be saved and the name of the output file.

write.csv(carSpeeds, file = "data_output/car-speeds-cleaned.csv")

Since we are are working in the cloud, we cannot directly migrate to our data folder and use a program such as Excel to inspect the new file. However, we can have a quick (and somewhat inaccurate) look at how the spreadsheet looks in RStudio. If we click on the data_output folder in the file pane and then click on the file, an option appears for checking the file (View File). It won’t reproduce the columns correctly as Excel would - however, we can still see that three of the columns are familiar, but there is also an an extra column containing a sequence of numbers.

These are actually row numbers in the data set. To address this, we can change the output by using row.names.

4a. The `row.names` argument

This argument allows us to set the names of the rows in the output data file. R’s default for this argument is TRUE, and since it does not know what else to name the rows for the cars data set, it resorts to using row numbers. To correct this, we can set row.names to FALSE:

write.csv(carSpeeds, file = "data_output/car-speeds-cleaned.csv", row.names = FALSE)

You now know some of the principles for importing data into R and exporting them for further use. In fact, we’ve covered much more than the mere basics. Good work! In the next session, we will return to the survey data and learn about different data wrangling tools that come as part of the tidyverse (more on that soon).

5. Sourcing and saving R scripts

One more thing, though: how to load and save an R script?

In RStudio, we can click on File –> Open File... to load an existing script. We can also save the current one using File –> Save As.... Alternatively, the toolbar under the File menu contains icons for importing and exporting scripts.

If you’d like to export content from RStudio Server, do as follows:

Go to the “Files” pane
Tick the empty boxes next to the files / folders you would like to export
Click on “More” (look for a cogwheel symbol)
Click on “Export”. After this, specify a suitable name for the .zip file. The file will be saved into your default downloads folder.

Challenge

Go through the steps in Starting with Data Exercise Block 7 to export your data and save your script.