Extracting and Cleaning Data

With this tutorial, we’ll be working with some data from UMass’ Office of Institutional Research. The OIR collects and reports many of the statistics about the UMass campus. Specifically, we’ll be working with this spreadsheet containing data about undergraduate enrollment by major.

The goal of this tutorial is to show how you can extract data stored in a PDF document—a very common though problematic file type for data crunchers like us—into a CSV file that can be easily used by R.

We’ll use a free program called Tabula to handle the extraction. It sometimes works really well. Other times, it doesn’t work well at all. For the sake of this exercise, I’m selecting a real-world dataset that won’t give us too big of a headache.

Extracting the data

Before we can bring the data from our PDF into R, we’ll want to extract it and store it in a format that works well with R. One such format—the one we’ve been using in our exercises so far—is CSV, or a plain-text file with comma-separated values.

Let’s begin by opening up Tabula. A new browser window should open up, with an interface that looks something like this:

We’ll then click the blue “Browse” button and select the PDF file we want to extract information from. Once we’ve selected that file from our computer, we’ll click the nearby “Import” button. Tabula will spend a few seconds importing the file and automatically take you to a screen that looks like this:

This screen gives us a preview of what our PDF file looks like and allows us to select the elements from each page that we want Tabula to try to extract.

The first thing I like to do is to see if Tabula can correctly guess where the data are by clicking on “Autodetect Tables”. You’ll notice that a red box appears over the preview image:

That red box indicates the information that Tabula will try to extract out—the stuff that is of interest to us. Sometimes, as is the case here, the autodetect function will do a perfect job guessing what’s relevant to us, and draws boxes around only what we need. Other times, it won’t do a very good job at all and either select irrelevant information (like page numbers) or miss out on relevant numbers.

When it doesn’t work, you can just click the “Clear All Selections” button. Or, if you just want to clear one selection (e.g., on a single page), click the “X” on the top-right hand corner of the red box.

You can also draw your own boxes by simply clicking and dragging to create a box around the data you want:

You can draw as many boxes as you want, if you want multiple segments of data. Do note that each box you create only covers the region of the given page. Sometimes, relevant data will appear in the same part of every page, and we can just click the “Repeat this Selection” box that appears next to the box after we highlight it. It will apply your selection to every page of the PDF document.

Let’s clear that selection and select “Autodetect Tables” again, since it worked well with this document. Even though it did a good job guessing the boundaries for the relevant data, we’ll want to make a small adjustments to each page. Specifically, we’ll want to drag down the top edge of the box so that it goes just above the “Department or Program” heading.

The first reason for that is because the information in the maroon box (“Undergraduate Students by Major”) isn’t data; it’s just contextual information for us. Additionally, we’ll leave out the “School or College” piece because there are no columns immediately to the right of it. Instead, the “Fall 2009”, etc., information is only immediately adjacent in the line below. The goal with a CSV file is to have our header information in the first row, so we’ll try to select that at the very top.

The second reason is that our selection (box) has huge implications for how well Tabula is able to guess what goes where with our data. Specifically, if we go as high as “School or College”, Tabula will get confused about the spacing between columns, and possibly combine information about “Fall 2009” and “Fall 2010” into the same cell (e.g., treat them as a single column). It’s usually helpful to get as close as you can to a perfect matrix.

We’ll drag our selection box down to “Department or Program” for each of the three pages on the PDF. (Be careful not to select “Repeat this Selection” here because the table on the third page is shorter and includes irrelevant information just below it.)

Here’s what it should look like for you (on all three pages):

Once you’ve done that (again, for all three pages), click the green “Preview & Export Extracted Data”.

You should get a screen that looks like this:

Just as we’d hoped, our first row contains the heading information we were looking for. You’ll notice on the left-hand bar that there are two extraction methods. The first, “Stream,” looks for whitespace between values to denote columns. The second, “Lattice,” looks for boundary lines to denote columns. Because the table in our PDF document didn’t have any boundary lines, it’s best to stick with “Stream.” Other documents may perform better with “Lattice.”

The second thing we’ll want to check is that our columns align across pages. Tabula will break up each table (page) with some white space in the preview, like so:

We can see that both Page 1 and Page 2 line up well and, most importantly, have the same number of columns. We’ll do that for Page 3 as well.

One thing we’ll have to keep in mind is that when we export these data as a single CSV file, there will be header lines in the middle of our dataset (i.e., the row with “Department or Program”). We’ll just need to make a mental note to exclude that line later on. (Again, if we had removed it from our red box, we may have run into column alignment issues with Tabula. But you can, and should, play around to see what works best.)

Once we’re done, we can just select “CSV” as our “Export Format” and click “Export.” Tabula will ask us where to save the file. For good practice, I like to save it in a subdirectory of my Project Directory called data. (For example, this might be in a subfolder of my Documents folder called “Data-Driven Storytelling/Exercise 27/data/”. I’d store my R Notebook files in the “Exercise 27” folder.)

Pre-Analysis

Load the data

You can load the CSV file produced above with the following code:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

enrollment_majors <- read_csv("http://projects.rodrigozamith.com/datastorytelling/data/oir_undergraduate_enrollment_by_major.csv")

## Parsed with column specification:
## cols(
##   `Department or Program` = col_character(),
##   `Fall 2009` = col_character(),
##   `Fall 2010` = col_character(),
##   `Fall 2011` = col_character(),
##   `Fall 2012` = col_character(),
##   `Fall 2013` = col_character(),
##   `Fall 2014` = col_character(),
##   `Fall 2015` = col_character(),
##   `Fall 2016` = col_character(),
##   `Fall 2017` = col_character(),
##   `Fall 2018` = col_character()
## )

Alternatively, if you had extracted the information from your PDF file on your own machine using Tabula, you could also use your own extraction with the following code:

enrollment_majors <- read_csv("data/oir_undergraduate_enrollment_by_major.csv") # CHANGE THIS FILE PATH TO REFLECT YOUR TABULA CSV FILE!

If you’re using RStudio Cloud, make sure you upload the CSV file first!

Let’s see how the file was imported:

head(enrollment_majors, 10)

str(enrollment_majors)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 169 obs. of  11 variables:
##  $ Department or Program: chr  "College of Humanities and Fine Arts" "Afro-American Studies" "Architecture" "Art" ...
##  $ Fall 2009            : chr  "2,496" "9" "1 11" "193" ...
##  $ Fall 2010            : chr  "2,432" "10" "113" "181" ...
##  $ Fall 2011            : chr  "2,393" "1 4" "1 36" "169" ...
##  $ Fall 2012            : chr  "2,147" "13" "124" "133" ...
##  $ Fall 2013            : chr  "1,928" "9" "109" "127" ...
##  $ Fall 2014            : chr  "1,749" "11" "100" "115" ...
##  $ Fall 2015            : chr  "1,801" "13" "110" "112" ...
##  $ Fall 2016            : chr  "1,785" "14" "105" "132" ...
##  $ Fall 2017            : chr  "1,808" "20" "122" "128" ...
##  $ Fall 2018            : chr  "1,762" "1 7" "1 29" "114" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Department or Program` = col_character(),
##   ..   `Fall 2009` = col_character(),
##   ..   `Fall 2010` = col_character(),
##   ..   `Fall 2011` = col_character(),
##   ..   `Fall 2012` = col_character(),
##   ..   `Fall 2013` = col_character(),
##   ..   `Fall 2014` = col_character(),
##   ..   `Fall 2015` = col_character(),
##   ..   `Fall 2016` = col_character(),
##   ..   `Fall 2017` = col_character(),
##   ..   `Fall 2018` = col_character()
##   .. )

Clean the Data

We see right away that we need to clean some things up. First, our numeric values (e.g., student enrollment figures) are currently of chr type. Second, we have some spacing issues that we need to sort out (e.g., “1 4” instead of “14” in the second row). Third, we have some “-” values where they should be empty (NA). Fourth, we still have those row header rows buried in our dataset. Fifth, although it’s not necessary, it might be good to rename the variables (columns) in anticipation of how we’ll use the data.

Let’s address all those problems, in reverse order.

Renaming variables

We can use the base::names() function to look up and change the names of our variables. That function just takes a single argument: the name of the object. We can thus look at the names by doing the following:

names(enrollment_majors)

##  [1] "Department or Program" "Fall 2009"            
##  [3] "Fall 2010"             "Fall 2011"            
##  [5] "Fall 2012"             "Fall 2013"            
##  [7] "Fall 2014"             "Fall 2015"            
##  [9] "Fall 2016"             "Fall 2017"            
## [11] "Fall 2018"

Let’s change that to (a) make it faster to call the first variable and (b) keep only the years for the subsequent variables. To do that, we can create a vector of the same length (number of elements) as there are columns in the dataframe, and include that in the header of that dataframe by reassigning it to the object using the names() function. This is what that would entail:

names(enrollment_majors) <- c("Department", "F2009", "F2010", "F2011", "F2012", "F2013", "F2014", "F2015", "F2016", "F2017", "F2018")
names(enrollment_majors)

##  [1] "Department" "F2009"      "F2010"      "F2011"      "F2012"     
##  [6] "F2013"      "F2014"      "F2015"      "F2016"      "F2017"     
## [11] "F2018"

Removing rows

Second, let’s check for those headers in the rows using the dplyr::filter() function to look up any instance where the Department variable starts with "Department" (using the base::startsWith() function, which simply tries to match strings based on a set of characters at the string’s beginning):

enrollment_majors %>%
  filter(startsWith(Department, "Department"))

As expected, we have those two junk rows. we can remove them by running the same code as above, but using ! (exclamation mark) to specify that we want to include (filter) everything except our filter condition. Put differently, we’ll include all the rows where the Department variable does not start with “Department”. Then, we’ll reassign the results into an object representing the clean version of our dataset:

enrollment_majors_clean <- enrollment_majors %>%
  filter(!startsWith(Department, "Department"))

You’ll see in your environment pane (top-right) that enrollment_majors_clean has two fewer observations than enrollment_majors, just as we expected!

Cleaning empty values

All of our empty values in our dataset are currently labeled as "-". We can do a quick check by looking for empty values in Fall 2012:

enrollment_majors_clean %>%
  filter(F2012 == "-")

We can quickly clean that up by running the following code:

enrollment_majors_clean[enrollment_majors_clean=="-"] <- NA

What that code does is look inside the enrollment_majors_clean dataframe and search through every value (we type the object name twice to do that) and if the value is equivalent to "-", we replace it with an NA value.

We can check that it worked by modifying the condition in the previous line of code to check for NA values:

enrollment_majors_clean %>%
  filter(is.na(F2012))

Sure enough, it looks just like the previous chunk, but with NA values instead of "-".

Removing spaces and commas

We also want to remove spaces and commas in our values in order to ensure the conversion to numbers works flawlessly. We can do that by combining the dplyr::mutate_at() function with the stringr::str_replace_all() function like so:

enrollment_majors_clean <- enrollment_majors_clean %>%
  mutate_at(vars(-"Department"), list(~str_replace_all(., ",", ""))) %>%
  mutate_at(vars(-"Department"), list(~str_replace_all(., " ", "")))
head(enrollment_majors_clean)

The above code is doing is a few different things. First, we’ll load data from enrollment_major_clean and pass it to the dplyr::mutate_at() function. That function will change values from a variable(s) if they meet some condition. We don’t want to touch the Department variable because spaces matter there, so we use the argument vars(-"Department") to specify that we want to evaluate all of the variables except Department. The second argument, list() specifies that we want to apply one or more functions to each value. Specifically, we’ll use the str_replace_all() function as a lambda (~) to check the current values (.) for a comma (",") and replace it with nothing (""). We repeat that in the next line for a space (" "). We then store all of that back into our enrollment_major_clean object.

Converting to numberic values

We should now be able to convert all of those character values back into numerical ones using the as.numeric() function. We can use code very similar to the above, simply changing the function that we’re applying:

enrollment_majors_clean <- enrollment_majors_clean %>%
  mutate_at(vars(-"Department"), list(~as.numeric(.)))
head(enrollment_majors_clean)

Note how the column type underneath each column changed from <chr> to <dbl>, or double precision floating point numbers—which is a form of a numeric value.

Double-checking the cleaning

Now, it might be good to do a quick spot check of our data to make sure our values align with the ones from the PDF. First, I’ll just do a quick check using the DataExplorer package.

library(DataExplorer)
plot_histogram(enrollment_majors_clean)

That looks fine to me at first glance. There’s one extreme value (likely an overall total) for all the variables, and a lot of smaller values (since we have lots of majors with <100 students.

Second, I’ll select six different rows spaced out by using the dplyr::slice() function to see if their values match the information on the PDF:

enrollment_majors_clean %>%
  slice(c(13, 27, 55, 76, 104, 147))

I’m feeding the slice() function a set of observation (row) numbers, which I group together in a single vector using c().

Yup, they all match up perfectly. We should feel good to move forward with these data.

Analysis

Now that we have our data imported and cleaned up, we can start to do some analyses. For example, let’s say we wanted to calculate the mean for the number of students who majored in Kinesiology. We might begin by filtering in those individuals:

enrollment_majors_clean %>%
  filter(Department == "Kinesiology")

However, we run into a little problem here: when we use the mean() function, we need to specify a variable. The issue here is that each data point is currently stored as a different variable. (For example, the value for 2009 is stored in the F2009 variable, 2010 in F2010, etc.)

Going from wide to long

The above table is an example of data being in wide format, meaning that each associated variable is a separate column. What we need to do is get it into long format, where we have three variables (columns): The year, the name of a corresponding variable, and the value of that corresponding variable.

The dplyr package makes this possible using the gather() function. With gather(), we can specify the name of a key variable (the variable/column containing variable names) and the value variable (the variable/column containing the associated values). We can also specify which variables we want to key against and which ones we don’t want to key against (leave constant); if we leave it blank, it keys against every existing variable in the data frame.

Here’s how that would work in this context:

enrollment_majors_clean %>%
  filter(Department == "Kinesiology") %>%
  gather(key="Year", value="Majors", -Department)

What we’ve done is specify that we want the key variable to be named Year and the value variable to be named Majors. We also specify that we want to keep “Department” as a separate variable.

Summarizing long-format data

If we wanted to calculate the mean number of majors, we could easily do so by adding a simple summarization (dplyr::summarize()):

enrollment_majors_clean %>%
  filter(Department == "Kinesiology") %>%
  gather(key="Year", value="Majors", -Department) %>%
  summarize(Mean_Majors=mean(Majors))

The mean number of majors during that stretch of time is 654.1.

However, what if I wanted to only look at the mean number between 2016 and 2018? To do that, I’d need to apply a second filter before the summarization. However, I’d also need to apply a mutation (dplyr::mutate(), which will add columns to my dataset or replace existing columns) to convert my Year values to numeric ones (using as.numeric() and removing the “F” at the start of each variable name).

This is pretty simple, though, given what we’ve previously learned:

enrollment_majors_clean %>%
  filter(Department == "Kinesiology") %>%
  gather(key="Year", value="Majors", -Department) %>%
  mutate(Year=as.numeric(str_sub(Year, 2))) %>%
  filter(Year >= 2016 & Year <= 2018) %>%
  summarize(Mean_Majors=mean(Majors))

We use the stringr::str_sub() function to only display the characters in the string within Year from the second character on. This allows us to drop the “F” from the variable name and make a proper as.numeric() conversion to a number. Since we have a numeric value now, we can use mathematical conditions in our filter() argument.

Selecting specific rows

Because these data are nested (programs within departments within colleges), you’ll notice that we have some rows that share the same name:

enrollment_majors_clean %>%
  filter(startsWith(Department, "Accounting"))

Again, that’s because these data are nested. If I only wanted the second and third instances, it might make sense for me to either use the slice() function or indexing. Let’s start with the first option. Like we did above, we can try to select the “Accounting” (child row) and “Accounting (CPE)” rows by specifying their row numbers (remember that the row numbers will be different before and after the filter):

enrollment_majors_clean %>%
  filter(startsWith(Department, "Accounting")) %>%
  slice(c(2, 3))

With the second option, recall that our data frames have columns (variables) and rows, and are thus organized like a table. We can index by calling the data frame and then, in brackets, specifying the rows and columns we want to select: object[row, column]. If we leave either the row or column blank, it selects all of them.

So, here’s how we can replicate the above using indexing:

accounting_majors <- enrollment_majors_clean %>%
  filter(startsWith(Department, "Accounting"))
accounting_majors[c(2,3),]

Note that we could also use a colon (:) to specify that we want all rows between min:max, like so: accounting_majors[2:3,]. We wanted all columns, so we left the space after the comma blank. Be sure to include the comma, though, so it knows that you want to pull information from rows and columns.

We can also use both slice() and indexing as part of our pipe. For example, if we wanted to get all of the departments under Landscape Architecture and Regional Planning and convert those data to long form, we’d do the following:

enrollment_majors_clean %>%
  slice(96:98) %>%
  gather(key="Year", value="Majors", -Department)

The View() function (e.g., View(enrollment_data_clean)) can be handy for finding row numbers since they appear as the first column under that view. We could also use multiple filters—referencing only those departments that fall under LARP—and it would work just as well provided there are no duplicate names.

As we see above, we now have a separate row for each year and for each department.

We could also shorten our code with this functionally equivalent expression:

enrollment_majors_clean[96:98,] %>%
  gather(key="Year", value="Majors", -Department)

Plotting long-format data

We also want data to be in long format if we want to plot it with ggplot2 (and make interactive with plotly).

Let’s use the above example and create three different bar graphs for each of our three majors. Let’s begin by storing those data in an object called plot_data:

plot_data <- enrollment_majors_clean %>%
  slice(96:98) %>%
  gather(key="Year", value="Majors", -Department) %>%
  mutate(Year=as.numeric(str_sub(Year, 2)))
plot_data

Next, let’s create a bar graph using ggplot2. Much of the code below will look familiar—we used it not long ago—but there’s one new wrinkle: facet_wrap(). This allows us to generate similar graphs for the different variables of a given variable (e.g., a department).

Here’s how we might do that:

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

plot <- ggplot(plot_data, aes(x=Year, y=Majors)) +
  geom_col() +
  facet_wrap(~Department) # Note the use of the tilde (~), which is necessary to specify "by this variable" for this function.
ggplotly(plot)

This allows us to quickly view the growth of related majors.

Putting it into practice

Given that we have longitudinal data, it might make more sense to display things through a line graph.

Can you create an interactive line graph that shows the change in undergraduate enrollment in the majors listed under the Department of Political Science from Fall 2009 to Fall 2018?