Exploratory Data Visualization

Today, we’ll be looking at some eviction data aggregated at the county level for Massachusetts. These data come from the Eviction Lab, which is led by Matthew Desmond. As always, it is helpful to review the data dictionary. If you want more detail about any of the variables, I encourage you to review the full Methodology Report.

Pre-Analysis

Load the data

The first step, of course, is to read in the data. Like before, we’ll use the readr::read_csv() function (readr can be loaded by the tidyverse metapackage):

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

ma_counties <- read_csv("http://projects.rodrigozamith.com/datastorytelling/data/evictions_ma_counties.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   name = col_character(),
##   `parent-location` = col_character()
## )

## See spec(...) for full column specifications.

Confirm it imported correctly

Normally, we would make sure the data imported correctly. Because we’ve previously evaluated the data from this dataset, we can just take a quick look at the first few observations to confirm that the data were imported correctly.

head(ma_counties)

Just like before, it appears we imported the data correctly.

Visual Data Exploration

To begin visually inspecting our data, we can turn to the DataExplorer package. (You may need to install DataExplorer by going to Tools, Install Packages, and entering in that package name.) DataExplorer helps us do things like visualize missing values and automatically generate histograms, scatter plots, boxplots, and correlation matrices for all the variables in a given dataset.

The reason this is valuable is because humans have a much easier time processing large amounts of information when it is visualized in some way. Data tables are great when looking at each tree but visuals help us see the forest.

Additionally, visually inspecting data is great for finding outliers (extreme observations). Sometimes the story is not in the general pattern but in the outlier. For example, you may wonder why a particular city has an exceptionally high crime rate. (However, outliers may also be data entry errors that we need to clean up.)

Let’s load DataExplorer first to get access to its functions.

library(DataExplorer)

Visualizing missing values

While we can use the base::summary() function to get the number of NA values for each variable, DataExplorer provides us with a plot_missing() function that visualizes that nicely for us. That function just takes a single argument: the name of the data frame you want to analyze.

In our case, we’re currently working with the ma_counties data frame, so we’ll supply that as the lone argument:

plot_missing(ma_counties)

Behold, we get a simple graph that tells us that we have data for almost all of our variables. Four of them, however, have some missing data (actually, a fair amount of missing data!). The fact that the values are equal (31.93%) for all four variables suggests some entries have missing data for all those variables—though it is possible that some entries have data for eviction-rate but not eviction-filing-rate.

We could, of course, quickly examine that:

library(tidyverse)
ma_counties %>%
  filter(is.na(`eviction-filing-rate`) == TRUE | is.na(`eviction-rate`) == TRUE | is.na(`evictions`) == TRUE | is.na(`eviction-filings`) == TRUE) %>%
  select(year, name, `eviction-filing-rate`, `eviction-rate`, evictions, `eviction-filings`)

We use the pipe | character in our filter to convey an “or” logical statement. That is, we’re saying, if eviction-filing-rate is an NA value or if eviction-rate is an NA value, and so on, then include the value in our filtered data frame.

Our assumption is correct: the missing data align across those four variables. But that’s a whole lot of data that’s missing. We can quickly examine which counties are even usable:

ma_counties %>%
  select(year, name, `eviction-filing-rate`, `eviction-rate`, evictions, `eviction-filings`) %>%
  filter(complete.cases(.) == FALSE) %>%
  count(name) %>%
  arrange(desc(n))

Note: filter(complete.cases(.) == FALSE) accomplishes the equivalent of the longer filter() line in the previous code chunk. Specifically, base::complete.cases() tests if the observation has any missing data, outputing TRUE if there are no missing values and FALSE if there is one or more. That function expects you to specify a data frame as the argument; we use . to signal that it should use the piped output from the previous step (the select statement). Additionally, because we dropped several variables (e.g., population) by not selecting them, the complete.cases() function will only test for the few variables we selected.

It seems that some counties (like Barnstable County and Essex County) have a lot of missing data. Thus, we may not be able to make comparisons for some of the years across all counties.

This doesn’t mean these data are bad but it does mean that we’ll need to be careful about what we can say from our analyses.

Generating histograms

We may also want to check the frequency of values for the many variables in our dataset. This can be really handy for identifying those outliers and better understanding the distribution of values. For that, a histogram is very handy.

Here, we can use DataExplorer’s plot_histogram() function. That function requires a single argument (although there are some additional optional ones): the data frame from which to plot. The function will then go through that data frame and figure out which variables are sensible to plot on a histogram.

plot_histogram(ma_counties)

The function produced a histogram for almost all of our variables. Some of those variables are kind of pointless (like GEOID) but we’re just exploring and can quickly ignore them. For some variables, like pct.nh.pi, the X axis isn’t very helpful—but we do see that those values tend to be very small across entries.

We’re generating those data for all the observations in our dataset, though, which includes a lot of semi-repetition (we wouldn’t expect the percentage of hispanics to change tremendously from year to year for each county).

What if we wanted to generate these histograms, but only for a single year (2014)? First, we’d just have to use our filter() function once again, and pipe that output (a data frame) to plot_histogram():

ma_counties %>%
  filter(year==2014) %>%
  plot_histogram()

Immediately, we see, for example, that almost all counties have fewer than 40% of their households living in rental units (pct.renter.occupied). One county, however, has more than 60%. Maybe there’s something interesting about that county, and we can start digging into it.

Generating correlation heatmaps

DataExplorer also provides a very cool function for quickly looking up correlations between variables: plot_correlation(). What that function does is automatically calculate the correlation coefficient between all possible pairs of variables in a given dataset and use color coding to draw your attention to the strongest correlations.

If you have too many variables, that heatmap can become exceptionally crowded. Additionally, some relationships would be pointless. (Who cares about the relationship between GEOID and eviction-rate?) Thus, we’ll want to select just the variables for which a relationship could be practically meaningful.

For example, let’s say we wanted to check the relationships between population, poverty-rate, pct-renter-occupied, and rent-burden in 2016. We could quickly do that by filtering and selecting the requisite data, and passing the output to plot_correlation():

ma_counties %>%
  filter(year==2016) %>%
  select(population, `poverty-rate`, `pct-renter-occupied`, `rent-burden`) %>%
  plot_correlation()

The heatmap shades things dark red the closer we get to a perfect positive correlation (r=1) and dark blue the closer we get to a perfect negative correlation (r=-1). It also includes the Pearson linnear correlation coefficient between a variable pair in the respective box.

In our heatmap, for example, the correlation coefficient between rent burden and poverty rate is 0.17 (top row, second column). This is a weak correlation that suggests that as the rent burden goes up, so does poverty rate (and vice versa). You’ll notice that the diagonal is all red; that’s because it represents a variable’s correlation with itself (which is always 1). The values above the diagonal will mirror those below it, so you need only look at one half of the heatmap.

However, a quick glance shows me one potentially interesting relationship: a strong correlation between poverty rate (poverty-rate) and the percentage of occupied housing units that are renter-occupied (pct-renter-occupied). Counties with a higher poverty rate tend to have a greater proportion of renters. Thus, they are presumably more impacted by the issue/prospect of evictions. (There is also a moderate correlation between population size and the percentage of occupied housing units that are renter-occupied.)

Generating scatter plots

Sometimes, we’ll want a different visual aid than a histogram. Additionally, recall that we’ve only explored linear correlations. DataExplorer also lets us visualize scatter plots, which are good for showing the relationship between two continuous variables (e.g., ratio or interval data)—and allow us to spot non-linear relationships (for example, see this diagram where a clear relationship exists (d and e) yet the coefficient is 0).

To do that, you’ll want to use the DataExplorer::plot_scatterplot() function. Like plot_histogram(), you must supply a data frame as an argument—which we can leave blank if we’re piping the information. However, you also need to supply a variable for the Y (vertical) axis using the by= argument. In this case, we can specify population as the Y variable thusly:

ma_counties %>%
  filter(year==2016) %>%
  select(population, `poverty-rate`, `pct-renter-occupied`, `rent-burden`) %>%
  plot_scatterplot(by="population")

Note that we supplied the variable as a string (using " marks instead of ` marks or without any quotation). Put simply, that’s just how the functions from the DataExplorer package like things. Different package authors will follow different conventions, so you may need to adjust your syntax depending on the package.

Looking at this scatter plot, we see no evidence of a curvilinear relationship for any of the variables. Like our correlation heatmap showed, we do see some evidence that bigger population sizes tend to have higher percentages of occupied housing units that are renter-occupied. However, what this scatter plot also tells us is that there’s a big outlier in our dataset at the far end of that variable pairing (population and pct-renter-occupied). That outlier may be worth exploring.

Issues with special characters

There is one big issue with the conventions used by DataExplorer vis-a-vis our dataset: it can’t handle special characters (like a dash, which can interpreted as a minus sign) in the variable name. For example, here’s what happens when we set poverty-rate as our Y variable:

ma_counties %>%
  filter(year==2016) %>%
  select(population, `poverty-rate`, `pct-renter-occupied`, `rent-burden`) %>%
  plot_scatterplot(by="poverty-rate")

## Error in FUN(X[[i]], ...): object 'poverty' not found

What’s happening is that the function is trying to do math when it shouldn’t. If we try to use the grave accent (`) mark, we get a similar problem:

ma_counties %>%
  filter(year==2016) %>%
  select(population, `poverty-rate`, `pct-renter-occupied`, `rent-burden`) %>%
  plot_scatterplot(by="`poverty-rate`")

## Error in melt.data.table(data, id.vars = by, variable.factor = FALSE): One or more values in 'id.vars' is invalid.

Instead, what we need to do is replace the special symbols in our variable names with something simpler—like a period. Thankfully, there’s a simple way to do that, too! What we’ll need to do is recreate the data frame following our select statement using the as_tibble() function and add the argument .name_repair="universal":

ma_counties %>%
  filter(year==2016) %>%
  select(population, `poverty-rate`, `pct-renter-occupied`, `rent-burden`) %>%
  as_tibble(.name_repair="universal")

## New names:
## * `poverty-rate` -> poverty.rate
## * `pct-renter-occupied` -> pct.renter.occupied
## * `rent-burden` -> rent.burden

You can see from the new column names that the transformation worked. However, this is an example of why you’ll want to avoid using special characters like mathematical operators in your own variable names. However, when you’re working with data from the wild, you have to deal with other peoples’ preferences and make corrections accordingly.

Do note that the transformation we applied is not persistent as we never reassigned it to an object using the <- assignment operator. That is, whenever we reference our ma_counties data frame going forward, we’ll need to reapply the transformation. (You can, of course, assign the transformation to a new object or reassign it into the existing one.)

Nevertheless, we can now generate our scatter plots for poverty.rate thusly:

ma_counties %>%
  filter(year==2016) %>%
  select(population, `poverty-rate`, `pct-renter-occupied`, `rent-burden`) %>%
  as_tibble(.name_repair="universal") %>%
  plot_scatterplot(by="poverty.rate")

## New names:
## * `poverty-rate` -> poverty.rate
## * `pct-renter-occupied` -> pct.renter.occupied
## * `rent-burden` -> rent.burden

Generating box plots

You can similarly produce a series of box plots using the plot_boxplot() function. As with plot_scatterplot(), it accepts as arguments a data frame and a Y axis variable. Box plots are particularly useful for showing distributions by identifying the upper, middle, and lower quartiles, and highlighting outliers.

For example, let’s say we wanted to check out the distribution of population size, poverty rates, percent of occupied housing units that are renter occupied, and rent burden over time. First, we’d create a new data frame object with those variables of interest. Then, we’d supply that data frame to plot_boxplot() and use the year as my Y variable. So our code looks like this:

ma_counties %>%
  select(year, population, `poverty-rate`, `pct-renter-occupied`, `rent-burden`) %>%
  plot_boxplot(by="year")

Note that we’re using the - characters in the variable names in this code because we didn’t apply any transformation to the variable names here, or make it persistent in our last bit of code. That’s okay because we didn’t need to reference a variable with special characters in the plot_boxplot() function.

Here’s a handy guide on reading box plots. The first plot is pretty useless but the subsequent ones less so. For example, if we look at the rent.burden plot, we see that rent burdens seem to be increasing over time. In fact, the most extreme case in the first grouping (between 2000 and 2003) is below the median (thick black vertical line inside the box) in the most recent year grouping.

However, recall that we are missing a good chunk of data, especially in earlier years—so we’ll want to probe deeper. We’d probably want to drill down into specific years (rather than year groupings) and be more selective about which counties to include.

Generating bar plots

DataExplorer also provides a function, plot_bar() that will generate frequencies for discrete variables (those identified as strings (chr or factor types)). It works similar to plot_histogram(). However, if we only have one chr-type variable of interest (name), we can just call it using our regular object$variable nomenclature:

plot_bar(ma_counties$name)

This doesn’t tell us anything too useful, except that each county name comes up an equal number of times in our dataset. However, if you were interested in which names (string-type) came up most often in a dataset about campaign contributions, for example, this would be useful.

Generating your own plots

The DataExplorer functions we’ve looked at so far can be called “wrapper” functions because they’re primarily geared toward expediting common uses of other functions. Specifically, DataExplorer makes extensive use of the ggplot2 visualization package.

ggplot2 is quite powerful and can get pretty complicated, and we’ll cover some of its more advanced features later in the course. For now, we just care about producing some basic plots with little care for visual appeal—it’s just for exploration after all—so we’ll keep things simple.

First, load the ggplot2 package (if you loaded tidyverse previously, you don’t need to load ggplot2 as it is part of the tidyverse metapackage):

library(ggplot2)

ggplot2 works by layering information. For example, we might first set up a basic plot layer that defines the aesthetics; then add a layer with points; then add a layer with lines connecting the points; and finally adding a text layer with a title. Those four layers, together, would comprise a line graph.

Creating a line graph

Let’s illustrate that by producing a graphic that looks at the trajectory of eviction filing rates in Hampshire County across the years in our dataset. We’ll begin by generating a data frame that contains only the data for the relevant county:

ma_counties %>%
  filter(name=="Hampshire County")

We’ll usually start a plot by calling the ggplot() function and specifying the key variables in the first layer, including the data frame you’re pulling from and the top-level aesthetic mappings, including:

x: the variable that goes in the X axis
y: the variable that goes in the Y axis (optional)
group: any grouping variable to signal a connection between data points (optional)
color: any variable for color-coding data points’ outside edges (optional)
fill: any variable for color-coding data points’ inside fills (optional)

In this case, we’ll create a basic base layer that only maps the X and Y variables:

ma_counties %>%
  filter(name=="Hampshire County") %>%
  ggplot(aes(x=year, y=`eviction-filing-rate`))

Our plot now has scaled values for the X axis and Y axis (look at the labels placed on both axes), but there’s no data plotted on it just yet. To do that, we’ll add another layer by putting a + sign at the end of the previous layer. (Note that we’re no longer piping as we add layers to our ggplot().) Then, we’ll add in points by using the geom_point() function:

ma_counties %>%
  filter(name=="Hampshire County") %>%
  ggplot(aes(x=year, y=`eviction-filing-rate`)) +
    geom_point()

## Warning: Removed 1 rows containing missing values (geom_point).

Note: R tells us that it removed a row because we don’t have data for eviction-filing-rate in the year 2000. If we had removed observations with NA values, that warning would not have been given.

Now we have points! Next up, we’ll add a line to connect our points. We can do that by adding a new layer that uses the geom_line() function:

ma_counties %>%
  filter(name=="Hampshire County") %>%
  ggplot(aes(x=year, y=`eviction-filing-rate`)) +
    geom_point() +
    geom_line()

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_path).

Note: We could do without the geom_point() layer in creating a line graph, but it’s often helpful to include some marker for each data point.

Creating a scatter plot

In creating our line graph, we effectively already created a scatter plot. For example, if we wanted to re-create the earlier scatter plot comparing population to pct-renter-occupied with 2014 data, we’d do the following:

ma_counties %>%
  filter(year==2014) %>%
  ggplot(aes(x=`pct-renter-occupied`, y=population)) +
    geom_point()

Creating other charts

We can replicate our earlier charts using ggplot2 functions like geom_histogram(), geom_boxplot(), and geom_bar().

We’ll cover ggplot2’s more advanced features later in the course, but here’s a cheatsheet if you already want to do more with it. For now, quick and dirty is all we need.

Adding interactivity

The nice thing about producing our own plots using ggplot2 is the fact that we can easily turn them into interactive plots by pairing them with the ggplotly() function from the plotly library, provided by the website Plot.ly. (You may need to install the library first.)

As usual, we begin by loading the package:

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

The plotly library has some more advanced features, but it can usually take a given ggplot2 object and automatically figure out which aspects of the plot could be enhanced through interactive elements.

Note: The plotly library is a work in progress and authored by different people from ggplot2. Thus, some advanced ggplot2 aesthetics, like custom annotations, don’t work with plotly at the time of writing. For exploratory graphics, this isn’t usually an issue but it can be problematic if you want to produce production-level ggplot2 graphics.

For example, let’s recreate the line graph from above but make it so I can get precise values just by hovering over each point. First, we’ll store our ggplot in an object called g. Then, we’ll supply that object as the argument for ggplotly():

g <- ma_counties %>%
  filter(name=="Hampshire County") %>%
  ggplot(aes(x=year, y=`eviction-filing-rate`)) +
    geom_point() +
    geom_line()
ggplotly(g)

Similarly, if we wanted to make the earlier scatter plot interactive (and thus be able to quickly identify the name of the outlier), we could do the following:

g <- ma_counties %>%
  filter(year==2014) %>%
  ggplot(aes(x=`pct-renter-occupied`, y=population, county=name)) +
  geom_point()
ggplotly(g)

Note: A “county” aesthetic that maps to the “name” variable was added to the first layer to give us additional tooltip information. Try hovering each point now.

Now we can quickly see that Suffolk County—which includes much of Boston—is that outlier with an exceptionally high percentage of occupied housing units that are renter-occupied.

Using these exploratory visualization tactics, you should be able to start exploring some of the more interesting relationships and outliers in your data, and begin to make simple comparisons across groups.

Advanced examples: Comparison charts

Below are just some examples of how you could take ggplot2 and plotly a little further to help make comparisons across your data. I won’t detail this code just yet but they’re examples of what you could be doing (and will be able to do by the end of the course).

Comparison chart #1

The chart below helps compare Hampshire County’s eviction rate trajectory with other counties’.

g <- ma_counties %>%
  mutate(IsHamp=factor(ifelse(name=="Hampshire County", 1, 0))) %>%
  ggplot(aes(x=year, y=`eviction-rate`, group=name, color=IsHamp, alpha=IsHamp)) +
    geom_point() +
    geom_line() +
    scale_color_manual(values=c("black", "red")) +
    scale_alpha_manual(values=c(0.2, 1)) +
    theme(legend.position = "none")
ggplotly(g, tooltip=c("group", "x", "y"))

Comparison chart #2

The chart below helps compare eviction filing rates with eviction rates over time in Hampshire County.

g <- ma_counties %>%
  filter(name=="Hampshire County") %>%
  select(year, name, `eviction-filing-rate`, `eviction-rate`) %>%
  gather(key=rate, value=val, `eviction-filing-rate`, `eviction-rate`) %>%
  ggplot(aes(x=year, y=val, group=rate, color=rate)) +
    geom_point() +
    geom_line()
ggplotly(g, tooltip=c("x", "y"))

Putting it into practice

See if you can answer the following questions using the following dataset:

us_states <- read_csv("http://projects.rodrigozamith.com/datastorytelling/data/evictions_us_states.csv")

Produce a correlation heatmap using 2016 data for the following variables: population, eviction-rate, pct-white, pct-af-am, pct-hispanic, pct-am-ind, pct-asian, and pct-nh-pi. Which variable is most correlated (positive or negative) with the eviction rate? (Hint: Be sure to filter out NA values for eviction-rate.)
Produce scatterplots using 2016 data that have the eviction rate in the Y axis and population, pct-white, pct-af-am, pct-hispanic, pct-am-ind, pct-asian, and pct-nh-pi in the X axes. Do you see any clear patterns for those variables?
Produce an interactive scatterplot of the 2016 data with eviction rate in the Y axis and percentage of the population that is African-American in the X axis. Which data point (please name the state) do you find most interesting? Why?
Produce an interactive line graph showing the trajectory of eviction filing rates for the state of Massachusetts between 2002 and 2016. What does that chart tell you?