Today, we’ll be looking at some eviction data for the Commonwealth of Massachusetts. These data come from the Eviction Lab, which is led by Matthew Desmond.
The Eviction Lab data are derived from 80 million records painstakingly collected by the Lab’s research team. You can read about the methodology here, but the key things to keep in mind for this exercise is that the dataset we’ll be using covers the city level in the state of Massachusetts. Desmond’s team also makes data available at more micro levels (U.S. Census tracts and block groups) and more macro levels (county, statewide).
Because these data aren’t collected in a consistent manner (across states or within them), the Lab is forced to estimate some of the variables. You may find, for example, that a city had 0.45 evictions one year. How’s it possible to only evict 0.45 of a home? Well, that’s just an estimate and it would be reasonable for our purposes to round that tally to the closest whole number. (The Eviction Lab’s website does this for the sake of simplicity but the datasets they provide are more precise.)
As always, it is crucial that you review the data dictionary. If you want more detail about any of the variables, I encourage you to review the full Methodology Report.
The first step, of course, is to read in the data. We’ll use the readr::read_csv()
function (which can be loaded through tidyverse
):
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
ma_cities <- read_csv("http://projects.rodrigozamith.com/datastorytelling/data/evictions_ma_cities.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## name = col_character(),
## `parent-location` = col_character()
## )
## See spec(...) for full column specifications.
The next step is to confirm the data were imported the way we wanted it to. Here, I would recommend that you follow three steps.
Start by checking the first few observations of the object to check if there’s anything odd using the head()
function:
head(ma_cities)
At first glance, the data seem to have come in cleanly. Each row represents a different city (name
) at a different point in time (year
).
Next, we’ll use the str()
function to check all the column names and make sure they’re associated with the right data type:
str(ma_cities)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 4245 obs. of 27 variables:
## $ GEOID : num 2500135 2500135 2500135 2500135 2500135 ...
## $ year : num 2000 2001 2002 2003 2004 ...
## $ name : chr "Abington" "Abington" "Abington" "Abington" ...
## $ parent-location : chr "Massachusetts" "Massachusetts" "Massachusetts" "Massachusetts" ...
## $ population : num 14605 14605 14605 14605 14605 ...
## $ poverty-rate : num 3.6 3.6 3.6 3.6 3.6 3.62 3.62 3.62 3.62 3.62 ...
## $ renter-occupied-households: num NA NA NA NA NA ...
## $ pct-renter-occupied : num 28.2 28.2 28.2 28.2 28.2 ...
## $ median-gross-rent : num 676 676 676 676 676 1060 1060 1060 1060 1060 ...
## $ median-household-income : num 57100 57100 57100 57100 57100 ...
## $ median-property-value : num 178900 178900 178900 178900 178900 ...
## $ rent-burden : num 23.8 23.8 23.8 23.8 23.8 27.9 27.9 27.9 27.9 27.9 ...
## $ pct-white : num 97 97 97 97 97 ...
## $ pct-af-am : num 0.73 0.73 0.73 0.73 0.73 2.26 2.26 2.26 2.26 2.26 ...
## $ pct-hispanic : num 0.71 0.71 0.71 0.71 0.71 0.96 0.96 0.96 0.96 0.96 ...
## $ pct-am-ind : num 0.12 0.12 0.12 0.12 0.12 0 0 0 0 0 ...
## $ pct-asian : num 0.46 0.46 0.46 0.46 0.46 1.5 1.5 1.5 1.5 1.5 ...
## $ pct-nh-pi : num 0.01 0.01 0.01 0.01 0.01 0 0 0 0 0 ...
## $ pct-multiple : num 0.76 0.76 0.76 0.76 0.76 0.27 0.27 0.27 0.27 0.27 ...
## $ pct-other : num 0.25 0.25 0.25 0.25 0.25 0.16 0.16 0.16 0.16 0.16 ...
## $ eviction-filings : num NA NA NA NA NA NA NA NA 49 65 ...
## $ evictions : num NA NA NA NA NA NA NA NA 39 53 ...
## $ eviction-rate : num NA NA NA NA NA NA NA NA 2.23 2.96 ...
## $ eviction-filing-rate : num NA NA NA NA NA NA NA NA 2.8 3.63 ...
## $ low-flag : num 0 0 0 0 0 0 0 0 1 1 ...
## $ imputed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ subbed : num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. GEOID = col_double(),
## .. year = col_double(),
## .. name = col_character(),
## .. `parent-location` = col_character(),
## .. population = col_double(),
## .. `poverty-rate` = col_double(),
## .. `renter-occupied-households` = col_double(),
## .. `pct-renter-occupied` = col_double(),
## .. `median-gross-rent` = col_double(),
## .. `median-household-income` = col_double(),
## .. `median-property-value` = col_double(),
## .. `rent-burden` = col_double(),
## .. `pct-white` = col_double(),
## .. `pct-af-am` = col_double(),
## .. `pct-hispanic` = col_double(),
## .. `pct-am-ind` = col_double(),
## .. `pct-asian` = col_double(),
## .. `pct-nh-pi` = col_double(),
## .. `pct-multiple` = col_double(),
## .. `pct-other` = col_double(),
## .. `eviction-filings` = col_double(),
## .. evictions = col_double(),
## .. `eviction-rate` = col_double(),
## .. `eviction-filing-rate` = col_double(),
## .. `low-flag` = col_double(),
## .. imputed = col_double(),
## .. subbed = col_double()
## .. )
All of our variables appear to have the data types we’d expect for them (e.g., num
for variables we would think would be numerical and chr
for those we expect to be text).
We’ll also want to check to see if there are any extreme values in our dataset that may have been the result of information being entered incorrectly. To do that, we’ll use the summary()
function:
summary(ma_cities)
## GEOID year name parent-location
## Min. :2500135 Min. :2000 Length:4245 Length:4245
## 1st Qu.:2523840 1st Qu.:2004 Class :character Class :character
## Median :2541960 Median :2008 Mode :character Mode :character
## Mean :2542110 Mean :2008
## 3rd Qu.:2561205 3rd Qu.:2012
## Max. :2582595 Max. :2016
##
## population poverty-rate renter-occupied-households
## Min. : 0 Min. : 0.000 Min. : 1.74
## 1st Qu.: 2245 1st Qu.: 2.480 1st Qu.: 199.61
## Median : 4754 Median : 5.000 Median : 623.02
## Mean : 19266 Mean : 6.476 Mean : 3757.21
## 3rd Qu.: 23464 3rd Qu.: 8.895 3rd Qu.: 2990.25
## Max. :650281 Max. :78.600 Max. :178128.00
## NA's :175 NA's :175 NA's :1583
## pct-renter-occupied median-gross-rent median-household-income
## Min. : 0.00 Min. : 0.0 Min. : 0
## 1st Qu.: 18.80 1st Qu.: 707.0 1st Qu.: 47360
## Median : 30.41 Median : 906.0 Median : 60067
## Mean : 31.60 Mean : 952.1 Mean : 64479
## 3rd Qu.: 42.99 3rd Qu.:1141.0 3rd Qu.: 76328
## Max. :100.00 Max. :2908.0 Max. :250001
## NA's :175 NA's :346 NA's :206
## median-property-value rent-burden pct-white pct-af-am
## Min. : 0 Min. : 0.0 Min. : 0.00 Min. : 0.000
## 1st Qu.: 199025 1st Qu.:24.9 1st Qu.: 84.84 1st Qu.: 0.310
## Median : 285400 Median :28.2 Median : 92.28 Median : 1.090
## Mean : 319087 Mean :29.1 Mean : 87.73 Mean : 2.402
## 3rd Qu.: 383200 3rd Qu.:31.9 3rd Qu.: 95.76 3rd Qu.: 2.690
## Max. :1000001 Max. :50.0 Max. :100.00 Max. :41.210
## NA's :213 NA's :311 NA's :175 NA's :175
## pct-hispanic pct-am-ind pct-asian pct-nh-pi
## Min. : 0.000 Min. :0.0000 Min. : 0.000 Min. :0.00000
## 1st Qu.: 0.980 1st Qu.:0.0000 1st Qu.: 0.420 1st Qu.:0.00000
## Median : 2.100 Median :0.0600 Median : 1.260 Median :0.00000
## Mean : 4.711 Mean :0.1531 Mean : 2.727 Mean :0.01555
## 3rd Qu.: 4.940 3rd Qu.:0.1800 3rd Qu.: 3.260 3rd Qu.:0.01000
## Max. :76.350 Max. :2.3600 Max. :26.630 Max. :0.71000
## NA's :175 NA's :175 NA's :175 NA's :175
## pct-multiple pct-other eviction-filings evictions
## Min. : 0.000 Min. : 0.0000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.640 1st Qu.: 0.0000 1st Qu.: 3.13 1st Qu.: 2.255
## Median : 1.210 Median : 0.1400 Median : 11.09 Median : 8.000
## Mean : 1.506 Mean : 0.5807 Mean : 116.84 Mean : 72.593
## 3rd Qu.: 2.020 3rd Qu.: 0.5000 3rd Qu.: 61.00 3rd Qu.: 40.370
## Max. :11.890 Max. :30.3200 Max. :6447.00 Max. :2616.000
## NA's :175 NA's :175 NA's :1583 NA's :1583
## eviction-rate eviction-filing-rate low-flag imputed
## Min. : 0.000 Min. : 0.000 Min. :0.0000 Min. :0
## 1st Qu.: 0.700 1st Qu.: 1.090 1st Qu.:0.0000 1st Qu.:0
## Median : 1.425 Median : 2.090 Median :0.0000 Median :0
## Mean : 1.693 Mean : 2.479 Mean :0.3925 Mean :0
## 3rd Qu.: 2.370 3rd Qu.: 3.467 3rd Qu.:1.0000 3rd Qu.:0
## Max. :10.730 Max. :16.650 Max. :1.0000 Max. :0
## NA's :1583 NA's :1583
## subbed
## Min. :0
## 1st Qu.:0
## Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
##
Glancing at that output, we see a few things.
First, there don’t appear to be any extreme values for any of our variables. Percentages don’t exceed 100 or go below 0 and the year goes from 2000 to 2016, as expected. The maximum values for renter-occupied-households
and evictions
are quite a bit higher than the values at the third quartile, but that’s probably because most cities in Massachusetts are quite small—and then you have the likes of Boston. We’ll want to look at these a bit more closely. (I’ll show you how to do that shortly.)
Second, some of our variables have missing data (NA
s). Given that the number of NA
s is pretty consistent (1,526), I’m guessing that data simply are not available for some cities at given years. However, it could also mean that I have some extra observations in ma_cities
that are junk (e.g., blank lines at the bottom of the file). Again, we should check this out. (Again, more on this shortly.)
dplyr
One immensely useful R package for data journalists and scientists alike is called dplyr
. It is loaded by tidyverse
, so we don’t need to load it individually if we’ve installed that package.
When loading
dplyr
ortidyverse
, you may see a warning message that some objects are “masked.” What this is telling us is thatdplyr
contains some functions (likefilter()
) that have the same name as functions you’ve already loaded via another package (e.g.,filter()
in thestats
package that is automatically loaded). That’s fine, but this is why thepackage::function()
nomenclature is useful. By default, R will load a conflicting function from the package loaded most recently. (Thus, the order you load packages in matters.) If you want to call a function from a specific package, use thepackage::function()
nomenclature. Thus, if I writefilter()
, it will take it fromdplyr
since I just loaded it. If I want to usestats
’s version offilter()
, I would writestats::filter()
.
dplyr
includes a number of functions that will help us quickly slice up our data and calculate some relevant information. We’ll cover a few of the most important ones in this exercise: filter()
, count()
, select()
, summarize()
, group_by()
, and arrange()
.
Let’s say I wanted to look at all the rows that had a value in excess of 5
for the eviction-rate
variable. I could use the dplyr::filter()
function to accomplish that. With filter()
, I’m telling R to exclude all observations that don’t match the criteria I’m looking for, and produce a new object (e.g., a data frame) that only includes the desired observations.
filter()
requires two arguments (.data
and a logical predicate). (See the help system documentation for more information by typing ?filter()
in the console or using the Help
tab.) The first argument is the object (e.g., data frame) you are interested in and the second argument is the criterion (or criteria) for filtering. The crieria we most often use are: ==
(is equivalent to), !=
(is not equivalent to), >
/<
(is greater than or less than), and >=
/<=
(is greater than or equal to, or less than or equal to).
So, we’ll tell filter()
to look into the ma_cities
dataset and only include observations where the eviction-rate
variable is greater than 5
. (Put differently, exclude all observations where that criterion isn’t matched.)
filter(ma_cities, `eviction-rate` > 5)
Behold, we now only show the rows that have an eviction rate in excess of 5.
Note that I had to enclose my
eviction-rate
variable name in backticks (`), or grave accent marks. That’s because if we don’t do that, R will think that we’re trying to subtract the object ‘rate’ from ‘eviction’. I usually use underscores (_) in my variable names to avoid this problem, but the Eviction Lab did it differently.
NA
NA
values are a special class because they’re used to denote missing data. They don’t have a numerical meaning, so we can’t filter them out based on mathematical criterion.
Instead, we’ll use the base::is.na()
function, which assesses whether each value in a given vector is NA
or not. If it is, it marks it as TRUE
and if it isn’t, FALSE
(this is called a Boolean or logical response).
This is how we would get a listing of our rows that have NA
values:
filter(ma_cities, is.na(evictions))
By not specifying the condition for
is.na()
, we’re relying on R’s default of matchingTRUE
conditions. Put differently, this code tells R to only include observations where the value for theevictions
variable isNA
because it interprets the code asis.na(evictions)==TRUE
by default. If we wanted to do the opposite, we could specify that with,is.na(evictions)==FALSE
, or short-hand it by using a “is not” operator (!
) as follows:!is.na(evictions)
.
Oftentimes, we want to get a count of how often each unique value shows up in our data. For example, perhaps we’re wondering how often each city shows up in our dataset. Here’s where the dplyr::count()
function comes in. (It does something similar to base::table()
, but I suggest using count()
in the future because it can be seamlessly integrated into a multi-step workflow, as I’ll show you shortly.)
count()
allows us to specify two arguments (x
and wt
). (For more information, see the documentation.) The first argument is the object we want to evaluate and the second argument is a the variable containing the values you want counted.
In our case, we want to use count()
on the variable name
in the object ma_cities
:
count(ma_cities, name)
From these data, we notice that most towns appear in the dataset 17 times (2000 to 2016, after all). But we also notice that one town (Amesbury) had a name change. Because the name is entered differently, R will treat it as a separate place.
We can double-check this using the following code:
filter(ma_cities, str_detect(name, "Amesbury"))
Notice how Amesbury
only includes information until 2009, and the entries for Amesbury Town
between 2000 and 2009 are full of NA
values. (Click on the “Next” label on the bottom-right of the table to scroll through entries.)
If we want to treat Amesbury as a single place, it might be wise for us to give it a consistent label. One easy way to recode is to use the dplyr::recode()
function.
However, let’s first remove the duplicate entries:
ma_cities <- filter(ma_cities, !(name=="Amesbury Town" & year<2010))
With the above code, we executed a complex filter. It contains two conditions: The name must be equivalent to “Amesbury Town” and (
&
) the year must be less than 2010. We group the two conditions using parentheses and use the exclamation mark before that parenthetical grouping to convey that the filter should include all entries that do not match those two conditions. We then re-assign the output toma_cities
to remove the undesired rows from the object we’ll be calling again shortly.
With the recode()
function, we first specify the vector we want to inspect and the values we’d like to recode in the following format: OriginalValue = ReplacedValue
.
For example, if we want to replace all remaining instances of "Amesbury Town"
with "Amesbury"
in the names
variable of ma_cities
, we’d use the following code:
ma_cities$name <- recode(ma_cities$name, "Amesbury Town" = "Amesbury")
Again, notice that we’re replacing the values in
ma_cities$name
with the newly-recoded values. This way, the data transformation persists going forward.
Now, if we perform the previous count, we’ll see the corrected entry:
count(ma_cities, name)
What if I wanted to quickly look at my data and identify places that had fewer than 17 entries, in case similar transformations are needed?
This would require me to perform two steps: First, I’d need to use count()
to get a total for each value. Second, I’d need to use filter()
to only show me instances where a town has fewer than 17 entries.
I could wrap the functions inside each other (i.e., filter(count())
), but that would start to get unwieldy in instances where I need to perform multiple steps. Alternatively, I could store the output from each step in a new object and use those objects going forward (i.e., result1 <- count()
and then result2 <- filter(result1)
).
A much better solution is to pipe the information from one operation to the next. Piping just basically means, take the output of this operation and provide it as the first argument for the following operation. If that sounds confusing, let’s put it in action.
But, first, note that %>%
is the operator we use to pipe information. (To use that operator, make sure you have loaded dplyr
.)
Let’s answer our question in three steps using easy-to-read code. First, we specify the object that we’re interested in ma_cities
:
ma_cities
We’re going to pipe that information to the count()
function. Recall that we previously specified two arguments with count()
: the object containing the data we’re interested in and the variable containing the values we want counted (i.e., count(ma_cities, name)
). Because we’re piping (ma_cities
), we don’t need to specify the first argument anymore. We just need to specify the second argument, name
.
Here’s our second step, then:
ma_cities %>%
count(name)
This is just like the output we saw above. But what we can do now is transfer that output directly to the filter()
function. Again, we don’t need to specify the first argument anymore; we only need to specify the logical predicate.
However, before we specify the filter code, notice that the output from that last step is different from what we were originally working with. We are now working with a two-variable data frame. The first variable is name
and the second is n
. The n
variable was created in the process of counting (with count()
), and all the other variables were discarded. Thus, when we write our filter, we will base it on the n
variable.
Here’s how we would finalize that third step:
ma_cities %>%
count(name) %>%
filter(n < 17)
Behold, we have a few more entries where there are fewer than 17 observations (years’ worth of data). Some of these may be newly incorporated or disincorporated, or only recently added to the U.S. Census as a distinct geographical location. However, others just had a name change – and we should clean those up.
To keep things short, I won’t do that in this tutorial and we’ll proceed as though those entries aren’t problematic. However, in practice, we’d need to repeat the corrective step above for all the problem cities.
There are times when you just want to look at a few columns. This is when the dplyr::select()
function would be particularly helpful. With select()
, we just specify the variables we’re interested in seeing in the output. For example, if I just wanted to look at name
, pct-hispanic
, and eviction-rate
, I could write:
ma_cities %>%
select(name, `pct-hispanic`, `eviction-rate`)
This makes it easier for me to quickly find the information I’m looking for.
Sometimes, you want to do more than just count. You’ll want to calculate the mean, median, min, or max values (or even more advanced things, like quantiles and standard deviations). Here’s where the dplyr::summarize()
function comes in handy.
summarize()
requires us to specify what summary calculation we want to perform and where we want to assign that information. For example, if I specify the argument mean_col=mean(foo)
, I’m telling R to calculate the mean of values stored in the variable (e.g., column) foo
and to stick that result in a new variable (e.g., column) called mean_col
. I would repeat that for every calculation I want to perform.
For example, let’s say I wanted to calculate the mean and median eviction rates for Belchertown over all 17 observations (years). To do that, we would just need to perform three steps: (1) load data, (2) filter data, (3) summarize data.
ma_cities %>%
filter(name == "Belchertown") %>%
summarize(mean=mean(`eviction-rate`, na.rm = TRUE), median=median(`eviction-rate`, na.rm = TRUE))
Note that I had to add the argument
na.rm = TRUE
to both of those operations. This tells R to excludeNA
values when using themean()
andmedian()
functions. We need to do this because none of the cities in MA haveeviction-rate
data for the year 2000, andmean()
andmedian()
by default give an error when the vector they are applied to contains anNA
value. We could have also filtered out the year 2000, and then we wouldn’t have had to specify that.
What if we wanted to perform that same calculation for every single city, instead of just Belchertown? Here’s where the dplyr::group_by()
function would come in handy.
With group_by
, we can tell R to perform the same operation on each unique value of a grouping variable that we specify as an argument. In this case, our grouping is by city (name
)—so we can, for instance, calculate the mean for each city. We could also specify multiple layers of grouping by adding new variables as additional arguments, but we don’t need that right now.
To calculate the mean and median eviction rates for all cities over all the observations where they have recorded values, we would just need to perform three steps: (1) load data, (2) group data, and (3) summarize data. Here’s the code we would use:
ma_cities %>%
group_by(name) %>%
summarize(mean=mean(`eviction-rate`, na.rm = TRUE), median=median(`eviction-rate`, na.rm = TRUE))
However, we noticed that some cities may have more missing values than others. In fact, we may decide that if a city has fewer than 5 recorded values, it should be omitted because a single spike in eviction rates would impact the mean too much.
We can accomplish that by adding three more steps:
Between steps 1 and 2, we’ll add a filter to exclude rows where the eviction-rate
is NA
. (We can pair !
with is.na()
to tell R to include only observations that do not have an NA
value for that variable.)
In the third step, we can compute a variable called number_obs
that tells us how many matching observations we have for each group using the dplyr::n()
function.
After the last step, we’ll add a filter to exclude observations where our newly created number_obs
variable is greater than or equal to 5.
ma_cities %>%
filter(!is.na(`eviction-rate`)) %>%
group_by(name) %>%
summarize(mean=mean(`eviction-rate`), median=median(`eviction-rate`), number_obs=n()) %>%
filter(number_obs >= 5)
While we could go through that table and try to find all the cities with the highest mean, it might be easiest to simply sort the table in ascending (low to high) or descending (high to low) order. This is where the dplyr::arrange()
function comes in.
You’ll often want to arrange information so that you can look at the highest and lowest values of something. arrange()
requires just one argument: the variable to arrange information by. (Arrange can also to multilevel sorting, as we’ll see below.)
By default, arrange()
will sort in ascending order, although we can use the desc()
function to change that behavior. For example, we would write arrange(desc(foo))
to sort data from high to low based on the foo
variable. Alternatively, you can use the minus sign (-
) to invert the order, and thus replicate the descending behavior in a shorter way thusly: arrange(-foo)
.
Here’s how we can use arrange()
to sort the table by the mean
variable. Note that we’re only adding one line to the code we ran above, at the very end.
ma_cities %>%
filter(!is.na(`eviction-rate`)) %>%
group_by(name) %>%
summarize(mean=mean(`eviction-rate`), median=median(`eviction-rate`), number_obs=n()) %>%
filter(number_obs >= 5) %>%
arrange(-mean) # arrange(desc(mean)) accomplishes the same thing
Let’s say I wanted to arrange the table by the number of observations and then by name. (That is, to list the highest number_obs
values first and, when there are multiple observations with the same value, in alphabetical order by name
.) I need only do the following (see revision to the last line):
ma_cities %>%
filter(!is.na(`eviction-rate`)) %>%
group_by(name) %>%
summarize(mean=mean(`eviction-rate`), median=median(`eviction-rate`), number_obs=n()) %>%
filter(number_obs >= 5) %>%
arrange(-number_obs, name) # arrange(desc(number_obs), name) accomplishes the same thing
These five functions alone are remarkably powerful and versatile, especially when you begin pairing them in different ways. They are the data journalism equivalent of a carpenter’s hammer, screwdriver, and pliers—capable tools that are foundational to all kinds of projects. R offers a number of additional tools, including some that are highly specialized, but by mastering dplyr
, you’ll be well on your way to mining data sets for journalistic purposes!
See if you can answer the following questions:
Which Massachusetts city had the most evictions in 2016? How many evictions did it have?
Among large cities (100,000 people or higher), which city had the highest eviction rate in 2016? What was the rate?
What was the mean rate of evictions among large cities (100,000 people or higher) in 2016? Was it higher than the rate in 2015?
During which year did Springfield have its highest eviction filing rate?
Which year in the ma_cities
dataset had the greatest amount of missing values for the eviction rate variable?