Exploratory Data Analysis in R

Today, we’ll be looking at some eviction data for the Commonwealth of Massachusetts. These data come from the Eviction Lab, which is led by Matthew Desmond.

A quick note about the data

The Eviction Lab data are derived from 80 million records painstakingly collected by the Lab’s research team. You can read about the methodology here, but the key things to keep in mind for this exercise is that the dataset we’ll be using covers the city level in the state of Massachusetts. Desmond’s team also makes data available at more micro levels (U.S. Census tracts and block groups) and more macro levels (county, statewide).

Because these data aren’t collected in a consistent manner (across states or within them), the Lab is forced to estimate some of the variables. You may find, for example, that a city had 0.45 evictions one year. How’s it possible to only evict 0.45 of a home? Well, that’s just an estimate and it would be reasonable for our purposes to round that tally to the closest whole number. (The Eviction Lab’s website does this for the sake of simplicity but the datasets they provide are more precise.)

As always, it is crucial that you review the data dictionary. If you want more detail about any of the variables, I encourage you to review the full Methodology Report.

Pre-Analysis

Load the data

The first step, of course, is to read in the data. We’ll use the readr::read_csv() function (which can be loaded through tidyverse):

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

ma_cities <- read_csv("http://projects.rodrigozamith.com/datastorytelling/data/evictions_ma_cities.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   name = col_character(),
##   `parent-location` = col_character()
## )

## See spec(...) for full column specifications.

Confirm it imported correctly

The next step is to confirm the data were imported the way we wanted it to. Here, I would recommend that you follow three steps.

Step 1: Check the first few observations

Start by checking the first few observations of the object to check if there’s anything odd using the head() function:

head(ma_cities)

At first glance, the data seem to have come in cleanly. Each row represents a different city (name) at a different point in time (year).

Step 2: Check the structure

Next, we’ll use the str() function to check all the column names and make sure they’re associated with the right data type:

str(ma_cities)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 4245 obs. of  27 variables:
##  $ GEOID                     : num  2500135 2500135 2500135 2500135 2500135 ...
##  $ year                      : num  2000 2001 2002 2003 2004 ...
##  $ name                      : chr  "Abington" "Abington" "Abington" "Abington" ...
##  $ parent-location           : chr  "Massachusetts" "Massachusetts" "Massachusetts" "Massachusetts" ...
##  $ population                : num  14605 14605 14605 14605 14605 ...
##  $ poverty-rate              : num  3.6 3.6 3.6 3.6 3.6 3.62 3.62 3.62 3.62 3.62 ...
##  $ renter-occupied-households: num  NA NA NA NA NA ...
##  $ pct-renter-occupied       : num  28.2 28.2 28.2 28.2 28.2 ...
##  $ median-gross-rent         : num  676 676 676 676 676 1060 1060 1060 1060 1060 ...
##  $ median-household-income   : num  57100 57100 57100 57100 57100 ...
##  $ median-property-value     : num  178900 178900 178900 178900 178900 ...
##  $ rent-burden               : num  23.8 23.8 23.8 23.8 23.8 27.9 27.9 27.9 27.9 27.9 ...
##  $ pct-white                 : num  97 97 97 97 97 ...
##  $ pct-af-am                 : num  0.73 0.73 0.73 0.73 0.73 2.26 2.26 2.26 2.26 2.26 ...
##  $ pct-hispanic              : num  0.71 0.71 0.71 0.71 0.71 0.96 0.96 0.96 0.96 0.96 ...
##  $ pct-am-ind                : num  0.12 0.12 0.12 0.12 0.12 0 0 0 0 0 ...
##  $ pct-asian                 : num  0.46 0.46 0.46 0.46 0.46 1.5 1.5 1.5 1.5 1.5 ...
##  $ pct-nh-pi                 : num  0.01 0.01 0.01 0.01 0.01 0 0 0 0 0 ...
##  $ pct-multiple              : num  0.76 0.76 0.76 0.76 0.76 0.27 0.27 0.27 0.27 0.27 ...
##  $ pct-other                 : num  0.25 0.25 0.25 0.25 0.25 0.16 0.16 0.16 0.16 0.16 ...
##  $ eviction-filings          : num  NA NA NA NA NA NA NA NA 49 65 ...
##  $ evictions                 : num  NA NA NA NA NA NA NA NA 39 53 ...
##  $ eviction-rate             : num  NA NA NA NA NA NA NA NA 2.23 2.96 ...
##  $ eviction-filing-rate      : num  NA NA NA NA NA NA NA NA 2.8 3.63 ...
##  $ low-flag                  : num  0 0 0 0 0 0 0 0 1 1 ...
##  $ imputed                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ subbed                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   GEOID = col_double(),
##   ..   year = col_double(),
##   ..   name = col_character(),
##   ..   `parent-location` = col_character(),
##   ..   population = col_double(),
##   ..   `poverty-rate` = col_double(),
##   ..   `renter-occupied-households` = col_double(),
##   ..   `pct-renter-occupied` = col_double(),
##   ..   `median-gross-rent` = col_double(),
##   ..   `median-household-income` = col_double(),
##   ..   `median-property-value` = col_double(),
##   ..   `rent-burden` = col_double(),
##   ..   `pct-white` = col_double(),
##   ..   `pct-af-am` = col_double(),
##   ..   `pct-hispanic` = col_double(),
##   ..   `pct-am-ind` = col_double(),
##   ..   `pct-asian` = col_double(),
##   ..   `pct-nh-pi` = col_double(),
##   ..   `pct-multiple` = col_double(),
##   ..   `pct-other` = col_double(),
##   ..   `eviction-filings` = col_double(),
##   ..   evictions = col_double(),
##   ..   `eviction-rate` = col_double(),
##   ..   `eviction-filing-rate` = col_double(),
##   ..   `low-flag` = col_double(),
##   ..   imputed = col_double(),
##   ..   subbed = col_double()
##   .. )

All of our variables appear to have the data types we’d expect for them (e.g., num for variables we would think would be numerical and chr for those we expect to be text).

Step 3: Check the range of values

We’ll also want to check to see if there are any extreme values in our dataset that may have been the result of information being entered incorrectly. To do that, we’ll use the summary() function:

summary(ma_cities)

##      GEOID              year          name           parent-location   
##  Min.   :2500135   Min.   :2000   Length:4245        Length:4245       
##  1st Qu.:2523840   1st Qu.:2004   Class :character   Class :character  
##  Median :2541960   Median :2008   Mode  :character   Mode  :character  
##  Mean   :2542110   Mean   :2008                                        
##  3rd Qu.:2561205   3rd Qu.:2012                                        
##  Max.   :2582595   Max.   :2016                                        
##                                                                        
##    population      poverty-rate    renter-occupied-households
##  Min.   :     0   Min.   : 0.000   Min.   :     1.74         
##  1st Qu.:  2245   1st Qu.: 2.480   1st Qu.:   199.61         
##  Median :  4754   Median : 5.000   Median :   623.02         
##  Mean   : 19266   Mean   : 6.476   Mean   :  3757.21         
##  3rd Qu.: 23464   3rd Qu.: 8.895   3rd Qu.:  2990.25         
##  Max.   :650281   Max.   :78.600   Max.   :178128.00         
##  NA's   :175      NA's   :175      NA's   :1583              
##  pct-renter-occupied median-gross-rent median-household-income
##  Min.   :  0.00      Min.   :   0.0    Min.   :     0         
##  1st Qu.: 18.80      1st Qu.: 707.0    1st Qu.: 47360         
##  Median : 30.41      Median : 906.0    Median : 60067         
##  Mean   : 31.60      Mean   : 952.1    Mean   : 64479         
##  3rd Qu.: 42.99      3rd Qu.:1141.0    3rd Qu.: 76328         
##  Max.   :100.00      Max.   :2908.0    Max.   :250001         
##  NA's   :175         NA's   :346       NA's   :206            
##  median-property-value  rent-burden     pct-white        pct-af-am     
##  Min.   :      0       Min.   : 0.0   Min.   :  0.00   Min.   : 0.000  
##  1st Qu.: 199025       1st Qu.:24.9   1st Qu.: 84.84   1st Qu.: 0.310  
##  Median : 285400       Median :28.2   Median : 92.28   Median : 1.090  
##  Mean   : 319087       Mean   :29.1   Mean   : 87.73   Mean   : 2.402  
##  3rd Qu.: 383200       3rd Qu.:31.9   3rd Qu.: 95.76   3rd Qu.: 2.690  
##  Max.   :1000001       Max.   :50.0   Max.   :100.00   Max.   :41.210  
##  NA's   :213           NA's   :311    NA's   :175      NA's   :175     
##   pct-hispanic      pct-am-ind       pct-asian        pct-nh-pi      
##  Min.   : 0.000   Min.   :0.0000   Min.   : 0.000   Min.   :0.00000  
##  1st Qu.: 0.980   1st Qu.:0.0000   1st Qu.: 0.420   1st Qu.:0.00000  
##  Median : 2.100   Median :0.0600   Median : 1.260   Median :0.00000  
##  Mean   : 4.711   Mean   :0.1531   Mean   : 2.727   Mean   :0.01555  
##  3rd Qu.: 4.940   3rd Qu.:0.1800   3rd Qu.: 3.260   3rd Qu.:0.01000  
##  Max.   :76.350   Max.   :2.3600   Max.   :26.630   Max.   :0.71000  
##  NA's   :175      NA's   :175      NA's   :175      NA's   :175      
##   pct-multiple      pct-other       eviction-filings    evictions       
##  Min.   : 0.000   Min.   : 0.0000   Min.   :   0.00   Min.   :   0.000  
##  1st Qu.: 0.640   1st Qu.: 0.0000   1st Qu.:   3.13   1st Qu.:   2.255  
##  Median : 1.210   Median : 0.1400   Median :  11.09   Median :   8.000  
##  Mean   : 1.506   Mean   : 0.5807   Mean   : 116.84   Mean   :  72.593  
##  3rd Qu.: 2.020   3rd Qu.: 0.5000   3rd Qu.:  61.00   3rd Qu.:  40.370  
##  Max.   :11.890   Max.   :30.3200   Max.   :6447.00   Max.   :2616.000  
##  NA's   :175      NA's   :175       NA's   :1583      NA's   :1583      
##  eviction-rate    eviction-filing-rate    low-flag         imputed 
##  Min.   : 0.000   Min.   : 0.000       Min.   :0.0000   Min.   :0  
##  1st Qu.: 0.700   1st Qu.: 1.090       1st Qu.:0.0000   1st Qu.:0  
##  Median : 1.425   Median : 2.090       Median :0.0000   Median :0  
##  Mean   : 1.693   Mean   : 2.479       Mean   :0.3925   Mean   :0  
##  3rd Qu.: 2.370   3rd Qu.: 3.467       3rd Qu.:1.0000   3rd Qu.:0  
##  Max.   :10.730   Max.   :16.650       Max.   :1.0000   Max.   :0  
##  NA's   :1583     NA's   :1583                                     
##      subbed 
##  Min.   :0  
##  1st Qu.:0  
##  Median :0  
##  Mean   :0  
##  3rd Qu.:0  
##  Max.   :0  
##

Glancing at that output, we see a few things.

First, there don’t appear to be any extreme values for any of our variables. Percentages don’t exceed 100 or go below 0 and the year goes from 2000 to 2016, as expected. The maximum values for renter-occupied-households and evictions are quite a bit higher than the values at the third quartile, but that’s probably because most cities in Massachusetts are quite small—and then you have the likes of Boston. We’ll want to look at these a bit more closely. (I’ll show you how to do that shortly.)

Second, some of our variables have missing data (NAs). Given that the number of NAs is pretty consistent (1,526), I’m guessing that data simply are not available for some cities at given years. However, it could also mean that I have some extra observations in ma_cities that are junk (e.g., blank lines at the bottom of the file). Again, we should check this out. (Again, more on this shortly.)

Meet your new best friend: `dplyr`

One immensely useful R package for data journalists and scientists alike is called dplyr. It is loaded by tidyverse, so we don’t need to load it individually if we’ve installed that package.

When loading dplyr or tidyverse, you may see a warning message that some objects are “masked.” What this is telling us is that dplyr contains some functions (like filter()) that have the same name as functions you’ve already loaded via another package (e.g., filter() in the stats package that is automatically loaded). That’s fine, but this is why the package::function() nomenclature is useful. By default, R will load a conflicting function from the package loaded most recently. (Thus, the order you load packages in matters.) If you want to call a function from a specific package, use the package::function() nomenclature. Thus, if I write filter(), it will take it from dplyr since I just loaded it. If I want to use stats’s version of filter(), I would write stats::filter().

dplyr includes a number of functions that will help us quickly slice up our data and calculate some relevant information. We’ll cover a few of the most important ones in this exercise: filter(), count(), select(), summarize(), group_by(), and arrange().

Filtering

Let’s say I wanted to look at all the rows that had a value in excess of 5 for the eviction-rate variable. I could use the dplyr::filter() function to accomplish that. With filter(), I’m telling R to exclude all observations that don’t match the criteria I’m looking for, and produce a new object (e.g., a data frame) that only includes the desired observations.

filter() requires two arguments (.data and a logical predicate). (See the help system documentation for more information by typing ?filter() in the console or using the Help tab.) The first argument is the object (e.g., data frame) you are interested in and the second argument is the criterion (or criteria) for filtering. The crieria we most often use are: == (is equivalent to), != (is not equivalent to), >/< (is greater than or less than), and >=/<= (is greater than or equal to, or less than or equal to).

So, we’ll tell filter() to look into the ma_cities dataset and only include observations where the eviction-rate variable is greater than 5. (Put differently, exclude all observations where that criterion isn’t matched.)

filter(ma_cities, `eviction-rate` > 5)

Behold, we now only show the rows that have an eviction rate in excess of 5.

Note that I had to enclose my eviction-rate variable name in backticks (`), or grave accent marks. That’s because if we don’t do that, R will think that we’re trying to subtract the object ‘rate’ from ‘eviction’. I usually use underscores (_) in my variable names to avoid this problem, but the Eviction Lab did it differently.

The special case of `NA`

NA values are a special class because they’re used to denote missing data. They don’t have a numerical meaning, so we can’t filter them out based on mathematical criterion.

Instead, we’ll use the base::is.na() function, which assesses whether each value in a given vector is NA or not. If it is, it marks it as TRUE and if it isn’t, FALSE (this is called a Boolean or logical response).

This is how we would get a listing of our rows that have NA values:

filter(ma_cities, is.na(evictions))

By not specifying the condition for is.na(), we’re relying on R’s default of matching TRUE conditions. Put differently, this code tells R to only include observations where the value for the evictions variable is NA because it interprets the code as is.na(evictions)==TRUE by default. If we wanted to do the opposite, we could specify that with, is.na(evictions)==FALSE, or short-hand it by using a “is not” operator (!) as follows: !is.na(evictions).

Counting values

Oftentimes, we want to get a count of how often each unique value shows up in our data. For example, perhaps we’re wondering how often each city shows up in our dataset. Here’s where the dplyr::count() function comes in. (It does something similar to base::table(), but I suggest using count() in the future because it can be seamlessly integrated into a multi-step workflow, as I’ll show you shortly.)

count() allows us to specify two arguments (x and wt). (For more information, see the documentation.) The first argument is the object we want to evaluate and the second argument is a the variable containing the values you want counted.

In our case, we want to use count() on the variable name in the object ma_cities:

count(ma_cities, name)

From these data, we notice that most towns appear in the dataset 17 times (2000 to 2016, after all). But we also notice that one town (Amesbury) had a name change. Because the name is entered differently, R will treat it as a separate place.

We can double-check this using the following code:

filter(ma_cities, str_detect(name, "Amesbury"))

Notice how Amesbury only includes information until 2009, and the entries for Amesbury Town between 2000 and 2009 are full of NA values. (Click on the “Next” label on the bottom-right of the table to scroll through entries.)

Quick recoding

If we want to treat Amesbury as a single place, it might be wise for us to give it a consistent label. One easy way to recode is to use the dplyr::recode() function.

However, let’s first remove the duplicate entries:

ma_cities <- filter(ma_cities, !(name=="Amesbury Town" & year<2010))

With the above code, we executed a complex filter. It contains two conditions: The name must be equivalent to “Amesbury Town” and (&) the year must be less than 2010. We group the two conditions using parentheses and use the exclamation mark before that parenthetical grouping to convey that the filter should include all entries that do not match those two conditions. We then re-assign the output to ma_cities to remove the undesired rows from the object we’ll be calling again shortly.

With the recode() function, we first specify the vector we want to inspect and the values we’d like to recode in the following format: OriginalValue = ReplacedValue.

For example, if we want to replace all remaining instances of "Amesbury Town" with "Amesbury" in the names variable of ma_cities, we’d use the following code:

ma_cities$name <- recode(ma_cities$name, "Amesbury Town" = "Amesbury")

Again, notice that we’re replacing the values in ma_cities$name with the newly-recoded values. This way, the data transformation persists going forward.

Now, if we perform the previous count, we’ll see the corrected entry:

count(ma_cities, name)

Piping for multi-step operations

What if I wanted to quickly look at my data and identify places that had fewer than 17 entries, in case similar transformations are needed?

This would require me to perform two steps: First, I’d need to use count() to get a total for each value. Second, I’d need to use filter() to only show me instances where a town has fewer than 17 entries.

I could wrap the functions inside each other (i.e., filter(count())), but that would start to get unwieldy in instances where I need to perform multiple steps. Alternatively, I could store the output from each step in a new object and use those objects going forward (i.e., result1 <- count() and then result2 <- filter(result1)).

A much better solution is to pipe the information from one operation to the next. Piping just basically means, take the output of this operation and provide it as the first argument for the following operation. If that sounds confusing, let’s put it in action.

But, first, note that %>% is the operator we use to pipe information. (To use that operator, make sure you have loaded dplyr.)

Let’s answer our question in three steps using easy-to-read code. First, we specify the object that we’re interested in ma_cities:

ma_cities

We’re going to pipe that information to the count() function. Recall that we previously specified two arguments with count(): the object containing the data we’re interested in and the variable containing the values we want counted (i.e., count(ma_cities, name)). Because we’re piping (ma_cities), we don’t need to specify the first argument anymore. We just need to specify the second argument, name.

Here’s our second step, then:

ma_cities %>%
  count(name)

This is just like the output we saw above. But what we can do now is transfer that output directly to the filter() function. Again, we don’t need to specify the first argument anymore; we only need to specify the logical predicate.

However, before we specify the filter code, notice that the output from that last step is different from what we were originally working with. We are now working with a two-variable data frame. The first variable is name and the second is n. The n variable was created in the process of counting (with count()), and all the other variables were discarded. Thus, when we write our filter, we will base it on the n variable.

Here’s how we would finalize that third step:

ma_cities %>%
  count(name) %>%
  filter(n < 17)

Behold, we have a few more entries where there are fewer than 17 observations (years’ worth of data). Some of these may be newly incorporated or disincorporated, or only recently added to the U.S. Census as a distinct geographical location. However, others just had a name change – and we should clean those up.

To keep things short, I won’t do that in this tutorial and we’ll proceed as though those entries aren’t problematic. However, in practice, we’d need to repeat the corrective step above for all the problem cities.

Selecting

There are times when you just want to look at a few columns. This is when the dplyr::select() function would be particularly helpful. With select(), we just specify the variables we’re interested in seeing in the output. For example, if I just wanted to look at name, pct-hispanic, and eviction-rate, I could write:

ma_cities %>%
  select(name, `pct-hispanic`, `eviction-rate`)

This makes it easier for me to quickly find the information I’m looking for.

Summarizing

Sometimes, you want to do more than just count. You’ll want to calculate the mean, median, min, or max values (or even more advanced things, like quantiles and standard deviations). Here’s where the dplyr::summarize() function comes in handy.

summarize() requires us to specify what summary calculation we want to perform and where we want to assign that information. For example, if I specify the argument mean_col=mean(foo), I’m telling R to calculate the mean of values stored in the variable (e.g., column) foo and to stick that result in a new variable (e.g., column) called mean_col. I would repeat that for every calculation I want to perform.

For example, let’s say I wanted to calculate the mean and median eviction rates for Belchertown over all 17 observations (years). To do that, we would just need to perform three steps: (1) load data, (2) filter data, (3) summarize data.

ma_cities %>%
  filter(name == "Belchertown") %>%
  summarize(mean=mean(`eviction-rate`, na.rm = TRUE), median=median(`eviction-rate`, na.rm = TRUE))

Note that I had to add the argument na.rm = TRUE to both of those operations. This tells R to exclude NA values when using the mean() and median() functions. We need to do this because none of the cities in MA have eviction-rate data for the year 2000, and mean() and median() by default give an error when the vector they are applied to contains an NA value. We could have also filtered out the year 2000, and then we wouldn’t have had to specify that.

Grouping

What if we wanted to perform that same calculation for every single city, instead of just Belchertown? Here’s where the dplyr::group_by() function would come in handy.

With group_by, we can tell R to perform the same operation on each unique value of a grouping variable that we specify as an argument. In this case, our grouping is by city (name)—so we can, for instance, calculate the mean for each city. We could also specify multiple layers of grouping by adding new variables as additional arguments, but we don’t need that right now.

To calculate the mean and median eviction rates for all cities over all the observations where they have recorded values, we would just need to perform three steps: (1) load data, (2) group data, and (3) summarize data. Here’s the code we would use:

ma_cities %>%
  group_by(name) %>%
  summarize(mean=mean(`eviction-rate`, na.rm = TRUE), median=median(`eviction-rate`, na.rm = TRUE))

However, we noticed that some cities may have more missing values than others. In fact, we may decide that if a city has fewer than 5 recorded values, it should be omitted because a single spike in eviction rates would impact the mean too much.

We can accomplish that by adding three more steps:

Between steps 1 and 2, we’ll add a filter to exclude rows where the eviction-rate is NA. (We can pair ! with is.na() to tell R to include only observations that do not have an NA value for that variable.)
In the third step, we can compute a variable called number_obs that tells us how many matching observations we have for each group using the dplyr::n() function.
After the last step, we’ll add a filter to exclude observations where our newly created number_obs variable is greater than or equal to 5.

ma_cities %>%
  filter(!is.na(`eviction-rate`)) %>%
  group_by(name) %>%
  summarize(mean=mean(`eviction-rate`), median=median(`eviction-rate`), number_obs=n()) %>%
  filter(number_obs >= 5)

Arrange

While we could go through that table and try to find all the cities with the highest mean, it might be easiest to simply sort the table in ascending (low to high) or descending (high to low) order. This is where the dplyr::arrange() function comes in.

You’ll often want to arrange information so that you can look at the highest and lowest values of something. arrange() requires just one argument: the variable to arrange information by. (Arrange can also to multilevel sorting, as we’ll see below.)

By default, arrange() will sort in ascending order, although we can use the desc() function to change that behavior. For example, we would write arrange(desc(foo)) to sort data from high to low based on the foo variable. Alternatively, you can use the minus sign (-) to invert the order, and thus replicate the descending behavior in a shorter way thusly: arrange(-foo).

Here’s how we can use arrange() to sort the table by the mean variable. Note that we’re only adding one line to the code we ran above, at the very end.

ma_cities %>%
  filter(!is.na(`eviction-rate`)) %>%
  group_by(name) %>%
  summarize(mean=mean(`eviction-rate`), median=median(`eviction-rate`), number_obs=n()) %>%
  filter(number_obs >= 5) %>%
  arrange(-mean) # arrange(desc(mean)) accomplishes the same thing

Let’s say I wanted to arrange the table by the number of observations and then by name. (That is, to list the highest number_obs values first and, when there are multiple observations with the same value, in alphabetical order by name.) I need only do the following (see revision to the last line):

ma_cities %>%
  filter(!is.na(`eviction-rate`)) %>%
  group_by(name) %>%
  summarize(mean=mean(`eviction-rate`), median=median(`eviction-rate`), number_obs=n()) %>%
  filter(number_obs >= 5) %>%
  arrange(-number_obs, name) # arrange(desc(number_obs), name) accomplishes the same thing

Conclusion

These five functions alone are remarkably powerful and versatile, especially when you begin pairing them in different ways. They are the data journalism equivalent of a carpenter’s hammer, screwdriver, and pliers—capable tools that are foundational to all kinds of projects. R offers a number of additional tools, including some that are highly specialized, but by mastering dplyr, you’ll be well on your way to mining data sets for journalistic purposes!

Putting it into practice

See if you can answer the following questions:

Which Massachusetts city had the most evictions in 2016? How many evictions did it have?
Among large cities (100,000 people or higher), which city had the highest eviction rate in 2016? What was the rate?
What was the mean rate of evictions among large cities (100,000 people or higher) in 2016? Was it higher than the rate in 2015?
During which year did Springfield have its highest eviction filing rate?
Which year in the ma_cities dataset had the greatest amount of missing values for the eviction rate variable?