Exploratory Data Analysis in R (Part 2)

In this tutorial, we’ll be looking at some eviction data aggregated at the state level for (most) U.S. states. These data come from the Eviction Lab. As always, it is helpful to review the data dictionary. If you want more detail about any of the variables, I encourage you to review the full Methodology Report.

Pre-Analysis

Load the data

The first step, of course, is to read in the data. Like before, we’ll use the readr::read_csv() function:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

us_states <- read_csv("http://projects.rodrigozamith.com/datastorytelling/data/evictions_us_states.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   GEOID = col_character(),
##   name = col_character(),
##   `parent-location` = col_character()
## )

## See spec(...) for full column specifications.

Confirm it imported correctly

Normally, we would make sure the data imported correctly. Because we’ve previously evaluated the data from this dataset, we can just take a quick look at the first few observations to confirm that the data were imported correctly.

head(us_states)

Just like before, it appears we imported the data correctly.

Working with strings

Basic string matching

While most of the data crunching we do as data journalists involves numbers, we sometimes have to work with strings, too. For example, sometimes we want to create a filter in our dataset so we only include observations that begin with or end with certain text. For example, let’s say I want to look at the Dakotas. Instead of creating a separate filter for each variation of “Dakota” (thankfully, only two in this example!), I can use the endsWith() function that comes with R. The arguments I need to provide it with are the variable I’m evaluating and the text the variable needs to end with to yield a TRUE response.

To illustrate, here’s how I would filter in only the Dakotas. (Recall that we need to load dplyr, which comes with tidyverse to perform a filter and to pipe output.)

us_states %>%
  filter(endsWith(name, "Dakota"))

Note that we are using endsWith() (which comes with R) and not ends_with() (which comes with the dplyr package). dplyr::ends_with() is a separate function used to select multiple variables from a dataframe based on the string condition, and can be useful for very large datasets with hundreds of variables. base::endsWith() evaluates a given string or vector to see if it matches the condition, yielding a TRUE or FALSE response that can serve as the basis of a filter.

Unlike prior instances of dplyr::filter() where we specified a variable, an evaluation operator, and the condition (e.g., name == "Dakota"), the endsWith() function expects a different syntax. This is common in R when you’re using functions developed by different people and included in different packages. Thus, be sure to read the function’s documentation.

We can similarly look at the start of a string by using the startsWith() function, which also comes with R. Here’s how we’d filter in only the states that begin with “North”:

us_states %>%
  filter(startsWith(name, "North"))

Advanced string matching

What if we want to just filter observations based on some text that is present in the middle of a string? Here, we can use the str_detect() function from the stringr package (which is loaded by tidyverse).

stringr::str_detect() allows you to match text based on patterns, or regular expressions. These patterns can get very complicated (but are quite powerful), and you can read more about them here.

For the sake of simplicity, we won’t cover patterns here. Instead, we’ll just use str_detect() to look for entries that include a specified string anywhere within a given variable. For example, we can filter in any state with “y” in its name and count the number of unique states like so:

us_states %>%
  filter(str_detect(name, "y")) %>%
  count(name)

Be careful when using punctutation with str_detect(). First, the function is case-sensitive by default, so searching for “Y” will give you different results than “y” (i.e., New York). Second, the function may think you are giving it instructions for pattern-seeking. For example, “^North” tells it to match only when “North” appears at the start of the sentence—just like startsWith(). Any special characters need to be escaped. You can read about that in the documentation.

Computing new variables

With dplyr::summarize(), we previously computed a variable based on all the values for a given data frame (e.g., sumarize(median_rate=median(rate)). This produced a new aggregate data frame consisting only of the summarized values. However, what if we wanted to create a new variable and add it to our existing, disaggregated data frame based on some calculation or condition (e.g., calculate our own per-capita rate for each observation)?

This is where the dplyr::mutate() function comes in. mutate() operates just like summarize(), with the difference being that the variables you create are simply appended to the existing dataset.

Percentages to decimals

For example, let’s say we wanted to create a variable that divided pct-white by 100, to get us a decimal point representation of the percentage (94.5% –> 0.945.) We could do that with the following code:

us_states %>%
  mutate(decimal_pct_white=(`pct-white`/100)) %>%
  select(year, name, `pct-white`, decimal_pct_white) %>%
  head(5)

Note that I used dplyr::select() to highlight just the variables of interest and base::head() to limit the output to just the first five rows. Those steps simply improve readability.

Keep in mind that the variable we created (decimal_pct_white) will persist for all subsequent operations (in case we continue to pipe information). However, unless we assign the output to a new or existing output (e.g., us_states_decimals <- us_states %>% at the beginning of the code chunk above), the variable will be forgotten after that code is executed.

You can perform multiple mutations in a single line by treating them as separate arguments to the mutate() function. For example, if I wanted to perform the same operation to the pct-af-am and pct-hispanic variables, I’d just do the following:

us_states %>%
  mutate(decimal_pct_af_am=(`pct-af-am`/100), decimal_pct_hispanic=(`pct-hispanic`/100)) %>%
  select(year, name, `pct-af-am`, decimal_pct_af_am, `pct-hispanic`, decimal_pct_hispanic) %>%
  head(5)

Calculating rates

Data journalists are often interested in calculating rates to make comparisons more accurate. For example, Town A may have a higher incidence of Z (e.g., murders) than Town B because it has a bigger population. A rate (e.g., number of murders per person living in the town, or murder rate) can make it easier to compare the public safety of the two towns. A rate like that can be easily computed through the following calculation: X/Y (where X refers to the number of murders and Y to the population).

As an example, we’ll let X be the number of eviction filings (eviction-filings) and Y by the population of a given state (population). We could use mutate() like so to calculate a rate:

us_states %>%
  mutate(eviction_filings_rate=(`eviction-filings`/population)) %>%
  select(name, year, eviction_filings_rate) %>%
  head()

What the above output tells us is that there were 0.002 eviction filings for every person who lived in Alabama in the year 2000. Oftentimes, rates are calculated per 1,000 of Y or 100,000 of Y. (The ‘per’ number is dependent on context and the journalist should use the representation that is most informative – not the one that makes it seem most or least alarming.) In this case, we may want to say the number of eviction filings for every 1,000 people. We could amend our code like so:

us_states %>%
  mutate(eviction_filings_rate=(`eviction-filings`/population)*1000) %>%
  select(name, year, eviction_filings_rate) %>%
  head()

Now, what we see is that there were 1.71 eviction filings for every 1,000 people who lived in Alabama in the year 2000. (The difference in numbers from the eviction-filing-rate variable is because we calculated it differently than they did. Check out their documentation to see where we diverged.)

Subscripting data

R allows us to pull out specific information from the output, which can be handy when you want to perform calculations using filtered or mutated data. We’ve already demonstrated how to do this with different dplyr functions, but this is a good opportunity to demonstrate an alternative that is sometimes necessary for functions that don’t play nicely with tidyverse or to do some things we want within tidyverse.

First, we’ll information from a vector. As a reminder, you can call information from a variable in a data frame with the object$variable notation, such as when we run:

us_states$`eviction-rate`

##   [1] 1.79 2.08 1.94 1.62 1.38 1.84 2.14 1.74 1.99 1.74 1.73 1.86 1.96 1.93
##  [15] 1.69 1.80 1.82   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
##  [29]   NA   NA   NA   NA   NA   NA 6.84 7.65 7.88 7.74 7.97 7.93 5.05 5.43
##  [43] 5.59 4.49 4.35 3.93 4.47 4.34 3.13 3.94 3.92   NA   NA   NA   NA   NA
##  [57]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA 2.49 2.26
##  [71] 2.03 1.88 1.79 1.66 1.66 1.76 2.12 2.01 1.93 1.58 1.40 1.29 1.13 0.92
##  [85] 0.83 3.65 3.28 3.26 4.48 5.00 5.06 5.71 5.06 4.25 3.09 3.82 3.64 3.32
##  [99] 2.26 2.20 1.92 2.75   NA 2.34 2.94 3.86 3.55 3.06 3.44 3.43 3.39 2.93
## [113] 2.98 3.06 2.88 2.77 3.15 2.78 3.04 5.27 6.36 7.44 7.25 7.33 7.42 7.02
## [127] 6.76 6.92 7.05 6.36 6.86 5.84 5.77 5.43 5.22 5.10   NA   NA   NA   NA
## [141]   NA   NA 0.88 0.69 0.96 0.98 1.02 0.94 0.95   NA   NA   NA 2.59 2.85
## [155] 3.13 3.55 3.73 3.48 3.47 3.44 3.18 3.00 2.95 3.08 3.17 3.00 2.55 2.48
## [169] 2.53 2.53 5.67 4.90 5.35 5.80 5.62 6.16 6.28 7.11 6.79 6.84 7.17 7.58
## [183] 7.22 5.53 5.28 5.31 4.71 0.21 0.45 0.45 0.37 0.44 0.33 0.20 0.30 0.16
## [197] 0.55 0.00 0.00 0.00 0.65 0.49 0.40 0.41 0.81 0.98 0.98 1.02 1.18 1.36
## [211] 1.43 1.18 1.06 0.92 0.88 0.91 0.85 0.77 0.72 0.74 0.61 0.56 1.04 0.96
## [225] 1.01 1.10 1.20 1.91 1.98 2.14 2.26 2.17 2.15 2.18 2.00 1.87 1.72 1.58
## [239] 2.87 4.11 4.47 4.70 5.27 5.33 5.12 4.73 4.76 4.81 4.51 4.85 4.70 4.43
## [253] 4.24 3.94 4.07 0.29 2.21 2.58 2.42 2.61 2.78 2.88 2.52 2.50 2.36 1.84
## [267] 1.69 1.82 1.81 1.93 2.02 2.01 1.46 1.81 2.21 2.15 2.48 2.78 2.98 2.84
## [281] 2.90 2.75 2.64 2.65 2.74 2.85 2.66 2.12 2.30 3.85 4.09 3.73 3.75 3.92
## [295] 4.04 3.95 3.81 3.62 3.13 3.43 3.32 3.26 3.32 2.98 3.31 2.91 1.37 2.73
## [309] 2.78 2.76 2.91 3.01 2.69 2.48 2.57 2.88 2.73 2.77 2.75 2.68 3.13 3.10
## [323] 2.64 0.71 1.09 1.28 1.35 1.11 1.19 1.51 2.11 1.97 1.80 2.02 2.14 2.14
## [337] 2.31 2.24 2.26 2.26 0.21 0.22 0.12 0.06 0.08 0.04 0.72 1.18 0.27 1.10
## [351] 0.76 0.52 1.80 0.45 1.00 2.40 3.56   NA 2.21 1.60 2.07 1.97 1.62 1.63
## [365] 1.84 2.01 1.90 2.00 2.15 2.04 1.86 1.81 1.76 1.52 5.56 3.41 5.07 5.30
## [379] 6.04 5.62 6.32 7.11 7.25 6.67 6.56 6.93 6.70 6.29 5.47 4.09 3.28 0.01
## [393] 2.02 2.65 2.65 2.76 1.67 1.40 2.47 2.09 1.65 1.50 1.68 1.38 1.35 1.15
## [407] 0.74 0.59 3.11 3.52 3.65 3.73 4.04 3.77 4.08 4.36 3.54 3.93 4.07 4.41
## [421] 4.32 4.26 4.36 4.45 3.96 1.94 2.64 2.96 3.29 3.66 3.67 3.72 3.49 3.73
## [435] 3.33 3.98 3.70 3.85 3.50 3.27 3.30 2.85 0.44 0.38 0.50 0.65 0.69 0.80
## [449] 0.72 0.70 0.71 0.69 0.70 0.71 0.74 0.81 0.90 0.84 0.86 2.00 2.42 2.44
## [463] 2.36 2.43 2.48 2.59 2.58 2.54 2.35 2.32 2.20 2.11 2.18 2.24 2.16 2.17
## [477] 7.25 6.73 6.53 7.11 6.89 7.30 7.38 6.57 5.85 4.91 4.69 4.96 6.32 6.17
## [491] 4.49 4.44 3.41   NA   NA   NA   NA 1.20 1.62 2.50 2.62 3.02 2.27 1.89
## [505] 1.74 1.35 1.92 2.17 1.80 1.70   NA 2.30 2.13 2.15 0.44 0.03 0.93 1.97
## [519] 1.92 1.94 1.20 1.21 1.16 1.17 1.10 0.17 0.01 4.92 4.97 3.92 3.79 3.62
## [533] 3.98 4.52 4.59 4.15 4.26 4.05 3.84 3.95 4.05 3.79 3.78 3.18 2.00 1.89
## [547] 1.88 1.87 1.84 1.99 2.70 1.50 2.06 1.45 1.30 0.13 0.58 0.48 0.79 0.71
## [561] 2.15 5.67 3.79 3.33 3.08 5.03 4.90 4.87 5.82 6.40 5.91 5.68 5.26 4.67
## [575] 4.43 4.66 4.52 4.61   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [589]   NA   NA   NA   NA   NA   NA   NA 2.13 2.79 3.59 3.63 3.84 3.82 3.72
## [603] 3.54 3.94 3.58 3.59 4.02 3.91 3.83 3.39 3.55 3.49 2.33 3.60 3.95 4.11
## [617] 4.38 4.58 4.70 4.30 4.30 3.69 3.67 3.75 2.80 3.55 3.87 4.16 4.24 2.55
## [631] 2.41 2.46 2.56 2.66 2.57 2.28 1.91 1.76 1.48 1.48 1.38 1.33 1.17 0.97
## [645] 1.10 1.10 3.03 3.21 3.17 3.53 3.45 3.24 4.11 2.71 2.57 2.40 2.50 2.50
## [659] 2.49 2.36 2.37 2.28 1.77   NA 1.27   NA   NA   NA   NA   NA 2.25 2.31
## [673] 2.38 2.68 2.97 2.40 2.86 2.68 3.24 3.07   NA   NA   NA   NA   NA   NA
## [687]   NA   NA   NA 2.34 4.30 4.14 4.15 3.97 3.46 4.32 8.87   NA   NA   NA
## [701]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [715] 1.85 3.28 2.98 3.54 3.57 3.80 3.78 3.40 3.66 3.67 3.66 3.79 3.80 2.62
## [729] 2.77 2.99 2.78 1.58 2.30 2.72 2.85 2.87 3.00 3.34 3.25 2.98 2.86 2.91
## [743] 2.80 2.84 2.78 2.80 2.56 2.17 0.35 0.99 1.72 1.68 1.48 1.41 1.25 1.26
## [757] 1.20 1.21 1.08 1.13 1.18 1.12 1.24 0.96 0.93 0.00 0.07 0.03 0.08 0.03
## [771] 0.02 0.21 0.17 0.45 0.45 0.13 0.10 0.02 0.06 0.14 0.21 0.09 6.17 6.12
## [785] 6.54 5.96 6.20 5.66 5.49 6.14 6.86 6.09 6.59 5.77 5.47 5.64 5.57 5.04
## [799] 5.12 1.44 1.21 1.06 1.31 1.38 1.45 1.28 1.10 0.93 0.87 0.94 0.92 0.94
## [813] 0.93 0.80 0.83 0.82 0.47 2.00 1.30 1.96 2.06 2.01 2.32 3.18 3.35 3.61
## [827] 3.63 3.88 3.89 3.88 3.75 3.46 3.52 3.18 2.78 2.20 2.90 3.61 2.85 2.75
## [841] 2.32 2.32 2.16 2.09 2.15 2.14 2.10 2.11 2.02 1.89 0.11 0.09 0.19 0.74
## [855] 0.98 1.05 0.93 0.80 0.67 0.75 0.24 0.24 0.17 0.22 0.23 0.57 0.88

Let’s say we wanted to get the second value from that vector. Here’s where a subscript would be useful. Subscripts allow you to call specific elements from an object by specifying it in brackets. For example, data_rulz[1] would produce only the first element from the vector data_rulz. Applying it to the example above, we could get that second value using the following code:

us_states$`eviction-rate`[2]

## [1] 2.08

If your object is a data frame, which has observations (rows) and variables (columns), we subscript using matrix notation [(observation), (variable)] (or [(row), (column)]; note the comma in both). For example, if I wanted to get the first row of our us_states data frame, I’d write:

us_states[1,]

We could do this in dplyr using the following code us_states %>% slice(1).

If I wanted to get the values from the third column, I’d write:

us_states[,3]

We could do this in dplyr using the following code us_states %>% select(3).

When calling variables, we can also call them by name (and even several at a time) with the following code:

us_states[,c("name", "eviction-rate")]

Note that we use the c() function to create a vector, and in that vector we specify strings that are equivalent to the variable names. Because we’re specifying strings in this construction—we’re operating outside of our dplyr functions—we do use quotation marks. We could do this in dplyr using the following code us_states %>% select(name, `eviction-rate`).

We can also subscript based on conditions. For example, if I just wanted rows that had the name variable be equivalent to “Wyoming”, I could write:

us_states[us_states$name == "Wyoming",]

Note that the above code required us to reference the object of interest (us_states) twice. That’s the notation for regular R, where you need to preface each reference to a variable with an object. The dplyr syntax doesn’t require us to do that because it assumes we’re only loading information from the output we’ve piped (and thus specifies the object automatically for us). We could do this in dplyr using the following code us_states %>% filter(name=="Wyoming").

Now, we’ve been using functions from dplyr to accomplish the same things as the above in an easier-to-read manner. However, subscripting can be useful when you want to pull out specific values using dplyr functions—though the syntax looks a little different, as we’ll see below.

Calculating percent changes

Data journalists are often interested in percent changes. For example, they might ask, how much higher is the murder rate in Town A (2.42) relative to Town B (2.11)? That calculation can be performed with a simple formula: ((Y2-Y1)/Y1)*100. For this example, we’ll specify Town A’s murder rate Y2 and Town B’s as Y1 (since we are comparing A (town of interest) relative to B (town compared against).

((2.42-2.11)/2.11)*100

## [1] 14.69194

The murder rate is 14.7% higher in Town A.

We could manually fill out that formula every time but if we wanted to have R calculate the percent change for us using data from a data frame (e.g., us_states), we could use subscripting to do that.

For example, let’s say we wanted to calculate the percent difference in the eviction rate between Massachusetts and Florida in 2011. We could do that with the following code:

us_states %>%
  filter(year==2011) %>%
  select(year, name, `eviction-rate`) %>%
  summarize(pct_chg=((`eviction-rate`[name=="Massachusetts"]-`eviction-rate`[name=="Florida"])/`eviction-rate`[name=="Florida"])*100)

What we did above is simpler than it looks: Drawing from the us_states data frame, we filtered in only observations from the year 2011, and selected the three variables of interest to us. Then, we summarized those data using our percent change formula (((Y2-Y1)/Y1)*100), specifying Massachusetts as Y2 and Florida as Y1. We called the value for Y2 by subscripting from a variable coming from the data frame that was output from the select() step.

Specifically, we called the value from the eviction-rate variable appearing in the row where the name variable was equivalent to “Massachusetts”. (Note that there was only one row matching that, which is why it worked.) We did the same thing to get Y1, only we changed the condition so that name had to be equivalent to “Florida”.

Lagged calculations

Sometimes, you want to calculate a percent change to every observation in your dataset. This is especially useful when you have temporal (time) data on regular intervals (e.g., every year).

For example, let’s say we wanted to calculate the three-year percent change in eviction rate for each year (e.g., 2005-2008, 2006-2009, etc.) for Florida and Massachusetts. The first thing we need to do is filter out all the states but Florida and Massachusetts. Then, we’ll want to group by state so we can calculate the percent changes relative to the previous years for that state (and not other states). Then, we’ll want to order the data sequentially (to make sure the previous year is always the preceding row). Finally, we’ll tell R to perform our percent change calculation using information from the current row as Y2 and from three previous rows for Y1.

To perform that last step, we’ll use the dplyr::lag() function. lag() allows us to look up a value from n rows back in object x with the syntax lag(x, n). For example, we can specify Y1 in the present example by specifying lag(`eviction-rate`, 3).

Thus, our code would look like this:

us_states %>%
  filter(name=="Massachusetts" | name=="Florida") %>%
  select(year, name, `eviction-rate`) %>%
  group_by(name) %>%
  arrange(year, .by_group=TRUE) %>%
  mutate(evictions_pct_chg=((`eviction-rate`-lag(`eviction-rate`, 3))/lag(`eviction-rate`, 3))*100)

Note that we specified that the order of operations matters here. First, we want to make sure that we group our data so that we can run the mutation separately for each group (and not have one state’s information bleed into another). After that, we’ll arrange the values by year, specifying that we only arrange years from low to high within the group (that’s the .by_group=TRUE argument), so that we keep all the Floridas and Massachusetts rows together.

We see some NA values for the first three rows of Florida, which makes sense—we don’t have any values from 1999 to calculate our 2002 percent change. In the case of Massachusetts, we have NA values for the first four rows because we don’t have an eviction rate for 2000.

Visualizing the changes

We can do some quick, exploratory visualization of those data by combining our lagged data frame with functions from ggplot2:

us_states %>%
  filter(name=="Massachusetts" | name=="Florida") %>%
  select(year, name, `eviction-rate`) %>%
  group_by(name) %>%
  arrange(year, .by_group=TRUE) %>%
  mutate(evictions_pct_chg=((`eviction-rate`-lag(`eviction-rate`, 3))/lag(`eviction-rate`, 3))*100) %>%
  ggplot(aes(x=year, y=evictions_pct_chg, color=name)) +
    geom_point() +
    geom_line() +
    scale_y_continuous(limits=c(-32, 32), breaks=seq(-30, 30, 5)) +
    scale_x_continuous(limits=c(2003, 2016), breaks=seq(2003, 2016, 1))

## Warning: Removed 7 rows containing missing values (geom_point).

## Warning: Removed 7 rows containing missing values (geom_path).

It is much easier to see the trajectories for those two states when we plot them onto a line graph. A table might be helpful for quickly identifying precise values but the visual helps us see the trend faster.

Calculating correlations

Data journalists are also often interested in looking at relationships between variables. Scatterplots are generally best for spotting potential relationships, but it is sometimes helpful to also quickly calculate a linear correlation using Pearson’s correlation coefficient to gauge the strength of the linear relationship. (Important note: Relationships can be curvilinear in nature. How to evaluate that is better covered in a statistics class.)

To calculate a simple, linear correlation, we just use the cor() function, which is loaded automatically in R. With cor(), we just specify the two variables of interest as separate arguments (e.g., cor(dataset$variable1, dataset$variable2)).

Using our multi-step framework with dplyr, we can incorporate cor() into our summarize() function to calculate the relationship between population size and the rent-burden like so:

us_states %>%
  summarize(correlation=cor(population, `rent-burden`))

Note that because we used a dplyr function and had already piped a data frame, we didn’t have to re-specify the object name within cor() since dplyr functions assume we only want to draw from the piped information.

What the above tells us is that, when looking at every observation in our data frame, there is a positive (the r coefficient is positive) but weak (the r coefficient is less than +/-0.3) relationship between population size and the rent burden. Thus, an increase in population size results in an increase in the rent burden (or vice versa) some of the time.

The nice thing about using our dplyr multi-step workflow (instead of just running cor() separately) is that it allows us to apply filters that can help narrow our inquiry (correlation in recent years vs. many years ago) or to calculate the correlation for computed (mutated) variables.

As a reminder, there is a big difference between correlation and causation. Even if we had found a strong correlation between population size and rent burden, the causal variable might be very different. (Put differently, it’s not the size of the population that causes a place to have a high rent burden but some other variable that is common in cities that have larger populations.)

Putting it into practice

Use the following dataset to answer the following questions:

library(tidyverse)
ma_cities <- read_csv("http://projects.rodrigozamith.com/datastorytelling/data/evictions_ma_cities.csv")

Which Massachusetts city ending in Town has the biggest population in 2016?
Produce a table (data frame) containing values for the variables of city name, year, population size, and rent burden, for any city with Amherst in the name in 2016.
Calculate your own ‘eviction rate’ (per 1,000 residents) for each city in Massachusetts using information from the population and number of evictions variables. What was your calculated rate for Worcester in 2016? Is it equal to the eviction-rate variable already in the dataset? Please explain why it is or is not equal.
Do our data support the following conclusion: “In 2014, there was a strong linear correlation between a city’s median property value and its eviction rate”? Why or why not?
Among the Massachusetts cities with 100,000 or more people in it, which city saw the greatest year-to-year percent increase in evictions after 2011, and when did it happen? Which city saw the greatest year-to-year percent decrease and when did that happen?