Creating Choropleth Maps

For this tutorial, we’ll be looking at some eviction data aggregated at the county level for the state of Florida. These data come from the Eviction Lab, which is led by Matthew Desmond. As always, it is helpful to review the data dictionary. If you want more detail about any of the variables, I encourage you to review the full Methodology Report.

By the end of the tutorial, we’ll have produced an interactive map about eviction rates in Florida that looks like this:

Pre-Analysis

Load the data

The first step is to read in the data that we’d like to map. We’ll start by using county-level evictions data for all U.S. states. Like before, we’ll use the readr::read_csv() function and make sure it imported correctly with the head() function:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

us_counties <- read_csv("http://projects.rodrigozamith.com/datastorytelling/data/evictions_us_counties.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   GEOID = col_character(),
##   name = col_character(),
##   `parent-location` = col_character()
## )

## See spec(...) for full column specifications.

head(us_counties)

As a reminder, the tidyverse package loads readr, dplyr, and some of our other commonly used packages.

Getting the data we need

The first thing we’ll want to do is extract only the information we need for creating the map. While it can be harmless to include additional data, it can sometimes (a) confuse the software being used to create the chart; (b) make it unwieldy to select options using those software; and (c) exceed the dataset size limitations of the software, especially if you’re on a free tier.

The first thing we’ll do is to filter out all the states (parent-location) besides Florida, since that’s the focus of our visualization. We also just want data for the year 2016.

We can use the dplyr::filter() function to include only the observations (rows) we’re interested in. Since we’ll continue to work with these data, we’ll assign them to an object called fl_data.

fl_data <- us_counties %>%
  filter(`parent-location` == "Florida" & year == 2016)
fl_data

The second thing we’ll want to do is think ahead to all the variables we will need to draw geographical information from and to fill in our captions with (e.g., when hovering over areas of the map). Again, in the interest of reducing the size of our dataset (and reducing the likelihood of a problem with the third-party tools), we want to select just the variables that we need.

The first variable is GEOID. If you look at the data dictionary, you’ll see that GEOID corresponds to the location’s FIPS code. Briefly, FIPS is a standardized code used by the U.S. government to link together locations across datasets. When referring to counties, you’ll see a five-digit code like 12001. The first two digits (12) refer to the state, in this case Florida. The following three digits (001) refer to the county, in this case Alachua County.

The second to fourth variables are name (the county’s name), eviction-rate (the eviction rate, which we will use for shading), and evictions (the total number of evictions, which we’ll include in the information box when the user hovers over a county).

We can use the dplyr::select() function to select those four columns.

map_data <- fl_data %>%
  select(GEOID, name, `eviction-rate`, evictions) %>%
  na.omit()

We use the na.omit() function to remove any rows that have an NA value in them, meaning any one variable (e.g., eviction-rate) has a missing value. (This may happen in our dataset because there were insufficient data for the Eviction Lab team to confidently generalize to the county level.) This ensures that we only map counties for which we have data. Datawrapper in particular would be confused if it is presented with non-numeric values, like NA, for certain variables.

Here’s what our dataset now looks like:

map_data

Producing a CSV file

If we want to get our data out of R, we’ll need to export it. The readr package makes it easy for us to produce a properly formatted CSV file with its write_csv() function.

That function requires us to provide it just two arguments: the object (data frame) we’d like to export and the filename of the CSV file. The CSV file will be saved in your project directory, unless you specify a different path.

map_data %>%
  write_csv("map_data.csv")

#write_csv(map_data, "~/Desktop/map_data.csv") # Here's an alternative way of expressing where to save the file, which would place it on my Desktop if I'm using OSX

You can download the CSV file you just created from RStudio Cloud by selecting the file and using RStudio Cloud’s export functionality. If you’re using RStudio Desktop, you can check your project directory by entering the getwd() function into the console or as a separate line in your code.

Creating an interactive map

Creating a map with Datawrapper

One simple tool for creating interactive maps is Datawrapper. Datawrapper is used by several (smaller) newsrooms for producing a range of different visualizations (and is a good alternative to either Infogram or Flourish).

The first step is to create an account with Datawrapper. As is the case with many online visualization tools, Datawrapper provides you with a limited free tier and more feature-loaded paid tiers. The free tier will be good enough for our purposes.

You can click on the Create an Account/Login link on Datawrapper’s homepage and sign up with just a few details.

After you create and activate your account, you should be presented with a welcome page. Look for the “New Map” link at the top right part of the page.

You will then be presented with different options for maps. Today, we’ll be creating a choropleth map, where areas on the map are shaded according to some corresponding value (i.e., the eviction rate).

After selecting that option, you’ll be presented with different geographies for your map. In our case, we only have data for the state of Florida, so we’ll want to select that as our geography. You can either select it from the list or simply search for “Florida”. Because we have county-level data with county-level geographical identifiers (the FIPS code), we’ll select the USA >> Florida >> Counties option and then Next.

There is also an option to upload your own geography, which is necessary if you’re using less-used geographical markers like school district boundaries (or custom maps). This requires uploading a separate file with shape information and goes beyond the scope of this tutorial.

We now need to add in the data for our map. Datawrapper allows us to manually fill in values for each geographical marker associated with the selected geography (e.g., counties in the Florida Counties map). However, since we already have a clean data file with the values we need, we can just upload that file instead.

You can do that by scrolling to the bottom of the table and clicking on the Import your dataset button.

Datawrapper will tell us that we need a column in our dataset that specifies a corresponding geographical identifier. This can be either a “Name” column that matches Datawrapper’s expectations (e.g., “Alachua” for Alachua county) or a “FIPS” column that matches the U.S. government’s standard for counties. We have information for the latter, under the GEOID column, so we can just select Start Import.

While Datawrapper gives us the option of copying and pasting the information into a table, we’re better off just uploading our clean CSV file. (It increases the likelihood of a clean import.) Click on the link to upload a CSV-file. Then, select the CSV file we just created (map_data.csv above).

After selecting the file, the table will be updated to look like the one below. Datawrapper will also ask us to select the column that contains the FIPS codes. Make sure the first column (GEOID) is selected and click Next.

Once the data is imported, click Okay, Continue. You’ll then be asked to select the variable that will be used for shading the map. Select the eviction-rate variable, as that is the number that is most comparable across counties since it is proportional to the county’s population. (We can change this variable later.) Then, click Next.

With the data now added in, we can click Proceed at the bottom of our table.

You will then be presented with the design options for the map:

Play around with these options to find what suits you best. Note that there are tabs for Refine (select map options), Annotate (add in text), and Design (design options, which are limited for the free tier). For the map displayed at the start of the tutorial, the following options were selected:

Color Palette: Pink to Magenta
Stops: Min/Max
Tooltips (Customize): Title is {{ name }}, Body is Eviction Rate: {{ eviction_rate }}<br>Number of Evictions: {{ evictions }}
Make map zoomable: Unchecked
Map key title: Eviction Rate
Map key position: Bottom left
Chart title: Duval County home to state’s highest eviction rate
Chart description: The county, whose seat is Jacksonville, features the state’s highest eviction rate. It is followed by Broward and Hillsborough counties. Critics allege the county lags others in the protections it offers homeowners and tenants.
Data source: Eviction Lab
Link to data source: https://www.evictionlab.org
Byline: Rodrigo Zamith

With the tooltip, you can use HTML tags to format your tooltip. For example, we use <br> to insert a line break.

When you’re finished editing your map, click the Publish button at the bottom of the design options.

This will take you to a new screen that allows you to Publish chart. Click on that icon to generate a final chart. (You can later revise and republish it.)

You will then be provided with links to the chart for sharing and embedding the map.

Voila! You’ve created and are now able to share and embed a professional-looking map.

Other mapping tools

There are several other easy-to-use tools for generating maps that provide free tiers. These include:

Carto: Carto is a very powerful web-based tool that focuses on creating maps. It offers more functionality than Datawrapper or any of the tools below, but they recently did away with their free tier option. However, students can get free premium access to Carto through the Github Student Pack. If you’re serious about mapping, this is a good tool to learn.
Flourish: This is a newer web-based data visualization tool that has been getting quite a bit of traction among news organizations. It allows you to produce interesting maps, in addition to some data visualizations that the other tools do not offer. It has a good free tier.
Plot.ly: We’ve already used a library created by the Plot.ly team to add interactivity to our exploratory ggplots (via the plotly library). However, they also offer a powerful browser-based tool that permits the creation of maps (in addition to other charts). Their free tier is a bit more restrictive than competitors, however.
Infogram: This is a good tool that we’ve already covered when creating interactive charts. While it offers some free mapping functionality (the entire United States), its specialty offerings (like state-level county maps) require a subscription.
Tableau: Though it started as a tool aimed at newsrooms, Tableau has evolved to become more of a scientific visualization platform. It has some of the more advanced visualization features, and includes a mapping component. However, their visualizations tend to be less visually appealing and the tool is more complicated to learn.

Additionally, you can create highly customizable static (non-interactive) visualizations using the following tools:

ggplot2: We started using ggplot2 for exploratory purposes but it can be extended to produce some very appealing maps. However, tuning it just the way you like it can get complicated, so this is a better option for advanced R users. Here’s a guide that gets into the mapping aspects of ggplot2.
Adobe Illustrator: Adobe Illustrator has long been, and is likely to continue to be for the foreseeable future, one of the most commonly used data visualization design tools because of its customizability and familiarity among graphic designers. It has a steep learning curve, however. Sometimes, designers will begin their data visualization in R (using ggplot2), export the visualization as a vector object that can be imported into Adobe Illustrator, and put the finishing touches on the visualization using Illustrator.