Creating CSV Files and Charts

For this tutorial, we’ll be looking at eviction data aggregated at the state level for (most) U.S. states. These data come from the Eviction Lab, which is led by Matthew Desmond. As always, it is helpful to review the data dictionary. If you want more detail about any of the variables, I encourage you to review the full Methodology Report.

By the end of the tutorial, we’ll have produced a simple interactive chart about eviction rates in the Midwest that looks like this:

Pre-Analysis

Load the data

The first step, as usual, is to read in the data. Like before, we’ll use the readr::read_csv() function and make sure it imported correctly with the head() function:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

all_states <- read_csv("http://projects.rodrigozamith.com/datastorytelling/data/evictions_us_states.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   GEOID = col_character(),
##   name = col_character(),
##   `parent-location` = col_character()
## )

## See spec(...) for full column specifications.

head(all_states)

As a reminder, the tidyverse package loads readr, dplyr, and some of our other commonly used packages.

Getting the data we need

The first thing we’ll want to do is extract only the information we need for creating the chart. While it can be harmless to include additional data, it can sometimes (a) confuse the software being used to create the chart; (b) make it unwieldy to select options using those software; and (c) exceed the dataset size limitations of the software, especially if you’re on a free tier.

In our case, we only need data from the Midwestern states around Michigan and between the years 2006 and 2016. Additionally, we only need data from three variables to cover the X axis (year), Y axis (eviction-rate), and the legend (name).

We can thus use a combination of the dplyr::filter() and dplyr::select() functions.

plot_data <- all_states %>%
  filter(between(year, 2006, 2016) & name %in% c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin")) %>%
  select(year, name, `eviction-rate`)

We introduced two new wrinkles with that code. The first is the dplyr::between() function, that gives us a shortcut for assessing if a value is greater than or equal to the second argument and less than or equal to to the third. (That is, it is functionally equivalent to filter(year >= 2006 & year <= 2016)), but shorter.)

The %in% operator allows us to search for multiple strings within a single, simple filter() statement. It basically notes that the value for the variable name must appear in the vector we specify with the c() function. If there is a match with any element on that vector, the observation is included (filtered in). If there is no match, it is excluded (filtered out).

Producing a CSV file

If we want to get our data out of R, we’ll need to export it. The readr package makes it easy for us to produce a properly formatted CSV file with its write_csv() function.

That function requires us to provide it just two arguments: the object (data frame) we’d like to export and the filename of the CSV file. The CSV file will be saved in your project directory, unless you specify a different path.

write_csv(plot_data, "plot_data.csv")
#write_csv(plot_data, "~/Desktop/plot_data.csv") # Alternative way of expressing where to save the file, which would place it on my Desktop if I'm using a Mac

You can download the CSV file you just created from RStudio Cloud by selecting the file and using RStudio Cloud’s export functionality. If you’re using RStudio Desktop, you can check your project directory by entering the getwd() function into the console or as a separate line in your code.

Creating an interactive plot online

Creating a plot with Infogram

One popular tool for creating interactive data visualizations is Infogram. Infogram allows us to create a single chart or compile multiple visualizations into a single interactive infographic.

For this example, we’ll use a single chart. (You’ll often want to produce multiple individual charts for a news story, so you can embed them at different points of the story. However, there are times when an infographic is better to quickly compare different kinds of data.)

The first step is to create an account with Infogram. As is the case with many online visualization tools, Infogram provides you with a limited free tier and more feature-loaded paid tiers. The free tier will be good enough for our purposes.

After you create your account, you should be presented with a dashboard that looks like this:

Click on the “Infographics” icon to create a new chart. Then, select the “Blank template” since we’re familiarizing ourselves with the program.

A screen will pop up asking us to name our project We can give it any name that’s useful for our organizational purposes—that name will appear in the URL, so be thoughtful—such as “Michigan’s Steep Drop”. Because we’re using the free tier of Infogr.am, select the “Public” option.

After advancing from that screen, you’ll be presented with Infogram’s chart-building tool.

The first thing we’ll want to do is click on the “Add text” icon on the left menu and select a “Title” heading. When we click on it, a text box will appear that allows us to give our chart a title. Then, select “Add text” again, and select the “Body text” option, where we can describe something interesting about the chart.

Second, you’ll want to add a visual element—in our case, a line chart. Click on “Add chart” and select the “Line” option. You’ll see a basic line chart auto-populated with some irrelevant data.

Once we have those basic elements in place, we’ll want to click on the line graph. A slight blue border should appear around it and some new options will appear on the right bar. Click the “Edit data” tab. It will open up a spreadsheet-like viewer like the one below:

We can either manually enter the data or just upload the CSV file we’ve just created. Let’s go with the latter option. Select the blue “Upload file…” button and find the CSV file we just created (plot_data.csv).

You may find that your chart doesn’t look particularly helpful, though:

Going from long to wide

That’s because Infogram likes its source data to look a certain way. In fact, you’ll find that different tools have different expectations, which may require you to reshape your data to fit those expectations. It’s frustrating but, thankfully, easily solved with R.

Specifically, Infogram expects the CSV to contain data that are in wide (as opposed to long) format for a line chart.

We can quickly reshape our data with the dplyr::spread() function like so:

plot_data_wide <- plot_data %>%
  spread(name, `eviction-rate`)

What we’re doing with the spread() function is telling R to keep the year variable untouched, make each unique value in the name variable a separate variable (column), and assign the associated value stored in the eviction-rate variable as the value for the newly created variables.

This is easier to show. Here’s our earlier data frame, which stores the data in long format.

head(plot_data)

Here’s our new data frame, which stores it in wide format.

head(plot_data_wide)

This is the inverse process from what we’ve previously done using the dplyr::gather() function.

We can export the wide-format data using the old line of code, updating the object name (plot_data_wide) and the file name (plot_data_wide.csv):

write_csv(plot_data_wide, "plot_data_wide.csv")

Back to our plot

Click the blue “Upload file…” icon again and reupload the wide-format CSV file we just created (plot_data_wide.csv). The default chart should look much nicer now!

From there, we can toggle the “Settings” tab and customize our chart. Some of my suggested customizations for this chart include include:

Downloadable data: Yes
Grid: All
Y-axis title: “Eviction Rate (per 100 occupied rental households)”
Y-axis range: 0-8

Adding labels

The chart is just one piece of a journalistic data visualization. The surrounding context is also essential.

When writing a title, my recommendation is to make it the most interesting point made by the data in the chart, as it pertains to the story. Then, use the text below it to briefly detail that take-home point.

For example, if the story was about eviction rates in the Midwest, I might title the chart: “Michigan’s Steep Drop”. Then, I’d add the following subheading: “Michigan’s eviction rate, one of the highest at the start of the Great Recession, was more than halved between 2008 and 2016, bringing it in line with its regional neighbors. Experts attribute the decline to … .” This helps guide the viewer’s attention and reduces their cognitive load, making them feel like they can quickly get the main point of the visual.

At the bottom, you may also add a note. This can include explanatory text (e.g., if there are some important missing data) as well as the source of the data. We can do this by clicking on “Add text” and selecting “Caption text”. Then, we can write something like: “Source: The Eviction Lab”. We can hyperlink “The Eviction Lab” by selecting that part of the text, and on the right menu, going down to the “Add link” option. You can also use that right menu to change the font face, size, style, etc.

Tweaking colors

Colors can offer important visual cues in data visualizations, and the default options aren’t always the best ones. In this case, we want to draw the viewer’s attention to the state of Michigan. It thus makes sense to give that state a familiar color (like the University of Michigan’s primary color), and making sure it stands out from the rest by surrounding it with more muted or neutral colors. (An alternative would be to select each state’s primary color.)

You can change the colors of the lines by selecting the line chart and tweaking the entries under “Color” on the right bar.

Adding charts to an infographic

Infogram allows you to add more charts to your infographic (or blocks of text) by clicking on the icons on the left vertical menu. This can be useful for making comparisons, especially when it makes sense to produce data using different types of charts.

In our case, we just need to resize the height of the infographic by selecting the slider at the bottom and dragging it to the bottom of our final object (the source text label).

Once we’ve done that, we can click “Share” so others can see our work.

Other data visualization tools

There are several other easy-to-use tools for visualizing data that provide free tiers at the time of writing. These include:

Plot.ly: We’ve already used a library created by the Plot.ly team to add interactivity to our exploratory ggplots (via the plotly library). However, they also offer a powerful browser-based tool akin to Infogram, though their free tier is a bit more restrictive.
Datawrapper: This is one of the more commonly used tools for smaller newsrooms, though students have found it to be a bit more limited than Infogram and Plotly. It will give you some customizability options the others lack, however.
Flourish: This is a newer web-based data visualization tool that has been getting quite a bit of traction among news organizations. It allows you to produce some data visualizations that the other tools do not offer. It has a free tier akin to Infogr.am that you can use.
Tableau: Though it started as a tool aimed at newsrooms, Tableau has evolved to become more of a scientific visualization platform. It has some of the more advanced visualization features for doing things like highlighting things like statistical uncertainty. However, their visualizations tend to be less visually appealing and the tool is more complicated to learn.

Additionally, you can create highly customizable static (non-interactive) visualizations using the following tools:

ggplot2: We started using ggplot2 for exploratory purposes but it can be extended to produce some very appealing data visualizations. However, tuning it just the way you like it can get complicated, so this is a better option for advanced R users. Here’s a guide that gets into some of the more advanced syntax.
Adobe Illustrator: Adobe Illustrator has long been, and is likely to continue to be for the foreseeable future, one of the most commonly used data visualization design tools because of its customizability and familiarity among graphic designers. It has a steep learning curve, however. Sometimes, designers will begin their data visualization in R (using ggplot2), export the visualization as a vector object that can be imported into Adobe Illustrator, and put the finishing touches on the visualization using Illustrator.