Creating CSV Files and Charts (Part 2)

For this tutorial, we’ll be looking at presidential approval rating data for the previous 13 U.S. presidents. These data come from FiveThirtyEight, which estimate the approval ratings by modeling data from multiple polls. More info for how they do it may be found here.

By the end of the tutorial, we’ll have produced an interactive chart about President Trump’s approval numbers that looks like this:

Pre-Analysis

The first step, as usual, is to read in the data. We’ll first load the Trump approval data, using the readr::read_csv() function, and make sure it imported correctly with the head() function:

library(tidyverse)
trump_approval <- read_csv("http://projects.rodrigozamith.com/datastorytelling/data/trump_approval_topline.csv")
head(trump_approval)

For the historical data, FiveThirtyEight only provides those data using JSON files. This will require us to load the jsonlite package, which helps us read those data into a nice dataframe using the jsonlite::read_json() function. We’ll again verify it worked using the head() function.

library(jsonlite)
historical_approval <- read_json("http://projects.rodrigozamith.com/datastorytelling/data/presidential_historical_approval.json", T)
head(historical_approval)

Combining the data

The first thing we’ll want to do is combine those two datasets into a single one. They have different variables and the data were imported differently, so we’ll want to double-check a few things.

In the end, we know we just need a dataset that contains the following variables: president, days_in_office, and approval_rating.

Let’s get those data from the trump_approval dataset and store it in a temporary data frame called trump_approval_temp:

trump_approval_tmp <- trump_approval %>%
  filter(subgroup=="All polls") %>%
  mutate(days_in_office=as.numeric(as.Date(modeldate, format="%m/%d/%Y")-as.Date("01/19/2017", format="%m/%d/%Y")), approval_rating=approve_estimate) %>%
  select(president, days_in_office, approval_rating)
head(trump_approval_tmp)

Note: Our original data didn’t have the number of days Trump has been in office. To calculate that, we first converted the modeldate variable to a date using the base::as.Date() function, then subtracted that date from Obama’s last full day in office (1/19/2017), and converted that time difference into a number that represents the number of days since Obama left office—or how many days Trump has been on the job.

Now, let’s do the same for the historical_approval data:

historical_approval_tmp <- historical_approval %>%
  mutate(days_in_office=days, approval_rating=as.numeric(approve_estimate)) %>%
  select(president, days_in_office, approval_rating)
head(historical_approval_tmp)

Note: The approve_estimate variable was a character vector, so we converted it to numbers using the base::as.numeric() function.

Now that we have consistent data frames, we can easily combine them using the dplyr::bind_rows() function, which stacks datasets using the common variable names.

approval_data <- bind_rows(historical_approval_tmp, trump_approval_tmp)
head(approval_data)

We can also de-clutter our workspace by deleting the data we no longer need using the rm() function, inserting the object names inside the parentheses:

rm(historical_approval, historical_approval_tmp, trump_approval, trump_approval_tmp)

Filtering the data

We’ll want to make sure we have a consistent starting point for all our presidents or our data may appear misleading. We can quickly check that by doing the following:

approval_data %>%
  group_by(president) %>%
  summarize(lowest_starting_day=min(days_in_office)) %>%
  arrange(desc(lowest_starting_day))

We thus see that we only have data for Harry Truman when he was 55 days into his term. One way to deal with this problem is to say that we’re only going to look at approval ratings starting from eight weeks (56 days) in office—when citizens have had an opportunity to get a closer look at their new elected leader.

Next, we’ll want to have an end-date cut-off. We’ll likely just base this on President Trump, since he’s our main point of comparison, but let’s double-check that we didn’t have any ‘early departures’ in our data:

approval_data %>%
  group_by(president) %>%
  summarize(lowest_end_day=max(days_in_office)) %>%
  arrange(lowest_end_day)

Actually, it turns out Gerald Ford spent less time in office. We could either exclude Ford from our data or simply use his last day as the cut-off point. Let’s proceed with the latter.

We thus filter the data accordingly, and make sure we have an equal number of data points for each president:

approval_data %>%
  filter(between(days_in_office, 56, 895)) %>%
  count(president) %>%
  arrange(n)

Aggregating and reshaping data

Some tools prefer data in long form and others in wide form. This is one instance where we’ll need to convert our long-form data into wide form. To declutter our visualization, we might also want to aggregate our data by weeks.

First, let’s aggregate our data by weeks. We can do that by creating a new column called week that simply divides the number of days in office by 7, and then rounds down the resulting value (using the base::floor() function). We’ll reinsert those data into approval_data.

approval_data <- approval_data %>%
  filter(between(days_in_office, 56, 895)) %>%
  mutate(week=floor(days_in_office/7))
head(approval_data, 10)

Note: We start at week 8 because we agreed to drop the values from Weeks 1-7 earlier.

Now, we can easily calculate the mean rating for that week by grouping by the combination of president and week, and using the summarize() function to calculate the mean (which we’ll store in the mean_approval_rating variable. This time, we’ll insert the data into an object called approval_data_weeks since we’re discarding some of the original data:

approval_data_weeks <- approval_data %>%
  group_by(president, week) %>%
  summarize(mean_approval_rating=mean(approval_rating))
head(approval_data_weeks, 10)

Now, we can quickly transform the data into wide format using the spread() function and store it in a new object called approval_data_weeks_wide:

approval_data_weeks_wide <- approval_data_weeks %>%
  rename(Week=week) %>%
  spread(Week, mean_approval_rating, sep=" ")
head(approval_data_weeks_wide)

Note: To make the data frame, and eventually the visualization, look prettier, we changed the name of the variable to “Week” (title case) and use the sep argument in spread() to add a space between the key name (Week) and its value (week number).

Joining data

Next, we’ll be able to add some faces to our visualization by linking to image files corresponding to each president. We can get these faces from the Presidents of the USA website. One clean way to do this is to create a new dataframe (president_pictures) with two variables: president (with names corresponding to our existing dataset) and picture, and then combine those data with our existing data (approval_data_weeks_wide) using the dplyr::full_join() function.

president_pictures <- data.frame(president=c("Barack Obama", "Bill Clinton", "Donald Trump", "Dwight D. Eisenhower", "George Bush", "George W. Bush", "Gerald R. Ford", "Harry S. Truman", "Jimmy Carter", "John Fitzgerald Kennedy", "Lyndon Baines Johnson", "Richard Milhous Nixon", "Ronald Reagan"), picture=c("https://presidenstory.com/usimag/phot2/obama.jpg", "https://presidenstory.com/usimag/phot2/clinton.jpg", "https://presidenstory.com/usimag/phot2/donald-trump.jpg", "https://presidenstory.com/usimag/phot2/ike.jpg", "https://presidenstory.com/usimag/phot2/bush.jpg", "https://presidenstory.com/usimag/phot2/wbush.jpg", "https://presidenstory.com/usimag/phot2/ford.jpg", "https://presidenstory.com/usimag/phot2/truman.jpg", "https://presidenstory.com/usimag/phot2/carter.jpg", "https://presidenstory.com/usimag/phot2/jfk.jpg", "https://presidenstory.com/usimag/phot2/johnson.jpg", "https://presidenstory.com/usimag/phot2/nixon.jpg", "https://presidenstory.com/usimag/phot2/reagan.jpg"))

approval_data_weeks_wide <- full_join(president_pictures, approval_data_weeks_wide, by="president")
rm(president_pictures)
head(approval_data_weeks_wide)

Producing a CSV file

Next, we’ll want to make it such that President Trump is the last item in our data frame. This is because most data visualization tools will stack rows on top of each other, so if there are any overlaps (e.g., overlapping lines), the most recent row will be privileged. We can just bind two subsets of our data frames to quickly accomplish that.

approval_data_weeks_wide <- approval_data_weeks_wide %>% filter(president!="Donald Trump") %>%
  bind_rows(approval_data_weeks_wide %>% filter(president=="Donald Trump"))

The first filter argument excludes rows in which Donald Trump is mentioned, and the second includes only rows in which Donald Trump is mentioned. The result is the original data frame, but organized differently.

Finally, we can create a CSV file using the readr::write_csv() function, which we can bring into a data visualization tool. The write_csv() function requires us to provide it just two arguments: the object (data frame) we’d like to export and the filename of the CSV file. The CSV file will be saved in your project directory, unless you specify a different path.

approval_data_weeks_wide %>%
  write_csv("approval_data.csv")

You can download the CSV file you just created from RStudio Cloud by selecting the file and using RStudio Cloud’s export functionality. If you’re using RStudio Desktop, you can check your project directory by entering the getwd() function into the console or as a separate line in your code.

Creating an interactive plot online

A newer but quite powerful tool for creating interactive data visualizations is Flourish. Flourish allows you to create a range of different charts and specify a lot of options—more than many of its competitors. It is also designed with newsrooms in mind, and you can find Flourish charts in outlets like The Boston Globe.

The first step is to create an account with Flourish. As is the case with many online visualization tools, Flourish provides you with a limited free tier and more feature-loaded paid tiers. The free tier will be good enough for our purposes.

After you create your account, you should be presented with a dashboard that looks like this:

Click on the “New” button to create a new chart. Then, since we want to create an animated line chart, select the “Simple” chart under the “Line chart race” section.

A sample chart will appear. First, let’s title our visualization by entering some text at the box near the top of the page (e.g., “Presidential Approval Numbers”).

Then, click the “Data” button to upload your dataset. Click the “Import your data” button and select the CSV file we just created. Import it publicly. Set the name column to “A”, the image column to “B”, and the Score columns to cover the rest of the columns. (As of writing, those are Columns C-DG.)

When your image looks similar to the above, hit “Preview,” select “Scores” under “Scoring Type,” and begin to play around with the different settings.

Here are some settings to consider:

Disabling the ranks/scores toggle (Controls)
Manually specifying colors, to highlight certain individuals (Colours)
- You can do this with the custom overrides textbox. There, you can write, “Donald Trump: red” and in a new line, “Barack Obama: gray” and it will change the colors for those two people. Repeat the gray for everyone else to highlight Trump.
Changing the line widths and shading (Line styles)
Resizing the circles (Circle styles)
Showing the candidates’ names only on hover
Rescaling the Y axis to cover the range of approval ratings (Y axis)
Speeding up the animation of the chart (Animation)
Changing the title and accompanying text (Header)
- Make the title the most interesting point made by the data in the chart, as it pertains to the story. Then, use the text to briefly detail that take-home point.
Link to the source (Footer)
- List the source and link to the data. Also explain any decisions you made with the data.

Other data visualization tools

There are several other easy-to-use tools for visualizing data that provide free tiers. These include:

Plot.ly: We’ve already used a library created by the Plot.ly team to add interactivity to our exploratory ggplots (via the plotly library). However, they also offer a powerful browser-based tool akin to Infogram, though their free tier is a bit more restrictive.
Datawrapper: This is one of the more commonly used tools for smaller newsrooms, though students have found it to be a bit more limited than Infogram and Plotly. It will give you some customizability options the others lack, however.
Infogram: This is a popular web-based data visualization tool that allows you to create infographics containing multiple different chart types within a single visualization.
Tableau: Though it started as a tool aimed at newsrooms, Tableau has evolved to become more of a scientific visualization platform. It has some of the more advanced visualization features for doing things like highlighting things like statistical uncertainty. However, their visualizations tend to be less visually appealing and the tool is more complicated to learn.

Additionally, you can create highly customizable static (non-interactive) visualizations using the following tools:

ggplot2: We started using ggplot2 for exploratory purposes but it can be extended to produce some very appealing data visualizations. However, tuning it just the way you like it can get complicated, so this is a better option for advanced R users. Here’s a guide that gets into some of the more advanced syntax.
Adobe Illustrator: Adobe Illustrator has long been, and is likely to continue to be for the foreseeable future, one of the most commonly used data visualization design tools because of its customizability and familiarity among graphic designers. It has a steep learning curve, however. Sometimes, designers will begin their data visualization in R (using ggplot2), export the visualization as a vector object that can be imported into Adobe Illustrator, and put the finishing touches on the visualization using Illustrator.