Intro to R

Now that you’ve got your RStudio environment up and running and you’ve learned how to create an R Notebook, the next step is to start running some operations in R.

Mathematical Operations

R is an extremely powerful calculator. In R, you’ll use + and - for addition and subtraction. For multiplication and division, you’ll use * and /, respectively. You’ll use ^ for exponents and parentheses (( and )) to organize the order of operations. (Remember, Please Excuse My Dear Aunt Sally?)

Remember that you can create a code chunk by pressing Cmd+Alt+I and execute it by pressing Cmd+Shift+Enter. The result of all operations executed by that code chunk will appear immediately below it.

Here are some examples:

4+2

## [1] 6

4+2*6

## [1] 16

(4+2)*6

## [1] 36

(4+2)*6/2

## [1] 18

(4+2)*6/2^2

## [1] 9

(4+2)*(6/2)^2

## [1] 54

We probably won’t need to do a whole lot of math by hand like this. When we do, being mindful of the order of operations will be key.

If you’re wondering what the [1] is in the output, just ignore it for now. I’ll explain it shortly.

Creating objects

One nice thing about working with R is that it allows you to store information into objects. We do this with the following syntax: object <- operation. What this tells R is: perform an operation and assign the output into a named object. (The left arrow <- is called the assignment operator.)

Let’s show this off by subtracting my favorite number (rodrigos_fave) from your’s (my_fave).

rodrigos_fave <- 36
my_fave <- 666
my_fave - rodrigos_fave

## [1] 630

You’ll notice that in RStudio, two entries will have popped up under the Environment tab. These are objects that we can now call upon at any time, such as to subtract one object from the other.

Data Types in R

R has a handful of different data types. We’ll cover these types as they come up but we’ll start with two very important ones. The first type is numeric and it refers to real or decimal numbers. The second is character and it covers text (strings). It’s important to understand that if an object is stored as a string, you cannot perform a mathematical operation on it.

Below is an illustration of using above example. Notice how the numbers stored in the object my_fave are wrapped in quotation marks. Quotation marks make things text (character data type).

rodrigos_fave <- 36
my_fave <- "666"
my_fave - rodrigos_fave

## Error in my_fave - rodrigos_fave: non-numeric argument to binary operator

Unsurprisingly, R will give us an error because we are trying to perform a mathematical operation (subtraction) using a non-numeric object.

However, we can often translate between data types (e.g., from numeric to character and vice versa). We just need to make sure we do so explicitly before running an operation.

Functions

One of the great things about R is that it gives us a multitude of functions we can use to perform myriad operations. Some of these functions come with R, others can be installed via optional packages (more on that shortly), and new functions can be created by us anytime (we may do this later in the course).

For example, we can check the data type of an object by using the str() function. All functions have a name and typically accept different arguments within the parentheses. The str() function requires us to specify the object we want to check.

Let’s do that for the two objects we defined:

str(rodrigos_fave)

##  num 36

str(my_fave)

##  chr "666"

Our code chunk gave us two lines of output, one for each operation we ran. The first line tells us rodrigos_fave is num (numerical). The second tells us my_fave is chr (character).

One nifty way to move between numerical and character types is to use the as.numeric() and as.character() functions. Here’s how we’d turn the value of my_fave from chr to num.

as.numeric(my_fave)

## [1] 666

Notice how the quotation marks are gone! We can confirm this by wrapping the as.numeric() function within the str() function to effectively perform two operations in one sequence:

str(as.numeric(my_fave))

##  num 666

Just as we expected, it becomes a num object.

However, it’s important to note here that our original my_fave object remains of chr type:

str(my_fave)

##  chr "666"

The reason is because we performed an operation on that object but never made any changes to it. If we want to permanently change it back into a num object, we’ll need to recreate the object. In the next line of code, we’ll assign the result of the as.numeric() operation back into our original object of my_fave. (We could also store it in a new object altogether, like my_fave_num.)

my_fave <- as.numeric(my_fave)
str(my_fave)

##  num 666

Now, it’s permanently altered!

Using the help system

You’ll find in RStudio a Help tab, usually on the bottom-right panel. After you click on it, you can get help for every single loaded function by clicking the text box with a loupe on it. Just type the name of the function and it will describe it, list all the arguments it accepts, and provide some examples.

R’s help system is extremely useful! You’ll also find yourself Googling a lot and ending up on websites like Stack Overflow and Quora, in addition to different developer blogs.

Vectors

R also has different data structures for its objects. We’ll cover different structures later on as the need arises. One very important data structure to know about now is the vector.

Vectors allow us to store multiple values of the same data type into a single object. (We can’t mix numbers and text within a single vector. If there’s a single chr element in the vector, R will automatically make all the elements chr.)

We can create a vector using the c() function. With c(), each argument will be a different element that we’re adding to that object. (Each argument is separated by a comma.)

c(1, 5, 7, 5, 22)

## [1]  1  5  7  5 22

Notice how our output now shows us five numbers. Each number is a different element in the vector. Vectors are useful for a number of different operations.

To illustrate, I’ll start by storing that vector into an object (rodrigos_vector):

rodrigos_vector <- c(1, 5, 7, 22, 5)
str(rodrigos_vector)

##  num [1:5] 1 5 7 22 5

First, take note that this is a numeric vector, as shown by the num. Second, we can see that there are five elements in our vector [1:5]. This is helpful because we can pick out specific elements within a vector by subscripting. (More on that later.) But this should explain the [1] in the earlier output: we were actually getting a single-element vector as the result of the mathematical operations we ran earlier on.

To illustrate the power of vectors, I can divide each element of the vector by 2 with just the following code:

rodrigos_vector/2

## [1]  0.5  2.5  3.5 11.0  2.5

More useful than that is the fact that I can pair a vector with functions like mean() and max() to take the mean from that sequence of numbers and identify the highest number within it.

mean(rodrigos_vector)

## [1] 8

max(rodrigos_vector)

## [1] 22

Data frames

We’ll be primarily working with data frames over the course of the semester. (Variations include data tables and tibbles. If you see those terms, they generally refer to the idea of a data frame.)

Data frames organize information into rows and columns, much like a spreadsheet you’d find in Microsoft Excel or Google Sheets. Typically, each column will refer to a different variable (e.g., name, age, major) and each row will refer to a different observation (e.g., a student or a company).

We can create a data frame within R at any point but now would be a good time to practice bringing a dataset into R. In order to do that, we’ll load a CSV file (comma-separated values) using the read_csv() function.

Packages in R

One of the great things about R is that it is modular and has a huge community supporting it. What this means is that anyone can add functionality to R and share that functionality with the rest of the world. We call those packages, which will give us access to new functions.

We’re going to extensively use a small set of packages over the course of the semester. To use a package, we first need to install it. To install a package, click on Tools and Install Packages. (Unless I tell you otherwise, keep CRAN selected for the repository, don’t change the install directory, and make sure “install dependencies” is checked.)

Follow those steps to install our first (meta)package: tidyverse. (If you’ve already done this, the system will tell you it’s already installed). tidyverse includes the readr package, which contains functions that help us bring CSV files into R neatly.

Installing is just the first step, though. You’ll always need to load a package when you want to use it—which includes when you restart RStudio. You can do that with the library() function, inserting the name of the package you want to load within quotation marks in the parentheses.

For example, we can now load the tidyverse package (and, consequently, readr) by doing the following:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Oftentimes, nothing will appear to have happened when you execute the code to load a library. That’s because we haven’t used it for anything just yet. Sometimes, it will give you a warning message, as is the case here. This warning is fine–it’s just telling us the different tidyverse packages being loaded (e.g., ggplot2, tibble, etc.), and is telling us that there are a couple of small conflicts. We don’t need to worry about these for now.

Note: You only need to install a package once in R, but you’ll need to load it with every R Notebook you produce.

Going forward, I may point to a function as follows: readr::read_csv(). What this means is that we’ll use the read_csv() function that is part of the readr package. Thus, without loading readr (by itself or through tidyverse), you can’t access read_csv().

Back to data frames

Now that we’ve loaded readr, let’s open our first CSV file. We can load data locally (we’ve saved it on our computer) or remotely (we’re downloading it from a website). For most of our exercises, we’ll load data remotely.

Note: You’ll often want to keep a local copy of a dataset you use because you never know if someone will take it down or modify it without your knowledge. I’ll be managing the remote data for our exercises and Data Challenges, so you can count on it being okay.

We’ll be loading data about my former students’ preferences when it comes to Ed Sheeran’s songs from http://projects.rodrigozamith.com/datastorytelling/data/ed_sheeran.csv and storing it in the object ed_sheeran.

ed_sheeran <- read_csv("http://projects.rodrigozamith.com/datastorytelling/data/ed_sheeran.csv")

## Parsed with column specification:
## cols(
##   student = col_character(),
##   favorite_song = col_character(),
##   total_songs_owned = col_double(),
##   favorite_song_plays = col_double(),
##   ed_sheeran_total_plays = col_character()
## )

We see from the output and the fact that ed_sheeran appears as an object in our Environment tab that the data were successfully imported. We see in that environment tab that the object has six observations (rows) and five variables (columns).

You can often think of each column in the data frame as a different vector of information (information about the variable for each of the observations).

Let’s illustrate that by looking at the structure of our data frame with str():

str(ed_sheeran)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 6 obs. of  5 variables:
##  $ student               : chr  "Mohita" "Jalen" "Will" "Ben" ...
##  $ favorite_song         : chr  "Castle on the Hill" "Galway Girl" "Castle on the Hill" "Shape of You" ...
##  $ total_songs_owned     : num  3 56 5 6 3 10
##  $ favorite_song_plays   : num  10 273 17 15 33 19
##  $ ed_sheeran_total_plays: chr  "12" "979" "35" "Unknown" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   student = col_character(),
##   ..   favorite_song = col_character(),
##   ..   total_songs_owned = col_double(),
##   ..   favorite_song_plays = col_double(),
##   ..   ed_sheeran_total_plays = col_character()
##   .. )

The output will give us a lot of information but the key things to note for now are the variable names, which appear next to the dollar signs ($). Our data frame has five variables: student (chr), favorite_song (chr), total_songs_owned (int), favorite_song_plays (int), and ed_sheeran_total_plays (chr). We can also get a preview of the values for each variable.

If we’re in RStudio, we can view the contents of a data frame by clicking on the object name (ed_sheeran) under the Environment tab. That will automatically execute the following code:

View(ed_sheeran)

We can manually this, too, by typing the code above into the console.

If we want to stay within the notebook, we can also just type the object name:

ed_sheeran

Congratulations! You have created a data frame using information imported from a CSV file!

Working with variables

Oftentimes, we don’t need to work with all the data in a dataset, only portions that are of interest to us.

To view the observations related to a variable in a data frame, we access it with the following syntax: object$variable. For example, if we want to access the values in the total_songs_owned variable in the ed_sheeran data frame, we would type:

ed_sheeran$total_songs_owned

## [1]  3 56  5  6  3 10

Notice that we get a vector in our output. That vector contains all the values in our object for the chosen variable.

Combining variables and functions

Let’s start by taking the mean of the variable total_songs_owned. Recall that mean() requires us to feed it a vector with numbers from which the mean will be taken. So, we can simply use the following code:

mean(ed_sheeran$total_songs_owned)

## [1] 13.83333

Behold, the mean number of Ed Sheeran songs owned by the students in our sample is 13.8!

Putting it into practice

See if you can answer the following questions:

What is the data type for the following three variables: total_songs_owned, favorite_song_plays, and ed_sheeran_total_plays?
What is the highest number of Ed Sheeran songs owned by a single student in the class? (Hint: Use the max() function.)
What is the mean number of times a student’s favorite Ed Sheeran song was played? Please calculate it to the third decimal point. (Hint: You’ll want to pair the round() function with the function used to calculate means. Using the help system, look at the help document for round() to identify the argument used to specify the number of decimal places. That argument starts with a d.)
What is the most popular selection for the favorite song? (Hint: Use the table() function.)
What is the mean number of times a student played an Ed Sheeran song? If you get an error trying to calculate that, why do you think there was a problem?