Practical R for Mass Communication and Journalism

4.6 Easy sample data

R comes with some built-in data sets that are easy to use if you want to play around with new functions or other programming techniques. They’re also used a lot by people teaching R, since instructors can be sure that all students are starting off with the same data in exactly the same format.

Type data() to see available built-in data sets in base R and whatever installed packages are currently loaded. data(package = .packages(all.available = TRUE)) from base R will display all possible data sets from packages that are installed in your system, whether or not they’re loaded into memory in your current working session.

You can get more information about a data set the same way you get help with functions: ?datasetname or help("datasetname"). mtcars and iris are among those I’ve seen used very often.

If you type mtcars, the entire mtcars data set will print out in your console. You can use the head() function to look at the first few rows (more on head() in the next chapter) with head(mtcars).

You can store that data set in another variable if you want, with a format like:

cardata <- mtcars

Or, running the data function with the data set name, such as data(mtcars), loads the data set into your working environment.

One of the most interesting packages with sample data sets for journalists is the fivethirtyeight package, which has data from stories published on the FiveThirtyEight.com website. The package was created by several academics in consultation with FiveThirtyEight editors, and is designed to be a resource for teaching undergraduate statistics.

Pre-packaged data can be useful - and in some cases fun. In the real world, though, you may not be using data that’s quite so conveniently packaged.

4.6.1 Create a data frame manually within R

Chances are, you’ll often be dealing with data that starts off outside of R and you import from a spreadsheet, CSV file, API, or other source. But sometimes you might just want to type a small amount of data directly into R, or otherwise create a data frame manually. So let’s take a quick look at how that works.

R data frames are assembled column by column by default, not one row at a time. If you wanted to assemble a quick data frame of town election results, you could create a vector of candidate names, a second vector with their party affiliation, and then a vector of their vote totals:

candidates <- c("Smith", "Jones", "Write-ins", "Blanks")
party <- c("Democrat", "Republican", "", "")
votes <- c(15248, 16723, 230, 5234)

Remember not to use commas in your numbers, like you might do in Excel.

To create a data frame from those columns, use the data.frame() function and the synatx data.frame(column1, column2, column3).

myresults <- data.frame(candidates, party, votes)

Check its structure with str():

str(myresults)

## 'data.frame':    4 obs. of  3 variables:
##  $ candidates: Factor w/ 4 levels "Blanks","Jones",..: 3 2 4 1
##  $ party     : Factor w/ 3 levels "","Democrat",..: 2 3 1 1
##  $ votes     : num  15248 16723 230 5234

While the candidates and party vectors are characters, the candidates and party data frame columns have been turned into a class of R objects called factors. It’s a bit too in-the-weeds at this point to delve into how factors are different from characters, except to say that 1) factors can be useful if you want to order items in a certain, non-alphabetical way for graphing and other purposes, such as Poor is less than Fair is less than Good is less than Excellent; and 2) factors can behave differently than you might expect at times. I’d recommend sticking with character strings unless you have a good reason to specifically want factors.

You can keep your character strings intact when creating data frames by adding the argument stringsAsFactors = FALSE:

myresults <- data.frame(candidates, party, votes, stringsAsFactors = FALSE)
str(myresults)

## 'data.frame':    4 obs. of  3 variables:
##  $ candidates: chr  "Smith" "Jones" "Write-ins" "Blanks"
##  $ party     : chr  "Democrat" "Republican" "" ""
##  $ votes     : num  15248 16723 230 5234

Now, the values are what you expected.

There’s one more thing I need to warn you about when creating data frames this way. If one column is shorter than the other(s), R will sometimes repeat data from the shorter column - whether or not you want that to happen.

Say, for example, you created the election results columns for candidates and party but only entered votes results for Smith and Jones, not for Write-ins and Blanks. You might expect the data frame would show the other two entries as blank, but you’d be wrong. Try it and see, by creating a new votes vector with just two numbers, and using that new votes vector to create another data frame:

votes <- c(15248, 16723)
myresults2 <- data.frame(candidates, party, votes)
str(myresults2)

## 'data.frame':    4 obs. of  3 variables:
##  $ candidates: Factor w/ 4 levels "Blanks","Jones",..: 3 2 4 1
##  $ party     : Factor w/ 3 levels "","Democrat",..: 2 3 1 1
##  $ votes     : num  15248 16723 15248 16723

That’s right, R re-used the first two numbers, which is definitely not what you’d want. If you try this with three numbers in the votes vector instead of two or four, R would throw an error. That’s because each entry couldn’t be recycled the same number of times.

If by now you’re thinking, “Why can’t I create data frames that don’t change strings into factors automatically? And why do I have to worry about data frames re-using one column’s data if I forget to complete all the data?” Hadley Wickham had the same thought. His tibble package creates an R class, also called tibble, that he says is a “modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating.”

If this appeals to you, install the tibble package if it’s not on your system and then try to create a tibble with

myresults3 <- tibble::tibble(candidates, party, votes)

and you’ll get an error message that the votes column needs to be either 4 items long or 1 item long (tibble() will repeat a single item as many times as needed, but only for one).

Put the votes column back to 4 entries if you’d like to create a tibble with this data:

library(tibble)
votes <- c(15248, 16723, 230, 5234)
myresults3 <- tibble(candidates, party, votes)
str(myresults3)

## Classes 'tbl_df', 'tbl' and 'data.frame':    4 obs. of  3 variables:
##  $ candidates: chr  "Smith" "Jones" "Write-ins" "Blanks"
##  $ party     : chr  "Democrat" "Republican" "" ""
##  $ votes     : num  15248 16723 230 5234

It looks similar to a data frame – in fact, it is a data frame, but with some special behaviors, such as how it prints. You’ll also notice that the candidates column is character strings, not factors.

If you like this behavior, go ahead and use tibbles. However, given how prevelant conventional data frames remain in R, it’s still important to know about their default behaviors.