5.7 Slicing and dicing your data set

Which winters had the most and least snow? How many winters had total snow of less than 2 feet or more than 6? Between 40 and 50 inches (close to an “average” Boston winter)? Data subsetting can answer basic questions like these.

This is one case where even for beginners, it’s worth learning more than one way to do a task in R. Filtering is one of the most important skills to know in R, so I’d strongly suggest not skipping or lightly skimming this section, even if it gets a litte dry. Instead, grab a cup of your favorite caffeinated beverage, follow along carefully, and actually try out some of the code yourself.

Now, let’s look at how to slice and dice.

Subset by some sort of condition. I’m going to show you two ways of doing this, and you can decide which one you like most.

I can tell you hands down which one I usually like the most: dplyr. I find Hadley Wickham’s dplyr package generally to be a more elegant and human-readable way of subsetting data than base R’s bracket notation. But even if you opt for dplyr, too, it’s still important to understand R’s bracket notation. Bracket notation can be used in a lot of different ways, including situations that dplyr might not cover.

So, top off your caffeinated beverage if needed, and try out some code.

1. The dplyr package. If I were stranded on an (Internet-limited) desert island and could only download one R package, this would be the one. Other journalists are fans as well; science reporter Peter Aldhous has called it a “game changing R package.”

dplyr is exceptionally well thought out, and allows you to chain commands in a human-readable format. Let’s take a look.

First, load the package with library("dplyr"). dplyr has its own head()-like function to examine the first few rows of a data frame, called glimpse() . It gives you a peek at a data frame similar to what you can see if you expand the data frame name in RStudio’s top right Environment panel:

glimpse(snowdata)

dplyr also has several main “verbs” (creator Hadley Wickham’s term) for dealing with data. The idea is that there are a few basic things we’re all likely to want to do when wrangling data, and each one gets its own dplyr verb/function. These are the main tasks:

  • Select certain rows based on a logical condition - dplyr’s filter(). If you’d like to see all rows where Boston had more than 100 inches of snow: filter(snowdata, Boston > 100). This says “filter the snowdata data frame to show only rows where the Boston column is greater than 100.” (Spoiler: There were only two out of 76. 110 inches turns out to be a lot of snow for a Boston winter.)

To check for one condition OR another condition, use the | symbol, which means or. filter(snowdata, Boston < 20 | Boston > 100), says “show rows in the snowdata data frame where the Boston column is less than 20 or greater than 100.”

To filter by one condition AND a second condition, you can use the & sign, such as filter(snowdata, Boston > 40 & Boston < 50). Actually, though, filter() defaults to “and” when there are multiple conditions. So, this code also works to find rows where Boston snow total is greater than 40 and less than 50: filter(snowdata, Boston > 40, Boston < 50).

  • Select certain rows based on row number - dplyr’s slice(). It uses the format myresults <- slice(snowdata, 60:76).

Note: In all these dplyr examples, the results need to be stored in a variable, such as mynewdf <- slice(mydf, 6:10) or mydf <- slice(mydf, 6:10), That second code modifies the existing data frame; the first one creates a new data frame. To save space, I’m going to leave the mydf <- assignment part out of the rest of these examples, but remember that slice(snowdata, 60:76) will just print out the results in your console; it won’t save them anywhere.

  • Sort the data - dplyr’s arrange(). Use arrange(dataframe, colname) to sort in ascending order and arrange(dataframe, desc(colname)) to sort in descending order. Sort the snowdata data frame from Boston highest to lowest snowfalls with arrange(snowdata, desc(Boston)). To sort by a second column in case there are ties in the first column, the syntax is arrange(dataframe, col1, col2).

  • Select certain columns - dplyr’s select(). You can select by specific column name, no quotes or c() needed: select(snowdata, Winter, Boston). Select a contiguous group of columns, such as starting with Boston and ending with New York City, with the syntax select(snowdata, Boston:NYC). You can select based on column names containing certain characters; for example, if you had a data frame with column names in the format city_state such as Boston_MA, Chicago_IL, NYC_NY, Fargo_ND and Syracuse_NY, you could select all the New York State entries using select(dataframe, contains("_NY")) or select(dataframe, ends_with("_NY"). You can delete columns by putting a minus sign before your selection, such as select(snowdata, -(Boston:Chicago)) or select(dataframe, -contains("_NY")).

select_if() lets you use is. functions such as is.numeric() or is.character() to choose columns by data type. If you want just the numeric columns in snowdata, for example: select_if(snowdata, is.numeric). You could then run the psych package’s describe() functions on those results, so only numeric columns remained:

Remember that filter() picks rows based on data values, but select() chooses columns based on the column names (not values of data within the columns).

dplyr also has functions for adding columns, creating summaries, splitting data into subgroups, merging two data frames, renaming columns, selecting highest and lowest by rank, and more. I’ll cover more powers of dplyr throughout the book. For now, let’s take a look at how to find some Chicago winters with low and high totals using dplyr’s filter() and select(), and then sort from highest to lowest totals.

In the code below, the first row creates a temporary variable with all columns in the data set, but filtered just for rows where Chicago snow totals are less than 2 feet (24 inches) or more than 5 feet (60 inches). The second row takes that temporary data frame and selects the Winter and Chicago columns; the third line sorts the chicagoextremes data frame by snow totals from highest to lowest, storing it back into the same data frame.

Having to copy the data into a new data frame each time you want to do something wastes memory and typing time.

But dplyr has a very elegant way to fix this. It uses the %>% “pipe” symbol that allows you to “pipe” the results of one function into a second function, without having to repeat the data frame name or store data in a temporary variable. This automatically lets you break down complex operations into simple, more manageable chunks. It also makes code a lot more readable.

Here’s the Chicago example, piped:

Piping is one of the most useful things you can learn in R, so it’s worth taking a detailed look at how this works. What this code above is saying is:

Line 1: The new variable chicagoextremes is assigned all the data in the snowdata data frame. dplyr then “pipes” that result into line 2.

Line 2: The chicagoextremes data frame is filtered, keeping only rows where Chicago snowfall is less than 24 or greater than 60. dplyr then pipes that result to line 3.

Line 3: dplyr takes those results and then keeps only the Winter and Chicago columns. Since there’s a %>% pipe at the end of this line also, dplyr sends this result to the next line.

Line 4: dplyr takes the results from line 4 and sorts the data frame by the Chicago column in descending order.

dplyr is incredibly powerful. If you’ve got, say, a spreadsheet of political donations by candidate, political party, and Zip code, you can use dplyr to easily group the data by zip code and party to find average and median donations for each. I’ll be covering dplyr’s group_by() function in Chapter 8.

2. Bracket notation: subsetting by row and column numbers. This is fairly simple subsetting, although you need to know the row and column numbers you want. If you happen to know that you want rows 60 to 76 in the second column of a data frame, you can store those in a new variable with newdata <- snowdata[60:76,2]. For bracket notation more generally, the syntax is mydata[rownums, colnums].

You can select the single value in the first row and second column with newdata <- snowdata[1,2]. Once again, it’s row number first, then column number. If you want all the columns, leave the column portion after the comma blank. So, selecting the entire first row would be newdata <- snowdata[1,].

To choose all the rows in the second column, leave the row portion before the comma blank, such as newdata <- snowdata[,2].

How would you select the second and third rows and the 3rd and 4th columns? newdata <- mydata[2:3,3:4].

To create a new data frame for Boston and New York data excluding the Chicago column, you’d want (non-sequential) columns 1, 2 and 4. You can select non-adjacent columns with the c() function: newdata <- snowdata[,c(1,2,4)]. In this case, it might be easier to just leave out column 3, which you do with a minus sign: newdata <- snowdata[,-3]. To delete multiple columns or rows, use the minus sign before the -c() function, such as newdata <- snowdata[,-c(2,4)] for removing the Boston and NYC columns.

You can also select columns by names, not just numbers. Unlike with dplyr’s select() function, in base R bracket notation, you need to put the column names in quotation marks.

snowdata[,c("Boston", "Chicago", "NYC")]

Bracket notation works for other types of R data, not just data frames. snowdata$Boston[6] will give you the 6th item in the snowdata Boston column, snowdata$Boston[-1] returns the full vector except for the first item, and snowdata$Boston[6:10] will give you items 6 through 10. (Because vectors don’t have rows and columns, there’s no need for a comma such as [6:10,] like you need for a data frame.)

Tip: If you’re trying out these slightly different commands in your own console, remember that you can pull up the previous line of code you ran by using the up arrow. Then, you can edit that line of code – for example, changing [6] to [6:10] – instead of typing the whole command again. Or, you can type commands in the top-left script window, hit control/command enter to run them, and then see the results in the console.

If you want the last row of the data frame but don’t want to use the actual row number `snowdata[76,], what else could you do?

Well, that last row, 76, equals the total number of rows in the data frame. Hopefully, you remember that you can get the number of rows in a data frame with nrow(snowdata). That means you can swap in nrow(snowdata) for a hard-coded row number. So in this case, snowdata[76,] is the same as snowdata[nrow(snowdata),].

If you find these nested functions somewhat unreadable, one option is to break that single line of code into two: Store the number of rows in a variable, and then use that variable to swap in for 76:

##       Winter Boston Chicago  NYC
## 76 2015-2016   36.2    31.2 32.1

If you want to be sure that one line of code gives you the same exact result as another – same values and same structure – you can do a test with R’s identical() function.

## [1] TRUE
## [1] TRUE
## [1] FALSE

The first two I compared do have identical results while the third pair does not (since the last row is 76, not 75).

As the second line of code above shows, it is possible to do calculations such as lastrow - 1 inside the brackets.

What if you want to pull a row by a specific Winter, such as the 2014-2015 Snowpocalypse? You can use `snowdata[snowdata$Winter == “2014-2015”,].

Did you notice the two equals signs in snowdata$Winter == "2014-2015? Reminder that’s not a printing error; it’s a very important point: When you’re testing whether one thing is equal to another thing, you have to use == and not =. It would be the same for dplyr: filter(snowdata, Winter == "2014-2015")

If you’ve programmed in another computer language, this probably isn’t news to you. If you’re new to coding, though, be warned that this can trip up beginning programmers. Imprint this in your mind: = performs an action. myvariable = 5 sets the value of myvariable to 5. == tests whether something is true: myvariable == 5 will tell you TRUE if myvariable equals 5 and FALSE if myvariable equals something else. (If myvariable doesn’t exist, you’ll get an error message).

We saw from the summary() and describe() functions what the lowest and highest snow totals were for each city. But what if we want to find the rows with those totals so we see can which winters they were?

R has functions to find minimum and maximum values: min() and max(). max(snowdata$Boston) will return the amount of snow during the Snowpocalypse, while min(snowdata$Boston) gets the amount of snow in Boston’s least snowiest winter. And, range(snowdata$Boston) will return both the lowest and highest values.

Can you think of how you’d pull the row with the most Boston snow? You’d want it to be equal to the highest snow total, which is max(snowdata$Boston). With dplyr, it’s filter(snowdata, Boston == max(Boston)). With bracket notation, it would be snowdata[snowdata$Boston == max(snowdata$Boston),]. I think the dplyr version is much more readable.

For finding most and least values, R also has functions which.max() and which.min() that give the location of the maximum and minimum values. which.min(snowdata$Boston) gives the index location of the lowest value (in this case number 72). So this would also work to pull the row from snowdata that has the lowest Boston winter snow total:

slice(snowdata, which.min(Boston))

This is how you could find Chicago winters with less than 2 feet of snow or more than 5 feet using bracket notation (with the second line sorting the results):

To read the first line, you have to work from the inside out, while dplyr lets you see each operation sequentially on its own line. (The dplyr code can in fact also be on one line, but the format I used earlier is fairly typical and much more readable.)

I’ll mostly be using dplyr syntax from here on, because I think it’s a lot easier to follow.