5.5 Data summaries
During Boston’s 2014-15 Snowpocalypse, a lot of us became endlessly fascinated by snowfall data. (Fun fact: Boston got more snow in one 23-day period than Chicago got during its snowiest-ever entire winter.) But while 110.6 inches certainly sounds like a lot of snow, it’s hard to tell if that’s truly unexpected without knowing about amounts in other winters.
So let’s go back to snowfall data. Now, we’ll look at data for three different cities, not just Boston. You should now have the BostonChicagoNYCSnowfalls.csv CSV file in your project’s data subdirectory. If not, find it at http://bit.ly/BosChiNYCsnowfalls. Import it with
Don’t forget RStudio autocomplete! If you type data/Bost
between quotation marks and pause to see the autocomplete options, you shouldn’t have to type the whole file name.
One of the first things worth doing after importing a data set is looking at the first few rows, the last few rows, and a summary of some basic stats. R has functions to help with all three. head() will show you the first six rows of the data frame and tail() shows you the last six:
## Winter Boston Chicago NYC
## 1 1940-1941 47.8 52.5 39.0
## 2 1941-1942 23.9 29.8 11.3
## 3 1942-1943 45.7 45.2 29.5
## 4 1943-1944 27.7 24.0 23.8
## 5 1944-1945 59.2 34.9 27.1
## 6 1945-1946 50.8 23.9 31.4
## Winter Boston Chicago NYC
## 71 2010-2011 81.0 57.9 61.9
## 72 2011-2012 9.3 19.8 7.4
## 73 2012-2013 63.4 30.1 26.1
## 74 2013-2014 58.9 82.0 57.4
## 75 2014-2015 110.6 50.7 50.3
## 76 2015-2016 36.2 31.2 32.1
If you’d like to change that number of items, use the n
argument. For example, head(snowdata, n=10)
will show the first 10 rows.
This gives you a good initial feel of what the column structure looks like and also helps check if data got garbled toward the end of a file (particularly with spreadsheet data, when final rows can be total rows or footnotes). str(snowdata)
will show you that it’s a data frame of 76 rows and 4 columns – one character column and three numeric columns.
Aside: If you’ve got a tibble instead of a plain data frame and it has a lot of columns, head()
will only display data in columns that fit in your console window. The display just mentions the other columns that don’t fit. This is one area where Wickham and I disagree on desired default behavior. If this bugs you as well, you can turn the tibble back into a regular data frame by nesting as.data.frame() inside another function, such as head(as.data.frame(snowdata))
.
Another difference between tibbles and base-R data frames: Typing the name of a data frame will print out the entire data set (at least until it reaches a maximum that’s set in your R global options. You can see that maximum in RStudio under Tools > Global Options > Code > Display.) Wickham assumes that’s not what you want, so typing the name of a tibble only shows the first 10 rows.
There are a few other functions besides str() and class() to give you basic structural information about a data frame. dim(snowdata)
will show you the number of rows and columns, as well as nrow(snowdata)
for the number of rows and ncol(snowdata)
the number of columns. In addition to rownames() and colnames() (or just names() for column names), dimnames(snowdata)
will print out both row and column names in an R list.
If you need to check whether a column is numeric, the is.numeric()
function does this:
## [1] TRUE
Type is.
in the RStudio console, and you’ll see a dropdown list of similar functions such as is.data.frame(), is.character(), and is.vector().
If you need a different data type – such as the previous example of zip codes that make more sense as characters than numbers, there are also as.
functions to convert data from one type to another. as.character(20500)
will turn that number into a character string; as.numeric("756")
will turn that into a number. The conversion has to be straightforward, though: as.numeric("1,798")
won’t work unless you get rid of the comma first.