4.5 What’s a data frame? And what can you do with one?

rio imports a spreadsheet or CSV file as an R data frame. How do you know whether you’ve got a data frame? In the case of snowdata, class(snowdata) returns the class, or type, of object it is. str(snowdata) also tells you the class and adds a bit more information. Much of the info you see with str() is similar to what we saw in the RStudio environment pane: snowdata has 76 observations (rows) and 2 variables (columns).

I mentioned in Chapter 1 that data frames are somewhat like spreadsheets. However, data frames are more structured. Each column in a data frame is an R vector, which means that every item in a column has to be the same data type. One column can be all numbers and another column can be all strings, but within the column, data has to be consistent.

If you’ve got a data frame column with the values 5, 7, 4, and “value to come,” R will not simply be unhappy and give you an error. Instead, it will “coerce” all your values to be the same data type. Since “value to come” can’t be turned into a number, 5, 7 and 4 will end up being turned into character strings of “5”, “7”, and “4”. This isn’t usually what you want, so it’s important to be aware of what type of data is within each column. One stray character string value in a column of 1,000 numbers can turn the whole thing into characters. If you want numbers, make sure you have them!

R does have a ways of referring to missing data that won’t screw up the rest of your columns: NA means not available.

Data frames are rectangular: Each row has to have the same number of entries (although some can be blank), and each column has to have the same number of items.

Excel spreadsheet columns are typically referred to by letters: Column A, Column B, etc. You can refer to a data frame column with its name, by using the syntax dataFrameName$columnName. So, if you type snowdata$Total and hit enter, you’ll see all the values in the Total column, as in Figure 4.3. (That’s why when you run the str(snowdata) command, there’s a dollar sign before the name of each column.)

Figure 4.3: The Total column in the snowdata data frame.

Figure 4.3: The Total column in the snowdata data frame.

A reminder that those bracketed numbers at the left of the listing aren’t part of the data, they’re just telling you what position each line of data starts with. [1] means that line starts with the first item in the vector, [10] the 10th, etc.

RStudio tab completion works with data frame column names as well as object and function names. This is pretty useful to make sure you don’t misspell a column name and break your script – and it also saves typing if you’ve got long column names.

Type snowdata$ and wait, and you’ll see a list of all the column names in snowdata.

There are several ways to slice and dice data frames, which I’ll get to in the next chapter.

It’s easy to add a column to a data frame. Currently, the Total column shows winter snowfall in inches. To add a column showing totals in Meters, you can use this format:

The name of the new column is on the left, and there’s a formula on the right. In Excel, you might have used =A2 * 0.0254 and then copied the formula down the column. With a script, you don’t have to worry about whether you’ve applied the formula properly to all the values in the column.

Now look at your snowdata object in the Environment tab. It should have a third variable, Meters.

Because snowdata is a data frame, it has certain data-frame properties that you can access from the command line. nrow(snowdata) will give you the numbers of rows and ncol(snowdata) the number of columns. Yes, you can view this in the RStudio environment to see how many observations and variables there are, but there will probably be times when you’ll want to know this as part of a script. colnames(snowdata) or names(snowdata) will give you the name of snowdata columns. rownames(snowdata) will give you any row names (if none were set, it will default to character strings of the row number such as “1”, “2”, “3”, etc.).

As discussed briefly in Chapter 3, some of these special dataframe functions (technically called “methods”) not only give you information, but let you change characteristics of the data frame. So, names(snowdata) tells you the column names in the data frame, but

will change the column names in the data frame.

You probably won’t need to know all available methods for a data frame object, but if you’re curious, methods(class=class(snowdata)) will display them. To find out more about any method, run the usual help query with a question mark, such as ?merge or ?subset.

4.5.1 When a number’s not really a number

Zip codes are a good example of “numbers” that shouldn’t really be treated as such. Although technically “numeric,” it doesn’t make sense to do things like add two Zip codes together or take an average of Zip codes in a community. If you import a Zip-code column, though, R will likely turn it into a column of numbers. And if you’re dealing with areas in New England where Zip codes start with 0, the 0 will disappear.

I have a tab-delineated file of Boston Zip codes by neighborhood, downloaded from a Massachusetts government agency, at https://raw.githubusercontent.com/smach/R4JournalismBook/master/data/bostonzips.txt. If I tried to import it with zips <- rio::import("bostonzips.txt"), the Zip codes come in as 2118, 2119, etc. and not 02118, 02119, and so on.

This is where it helps to know a little bit about the underlying function that rio’s import() function uses. You can find those underlying functions by reading the import help file at ?import .For pulling in tab-separated files, import uses either fread() from the data.table package or base R’s read.table() function. The ?read.table help says that you can specify column classes with the colClasses argument.

Create a data subdirectory in your current project directory, then download the bostonzips.txt file with download.file("https://raw.githubusercontent.com/smach/R4JournalismBook/master/data/bostonzips.txt", "data/bostonzips.txt"). If you import this file specifying both columns as character strings, the Zip codes will come in properly formated:

## 'data.frame':    35 obs. of  2 variables:
##  $ Zipcode     : chr  "02118" "02119" "02120" "02130" ...
##  $ Neighborhood: chr  "Boston South End" "Roxbury" "Roxbury Mission Hill" "Jamaica Plain" ...

Note that the column classes have to be set using the c() function, c("character", "character"). If you tried colClasses = "character", "character" , you’d get an error message. This is a typical error for R beginners, but it shouldn’t take long to get into the c() habit.

A save-yourself-some-typing tip: Writing out c(“character”, “character”) isn’t all that arduous; but if you’ve got a spreadsheet with 16 columns where the first 14 need to be character strings, this can get annoying. R’s rep() function can help. rep(), as you might have guessed, repeats whatever item you give it however many times you tell it to, using the format rep(myitem, numtimes). rep("character", 2) is the same as c(“character”, “character”), so colClasses = rep("character", 2) is equivalent to colClasses = c("character", "character") .And, colClasses = c(rep(“character”, 14), rep(“numeric”, 2)) would set the first 14 columns as character strings and the last two as numbers. All the names of column classes here need to be in quotations marks because names are character strings.

I suggest you play around a little with rep() so you get used to the format, since it’s a syntax that other R functions will use, too.