Practical R for Mass Communication and Journalism

4.4 Import data from packages

In the last chapter, I showed how to import government and financial data directly into R using the quantmod package. There are other packages that will let you access data directly from R.

The aptly named weatherdata package on CRAN can pull data from the Weather Underground API, which has information for many countries around the world. (An API, or Application Programming Interface, is a service that makes data available in a structured way so developers can easily access it when designing their software.)

The rnoaa package, a project from the rOpenSci group, taps into several different U.S. National Oceanic and Atmospheric Administration data sets including daily climate, buoy, and storm information.

If you are interested in state or local government data in the US or Canada, you may want to check out RSocrata to see if an agency you’re interested in posts data there. I’ve yet to find a complete list of all available Socrata data sets, but there’s a search page at https://www.opendatanetwork.com. Be careful though: There are community-uploaded sets along with official government data, so check a data set’s owner and upload source before relying on it for more than R practice. “ODN Dataset” in a result means it’s a file uploaded by someone in the general public. Official government data sets tend to live at URLs like https://data.CityOrStateName.gov or https://data.CityOrStateName.us. I’ll walk you through pulling some New York City restaurant inspection data with RSocrata a few chapters from now.

For more data-import packages, see my searchable chart at http://bit.ly/RDataPkgs. Journalists and others in mass communication might be particularly interested in censusapi and tidycensus, both of which tap into U.S. Census Bureau data. Other useful government data packages include eu.us.opendata from the U.S. and European Union governments to make it easier to compare data in both regions, and cancensus for Canadian census data.

4.4.1 When the data’s not ideally formatted

In all these sample data cases, the data has been not only well-formatted, but ideal: Once we found it, it was perfectly structured for R. What do I mean by that? It was rectangular, with each cell having a single value instead of merged cells. And the first row had column headers, as opposed to, say, a title row in large font across multiple cells in order to look pretty – or no column headers at all.

Dealing with “untidy” data can, unfortunately, get pretty complicated. But there are a couple of common issues that are easy to fix.

Beginning rows that aren’t part of the data. If you know that the first few rows of an Excel spreadsheeet don’t have data you want, you can tell rio to skip one or more lines. The syntax is rio::import("mySpreadsheet.xlsx", skip=3) to exclude the first three rows. skip takes an integer.

There are no column names in the spreadsheet. The default import assumes the first row of your sheet is the column names. If your data doesn’t have headers, the first row of your data may end up as your column headers. To avoid that, use rio::import("mySpreadsheet.xlsx", col_names = FALSE) and R will generate default headers of X0, X1, X2 and so on. Or, use a syntax such as rio::import("mySpreadsheet.xlsx", col_names = c("City", "State", "Population")) to set your own column names.

If there are multiple tabs in your spreadsheet, the which argument will override the default of reading in the first worksheet. rio::import("mySpreadsheet.xlsx", which = 2) reads in the second worksheet.