1 Introduction
Imagine you’ve received a large spreadsheet with messy but important data, and you know it’s got a story to tell. You spend lots of time cutting and pasting, writing formulas, and data “cleaning.” One problem you’re fixing is multiple versions of the same company’s name: XYZ, XYZ Inc., X YZ Company, and so on. But when you’re finally able to do your analysis, something looks wrong. And when you ask your source about it, he responds: Yes, sorry, there was a mistake in the data file. We’ll send you a corrected spreadsheet shortly.
That was me a few years ago. After swearing (mostly to myself) in the newsroom, I had to re-create everything I did in that first Excel file in the second. The copying. The pasting. The formula-writing. The painfully long waits for formulas to execute. Spot checks of results. Standardizing on one version of each company’s name so counts were accurate.
I vowed that wasn’t going to happen again.
And I started learning R.