Recently I’ve been giving more workshops about cleaning data. This step in the data cycle often takes 80% of the time, but is seldom focused on in a systematic way. I want to address one topic that keeps coming up – what is clean data?
When I ask, I usually get answers all over the map. I tend to approach it from four topics:
- consistency: are observations always entered the same way?
- completeness: do you have full coverage of the topic?
- usability: is your data human readable, or machine readable, in the ways you need it to be?
- atomicity: do the rows hold the correct basic units for your analysis?
The last topic, atomicity, is one I need a better name for. In any case, I want to tease it apart a bit more because it is critical. Wickham’s Tidy Data paper has a great way of talking about this:
each variable is a column, each observation is a row, and each type of observational unit is a table
Yes, someone wrote a whole 24 page paper on how to make sure your columns are right. And yes, I read it and enjoyed it. You should go read it too (at least the first few pages). The key point is that far too many tabular datasets have column headers that are, in fact, part of the data. For instance if you are keeping track of how many times something happens each year, each year shouldn’t be a column header; “year” should be a column and you should have one row for each year. For you excel junkies, this means your raw data shouldn’t be in cross-tab format.
This process of cleaning your data to make it tidy can be annoying, buy luckily there are tools that can help. Tableau has a handy plugin for Excel that “reshapes” your data to prep it for analysis. If you are an R wizard, here is a presentation on how do tidying operations in R. If you use Google Sheets, there is a Stack Overflow post that has some details on a plugin someone wrote to normalize data in Google Sheets.
I hope that helps you in your next data-cleaning task. Hooray for tidy data!