Here, the spreadsheet is king (for now)
Seat reserved… for the spreadsheet. Photo by Zoonabar. CC-BY-SA 2.0
For non-techie researchers and investigators like me who work on human rights, spreadsheets are incredibly useful. However, it’s hard to imagine a tool as flexible that is at the same time so deeply frustrating. Spreadsheets can make simple things very difficult. For example, for many years this is what “cleaning data” has meant to me and many other people I work with:
Open file in spreadsheet. Open cell. Position cursor. Correct error. Close cell. Move down a row. Open cell. Position cursor. Correct error. Close cell. Move down a row… repeat to row 53,234 or until you fall asleep at the keyboard (whichever comes first).
To help speed these sorts of tasks up, we’ve written a new School of Data course called A gentle introduction to cleaning data in a spreadsheet. It contains loads of ways to make cleaning data a quicker and less painful experience.
In the course we start with a ‘dirty’ dataset containing lots of common errors. We walk you step-by-step through the process of making it to ‘clean’. We’ll show you how to use a range of common spreadsheet features to find and correct problems such as invisible or inconsistent data, missing values, a bad data structure and so on. By the end of the course, you should leave with a better view of what the spreadsheet can do, a practical process you can repeat on your own datasets and a good idea of how to better find help online about how to use spreadsheets.
The course dataset is interesting too. It’s about ‘land-grabbing’, or the commercial buy-up of agricultural land in the developing world by investment companies and governments to grow biofuel and other commodities, turfing people off land they need for their survival and (some analysts reckon) driving up food prices around the world. The data was produced by GRAIN, an excellent research organisation; I hope they accept our apologies for picking on their data in this course!
This is the first in a series of three ‘basics’ courses. They all use the same dataset about landgrabbing. The next in the series is a course called A gentle introduction to descriptive data analysis, which is about using a spreadsheet to get to grips with what’s in your data. Hot on its heels will be an introduction to visualising networks.
Finally, this course will also illustrate the spreadsheet’s limits. At some point, the time and effort you make pushing a spreadsheet to do something may be better spent looking at tools and techniques specifically designed to tackle the problem. In the case of cleaning data, this might be learning how to use Google Refine.
But until that time, all hail the spreadsheet, king of data cleaning.