Introduction
Why might data need “cleaning” anyway? We say the data needs cleaning when it has inconsistencies that make it difficult to work with; although it might already be in a spreadsheet, there are lots of ways that it could actually be “dirty” data.
For example, when dates are written in different formats in the same spreadsheet: 21st October, or 21/10/13, or Oct. 21. Or, when names are spelt slightly differently, but actually mean the same thing. All of these things (whether by human error, or machine) – make it very hard to analyse the data. As lots of IATI data has been processed by hand, little inconsistencies are common within the files you find in the IATI registry, and before you can properly work with it, it needs to be cleaned.
So, here is an introduction to a powerful data cleaning tool, which is free to download.
Objectives:
- Understand why IATI data might need ‘cleaning’
- Learn familiarity with the data cleaning tool Open Refine
- Work with IATI data (downloaded as CSV from the IATI registry)
Prerequisites/before you get started:
- Basic understanding of spreadsheets
- Basic understanding of what IATI data shows
- Understanding of what a CSV is
What you’ll need:
- Refine – Download it from http://openrefine.org. If you’re downloading it using a Mac, there might be a bug, telling you:
“Google Refine” is damaged and can’t be opened. You should move it to the Trash.”
To get around this problem, follow these instructions:
- Open System Preferences
- Open Security & Privacy
- Go to the General Tab
- Change the “Allow applications downloaded from:” setting to “Anywhere”
(This appears to be a security issue with Mountain Lion, but the above steps provide a workaround until it is fixed by Google.)
