Course outline: a gentle introduction to cleaning data

This course was created for the School of Data by Tactical Technology Collective. Tactical Tech is an international NGO working at the point where rights advocacy meets information and technology.

It’s quite hard to get emotional about data, but let’s try. Stop what you’re doing for a minute. Look at the picture below. What joy! The yellow smiling bears … what a scene! All that’s missing is fireworks.

http://farm9.staticflickr.com/8219/8412778423_2b06c83e7b_o_d.jpg

Image source: China Daily / Reuters, 17 January 2006. Image rights reserved.

Now, imagine the disappointment of finding out that your attempt to create the world’s largest ice cream cake had been ruined because of a spreadsheet error caused by a single blank space in the column called “quantities of ice cream”. It’s not a nice feeling is it? (Luckily this didn’t happen, and probably nobody got fired, and everyone is very happy with the truly meaningful achievement of creating an 8 ton ice cream cake.)

Errors in data and the tools used to manage it are common enough that there is actually an international organisation called the European Spreadsheet Risks Interest Group that advises on how to reduce the errors people make when using spreadsheets. ESRIG tracks “horror stories” of where data errors in spreadsheets have led to real consequences. These include billions in missing oil revenue, tens of thousands of Olympic tickets being oversold, and huge salaries being given because of accidentally inserted zeros. Some errors are deliberate and criminal. Others are just down to the complexity of managing data. Some errors occur because of the way spreadsheets behave.

This School of Data course is a gentle introduction to reducing errors by cleaning data, which means a number of things, including:

  • finding and removing unwanted bits of data in spreadsheets
  • formatting data correctly for the tools that you are using
  • dealing with inconsistencies in the data
  • structuring it so it can be used effectively for what you want it to do

Course requirements

A gentle Introduction to cleaning data builds on the School of Data course called Data Fundamentals, which covers the basics of working with data.

To do this course you will require the following:

  • a spreadsheet tool, such as Excel (which costs money) or Libre Office (which doesn’t), installed on your computer. We assume you know how to do this.
  • some basic knowledge of using a spreadsheet, including creating files, entering data and simple formulae into cells, selecting, copying and pasting data. If you want to brush up on these things, the Knight Digital Media Centre has an excellent set of basic tutorials on spreadsheets.
  • a copy of the course dataset, which is GRAIN’s data on “land grabbing”. We have used this data both because it is an interesting dataset, and because it speaks to the experience of activists in researching and pulling data together about and issue.It also provides a full set of really excellent problems that we can work to solve. Finally, we will also be using it as the basis of the course on descriptive analysis of data, which is coming soon.

Course contents

The course content is as follows:

  • The Course Outline, which is this page you’re reading now.
  • Section 1: Nuts and chewing gum looks at the the way data is presented in spreadsheets and how it might cause errors.
  • Section 2: The Invisible Man is in your spreadsheet is concerned with the problems of white spaces and non-printable characters and how they affect our ability to use the data.
  • Section 3: Your data is a witch’s brew deals with consistency in data entry, and how to choose the right unit and format for data.
  • Section 4: Did you bring the wrong suitcase (again)? is about where to put data, and how to structure it.
  • Accompanying these sections is a step-by-step recipe for cleaning a dataset. This is an extensive, handbook-style resource which we refer to in each section. It takes a set of ‘dirty’ data and moves it through the different steps to make it ‘clean’.

Each section of this course contains:

  • An introduction to the challenge: we’ve tried to explain the problems in a helpful, lighthearted and possibly inappropriate way.
  • A quick task which should take around 15 minutes to familiarise yourself with the issue. We’ll usually ask you to make or do something, and post your work in the section’s discussion area.
  • A longer task taken from the data cleaning recipe. Some tasks are short, but others may take up to an hour.
  • A reading list containing useful references and resources. We have included links to essays, spreadsheet product help pages, blogs and so on.
  • A question about the material in the section which you can answer in the discussion area. Please post comments on each section, share your thoughts and give feedback on your progress.

How to do this course

We suggest that:

  • You go through sections 1-4 and do the quick tasks first, and then do it again with the longer tasks based on the data cleaning recipe.
  • Don’t suffer in silence. We know you’re persistent and driven to learn and we’re here to help.

Set up? Ready to go?

Good. Say hello and introduce yourself in the discussion area, and then get cracking with Section 1: Nuts and chewing gum.

Next Course

Any questions? Got stuck? Ask School of Data!

Last updated on Sep 02, 2013.