Some work
To do both the quick and longer tasks below you’ll need:
- A spreadsheet tool, such as Excel or Libre Office.
- A copy of the sample spreadsheet for this section.
- A copy of the course dataset, which is GRAIN’s data on “land grabbing”.
A quick task (15 minutes)
- Download and open this sample spreadsheet on your computer. In column A is the data from above, with different sorts of non-printable characters. In columns B-E are four easy methods of removing non-printable characters from your data using: * the TRIM function (in column B) * the CLEAN function (in column C) * the TRIM and CLEAN functions …. together! (in column D) * the “Paste Special” feature (in column E)
- Apply an “AutoFilter” to the data (Data → Filter → AutoFilter). Click on the little downwards-pointing arrows and a selection list will pop up displaying the number of distinct entries in each column. If you click on the autofilter for each of columns A through to E you can see the effect of the different methods of removing non-printable characters.
- Double click on cells to view the formulas see how the functions work in practice. There is more information on each of them in the ‘Study’ section below.
- Now try it yourself. Create or find a single column of data on your own. Apply these four methods in the same format as our sample the your spreadsheet.
Use the discussion area below to share your work and any observations about this task.
A longer task (30-60 minutes)
From the Data Wrangling Handbook Recipe on Data Cleaning, run through Problem 2: Whitespace and new lines – data that shouldn’t be there using the GRAIN dataset. This goes into more detail about the CLEAN and TRIM functions.
