Step 3: Sorting and Faceting

Now that we have created our project, let’s go and explore the data and the Refine interface a little. Using Refine might be intimidating at first, since it seems so different from spreadsheets, but once you get used to it you will notice how easily you can do things with it.

One of the commonly used functions in spreadsheets is sorting and filtering data – to figure out minima, maxima or things about certain categories. Refine can do the same thing.

One of the first things to notice when looking at the data, is that the first two entries under the column ‘aid_project_title’ appear to be the same text, but are written one in uppercase, one in lowercase, and with a spelling omission, too. Remember what we said earlier about messy data? This is a prime example.

To see if there are any others errors like this, let’s use the ‘facet’ function.

Click on the little arrow next to the column title, and then select Facet → Text Facet.

What does ‘Facet’ actually mean? Essentially, filtering. Faceting in Refine is really powerful – you can do a lot to your data using facets.

Here, we’re going to clean up the columns a little. Clicking Text Facet will open a facet in the left sidebar. You can see that there are 62 choices given – are any of them doubled up, like in the first two rows?

Scroll through and see.

The ‘Gas Transmission’ project that we noticed at the top is here, and we can see that there are just two projects with this title; however, they are considered as separate projects due to the spelling mistake + difference in caps used).

We can get rid of the block capital letters quite easily, by clicking on the triangle at the top of the aid_project_title column and selecting Edit cells → Common transforms → To titlecase.

Then, we use the ‘Cluster’ feature to automatically find projects that have the same name. To activate clustering, click on the ‘Cluster’ button in the facet.

You will end up in the clustering menu – there, click on the drop down menu next to Method, and select ‘nearest neighbour’.

Then, you should see two entries come up: the Gas Transmission project we first noticed, and another one with an 0 instead of an O in the word ‘transport’.

If you click on one of the options, that text will appear in the corresponding New Cell Value window. It will also tick the ‘merge’ box. When you click the Merge button, the lines that are ticked will be rewritten with the appropriate New Cell Value.

Make sure the ‘New Cell Value’ is written out correctly, and then click ‘Merge Selected & Re-Cluster’. No others come up, so you can close the window. It is often worth trying one or two different measures (Distance Functions) when trying to cluster partially matching strings, as each measure works slightly differently.

What does ‘Distance Function’ mean? The different distance measures take different times to run and use different rules to try to work out what is the same as what. In levenshtein, the distance is the number of single character “edits” that map one string to another. If you try the ppm rather than levenshtein distance measure you get different results.

In the key collision method, there are a couple of interesting techniques that try to find strings that “sound alike” (metaphone3 and cologne phonetic). These are quite quick and dirty methods, but they can work well

Exercise 1:

You can check to see if there are similar problems in any of the other columns, too – for example, if you set a Text Facet up for the column ‘sectors’ you’ll see that there are two similarly named sectors:

Transport And Ict, and Transport and ICT.

Click ‘Cluster’ and see if you can merge them!

Exercise 2:

Narrowing down your search: try to find all projects in Public Sector Management that have a Social Development policy marker. (tip: try using a facet on one column, and a filter on another)

Different kinds of facets

As you saw when you created the Text facet just now, there are other kinds of facets – let’s create a Timeline one, to make it easy to see when projects started (and stopped.)

Go to the column start_actual; in this column, can see the dates that projects started. Create a Timeline facet, by clicking on Facet → Timeline facet.

Hmm – it doesn’t seem to work- this is what we see, and it appears to be blank! Why?

The dates given in the start_actual column don’t seem to be recognised as dates (or ‘Time’) so they’re not coming up in the Timeline facet.

Luckily, if the date is in the format specified in the IATI standard this is easy to change: just ‘transform’ all of the entries into that cell into ‘dates’, by clicking Edit cells → Common transforms → To date

You can go ahead and do this transform on all of the columns that you can see which have dates in – start-planned, and end-planned too.

Once you have, you’ll be able to apply a Timeline Facet which allows us to see when projects were started.

Cleaning multiple entries in one cell

You may notice that in some of the columns, there are multiple entries:

If we wanted to list just the rows that involved Sylhet Gas Fields Limited we could select the Text filter option, enter the company name, and just see the corresponding rows.

However, we might also want to reshape the data so that we have one line for each accountable-org or implementing-org. We can achieve this in OpenRefine in two steps. The first step involves generating a new row for each entry in a particular column. From the Edit Cells menu option select *Split multi-valued cells… * and then enter the character that is used to separate the different items within the cell – in this case, a semi-colon (;) character.

For each item in the cell, a new row is created and the distinct values filled in within the original column – that is, the column whose cell values we split.

If you inspect the other columns, you will notice that all the cells in the newly created rows are blank.

To fill in the blanks, we need a second step – for each column, select Edit cells then Fill down. This works down each row in the dataset, looking for empty cells and filling them with whatever value appears in the cell next filled cell above.

For a dataset such as this one, there are obviously of a lot of columns that need filling in – which could take some time. But do you really want data from every column? Perhaps all you really wanted was a list of accountable organisation aims and the IATI project identifier – in which case you’d only have to fill down on that column.

You could then export just these two columns of data using the Custom tabular exporter from the project Export menu; this would give you a more specific selection of the entire data set to work with.