You are browsing the archive for Rufus Pollock.

Using SQL for Lightweight Data Analysis

Rufus Pollock - March 26, 2013 in Data Blog, Data Stories, HowTo

This article introduces the use of SQL for lightweight data analysis by walking through a small data investigation to answer the question: who were the top recipients of Greater London Authority spending in January 2013?

Along the way, it not only introduces SQL (and SQLite) but illustrates various other skills such as locating and cleaning data and how to load tabular data into a relational database.

Note: if you are intrigued by the question or the data wrangling do check out the OpenSpending project – the work described here was part of some recent work by OpenSpending community members at a recent Open Data Maker Night.

Finding the Data

First we need to locate the data online. Let’s start with a web search, e.g.: “London GLA spending” (GLA = greater london authority). This quickly yields the jackpot in the form of this web page:

For our work, we’ll focus on the latest month. So jump in and grab the CSV file for February which is at the top of that page (at the moment!).

Preparing the Data

The data looks like this (using the Chrome CSV Viewer extension):

gla-csv

Unfortunately, it’s clear these files have a fair amount of “human-readable” cruft that make them unsuitable for further processing without some cleaning and preparation. Specifically:

  • There is various “meta” information plus a blank linke at the top of each file
  • There are several blank lines at the bottom
  • The leading column is empty

We’ll need to remove these if we want to work with this data properly – e.g. load into OpenSpending, put in a database etc. You could do this by hand in your favourite spreadsheet package but we’ll do this using the classic UNIX command line tools head, tail and sed:

tail -n +7 2012-13-P11-250.csv | head -n -4 | sed "s/^,//g" > 2013-jan.csv

This command takes all lines after the first 6 and before the last 4, strips off the leading “,” and puts it in a new file called 2013-jan.csv. It uses unix pipes to run together these few different operations:

# strip off the first 6 lines
tail -n +7

# strip off the last 4 lines
head -n -4

# remove the lead column in the form of "," at the start of each line
# "^," is a regular expression matching "," at the start of a line ("^"
# matches the start of a line)
sed "s/^,//g"

The result of this is shown in the screenshot below and we’re now ready to move on to the next stage.

gla-csv-cleaned

Analyzing the Data in a Relational Database (SQLite)

Our aim is to work out the top recipients of money. To do this we need sum up the amounts spent by Vendor (Name). For the small amount of data here you could use a spreadsheet and pivot tables. However, I’m going to take a somewhat different approach and use a proper (relational) database.

We’ll be using SQLite, an open-source relational database that is lightweight but fully-featured. So, first check you have this installed (type sqlite or sqlite3 on the command line – if you don’t have it is easy to download and install).

Loading into SQLite

Now we need to load our CSV into SQLite. Here we can take advantage of a short python csv2sqlite script. As its name suggests, this takes a CSV file and loads it into an SQLite DB (with a little bit of extra intelligence to try and guess types). The full listing for this is in the appendix below and you can also download it from a gist here. Once you have it downloaded we can use it:

# this will load our csv file into a new table named "data"
# in a new sqlite database in a file named gla.sqlite
csv2sqlite.py 2013-jan.csv gla.sqlite

Analysis I

Let’s get into the SQLite shell so we can run some SQL:

# note you may need to run sqlite3 rather than sqlite!
sqlite gla.sqlite

Now you will be in the SQLite terminal. Let’s run our query:

sqlite> SELECT "Vendor Name", sum(amount) FROM data
          GROUP BY "Vendor Name"
          ORDER BY SUM(amount) DESC
          LIMIT 20;

How does this work? Well the key thing here is the “GROUP BY” which has a similar function to pivoting in spreadsheets: what it does is group together all the rows with the same value in the “Vendor Name” field. We can then use SELECT to specify fields, or functions of fields that are common or aggregate across all the rows with the same “Vendor Name” value. In this case, we just select the “Vendor Name” and the SUM of the “Amount” field. Lastly, we order the results by the sum (descending – so most first) and limit to only 20 results. The result is as follows:

Vendor Name                          SUM(Amount)
-----------------------------------  -----------
NEWLON HOUSING TRUST                 7540500.0  
ONE HOUSING GROUP                    6655104.0  
L B OF HARINGEY                      6181359.0  
LONDON BOROUGH OF HACKNEY - BSP      5665249.0  
LONDON BOROUGH OF HAVERING           4378650.0  
LONDON BOROUGH OF NEWHAM             3391830.0  
LONDON BOROUGH OF BARKING            2802261.0  
EVERSHEDS                            2313698.54 
METROPOLITAN HOUSING TRUST LIMITED   2296243.0  
BERKELEY PARTNERSHIP HOMES LIMITED   2062500.0  
LONDON BOROUGH OF LAMBETH            1917073.95 
PARADIGM HOUSING GROUP LIMITED       1792068.0  
AMAS LTD                             1673907.5  
VIRIDIAN HOUSING                     1467683.0  
LONDON BOROUGH OF GREENWICH          1350000.0  
CITY OF WESTMINSTER                  1250839.13 
CATALYST HOUSING GROUP LTD            829922.0   
ESTUARY HOUSING ASSOCIATION LIMITED   485157.0   
LOOK AHEAD HOUSING AND CARE           353064.0   
TRANSPORT FOR LONDON                  323954.1   

We could try out some other functions, for example to see the total number of transactions and the average amount we’d do:

sqlite> SELECT "Vendor Name", SUM(Amount), AVG(Amount), COUNT(*)
          FROM data
          GROUP BY "Vendor Name"
          ORDER BY sum(amount) DESC;

Vendor Name                          SUM(Amount)  AVG(Amount)  COUNT(*)  
-----------------------------------  -----------  -----------  ----------
NEWLON HOUSING TRUST                 7540500.0    3770250.0    2         
ONE HOUSING GROUP                    6655104.0    3327552.0    2         
L B OF HARINGEY                      6181359.0    6181359.0    1         
LONDON BOROUGH OF HACKNEY - BSP      5665249.0    1888416.333  3         
LONDON BOROUGH OF HAVERING           4378650.0    4378650.0    1         

This gives us a sense of whether there are many small items or a few big items making up the expenditure.

What we’ve seen so far shows us that (unsurprisingly) GLA’s biggest expenditure is support to other boroughs and to housing associations. One interesting point is the approx £2.3m paid to Eversheds (a City law firm) in January and the £1.7m to Amas Ltd.

Analysis II: Filtering

To get a bit more insight let’s try a crude method to remove boroughs from our list:

sqlite> SELECT "Vendor Name", SUM(Amount) FROM data
          WHERE "Vendor Name" NOT LIKE "%BOROUGH%"
          GROUP BY "Vendor Name"
          ORDER BY sum(amount)
          DESC LIMIT 10;

Here we are using the WHERE clause to filter the results. In this case we are using a “NOT LIKE” clause to exclude all rows where the Vendor Name does not contain “Borough”. This isn’t quite enough, let’s also try to exclude housing associations / groups:

SELECT "Vendor Name", SUM(Amount) FROM data
  WHERE ("Vendor Name" NOT LIKE "%BOROUGH%" AND "Vendor Name" NOT LIKE "%HOUSING%")
  GROUP BY "Vendor Name"
  ORDER BY sum(amount)
  DESC LIMIT 20;

This yields the following results:

Vendor Name                          SUM(Amount)
-----------------------------------  -----------
L B OF HARINGEY                      6181359.0  
EVERSHEDS                            2313698.54 
BERKELEY PARTNERSHIP HOMES LIMITED   2062500.0  
AMAS LTD                             1673907.5  
CITY OF WESTMINSTER                  1250839.13 
TRANSPORT FOR LONDON                  323954.1   
VOLKER FITZPATRICK LTD                294769.74  
PEABODY TRUST                         281460.0   
GEORGE WIMPEY MAJOR PROJECTS          267588.0   
ST MUNGOS                             244667.0   
ROOFF LIMITED                         243598.0   
R B KINGSTON UPON THAMES              200000.0   
FOOTBALL FOUNDATION                   195507.0   
NORLAND MANAGED SERVICES LIMITED      172420.75  
TURNER & TOWNSEND PROJECT MAGAG       136024.92  
BARRATT DEVELOPMENTS PLC              108800.0   
INNOVISION EVENTS LTD                 108377.94  
OSBORNE ENERGY LTD                    107248.5   
WASTE & RESOURCES ACTION PROGRAMME     88751.45   
CB RICHARD ELLIS LTD                   87711.45 

We still have a few boroughs due to abbreviated spelling (Haringey, Richmond, Westminster) but the filter is working quite well. New names are now appearing and we could start to look intro these in more detail.

Some Stats

To illustrate a few additional features of let’s get some overall stats.

The number of distinct suppliers: 283

SELECT COUNT(DISTINCT "Vendor Name") FROM data;

Total amount spent in January: approx £60m (60,448,491)

SELECT SUM(Amount) FROM data;

Wrapping Up

We now have an answer to our original question:

  • The biggest recipient of GLA funds in January was Newlon Housing Trust with £7.5m
  • Excluding other governmental or quasi-governmental entities the biggest recipient was Eversheds, a law firm with £2.4m

This tutorial has shown we can get these answers quickly and easily using a simple relational database. Of course, there’s much more we could do and we’ll be covering some of these in subsequent tutorials, for example:

  • Multiple tables of data and relations between them (foreign keys and more)
  • Visualization of of our results
  • Using tools like OpenSpending to do both of these!

Appendix

Colophon

CSV to SQLite script

Note: this script is intentionally limited by requirement to have zero dependencies and its primary purpose is to act as a demonstrator. If you want real CSV to SQL power check out csvsql in the excellent CSVKit or MessyTables.

SQL

All the SQL used in this article has been gathered together in one script:

.mode column
.header ON
.width 35
-- first sum
SELECT "Vendor Name", SUM(Amount) FROM data GROUP BY "Vendor Name" ORDER BY sum(amount) DESC LIMIT 20;
-- sum with avg etc
SELECT "Vendor Name", SUM(Amount), AVG(Amount), COUNT(*) FROM data GROUP BY "Vendor Name" ORDER BY sum(amount) DESC LIMIT 5;
-- exclude boroughs
SELECT "Vendor Name", SUM(Amount) FROM data
  WHERE "Vendor Name" NOT LIKE "%Borough%"
  GROUP BY "Vendor Name"
  ORDER BY sum(amount) DESC
  LIMIT 10;
-- exclude boroughs plus housing
SELECT "Vendor Name", SUM(Amount) FROM data
  WHERE ("Vendor Name" NOT LIKE "%BOROUGH%" AND "Vendor Name" NOT LIKE "%HOUSING%")
  GROUP BY "Vendor Name"
  ORDER BY sum(amount) DESC
  LIMIT 20;
-- totals
SELECT COUNT(DISTINCT "Vendor Name") FROM data;
SELECT SUM(Amount) FROM data;

Assuming you had this in a file called ‘gla-analysis.sql’ you could run it against the database by doing:

sqlite gla.sqlite < gla-analysis.sql

Flattr this!

Geocoding Part II: Geocoding Data in a Google Docs Spreadsheet

Rufus Pollock - February 19, 2013 in HowTo

This tutorial follows on from the previous Introduction to Geocoding.

A very common need is to geocode data in a Google Spreadsheet (for example, in creating TimeMaps with the Timeliner project). There are several options here:

  1. By hand – use a Geocoding service (see the previous post) and then copy and paste by hand.
  2. Use the ImportXML (or ImportCSV) formulae to grab data from a geocoding service – great but with limitations on the number of rows you can code at one time (~50).
  3. Use a Google App Script – the most powerful but requires installation of an App Script in your spreadsheet.

In this tutorial I’m going to cover the latter two automated options and specifically focus on option 2.

Using Formulas

All of the following is illustrated live in this google spreadsheet.

We start with a formula like the following:

=ImportXML("http://open.mapquestapi.com/nominatim/v1/search.php?format=xml&q=London", "//place[1]/@lat")

This formula uses the ImportXML function to look up XML data from the Mapquest Nominatim geocoding service (see the previous tutorial for more about geocoding services). The first argument to ImportXML is the URL to fetch (in this case the results from querying the geocoding service) and the second part is an XPath expression to select data from that returned XML. In this case, the XPath looks up the first place object in the results: place[1] and then gets the lat (latitude) attribute. To understand this more clearly, here’s the XML returned by that XML query:

In reality we want both latitude and longitude, so let’s change it to:

=ImportXML("http://open.mapquestapi.com/nominatim/v1/search?format=xml&q=London", "//place[1]/@lat | //place[1]/@lon")

This uses an “or” || expression in XPath and the result will now be an array of results that Google Docs will put in 2 cells (one below another). You can see this in Column C of the example spreadsheet.

What happens if we wanted the data in just one cell, with the two values separated by commas, for example? We could use the JOIN function:

=JOIN(",", ImportXML("http://open.mapquestapi.com/nominatim/v1/search?format=xml&q=London", "//place[1]/@lat | //place[1]/@lon"))

Lastly, we’d like to geocode based on a place name in an another cell in the spreadsheet. To do this we just need to add the place name to our API request to MapQuest’s Nominatim service using the CONCATENATE function (this example assues the value is in cell A2):

=ImportXML(CONCATENATE("http://open.mapquestapi.com/nominatim/v1/search?format=xml&q=", A2), "//place[1]/@lat")
...
=JOIN(",", ImportXML(CONCATENATE("http://open.mapquestapi.com/nominatim/v1/search?format=xml&q=",A2), "//place[1]/@lat | //place[1]/@lon"))

App Script

If you want an even more powerful approach you can use a Google App Script. In particular, Development Seed’s MapBox team have prepared a great ready-made Google AppScript that will do geocoding for you.

Find the script plus instructions online here and further details in this blog post.

Flattr this!

Geocoding Part I: Introduction to Geocoding

Rufus Pollock - February 19, 2013 in HowTo

Geocoding is the conversion of a human-readable location name into a numeric (or other machine-processable) location such as a longitude and latitude. For example:

London => [geocoding] => {latitude: -0.81, longitude: 51.745}

Geocoding is a common need when working with data as you may only have human-readable locations (e.g. “London” or a zip code like “12245”) but for a computer to display the data on a map or query it requires one to have actual numerical geographical coordinates.

Aside: in the example just given we’ve the term “London” has been converted to a point with a single Latitude and Longitude. Of course, London (the City in the UK) covers a significant area and so a polygon would be a better representation. However, for most purposes a single point is all we need.

Online geocoding

In theory, to do geocoding we just need a database that lists place names and their corresponding coordinates. Several, such open databases exist including geonames and Open Street Map.

However, we don’t want to have to do the lookups ourselves – that would either involve programming or a lot of very tedious scrolling.

As a result, various web services have been built which allow look ups online or over a web API. These services also assist in find the best match for a given name — for a given simple place name such as London there may be several matching locations (e.g. London, UK and London, Ontario) and one needs some way to match and rank these alternatives.

Nominatim – An Open Geocoding Service

There are a variety of Geocoding services. We recommend using one based on open data such as the MapQuest Nominatim service which uses the Open Street Map database. This service provides both “human-readable” service (HTML) and a “machine-readable” API (JSON and XML) for automated Geocoding.

Example – Human-readable HTML

http://open.mapquestapi.com/nominatim/v1/?q=London

Example – Machine-readable JSON

(JSON is also human-readable if you have a plugin)

http://open.mapquestapi.com/nominatim/v1/?format=json&q=London

Coming up Next

Geocoding Part II – Geocoding Data in a Google Docs Spreadsheet

Flattr this!

Web Scraping with CSS Selectors in Node JS using JSDOM or Cheerio

Rufus Pollock - January 22, 2013 in Data Blog, HowTo


I’ve traditionally used python for web scraping but I’d been increasingly thinking about using Node JS given that it is based on a browser JS engine and therefore would appear to be a more natural fit when getting info out of web pages.

In particular, my first steps when scraping information from a website is to open up the Chrome Developer tools (or Firebug in Firefox) and try and extract information by inspecting the page and playing around in the console – the latter is especially attractive if jQuery is available (and if it’s not available there are plenty of ways to inject it).

Here’s an example of inspecting the http://police.uk/data webpage with Chrome Developer tools:

Inspecting webpage with chrome developer tools

The result of this experimentation is usually a few lines of jQuery selectors.

What I want to be able to do next is reuse these css selectors I found with my in-browser experimentation directly in the scraping script. Now, things like pyquery do exist in python (and there is some css selector support in the brilliant BeautifulSoup) but a connection with something like Node seems even more natural – it is after the JS engine from a browser!

UK Crime Data

My immediate motivation for this work was wanting to play around with the UK Crime data (all open data now!). To do this I needed to:

  1. Get the data in consolidated form by scraping the file list and data files from http://police.uk/data/ – while they commendably provide the data in bulk there is no single file to download, instead there is one file per force per month.
  2. Do data cleaning and analysis – this included some fun geo-conversion and csv parsing

I’m just going to talk about the first part in what folllows – though I hope to cover the second part in a follow up post.

I should also note that all the code used for scraping and working with this data can be found in the [UK Crime dataset data package on GitHub] on Github - scrape.js file is here. You can also see some of the ongoing results of these data experiments in an experimental UK crime "dashboard" here.

: https://github.com/datasets/crime-uk

Scraping using CSS Selectors in Node

Two options present themselves when doing simple scraping using css selectors in node.js:

For the UK crime work I used jsdom but I've subsequently used cheerio as it is substantially faster so I'll cover both here (I didn't discover cheerio until I'd started on the crime work!).

Here's an excerpted code example (full example in the scrape.js):

var url = 'http://police.uk/data';
// holder for results
var out = {
  'streets': []
}
jsdom.env({
  html: url,
  scripts: [
    'http://code.jquery.com/jquery.js'
  ],
  done: function(errors, window) {
    var $ = window.$;
    // find all the html links to the street zip files
    $('#downloads .months table tr td:nth-child(2) a').each(function(idx, elem) {
      // push the url (href attribute) onto the list
      out['streets'].push( $(elem).attr('href') );
    });
  });
});

As an example of Cheerio scraping here's an example from work scraping info the EU's TED database (sample sample):

var url = http://files.opented.org.s3.amazonaws.com/scraped/100120-2011/summary.html;
// place to store results
var data = {};
// do the request using the request library
request(url, function(err, resp, body){
  $ = cheerio.load(body);

  data.winnerDetails = $('.txtmark .addr').html();

  $('.mlioccur .txtmark').each(function(i, html) {
    var spans = $(html).find('span');
    var span0 = $(spans[0]);
    if (span0.text() == 'Initial estimated total value of the contract ') {
      var amount = $(spans[4]).text()
      data.finalamount = cleanAmount(amount);
      data.initialamount = cleanAmount($(spans[1]).text());
    }
  });
});

Flattr this!