Asking Questions of Data – Some Simple One-Liners
One of the great advantages of having a dataset available as data is that we can interrogate it in very direct way. In this post, we’ll see a variety of examples of how we can start to ask structured questions of a dataset.
Although it’s easy for us to be seduced by the simple search boxes that many search engines present into using two or three keyword search terms as the basis of a query, expert searchers will know that using an advanced search form, or the search limits associated with advanced search form elements, can provide additional power to a search.
For example, adding the site: search limit to a web search, as in site:schoolofdata.org, will limit the results to links to pages on a particular web domain; or using the filetype: search limit will allow is to limit results to just PDF documents (filetype:pdf) or spreadsheet files (using something like (filetype:xls OR filetype:csv) for example).
In many cases, the order in which we can add these search limits to a query is not constrained – any order will do. The “query language” is relatively forgiving: the syntax that defines it is largely limited to specifying the reserved word terms used to specify the search limits (for example, site or filetype) and the grammatical rules that say how to combine them with the limiting terms (“reservedWord colon searchLimitValue”, for example) or combine the terms with each other (“term1 OR term2”, for example).
Rather more structured rules define how to construct – and read – a web address/URL, as described in Hunting for Data – Learning How to Read and Write Web Addresses, aka URLs.
When it comes to querying a database – that is, asking a question of it – the query language can be very structured indeed. Whilst it is possible to construct very complex database queries, we can achieve a lot by learning how to write come quite simple, but nonetheless still powerful, queries over even a small database.
A short while ago, I collected some data about the candidates standing for a local election in the UK from the poll notices. You can find the data on Scraperwiki: Isle of Wight Poll Notices Scrape.
The data from the poll notices is placed into three separate tables:
- stations – a list of polling stations, their locations, and the electoral division they relate to;
- candidates – a list of the candidates, along with their party and home address, who are standing in each electoral division;
- support – a list of the people supporting each candidate on their nomination form.
Let’s look at a fragment of the candidates table:
You’ll see there are four columns:
- ward – the electoral division;
- desc – the party the candidate was standing for
- candidate – each candidate’s name
- address – the address of each candidate.
If we go to the Scraperwiki API for this scraper, we can start to interrogate this data:
If we Run the query, we get a preview of the result and a link to a URL that contains the output result presented using the data format we specified (for example, HTML table or CSV).
If we click on the result link, we get what we asked for…
So let’s have a look at some of the queries we can ask… The query language we’re using is called SQL (“sequel”) and the queries we’ll look at are made up of various parts. If you’re copying an pasting these queries into the Scraperwiki API form, note that sometimes I use single quotes that this WordPress theme may convert to some other character…which means you may need to retype those single quotes…:
- A bit that says what columns we want to SELECT in our output table;
- an optional bit that specifies WHERE we want some conditions to be true;
- an optional bit that says what we want to GROUP the results BY.
Here are some example queries on the candidates table – click through the links to see the results:
Show everything (*): SELECT * FROM candidates
How can we limit the data to just a couple of columns? Show all the candidates names and addresses, but nothing else: SELECT candidate, desc FROM candidates (If you swap the order of the column names in the SELECT part of the query, they will display in that swapped order…)
How can we find out what different values occur within a column? Show the unique electoral divisions: SELECT DISTINCT ward FROM candidates (Can you figure out how to get a list of the different unique party names (desc) represented across the wards in this election?) If we SELECT multiple columns, the DISTINCT command will display the unique rows.
How can we select just those rows where the contents of one specific column take on a particular value? Show the names of the candidates standing as Independent candidates, and the electoral division they are standing in: SELECT candidate, ward, desc FROM candidates WHERE desc='Independent'
How can we rename column labels? For example, the “desc” column name isn’t very informative, is it? Here’s how we can list the different parties represented and rename the column as “Party”, sorting the result alphabetically: SELECT DISTINCT desc AS Party FROM candidates ORDER BY Party
How do we select rows where the contents of one specific column contain a particular value? Which electoral divisions are named as electoral divisions in Ryde? SELECT DISTINCT ward FROM candidates WHERE ward LIKE '%Ryde%' (the % characters are wildcards that match any number of characters).
What do we need to do in order to be able to search for rows with particular values in multiple columns? Find out who was standing as a Labour Party Candidate in the Newport electoral divisions: SELECT DISTINCT ward, candidate, desc AS Party FROM candidates WHERE ward LIKE 'Newport%' AND desc='Labour Party Candidate'
How about if we want to find rows where one column contains one particular value, and another column doesn’t contain a particular value? Find candidates standing in Newport electoral divisions that do not appear to have a Newport address: SELECT * FROM candidates WHERE ward LIKE 'Newport%' AND address NOT LIKE '%Newport%'.
Let’s do some counting, by grouping rows according to their value in a particular column, and then counting how many rows are in each group… How many candidates stood in each electoral division: SELECT ward AS 'Electoral Division', COUNT(ward) AS Number FROM candidates GROUP BY ward We can order the results too… SELECT ward AS 'Electoral Division', COUNT(ward) AS Number FROM candidates GROUP BY ward ORDER BY Number Notice how I can refer to the column by it’s actual or renamed value in the BY elements. To sort in descending order, use ORDER BY Number DESC (use ASC, the default value, to explicitly state ascending order).
Let’s count some more. How many candidates did each party field? SELECT desc AS Party, COUNT(desc) AS Number FROM candidates GROUP BY Party ORDER BY Number DESC
Let’s just look at part of the HTML table output of that query for a moment…
If we click in the browser window and select all of that data, we can paste it into a Datawrapper form:
We can then start to generate a Datawrapper chart… Note that one of the Party names is missing – we can click on the back button and just add in Unknown (copy a tab separator from one of the other rows to separate the Party name for the count…) Here’s the final result:
As we get more confident writing queries, we can generate ever more complex ones. For example, let’s see how many candidates each party stands per electoral division (some of them returned two councillors): SELECT ward AS 'Electoral Division', desc AS Party, COUNT(ward) as Number FROM candidates GROUP BY Party,ward ORDER BY Number DESC. We can use this results table as the input to another query that tells us how many electoral divisions were fought by each party: SELECT Party, COUNT(Party) AS WardsFought FROM (SELECT ward AS 'Electoral Division', desc AS Party, COUNT(ward) as Number FROM candidates GROUP BY Party,ward ORDER BY Number DESC) GROUP BY Party.
To check this seems a reasonable number, we might want to count the distinct number of wards: SELECT COUNT(DISTINCT ward) AS 'Number of Electoral Divisions' from Candidates.
Where the parties did not stand in all the electoral divisions, we might then reasonably wonder – which ones didn’t they stand in? For example, in which electoral divisions were the Conservatives not standing? SELECT DISTINCT ward from Candidates WHERE ward NOT IN (SELECT DISTINCT ward FROM Candidates WHERE desc LIKE '%Conservative%')
Hopefully, you will have seen how by getting data into a database, we can start to ask quite complex questions of it using a structured query language. Whilst the queries can become quite complex (and they can get far more involved than even the queries show here), with a little bit of thought, and by building up from very simple queries and query patterns, you should be able to start running your own database queries over your own data quite quickly indeed…
See also: Using SQL for Lightweight Data Analysis