Learning to Listen to your Data
School of Data mentor Marco Túlio Pires has been writing for our friends at Tactical Tech about journalistic data investigation. This post “talks us through how to begin approaching and thinking about stories in data”, and it was originally published on Exposing the Invisible‘s resources page.
Journalists used to carefully establish relationships with sources in the hope of getting a scoop or obtaining a juicy secret. While we still do that, we now have a new source which we interrogate for information: data. Datasets have become much like those real sources – someone (or something!) that holds the key to many secrets. And as we begin to treat datasets as sources, as if they were someone we’d like to interview, to ask meaningful and difficult questions to, they start to reveal their stories, and more often than not, we come across tales we weren’t even looking for.
But how do we do it? How can we find stories buried underneath a pile of raw data? That’s what this post will try to show you: the process of understanding your data and listening to what your “interviewee” is trying to tell you. And instead of giving you a lecture about the ins and outs of data analysis, we’ll walk you through an example.
Let’s take an example from the The Guardian, the British newspaper that has a very active data-driven operation. We’re going to (try to) “reverse engineer” one of their stories in the hopes you get a glimpse at what happens when you go after information that you have to compile, clean, and analyse and what kind of decisions we make along the way to tell a story out of a dataset.
So, let’s talk about immigration. Every year, the Department of Immigration and Border Protection of Australia publishes a bunch of documents about immigration statistics down under. Published last year, the team at The Guardian focused on a report called Asylum Trends for 2011-2012. There’s a more up-to-date version available (2012-2013). By the end of this exercise, we hope you can use the newer version to compare it with the dataset used by The Guardian. Let us know in the comments about your findings.
The article starts with a broad question: does Australia have a problem with refugees? That’s the underlying question that helps makes this story relevant. It’s useful to start a data-driven investigation with a question, something that bothers you, something that doesn’t seem quite right, something that might be an issue for a lot of people.
With that question in mind, I quickly found a table on page 2 with the total number of people seeking protection in Australia.
Let’s make a chart out of this and see what the trend is. Because this is a pesky PDF file, you’ll need to either type the data by hand into your spreadsheet processor or use an app to do that for you. For a walkthrough of a tool that does this automatically, see the Tabula example here.
After putting the PDF into Tabula this is what we get (data was imported into OpenOffice Calc):
I opened the CSV file in OpenOffice Calc and edited it a bit to make it clearer. Let’s see how the number of people seeking Australia’s protection has changed over the years. Using the Chart feature in the spreadsheet, we can compare columns A and D by making a line chart.
Take a good look at this chart. What’s happening here? On the vertical axis, we see the total number of people asking for Australia’s protection. On the horizontal axis, we see the timeline year by year. Between 2003 and 2008, there’s no significant change. But something happened from 2009 on. By the end of the series, it’s almost three times higher. Why? We don’t know yet. Let’s take a look at other data from the PDF and use Tabula to import it to our spreadsheet. Maybe that will show us what’s going on.
Australia divides their refugees into two groups: those who arrived by boat and those who arrived by air. They use the acronyms IMA and non-IMA (IMA stands for Irregular Maritime Arrivals). Let’s compare the totals of the two groups and see how they relate across the years presented in this report. Using Table 4 and Table 25, we’ll create a new table that has the totals for the two groups. Be careful, though, the non-IMA table goes back up to 2007, but the IMA table goes only as far as 2008. Let’s create a line chart with this data.
What’s that? It seems that in 2011-2012, for the first time in this time series, the number of refugees arriving in Australia by boat surpassed those landing by plane. The next question could be: where are all the IMA refugees coming from? We already have the data from table 25. Let’s make a chart out of that, considering the period 2011-2012. That would be columns A and E of our data. Here’s a donut chart with the information:
Afghanistan (deep blue) and Iran (orange) alone represent more than 64% of all IMA refugees in Australia in 2011-2012.
From here, there are a lot of routes we could take. We could use the report to take a look at the age of the refugees, like the folks at The Guardian did. We could compare IMA and non-IMA countries and see if there’s a stark difference and, if so, ask why that’s the case. We could look at why Afghans and Iranians are travelling by boat and not plane, and what risks they face as a result. How does the data in this report compare with the data from the more recent report? The analysis could be used to come up with a series of questions to ask to the Australian government or a specialist on immigration.
Whatever the case might be, it’s worth remembering that finding stories in data should never be an activity that ends in itself. We’re talking about data that’s built on the behavior of people, on the real world. The data is always connected to something out there, you just need to listen to what your spreadsheet is saying. What do you say? Got data?