On the Relative Truthiness of Data

April 11, 2013 in Data Blog

One of the more common requests for activities around the use of data are for sessions or workshops on generating charts and data visualisations. Visual techniques can be used to powerful effect to illustrate a wide variety of patterns or trends that might arise within a data set based on correlations or other associations between the various data elements. A P.A.P.-Blog, Human Rights etc. post on Statistics on Gross Domestic Product (GDP) Correlations demonstrates how simple scatterplots of GDP against a wide variety of indicators (from poverty indicators to resource availability, mortality rates to education indicators) pull out a wide variety of associations between GDP and these other factors. (Working out the extent to which any relationships are causal is another matter, of course.)

On many occasions, a quick glance at a chart will give us an impression of what story it has to tell – a “glanceable chart” lets us see that points on a chart of GDP against years of schooling appears to go up and to the right, for example. If we are familiar with the notion that numerical chart axes typically represent increasing values towards the right on the horizontal x-axis, and increase up the page on the vertical y-axis, we can quickly recognise that up and to the right charts show that the thing on the y-axis tends to increase as the thing on the x-axis gets larger. If the line is a straight one, we tend to say they are (linearly) correlated.

Sometimes, however, a chart may contain within it certain inconsistencies that only become apparent if we play a more active role in reading the chart, taking a little time to think about the meaning of what is represented in the diagram, along with how the different elements relate to one another.

By way of analogy, take a look at the following photograph (I haven’t manipulated it in any way). What’s odd about it?

The photograph shows “real time” platform information. The relationship between four pieces of information represented there jump out at me:

the current time, in this case, 19.21:05;
the departure time of the next train expected at the platform, 19.19;
the status of the train, in terms of when it might be expected to arrive at the platform: On time, apparently; (other typical status messages might include things like Expected 19:22 for example);
the empty platform: the train hasn’t arrived yet.

One piece of “data” is wrong, and the “wrongness” jumps out at us as we interpret the different parts of the scene in terms of the information, or meaning, they are intended to convey. The next train on the platform should depart at 19.19, it’s “On time” but the time is 19.21. So it will necessarily be late to depart, not on time. [I can vouch that the train hadn't been and gone, leaving a sign that was slow to update!;-) -Ed.]

If we trust that the sign is telling the truth in the sense of indicating the next train to depart from that platform, we need to treat the derived “On time” data element with suspicion. Unless, that is, “On time” is defined in an unusual way (such as ‘not more than x minutes or y seconds late’). In which case, the “On time” data element might actually be ‘true, as defined; but as a reader of the ‘diagram’, we would need to know what “On time” actually meant.

Here, then, we have a second consideration to take into account. Not only should we try to read the image in a meaningful way, we also need to be clear about how the creator of the chart or image defined the meaning of the things we are trying make sense of (a classis communication problem!). Many datasets collected at an international level by NGOs through the issuing of questionnaires need to be treated with caution because we can’t always guarantee that the people filling in the questionnaires in different countries are working to the same definitions. For example, the Joint Organisations Data Initiative (JODI) Oil Manual reviews the definitions of oil industry terms as used by various oil industry reporting organisations; here’s what they note when referring to refinery output data:

APEC, Eurostat, IEA and UNSD exclude refinery loss but include refinery fuel. OPEC excludes both. The OLADE definition does not mention anything about refinery fuel or loss. Interproduct transfers are excluded by all organisations except OLADE.

Within countries or organisations too, the definitions or assumptions underlying popularly referred to indicators may change over time, creating problems particularly when it comes to constructing time series data. For example, if the basis of a particular indicator changes over time, plotting the value of that indicator over time may not make sense because one’s year’s basis may not be compatible with another year’s.

Additional problems may arise when new indicators are defined in support of policy matters and we wish to review the evolution of those indicators. If indicator A informs policy in period 1, and is then redefined as indicator B that informs policy in period 2, does it make sense to reconstruct how indicator B evolved during period 1? A recent example of a redefined indicator can be seen in the form of the introduction of a new RPIJ measure of Consumer Price Inflation, 1997 to 2012 by the UK Office for National Statistics.

The lesson, if any, to be learned, is perhaps this: when you a see a chart or a diagram, try to read meaning into it, cross-referring what the different pieces of data described in the chart appear to be saying about themselves as well as what they say about each other in the context of each other. But at the same time, make sure what definitions were used in creating each piece of data, particularly if different data items have been provided by different sources, even if they are nominally referring to the same thing…

← Using Redis as a rapid discovery and prototyping tool.

A Data Expedition in Cooperation with Save the Children →

On the Relative Truthiness of Data

Search the blog

On the blog