Why should we care about comparability in corruption data?

May 29, 2014 in Data Expeditions, Data Stories

Does comparing diverse data-driven campaigns empower advocacy? How can comparing data on corruption across countries, regions and contexts contribute to efforts on the ground? Can the global fight against corruption benefit from comparable datasets? The engine room tried to find some answers through two back-to-back workshops in London last April, in collaboration with our friends from School of Data and CIVICUS.

The first day was dedicated to a data expedition, where participants explored a collection of specific corruption-related datasets. This included a wide range of data, from international perception-based datasets such as Transparency International’s Global Corruption Barometer, through national Corruption Youth Surveys (Hungary), to citizen-generated bribe reports like I Paid A Bribe Kenya.

Hard at work organizing corruption datatypes.

The second day built on lessons learned in the data expedition. Technologists, data literates and harmonization experts convened for a day of brainstorming and toolbuilding. The group developed strategies and imagined heuristics through an analysis of existing cases, best practices and personal experience.

Here is what we learned:

Data comparability is hard

Perhaps the most important lesson from the data expedition was that one single day of wrangling can’t even begin to grasp the immensely diverse mix of corruption data out there. When looking at scope, there was no straightforward way to find links between locally sourced data and the large-scale corruption indices. Linguistic and semantic challenges to comparing perceptions across countries were an area of concern. Since datasets were so diverse, groups spent a considerable amount of time familiarizing themselves with the available data, as well as hunting for additional datasets. Lack of specific incident-reporting datasets was also noticeable. In the available datasets, corruption data usually meant corruption perception data: data coming from surveys gauging people’s feelings about the state of corruption in their community. Datasets containing actual incidents of corruption (bribes, preferred sellers, etc) were less readily available. Perception data is crucial for taking society’s pulse, but is difficult to compare meaningfully across different contexts — especially considering the fluidity of perception in response to cultural and social customs — and very complex to cross-correlate with incident reporting.

Pattern-finding expedition

An important discussion also came to life regarding the lack of technical capacity among grassroots organizations that collect data, and how that negatively impacts the data quality. For organizations on the ground it’s a question of priorities and capacity. Organisations that operate in dangerous areas, responding to urgent needs with limited resources, don’t necessarily consider data collection proficiency a top-shelf item. In addition, common methods and standards in data collection empower global campaigns for remote actors (cross-national statistics, high-level policy projects etc) but don’t necessarily benefit the organizations on the ground collecting the data. These high-level projects may or may not have trickle-down benefits. Grassroots organizations don’t have a reason to adopt standardized data collection practices, unless it helps them in their day-to-day work: for example providing tools that are easier to use, or having the ability to share information with partner organizations.

Data comparability is possible

While the previous section might paint a black picture, the reality is more positive, and the previous paragraph tells us where to look (or, how to look). The amorphous blob of all corruption-related data is too generically daunting to make sense of — until we flip the process on its head. Like in the best detective novels, starting small and investigating specific local stories of corruption lets investigators find a thread and follow it along, slowly unraveling the complex yarn of corruption towards the bigger picture. So for example, a small village in Azerbaijan complaining about the “Ingilis” that contaminate their water can unravel a story of corruption leading all the way to the presidential family. This excellent example, and many more, come from Paul Radu’s investigative experience, described in the Exposing the Invisible project produced by the Tactical Technology Collective.

Screengrab from “Our Currency is Information” by Tactical Technology Collective

There are also excellent resources that collect and share data in comparable, standardized and functional ways. Open Corporates, for example, collects information on more than 60 million corporations, and provides beautiful, machine-readable, API-pluggable information, ready to be perused by humans and computers, and easily comparable and mashable. If your project involves digging through corporation ownership, Open Corporates will most surely be able to help you out. Another project of note is the Investigative Dashboard that collects scraped business records from numerous countries, as well as hundreds of reference databases.

What happens when datasets just aren’t compatible, and there is no easy way to convince the data producers to make them more user-friendly? Many participants voiced their trust in civic hackers and the power of scraping — even if datasets aren’t provided in machine-readable formats, or standardized and comparable, there are many tools (as well as many helpful people) that can come to the rescue. The best source for finding both? Well, the School of Data, of course. Apart from providing a host of useful tutorials and links, it acts as a hub for engaged civic hackers, data wranglers and storytellers all over the world.

Citizen engagement is key

During a brainstorm where participants compared real-life models of data mashups (surveys, incident reporting, budget data), it became clear that many corruption investigation projects involve crowdsourced verification. While crowdsourcing is a vague concept in itself, it can be very powerful when focused within a specific use case. It’s important for anti-corruption projects that revolve around leaked data (such as the Yanukovych leaks), or FOIA requests that yield information in difficult-to-parse formats that aren’t machine readable (badly scanned documents, or even boxes of paper prints). In cases like these, citizen engagement is possible because there are clear incentives for individuals to get involved. Localized segmentation (where citizens look only at data directly involving them or their communities) is a boon for disentangling large lumps of data, as long as the information interests enough people to engage a groundswell of activity. Verification of official information can also help, for example when investigating whether state-financed infrastructures are actually being built, or if there is just a very expensive empty lot where a school is supposed to be.

It makes perfect sense, then, to look at standardization and comparability as an enabling force for citizen engagement. The ability to mash and compare different datasets brings perspective, and enables the citizens themselves to have a clearer picture, and act upon that information to hold their institutions accountable. However, translating, parsing and digesting spaghetti-data can be so time-consuming and cumbersome that organisations might just decide it’s not worth the effort. At the same time, data-collecting organizations on the ground, presented with unwieldy, overly complex standards, will simply avoid using them and compound the comparability problem. The complexity in the landscape of corruption data represents a challenge that needs to be overcome, so that data being collected can truly inspire citizen action for change.

Tags: Corruption

← Video: Be a School of Data Fellow

Fellowship Deadline Extended & We need your help →