Data Journalist in a Day
Data in itself is not a force for good. The mere availability of data, far from fostering democracy, can deepen pre-existing power inequalities. The recent data-driven expropriations in Tamil Nadu and Bangalore provide a striking example: the digitization of land records has allowed rich landowners, disproportionately well equipped to interpret the data in their favour, to further dispossess the already poor.
It is clearly necessary for data providers to give careful thought to the potential impact of their datasets. But it is equally clear that in our new world of open data, evil will prevail unless good people learn to find and communicate insights into the meaning of data. The skills and tools necessary to do this have so far been the preserve of “data journalists”—but data journalism must be brought to the people.
It is, however, easier than you might think to find stories in data and make them intelligible to others. Training in statistics and advanced programming skills are not necessary. To get started, all you need is data and the wherewithal to spend a few hours learning the tools described in this post. These tools can be learned in less than a day, and they cover the full pipeline of analysis and communication. Learning them is a substantial and empowering step.
There is no point looking for insight into data which is corrupt, which unfortunately a great deal of data is. The first step in any data exploration project is therefore cleaning, which consists of the elimination of errors, redundancies, and inconsistencies in data.
The simplest way to clean data is to do it by hand with a spreadsheet application like Google Docs. You can, however, speed up the process by using specialized data cleaning tools like OpenRefine. OpenRefine anticipates common problems with tabular data and allows you to quickly eliminate them. It has a bit of a learning curve, so be sure to set aside a few hours for experimentation.
Once you have polished your data, you can peer into it to see the stories it contains. This means performing analysis: finding meaningful characterizations of data (mean, standard deviation, etc.) and its structure (relationships between variables).
Spreadsheet applications are the traditional tool for data analysis, but simpler and quicker alternatives are becoming more popular. StatWing is a web application that allows you to upload tabular data and perform simple statistics to arrive at a picture of your data’s properties and structure. QueryTree is a similar web application with an innovative drag-and-drop interface. Both applications aim to simplify and democratize analysis. Neither, however, is free and open software—so you may prefer to stick to free spreadsheet applications.
Charts and graphs
Once you understand your data, you can communicate your understanding. Visual presentation is the most intuitive means of imparting insight into data. The most important skill for a would-be data storyteller is therefore creating straightforward informational graphics. Excellent tools for this purpose abound.
Datawrapper is perhaps the simplest way to create and publish an infographic online. Datawrapper is an open-source web-based tool for creating “simple and correct” charts that can be embedded in web pages. Creating a chart with Datawrapper is as simple as pasting in your data, choosing a layout, and publishing the result. Each chart Datawrapper creates includes a link to the source data, allowing your readers to verify or recreate your work. For a stylish (but non-free) alternative to Datawrapper, try Infogr.am, a new tool which allows you to create complex layouts of infographical elements.
Data’s potential impact often derives from its connection to place. The best way to communicate this connection is a map visualization. Such maps can have a powerful effect on public policy. The Solano County Community Advocates, armed with both data and freely available data literacy training provided by the California Health Interview Survey, were able to create a map of asthma incidences that helped them successfully argue against a polluting construction project in their county.
In the world of data-driven cartography, one application stands out. Google Fusion Tables is a tool in the Google Drive suite that, in spite of still being “experimental” software, is the single most important application in journalistic cartography. Embedded maps created with Fusion Tables can be found wherever journalists have presented spatial data online; Canada’s Global News credits Fusion Tables with allowing them “to ram out dozens of census maps on the day of a census release”. Fusion Tables is a tool for merging together sets of data and visualizing them. Data which includes spatial coordinates can be visualized on a Google map, or it can be linked to a custom boundary file to create “intensity maps” providing a colourful view of how some quantity varies over space. Fusion Tables includes sample datasets and tutorials which allow you to quickly get started with the software.