Data roundup, April 10

April 10, 2013 in Data Roundup

We’re rounding up data news from the web each week. If you have a data news tip, send it to us at [email protected].

Photo credit: Jeff Kubina

Photo credit: Jeff Kubina


If you’re new here, you may be wondering: what is a “data visualization tool”? This blog post explains.

One example is QueryTree, the beta of which is now open to the world. QueryTree is a browser-based data visualization tool with an innovative drag-and-drop interface, promising powerful data exploration with fewer technical challenges.

But if you already know what data visualization is, you will enjoy Andy Kirk’s thoughtful contribution to the discussion of data storytelling which has been happening lately. Kirk’s post digs into the typology of data visualizations and the criteria for success in dataviz.

Tabula is Source’s new PDF-to-CSV data extractor which is already being hailed as amazing, magical, and so on by data journalists. The agony of dealing with PDF data releases has now been significantly ameliorated.

A Hackathon for Kids in Berlin is being sponsored by HacKIDemia and OpenTechSchool—and they’re looking for coaches. If you’re interested in hacking and like kids, please apply within. (Fluency in German not required.)

All presentations from Code With Me, a two-day workshop training journalists in the foundations of web standards coding, are now available online. These online releases include, as a bonus, “the secret speaker notes that [they] embedded into the presentations”.

“Coding is expensive and slow, journalism should be cheap and fast.” A new blog post by Esa Mäkinen shows how Finnish newspaper Helsingin Sanomat resolves this dilemma by use of a sort of “style book for data journalists”—a set of templates for quick and effective journalistic apps.

Mongkie is a network visualization platform described by the creators of Gephi as a “crazy merge of Gephi and Cytoscape in a single app made by the Korean Bioinformation Center”, effectively a biology-specialized fork of Gephi.

Social network analysis can be used to understand conflicts of interest. A new blog post by Sebastián Pérez Saaibi and Juan Pablo Marín Díaz demonstrates by walking through “a real example of the network formed by the Management Board of the 50 largest non-financial companies in Colombia”.

Julia is an emerging high-performance statistical programming language. A new post on the Julia blog demonstrates its parallel computing capabilities by showing how to do distributed numerical optimization in Julia.


Margaret Thatcher has died. The Guardian presents Thatcher’s legacy in 15 charts, tracing some of the “huge economic, demographic, and social change” undergone by Britain during her rule.

vast web of tax evasion has come to light through the leak of some 2 million emails and financial documents to the International Consortium of Investigative Journalists. A post on the ICIJ blog explains how the ICIJ analyzed the leaked files.

A three-part video series from the Guardian explores the history of data journalism, both at the influential newspaper and beyond, concluding with the coverage of the London Olympics.

How does information evolve as it flows through different social media networks? How does the diffusion of information about expected and unexpected events differ? Dr. Scott Hendrickson of social media analysis company Gnip provides an illustrated answer.

An interactive map from tracks the past 12 years in Arizona border control. The map displays apprehensions, agents, and more for each year, with prose commentary.


Wikileaks has simultaneously completed Project K, which turns out to be the release of 1.7 million diplomatic records focusing on the period from 1973 to 1976—“the Kissinger Cables”—and launched the Public Library of US Diplomacy, a searchable repository of their more than 2 million US formerly restricted diplomatic documents. The source code for the map and graph modules of the PLOD are up on Github.

Sexualitics is a new project contributing to the quantitative understanding of human sexuality by, in the first case, releasing datasets for the analysis of online pornography. These include two sets of porn site metadata covering some two milion entries.

Hatebase is a new online database comprising “the world’s largest online repository of structured, multilingual, usage-based hate speech”. The goal of the database, as reported in Ars Technica, is to provide governments and NGOs with early warning signs of genocide.

The European Union data hub has a nice new interface. The EU portal’s 5,884 hosted datasets should now be more pleasant to browse and use.

Flattr this!