Data roundup, May 15

May 15, 2013 in Data Roundup

We’re rounding up data news from the web each week. If you have a data news tip, send it to us at [email protected].

Photo credit: Josh More

Photo credit: Josh More


In “the most significant piece of CKAN news since the project began”, version 2.0 of CKAN has been released. CKAN, the Open Knowledge Foundation’s flagship, has been updated with a new API, new design features, overhauled documentation, and more.

Spain’s first data journalism conference, las I Jornadas de Periodismo de Datos y Open Data en España, will be held from May 24 – 26 by the Spanish chapter of the OKFN. Held simultaneously in Barcelona and Madrid, the conference’s events will include workshops and a hackathon.

Last week brought together over data journalists from across Europe for the third Data Harvest Conference in Brussels. A blog post from the International Consortium of Investigative Journalists explains what you missed if you weren’t there.

Learn how to find stories in data at a one-day introduction to open data for journalists. The course will take place this September 19 at the Open Data Institute in London. Lisa Evans of the OKFN and Kathryn Corrick of the ODI will preside.

The first issue of Network Science, a new journal for the emerging discipline “using the network paradigm […] to inform research, methodology, and applications from many fields across the natural, social, engineering and informational sciences”, is available for free. For a sense of the growing importance of networks and graphs to data analysis, read a GigaOM article on the rise of the graph in big data.

Markov networks are a powerful way to represent multivariate probability distributions. A new blog post shows you how to work with Markov networks in Haskell, the gloriously algebraic programming language, using the HLearn library. This approach exploits the networks’ algebraic structure to get “online and parallel training algorithms ‘for free’”.

Ben Frederickson, tackling the twin challenges of learning JavaScript and D3.js, shows how to create “the simplest possible visualization [he] could think of”, the Venn Diagram. His blog post explores the challenges of the task and gives examples of interesting results.

Data for Radicals is on a roll. Lisa Williams has released yet another excellent and “absurdly illustrated” guide to data wrangling, this time explaining sortable, searchable online data tables.


The cicadas have arrived! The East Coast of America is playing host to swarms of 17-year cicadas. The Radiolab Cicada Tracker project, which has been gathering data to predict the cicadas’ arrival from volunteers using $80 sensors, is starting to show results on its map. The cicadas have made it to Manhattan!

The GDELT dataset is beginning to bear fruit. A new interactive from New Scientist uses GDELT data to construct a hexagon-binned map of violent events in Syria since 2011. “The resulting view suggests that the violence has subsided in recent months, from a peak in the third quarter of 2012.”

Following up on their map of slurs on President Obama, Floating Sheep has constructed a map of “a broader swath of discriminatory speech in social media, including the usage of racist, homophobic and ableist slurs”. Their work draws on over 130,000 geotagged tweets, tagged for offensiveness by human annotators. As observed by Jen Lowe, the Twitter conversation around the map “is a fantastic reality check on the data”.

Data on medical provider charges across the United States has been released, showing “significant variation across the country and within communities in what hospitals charge for common inpatient services”. The Washington Post analyzes the data and finds that “even on the same street, hospitals can vary by upwards of 300 percent in price for the same service”.

Also from the Washington Post comes a profile of baseball player Bryce Harper’s swing, annotated with remarkably lucid informational graphics.

How does ESPN discuss white and non-white quarterbacks? This question is investigated by Trey Causey (who you may remember from last week’s investigation of R-help’s cruelty) in an analysis of more than 36,000 ESPN articles that uncovers a number of interesting asymmetries.

In another investigation into discourse asymmetries, UNC’s Neal Caren asks: does the New York Times write differently about men and women? The post shows how to explore this question using Python and its natural language processing toolkit NLTK.

How much money is China investing in Africa? Aid Data China, in its first application of its “media-based data collection” methodology for “systematically collect[ing] open-source information about development finance flows from suppliers that do not publish their own project-level data”, has created a database of Chinese finance flow into Africa, encompassing over 1,700 projects.

Check out the latest edition of the weekly VisualLoop Data Viz News for a gigantic collection of data visualization news, articles, and resources.


On May 9, the U.S. government issued an executive order and memorandum “making open and machine readable the new default for government information”. To paraphrase Joe Biden, this is a rather big deal. The OKFN’s Rufus Pollock unpacks the executive order in a blog post, Joshua Tauberer takes a close look, and David Eaves offers his thoughts. Remarkably, the US Open Data Policy has been drafted and released on GitHub.

As explained in “Data Stories” above, data on United States medical provider charges is now available for download in Excel and CSV form.

A new portal for Latin American datasets, OpenData Latinoamérica, is a central repository bringing together the continent’s scattered data sources. A blog post on (in Spanish) explains the repository’s significance to data journalism in South America.

A portal for the U.S. State of Maryland has been launched, “offering state data not accessible to the public before”, including “ handgun permits, vendor payments, vehicle accidents, licensed veterinary clinics and per capita electricity consumption” (source).

Flattr this!