Data roundup, April 17

April 17, 2013 in Data Roundup

We’re rounding up data news from the web each week. If you have a data news tip, send it to us at [email protected].

Photo credit: Anonymous (

Photo credit: Anonymous (


Sooner or later, we all have to deal with bad data. A new post by Abbott Katz explains how you can use Excel to cope with bad data, walking through three real-life problems of data integrity and their solutions.

The Open Research Data Handbook is starting to take form, and it needs your help. An open call for case studies for the ORDH has been issued in search of stories of the benefits and challenges of open research data.

Version 1.2 of Crossfilter has been released. Crossfilter is “a JavaScript library for exploring large multivariate datasets in the browser”. It is fast, scalable, and Apache-licensed.

rCharts is an R package that allows you to produce interactive JavaScript visualizations from within R. Previously rCharts has been based on polycharts.js, but the rCharts interface has recently been expanded to cover a total of three popular JS vis libraries.

Topic modeling is a machine learning technique which is garnering a great deal of attention not only in computer-assisted journalism but also in the digital humanities. The new issue 2.1 of the Journal of Digital Humanities, available freely online, focuses on topic modeling and its place in the humanities.

“Anyone can capture volumes of social media data, but what do you do with it then?” Watch a new lecture from Jonathan Stray on using Overview to mine masses of social media text for insights.

Inquiring minds want to know: how can you get started doing data work with Python? Data Community DC, fresh out of PyCon and PyData, has put together a useful collection of resources, including books, IDEs, and overviews of specific packages.


The deadly explosions at the Boston marathon this Monday have left many people desperate for information which has not been forthcoming. As the Guardian observes, “social media, both good and bad, has filled the information space”. The British newspaper has curated an overview of information (and misinformation) on the bombings.

What did Twitter look like when Margaret Thatcher died? A new blog post by Andy Pryke investigates. There is, as you may expect, a fair amount of cussing. offers beautiful and lucid “data stories on India, one chart at a time”, contributed by an anonymous journalist based in Delhi. Check out a poverty map of India, reflections on 1979, and much more.

New research by Giovanna Ceserani is reconstructing the social networks of 18th-century European travelers. “Ceserani’s digital humanities project, the Grand Tour Travelers, has uncovered unexpectedly close connections between intellectuals, illuminated the rise and fall of cities, and occasionally offered warnings about how visualization can sometimes prove misleading.”

The Berliner Morgenpost has created a flight route radar app, showing all flights passing over locations in and around Berlin (“jede Fluglinie, jede Flughöhe, jeden Lärmwert zu jeder Tages- und Nachtzeit”). The radar is the subject of several writeups on the site.

Remember how the DC code finally became publicly available last week? Well, what happened when it did? Quite a lot, as Eric Mill’s new blog post testifies: projects were initiated, a browser was built, a legal glossary was initiated, and more.

Can social media activities be used to forecast postpartum depression and other post-birth behavioral changes? Follow the Crowd investigates, studying “a variety of behavioral measures in a cohort of about 400 mothers during the prenatal period”, and achieve impressive results.


GloWbe is a corpus of “global web-based English”, covering English usage from 20 countries across 1.8 million web pages and 1.9 billion words, released by Brigham Young University. This corpus appears to only be accessible via a web interface but will nevertheless be a valuable resource for dialect-sensitive empirical investigation of English.

To aid natural language processing research in relation extraction—the identification of relationships holding between entities—Google is releasing a dataset of human-annotated data on two relations (“place of birth” and “graduated from”).

A new version of the Groningen Meaning Bank is out. The GMB is a free—and freely editable—bank of public domain English texts annotated with deep semantics. The semantic annotations are in a variant of the influential Discourse Representation Theory of meaning.

The Internet Archive has added over 450,000 journal articles from the JSTOR Early Journal Content collection of pre-1923 materials. The new materials amount to more than two terabytes and are available for bulk harvesting.

OpenStreetMap has released the OSM GPX track files in bulk. As they put it modestly, the dataset is “fairly large” at 2.6 trillion GPX points and 260 gigabytes of data. The data is available here.

A new data portal has been launched by the Italian province of Lucca has opened a data portal. As reported by EPSI Platform, the portal is “federated with, which means that data is searchable and accessible directly from the latter without these being duplicated across multiple servers”.

The Danish city of Aarhus has also opened a new data portal. The new portal, built on CKAN, has already made 27 datasets available.

Finally, the Canadian city of Halifax has launched its own data portal. The new site is “part of a 12-month pilot project making 17 datasets freely available”.

Flattr this!