Data roundup, May 8

May 8, 2013 in Data Roundup

We’re rounding up data news from the web each week. If you have a data news tip, send it to us at [email protected].

Photo credit: George Lu

Photo credit: George Lu


“Data is hard. Really, really hard,” says Dan Sinker, “[and] one of the hardest parts is cleaning.” That’s why OpenNews is announcing two code sprints for data cleaning. The sprints will work on developing Dedupe and the FMS parser.

Yes, data is hard—luckily, the School of Data is now ready to answer your questions about data. is a new service drawing on the expertise of the School of Data community to clarify the problems that come with working with data.

The Ghana Data Bootcamp, taking place May 27 to 29 in Accra, aims to bring together journalists, web programmers, and activists to foster the use of public data in Ghana. Participants will build data-driven content using their new data literacy in competition for a seed grant of $1,000. Registration is open, and free seats are extended to “journalists, developers, and digital creatives”.

Reflections are pouring in from the Open Knowledge Foundation’s School of Data Journalism 2013, which ran from April 24 – 27. Moran Barkai has written a piece looking back in awe at the event. Ten ideas to remember (French) from SDJ2013 have been compiled by

The G8 Conference on Open Data for Agriculture was held last week in Washington, D.C., including the announcement of a food, agriculture, and rural data portalRecorded webcasts from the conference are available on the World Bank’s website.

Skeptical about the hype surrounding “big data” and “data science”? Good! Join others at the first-ever NYC Data Skeptics Meetup this June 19. The meetup aims to foster a critical perspective on “mathematical, ethical, and business aspects of data”.

recording of Jonathan Corum’s much-loved keynote address at the Tapestry Conference is now available online.

Lisa Williams of Data For Radicals celebrates the legalization of gay marriage in Rhode Island by providing you with an absurdly illustrated guide to your first data-driven timeline. The guide walks you through the process of using Timeline.js to create an interactive timeline from start to finish.

Learn Pandas, the Python data analysis library. “Learn Pandas” is a collection of Python notebooks—updated to mark the recent 0.11 release of Pandas—organized into lessons to help you get up and running with Pandas.

Once you’ve learned Pandas, learn bearcart, a Python library “for creating Rickshaw visualizations with Pandas timeseries data structures”. In other words, bearcart does for time series graphs what Vincent does for Vega: it makes it easier to get from code to visualization.


A new paper by Harvard researchers explores the nature of Internet censorship in China (pdf link). Analyzing the content of millions of censored social media posts from over 1,400 different services, the researchers arrive at a surprising new theory of Chinese internet censorship.

This week’s data roundup period begins on May Day. Business Insider CEO Henry Blodget commemorates the occasion by graphing the plight of the worker under modern economic conditions. Felix Salmon reflects on the graphs and their depressing implications.

The UK local elections provide “an opportunity to put some of the open data released by UK local and county council elections to a practical test”. A School of Data blog post provides a detailed first look at “proving the data” with exploratory data visualization and mapping, and blog posts by Tony Hirst round up live election data initiatives and looks to see whether election data has a story to tell.

Car2go is a car-sharing service offering one-way rental cars charged by the minute. Disposable Cars tracks these momentary rentals in their last three days of travels around Portland in the form of a time-evolving map.

Bolides is an animated visualization of the last 1,152 years of meteorite sightings, beginning in Nogata, Japan, and ending in Battle Mountain, USA. The rain of destruction unfolding across centuries is strangely relaxing—but watch out for the Sikhote-Alin meteorite of 1947!

The history of San Francisco place names is the subject of a new interactive map made by Noah Veltman. Zoom and click through the map for an amazing guided tour through the onomastics of San Francisco. is also rounding up data journalism news from the web and posting the results on a monthly basis. Check out their April data journalism roundup, which links back to the School of Data’s roundup.


The CMU movie summary corpus comprises 42,306 movie plot summaries with aligned metadata extracted from Wikipedia and Freebase, accompanied by summaries preprocessed with Stanford CoreNLP. It is the basis of a forthcoming computational linguistics research paper, “Learning Latent Personas of Film Characters”.

The Center for Investigative Reporting has released an API for data related to its backlog of Veterans’ Affairs disability claims, making it easier to reuse the CIR’s data to produce work like its interactive map of claims backlogs.

The Sunlight Foundation has opened a new API user hub to focus more attention on its sizeable base of API users. The hub provides an overview of Sunlight APIs and a showcase of their associated projects.

Norway is slated to release its topographic datasets to the public. These include, according to Bjørn Sandvik, “topographic datasets at 1:50,000 scale […] together with address, road and cadastre data”.

A digital collection of over 38,000 historical maps has been released by the Digital Public Library of America. These maps are accessible through the DPLA API, as well as through the DLPA portal.

Flattr this!