Data roundup, January 23
We’re rounding up data news from the web each week. If you have a data news tip, send it to us at [email protected]
TOOLS, COURSES AND EVENTS
A new open source tool, freeDive, makes it easier for journalists to publish datasets with their articles. FreeDive transforms Google Spreadsheets into searchable databases that can be embedded on webpages. Tutorials can be found on the Knight Digital Media Center webpage, and the freeDive source is available on GitHub.
Want a thorough and practical introduction to data preparation for machine learning applications? Chris Clark‘s introduction to the Pandas data analysis library for Python will lead you step by step through the process of preparing a dataset from NYC Open Data and using it to derive predictions.
Peter Aldhous has posted a collection of learning materials for data-driven journalism drawn from his classes at UC Santa Cruz. Data representation, visualization, and analysis are all introduced in these tutorials and slides.
Data Mining Cup 2013, a student competition held by prudsys AG, has been announced. Students from the more than 20 participating countries can pit their data analysis skills against one another in a data mining challenge to be announced on April 3. Registration for the competition begins on March 4.
In hackathon news, registration is still open for the Domestic Violence Hackathon taking place across South America and in Washington, DC, from January 26-27. The 2013 Open Data Day Hackathon has been announced and will take place on February 23; see the wiki for a chance to participate in a hackathon near you.
Statistician Andrew Gelman is assembling a different kind of “data story”: Gelman wants to find 365 statisticians to write vignettes about their statistical lives. These life stories will then be made available on the American Statistical Association‘s blog, one a day for a year, providing a unique look into the data wrangling life.
The new open data standard for restaurant health inspection information developed by Yelp in partnership with the cities of San Francisco and New York, LIVES, has enabled what has been described as “a huge step for open data”. Yelp is now incorporating health inspection data into restaurant listings for San Francisco and New York, with extensions to Philadelphia, Boston, and Chicago expected soon.
In other social directory service news, Foursquare has released a map visualization of the last 500,000,000 Foursquare check-ins.
The Lower Saxony state election (January 20) has been given a comprehensive datavis dashboard by Gregor Aisch. Aisch has blogged on the new approach to visualizing coalitions he employed (with English translation).
A new web traffic dataset has been released by the University of Indiana. This so-called “Click Dataset”, available for non-commercial use, consists of some 2.5 terabytes of data collected by using a packet filter to detect traffic on the University’s network destined for TCP port 80.
The City of Palo Alto has opened up its library data. This data on the library’s visitors, checkouts, and more dates back as far as three decades.
The US Defense Department-funded Empirical Studies of Conflict project has launched its website, making freely available some 45 collections of “micro-level conflict data and information on insurgency, civil war, and other sources of politically motivated violence” in Afghanistan, Colombia, Iraq, Pakistan, the Philippines, and Vietnam.
Chief bitly scientist Hilary Mason has posted a bundle of research-quality data sets for aspiring data scientists. To quote Mason, “The list includes such exciting and diverse things as spam, belly buttons, item pricing, social media, and face recognition.”
I know what you’re saying after that last item: “Face recognition is fine, but what about cat face recognition?” You’re in luck: a new set of 2 gigabytes of annotated cat facial data, the CAT dataset, has been made available by Prof. Tang Xiaoou of the Chinese University of Hong Kong.