Data roundup, February 6
We’re rounding up data news from the web each week. If you have a data news tip, send it to us at [email protected].
TOOLS, COURSES, AND EVENTS
Machine learning developers love the hashing trick (PDF link), which makes working in the high-dimensional feature spaces typical of natural language processing less painful. Erich Owens explains how you can come to love it, too, in a detailed tutorial illustrated with Python code.
Version 3 of Cytoscape, an open source platform for complex network analysis and visualization, has been released. This new version of Cytoscape is a major redesign and introduces a new architecture, API, and set of user controls.
If you’re new to networks, check out a new interactive introduction to network analysis with a special focus on networks used in digital humanities research. Created by by Elijah Meeks and Maya Krishnan, this app gives you a taste of various network models and their algorithms.
You can also watch a lecture on social network analysis from the intensive Computational Journalism course now a few weeks underway at the University of Hong Kong. This course continues to generate excellent learning materials on “some of the most advanced techniques used by journalists to understand digital information”.
Learn the ins and outs of using Clojure and Cascalog to use open geospatial data to build an application in a remarkable new slideshow from Paco Nathan, author of the forthcoming O’Reilly book “Enterprise Data Workflows with Cascading”.
The recent release of D3.js 3.0 brought dramatic improvements to D3’s geographic projection system, including new transitions for projections. Jason Davies has made a cool interactive showcase of these transitions for you to play with.
Looking for a way to start learning D3.js, but intimidated by the huge number of tutorials on the topic? Look no further than the D3.js Meta Tutorial, a handy new “guide to guides” for D3.js.
If you’re interested in getting started with Small Area Estimation to improve the precision of your statistical estimates, check out this collection of reference materials and learning resources from Jerzy Wieczorek.
DATA STORIES
The only thing Periscopic‘s Kim Rees wrote in her notebook on February 4th: “own it“. Own what? Periscopic’s stunning and heartbreaking new interactive visualization of US gun murders in 2010, I assume—easily the week’s most disturbing and emotionally arresting data visualization.
Guns in the United States have been the subject of another recent piece of journalistic data visualization. The New York Times has published a map tracing the origins of the 50,000 guns seized in Chicago over the past decade.
InfoAmazonia has posted an interactive map of Amazonian deforestation and the exploitative mining, hydroelectricity, and ranching forces that are driving it. The open data used to create the map is freely available on the website.
OKF contributor Duarte Romero Varela has mapped 600 Birmingham eateries that fail to meet basic hygiene standards and published the data in a cleaned and accessible form.
The Museum of Modern Art has made an interactive graph visualization of connections between the early 20th century pioneers of abstract art entitled “Inventing Abstraction”. This graph illuminates the flow “of ideas moving through a nexus of artists and intellectuals working in different mediums and far-flung places”.
A new app allows you to explore the Bundestag voting decisions of German MPs. The results of votes are also included in the form of lists of names (in PDF and Excel spreadsheet form).
Hilary Parker examines the claim that “Hillary” is “the most poisoned name of all time” in great quantitative detail in a blog post, complete with code available on GitHub.
DATA SOURCES
Russia is joining the open data movement with a series of new data portals. A portal for Moscow city data leads the way, with a portal for the city of Perm scheduled to follow—and, by July 15th, open access to all non-classified databases of all government agencies. Read about it on Voice of Russia.
Japan’s burgeoning participation in the open data movement has made news recently. The economic ministry of Japan has launched a new data portal with a Creative Commons license. Built on CKAN and dubbed a “trial beta version” by the government, the site offers 79 data sets on such topics as energy use, manufacturing, and intellectual property.
Thousands of censored Sina Weibo posts have been collected by the Journalism and Media Studies Center at Hong Kong University since last February. The JMSC’s colleagues at the China Media Project select and explain some of these deleted posts.
The City of Oakland has launched a new data portal, serving 53 sets of municipal data. To encourage would-be analysts and developers, the site’s splash page links directly to a series of video tutorials on data exploration and to information on the site’s API.
The Open Knowledge Foundation’s CKAN platform will be used in the next iteration of the US government’s data portal, further solidifying CKAN’s status as an industry standard for data management.
Accessing World Bank open data from within Stata has become easier thanks to the release of a new Stata module, wbopendata. This module collects and presents data from 7,349 of the World Bank’s development indicators.
The data portal for the City of Gatineau, Québec, is now online. The first data sets to be made available focus especially on arts and entertainment but also include data on road work and contracts.