Data roundup, June 19

June 19, 2013 in Uncategorized

We’re rounding up data news from the web each week. If you have a data news tip, send it to us at schoolofdata@okfn.org.

Photo credit: Mike S

Photo credit: Mike S

TOOLS, COURSES, AND EVENTS

The G8 Open Data Charter was unveiled this past Tuesday at the 39th G8 Summit. The charter reaffirms the G8 countries’ commitment to open data and sets an “open by default” policy for government data. The Telegraph discusses the significance of the charter.

There is, however, still much to be done. The Open Knowledge Foundation launched a preview of the 2013 Open Data Census in time for the G8 Summit, and this preview suggests that “G8 countries still have a long way to go in releasing essential information as open data”.

Also released in time for the G8 Summit is a pilot of the open company data index from OpenCorporates, supported by the World Bank. The index highlights global corporate participation in the move towards transparency in data.

The Open Knowledge Foundation has announced the launch of the Panton Fellowships, awards valued at £8,000 per annum which will reward scientists who actively promote scientific open data. Applications are now being accepted.

The Open Data Institute has announced the beta of Open Data Certificates, a website which allows data publishers to self-report and certify their data’s adherence to openness standards.

The State of New York has published a provisional open data handbook—on GitHub. The handbook is “a general guide for government entities participating in OPEN-NY”. Comments from the public are invited.

Got a spatiotemporal dataset with several billion data points? Want to visualize it interactively in a web browser? Nanocubes are a new, fast data structure that can be used to do exactly that. Nanocubes use so little memory “that you can run a nanocube in a modern-day laptop”. Check out a demo of nanocubes applied to some two hundred million tweets.

There is now a Go library for spatial data operations, and it’s called gogeos. Fans of Google’s programming language can now take advantage of a wide range of powerful spatial data manipulations, as detailed in the announcement blog post.

Learn how a small team of journalists can tackle a big data-journalistic project with David Bauer’s account of how TagesWoche investigated migrant remittances. Bauer explains how teams were structured, data was found, and visualizations were constructed.

Hive is a “data warehouse system” that facilitates the analysis of large datasets. Learn how to process social science data with Hive in a new blog post by John Beieler that shows you how to query a 40+ gigabyte dataset in a matter of minutes.

Become a Knight-Mozilla OpenNews Fellow! Applications for the fellowships, which fund a journalistically inclined programmer for ten months of newsroom participation, will be open until August 17.

DATA STORIES

Treezilla aims to map every tree in Britain. It is a citizen science platform in the form of an app and a crowdsourced database of British trees, already tens of thousands of trees strong, contributing to awareness of trees and their central ecological importance.

Detention Logs publishes “data, documents and investigations that reveal new perspectives on conditions and events inside immigration detention” in Australia. Called “one of the largest data journalism projects in Australian history” (source), the project aims “to arm the public” with the facts necessary for informed public policy on asylum seekers and immigration.

DataParis.io presents many perspectives on the lives of Parisiens—income, politics, sex…—all from the reference-point of the Métro stations that abut on their lives. It is among the newest, and perhaps richest, transit-oriented infographic takes on urban life; compare the New Yorker on inequality and NYC’s subway.

Social network analysis has now been applied to Homer. The social network of characters in the Odyssey has been analyzed by PJ Miranda and colleagues, who conclude that “this social network bears remarkable similarities to Facebook, Twitter and the like”.

How do cats spend their time? As a cat owner, I know very well how pressing this question is. The BBC, in collaboration with the Royal Veterinary College, has investigated, presenting a day in the life of nine cats in the form of a dynamic map.

Responding to the opening of the trial of Andre Cornet’s alleged killers on Monday, Arnaud Wéry has mapped 15 years of crime stories in the Huy-Waremme region and blogged about how he did it.

How will climate change affect flora and fauna in Spain? The World Wildlife Foundation has created an interactive application allowing the projections of two models of climate change to be compared on a map of Spain, showing how living areas for plant and animal species will change in the years to come.

DATA SOURCES

No new data source this week is more exciting than the International Consortium of Investigative Journalists’ release of a database of over 100,000 offshore tax havens, the Offshore Leaks Database. The database is “part of a cache of 2.5 million leaked offshore files ICIJ analyzed with 112 journalists in 58 countries” (source). Learn how it was built on the ICIJ blog, and read more about why it matters.

UK Cabinet Office Minister Francis Maude has announced “new commitments on open data that will give citizens detailed information on the operations of charities and companies”. Data held by the UK Charity Commission is slated to be made freely available by March of next year.

In response to the G8′s open data charter, Canada has launched a new data portal. The usefulness of this new portal is likely to be compromised by the serious budget cuts suffered by Statistics Canada under the Harper administration.

flattr this!

Data Explorer Mission from the Inside: an Agent’s Story

June 18, 2013 in Data Expeditions

This post comes to you from Anna Sakoyan, who participated as a “Data Agent” the Data Explorer Mission, a partnership between Peer 2 Peer University and the Open Knowledge Foundation. The course ran from mid-April to mid-May, and primed Agents to analyze, clean, visualize data, tell a story with it, and facilitate their group. Here is her story. The original post can be found at her blog, Self Made University.

I can hardly believe it, but my assignment at School of Data seems to be completed. The last step was to produce some output, that is to tell the story. Now I think I should somehow summarize my experience.

Now, first off, what is Data Expedition at School of Data? It can be very flexible in terms of organisation. Here are the links to the general description and also to the Guide for Guides, which is revealing. In this post, I’ll be talking about this particular expedition. Also, a great account of it can be found on one of my team mates’ blog. So, this expedition was technically very similar to the principle of Python Mechanical MOOC. All the instructions were sent by a robot via our mailing list and then we had to collaborate with our team mates to find solutions.

8364602336_facaa10cdf_o

(Image CC-By-SA J Brew on Flickr)

First of all, we were given a dataset on CO2 emissions by country and CO2 emissions per capita. Our task was to look at the data and try to think about what can be done about it. As a background, we were also given the Guardian article based on this very dataset so that we could have a look at a possible approach. Well, I can’t say I was able to do the task right away. Without any experience of working

with data or any tools to deal with it, I felt absolutely frustrated by the very look of a spreadsheet. And at that stage peers could hardly provide any considerable technical support, because we all were newbies.

2013-06-03 01_13_18-Untitled - Google Maps

Then we had tasks to clean and format the data in order to analyze certain angles. Here our cooperation began and became really helpful. Although nobody among us was an expert here, we were all looking for the solutions and shared our experience, even when it was little more than ‘I DON’T UNDERSTAND ANYTHING!!11!!1!’.

Our chief weapons were:

  • the members’ supportive and encouraging attitude to each other
  • our mailing list
  • Google Docs to record our progress
  • Google Spreadsheets to work with our data and share the results
  • Google Hangout for our weekly meet-ups (really helpful, to my mind)
  • Google Fusion Tables for visualisation (alongside with Google Spreadsheets)

And that is it actually. I’m not mentioning more individual choices, because I’m not sure I even know about them all.

Now some credits.

Irina, you’ve been a source of wonderful links that really broadened my understanding of what’s going on. And above all, you’re extremely encouraging.

Jakes, you’ve contributed a huge amount of effort to get the things going and I think it paid off. You have also always been very supportive, generous and helpful even beyond the immediate team agenda.

Ketty, you were the first among us who was brave enough to face the spreadsheet as it is and proved that it is actually possible to work with. I was really inspired by this and tried to follow suit. Same was in the case of Google Fusion Tables.

Randah, I wish you had had more time at your disposal to participate in the teamwork. And judging by your brief inputs, you would make a great team mate. You were also the person who coined the term dataphobia and in this way located the problem I resolved to overcome. I hope to get in touch with you again when you have more spare time.

Zoltan, you were also an upsettingly rare contributor, due to your heavy and unpredictable workload. But nevertheless, you managed to provide an example of a very cool approach to overcoming big problems just by mechanically splitting them into smaller and less scary pieces.

Vanessa Gennarelli and Lucy Chambers, thanks for organising this wonderful MOOC!

So, as a result, I

  • seem to have overcome my general dataphobia
  • learnt a number of basic techniques
  • got an idea of what p2p learning is (it’s a cool thing, really)
  • got to know great people and hope to keep collaborating with them in the future

Well, this is kind of more than I expected.

Next, I’m going to learn more about data processing, Python, P2P-learning and other awesome things.

flattr this!

Get Started With Scraping – Extracting Simple Tables from PDF Documents

June 18, 2013 in Scraping

As anyone who has tried working with “real world” data releases will know, sometimes the only place you can find a particular dataset is as a table locked up in a PDF document, whether embedded in the flow of a document, included as an appendix, or representing a printout from a spreadsheet. Sometimes it can be possible to copy and paste the data out of the table by hand, although for multi-page documents this can be something of a chore. At other times, copy-and-pasting may result in something of a jumbled mess. Whilst there are several applications available that claim to offer reliable table extraction services (some free software,so some open source software, some commercial software), it can be instructive to “View Source” on the PDF document itself to see what might be involved in scraping data from it.

In this post, we’ll look at a simple PDF document to get a feel for what’s involved with scraping a well-behaved table from it. Whilst this won’t turn you into a virtuoso scraper of PDFs, it should give you a few hints about how to get started. If you don’t count yourself as a programmer, it may be worth reading through this tutorial anyway! If nothing else, it may give a feel for the sorts of the thing that are possible when it comes to extracting data from a PDF document.

The computer language I’ll be using to scrape the documents is the Python programming language. If you don’t class yourself as a programmer, don’t worry – you can go a long way copying and pasting other people’s code and then just changing some of the decipherable numbers and letters!

So let’s begin, with a look at a PDF I came across during the recent School of Data data expedition on mapping the garment factories. Much of the source data used in that expedition came via a set of PDF documents detailing the supplier lists of various garment retailers. The image I’ve grabbed below shows one such list, from Varner-Gruppen.

SUpplier list

If we look at the table (and looking at the PDF can be a good place to start!) we see that the table is a regular one, with a set of columns separated by white space, and rows that for the majority of cases occupy just a single line.

SUpplier list detail

I’m not sure what the “proper” way of scraping the tabular data from this document is, but here’s the sort approach I’ve arrived at from a combination of copying things I’ve seen, and bit of my own problem solving.

The environment I’ll use to write the scraper is Scraperwiki. Scraperwiki is undergoing something of a relaunch at the moment, so the screenshots may differ a little from what’s there now, but the code should be the same once you get started. To be able to copy – and save – your own scrapers, you’ll need an account; but it’s free, for the moment (though there is likely to soon be a limit on the number of free scrapers you can run…) so there’s no reason not to…;-)

Once you create a new scraper:

scraperwiki create new scraper

you’ll be presented with an editor window, where you can write your scraper code (don’t panic!), along with a status area at the bottom of the screen. This area is used to display log messages when you run your scraper, as well as updates about the pages you’re hoping to scrape that you’ve loaded into the scraper from elsewhere on the web, and details of any data you have popped into the small SQLite database that is associated with the scraper (really, DON’T PANIC!…)

Give your scraper a name, and save it…

blank scraper

To start with, we need to load a couple of programme libraries into the scraper. These libraries provide a lot of the programming tools that do a lot of the heavy lifting for us, and hide much of the nastiness of working with the raw PDF document data.

import scraperwiki
import urllib2, lxml.etree

No, I don’t really know everything these libraries can do either, although I do know where to find the documentation for them… lxm.etree, scraperwiki! (You can also download and run the scraperwiki library in your own Python programmes outside of scraperwiki.com.)

To load the target PDF document into the scraper, we need to tell the scraper where to find it. In this case, the web address/URL of the document is http://cdn.varner.eu/cdn-1ce36b6442a6146/Global/Varner/CSR/Downloads_CSR/Fabrikklister_VarnerGruppen_2013.pdf, so that’s exactly what we’ll use:

url = 'http://cdn.varner.eu/cdn-1ce36b6442a6146/Global/Varner/CSR/Downloads_CSR/Fabrikklister_VarnerGruppen_2013.pdf'

The following three lines will load the file in to the scraper, “parse” the data into an XML document format, which represents the whole PDF in a way that resembles an HTML page (sort of), and then provides us with a link to the “root” of that document.

pdfdata = urllib2.urlopen(url).read()
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.etree.fromstring(xmldata)

If you run this bit of code, you’ll see the PDF document gets loaded in:

Scraperwiki page loaded in

Here’s an example of what some of the XML from the PDF we’ve just loaded looks like preview it:

print etree.tostring(root, pretty_print=True)

PDF as XML preview

We can see how many pages there are in the document using the following command:

pages = list(root)
print "There are",len(pages),"pages"

The scraperwiki.pdftoxml library I’m using converts each line of the PDF document to a separate grouped elements. We can iterate through each page, and each element within each page, using the following nested loop:

for page in pages:
  for el in page:

We can take a peak inside the elements using the following print statement within that nested loop:

if el.tag == "text":
  print el.text, el.attrib

Previewing the XML element contents

Here’s the sort of thing we see from one of the table pages (the actual document has a cover page followed by several tabulated data pages):

Bangladesh {'font': '3', 'width': '62', 'top': '289', 'height': '17', 'left': '73'}
Cutting Edge {'font': '3', 'width': '71', 'top': '289', 'height': '17', 'left': '160'}
1612, South Salna, Salna Bazar {'font': '3', 'width': '165', 'top': '289', 'height': '17', 'left': '425'}
Gazipur {'font': '3', 'width': '44', 'top': '289', 'height': '17', 'left': '907'}
Dhaka Division {'font': '3', 'width': '85', 'top': '289', 'height': '17', 'left': '1059'}
Bangladesh {'font': '3', 'width': '62', 'top': '311', 'height': '17', 'left': '73'}

Looking again the output from each row of the table, we see that there are regular position indicators, particulalry the “top” and “left” coordinates, which correspond to the co-ordinates of where the registration point of each block of text should be placed on the page.

If we imagine the PDF table marked up as follows, we might be able to add some of the co-ordinate values as follows – the blue lines correspond to co-ordinates extracted from the document:

imaginary table lines

We can now construct a small default reasoning hierarchy that describes the contents of each row based on the horizontal (“x-axis”, or “left” co-ordinate) value. For convenience, we pick values that offer a clear separation between the x-co-ordinates defined in the document. In the diagram above, the red lines mark the threshold values I have used to distinguish one column from another:

if int(el.attrib['left']) < 100: print 'Country:', el.text,
elif int(el.attrib['left']) < 250: print 'Factory name:', el.text,
elif int(el.attrib['left']) < 500: print 'Address:', el.text,
elif int(el.attrib['left']) < 1000: print 'City:', el.text,
else:
  print 'Region:', el.text

Take a deep breath and try to follow the logic of it. Hopefully you can see how this works…? The data rows are ordered, stepping through each cell in the table (working left right) for each table row in turn. The repeated if-else statement tries to find the leftmost column into which a text value might fall, based on the value of its “left” attribute. When we find the value of the rightmost column, we print out the data associated with each column in that row.

We’re now in a position to look at running a proper test scrape, but let’s optimise the code slightly first: we know that the data table starts on the second page of the PDF document, so we can ignore the first page when we loop through the pages. As with many programming languages, Python tends to start counting with a 0; to loop through the second page to the final page in the document, we can use this revised loop statement:

for page in pages[1:]:

Here, pages describes a list element with N items, which we can describe explicitly as pages[0:N-1]. Python list indexing counts the first item in the list as item zero, so [1:] defines the sublist from the second item in the list (which has the index value 1 given that we start counting at zero) to the end of the list.

Rather than just printing out the data, what we really want to do is grab hold of it, a row at a time, and add it to a database.

We can use a simple data structure to model each row in a way that identifies which data element was in which column. We initiate this data element in the first cell of a row, and print it out in the last. Here’s some code to do that:

for page in pages[1:]:
  for el in page:
    if el.tag == "text":
      if int(el.attrib['left']) < 100: data = { 'Country': el.text }
      elif int(el.attrib['left']) < 250: data['Factory name'] = el.text
      elif int(el.attrib['left']) < 500: data['Address'] = el.text
      elif int(el.attrib['left']) < 1000: data['City'] = el.text
      else:
        data['Region'] = el.text
        print data

And here’s the sort of thing we get if we run it:

starting to get structured data

That looks nearly there, doesn’t it, although if you peer closely you may notice that sometimes we catch a header row. There are a couple of ways we might be able to ignore the elements in the first, header row of the table on each page.

  • We could keep track of the “top” co-ordinate value and ignore the header line based on the value of this attribute.
  • We could tack a hacky lazy way out and explicitly ignore any text value that is one of the column header values.

The first is rather more elegant, and would also allow us to automatically label each column and retain it’s semantics, rather than explicitly labelling the columns using out own labels. (Can you see how? If we know we are in the title row based on the “top” co-ordinate value, we can associate the column headings with the “left” coordinate value.) The second approach is a bit more of a blunt instrument, but it does the job…

skiplist=['COUNTRY','FACTORY NAME','ADDRESS','CITY','REGION']
for page in pages[1:]:
  for el in page:
    if el.tag == "text" and el.text not in skiplist:
      if int(el.attrib['left']) < 100: data = { 'Country': el.text }
      elif int(el.attrib['left']) < 250: data['Factory name'] = el.text
      elif int(el.attrib['left']) < 500: data['Address'] = el.text
      elif int(el.attrib['left']) < 1000: data['City'] = el.text
      else:
        data['Region'] = el.text
        print data

At the end of the day, it’s the data we’re after and the aim is not necessarily to produce a reusable, general solution – expedient means occasionally win out! As ever, we have to decide for ourselves the point at which we stop trying to automate everything and consider whether it makes more sense to hard code our observations rather than trying to write scripts to automate or generalise them.

http://xkcd.com/974/ - The General Problem

The final step is to add the data to a database. For example, instead of printing out each data row, we could add the data to the a scraper database table using the command:

scraperwiki.sqlite.save(unique_keys=[], table_name='fabvarn', data=data)

Scraped data preview

Note that the repeated database accesses can slow Scraperwiki down somewhat, so instead we might choose to build up a list of data records, one per row, for each page and them and then add all the companies scraped from a page one page at a time.

If we need to remove a database table, this utility function may help – call it using the name of the table you want to clear…

def dropper(table):
  if table!='':
    try: scraperwiki.sqlite.execute('drop table "'+table+'"')
    except: pass

Here’s another handy utility routine I found somewhere a long time ago (I’ve lost the original reference?) that “flattens” the marked up elements and just returns the textual content of them:

def gettext_with_bi_tags(el):
  res = [ ]
  if el.text:
    res.append(el.text)
  for lel in el:
    res.append("<%s>" % lel.tag)
    res.append(gettext_with_bi_tags(lel))
    res.append("" % lel.tag)
    if el.tail:
      res.append(el.tail)
  return "".join(res).strip()

If we pass this function something like the string <em>Some text<em> or <em>Some <strong>text</strong></em> it will return Some text.

Having saved the data to the scraper database, we can download it or access it via a SQL API from the scraper homepage:

scrpaed data - db

You can find a copy of the scraper here and a copy of various stages of the code development here.

Finally, it is worth noting that there is a small number of “badly behaved” data rows that split over more than one table row on the PDF.

broken scraper row

Whilst we can handle these within the scraper script, the effort of creating the exception handlers sometimes exceeds the pain associated with identifying the broken rows and fixing the data associated with them by hand.

Summary

This tutorial has shown one way of writing a simple scraper for extracting tabular data from a simply structured PDF document. In much the same way as a sculptor may lock on to a particular idea when working a piece of stone, a scraper writer may find that they lock in to a particular way of parsing data out of a data, and develop a particular set of abstractions and exception handlers as a result. Writing scrapers can be infuriating at times, but may also prove very rewarding in the way that solving any puzzle can be. Compared to copying and pasting data from a PDF by hand, it may also be time well spent!

It is also worth remembering that sometimes it can be quicker to write a scraper that does most of the job, and then finish off the data cleansing or exception handling using another tool, such as OpenRefine or even just a simple text editor. On occasion, it may also make sense to throw the data into a database table as quickly as you can, and then develop code to manage a second pass that takes the raw data out of the database, tidies it up, and then writes it in a cleaner or more structured form into another database table.

The images used in this post are available via a flickr set: ScoDa-Scraping-SimplePDFtable

flattr this!

Join the School of Data as a Community Mentor!

June 17, 2013 in Community

Have data skills to share? Want to bring the School of Data to your community? We are currently looking for 12 Community Mentors as a pilot for our international network.

2012 FIRST Robotics Competition Palmetto Regional

As a Community Mentor you will:

  • Offer constructive feedback for learners on projects (often within your own language region)
  • Help to answer questions by learners on forums/mailinglists (in your own language)
  • Organize data expeditions and hands-on workshops

You’ll get training, help and support from the School of Data team, and good karma (priceless)!

Sign up and Get started as a community mentor!

flattr this!

On the Radar: Using Data to Save Lives

June 17, 2013 in Data for CSOs

The field of crisis mapping is relatively new, but its impact on the global response to conflict is already evident. By enabling massive amounts of information to be quickly understood by any interested party, crisis mapping increases public awareness on a exponential scale and, if properly put together, allows for quicker responses to crises.

Invisible Children together with the Resolve LRA Crisis Initiative ventured into this field in 2011 with the launch of the LRA Crisis Tracker. This platform was created as a response to the lack of response to the Makombo Massacres in DR Congo where more than 320 people were killed and 250 people abducted by the Lord’s Resistance Army (LRA). Three months passed before news of the massacres appeared in the media.

The LRA Crisis Tracker

Inspired by Ushahidi, the free, build-your-own crisis map website, the LRA Crisis Tracker is a real-time mapping platform and data collection system that brings an unprecedented level of transparency to the atrocities committed by the LRA in central Africa.

To build our own platform we partnered with Digitaria, a San Diego-based digital agency, and used a custom-built SalesForce application on the back end. Building the LRA Crisis Tracker was an extensive process. Each of the partners dedicated a few members of their team to work on the project full time. After nine months of development the LRA Crisis Tracker was launched in September 2011.

The Crisis Tracker faces the unique challenge of getting reliable reports from a region that has little to no communication infrastructure. The solution to this problem, in large part, was found in Invisible Children’s expansion of a locally-run high frequency (HF) radio network throughout communities in DR Congo and the Central African Republic. Twice daily, radio reports go to a local hub that then sends this information to our office in San Diego. Obtaining reliable data in a timely manner from such a remote region has required a serious investment of time and money. We’ve been building this network for almost two years and it provides much of the data used by the LRA Crisis Tracker. Invisible Children’s HF radio network currently consists of 38 radios and will continue to grow.

Through its intuitive design and inclusion of photos and videos from the region, the data is engaging and easy to access. The LRA Crisis Tracker is also available as a mobile app (iPhone or Android), and @CrisisTracker tweets LRA incidents as they’re reported.

Reports generated by the LRA Crisis Tracker have been used at all levels of counter-LRA efforts. Military and non-military actors in the region have expressed their appreciation of the real-time information put out by the LRA Crisis Tracker. This is exactly what we were hoping to create: an easy-to-access resource for the media, counter-LRA actors, regional organizations, and the general public. It makes it possible to identify trends in LRA activity that wouldn’t otherwise be accessible.

The platform continues to be a work in progress. Since its creation we’ve had to go back and revisit the data set multiple times. At one point we went back and added age and gender to incident reports. Other times we are mining our data for new location specifications. Our team spends a lot of time vetting our data to make sure it is accurate, which really is the most crucial aspect of our work.

This summer we’re planning to roll out Phase II of the LRA Crisis Tracker, which will improve the user’s ability to filter and analyze data. We’re excited to make the LRA Crisis Tracker even more valuable in the efforts to bring a permanent end to LRA atrocities. Already the platform makes our data available to any interested data-enthusiasts, through the ‘Get Reports’ Tab.

flattr this!

Data roundup, June 12

June 12, 2013 in Data Roundup

We’re rounding up data news from the web each week. If you have a data news tip, send it to us at schoolofdata@okfn.org.

Photo credit: Kris Krüg

Photo credit: Kris Krüg

TOOLS, COURSES, AND EVENTS

The World Wide Web Foundation and its partners (including the OKFN) have launched the Global Open Data Initiative, “a champion for Open Data globally”, aiming to create and promote a unified set of guidelines assisting governments in the use of open data.

Today wraps up the second Open Economics Workshop, an Open Knowledge Foundation event hosted at MIT. As reported on the Open Econ blog, the event brought together some 40 economists and social scientists to discuss research data sharing and transparency in economics.

Data-Crunched Democracy was a conference bringing together journalists and analysts “to cut through the hype and understand the use of voter data in campaigns”. Derek Willis reflects on “the lessons for journalists covering campaigns that engage in the use of data” in an in-depth blog post.

I’ve known more than one graduate student in the social sciences who has described Excel’s pivot tables as “the best thing ever”. Pivot tables are a powerful tool for data exploration. A new blog post by Abbott Katz explains you can begin using pivot tables in your own work.

Real-time and historical data on United States drone strikes is now available as an API. Dronestre.am is a public API making it easy to “build data visualizations about covert war [...] in Pakistan, Yemen, and Somalia”.

Learn about pandas, “one of the best, and most important, libraries for data analysis in Python”, and how it can be used to do serious data analysis using SQL queries in a new blog post by John Beieler.

Bayesian Methods for Hackers, an introduction to Bayesian probability theory in practical and Pythonic terms, has appeared on the data roundup before. Now a draft of the PDF version of the book has been released. Check out this “understanding-first” introduction to “the natural approach to inference”.

Check out Source’s journalism code event roundup, June 10, for a worldwide selection of hackathons and conferences in data-driven and computer-assisted journalism.

DATA STORIES

So the NSA has all your metadata. What can they do with it? German Green Party politician Malte Spitz sued to repatriate six months of his own phone data and made it available to Zeit online, who combined it with publicly available data to reconstruct six months of Spitz’s life. You can read more about the project and download its data. You can also check out a timeline of the NSA’s domestic spying from the Electronic Frontier Foundation.

ProjectPolicy aims to “unify, organize and visualize the world’s government information onto one intuitive web platform”. Its take on San Francisco is available as a demo of what it aims to do.

America’s Worst Charities presents a year’s investigation by the Tampa Bay Times and the Center for Investigative Reporting into the misuse of charity funds by American charities. It prominently features an interactive presentation of the data, some of which is also available for download in CSV form.

The central limit theorem is a statistical theorem of scientific importance that cannot be overstated. A new visualization of the theorem constructed with D3.js, explained in terms of coin flips, makes it easier to develop intuitions about its meaning.

Stamen has put together 3D contour maps of the surface of Mars from data collected by the Mars Orbiter Laser Altimeter. As their blog reports, these maps are “a small gesture of thanks to the scientists who are working hard to do science and communicate with the public despite the stupid sequester”.

The latest work from Accurat presents the lives of ten famous painters in the form of beautiful timelines. Each timeline presents the artist’s personal history in a manner sensitive to the artist’s style.

Check out datenjournalist.de’s roundup of Datenjournalismus im Mai 2013 (German) for a collection of some of last month’s best examples of data-driven journalism.

DATA SOURCES

In a move that is unlikely to distract attention away from the PRISM scandal, the Obama administration has released a portal calling out climate science deniers.

Open Nepal has launched Open Data Nepal, a project “not about creating yet another data repository in the web but an effort to curate and disseminate data that is already available in public domain”.

Canada’s Global News has obtained, at great difficulty, a database of over 61,000 Albertan oil spill incidents spanning the period from 1975 to 2013, and they are “now offering this information to the public for download”. This is certainly one of the most important datasets to see the light of data in Alberta—especially that Alberta’s open data catalogue has been described as perhaps “the most useless [...] in the history of open data catalogues”.

The Los Angeles Times has acquired and released a database of the salaries of Department of Water and Power employees in 2012, finding that their “average total pay … is more than 50% high­er than oth­er city em­ploy­ees”. You can download the dataset and see for yourself.

Freddie Mac, a major US mortgage backer, is “standardizing its processes and making raw data more easily accessible to the public”. This move towards “transparency” appears to be part of a process of privatization of government-sponsored mortgages, “using our data to attract private capital”.

flattr this!

Several Takes on Defining Data Journalism

June 11, 2013 in Data Blog

Every so often I get asked the question: “so what is data journalism?” I’m still not sure I have a very good definition of it, but here are three different ways I think we can view it:

  • as a particular sort of output – one of the easiest ways of responding to the question is to point to a map or graphic that someone has used to illustrate a story, or a piece of “award winning” data journalism, and say “that is”. For anyone who works with data, however, they well know that producing a graphic is often the easy part of the process, and that most of the time is spent finding the data, fighting with it to get it into a state you can start working with it, and analysing the data, or asking it questions in order to find the story within it, or illustrate a story you have already discovered. This observation in turn leads to a second way of characterising data journalism:
  • as a particular set of skills – that is, data journalism is not necessarily what data journalists produce, it’s best thought about in terms of the sorts of skills that data journalists need in order to produce the maps and charts that get pointed at as examples of data journalism.
    One way of identifying what these skills might be is to look at job adverts for “data journalist” (I collected a few examples here: So what is a data journalist exactly? A view from the job ads…). Looking through them, many current ads seem to require skills associated with the development of interactive data driven applications, which puts the emphasis on a range of web design and development skills, again apparently associating the practice of data journalism closely with the production of things that are used to illustrate a story. That is, data journalism is to data what radio journalism is to audio and video journalism is to, erm, video?! (It’s probably also worth mentioning that data journalism is not necessarily genre based journalism, such science journalism or sports journalism – it’s not just “about” data.)
    But that doesn’t feel right, either, which suggests a third way of considering data journalism:
  • as a process – and in particular, as a process that involves data somehow, though not necessarily exclusively. Whilst there may be “data outputs”, it might also be the case that the data journalistic process generates a lead that develops into a story that is not best illustrated using “data”. Data might lead us to a story, for example, that one particular garment retailer tolerates poor working conditions through the discovery that they use factories blacklisted by other retailers, but that story may be best expressed in other terms. The data, in other words, may simply play the role of a source, and in this sense “data journalism” is more process oriented, in much the same was that investigative journalism is, although potentially over much shorter timescales. (We might expect a data journalism piece to be produced in a matter of hours as part of the daily news cycle, for example.)
    Under this process view of data journalism, the skills required of a journalist participating in the process may take the form simply of advanced information skills, such as the ability to run powerful advanced searches using web search engines, filter down a data set using text and/or numeric facets in a tool such as OpenRefine, or run structured queries over data in a database using a query language such as SQL.
    The process might equally involve using data visualisation tools to make sense of a dataset, or generate further questions from it, questions that might be additionally asked of the dataset itself, possibly in conjunction with other datasets, or alternatively used to set up a question then asked of a person.
    For certain data sets, statistical tests may be required to identify whether there is something or nothing in what the data appears to be saying, or questions asked of an expert in the field to identify whether a number is actually a big number or not (hat tip to FT Undercover Economist, and More Or Less presenter, Tim Harford, for that refrain!). And then it may be time to get the interactive developers on board. Or there may be no need.

So are we any nearer to having a definition of “data journalism” that take into account these different views?

Here’s one I quite like:

The art and practice of finding stories in data…

…and then retelling them.

This captures both the notion that data journalism is about finding stories from a particular sort of source (a data source) and then communicating them, whilst not requiring that the telling of the story is done in any particular way.

Here’s another:

Journalism in which “data” is one of the sources used to get or relate a story.

In this case, we see data as playing a role either in the sourcing of a story, or the communication of a story (or maybe even both), but again, we imagine data playing a role in “human” terms.

So what’s your favorite definition of data journalism?

See also: Data Journalism Handbook

flattr this!

Mapping the Well-Being of Children in the District of Columbia

June 11, 2013 in Data for CSOs

Last year, DC Action for Children, in partnership with DataKind and a group of dedicated pro-bono data scientists, created an interactive, web-based tool to take traditional child well-being indicators “beyond the PDF book” and into the exciting realm of visualizing and communicating data for collective action.

The neighborhood maps we created showed that the success of too many DC (District of Columbia, U.S.) children is predetermined by their ZIP Code – and limited access to critical resources to thrive. Some DC neighborhoods have assets that enrich the lives of children, but others are characterized by high levels of poverty and the many challenges that come with it, including poorer performing schools, more violent crime and less access to resources like healthy food, libraries, parks and recreation centers.

dc action for kids

For the project we used both U.S. Census Bureau and local administrative data about the population and resources in District of Columbia neighborhoods. We obtained data on population counts and social characteristics from the Decennial Census and American Community Survey. Geographical data, shapefiles for mapping, and data on community characteristics such as grocery stores, libraries, crime and transportation were obtained from the DC Data Catalog. Other data were obtained directly from local agencies, including the DC Office of the State Superintendent of Education and the DC Department of Health.

To obtain the neighborhood-level estimates, our data scientists used block-level population data to construct population weights for data at the block-group and neighborhood level. The DC Master Address Repository was used to geocode point data, such as locations of libraries or schools; ArcGIS was used to aggregate point data by neighborhood. Collaborators used MapBox to create neighborhood maps.

Community response

The response to our newly launched KIDS COUNT 1.0 has been overwhelming, both locally and nationally. Local policy makers have been relying heavily on the work and asking what is next, particularly how to add data that can start to bring accountability to public policy decisions and publicly funded programs.

The work has also been recognized as innovative by numerous organizations, including the Annie E. Casey Foundation (through the KIDS COUNT network), Rockefeller Foundation (Innovators Award) and Global Editors Network (2013 Data Journalism Awards). We continue to get inquiries from potential partners like The World Bank, the White House, and, most critically, parent groups.

Why is this important?

In a city where policy decisions that determine the allocation of resources and assets are guided by relationships and old-school politics, the project will bring much-needed transparency to DC government budget data. We must show how budget decisions align or do not align with the needs of our children.

In DC, there are approximately 100,000 children under 18 years of age. More than 36,000 young children are growing up in DC neighborhoods – playing on city playgrounds, attending child care centers and preparing for school in pre-kindergarten classes. The number of young children in DC has increased by 11% since 2000, which is especially notable because the total number of children (under age 18) has decreased by 8% over the same time period. With a rising birth rate and expanding overall city population, we expect the number of young children to continue to increase in future years. Of the 36,000 – 1 in 3 live in poverty in DC. Poverty is pervasive.

DC has the highest spending per pupil in public education. We have had intense national scrutiny based on the efforts in education reform to improve outcomes for children, yet even with all the spending and reform efforts, the bottom line: outcomes for children are not improving.

Next steps

In the next phase of the project, we propose to add a layer of local budget data to the asset maps to answer a related question: If we map public investments, will they align with where we have mapped need among children in DC?

We propose to use five years of retrospective budget data to add a powerful new tool to our DC KIDS COUNT maps to help policy makers, media, advocates, service providers and citizens evaluate the city’s budget through the lens of young children – in the neighborhoods where they live. The project will help us present a more nuanced analysis of the geography of DC budget investments, including to:

  • Map where the city has invested in the futures of young children and where it has not.
  • Create a shared understanding of how investment maps do – or do not – match with need maps for the city’s children.
  • Communicate messages about inequities in investments by geography and demographics (income, race, etc.).
  • Identify budget and policy opportunities for addressing the identified mismatches, gaps and inequities.

Our ultimate goal is to ensure the data and analysis we provide will change the outcomes for children, youth and their families.

As I reflect on the success of this partnership and project, a few key themes surfaced:

  • Leadership: Jake Porway (founder of DataKind) and Sisi Wei (project lead) were instrumental during the preliminary phase but also for long-term sustainability of the project. We all knew that this was new territory of data work. Both were committed to the answering our BIG question: “Can we change child outcomes with data?” There was definitely a theme among the three of us: innovative, risk-takers, visionary, do-gooders and a little too much enthusiasm about data!
  • Data Heroes: A leader can’t lead without a strong troop. I recall being at the DataDive and praying that most of the genius data heroes would choose our project BUT there was some fierce competition. As one of our project data heroes often states, “I joined this collective effort to make a difference and wanted nothing in return.” But what we all got in return was the opportunity to engage in a magical process empowered by trust, mission and impact. We saw in action what Jake had always envisioned: data = social change!

flattr this!

School of Data Latin America Tour

June 7, 2013 in On the Road

Do you live in Latin America? Hungry for some School of Data materials in Spanish? We have some good news for you: The School of Data is going to come to you!

Locals and Tourists #49 (GTWA #200): Sao PauloImage CC-BY-SA Eric Fisher

While our friends at Social-TIC are working hard on translating School of Data materials in Spanish, Michael Bauer and Zara Rahman are going to visit La Paz (Bolivia), Santiago (Chile), Buenos Aires (Argentina) and Montevideo (Uruguay).

Michael kicks his tour off with the first DataBootcamp in Latin America in Bolivia, he’s then joined by Zara in Santiago, where there will be a Workshop on Scraping on Monday June 17th. They will also shortly present at the Data Tuesday the next day. They will continue their trip to Argentina with a Workshop on June 20th and finish their tour at AbreLatam – the open Latin America unconference.

During the time they will be available to meet, scheme and plot. If you’re interested meeting them: Contact us at schoolofdata [at] okfn.org.

flattr this!

The Latest From the School of Data

June 6, 2013 in News

The latest from what we are up to at the School of Data.

School of Data goes Spanish

Next month, a couple of the team will be headed over to Latin America for a series of warm-up events for the launch of the Spanish version of School of Data. The School will be launched at the AbreLatam conference in Uruguay.

On their way to Uruguay, they will be passing through Bolivia, Chile and Argentina. Know an organisation or amazing individual doing great things with data they should meet up with on the way? Please drop us a line on schoolofdata [at] okfn.org.

Thank you to the amazing organisations and individuals who are helping to make this happen, including Social-TIC (Mexico), DATA (Uruguay), the Knight Media Fellows and the Hacks/Hackers network.

Data Clinics

Over at, OpenSpending, Anders Pedersen is running bi-weekly data clinics to help people troubleshoot their spending data, from getting better data, to cleaning, analysing and visualising the data.

The next clinic will happen on Wednesday 19th June in the evening.

Have data you want to bring along to troubleshoot? Join the OpenSpending mailing list, or email info [at] openspending.org for more details on the upcoming clinics.

Data Expeditions

Mapping the Garment Factories

Some participants of the Mapping the Garments Factories expedition couldn’t get enough and carried on their expedition into the week. A few participants have written a writeup of their expedition, concluding:

“major retailers like Wal-Mart maintains high levels of opacity around their supply chain and audit standards, which are detrimental to improving working standards in the garment industry.”

Tax Avoidance

Our tax avoidance teams pair up. We’ve paired up techies and storytellers to tackle the challenge of finding tax avoiders and evaders. Welcome also to our first Spanish-speaking group, who will take on the challenge. The expedition launches tomorrow, we will keep you posted with updates!

Call for ideas for topics

Have an idea for a topic you think would make a great expedition? (Or, even better – keen to help us lead one?) Please drop us a line on schoolofdata [at] okfn.org.

From the Blog

flattr this!

 Receive announcements  Get notifications of news from the School in your inbox
Join the discussion Discussion list - have your say: