<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>School of Data</title>
	<atom:link href="http://schoolofdata.org/feed/" rel="self" type="application/rss+xml" />
	<link>http://schoolofdata.org</link>
	<description>Evidence is Power</description>
	<lastBuildDate>Tue, 18 Jun 2013 11:38:46 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Data Explorer Mission from the Inside: an Agent’s Story</title>
		<link>http://schoolofdata.org/2013/06/18/data-explorer-mission-from-the-inside-an-agents-story/</link>
		<comments>http://schoolofdata.org/2013/06/18/data-explorer-mission-from-the-inside-an-agents-story/#comments</comments>
		<pubDate>Tue, 18 Jun 2013 07:00:14 +0000</pubDate>
		<dc:creator>Vanessa Gennarelli</dc:creator>
				<category><![CDATA[Data Expeditions]]></category>

		<guid isPermaLink="false">http://schoolofdata.org/?p=5147</guid>
		<description><![CDATA[This post comes to you from Anna Sakoyan, who participated as a &#8220;Data Agent&#8221; the Data Explorer Mission, a partnership between Peer 2 Peer University and the Open Knowledge Foundation. The course ran from mid-April to mid-May, and primed Agents to analyze, clean, visualize data, tell a story with it, and facilitate their group. Here is [...]]]></description>
				<content:encoded><![CDATA[<p><!-- magazine.image = http://ourchiefweapons.files.wordpress.com/2013/06/2013-06-03-01_13_18-untitled-google-maps.png --></p>
<p><em>This post comes to you from Anna Sakoyan, who participated as a &#8220;Data Agent&#8221; the <a href="http://info.p2pu.org/2013/04/02/okfn-p2pu-partner-for-data-explorer-missions/">Data Explorer Mission</a>, a partnership between <a href="https://p2pu.org/en/">Peer 2 Peer University</a> and the <a href="http://okfn.org/">Open Knowledge Foundation</a>. The course ran from mid-April to mid-May, and primed Agents to analyze, clean, visualize data, tell a story with it, and facilitate their group. Here is her story. The original post can be found at her blog, <a href="http://ourchiefweapons.wordpress.com/2013/06/03/data-expedition-recap/">Self Made University</a>.</em></p>
<p>I can hardly believe it, but my assignment at <a href="http://schoolofdata.org/" target="_blank">School of Data</a> seems to be completed. The last step was to produce some output, that is to tell the <a href="http://ourchiefweapons.wordpress.com/2013/06/02/my-first-data-driven-story-ever/" target="_blank">story</a>. Now I think I should somehow summarize my experience.</p>
<p>Now, first off, what is Data Expedition at School of Data? It can be very flexible in terms of organisation. Here are the links to the <a href="http://schoolofdata.org/data-expeditions/" target="_blank">general description</a> and also to the <a href="http://schoolofdata.org/data-expeditions/guide-for-guides/" target="_blank">Guide for Guides</a>, which is revealing. In this post, I’ll be talking about <a href="http://schoolofdata.org/datamooc/" target="_blank">this particular expedition</a>. Also, a great account of it can be found on <a href="http://dataspa.wordpress.com/2013/04/30/for-a-breath-of-clean-air/" target="_blank">one of my team mates&#8217; blog</a>. So, this expedition was technically very similar to the principle of <a href="http://mechanicalmooc.wordpress.com/" target="_blank">Python Mechanical MOOC</a>. All the instructions were sent by a robot via our mailing list and then we had to collaborate with our team mates to find solutions.</p>
<p><a href="http://ourchiefweapons.files.wordpress.com/2013/06/8364602336_facaa10cdf_o.png"><img class="alignnone size-full wp-image-96" src="http://ourchiefweapons.files.wordpress.com/2013/06/8364602336_facaa10cdf_o.png" alt="8364602336_facaa10cdf_o" width="630" height="177" /></a></p>
<p>(Image CC-By-SA <a href="http://www.flickr.com/photos/brewbooks/3303763084/">J Brew on Flickr</a>)</p>
<p>First of all, we were given a <a href="https://docs.google.com/spreadsheet/ccc?key=0AqwLVP6U7FhDdEZKa1pqa3VhbmkyWkF2Q2IxcnhtWHc#gid=1" target="_blank">dataset on CO2 emissions by country and CO2 emissions per capita</a>. Our task was to look at the data and try to think about what can be done about it. As a background, we were also given the <a href="http://www.guardian.co.uk/news/datablog/2011/jan/31/world-carbon-dioxide-emissions-country-data-co2#_" target="_blank">Guardian article</a> based on this very dataset so that we could have a look at a possible approach. Well, I can’t say I was able to do the task right away. Without any experience of working</p>
<p>with data or any tools to deal with it, I felt absolutely frustrated by the very look of a spreadsheet. And at that stage peers could hardly provide any considerable technical support, because we all were newbies.</p>
<p><img class="size-full wp-image-95 alignright" src="http://ourchiefweapons.files.wordpress.com/2013/06/2013-06-03-01_13_18-untitled-google-maps.png" alt="2013-06-03 01_13_18-Untitled - Google Maps" width="398" height="388" /></p>
<p>Then we had tasks to clean and format the data in order to analyze certain angles. Here our cooperation began and became really helpful. Although nobody among us was an expert here, we were all looking for the solutions and shared our experience, even when it was little more than ‘I DON’T UNDERSTAND ANYTHING!!11!!1!’.</p>
<p><span style="text-decoration: underline;">Our chief weapons were:</span></p>
<ul>
<li>the members’ supportive and encouraging attitude to each other</li>
<li>our mailing list</li>
<li>Google Docs to record our progress</li>
<li>Google Spreadsheets to work with our data and share the results</li>
<li>Google Hangout for our weekly meet-ups (really helpful, to my mind)</li>
<li>Google Fusion Tables for visualisation (alongside with Google Spreadsheets)</li>
</ul>
<p>And that is it actually. I’m not mentioning more individual choices, because I’m not sure I even know about them all.</p>
<p>Now some credits.</p>
<p><strong>Irina</strong>, you’ve been a source of wonderful links that really broadened my understanding of what’s going on. And above all, you’re extremely encouraging.</p>
<p><strong>Jakes</strong>, you’ve contributed a huge amount of effort to get the things going and I think it paid off. You have also always been very supportive, generous and helpful even beyond the immediate team agenda.</p>
<p><strong>Ketty</strong>, you were the first among us who was brave enough to face the spreadsheet as it is and proved that it is actually possible to work with. I was really inspired by this and tried to follow suit. Same was in the case of Google Fusion Tables.</p>
<p><strong>Randah</strong>, I wish you had had more time at your disposal to participate in the teamwork. And judging by your brief inputs, you would make a great team mate. You were also the person who coined the term dataphobia and in this way located the problem I resolved to overcome. I hope to get in touch with you again when you have more spare time.</p>
<p><strong>Zoltan</strong>, you were also an upsettingly rare contributor, due to your heavy and unpredictable workload. But nevertheless, you managed to provide an example of a very cool approach to overcoming big problems just by mechanically splitting them into smaller and less scary pieces.</p>
<p><strong>Vanessa Gennarelli</strong> and <strong>Lucy Chambers</strong>, thanks for organising this wonderful MOOC!</p>
<p>So, as a result, I</p>
<ul>
<li>seem to have overcome my general dataphobia</li>
<li>learnt a number of basic techniques</li>
<li>got an idea of what p2p learning is (it’s a cool thing, really)</li>
<li>got to know great people and hope to keep collaborating with them in the future</li>
</ul>
<p>Well, this is kind of more than I expected.</p>
<p>Next, I’m going to learn more about data processing, Python, P2P-learning and other awesome things.</p>
 <p><a href="http://schoolofdata.org/?flattrss_redirect&amp;id=5147&amp;md5=82ca0812b10e292e90b324341ffe9a69" title="Flattr" target="_blank"><img src="http://schoolofdata.org/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://schoolofdata.org/2013/06/18/data-explorer-mission-from-the-inside-an-agents-story/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<atom:link rel="payment" title="Flattr this!" href="https://flattr.com/submit/auto?user_id=openknowledgefoundation&amp;popout=1&amp;url=http%3A%2F%2Fschoolofdata.org%2F2013%2F06%2F18%2Fdata-explorer-mission-from-the-inside-an-agents-story%2F&amp;language=en_GB&amp;category=text&amp;title=Data+Explorer+Mission+from+the+Inside%3A+an+Agent%E2%80%99s+Story&amp;description=This+post+comes+to+you+from+Anna%C2%A0Sakoyan%2C+who+participated+as+a+%26%238220%3BData+Agent%26%238221%3B+the+Data+Explorer+Mission%2C+a+partnership+between+Peer+2+Peer+University+and+the+Open+Knowledge+Foundation.+The..." type="text/html" />
	</item>
		<item>
		<title>Get Started With Scraping &#8211; Extracting Simple Tables from PDF Documents</title>
		<link>http://schoolofdata.org/2013/06/18/get-started-with-scraping-extracting-simple-tables-from-pdf-documents/</link>
		<comments>http://schoolofdata.org/2013/06/18/get-started-with-scraping-extracting-simple-tables-from-pdf-documents/#comments</comments>
		<pubDate>Mon, 17 Jun 2013 22:12:03 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[Scraping]]></category>

		<guid isPermaLink="false">http://schoolofdata.org/?p=5155</guid>
		<description><![CDATA[As anyone who has tried working with &#8220;real world&#8221; data releases will know, sometimes the only place you can find a particular dataset is as a table locked up in a PDF document, whether embedded in the flow of a document, included as an appendix, or representing a printout from a spreadsheet. Sometimes it can [...]]]></description>
				<content:encoded><![CDATA[<p><!-- magazine.image = http://farm6.staticflickr.com/5343/9069241016_fe645e895b.jpg --></p>
<p>As anyone who has tried working with &#8220;real world&#8221; data releases will know, sometimes the only place you can find a particular dataset is as a table locked up in a PDF document, whether embedded in the flow of a document, included as an appendix, or representing a printout from a spreadsheet. Sometimes it can be possible to copy and paste the data out of the table by hand, although for multi-page documents this can be something of a chore. At other times, copy-and-pasting may result in something of a jumbled mess. Whilst there are several applications available that claim to offer reliable table extraction services (some free software,so some open source software, some commercial software), it can be instructive to &#8220;View Source&#8221; on the PDF document itself to see what might be involved in scraping data from it.</p>
<p>In this post, we&#8217;ll look at a simple PDF document to get a feel for what&#8217;s involved with scraping a well-behaved table from it. Whilst this won&#8217;t turn you into a virtuoso scraper of PDFs, it should give you a few hints about how to get started. If you don&#8217;t count yourself as a programmer, it may be worth reading through this tutorial anyway! If nothing else, it may give a feel for the sorts of the thing that are possible when it comes to extracting data from a PDF document.</p>
<p>The computer language I&#8217;ll be using to scrape the documents is the Python programming language. If you don&#8217;t class yourself as a programmer, don&#8217;t worry &#8211; you can go a long way copying and pasting other people&#8217;s code and then just changing some of the decipherable numbers and letters!</p>
<p>So let&#8217;s begin, with a look at a PDF I came across during the recent School of Data data expedition on <a href="http://schoolofdata.org/2013/05/18/data-expedition-mapping-the-garment-factories/">mapping the garment factories</a>. Much of the source data used in that expedition came via a set of PDF documents detailing the supplier lists of various garment retailers. The image I&#8217;ve grabbed below shows one such list, from <a href="http://cdn.varner.eu/cdn-1ce36b6442a6146/Global/Varner/CSR/Downloads_CSR/Fabrikklister_VarnerGruppen_2013.pdf">Varner-Gruppen</a>.</p>
<p><a href="http://www.flickr.com/photos/psychemedia/9069241016/" title="SUpplier list by psychemedia, on Flickr"><img src="http://farm6.staticflickr.com/5343/9069241016_fe645e895b.jpg" width="500" height="467" alt="SUpplier list"></a></p>
<p>If we look at the table (and looking at the PDF can be a good place to start!) we see that the table is a regular one, with a set of columns separated by white space, and rows that for the majority of cases occupy just a single line.</p>
<p><a href="http://www.flickr.com/photos/psychemedia/9067018967/" title="SUpplier list detail by psychemedia, on Flickr"><img src="http://farm4.staticflickr.com/3775/9067018967_c835237036.jpg" width="500" height="171" alt="SUpplier list detail"></a></p>
<p>I&#8217;m not sure what the &#8220;proper&#8221; way of scraping the tabular data from this document is, but here&#8217;s the sort approach I&#8217;ve arrived at from a combination of copying things I&#8217;ve seen, and bit of my own problem solving.</p>
<p>The environment I&#8217;ll use to write the scraper is <a href="http://scraperwiki.com">Scraperwiki</a>. Scraperwiki is undergoing something of a relaunch at the moment, so the screenshots may differ a little from what&#8217;s there now, but the code should be the same once you get started. To be able to copy &#8211; and save &#8211; your own scrapers, you&#8217;ll need an account; but it&#8217;s free, for the moment (though there is likely to soon be a limit on the number of free scrapers you can run&#8230;) so there&#8217;s no reason not to&#8230;;-)</p>
<p>Once you create a new scraper:</p>
<p><a href="http://www.flickr.com/photos/psychemedia/9069240456/" title="scraperwiki create new scraper by psychemedia, on Flickr"><img src="http://farm8.staticflickr.com/7294/9069240456_a22ebb8a28.jpg" width="500" height="276" alt="scraperwiki create new scraper"></a></p>
<p>you&#8217;ll be presented with an editor window, where you can write your scraper code (<em>don&#8217;t panic!</em>), along with a status area at the bottom of the screen. This area is used to display log messages when you run your scraper, as well as updates about the pages you&#8217;re hoping to scrape that you&#8217;ve loaded into the scraper from elsewhere on the web, and details of any data you have popped into the small SQLite database that is associated with the scraper (really, <em>DON&#8217;T PANIC!</em>&#8230;)</p>
<p>Give your scraper a name, and save it&#8230;</p>
<p><a href="http://www.flickr.com/photos/psychemedia/9067017573/" title="blank scraper by psychemedia, on Flickr"><img src="http://farm3.staticflickr.com/2821/9067017573_a244e931c3.jpg" width="500" height="280" alt="blank scraper"></a></p>
<p>To start with, we need to load a couple of programme libraries into the scraper. These libraries provide a lot of the programming tools that do a lot of the heavy lifting for us, and hide much of the nastiness of working with the raw PDF document data.</p>
<p><tt>import scraperwiki<br />
import urllib2, lxml.etree</tt></p>
<p>No, I don&#8217;t really know everything these libraries can do either, although I do know where to find the documentation for them&#8230; <a href="http://lxml.de/tutorial.html">lxm.etree</a>, <a href="https://scraperwiki.com/docs/python/python_help_documentation/">scraperwiki</a>! (You can also download and run the scraperwiki library in your own Python programmes outside of scraperwiki.com.)</p>
<p>To load the target PDF document into the scraper, we need to tell the scraper where to find it. In this case, the web address/URL of the document is <em>http://cdn.varner.eu/cdn-1ce36b6442a6146/Global/Varner/CSR/Downloads_CSR/Fabrikklister_VarnerGruppen_2013.pdf</em>, so that&#8217;s exactly what we&#8217;ll use:</p>
<p><tt>url = 'http://cdn.varner.eu/cdn-1ce36b6442a6146/Global/Varner/CSR/Downloads_CSR/Fabrikklister_VarnerGruppen_2013.pdf'</tt></p>
<p>The following three lines will load the file in to the scraper, &#8220;parse&#8221; the data into an XML document format, which represents the whole PDF in a way that resembles an HTML page (sort of), and then provides us with a link to the &#8220;root&#8221; of that document.</p>
<p><tt>pdfdata = urllib2.urlopen(url).read()<br />
xmldata = scraperwiki.pdftoxml(pdfdata)<br />
root = lxml.etree.fromstring(xmldata)</tt></p>
<p>If you run this bit of code, you&#8217;ll see the PDF document gets loaded in:</p>
<p><a href="http://www.flickr.com/photos/psychemedia/9067019167/" title="Scraperwiki page loaded in by psychemedia, on Flickr"><img src="http://farm3.staticflickr.com/2860/9067019167_e966244011.jpg" width="500" height="150" alt="Scraperwiki page loaded in"></a></p>
<p>Here&#8217;s an example of what some of the XML from the PDF we&#8217;ve just loaded looks like preview it:</p>
<p><tt>print etree.tostring(root, pretty_print=True)</tt></p>
<p><a href="http://www.flickr.com/photos/psychemedia/9067018311/" title="PDF as XML preview by psychemedia, on Flickr"><img src="http://farm8.staticflickr.com/7386/9067018311_5b02da388c.jpg" width="500" height="231" alt="PDF as XML preview"></a></p>
<p>We can see how many pages there are in the document using the following command:</p>
<p><tt>pages = list(root)<br />
print "There are",len(pages),"pages"</tt></p>
<p>The <em>scraperwiki.pdftoxml</em> library I&#8217;m using converts each line of the PDF document to a separate grouped elements. We can iterate through each page, and each element within each page, using the following nested loop:</p>
<p><tt>for page in pages:<br />
&nbsp;&nbsp;for el in page:</tt></p>
<p>We can take a peak inside the elements using the following print statement within that nested loop:</p>
<p><tt>if el.tag == "text":<br />
&nbsp;&nbsp;print el.text, el.attrib</tt></p>
<p><a href="http://www.flickr.com/photos/psychemedia/9067019019/" title="Previewing the XML element contents by psychemedia, on Flickr"><img src="http://farm8.staticflickr.com/7400/9067019019_286a35d968.jpg" width="500" height="301" alt="Previewing the XML element contents"></a></p>
<p>Here&#8217;s the sort of thing we see from one of the table pages (the actual document has a cover page followed by several tabulated data pages):</p>
<p><tt>Bangladesh {'font': '3', 'width': '62', 'top': '289', 'height': '17', 'left': '73'}<br />
Cutting Edge {'font': '3', 'width': '71', 'top': '289', 'height': '17', 'left': '160'}<br />
1612, South Salna, Salna Bazar {'font': '3', 'width': '165', 'top': '289', 'height': '17', 'left': '425'}<br />
Gazipur {'font': '3', 'width': '44', 'top': '289', 'height': '17', 'left': '907'}<br />
Dhaka Division {'font': '3', 'width': '85', 'top': '289', 'height': '17', 'left': '1059'}<br />
Bangladesh {'font': '3', 'width': '62', 'top': '311', 'height': '17', 'left': '73'}</tt></p>
<p>Looking again the output from each row of the table, we see that there are regular position indicators, particulalry the &#8220;top&#8221; and &#8220;left&#8221; coordinates, which correspond to the co-ordinates of where the registration point of each block of text should be placed on the page.</p>
<p>If we imagine the PDF table marked up as follows, we might be able to add some of the co-ordinate values as follows &#8211; the blue lines correspond to co-ordinates extracted from the document:</p>
<p><a href="http://www.flickr.com/photos/psychemedia/9067099873/" title="imaginary table lines by psychemedia, on Flickr"><img src="http://farm8.staticflickr.com/7324/9067099873_69a6a79bc3.jpg" width="500" height="173" alt="imaginary table lines"></a></p>
<p>We can now construct a small default reasoning hierarchy that describes the contents of each row based on the horizontal (&#8220;x-axis&#8221;, or &#8220;left&#8221; co-ordinate) value. For convenience, we pick values that offer a clear separation between the x-co-ordinates defined in the document. In the diagram above, the red lines mark the threshold values I have used to distinguish one column from another:</p>
<p><tt>if int(el.attrib['left']) &lt; 100: print 'Country:', el.text,<br />
elif int(el.attrib['left']) &lt; 250: print 'Factory name:', el.text,<br />
elif int(el.attrib['left']) &lt; 500: print 'Address:', el.text,<br />
elif int(el.attrib['left']) &lt; 1000: print 'City:', el.text,<br />
else:<br />
&nbsp;&nbsp;print 'Region:', el.text</tt></p>
<p>Take a deep breath and try to follow the logic of it. Hopefully you can see how this works&#8230;? The data rows are ordered, stepping through each cell in the table (working left right) for each table row in turn. The repeated <em>if-else</em> statement tries to find the leftmost column into which a text value might fall, based on the value of its &#8220;left&#8221; attribute. When we find the value of the rightmost column, we print out the data associated with each column in that row.</p>
<p>We&#8217;re now in a position to look at running a proper test scrape, but let&#8217;s optimise the code slightly first: we know that the data table starts on the second page of the PDF document, so we can ignore the first page when we loop through the pages. As with many programming languages, Python tends to start counting with a 0; to loop through the <em>second</em> page to the final page in the document, we can use this revised loop statement:</p>
<p><tt>for page in pages[1:]:</tt></p>
<p><em>Here, <tt>pages</tt> describes a list element with N items, which we can describe explicitly as <tt>pages[0:N-1]</tt>. Python list indexing counts the first item in the list as item zero, so <tt>[1:]</tt> defines the sublist from the second item in the list (which has the index value 1 given that we start counting at zero) to the end of the list.</em></p>
<p>Rather than just printing out the data, what we really want to do is grab hold of it, a row at a time, and add it to a database.</p>
<p>We can use a simple data structure to model each row in a way that identifies which data element was in which column. We initiate this data element in the first cell of a row, and print it out in the last. Here&#8217;s some code to do that:</p>
<p><tt>for page in pages[1:]:<br />
&nbsp;&nbsp;for el in page:<br />
&nbsp;&nbsp;&nbsp;&nbsp;if el.tag == "text":<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if int(el.attrib['left']) &lt; 100: data = { 'Country': el.text }<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;elif int(el.attrib['left']) &lt; 250: data['Factory name'] = el.text<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;elif int(el.attrib['left']) &lt; 500: data['Address'] = el.text<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;elif int(el.attrib['left']) &lt; 1000: data['City'] = el.text<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;data['Region'] = el.text<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print data</tt></p>
<p>And here&#8217;s the sort of thing we get if we run it:</p>
<p><a href="http://www.flickr.com/photos/psychemedia/9069242462/" title="starting to get structured data by psychemedia, on Flickr"><img src="http://farm3.staticflickr.com/2835/9069242462_6ef64a36ab.jpg" width="500" height="222" alt="starting to get structured data"></a></p>
<p>That looks nearly there, doesn&#8217;t it, although if you peer closely you may notice that sometimes we catch a header row. There are a couple of ways we might be able to ignore the elements in the first, header row of the table on each page.</p>
<ul>
<li>We could keep track of the &#8220;top&#8221; co-ordinate value and ignore the header line based on the value of this attribute.</li>
<li>We could tack a hacky lazy way out and explicitly ignore any text value that is one of the column header values.</li>
</ul>
<p>The first is rather more elegant, and would also allow us to automatically label each column and retain it&#8217;s semantics, rather than explicitly labelling the columns using out own labels. (Can you see how? If we know we are in the title row based on the &#8220;top&#8221; co-ordinate value, we can associate the column headings with the &#8220;left&#8221; coordinate value.) The second approach is a bit more of a blunt instrument, but it does the job&#8230;</p>
<p><tt>skiplist=['COUNTRY','FACTORY NAME','ADDRESS','CITY','REGION']<br />
for page in pages[1:]:<br />
&nbsp;&nbsp;for el in page:<br />
&nbsp;&nbsp;&nbsp;&nbsp;if el.tag == "text"  and el.text not in skiplist:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if int(el.attrib['left']) &lt; 100: data = { 'Country': el.text }<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;elif int(el.attrib['left']) &lt; 250: data['Factory name'] = el.text<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;elif int(el.attrib['left']) &lt; 500: data['Address'] = el.text<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;elif int(el.attrib['left']) &lt; 1000: data['City'] = el.text<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;data['Region'] = el.text<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print data</tt></p>
<p>At the end of the day, it&#8217;s the data we&#8217;re after and the aim is not necessarily to produce a reusable, general solution &#8211; expedient means occasionally win out! As ever, we have to decide for ourselves the point at which we stop trying to automate everything and consider whether it makes more sense to hard code our observations rather than trying to write scripts to automate or generalise them.</p>
<p><img src="http://imgs.xkcd.com/comics/the_general_problem.png" alt="http://xkcd.com/974/ - The General Problem" /></p>
<p>The final step is to add the data to a database. For example, instead of printing out each data row, we could add the data to the a scraper database table using the command:</p>
<p><tt>scraperwiki.sqlite.save(unique_keys=[], table_name='fabvarn', data=data)</tt></p>
<p><a href="http://www.flickr.com/photos/psychemedia/9071634886/" title="Scraped data preview by psychemedia, on Flickr"><img src="http://farm4.staticflickr.com/3775/9071634886_9b6c5ae2a2.jpg" width="500" height="77" alt="Scraped data preview"></a></p>
<p><em>Note that the repeated database accesses can slow Scraperwiki down somewhat, so instead we might choose to build up a list of data records, one per row, for each page and them and then add all the companies scraped from a page one page at a time.</em></p>
<p>If we need to remove a database table, this utility function may help &#8211; call it using the name of the table you want to clear&#8230;</p>
<p><tt>def dropper(table):<br />
&nbsp;&nbsp;if table!='':<br />
&nbsp;&nbsp;&nbsp;&nbsp;try: scraperwiki.sqlite.execute('drop table "'+table+'"')<br />
&nbsp;&nbsp;&nbsp;&nbsp;except: pass</tt></p>
<p>Here&#8217;s another handy utility routine I found somewhere a long time ago (I&#8217;ve lost the original reference?) that &#8220;flattens&#8221; the marked up elements and just returns the textual content of them:</p>
<p><tt>def gettext_with_bi_tags(el):<br />
&nbsp;&nbsp;res = [ ]<br />
&nbsp;&nbsp;if el.text:<br />
&nbsp;&nbsp;&nbsp;&nbsp;res.append(el.text)<br />
&nbsp;&nbsp;for lel in el:<br />
&nbsp;&nbsp;&nbsp;&nbsp;res.append("<%s>" % lel.tag)<br />
&nbsp;&nbsp;&nbsp;&nbsp;res.append(gettext_with_bi_tags(lel))<br />
&nbsp;&nbsp;&nbsp;&nbsp;res.append("</%s>" % lel.tag)<br />
&nbsp;&nbsp;&nbsp;&nbsp;if el.tail:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;res.append(el.tail)<br />
&nbsp;&nbsp;return "".join(res).strip()</tt></p>
<p>If we pass this function something like the string <em>&lt;em&gt;Some text&lt;em&gt;</em> or <em>&lt;em&gt;Some &lt;strong&gt;text&lt;/strong&gt;&lt;/em&gt;</em> it will return <em>Some text</em>.</p>
<p>Having saved the data to the scraper database, we can download it or access it via a SQL API from the scraper homepage:</p>
<p><a href="http://www.flickr.com/photos/psychemedia/9071635038/" title="scrpaed data - db by psychemedia, on Flickr"><img src="http://farm3.staticflickr.com/2853/9071635038_9a23e8ed85.jpg" width="500" height="191" alt="scrpaed data - db"></a></p>
<p>You can find a copy of the scraper <a href="https://scraperwiki.com/scrapers/pdf_scraper_intro/">here</a> and a copy of various stages of the code development <a href="https://gist.github.com/psychemedia/5800840">here</a>.</p>
<p>Finally, it is worth noting that there is a small number of &#8220;badly behaved&#8221; data rows that split over more than one table row on the PDF.</p>
<p><a href="http://www.flickr.com/photos/psychemedia/9071576492/" title="broken scraper row by psychemedia, on Flickr"><img src="http://farm4.staticflickr.com/3769/9071576492_371f4a0c13.jpg" width="500" height="56" alt="broken scraper row"></a></p>
<p>Whilst we can handle these within the scraper script, the effort of creating the exception handlers sometimes exceeds the pain associated with identifying the broken rows and fixing the data associated with them by hand.</p>
<p><strong>Summary</strong></p>
<p>This tutorial has shown one way of writing a simple scraper for extracting tabular data from a simply structured PDF document. In much the same way as a sculptor may lock on to a particular idea when working a piece of stone, a scraper writer may find that they lock in to a particular way of parsing data out of a data, and develop a particular set of abstractions and exception handlers as a result. Writing scrapers can be infuriating at times, but may also prove very rewarding in the way that solving any puzzle can be. Compared to copying and pasting data from a PDF by hand, it may also be time well spent!</p>
<p>It is also worth remembering that sometimes it can be quicker to write a scraper that does most of the job, and then finish off the data cleansing or exception handling using another tool, such as OpenRefine or even just a simple text editor. On occasion, it may also make sense to throw the data into a database table as quickly as you can, and then develop code to manage a second pass that takes the raw data out of the database, tidies it up, and then writes it in a cleaner or more structured form into another database table.</p>
<p><em>The images used in this post are available via a flickr set: <a href="http://www.flickr.com/photos/psychemedia/sets/72157634180274600/">ScoDa-Scraping-SimplePDFtable</a></em></p>
 <p><a href="http://schoolofdata.org/?flattrss_redirect&amp;id=5155&amp;md5=e1c29a8bf2adeb35a98aa6caf4316f53" title="Flattr" target="_blank"><img src="http://schoolofdata.org/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://schoolofdata.org/2013/06/18/get-started-with-scraping-extracting-simple-tables-from-pdf-documents/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<atom:link rel="payment" title="Flattr this!" href="https://flattr.com/submit/auto?user_id=openknowledgefoundation&amp;popout=1&amp;url=http%3A%2F%2Fschoolofdata.org%2F2013%2F06%2F18%2Fget-started-with-scraping-extracting-simple-tables-from-pdf-documents%2F&amp;language=en_GB&amp;category=text&amp;title=Get+Started+With+Scraping+%26%238211%3B+Extracting+Simple+Tables+from+PDF+Documents&amp;description=As+anyone+who+has+tried+working+with+%26%238220%3Breal+world%26%238221%3B+data+releases+will+know%2C+sometimes+the+only+place+you+can+find+a+particular+dataset+is+as+a+table+locked+up+in..." type="text/html" />
	</item>
		<item>
		<title>Join the School of Data as a Community Mentor!</title>
		<link>http://schoolofdata.org/2013/06/17/join-the-school-of-data-as-a-community-mentor/</link>
		<comments>http://schoolofdata.org/2013/06/17/join-the-school-of-data-as-a-community-mentor/#comments</comments>
		<pubDate>Mon, 17 Jun 2013 14:07:41 +0000</pubDate>
		<dc:creator>Michael Bauer</dc:creator>
				<category><![CDATA[Community]]></category>

		<guid isPermaLink="false">http://schoolofdata.org/?p=5144</guid>
		<description><![CDATA[Have data skills to share? Want to bring the School of Data to your community? We are currently looking for 12 Community Mentors as a pilot for our international network. As a Community Mentor you will: Offer constructive feedback for learners on projects (often within your own language region) Help to answer questions by learners [...]]]></description>
				<content:encoded><![CDATA[<p><!-- magazine.image = http://farm8.staticflickr.com/7126/7020606995_488b177979_z.jpg --></p>
<p><strong>Have data skills to share? Want to bring the School of Data to your community? We are currently looking for 12 Community Mentors as a pilot for our international network.</strong></p>
<p><a href="http://www.flickr.com/photos/northcharleston/7020606995/" title="2012 FIRST Robotics Competition Palmetto Regional by North Charleston, on Flickr"><img src="http://farm8.staticflickr.com/7126/7020606995_488b177979_z.jpg" width="640" height="427" alt="2012 FIRST Robotics Competition Palmetto Regional"></a></p>
<p>As a Community Mentor you will:</p>
<ul>
<li>Offer constructive feedback for learners on projects (often within your own language region)</li>
<li>Help to answer questions by learners on forums/mailinglists (in your own language)</li>
<li>Organize data expeditions and hands-on workshops</li>
</ul>
<p>You&#8217;ll get training, help and support from the School of Data team, and good karma (priceless)!</p>
<p><a href="http://schoolofdata.org/get-started-as-a-community-mentor/" class="btn btn-primary">Sign up and Get started as a community mentor!</a></p>
 <p><a href="http://schoolofdata.org/?flattrss_redirect&amp;id=5144&amp;md5=1a9803bc41242a5f4eda0a895e5e45a6" title="Flattr" target="_blank"><img src="http://schoolofdata.org/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://schoolofdata.org/2013/06/17/join-the-school-of-data-as-a-community-mentor/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<atom:link rel="payment" title="Flattr this!" href="https://flattr.com/submit/auto?user_id=openknowledgefoundation&amp;popout=1&amp;url=http%3A%2F%2Fschoolofdata.org%2F2013%2F06%2F17%2Fjoin-the-school-of-data-as-a-community-mentor%2F&amp;language=en_GB&amp;category=text&amp;title=Join+the+School+of+Data+as+a+Community+Mentor%21&amp;description=Have+data+skills+to+share%3F+Want+to+bring+the+School+of+Data+to+your+community%3F+We+are+currently+looking+for+12+Community+Mentors+as+a+pilot+for+our+international+network...." type="text/html" />
	</item>
		<item>
		<title>On the Radar: Using Data to Save Lives</title>
		<link>http://schoolofdata.org/2013/06/17/on-the-radar-using-data-to-save-lives/</link>
		<comments>http://schoolofdata.org/2013/06/17/on-the-radar-using-data-to-save-lives/#comments</comments>
		<pubDate>Mon, 17 Jun 2013 05:24:03 +0000</pubDate>
		<dc:creator>Saskia Rotshuizen</dc:creator>
				<category><![CDATA[Data for CSOs]]></category>

		<guid isPermaLink="false">http://schoolofdata.org/?p=5089</guid>
		<description><![CDATA[The field of crisis mapping is relatively new, but its impact on the global response to conflict is already evident. By enabling massive amounts of information to be quickly understood by any interested party, crisis mapping increases public awareness on a exponential scale and, if properly put together, allows for quicker responses to crises. Invisible [...]]]></description>
				<content:encoded><![CDATA[<p><!--magazine.image = http://farm3.staticflickr.com/2840/9038657733_573b3ba8c3.jpg --></p>
<p>The field of crisis mapping is relatively new, but its impact on the global response to conflict is already evident. By enabling massive amounts of information to be quickly understood by any interested party, crisis mapping increases public awareness on a exponential scale and, if properly put together, allows for quicker responses to crises.</p>
<p>Invisible Children together with the Resolve LRA Crisis Initiative ventured into this field in 2011 with the launch of the <a href="http://www.lracrisistracker.com/">LRA Crisis Tracker</a>. This platform was created as a response to the <em>lack of response</em> to the Makombo Massacres in DR Congo where more than 320 people were killed and 250 people abducted by the Lord’s Resistance Army (LRA). Three months passed before news of the massacres appeared in the media.</p>
<p><div class="wp-caption aligncenter" style="width: 510px"><a href="http://www.lracrisistracker.com/"><img class=" " src="http://farm3.staticflickr.com/2840/9038657733_573b3ba8c3.jpg" alt="" width="500" height="239" /></a><p class="wp-caption-text">The LRA Crisis Tracker</p></div></p>
<p>Inspired by <a href="http://ushahidi.com/">Ushahidi</a>, the free, build-your-own crisis map website, the LRA Crisis Tracker is a real-time mapping platform and data collection system that brings an unprecedented level of transparency to the atrocities committed by the LRA in central Africa.</p>
<p>To build our own platform we partnered with Digitaria, a San Diego-based digital agency, and used a custom-built SalesForce application on the back end. Building the LRA Crisis Tracker was an extensive process. Each of the partners dedicated a few members of their team to work on the project full time. After nine months of development the LRA Crisis Tracker was launched in September 2011.</p>
<p>The Crisis Tracker faces the unique challenge of getting reliable reports from a region that has little to no communication infrastructure. The solution to this problem, in large part, was found in Invisible Children’s expansion of a locally-run high frequency (HF) radio network throughout communities in DR Congo and the Central African Republic. Twice daily, radio reports go to a local hub that then sends this information to our office in San Diego. Obtaining reliable data in a timely manner from such a remote region has required a serious investment of time and money. We’ve been building this network for almost two years and it provides much of the data used by the LRA Crisis Tracker. Invisible Children’s HF radio network currently consists of 38 radios and will continue to grow.</p>
<p>Through its intuitive design and inclusion of photos and videos from the region, the data is engaging and easy to access. The LRA Crisis Tracker is also available as a mobile app (iPhone or Android), and @CrisisTracker tweets LRA incidents as they’re reported.</p>
<p>Reports generated by the LRA Crisis Tracker have been used at all levels of counter-LRA efforts. Military and non-military actors in the region have expressed their appreciation of the real-time information put out by the LRA Crisis Tracker. This is exactly what we were hoping to create: an easy-to-access resource for the media, counter-LRA actors, regional organizations, and the general public. It makes it possible to identify trends in LRA activity that wouldn’t otherwise be accessible.</p>
<p>The platform continues to be a work in progress. Since its creation we&#8217;ve had to go back and revisit the data set multiple times. At one point we went back and added age and gender to incident reports. Other times we are mining our data for new location specifications. Our team spends a lot of time vetting our data to make sure it is accurate, which really is the most crucial aspect of our work.</p>
<p>This summer we’re planning to roll out Phase II of the LRA Crisis Tracker, which will improve the user’s ability to filter and analyze data. We’re excited to make the LRA Crisis Tracker even more valuable in the efforts to bring a permanent end to LRA atrocities. Already the platform makes our data available to any interested data-enthusiasts, through the ‘<a href="http://www.lracrisistracker.com/reports/signup">Get Reports</a>’ Tab.</p>
 <p><a href="http://schoolofdata.org/?flattrss_redirect&amp;id=5089&amp;md5=2d7acdda6870b904e7d57e09a7d2afb4" title="Flattr" target="_blank"><img src="http://schoolofdata.org/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://schoolofdata.org/2013/06/17/on-the-radar-using-data-to-save-lives/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<atom:link rel="payment" title="Flattr this!" href="https://flattr.com/submit/auto?user_id=openknowledgefoundation&amp;popout=1&amp;url=http%3A%2F%2Fschoolofdata.org%2F2013%2F06%2F17%2Fon-the-radar-using-data-to-save-lives%2F&amp;language=en_GB&amp;category=text&amp;title=On+the+Radar%3A+Using+Data+to+Save+Lives&amp;description=The+field+of+crisis+mapping+is+relatively+new%2C+but+its+impact+on+the+global+response+to+conflict+is+already+evident.+By+enabling+massive+amounts+of+information+to+be+quickly+understood..." type="text/html" />
	</item>
		<item>
		<title>Data roundup, June 12</title>
		<link>http://schoolofdata.org/2013/06/12/data-roundup-june-12/</link>
		<comments>http://schoolofdata.org/2013/06/12/data-roundup-june-12/#comments</comments>
		<pubDate>Wed, 12 Jun 2013 13:00:23 +0000</pubDate>
		<dc:creator>Neil Ashton</dc:creator>
				<category><![CDATA[Data Roundup]]></category>

		<guid isPermaLink="false">http://schoolofdata.org/?p=5051</guid>
		<description><![CDATA[We’re rounding up data news from the web each week. If you have a data news tip, send it to us at schoolofdata@okfn.org. TOOLS, COURSES, AND EVENTS The World Wide Web Foundation and its partners (including the OKFN) have launched the Global Open Data Initiative, &#8220;a champion for Open Data globally&#8221;, aiming to create and [...]]]></description>
				<content:encoded><![CDATA[<p>We’re rounding up data news from the web each week. If you have a data news tip, send it to us at schoolofdata@okfn.org.</p>
<p><div id="attachment_5059" class="wp-caption aligncenter" style="width: 610px"><a href="http://www.flickr.com/photos/kk/6879994839/"><img class="size-full wp-image-5059" src="http://schoolofdata.org/files/2013/06/6879994839_0a1821e9d8_b-e1371038753409.jpg" alt="Photo credit: Kris Krüg" width="600" height="400" /></a><p class="wp-caption-text">Photo credit: Kris Krüg</p></div><!--magazine.image = http://schoolofdata.org/files/2013/06/6879994839_0a1821e9d8_b-e1371038753409.jpg --></p>
<p><strong>TOOLS, COURSES, AND EVENTS</strong></p>
<p>The World Wide Web Foundation and its partners (including the OKFN) have launched the <a href="http://www.webfoundation.org/2013/06/announcing-the-global-open-data-initiative-godi/">Global Open Data Initiative</a>, &#8220;a champion for Open Data globally&#8221;, aiming to create and promote a unified set of guidelines assisting governments in the use of open data.</p>
<p>Today wraps up the <a href="http://openeconomics.net/events/workshop-june-2013/">second Open Economics Workshop</a>, an Open Knowledge Foundation event hosted at MIT. As reported on the <a href="http://openeconomics.net/2013/06/05/second-open-economics-international-workshop/">Open Econ blog</a>, the event brought together some 40 economists and social scientists to discuss research data sharing and transparency in economics.</p>
<p><a href="http://datapolitics.jomc.unc.edu/">Data-Crunched Democracy</a> was a conference bringing together journalists and analysts “to cut through the hype and understand the use of voter data in campaigns”. <a href="http://thescoop.org/">Derek Willis</a> reflects on “the lessons for journalists covering campaigns that engage in the use of data” in an in-depth <a href="http://thescoop.org/archives/2013/06/04/lessons-from-data-crunched-democracy/">blog post</a>.</p>
<p>I’ve known more than one graduate student in the social sciences who has described Excel’s <a href="http://datadrivenjournalism.net/resources/groupthink_grouping_values_in_excel_pivot_tables">pivot tables</a> as “the best thing ever”. Pivot tables are a powerful tool for data exploration. A new blog post by Abbott Katz explains you can begin using pivot tables in your own work.</p>
<p>Real-time and historical data on <a href="http://dronestre.am/">United States drone strikes</a> is now available as an API. Dronestre.am is a public API making it easy to “build data visualizations about covert war [...] in Pakistan, Yemen, and Somalia”.</p>
<p><a href="http://johnbeieler.org/blog/2013/06/06/using-sql/">Learn about pandas</a>, “one of the best, and most important, libraries for data analysis in Python”, and how it can be used to do serious data analysis using SQL queries in a new blog post by John Beieler.</p>
<p><a href="https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers">Bayesian Methods for Hackers</a>, an introduction to Bayesian probability theory in practical and Pythonic terms, has appeared on the data roundup before. Now a draft of the PDF version of the book has been released. Check out this “understanding-first” introduction to “the natural approach to inference”.</p>
<p>Check out Source’s <a href="http://source.mozillaopennews.org/en-US/articles/event-roundup-june-10/">journalism code event roundup, June 10</a>, for a worldwide selection of hackathons and conferences in data-driven and computer-assisted journalism.</p>
<p><strong>DATA STORIES</strong></p>
<p><strong></strong>So the NSA has all your metadata. What can they do with it? German Green Party politician Malte Spitz sued to repatriate <a href="http://www.zeit.de/datenschutz/malte-spitz-data-retention">six months of his own phone data</a> and made it available to Zeit online, who combined it with publicly available data to reconstruct six months of Spitz’s life. You can read <a href="http://www.zeit.de/digital/datenschutz/2011-03/data-protection-malte-spitz">more about the project</a> and <a href="https://docs.google.com/spreadsheet/ccc?key=0An0YnoiCbFHGdGp3WnJkbE4xWTdDTVV0ZDlQeWZmSXc&amp;authkey=COCjw-kG&amp;hl=en_GB&amp;authkey=COCjw-kG#gid=0">download its data</a>. You can also check out a <a href="https://www.eff.org/nsa-spying/timeline">timeline of the NSA’s domestic spying</a> from the Electronic Frontier Foundation.</p>
<p><a href="http://projectpolicy.org/">ProjectPolicy</a> aims to “unify, organize and visualize the world’s government information onto one intuitive web platform”. Its take on <a href="http://projectpolicy.org/sf/">San Francisco</a> is available as a demo of what it aims to do.</p>
<p><a href="http://cironline.org/reports/dirty-secrets-worst-charities-4603">America’s Worst Charities</a> presents a year’s investigation by the Tampa Bay Times and the Center for Investigative Reporting into the misuse of charity funds by American charities. It prominently features an <a href="http://www.tampabay.com/americas-worst-charities/">interactive</a> presentation of the data, some of which is also available for download in CSV form.</p>
<p>The <a href="http://blog.vctr.me/posts/central-limit-theorem.html">central limit theorem</a> is a statistical theorem of scientific importance that cannot be overstated. A new visualization of the theorem constructed with D3.js, explained in terms of coin flips, makes it easier to develop intuitions about its meaning.</p>
<p><a href="http://stamen.com">Stamen</a> has put together 3D contour maps of <a href="http://maps.stamen.com/mars/">the surface of Mars</a> from data collected by the Mars Orbiter Laser Altimeter. As their <a href="http://content.stamen.com/mars">blog reports</a>, these maps are “a small gesture of thanks to the scientists who are working hard to do science and communicate with the public despite the stupid <a href="http://www.usatoday.com/story/news/politics/2013/04/23/sequestration-threatens-nasa-deadlines/2107585/">sequester</a>”.</p>
<p>The latest work from <a href="http://accurat.it">Accurat</a> presents <a href="http://www.brainpickings.org/index.php/2013/06/07/painters-lives-accurat-giorgia-lupi/">the lives of ten famous painters</a> in the form of beautiful timelines. Each timeline presents the artist’s personal history in a manner sensitive to the artist’s style.</p>
<p>Check out datenjournalist.de’s roundup of <a href="http://datenjournalist.de/datenjournalismus-im-mai-2013/">Datenjournalismus im Mai 2013</a> (German) for a collection of some of last month’s best examples of data-driven journalism.</p>
<p><strong>DATA SOURCES</strong></p>
<p>In a move that is unlikely to distract attention away from the PRISM scandal, the Obama administration has released a portal <a href="http://www.barackobama.com/climate-deniers">calling out climate science deniers</a>.</p>
<p><a href="http://www.opennepal.net">Open Nepal</a> has launched <a href="http://opendatanepal.org/">Open Data Nepal</a>, a project &#8220;not about creating yet another data repository in the web but an effort to curate and disseminate data that is already available in public domain&#8221;.</p>
<p>Canada’s Global News has obtained, at great difficulty, a database of <a href="http://globalnews.ca/news/622513/open-data-alberta-oil-spills-1975-2013/">over 61,000 Albertan oil spill incidents</a> spanning the period from 1975 to 2013, and they are “now offering this information to the public for download”. This is certainly one of the most important datasets to see the light of data in Alberta—especially that Alberta’s open data catalogue has been described as perhaps “<a href="http://blogs.edmontonjournal.com/2013/06/04/is-albertas-open-data-catalogue-the-most-useless-open-data-catalogue-in-the-history-of-open-data-catalogues/">the most useless</a> [...] in the history of open data catalogues”.</p>
<p>The Los Angeles Times has acquired and released a database of the <a href="http://salaries.latimes.com/dwp/">salaries of Department of Water and Power employees</a> in 2012, finding that their “average total pay … is more than 50% high­er than oth­er city em­ploy­ees”. You can download the dataset and see for yourself.</p>
<p><a href="http://www.washingtonpost.com/business/on-it/freddie-mac-hopes-to-increase-transparency-with-releases-of-raw-mortgage-data/2013/06/09/f02c66ce-cd60-11e2-8f6b-67f40e176f03_story.html">Freddie Mac</a>, a major US mortgage backer, is “standardizing its processes and making raw data more easily accessible to the public”. This move towards “transparency” appears to be part of a process of privatization of government-sponsored mortgages, “using our data to attract private capital”.</p>
 <p><a href="http://schoolofdata.org/?flattrss_redirect&amp;id=5051&amp;md5=6a8329bd117a74076cb66b65380678ee" title="Flattr" target="_blank"><img src="http://schoolofdata.org/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://schoolofdata.org/2013/06/12/data-roundup-june-12/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<atom:link rel="payment" title="Flattr this!" href="https://flattr.com/submit/auto?user_id=openknowledgefoundation&amp;popout=1&amp;url=http%3A%2F%2Fschoolofdata.org%2F2013%2F06%2F12%2Fdata-roundup-june-12%2F&amp;language=en_GB&amp;category=text&amp;title=Data+roundup%2C+June+12&amp;description=We%E2%80%99re+rounding+up+data+news+from+the+web+each+week.+If+you+have+a+data+news+tip%2C+send+it+to+us+at+schoolofdata%40okfn.org.+TOOLS%2C+COURSES%2C+AND+EVENTS+The+World+Wide..." type="text/html" />
	</item>
		<item>
		<title>Several Takes on Defining Data Journalism</title>
		<link>http://schoolofdata.org/2013/06/11/several-takes-on-defining-data-journalism/</link>
		<comments>http://schoolofdata.org/2013/06/11/several-takes-on-defining-data-journalism/#comments</comments>
		<pubDate>Tue, 11 Jun 2013 15:26:56 +0000</pubDate>
		<dc:creator>Tony Hirst</dc:creator>
				<category><![CDATA[Data Blog]]></category>

		<guid isPermaLink="false">http://schoolofdata.org/?p=5044</guid>
		<description><![CDATA[Every so often I get asked the question: &#8220;so what is data journalism?&#8221; I&#8217;m still not sure I have a very good definition of it, but here are three different ways I think we can view it: as a particular sort of output &#8211; one of the easiest ways of responding to the question is [...]]]></description>
				<content:encoded><![CDATA[<p><!--magazine.image = http://farm8.staticflickr.com/7073/6892044574_99cd07512a.jpg --></p>
<p>Every so often I get asked the question: &#8220;<em>so what is data journalism</em>?&#8221; I&#8217;m still not sure I have a very good definition of it, but here are three different ways I think we can view it:</p>
<ul>
<li><em>as a particular sort of output</em> &#8211; one of the easiest ways of responding to the question is to point to a map or graphic that someone has used to illustrate a story, or a piece of &#8220;award winning&#8221; data journalism, and say &#8220;that is&#8221;. For anyone who works with data, however, they well know that producing a graphic is often the <em>easy</em> part of the process, and that most of the time is spent finding the data, fighting with it to get it into a state you can start working with it, and analysing the data, or asking it questions in order to find the story within it, or illustrate a story you have already discovered. This observation in turn leads to a second way of characterising data journalism:</li>
<li><em>as a particular set of skills</em> &#8211; that is, data journalism is not necessarily what data journalists produce, it’s best thought about in terms of the sorts of skills that data journalists need in order to produce the maps and charts that get pointed at as examples of data journalism.<br />
One way of identifying what these skills might be is to look at job adverts for &#8220;data journalist&#8221; (I collected a few examples here: <a href="http://blog.ouseful.info/2013/05/31/data-journalism-jobs-in-the-air/">So what is a data journalist exactly? A view from the job ads…</a>). Looking through them, many current ads seem to require skills associated with the development of interactive data driven applications, which puts the emphasis on a range of web design and development skills, again apparently associating the practice of data journalism closely with the production of things that are used to illustrate a story. That is, data journalism is to data what radio journalism is to audio and video journalism is to, erm, video?! (It&#8217;s probably also worth mentioning that data journalism is not necessarily genre based journalism, such science journalism or sports journalism &#8211; it&#8217;s not just &#8220;about&#8221; data.)<br />
But that doesn&#8217;t feel right, either, which suggests a third way of considering data journalism:</li>
<li><em>as a process</em> &#8211; and in particular, as a process that involves data somehow, though not necessarily exclusively. Whilst there may be &#8220;data outputs&#8221;, it might also be the case that the data journalistic process generates a lead that develops into a story that is not best illustrated using &#8220;data&#8221;. Data might lead us to a story, for example, that one particular garment retailer tolerates poor working conditions through the discovery that they use factories blacklisted by other retailers, but that story may be best expressed in other terms. The data, in other words, may simply play the role of a <em>source</em>, and in this sense &#8220;data journalism&#8221; is more process oriented, in much the same was that investigative journalism is, although potentially over much shorter timescales. (We might expect a data journalism piece to be produced in a matter of hours as part of the daily news cycle, for example.)<br />
Under this process view of data journalism, the skills required of a journalist participating in the process may take the form simply of advanced information skills, such as the ability to run powerful advanced searches using web search engines, filter down a data set using text and/or numeric facets in a tool such as OpenRefine, or run structured queries over data in a database using a query language such as SQL.<br />
The process might equally involve using data visualisation tools to make sense of a dataset, or generate further questions from it, questions that might be additionally asked of the dataset itself, possibly in conjunction with other datasets, or alternatively used to set up a question then asked of a person.<br />
For certain data sets, statistical tests may be required to identify whether there is something or nothing in what the data appears to be saying, or questions asked of an expert in the field to identify whether a number is actually a <em>big</em> number or not (hat tip to FT Undercover Economist, and More Or Less presenter, <a href="http://www.open.edu/openlearn/whats-on/ou-on-the-bbc-more-or-less-inside-interview">Tim Harford</a>, for that refrain!). And <em>then</em> it may be time to get the interactive developers on board. Or there may be no need.</li>
</ul>
<p>So are we any nearer to having a definition of &#8220;data journalism&#8221; that take into account these different views?</p>
<p>Here&#8217;s one I quite like:</p>
<blockquote><p>The art and practice of <em>finding stories in data&#8230;</p>
<p>&#8230;and then retelling them.</em></p></blockquote>
<p>This captures both the notion that data journalism is about <em>finding</em> stories from a particular sort of source (a data source) and then communicating them, whilst not requiring that the telling of the story is done in any particular way.</p>
<p>Here&#8217;s another:</p>
<blockquote><p>Journalism in which <em>“data” is one of the sources used to get or relate a story</em>.</p></blockquote>
<p>In this case, we see data as playing a role either in the sourcing of a story, or the communication of a story (or maybe even both), but again, we imagine data playing a role in &#8220;human&#8221; terms.</p>
<p>So what&#8217;s your favorite definition of <em>data journalism</em>?</p>
<p>See also: <a href="http://datajournalismhandbook.org/">Data Journalism Handbook</a></p>
 <p><a href="http://schoolofdata.org/?flattrss_redirect&amp;id=5044&amp;md5=dab93f4e6bb393b491913ae036e5c61b" title="Flattr" target="_blank"><img src="http://schoolofdata.org/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://schoolofdata.org/2013/06/11/several-takes-on-defining-data-journalism/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<atom:link rel="payment" title="Flattr this!" href="https://flattr.com/submit/auto?user_id=openknowledgefoundation&amp;popout=1&amp;url=http%3A%2F%2Fschoolofdata.org%2F2013%2F06%2F11%2Fseveral-takes-on-defining-data-journalism%2F&amp;language=en_GB&amp;category=text&amp;title=Several+Takes+on+Defining+Data+Journalism&amp;description=Every+so+often+I+get+asked+the+question%3A+%26%238220%3Bso+what+is+data+journalism%3F%26%238221%3B+I%26%238217%3Bm+still+not+sure+I+have+a+very+good+definition+of+it%2C+but+here+are+three+different..." type="text/html" />
	</item>
		<item>
		<title>Mapping the Well-Being of Children in the District of Columbia</title>
		<link>http://schoolofdata.org/2013/06/11/mapping-the-well-being-of-children-in-the-district-of-columbia/</link>
		<comments>http://schoolofdata.org/2013/06/11/mapping-the-well-being-of-children-in-the-district-of-columbia/#comments</comments>
		<pubDate>Tue, 11 Jun 2013 06:00:56 +0000</pubDate>
		<dc:creator>HyeSook Chung</dc:creator>
				<category><![CDATA[Data for CSOs]]></category>

		<guid isPermaLink="false">http://schoolofdata.org/?p=5015</guid>
		<description><![CDATA[Last year, DC Action for Children, in partnership with DataKind and a group of dedicated pro-bono data scientists, created an interactive, web-based tool to take traditional child well-being indicators “beyond the PDF book” and into the exciting realm of visualizing and communicating data for collective action. The neighborhood maps we created showed that the success [...]]]></description>
				<content:encoded><![CDATA[<p><!--magazine.image = http://farm4.staticflickr.com/3778/9009014650_d4cca0135e.jpg --></p>
<p>Last year, DC Action for Children, in partnership with DataKind and a group of dedicated pro-bono data scientists, created an interactive, web-based tool to take traditional child well-being indicators “beyond the PDF book” and into the exciting realm of visualizing and communicating data for collective action.</p>
<p>The <a href="http://www.dcactionforchildren.org/kids-count/dc-kids-count-data-tools">neighborhood maps</a> we created showed that the success of too many DC (District of Columbia, U.S.) children is predetermined by their ZIP Code – and limited access to critical resources to thrive. Some DC neighborhoods have assets that enrich the lives of children, but others are characterized by high levels of poverty and the many challenges that come with it, including poorer performing schools, more violent crime and less access to resources like healthy food, libraries, parks and recreation centers.</p>
<p style="text-align: center"><a href="http://www.dcactionforchildren.org/kids-count/dc-kids-count-data-tools" target="_blank"><img src="http://farm4.staticflickr.com/3778/9009014650_d4cca0135e.jpg" alt="dc action for kids" /></a></p>
<p>For the project we used both U.S. Census Bureau and local administrative data about the population and resources in District of Columbia neighborhoods. We obtained data on population counts and social characteristics from the Decennial Census and American Community Survey. Geographical data, shapefiles for mapping, and data on community characteristics such as grocery stores, libraries, crime and transportation were obtained from the DC Data Catalog. Other data were obtained directly from local agencies, including the DC Office of the State Superintendent of Education and the DC Department of Health.</p>
<p>To obtain the neighborhood-level estimates, our data scientists used block-level population data to construct population weights for data at the block-group and neighborhood level. The DC Master Address Repository was used to geocode point data, such as locations of libraries or schools; ArcGIS was used to aggregate point data by neighborhood. Collaborators used MapBox to create neighborhood maps.</p>
<h4>Community response</h4>
<p>The response to our newly launched KIDS COUNT 1.0 has been overwhelming, both locally and nationally. Local policy makers have been relying heavily on the work and asking what is next, particularly how to add data that can start to bring accountability to public policy decisions and publicly funded programs.</p>
<p>The work has also been recognized as innovative by numerous organizations, including the Annie E. Casey Foundation (through the KIDS COUNT network), Rockefeller Foundation (Innovators Award) and Global Editors Network (2013 Data Journalism Awards). We continue to get inquiries from potential partners like The World Bank, the White House, and, most critically, parent groups.</p>
<h4>Why is this important?</h4>
<p>In a city where policy decisions that determine the allocation of resources and assets are guided by relationships and old-school politics, the project will bring much-needed transparency to DC government budget data. We must show how budget decisions align or do not align with the needs of our children.</p>
<p>In DC, there are approximately 100,000 children under 18 years of age. More than 36,000 young children are growing up in DC neighborhoods – playing on city playgrounds, attending child care centers and preparing for school in pre-kindergarten classes. The number of young children in DC has increased by 11% since 2000, which is especially notable because the total number of children (under age 18) has decreased by 8% over the same time period. With a rising birth rate and expanding overall city population, we expect the number of young children to continue to increase in future years. Of the 36,000 – 1 in 3 live in poverty in DC. Poverty is pervasive.</p>
<p>DC has the highest spending per pupil in public education. We have had intense national scrutiny based on the efforts in education reform to improve outcomes for children, yet even with all the spending and reform efforts, the bottom line: outcomes for children are <em>not</em> improving.</p>
<h4>Next steps</h4>
<p>In the next phase of the project, we propose to add a layer of local budget data to the asset maps to answer a related question: If we map public investments,<em> will they align with where we have mapped need among children in DC</em>?</p>
<p>We propose to use five years of retrospective budget data to add a powerful new tool to our DC KIDS COUNT maps to help policy makers, media, advocates, service providers and citizens evaluate the city’s budget through the lens of young children – in the neighborhoods where they live. The project will help us present a more nuanced analysis of the geography of DC budget investments, including to:</p>
<ul>
<li>Map where the city has invested in the futures of young children and where it has not.</li>
<li>Create a shared understanding of how investment maps do – or do not – match with need maps for the city’s children.</li>
<li>Communicate messages about inequities in investments by geography and demographics (income, race, etc.).</li>
<li>Identify budget and policy opportunities for addressing the identified mismatches, gaps and inequities.</li>
</ul>
<p>Our ultimate goal is to ensure the data and analysis we provide will change the outcomes for children, youth and their families.</p>
<p>As I reflect on the success of this partnership and project, a few key themes surfaced:</p>
<ul>
<li>Leadership: Jake Porway (founder of DataKind) and Sisi Wei (project lead) were instrumental during the preliminary phase but also for long-term sustainability of the project. We all knew that this was new territory of data work. Both were committed to the answering our BIG question: “Can we change child outcomes with data?” There was definitely a theme among the three of us: innovative, risk-takers, visionary, do-gooders and a little too much enthusiasm about data!</li>
<li>Data Heroes: A leader can’t lead without a strong troop. I recall being at the DataDive and praying that most of the genius data heroes would choose our project BUT there was some fierce competition. As one of our project data heroes often states, “I joined this collective effort to make a difference and wanted nothing in return.” <em>But</em> what we <em>all</em> got in return was the opportunity to engage in a magical process empowered by trust, mission and impact. We saw in action what Jake had always envisioned: data = social change!</li>
</ul>
 <p><a href="http://schoolofdata.org/?flattrss_redirect&amp;id=5015&amp;md5=6ba3dda4e6b4f0ba24be9ae1d10fdfd5" title="Flattr" target="_blank"><img src="http://schoolofdata.org/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://schoolofdata.org/2013/06/11/mapping-the-well-being-of-children-in-the-district-of-columbia/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<atom:link rel="payment" title="Flattr this!" href="https://flattr.com/submit/auto?user_id=openknowledgefoundation&amp;popout=1&amp;url=http%3A%2F%2Fschoolofdata.org%2F2013%2F06%2F11%2Fmapping-the-well-being-of-children-in-the-district-of-columbia%2F&amp;language=en_GB&amp;category=text&amp;title=Mapping+the+Well-Being+of+Children+in+the+District+of+Columbia&amp;description=Last+year%2C+DC+Action+for+Children%2C+in+partnership+with+DataKind+and+a+group+of+dedicated+pro-bono+data+scientists%2C+created+an+interactive%2C+web-based+tool+to+take+traditional+child+well-being+indicators+%E2%80%9Cbeyond..." type="text/html" />
	</item>
		<item>
		<title>School of Data Latin America Tour</title>
		<link>http://schoolofdata.org/2013/06/07/school-of-data-latin-america-tour/</link>
		<comments>http://schoolofdata.org/2013/06/07/school-of-data-latin-america-tour/#comments</comments>
		<pubDate>Fri, 07 Jun 2013 14:08:50 +0000</pubDate>
		<dc:creator>Michael Bauer</dc:creator>
				<category><![CDATA[On the Road]]></category>

		<guid isPermaLink="false">http://schoolofdata.org/?p=5011</guid>
		<description><![CDATA[Do you live in Latin America? Hungry for some School of Data materials in Spanish? We have some good news for you: The School of Data is going to come to you! Image CC-BY-SA Eric Fisher While our friends at Social-TIC are working hard on translating School of Data materials in Spanish, Michael Bauer and [...]]]></description>
				<content:encoded><![CDATA[<p><!-- magazine.image=http://farm2.staticflickr.com/1285/4671446659_ce51ed4d91_z.jpg --></p>
<p>Do you live in Latin America? Hungry for some School of Data materials in Spanish? We have some good news for you: The School of Data is going to come to you!</p>
<p><a href="http://www.flickr.com/photos/walkingsf/4671446659/" title="Locals and Tourists #49 (GTWA #200): Sao Paulo by Eric Fischer, on Flickr"><img src="http://farm2.staticflickr.com/1285/4671446659_ce51ed4d91_z.jpg" width="640" height="640" alt="Locals and Tourists #49 (GTWA #200): Sao Paulo"></a><small>Image CC-BY-SA Eric Fisher</small></p>
<p>While our friends at <a href="http://social-tic.tumblr.com/">Social-TIC</a> are working hard on translating School of Data materials in Spanish, <a href="http://twitter.com/mihi_tr">Michael Bauer</a> and <a href="http://twitter.com/zararah">Zara Rahman</a> are going to visit La Paz (Bolivia), Santiago (Chile), Buenos Aires (Argentina) and Montevideo (Uruguay).</p>
<p>Michael kicks his tour off with the first <a href="http://bolivia.databootcamp.org">DataBootcamp in Latin America in Bolivia</a>, he&#8217;s then joined by Zara in Santiago, where there will be a <a href="http://www.meetup.com/HacksHackersChile/events/121552802/">Workshop on Scraping</a> on Monday June 17th. They will also shortly present at the Data Tuesday the next day. They will continue their trip to Argentina with a <a href="http://www.meetup.com//HacksHackersBA/events/122848822/">Workshop on June 20th</a> and finish their tour at <a href="http://abrelatam.eventbrite.com">AbreLatam</a> &#8211; the open Latin America unconference</a>.</p>
<p>During the time they will be available to meet, scheme and plot. If you&#8217;re interested meeting them: Contact us at schoolofdata [at] okfn.org.</p>
 <p><a href="http://schoolofdata.org/?flattrss_redirect&amp;id=5011&amp;md5=84223304fa5977f9ec6003028dfcb432" title="Flattr" target="_blank"><img src="http://schoolofdata.org/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://schoolofdata.org/2013/06/07/school-of-data-latin-america-tour/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<atom:link rel="payment" title="Flattr this!" href="https://flattr.com/submit/auto?user_id=openknowledgefoundation&amp;popout=1&amp;url=http%3A%2F%2Fschoolofdata.org%2F2013%2F06%2F07%2Fschool-of-data-latin-america-tour%2F&amp;language=en_GB&amp;category=text&amp;title=School+of+Data+Latin+America+Tour&amp;description=Do+you+live+in+Latin+America%3F+Hungry+for+some+School+of+Data+materials+in+Spanish%3F+We+have+some+good+news+for+you%3A+The+School+of+Data+is+going+to+come..." type="text/html" />
	</item>
		<item>
		<title>The Latest From the School of Data</title>
		<link>http://schoolofdata.org/2013/06/06/the-latest-from-the-school-of-data-3/</link>
		<comments>http://schoolofdata.org/2013/06/06/the-latest-from-the-school-of-data-3/#comments</comments>
		<pubDate>Wed, 05 Jun 2013 22:34:18 +0000</pubDate>
		<dc:creator>Lucy Chambers</dc:creator>
				<category><![CDATA[News]]></category>

		<guid isPermaLink="false">http://schoolofdata.org/?p=4997</guid>
		<description><![CDATA[The latest from what we are up to at the School of Data. School of Data goes Spanish Next month, a couple of the team will be headed over to Latin America for a series of warm-up events for the launch of the Spanish version of School of Data. The School will be launched at [...]]]></description>
				<content:encoded><![CDATA[<p><!-- magazine.image= http://farm6.staticflickr.com/5332/8962982251_c6d0259687_z.jpg --></p>
<p>The latest from what we are up to at the School of Data.</p>
<h2>School of Data goes Spanish</h2>
<p>Next month, a couple of the team will be headed over to Latin America for a series of warm-up events for the launch of the Spanish version of School of Data. The School will be launched at the AbreLatam conference in Uruguay.</p>
<p>On their way to Uruguay, they will be passing through Bolivia, Chile and Argentina. Know an organisation or amazing individual doing great things with data they should meet up with on the way? Please drop us a line on schoolofdata [at] okfn.org.</p>
<p>Thank you to the amazing organisations and individuals who are helping to make this happen, including Social-TIC (Mexico), DATA (Uruguay), the Knight Media Fellows and the Hacks/Hackers network.</p>
<p><img src="http://farm6.staticflickr.com/5332/8962982251_c6d0259687_z.jpg" alt="" /></p>
<h2>Data Clinics</h2>
<p>Over at, OpenSpending, Anders Pedersen is running bi-weekly data clinics to help people troubleshoot their spending data, from getting better data, to cleaning, analysing and visualising the data.</p>
<p>The next clinic will happen on Wednesday 19th June in the evening.</p>
<p>Have data you want to bring along to troubleshoot? Join the <a href="http://lists.okfn.org/mailman/listinfo/openspending">OpenSpending mailing list</a>, or email info [at] openspending.org for more details on the upcoming clinics.</p>
<h2>Data Expeditions</h2>
<h3>Mapping the Garment Factories</h3>
<p>Some participants of the Mapping the Garments Factories expedition couldn&#8217;t get enough and carried on their expedition into the week. A few participants have written a <a href="http://schoolofdata.org/2013/06/04/data-expedition-story-why-garment-retailers-need-to-do-more-in-bangladesh/">writeup</a> of their expedition, concluding:</p>
<blockquote>
<p>&#8220;major retailers like Wal-Mart maintains high levels of opacity around their supply chain and audit standards, which are detrimental to improving working standards in the garment industry.&#8221;</p>
</blockquote>
<h3>Tax Avoidance</h3>
<p>Our tax avoidance teams pair up. We&#8217;ve paired up techies and storytellers to tackle the challenge of finding tax avoiders and evaders. Welcome also to our first Spanish-speaking group, who will take on the challenge. The expedition launches tomorrow, we will keep you posted with updates!</p>
<h3>Call for ideas for topics</h3>
<p>Have an idea for a topic you think would make a great expedition? (Or, even better &#8211; keen to help us lead one?) Please drop us a line on schoolofdata [at] okfn.org.</p>
<h2>From the Blog</h2>
<ul>
<li><a href="http://schoolofdata.org/2013/06/05/data-roundup-5th-june/">John Murtagh&#8217;s Fantastic Data Roundup</a></li>
<li><a href="http://schoolofdata.org/2013/06/04/school-of-data-last-month-prague-barcelona-and-accra/">Data Diva, Michael Bauer</a> has been on the road, training, training, training in Prague, Barcelona and Accra. </li>
<li><a href="http://schoolofdata.org/2013/06/04/analysing-uk-lobbying-data-using-openrefine/">Analysing UK Lobbying Data Using OpenRefine</a> by Tony Hirst. </li>
</ul>
 <p><a href="http://schoolofdata.org/?flattrss_redirect&amp;id=4997&amp;md5=ee3099eb2ff64b138e6ec92a8ea2dba5" title="Flattr" target="_blank"><img src="http://schoolofdata.org/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://schoolofdata.org/2013/06/06/the-latest-from-the-school-of-data-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<atom:link rel="payment" title="Flattr this!" href="https://flattr.com/submit/auto?user_id=openknowledgefoundation&amp;popout=1&amp;url=http%3A%2F%2Fschoolofdata.org%2F2013%2F06%2F06%2Fthe-latest-from-the-school-of-data-3%2F&amp;language=en_GB&amp;category=text&amp;title=The+Latest+From+the+School+of+Data&amp;description=The+latest+from+what+we+are+up+to+at+the+School+of+Data.+School+of+Data+goes+Spanish+Next+month%2C+a+couple+of+the+team+will+be+headed+over+to..." type="text/html" />
	</item>
		<item>
		<title>Data Roundup &#8211; 5th June</title>
		<link>http://schoolofdata.org/2013/06/05/data-roundup-5th-june/</link>
		<comments>http://schoolofdata.org/2013/06/05/data-roundup-5th-june/#comments</comments>
		<pubDate>Wed, 05 Jun 2013 21:39:01 +0000</pubDate>
		<dc:creator>John Murtagh</dc:creator>
				<category><![CDATA[Data Roundup]]></category>

		<guid isPermaLink="false">http://schoolofdata.org/?p=4990</guid>
		<description><![CDATA[We’re rounding up data news from the web each week. If you have a data news tip, send it to us at schoolofdata@okfn.org. Tools, Courses, and Events Booking now is the UK Data Service one-day workshop on using large-scale survey data for research which is in Manchester 24 June 2013. This workshop is aimed at those with little [...]]]></description>
				<content:encoded><![CDATA[<p><!-- magazine.image= http://farm9.staticflickr.com/8405/8963714428_8afdefcfe5_z.jpg --></p>
<p><em>We’re rounding up data news from the web each week. If you have a data news tip, send it to us at <a href="mailto:schoolofdata@okfn.org" target="_blank">schoolofdata@okfn.org</a>.</em></p>
<h2><strong>Tools, Courses, and Events</strong><strong></strong></h2>
<p>Booking now is the UK Data Service one-day <a href="http://ukdataservice.ac.uk/news-and-events/eventsitem/?id=3514" target="_blank">workshop</a> on using large-scale survey data for research which is in Manchester 24 June 2013. This workshop is aimed at those with little or no experience in the secondary analysis of survey data available from the UK Data Service and will introduce attendees to the skills required to find and access survey data and to carry out basic secondary analyses with the survey data.</p>
<p>As part of <a href="http://opendataweek.org/presentation-en/" target="_blank">European Open Data Week</a> there are 3 conferences and 14 workshops, from Tuesday, June 25 to Friday 28 June 2013 in Marseille, France, specifically looking at harmonising open data policies.</p>
<p><img src="http://farm9.staticflickr.com/8405/8963714428_8afdefcfe5_z.jpg" alt="" /></p>
<p>On 28th May the 7th Open Data Ireland Meetup took place in Dublin, the theme of which was entitled “Give Us Our Health Data” , and was attended by around 40 people. More details form the blog of the event at <a href="http://data.fingal.ie/Blog/May2013/Name,36932,en.aspx" target="_blank">http://data.fingal.ie/Blog/May2013/Name,36932,en.aspx</a>.</p>
<p>Open Nepal Week is running in Kathmandu from June 2 to June 6 and is a partnership driven series of event spread over five days that aims to raise awareness about open data in Nepal and to devise mechanisms to help citizens reach to such data. Check out the website at <a href="http://t.co/6OkDojIgMa" target="_blank">opennepal.net</a>.</p>
<p>Newly launched is the Policy RECommendations for Open Access to Research Data in Europe project (RECODE) – the work plan and deliverables are available from here <a href="http://t.co/h4K6gXoxn7" target="_blank">recodeproject.eu/research/</a> <a href="http://twitter.com/search?q=%23RECODE" target="_blank">#RECODE</a>.</p>
<p>A succinct and useful presentation from Victoria Stodden at Columbia University entitled: “Why Public Access to Data is So Important (and why getting the policy right is even more so)” which is available <a href="http://hdl.handle.net/10022/AC:P:20387" target="_blank">on their website</a>.</p>
<p>On June 18 there is a <a href="http://blog.opengovpartnership.org/2013/05/ogp-webinar-strengthening-the-demand-for-and-use-of-open-data-initiatives/" target="_blank">webinar</a> from the Open Government Partnership (OGP) entitled: &#8220;Strengthening the Demand for and use of open data initiatives&#8221; which is from 10:00 – 11:00 AM EST | 14:00 – 15:00 GMT</p>
<p>New open government tools have been launched for the Oakland, California area by community technologists, see more details here:<a href="http://t.co/9yc0rTOLXn" target="_blank">eepurl.com/zRdML</a>.</p>
<p>A useful blog post by Siri Anderson advising those who have data sets and how to published them: “3 Guidelines for Publishing Your First Open Data Sets”. <a href="http://t.co/fT5HZXr3Tw" target="_blank">ow.ly/lqMh1</a></p>
<p>The National Day of Civic Hacking is a national event that took place June 1-2, 2013, in cities across the United States. Civic Hackers: The Neighborland API is a resource for local ideas and actions: <a href="http://t.co/l6hda1Lv4S" target="_blank">hackforchange.org/datasets</a>.</p>
<p>Lastly, research data now and in conjunction with the 3rd International Conference on Theory and Practice of Digital Libraries (TPDL) the first workshop on &#8216;Linking and Contextualizing Publications and datasets:&#8217; is on September 26 in Malta: <a href="http://t.co/MSOxZnlD2Y" target="_blank">bit.ly/12GbfgK</a></p>
<h2><strong>Data Stories</strong><strong></strong></h2>
<p>The story of how detailed real time data got released from UK&#8217;s rail infrastructure owner (PDF) <a href="http://t.co/Lnt11Fztsm" target="_blank">j.mp/18oSF2Q</a> <a href="http://twitter.com/search?q=%23opendata" target="_blank">#opendata</a></p>
<p>The BBC has reported a <a href="http://bbc.in/12K7l6r." target="_blank">story</a> on the stunning visualizations of flight paths across the globe produced by GIS (and in their spare time, no less).</p>
<p>At the Centre for Sustainable Energy their most popular news story of the past 12 months: ‘Energy Company Obligation data in a usable format’,<a href="http://t.co/OarrfMhBBh" target="_blank">cse.org.uk/news/view/1662</a></p>
<p>The <a href="http://data.worldbank.org/" target="_blank">World Bank Development Data Group</a> (DECDG) and the aid data organization <a href="http://www.developmentgateway.org/" target="_blank">Development Gateway</a> has unearthed data which looks at the question of whether 29 developing countries are meeting their education goals and their progress visualized here: <a href="http://t.co/wLurucjX1g" target="_blank">ow.ly/lsDfN</a></p>
<p>An important article has been published (April 4) that reasserts that research data and their used in journal articles leads to an &#8220;a open data citation advantage&#8221;. You can read the pre-print <a href="https://peerj.com/preprints/1" target="_blank">on their website</a></p>
<p>Jess Denham, an Interactive Journalism MA student at City University London has <a href="http://jessdenham.net/2013/05/15/interviewed-david-ottewell-head-of-the-data-journalism-unit-at-trinity-mirror/" target="_blank">interviewed</a> David Ottewell, Head of the Data Journalism Unit at the Trinity Mirror (Regionals) group of UK Newspapers.</p>
<p>Jonathan Stray has written a blog post on a two-day data journalism workshop he gave in Taiwan which asks “<a href="http://www.niemanlab.org/2013/04/how-does-a-country-get-to-open-data-what-taiwan-can-teach-us-about-the-evolution-of-access/" target="_blank">How does a country get to open data? What Taiwan can teach us about the evolution of access</a>” He writes “Assumptions about government openness vary from country to country. Here are a few lessons a cross-national perspective can bring to the open data movement.”</p>
<p>&#8220;Fell in love with data&#8221;, is an interesting blog post by Enrico Bertini, Assistant Professor at NY-Poly (with equally interesting comments) on data visualisation success stories &#8211; which are often in short supply. Read it <a href="http://fellinlovewithdata.com/news/chi-2013-vis-papers" target="_blank">on their website</a>.</p>
<p>And the New York Times in its Technophoria blog has a piece about the struggle to gain access  to your own data which is stored (and monetized) by commercial companies like telecoms and utilities. It also details who is making this data available to consumers. “If My Data Is an Open Book, Why Can’t I Read It?” is available <a href="http://www.nytimes.com/2013/05/26/technology/for-consumers-an-open-data-society-is-a-misnomer.html?pagewanted=all&amp;_r=0" target="_blank">on their website</a>.</p>
<p><strong>Data Sources</strong><strong></strong></p>
<p>On Friday May 31 Germany released the first results of its 2011 census, the first in 24 years and the first since east and west were joined together again. See the announcement <a href="https://www.zensus2011.de/SharedDocs/AktuellesEN/On_31_May_2013_the_first_census_results_will_be_released.html?nn=3068736" target="_blank">on their website</a>.</p>
<p>In Canada the Government of Alberta has joined the open data movement by launching its Open Data Portal. There are already more than 280 data sets on the portal — found at <a href="http://data.alberta.ca/" target="_blank">http://data.alberta.ca</a> and you can watch a TV News item on it <a href="http://edmonton.ctvnews.ca/province-posts-government-data-online-1.1300981" target="_blank">here</a>.</p>
<p>In Australia the Government of New South Wales (NSW) has drafted an <a href="http://engage.haveyoursay.nsw.gov.au/document/show/933" target="_blank">Open Data Policy</a> which is open for public comment:<a href="http://t.co/oC7QvKkr0n" target="_blank">engage.haveyoursay.nsw.gov.au/opendata</a> which is part of the <a href="http://www.services.nsw.gov.au/ict/" target="_blank">NSW Government ICT Strategy</a> supporting government transparency, accountability and efficiency.They have also re-launched their Open Data platform using CKAN 2.0 at <a href="http://data.nsw.gov.au/" target="_blank">http://data.nsw.gov.au/</a>.</p>
<p>The Registry of Research Data Repositories <a href="http://www.re3data.org/" target="_blank">www.re3data.org</a> has launched which allows the easy identification of appropriate research data repositories, both for data producers and users. The registry covers research data repositories from all academic disciplines. Information icons display the principal attributes of a repository, allowing users to identify the functionalities and qualities of a data repository. These attributes can be used for multi-faceted searches, for instance to find a repository for geoscience data using a Creative Commons licence. By April 2013, 338 research data repositories were indexed in <a href="http://re3data.org/" target="_blank">re3data.org</a>. 171 of these are described by a comprehensive vocabulary, which was developed by involving the data repository community (<a href="http://doi.org/kv3" target="_blank">http://doi.org/kv3</a>).</p>
<p>The EC Open Data Portal (<a href="http://open-data.europa.eu/" target="_blank">http://open-data.europa.eu</a>) went online just before Christmas 2012. It is designed to be the open data hub for European Institutions, beginning with data from the European Commission. There’s a recent <a href="http://videolectures.net/dataforum2013_beyer_katzenberger_open_data/?utm_content=bufferdbf26&amp;utm_source=buffer&amp;utm_medium=twitter&amp;utm_campaign=Buffer" target="_blank">video lecture</a> introducing the hub from Malte Beyer- Katzbenberger entitled Towards a European open data infrastructure and is a guide through the portal – and the policy that is behind it.</p>
<p>The U.S. Government’s new CKAN open data catalog has just  launched <a href="http://t.co/58J13qKBGo" target="_blank">ckan.org/2013/05/23/dat…</a> and African governments are now opening open data portals too. See Kenya’s at <a href="https://t.co/UXUu7kzFCA" target="_blank">opendata.go.ke</a> and Ghana’s at <a href="http://t.co/MRY3C97YYg" target="_blank">data.gov.gh</a>.</p>
<p>A database of worldwide private companies registries has been launched called the <a href="http://opencorporates.com/" target="_blank">Open Database of the Corporate World</a> which currently holds information on 54,196,924 companies. The database uses the Google Refine reconciliation service and allows access to the information as JSON or XML.</p>
<p>There are some new datasets using the the history of UK websites via the <a href="http://data.webarchive.org.uk/opendata" target="_blank">UK Web Archive</a>: They have also made a few example tools available, showing how the open data might be used, and these are hosted in their GitHub repository.</p>
 <p><a href="http://schoolofdata.org/?flattrss_redirect&amp;id=4990&amp;md5=0f88ee44a65a3b99ac6e89ade09abaee" title="Flattr" target="_blank"><img src="http://schoolofdata.org/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://schoolofdata.org/2013/06/05/data-roundup-5th-june/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<atom:link rel="payment" title="Flattr this!" href="https://flattr.com/submit/auto?user_id=openknowledgefoundation&amp;popout=1&amp;url=http%3A%2F%2Fschoolofdata.org%2F2013%2F06%2F05%2Fdata-roundup-5th-june%2F&amp;language=en_GB&amp;category=text&amp;title=Data+Roundup+%26%238211%3B+5th+June&amp;description=We%E2%80%99re+rounding+up+data+news+from+the+web+each+week.+If+you+have+a+data+news+tip%2C+send+it+to+us+at%C2%A0schoolofdata%40okfn.org.+Tools%2C+Courses%2C+and+Events+Booking+now+is+the..." type="text/html" />
	</item>
	</channel>
</rss>
