You are browsing the archive for Python.

Scraping PDFs with Python and the scraperwiki module

- August 16, 2013 in Uncategorized


While for simple single or double-page tables tabula is a viable option – if you have PDFs with tables over multiple pages you’ll soon grow old marking them.

This is where you’ll need some scripting. Thanks to scraperwikis library (pip install scraperwiki) and the included function pdftoxml – scraping PDFs has become a feasible task in python. On a recent Hacks/Hackers event we run into a candidate – that was quite tricky to scrape – I decided to protocol the process here.

import scraperwiki, urllib2

First import the scraperwiki library and urllib2 – since the file we’re using is on a webserver – then open and parse the document…

  u=urllib2.urlopen("http://images.derstandard.at/2013/08/12/VN2p_2012.pdf") 
  #open the url for the PDF
  x=scraperwiki.pdftoxml(u.read()) # interpret it as xml
  print x[:1024] # let's see what's in there abbreviated...
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml producer="poppler" version="0.22.5">
<page number="1" position="absolute" top="0" left="0" height="1263" width="892">
    <fontspec id="0" size="8" family="Times" color="#000000"/>
    <fontspec id="1" size="7" family="Times" color="#000000"/>
<text top="42" left="64" width="787" height="12" font="0"><b>TABELLE VN2Ap/1                         
                                                  30/07/13  11.38.44  BLATT    1 </b></text>
<text top="58" left="64" width="718" height="12" font="0"><b>STATISTIK ALLER VORNAMEN (TEILWEISE PHONETISCH ZUSAMMENGEFASST, ALPHABETISCH SORTIERT) FÜR NEUGEBORENE KNABEN MIT </b></text>
<text top="73" left="64" width="340" height="12" font="0"><b>ÖSTERREICHISCHER STAATSBÜRGERSCHAFT 2012  - ÖSTERREICH </b></text>
<text top="89" left="64" width="6" height="12" font="0"><b> </b></text>
<text top="104" left="64" width="769" height="12" font="0"><b>VORNAMEN                  ABSOLUT      
%   

As you can see above, we have successfully loaded the PDF as xml (take a look at
the PDF by just opening the url given, it should give you an idea how it is
structured).

The basic structure of a pdf parsed this way will always be page tags
followed by text tags contianing the information, positioning and font
information. The positioning and font information can often help to get the
table we want – however not in this case: everything is font=”0″ and left=”64″.

We can now use xpath to query our
document…

import lxml
r=lxml.etree.fromstring(x)
r.xpath(&#039;//page[@number=&quot;1&quot;]&#039;)
[<Element page at 0x31c32d0>]

and also get some lines out of it

r.xpath(&#039;//text[@left=&quot;64&quot;]/b&#039;)[0:10] #array abbreviated for legibility
[<Element b at 0x31c3320>,
 <Element b at 0x31c3550>,
 <Element b at 0x31c35a0>,
 <Element b at 0x31c35f0>,
 <Element b at 0x31c3640>,
 <Element b at 0x31c3690>,
 <Element b at 0x31c36e0>,
 <Element b at 0x31c3730>,
 <Element b at 0x31c3780>,
 <Element b at 0x31c37d0>]
r.xpath(&#039;//text[@left=&quot;64&quot;]/b&#039;)[8].text
u'Aaron *                        64       0,19       91               Aim\xe9                        
1       0,00      959 '

Great – this will help us. If we look at the document you’ll notice that there
are all boys names from page 1-20 and girls names from page 21-43 – let’s get
them seperately…

boys=r.xpath(&#039;//page[@number&lt;=&quot;20&quot;]/text[@left=&quot;64&quot;]/b&#039;)
girls=r.xpath(&#039;//page[@number&gt;&quot;20&quot; and @number&lt;=&quot;43&quot;]/text[@left=&quot;64&quot;]/b&#039;)
print boys[8].text
print girls[8].text
Aaron *                        64       0,19       91               Aimé                            1
   0,00      959 
Aarina                          1       0,00    1.156               Alaïa                           1
   0,00    1.156 

fantastic – but you’ll also notice something – the columns are all there,
sperated by whitespaces. And also Aaron has an asterisk – we want to remove it
(the asterisk is explained in the original doc).

To split it up into columns I’ll create a small function using regexes to split
it.

import re

def split_entry(e):
return re.split("[ ]+",e.text.replace("*","")) # we're removing the asterisk here as well...

now let’s apply it to boys and girls

boys=[split_entry(i) for i in boys]
girls=[split_entry(i) for i in girls]
print boys[8]
print girls[8]
[u'Aaron', u'64', u'0,19', u'91', u'Aim\xe9', u'1', u'0,00', u'959', u'']
[u'Aarina', u'1', u'0,00', u'1.156', u'Ala\xefa', u'1', u'0,00', u'1.156', u'']

That worked!. Notice the empty string u” at the end? I’d like to filter it.
I’ll do this using the ifilter function from itertools

import itertools
boys=[[i for i in itertools.ifilter(lambda x: x!=&quot;&quot;,j)] for j in boys]
girls=[[i for i in itertools.ifilter(lambda x: x!=&quot;&quot;,j)] for j in girls]
print boys[8]
print girls[8]
[u'Aaron', u'64', u'0,19', u'91', u'Aim\xe9', u'1', u'0,00', u'959']
[u'Aarina', u'1', u'0,00', u'1.156', u'Ala\xefa', u'1', u'0,00', u'1.156']

Worked, this cleaned up our boys and girls arrays. We want to make them properly
though – there are two columns each four fields wide. I’ll do this with a little
function

def take4(x):
if (len(x)>5):
return [x[0:4],x[4:]]
else:
return [x[0:4]]

boys=[take4(i) for i in boys]
girls=[take4(i) for i in girls]
print boys[8]
print girls[8]

[[u'Aaron', u'64', u'0,19', u'91'], [u'Aim\xe9', u'1', u'0,00', u'959']]
[[u'Aarina', u'1', u'0,00', u'1.156'], [u'Ala\xefa', u'1', u'0,00', u'1.156']]

ah that worked nicely! – now let’s make sure it’s one array with both options in
it -for this i’ll use reduce

boys=reduce(lambda x,y: x+y, boys, [])
girls=reduce(lambda x,y: x+y, girls,[])
print boys[10]
print girls[10]
['Aiden', '2', '0,01', '667']
['Alaa', '1', '0,00', '1.156']

perfect – now let’s add a gender to the entries

for x in boys:
x.append("m")

for x in girls:
x.append("f")

print boys[10]
print girls[10]

['Aiden', '2', '0,01', '667', 'm']
['Alaa', '1', '0,00', '1.156', 'f']

We got that! For further processing I’ll join the arrays up

names=boys+girls
print names[10]
['Aiden', '2', '0,01', '667', 'm']

let’s take a look at the full array…

names[0:10]
[['TABELLE', 'VN2Ap/1', '30/07/13', '11.38.44', 'm'],
 ['BLATT', '1', 'm'],
 [u'STATISTIK', u'ALLER', u'VORNAMEN', u'(TEILWEISE', 'm'],
 [u'PHONETISCH',
  u'ZUSAMMENGEFASST,',
  u'ALPHABETISCH',
  u'SORTIERT)',
  u'F\xdcR',
  u'NEUGEBORENE',
  u'KNABEN',
  u'MIT',
  'm'],
 [u'\xd6STERREICHISCHER', u'STAATSB\xdcRGERSCHAFT', u'2012', u'-', 'm'],
 ['m'],
 ['VORNAMEN', 'ABSOLUT', '%', 'RANG', 'm'],
 ['VORNAMEN', 'ABSOLUT', '%', 'RANG', 'm'],
 ['m'],
 ['INSGESAMT', '34.017', '100,00', '.', 'm']]

Notice there is still quite a bit of mess in there: basically all the lines
starting with an all caps entry, “der”, “m” or “f”. Let’s remove them….

names=itertools.ifilter(lambda x: not x[0].isupper(),names) # remove allcaps entries
names=[i for i in itertools.ifilter(lambda x: not (x[0] in [&quot;der&quot;,&quot;m&quot;,&quot;f&quot;]),names)] 
 # remove all entries that are &quot;der&quot;,&quot;m&quot; or &quot;f&quot;
names[0:10]
[['Aiden', '2', '0,01', '667', 'm'],
 ['Aiman', '3', '0,01', '532', 'm'],
 [u'Aaron', u'64', u'0,19', u'91', 'm'],
 [u'Aim\xe9', u'1', u'0,00', u'959', 'm'],
 ['Abbas', '2', '0,01', '667', 'm'],
 ['Ajan', '2', '0,01', '667', 'm'],
 ['Abdallrhman', '1', '0,00', '959', 'm'],
 ['Ajdin', '15', '0,04', '225', 'm'],
 ['Abdel', '1', '0,00', '959', 'm'],
 ['Ajnur', '1', '0,00', '959', 'm']]

Woohoo – we have a cleaned up list. Now let’s write it as csv….

import csv
f=open("names.csv","wb") #open file for writing
w=csv.writer(f) #open a csv writer
w.writerow(["Name","Count","Percent","Rank","Gender"]) #write the header
for n in names:
w.writerow([i.encode("utf-8") for i in n]) #write each row

f.close()

Done, We’ve scraped a multi-page PDF using python. All in all this was a fairly
quick way to get the data out of a PDF using the scraperwiki module.

Flattr this!

Netneutralitymap.org – converting 700Gb of data to a map.

- September 6, 2012 in Data Stories


In this post, Michael Bauer explains how he wrangled data from measurement lab in order to create this visualisation of netneutrality, which maps interference with end user internet access.

Netneutrality describes the principle that data packets should be transported as best as possible over the internet without discrimination as to their origin, application or content. Whether or not we need legal measures to protect this principle is currently under discussion in the European Union and in the US.

The European Commission have frequently stated that there is no data to demonstrate the need for regulation. However, this data does exist: The measurement-lab initiative has collected large amounts of data from tests designed to detect interference with end user internet access. I decided to explore whether we can use some of the data and visualize it. The result? netneutralitymap.org.

Data acquisition: Google Storage

Measurement lab releases all their data with CC-0 licenses. This allows other researchers and curious people to play with the data. The datasets are stored as archives on Google storage. Google has created a set of utilities to retrieve them: gsutils. If you are curious about how to use it, look at the recipe in the school of data handbook.

All the data that I needed was in the archives – alongside a lot of data that I didn’t need. So I ended up downloading 2Gb data for a single day of tests, and actually only using a few megabytes of it.

Data Wrangling: Design decisions, parsing and storage

Having a concept of how to do things when you are starting out always pays. I wanted to keep a slim backend for the data and do the visualizations in browser. This pretty much reduced my options to either having a good API backend or JSON. Since I intend to update the data once a day, I only need to analyze and generate the data for the frontend once a day. So I decided to produce static json files using a specific toolchain.

For the toolchain I chose python and postgres SQL. Python for parsing, merging etc. and postgres for storage and analysis. Using SQL based databases for analysis pays off as soon as you get a lot of data. I expected a lot. SQL is considered to be slow, but a lot faster than python.

The first thing to do was parsing: the test I selected was glasnost. It is a testsuite emulating different protocols to transfer data and trying to detect whether these protocols are shaped. Glasnost stores very verbose logfiles. The logfiles state the results in nicely human readable format: So I had to write a custom parser to do this. There are many ways of writing parsers – I recently decided to use a more functional style and do it using pythons itertools and treating the file as a stream. The parser simply fills up the SQL database. But there is one more function – since we want to be able to distinguish countries, the parser also looks up the country belonging to the IP of the test using pygeoip and the geolite geoip database.

Once the table was filled, I wanted to know which providers the IPs of the test clients belonged to. So I added an additional table and started to look up ASNs. Autonomous System Numbers are numbers assigned to a block of Internet addresses, which tell us who currently owns the block. To look them up I used a python module cymruwhois (which queries whois.cymru.com for information). Once this step was complete, I had all the data I needed.

Analysis: Postgres for the win!

Once all the data is assembled, analysis needs to be done. The glasnost team previously used a quite simple metric to determine whether interference was present or not. I decided I use the same one. I created some new tables in postgres and started working on the queries. A good strategy is to do this iteratively – figure your subqueries out and then join them together in a larger query. This way things like:

INSERT INTO client_results
SELECT id,ip,asn,time,cc,summary.stest,min(rate) FROM client INNER JOIN
(SELECT test.client_id,test.test,max(rxrate)/max(txrate) AS
rate,mode,test.test AS stest FROM result,test WHERE test_id=test.id
GROUP BY test.client_id,mode,test.port,test.test HAVING max(txrate)>0) 
summary ON summary.client_id=id WHERE id NOT IN (SELECT id FROM client_results)
GROUP BY client.id,client.asn,client.time,client.cc,client.ip,summary.stest;

don’t seem too scary. In the map I wanted to show the precentage of tests in which intereference with a user’s internet connection took place, both by country and by provider. The total number of tests, the number of tests in which interference was detected, and the precentage of ‘interfered with tests’ are thus calculated for each country and for each provider. The Glasnost suite offers multiple tests for different applications. To do this the results are then further broken down by applications. Since this is run once a day I didn’t worry too much about performance. With all the data, calculating the results takes a couple of minutes – so no realtime queries here.

The next step is to simply dump the result as json files. I used python’s json module for this and it turned out to work beautifully.

Visualization: Jvectormap

For visualization I imagined a simple choropleth map with color-coded values of interference. I looked around how to do it. Using openstreetmap/leaflet seemed too cumbersome, but on the way I stumbled across jvectormap – a jquery based map plugin. I decided to use it. It simply takes data in the form of {“countrycode”:value} to display it. It also takes care of color-coding etc. A detailed recipe on how to create similar maps can be found in the School of Data Handbook. Once I had the basic functionality down e.g. display country details when a country is clicked, it was time to call in my friends. Getting feedback early helps in developing something like the map. One of my friends is a decent web designer – so he looked at it and immediately replaced my functional design with something much nicer.

Things I’ve learned creating the project:

  • SQL is amazing! Creating queries by iteration eases things down and results in mindboggling queries
  • Displaying data on maps is actually easy (this is the first project I did so).
  • 700 Gb of data is a logistical challenge (I couldn’t have done it from home, thanks to friends at “netzfreiheit.org” for giving me access to their server)

If you’re interested in the details: check out the github repository.

Michael Bauer is a new datawrangler with the OKFN. He’ll be around schoolofdata.org and okfn labs. In his free time he likes to wrangle data from advocacy projects and research.

Flattr this!