Extracting Data from PDFs using Tabula

PDFs can be all forms and shapes – if you’re facing a nicely formatted PDF that is not scanned give Tabula a shot to extract the information. How? read the short walkthrough below:

You’ll need:

Tabula http://tabula.nerdpower.org
a PDF: e.g. http://www.unhabitat.org/pmss/getElectronicVersion.aspx?nr=3387&alt=1

Waltkthrough: Extracting data from PDF tables

Download the PDF at:: http://www.unhabitat.org/pmss/getElectronicVersion.aspx?nr=3387&alt=1
Start Tabula (most likely by double clicking on the tabula icon)
point your browser tof http://127.0.0.1:8080
Choose the file you want to upload and click Submit
Wait until the PDF is fully loaded
Scroll down to page 167 – we’ll extract that table.
Click and pull a selection box over the table
A window will pop up to show how Tabula would extract the data.
Now download the Data as CSV
Fantastic you liberated the table from the PDF. Quick and easy wasn’t it?

Any questions? Got stuck? Ask School of Data!

Last updated on Sep 02, 2013.