Extracting Data from PDFs using Tabula

PDFs can be all forms and shapes – if you’re facing a nicely formatted PDF that is not scanned give Tabula a shot to extract the information. How? read the short walkthrough below:

You’ll need:

Waltkthrough: Extracting data from PDF tables

  1. Download the PDF at:: http://www.unhabitat.org/pmss/getElectronicVersion.aspx?nr=3387&alt=1

  2. Start Tabula (most likely by double clicking on the tabula icon)

  3. point your browser tof http://127.0.0.1:8080

  4. Choose the file you want to upload and click Submit

    http://farm6.staticflickr.com/5484/9500458533_91f9a6cdb4_o_d.png

  5. Wait until the PDF is fully loaded

  6. Scroll down to page 167 – we’ll extract that table.

  7. Click and pull a selection box over the table

    http://farm4.staticflickr.com/3726/9500458669_96dbc7f6e5_o_d.png

  8. A window will pop up to show how Tabula would extract the data.

    http://farm4.staticflickr.com/3703/9500458729_333885f7a3_z_d.jpg

  9. Now download the Data as CSV

    http://farm8.staticflickr.com/7397/9500458755_4e9e802e54_o_d.png

  10. Fantastic you liberated the table from the PDF. Quick and easy wasn’t it?

Any questions? Got stuck? Ask School of Data!

Last updated on Sep 02, 2013.