Tackling PDFs with Tabula
School of Data mentor Marco Túlio Pires has been writing for our friends at Tactical Tech about journalistic data investigation. This post introduces a tool for extracting data from PDFs, and it was originally published on Exposing the Invisible‘s resources page.
PDF files are pesky. If you copy and paste a table from a PDF into a new document, the result will be messy and ugly. You either have to type the data by hand into your spreadsheet processor or use an app to do that for you. Here we will walk through a tool that does it automatically: Tabula.
Tabula is awesome because it’s free and works on all major operating systems. All you have to do is download the zip file and extract a folder. You don’t even have to install anything, provided you’ve got Java on your machine, which is very likely. It works in your browser, and it’s all about visual controls. You use your mouse to select data you want to convert, and you download it converted to CSV. Simple as that.
Though Tabula is still experimental software, it is really easy to use. Once you fire it up, it will open your web browser with a welcome screen. Submit your PDF file and Tabula will process your file and show you a nice list of page thumbnails. Look for the table you want to extract.
Click and drag to select the area of the table. Once you release, Tabula will show you the extracted data in a friendly format. If the data is fuzzy, try selecting a narrower area, try removing the headers or the footnote, etc. Play around with it a bit. You can either download it as CSV and open it in a with spreadsheet program or copy the data to the clipboard.
Once the data is in a spreadsheet, you may need to do a bit of editing such as correcting the headers. Tabula won’t be perfect 100% of the time – it’s very good with numbers, but it can get confused by multi-line headers.
For further information and resources on Tabula, see:
- Tabula has a video that walks through how to set up the tool.
- Read the Open News article introducing Tabula and setting out some of its limitations.
- Check out the School of Data Link discussion on Tabula and some alternatives.