A Gentle Introduction into Extracting Data

Expand all sectionsCollapse all sections

Extracting Data from PDFs
Far too much data is trapped in PDFs. In order to be able to work with, analyse and visualise data, we need it in machine-readable formats. It's often not easy to the data out again, but sometimes possible - find out how here.
3
Making data on the web useful: scraping
Learn how to scrape without code in 5 minutes and when you might need to invest time in more sophisticated techniques.
5

Extracting data from PDF can be done with…

PDF to Word/Excel converters which allow you to copy the information you need. But the result is often messy if there are tables in the pdf. Some free tools include Excel Online
OCR (Optical Character Recognition) which “reads” the PDF and then copy its content in a different format, usually simple text. Quality varies between the OCR engines, and often the licences are not free. You could always go with the free and open source Tessaract OCR, but it requires some programming know-how.
Programming, with some libraries existing for Python (PDFMiner), Java (TIka, PDFBoc), and the command line (pdftotext, pdftohtml).
Crowdsourcing, which is not specifically for PDF, but can be used when you have many documents to transcript.
and Tabula, the new kid on the block, specifically designed to get data out of PDF tables, which is often where the data you’re looking for lives.

Introduction