Extracting data from PDF can be done with…
- PDF to Word/Excel converters which allow you to copy the information you need. But the result is often messy if there are tables in the pdf. Some free tools include Excel Online
- OCR (Optical Character Recognition) which “reads” the PDF and then copy its content in a different format, usually simple text. Quality varies between the OCR engines, and often the licences are not free. You could always go with the free and open source Tessaract OCR, but it requires some programming know-how.
- Programming, with some libraries existing for Python (PDFMiner), Java (TIka, PDFBoc), and the command line (pdftotext, pdftohtml).
- Crowdsourcing, which is not specifically for PDF, but can be used when you have many documents to transcript.
- and Tabula, the new kid on the block, specifically designed to get data out of PDF tables, which is often where the data you’re looking for lives.
