Free tool to extract data from PDFs: Tabula

Tabula screenshot

Using Tabula on the desktop

Credit: Tabula project

PDFs are handy for displaying articles and books in a well-designed format. But for data analysis? Not so much. Yet there are times where data you'd like to analyze is only available in a table within a PDF -- especially frustrating since odds are, that data began in a much friendlier database or spreadsheet format.

Enter Tabula, a free, open-source tool designed for "liberating data tables locked inside PDF files." It was created by several journalists with the support of a number of organizations including Knight-Mozilla OpenNews, the New York Times and La Nación DATA.

To use, download the software from the project website . It runs locally in your browser and requires a Java Runtime Environment compatible with Java 6 or 7. Import a PDF and then select the area of a table you want to turn into usable data. You'll have the option of downloading as a comma- or tab-separated file as well as copying it to your clipboard.

You'll also be able to look at the data it captures before you save it, which I'd highly recommend. It can be easy to miss a column and especially a row when making a selection.

The 30-second video below, produced by the Tabula project, shows more of how it works on a Windows system. There are also versions available for OS X and Linux.

Note that Tabula is only designed for PDFs that were created from electronic text; it is not OCR software and won't work with scanned images. Its creators also caution that it works best on simple table formats, not those where some rows or columns span multiple cells.

Looking for other tools? Check out my chart of 30+ free tools for data visualization and analysis.

The march toward exascale computers
View Comments
Join the discussion
Be the first to comment on this article. Our Commenting Policies