Introducing Tabula, a human-friendly PDF-to-CSV data extractor (opens in new tab)

(source.mozillaopennews.org)

137 pointsmtigas13y ago18 comments

18 comments

Love it: a wonderful gift to millions of students, analysts, journalists, researchers, and others who for many years have had to extract data from PDFs via throwaway scripts, copy-and-paste, or (yikes) read-and-retype.

polskibus13y ago

If they automate table detection, then many low-end "analysts" will be made redundant. PDFs one of the worst bits for data feed automation.

kyllo13y ago

Yeah, I did this for a living for a little while--I was an analyst whose job was mostly to read industry quarterly reports in PDF form and condense them into much smaller reports to give to upper management.

Of course, the data itself is usually also for sale. But a manager would rather make an analyst scrape it from the PDF report than pay the reporting company extra for a data subscription, because they prefer to bear the opportunity cost of not having the analyst work on something more important and productive.

As an analyst, I can't count how many times I asked my former employer to shell out a couple hundred dollars a month for market intelligence data subscriptions and was blown off because they didn't want to allocate a budget for it.

polskibus13y ago

Just imagine how many "analysts" work for Reuters et al.

1 more reply

mtigasOP13y ago

In fact, we’re working on an auto-detection feature at this very moment! :D

danso13y ago

Great work, the integration (as shown in the demo) and UX are really well done. A couple of questions:

1) Why use Python for OpenCV when Ruby has a decent wrapper that can do Hough (https://github.com/ruby-opencv/ruby-opencv)? Or was the Ruby version just too buggy still?

2) Is there a command-line version planned? I guess it'd be most relevant once auto-detection is figured out.

mtigasOP13y ago

1) We’re not actually using Python for OpenCV, just ruby-opencv and possibly some bindings in Java/JRuby. (I think Python’s in the build instructions due to a numpy dependency in OpenCV. Though that might be specific to using Homebrew on OS X. Definitely looking into it soon.)

2) No plans at the moment, though that's an awesome idea.

saddino13y ago

Wow, nice work! I'm the author Trapeze, a once-shareware (now freeware and open source) PDF-to-Word/RTF/HTML/PlainText application for OS X. My approach was similar: trying to squash characters into words via a logical grid to determine whitespace. My #1 request from customers was to extract tables and I never had the guts to attempt it. :-)

(For those interested, you can grab Trapeze from mesadynamics.com -- requires OS X 10.4; source code is a mixture of C++ and Objective-C).

xaritas13y ago

I probably could have used this recently when I had a project which required a close encounter with extracting data from PDFs. Fortunately the PDFs were generated as a report by a VB6 application (!) so they had a fairly regular format once I figured out the quirks of PDF, as the authors describe here.

I did learn a few neat tricks by doing it myself though. The library I used to extract the text was none other than Mozilla's own PDF.js, so in the final version my users could just drag and drop the PDF onto the browser window, and my little algorithm parsed the tables into arrays, with AngularJS rendering them as HTML tables.

Obviously computer-vision assisted, general purpose reconstruction of tabular data is the secret sauce in this project, but if you have the right use case you can do some cool things in the client. You do have to dig into the PDF.js internals a bit to figure out how to use it but I'm sure that it will improve in that respect.

manicbovine13y ago

I wish I'd read this an hour ago, before I wrote a series of terrible awk, perl, and bash scripts to process several thousand inconsistently formatted pdfs.

edit: Nevermind, it wouldn't have helped. I missed the part where automation isn't yet supported. Either way, this looks like a great tool.

nsp13y ago

This is fantastic, would saved me dozens of hours as an econ undergraduate.

Semirelated: I used to have a ton of scanned journal articles that I wanted to be able to read on a kindle without having to scroll across every page, and came across k2pdfopt. It's a C script that finds word and line breaks in image based pdfs and rearranges the text so that they'll fit on smaller screens. It's got a ton of flags you can set and is pretty good and ignoring/cropping out headers and footers and dealing with pages scanned at an angle. http://www.willus.com/k2pdfopt/help/k2menu.shtml No affiliation with Willus

migbac13y ago

I am starting a personal project to convert my University schedules from pdf to an ICS calendar, I'm so glad I heard about Tabula, but like previously said a command line version would just be wonderful.

stcredzero13y ago

This is very cool!

Has this kind of thing been done for PDF map data?

I was talking with a friend of mine a month ago about the dismal state of official crime incidence websites. They're usually just lists of PDFs, probably because whoever is responsible for the data just uses whatever MS Word PDF output is available to the office and posts an existing monthly report as a PDF. This makes online crime data a huge pain in the #ss to decipher.

I'm sure there's a lot of geographic data this could apply to.

leeoniya13y ago

this is neat. i'm also doing pdf rasterization and pretty extensive document analysis in html5 <canvas>, not just tables. unfortunately it's for an internal tool which will likely form the core of our business but the base library i wrote and use for it is open sourced at https://github.com/leeoniya/pXY.js

tute and demos are here: http://o-0.me/pXY/ , some recent commits like radial scanning aren't documented very well yet but i'll devote some time to it if anyone needs those. they're mostly useful for interactive analysis.

with some creative algorithms, typed arrays and web workers the speed is pretty amazing (for something built in js at least). a 1550x2006 pixel document page analyzes in 1.1s in chrome.

alanreid13y ago

This is just awesome! Well done!

jonjohn8413y ago

Tabula is also the name of a programmable logic company doing fpga-like "3PLDs" where the design implemented varies over time to increase effective size of the logic fabric. (tabula.com)

bnp13y ago

Awesome - have needed this so often.

j / k navigate · click thread line to collapse

18 comments

cs70213y ago

polskibus13y ago

If they automate table detection, then many low-end "analysts" will be made redundant. PDFs one of the worst bits for data feed automation.

kyllo13y ago

polskibus13y ago

Just imagine how many "analysts" work for Reuters et al.

1 more reply

mtigasOP13y ago

In fact, we’re working on an auto-detection feature at this very moment! :D

danso13y ago

Great work, the integration (as shown in the demo) and UX are really well done. A couple of questions:

1) Why use Python for OpenCV when Ruby has a decent wrapper that can do Hough (https://github.com/ruby-opencv/ruby-opencv)? Or was the Ruby version just too buggy still?

2) Is there a command-line version planned? I guess it'd be most relevant once auto-detection is figured out.

mtigasOP13y ago

2) No plans at the moment, though that's an awesome idea.

saddino13y ago

(For those interested, you can grab Trapeze from mesadynamics.com -- requires OS X 10.4; source code is a mixture of C++ and Objective-C).

xaritas13y ago

manicbovine13y ago

I wish I'd read this an hour ago, before I wrote a series of terrible awk, perl, and bash scripts to process several thousand inconsistently formatted pdfs.

edit: Nevermind, it wouldn't have helped. I missed the part where automation isn't yet supported. Either way, this looks like a great tool.

nsp13y ago

This is fantastic, would saved me dozens of hours as an econ undergraduate.

migbac13y ago

stcredzero13y ago

This is very cool!

Has this kind of thing been done for PDF map data?

I'm sure there's a lot of geographic data this could apply to.

leeoniya13y ago

with some creative algorithms, typed arrays and web workers the speed is pretty amazing (for something built in js at least). a 1550x2006 pixel document page analyzes in 1.1s in chrome.

alanreid13y ago

This is just awesome! Well done!

jonjohn8413y ago

Tabula is also the name of a programmable logic company doing fpga-like "3PLDs" where the design implemented varies over time to increase effective size of the logic fabric. (tabula.com)

bnp13y ago

Awesome - have needed this so often.

j / k navigate · click thread line to collapse