Of course, the data itself is usually also for sale. But a manager would rather make an analyst scrape it from the PDF report than pay the reporting company extra for a data subscription, because they prefer to bear the opportunity cost of not having the analyst work on something more important and productive.
As an analyst, I can't count how many times I asked my former employer to shell out a couple hundred dollars a month for market intelligence data subscriptions and was blown off because they didn't want to allocate a budget for it.
1) Why use Python for OpenCV when Ruby has a decent wrapper that can do Hough (https://github.com/ruby-opencv/ruby-opencv)? Or was the Ruby version just too buggy still?
2) Is there a command-line version planned? I guess it'd be most relevant once auto-detection is figured out.
2) No plans at the moment, though that's an awesome idea.
(For those interested, you can grab Trapeze from mesadynamics.com -- requires OS X 10.4; source code is a mixture of C++ and Objective-C).
I did learn a few neat tricks by doing it myself though. The library I used to extract the text was none other than Mozilla's own PDF.js, so in the final version my users could just drag and drop the PDF onto the browser window, and my little algorithm parsed the tables into arrays, with AngularJS rendering them as HTML tables.
Obviously computer-vision assisted, general purpose reconstruction of tabular data is the secret sauce in this project, but if you have the right use case you can do some cool things in the client. You do have to dig into the PDF.js internals a bit to figure out how to use it but I'm sure that it will improve in that respect.
edit: Nevermind, it wouldn't have helped. I missed the part where automation isn't yet supported. Either way, this looks like a great tool.
Semirelated: I used to have a ton of scanned journal articles that I wanted to be able to read on a kindle without having to scroll across every page, and came across k2pdfopt. It's a C script that finds word and line breaks in image based pdfs and rearranges the text so that they'll fit on smaller screens. It's got a ton of flags you can set and is pretty good and ignoring/cropping out headers and footers and dealing with pages scanned at an angle. http://www.willus.com/k2pdfopt/help/k2menu.shtml No affiliation with Willus
Has this kind of thing been done for PDF map data?
I was talking with a friend of mine a month ago about the dismal state of official crime incidence websites. They're usually just lists of PDFs, probably because whoever is responsible for the data just uses whatever MS Word PDF output is available to the office and posts an existing monthly report as a PDF. This makes online crime data a huge pain in the #ss to decipher.
I'm sure there's a lot of geographic data this could apply to.
tute and demos are here: http://o-0.me/pXY/ , some recent commits like radial scanning aren't documented very well yet but i'll devote some time to it if anyone needs those. they're mostly useful for interactive analysis.
with some creative algorithms, typed arrays and web workers the speed is pretty amazing (for something built in js at least). a 1550x2006 pixel document page analyzes in 1.1s in chrome.