Your online presentation looks great. The 'layout designer' if you will, the 'where are important things' screens look slick.
I do wonder how you assign those settings to incoming PDFs though. Is it the user's responsibility to say 'This PDF? I told you how/from where to extract data before'? Or do you have some classification system that stuffs the PDFs into buckets (say, by vendor) and templates are assigned to those?
How many PDFs that you encounter contain text (vs. scanned/image only documents)? For us, while the former certainly rise in popularity, the latter are still far too common/more prevalent.
Our solution is mostly on-premise so far (online offerings are the current focus of development) and we're quite OCR heavy, using a bunch of non-free engines and vote between the results. We also have dynamic templates, allowing rule sets containing rules like 'The total amount is a number satisfying format X, usually right or below a string containing "Total"' (and our invoice processing solution basically comes with rules like these preconfigured for various countries).
Are your templates using absolute coordinates/regions? You mention your 'unpaper' feature - do you fix/deskew both images and regions for misaligned pages?
(I won't mention any company/product names, because I don't want to advertise or hijack the thread. Nor do I need to connect my HN account ~directly~ with my employer)
So far, it's the user who would need to decide which document goes to which parser. A routing engine is however on our list and probably be one of the next features to add.
Regarding the stats, I'm not sure yet as we just launched. OCR was however one of the first things early users asked for.
For the 'unpaper' function we are using http://manpages.ubuntu.com/manpages/trusty/man1/unpaper.1.ht...
I would love to discuss things more in detail with you. Could you contact me contact [at] docparser.com please?
In your FAQ it says:
There are no special requirements. There is nothing to install and you don't need any technical know-how for setting up and using >>> mailparser.io.<<< No coding is required.
Just pointing out a potential syntax error. Otherwise if it's meant to say mailparser better explain what that is.
A quick advert for PDF Tables https://pdftables.com/ - we're a bit more API-focussed.
It's upto users to define what are rows and columns. In most programmatically generated PDFs, this is easy. But in manually typeset PDFs, there are lots of edge cases like variable row heights or column widths, slanted table borders, stuff like that.