Show HN: Convert PDF files into structured data (opens in new tab)

(docparser.com)

107 pointschezmo9y ago23 comments

23 comments

Is this using something like https://github.com/creatale/node-fv on the backend, which can accommodate various not perfectly scanned forms to data, after you prepare a schema? Or is it a more simplistic "mark hotspots" which won't work well/at all if if it is not perfectly aligned/sized with the original?

chezmoOP9y ago

We do position based text extraction. We add however an 'unpaper' function which tries to correct misalignments and increases the quality of the scan.

ComodoHacker9y ago

What OCR library do you use? What languages it supports?

1 more reply

darklajid9y ago

I'm working for a company that does DMS Things™ and processing incoming PDFs (for mailroom applications or invoice processing) is one of our core projects. Given that this is the closest submission to my day job ever, I'm really curious about your project.

Your online presentation looks great. The 'layout designer' if you will, the 'where are important things' screens look slick.

I do wonder how you assign those settings to incoming PDFs though. Is it the user's responsibility to say 'This PDF? I told you how/from where to extract data before'? Or do you have some classification system that stuffs the PDFs into buckets (say, by vendor) and templates are assigned to those?

How many PDFs that you encounter contain text (vs. scanned/image only documents)? For us, while the former certainly rise in popularity, the latter are still far too common/more prevalent.

Our solution is mostly on-premise so far (online offerings are the current focus of development) and we're quite OCR heavy, using a bunch of non-free engines and vote between the results. We also have dynamic templates, allowing rule sets containing rules like 'The total amount is a number satisfying format X, usually right or below a string containing "Total"' (and our invoice processing solution basically comes with rules like these preconfigured for various countries).

Are your templates using absolute coordinates/regions? You mention your 'unpaper' feature - do you fix/deskew both images and regions for misaligned pages?

(I won't mention any company/product names, because I don't want to advertise or hijack the thread. Nor do I need to connect my HN account ~directly~ with my employer)

chezmoOP9y ago

Awesome feedback!

So far, it's the user who would need to decide which document goes to which parser. A routing engine is however on our list and probably be one of the next features to add.

Regarding the stats, I'm not sure yet as we just launched. OCR was however one of the first things early users asked for.

For the 'unpaper' function we are using http://manpages.ubuntu.com/manpages/trusty/man1/unpaper.1.ht...

I would love to discuss things more in detail with you. Could you contact me contact [at] docparser.com please?

alvin09y ago

what does DMS stand for? or is "DMS Things" a tm?

whitingx9y ago

DMS = Document Management System ツ

evolve2k9y ago

Looks get cool, nice work.

In your FAQ it says:

There are no special requirements. There is nothing to install and you don't need any technical know-how for setting up and using >>> mailparser.io.<<< No coding is required.

Just pointing out a potential syntax error. Otherwise if it's meant to say mailparser better explain what that is.

chezmoOP9y ago

Thanks for the heads up, I just fixed it! mailparser.io is my other product which I launched a couple of years ago. Customers kept asking for document parsing capabilities so I thought it would be a good idea to start Docparser. For the FAQ I copied some text and apparently forgot to properly proof read it :)

caseyf79y ago

The Zapier integration is why I'm going to try this one.

unfortunateface9y ago

Save yourself a lot of support time/costs and remove the 'free' option. Your homepage sells the product well and shows its benefits. From the feedback you've already received it looks like you are providing more than $50 worth of value.

sixhobbits9y ago

I'm always surprised by how well `pdf2text --layout` works for even complicated looking PDFs. Has been better than most specialised (free) web services I've tried

frabcus9y ago

Looks really good!

A quick advert for PDF Tables https://pdftables.com/ - we're a bit more API-focussed.

petra9y ago

Depending on how well this works, this could be extremely useful for the electronics industry, where everything is locked in a PDF - allowing someone to build n in-depth research tool that would allow engineers to find the optimal part(using complex queries), from any manufacturer, very fast - far from the broken situation of today, where engineers spend tons of time researching , and often don't get tclose to the ideal.

Kinnard9y ago

I wonder how their software works. I think there's untapped potential in adobe's postscript.

lovelearning9y ago

The file format itself has all the information required to extract text from a rectangular area. Frameworks like PDFBox and iText have supported it from a long time.

It's upto users to define what are rows and columns. In most programmatically generated PDFs, this is easy. But in manually typeset PDFs, there are lots of edge cases like variable row heights or column widths, slanted table borders, stuff like that.

chezmoOP9y ago

That's right! The user defines a rectangular area and we then extract the raw text based on the position. For table extraction we use tabula.java under the hood.

camel_Snake9y ago

Tried giving this[0] a shot but even just a single page was too large for the 4MB limit.

[0] https://archive.org/details/averageweightofm41fult

jamiecarruthers9y ago

I gave it a go and couldn't get useful data extracted. I sent a support query with attached PDFs.

markdown9y ago

Your pricing tables mention webhooks but the faqs below them don't explain what those are.

ruler889y ago

nice! I wish I knew about this earlier, I had built a version of this on my own to solve this very problem.

mordae9y ago

No source? No, thanks!

j / k navigate · click thread line to collapse

23 comments

phonon9y ago

chezmoOP9y ago

We do position based text extraction. We add however an 'unpaper' function which tries to correct misalignments and increases the quality of the scan.

ComodoHacker9y ago

What OCR library do you use? What languages it supports?

1 more reply

darklajid9y ago

Your online presentation looks great. The 'layout designer' if you will, the 'where are important things' screens look slick.

How many PDFs that you encounter contain text (vs. scanned/image only documents)? For us, while the former certainly rise in popularity, the latter are still far too common/more prevalent.

Are your templates using absolute coordinates/regions? You mention your 'unpaper' feature - do you fix/deskew both images and regions for misaligned pages?

(I won't mention any company/product names, because I don't want to advertise or hijack the thread. Nor do I need to connect my HN account ~directly~ with my employer)

chezmoOP9y ago

Awesome feedback!

So far, it's the user who would need to decide which document goes to which parser. A routing engine is however on our list and probably be one of the next features to add.

Regarding the stats, I'm not sure yet as we just launched. OCR was however one of the first things early users asked for.

For the 'unpaper' function we are using http://manpages.ubuntu.com/manpages/trusty/man1/unpaper.1.ht...

I would love to discuss things more in detail with you. Could you contact me contact [at] docparser.com please?

alvin09y ago

what does DMS stand for? or is "DMS Things" a tm?

whitingx9y ago

DMS = Document Management System ツ

evolve2k9y ago

Looks get cool, nice work.

In your FAQ it says:

There are no special requirements. There is nothing to install and you don't need any technical know-how for setting up and using >>> mailparser.io.<<< No coding is required.

Just pointing out a potential syntax error. Otherwise if it's meant to say mailparser better explain what that is.

chezmoOP9y ago

caseyf79y ago

The Zapier integration is why I'm going to try this one.

unfortunateface9y ago

sixhobbits9y ago

I'm always surprised by how well `pdf2text --layout` works for even complicated looking PDFs. Has been better than most specialised (free) web services I've tried

frabcus9y ago

Looks really good!

A quick advert for PDF Tables https://pdftables.com/ - we're a bit more API-focussed.

petra9y ago

Kinnard9y ago

I wonder how their software works. I think there's untapped potential in adobe's postscript.

lovelearning9y ago

The file format itself has all the information required to extract text from a rectangular area. Frameworks like PDFBox and iText have supported it from a long time.

chezmoOP9y ago

That's right! The user defines a rectangular area and we then extract the raw text based on the position. For table extraction we use tabula.java under the hood.

camel_Snake9y ago

Tried giving this[0] a shot but even just a single page was too large for the 4MB limit.

[0] https://archive.org/details/averageweightofm41fult

jamiecarruthers9y ago

I gave it a go and couldn't get useful data extracted. I sent a support query with attached PDFs.

markdown9y ago

Your pricing tables mention webhooks but the faqs below them don't explain what those are.

ruler889y ago

nice! I wish I knew about this earlier, I had built a version of this on my own to solve this very problem.

mordae9y ago

No source? No, thanks!

j / k navigate · click thread line to collapse