undefined | Better HN

0 pointslasagna_coder3y ago0 comments

What are these projects are you referring to? AFAIK Tesseract is sponsored by Google, from what I understand it is state of the art, ie it is Google OCR. Searching for OCR with OpenCV only reveals using OpenCV with Tesseract, not rolling its own OCR, OpenCV being used to preprocess images to optimise them for Tesseract. Maybe I'm missing something, so I'm interested if you can point me in the right direction.

0 comments

spi3y ago

Google OCR is definitely not the same as Tesseract, although it's true that Tesseract is maintained by Google. Google OCR has definitely much higher accuracy and is significantly faster (basically always taking 1s for inference, while Tesseract can easily take 10s or more for dense pages).

Source: I work in developing a competing OCR service and we keep an eye on competition (e.g. aside from Google, solutions by Azure, Amazon, Abbyy, Nuance, Cloudmersive, etc., as well as our internal product of course, which is not available externally), and they are (almost) all significantly better on Tesseract.

The only domain where Tesseract is competitive is for perfect "black text on white paper", it gives pretty poor performance when dealing with colored, distorted text, or even strong page structure effects (tables, etc.).

When I say "pretty poor" I mean: "with respect to the state-of-the-art", of course it's still enormously better than what was the state-of-the-art before deep learning came into the picture, roughly a decade ago. And for things like "search contents of a book" it's basically perfect already.

recuter3y ago

> Source: I work in developing a competing OCR service and we keep an eye on competition (e.g. aside from Google, solutions by Azure, Amazon, Abbyy, Nuance, Cloudmersive, etc., as well as our internal product of course, which is not available externally), and they are (almost) all significantly better on Tesseract.

Great. How do you quantify it and keep track? Is there an industry standard benchmark?

Would you consider sharing a backblaze type analysis (they track consumer HD performance and blogging about it got them a lot of attention and customers)?

spi3y ago

Sorry for the late answer.

Short answer is: we can't and we don't. Most EULAs explicitly prevent users to benchmark results, and we don't want to incur into any such risk. Plus, since we develop a competing product, any "deep look" into the competition might be seen as reverse engineering it, and our company is very careful to avoid such problems.

Our company has dedicated teams to evaluate competition products, so we once asked them (a couple of years ago), and could only look at aggregated, anonymized results. But the patterns were very clear. Anecdotical experience (mostly coming from customers of ours who, themselves, compare our internal engine with alternatives) seemed to point to the fact that most of the competition have rather stable service, so quality likely didn't evolve much in the last two years, but we can't be sure of course.

We constantly track our own accuracy on internally developed benchmarks, because frankly the ones available online (also for research purposes) are very bad. But as said, we can only continuously test our own engine and open source ones (like Tesseract), for legal reasons.

1 more reply

sireat3y ago

Working for a library on rare fonts I've found Tesseract fantastic for custom training.

It certainly beats Abbyy from 10 years ago - maybe a low bar to clear.

I had to spend some time setting up labeling then did some supplemental training on UB-Mannheim datasets.

Tesseract is the only OCR FOSS solution that has reasonable performance.

rjzzleep3y ago

> The only domain where Tesseract is competitive is for perfect "black text on white paper", it gives pretty poor performance when dealing with colored, distorted text, or even strong page structure effects (tables, etc.).

I wouldn't be surprised if their data set is bigger than the stock tesseract, but part of the OCR process is to preprocess the images.

cinntaile3y ago

I don't get how it's a competing product if it's not available externally? What field is it if you can answer that?

spi3y ago

Sure, our company deals with business documents and typically sells products higher in the stack. Our OCR offering is available to customers, but only if they buy a significantly larger pack of products that does information extraction. As a matter of fact, OCR results are included in there, so customers could (and very rarely do) buy the whole package for OCR purposes only. It's just not advertised/sold independently, so it doesn't make much sense for most customers to buy it for that purpose (unless they have really tiny volumes) because price-wise is much more expensive than alternative products only selling OCR.

1 more reply

j / k navigate · click thread line to collapse

0 comments

spi3y ago

recuter3y ago

Great. How do you quantify it and keep track? Is there an industry standard benchmark?

Would you consider sharing a backblaze type analysis (they track consumer HD performance and blogging about it got them a lot of attention and customers)?

spi3y ago

Sorry for the late answer.

1 more reply

sireat3y ago

Working for a library on rare fonts I've found Tesseract fantastic for custom training.

It certainly beats Abbyy from 10 years ago - maybe a low bar to clear.

I had to spend some time setting up labeling then did some supplemental training on UB-Mannheim datasets.

Tesseract is the only OCR FOSS solution that has reasonable performance.

rjzzleep3y ago

I wouldn't be surprised if their data set is bigger than the stock tesseract, but part of the OCR process is to preprocess the images.

cinntaile3y ago

I don't get how it's a competing product if it's not available externally? What field is it if you can answer that?

spi3y ago

1 more reply

j / k navigate · click thread line to collapse