undefined | Better HN

0 points9999000009993y ago0 comments

OCR is built into Adobe's PDF reader, issue is it's 15$ a month.

I really want to see OCR become easier to use, but I don't know why it's such a hard problem in the first place.

0 comments

There is the python library ocrmypdf https://ocrmypdf.readthedocs.io/en/latest/ that works really well. I have found the results comparable to Adobe in accuracy.

I believe it uses tesseract, ghostscript and some other libraries.

Speaking of ghostscript, one way to deal with problematic PDFs is to print them to file and deal with the result instead.

Any open source apps integrate this ?

I'd love to just be able to search a PDF document for a string and get a list of results.

j / k navigate · click thread line to collapse

There is the python library ocrmypdf https://ocrmypdf.readthedocs.io/en/latest/ that works really well. I have found the results comparable to Adobe in accuracy.

I believe it uses tesseract, ghostscript and some other libraries.

Speaking of ghostscript, one way to deal with problematic PDFs is to print them to file and deal with the result instead.

Any open source apps integrate this ?

I'd love to just be able to search a PDF document for a string and get a list of results.

j / k navigate · click thread line to collapse