I am tired of postprocessing OCR.
I have used many OCR solutions - Tesseract (4 and 5), EasyOCR, the TrOCR(not-document level), DocTR and Paddle-Paddle (self-hostable on GPUs), and lastly Textract(best).
Some are just about fast enough to be useful in production for long documents, but all have one thing in common:
- You need to preprocess so much!
Why in this day and age do they all tend to output lines or words of text, completely leaving things like sorting out which text goes in which column or which bullet point is a new sentence?
I know solutions like GROBID solve this by correctly processing columns etc for papers, but for general documents, it seems so unsolved.
Are there good maintained solutions to this? At a team I am on, we spent a long time on an internal solution, which works well, and seeing the performance difference from raw processing to proper processing (formatting text and other improvements) has been -night-and-day-
So why don't providers or producers add steps to tidy up generic formats?
PS: I haven't found GPT APIs to be great for this, because the location and size of text is often crucial for columns and subheaders.