A fun side project that I do end up using a bit. Gonna bind the capture to some hotkey so I can use it without changing app focus. Most annoying problem though is that Tesseract OCR often gets confused when you make it read combined latin+cyrillic letters and the font isn't something Tesseract prefers. Especially when there's something behind the text. Kind of disappointed that the most popular API often has a lot worse results than a human would just transcribing the letters.
Wouldn't be surprised if OCR software would leap soon due to a product similar to Whisper.
Comes to mind that the best possible app that does this would be kind of like the old "word lens" iPhone application but on all screens, meaning it would replace text from the raw screen input with text of another language, while keeping the appearance/color/scale/rotation of the original text. This would free it from needing to be built-in to whatever UI library is producing the text, and would work on recorded video too. Immediate latency/performance problems come to mind though but could be a fun thing to try.
Visual Universal Translator.
I will definitely use this.
https://learn.microsoft.com/en-us/windows/powertoys/text-ext...
Overview selection is Pixel-exclusive and that is definitely OCR since it can detect text in images. It's not perfect however, and it doesn't seem to support non-Latin script.
The other way is to use an app like Copy [1] which analyzes app layouts.
[1] https://play.google.com/store/apps/details?id=com.weberdo.ap...
It's not very good. I miss being able to copy/paste from blurry or deformed screenshots of youtube on Windows.
The special sauce - what you need to get a better result - is good, adaptive thresholding (something more advanced that raw naive binary thresholding you get feeding naive color/grayscale images to OCR).
As far as I know, once you get that nailed it doesn't matter that much what OCR you use - as long as it's available and supports your target language.
The main issue for a use-case like NormCap are the trained models: they are optimized for images of _printed_ text and layouts, which is different from on-screen-text in many aspects. Unfortunately, I don't have the resources to train my own models.
Cuneiform was a long time competitor, but afaik development there is stalled.
PS: People looking for (FOSS) alternatives, look here: https://github.com/dynobo/normcap#similar-open-source-tools
https://github.com/schappim/macOCR
Just rediscovered the Shortcuts a couple days ago while installing it on a friend's mac.