"Q: What type of text can Amazon Textract detect and extract?
A: Amazon Textract can detect Latin-script characters from the standard English alphabet and ASCII symbols."
So, English only. But very worryingly is that they're going to keep your companies' documents:
"Q. Are document and image inputs processed by Amazon Textract stored, and how are they used by AWS?
A: Amazon Textract may store and use document and image inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Textract..."
"Q. Can I delete images and documents stored by Amazon Textract?
A: Yes. You can request deletion of document and image inputs associated with your account by contacting AWS Support. Deleting image and document inputs may degrade your Amazon Textract experience."
That said, I'm still baffled on what value-add they're providing? For me, from the name alone, it would generate other documents of common types: .txt (without images), .doc, .html (zip). That is, a large part of extracting text is the ability to reflow the text across page boundaries & columns. However, this product states that:
"All extracted data is returned with bounding box coordinates" [1]
...which is how pdf documents lay things out in the first place...Have I missed something?
For my employer's use case, the data storage and privacy implications are a non-starter.
Example: You are dealing with a lot of PDF documents that contain unstructured information (e.g. a filled form) and you need to extract bits of information (e.g. name, address) and output it in a structured format (e.g. JSON/XLS).
[0] https://hn.algolia.com/?query=pdf%20convert&sort=byPopularit...
I do maintain some level of skepticism though. It is ocr :D
As an example of a more scripted/structured job, ProPublica built out a crowdsourcing framework in Rails to extract data from FCC filings. But even that was quite difficult, because every state/TV station has its own kind of form: https://projects.propublica.org/free-the-files/
It looks like it's competing with ABBYY (FlexiCapture) and Kofax.
AWS is just aggregating it all into one place and giving it a really good final polish.
Source: I do a lot of post-sales consulting work implementing RPA solutions.
I ask because it feels like I should have an app that lets me scan with my phone, process the text with OCR, then let me plain text search every scanned document I have.
The first part only natively made it into iOS Notes a year or two ago, but that whole experience above should be out of the box, IMHO…
For normal text OCR works well. But automatically understanding what is what is more complex.
And actually understanding the context of what you're trying to use OCR on can work backward to determine what the text actually is, i.e. if it's a "Name" field then the probabilities of ambiguous letters may change (in the case of handwriting rec).
They store every word that MAY be in the scanned document.
So their OCR engine will find a lot of legitimate words, but it will also find a lot of words that don't sense too.
When putting in a term for searching, it looks at the entire index (both legit words and the garbage) and returns you the documents that match.
I think it's quite clever.
Bear in mind that this feature was many years ago, I have no idea if this is still the case.
A: Amazon Textract can detect Latin-script characters from the standard English alphabet and ASCII symbols."
So, English only. But very worryingly is that they're going to keep your companies' documents:
"Q. Are document and image inputs processed by Amazon Textract stored, and how are they used by AWS?
A: Amazon Textract may store and use document and image inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Textract..."
"Q. Can I delete images and documents stored by Amazon Textract?
A: Yes. You can request deletion of document and image inputs associated with your account by contacting AWS Support. Deleting image and document inputs may degrade your Amazon Textract experience."
It returns table data line by line.
Announcing Amazon Textract, https://www.youtube.com/watch?v=PHX7q4pMGbo
Introducing Amazon Textract: Now in Preview, https://www.youtube.com/watch?v=hagvdqofRU4
Introducing Amazon Hieroglyph: Now in Preview (AIM363), https://www.youtube.com/watch?v=FnZFK_2oqKk
Edit: Hell, even average word length is probably going to be a good indicator since German is so agglutinative. Collect some factors like this and I think you'll be able to build a pretty good classifier.
If you want to perform language detection you can do the following:
a. Invoke tesseract with "-l eng".
b. Pass the output text to langdetect [1]. It is a port of Google's language detection library to Python which will give you the probabilities of the languages for a given text.
c. Invoke tesseract with "-l langdetect_output"
Note that langdetect generates 2 character codes (ISO 639-1) whereas tesseract expects 3 character codes (ISO 639-2).
As for accuracy, the details of your documents and scanning can matter, but, for normal personal usage, it should be very high.
1. How it will deal with multiple templates that the system hasn't seen before. Especially when there is significant difference between the templates.
2. UI/UX. E.g. how it will trace the extracted data to the original source and how it will show the confidence scores of each entity.
3. Verification process, how will the workflow look like when the confidence score is low and the document has to be checked by human operators.
I a team I'm working with had a lot of success doing this, curious what method(s) they are using.
Have you found any major problems or shortcomings in your usage?
I still prefer the Dropbox solution for that, but I'm waiting them transforming into an API.
- Printed text detection
- Handwritten text detection
- Key-Value detection
- Table detection
- Checkbox detection
- Other optical marks (e.g. barcode, QR code)
There's a decent possibility it has handwriting recognition. Not sure about the non-English languages though.
Do you support handwriting? – We do not support handwriting extraction.
[1]: https://docs.aws.amazon.com/textract/latest/dg/how-it-works-...
1. make "strings" api 2. hook it to a web server 3. profit!