Amazon Textract – Extract text and data from virtually any document (opens in new tab)

(aws.amazon.com)

229 pointsmcrute7y ago72 comments

72 comments

cmroanirgo7y ago

Found some interesting tidbits in their FAQ [0]:

"Q: What type of text can Amazon Textract detect and extract?

A: Amazon Textract can detect Latin-script characters from the standard English alphabet and ASCII symbols."

So, English only. But very worryingly is that they're going to keep your companies' documents:

"Q. Are document and image inputs processed by Amazon Textract stored, and how are they used by AWS?

A: Amazon Textract may store and use document and image inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Textract..."

"Q. Can I delete images and documents stored by Amazon Textract?

A: Yes. You can request deletion of document and image inputs associated with your account by contacting AWS Support. Deleting image and document inputs may degrade your Amazon Textract experience."

That said, I'm still baffled on what value-add they're providing? For me, from the name alone, it would generate other documents of common types: .txt (without images), .doc, .html (zip). That is, a large part of extracting text is the ability to reflow the text across page boundaries & columns. However, this product states that:

"All extracted data is returned with bounding box coordinates" [1]

...which is how pdf documents lay things out in the first place...Have I missed something?

[0] https://aws.amazon.com/textract/faqs/

[1] https://aws.amazon.com/textract/features/

tills137y ago

The point of this service is to train their own OCR models for use in other products like Kindle / their e-book store. There doesn't really need to be a value add - if people use it it's a win for them... if people don't it's not really a big loss.

acqq7y ago

But in order to train something you have to have the input of what is actually there, I don’t see how that is provided here.

Technetium_Hat7y ago

There might be a way for users to rate the quality of the result, or at least report it if it is very wrong.

tracker17y ago

Think less about books, and more about automating input from forms filled out by hand. In working with this tech, I can say that none of it is great and it would be very nice to be able to ditch what's available for stuff that would work better.

For my employer's use case, the data storage and privacy implications are a non-starter.

kayhi7y ago

Wonder if they will offer a local solution.

ocrcustomserver7y ago

Shameless plug: I work on custom solutions that do this locally, shoot me an email if interested.

ocrcustomserver7y ago

As tracker1 mentioned, don't think of this as for reflowing text for different devices but as a data capture and documents processing solution.

Example: You are dealing with a lot of PDF documents that contain unstructured information (e.g. a filled form) and you need to extract bits of information (e.g. name, address) and output it in a structured format (e.g. JSON/XLS).

VvR-Ox7y ago

Keeping documents and analyzing your business is not new and will not keep people from using it in their companies I'm afraid. At least it doesn't stop people from Using Windows and other M$ products.

danso7y ago

Given how high and continuing the popularity of the "simple" conversion of regular PDF forms/tables -- even for the technically-sophisticated HN audience [0] -- if Amazon can deliver on OCR-to-data, that feels like a huge achievement. Not as sexy (or creepy) as Rekognition, perhaps, but almost certainly more day-to-day useful to the many, many professionals who work with documents and legacy data entry systems.

[0] https://hn.algolia.com/?query=pdf%20convert&sort=byPopularit...

- https://news.ycombinator.com/item?id=18199708

- https://news.ycombinator.com/item?id=5487530

just_myles7y ago

Agreed. Anything that can lighten the load of having to write custom scripts to handle pdf-to-data conversions will be helpful.

I do maintain some level of skepticism though. It is ocr :D

danso7y ago

Even if AWS goes the cynical route of making Textract be an upsell to MTurk -- e.g. the Textract output is not reliable enough on its own, but structured for easy piping to a MTurk job -- that's got to be useful for the many folks who send entire pages to MTurk when they just need a couple boxes proofread.

As an example of a more scripted/structured job, ProPublica built out a crowdsourcing framework in Rails to extract data from FCC filings. But even that was quite difficult, because every state/TV station has its own kind of form: https://projects.propublica.org/free-the-files/

ocrcustomserver7y ago

There's Google Cloud Vision and Microsoft Cognitive Services that act as competitors to Amazon Rekognition, but AFAIK there's no offering from a FAANG that competes with AWS Textract.

It looks like it's competing with ABBYY (FlexiCapture) and Kofax.

raghavtoshniwal7y ago

This plays so well with the theory of AWS taking a slice of all web activity. They are commoditising more and more complex tasks and enabling huge number of engineers to bootstrap their idea with amazing tech from day 1. A huge jump from S3/EC2 to this. Commendable.

jjeaff7y ago

I sort of agree. But I think the reality is a little closer to Apple's style of innovation. Few of the things that aws offers are things that didn't exist before. For example, data extraction and image recognition APIs have been around for a long while now from several different providers.

AWS is just aggregating it all into one place and giving it a really good final polish.

ocrcustomserver7y ago

I was surprised to see them also announce "Amazon Comprehend Medical" which is NLP for a specific vertical: https://aws.amazon.com/comprehend/medical/

Edmond7y ago

Not sure if this is bad news for the Robotic Process Automation (RPA) sector or an opportunity to offload the "Robotic" part while focusing on business process...

macintux7y ago

Given the fact that Amazon hangs on to the documents after, I don’t think most companies interested in RPA would feel comfortable with it.

Ftuuky7y ago

There are many RPA solutions with OCR as part of the automation.

njstraub6087y ago

Generally, these are stuffed in there for marketing and aren't very effective when used in actual business scenarios.

Source: I do a lot of post-sales consulting work implementing RPA solutions.

Edmond7y ago

Right, except that is often the only heavy lift of their offerings (ie turning your paper receipt into an expense report)...A service like this takes away the ML/AI babbling and what they're left with is clunky business process software offerings.

efields7y ago

Is off the shelf open source OCR not reliable for an image of reasonable fidelity, like a smartphone camera picture of a B&W text document?

I ask because it feels like I should have an app that lets me scan with my phone, process the text with OCR, then let me plain text search every scanned document I have.

The first part only natively made it into iOS Notes a year or two ago, but that whole experience above should be out of the box, IMHO…

Holybeds7y ago

There's a difference between doing OCR and actually understanding what is what in the document content.

For normal text OCR works well. But automatically understanding what is what is more complex.

njstraub6087y ago

This ^^

And actually understanding the context of what you're trying to use OCR on can work backward to determine what the text actually is, i.e. if it's a "Name" field then the probabilities of ambiguous letters may change (in the case of handwriting rec).

viig997y ago

No open source ocr doesn't work that great, i work for a telecom company, and we process over millions of documents a month, we built everything in house and now are able to process it at almost 40cents per 1000 documents. It a long process to process huge documents like payslips which require text boundary detection, word identification, spatial clustering and writing parsers (depends on word, segment, and clustering probabilities) which can extract required fields out of the documents.

wahnfrieden7y ago

This is an Evernote feature. Dropbox also launched this feature.

brad07y ago

Evernote is an interesting case.

They store every word that MAY be in the scanned document.

So their OCR engine will find a lot of legitimate words, but it will also find a lot of words that don't sense too.

When putting in a term for searching, it looks at the entire index (both legit words and the garbage) and returns you the documents that match.

I think it's quite clever.

Bear in mind that this feature was many years ago, I have no idea if this is still the case.

ocrcustomserver7y ago

Yeah, Evernote's OCR engine will generate possible candidates for every given word and will sort them internally by confidence score.

Screenshot: https://s24953.pcdn.co/blog/wp-content/uploads/2018/02/longh...

Since it's not aimed for transcription (user doesn't know what he's looking for) but for retrieval (user knows what he's looking for), it can get away with mistakes.

References:

https://evernote.com/blog/how-evernotes-image-recognition-wo...

https://help.evernote.com/hc/en-us/articles/208314518-How-Ev...

https://evernote.com/blog/evernote-indexing-system/

julianz7y ago

Yep it's quite clever for searching for things, much less useful for doing something based on the recognized text.

1 more reply

hhanshin7y ago

Found some interesting tidbits in their FAQ [0]: "Q: What type of text can Amazon Textract detect and extract?

A: Amazon Textract can detect Latin-script characters from the standard English alphabet and ASCII symbols."

So, English only. But very worryingly is that they're going to keep your companies' documents:

"Q. Are document and image inputs processed by Amazon Textract stored, and how are they used by AWS?

A: Amazon Textract may store and use document and image inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Textract..."

"Q. Can I delete images and documents stored by Amazon Textract?

A: Yes. You can request deletion of document and image inputs associated with your account by contacting AWS Support. Deleting image and document inputs may degrade your Amazon Textract experience."

BasHamer7y ago

If this can get me tables out of pdf's generated by crystal reports it would be a godsend for testing. This has been a nightmare to try and solve, the best option so far has been adobe cloud but they don't offer an API for that. I'm excited to try it out.

mjt587y ago

Have you tried e.g. https://tabula.technology, https://pdftables.com, https://pypi.org/project/Camelot/?

counciltime7y ago

I have a friend who has also developed a number of applications that use OCR specifically for PDF which uses Tesseract. The Report Miner application does a nice job of locating and extracting PDF tables.

https://www.opait.com/tesseractstudio/

https://www.opait.com/Pdfreportminer/

minhtripham7y ago

Would love to learn more about the apps your friend developed--currently doing research into different OCR use cases + tech. can you shoot me an email at minh@docucharm.com?

BasHamer7y ago

https://pdftables.com failed the test file, pretty good but inconsistent interpretation across rows, sometimes it split the cell, sometimes it did not. Tabula failed to detect multi-line rows, after manually changing the table it did do better than pdftables.com on splitting cells. Both failed the non-printable whitespace characters that created garbled outputs in the excel. The other one would take some time to rig up.

ocrcustomserver7y ago

You can also try https://docparser.com/.

If nothing works for you and you're comfortable with sharing an example file, you can send it to me and I could take a look.

cdolan7y ago

Rather than the Camelot link you provided, I think you meant Excalibur? https://github.com/camelot-dev/excalibur

mjt587y ago

Oh yes, thanks :-)

RandomBookmarks7y ago

How about https://ocr.space/tablerecognition

It returns table data line by line.

BasHamer7y ago

handled the non-printed whitespace but butchered the multi- line table headers, so re-building the headers is rough as it is line by line and you need to know what words go together and you have lost the structure.

cdolan7y ago

Can you send me a copy of what you are trying to extract? We use proprietary stuff (we're in the business of extracting data and performing analysis on invoices for waste, recycling, cellular, etc... stuff that gets "lost" in the AP department.

Happy to see if our tools can help. I've tried everything on the market - DocParser, MediusFlow, KOFAX, Ephesoft, etc... none work well enough in my opinion.

1 more reply

ocrcustomserver7y ago

Some videos that were just released:

Announcing Amazon Textract, https://www.youtube.com/watch?v=PHX7q4pMGbo

Introducing Amazon Textract: Now in Preview, https://www.youtube.com/watch?v=hagvdqofRU4

Introducing Amazon Hieroglyph: Now in Preview (AIM363), https://www.youtube.com/watch?v=FnZFK_2oqKk

gingerlime7y ago

I have a personal flow using tesseract to scan docs into searchable PDFs, but it’s not that accurate. One of the main problems is that some (now most?) of the documents are in German since I live in Germany, but some are in English. There’s a way to choose the language but nothing to auto detect as far as I’m aware. I was hoping for some cloud AI service with superior OCR and simple integration or CLI (push a PDF and download one with OCR embedded). Google seems to be too complicated unfortunately... Any tips??

philsnow7y ago

If you're running tesseract locally (i.e. not paying per invocation), run it once with EN and count occurrences of the/this/a/any etc, run it again with DE and count occurrences of der/die/das/um/ab/wie, and go from there?

Edit: Hell, even average word length is probably going to be a good indicator since German is so agglutinative. Collect some factors like this and I think you'll be able to build a pretty good classifier.

rpedela7y ago

Good idea. I would take it one step further. I would use a ML-based language detection tool which should return a list of languages and a confidence score. Whichever language has the highest confidence score wins. The FastText project has a good pre-trained model available.

ocrcustomserver7y ago

In tesseract, if you want to recognize both English and German you can use option -l deu+eng.

If you want to perform language detection you can do the following:

a. Invoke tesseract with "-l eng".

b. Pass the output text to langdetect [1]. It is a port of Google's language detection library to Python which will give you the probabilities of the languages for a given text.

c. Invoke tesseract with "-l langdetect_output"

Note that langdetect generates 2 character codes (ISO 639-1) whereas tesseract expects 3 character codes (ISO 639-2).

[1]: https://github.com/Mimino666/langdetect

gingerlime7y ago

Thanks. Wasn't aware it is possible to combine languages!

lokl7y ago

If you don't absolutely need the integration/CLI, I recommend FineReader (Standard edition). You can specify that the document can contain text from a set of languages (e.g., German and English) and it will auto-detect appropriately. If you need automation (of import, processing, export), this can be done with FineReader Server (formerly known as Recognition Server), but the pricing is quite high for personal use. FineReader Corporate edition has limited automation -- if sufficient for your needs, the pricing might be much more reasonable. I have used the Standard edition and Recognition Server extensively, but have not used the Corporate edition. If you really want a cloud service, you can make your own with their Cloud SDK or use their FineReader Online, but I also have no experience with these.

As for accuracy, the details of your documents and scanning can matter, but, for normal personal usage, it should be very high.

gingerlime7y ago

I've heard good things about FineReader, but I'm using Linux and it doesn't look like it's available, also to automate the scanning workflow (and I can't really justify spending that much of it).

ocrcustomserver7y ago

There's ABBYY FineReader Engine CLI for Linux: https://www.ocr4linux.com/

RandomBookmarks7y ago

You can try the free ocr api at https://ocr.space/ocrapi

gingerlime7y ago

Looks interesting, but the free limitations are too restrictive unfortunately (3 page limit, 1 Mb), and I cannot justify paying this much for the paid option when I probably scan roughly less than 10 documents per month (which can be longer than 3 pages and larger than 1 Mb).

ocrcustomserver7y ago

This is very interesting. I'm curious to see how they will execute on several points:

1. How it will deal with multiple templates that the system hasn't seen before. Especially when there is significant difference between the templates.

2. UI/UX. E.g. how it will trace the extracted data to the original source and how it will show the confidence scores of each entity.

3. Verification process, how will the workflow look like when the confidence score is low and the document has to be checked by human operators.

citilife7y ago

This looks a lot like what I've seen from companies such as InstaBase[1]. Given how hard it is to do well (largely due to poor initial images), I'm curious how Amazon's product offering will work.

I a team I'm working with had a lot of success doing this, curious what method(s) they are using.

[1] https://en.wikipedia.org/wiki/Instabase

sbarre7y ago

So this is Apache Tika as a Service?

https://tika.apache.org/

bpchaps7y ago

A little late to the comment party, but I was wondering the same. I'm working on a web scrape workflow that's currently using Tika. I'm very interested in to see how well this does in comparison.

sbarre7y ago

I was quite surprised by how powerful and flexible Tika can be, and my use-case was pretty basic: crawling a network drive to index project artifacts like Office docs and media files and pushing them into an Elasticsearch index.

Have you found any major problems or shortcomings in your usage?

bpchaps7y ago

One small problem is that it sometimes doesn't make newline separations properly. In my use case, I was extracting email addresses from web scrapes - some email addresses would come out as "blah@blah.comRandomWord"

amelius7y ago

Can't use this because my clients/contract don't allow sending of documents to third parties.

sbarre7y ago

Have you looked at Apache Tika?

https://tika.apache.org/

ironfootnz7y ago

Amazon Textract may store and use document and image inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Textract..."

I still prefer the Dropbox solution for that, but I'm waiting them transforming into an API.

jgalt2127y ago

I have been following this service from afar, as the founder is quite skilled. Seems a bit pricey, but does similar.

https://www.pdfdata.io/

blacksmith_tb7y ago

I wonder if they have any detection of captchas, or if they'd let people just submit screengrabs containing them as 'documents' to be processed...

foxhound67y ago

Any idea if this can support handwriting even with a reduced confidence? Support for non-English languages?

brad07y ago

According to the Textract preview sign up form there is the following features:

- Printed text detection

- Handwritten text detection

- Key-Value detection

- Table detection

- Checkbox detection

- Other optical marks (e.g. barcode, QR code)

There's a decent possibility it has handwriting recognition. Not sure about the non-English languages though.

ocrcustomserver7y ago

The docs page [1] (subject to change) mentions:

Do you support handwriting? – We do not support handwriting extraction.

[1]: https://docs.aws.amazon.com/textract/latest/dg/how-it-works-...

hbcondo7147y ago

Arg, you have to type in all your information even if you are logged into the AWS console

dvtrn7y ago

The FOIA geek in me is....well...geeking out over this. Slightly.

jijji7y ago

This is genius...

1. make "strings" api 2. hook it to a web server 3. profit!

j / k navigate · click thread line to collapse