Show HN: Extract Table from Image (opens in new tab)

(extract-table.com)

217 pointsv3gas4y ago41 comments

41 comments

w-m4y ago

I'm answering questions about Pandas (the Python data analysis framework) on StackOverflow from time to time. It's an exercise in patience, because many people will post screenshots of their data instead of a reproducible code example. You'll have to point about every other newcomer to the documentation on how write a proper question that one can actually answer.

I'd imagine other areas around StackOverflow (SQL, R?) are fighting similar issues. I've just tried it with a question (sure enough the second newest Pandas tagged question had a table as an image), and your tool produced a nice .csv.

It would be a godsend to have a button on StackOverflow that would replace a user-uploaded image of a table with some Pandas code that constructs the same DataFrame. Currently I would have to download the image, upload it to extract-table.com, download the .csv, load it into Python, run some code to create the code-based DataFrame.

I'd consider sending people on StackOverflow to your tool if you cut down some of the steps: (1) allowing to paste in an URL of an image, and (2) producing Pandas code output that can be directly copy/pasted from the site (not having to download a csv).

For illustration: here's what the Pandas code would look like for the first example of extract-table.com:

  df = pd.DataFrame( {'Name': {0: 'David', 1: 'Jessica', 2: 'Warren'}, 'Gender': {0: 'Male', 1: 'Female', 2: 'Male'}, 'Age': {0: 23, 1: 47, 2: 12}} )

pietrovismara4y ago

Off topic funny story: My highest voted answer on SO is a very basic one about Pandas, from 7 years ago. It's funny that I've only used Pands for a few weeks, years ago (I would need to relearn it from scratch now), but 90% of my SO score comes from that answer and I still get more points almost daily. In fact I'm in the top 6% of SO mostly thanks to that answer.

belval4y ago

I'm in the same boat, 95% of my SO points come from an answer that was basically a copy pasted script to fix an obscure VMWare error with Ubuntu. Turns out a lot of people had the same issue that day.

w-m4y ago

Since all votes have the same weight I guess it makes sense that the answers to most basic questions or highly common problems will get the most points. Maybe SO should have a button to donate points to an answer that really saved your bacon, a super-upvote if you will. (I know you can attach bounties to questions, but that's not really feasibly when you come across something that has already been answered).

But yeah, crowd behavior is fun. I have the feeling I can time when some computer vision courses (or the semester) starts, as suddenly there's many upvotes on my basic answer explaining BGR/RGB color space confusion with OpenCV, the computer vision library :)

2 more replies

unwind4y ago

People post images of C code too. Best are the ones that post a link to the image on some external image host. Gaaah.

MattGaiser4y ago

Could do it with a Chrome extension. Add a button to the right click context menu and get the tabular data in the popup.

v3gasOP4y ago

Thanks for the feedback! That is a good suggestion, I'll definitely add support for using the image URL.

v3gasOP4y ago

I've added support for urls now! Please try it.

greaterweb4y ago

Nice work putting together this tool. Have you seen either Spark OCR[1] from John Snow Labs or the Adobe PDF Extract API[2]? They both do a pretty good job a data extraction from tables as well.

[1] https://www.johnsnowlabs.com/spark-ocr/

[2] https://www.adobe.io/apis/documentcloud/dcsdk/pdf-extract.ht...

v3gasOP4y ago

Thanks! No, I hadn't heard of either - thank you!

MattGaiser4y ago

Pair this with a snipping tool and all sorts of people in banking would use it for a few hours a day, especially if it could paste to Excel or at least fill the clipboard in a way pastable to Excel.

I used to work for a bank on their innovation team and pitched basically this, but as an intern I had neither the skill nor time to do it. But it was certainly something a bunch of people internally wanted.

v3gasOP4y ago

Interesting, thanks!

Do you happen to know how to paste regular UTF-8 text into Excel/Google sheets as multiple cells? If I copy two cells in Sheets, I get a tab character (\t) between the cells. But if I try to paste "hello \t world" into Sheets then it's just dumped into one cell.

v3gasOP4y ago

Nevermind, the tab character is indeed what's needed to split it into multiple cells.

EveYoung4y ago

I can only imagine what a pain it would be to get InfoSec approval for such a tool, unless it's doing everything on-device.

MattGaiser4y ago

Wouldn’t need to be on device necessarily.

At least my bank was comfortable with cloud everything and people using APIs from approved partners. If you can write the report in Google Docs, as long as they were the ones plugging in their API key for the OCR, I imagine it would be fine.

saradhi4y ago

You should consider extracttable.com

P.s: I run the linked resource.

nanis4y ago

With this image[1] from this question on SO[2], the output[3] is missing the last row. FWIW, I've had the occasional miraculous-looking results from AWS Textract, but you do need to keep an eye on what's happening.

Update: I just checked a bit carefully, and this example[4] is also missing the last row.

Also, Danish ø seems problematic on your web page whereas the CSV has the right UTF-8 encoded bytes.

[1]: https://i.stack.imgur.com/y7Zrt.png

[2]: https://stackoverflow.com/q/69363708/100754

[3]: https://results.extract-table.com/8d4818867ad604792819e98808...

[4]: https://results.extract-table.com/254d95722a2c2b1df72fc26b59...

v3gasOP4y ago

That's interesting. Thanks for reporting!

eihli4y ago

Nice. I worked on something similar but far less robust: https://github.com/eihli/image-table-ocr. It fails to find the tables on the example images at extract-table.com, but the code is heavily commented at https://eihli.github.io/image-table-ocr/pdf_table_extraction... so there's high visibility into what's going on and what needs to change to get it to work with images of different sizes/fonts.

BrandiATMuhkuh4y ago

This is really awesome. I have tried to solve that many times. I got close, with open CV and azure ML. I have even tried AWS Textract (~2 years ago). But this is the best implementation I have seen so far. Congratulations.

I'm not sure what application you are thinking off. But the reason I'm following this problem is UX. Years ago, I worked on a project where anyone can add product prices into a DB. They do that by typing their receipt (line items) into the DB. The major issue was, the UX was horrible.

With an API like yours, this is super simply. One photo. That's all.

Maybe I'll revisit it as a side project.

v3gasOP4y ago

Thank you! I have also been kind of obsessed with this problem. I have tried to solve it myself, going from an image to bounding boxes and trying to separate the boxes into columns. But that problem is just fraught with edge cases, so I decided to just use an existing tool.

BillSaysThis4y ago

Really nice but... wondering how long this will last as a free tool given AWS fees.

whirlwin4y ago

Nice. Fun fact: The third example table is an ordered list of Norway's richest people (according to net worth, I think)

howmayiannoyyou4y ago

Nice job. Actually though, what the world really needs in ML that divines the trend and perhaps indices/values from images of charts.

plaidfuji4y ago

This has been my pet side project for many years. What use case would you apply it to?

howmayiannoyyou4y ago

Scraping financial content

pveierland4y ago

Neat tool! There appears to be two minor issues in the last example. There is an encoding issue of "ø" characters ("RÃ¸kke"), and a column split appears to be missing betweeen the closely spaced numbers ("33 300 22 700" vs "33 300,22 700"). Possible possibly non-trivial improvement: harmonize formatting within the same column to avoid mixed occurences of "7800" / "7 800".

mzs4y ago

https://github.com/vegarsti/extract-table

jnsie4y ago

Really cool. I'm interested to hear your plans for this. Are you planning to offer as a service/open source/etc.?

visarga4y ago

Does it also do table detection in a larger image and header/body classification?

v3gasOP4y ago

This currently returns an error if it doesn't find exactly one table in the image, so it might be able to work with larger images, but probably not if there are multiple distinct blocks of text.

ducktective4y ago

Awesome project!

Can AWS Textract be used directly with curl to return text strings of an uploaded image?

v3gasOP4y ago

Thanks! No, not that I know of, looks like for the AWS cli it needs to be in an S3 bucket, based on looking at this document: https://docs.aws.amazon.com/cli/latest/reference/textract/an...

ducktective4y ago

hmm...weird. They could have provided a rate-limited API endpoint as a service...

z3t44y ago

Should make it into a browser plugin, so annoying when web sites have tables in images.

basmango4y ago

Does it use textract directly? Or are you doing some preprocessing?

v3gasOP4y ago

Directly, no preprocessing! The postprocessing is concatenating all words that belong to the same cell.

tuberelay4y ago

UI Path does this in a nice way

j / k navigate · click thread line to collapse

41 comments

w-m4y ago

For illustration: here's what the Pandas code would look like for the first example of extract-table.com:

  df = pd.DataFrame( {'Name': {0: 'David', 1: 'Jessica', 2: 'Warren'}, 'Gender': {0: 'Male', 1: 'Female', 2: 'Male'}, 'Age': {0: 23, 1: 47, 2: 12}} )

pietrovismara4y ago

belval4y ago

w-m4y ago

2 more replies

unwind4y ago

People post images of C code too. Best are the ones that post a link to the image on some external image host. Gaaah.

MattGaiser4y ago

Could do it with a Chrome extension. Add a button to the right click context menu and get the tabular data in the popup.

v3gasOP4y ago

Thanks for the feedback! That is a good suggestion, I'll definitely add support for using the image URL.

v3gasOP4y ago

I've added support for urls now! Please try it.

greaterweb4y ago

Nice work putting together this tool. Have you seen either Spark OCR[1] from John Snow Labs or the Adobe PDF Extract API[2]? They both do a pretty good job a data extraction from tables as well.

[1] https://www.johnsnowlabs.com/spark-ocr/

[2] https://www.adobe.io/apis/documentcloud/dcsdk/pdf-extract.ht...

v3gasOP4y ago

Thanks! No, I hadn't heard of either - thank you!

MattGaiser4y ago

Pair this with a snipping tool and all sorts of people in banking would use it for a few hours a day, especially if it could paste to Excel or at least fill the clipboard in a way pastable to Excel.

v3gasOP4y ago

Interesting, thanks!

v3gasOP4y ago

Nevermind, the tab character is indeed what's needed to split it into multiple cells.

EveYoung4y ago

I can only imagine what a pain it would be to get InfoSec approval for such a tool, unless it's doing everything on-device.

MattGaiser4y ago

Wouldn’t need to be on device necessarily.

saradhi4y ago

You should consider extracttable.com

P.s: I run the linked resource.

nanis4y ago

Update: I just checked a bit carefully, and this example[4] is also missing the last row.

Also, Danish ø seems problematic on your web page whereas the CSV has the right UTF-8 encoded bytes.

[1]: https://i.stack.imgur.com/y7Zrt.png

[2]: https://stackoverflow.com/q/69363708/100754

[3]: https://results.extract-table.com/8d4818867ad604792819e98808...

[4]: https://results.extract-table.com/254d95722a2c2b1df72fc26b59...

v3gasOP4y ago

That's interesting. Thanks for reporting!

eihli4y ago

BrandiATMuhkuh4y ago

With an API like yours, this is super simply. One photo. That's all.

Maybe I'll revisit it as a side project.

v3gasOP4y ago

BillSaysThis4y ago

Really nice but... wondering how long this will last as a free tool given AWS fees.

whirlwin4y ago

Nice. Fun fact: The third example table is an ordered list of Norway's richest people (according to net worth, I think)

howmayiannoyyou4y ago

Nice job. Actually though, what the world really needs in ML that divines the trend and perhaps indices/values from images of charts.

plaidfuji4y ago

This has been my pet side project for many years. What use case would you apply it to?

howmayiannoyyou4y ago

Scraping financial content

pveierland4y ago

mzs4y ago

https://github.com/vegarsti/extract-table

jnsie4y ago

Really cool. I'm interested to hear your plans for this. Are you planning to offer as a service/open source/etc.?

visarga4y ago

Does it also do table detection in a larger image and header/body classification?

v3gasOP4y ago

This currently returns an error if it doesn't find exactly one table in the image, so it might be able to work with larger images, but probably not if there are multiple distinct blocks of text.

ducktective4y ago

Awesome project!

Can AWS Textract be used directly with curl to return text strings of an uploaded image?

v3gasOP4y ago

Thanks! No, not that I know of, looks like for the AWS cli it needs to be in an S3 bucket, based on looking at this document: https://docs.aws.amazon.com/cli/latest/reference/textract/an...

ducktective4y ago

hmm...weird. They could have provided a rate-limited API endpoint as a service...

z3t44y ago

Should make it into a browser plugin, so annoying when web sites have tables in images.

basmango4y ago

Does it use textract directly? Or are you doing some preprocessing?

v3gasOP4y ago

Directly, no preprocessing! The postprocessing is concatenating all words that belong to the same cell.

tuberelay4y ago

UI Path does this in a nice way

j / k navigate · click thread line to collapse