Show HN: Copyfish – Extract text from images, videos or PDF (opens in new tab)

johnvonneumann8y ago

Can you explain why this is a deal breaker? Is it the use of OCR or the choice of provider? Assume I know nothing here, because I do.

http://lifehacker.com/many-browser-extensions-have-become-ad...

TekMol8y ago

What is the business model of free extensions like these? Is it all spyware/malware?

It looks like many free extensions either have malware in them from the start or get sold to malware companies later on, who then deploy the malware via updates:

irrational8y ago

Why does everything have to have a business model? Sometimes people like to create things for the sure enjoyment of creating things or they have an itch to scratch and think others might have the same need. Not everything is nefarious.

Volt8y ago

Exactly for the reason they said. The concern isn't that extensions are necessarily nefarious, but that people often want something in return for their work, which might be money by whatever means.

eriknstr8y ago

I recently found an interesting issue [1] filed in public on the GitHub repository of a fork of a popular extension.

Here are archived versions of the URLs mentioned in the issue:

Without "partner extension": http://archive.is/anu2E

With "partner extension": http://archive.is/bp93l

As is evident, what their "partner extension" does is in fact maliciously hijacking and replacing ad-space on websites visited by the user.

Strangely, searching for their name among the issues on GitHub does not show other such results. I guess they usually make contact directly and that the person at that company who filed this issue did not realize it would be visible to the public.

Here is the full text of the issue:

> Adnow is interested in byuiing your extension traffic #1

> Dear Kyong Tsu,

> My name is Anastasia, I am a manager from international advertising network Adnow.

> Extension traffic is a hot trend nowadays, and we are interested in buying traffic from Facebook Video Downloader extension and the others. We are ready to share an idea of monetization extensions with you and give you a method.

> We offer:

> * high payouts

> * 100% fill rate (we buy traffic from all over the world)

> * Integration through JS Tag / XML / JSON feed

> * Integration method

> That's how the page looks without partner extension: https://gyazo.com/5d635a9dc7bdc142e18e6775a1d1340d

> And that's how it looks for user with our plugin/code in extension: https://gyazo.com/a2b48b16d304a3ba37cdf6967fa4d9d8

> Please contact me in case you are interested in monetization your extensions.

> I am looking forward to your answer.

> Thank you in advance.

> Best regards,

> --

> Anastasia Nova

> Sales manager | Adnow LLP

> e.: tasya@sales.adnow.com

> Skype: tasya@adnow.com

[1]: https://github.com/KyongTsu/TabMemorySaver/issues/1

Archived snapshot of above issue: http://archive.is/Z5mJl

ImJasonH8y ago

Similar Chrome extension I wrote using Google Cloud APIs: https://chrome.google.com/webstore/detail/cloud-vision/nblmo...

Zyst8y ago

It is actually on Chrome as well: https://chrome.google.com/webstore/detail/copyfish-%F0%9F%90...

I started using it a bit ago, the area selection seems a bit wonky, but otherwise works.

mdani8y ago

No need to use third-party extensions if you have a Google cloud account. You can download https://github.com/kaneshin/pigeon and just run it from command line - protects privacy and more secure compared to relying on third-parties.

pbhjpbhj8y ago

Isn't Google a third-party?

angry_octet8y ago

We need to evolve a grammar for describing privacy implications, because proper classification of this software would allow it to be marked as malware/spyware.

It is beyond irresponsible for mozilla to do nothing to prevent this malware from being recommended on their platform.

ghostly_s8y ago

How about you explain what on earth you're talking about if you're going to take the time to disparage this product here?

angry_octet8y ago

Read the page. Yes, it isn't obvious is it?

Look down the bottom.

https://ocr.space/

It uploads everything to a commercial OCR service. Which provides these CPU cycles 'for free'.

Who owns this data? Do you have a privacy agreement with ocr.space? Can you trust them as far as you could spit?

It doesn't matter that this is documented though. Unless it had a popup banner EVERY TIME YOU USED IT saying "Your data will be sent to a cloud service for OCR, which may keep/index/sell you data without restriction."

Nadya8y ago

Maybe I'm missing something - but it only OCR's things you choose to capture and isn't constantly trying to OCR every single thing you see.

Are you misunderstanding the extension or am I missing something bigger?

E: A total guess: "the server will see the image you are trying to OCR"? That's about as much privacy as I could see being intruded upon.

angry_octet8y ago

It could easily build a profile of everything you get scanned/translated. I don't know if it uses https, so maybe it encrypts, maybe everyone listening can see what you get scanned.

It is good that it isn't scanning everything, i.e. complete exfiltration, but that is a low bar. It leaks every time you use it.

mholt8y ago

Even the name is too reminiscent of Superfish.

xophishox8y ago

Give me an api end point to send an image to, and a text response. Ill hand you cash.

There are a ton of these now. Google provides OCR as part of their machine vision API. AWS has similar with Rekognition. As others have mentioned, there are dozens of others on less well known platforms.

a9t9OP8y ago

Actually, based on my tests, there are only a few good services:

Abbyy (best recognition rate but by far most expensive), Google Cloud Vision (second best recognition rate), Microsoft OCR and... our OCR.space service with a very generous free tier and a competitive priced PRO tier.

gressquel8y ago

Rekognition from Amazon doesnt have OCR as far as I remember

ythn8y ago

Under the hood, the extension is using:

https://ocr.space/ocrapi

Nadya8y ago

From the creators of Copyfish: https://ocr.space/

They should have an API to point to. It is fairly accurate. I use them occasionally via ShareX, which uses their API for OCR.

E: https://ocr.space/ocrapi

Like a9t9 said, ABBYY, Microsoft and Google offer this.

If your images however differ from the typical text document, recognition from those services will fail. OCR is highly dependent on the particular application and the kind of images that you're dealing with. Preprocessing and segmentation are very important.

If you need a custom solution, my email is in my profile.

jklinger4108y ago

Same

CM308y ago

Hmm, I've seen a few apps and extensions like this before. I think Project Naptha was a heavily advertised one that did the same thing a few years back.

But how's the accuracy here? Cause when I used previous plugins for this functionality, I often found they'd return gibberish if the text was even slightly ambiguous looking in image form.

How does it compare to the other plugins doing the same thing here?

maggit8y ago

The text on the linked page actually compares this to Project Naptha:

> For extension gurus: You might have heard of Project Naptha, a great addon that applies state-of-the-art computer vision algorithms on every image you see while browsing the web. Copyfish solves the same problem, but it takes a different user interface approach. It does not try to alter the website. Instead, it lets you mark the text in the image that you want to extract. As a result Copyfish works with every website, even videos and PDF documents.

chrischen8y ago

Need this for mobile. Somehow it became a defining characteristic of an "App" to disable text selection.

yjftsjthsd-h8y ago

Should be fixable via accessibility APIs?

Zyst8y ago

This seems cool, I just tried it in chrome and it has support for pop-up dictionaries, so I'll be using this for some beginner reading assistance.

Thanks for making this!

dontchooseanick8y ago

Copyphish ?

Semi related: I would love to see someone do a comparison of the various OCR APIs on speed, accuracy, and cost.

https://ocr.space/blog/2015/02/ocr-online-converter-review.h...

asenna8y ago

Same here. Was just researching on this. Not sure if I should go with an open source OCR engine or one of these APIs

https://www.youtube.com/watch?v=YNGkGWj8lA4

Doing it yourself with Tesseract is pretty hard (time consuming, error prone). It's something I would only consider doing once my project was build, viable, and the costs of an API were an issue.

foota8y ago

On my phone so I don't have a chance to give it a shot, but what I find has been most irritating in the past about ocr is the accuracy. If your extension has better accuracy you might call that out.

imron8y ago

I saw the heading on HN and thought "I wonder if it works with Chinese".

I saw the first example screenshot on the page was a Chinese movie and thought "Great, it does"

I saw the enlarged version of the screenshot and the Chinese subtitles contain multiple mistakes: "Nice try, but maybe not so great after all for the use case I'd personally be interested in".

a9t9OP8y ago

Well, at least this confirms that the screenshots are not manipulated ;)

The tricky part for the OCR in this example is the diverse background, as the Chinese characters are directly inside the movie.

Your comment is interesting, as the original motivation for creating the Copyfish extension was to help me watch Chinese movies. So I can confirm that for this purpose, it works fine. Of course, once in a while it gets some characters wrong but it works ok with many movies.

Here is a screencast of Copyfish doing subtitle OCR:

bondolo8y ago

This seems like it could be very useful for accessibility applications.

Nimsical8y ago

This is cool!

Wondering what you're using for OCR?

jffry8y ago

  For developers: Copyfish is published under the
  GPL open-source license. As OCR software, it uses
  the free OCR API from https://ocr.space/

whitten8y ago

So, to answer the question mentioned above, the document storing the text is sent to an off-site server (https://ocr.space/) which does the OCR and returns the results.

ransom15388y ago

Is this using TesseractOCR?

sjs3828y ago

Could you add an email address to your HN profile so I can contact you?

a9t9OP8y ago

Done. In addition, the email listed on https://github.com/A9T9 also reaches me.

vdRrsithZm8y ago

> Done. In addition, the email listed on https://github.com/A9T9 also reaches me.

Neat! Brother. +1 =100 Ace

vdRrsithZm8y ago

Neat brother. +1 = A109

zhangkehu8y ago

wanderful tool,we can get text from some pdf file easyly.

j / k navigate · click thread line to collapse

89 comments

gressquel8y ago

Is the OCR-extraction performed in the client? if its transferred to a server then people should be aware of this so sensitive data from documents/pdf is not submitted.

qz_8y ago

Yeah apparently it uses https://ocr.space/, deal-breaker for me.

a9t9OP8y ago

I understand that hosted OCR, just like SaaS in general, is not suitable for every use case.

On the other hand, the OCR.space OCR API has a very strict privacy policy:

https://ocr.space/privacypolicy - All uploaded images and the extracted text are deleted immediatly after processing.

3 more replies

superasn8y ago

Why is it a deal breaker?

johnvonneumann8y ago

Can you explain why this is a deal breaker? Is it the use of OCR or the choice of provider? Assume I know nothing here, because I do.

http://lifehacker.com/many-browser-extensions-have-become-ad...

TekMol8y ago

What is the business model of free extensions like these? Is it all spyware/malware?

It looks like many free extensions either have malware in them from the start or get sold to malware companies later on, who then deploy the malware via updates:

irrational8y ago

Volt8y ago

Exactly for the reason they said. The concern isn't that extensions are necessarily nefarious, but that people often want something in return for their work, which might be money by whatever means.

eriknstr8y ago

I recently found an interesting issue [1] filed in public on the GitHub repository of a fork of a popular extension.

Here are archived versions of the URLs mentioned in the issue:

Without "partner extension": http://archive.is/anu2E

With "partner extension": http://archive.is/bp93l

As is evident, what their "partner extension" does is in fact maliciously hijacking and replacing ad-space on websites visited by the user.

Here is the full text of the issue:

> Adnow is interested in byuiing your extension traffic #1

> Dear Kyong Tsu,

> My name is Anastasia, I am a manager from international advertising network Adnow.

> We offer:

> * high payouts

> * 100% fill rate (we buy traffic from all over the world)

> * Integration through JS Tag / XML / JSON feed

> * Integration method

> That's how the page looks without partner extension: https://gyazo.com/5d635a9dc7bdc142e18e6775a1d1340d

> And that's how it looks for user with our plugin/code in extension: https://gyazo.com/a2b48b16d304a3ba37cdf6967fa4d9d8

> Please contact me in case you are interested in monetization your extensions.

> I am looking forward to your answer.

> Thank you in advance.

> Best regards,

> --

> Anastasia Nova

> Sales manager | Adnow LLP

> e.: tasya@sales.adnow.com

> Skype: tasya@adnow.com

[1]: https://github.com/KyongTsu/TabMemorySaver/issues/1

Archived snapshot of above issue: http://archive.is/Z5mJl

ImJasonH8y ago

Similar Chrome extension I wrote using Google Cloud APIs: https://chrome.google.com/webstore/detail/cloud-vision/nblmo...

Zyst8y ago

It is actually on Chrome as well: https://chrome.google.com/webstore/detail/copyfish-%F0%9F%90...

I started using it a bit ago, the area selection seems a bit wonky, but otherwise works.

mdani8y ago

pbhjpbhj8y ago

Isn't Google a third-party?

angry_octet8y ago

We need to evolve a grammar for describing privacy implications, because proper classification of this software would allow it to be marked as malware/spyware.

It is beyond irresponsible for mozilla to do nothing to prevent this malware from being recommended on their platform.

ghostly_s8y ago

How about you explain what on earth you're talking about if you're going to take the time to disparage this product here?

angry_octet8y ago

Read the page. Yes, it isn't obvious is it?

Look down the bottom.

https://ocr.space/

It uploads everything to a commercial OCR service. Which provides these CPU cycles 'for free'.

Who owns this data? Do you have a privacy agreement with ocr.space? Can you trust them as far as you could spit?

Nadya8y ago

Maybe I'm missing something - but it only OCR's things you choose to capture and isn't constantly trying to OCR every single thing you see.

Are you misunderstanding the extension or am I missing something bigger?

E: A total guess: "the server will see the image you are trying to OCR"? That's about as much privacy as I could see being intruded upon.

angry_octet8y ago

It could easily build a profile of everything you get scanned/translated. I don't know if it uses https, so maybe it encrypts, maybe everyone listening can see what you get scanned.

It is good that it isn't scanning everything, i.e. complete exfiltration, but that is a low bar. It leaks every time you use it.

mholt8y ago

Even the name is too reminiscent of Superfish.

xophishox8y ago

Give me an api end point to send an image to, and a text response. Ill hand you cash.

a9t9OP8y ago

Actually, based on my tests, there are only a few good services:

gressquel8y ago

Rekognition from Amazon doesnt have OCR as far as I remember

ythn8y ago

Under the hood, the extension is using:

https://ocr.space/ocrapi

Nadya8y ago

From the creators of Copyfish: https://ocr.space/

They should have an API to point to. It is fairly accurate. I use them occasionally via ShareX, which uses their API for OCR.

E: https://ocr.space/ocrapi

Like a9t9 said, ABBYY, Microsoft and Google offer this.

If you need a custom solution, my email is in my profile.

jklinger4108y ago

Same

CM308y ago

Hmm, I've seen a few apps and extensions like this before. I think Project Naptha was a heavily advertised one that did the same thing a few years back.

But how's the accuracy here? Cause when I used previous plugins for this functionality, I often found they'd return gibberish if the text was even slightly ambiguous looking in image form.

How does it compare to the other plugins doing the same thing here?

maggit8y ago

The text on the linked page actually compares this to Project Naptha:

chrischen8y ago

Need this for mobile. Somehow it became a defining characteristic of an "App" to disable text selection.

yjftsjthsd-h8y ago

Should be fixable via accessibility APIs?

Zyst8y ago

This seems cool, I just tried it in chrome and it has support for pop-up dictionaries, so I'll be using this for some beginner reading assistance.

Thanks for making this!

dontchooseanick8y ago

Copyphish ?

Semi related: I would love to see someone do a comparison of the various OCR APIs on speed, accuracy, and cost.

https://ocr.space/blog/2015/02/ocr-online-converter-review.h...

asenna8y ago

Same here. Was just researching on this. Not sure if I should go with an open source OCR engine or one of these APIs

https://www.youtube.com/watch?v=YNGkGWj8lA4

Doing it yourself with Tesseract is pretty hard (time consuming, error prone). It's something I would only consider doing once my project was build, viable, and the costs of an API were an issue.

foota8y ago

On my phone so I don't have a chance to give it a shot, but what I find has been most irritating in the past about ocr is the accuracy. If your extension has better accuracy you might call that out.

imron8y ago

I saw the heading on HN and thought "I wonder if it works with Chinese".

I saw the first example screenshot on the page was a Chinese movie and thought "Great, it does"

I saw the enlarged version of the screenshot and the Chinese subtitles contain multiple mistakes: "Nice try, but maybe not so great after all for the use case I'd personally be interested in".

a9t9OP8y ago

Well, at least this confirms that the screenshots are not manipulated ;)

The tricky part for the OCR in this example is the diverse background, as the Chinese characters are directly inside the movie.

Here is a screencast of Copyfish doing subtitle OCR:

bondolo8y ago

This seems like it could be very useful for accessibility applications.

Nimsical8y ago

This is cool!

Wondering what you're using for OCR?

jffry8y ago

  For developers: Copyfish is published under the
  GPL open-source license. As OCR software, it uses
  the free OCR API from https://ocr.space/

whitten8y ago

So, to answer the question mentioned above, the document storing the text is sent to an off-site server (https://ocr.space/) which does the OCR and returns the results.

ransom15388y ago

Is this using TesseractOCR?

sjs3828y ago

Could you add an email address to your HN profile so I can contact you?

a9t9OP8y ago

Done. In addition, the email listed on https://github.com/A9T9 also reaches me.

vdRrsithZm8y ago

> Done. In addition, the email listed on https://github.com/A9T9 also reaches me.

Neat! Brother. +1 =100 Ace

vdRrsithZm8y ago

Neat brother. +1 = A109

zhangkehu8y ago

wanderful tool,we can get text from some pdf file easyly.