undefined | Better HN

0 pointsjoe_914y ago0 comments

Haha I love that people forget how google/bing are out there scraping everything and anyone who scrapes anything for any other reason is a "bad guy".

You can get around some web scraping blockers by just setting your user agent as Googlebot too which I find funny...

0 comments

Terry_Roll4y ago

That was a cheap way to read the FT.com!

Ian_Kerins4y ago

Haha, nice hack!

fleddr4y ago

No they don't, Google and Bing respect robots.txt. Most websites would open it up to them because they need the traffic, so it's a type of scraping that is beneficial.

Any other scraping, especially when ignoring robots.txt, is unsolicited. And if said website takes additional advanced anti-scraping measures, and you persist in bypassing that too, then to me you're clearly unethical, even if it's technically legal.

"It's public" is a legal defense, not an ethical one. It's public for readers, not for scrapers. It's public within the original context of the website, which may include monetization.

Photographing every page of a book and then reading it that way may be legally allowed, but it's still unethical.

I have somebody in our neighborhood that instead of paying for private trash, takes tiny bags of his private trash to the park and dumps it into the public trash cans.

Legal? Yes. Parasitic behavior? Also yes.

RobSm4y ago

No. robots.txt is not something that is defined and enforced by the law. Just because someone came up with some 'recommendation' like robots.txt does not mean this is the law

shukantpal4y ago

> Legal? Yes. Parasitic behavior? Also yes.

You failed to make a meaningful counterpoint; the legal/ethical distinction was made clear in the parent post.

LamaOfRuin4y ago

As a matter of fact, robots.txt is a well understood expression of intent which is legally meaningful in a lot of contexts.

autoexec4y ago

> Any other scraping, especially when ignoring robots.txt, is unsolicited. And if said website takes additional advanced anti-scraping measures, and you persist in bypassing that too, then to me you're clearly unethical, even if it's technically legal.

I suppose it just comes to down to your own morals, but I see nothing at all unethical about scraping a site for personal use provided that it's done gently enough to avoid DoS or disruption. The idea that saving webpages to read later is parasitic or unethical if a website uses robot.txt to discourage commercial scrapers and data-mining goes way too far.

fleddr4y ago

You're really taking the most innocent stance possible on scraping.

The article talks of large scale scraping, which includes all kinds of bypassing tools, proxies, hardware, or commercial services that abstract this away.

This industrial scale level of scraping is not the same thing as you saving a local copy of 3 web pages. The scale is a million times bigger and for sure it will not be for personal use.

danbmil994y ago

What you fail to acknowledge is that Bing Google etcetera have an effective monopoly on search. They can afford to respect robots.txt because everyone wants them to scrape their site.

The first mover advantage is so huge in this case that without allowing scraping, it's hard to understand how anyone could ever compete with these monoliths.

eli4y ago

robots.txt isn't what's keeping a newcomer from challenging Google.

LunaSea4y ago

> No they don't, Google and Bing respect robots.txt.

They don't.

Terry_Roll4y ago

Correct.

I myself wrote a webserver, albeit a specialised one and for curiosity, I also created a few pages which were in no way accessible unless you knew its web address, there were no links to these pages from the home page or anything, I didn't even tell anyone about these webpages and yet in my logs, I could see those webpages were being spidered!

My robots.txt was setup as an instruction to proceed no further, so I think there is other feedback mechanisms guiding the spiders but I havent worked out if its from the web browser, or actual infrastructure like switches or routers.

Admittedly this was before HTTPS became common.

senko4y ago

From https://developers.google.com/search/docs/advanced/robots/in...

> [...] Googlebot and other respectable web crawlers obey the instructions in a robots.txt file [...]

If you're saying this is a lie, please provide sources

1 more reply

j / k navigate · click thread line to collapse

0 comments

Terry_Roll4y ago

That was a cheap way to read the FT.com!

Ian_Kerins4y ago

Haha, nice hack!

fleddr4y ago

No they don't, Google and Bing respect robots.txt. Most websites would open it up to them because they need the traffic, so it's a type of scraping that is beneficial.

"It's public" is a legal defense, not an ethical one. It's public for readers, not for scrapers. It's public within the original context of the website, which may include monetization.

Photographing every page of a book and then reading it that way may be legally allowed, but it's still unethical.

I have somebody in our neighborhood that instead of paying for private trash, takes tiny bags of his private trash to the park and dumps it into the public trash cans.

Legal? Yes. Parasitic behavior? Also yes.

RobSm4y ago

No. robots.txt is not something that is defined and enforced by the law. Just because someone came up with some 'recommendation' like robots.txt does not mean this is the law

shukantpal4y ago

> Legal? Yes. Parasitic behavior? Also yes.

You failed to make a meaningful counterpoint; the legal/ethical distinction was made clear in the parent post.

LamaOfRuin4y ago

As a matter of fact, robots.txt is a well understood expression of intent which is legally meaningful in a lot of contexts.

autoexec4y ago

fleddr4y ago

You're really taking the most innocent stance possible on scraping.

The article talks of large scale scraping, which includes all kinds of bypassing tools, proxies, hardware, or commercial services that abstract this away.

This industrial scale level of scraping is not the same thing as you saving a local copy of 3 web pages. The scale is a million times bigger and for sure it will not be for personal use.

danbmil994y ago

What you fail to acknowledge is that Bing Google etcetera have an effective monopoly on search. They can afford to respect robots.txt because everyone wants them to scrape their site.

The first mover advantage is so huge in this case that without allowing scraping, it's hard to understand how anyone could ever compete with these monoliths.

eli4y ago

robots.txt isn't what's keeping a newcomer from challenging Google.

LunaSea4y ago

> No they don't, Google and Bing respect robots.txt.

They don't.

Terry_Roll4y ago

Correct.

Admittedly this was before HTTPS became common.

senko4y ago

From https://developers.google.com/search/docs/advanced/robots/in...

> [...] Googlebot and other respectable web crawlers obey the instructions in a robots.txt file [...]

If you're saying this is a lie, please provide sources

1 more reply

j / k navigate · click thread line to collapse