User-agent: ia_archiver
Disallow: /
Those two lines mean that all content hosted on the entire site will be blocked from the Internet Archive (archive.org) WayBack Machine, and the public will be unable to look at any previous versions of the website's content. It wipes out a public view of the past.Yeah, I'm looking at you, Washington Post: http://www.washingtonpost.com/robots.txt
Banning access to history like that is shameful.
http://www.quora.com/robots.txt
Here is their explanation (in the robots.txt file)
"We opt out of the wayback machine because inclusion would allow people to discover the identity of authors who had written sensitive answers publicly and later had made them anonymous, and because it would prevent authors from being able to remove their content from the internet if they change their mind about publishing it. As far as we can tell, there is no way for sites to selectively programmatically remove content from the archive and so this is the only way for us to protect writers. If they open up an API where we can remove content from the archive when authors remove it from Quora, but leave the rest of the content archived, we would be happy to opt back in. See the page here: https://archive.org/about/exclude.php"
"Meanwhile, if you are looking for an older version of any content on Quora, we have full edit history tracked and accessible in product (with the exception of content that has been removed by the author). You can generally access this by clicking on timestamps, or by appending "/log" to the URL of any content page."
Authors don't get the right to go around removing their novels from public libraries just because they would rather the books be available only for pay in bookstores.
Why do you think it is legal to then go ahead and slurp it?
We do, however, have the right to criticize people who ban IA from their site.
If it's on their bandwidth and power, why not?
And people wonder why alternative search engines have such a hard time taking off.
Google is somewhere between 50-90% of most sites' search referrals (source: /dev/ass). Add in a handful of other search engines (Bing, DDG, Yahoo, Ask) and you've pretty much got all of it.
They're maybe 10-20% of your crawl traffic though. And possibly a lot less than that.
There are a TON of bots out there. If you're lucky, they just fill your logs and hammer your bandwidth.
If you're not so lucky, they break your site search, overload your servers, and if you're particularly unlucky, they wake you up with 2:30 am pages for two weeks straight.
At which point the simplest way to solve the technical problem, that is, you getting a full night's sleep, is to ban every last fucking bot but Google. Or maybe a handful of the majors.
Now, of course, you're a data-driven operation and you're relying on Google Analytics to tell you who's sending traffic your way. But if you block a search crawler, it's going to stop sending you traffic, so you won't know it's important.
It's a rather similar set of logic that drives people to set email bans on entire CCTLDs or ASN blocks for foreign countries. And if you're a smallish site, it's probably a decent heuristic. And no, it's not just fucking n00bs who do this. Lauren Weinstein who pretty much personally birthed ARPANET at UCLA was bitching on G+ just a week or so back that the new set of unlimited TLDs ICANN were selling were rapidly going into his mailserver blocklists. Because, of course, the early adoptors of such TLDs tend to be spammers, or at least, the early adopters he's likely to hear from.
https://plus.google.com/114753028665775786510/posts/SsgPNHLG...
"Some sites try to communicate with Google through comments in robots.txt"
In the examples given, none appear to be trying to "communicate with Google through comments" - how is including...
# What's all this then?
# \
#
# -----
# | . . |
# -----
# \--|-|--/
# | |
# |-------|
...a "mistake" to avoid? There's no harm in it at all.I thought that was the whole point of robots.txt
lots of target-detection crawlers will look at robots.txt as the first thing they do to see if there's any fun pages you don't want the other crawlers to see
That said, obscurity is not really security. Your admin pages should be behind a password, which, if coded properly, will exclude spiders, bots, and bad guys.
Alongside tagging links to such resources with nofollow.
The robots exclusion protocol is a ridiculous anachronism. I don't use it and neither should you.
Spiders have to be robust against sites with unlimited numbers of internal links anyway, or else an attacker could trap a web spider with a malicious site, or a 13 year old writing a buggy PHP add could take down Google's entire spidering system.
the article is pretty much correct (although strangely worded at some times), the stuff about "communicating via robotst comments to google" is of course not true. the example he gives are developer jokes, nothing more.
still, you should not use comments in the robots.txt, why?
you can group user agents i.e.:
User-agent: Googlebot
User-agent: bingbot
User-Agent: Yandex
Disallow: /
Congrats, you have just disallowed googlebot, bingbot and yandox from crawling (not indexing, just crawling)ok, now:
User-agent: Googlebot
#User-agent: bingbot
User-Agent: Yandex
Disallow: /
so well, you have definitly blocked yandex, you do not care for bingbot (commented out), but what about googlebot? is googlebot and yandex part of a user-agent group? or is googlebot it's own group and yandex it's own group? if the commented line is interpredted as blank line, then googlebot and yandex are different groups, if it's interpredted are as non existent, they belong together.they way i read the spec https://developers.google.com/webmasters/control-crawl-index..., this behaviour is undefined. (pleae correct me if i'm wrong)
simple solution: don't use comments in the robots.txt file.
also, please somebody fork and take over https://www.npmjs.org/package/robotstxt it has this undefined behaviour and it also does not follow HTTP 301 requests (which was unspecified when i coded it) and also it tries to do too much (fetching and analysing, it should only do one thing).
by the way, my recommendation is to have a robots.txt file like this
User-agent: *
Dissalow:
Sitemap: http://www.example.com/your-sitemap-index.xml
and return HTTP 200why: if you do not have a file there, then at some point in the future suddenly you will return HTTP 500 or HTTP 200 with some response, that can be misleading. also it's quite common that the staging robots.txt file spills over into the real word, this happens as soon as you forget that you have to care about your real robots.txt
also read the spec https://developers.google.com/webmasters/control-crawl-index...