I've been running this experiment (another comment). While bots continuously hammer on port 22 (ssh), and repeatedly try to get things like /wp-* (I don't even run PHP), they don't bother fetching robots.txt in the first place, and my honeypot hasn't a single hit.
Definitely do not try to "secure" your site this way, but bots are either not sophisticated enough to analyze the .txt, or it might already be a known technique. Seems many other commenters come up with the same idea.
[1] https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
"The combination is... 1-2-3-4-5."
"That's amazing! I've got the same combination on my luggage!"
It's been incredibly enlightening. One thing that sticks out immediately is that you can identify the underlying HTTP framework in many cases due to the defaults. Sometimes even the exact version.
And, yes, people do use the robots file to "protect" or "hide" endpoints and they can effectively be used to enumerate potential endpoints worth investigating further (from a pentesting perspective).
[1] https://gist.github.com/wybiral/20c20ccf00b6c93506b8acdc6ccb...
Obviously you shouldn't rely on it, but defense in depth as always.
https://developers.google.com/search/reference/robots_meta_t...
I’d also strongly recommend pairing this with outside monitoring which alerts if something accidentally becomes reachable since it’s really easy not to notice something working from more places than intended.
This is the only way to use robots.txt for semi-sensitive info, and obviously not for info so sensitive that it would be awful for it to get out. URLs can leak through proxy logs and shared browser history.
In my ex-job we were developing e-commerce system, which was super-old|big|messy 0-test PHP trash. After two years of actively working on it, I still couldn't form a clear picture about the details of its subsystems in my head.
One day there is a call from a client, saying that he is missing many of his orders. The whole company is on its feet and we are searching for what went wrong. We are examining the server logs just to find out that someone is making thousands of requests to our admin section and tries to linearly increment order IDs in the URL. Definitely some kind of attack.. Our servers are managed by different company so we are opening a ticket to blacklist that IP. Quick search told me that the requests are coming from the AWS servers, and the IP leads me to an issue on GitHub for some nginx "bad bots" blocking plugin, saying that this thing is called Maui bot and we are not the first one experiencing it. Nice. Anyway, this thing is still deleting our data and we can't even turn off the servers because of SLAs and how the system was architected. So we are trying to find out how is it even possible, that unauthorized request can delete our data. We are examining our auth module, but everything looks right. If you are not logged in and visit the order detail (for example) you are correctly redirected to login screen. So how? We are reading the documentation of the web framework that the application is using. There is it. $this->redirect('login');. According to the documentation it was missing return before that statement. So without the return, everything after that point was still executed. And "everything" in our case, was the action from the URL. No one ever noticed, because there were no tests, and when you tried it in the browser, you were "correctly" presented with the login screen. Unfortunately, with side-effect.. Guy that wrote that line did it 5-6 years before this incident, and was out of the company for many years even before I joined. I don't blame him..
Fix. Push. Deploy. No more deleted orders.
POST MORTEM:
The Maui bot went straight to the disallowed /admin paths in robots.txt and tried to increment numbers (IDs) in paths.
I remember, that because Maui bot actions were (to the system) indistinguishable from the normal user actions, someone had to manually fix the orders in the database just by using server logs and comparing them somehow.
Sorry for my English, and yeah, (obviously) don't use robots.txt as security measure of any kind...
not sure why i did this aside from that it was fun!