It looks like a great way for you to discover URLs but like a terribly slow way for people to avoid implementing robots.txt rules.
The aim of this project is only check if a given web resource can be crawled by a user-agent, but using a API
What's the plan here? Check for a sitemap.xml (which generally only contains crawlable URLs anyway) or crawl the index and look for all links and send a request to your service for every URL before crawling it?
I personally think it would be better suited as a library where you can pass it a robots.txt and it'll let you know if you can crawl a URL based on that.
For example if you want to check the url https://example.com/test/user/1 with a user agent MyUserAgentBot, the first request can be slow (~730ms) but subsequent requests with different paths but same base url, port and protocol, will use the cached version (just ~190ms). Note that this version is in alpha and many things can be optimized. The balance between managing these files in different projects or the time between network requests must be sought.
Anyway, any person can compile the parser module and create a library to check robots.txt rules by itself ;-)
PS: thanks for the feedback
I understand the unethical nature of the above method, however, I see it happening quite a lot in practice.
Give me some feedback!