It seems “obvious” to me that if you have a tool which can request a web page, you can make it so that this tool extracts the main content from the page’s HTML. Maybe there is something I’m missing here that makes this more difficult for LLMs, because before we had LLMs, this was considered an easy problem. It is surprising to me that the addition of LLMs has made this previously easy, efficient solution somehow unviable or inefficient.
I think we should also assume here that the web site is designed to be scraped this way—if you don’t, then “Accept: text/markdown” won’t work.