But you are pretty much the only people who can save the web from AI bots right now.
The sites I administer are drowning in bots, and the applications I build which need web data are constantly blocked. We're in the worst of all possible worlds and the simplest way to solve it is to have a middleman that scrapes gently and has the bandwidth to provide an AI first API.
Would Common Crawl do a "for all purposes and no restrictions" license if it is for AI training, comouter analyses, etc? Especially given the bad actors are ignoring copyrights and terms while such restrictions only affect moral, law-abiding people?
Also, even simpler, would Common Crawl release under a permissive license a list of URL's that others could scrape themselves? Maybe with metadata per URL from your crawls, such as which use Cloudflare or other limiters. Being able to rescrape the CC index independently would be very helpful under some legal theories about AI training. Independent, search operators benefit, too.
We carefully preserve robots.txt permissions in robots.txt, in http headers, and in html meta tags.
We do publish 2 different url indexes, if you wanted to recrawl for some reason.
https://commoncrawl.org/terms-of-use
In it, (a), (d), and (g) have had overly-political interpretations in many places. (h) is on Reddit where just offering the Gospel of Jesus Christ got me hit with "harassment" once. The problem is whether what our model can be or is uses for incurs liability under such a license. Also, it hardly seems "open" if we give up our autonomy and take on liability just to use it.
Publishing a crawl, or the URL's, under CC-0, CC-by, BSD, or Apache would make them usable without restrictions or any further legal analyses. Does CC have permissively-licensed crawls somewhere?
Btw, I brought up URL's because transfering crawled content may be a copyright violation in U.S., but sharing URL's isn't. Are the URL's released under a permissive license that overrides the Terms of Use?
Alternatively, would Common Crawl simply change their Terms so that it doesn't apply to the Crawled Content and URL databases? And simply release them under a permissive license?