Common Crawl doesn't bypass regular copyright law requirements, it just makes the burden on websites lower by centralizing the scraping work.
Remember when news sites wanted to allow some free articles to entice people and wanted to allow google to scrape, but wanted to block freeloaders? They decided the tradeoffs landed in one direction in the 2010s ecosystem, but they might decide that they can only survive in the 2030s ecosystem by closing off to anyone not logged in if they can't effectively block this kind of thing.
If what a government receptionist says is copyright-free, you still can't walk into their office thousands of times per day and ask various questions to learn what human answers are like in order to train your artificial neural network
The amount of scraping that happened in ~2020 as compared to 2024 is orders of magnitude different. Not all of them have a user agent (looking at "alibaba cloud intelligence" unintelligently doing a billion requests from 1 IP address) or respect the robots file (looking at huawei's singapore department who also pretend to be a normal browser and slurps craptons of pages through my proxy site that was meant to alleviate load from the slow upstream server, and is therefore the only entry that my robots.txt denies)
You block Common Crawl and all you'll be left with is the abusive bots that find workarounds.
why not?
Esp. if that receptionist is an automaton, and isn't bothered by you. Of course, if you end up taking more resources and block others from asking as well, then you need to observe some etiquette (aka, throttle etc).
I chose "thousands" to keep it within the realm of possibility while making it clear that it would bother a human receptionist precisely because humans aren't automatons, making the use of resources very obvious.
If you need an analogy to understand how an automated system could suffer from resources being consumed, perhaps picture a web server and billions of requests using a certain amount of bandwidth and CPU time each. Wait, now we're back to the original scenario!
There is litigation of multiple cases and a judge making a judgement on each one.
Until then, and even after then, publishers can set the terms and enforce those terms using technical means like this.
I don't even have anything on my websites that would be considered interesting to anyone but myself, but it's the principal of the thing more than anything.