Some old sites never upgraded to https or other technical demands Google made of them. Google chose to stop indexing these sites to force them to change their behavior.
Most new content is trapped in walled gardens of some format. The one I see all the time is Discord, but the communities you care about are probably talking in a non-indexable group chat rather than an indexable Internet forum like they might have 20 years ago.
Pretty sure G could throw more money and resources at it if they thought that would make a dent.
It feels more like a lot of real content is collateral damage to the SEO vs google wars. e.g. A blogger was complaining the other day that someone had set up an automation to automatically scrape their content the second it gets published, run it through translate twice and publish the resulting semi-gibberish.
Those sort of shenanigans are I suspect quite hard to deal with even if you're google
Will be interesting to see a decade from now how researchers collect a corpus that isn’t chock full of a model’s own output
Let's say we set up a wildcard domain *.example.com all pointing to a server set up so that
0.example.com/ has a link to 0.example.com/0 and 1.example.com/
0.example.com/0 has a link to 0.example.com/1 and 0.example.com/0/0
0.example.com/0/0 has a link to 0.example.com/0/1 and 0.example.com/0/0/0
1.example.com/ has a link to 1.example.com/1 and 2.example.com/
1.example.com/0 has a link to 1.example.com/1 and 1.example.com/0/0
and so forth.This way even a raspberry pi is able to trivially host an infinite number of infinite websites.
Does marginalia struggle or fail to identify and sidestep indexing for these sorts of structures?
Counting websites, or even delineating where a website starts or ends is difficult, as you can on the one hand have a single server hosting infinite websites like I described. Services like cloudflare also throw a spanner in the works, if you think maybe using server IP would help. Domain name isn't much use either, as that would discount hosting services like neocities.
There's a similar fractal of weird cases with counting documents on a given webserver (and by the extension the internet).
That was the question back when Google first got going.
Now the question seems to be, at what point is Google too big to even bother?
They have also reduced the image search, torrents, and many other things have been removed via the censoring if the index and YouTube..
Pushing corporate type sites up and other things into the nether , Google has become the new yellow pages along with being an arbiter or higher ranked health info and links to old answers for programmers.
Basic things like recipes are so bad even non Oliver jokes about reading a dozen paragraphs before finding a recipe via Google.
So many things fun / entertainment/ sexy and more have no room for the high brow expectations of the big G.
Hence TikTok being more popular than Google now.
Some of the things they have removed have come from govs and industries with a lot of sway.. but much of what they downrank post panda penguin is a vieled attempt at being more politically correct and less blue collar.
Imo.
So indeed there is now room for other 'searxh/find' portals for things Google does t want to showcase on their front pages..
But for at least some while into the future they will likely be the best yellow pages since customers do most of that work for them.