At what point is the internet too big for Google to index it?

11 pointsblablablub3y ago16 comments

Recently there was a lot of talk about diminishing Google search results, missing forum links, etc. That got me thinking. At what point in time or Volume is the internet too big for Google to index it?

16 comments

wilde3y ago

The problem isn’t that the internet is too big. It’s that Google and the internet grew apart from each other.

Some old sites never upgraded to https or other technical demands Google made of them. Google chose to stop indexing these sites to force them to change their behavior.

Most new content is trapped in walled gardens of some format. The one I see all the time is Discord, but the communities you care about are probably talking in a non-indexable group chat rather than an indexable Internet forum like they might have 20 years ago.

Havoc3y ago

I don't think google's recent troubles are a result of index size.

Pretty sure G could throw more money and resources at it if they thought that would make a dent.

It feels more like a lot of real content is collateral damage to the SEO vs google wars. e.g. A blogger was complaining the other day that someone had set up an automation to automatically scrape their content the second it gets published, run it through translate twice and publish the resulting semi-gibberish.

Those sort of shenanigans are I suspect quite hard to deal with even if you're google

qeternity3y ago

It’s only going to get worse with the proliferation of NLP nets a la GPT.

Will be interesting to see a decade from now how researchers collect a corpus that isn’t chock full of a model’s own output

marginalia_nu3y ago

Google already doesn't index the entire Internet. The internet even having a size becomes more questionable the more you think about it.

Let's say we set up a wildcard domain *.example.com all pointing to a server set up so that

  0.example.com/ has a link to 0.example.com/0 and 1.example.com/
  0.example.com/0 has a link to 0.example.com/1 and 0.example.com/0/0
  0.example.com/0/0 has a link to 0.example.com/0/1 and 0.example.com/0/0/0
  1.example.com/ has a link to 1.example.com/1 and 2.example.com/
  1.example.com/0 has a link to 1.example.com/1 and 1.example.com/0/0

and so forth.

This way even a raspberry pi is able to trivially host an infinite number of infinite websites.

metadat3y ago

While you are technically correct, these sorts of tarpit cycle patterns should be detected and shed by any half-decent crawling system ("half-decent" is said tongue-in-cheek :p).

Does marginalia struggle or fail to identify and sidestep indexing for these sorts of structures?

marginalia_nu3y ago

I don't mean to suppose these structures are actually common, I'm using them to illustrate how the concept of the internet having a size falls apart when you start considering the practical matters of counting how big the internet is. It hinges on a model of files-on-a-server that isn't how websites have worked in the last 30-or-so years.

Counting websites, or even delineating where a website starts or ends is difficult, as you can on the one hand have a single server hosting infinite websites like I described. Services like cloudflare also throw a spanner in the works, if you think maybe using server IP would help. Domain name isn't much use either, as that would discount hosting services like neocities.

There's a similar fractal of weird cases with counting documents on a given webserver (and by the extension the internet).

bediger40003y ago

I thought google was deliberately "forgetting" older material, to make room for new. The "long tail" turned out to be hogwash, and advertising corrupted everything.

fuzzfactor3y ago

>At what point is the internet too big for Google to index it?

That was the question back when Google first got going.

Now the question seems to be, at what point is Google too big to even bother?

throwaway5325323y ago

Google for too small for the internet around the time if their first penguin + panda updates.. They killed off a lot of good sites in the war against SEO.

They have also reduced the image search, torrents, and many other things have been removed via the censoring if the index and YouTube..

Pushing corporate type sites up and other things into the nether , Google has become the new yellow pages along with being an arbiter or higher ranked health info and links to old answers for programmers.

Basic things like recipes are so bad even non Oliver jokes about reading a dozen paragraphs before finding a recipe via Google.

So many things fun / entertainment/ sexy and more have no room for the high brow expectations of the big G.

Hence TikTok being more popular than Google now.

Some of the things they have removed have come from govs and industries with a lot of sway.. but much of what they downrank post panda penguin is a vieled attempt at being more politically correct and less blue collar.

Imo.

So indeed there is now room for other 'searxh/find' portals for things Google does t want to showcase on their front pages..

But for at least some while into the future they will likely be the best yellow pages since customers do most of that work for them.

jstx13y ago

I really don't think the the reduced quality in results is because the Internet is too big all of a sudden.

sytelus3y ago

Infrastructure wise scaling will continue to be possible with CS innovations. Algo wise, I am not sure if we can handle all the additional adversarial content and noise. A lot of index pruning happens just to reduce adversarial content and noise. However, ultimately it all comes down to cost in long run. Cost of crawling and serving extra Y% needs to be equal or lower than the potential drop in revenue in long run. At current stage, it is likely that vast majority of crawlable internet is not actually in index. By some measure, just 50B pages were sufficient to keep most users fairly happy. Going to 150B pages has marginal gain that small players cannot afford. The reachable size of internet is well over 1T pages.

sacrosanct3y ago

Google can cope with a Zettabye Era (https://en.m.wikipedia.org/wiki/Zettabyte_Era) it’s separating wheat from chaff which is the hard problem. Also most data is largely being siloed behind walled gardens and can’t be indexed.

cpach3y ago

I believe we have already reached that point.

dekhn3y ago

Google blackholes many sites and they don't get indexed.

dontbenebby3y ago

Too walled is the issue, not too big. Much is behind things like Facebook etc

betaby3y ago

It feels that Internet is trivially small, at lest text based part.

j / k navigate · click thread line to collapse

16 comments

wilde3y ago

The problem isn’t that the internet is too big. It’s that Google and the internet grew apart from each other.

Some old sites never upgraded to https or other technical demands Google made of them. Google chose to stop indexing these sites to force them to change their behavior.

Havoc3y ago

I don't think google's recent troubles are a result of index size.

Pretty sure G could throw more money and resources at it if they thought that would make a dent.

Those sort of shenanigans are I suspect quite hard to deal with even if you're google

qeternity3y ago

It’s only going to get worse with the proliferation of NLP nets a la GPT.

Will be interesting to see a decade from now how researchers collect a corpus that isn’t chock full of a model’s own output

marginalia_nu3y ago

Google already doesn't index the entire Internet. The internet even having a size becomes more questionable the more you think about it.

Let's say we set up a wildcard domain *.example.com all pointing to a server set up so that

  0.example.com/ has a link to 0.example.com/0 and 1.example.com/
  0.example.com/0 has a link to 0.example.com/1 and 0.example.com/0/0
  0.example.com/0/0 has a link to 0.example.com/0/1 and 0.example.com/0/0/0
  1.example.com/ has a link to 1.example.com/1 and 2.example.com/
  1.example.com/0 has a link to 1.example.com/1 and 1.example.com/0/0

and so forth.

This way even a raspberry pi is able to trivially host an infinite number of infinite websites.

metadat3y ago

While you are technically correct, these sorts of tarpit cycle patterns should be detected and shed by any half-decent crawling system ("half-decent" is said tongue-in-cheek :p).

Does marginalia struggle or fail to identify and sidestep indexing for these sorts of structures?

marginalia_nu3y ago

There's a similar fractal of weird cases with counting documents on a given webserver (and by the extension the internet).

bediger40003y ago

I thought google was deliberately "forgetting" older material, to make room for new. The "long tail" turned out to be hogwash, and advertising corrupted everything.

fuzzfactor3y ago

>At what point is the internet too big for Google to index it?

That was the question back when Google first got going.

Now the question seems to be, at what point is Google too big to even bother?

throwaway5325323y ago

Google for too small for the internet around the time if their first penguin + panda updates.. They killed off a lot of good sites in the war against SEO.

They have also reduced the image search, torrents, and many other things have been removed via the censoring if the index and YouTube..

Basic things like recipes are so bad even non Oliver jokes about reading a dozen paragraphs before finding a recipe via Google.

So many things fun / entertainment/ sexy and more have no room for the high brow expectations of the big G.

Hence TikTok being more popular than Google now.

Imo.

So indeed there is now room for other 'searxh/find' portals for things Google does t want to showcase on their front pages..

But for at least some while into the future they will likely be the best yellow pages since customers do most of that work for them.

jstx13y ago

I really don't think the the reduced quality in results is because the Internet is too big all of a sudden.

sytelus3y ago

sacrosanct3y ago

cpach3y ago

I believe we have already reached that point.

dekhn3y ago

Google blackholes many sites and they don't get indexed.

dontbenebby3y ago

Too walled is the issue, not too big. Much is behind things like Facebook etc

betaby3y ago

It feels that Internet is trivially small, at lest text based part.

j / k navigate · click thread line to collapse