I managed to find a huge spam network that set up a proxy service that delivered normal content, but injected "you can win an iPhone!" spam to all users visiting them.
Since I was in the position of being able to monitor their proxy traffic towards many sites I managed. I could easily document their behaviour.
In the same time, I wrote a crawler that visited their sites over a long, long time. I learned that they kept injecting hidden links to other sites in their network, so I did let my bot look at those also.
By this time, I also got a journalist with me that started to look at the money flow to try and find the organisation behind it.
My bot found in excess of 100K domains being used for this operation, targeting all of westeren Europe. All the 100K sites contained proxied content and was hidden behind Cloudflare, but thanks to the position I had, I managed to find their backend anyways.
We reported the sites to both CF and Google, and to my knowledge, not a single site were removed before the people behind it took it down.
Oh, and the journalist? He did find a Dutch company that was not happy to see neither him or the photographer :)
As someone that tried reporting spam sites because they were using content scrapped from my website, I'm not surprised.
Cloudflare has a policy that they will not stop providing their IP hiding/reverse proxy services to anyone, regardless of complaints. The best they do is forward your complaint to the owner of the website, who is free to ignore it.
They say "we're not a hosting provider" as if that's an excuse that they can't refuse to offer their service. I'm sure many spam websites would go away if they couldn't hide behind Cloudflare.
Or worse. Since I have no way to know beforehand who I would be dealing with, this is actively dangerous - what if the mobster running this site is having a bad day and choses to retaliate ?
Also what a stupid fucking policy that is. Even if you are not legally compelled to block content, what is the point of actively helping distibute harmful content?
What they are doing is worse than just saying "We are not a hosting provider" - because while what is true, they are actively distributing content that is hosted elsewhere while hiding who is hosting it.
One can easily write an email to abuse@hoster.example.com and usually these people do not want garbage on their networks. CF is making it impossible to do notify them, and they refuse to implement an alternative procedure.
I still do not understand the moral position of profiting off of enabling criminal scum, when it would be so easy not to...
At the moment, they have a very clear rule. If they stop providing services to obvious spammers, they will create lots of grey areas, and they will also implicitly make a judgement that the client they still serve are _good_ in some way, and an enterprising lawyer or muckraker might exploit that.
20 years ago the transit providers of the internet would have spotted Cloudflare for what it is, and cut it off.
By ignoring the content they serve, they rid themselves of the necessity to analyze and judge what they serve. Not only would this require a brain the size of a planet and the expense of running it, but also would inevitably conflict with someone else's judgments, and bring various PR woes.
They don't analyze the internals of their traffic the way internet backbone providers don't analyze the internals of the traffic they pass around.
I frankly find this position superior: imho it does more good by preventing censorship than harm by serving good-intentioned and bad-intentioned customers alike.
Legit company will always have internal struggles between dev/sales/marketing, so things just take longer and are much more draining to accomplish. I'd imaginge spam org just needs to have bare minimum stuff up to satisfy whatever need it is they have knowing that humans won't necessarily be perusing those domains, yet it's 100K domains. I could almost see something like this running more smoothly. I can also see it being run by small number of people that let things lapse and it's just barely hanging together. So many questions...
Very curious to know what you found!
There is / are organisations that a) scrape legitimate sites for content, b) host that content on their own 100K domains, c) sit behind cloudflare, d) do some seo??? e) when someone finds their site they then inject an ad or similar rubbish f) do this enough that they make money off the ad / competition / porn ?
That seems like a problem that the ”original-source” metatag was supposed to stop?
I've been using Google search for all kinds of research for 15 years. There used to be a time when you could find the answer to pretty much anything. I could find leaked source codes on public FTP servers, links to pirated software and keygens, detailed instructions for a variety of useful things. That was the golden age of the web.
These days, all the "interesting" data on the Internet is all inside closed Telegram chats, facebook groups, Discords or the rare public website here and there that Google doesn't want to index (like sci-hub, or other piracy sites).
The data that remains on SERPs is now also heavily censored for arbitrary reasons. "For your health", "For your protection". Google search is done.
It's precisely their fault, they've created an environment that incentivizes low quality, irrelevant content and are actively hostile towards users. Two examples just from top of my head: ignoring the country website, previously if you wanted to search only local news it was very easy to do. Another was ignoring completely the exact phrase search with double brackets.
What made me really angry aboyt Google Search was when they removed their function to search in discussion forums. But even then you could more or less filter out crap.
Nowadays it feels very hard. I find myself using the site: flag many times, but you need to know the site beforehand, which is another problem.
I think this is an overly harsh take. I strongly suspect that any algorithm for ranking search results is open to gaming and manipulation by malicious users.
Also the opposite: insisting on pushing local and localized results on google.com even when I set my browser language to english.
On the other hand, Youtube is the second most popular search engine and I don't see it slowing down. What an insight they had when they bought it.
Edit : I entirely agree to the fact that valuable information is found more in communities nowadays. I also predict that the web in 5 years will be mostly explored through communities
Another reason for that is user retention.
If you get your information directly on google.com, you won't navigate away, probably search again, and bring in more ad revenue.
49 out of 50 review sites are now just affiliate links to Amazon. “Check the price on Amazon” buttons is the main content there
I don‘t think I have used more than 1000 different sites in all development searches ever. It‘s the stack exchange network, github, official documentation, non-github official issue tracking/communities and some high quality blogs. That seems very manageable. You could probably index that into one elasticsearch and one sourcegraph instance. Add a little more specific faceted search, add back powerful and precise query syntax and still maintain "just past in whatever and hit the first result" functionality. I‘m likely underestimating the breadth of other developer needs than my own. I don‘t know.
DuckDuckGo is nowadays more useful than Google for my web searches.
1. the scientific worldview needs an update
2. from reproducibility to over reproducibility
It would be cool to find datapoints for a proper bug report for Google :)
Edit: see this other HN story: https://news.ycombinator.com/item?id=27993564
This is not to say that all search results are bought, although of course those are present now, too. But overall Google presumes that whatever the user is searching for, the best result is one where the answer is "buy this thing".
For those search results that don't lead directly to commercial products, the revenue generation is indirect: through the collection of user preferences and activity, Google can refine its search results towards maximizing revenue. At the very least, the result is likely to be a site that has ads, some of which generate revenue directly for Google.
*In the old-fashioned Yellow Pages book, you couldn’t really “search,” but there was an index by category. It had many of the issues inherent in categories, but it didn’t take an expert to find things. Google search eliminates the needs for anyone to understand a taxonomy of businesses.
Now such search results often don't even get a second page...
For example if you check havfruen4220.dk on archive.org you can see that it appears to have been a legitimate business website before. https://web.archive.org/web/20181126203158/https://havfruen4...
How do they rank so well?
I've checked the domain on ahref and it has almost no backlinks. But if you look closely you will see that all the results that rank very well have been added very recently. On the screenshots in the article you can see things like "for 2 timer siden" which means 2 hours ago. It looks like google is ranking pages that have a very recent publishing date higher.
Edit: Here is what the content of such a site looks like: https://webcache.googleusercontent.com/search?q=cache:Bk0VsM...
For example there used to be a very common content farm system, that was structured like like this:
https://domainsites.com/site/nytimes.com
So when people searched for sites by domain name, the zillions of low traffic long-tail results of this farm system would be all over Google's results.
What it would present on the page is a mess of data about nytimes.com, such as traffic, or keywords pulled from the site header, maybe a manufactured description (or pulled right from the site head), sometimes images / screenshots of the site. Anything that could be stuffed in there to fill up enough content to get Google to not do an automatic shallow content kill penalty on the content farm. This worked for several years very successfully until Google's big algorithm updates, 9-10 years ago or whatever now (Penguin et al.). You could just build a large index of the top million domains (eg Alexa and Quantcast used to provide that index in a zip file), spider & scrape info from the domains, and build a content farm index out of it and have a million pages of content to then hand off to Googlebot.
So initially such a farm will boom into the search rankings, Google would give them a trial period and let out the flood gates of traffic to the site. Then Google would promptly kill off the content farm after the free run period expired and they had figured out it was a garbage site.
I still occasionally see this model of content farm burst up into traffic rankings, and it's usually very short lived. It makes me wonder if that's not more or less what's going on with the Mermaid farm.
It could of course be that its something similar to GPT that is trained on all the content it could find and then writes articles, cause it's clearly messing up sometimes, form the small piece of content available at the search results page.
I'm not sure if this is an ML race and the reason we're not seeing the same thing in English is because Google might understands English better than spammers. While in Norwegian and German it's the other way around?
Clearly freshness is a large part of it. Google seems to have indexed millions upon millions of pages tied to this in the last 24 hours.
Almost any search in Norwegian will have obvious scam sites like these in the top 10 results.
Other domains part of the same scam that show up in my results today: mariesofie dot dk, bvosvejsogmontage dot dk
I wonder if it is related to this: https://www.dk-hostmaster.dk/en/news/dk-hostmaster-takes-102...
Never seen anything on this scale before. I can search for basically anything(tax rules, baking, stocks, property, hygiene...) and Google will most likely show those domains somewhere.
The content seems taken from other websites and mixed in a nonsensical way. It comes up frequently in my search results. www.xspdf.com has completely unrelated content and seems a separate business.
curl -A 'Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0' 'https://havfruen4220.dk' > 1.html
curl -A 'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' 'https://havfruen4220.dk' > 2.html
diff 1.html 2.html 7d6 < <script>var b="https://havfruen4220.dk/3_5_no_14_-__1627553323/gotodate"; ( /google|yahoo|facebook|vk|mail|alpha|yandex|search|msn|DuckDuckGo|Boardreader|Ask|SlideShare|YouTube|Vimeo|Baidu|AOL|Excite/.test(document.referrer) && location.href.indexOf(".") != -1 ) && (top.location.href = b); </script>
diff 1.html 2.html
7d6 < <script>var b="https://havfruen4220.dk/3_5_no_14_-__1627553323/gotodate"; ( /google|yahoo|facebook|vk|mail|alpha|yandex|search|msn|DuckDuckGo|Boardreader|Ask|SlideShare|YouTube|Vimeo|Baidu|AOL|Excite/.test(document.referrer) && location.href.indexOf(".") != -1 ) && (top.location.href = b); </script>
Would be interesting to see the actual content. Based on the small snippets in the search results, it takes content from other sites, like large Norwegian news sites, and somehow outranks them hard.
I wonder what the Google Search Console looks like for that domain, considering that it's probably getting millions worth of free traffic.
EDIT: After looking more at it, it's insane how much it ranks for and how well. Straight up brand names seems to be the hardest to compere with, at least larger ones. Those seems to be around page 4-5 for me.
Some brands I was unable to find at all, but ironically another .dk domain showed up in it's place that did the same thing. There is also some .it domains using the same content.
I've found that it takes contents from multiple sources and glues it together in sometimes great ways. Like one sentence from this page, another thing from that page.
Maybe this is some ML that collects content and pieces a lot of it together sentences or half sentences to one large article? It's clearly from completely different sources, but about the same thing.
Example: "wash car"
Result in google: "A dark winter with snow and salt is hard on the car, and it's extra important to wash the car" - Collected from one article.
<some other text>
"Keep the pressure washer at 30-50 cm from the car..." - From another article.
Ironically, there is like 11 results all tied to this thing outranking the original articles(those are last), even if it's medium to large sized well known companies selling for billion(s) of dollars each year in Norway.
Sometimes it goes from one thing and switches to something completely unrelated, so I guess the spammers still have something to improve.
Weird.
Ahrefs: 230k organic traffic valued at $124k SEMRush: 558k organic traffic valued at $355k
These are estimates and can be widely under or overestimated but they show that this is happening on a very large scale.
For a quick idea on how this is possible I looked at their top pages (according to Ahrefs). Their top page is ranking #2 for the keyword "interia" which has 207k searches per month in Norway and is rated as 0 (out of 100) for being easy to rank for. Usually when a keyword has that amount of searches it would be incredibly hard to rank for, I've never seen anything like this. So what is happening here looks like they are just taking advantage of a market with really low competition keywords.
However, the weird thing it that it steals content from articles, and then outranks them. Most pages seems to be boosted, maybe as a result of it being new. (Most content is just hours old)
Could you check these too? (exactly the same thing, but newer, it seems) www.mariesofie.dk nem-varmepumper.dk
Clearly reused domains.
But I currently feel that paying $100/mo for Ahrefs for something I do as a side project is a tad wasteful.
And add to that the some link spam and preventing the visitors to return not get any bounce back...
Either way, I can't help to be a bit impressed by the SEO spammers outsmarting the people at Google. (Edit: and I don't mean to say they are smarter or anything, just that they only need to find one weakness in the algorithm while the people working to improve it needs to make it works for everything.)
edit: one other thing I have seen, but it doesn't mean it is always spam. All The Words In A Title Are Capitalized - it's something to pay attention to whether it is spam or not. Conventionally, titles are usually not like that in Norwegian.
Another big one is that Norwegians like Germans write words together, just one example from one of the stupid ads: "Spesial Reportasje" is a dead giveaway not only because of the capitalization.)
(Oh well, sadly because of pressure from Words incompetent spell checker over years and lenient teachers this is getting worse. I fear we are seing compound damage here as kids that got away with this are now becoming teachers...)
The current state of formal German will surely not be the end of history.
See https://de.wikipedia.org/wiki/Leerzeichen_in_Komposita#Gesch... (Might need Google Translate, if you don't speak German.)
(Something like: Pictures in the struggle against mistakes when using spaces between words)
Why, though? There is an arbitrary ranking system that seems increasingly independent of what I actually searched for. Google had created a game where the winner isn't necessarily relevant or at all useful. It's inevitable that spammers will play that game.
A bit of it is probably that.
Outright ignoring my queries: +, doublequotes, "verbatim" and all takes more than SEO tactics, it takes someone inside Google, either malicious or more probably incompetent on the inside.
Or more probably: someone was so busy trying to use AI in searches that no they haven't had time the last ten years to consider if it was smart.
Or maybe Google started applying "We know better than the users", the driving principle behind their software and libraries, to their search.
It seems to me that the techniques used to spam Google's index work just as well on Bing's index.
It seems DDG is worse at finding the more authoritative sites about a subject compared to Google.
You search for a very specific thing and all the results are big sites that have said something that contains two of the 6 words you search for in a completely generic article that helps you none.
My favorite is when your query contains a word that is the very essence of what you search for and google chooses to display results without it so you have to do extra click "yes I actually want to search for what I said I want to search for".
It enables a collaborative effort in blocking spam / low value domains.
If you make a block list, please submit it to the list I’ve made: https://github.com/rjaus/awesome-ublacklist
(There’s no great subscription discovery as yet)
-stupidautogeneratedcontent1.com -stupidautogeneratedcontent2.com etc
I figured sooner or later Google would pick up the signal but I think instead they just started ignoring my "- requests" as I stopped using them. edit: or maybe they fix the problem. Spam sites used to be a problem during the early decline of Google. I think what happened was that problem actually almost disappeared for me and was replaced by irrelevant results from non-spam-sites
Edit: mahalo.com was one of those, https://en.m.wikipedia.org/wiki/Mahalo.com
google.*##.g:has(a[href*="example.com"])Can't fathom Google not catching this..
YMMV a lot with Google results. For me, it's usually great where DDG is kinda crap, but not as bad as... shudder ... bing
My guess is that someone at Google reacted.
It seems like they've manipulated rankings by locking people in to reduce their bounce-back stats (in addition to keyword-stuffed content)
...but, you know. Can you see anything else they're doing that would give them that kind of ranking? These pages are just piles of crap, and google is pretty good at filtering that sort of stuff out.
If it was that easy, google would be filled with spam everywhere.
The chance that someone did something random thats very uncommon (block back) and it happened to be a super effective signal to google seems:
a) like an edge case they didn't think of
b) like it'll get fixed pretty fast
c) not that unlikely.
Compared to, say, the idea that some random spammers have built a network of incredibly sophisticated ML-generated pages that can subvert googles algorithms which seems:
a) not substantiated by any obvious content on the pages
b) requires a very high level of sophistication which seems totally lacking
c) very unlikely
...but I mean, who knows right?
We're all just speculating. I guess it'll get fixed soon, and we'll never know.
This has been going on for years now, so I don't have much confidence that Google is able or willing to fix it.
A rather non-sequitur choice, like everything else with this thing I guess.
There are a lot of these domains (ptsdforum.dk, verdes.dk, momentsbykruuse.dk from the top of my mind). Always Danish domains and always registered by the same person in Riga.
I'm not sure if the culprit is BERT or using neural ranking. But in the last years I feel that is more common that I leave Google search without useful information. The worse part is that all the competing search engines are using the same algorithms that are only useful for mainstream results.
Memory Hole
"The alteration or outright disappearance of inconvenient or embarrassing documents, photographs, transcripts, or other records, such as from a web site or other archive. Its origin comes from George Orwell's "1984", in which the memory hole was a small incinerator chute used for censoring, (through destroying), things Big Brother deemed necessary to censor."
https://www.urbandictionary.com/define.php?term=Memory%20Hol...
Tools are not to blame here, it's like blaming the compiler for the behaviour of an application. Starting with the training data and ending with how the model is used in deployment it's the blame of people who made it, not of the neural architecture. The architecture itself can learn anything you throw at it, good or bad.
Basically, like searching diving suit thickness, and google ignoring "suit" and "thickness" (until i specifically put those two words in quotemarks), and only showing me results for diving.
Google search 'reservoir cats' and it will completely ignore what you actually search for in favor of the mainstream result. The effect is basically that you can't sesrch for 'reservoir cats'!
Even putting something opposite or unrelated to the highly mainstream result will have no effect.
Its completely entirely ridiculous and makes the search engine seem like a facade.
I don't know of a good general internet search engine, so I tend to stick to the sites I know will provide answers that'll work for me, which is a shame for discovering new content.
[1] https://www.google.com/search?q=diving+suit+thickness&rlz=1C...
Even today here are bloggers outside who do not have a commercial affiliation with the goods/items/things they are blogging about. Such content is practically impossible to find in comparison to all the Amazon-affiliated pseudo-information conveying spoof-sites.
The Amazon affiliate program is definitely contributing to this problem.
Additionally as polyglot it is very irritanting that Google tries to helpfully translate queries for me, thus I have to go to other search engines to actually find the article on the language I want.
I'm not sure what the beginning of the end was for Google Search, but I think the day where they changed the ad background to white is a good candidate.
Google Search used to be like Chrome or Gmail - we know its wrong in the long term, but it's hard to stop using it because it just works so well.
But these days, not anymore. Search is a lot less sticky, and it is their golden goose they are messing with here.
At 2. juli 2021
Thats pretty fast to work so well. But i see lots of this, with other domains, when searching and have done for years so nothing new here i think.
If the redirect is done as a meta refresh, then you can block it in your robots.txt from being picked up from SEO tools like Ahrefs, SEMRush etc.
These types of sites are called doorway pages and have been around for ages. They are most popular in Russia and on Yandex, but you do see them on Google for super longtail keywords with 0 competition.
The other important thing to remember is that doing SEO in any language that's not English is a walk in the park. Lots of SEO influencer types have case studies showing how much extra traffic they get by translating their content. [1]
They can be easily traced to a block of flats in Latvia but since their registered phone its a Toy Store in Riga...I am going to go with probably stolen identify operation and a sense of humour on their part instead of the real operation of some 12 year in Riga...
An SEO-fighting Googler might at a glance have no reason not to think that could be a really relevant or popular site in your country.
God I hope not. If Google does do this, it sounds like a really dumb idea, which will ultimately create widespread usability issues. I can already envision SEO consultants recommending this for their clients if this is believed.
Doesn’t look like it according to https://www.seroundtable.com/google-browser-back-button-rank...
"We help you to receive high-quality visitors from search engines, generate conversions and build your brand. To achieve these results, we ensure your website / company is recommended for specific keywords by the search engine's autocomplete function."
That doesn’t entirely eliminate the other possibilities though, google search isn’t deterministic, and the domain could have been reported since the article went up.
I've noticed that the ranking of the results changes really often.
In particular, the only results ranking higher than themermaid for "hvor ofte oppdaterer apple ios" are those coming from support.apple.com.
Surely Google does this, right? Given that - in theory - showing different content to Google versus non-Google should result in a penalty, anyway ...
sesam.no (not valid domain anymore) was an engine made by some a big Norwegian company back in 2005 or so.
Norway used to be big in search. FAST got acquired by MS back in 2008.
Recommend to switch to DuckDuckGo:)
Are they potentially doin harm? Sure. Have the successfully managed to trick anybody with this? I'd be extremely surprised if they're getting more than a dozen people clicking through from being the ninth result in a day,and when people see they've been redirected to an advertisement the majority of people immediately click away.
This isn't like clicking on a fake prorn site that redirects to cam girls with viruses hidden in all the downloads. It's random unrelated searches redirecting you to blatant ads for cryto currency. The kind of people who are young enough to know what crypto currency is and how to buy it, also know how to spot a redirect to a fake website.
## read robots.txt `curl 'https://havfruen4220.dk/robots.txt'`
## use pointer to a sitemap.xml
curl -A 'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' 'https://havfruen4220.dk/sitemap-no.xml' > sitemap.xml
## read more sitemaps
curl -A 'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' 'https://havfruen4220.dk/sitemap-no-1.xml' > sitemap1.xml
Other sitemaps contain a pointer to a "webpage" eg: https://havfruen4220.dk/no/7a28855e4714dd14
## read web pages
Each location in a sitemap has a "lastmod" of today/yesterday so bot returns there everyday. In addition each webpage has a "<meta name="robots" content="noarchive">"
But if you visit each of those pages then it shows you a cartoon image. It seems the actual indexed content is visible only to the bot.
## But how is actual content being rendered?
The question is, what conditions (request params/headers) result in the actual content being rendered? The bot needs to evaluate it. Suspect it is some combo of checking if the requester is an actual google bot, maybe by looking up the IP https://developers.google.com/search/docs/advanced/crawling/...
For example, Microsoft routinely deletes negative feedback from GitHub issue for vs code.
Simply just trie one of the examples like "hvordan regne ut prosent"(how to calculate percentages) or, I don't know..."DNB aksje"(DNB stocks, DNB being the biggest bank in Norway). Sure enough, both ranks on the first page or as the one of the top results. (One is now using the www.nem-varmepumper.dk domain, that is the same thing).
EDIT: Now the DNB one moved from 2 and 3 place to page 2. Things are moving around quickly.
It is Google that needs "incognito" mode, not the author.
Anyone who can do that can rank as high as they like for any search query.
A good proxy for this is how many people don't click the 'back' button to see other results.
Google is already aware of sites which hijack the back button. Their crawler detects this, and if they find it, they throw out the figures of how many people click the back button.
So if you can find a way to hook the back button so nobody can click back, while stopping google thinking you have hooked the back button, then your page will keep creeping up the rankings.
Google detects back button hijacking with their crawler (by rendering the page in Chromium and seeing the effect when hitting the actual back button), but this is circumvented by presenting the crawler different html. (or making sure the page behaves differently in their crawler, potentially by checking things like the model of the graphics card - googles crawlers don't yet support most of WebGL 2.0, and also simulate playing audio at the wrong rate)
Google also detects how many real users click back. If it's zero, then thats a warning flag. So I'd guess the back-hijacking logic is only activated ~80% of the time.