Many people don't write for money, to put ads on their website, or as part of some "content marketing" campaign. All they want is a little recognition. A boost in positioning on the SERP means we will be getting useful stuff at no cost.
And there are genuine replies there. Ryan Jones[1] even got the scrapers to confess their sins[2].
[1] https://twitter.com/RyanJones/status/439123533349015553
[2] https://www.google.com/search?q=%20%22istwfn%22+%22stole+thi...
I hope this is genuine and not a disingenuous diversion on Google's part. The fact that the Huffington Post still ranks very high for trendy searches makes me wonder.
As usual, follow the money: the scraping sites exist to make money, often through Google's advertising; Google gets a cut. The original content is often on sites with no advertising or real traffic, from which Google profits nothing.
EDIT: To expand on this: Google-search for any hot topic in the news, say the name of some misbehaving pop star. See the HuffPo result near the top of the page. Look down to see several results from real newspapers. This is where the original content can be found. Most of these newspapers are about to die because they're not making any money. HuffPo investors are filthy rich because they're gaming the search engines to profit from copy-and-paste.
ANOTHER EDIT: I apologize for my characterization of the Huffington Post. I was describing, accurately, the nature of that site as it was the last time I visited it some time before its purchase by AOL three years ago. The HuffPo I see today is utterly transformed. They use wire services, do plenty of their own reporting, and many of the links on the front page go directly to other news sites. They are no longer a copy-and-paste site.
I also assume that by "HuffPo investors" you mean AOL? Huffington Post is a fully owned subsidiary.
(Disclosure: I consult for Huffington Post)
Many newspapers get a lot of their content from syndication services like Reuters. You may be seeing similar content because lazy editorial assistants just copied out a reuters story verbatim, slapped a pic on it and put it up at multiple organisations, not because HuffPo is scraping other sites. Do you have an example of this sort of thing you can point to? It'd be interesting to trace the origin of the content.
http://theenginuity.com/blog/how-a-copied-excerpt-of-a-story...
Not that I particularly feel like defending the Huffington Post, but they're not a web scraper.
http://www.huffingtonpost.com/betsy-isaacson/
She writes good articles for the general public about tech in general and things like net neutrality and aaron swartz
Is it content curation? Don't know. Everyone is reporting pretty much exactly the same things. They can't all be the original source. Who gets the juice?
You may as will just show http://images.google.com and complain that it's scraping. Or http://news.google.com.
In general, do you think Wikipedia gets more traffic because Google exists, or do you think Google gets more traffic because Wikipedia exists? Meaning, which affect is larger? I'm pretty sure the answer to this is obvious.
And if more scrapers donated millions to the site they scrape from, the world would be a much better place.
http://wikimediafoundation.org/wiki/Press_releases/Wikimedia...
How do you think Google would view my site if I wrapped Wikipedia's content, with back link and ran my own ads alongside that content? I would imagine not very positively.
Also, is it okay that a bigger entity scrapes my content just because they send me traffic? You might not want to bite the hand that feeds you, but it still doesn't make it right.
This is technically scraping but it's hardly comparable to the bottom-feeders that plagiarize for money. (Edit: according to 'pud' on this page, Google uses a Wikipedia index so it's not scraping, but it is in the case of other sites that Google indexes.)
And yes, it's OK both legally and ethically if you do the same to Wikipedia - like Google that is, just for indexing purposes and not using whole articles.
It may be quite frustrating for an upstart to be denied access while Google is explicitly allowed, but that's another matter.
This original site is now getting so little traffic from Google that more people visit it from the trickle of these bottom-of-the-page Wikipedia links than from Google itself. Its traffic was also badly hurt by Google's Panda algorithm, which I think clearly proves how flawed it is since this algorithm was supposed to do the exact opposite.
Because of this situation, if somebody thinks of spending money to create high quality reference-type content, I would strongly advise against it. You have no chance vs. Wikipedia's poorly-written articles repurposing your content and Google's flawed algorithms.
Sure it passes the keywords etc. But this likely reduces the number of people visiting Wikipedia, while increasing Google's ad revenues, if anyone but Google did this they'd be potential blacklisted by Google.
1) Hasn't been chunked into 20 pieces of varying grammatical structure which are automatically matched to corresponding questions
2) Hasn't been subsequently pasted over a slideshow of completely irrelevant stock photos in bold, white font
3) Isn't accompanied by a grid of ~30 vaguely related questions helpfully linked to similar pages and tastefully decorated with more irrelevant stock photos
4) Only occupies ~1.5 rather than 3 or 4 of the front page search results
5) Contains only closely related textual ads rather than a melange of casino, fast food, and online college banners
6) Has fewer than 25 trustworthy stock faces smiling back at me from any given scroll position
If this is the best google can do then I don't think wiki.answers.com has anything to fear.
------------
Seriously, how the hell does wiki.answers.com manage to pollute half of the searches I make with their algorithmically generated garbage (multiple times, at that)?! What kind of SEO catapulted them to the top despite 0 viewer retention and what surely must be about 0 reputable backlinks? How haven't they been sent to the 1000th page with manual penalties already? They show up before wikipedia itself, for crying out loud!
Google, if you aren't going to let users maintain a manual blacklist, you need to be on top of this kind of thing. It's seriously degrading my search experience and I suspect I'm not alone. This kind of inattention is the type of thing that can push even the most inattentive users to change default search engines.
So this is neither scraping, nor against the rules.
Here are dumps in SQL and XML format:
http://dumps.wikimedia.org/enwiki/
Ps- Yes the original post was meant to be funny and it was; I do have a sense of humor. :)
What's bad though, is that Google isn't just lowering the rankings of non-original content pages now (including any kind of legitimate curation sites.) They're marking the entire domains of new curation sites as "pure spam" and de-listing them from Google entirely, and punishing anyone who's linked to them.
This is having the effect of sending a clear message to developers -- stay far away from Google's territory of recommending third party content to people, no matter how you do it.
There's this site called "News360" that sends a lot of traffic to my site every time I post something. It copies the post in full. Apparently it's a popular app for iPhone and Google Play. This is an aggregator.
Google copies my site so it can send people to my writing. This is a search engine.
Then there's the legion of sites that copy my stuff and send no traffic even though they link back. Most of these are scrapers, meaning they're adswill garbage dumps that get no traffic after recent algorithm updates by Google, but some are attempts to build new aggregators like Huffington Post or that News360 thing.
The scrapers are a nuisance, but don't harm me in any way. Google is free, relevant traffic. Aggregators find an audience and provide useful content to them with credit, probably using the RSS feed I publish for that purpose.
Also google can impose whatever rules they fancy, because it's their own site. Do you have a website? If I find some that you govern it with some rule that I don't like, should I rage on forums about it? Should you care if I rage about it?
It's a shame that the search engine market share isn't split evenly by several different engines. I think it would be beneficent both to the users and website owners. Right now everyone tries to court Google and they seem to do whatever the fuck they want.
EDIT: I should also note that I'm one of those who switched over to DuckDuckGo for privacy reasons, so I don't see these results as often now.
I want content that is curated by people who actually understand the subject. I would pay for a search engine designed by someone who understands my industry. The Google algorithm only manages to grab at the low hanging fruit. I am a professional working on real stuff, I want something better than coffee shop suggestions.
"I would pay for a search engine designed by someone who understands my industry"
What industry is that? And how would Google guess or know your industry unless you tell them?
To give some idea I've asked for a list of URLs to documents covering best current practice for suicide prevention in Gloucestershire and Herefordshire; to include national level NHS and NICE guidance, DoH guidance, anything from Gloucestershire and Herefordshire, and anything recognised as excellent from anywhere else in the country. If possible I also want a list of protocols used in schools, care homes, etc.
It's probably something you could risk on MTurk. Perhaps Bountify.com could expand to this kind of simple websearching.
Obvious drawbacks include delay between starting the search and getting the results, and cost, and having to trust some random person to not miss stuff.
I don't know if there's anything similar to "clippings services" either where you'd provide them with list of types of stories you'd want, and they'd read all the newspapers and clip any relevant stories and post them to you.
An option to disable the personalization of search and going back to seeing the top 10 results for a search term that everyone sees would go against the core strategy that Google and others pursue these days.
The idea is to gain information about you and give you personalized advertisement and services. This has been criticized with the term Filter Bubble [1]. Consider your phrasing "I just want search results", similarly the terms "have you googled it" or "let me google that for you".
Quaint.
In testing, they definitely don't seem to scrape every article:
</rant>
1. They are not only doing this with wikipedia, but with many, many sites: "what is the smallest cell in the human body", "what is the biggest planet in the solar system".
2. The sites they chose to link are not always the highest quality sites, such as the two examples above- why are these websites being featured?
3. Many times, the user will get their answer right then and there, and be done with the search process. The site misses a visitor. In spite of these type of questions being "facts", someone took the time to organize and give context to these "facts". Turning facts into useful, consumable, content costs money. Google should not be taking visitors away from these sites.
4. There should be public information on the CTR of these snippets. See if it helps or hurts the user.
5. Google is abusing its power as a major search engine to reinforce structuring rules, such as microformats. With these rules, webmasters are giving more and more semantic meaning to their content, which means Google has an easier time completing their knowledge graph. They might link to the source site for a while, but there is no good argument for linking back to wikipedia to attribute the fact that Jupiter is the largest planet, since it's a fact, just like 2+2 is 4 (no attribution).
6. Google is all about ML/NLP/AI driven knowledge. But in reality they are turning all of the internet content creators into a giant sweat shop for their knowledge graph. This is not fair, and sooner or later it will come back to bite them.
A happy DDG user, who still uses !g too often though.
"Do as I say, not as I do" -- Google
Not only are the curated pages blocked, but the entire domain is blocked as "pure spam". People who use Google to find a domain instead of typing the full URL now can't find it anywhere.
These assholes are just being anti-competitive now.
Honestly you might consider switching to a new domain given that you haven't launched yet. It can take a long time to get out of the Google doghouse.
By this definition webcache.googleusercontent.com qualifies.
It is a full copy of every site GoogleBot scrapes.
Google gives attrition to the original source, but if this isn't "scraping", what is?
They have been sued for this, and they've won. The benefits of a decent search engine outweigh the burden of infringing the copyrights of others. At least where Google and other search engines that cache websites are concerned.
Seriously that was just a stretch, but they both say the full url. So all of Google News is a scraper site and any other summery given is a scrapper site then. Sad.
Matt is looking for scapers that rank better than the original, basically meaning that they have higher PageRank and more links.