"scraping attacks"
Scraping is not an attack. Monopolists want to pretend they own your data because they get unlimited access to monetize it whereas competitors should have none.
"self-compromised"
Monopolists want to sell you thus it's imperative they maintain the fiction of "one person, one account". By admitting you own your account, they'd have to allow sharing and they wouldn't be able to provide their customers (advertisers) with reliable data about individuals.
"protect people from scraping"
Monopolists will protect themselves and call it protecting you. They will attempt to make you afraid of some other actor using your data in harmful ways so as to detract from how they monetize you and use your data in harmful ways.
"deter the abuse"
Monopolists don't want to argue about what constitutes abuse. Anything they write in their TOS is entirely for their benefit and only constrained by local law (if that). They will abuse you to the fullest extent they can get away with while arguing that any action to use your rights is "abuse."
"safeguard people against clone sites"
Monopolists want to maintain their monopoly, there is no greater threat than a direct challenge to that monopoly by allowing data to move freely.
--
More subtle but even more ironic rhetorical points
"for hire" / "paying for access"
Emphasizing that people making money (gasp) for providing this service, is bad.
"industry leader in taking legal action" + "across many platforms and national boundaries, also requires a collective effort from platforms, policymakers and civil society"
Monopolists can pay high priced marketers to rebrand them as patriotic hero figures fighting valiantly for the little guy.
Over the last few years or so it feels like, to reference a @dril tweet[1], Facebook has just been 'turning a big dial taht says "data access" on it and constantly looking back at the audience for approval like a contestant on the price is right' with how much it allows 3rd parties to get at its data.
Keep in mind ~5 years ago the big thing at FB was "Open Graph" and "Graph Search" which gave everyone really in-depth access to their data with the idea that Facebook would be the "data platform" on top of which all of these 3rd parties would build apps and interfaces. This of course eventually resulted in the whole Cambridge Analytica thing and now this gigantic swing in the other direction of being overly protective of the data as a kneejerk PR reaction to all the bad press.
FB loved sharing data and provided a direct API for accessing it when the public narrative was about data freedom and 3rd party developer friendliness and it hates giving any access at all and goes around sues web scrapers now that the public narrative is all about privacy.
Facebook will happily align itself in whatever way results in the least public outcry arguing they shouldn't be allowed to have the data in the first place regardless of if that means giving access or restricting it.
Nothing prevents him to restrict access to his pages an data to "trusted" friends.
Wehether we make that spread of informationlegal or not does little to affect whether it happens.
There are two things that might help. First, don't share as much information. Once it's no longer limited to you or your close group of friends which hopefully won't share it along with your name, it's mostly out of your control. Second, put limits (laws) on what information companies are able to synthesize about you, and how long they can retain it. If there's less information created about you (or it's ephemeral, created and destroyed as needed), and if they need to clean out older data, there's less to be shared or stolen.
They do this to collect information of foreign policy interest to them, to silence political dissidents abroad, etc.
For example: https://www.washingtonpost.com/national-security/china-harve...
And: https://www.propublica.org/article/even-on-us-campuses-china...
Let's start with one thing: copyright on databases. Take IMDb: they collect and combine totally open data on movies cast, crew, soundtracks used and so on. Everyone can go to the cinema, wait until movie ends, write down data from credits roll and put it on the database. There's no prohibition on this activity. Cinema may prohibit filming inside, but not using pencil on paper. Or you may buy a DVD released later, and do just the same. Or you may even write a movie company email asking for those data in electronic form and chances are they will send it to you or point to some promo materials website where it is published already.
But the entire database is a product of work, and that makes it valuable. So the company or organization spent time and money collecting, indexing and cross-linking those data, and has a right to bank on that work. Easily copying that database for commercial purpose _is_ stealing. This is why we have a database copyright laws.
Now back to Meta. They created this product and made it attractive enough so people are adding their data voluntary. Every single piece of data is quite open (maybe not really so for personal bits like face photos, emails and phone numbers). Meta spent a lot of cash making and keeping product that attractive, and now banks on those collected data by targeting ads.
Nothing in the world prohibits everyone else to create a service, make it valuable, attract people, collect data (according to data collection laws) and bank on that. But just copying data collected my Meta is stealing, and Meta is in its own right to protect it. The fact that Meta did it before doesn't makes it monopolist. In fact, there are lots of companies doing the same, like Google, Amazon, Apple, eBay etc. So in my opinion it is not a monopoly defending its' position, but rather business defending its' assets from stealing.
> a US subsidiary of a "Chinese national" "high-tech" enterprise
Replacing it with "a business" would do just fine.
It's not clear what Facebook's position on scraping truly is. Sometimes they downplay it as "normalized and widespread," and other times they castigate it as inexplicably legal and clearly immoral, or even outright "in violation of state and federal law." For example:
- April 2021. Researchers find an exposed database containing the scraped data of 533 million facebook users. Some news reports refer to it as a "breach." Facebook attempts to downplay the issue as the result of third party scraping. Headline in ZDNet: "Internal Facebook email reveals intent to frame data scraping as ‘normalized, broad industry issue’" [0]
- October 2020. Facebook announces lawsuits against companies it claimed created a "malicious extension on Google’s Chrome Web Store designed to scrape Facebook, in violation of Facebook’s Terms and Policies and state and federal law." [1]
So... which is it? Does Facebook believe that scraping is a "broad, normalized industry issue?" Or is it a violation of "state and federal law?" It seems like they measure severity of its impact primarily based on the reactions of political commentators.
And what's the difference between automating a browser and automating an API client? Why did Facebook design an API for accessing the data they collected, if it's illegal to collect? They've even claimed to be the victim of Cambridge Analytica, who purchased a "quiz" application created by a developer who pieced it together using code straight from the "examples" section of Facebook's API documentation.
There is one obvious resolution to this apparent contradiction. If we remove Facebook from the question, then the contradiction resolves itself. All we need to do is stop presuming that Facebook has the right to collect and retain this data in the first place. And as a user, if you publish your data to a website designed for sharing it with other people, then by definition it is no longer private data. Therein lies the central question: what is "semi-private" data, and who controls its boundaries?
[0] https://www.zdnet.com/article/facebook-internal-email-reveal...
[1] https://about.fb.com/news/2020/10/taking-legal-action-agains...
p.s. another thing they never mention is why companies want to scrape lists of facebook users. perhaps it might have something to do with the "lookalike audience" feature, and its more precisely targetable predecessors, which allow advertisers to upload a list of usernames and email addresses for targeted advertising?
But account hijacking and mass-creation of accounts just to access private pages are clear violations of the Facebook and Instagram ToS, so they surely can sue for that.
Does violating a website's TOS meant your accessing it beyond your authority, making it a violation of the US's Computer Fraud and Abuse Act?
"Since when do I get sued for taking too many free samples from Costco?" -> "Since you started taking millions of them to resell"
Relevant life lesson: don't do things to people with money that they might perceive as harm.
Corollary: Being sued is as much punishment as losing a suit for most people.
At the same time, I don't think all of Instagram's users care if their images are hidden, or not.
It's quite unfortunate Facebook/Meta is using hostile language and the word "scraping" together in this case. Scraping is a legitimate process used by various business models to gather information from the Web, which itself was originally intended to be an open forum for people to share content.
Hostile business models have corrupted that intent and turned it into a competitive environment that is harming users and legitimate models which may not have the funding larger corporations can muster.
I have a "scraper" I've built that will either snapshot a page from a user's browser or crawl it remotely with Selinium/Firefox, on the user's behalf, to save the content in an index for searching later, by that user. It's not automated, nor does it parse and crawl URLs in the pages saved. It doesn't use page content in a wider context, either.
I've spent a significant amount of time trying to "work around" anti-scraping efforts by various companies and it's frustrating to see hostility instead of cooperation in certain types of use.
The tool lets you download the contact info of your friends, which you should be able to do anyway. In fact Facebook tries to trick its users into thinking they can do this with their data takeout option, but the downloaded files don't actually include any of the contact info for your contacts. Which makes zero sense, considering the entire point of Facebook is that it's a digital rolodex for storing your friends' contact info.
I appreciate that 'system authenticated person' is a smaller set than those who can access anything publicly accessible, and that the former is a subset of the latter.
Scripts are tools, and like any tool they're extensions of the self. If it's morally okay to do it by hand, it's morally okay to do it with a script, so long as my script is respectful of server resources.
Sharing photos on Instagram, there is no such understanding, news outlets have been logging in to view and publish your instagram photos so.
Thought experiment: if you want to keep control over your data, try something radical: don't hand it to Meta/FB/IG at all
(Full disclosure, I'm neither on FB nor IG)
"Just Facebook" has made the web shittier; entire realms of essentially public, often great content hidden behind a login wall.
“You” only exist in numerous empty statements about “privacy”, “respect”, etc. If you are feeling artsy, you can make that hyped NFT thing out of those, and see whether those kilobytes of text really worth anything.
The scrapers have become the scrapees. The horror.
It is interesting as how they try to position this as a Chinese attack on them.
It makes me think that there are many people on CCP's dole, rich powerful famous people are somehow beholden to the CCP in some unknown way but we can all guess correctly that they are all old white men who have previously been seen with young females.
With Cambridge Analytica:
- Facebook allowed users (with informed consent) to allow external developers to access their data and limited data about their friends, in order to build social-enabled apps.
- CA exploited this to scrape basic profile data from a large number of users. It broke the ToS by doing so (in particular by using the data for purposes different than stated)
Here the same is happening:
- people are giving a third company access to their profile, which includes access to friends' data (in fact a lot more than what the app platform allowed to do)
- the company is scraping all the data.
At the time of CA, the criticism was that Facebook didn't do enough to enforce its ToS (or maybe that the data sharing should have not been allowed in the first place? But the terms were common knowledge and the attack potential became clear only in hindsight), here people are criticizing that Facebook is in fact enforcing its ToS.
Also note that strong enforcement against scraping is one of the mandates that came from the FTC settlement.
It seems inevitable that any news about Facebook/Meta is read in the worst possible light these days, even when the criticism is self-contradictory. I would expect less superficial commentary from HN.
I agree with your first paragraph, and my point is that it is not possible to argue at the same time that Facebook should share data more broadly and allow scraping, and at the same time be critical that Facebook allowed CA to happen in the first place.
If the CA scandal was a wake-up call, it appears it was not internalized enough for people to understand the implications of what they're suggesting in this thread?
[1] https://www.facebook.com/ParquesNacionalesdeArgentinaThose accounts wouldn't be allowed to view private data though unless they friend/follow the person first, so they'll only still be limited to data the account holders intend to be public and available to anyone.
There's also no evidence that the scraped data was aggregated at scale or commingled in any way, so even if customers provided their actual credentials which grant them access to private data of their friends, the scraper didn't share it with anyone else but them.
Did they misuse the collected data? Sure. But people granted access to that data knowingly. This wasn't really an attack in my view.
Facebook wasn’t really complicit and definitely didn’t sell/give away any data.
"self-compromised" lol
clearly these people just wanted an automated way to access their own data
GDPR and CCPA (and probably many other national/state privacy laws) forces facebook/instagram/etc to let you download and/or delete your data without using third party websites. Usually people self-compromise their accounts in exchange for money: https://www.buzzfeednews.com/article/craigsilverman/facebook...
Google blocked them.
There was animus between the two companies that resulted in Facebook not making an official android app until 2010.
Sorry for being vague here, I haven't publicly disclosed it yet, but will probably have to if it don't get fixed.
I was a webmaster of a set of servers on a major university's network. I also had access (enough to run arbitrary programs that had pretty much full ingress/egress to the public internet) to a number of machines across the campus's network. Through some of my coursework and ACM chapter activities I met some other similarly minded technical people with similar levels of access.
We decide that it would be fun to use our superpowers (access + programming abilities + curiosity) to sign up for various accounts on FB and essentially scrape and friend as much as possible. At the time they had some rate limiting, some IP banning (which wasn't terrible because the Uni gave public IPv4 addrs to all machines on campus by default) and then added some early CAPTCHA which we ended up breaking pretty trivially with some python and image recognition code.
Never got sued... :) Never really did much with the scripts or data except test that they worked. Fun times.
Conspiracy to access a protected computer system - that is, typing on IRC. weev didn't write any of the code or access the API.
The key phrase is "publicly accessible." This wasn't that. The scraping was done by automating Facebook accounts, which have terms of service, which forbid scraping.
ToS/EULAs make a big difference. They're the reason Blizzard could shut down bnetd's StarCraft server. They're why no one can legally reverse engineer Oracle to create a drop-in replacement, despite interoperability provisions.
More and more platforms are putting the majority of your user-generated content behind auth walls with ToS because that's how they prevent competitors from swiping it.
Strictly referencing EULAs for user-owned copies of software here, not ToS:
That is not true. The Blizzard court clearly erred in not considering unconscionability when analyzing the EULA. As for Oracle, the interoperability provisions are what overrides that part of the EULA.
In this case, the account requirement would be a technicality and the data, for all intents and purposes, would still be considered "publicly accessible" if anyone with an account can access it.
> After paying for access to the scraping software, customers self-compromised their Facebook and Instagram accounts by providing their authentication information to Octopus.
They didn't "self-compromise" their account. They trust Octopus to act on their behalf, and unlike Facebook, Octopus' interests are most likely more aligned with their users' since their service is paid. This is no different from handing your Facebook credentials to your social media manager or secretary. There's no evidence that Octopus misused this access in any way.
> Octopus designed the software to scrape data accessible to the user when logged into their accounts, including data about their Facebook Friends such as email address, phone number, gender and date of birth, as well as Instagram followers and engagement information such as name, user profile URL, location and number of likes and comments per post.
This is either information people intend to be public or information they trust their friends to keep private. Now if Octopus was leaking the private information to third-parties it would be one thing, but so far I see no evidence Octopus was disclosing the scraped information to anyone but their customer (who is already authorized to access it).
> Meta is an industry leader in taking legal action to protect people from scraping and exposing these types of services
Translation: Meta is an industry leader in protecting its disgusting business model that hinges on making public data behind a walled garden with an unacceptable "privacy" policy. There wouldn't be a market for Octopus (or other scrapers) if Facebook already allowed customers to efficiently access information they're already entitled to, but that would be against their interests as their entire business hinges on information being held hostage.
They've created a problem, are selling the cure (well in this case monetizing it via ads) and are now pissed off that someone else is selling the cure for cheaper.
https://www.nytimes.com/2020/01/18/technology/clearview-priv...
On one side, you have people who say any form of scraping is be disallowed, even prosecutable. This went so far that the Department of Justice on behalf of AT&T prosecuted a case of URL modification [1]. One of the few bright spots for this psychotic Supreme Court was to curtail the government's power under the CFAA by limiting what constituted "unauthorized" access [2].
On the other hand, there are those who think that any level of scraping should be fine and I think that's untenable too. Consider Yahoo indexing of Stack Overflow [3]:
> In the meantime, since Yahoo (via Slurp!) is about 0.3% of our traffic, but insists on rudely consuming a huge chunk of our prime-time bandwidth, they’re getting IP banned and blocked.
Do these "scraping extremists" think such actions should be illegal? It's actually not that far-fetched given the Ninth Circuit decided LinkedIn wrongly blocked HiQ scraping [4]. Like if you change your website with the intent that it'll make scraping more difficult, is that a problem? What if it's an unintended side effect?
Additionally, companies like Meta, Google and Apple are going to be way more acountable to abiding by data retention laws and regulations than any scraper. If it's OK to scrape FB.com completely, that information is out there forever.
I certainly think the government shouldn't prosecute on behalf of companies. At least that should expose to people how the government's #1 priority is in fact to protect the true constituents: corporations and the capital-owning class.
[1]: https://www.techdirt.com/2013/09/30/dojs-insane-argument-aga...
[2]: https://en.wikipedia.org/wiki/Van_Buren_v._United_States
[3]: https://stackoverflow.blog/2009/06/16/the-perfect-web-spider...
[4]: https://blog.ericgoldman.org/archives/2019/09/ninth-circuit-...
Yeah... uhm... I used to do exactly this sort of thing...
When I was a teenager, I would look at the URL of whatever site I was on, and would change a number here, or a letter there; and see what I got.
Sometimes you get nothing, sometimes you get something. Sometimes that something is quite interesting.
Sure, as long as Meta is not the one selling the data to Cambridge Analytica it's wrong.
Where's the "posts shared privately amongst friends made public" part? There are two cases here:
1. A service that logs in as the customer (who voluntarily provide their credentials) and scrapes information visible to said customer on their behalf. Nothing about "made available publicly" is alleged.
2. An individual using a pool of bot accounts to scrape posts visible to any logged in user. Nothing about "shared privately" is alleged. To be clear I don't like the method, but I'll also have to admit I've used one of the Instagram "clone sites" in the past thanks to their login wall.
Unless I missed something, it sounds like you just made it up.
Is the scraper to blame here, or the friend?
“Privately shared with friends” used to mean that only you and your friends know something. You don't “share” anything with “friends” on a social network. You give the information to a giant corporation. If it finds it suitable, it then delivers it to other users, but only after it records your location, analyzes the content to check if you were, say, affected by some melodramatic event (and therefore should be tricked into spending more time… I mean, get “personal recommendations” for a certain kind of content), and does a billion other things.
If you consider that this is fine, please relay all your conversations with family and friends through me from now on. I offer secure, reliable, fast, yada yada communication service. And it's hip! Ask anyone on the street what they use.
I think Meta might be mixing up these two cases here on purpose to make it look like web scraping is as bad as stealing photos to publish it on a clone website.
I don't know how far Facebook can get with this, thought Linkedin's court ruling made scraping legal de-facto
Well, color me surprised /s
Fuck Facebook. Meta. Or whatever you want to call it.
3rd parties don't have the consent from users. Users don't even have an idea these companies might be holding their data.
Indeed; the users probably wanted to make the data public, if scraper accounts could see it. There is a GDPR allowance for data "manifestly made public by the data subject".
https://gdpr-info.eu/art-9-gdpr/
Here, it's just Facebook wanting to keep the data inside a walled garden.
For the same reason, I quit LinkedIn and made my own site. I don't want people to have to sign in to see my profile.
As for why their domain is facebook for their news site, not sure why. It would make for sense for it to be under meta instead.