I completely accept how important scraping is as a data source, but that doesn't make it any more legal. It's in a space right now where only big companies can take unmitigated advantage of the tool, because it'd cost millions of dollars to successfully defend a CFAA suit.
I work for Scrapinghub as well and try to understand the law around this. I can help with some pointers to why I think some web scraping isn't illegal… there are of courses some limits to this.
When the data scraped is "is publicly available on the Internet, without requiring any login, password, or other individualized grant of access", the Eastern District Court of Virginia in Cvent vs Eventbrite (https://casetext.com/case/cvent-inc-v-eventbrite) ruled one could not be deemed to be exceeding unauthorized access.
There are two ways, that I know of, that courts have ruled you can exceed your authorization:
- When the site owner has contacted you and removed your authorization in a written manner, as happened on Craigslist vs 3taps.
- By accepting the terms of service and agreeing against scraping. You have to do this through a "clickwrap" ToS, rather than a "browsewrap". You can read about the differences here: https://termsfeed.com/blog/browsewrap-clickwrap/
As a matter of policy, we don't scrape any site with a ToS with clear anti-scraping language and which forces us to create an account or "constructively agree" as part of the use of the site.
Any user wishing to revoke authorization for anyone using our platform can make an abuse report on our site– we tend to handle these within 24 hours and haven't had a single claim go further than this stage, as we aim to be reasonable and look for a way to provide value to both sides.
Most websites have anti-scraping boilerplate in their ToU. I'm pretty sure that's in the "customized" ToU I got from LegalZoom. So you're basically saying that if the data is not behind a login and doesn't require you to fill any forms that contain either a checkbox or nearby language that indicates submitting the form constitutes acceptance of the ToU, you'll scrape it, even if the ToU explicitly bans scraping.
What do you plan to do when you do receive a C&D from someone that claims you've agreed to their browsewrap ToU? Are you going to argue that your use is not unauthorized since the data is public?
I guess I assume you'll comply with any C&D since you state that receipt of a written removal constitutes a revocation of authorization. However, I don't believe that's enough for some. Check out QVC v. Resultly, where QVC sued on a CFAA claim based on browse-wrap agreements (they lost; Resultly asserted permission was granted by robots.txt and the Court agreed).
Beyond just CFAA claims, there are copyright and trademark claims attached to most scraping cases. Those have unfortunately succeeded most of the time. The most egregious is Ticketmaster v RMG, where it was ruled that RMG had violated Ticketmaster's copyright by downloading the page (specifically, making a copy of the page contents in RAM and extracting only non-copyrightable content, and then discarding the complete copy; in short, downloading the page). Facebook v. Power Ventures is also pretty brutal.
I hope Scrapinghub is well-funded enough to trailblaze some space in the law here, because we really need it.
I'm not sure Scrapinghub is funded externally as I couldn't find anything on their valuation or revenues so I assume that they are bootsrapped.
I do not discount any of what you wrote but a lot of it are imagined dangers, scrapinghub would be immune to such cases unless they sided with their customers like 3taps did. 3taps did not stop scraping for their client Padmapper. I don't think scrapinghub is willing to put their neck out for someone paying $20/month to scrape craigslist. In fact, those are the shittiest segments of this market imo, the bottom feeders who demand excessively unrealistic expectations from scraping as a blackbox magical world that will solve their problems.
Got a source on that? Google scrapes all the time. This is how they index all the pages it discovers.
The only real scenario I recall is 3tap vs craigslist but they just kept scraping craigslist even after they banned their IP addresses with multiple proxies.
Then there are airplane ticket websites scraping each other and getting into hot waters.
Having said that it's not a clear cut definition as you'd like to put it. CFAA ruling was only because craigslist felt directly threatened by Padmapper which relied on 3taps.
This really is a shitty shitty business model. All that work 3taps did for you guys and they take all the heat? I don't know why 3taps didn't just comply, was PadMapper 100% of their business?
The source is the CFAA, which makes it a crime and/or a tort to commit any "unauthorized" access to a computer system. Because authorization is not defined in the statute, it's a matter of interpretation whether or not one's use is unauthorized. Historically, judges have strongly disfavored scrapers.
Most boilerplate Terms of Use contain language that forbids all "spiders, scrapers, bots, and all other automated means of access", or something along those lines. Most companies assert that accessing any page beyond the front page of their site constitutes a binding agreement to their ToU, and thus that any automated access is "unauthorized". Scrapinghub appears to be of the opinion that browsewrap agreements are unenforceable, and while some judges have agreed with that, some haven't.
Beyond the argument that scraping is a breach of contract (violating their Terms of Use) and that since you agreed to that contract, you understood that automated access was unauthorized, there's the potential criminal element, which was deployed against weev for exposing a minor data leak in AT&T's system and against Aaron Swartz for exceeding MIT's authorized access to JSTOR and downloading publicly-funded academic data (including data which was out of copyright). You basically just have to really hope that no one inside the company you've "wronged" is good friends with a prosecutor.
Because there is a lot of grey area around what may or may not constitute "unauthorized" access to a computer system, if a company does bring a tort claim against you for accessing their system without authorization, you might actually win -- if you can afford the time and money to fight them for the minimum 3-5 years it'll take your case to resolve. This is hundreds of thousands in legal fees easy.
3taps eventually had no choice but to give up because they couldn't take the legal costs anymore, and Power Ventures tried to stick it out and ended up not only being held liable for $3 million in damages to Facebook's systems when no actual damage had occurred at all, but the veil was pierced and the founder held personally liable. It's obvious from the court documents that he was struggling to afford counsel, and companies must be represented by an attorney, so he didn't even have an option to try to represent himself.
>Google scrapes all the time. This is how they index all the pages it discovers.
Yes, Google's operations are, strictly speaking, illegal on various fronts. They depend heavily on automated access, which many sites they index explicitly forbid and thus Google is committing "unauthorized access" to these computer systems, and they also store complete copies of the site and the individual images displayed on the site, virtually all of which are protected by copyright, and all of which constitutes flagrant violation of copyright law.
If someone did bring a CFAA claim against Google for this (which no one would, because Google is one of the wealthiest companies in the world, and it'd therefore cost tens of millions to sue them), Google would likely argue that robots.txt is the only authorization it is obligated to observe, which may or may not be an effective argument. Google also make no guarantees about the extent to which it obeys robots.txt; it's a way to signal your desires to Google, which it may or may not honor.
tl;dr The very short answer to all of this is that traditionally, the legal system has been extremely suspicious of scrapers and has treated them very badly, applying concepts intended for the physical world like trespass to chattels to server access. This has been improving somewhat in recent years, but is still a very financially and legally precarious situation in which to find oneself. The people who get away with it get away with it because no one sued them before they were too big to sue.