In my opinion, 2021 was a bad year for the law as it relates to web scraping. The Supreme Court remanded hiQ Labs, and many high-profile lower-court cases ended badly for web scrapers. It's a darker shade of gray than it was in 2020. It can be navigated, but it's tricky.
I'm often reminded of the fact that in https://en.wikipedia.org/wiki/United_States_v._Swartz the scraped party JSTOR did not desire to press civil charges, but due to the criminal component of the CFAA, this was out of their hands - and the story ended in the worst possible way.
If the current legal landscape at least better restricts disputes over web scraping to civil litigation, it may not be a huge change for how companies look at their risks, but it could make a huge difference for individuals caught in the crossfire.
Scraping Facebook to make a clone of profiles shouldn’t be held to the same scrutiny of scraping Facebook to do an internal analysis of user demographics for research purposes.
https://www.zyte.com/blog/van-buren-a-victory-for-web-scrape...
https://blog.ericgoldman.org/archives/2021/06/more-perspecti...
https://blog.ericgoldman.org/archives/2021/06/more-perspecti...
The name of my firm is McCarthy Garber Law. I write about scraping there when I have time (which I rarely do)!
Since I use both of these archives together, I wrote this code to iron out the differences between them:
I'd imagine most people's use cases need data which can change from day to day or week to week but I do think that this is fantastic if I was to have a project which was looking at data across a longer timeframe.
I do think Common Crawl has a lot of potential for people to use instead of scraping, but I think its for larger projects. It gave me the idea to look at the links to ID if they are a business or non-business website
are you saying you only had problem because you didn't use headless browser before and now with both headless and proxy it generally suffices to not be seen as scrapper?
:
This outcome was great news for web scrapers, as it means that so long as a websites has made their data public you are not in violation of the CFAA when you scrape the data even if it is prohibited in some other way (T&Cs, robots.txt, etc).
Just because you can, doesn't mean you should. It would be better I think if there was a treatment of the ethics here, rather than a seemingly "ra-ra go bots" attitude, as though the only consideration is commercial.
- If they provide a API, then use it.
- Don't slam a website, ideally spread it out over hours of the day when there target audience is least active (night time).
- If you can get cached data from somewhere that works, then use that.
Most developers are respectful and only scrape what they really need, not only from an ethical point of view but also a cost and resources point of view. Scraping data is resource intensive and proxy costs can quickly rise to $1,000-$10,000 per month. So most only scrape the minimum they need.
The other thing here as well, is that a lot of the most popular sites being scraped, are also massive scrapers themselves. The big ecommerce sites are being scraped, but they are also scraping their competitors too.
You can get around some web scraping blockers by just setting your user agent as Googlebot too which I find funny...
Good old government sites - rarely change!
Ignoring the fact that I didn't agree to anything just by virtue of requesting a page from a webserver (and, your server sent me the data!), that's such a meaningless phrase that it's certainly unenforceable. What is an automated fashion? Do I have to manually craft my HTTP request by hand-pulsing a voltage on an Ethernet cable, or do I have your permission to let Chrome automate that for me?
And the goal of webs craping is not to get illegal data, but to have efficiency and performance by not doing something manually but letting computer do the repetitive tasks. It's a productivity tool. You can't make something illegal just because it's an automation instead of 'manual' operation.
Can someone sell me on beautiful soup or scrapy or any of the others? Do they provide any advantages or features that I'd be missing out on?
I’m asking this because I run a small side project to show prices across retailers for a very small niche. The users are very very happy. Even the vendors started contacting to be listed on the comparison.
But I am unable to make a business out of it other than few affiliate commission.
Any good ideas?
- build a on-demand data api for a specific type of data and charge a premium for it. Good example is https://serpapi.com/ who do Google data, charge ~10X markup on proxy costs
- proxy solutions make good money. To scrape at scale you need proxies, and lots of users pay $1-5k per month. Lots of proxy solutions doing +$100k per month.
- build a tool that uses web scraped data, analyses/filters it and displays it to users. Lots of the biggest web scrapers are doing this, ex. doing product monitoring products for e-commerce companies, etc. Lots of competition there, but you can do it in new markets, like NFTs, etc.
- hedge funds will pay huge money for web data, if you have 5 years of continuous data so they can backtest it.
Do you have any examples of such sites?
> hedge funds will pay huge money for web data, if you have 5 years of continuous data so they can backtest it.
what kind of web data would they be interested in?
With all that data you can do stuff like make heatmaps from pricing data, figure out the most attractive areas for certain profiles (singles, families, ...). You could then mash up that data to produce things like a "Walkscore" or let people indicate what's important for them (green areas, bars & restaurants, time & distance to other destinations, even crime levels) and then show real estate that meets their criteria.
Some sites in the US already show this but in other countries that's not the case, while the data's all there just to grab.
Most likely it wouldn't be legal and certainly not if you made money from it. But it's incredibly fun and hugely useful. Maybe that could get you started on some ideas!
They pay $500/mo for access to a bot that will allow them to make these purchases.
Most of the community lives on discord.
But what other than artificial scarcity drives people to spend hundreds of dollars on bots to snipe sneakers?!
In almost all cases I view Web scraping as people who are trying to build businesses on top of other people's innovation and data. I know this isn't a popular opinion, so change my mind, but at the same time, I'm one of those business owners that fights with Web scraping constantly and my opinion of it is that those that are doing it to my platforms are doing so solely to steal data and build businesses on top of other's hard work.
- Scraping public information from government websites to do analysis: ethical, it's the public's data
- Scraping to help some companies customers more effectively use that companies product, for example scraping a medical office's insurance claims to help them automate their insurance remittance process: ethical
- Scraping faces to build a surveillance-tech company: disgusting
- Scraping your own website because your internal processes are so broken you can't get it any other way: ethical
- Scraping to just copy someone's data they worked hard to generate to go and resell: unethical
Political advocacy orgs rely a lot on scraping to collect political representative data that isn't available through any other means.
- Scraping photos to create deep learning VQGAN+CLIP art generator: ethical
.. we can go on and on, but we should all agree scraping is a useful tool that should never be outlawed.
Wanted to include a slightly different application:
- Scraping multiple websites and organizing data in a new and useful way for customers: To me this would be ethical since it produces new value and does not just copy someone else's data as-is
You do not want information to be public and/or free? Put it under login and charge for it.
You want to prevent people to reuse the data you publish to build other (potentially competitive) products, then use licensing and copyright, and the law.
However, banning a technological mean because what a minority could potentially do with it? Then make the internet illegal then and the problem is fixed altogether.
And, I imagine, lots of A/B testing geared toward exactly that...keeping them on Google-owned properties.
Scraping is simply a way to get data. I used to run a team that was paid by large government contractors in the US to scrape their job posts from their career portals, and then deliver those posts via email, fax and snail mail to veteran's service officers near the job opening. It was required by regulation, and the only way to get the job data was to scrape.Many enterprise applicant tracking systems did not have a good way to automatically deliver that data or wanted $millions for that capability. Scraping was the best way and in some cases, the only way.
By the way, search engines like Google are scrape data and index it.
However, there are a lot of web scraping use cases which are beneficial to the site being scraped and actually add value. Two examples:
- Google: Ahrefs & SEMRush scrape Google so they can provide SEO analytics to companies looking to grow their companies. Googles keyword analytics aren't great, so Google has effectively outsourced providing a good analytics tool to Ahrefs & SEMRush who products increase the value of the Google SERPs ecosystem.
- Amazon + Other E-Commerce: Amazon wants brands and 3rd party stores to list products on their site, and the companies scraping Amazon to provide product placement tools to their users make it easier and more profitable to list products on Amazon. Leading to more and more companies listing products on Amazon.
Archiving is unethical?
The best way to stop someone trying to make a buck on your hard work is to go direct to their customers and do a better job. If you can't, what they're selling is something on top of your offering and you aren't serving that market, and you either should start serving it, or make a deal so the scrapers can continue to do it without impacting your service.
As someone that had to do scraping in the past, and went through having a free open API that served our needs perfectly replaced with an account based one that required we make 100x the queries, it was really frustrating that the company refused to even respond to queries for specific business accomodations to data.
- There is no external API for getting scheduled streams or when they have gone live AFAIK. This lets me be notified of new stuff to watch.
- The API for getting a channel's members is locked down. I applied for access to it 6 months ago and haven't heard anything about it from YouTube so I just scrape it to give members perks.
Why even bother having the API there - so much value can be added by people building on top of YouTube and other large sites, its a shame that most of these large sites do nothing to provide API access and people have to go out of their way to scrape them them...
Let's say you made a recipes website and I would like to build an app that will order the ingredients for a meal.
It would be useful to extract the recipes, so that I can create experiences like users picking a meal and have the ingredients delivered.
I guess I can't show your recipes as it can be copyright infringement but I can link it to you and sell the tomatoes.
Also, despite copying someones work is unethical and likely illegal , there is nothing unethical or illegal to use computers to analyse the data out there. I should be able to analyse recipe publications just as I can measure the air pollution. The web scarping comes in since the semantic web never happen.
I think, we all should be able to use other people's work to build something else on top of it. Of course I do not advocate outright taking it and re-sell it as of ours.
For example, I would like to be able to create an app with Netflix content but obviously I don't expect to be able to stream their content as if it is mine. What I should be able to do is to create an app with an experience designed by me that lets you stream their movies if you pay them.
How would scraping, say, reddit, differ from the business model of Reddit itself?
> those that are doing it to my platforms are doing so solely to steal data
What kind of data are you talking about?
However, the disgusting data brokers that employ most of the custom scrapers, are usually unethical. That's why I don't trust any person or company that admits being involved professionally in "scraping", because most of the time that means "we collect personal information that got leaked elsewhere and sell them on".
I work for an ecommerce company and we scrape competitors for price information. Should this automated process using API’s not be okay, we’ll have humans do it. Less efficient for us, more traffic for a competitor. Should they provide a paid API with price information available, I’m sure we’d pay.