The State of Web Scraping 2022 (opens in new tab)

(scrapeops.io)

291 pointsIan_Kerins4y ago144 comments

144 comments

As a lawyer whose primary focus is in web scraping, this article is in many ways misleading and inaccurate. While it is true that the Van Buren case is generally positive for web scraping, the overall legal landscape is still murky. The main battleground for web scraping legal issues is shifting from the CFAA to breach of contract and various state-law issues, including misappropriation, unjust enrichment, and trespass to chattels.

In my opinion, 2021 was a bad year for the law as it relates to web scraping. The Supreme Court remanded hiQ Labs, and many high-profile lower-court cases ended badly for web scrapers. It's a darker shade of gray than it was in 2020. It can be navigated, but it's tricky.

btown4y ago

Not a lawyer, but is it at least true that web scraping alone would now be significantly less likely to be a basis for federal criminal prosecution under the CFAA?

I'm often reminded of the fact that in https://en.wikipedia.org/wiki/United_States_v._Swartz the scraped party JSTOR did not desire to press civil charges, but due to the criminal component of the CFAA, this was out of their hands - and the story ended in the worst possible way.

If the current legal landscape at least better restricts disputes over web scraping to civil litigation, it may not be a huge change for how companies look at their risks, but it could make a huge difference for individuals caught in the crossfire.

KieranMac4y ago

Yes, I would agree with that first sentence. After Van Buren, web scraping alone would now be significantly less likely to be a basis for federal criminal prosecution under the CFAA.

digitcatphd4y ago

Good take, IMO ethically speaking we should not penalize scrapers themselves but do so based on their use.

Scraping Facebook to make a clone of profiles shouldn’t be held to the same scrutiny of scraping Facebook to do an internal analysis of user demographics for research purposes.

ForHackernews4y ago

Why should either be discouraged?

2 more replies

RobSm4y ago

How many contracts google breaches scraping billions of pages every month?

Fatnino4y ago

Google doesn't have to proactively try very hard to ingest sites. If something is difficult for Google to scrape they don't sped loads of engineer hours on getting it to work. They just leave the site out and the webmaster there will quickly bend over backwards to make sure Google can scrape them. When something gets scraped into Google inadvertently it's because the website made not even the slightest effort to protect itself.

KieranMac4y ago

Given the nuances of browsewrap contract enforceability, perhaps not as many as you suggest. The tricky part with navigating this gray area is knowing the likely circumstances when a contract of adhesion may give rise to an actual legal claim. There are patterns.

2 more replies

Ian_KerinsOP4y ago

Interesting!...I'm not a lawyer, so the content for this piece was based on commentary in the below article. Was written by their lawyer, but would love to hear your counter point to it. Always good to get multiple viewpoints on something.

https://www.zyte.com/blog/van-buren-a-victory-for-web-scrape...

KieranMac4y ago

The Zyte article isn't inaccurate; it's just a simplified assessment of a complicated issue. If you'd like a more nuanced perspective on this, please read my guest post of Prof. Goldman's blog.

https://blog.ericgoldman.org/archives/2021/06/more-perspecti...

faizshah4y ago

Is there a good blog or something that tracks these cases?

KieranMac4y ago

Prof. Eric Goldman's blog is probably the #1 site historically on scraping and the law. I've contributed to it a few times.

https://blog.ericgoldman.org/archives/2021/06/more-perspecti...

The name of my firm is McCarthy Garber Law. I write about scraping there when I have time (which I rarely do)!

1 more reply

samcrawford4y ago

Enjoyed reading your bio on your website. Sub 24 hour at Leadville is super impressive! (Coming from someone who has not managed 24 hours at Western States... Yet...)

KieranMac4y ago

Leadville is just 45 minutes up the road for me, so I'm kind of cheating!

Seattle35034y ago

Is there a good blog post or summary that I could read?

KieranMac4y ago

https://mccarthygarberlaw.com/a-comprehensive-legal-guide-to...

ok_coo4y ago

Time for me to advocate again for people to use Common Crawl. Please don't slam peoples' websites, look for alternatives before scraping. There are probably other, better options. APIs, data set downloads, etc.

https://commoncrawl.org/

dewey4y ago

I'd guess that for the many popular scraping uses cases this is not really useful as it's usually about being quick and up to date (job postings, availability information, e-commerce, serps,...) not about having a big corpus of historic data.

weird-eye-issue4y ago

Have you used this in real world scenarios? Or is it just a nice hypothetical that sounds great in theory but almost never works in practice?

LunaSea4y ago

Common Crawl is missing far too many URLs for it to be useful in a real world scenario.

Chris20484y ago

But can't you add to their index?

1 more reply

mycall4y ago

I wish web.archive.org had an index by someone like common crawl. There is lots of great stuff on archive.org

wumpus4y ago

web.archive.org has a CDX index, similar to Common Crawl.

Since I use both of these archives together, I wrote this code to iron out the differences between them:

https://github.com/cocrawler/cdx_toolkit

1 more reply

kevinsundar4y ago

They do and its better than common crawl's by my testing.

joe_914y ago

That looks like a great resource! How often is the data set "updated"?

I'd imagine most people's use cases need data which can change from day to day or week to week but I do think that this is fantastic if I was to have a project which was looking at data across a longer timeframe.

jimkri4y ago

That is too much data to parse for a simple website scrape.

I do think Common Crawl has a lot of potential for people to use instead of scraping, but I think its for larger projects. It gave me the idea to look at the links to ID if they are a business or non-business website

joe_914y ago

I'm scraping about 30 sites for work at the moment, but have a few that are using Cloudflare which has been a b*tch to deal with. Tried numerous libraries and different proxy providers, but reliability is patchy. Previous fixes like https://github.com/Anorov/cloudflare-scrape don't seem to work anymore after Cloudflare updates, so I've switched to using a pretty optimised headless browser with good proxies instead.

Ian_KerinsOP4y ago

This has a lot of good info on how to cloudflare and others work, and more creative ways to bypass them if the easier options don't work https://incolumitas.com/2021/05/20/avoid-puppeteer-and-playw...

nanna4y ago

I'm finding that Cloudflare is even blocking my RSS reader from requesting feeds behind their service. It's not even just scrapers at this point.

nsonha4y ago

> optimised headless browser with good proxies instead

are you saying you only had problem because you didn't use headless browser before and now with both headless and proxy it generally suffices to not be seen as scrapper?

temp89644y ago

I think it will eventually goes to like stock trading. If you have a good strategy, you don't want to share with the world, because it will render your strategy useless.

emptysea4y ago

Is the “pretty optimized headless browser” an off the shelf thing, or something custom? Are you using playwright/puppeteer to drive it?

mycall4y ago

Headless Chrome [0] and alpine-Chrome [1] are pretty popular. Some variations also include V2Ray, Shadowsocks and other VPNs.

[0] https://hub.docker.com/r/justinribeiro/chrome-headless/

[1] https://github.com/Zenika/alpine-chrome

rozenmd4y ago

There are plugins for Puppeteer: https://github.com/berstend/puppeteer-extra/tree/master/pack...

valar_m4y ago

Do you have any recommendations for the "good proxies" you mentioned?

mellosouls4y ago

With the right combination of proxies, user agents and browsers, you can scrape every website. Even those that seem unscrapable.

This outcome was great news for web scrapers, as it means that so long as a websites has made their data public you are not in violation of the CFAA when you scrape the data even if it is prohibited in some other way (T&Cs, robots.txt, etc).

Just because you can, doesn't mean you should. It would be better I think if there was a treatment of the ethics here, rather than a seemingly "ra-ra go bots" attitude, as though the only consideration is commercial.

Ian_KerinsOP4y ago

100% agree, when scraping it should always be done respectfully.

- If they provide a API, then use it.

- Don't slam a website, ideally spread it out over hours of the day when there target audience is least active (night time).

- If you can get cached data from somewhere that works, then use that.

Most developers are respectful and only scrape what they really need, not only from an ethical point of view but also a cost and resources point of view. Scraping data is resource intensive and proxy costs can quickly rise to $1,000-$10,000 per month. So most only scrape the minimum they need.

The other thing here as well, is that a lot of the most popular sites being scraped, are also massive scrapers themselves. The big ecommerce sites are being scraped, but they are also scraping their competitors too.

travisporter4y ago

Don’t get my home address, name, family members names, salary, cell phone number, aggregate and sell them and claim “it’s all publically available anyway”

1 more reply

Terry_Roll4y ago

You dont even need to do that, go overt plain sight in yer face and call yourself a search engine!

joe_914y ago

Haha I love that people forget how google/bing are out there scraping everything and anyone who scrapes anything for any other reason is a "bad guy".

You can get around some web scraping blockers by just setting your user agent as Googlebot too which I find funny...

2 more replies

bryanrasmussen4y ago

this sort of implies that the 'ethics' would end up meaning that you shouldn't scrape if it is not wanted, although I suppose there can be ethics or other than commercial requirements that mean that you should.

NDizzle4y ago

I still have a daily job running a web scraper I first wrote with Scrapy back in 2017. I think I've had to update it 3 times over the years for changes to the site and web standards.

Good old government sites - rarely change!

bobblywobbles4y ago

Not a lawyer, but many terms of service prohibit interacting with their website in an automated fashion, as well as collecting their data. In my understanding, scraping a site with these terms already puts you in the wrong.

akersten4y ago

> many terms of service prohibit interacting with their website in an automated fashion,

Ignoring the fact that I didn't agree to anything just by virtue of requesting a page from a webserver (and, your server sent me the data!), that's such a meaningless phrase that it's certainly unenforceable. What is an automated fashion? Do I have to manually craft my HTTP request by hand-pulsing a voltage on an Ethernet cable, or do I have your permission to let Chrome automate that for me?

RobSm4y ago

This is so exactly. People do not realize that when they use chrome to view website, chrome is their 'scraper'.

And the goal of webs craping is not to get illegal data, but to have efficiency and performance by not doing something manually but letting computer do the repetitive tasks. It's a productivity tool. You can't make something illegal just because it's an automation instead of 'manual' operation.

1 more reply

tommek40774y ago

Because those terms are the law and cant be ignored in almost all the rest of the world...

cblconfederate4y ago

Cloudflare's blocks get in the way of many websites who are simply trying to get a "link preview" of the page, even if it is only a single request from a new IP. I wish they would offer some kind of alternative for the pages they serve instead of a captcha block.

fareesh4y ago

My toolbox of choice for web scraping is either Nokogiri or puppeteer

Can someone sell me on beautiful soup or scrapy or any of the others? Do they provide any advantages or features that I'd be missing out on?

edmundsauto4y ago

One great scrapy feauture is caching the page content. So you can essentially write a crawler, and when that’s running, you write your extraction code. Then, if you want to go back, you can add more extractors and run it against your local copy.

fareesh4y ago

Ah interesting, I end up doing this manually, i.e. File.write followed by what I want to scrape

1 more reply

gmanis4y ago

What does HN think of web scraping for the purpose of price comparison?

I’m asking this because I run a small side project to show prices across retailers for a very small niche. The users are very very happy. Even the vendors started contacting to be listed on the comparison.

But I am unable to make a business out of it other than few affiliate commission.

magixx4y ago

I worked for a company that did exactly this many years ago. (They were even able to parter with some retailer). Their product worked well yet they still went out of business long ago. To be honest, I don't see much value in such a service, not that it doesn't exist, it's just hard to justify paying for this data.

Ian_KerinsOP4y ago

If anyone has anything else they think was missed or should be included then let me know!

coverj4y ago

I have been interested in web scraping lately but never really dived too deep. Did anyone have more indepth resources (github projects, blogs, forums, etc) than the tutorials that are basically install beautiful soup and get data from a tag?

JimBlackwood4y ago

Genuine question but, what more do you need?

newsbinator4y ago

Like most here, I am very good at web scraping and automated form fills. I keep trying to figure out a profitable side project or business idea to make out of it and keep coming up with nothing that works.

Any good ideas?

Ian_KerinsOP4y ago

You can do it as a service, but that is highly competitive and basically trading time for money. Best ways are to productize it:

- build a on-demand data api for a specific type of data and charge a premium for it. Good example is https://serpapi.com/ who do Google data, charge ~10X markup on proxy costs

- proxy solutions make good money. To scrape at scale you need proxies, and lots of users pay $1-5k per month. Lots of proxy solutions doing +$100k per month.

- build a tool that uses web scraped data, analyses/filters it and displays it to users. Lots of the biggest web scrapers are doing this, ex. doing product monitoring products for e-commerce companies, etc. Lots of competition there, but you can do it in new markets, like NFTs, etc.

- hedge funds will pay huge money for web data, if you have 5 years of continuous data so they can backtest it.

chirau4y ago

> build a tool that uses web scraped data, analyses/filters it and displays it to users. Lots of the biggest web scrapers are doing this, ex. doing product monitoring products for e-commerce companies, etc. Lots of competition there, but you can do it in new markets, like NFTs, etc.

Do you have any examples of such sites?

> hedge funds will pay huge money for web data, if you have 5 years of continuous data so they can backtest it.

what kind of web data would they be interested in?

stef254y ago

For a while I had a hobby project that would scrape real estate websites listing properties in my city. Goal was to try and figure out trends, pricing data, find good deals. Eventually the site added those features itself (heatmaps based on prices, for example)

With all that data you can do stuff like make heatmaps from pricing data, figure out the most attractive areas for certain profiles (singles, families, ...). You could then mash up that data to produce things like a "Walkscore" or let people indicate what's important for them (green areas, bars & restaurants, time & distance to other destinations, even crime levels) and then show real estate that meets their criteria.

Some sites in the US already show this but in other countries that's not the case, while the data's all there just to grab.

Most likely it wouldn't be legal and certainly not if you made money from it. But it's incredibly fun and hugely useful. Maybe that could get you started on some ideas!

Fantosism4y ago

I know many people that follow limited/exclusive releases for things like Yeezy/Air Jordan sneakers as well as PS5's and graphics cards.

They pay $500/mo for access to a bot that will allow them to make these purchases.

Most of the community lives on discord.

mymllnthaccount4y ago

It would be relatively easy to solve this problem if the original supplier wanted the problem to be solved. Instead of releasing a batch of inventory at a certain time, run a raffle over a week or two and then randomly select folks to allow to purchase the item.

1 more reply

mschuster914y ago

I understand people using bots to snipe PS5s and GPUs, these have real economic value and actual usage.

But what other than artificial scarcity drives people to spend hundreds of dollars on bots to snipe sneakers?!

3 more replies

jconley4y ago

I have a project that will be fueled by scraping. We should chat. :)

chirau4y ago

I do tons of scraping as well, let me know if you need extra hands.

1 more reply

darepublic4y ago

Separate from web scraping, there is the use of automation to perform normal allowable user actions on the site. That should be considered distinct from large scale data extraction no

JJxFile4y ago

The web scraping ecosystem is growing, with more libraries, frameworks and products available than ever before to simplify our web scraping headaches so the future is looking bright.

slvrspoon4y ago

for those in this thread with super-serious experience scraping and automating at scale, looking for work (ethical!) please contact me directly.

blantonl4y ago

I fail to understand why Web Scraping isn't almost universally viewed as unethical and a terrible and nasty business practice.

In almost all cases I view Web scraping as people who are trying to build businesses on top of other people's innovation and data. I know this isn't a popular opinion, so change my mind, but at the same time, I'm one of those business owners that fights with Web scraping constantly and my opinion of it is that those that are doing it to my platforms are doing so solely to steal data and build businesses on top of other's hard work.

xrendan4y ago

I think it really depends on the application of web scraping. (As someone who does, what is in my mind, ethical web scraping)

- Scraping public information from government websites to do analysis: ethical, it's the public's data

- Scraping to help some companies customers more effectively use that companies product, for example scraping a medical office's insurance claims to help them automate their insurance remittance process: ethical

- Scraping faces to build a surveillance-tech company: disgusting

- Scraping your own website because your internal processes are so broken you can't get it any other way: ethical

- Scraping to just copy someone's data they worked hard to generate to go and resell: unethical

lambic4y ago

The first one here is important. Despite the open data movement pressuring governments to provide their data in easily consumable forms, a lot of government organizations are still unable or unwilling to do so.

Political advocacy orgs rely a lot on scraping to collect political representative data that isn't available through any other means.

1 more reply

mycall4y ago

- Scraping faces to find missing persons: ethical

- Scraping photos to create deep learning VQGAN+CLIP art generator: ethical

.. we can go on and on, but we should all agree scraping is a useful tool that should never be outlawed.

1 more reply

zffr4y ago

> - Scraping to just copy someone's data they worked hard to generate to go and resell: unethical

Wanted to include a slightly different application:

- Scraping multiple websites and organizing data in a new and useful way for customers: To me this would be ethical since it produces new value and does not just copy someone else's data as-is

akersten4y ago

So it's really not about the "scraping" here, it's about the kind of business you're building. I don't think any of your definitions change if you simply employed people to check the websites instead of scripts.

msluyter4y ago

Re government websites: they're often terrible. I've occasionally contemplated a side project just to scrape and restructure some local/state websites into a usable forms with search and whatnot.

RobSm4y ago

And if you manually copy someone's data they worked hard to generate to go and resell, then it's ethical?

yashasolutions4y ago

Google is web scrapper number one, as any search engine. Making web scrapping illegal mean making search engine illegal.

You do not want information to be public and/or free? Put it under login and charge for it.

You want to prevent people to reuse the data you publish to build other (potentially competitive) products, then use licensing and copyright, and the law.

However, banning a technological mean because what a minority could potentially do with it? Then make the internet illegal then and the problem is fixed altogether.

tyingq4y ago

Google does do some things that aren't great for website owners too. Like "rich snippets", where they present the information from your page right to the end user, leaving that end user with no reason to visit your site.

And, I imagine, lots of A/B testing geared toward exactly that...keeping them on Google-owned properties.

2 more replies

FinanceAnon4y ago

What if Google didn't scrape websites automatically, and waited till users submit their domains to them, to mark that they want to be scraped? I think in that case, most users would still submit their domains there, because they want to come up in Google search. You might want your website to be scraped by some people/companies and not by others, but not have to put everything behind a login screen (which some determined scrapers would still try to breach in some way).

teddyh4y ago

NB: It’s “scraping”, not “scrapping”.

digitcatphd4y ago

Google is a crawler not a scraper, these are two totally different things

1 more reply

indymike4y ago

> I fail to understand why Web Scraping isn't almost universally viewed as unethical and a terrible and nasty business practice.

Scraping is simply a way to get data. I used to run a team that was paid by large government contractors in the US to scrape their job posts from their career portals, and then deliver those posts via email, fax and snail mail to veteran's service officers near the job opening. It was required by regulation, and the only way to get the job data was to scrape.Many enterprise applicant tracking systems did not have a good way to automatically deliver that data or wanted $millions for that capability. Scraping was the best way and in some cases, the only way.

By the way, search engines like Google are scrape data and index it.

Ian_KerinsOP4y ago

Some web scraping can be unethical, say for example if you are scraping a site solely to mirror their content and add zero value to the original content owner.

However, there are a lot of web scraping use cases which are beneficial to the site being scraped and actually add value. Two examples:

- Google: Ahrefs & SEMRush scrape Google so they can provide SEO analytics to companies looking to grow their companies. Googles keyword analytics aren't great, so Google has effectively outsourced providing a good analytics tool to Ahrefs & SEMRush who products increase the value of the Google SERPs ecosystem.

- Amazon + Other E-Commerce: Amazon wants brands and 3rd party stores to list products on their site, and the companies scraping Amazon to provide product placement tools to their users make it easier and more profitable to list products on Amazon. Leading to more and more companies listing products on Amazon.

megous4y ago

> Some web scraping can be unethical, say for example if you are scraping a site solely to mirror their content and add zero value to the original content owner.

Archiving is unethical?

1 more reply

kbenson4y ago

Do you provide an API, paid or not, for the same data? An API which might even have limitations on use makes scraping a bit less defensible in my mind, but if you're offering something for free to the public and then getting upset when people take and use that free info, maybe free isn't the right business model, or maybe you should look into what those people are using that scraped data for and see if you can offer it better and cheaper.

The best way to stop someone trying to make a buck on your hard work is to go direct to their customers and do a better job. If you can't, what they're selling is something on top of your offering and you aren't serving that market, and you either should start serving it, or make a deal so the scrapers can continue to do it without impacting your service.

As someone that had to do scraping in the past, and went through having a free open API that served our needs perfectly replaced with an account based one that required we make 100x the queries, it was really frustrating that the company refused to even respond to queries for specific business accomodations to data.

charcircuit4y ago

Here are two use cases why I scrape YouTube.

- There is no external API for getting scheduled streams or when they have gone live AFAIK. This lets me be notified of new stuff to watch.

- The API for getting a channel's members is locked down. I applied for access to it 6 months ago and haven't heard anything about it from YouTube so I just scrape it to give members perks.

joe_914y ago

Madness that they haven't gotten back to your access request in 6 months!

Why even bother having the API there - so much value can be added by people building on top of YouTube and other large sites, its a shame that most of these large sites do nothing to provide API access and people have to go out of their way to scrape them them...

KieranMac4y ago

There are pro-social and anti-social uses of web scraping. If you have ever used Kayak or any other price discovery or price comparison website, you've relied on web scraping to provide you a service.

zffr4y ago

Also google or any other search engine

tyingq4y ago

I believe Kayak has agreements with the sites they scrape though. So it's a different type of "scraping", really.

mrtksn4y ago

When I want to do web scraping is because I have an idea to build over the content of the website I would like to scrape.

Let's say you made a recipes website and I would like to build an app that will order the ingredients for a meal.

It would be useful to extract the recipes, so that I can create experiences like users picking a meal and have the ingredients delivered.

I guess I can't show your recipes as it can be copyright infringement but I can link it to you and sell the tomatoes.

Also, despite copying someones work is unethical and likely illegal , there is nothing unethical or illegal to use computers to analyse the data out there. I should be able to analyse recipe publications just as I can measure the air pollution. The web scarping comes in since the semantic web never happen.

I think, we all should be able to use other people's work to build something else on top of it. Of course I do not advocate outright taking it and re-sell it as of ours.

For example, I would like to be able to create an app with Netflix content but obviously I don't expect to be able to stream their content as if it is mine. What I should be able to do is to create an app with an experience designed by me that lets you stream their movies if you pay them.

julianeon4y ago

Because there would no Internet search - no search engines, no Google Search, and essentially no Internet bigger than a hobbyist DARPA - without web scraping.

Chris20484y ago

> people who are trying to build businesses on top of other people's innovation and data

How would scraping, say, reddit, differ from the business model of Reddit itself?

> those that are doing it to my platforms are doing so solely to steal data

What kind of data are you talking about?

jeroenhd4y ago

Scraping itself isn't universally unethical. Google and Bing scraping websites to make information easier to access is fine, and scraping and analysing government data is even better. Public data should be public, after all.

However, the disgusting data brokers that employ most of the custom scrapers, are usually unethical. That's why I don't trust any person or company that admits being involved professionally in "scraping", because most of the time that means "we collect personal information that got leaked elsewhere and sell them on".

JimBlackwood4y ago

If we want to take the unethical route, I’d argue not providing an API (paid or free) is unethical and a nasty business practice.

I work for an ecommerce company and we scrape competitors for price information. Should this automated process using API’s not be okay, we’ll have humans do it. Less efficient for us, more traffic for a competitor. Should they provide a paid API with price information available, I’m sure we’d pay.

usbqk4y ago

I think if you make intangible things public you shouldn’t consider them to be only yours anymore.

j / k navigate · click thread line to collapse

144 comments

KieranMac4y ago

btown4y ago

Not a lawyer, but is it at least true that web scraping alone would now be significantly less likely to be a basis for federal criminal prosecution under the CFAA?

KieranMac4y ago

Yes, I would agree with that first sentence. After Van Buren, web scraping alone would now be significantly less likely to be a basis for federal criminal prosecution under the CFAA.

digitcatphd4y ago

Good take, IMO ethically speaking we should not penalize scrapers themselves but do so based on their use.

Scraping Facebook to make a clone of profiles shouldn’t be held to the same scrutiny of scraping Facebook to do an internal analysis of user demographics for research purposes.

ForHackernews4y ago

Why should either be discouraged?

2 more replies

RobSm4y ago

How many contracts google breaches scraping billions of pages every month?

Fatnino4y ago

KieranMac4y ago

2 more replies

Ian_KerinsOP4y ago

https://www.zyte.com/blog/van-buren-a-victory-for-web-scrape...

KieranMac4y ago

The Zyte article isn't inaccurate; it's just a simplified assessment of a complicated issue. If you'd like a more nuanced perspective on this, please read my guest post of Prof. Goldman's blog.

https://blog.ericgoldman.org/archives/2021/06/more-perspecti...

faizshah4y ago

Is there a good blog or something that tracks these cases?

KieranMac4y ago

Prof. Eric Goldman's blog is probably the #1 site historically on scraping and the law. I've contributed to it a few times.

https://blog.ericgoldman.org/archives/2021/06/more-perspecti...

The name of my firm is McCarthy Garber Law. I write about scraping there when I have time (which I rarely do)!

1 more reply

samcrawford4y ago

Enjoyed reading your bio on your website. Sub 24 hour at Leadville is super impressive! (Coming from someone who has not managed 24 hours at Western States... Yet...)

KieranMac4y ago

Leadville is just 45 minutes up the road for me, so I'm kind of cheating!

Seattle35034y ago

Is there a good blog post or summary that I could read?

KieranMac4y ago

https://mccarthygarberlaw.com/a-comprehensive-legal-guide-to...

ok_coo4y ago

https://commoncrawl.org/

dewey4y ago

weird-eye-issue4y ago

Have you used this in real world scenarios? Or is it just a nice hypothetical that sounds great in theory but almost never works in practice?

LunaSea4y ago

Common Crawl is missing far too many URLs for it to be useful in a real world scenario.

Chris20484y ago

But can't you add to their index?

1 more reply

mycall4y ago

I wish web.archive.org had an index by someone like common crawl. There is lots of great stuff on archive.org

wumpus4y ago

web.archive.org has a CDX index, similar to Common Crawl.

Since I use both of these archives together, I wrote this code to iron out the differences between them:

https://github.com/cocrawler/cdx_toolkit

1 more reply

kevinsundar4y ago

They do and its better than common crawl's by my testing.

joe_914y ago

That looks like a great resource! How often is the data set "updated"?

jimkri4y ago

That is too much data to parse for a simple website scrape.

joe_914y ago

Ian_KerinsOP4y ago

nanna4y ago

I'm finding that Cloudflare is even blocking my RSS reader from requesting feeds behind their service. It's not even just scrapers at this point.

nsonha4y ago

> optimised headless browser with good proxies instead

are you saying you only had problem because you didn't use headless browser before and now with both headless and proxy it generally suffices to not be seen as scrapper?

temp89644y ago

I think it will eventually goes to like stock trading. If you have a good strategy, you don't want to share with the world, because it will render your strategy useless.

emptysea4y ago

Is the “pretty optimized headless browser” an off the shelf thing, or something custom? Are you using playwright/puppeteer to drive it?

mycall4y ago

Headless Chrome [0] and alpine-Chrome [1] are pretty popular. Some variations also include V2Ray, Shadowsocks and other VPNs.

[0] https://hub.docker.com/r/justinribeiro/chrome-headless/

[1] https://github.com/Zenika/alpine-chrome

rozenmd4y ago

There are plugins for Puppeteer: https://github.com/berstend/puppeteer-extra/tree/master/pack...

valar_m4y ago

Do you have any recommendations for the "good proxies" you mentioned?

mellosouls4y ago

With the right combination of proxies, user agents and browsers, you can scrape every website. Even those that seem unscrapable.

Ian_KerinsOP4y ago

100% agree, when scraping it should always be done respectfully.

- If they provide a API, then use it.

- Don't slam a website, ideally spread it out over hours of the day when there target audience is least active (night time).

- If you can get cached data from somewhere that works, then use that.

travisporter4y ago

Don’t get my home address, name, family members names, salary, cell phone number, aggregate and sell them and claim “it’s all publically available anyway”

1 more reply

Terry_Roll4y ago

You dont even need to do that, go overt plain sight in yer face and call yourself a search engine!

joe_914y ago

Haha I love that people forget how google/bing are out there scraping everything and anyone who scrapes anything for any other reason is a "bad guy".

You can get around some web scraping blockers by just setting your user agent as Googlebot too which I find funny...

2 more replies

bryanrasmussen4y ago

NDizzle4y ago

I still have a daily job running a web scraper I first wrote with Scrapy back in 2017. I think I've had to update it 3 times over the years for changes to the site and web standards.

Good old government sites - rarely change!

bobblywobbles4y ago

akersten4y ago

> many terms of service prohibit interacting with their website in an automated fashion,

RobSm4y ago

This is so exactly. People do not realize that when they use chrome to view website, chrome is their 'scraper'.

1 more reply

tommek40774y ago

Because those terms are the law and cant be ignored in almost all the rest of the world...

cblconfederate4y ago

fareesh4y ago

My toolbox of choice for web scraping is either Nokogiri or puppeteer

Can someone sell me on beautiful soup or scrapy or any of the others? Do they provide any advantages or features that I'd be missing out on?

edmundsauto4y ago

fareesh4y ago

Ah interesting, I end up doing this manually, i.e. File.write followed by what I want to scrape

1 more reply

gmanis4y ago

What does HN think of web scraping for the purpose of price comparison?

But I am unable to make a business out of it other than few affiliate commission.

magixx4y ago

Ian_KerinsOP4y ago

If anyone has anything else they think was missed or should be included then let me know!

coverj4y ago

JimBlackwood4y ago

Genuine question but, what more do you need?

newsbinator4y ago

Any good ideas?

Ian_KerinsOP4y ago

You can do it as a service, but that is highly competitive and basically trading time for money. Best ways are to productize it:

- build a on-demand data api for a specific type of data and charge a premium for it. Good example is https://serpapi.com/ who do Google data, charge ~10X markup on proxy costs

- proxy solutions make good money. To scrape at scale you need proxies, and lots of users pay $1-5k per month. Lots of proxy solutions doing +$100k per month.

- hedge funds will pay huge money for web data, if you have 5 years of continuous data so they can backtest it.

chirau4y ago

Do you have any examples of such sites?

> hedge funds will pay huge money for web data, if you have 5 years of continuous data so they can backtest it.

what kind of web data would they be interested in?

stef254y ago

Some sites in the US already show this but in other countries that's not the case, while the data's all there just to grab.

Most likely it wouldn't be legal and certainly not if you made money from it. But it's incredibly fun and hugely useful. Maybe that could get you started on some ideas!

Fantosism4y ago

I know many people that follow limited/exclusive releases for things like Yeezy/Air Jordan sneakers as well as PS5's and graphics cards.

They pay $500/mo for access to a bot that will allow them to make these purchases.

Most of the community lives on discord.

mymllnthaccount4y ago

1 more reply

mschuster914y ago

I understand people using bots to snipe PS5s and GPUs, these have real economic value and actual usage.

But what other than artificial scarcity drives people to spend hundreds of dollars on bots to snipe sneakers?!

3 more replies

jconley4y ago

I have a project that will be fueled by scraping. We should chat. :)

chirau4y ago

I do tons of scraping as well, let me know if you need extra hands.

1 more reply

darepublic4y ago

Separate from web scraping, there is the use of automation to perform normal allowable user actions on the site. That should be considered distinct from large scale data extraction no

JJxFile4y ago

The web scraping ecosystem is growing, with more libraries, frameworks and products available than ever before to simplify our web scraping headaches so the future is looking bright.

slvrspoon4y ago

for those in this thread with super-serious experience scraping and automating at scale, looking for work (ethical!) please contact me directly.

blantonl4y ago

I fail to understand why Web Scraping isn't almost universally viewed as unethical and a terrible and nasty business practice.

xrendan4y ago

I think it really depends on the application of web scraping. (As someone who does, what is in my mind, ethical web scraping)

- Scraping public information from government websites to do analysis: ethical, it's the public's data

- Scraping faces to build a surveillance-tech company: disgusting

- Scraping your own website because your internal processes are so broken you can't get it any other way: ethical

- Scraping to just copy someone's data they worked hard to generate to go and resell: unethical

lambic4y ago

Political advocacy orgs rely a lot on scraping to collect political representative data that isn't available through any other means.

1 more reply

mycall4y ago

- Scraping faces to find missing persons: ethical

- Scraping photos to create deep learning VQGAN+CLIP art generator: ethical

.. we can go on and on, but we should all agree scraping is a useful tool that should never be outlawed.

1 more reply

zffr4y ago

> - Scraping to just copy someone's data they worked hard to generate to go and resell: unethical

Wanted to include a slightly different application:

- Scraping multiple websites and organizing data in a new and useful way for customers: To me this would be ethical since it produces new value and does not just copy someone else's data as-is

akersten4y ago

msluyter4y ago

Re government websites: they're often terrible. I've occasionally contemplated a side project just to scrape and restructure some local/state websites into a usable forms with search and whatnot.

RobSm4y ago

And if you manually copy someone's data they worked hard to generate to go and resell, then it's ethical?

yashasolutions4y ago

Google is web scrapper number one, as any search engine. Making web scrapping illegal mean making search engine illegal.

You do not want information to be public and/or free? Put it under login and charge for it.

You want to prevent people to reuse the data you publish to build other (potentially competitive) products, then use licensing and copyright, and the law.

However, banning a technological mean because what a minority could potentially do with it? Then make the internet illegal then and the problem is fixed altogether.

tyingq4y ago

And, I imagine, lots of A/B testing geared toward exactly that...keeping them on Google-owned properties.

2 more replies

FinanceAnon4y ago

teddyh4y ago

NB: It’s “scraping”, not “scrapping”.

digitcatphd4y ago

Google is a crawler not a scraper, these are two totally different things

1 more reply

indymike4y ago

> I fail to understand why Web Scraping isn't almost universally viewed as unethical and a terrible and nasty business practice.

By the way, search engines like Google are scrape data and index it.

Ian_KerinsOP4y ago

Some web scraping can be unethical, say for example if you are scraping a site solely to mirror their content and add zero value to the original content owner.

However, there are a lot of web scraping use cases which are beneficial to the site being scraped and actually add value. Two examples:

megous4y ago

> Some web scraping can be unethical, say for example if you are scraping a site solely to mirror their content and add zero value to the original content owner.

Archiving is unethical?

1 more reply

kbenson4y ago

charcircuit4y ago

Here are two use cases why I scrape YouTube.

- There is no external API for getting scheduled streams or when they have gone live AFAIK. This lets me be notified of new stuff to watch.

- The API for getting a channel's members is locked down. I applied for access to it 6 months ago and haven't heard anything about it from YouTube so I just scrape it to give members perks.

joe_914y ago

Madness that they haven't gotten back to your access request in 6 months!

KieranMac4y ago

zffr4y ago

Also google or any other search engine

tyingq4y ago

I believe Kayak has agreements with the sites they scrape though. So it's a different type of "scraping", really.

mrtksn4y ago

When I want to do web scraping is because I have an idea to build over the content of the website I would like to scrape.

Let's say you made a recipes website and I would like to build an app that will order the ingredients for a meal.

It would be useful to extract the recipes, so that I can create experiences like users picking a meal and have the ingredients delivered.

I guess I can't show your recipes as it can be copyright infringement but I can link it to you and sell the tomatoes.

I think, we all should be able to use other people's work to build something else on top of it. Of course I do not advocate outright taking it and re-sell it as of ours.

julianeon4y ago

Because there would no Internet search - no search engines, no Google Search, and essentially no Internet bigger than a hobbyist DARPA - without web scraping.

Chris20484y ago

> people who are trying to build businesses on top of other people's innovation and data

How would scraping, say, reddit, differ from the business model of Reddit itself?

> those that are doing it to my platforms are doing so solely to steal data

What kind of data are you talking about?

jeroenhd4y ago

JimBlackwood4y ago

If we want to take the unethical route, I’d argue not providing an API (paid or free) is unethical and a nasty business practice.

usbqk4y ago

I think if you make intangible things public you shouldn’t consider them to be only yours anymore.

j / k navigate · click thread line to collapse