Ask HN: What are the best tools for web scraping in 2022?

313 pointspablohoffman3y ago151 comments

Last time this question was asked on HN was in 2017 (https://news.ycombinator.com/item?id=15694118), a lot has changed in the last 5 years in the world of web scraping (legal landscape, antibot unblockers, data type specific APIs, etc), so I thought it may be a good idea to refresh this question and see what are the most popular tools used by the HN community these days.

151 comments

simonw3y ago

It's increasingly difficult these days to write scrapers that don't at some point need to execute JavaScript on a page - so you need to have a good browser automation tool on hand.

I'm really impressed by Playwright. It feels like it has learned all of the lessons from systems like Selenium that came before it - it's very well designed and easy to apply to problems.

I wrote my own CLI scraping tool on top of Playwright a few months ago, which has been a fun way to explore Playwright's capabilities: https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...

shubhamjain3y ago

My $0.02, but in most cases, I have seen you don't need to emulate a browser to scrape even if it's an SPA. The data has to be coming from somewhere. You can play around devtools to reverse engineer the API requests and get the data you need. I understand companies can put roadblocks to hinder this, but my point is, browser emulation is slow and expensive resource-wise. It should be the last resort.

benbristow3y ago

My £0.02

It's usually easier to use an Android emulator like GenyMotion or a rooted Android phone and use HTTPToolkit and/or some certificate bypassing method using Frida or other and then explore APIs through their official apps.

I've scraped loads of stuff through unofficial APIs before this way. Most developers don't ever expect people to do this so they're often a bit less secure too.

Alternatively sometimes doing a Global GitHub / Sourcegraph search you might find someone else who's done the hard work to reverse engineer an API and open-sourced it.

2 more replies

simonw3y ago

Yeah, my first step in trying to scrape an SPA is always to hit the network tab in the browser, filter by type=JSON and then sort by size. The largest responses are often the most useful, and can then be grabbed with curl.

Sometimes though that's not enough - particularly on older sites that might use weirder concepts like ASP.NET View state. For those I find having Playwright around is a big benefit.

Generally the things I have the most trouble with for non-browser-automation scraping are things with complex state stored in cookies and URL fragments (and maybe even localStorage these days).

1 more reply

mosseater3y ago

Totally agree!

I've done this method a lot. Honestly scraping Google Reviews was the most difficult in terms of complexity. This was like 6 or 7 years ago. You would get back these huge nested arrays that mostly had 0s in them. Occasionally a value would be set and that's what I would go with. I'm assuming their internal tools were obfuscated and/or using protobuf. But it certainly took me back to the good ol' days hexediting games in order to make your own cheat codes.

Another difficulty I faced were sites that relied on the previous UI state to pass the API call. You'd have to emulate "real" browsing by requesting the subsequent pages and get the ID number. Still much faster than emulating the whole browser via Selenium.

Honestly, it was the small sites that actually proved more troublesome. The ones that had an actual admin reading logs. They would ban our whole IP Block, then ban our whole proxy IP Block. Once I implemented TOR functionality into our scraper for a particularly valuable but small site and they blocked that too. This site ended up implementing ludicrous rate limiting that had normal users waiting for 2-3 seconds between requests, all because we were scraping their data. I kid you not, by the time we gave up trying, this Section-8 rental site for a small city had vastly more protections in place than Zillow and Apartments.com combined.

2 more replies

matheusmoreira3y ago

This was my approach too and it's been working great. Nowadays data isn't rendered directly into HTML anymore, it gets downloaded from some JSON API endpoint. So I use network monitoring tools to see where it's coming from and then inferface with the endpoint directly. I essentially wrote custom clients for someone else's site. One of my scrapers is actually just curl piped into jq. Sometimes they change the API and I have to adapt but that's fine.

> I understand companies can put roadblocks to hinder this

Can you elaborate? I haven't run into any roadblocks yet but I'm not scraping big sites or sending a massive number of requests.

1 more reply

1vuio0pswjnm73y ago

"It's increasingly difficult these days to write scrapers that don't at some point need to execute Javascript on a page - so you need to have a good browser automation tool on hand."

But doesn't this assume which sites are being "scraped". How would anyone know what sites someone else needs to "scrape" unless people name the sites (and the specific pages at the sites as this is not "crawling"). For example, none of the websites with webpages I extract data from require me to use Javascript, i.e., I can retrieve and extract data without using JS.

Also, it is possible to automate text-only browsers that do not run Javascript. "Browser automation" is not necessarily just for Javascript.

Maybe we should have a "scraping challenge" in an effort to provide some evidence on this question. The challenge could be to "scrape" every webpage currently submitted to HN,^1 without using Javascript.^2

If someone manages to scrape a majority of the pages submitted to HN without JS, then we have some evidence that, for HN readers, JS and therefore Javascript-enabled browser automation is generally _not_ required for "scraping".

1. The problem with I see using something more generic like majestic_million.csv is that it is a list of domain names not webpages.

2. We would likely need to agree on what data would need to be extracted from each submitted page.

simonw3y ago

I'll rephrase:

"It's increasingly difficult these days to regularly write scrapers for a large range of different websites without eventually hitting a situation where you need to execute JavaScript on a page"

tzs3y ago

What happens in Playwright with sites using CAPTCHAs?

I had been occasionally scraping a site via curl, but then they started using Cloudflare's anti-bot stuff.

I switched to Selenium and that worked for a while--my Selenium script would navigate to the site, pause to let me manually deal with Cloudflare, and then automatically grab the data I wanted. But then that stopped working.

I found a Stack Overflow answer that gave some settings in Selenium to make it not tell the site's JavaScript that the browser was being automated and that briefly made things happy, but not too long afterwards that broke. There's a Selenium Chrome drive available that is meant for scraping which apparently tries to hide all evidence that the browser is being automated, but it didn't fool Cloudflare.

What I want is a browser-based automation tool that to the site is indistinguishable from a human browsing, except possibly by the timing of user actions. E.g., if the site can deduce it is being automated because the client responds faster than human reaction time, or with too little variation in response time, that's fine.

latchkey3y ago

Totally separate question, but I'm wondering why you put 'Mar' in your url instead of the month number?

simonw3y ago

It's a decision from 2003 I think. It's mainly because I'm from the UK, so I'm extremely sensitive to the risk of people confusing DD-MM-YYYY and MM-DD-YYYY - the least ambiguous format is to use DD-Mon-YYYY, so I picked that for my URLs.

If I was designing my blog today I'd probably drop the day and month entirely, and go with /yyyy/unique-text-slug for the URLs.

4 more replies

altilunium3y ago

Beautiful Soup gets the job done. I made several app by using it.

[1] https://github.com/altilunium/wistalk (Scrap wikipedia to analyze user's activity)

[2] https://github.com/altilunium/psedex (Scrap goverment website to get list of all registered online services in Indonesia)

[3] https://github.com/altilunium/makalahIF (Scrap university lecturer's web page to get list of papers)

[4] https://github.com/altilunium/wi-page (Scrap wikipedia to get most active contributors that contribute to a certain article)

[5] https://github.com/altilunium/arachnid (Web scraper, optimized for wordpress and blogger)

Buttons8403y ago

I've found lxml to be more powerful. The lxml library supports xpaths, which I don't believe Beautiful Soup does?

In other words, consider lxml as well.

yifanl3y ago

lxml is supported (mostly) out of the box for BeautifulSoup, so you can it as a parser behind BS4's nicer interface, which I believe the OP does in the linked codebases.

traverseda3y ago

I reach for selectolax first if I'm doing relatively tame stuff. Also css selectors are nice.

snehesht3y ago

In the world of SPA (single page applications), headless browser API is super helpful, playwright[1] and puppeteer[2] are very good choices.

[1] https://github.com/microsoft/playwright

[2] https://github.com/puppeteer/puppeteer

lofatdairy3y ago

Highly recommend playwright (if I'm not mistaken most of the big developers from puppeteer were hired by MS to work on playwright). I run into significantly less await/async problems with playwright than I did with puppeteer and the codegen tool is super helpful as a first pass option.

snehesht3y ago

Playwright integrates with lot of different browsers compared to puppeteer which just uses chrome.

mynameismon3y ago

Also is the ability to open the Networks panel, to snoop on requests and find the exact API call that you might need to perform your task, instead of having to pull in all of HTML/JS/CSS crap. As a lot of SPAs have essentially pushed everything behind JSON APIs, all information is usually one (authenticated) API call away.

XzAeRosho3y ago

Most content heavy websites that tend to be scrapped, usually use server side rendering for this exact same reason, and put many obstacles in the way to make sure that data doesn't get scrapped easily. See: product price, stock, delivery information.

snehesht3y ago

If you're interested in running the puppeteer in containers, take a look at chrome-aws-lambda[1] and browserless docker container[2]

Not affiliated with browserless, but they do have a free/paid cloud service. https://www.browserless.io

[1] https://github.com/alixaxel/chrome-aws-lambda

[2] https://github.com/browserless/chrome

btown3y ago

https://chrome.browserless.io/ is perhaps the best technical demo I've ever seen, and shows off Browserless's capabilities amazingly. An incredibly high-quality service and codebase.

namuorg3y ago

I built a tool called Browserflow (https://browserflow.app) that lets you automate any task in the browser, including scraping websites.

People love it for its ease-of-use because you can record actions via click-and-point rather than having to manually come up with CSS selectors. It intelligently handles lists, infinite scrolling, pagination, etc. and can run on both your desktop and in the cloud.

Grateful for how much love it received when it launched on HN 8 months ago: https://news.ycombinator.com/item?id=29254147

Try it out and let me know what you think!

vitorbaptistaa3y ago

It looks amazing! Great intro video.

It probably doesn't make sense for Browserflow as a business, but I'd love to find a tool like this that exported a Scrapy spider, similar to the now unmaintained Portia.

statmapt3y ago

Browserflow fan here - I used it on a variety of projects. DK was a great help throughout the way on the slack group.

It's fun to watch his twitter and celebrate his wins alongside him

dbecks3y ago

Beautifully done! Really amazing tool.

shubhamjain3y ago

Unpopular opinion, but Bash/Shell Scripting. Seriously, it's probably the fastest way to get things done. For fetching, use cURL. Want to extract particular markup? Use pup[1]. Want to process csv? Use cskit[2]. Or JSON? Use jq[3]. Want to use DB? Use psql. Once you get the hang of shell scripting, you can create simple scrapers by wiring up these utilities in a matter of minutes.

The only thing I wish was present was better support for RegExes. Bash and most unix tools don't support PCRE which can severely limiting. Plus, sometimes you want to process text as a whole vs line-by-line.

I would also recommend Python's sh[4] module if Shell scripting isn't your cup of tea. You get best of both worlds: faster dev work with Bash utils, and a saner syntax.

[1]: https://github.com/ericchiang/pup

[2]: https://csvkit.readthedocs.io/en/latest/

[3]: https://stedolan.github.io/jq/

[4]: https://pypi.org/project/sh/

plainnoodles3y ago

My main qualms with bash as a scripting language are that its syntax is not only kind of bonkers (no judgement, I know it's an old tool) but also just crazily unsafe. I link to a few high-profile things whenever people ask me why my mantra is "the time to switch your script from bash to python is when you want to delete things".

>rm -rf /usr /lib/nvidia-current/xorg/xorg

https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/commi...

>rm -rf "$STEAMROOT/"*

https://github.com/valvesoftware/steam-for-linux/issues/3671

It's just too easy to shoot your foot.

shubhamjain3y ago

There are couple of flags you can use to mitigate the safety risks. `set -u`, for instance, will thrown an error if an unbound variable is used. I always start my scripts with

> set -euo pipefail

Here's a detail explaination of all the switches: https://gist.github.com/mohanpedala/1e2ff5661761d3abd0385e82....

I do agree though, it's not the best tool. But combining CLI utilities tends to be fast.

simonw3y ago

For things like regular expressions, it's useful to know that Python has a "-c" option which can be passed a multi-line string as part of a CLI pipeline. You can do something like this:

    curl 'https://news.ycombinator.com/' | python -c '
    import sys, re, json
    html = sys.stdin.read()
    r = re.compile("<a href=\"(.*)\"")
    print(json.dumps(r.findall(html), indent=2))
    '

This outputs JSON which you can then pipe to other tools.

shubhamjain3y ago

This is great. perl also has one-liners [1] one can use, but I gave up dealing with perl's obscure syntax. This is much better.

[1]: http://novosial.org/perl/one-liner/

infinite8s3y ago

That's great! I didn't know -c supported multiline - I always just crammed it into one line with semicolons.

1 more reply

simonw3y ago

My shot-scraper CLI tool was designed to support this kind of workflow but with a real headless browser inserted into the mix. Means you can do things like this:

    shot-scraper javascript \
      "https://news.ycombinator.com/from?site=simonwillison.net" "
    Array.from(document.querySelectorAll('.itemlist .athing')).map(el => {
      const title = el.querySelector('a.titlelink').innerText;
      const points = parseInt(el.nextSibling.querySelector('.score').innerText);
      const url = el.querySelector('a.titlelink').href;
      const dt = el.nextSibling.querySelector('.age').title;
      const submitter = el.nextSibling.querySelector('.hnuser').innerText;
      const commentsUrl = el.nextSibling.querySelector('.subtext a:last-child').href;
      const id = commentsUrl.split('?id=')[1];
      const numComments = parseInt(
        Array.from(
          el.nextSibling.querySelectorAll('.subtext a[href^=item]')
        ).slice(-1)[0].innerText.split()[0]
      ) || 0;
      return {id, title, url, dt, points, submitter, commentsUrl, numComments};
    })
    " | jq '. | map(.numComments) | add'

That example scrapes a page on Hacker News by running JavaScript inside headless Chromium, outputs the results as JSON to stdout, then pipes them into jq to add them up. It outputs "1274".

https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...

(Fun side note: I figured out the jq recipe I'm using in this example using GPT-3: https://til.simonwillison.net/gpt3/jq )

jamescampbell3y ago

Badass as always Simon. I prefer my custom cloudflare killer chrome headless python code. But this is cool for quick things.

Mrdarknezz3y ago

This comment got the same energy as this comment that dropbox could just be replaced by an FTP https://news.ycombinator.com/item?id=9224

valarauko3y ago

Every time that comment is bought up, I can't help but feel that people are deliberately fishing for the comparison.

hhthrowaway12303y ago

Also a fan. Usually I generate indexes/urls, and then just wget and scrape the content offline once all is downloaded.

lwthiker3y ago

curl-impersonate[1] is a curl fork that I maintain and which lets you fetch sites while impersonating a browser. Unfortunately, the practice of TLS and HTTP fingerprinting of web clients has become extremely common in the past ~1 year, which means a regular curl request will often return some JS challenge and not the real content. curl-impersonate helps with that.

[1] https://github.com/lwthiker/curl-impersonate

bunnyfoofoo3y ago

Oh wow, thank you! I had no idea something like this existed. I was manually compiling against BoringSSL.

hartator3y ago

We've built https://serpapi.com

We've invented the industry what you referring as "data type specific APIs"; APIs that abstract away all proxies issues, captcha solvings, various layouts support, even scrapping-related legal issues, and much more to a clean JSON response every single call. It was a lot of work but our success rate and response times are now rivaling non-scraping commercial APIs: https://serpapi.com/status

I think the next battle will be still legal despite all the wins in favor of scrapping public pages and common sense understanding this is the way to go. The EFF has been doing an amazing work in this world and we are proud to be a significant yearly contributor to the EFF.

sph3y ago

As someone that has built and maintained a few scraper tools in my career: hand-written logic and patience because your scraper will break any time upstream changes their HTML. It's an infinite game of whack-a-mole outside your control.

Scrapers are very simple, effective and probably one of the least fun things to build.

adriancooney3y ago

Apologies for the shameless self-promotion here but it was this very problem that I built puppeteer-heap-snapshot. It decouples the HTML from the scraper and instead we inspect the booted app’s memory. Not near as performant but a lot more reliable. I wrote about it here: https://www.adriancooney.ie/blog/web-scraping-via-javascript...

btzs3y ago

Hi! Your application looks interesting! I have a question regarding your YouTube example: Where do you get property names like channelId,viewCount,keywords from? Thanks

robk3y ago

This is amazing thanks for sharing!!

mouse-linguist3y ago

Do you have recommendations for platforms that monitor scraping fleets?

I occasionally code scrapers for quick data aggregation, but have trouble running anything long-term because it can be a chore to monitor. I've been looking into various options for self-hosting some sort of monitor/dashboard that can send alerts but haven't found anything satisfying yet.

breno3y ago

We released estela for this and other purposes, check it out, maybe it will suit your needs:

https://github.com/bitmakerla/estela

Only Scrapy support atm, but additional scraping frameworks/language are on the roadmap. It would be great to have feedback to consider it when prioritizing some over others :-)

real-dino3y ago

I'd create a simple health check based on the integrity of the data your retrieve.

Think about giving it a score based on how the data is shaped. If it's missing prices for example, then it immediately goes down to zero, doesn't update the database and sends an alert.

1 more reply

jamescampbell3y ago

Hard agree here. Stopped taking scraping projects because the sites always change. No matter how general you design it.

throwaway-blaze3y ago

Scrapers are fun to build, just not fun to maintain :)

mherrmann3y ago

My http://heliumhq.com is open source and gives you a very simple Python API:

  from helium import *
  start_chrome('github.com/login')
  write('user', into='Username')
  write('password', into='Password')
  click('Sign in')

To get started:

  pip install helium

Also, you need to download the latest ChromeDriver and put it in your PATH.

Have fun :-)

qwertyforce3y ago

Probably the best tool for scraping websites protected by services like cloudflare. https://github.com/ultrafunkamsterdam/undetected-chromedrive...

jamescampbell3y ago

Super cool. TIL.

ffhhj3y ago

Wow, thanks!

smashah3y ago

I've been using puppeteer as it's got a very established ecosystem. There are also puppeteer plugins that make it very powerful against captchas/detection/etc.

The worst thing about Puppeteer is chrome and it's bad memory management so I'm going to give playwright a spin soon.

nathancahill3y ago

It's been a while, but last time I used it Puppeteer had a headless Firefox backend?

minimaul3y ago

Puppeteer can do both Chrome & Firefox, you're right. I use both for a scraping project I do.

mccorrinall3y ago

You might be mistaking playwright with puppeteer

breno3y ago

estela is an elastic web scraping cluster running on Kubernetes. It provides mechanisms to deploy, run and scale web scraping spiders via a REST API and a web interface.

It is a modern alternative to the few OSS projects available for such needs, like scrapyd and gerapy. estela aims to help web scraping teams and individuals that are considering moving away from proprietary scraping clouds, or who are in the process of designing their on-premise scraping architecture, so as not to needlessly reinvent the wheel, and to benefit from the get-go from features such as built-in scalability and elasticity, among others.

estela has been recently published as OSS under the MIT license:

https://github.com/bitmakerla/estela

More details about it can be found in the release blog post and the official documentation:

https://bitmaker.la/blog/2022/06/24/estela-oss-release.html

https://estela.bitmaker.la/docs/

estela supports Scrapy spiders for the moment being, but additional frameworks/languages are on the roadmap.

All kinds of feedback and contributions are welcome!

Disclaimer: I'm part of the development team behind estela :-)

simonw3y ago

Quick plug for running scrapers in GitHub Actions and writing the results back to a repository - which gives you a free way to track changes to a scraped resource over time. I call this "Git scraping" - I've written a whole bunch of notes about this technique here: https://simonwillison.net/series/git-scraping/

larrydag3y ago

The polite package using R is intended to be a friendly way of scraping content from the owner. "The three pillars of a polite session are seeking permission, taking slowly and never asking twice."

https://github.com/dmi3kno/polite

zarzavat3y ago

Python is my work horse, if I need to scrape something from a site that is relaxed about scraping (most are). I have my own library of helper functions I've built up over the years. In simple cases I just regex out what I need, if I need a full DOM then I use JSDOM/node.

For sites that are "difficult" I remote control a real browser, GUI and all. I don't use Chrome headless because if there's e.g. a captcha I want to be able to fill it in manually.

brutuscat3y ago

For Ruby I recommend Medusa Crawler gem.

[1] https://github.com/brutuscat/medusa-crawler

Which I maintain as a fork of the unmaintained Anemone gem.

shakezula3y ago

I’m not sure about “best” but I’ve been using Colly (written in Go) and it’s been pretty slick. Haven’t run in to anything it can’t do.

http://go-colly.org/

nprateem3y ago

I've used scrapy but I prefer colly. It's fast and simple.

samwillis3y ago

I don’t think the landscape has changed much since then. However, from my experience you should do everything possible to avoid a headless browser for scraping. It’s in the region of 10-100x slower and significantly more resource intensive, even if you carefully block unwanted requests (images, css, video, ads).

Obviously sometimes you have to go that route.

c0brac0bra3y ago

I have used the Apify SDK (now https://crawlee.dev/) in the past and found it very useful.

BloodOath3y ago

I'm a little surprised this is not mentioned more here. It's one of the best tools for scraping.

gkfasdfasdf3y ago

If the content you need is static, I like using node + cheerio [0] as the selector syntax is quite powerful. If there is some javascript execution involved however, I will fall back to puppeteer.

[0] - https://cheerio.js.org/

lioeters3y ago

Node.js and cheerio is what came to my mind too.

I heard the team behind Puppeteer moved from Google to Microsoft, and started the project Playwright, which has a more ergonomic API and better cross-browser support (Chromium, WebKit, and Firefox).

https://playwright.dev/

asdadsdad3y ago

Can I piggyback on the question and ask what are people scraping these days?

b14763y ago

During Covid lockdown online food shopping in my area (UK) became increasingly difficult to book delivery slots as they would fill up as soon as they became available. For most people who either owned a car or lived nearby a supermarket this was probably just a mild inconvenience, however at the time I didn't have a car and the nearest store was miles away from my house. Writing a bunch of Python scripts to check for free delivery slots over several of the big supermarkets and then pinging me when one became available was a lifesaver (maybe a bit dramatic, but it helped me secure regular food deliveries without having to spend hours hitting the refresh button). I learned a lot in the process, the lengths at which some online stores go to try and deter automated scripts was quite interesting.

asdadsdad3y ago

I feel you. I've wondered many times why anti-bot scripts are deployed at all, save some cases like airline booking, where airlines are penalized for abnormal look-book ratios.

olvy03y ago

Recently I wrote a small firefox extension to scrape only the Hacker News submissions which I upvoted, and also my own submissions, and create browser bookmarks from them.

It also scrapes all my comments I upvoted and if those have links inside them it creates bookmarks from them too.

That's because I often find myself searching for some submission I upvoted but can't find it, especially if there were many similar ones, whereas Firefox bookmarks manager has a nifty search feature...

I had to scrape since the HN API doesn't expose ability to get information about upvoted submissions/comments. The extension assumes you are logged in, it doesn't ask for your username or password.

It's mostly for my own use and not very polished, but it works. I uploaded it to the Firefox extensions gallery, and you can probably find it there, but I don't think it's ready for public consumption yet...

shmoogy3y ago

Competitor pricing, stock, leadtimes, promotions.

XzAeRosho3y ago

Also product information updates/mistakes. Some companies like to make sure that the data that is being shown out there is accurate.

NDizzle3y ago

SEC database… 10-K, 10-Q, 20-F.

1 more reply

na853y ago

US Treasury rates

benibela3y ago

I wrote my own webscraper: https://videlibri.de/xidel.html

The main purpose was to submit HTML forms. You just say in which input fields something should be written and then it does the other things (i.e. download the page, find all other fields and their default values, build a HTTP request from all of them and send that ).

The last 5 years, I spent updating the XPath implementation to XPath/XQuery 3.1. The W3C has put a lot new stuff in the new XPath versions like JSON support or higher order functions, for some reason they decided to turn XPath into a Turing-complete functional programming language.

PigiVinci833y ago

I’ve got several years of experience of webscraping, mainly in python. Scrapy is the first choice for “basic websites” while playwright is used then things get difficult. I’m collecting my experience in using these tools in this “web scraping open knowledge project” on github (https://github.com/reanalytics-databoutique/webscraping-open...) and on my substack (http://thewebscraping.club/) for longer free content

shireboy3y ago

I have had some luck running puppeteer in a nodejs app hosted at glitch.com. I spring for the (cheap) paid hosting and get several containers for dev/test/prod, web based ide. Obviously, this would only scale to a point. In my case I just need a single client automating interaction with a single site. If I really needed scale, I'd probably use one of the services listed elsewhere.

Of course, if you don't need a full javascript-enabled browser parse, consider alternatives first: simple HTTP requests, API, RSS, etc.

david_databar3y ago

Why don't you set it up on databar.ai? Or is it a more complex use case

ravenstine3y ago

For simple scraping where the content is fairly static, or when performance is critical, I will use linkedom to process pages.

https://github.com/WebReflection/linkedom

When the content is complex or involves clicking, Playwright is probably the best tool for the job.

https://github.com/microsoft/playwright

shmoogy3y ago

Has anybody created anything similar to Portia for scraping? I'd love to self host or pay a nominal fee to allow my team to create / adjust scrapers via a UI

mmerlin3y ago

RTILA is a very useful visual scraper, my recommendation to a similar question from 2 years ago is just as valid today:

https://news.ycombinator.com/item?id=24421092

david_databar3y ago

Another one (that's relatively new) is databar.ai. I am one of the co-founders here though so I'm biased :D

tonymet3y ago

Obviously scraping logic using puppeteer, but there are many other tooling aspects that are critical to bypass bot prevention.

one is signature / fingerprinting emulation. It helps to run the bot in a real browser and export the fingerprint (e.g. UA, canvass, geoloc etc) into JS object . Add noise to the data too.

Simulate residential IPs by routing through a residential proxy. If you run bots from cloud you will get blocked.

jawerty3y ago

I’ve built a lot of tools utilizing web scraping most recently https://GitHub.com/Jawerty/myAlgorithm and https://metaheads.xyz I think the more control you have over the tools the better if you know your way around css selectors and selenium you can do anything web scraping. Selenium can seem hefty but there are plenty of ways to optimize for resource intensity; look up selenium grid. Overall, don’t be afraid of browser automation you can Always find a way to optimize. The real difficulty is freshness of html. This you can fix by being smart about time stamps and caching. If you have the same data you’re scraping consistently…don’t do that. Also if there’s a frontend in your application dependent on scraped data NEVER use your scraping routines as a direct feed, store data whenever you scrape.

simplecto3y ago

i hate to be that guy, but “it depends”

scrapy is still king for me (scrapy.org). there are even packages to use headless browsers for those awful javascript heavy sites

however, APIs and RSS are still in play, and that does not require a heavy scraper. I am building vertical industry portals, and many of my data rollups consume APIs and structured XML/RSS feeds from social and other sites.

yobbo3y ago

Many years ago I wrote a scraper-module for a scripting language that exposed a fake DOM to an embedded JS-engine, spidermonkey. The DOM was just an empty object graph, readable both from the scripting language and inside the JS context. The documents were parsed by libxml2 and the resulting DOMs were not identical to mozilla's, for example. But fast and efficient.

The purpose was to enable "live interactive" scraping of forms/js/ajax sites, with a web frontend controlling maybe 10 scrapers for each user. When that project fell through, I stopped maintaining it and the spidermonkey api has long since moved on.

It works for simple sites that don't require the DOM to actually do anything (for example triggering images to load with some magic url). But many simple DOM behaviours can be implemented.

whoisjuan3y ago

It depends on what you are trying to accomplish, but I think a combination of Puppeteer and JSDOM or Cheerio should take you far. Where it gets complex is when you need to do things such as rotating IPs, but in my experience, that's only needed if you're engaging in a heavy scraping workload.

Puppeteer + JSDOM is what I used to build https://www.getscrape.com, which is a high-level web scraping API. Basically, you tell the API if you want links, images, texts, headings, numbers, etc; and the API gets all that stuff for you without the need to pass selectors or parsing instructions.

In case anyone here wants something straightforward. It works well to build generic scraping operations.

nlh3y ago

I'm working on a personal project that involves A LOT of scraping, and through several iterations I've gotten some stuff that works quite well. Here's a quick summary of what I've explored (both paid and free):

* Apify (https://apify.com/) is a great, comprehensive system if you need to get fairly low-level. Everything is hosted there, they've got their own proxy service (or you can roll your own), and their open source framework (https://github.com/apify/crawlee) is excellent.

* I've also experimented with running both their SDK (crawlee) and Playwright directly on Google Cloud Run, and that also works well and is an order-of-magnitude less expensive than running directly on their platform.

* Bright Data nee Luminati is excellent for cheap data center proxies ($0.65/GB pay as you go), but prices get several orders of magnitude more expensive if you need anything more thorough than data center proxies.

* For some direct API crawls that I do, all of the scraping stuff is unnecessary and I just ping the APIs directly.

* If the site you're scraping is using any sort of anti-bot protection, I've found that ScrapingBee (https://www.scrapingbee.com/) is by far the easiest solution. I spent many many hours fighting anti-bot protection doing it myself with some combination of Bright Data, Apify and Playwright, and in the end I kinda stopped battling and just decided to let ScrapingBee deal with it for me. I may be lucky in that the sites I'm scraping don't really use JS heavily, so the plain vanilla, no-JS ScrapingBee service works almost all of the time for those. Otherwise it can get quite expensive if you need JS rendering, premium proxies, etc. But a big thumbs up to them for making it really easy.

Always looking for new techniques and tools, so I'll monitor this thread closely.

ardalann3y ago

We’ve built a freemium cloud RPA software focused on web scraping and monitoring, called Browse AI.

https://www.browse.ai

It lets you train a bot in 2 minutes. The bot will then open the site with rotating geolocated ip addresses, solve captchas, click on buttons and scroll and fill out forms, to get you the data you need.

It’s integrated with Google Sheets, Airtable, Zapier, and more.

We have a Google Sheets addon too which lets you run robots and get their results all in a spreadsheet.

We have close to 10,000 users with 1,000+ signing up every week these days. That made us raise a bit of funding from Zapier and others to be able to scale quicker and build the next version.

colinramsay3y ago

For a particular type of scraping, we wrote SSScraper on top of Colly and it works really well:

https://github.com/gotripod/ssscraper/

nreece3y ago

* Shameless plug *: our super-easy feed builder at New Sloth (formerly Feedity) - https://newsloth.com combines a scraper and data transformer, which helps create custom RSS feeds for any public webpage. Our API can auto-magically detect relevant articles in most cases. The platform includes an integrated feed reader and clusterer/deduplicator, specially aimed for knowledge workers with hundreds and thousands of feeds to monitor daily.

arjunbahl19103y ago

Not a full fledged scraper but IDS[1] has great heuristics to figure relevant content/information behind HTML code therefore lesser/no iterations needed in case frontend code changes.

Would be cool to reverse engineer it and probably plug it into some JS rendering testing solution (say Puppeteer, etc.)

[1] https://chrome.google.com/webstore/detail/instant-data-scrap...

pipeline_peak3y ago

Normalize Rest calls so programmers don’t have to rely on webscrapers, selenium and other flaky methods of retrieving data.

Web scraping is fun, but in production it’s an absolute joke.

throwtoobusy3y ago

I am probably in the minority, but I try to outsource scraping whenever possible. It's too much grunt work: you have to constantly baby sit the crawlers that break because websites keep changing.

Personally, I use Indexed (https://www.indexedinc.com) because they are technical and reliable, although there are many other providers out there..

71a54xd3y ago

Is there any form of markup / library that would allow me to access a file-tree similar to what show's up in the chrome "inspect" "sources" tab? I'm working on a system to extract m3u8 files from websites. Haven't found a good way to do this yet a few years since my last project that required scraping with a headless browser.

picodguyo3y ago

Have to appreciate the irony of someone's SEO spam submission (submitter works for a company selling scraping services) being SEO spammed in the comments...

>Thanks for the links. And I read too. I see a lot of useful stuff that I will use for my site https://los-angeles-plumbers.com/

fersarr3y ago

Selenium via Python is really useful too if you need to do a bit more (e.g clicks) than just fetching the html from the page.

vikR00013y ago

Selenium can be very difficult to install when it comes to specific browser libraries.. Playwright, just as one example, is very easy to install.

sireat3y ago

Isn't it just pip install selenium then download the correct driver version for your browser (place it in a path or supply path when initializing client)?

I use Selenium every few months so I have to update the drivers but otherwise it is pretty painless.

Selenium is much slower than BS4 which is much preferred for static sites.

Slix3y ago

Scrapy's crawling and CSS/xpath selectors are fine. But I'm annoyed about the pipeline after that. Especially to get the data into a SQLite database. I wish cleaning up the data was a series of transformations on SQL tables instead of a bunch of work on Python models.

yokisan3y ago

A good no-code solution is https://simplescraper.io. Leans towards non-developers but there's an API too.

Stefan_Smirnov3y ago

I’m biased since I’m an owner of a web scraping agency (https://webscrapingsolutions.co.uk/). I was asking myself the same question in 2019. You can use any programming language, but have settled on this tech-stack Python, Scrapy (https://github.com/scrapy/scrapy), Redis, PostgreSQL. for the following reasons:

[1] Scrapy is a well-documented framework, so any Python programmer can start using it after 1 month of training. There are a lot of guides for beginners.

[2] Lots of features are already implemented and open-source, you won’t have to waste time & money on them.

[3] There is a strong community that can help with most of the questions (I don't think any other alternative has that).

[4] Scrapy developers are cheap. You will only need junior+ to middle level software engineers to pull out most of the projects. It’s not rocket since.

[5] Recruiting is easier: - there are hundreds of freelancers with relevant expertise - if you search on LinkedIn - there are hundreds of software developers that have worked with Scrapy in the past, and you don’t need that many - you can grow expertise in your own team quickly - developers are easily replaceable, even on larger projects - you can use the same developers on backend tasks.

[6] You don’t need a DevOps expertise in your web scraping team because Scrapy Cloud (https://www.zyte.com/scrapy-cloud/) is good and cheap enough for 99% of the projects.

[7] If you decide to have your own infrastructure, you can use https://github.com/scrapy/scrapyd.

[8] The entire ecosystem is well-well-maintained and steadily growing. You can integrate a lot of 3-rd party services into your project within hours: proxies, captcha solving, headless browsers, HTML parsing APIs.

[9] It’s easy to integrate your own AI/ML models into the scraping workflow.

[10]. With some work, you can use Scrapy for distributed projects that are scraping thousands (millions) of domains. We are using https://github.com/rmax/scrapy-redis.

[11] Commercial support is available. There are several companies that can develop you an entire project or take over an existing one - if you don’t have the time/don’t want to do it on your own.

We have built dozens of projects in multiple industries:

- news monitoring

- job aggregators

- real estate aggregators

- ecommerce (anything from 1 website, to monitoring prices on 100k+ domains)

- lead generation

- search engines in a specific niche (SEO, pdf files, ecommerce, chemical retail)

- macroeconomic research & indicators

- social media, NFT marketplaces, etc

So, most of the projects can be finished using these tools.

intelVISA3y ago

Is that even legal? I've built a few fast scrapers in C but I balk at the thought of selling them for some reason it feels a bit grey area to me.

Stefan_Smirnov3y ago

It depends. We employ a lawyer to assess risks associated with each project.

In general - as long as you don't have to login, don't infringe on intellectual property rights and don't harm targeted servers - you should be ok.

epberry3y ago

I've thought https://www.scrapingbee.com/ looked great, especially their auto rotation of IP addresses.

XzAeRosho3y ago

ScrappingBee is a fantastic paid solution that makes most of the pain of modern web scraping go away.

The prices are not very friendly though...

mateuszbuda3y ago

If you're looking for web scraping API having more friendly pricing, especially with premium proxies and JS rendering, check out https://scrapingfish.com

tomp3y ago

Python, requests, BeautifulSoup, lxml, BrightData proxy provider. If necessary, async (if you’re scraping multiple pages) and Pyppeteer (if scraping JS-heavy pages).

Drakula2k3y ago

Try https://webscraping.ai/ if you need rotating proxies and JS rendering

nkrt3y ago

https://www.page2api.com/ - powerful API, easy to use, pay-as-you-go.

rustdeveloper3y ago

I use Scraping Fish API: https://scrapingfish.com/

speedgoose3y ago

You can also use the common crawl dataset.

https://commoncrawl.org/

rahulgoel3y ago

If no javascript automation - python's asyncio, regex and aiohttp can scale really fast for simple tasks.

casualwriter3y ago

it depends. for no-code solution, please check [powerpage-web-crawler](https://github.com/casualwriter/powerpage-web-crawler) for crawling blog/posts.

epolanski3y ago

Playwright is imho great if you want to avoid the common pitfalls of most scrapers.

naseef143y ago

if you are tech enough to find the query-selector of the elements, here is a great tool.

Great things is, it have support for Zapier, webhooks and API access too.!

https://browserbird.com

flamey3y ago

Perl + Mojo::DOM, or Perl + Playwright or Selenium when JavaScript is required

RektBoy3y ago

Selenium with custom geckodriver/chromedriver (to counter antibot hacks)

arinze113y ago

if you are looking for pre-made tools and dont want to write any code, check our https://webautomation.io

mkl953y ago

Scrapy for low level stuff plus some tool that can run Javascript

topherPedersen3y ago

I use ScrapingBee (with Python, requests, & BeautifulSoup).

andrew_3y ago

In the Node realm; you can't beat undici + cheerio.

andrewmcwatters3y ago

It’s still curl and selenium.

worldmerge3y ago

Selenium is very good

taesu3y ago

python requests python bs4 buy some proxy and go!

alex_eScraper3y ago

e-scraper.com

SQL22193y ago

powershell

dyml3y ago

Can you expand on this? I’ve always found powershell hard to do right.

majkinetor3y ago

    $x = iwr https://google.com
    foreach ($link in $x.Links.href) { iwr $link -OutFile foo ...}

j / k navigate · click thread line to collapse

151 comments

simonw3y ago

It's increasingly difficult these days to write scrapers that don't at some point need to execute JavaScript on a page - so you need to have a good browser automation tool on hand.

I'm really impressed by Playwright. It feels like it has learned all of the lessons from systems like Selenium that came before it - it's very well designed and easy to apply to problems.

I wrote my own CLI scraping tool on top of Playwright a few months ago, which has been a fun way to explore Playwright's capabilities: https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...

shubhamjain3y ago

benbristow3y ago

My £0.02

I've scraped loads of stuff through unofficial APIs before this way. Most developers don't ever expect people to do this so they're often a bit less secure too.

Alternatively sometimes doing a Global GitHub / Sourcegraph search you might find someone else who's done the hard work to reverse engineer an API and open-sourced it.

2 more replies

simonw3y ago

Sometimes though that's not enough - particularly on older sites that might use weirder concepts like ASP.NET View state. For those I find having Playwright around is a big benefit.

Generally the things I have the most trouble with for non-browser-automation scraping are things with complex state stored in cookies and URL fragments (and maybe even localStorage these days).

1 more reply

mosseater3y ago

Totally agree!

2 more replies

matheusmoreira3y ago

> I understand companies can put roadblocks to hinder this

Can you elaborate? I haven't run into any roadblocks yet but I'm not scraping big sites or sending a massive number of requests.

1 more reply

1vuio0pswjnm73y ago

"It's increasingly difficult these days to write scrapers that don't at some point need to execute Javascript on a page - so you need to have a good browser automation tool on hand."

Also, it is possible to automate text-only browsers that do not run Javascript. "Browser automation" is not necessarily just for Javascript.

1. The problem with I see using something more generic like majestic_million.csv is that it is a list of domain names not webpages.

2. We would likely need to agree on what data would need to be extracted from each submitted page.

simonw3y ago

I'll rephrase:

"It's increasingly difficult these days to regularly write scrapers for a large range of different websites without eventually hitting a situation where you need to execute JavaScript on a page"

tzs3y ago

What happens in Playwright with sites using CAPTCHAs?

I had been occasionally scraping a site via curl, but then they started using Cloudflare's anti-bot stuff.

latchkey3y ago

Totally separate question, but I'm wondering why you put 'Mar' in your url instead of the month number?

simonw3y ago

If I was designing my blog today I'd probably drop the day and month entirely, and go with /yyyy/unique-text-slug for the URLs.

4 more replies

altilunium3y ago

Beautiful Soup gets the job done. I made several app by using it.

[1] https://github.com/altilunium/wistalk (Scrap wikipedia to analyze user's activity)

[2] https://github.com/altilunium/psedex (Scrap goverment website to get list of all registered online services in Indonesia)

[3] https://github.com/altilunium/makalahIF (Scrap university lecturer's web page to get list of papers)

[4] https://github.com/altilunium/wi-page (Scrap wikipedia to get most active contributors that contribute to a certain article)

[5] https://github.com/altilunium/arachnid (Web scraper, optimized for wordpress and blogger)

Buttons8403y ago

I've found lxml to be more powerful. The lxml library supports xpaths, which I don't believe Beautiful Soup does?

In other words, consider lxml as well.

yifanl3y ago

lxml is supported (mostly) out of the box for BeautifulSoup, so you can it as a parser behind BS4's nicer interface, which I believe the OP does in the linked codebases.

traverseda3y ago

I reach for selectolax first if I'm doing relatively tame stuff. Also css selectors are nice.

snehesht3y ago

In the world of SPA (single page applications), headless browser API is super helpful, playwright[1] and puppeteer[2] are very good choices.

[1] https://github.com/microsoft/playwright

[2] https://github.com/puppeteer/puppeteer

lofatdairy3y ago

snehesht3y ago

Playwright integrates with lot of different browsers compared to puppeteer which just uses chrome.

mynameismon3y ago

XzAeRosho3y ago

snehesht3y ago

If you're interested in running the puppeteer in containers, take a look at chrome-aws-lambda[1] and browserless docker container[2]

Not affiliated with browserless, but they do have a free/paid cloud service. https://www.browserless.io

[1] https://github.com/alixaxel/chrome-aws-lambda

[2] https://github.com/browserless/chrome

btown3y ago

https://chrome.browserless.io/ is perhaps the best technical demo I've ever seen, and shows off Browserless's capabilities amazingly. An incredibly high-quality service and codebase.

namuorg3y ago

I built a tool called Browserflow (https://browserflow.app) that lets you automate any task in the browser, including scraping websites.

Grateful for how much love it received when it launched on HN 8 months ago: https://news.ycombinator.com/item?id=29254147

Try it out and let me know what you think!

vitorbaptistaa3y ago

It looks amazing! Great intro video.

It probably doesn't make sense for Browserflow as a business, but I'd love to find a tool like this that exported a Scrapy spider, similar to the now unmaintained Portia.

statmapt3y ago

Browserflow fan here - I used it on a variety of projects. DK was a great help throughout the way on the slack group.

It's fun to watch his twitter and celebrate his wins alongside him

dbecks3y ago

Beautifully done! Really amazing tool.

shubhamjain3y ago

I would also recommend Python's sh[4] module if Shell scripting isn't your cup of tea. You get best of both worlds: faster dev work with Bash utils, and a saner syntax.

[1]: https://github.com/ericchiang/pup

[2]: https://csvkit.readthedocs.io/en/latest/

[3]: https://stedolan.github.io/jq/

[4]: https://pypi.org/project/sh/

plainnoodles3y ago

>rm -rf /usr /lib/nvidia-current/xorg/xorg

https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/commi...

>rm -rf "$STEAMROOT/"*

https://github.com/valvesoftware/steam-for-linux/issues/3671

It's just too easy to shoot your foot.

shubhamjain3y ago

There are couple of flags you can use to mitigate the safety risks. `set -u`, for instance, will thrown an error if an unbound variable is used. I always start my scripts with

> set -euo pipefail

Here's a detail explaination of all the switches: https://gist.github.com/mohanpedala/1e2ff5661761d3abd0385e82....

I do agree though, it's not the best tool. But combining CLI utilities tends to be fast.

simonw3y ago

For things like regular expressions, it's useful to know that Python has a "-c" option which can be passed a multi-line string as part of a CLI pipeline. You can do something like this:

    curl 'https://news.ycombinator.com/' | python -c '
    import sys, re, json
    html = sys.stdin.read()
    r = re.compile("<a href=\"(.*)\"")
    print(json.dumps(r.findall(html), indent=2))
    '

This outputs JSON which you can then pipe to other tools.

shubhamjain3y ago

This is great. perl also has one-liners [1] one can use, but I gave up dealing with perl's obscure syntax. This is much better.

[1]: http://novosial.org/perl/one-liner/

infinite8s3y ago

That's great! I didn't know -c supported multiline - I always just crammed it into one line with semicolons.

1 more reply

simonw3y ago

My shot-scraper CLI tool was designed to support this kind of workflow but with a real headless browser inserted into the mix. Means you can do things like this:

    shot-scraper javascript \
      "https://news.ycombinator.com/from?site=simonwillison.net" "
    Array.from(document.querySelectorAll('.itemlist .athing')).map(el => {
      const title = el.querySelector('a.titlelink').innerText;
      const points = parseInt(el.nextSibling.querySelector('.score').innerText);
      const url = el.querySelector('a.titlelink').href;
      const dt = el.nextSibling.querySelector('.age').title;
      const submitter = el.nextSibling.querySelector('.hnuser').innerText;
      const commentsUrl = el.nextSibling.querySelector('.subtext a:last-child').href;
      const id = commentsUrl.split('?id=')[1];
      const numComments = parseInt(
        Array.from(
          el.nextSibling.querySelectorAll('.subtext a[href^=item]')
        ).slice(-1)[0].innerText.split()[0]
      ) || 0;
      return {id, title, url, dt, points, submitter, commentsUrl, numComments};
    })
    " | jq '. | map(.numComments) | add'

That example scrapes a page on Hacker News by running JavaScript inside headless Chromium, outputs the results as JSON to stdout, then pipes them into jq to add them up. It outputs "1274".

https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...

(Fun side note: I figured out the jq recipe I'm using in this example using GPT-3: https://til.simonwillison.net/gpt3/jq )

jamescampbell3y ago

Badass as always Simon. I prefer my custom cloudflare killer chrome headless python code. But this is cool for quick things.

Mrdarknezz3y ago

This comment got the same energy as this comment that dropbox could just be replaced by an FTP https://news.ycombinator.com/item?id=9224

valarauko3y ago

Every time that comment is bought up, I can't help but feel that people are deliberately fishing for the comparison.

hhthrowaway12303y ago

Also a fan. Usually I generate indexes/urls, and then just wget and scrape the content offline once all is downloaded.

lwthiker3y ago

[1] https://github.com/lwthiker/curl-impersonate

bunnyfoofoo3y ago

Oh wow, thank you! I had no idea something like this existed. I was manually compiling against BoringSSL.

hartator3y ago

We've built https://serpapi.com

sph3y ago

Scrapers are very simple, effective and probably one of the least fun things to build.

adriancooney3y ago

btzs3y ago

Hi! Your application looks interesting! I have a question regarding your YouTube example: Where do you get property names like channelId,viewCount,keywords from? Thanks

robk3y ago

This is amazing thanks for sharing!!

mouse-linguist3y ago

Do you have recommendations for platforms that monitor scraping fleets?

breno3y ago

We released estela for this and other purposes, check it out, maybe it will suit your needs:

https://github.com/bitmakerla/estela

Only Scrapy support atm, but additional scraping frameworks/language are on the roadmap. It would be great to have feedback to consider it when prioritizing some over others :-)

real-dino3y ago

I'd create a simple health check based on the integrity of the data your retrieve.

Think about giving it a score based on how the data is shaped. If it's missing prices for example, then it immediately goes down to zero, doesn't update the database and sends an alert.

1 more reply

jamescampbell3y ago

Hard agree here. Stopped taking scraping projects because the sites always change. No matter how general you design it.

throwaway-blaze3y ago

Scrapers are fun to build, just not fun to maintain :)

mherrmann3y ago

My http://heliumhq.com is open source and gives you a very simple Python API:

  from helium import *
  start_chrome('github.com/login')
  write('user', into='Username')
  write('password', into='Password')
  click('Sign in')

To get started:

  pip install helium

Also, you need to download the latest ChromeDriver and put it in your PATH.

Have fun :-)

qwertyforce3y ago

Probably the best tool for scraping websites protected by services like cloudflare. https://github.com/ultrafunkamsterdam/undetected-chromedrive...

jamescampbell3y ago

Super cool. TIL.

ffhhj3y ago

Wow, thanks!

smashah3y ago

I've been using puppeteer as it's got a very established ecosystem. There are also puppeteer plugins that make it very powerful against captchas/detection/etc.

The worst thing about Puppeteer is chrome and it's bad memory management so I'm going to give playwright a spin soon.

nathancahill3y ago

It's been a while, but last time I used it Puppeteer had a headless Firefox backend?

minimaul3y ago

Puppeteer can do both Chrome & Firefox, you're right. I use both for a scraping project I do.

mccorrinall3y ago

You might be mistaking playwright with puppeteer

breno3y ago

estela is an elastic web scraping cluster running on Kubernetes. It provides mechanisms to deploy, run and scale web scraping spiders via a REST API and a web interface.

estela has been recently published as OSS under the MIT license:

https://github.com/bitmakerla/estela

More details about it can be found in the release blog post and the official documentation:

https://bitmaker.la/blog/2022/06/24/estela-oss-release.html

https://estela.bitmaker.la/docs/

estela supports Scrapy spiders for the moment being, but additional frameworks/languages are on the roadmap.

All kinds of feedback and contributions are welcome!

Disclaimer: I'm part of the development team behind estela :-)

simonw3y ago

larrydag3y ago

The polite package using R is intended to be a friendly way of scraping content from the owner. "The three pillars of a polite session are seeking permission, taking slowly and never asking twice."

https://github.com/dmi3kno/polite

zarzavat3y ago

For sites that are "difficult" I remote control a real browser, GUI and all. I don't use Chrome headless because if there's e.g. a captcha I want to be able to fill it in manually.

brutuscat3y ago

For Ruby I recommend Medusa Crawler gem.

[1] https://github.com/brutuscat/medusa-crawler

Which I maintain as a fork of the unmaintained Anemone gem.

shakezula3y ago

I’m not sure about “best” but I’ve been using Colly (written in Go) and it’s been pretty slick. Haven’t run in to anything it can’t do.

http://go-colly.org/

nprateem3y ago

I've used scrapy but I prefer colly. It's fast and simple.

samwillis3y ago

Obviously sometimes you have to go that route.

c0brac0bra3y ago

I have used the Apify SDK (now https://crawlee.dev/) in the past and found it very useful.

BloodOath3y ago

I'm a little surprised this is not mentioned more here. It's one of the best tools for scraping.

gkfasdfasdf3y ago

If the content you need is static, I like using node + cheerio [0] as the selector syntax is quite powerful. If there is some javascript execution involved however, I will fall back to puppeteer.

[0] - https://cheerio.js.org/

lioeters3y ago

Node.js and cheerio is what came to my mind too.

I heard the team behind Puppeteer moved from Google to Microsoft, and started the project Playwright, which has a more ergonomic API and better cross-browser support (Chromium, WebKit, and Firefox).

https://playwright.dev/

asdadsdad3y ago

Can I piggyback on the question and ask what are people scraping these days?

b14763y ago

asdadsdad3y ago

I feel you. I've wondered many times why anti-bot scripts are deployed at all, save some cases like airline booking, where airlines are penalized for abnormal look-book ratios.

olvy03y ago

Recently I wrote a small firefox extension to scrape only the Hacker News submissions which I upvoted, and also my own submissions, and create browser bookmarks from them.

It also scrapes all my comments I upvoted and if those have links inside them it creates bookmarks from them too.

I had to scrape since the HN API doesn't expose ability to get information about upvoted submissions/comments. The extension assumes you are logged in, it doesn't ask for your username or password.

shmoogy3y ago

Competitor pricing, stock, leadtimes, promotions.

XzAeRosho3y ago

Also product information updates/mistakes. Some companies like to make sure that the data that is being shown out there is accurate.

NDizzle3y ago

SEC database… 10-K, 10-Q, 20-F.

1 more reply

na853y ago

US Treasury rates

benibela3y ago

I wrote my own webscraper: https://videlibri.de/xidel.html

PigiVinci833y ago

shireboy3y ago

Of course, if you don't need a full javascript-enabled browser parse, consider alternatives first: simple HTTP requests, API, RSS, etc.

david_databar3y ago

Why don't you set it up on databar.ai? Or is it a more complex use case

ravenstine3y ago

For simple scraping where the content is fairly static, or when performance is critical, I will use linkedom to process pages.

https://github.com/WebReflection/linkedom

When the content is complex or involves clicking, Playwright is probably the best tool for the job.

https://github.com/microsoft/playwright

shmoogy3y ago

Has anybody created anything similar to Portia for scraping? I'd love to self host or pay a nominal fee to allow my team to create / adjust scrapers via a UI

mmerlin3y ago

RTILA is a very useful visual scraper, my recommendation to a similar question from 2 years ago is just as valid today:

https://news.ycombinator.com/item?id=24421092

david_databar3y ago

Another one (that's relatively new) is databar.ai. I am one of the co-founders here though so I'm biased :D

tonymet3y ago

Obviously scraping logic using puppeteer, but there are many other tooling aspects that are critical to bypass bot prevention.

one is signature / fingerprinting emulation. It helps to run the bot in a real browser and export the fingerprint (e.g. UA, canvass, geoloc etc) into JS object . Add noise to the data too.

Simulate residential IPs by routing through a residential proxy. If you run bots from cloud you will get blocked.

jawerty3y ago

simplecto3y ago

i hate to be that guy, but “it depends”

scrapy is still king for me (scrapy.org). there are even packages to use headless browsers for those awful javascript heavy sites

yobbo3y ago

It works for simple sites that don't require the DOM to actually do anything (for example triggering images to load with some magic url). But many simple DOM behaviours can be implemented.

whoisjuan3y ago

In case anyone here wants something straightforward. It works well to build generic scraping operations.

nlh3y ago

* For some direct API crawls that I do, all of the scraping stuff is unnecessary and I just ping the APIs directly.

Always looking for new techniques and tools, so I'll monitor this thread closely.

ardalann3y ago

We’ve built a freemium cloud RPA software focused on web scraping and monitoring, called Browse AI.

https://www.browse.ai

It’s integrated with Google Sheets, Airtable, Zapier, and more.

We have a Google Sheets addon too which lets you run robots and get their results all in a spreadsheet.

We have close to 10,000 users with 1,000+ signing up every week these days. That made us raise a bit of funding from Zapier and others to be able to scale quicker and build the next version.

colinramsay3y ago

For a particular type of scraping, we wrote SSScraper on top of Colly and it works really well:

https://github.com/gotripod/ssscraper/

nreece3y ago

arjunbahl19103y ago

Not a full fledged scraper but IDS[1] has great heuristics to figure relevant content/information behind HTML code therefore lesser/no iterations needed in case frontend code changes.

Would be cool to reverse engineer it and probably plug it into some JS rendering testing solution (say Puppeteer, etc.)

[1] https://chrome.google.com/webstore/detail/instant-data-scrap...

pipeline_peak3y ago

Normalize Rest calls so programmers don’t have to rely on webscrapers, selenium and other flaky methods of retrieving data.

Web scraping is fun, but in production it’s an absolute joke.

throwtoobusy3y ago

I am probably in the minority, but I try to outsource scraping whenever possible. It's too much grunt work: you have to constantly baby sit the crawlers that break because websites keep changing.

Personally, I use Indexed (https://www.indexedinc.com) because they are technical and reliable, although there are many other providers out there..

71a54xd3y ago

picodguyo3y ago

Have to appreciate the irony of someone's SEO spam submission (submitter works for a company selling scraping services) being SEO spammed in the comments...

>Thanks for the links. And I read too. I see a lot of useful stuff that I will use for my site https://los-angeles-plumbers.com/

fersarr3y ago

Selenium via Python is really useful too if you need to do a bit more (e.g clicks) than just fetching the html from the page.

vikR00013y ago

Selenium can be very difficult to install when it comes to specific browser libraries.. Playwright, just as one example, is very easy to install.

sireat3y ago

Isn't it just pip install selenium then download the correct driver version for your browser (place it in a path or supply path when initializing client)?

I use Selenium every few months so I have to update the drivers but otherwise it is pretty painless.

Selenium is much slower than BS4 which is much preferred for static sites.

Slix3y ago

yokisan3y ago

A good no-code solution is https://simplescraper.io. Leans towards non-developers but there's an API too.

Stefan_Smirnov3y ago

[1] Scrapy is a well-documented framework, so any Python programmer can start using it after 1 month of training. There are a lot of guides for beginners.

[2] Lots of features are already implemented and open-source, you won’t have to waste time & money on them.

[3] There is a strong community that can help with most of the questions (I don't think any other alternative has that).

[4] Scrapy developers are cheap. You will only need junior+ to middle level software engineers to pull out most of the projects. It’s not rocket since.

[6] You don’t need a DevOps expertise in your web scraping team because Scrapy Cloud (https://www.zyte.com/scrapy-cloud/) is good and cheap enough for 99% of the projects.

[7] If you decide to have your own infrastructure, you can use https://github.com/scrapy/scrapyd.

[9] It’s easy to integrate your own AI/ML models into the scraping workflow.

[10]. With some work, you can use Scrapy for distributed projects that are scraping thousands (millions) of domains. We are using https://github.com/rmax/scrapy-redis.

[11] Commercial support is available. There are several companies that can develop you an entire project or take over an existing one - if you don’t have the time/don’t want to do it on your own.

We have built dozens of projects in multiple industries:

- news monitoring

- job aggregators

- real estate aggregators

- ecommerce (anything from 1 website, to monitoring prices on 100k+ domains)

- lead generation

- search engines in a specific niche (SEO, pdf files, ecommerce, chemical retail)

- macroeconomic research & indicators

- social media, NFT marketplaces, etc

So, most of the projects can be finished using these tools.

intelVISA3y ago

Is that even legal? I've built a few fast scrapers in C but I balk at the thought of selling them for some reason it feels a bit grey area to me.

Stefan_Smirnov3y ago

It depends. We employ a lawyer to assess risks associated with each project.

In general - as long as you don't have to login, don't infringe on intellectual property rights and don't harm targeted servers - you should be ok.

epberry3y ago

I've thought https://www.scrapingbee.com/ looked great, especially their auto rotation of IP addresses.

XzAeRosho3y ago

ScrappingBee is a fantastic paid solution that makes most of the pain of modern web scraping go away.

The prices are not very friendly though...

mateuszbuda3y ago

If you're looking for web scraping API having more friendly pricing, especially with premium proxies and JS rendering, check out https://scrapingfish.com

tomp3y ago

Python, requests, BeautifulSoup, lxml, BrightData proxy provider. If necessary, async (if you’re scraping multiple pages) and Pyppeteer (if scraping JS-heavy pages).

Drakula2k3y ago

Try https://webscraping.ai/ if you need rotating proxies and JS rendering

nkrt3y ago

https://www.page2api.com/ - powerful API, easy to use, pay-as-you-go.

rustdeveloper3y ago

I use Scraping Fish API: https://scrapingfish.com/

speedgoose3y ago

You can also use the common crawl dataset.

https://commoncrawl.org/

rahulgoel3y ago

If no javascript automation - python's asyncio, regex and aiohttp can scale really fast for simple tasks.

casualwriter3y ago

it depends. for no-code solution, please check [powerpage-web-crawler](https://github.com/casualwriter/powerpage-web-crawler) for crawling blog/posts.

epolanski3y ago

Playwright is imho great if you want to avoid the common pitfalls of most scrapers.

naseef143y ago

if you are tech enough to find the query-selector of the elements, here is a great tool.

Great things is, it have support for Zapier, webhooks and API access too.!

https://browserbird.com

flamey3y ago

Perl + Mojo::DOM, or Perl + Playwright or Selenium when JavaScript is required

RektBoy3y ago

Selenium with custom geckodriver/chromedriver (to counter antibot hacks)

arinze113y ago

if you are looking for pre-made tools and dont want to write any code, check our https://webautomation.io

mkl953y ago

Scrapy for low level stuff plus some tool that can run Javascript

topherPedersen3y ago

I use ScrapingBee (with Python, requests, & BeautifulSoup).

andrew_3y ago

In the Node realm; you can't beat undici + cheerio.

andrewmcwatters3y ago

It’s still curl and selenium.

worldmerge3y ago

Selenium is very good

taesu3y ago

python requests python bs4 buy some proxy and go!

alex_eScraper3y ago

e-scraper.com

SQL22193y ago

powershell

dyml3y ago

Can you expand on this? I’ve always found powershell hard to do right.

majkinetor3y ago

    $x = iwr https://google.com
    foreach ($link in $x.Links.href) { iwr $link -OutFile foo ...}

j / k navigate · click thread line to collapse