I'm really impressed by Playwright. It feels like it has learned all of the lessons from systems like Selenium that came before it - it's very well designed and easy to apply to problems.
I wrote my own CLI scraping tool on top of Playwright a few months ago, which has been a fun way to explore Playwright's capabilities: https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...
It's usually easier to use an Android emulator like GenyMotion or a rooted Android phone and use HTTPToolkit and/or some certificate bypassing method using Frida or other and then explore APIs through their official apps.
I've scraped loads of stuff through unofficial APIs before this way. Most developers don't ever expect people to do this so they're often a bit less secure too.
Alternatively sometimes doing a Global GitHub / Sourcegraph search you might find someone else who's done the hard work to reverse engineer an API and open-sourced it.
Sometimes though that's not enough - particularly on older sites that might use weirder concepts like ASP.NET View state. For those I find having Playwright around is a big benefit.
Generally the things I have the most trouble with for non-browser-automation scraping are things with complex state stored in cookies and URL fragments (and maybe even localStorage these days).
I've done this method a lot. Honestly scraping Google Reviews was the most difficult in terms of complexity. This was like 6 or 7 years ago. You would get back these huge nested arrays that mostly had 0s in them. Occasionally a value would be set and that's what I would go with. I'm assuming their internal tools were obfuscated and/or using protobuf. But it certainly took me back to the good ol' days hexediting games in order to make your own cheat codes.
Another difficulty I faced were sites that relied on the previous UI state to pass the API call. You'd have to emulate "real" browsing by requesting the subsequent pages and get the ID number. Still much faster than emulating the whole browser via Selenium.
Honestly, it was the small sites that actually proved more troublesome. The ones that had an actual admin reading logs. They would ban our whole IP Block, then ban our whole proxy IP Block. Once I implemented TOR functionality into our scraper for a particularly valuable but small site and they blocked that too. This site ended up implementing ludicrous rate limiting that had normal users waiting for 2-3 seconds between requests, all because we were scraping their data. I kid you not, by the time we gave up trying, this Section-8 rental site for a small city had vastly more protections in place than Zillow and Apartments.com combined.
> I understand companies can put roadblocks to hinder this
Can you elaborate? I haven't run into any roadblocks yet but I'm not scraping big sites or sending a massive number of requests.
But doesn't this assume which sites are being "scraped". How would anyone know what sites someone else needs to "scrape" unless people name the sites (and the specific pages at the sites as this is not "crawling"). For example, none of the websites with webpages I extract data from require me to use Javascript, i.e., I can retrieve and extract data without using JS.
Also, it is possible to automate text-only browsers that do not run Javascript. "Browser automation" is not necessarily just for Javascript.
Maybe we should have a "scraping challenge" in an effort to provide some evidence on this question. The challenge could be to "scrape" every webpage currently submitted to HN,^1 without using Javascript.^2
If someone manages to scrape a majority of the pages submitted to HN without JS, then we have some evidence that, for HN readers, JS and therefore Javascript-enabled browser automation is generally _not_ required for "scraping".
1. The problem with I see using something more generic like majestic_million.csv is that it is a list of domain names not webpages.
2. We would likely need to agree on what data would need to be extracted from each submitted page.
"It's increasingly difficult these days to regularly write scrapers for a large range of different websites without eventually hitting a situation where you need to execute JavaScript on a page"
I had been occasionally scraping a site via curl, but then they started using Cloudflare's anti-bot stuff.
I switched to Selenium and that worked for a while--my Selenium script would navigate to the site, pause to let me manually deal with Cloudflare, and then automatically grab the data I wanted. But then that stopped working.
I found a Stack Overflow answer that gave some settings in Selenium to make it not tell the site's JavaScript that the browser was being automated and that briefly made things happy, but not too long afterwards that broke. There's a Selenium Chrome drive available that is meant for scraping which apparently tries to hide all evidence that the browser is being automated, but it didn't fool Cloudflare.
What I want is a browser-based automation tool that to the site is indistinguishable from a human browsing, except possibly by the timing of user actions. E.g., if the site can deduce it is being automated because the client responds faster than human reaction time, or with too little variation in response time, that's fine.
If I was designing my blog today I'd probably drop the day and month entirely, and go with /yyyy/unique-text-slug for the URLs.
[1] https://github.com/altilunium/wistalk (Scrap wikipedia to analyze user's activity)
[2] https://github.com/altilunium/psedex (Scrap goverment website to get list of all registered online services in Indonesia)
[3] https://github.com/altilunium/makalahIF (Scrap university lecturer's web page to get list of papers)
[4] https://github.com/altilunium/wi-page (Scrap wikipedia to get most active contributors that contribute to a certain article)
[5] https://github.com/altilunium/arachnid (Web scraper, optimized for wordpress and blogger)
In other words, consider lxml as well.
Not affiliated with browserless, but they do have a free/paid cloud service. https://www.browserless.io
People love it for its ease-of-use because you can record actions via click-and-point rather than having to manually come up with CSS selectors. It intelligently handles lists, infinite scrolling, pagination, etc. and can run on both your desktop and in the cloud.
Grateful for how much love it received when it launched on HN 8 months ago: https://news.ycombinator.com/item?id=29254147
Try it out and let me know what you think!
It probably doesn't make sense for Browserflow as a business, but I'd love to find a tool like this that exported a Scrapy spider, similar to the now unmaintained Portia.
It's fun to watch his twitter and celebrate his wins alongside him
The only thing I wish was present was better support for RegExes. Bash and most unix tools don't support PCRE which can severely limiting. Plus, sometimes you want to process text as a whole vs line-by-line.
I would also recommend Python's sh[4] module if Shell scripting isn't your cup of tea. You get best of both worlds: faster dev work with Bash utils, and a saner syntax.
[1]: https://github.com/ericchiang/pup
[2]: https://csvkit.readthedocs.io/en/latest/
>rm -rf /usr /lib/nvidia-current/xorg/xorg
https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/commi...
>rm -rf "$STEAMROOT/"*
https://github.com/valvesoftware/steam-for-linux/issues/3671
It's just too easy to shoot your foot.
> set -euo pipefail
Here's a detail explaination of all the switches: https://gist.github.com/mohanpedala/1e2ff5661761d3abd0385e82....
I do agree though, it's not the best tool. But combining CLI utilities tends to be fast.
curl 'https://news.ycombinator.com/' | python -c '
import sys, re, json
html = sys.stdin.read()
r = re.compile("<a href=\"(.*)\"")
print(json.dumps(r.findall(html), indent=2))
'
This outputs JSON which you can then pipe to other tools. shot-scraper javascript \
"https://news.ycombinator.com/from?site=simonwillison.net" "
Array.from(document.querySelectorAll('.itemlist .athing')).map(el => {
const title = el.querySelector('a.titlelink').innerText;
const points = parseInt(el.nextSibling.querySelector('.score').innerText);
const url = el.querySelector('a.titlelink').href;
const dt = el.nextSibling.querySelector('.age').title;
const submitter = el.nextSibling.querySelector('.hnuser').innerText;
const commentsUrl = el.nextSibling.querySelector('.subtext a:last-child').href;
const id = commentsUrl.split('?id=')[1];
const numComments = parseInt(
Array.from(
el.nextSibling.querySelectorAll('.subtext a[href^=item]')
).slice(-1)[0].innerText.split()[0]
) || 0;
return {id, title, url, dt, points, submitter, commentsUrl, numComments};
})
" | jq '. | map(.numComments) | add'
That example scrapes a page on Hacker News by running JavaScript inside headless Chromium, outputs the results as JSON to stdout, then pipes them into jq to add them up. It outputs "1274".https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...
(Fun side note: I figured out the jq recipe I'm using in this example using GPT-3: https://til.simonwillison.net/gpt3/jq )
We've invented the industry what you referring as "data type specific APIs"; APIs that abstract away all proxies issues, captcha solvings, various layouts support, even scrapping-related legal issues, and much more to a clean JSON response every single call. It was a lot of work but our success rate and response times are now rivaling non-scraping commercial APIs: https://serpapi.com/status
I think the next battle will be still legal despite all the wins in favor of scrapping public pages and common sense understanding this is the way to go. The EFF has been doing an amazing work in this world and we are proud to be a significant yearly contributor to the EFF.
Scrapers are very simple, effective and probably one of the least fun things to build.
I occasionally code scrapers for quick data aggregation, but have trouble running anything long-term because it can be a chore to monitor. I've been looking into various options for self-hosting some sort of monitor/dashboard that can send alerts but haven't found anything satisfying yet.
https://github.com/bitmakerla/estela
Only Scrapy support atm, but additional scraping frameworks/language are on the roadmap. It would be great to have feedback to consider it when prioritizing some over others :-)
Think about giving it a score based on how the data is shaped. If it's missing prices for example, then it immediately goes down to zero, doesn't update the database and sends an alert.
from helium import *
start_chrome('github.com/login')
write('user', into='Username')
write('password', into='Password')
click('Sign in')
To get started: pip install helium
Also, you need to download the latest ChromeDriver and put it in your PATH.Have fun :-)
The worst thing about Puppeteer is chrome and it's bad memory management so I'm going to give playwright a spin soon.
It is a modern alternative to the few OSS projects available for such needs, like scrapyd and gerapy. estela aims to help web scraping teams and individuals that are considering moving away from proprietary scraping clouds, or who are in the process of designing their on-premise scraping architecture, so as not to needlessly reinvent the wheel, and to benefit from the get-go from features such as built-in scalability and elasticity, among others.
estela has been recently published as OSS under the MIT license:
https://github.com/bitmakerla/estela
More details about it can be found in the release blog post and the official documentation:
https://bitmaker.la/blog/2022/06/24/estela-oss-release.html
https://estela.bitmaker.la/docs/
estela supports Scrapy spiders for the moment being, but additional frameworks/languages are on the roadmap.
All kinds of feedback and contributions are welcome!
Disclaimer: I'm part of the development team behind estela :-)
For sites that are "difficult" I remote control a real browser, GUI and all. I don't use Chrome headless because if there's e.g. a captcha I want to be able to fill it in manually.
[1] https://github.com/brutuscat/medusa-crawler
Which I maintain as a fork of the unmaintained Anemone gem.
Obviously sometimes you have to go that route.
[0] - https://cheerio.js.org/
I heard the team behind Puppeteer moved from Google to Microsoft, and started the project Playwright, which has a more ergonomic API and better cross-browser support (Chromium, WebKit, and Firefox).
It also scrapes all my comments I upvoted and if those have links inside them it creates bookmarks from them too.
That's because I often find myself searching for some submission I upvoted but can't find it, especially if there were many similar ones, whereas Firefox bookmarks manager has a nifty search feature...
I had to scrape since the HN API doesn't expose ability to get information about upvoted submissions/comments. The extension assumes you are logged in, it doesn't ask for your username or password.
It's mostly for my own use and not very polished, but it works. I uploaded it to the Firefox extensions gallery, and you can probably find it there, but I don't think it's ready for public consumption yet...
The main purpose was to submit HTML forms. You just say in which input fields something should be written and then it does the other things (i.e. download the page, find all other fields and their default values, build a HTTP request from all of them and send that ).
The last 5 years, I spent updating the XPath implementation to XPath/XQuery 3.1. The W3C has put a lot new stuff in the new XPath versions like JSON support or higher order functions, for some reason they decided to turn XPath into a Turing-complete functional programming language.
Of course, if you don't need a full javascript-enabled browser parse, consider alternatives first: simple HTTP requests, API, RSS, etc.
https://github.com/WebReflection/linkedom
When the content is complex or involves clicking, Playwright is probably the best tool for the job.
one is signature / fingerprinting emulation. It helps to run the bot in a real browser and export the fingerprint (e.g. UA, canvass, geoloc etc) into JS object . Add noise to the data too.
Simulate residential IPs by routing through a residential proxy. If you run bots from cloud you will get blocked.
scrapy is still king for me (scrapy.org). there are even packages to use headless browsers for those awful javascript heavy sites
however, APIs and RSS are still in play, and that does not require a heavy scraper. I am building vertical industry portals, and many of my data rollups consume APIs and structured XML/RSS feeds from social and other sites.
The purpose was to enable "live interactive" scraping of forms/js/ajax sites, with a web frontend controlling maybe 10 scrapers for each user. When that project fell through, I stopped maintaining it and the spidermonkey api has long since moved on.
It works for simple sites that don't require the DOM to actually do anything (for example triggering images to load with some magic url). But many simple DOM behaviours can be implemented.
Puppeteer + JSDOM is what I used to build https://www.getscrape.com, which is a high-level web scraping API. Basically, you tell the API if you want links, images, texts, headings, numbers, etc; and the API gets all that stuff for you without the need to pass selectors or parsing instructions.
In case anyone here wants something straightforward. It works well to build generic scraping operations.
* Apify (https://apify.com/) is a great, comprehensive system if you need to get fairly low-level. Everything is hosted there, they've got their own proxy service (or you can roll your own), and their open source framework (https://github.com/apify/crawlee) is excellent.
* I've also experimented with running both their SDK (crawlee) and Playwright directly on Google Cloud Run, and that also works well and is an order-of-magnitude less expensive than running directly on their platform.
* Bright Data nee Luminati is excellent for cheap data center proxies ($0.65/GB pay as you go), but prices get several orders of magnitude more expensive if you need anything more thorough than data center proxies.
* For some direct API crawls that I do, all of the scraping stuff is unnecessary and I just ping the APIs directly.
* If the site you're scraping is using any sort of anti-bot protection, I've found that ScrapingBee (https://www.scrapingbee.com/) is by far the easiest solution. I spent many many hours fighting anti-bot protection doing it myself with some combination of Bright Data, Apify and Playwright, and in the end I kinda stopped battling and just decided to let ScrapingBee deal with it for me. I may be lucky in that the sites I'm scraping don't really use JS heavily, so the plain vanilla, no-JS ScrapingBee service works almost all of the time for those. Otherwise it can get quite expensive if you need JS rendering, premium proxies, etc. But a big thumbs up to them for making it really easy.
Always looking for new techniques and tools, so I'll monitor this thread closely.
It lets you train a bot in 2 minutes. The bot will then open the site with rotating geolocated ip addresses, solve captchas, click on buttons and scroll and fill out forms, to get you the data you need.
It’s integrated with Google Sheets, Airtable, Zapier, and more.
We have a Google Sheets addon too which lets you run robots and get their results all in a spreadsheet.
We have close to 10,000 users with 1,000+ signing up every week these days. That made us raise a bit of funding from Zapier and others to be able to scale quicker and build the next version.
Would be cool to reverse engineer it and probably plug it into some JS rendering testing solution (say Puppeteer, etc.)
[1] https://chrome.google.com/webstore/detail/instant-data-scrap...
Web scraping is fun, but in production it’s an absolute joke.
Personally, I use Indexed (https://www.indexedinc.com) because they are technical and reliable, although there are many other providers out there..
>Thanks for the links. And I read too. I see a lot of useful stuff that I will use for my site https://los-angeles-plumbers.com/
I use Selenium every few months so I have to update the drivers but otherwise it is pretty painless.
Selenium is much slower than BS4 which is much preferred for static sites.
[1] Scrapy is a well-documented framework, so any Python programmer can start using it after 1 month of training. There are a lot of guides for beginners.
[2] Lots of features are already implemented and open-source, you won’t have to waste time & money on them.
[3] There is a strong community that can help with most of the questions (I don't think any other alternative has that).
[4] Scrapy developers are cheap. You will only need junior+ to middle level software engineers to pull out most of the projects. It’s not rocket since.
[5] Recruiting is easier: - there are hundreds of freelancers with relevant expertise - if you search on LinkedIn - there are hundreds of software developers that have worked with Scrapy in the past, and you don’t need that many - you can grow expertise in your own team quickly - developers are easily replaceable, even on larger projects - you can use the same developers on backend tasks.
[6] You don’t need a DevOps expertise in your web scraping team because Scrapy Cloud (https://www.zyte.com/scrapy-cloud/) is good and cheap enough for 99% of the projects.
[7] If you decide to have your own infrastructure, you can use https://github.com/scrapy/scrapyd.
[8] The entire ecosystem is well-well-maintained and steadily growing. You can integrate a lot of 3-rd party services into your project within hours: proxies, captcha solving, headless browsers, HTML parsing APIs.
[9] It’s easy to integrate your own AI/ML models into the scraping workflow.
[10]. With some work, you can use Scrapy for distributed projects that are scraping thousands (millions) of domains. We are using https://github.com/rmax/scrapy-redis.
[11] Commercial support is available. There are several companies that can develop you an entire project or take over an existing one - if you don’t have the time/don’t want to do it on your own.
We have built dozens of projects in multiple industries:
- news monitoring
- job aggregators
- real estate aggregators
- ecommerce (anything from 1 website, to monitoring prices on 100k+ domains)
- lead generation
- search engines in a specific niche (SEO, pdf files, ecommerce, chemical retail)
- macroeconomic research & indicators
- social media, NFT marketplaces, etc
So, most of the projects can be finished using these tools.
In general - as long as you don't have to login, don't infringe on intellectual property rights and don't harm targeted servers - you should be ok.
The prices are not very friendly though...
Great things is, it have support for Zapier, webhooks and API access too.!
$x = iwr https://google.com
foreach ($link in $x.Links.href) { iwr $link -OutFile foo ...}