undefined | Better HN

0 pointsmatheusmoreira3y ago0 comments

This was my approach too and it's been working great. Nowadays data isn't rendered directly into HTML anymore, it gets downloaded from some JSON API endpoint. So I use network monitoring tools to see where it's coming from and then inferface with the endpoint directly. I essentially wrote custom clients for someone else's site. One of my scrapers is actually just curl piped into jq. Sometimes they change the API and I have to adapt but that's fine.

> I understand companies can put roadblocks to hinder this

Can you elaborate? I haven't run into any roadblocks yet but I'm not scraping big sites or sending a massive number of requests.

0 comments

shubhamjain3y ago

> Can you elaborate? I haven't run into any roadblocks yet but I'm not scraping big sites or sending a massive number of requests.

Cloudflare Bot Protection[1] is a popular one. The website is guarded by a layer of code that needs to be executed before continuing. Normal browsers will follow through. It can be hard to bypass.

[1]: https://www.cloudflare.com/pg-lp/bot-mitigation-fight-mode/

jamescampbell3y ago

I have a codebase that defeats cloudflare protection. Felt like I had keys to kingdom.

GekkePrutser3y ago

So that would break text browsers too, right? :(

And users with JS disabled for privacy reasons.

j / k navigate · click thread line to collapse