undefined | Better HN

0 pointsluigi231y ago0 comments

Got it, but why is it booming now and often it’s a showcase of llm model? Is there some secret market/ usecase for it?

0 comments

Building scrapers sucks.

It's generally not hard because it's conceptually very difficult, or that it requires extremely high level reasoning.

It sucks because when someone changes "<section class='bio'>" to "<div class='section bio'>" your scraper breaks. I just want the bio and it's obvious what to grab, but machines have no nuance.

LLMs have enough common sense to be able to deal with these things and they take almost no time to work with. I can throw html at something, with a vague description and pull out structured data with no engineer required, and it'll probably work when the page changes.

There's a huge number of one-off jobs people will do where perfect isn't the goal, and a fast solution + a bit of cleanup is hugely beneficial.

atomic1281y ago

Another approach is to use a regexp scraper. These are very "loose" and tolerant of changes. For example, RNSAFFN.com uses regular expressions to scrape the Commitments of Traders report from the Commodity Futures Trading Commission every week.

simonw1y ago

My experience has been the opposite: regex scrapers are usually incredibly brittle, and also harder to debug when something DOES change.

My preferred approach for scraping these days is Playwright Python and CSS selectors to select things from the DOM. Still prone to breakage, but reasonably pleasant to debug using browser DevTools.

yuters1y ago

I don't know if many has the same use case but... I'm heavily relying on this right now because my daughter started school. The school board, the school, and the teacher each use a different app to communicate important information to parents. I'm just trying to make one feed with all of them. Before AI it would have been hell to scrape, because you can imagine those apps are terrible.

Fun aside: The worst one of them is a public Facebook page. The school board is making it their official communication channel, which I find horrible. Facebook is making it so hard to scrape. And if you don't know, you can't even use Facebook's API for this anymore, unless you have a business verified account and go through a review just for this permission.

drusepth1y ago

Scrapers have always been notoriously brittle and prone to breaking completely when pages make even the smallest of structural changes.

Scraping with LLMs bypasses that pitfall because it's more of a summarization task on the whole document, rather than working specifically on a hard-coded document structure to extract specific data.

bobajeff1y ago

Personally I find it's better for archiving as most sites that don't provide a convenient way to save their content directly. Occasionally, I do it just to make a better interface over the data.

j / k navigate · click thread line to collapse