A web scraping CLI made for AI that is idempotent (opens in new tab)

(github.com)

81 pointsclemlesne1y ago31 comments

31 comments

I have a similar project. I scrape pages only to obtain page meta. It can use selenium, crawlee, i will also add puppeteer later.

The project is quite big, has mamy features.

It is my internet command center. I used it to check what's news on the internet.

https://github.com/rumca-js/Django-link-archive

nilsherzig1y ago

Love the images haha

usernamed71y ago

Not to discount any actual utility or innovation here, but I was wondering "why would you be hard coding to all these azure services?" then I saw the author is a solutions architect at microsoft.

so this is likely part of Microsoft's AI strategy to lure developers in and create dependence. Doesn't mean it can't also be interesting/good, but it's important context to this project's purpose and goals.

clemlesneOP1y ago

I’m developing that in my free time because I think there is a need inside the community for that. I’m not motivated in any way by my company.

In the meantime, if you have other technologies achieving the features (blob, queue, search), feel free to push a PR. Someone already did that for AWS: https://github.com/clemlesne/scrape-it-now/issues/8.

cha-d1y ago

Am I right in thinking running this regularly from your computer at home will start causing you to start receiving more Captchas over time? If so, what are some other options?

wcallahan1y ago

Nice work. Would love a similar repository for Google cloud’s equivalent services!

Or a PR on this that accomplishes the same, as @clemlesne mentioned.

katella1y ago

Does it just scrape all pages of a site?

clemlesneOP1y ago

It saves page content, transform it to markdown, and import it (optionally) to a search database to perform semantic (sentences) searches.

mrdw1y ago

why so dependent on azure?

"Decoupled architecture with Azure Queue Storage"

"Scraped content is stored in Azure Blob Storage"

"Indexed content is semantically searchable with Azure AI Search"

clemlesneOP1y ago

Well it solve basics problems like queuing and blob storage. For example, to achieve the same features as Queue Storage, you should use RabbitMQ or similar: in enterprise environment, it means multiple instances in high availability, maintenance, people to deploy it reproductibly…

bbor1y ago

lol I love the cheeky `[ ] respect robots.txt` mention. I was all worried about this for my own system, but shocked to find out there’s a ton of projects openly built around breaking the law (/social protocol). Is the justification just the same as pirating entertainment, ie “big companies are bad” and/or “IP is unjustified”?

This one in particular doesn’t fit my exact use case I don’t think, but I love the repo, very clearly explained. Well done! I hadn’t even thought about ads until just now, that’s an interesting problem…

deisteve1y ago

i just don't get why people use web scraping as a battleground for moral ethics

its bizarre just like equating copyright infringement to theft of property.

where does this moral high ground come from? nobody scraping is thinking "oh im so evil im scraping without respecting robot.txt and using residential ip addresses to bypass detection"

Google does it nobody has a problem but when the little guy does it suddenly they are an outlaw.

trog1y ago

> Google does it nobody has a problem

Historically, when Google did it, they did it to create an index, which a lot of people found useful as a way to find information they were looking for. This used to mean people would come and visit your website, where they could engage with the website creator directly through a variety of different means.

Google doing it now to digest all the content and mulch it all together to return a regurgitated form of it is a very different proposition, and that is what people are annoyed about when "the little guys" (funny name for startups with multiple billions of dollars of raised capital) are doing the same thing.

For many it's not about "moral ethics", it's about actual survival. If nobody is visiting their website, nobody is buying their products or engaging with their community or whatever.

If you're scraping content for no other purpose than to mechanistically reword it for commercial purposes, then it's not really surprising that people have issues with it.

__loam1y ago

You're taking someone else's labor and profiting off it, without any credit or compensation. To add insult to injury, the person you're scraping pays money to support your traffic. It's a one sided transaction.

2 more replies

hipadev231y ago

> breaking the law

citation needed

kordlessagain1y ago

Since when is intent to implement a feature "cheeky"?

> but shocked to find out there’s a ton of projects openly built around breaking the law

The original statement oversimplifies a complex legal and ethical landscape in technology. It fails to account for the gradual nature of discovering various projects with potential legal implications, instead projecting an unrealistic sudden shock. This overlooks the nuanced reality of how technology often operates in legal gray areas, especially when dealing with emerging fields or novel applications of existing tech.

The assertion of widespread illegality ignores crucial legal concepts like fair use, which provides lawful ways to utilize publicly available information under certain circumstances. For instance, web crawling for legitimate purposes, including research or analysis that falls under fair use, can be perfectly legal despite potential objections from website owners.

Furthermore, the statement disregards the principle that information openly published on the internet, without robust privacy protections, may often be legally utilized in ways the publisher didn't anticipate. This reflects a misunderstanding of how modern information ecosystems function and the legal frameworks governing them. By presenting a black-and-white view of legality in tech projects, the original statement hinders a more sophisticated understanding of the intricate balance between innovation, law, and ethical considerations in the digital age. It's crucial to approach these issues with a nuanced perspective that acknowledges the complexities of applying traditional legal concepts to rapidly evolving technologies and practices.

clemlesneOP1y ago

That's indeed in the roadmap, like you mentioned.

My primary objective is to build a LLM chat tool based on open-source documentations. The project owner (and even more if it is OSS) is I think not responsible for that, the one using it is.

You are welcome to push a PR to add other backends (including OSS)!

CalRobert1y ago

What law are you thinking of?

clemlesneOP1y ago

And thank you for the compliment! It's great to see that your efforts are seen and appreciated :)

bearjaws1y ago

At this point honoring robots.txt will ensure you have a terrible searching experience...

lyime1y ago

Ignoring robots.txt is "not breaking the law" lol

gmerc1y ago

Eric Schmidt has you covered. You do it to win and the law isn’t for tech bros, it’s for suckers who can’t pay a lawyer

j / k navigate · click thread line to collapse

31 comments

renegat0x01y ago

I have a similar project. I scrape pages only to obtain page meta. It can use selenium, crawlee, i will also add puppeteer later.

The project is quite big, has mamy features.

It is my internet command center. I used it to check what's news on the internet.

https://github.com/rumca-js/Django-link-archive

nilsherzig1y ago

Love the images haha

usernamed71y ago

Not to discount any actual utility or innovation here, but I was wondering "why would you be hard coding to all these azure services?" then I saw the author is a solutions architect at microsoft.

clemlesneOP1y ago

I’m developing that in my free time because I think there is a need inside the community for that. I’m not motivated in any way by my company.

cha-d1y ago

Am I right in thinking running this regularly from your computer at home will start causing you to start receiving more Captchas over time? If so, what are some other options?

wcallahan1y ago

Nice work. Would love a similar repository for Google cloud’s equivalent services!

Or a PR on this that accomplishes the same, as @clemlesne mentioned.

katella1y ago

Does it just scrape all pages of a site?

clemlesneOP1y ago

It saves page content, transform it to markdown, and import it (optionally) to a search database to perform semantic (sentences) searches.

mrdw1y ago

why so dependent on azure?

"Decoupled architecture with Azure Queue Storage"

"Scraped content is stored in Azure Blob Storage"

"Indexed content is semantically searchable with Azure AI Search"

clemlesneOP1y ago

bbor1y ago

deisteve1y ago

i just don't get why people use web scraping as a battleground for moral ethics

its bizarre just like equating copyright infringement to theft of property.

where does this moral high ground come from? nobody scraping is thinking "oh im so evil im scraping without respecting robot.txt and using residential ip addresses to bypass detection"

Google does it nobody has a problem but when the little guy does it suddenly they are an outlaw.

trog1y ago

> Google does it nobody has a problem

For many it's not about "moral ethics", it's about actual survival. If nobody is visiting their website, nobody is buying their products or engaging with their community or whatever.

If you're scraping content for no other purpose than to mechanistically reword it for commercial purposes, then it's not really surprising that people have issues with it.

__loam1y ago

2 more replies

hipadev231y ago

> breaking the law

citation needed

kordlessagain1y ago

Since when is intent to implement a feature "cheeky"?

> but shocked to find out there’s a ton of projects openly built around breaking the law

clemlesneOP1y ago

That's indeed in the roadmap, like you mentioned.

My primary objective is to build a LLM chat tool based on open-source documentations. The project owner (and even more if it is OSS) is I think not responsible for that, the one using it is.

You are welcome to push a PR to add other backends (including OSS)!

CalRobert1y ago

What law are you thinking of?

clemlesneOP1y ago

And thank you for the compliment! It's great to see that your efforts are seen and appreciated :)

bearjaws1y ago

At this point honoring robots.txt will ensure you have a terrible searching experience...

lyime1y ago

Ignoring robots.txt is "not breaking the law" lol

gmerc1y ago

Eric Schmidt has you covered. You do it to win and the law isn’t for tech bros, it’s for suckers who can’t pay a lawyer

j / k navigate · click thread line to collapse