The project is quite big, has mamy features.
It is my internet command center. I used it to check what's news on the internet.
so this is likely part of Microsoft's AI strategy to lure developers in and create dependence. Doesn't mean it can't also be interesting/good, but it's important context to this project's purpose and goals.
In the meantime, if you have other technologies achieving the features (blob, queue, search), feel free to push a PR. Someone already did that for AWS: https://github.com/clemlesne/scrape-it-now/issues/8.
Or a PR on this that accomplishes the same, as @clemlesne mentioned.
"Decoupled architecture with Azure Queue Storage"
"Scraped content is stored in Azure Blob Storage"
"Indexed content is semantically searchable with Azure AI Search"
This one in particular doesn’t fit my exact use case I don’t think, but I love the repo, very clearly explained. Well done! I hadn’t even thought about ads until just now, that’s an interesting problem…
its bizarre just like equating copyright infringement to theft of property.
where does this moral high ground come from? nobody scraping is thinking "oh im so evil im scraping without respecting robot.txt and using residential ip addresses to bypass detection"
Google does it nobody has a problem but when the little guy does it suddenly they are an outlaw.
Historically, when Google did it, they did it to create an index, which a lot of people found useful as a way to find information they were looking for. This used to mean people would come and visit your website, where they could engage with the website creator directly through a variety of different means.
Google doing it now to digest all the content and mulch it all together to return a regurgitated form of it is a very different proposition, and that is what people are annoyed about when "the little guys" (funny name for startups with multiple billions of dollars of raised capital) are doing the same thing.
For many it's not about "moral ethics", it's about actual survival. If nobody is visiting their website, nobody is buying their products or engaging with their community or whatever.
If you're scraping content for no other purpose than to mechanistically reword it for commercial purposes, then it's not really surprising that people have issues with it.
citation needed
> but shocked to find out there’s a ton of projects openly built around breaking the law
The original statement oversimplifies a complex legal and ethical landscape in technology. It fails to account for the gradual nature of discovering various projects with potential legal implications, instead projecting an unrealistic sudden shock. This overlooks the nuanced reality of how technology often operates in legal gray areas, especially when dealing with emerging fields or novel applications of existing tech.
The assertion of widespread illegality ignores crucial legal concepts like fair use, which provides lawful ways to utilize publicly available information under certain circumstances. For instance, web crawling for legitimate purposes, including research or analysis that falls under fair use, can be perfectly legal despite potential objections from website owners.
Furthermore, the statement disregards the principle that information openly published on the internet, without robust privacy protections, may often be legally utilized in ways the publisher didn't anticipate. This reflects a misunderstanding of how modern information ecosystems function and the legal frameworks governing them. By presenting a black-and-white view of legality in tech projects, the original statement hinders a more sophisticated understanding of the intricate balance between innovation, law, and ethical considerations in the digital age. It's crucial to approach these issues with a nuanced perspective that acknowledges the complexities of applying traditional legal concepts to rapidly evolving technologies and practices.
My primary objective is to build a LLM chat tool based on open-source documentations. The project owner (and even more if it is OSS) is I think not responsible for that, the one using it is.
You are welcome to push a PR to add other backends (including OSS)!