Show HN: Turn any website into a knowledge base for LLMs (opens in new tab)

(embedding.io)

305 pointstompec1y ago122 comments

I built this tool because I wanted a way to just take a bunch of URLs or domains, and query their content in RAG applications.

It takes away the pain of crawling, extracting content, chunking, vectorizing, and updating periodically.

I'm curious to see if it can be useful to others. I meant to launch this six months ago but life got in the way...

122 comments

vladde1y ago

The example API key on the page is decoded to "WOW YOU'RE A HACKER" :)

soundblaster1y ago

that == at the end looked like a base64 encoded string ;)

jeanloolz1y ago

I built a similar thing as a python library that does just that: https://github.com/philippe2803/contentmap

Blog post that explains the rationale behind the library: https://philippeoger.com/pages/can-we-rag-the-whole-web

Just submit your XML sitemap into a python class, and it will do the crawling, chunking, vectorizing and storage in an SQLite file for you. It's using SQLiteVSS integration with Langchain, but thinking of moving away from it, and do an integration with the new sqlite-vec instead.

samstave1y ago

This is part of a dream of a tool I would like:

A relational crawler on a particular subject with nuanced, opaque, seemingly-temporally-unrelated connections that show a particular MIC conduction of acts::

"Follow all the congress members who have been a part of a particular committee, track their signatory/support for particular ACTs that have been passed, and look at their investment history from open data, quiver, etc - and show language in any public speaking talking about conflicts and arms deals occurring whereby their support of the funding for said conflicts are traceable to their ACTs, committee seat, speaking engagements, investment profit and reporting as compared to their stated net worth over each year as compared to the stated gains stated by their filings for investment. Apply this pattern to all congress, and their public-profile orbit of folks, without violating their otherwise private-related actions."

And give it a series of URLs with known content for which these nuances may be gleaned.

Or have a trainer bot that will constantly only consume this context from the open internet over time such that you can just have a graph over time for the data...

PYTHON: Run it all through txtai / your library ? nodes and ask questions of the data in real time?

(And it reminds me of the work of this fine person/it::

https://mlops.systems/#category=isafpr

https://mlops.systems/#category=afghanistan

xrd1y ago

I know sqlite-vss has been upgraded lately. But, it was unstable for a while prior. Are you having good experiences with it?

jeanloolz1y ago

Actually, Sqlite-vss has been untouched for quite some time, and the creator has officially communicated that it was deprecated to be replaced by sqlite-vec, which has recently seen its first non-alpha release (v0.1.0). So, I would embrace sqlite-vec now if I were you.

I have not used sqlite-vec much because it was only alpha-released for now, but it finally came out a few days ago. I'm looking into integrating it and use it to make sqlite more my go-to RAG database.

23B11y ago

I tried it out. This would be extremely useful to me to the point I'd be willing to happily pay for it, as it's something I would have otherwise had to spend a long time hacking together.

1) The returned output from a query seems pretty limited in length and breadth.

2) No apparent way to adjust my prompts to improve/adjust the output e.g. not really 'conversational' (not sure if that is your intent)

Otherwise keep developing and be sure to push update notifications to your new mailing list! ;-)

dmje1y ago

Agree with this. I also think the emphasis here (to OP) should be "I'd be willing to happily pay for it" - ie I'd rather be paying a reasonable amount each month for something that is going to remain active that have the large (current) disparity between "free" and "enterprise". I'd say make some middle tiers of (I don't know?) $5 / $10 / $20 a month for reasonable numbers of queries or whatever. Keep the "enterprise" offering there for the biggies, but offer us small players some hope that this will be sufficiently funded / supported.

Brilliant idea, btw, I like it :-)

tompecOP1y ago

Thanks! I'm still figuring things out about pricing, but there will be small plans available.

tompecOP1y ago

Thanks! The chat demo is actually just a small thing I put together as a preview of what can be done, but the main product is the API. But seeing that most users seem to like that, there's probably something there... If you want to email me at support at embedding.io with some requirements, I can see how to make that work for you.

MattDaEskimo1y ago

In my opinion this is a transitional niche.

Soon websites/apps whatever you want to call them will have their own built-in handling for AI.

It's inefficient and rude to be scraping pages for content. Especially for profit.

obilgic1y ago

I think each website/data-source having their own built-in AI is also a transitional period.

It is like every website having search engine vs google.

sooheon1y ago

We already have the google analogue (llm that's seen all the websites), so are we going in circles?

muratsu1y ago

I doubt. For larger players, data is valuable - so they are preventing scraping already (eg reddit, linkedin). For smaller websites there’s also not much of an incentive.. Maybe hosting providers will help with preventing scraping? like ddos protection

temporary_name1y ago

In agree that this niche is DOA. No offence to OP but the barrier for entry to this stuff is low. I built basically the same thing over a weekend for personal use. React frontend, python server, chroma for embeddings, sqlite cache, switch between open AI and anthropic (I want to add llama for full local execution when I get a better pc). I have a local SPA with named "projects", can configure crawl depth from a start page, I can set my crawl rate, don't have to pay to use it, can choose any provider I want... I'm just one guy and that took a day to get working plus a bit of polish.

I would guess the hardest thing by far in developing the advertised product would be user management, authentication, payments and wrapping the subscription model's business logic around the core loop. And probably scaling, as running embeddings over hundreds of scraped pages adds up quickly when free tier users start hammering you.

My question when deciding to sell something I've built is, if building the service model is harder than building the actual service, where is the value add?

My take on the natural evolution is that collating and caching documents, websites etc for search (with source attribution ideally) is a problem that will I think ultimately be solved by OS vendors. Why sign up for SaaS and expose all your content to untrustworthy 3rd parties, when it's built right in and handled by your "trusty" OS.

In the meantime, I reckon someone more dedicated than me will (or probably already has) open source something like I built but better, probably as a CLI tool, which will eventually reach maturity and be stolen cough I mean adopted by the top end of town.

Ethically I think nothing's changed for centuries in regards to plagiarism and attribution. It gets easier to copy work and thinking, but it also ultimately gets easier to acknowledge sources. Good folk will do the right thing as they always have done.

Regarding efficiency, I think tools like this have a place in making access to relevant and summarised knowledge during general research more efficient, when doing the broad strokes to find areas of interest to zoom in on, when more traditional approaches take over.

Interesting times anyway. I have to give credit to people that try, but I'm taking a back seat in thinking of ideas to productise in this space, as by the time I've thought it through, something new comes along that instantly makes it obsolete.

purple-leafy1y ago

God. Some people on hackernews suck.

This isn’t “niche”, it’s a pretty cool thing OP has built.

How about instead of commenting and trivialising what people have done, you say something positive

CPLX1y ago

> the barrier for entry to this stuff is low. I built basically the same thing over a weekend for personal use. React frontend, python server, chroma for embeddings, sqlite cache

Lmfao. God bless HN for keeping this meme going for decades by now.

hluska1y ago

Your last paragraph really says it all. You haven’t accomplished anything in the space and you’re not willing to try. So you’re just going to hate on everyone who does.

Nobody cares how you would build it because you haven’t. At least not in any form that we can see.

kaycebasques1y ago

I spent a lot of time thinking about how to manage embeddings for docs sites. This is basically the same solution that I landed on but never got around to shipping as a general-purpose product.

A key question that the docs should answer (and perhaps the "How it works" page too): chunking. You generate an embedding for the entire page? Or do you generate embeddings for sections? And what's the size limit per page? Some of our docs pages have thousands of words per page. I'm doubtful you can ingest all that, let alone whether the embedding would be that useful in practice.

tompecOP1y ago

I chunk pages and generate embeddings for each chunk. So there's no real size limit per page.

kaycebasques1y ago

The more detail, the better. If `<section>` elements are found you chunk those? Do you do it recursively or do you stop after a certain level? And when section elements don't exist, you use `<h1>`, `<h2>`, etc. to infer logical chunks?

1 more reply

crowcroft1y ago

I like this. Abstracting away the management of embeddings and vector database is something I desperately want, and adding in website crawling is useful as well.

muggermuch1y ago

I like this a lot!

But: I feel the more of these services come to being, the more likely it is that every website starts putting up gates to keep the bots away.

Sort of like a weird GenAI take on Cixin Liu's Dark Forest hypothesis (https://en.wikipedia.org/wiki/Dark_forest_hypothesis).

(Edited to add a reference.)

marcellus231y ago

Responding just because it's a pet peeve of mine: Cixin Liu did not invent the dark forest hypothesis. People were discussing it, and writing science fiction books about it, for decades before the 3BP books were published. Nothing against him, and he definitely helped popularize the concept, but I think it's incorrect to refer to it as "Cixin Liu's hypothesis".

sooheon1y ago

Was curious as a lover of the 3BP series, google gave me this:

"We've been sitting in our tree chirping like foolish birds for over a century now, wondering why no other birds answered. The galactic skies are full of hawks, that's why." (The Forge of God, Legend edition, 1989, pg 315).

Yeah, same concept and even the same imagery.

Source: https://warwick.ac.uk/fac/sci/physics/research/astro/people/...

1 more reply

dankwizard1y ago

But he is responsible for the name, not the concept. So yes it is Cixin Liu's Dark Forest hypothesis.

1 more reply

throw109201y ago

> I feel the more of these services come to being, the more likely it is that every website starts putting up gates to keep the bots away

That's why we need microtransactions, because I'd rather be able to have both nice AI services and useful data repositories that they pull from, than have to choose just one. (and that one would be AI services, because you can't stop all the scrapers, so data sources will just keep tightening their restrictions)

pjc501y ago

We're never going to have microtransactions because of microfraud - and AI makes this problem worse rather than better.

1 more reply

tremarley1y ago

This would be amazing

replete1y ago

Does anyone know of a way to do this locally with Ollama? The 'chat with documentation' thing is something I was thinking of a week ago when dealing with hallucinating cloud AI. I think it'd be worth the energy to embed a set of documentation locally to help with development

navbaker1y ago

Yes, Langchain has tooling specifically for connecting to Ollama that can be chained with other tooling in their library to pull in your documents, chunk them, and store them for RAG. See here for a good example notebook:

https://github.com/langchain-ai/langchain/blob/master/cookbo...

dartos1y ago

https://github.com/open-webui/open-webui

have_faith1y ago

Looks cool! anything about how it compares to similar RAG-as-a-service products? something I've been researching a little.

FWIW, the pricing model of jumping from free to "contact us" is slightly ominous.

crngefest1y ago

With these early stage startups it often means they haven’t really figured out how to price their product and will cut you a very generous deal if you push a bit

notatoad1y ago

but what does "a generous deal" even mean. they have no pricing at all listed, so there's no reference point. if i want to index ten pages more than the free tier is it going to cost me $10/mo or $10000/mo

is_true1y ago

Do you plan on doing revenue sharing with the site owners?

mmkos1y ago

Do OpenAI and all the other LLM behemoths?

is_true1y ago

this is a completely different product.

dmitrygr1y ago

  > Turn any website into a knowledge base for LLMs

I would pay for the opposite product: make your website completely unusable/unreadable by LLMs while readable by real humans, with low false positive rates.

toomuchtodo1y ago

Could you support ingesting WARC files?

https://github.com/harvard-lil/warc-gpt

https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open...

ashwinnair991y ago

How are you deciding on the best RAG configuration for your app? How you decide chunking strategy, embedding and retrievers for your app? Check out our open source tool-RAgBuilder that allows developers get to the top performing RAG for their data https://news.ycombinator.com/item?id=41145093

lua-steve1y ago

How does this handle changes to the website? Does it re-crawl the whole site periodically and regenerate the embeddings? Or is there some sort of diff-checker that only picks up pages that have changed, added, or deleted?

samuria1y ago

Interesting, I wanted to do this for a personal use case (mostly learning), but with PDFs. What's tech stack? I have explored using the AWS AI tools, but it seems a bit overkill for what I want it to do.

kordlessagain1y ago

Here's some code to deal with that:

https://github.com/MittaAI/SlothAI/blob/main/SlothAI/lib/pro...

https://github.com/MittaAI/mitta-community/tree/main/service...

There's code in there that just reads PDF meta data as well, but you can't always guarantee it's there in a PDF.

lou13061y ago

If the PDFS are textual or have OCR, then pdf2text from the Poppler suite ought to be enough? If not, add Tesseract/ocrmypdf to the pipeline?

tompecOP1y ago

Tech stack is a mix of serverless Laravel, with Cloudflare and AWS functions, and some Pinecone for vector storage. Still experimenting on a few things but don't want to over-engineer unless I know where I'm going.

stevenicr1y ago

Given that cloudflare spies on traffic and reports to multiple agencies on it's findings, perhaps a breakdown of the chain and the privacy implications of each block in the stack would be beneficial?

1 more reply

oneshtein1y ago

Try aichat: https://github.com/sigoden/aichat

dazbradbury1y ago

Nice! What's the underlying model / RAG approach being used? Be good to understand that part as presumably it will have a big impact on performance / usability of the results.

jakubsuchy1y ago

I feel like this is unethical. You built yet another bot scraper. It would only be an ethical tool if it validated I own the website I am scraping before it starts.

corn13read21y ago

yes only big conglomerates now can scrape pages. If you're not google stealing the info then...... Right?

jakubsuchy1y ago

I didn't say that but a site owner should have the right to decide.

In addition, this scraper doesn't even identify itself (I checked). It pretends to be a normal browser, without saying it's a scraper.

1 more reply

visarga1y ago

This is probably a losing direction - protecting your little island of content in the sea of internet and LLM outputs. Get more value by exposure. This is the trend of open source, wikipedia and open scientific publication. LLMs double down on the same collaborative approach to intelligence.

You can of course decouple from the big discussion and isolate your content with access restrictions, but the real interesting activity will be outside. Look for example the llama.cpp and other open source AI tools we have gotten recently. So much energy and enthusiasm, so much collaboration. Closed stuff doesn't get that level of energy.

I think IP laws are in for a reckoning, protecting creativity by restricting it is not the best idea in the world. There are better models. Copyright is anachronic, it was invented in the era of the printing press when copying was made easy, LLMs remix they don't simply copy, even the name is unfitting for the new reality. We need to rename it remixright.

pjc501y ago

> Get more value by exposure

The LLM era doesn't give credit or attribution to its sources. It erases exposure. So there's a disincentive to collaborate with it, because it only takes.

> I think IP laws are in for a reckoning, protecting creativity by restricting it is not the best idea in the world.

We've been having this discussion for over 20 years since the Napster era, or even the era of elaborate anti piracy measures for computer games distributed on tapes 40 years ago.

I've reached the conclusion that the stable equilibrium is "small shadow world": enough IP leakage for piracy and preservation, but on a noncommercial scale. We sit with our Plex boxes and our adblockers, knowing that 90% of the world isn't doing that and is paying for it. Too much control is an IP monopoly stranglehold where it costs multiple dollars to set a song as your phone ringtone or briefly heard background music gets your video vaporised off social media. Too _little_ control and eventually there is actually a real economic loss from piracy, and original content does not get made.

AI presents a third threat: unlimited pseudo-creative "slop", which is cheap and adequate to fill people's scrolling time but does not pay humans for its creation and atrophys the creative ecosystem.

realusername1y ago

Well, Google itself is just an unethical bot scrapper then...

elric1y ago

Several lawsuits have confirmed that. Google regurgitating articles from French newspaper sites comes to mind.

This is not an easy problem to solve. In my naive take, authors get to decide how their work is used, not scrapers.

2 more replies

hluska1y ago

I like the concept, the documentation is very good and I even enjoy the domain name. This is an excellent launch and congratulations on getting it out.

danirogerc1y ago

Can I query multiple vectorized websites at once? Can I export vectorized websites and host them myself? Any chance to export them to a no-code format, like PDF?

tompecOP1y ago

You can group as many websites as you want into a collection. Then query that collection. Not sure what you mean by exporting; you would like to export the vectors themselves? Or just the chunks of text from the websites?

Cynddl1y ago

I find it interesting that as an (edit: UK) academic researcher, I would be likely be forbidden to use tools like this, that fail basic ethics standards, regulations such as GDPR, and practical standards such as respecting robots.txt [given there's no information on embedding.io, it's unlikely I can block the crawler when designing a website].

There's still room for an ethical development of such crawlers and technologies, but it needs to be consent-first, with strong ethical and legal standards. The crazy development of such tools has been a massive issue for a number of small online organisations that struggle with poorly implemented or maintained bots (as discussed for OpenStreetMap or Read The Docs).

popcorncowboy1y ago

I'm less convinced. Are you saying it's unethical to automate browsing a site?

Because if you save the pages you browse on some site, they're yours (authors don't own your cache).

Perhaps you're arguing that if you wrote a lightweight script/browser (which is just your user agent) to save some website for offline use, that'd be unethical and GDPR violating? Again, I don't think so but maybe I'm missing something. But perhaps this turns on what defines a "user agent".

Perhaps this becomes a "depth of pre-fetch" question. If your browser prefetches linked pages, that's "automated" downloading, akin to the script approach above. Downloading. To your cache. Which you own. (Where I struggle to see an ethical violation)

Genuinely curious where the line is, or what exactly here is triggering ethics, GDPR and practical standards?

Cynddl1y ago

Maybe a good illustration would be ClearView AI. They are scraping websites, extracting information (images), and training ML models to learn embeddings (distance between faces). They indiscriminately collect personal data without opt-in, but a limited opt-out mechanism.

In this case, if this tool is used to scrape a website, there are too direct issues: 1/ no immediate way for the website owner to exclude this particular scraper (what is the useragent?) 2/ no way for data subjects (whose data is present on the website) to search whether the scraper learned their personal data in the embeddings. Data being available publicly doesn't mean it can be widely used [at least outside the US, where we have much stricter rules on scraping].

ancras1y ago

This is interesting. Can it work with any website, even say document repositories hosted on standard servers like gitbook?

tompecOP1y ago

It works with pretty much any website, and works well with docs hosted on GitBook yes, I have embedded a website that's hosted there.

webappguy1y ago

Confirmation email doesn't work, so cannot try it. I attemtped twice and checked spam

1 more reply

nashashmi1y ago

Would you share the source? I want to use this for a private internal network of pages. How would that work?

oars1y ago

Interested in seeing whether this will be widespread in 5 years or whether sites will have fought back.

dartos1y ago

Sites are already fighting back.

Twitter and Reddit locked down their APIs. Soon enough, you’ll need an account to even access any content

alok-g1y ago

Would be great to use for developer documentation for various languages, frameworks and libraries.

blackeyeblitzar1y ago

Is there a way to deal with websites where you need to login? Like subscription based sites?

tompecOP1y ago

Unless you own those sites, I'm afraid that's not going to be possible.

brianfryer1y ago

But if I do own those sites, there’s a way?

michaelmior1y ago

This looks interesting, but I get a 404 on the iframe when I try to go into the chat.

tompecOP1y ago

Sorry about that, a bit too much load at the moment

breck1y ago

#1. Gratuitous self promotion (but also my honest best advice): The future of knowledge bases is ScrollSets: https://sets.scroll.pub/

#2. If you are interested in knowledge bases, see #1

mattfrommars1y ago

So i provide a URL, your service does the crawling of the site?

suyash1y ago

Can it get content that is gated/behind login ?

onemoresoop1y ago

How would expect it to do that??

suyash1y ago

there are ways to do that, just wondering if this tool has the capability or not.

vulture9161y ago

Experiencing many Internal Server Errors.

rvz1y ago

> Enterprise: Contact Us

If there is no certifications or compliance information then I don't think there is anything to discuss about any enterprise plan.

tompecOP1y ago

Gotta start somewhere :)

olalonde1y ago

Which LLM model is it using?

barrenko1y ago

Will this work for forums?

rcarmo1y ago

How do I feed it a sitemap?

tompecOP1y ago

It currently will try to find a sitemap on its own. But I have on the roadmap to let users add their own.

ckluis1y ago

How much does it cost?

boredemployee1y ago

does it embbed images as well? if not, do you plan to do so?

tompecOP1y ago

It doesn't embed images, no. But that's a great idea for the roadmap!

boredemployee1y ago

great. I really want a feature like that! I'd like to query my knowledge base about images as well!

cranberryturkey1y ago

how does this work?

kordlessagain1y ago

I do this with https://mitta.ai by using a Playwright container that does a callback to a pipeline that uses either meta data from the PDF or sends it to an EasyOCR deployment on a GPU instance on Google for text extraction. Then I use a custom chunker and instructor/xl embeddings.

All of that code is Open Source, and works well for most sites. Some sites block Google IPs, but the Playwright container can run locally, so should be able to work around it with some minimal effort.

tompecOP1y ago

Give it URLs or domains, and it will crawl and extract their content, embed them in a vector database, and give you an endpoint that you can then query when doing RAG stuff or semantic search.

xiconfjs1y ago

But how does it work in the background? What‘s the tech stack?

1 more reply

sunir1y ago

There are a few ways. I built something similar huckai.com on top of vectara.com. They have open sourced their versions https://github.com/vectara

You can also do this on AWS now fairly easily. https://medium.com/data-reply-it-datatech/how-to-build-a-cus...

The lablab.ai Discord community is a pretty good place to learn how this product category is evolving.

suyash1y ago

any open source tools for doing just this?

r0b051y ago

Does it hallucinate much?

mkw50531y ago

I made a similar open source app a year ago or so https://github.com/mkwatson/chat_any_site

1 more reply

pryelluw1y ago

Does this respect robots.txt?

danirod1y ago

I hope this gets answered.

Also I've checked their docs to see if there is any mention about the user agents or IP ranges they use for scraping, with no luck.

tompecOP1y ago

It does respect robots.txt when crawling. I'll add more details about this in the docs.

pryelluw1y ago

I appreciate the reply. As someone who runs multiple CMSs it’s painful to deal with the ai crawlers these days. Specially the ones that don’t respect my terms.

srameshc1y ago

Valid question and I am sure it doesn't.

khanan1y ago

Can this be deployed on-prem or is it an cloud-toy?

tompecOP1y ago

Currently just a cloud-toy.

j / k navigate · click thread line to collapse

122 comments

vladde1y ago

The example API key on the page is decoded to "WOW YOU'RE A HACKER" :)

soundblaster1y ago

that == at the end looked like a base64 encoded string ;)

jeanloolz1y ago

I built a similar thing as a python library that does just that: https://github.com/philippe2803/contentmap

Blog post that explains the rationale behind the library: https://philippeoger.com/pages/can-we-rag-the-whole-web

samstave1y ago

This is part of a dream of a tool I would like:

A relational crawler on a particular subject with nuanced, opaque, seemingly-temporally-unrelated connections that show a particular MIC conduction of acts::

And give it a series of URLs with known content for which these nuances may be gleaned.

Or have a trainer bot that will constantly only consume this context from the open internet over time such that you can just have a graph over time for the data...

PYTHON: Run it all through txtai / your library ? nodes and ask questions of the data in real time?

(And it reminds me of the work of this fine person/it::

https://mlops.systems/#category=isafpr

https://mlops.systems/#category=afghanistan

xrd1y ago

I know sqlite-vss has been upgraded lately. But, it was unstable for a while prior. Are you having good experiences with it?

jeanloolz1y ago

23B11y ago

I tried it out. This would be extremely useful to me to the point I'd be willing to happily pay for it, as it's something I would have otherwise had to spend a long time hacking together.

1) The returned output from a query seems pretty limited in length and breadth.

2) No apparent way to adjust my prompts to improve/adjust the output e.g. not really 'conversational' (not sure if that is your intent)

Otherwise keep developing and be sure to push update notifications to your new mailing list! ;-)

dmje1y ago

Brilliant idea, btw, I like it :-)

tompecOP1y ago

Thanks! I'm still figuring things out about pricing, but there will be small plans available.

tompecOP1y ago

MattDaEskimo1y ago

In my opinion this is a transitional niche.

Soon websites/apps whatever you want to call them will have their own built-in handling for AI.

It's inefficient and rude to be scraping pages for content. Especially for profit.

obilgic1y ago

I think each website/data-source having their own built-in AI is also a transitional period.

It is like every website having search engine vs google.

sooheon1y ago

We already have the google analogue (llm that's seen all the websites), so are we going in circles?

muratsu1y ago

temporary_name1y ago

My question when deciding to sell something I've built is, if building the service model is harder than building the actual service, where is the value add?

purple-leafy1y ago

God. Some people on hackernews suck.

This isn’t “niche”, it’s a pretty cool thing OP has built.

How about instead of commenting and trivialising what people have done, you say something positive

CPLX1y ago

> the barrier for entry to this stuff is low. I built basically the same thing over a weekend for personal use. React frontend, python server, chroma for embeddings, sqlite cache

Lmfao. God bless HN for keeping this meme going for decades by now.

hluska1y ago

Your last paragraph really says it all. You haven’t accomplished anything in the space and you’re not willing to try. So you’re just going to hate on everyone who does.

Nobody cares how you would build it because you haven’t. At least not in any form that we can see.

kaycebasques1y ago

I spent a lot of time thinking about how to manage embeddings for docs sites. This is basically the same solution that I landed on but never got around to shipping as a general-purpose product.

tompecOP1y ago

I chunk pages and generate embeddings for each chunk. So there's no real size limit per page.

kaycebasques1y ago

1 more reply

crowcroft1y ago

I like this. Abstracting away the management of embeddings and vector database is something I desperately want, and adding in website crawling is useful as well.

muggermuch1y ago

I like this a lot!

But: I feel the more of these services come to being, the more likely it is that every website starts putting up gates to keep the bots away.

Sort of like a weird GenAI take on Cixin Liu's Dark Forest hypothesis (https://en.wikipedia.org/wiki/Dark_forest_hypothesis).

(Edited to add a reference.)

marcellus231y ago

sooheon1y ago

Was curious as a lover of the 3BP series, google gave me this:

Yeah, same concept and even the same imagery.

Source: https://warwick.ac.uk/fac/sci/physics/research/astro/people/...

1 more reply

dankwizard1y ago

But he is responsible for the name, not the concept. So yes it is Cixin Liu's Dark Forest hypothesis.

1 more reply

throw109201y ago

> I feel the more of these services come to being, the more likely it is that every website starts putting up gates to keep the bots away

pjc501y ago

We're never going to have microtransactions because of microfraud - and AI makes this problem worse rather than better.

1 more reply

tremarley1y ago

This would be amazing

replete1y ago

navbaker1y ago

https://github.com/langchain-ai/langchain/blob/master/cookbo...

dartos1y ago

https://github.com/open-webui/open-webui

have_faith1y ago

Looks cool! anything about how it compares to similar RAG-as-a-service products? something I've been researching a little.

FWIW, the pricing model of jumping from free to "contact us" is slightly ominous.

crngefest1y ago

With these early stage startups it often means they haven’t really figured out how to price their product and will cut you a very generous deal if you push a bit

notatoad1y ago

is_true1y ago

Do you plan on doing revenue sharing with the site owners?

mmkos1y ago

Do OpenAI and all the other LLM behemoths?

is_true1y ago

this is a completely different product.

dmitrygr1y ago

  > Turn any website into a knowledge base for LLMs

I would pay for the opposite product: make your website completely unusable/unreadable by LLMs while readable by real humans, with low false positive rates.

toomuchtodo1y ago

Could you support ingesting WARC files?

https://github.com/harvard-lil/warc-gpt

https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open...

ashwinnair991y ago

lua-steve1y ago

samuria1y ago

kordlessagain1y ago

Here's some code to deal with that:

https://github.com/MittaAI/SlothAI/blob/main/SlothAI/lib/pro...

https://github.com/MittaAI/mitta-community/tree/main/service...

There's code in there that just reads PDF meta data as well, but you can't always guarantee it's there in a PDF.

lou13061y ago

If the PDFS are textual or have OCR, then pdf2text from the Poppler suite ought to be enough? If not, add Tesseract/ocrmypdf to the pipeline?

tompecOP1y ago

stevenicr1y ago

Given that cloudflare spies on traffic and reports to multiple agencies on it's findings, perhaps a breakdown of the chain and the privacy implications of each block in the stack would be beneficial?

1 more reply

oneshtein1y ago

Try aichat: https://github.com/sigoden/aichat

dazbradbury1y ago

Nice! What's the underlying model / RAG approach being used? Be good to understand that part as presumably it will have a big impact on performance / usability of the results.

jakubsuchy1y ago

I feel like this is unethical. You built yet another bot scraper. It would only be an ethical tool if it validated I own the website I am scraping before it starts.

corn13read21y ago

yes only big conglomerates now can scrape pages. If you're not google stealing the info then...... Right?

jakubsuchy1y ago

I didn't say that but a site owner should have the right to decide.

In addition, this scraper doesn't even identify itself (I checked). It pretends to be a normal browser, without saying it's a scraper.

1 more reply

visarga1y ago

pjc501y ago

> Get more value by exposure

The LLM era doesn't give credit or attribution to its sources. It erases exposure. So there's a disincentive to collaborate with it, because it only takes.

> I think IP laws are in for a reckoning, protecting creativity by restricting it is not the best idea in the world.

We've been having this discussion for over 20 years since the Napster era, or even the era of elaborate anti piracy measures for computer games distributed on tapes 40 years ago.

AI presents a third threat: unlimited pseudo-creative "slop", which is cheap and adequate to fill people's scrolling time but does not pay humans for its creation and atrophys the creative ecosystem.

realusername1y ago

Well, Google itself is just an unethical bot scrapper then...

elric1y ago

Several lawsuits have confirmed that. Google regurgitating articles from French newspaper sites comes to mind.

This is not an easy problem to solve. In my naive take, authors get to decide how their work is used, not scrapers.

2 more replies

hluska1y ago

I like the concept, the documentation is very good and I even enjoy the domain name. This is an excellent launch and congratulations on getting it out.

danirogerc1y ago

Can I query multiple vectorized websites at once? Can I export vectorized websites and host them myself? Any chance to export them to a no-code format, like PDF?

tompecOP1y ago

Cynddl1y ago

popcorncowboy1y ago

I'm less convinced. Are you saying it's unethical to automate browsing a site?

Because if you save the pages you browse on some site, they're yours (authors don't own your cache).

Genuinely curious where the line is, or what exactly here is triggering ethics, GDPR and practical standards?

Cynddl1y ago

ancras1y ago

This is interesting. Can it work with any website, even say document repositories hosted on standard servers like gitbook?

tompecOP1y ago

It works with pretty much any website, and works well with docs hosted on GitBook yes, I have embedded a website that's hosted there.

webappguy1y ago

Confirmation email doesn't work, so cannot try it. I attemtped twice and checked spam

1 more reply

nashashmi1y ago

Would you share the source? I want to use this for a private internal network of pages. How would that work?

oars1y ago

Interested in seeing whether this will be widespread in 5 years or whether sites will have fought back.

dartos1y ago

Sites are already fighting back.

Twitter and Reddit locked down their APIs. Soon enough, you’ll need an account to even access any content

alok-g1y ago

Would be great to use for developer documentation for various languages, frameworks and libraries.

blackeyeblitzar1y ago

Is there a way to deal with websites where you need to login? Like subscription based sites?

tompecOP1y ago

Unless you own those sites, I'm afraid that's not going to be possible.

brianfryer1y ago

But if I do own those sites, there’s a way?

michaelmior1y ago

This looks interesting, but I get a 404 on the iframe when I try to go into the chat.

tompecOP1y ago

Sorry about that, a bit too much load at the moment

breck1y ago

#1. Gratuitous self promotion (but also my honest best advice): The future of knowledge bases is ScrollSets: https://sets.scroll.pub/

#2. If you are interested in knowledge bases, see #1

mattfrommars1y ago

So i provide a URL, your service does the crawling of the site?

suyash1y ago

Can it get content that is gated/behind login ?

onemoresoop1y ago

How would expect it to do that??

suyash1y ago

there are ways to do that, just wondering if this tool has the capability or not.

vulture9161y ago

Experiencing many Internal Server Errors.

rvz1y ago

> Enterprise: Contact Us

If there is no certifications or compliance information then I don't think there is anything to discuss about any enterprise plan.

tompecOP1y ago

Gotta start somewhere :)

olalonde1y ago

Which LLM model is it using?

barrenko1y ago

Will this work for forums?

rcarmo1y ago

How do I feed it a sitemap?

tompecOP1y ago

It currently will try to find a sitemap on its own. But I have on the roadmap to let users add their own.

ckluis1y ago

How much does it cost?

boredemployee1y ago

does it embbed images as well? if not, do you plan to do so?

tompecOP1y ago

It doesn't embed images, no. But that's a great idea for the roadmap!

boredemployee1y ago

great. I really want a feature like that! I'd like to query my knowledge base about images as well!

cranberryturkey1y ago

how does this work?

kordlessagain1y ago

tompecOP1y ago

Give it URLs or domains, and it will crawl and extract their content, embed them in a vector database, and give you an endpoint that you can then query when doing RAG stuff or semantic search.

xiconfjs1y ago

But how does it work in the background? What‘s the tech stack?

1 more reply

sunir1y ago

There are a few ways. I built something similar huckai.com on top of vectara.com. They have open sourced their versions https://github.com/vectara

You can also do this on AWS now fairly easily. https://medium.com/data-reply-it-datatech/how-to-build-a-cus...

The lablab.ai Discord community is a pretty good place to learn how this product category is evolving.

suyash1y ago

any open source tools for doing just this?

r0b051y ago

Does it hallucinate much?

mkw50531y ago

I made a similar open source app a year ago or so https://github.com/mkwatson/chat_any_site

1 more reply

pryelluw1y ago

Does this respect robots.txt?

danirod1y ago

I hope this gets answered.

Also I've checked their docs to see if there is any mention about the user agents or IP ranges they use for scraping, with no luck.

tompecOP1y ago

It does respect robots.txt when crawling. I'll add more details about this in the docs.

pryelluw1y ago

I appreciate the reply. As someone who runs multiple CMSs it’s painful to deal with the ai crawlers these days. Specially the ones that don’t respect my terms.

srameshc1y ago

Valid question and I am sure it doesn't.

khanan1y ago

Can this be deployed on-prem or is it an cloud-toy?

tompecOP1y ago

Currently just a cloud-toy.

j / k navigate · click thread line to collapse