DeepSearcher: A local open-source Deep Research (opens in new tab)

(milvus.io)

229 pointsstephen371y ago26 comments

26 comments

This doesn't seem to use local LLMs... so it's not really local. :-\

Is there a deep searcher that can also use local LLMs like those hosted by Ollama and LM Studio?

Looking at the code (https://github.com/zilliztech/deep-searcher/blob/master/deep...), I think it probably may work at least with Ollama without any additional tweaks if you run it with `OPENAI_BASE_URL=http://localhost:11434/v1` or define `provide_settings.llm.base_url` in `config.yaml` (https://github.com/zilliztech/deep-searcher/blob/6c77b1e5597...) and tweak the model appropriately.

From a quick glance, this project doesn't seem to use any tool/function calling or streaming or format enforcement or any other "fancy" API features, so all chances are that it may just work, although I have some reservations about the quality, especially with smaller models.

phantompeace1y ago

I’ve been having issues parsing the LLM responses using Ollama and llama3.2, deepseek-r1:7b and mistral-small. I think the lack of structured output/schema is hurting it here

drdaeman1y ago

Yep, I haven't tried this particular project but that's my overall experience with similar projects as well. Smaller models that can be ran locally in compute-poor environments really need structured outputs and just prompting them to "you can ONLY return a python list of str, WITHOUT any other additional content" (a piece of prompt from this project) is nowhere sufficient for any resemblance of reliability.

If you're feeling adventurous, you can probably refactor the prompt functions in https://github.com/zilliztech/deep-searcher/blob/master/deep... to return additional metadata (required output structure) together with the prompt itself, update all `llm.chat()` calls throughout the codebase to account for this (probably changing the `chat` method API by adding an extra `format` argument and not just `messages`) and implement a custom Ollama-specific handler class that would pass this to the LLM runner. Or maybe task some of those new agentic coding tools to do this, since it looks like a mostly mechanical refactoring that doesn't require a lot of thinking past figuring out the new API contract.

vineyardmike1y ago

I’m curious how this compares to the open-source version made by HuggingFace [1]. As I can tell, the HF version uses reasoning LLMs to search/traverse and parse the web and gather results, then evaluates the results before eventually synthesizing a result.

This version appears to show off a vector store for documents generated from a web crawl (the writer is a vector-store-aaS company)

[1] https://github.com/huggingface/smolagents/tree/main/examples...

stefanwebb1y ago

There's quite a few differences between HuggingFace's Open Deep-Research and Zilliz's DeepSearcher.

I think the biggest one is the goal: HF is to replicate the performance of Deep Research on the GAIA benchmark whereas ours is to teach agentic concepts and show how to build research agents with open-source.

Also, we go into the design in a lot more detail than HF's blog post. On the design side, HF uses code writing and execution as a tool, whereas we use prompt writing and calling as a tool. We do an explicit break down of the query into sub-queries, and sub-sub-queries, etc. whereas HF uses a chain of reasoning to decide what to do next.

I think ours is a better approach for producing a detailed report on an open-ended question, whereas HFs is better for answering a specific, challenging question in short form.

parhamn1y ago

I think the magic of Grok's implementation of this is that they already have most of the websites cached (guessing via their twitter crawler) so it all feels very snappy. Bing/Brave search don't seem to offer that in their search apis. Does such a thing exist as a service?

tekacs1y ago

I’ve been wondering about this and searching for solutions too.

For now we’ve just managed to optimize how quickly we download pages, but haven’t found an API that actually caches them. Perhaps companies are concerned that they’ll be sued for it in the age of LLMs?

The Brave API provides ‘additional snippets’, meaning that you at least get multiple slices of the page, but it’s not quite a substitute.

binarymax1y ago

Web search APIs can't present the full document due to copyright. They can only present the snippet contextual to the query.

I wrote my own implementation using various web search APIs and a puppeteer service to download individual documents as needed. It wasn't that hard but I do get blocked by some sites (reddit for example).

tekacs1y ago

Google and Bing's Cache, Archive.org, Archive.is, CommonCrawl... many services have previously or currently presented the full document.

Google and Bing removed their cache features when LLMs started taking off – as I said in a sibling comment, I wonder if they felt that that regime was finally going to be challenged in court as people tried to protect their data.

That being said, "can't present the full document due to copyright" seems at odds with all of the above examples existing for years.

zk1081y ago

(founder here) We are working on that problem of providing deeper level of search especially on proprietary datasets (think reference works, books, papers etc.). Started off with Arxive papers )We are working on that problem of providing deeper level of search especially on proprietary/ copyright datasets (think reference works, books, papers etc.). We are working with a number of large publishers on this.

We started off with Arxive papers to test out the product- would love to get feedback :)

https://exchange.valyu.network/

parhamn1y ago

Is this true? Wouldn't all the "site to markdown" type services be infringing then?

swyx1y ago

exa is your answer i think? https://latent.space/p/exa

esafak1y ago

https://exa.ai/

fragmede1y ago

the common crawl dataset is rather massive, though I can't speak to how well it would perform here

http://commoncrawl.org

bilater1y ago

Nice - I like people's different twist on Deep Research. Here is mine...with Flow I'm trying a new workflow.

https://github.com/btahir/open-deep-research

fuddle1y ago

Considering all the major AI companies have basically created the same deep research product, it would make sense that they focus on a shared open source platform instead.

stefanwebb1y ago

There's two blog posts that go with this, check it out:

https://milvus.io/blog/i-built-a-deep-research-with-open-sou...

https://milvus.io/blog/introduce-deepsearcher-a-local-open-s...

Daniel_Van_Zant1y ago

Have been searching for a deep research tool that I can hook up to both my personal notes (in Obsidian) and the web and this looks like this has those capabilities. Now the only piece left is to figure out a way to export the deep research outputs back into my Obsidian somehow.

jianc10101y ago

Sometimes I wanted to do a little coding to automate things with my personal productivity tool so i feel a programatic interface that open source implementation like this provides is very convenient

zitterbewegung1y ago

I actually tried using this and I came into some issues and I had to replace the openAI text embeddings with the MilvusEmbedding.

https://gist.github.com/zitterbewegung/086dd344d16d4fd4b8931...

The QuickStart had a good response. [1] https://gist.github.com/zitterbewegung/086dd344d16d4fd4b8931...

mtrovo1y ago

I'm wondering about the practical implications of integrating web crawling. Could this, in theory, be used solely for reading papers from Sci-Hub and producing valid graduate-level research?

It could be useful for comparing reports built using DeepSeek R1 vs. GPT-4o and other large models. The code being open source might highlight the limitations of different LLMs much faster and help develop better reasoning loops in future prompts for specific needs. Really interesting stuff.

namlem1y ago

The real magic bullet would be searching lib-gen and sci-hub as well

redskyluan1y ago

Amazing！

Search is not a problem . What to search is!

Using reasoning model, it is much easier to split task and focus on what to search

gnatnavi1y ago

+1. Asking the right questions is always the most difficult thing to do.

cma1y ago

Cloudflare is going to ruin self hosted things like this and force centralization to a few players. I guess we'll need decentralized efforts to scrape the web and be able to run it on that.

j / k navigate · click thread line to collapse

26 comments

gslepak1y ago

This doesn't seem to use local LLMs... so it's not really local. :-\

Is there a deep searcher that can also use local LLMs like those hosted by Ollama and LM Studio?

drdaeman1y ago

phantompeace1y ago

I’ve been having issues parsing the LLM responses using Ollama and llama3.2, deepseek-r1:7b and mistral-small. I think the lack of structured output/schema is hurting it here

drdaeman1y ago

vineyardmike1y ago

This version appears to show off a vector store for documents generated from a web crawl (the writer is a vector-store-aaS company)

[1] https://github.com/huggingface/smolagents/tree/main/examples...

stefanwebb1y ago

There's quite a few differences between HuggingFace's Open Deep-Research and Zilliz's DeepSearcher.

I think ours is a better approach for producing a detailed report on an open-ended question, whereas HFs is better for answering a specific, challenging question in short form.

parhamn1y ago

tekacs1y ago

I’ve been wondering about this and searching for solutions too.

The Brave API provides ‘additional snippets’, meaning that you at least get multiple slices of the page, but it’s not quite a substitute.

binarymax1y ago

Web search APIs can't present the full document due to copyright. They can only present the snippet contextual to the query.

tekacs1y ago

Google and Bing's Cache, Archive.org, Archive.is, CommonCrawl... many services have previously or currently presented the full document.

That being said, "can't present the full document due to copyright" seems at odds with all of the above examples existing for years.

zk1081y ago

We started off with Arxive papers to test out the product- would love to get feedback :)

https://exchange.valyu.network/

parhamn1y ago

Is this true? Wouldn't all the "site to markdown" type services be infringing then?

swyx1y ago

exa is your answer i think? https://latent.space/p/exa

esafak1y ago

https://exa.ai/

fragmede1y ago

the common crawl dataset is rather massive, though I can't speak to how well it would perform here

http://commoncrawl.org

bilater1y ago

Nice - I like people's different twist on Deep Research. Here is mine...with Flow I'm trying a new workflow.

https://github.com/btahir/open-deep-research

fuddle1y ago

Considering all the major AI companies have basically created the same deep research product, it would make sense that they focus on a shared open source platform instead.

stefanwebb1y ago

There's two blog posts that go with this, check it out:

https://milvus.io/blog/i-built-a-deep-research-with-open-sou...

https://milvus.io/blog/introduce-deepsearcher-a-local-open-s...

Daniel_Van_Zant1y ago

jianc10101y ago

Sometimes I wanted to do a little coding to automate things with my personal productivity tool so i feel a programatic interface that open source implementation like this provides is very convenient

zitterbewegung1y ago

I actually tried using this and I came into some issues and I had to replace the openAI text embeddings with the MilvusEmbedding.

https://gist.github.com/zitterbewegung/086dd344d16d4fd4b8931...

The QuickStart had a good response. [1] https://gist.github.com/zitterbewegung/086dd344d16d4fd4b8931...

mtrovo1y ago

I'm wondering about the practical implications of integrating web crawling. Could this, in theory, be used solely for reading papers from Sci-Hub and producing valid graduate-level research?

namlem1y ago

The real magic bullet would be searching lib-gen and sci-hub as well

redskyluan1y ago

Amazing！

Search is not a problem . What to search is!

Using reasoning model, it is much easier to split task and focus on what to search

gnatnavi1y ago

+1. Asking the right questions is always the most difficult thing to do.

cma1y ago

Cloudflare is going to ruin self hosted things like this and force centralization to a few players. I guess we'll need decentralized efforts to scrape the web and be able to run it on that.

j / k navigate · click thread line to collapse