Deep Research is now available on Gemini 2.5 Pro Experimental (opens in new tab)

(blog.google)

88 pointsextesy1y ago24 comments

24 comments

> In our testing, raters preferred the reports generated by Gemini Deep Research powered by 2.5 Pro over other leading deep research providers by more than a 2-to-1 margin.

Are these raters experts in the field the report was written on? Did they rate the reports on factuality, broadness, and insights?

These sort of tests (and RLHF in general) are the reason that LLMs often respond with "Great question, you are exactly right to wonder..." or "Interesting insight, I agree that...". I do not want this obsequious behavior, I want "correct answers"[0]. We need some better benchmarks when it comes to human preference.

[0]: I know there is no objective correct answer for some questions.

AIPedant1y ago

Even if they were subject matter experts, it's mentally exhausting to judge these things, especially if it's just for a RLHF contracting gig and you're not actually using the report for real work. Even honest and motivated testers would be tempted into relying on surface "vibes" + no immediately obvious whoppers.

OpenAI's Deep Research seems oddly restricted in the number of sources it uses, eg repeating one survey article over and over. I suspect it is just too draining and demoralizing for RLHFers to check Deep Research's citations (especially without a formal bibliography).

jeffbee1y ago

I stumbled across the feature a few hours ago. I had asked Gemini why there's a hole in the middle of the city of Azusa, topologically speaking. It had given me a useless tautological response: because they never annexed it. Then it offered to create a research report and I agreed. Five minutes later I got a notification on my mobile that the report was ready. It had 120 sources including assessor's maps, historical maps, court cases, and narrative articles. The text that went along with it was too verbose and still contained paragraphs of vague stuff, but it had key information linking the Mexican land grants, the founding of the city, and other events of history. Very impressive.

muzani1y ago

That sounds fascinating. Please share if it's not too private.

DadBase1y ago

Deep research used to mean spending a weekend with grep and a coffee pot. Now it’s just autocomplete with a confidence interval.

esperent1y ago

> with grep

Maybe for an extremely limited number of people. For the rest of the world, it meant searching the web, books, or scholarly publications, reading a ton, taking notes, and then possibly creating a report. Which is pretty much exactly what these AI agents are claimed to to, so deep research is the perfect name for it. Whether or not they are good at it compared to humans is a question that hasn't been answered to my satisfaction yet, but the name I'm fine with.

DadBase1y ago

Back then, grep was how you searched the scholarly publications—assuming you’d mirrored the arXiv to a local FTP server like any serious researcher. The notes were just comments in the Makefile.

esperent1y ago

Ah, only a true Scotsman ever did real deep research. Got it.

cayley_graph1y ago

It doesn't have a confidence interval. We can only dream...

triyambakam1y ago

Is the coffee pot full of coffee? Is it a pot of coffee?

pizzly1y ago

Just tested it on a case we were working on for months so we can better validate the output. We found it was really good at finding websites from google searches and can navigate websites. From that it gave a good compressive review of the case. Where it failed is searching online databases i.e. one example is a business register. If the search result does not have the exact same keyword it will not review the result. However, the keyword appeared within the document of the search result and thus it missed out on this key information. Overall very good but still needs some work.

infecto1y ago

Has anyone tested googles functionality vs ChatGPT? I have lightly played around with it but felt that generally ChatGPTs implementation was a little more educated sounding and felt like it took whatever necessary persona well.

nico1y ago

Just did a test last week and OpenAIs research was way better. Found 10x more sources and did an overall pretty great job

The task was to lookup information about a late distant family member who had been a prominent employee in a certain foreign government about 100 years ago

Gemini barely scratched the surface and pretty much gave up

ChatGPT on the other hand, kept building up on its research, connecting the dots and leveraging each bit of acquired information to try to find more

consumer4511y ago

Would love to see this repeated with this latest version from Google.

Man, what's really missing from all of this is a 3rd party AI Consumer Reports type site for all of these LLM tools. Whoever does this thing that does not scale will have a highly referenced site on their hands.

jeffbee1y ago

Throughout the entire 20th century the main determinant of a Consumer Reports rating for a car was whether you could put a wheelchair in the trunk. Hopefully the AI agent industry does not sprout a similarly worthless metric.

1 more reply

shigawire1y ago

Isn't that what llmarena does?

1 more reply

SequoiaHope1y ago

I suppose Consumer Reports could do it!

phonon1y ago

But this is a brand new version? Why not run it again on Gemini 2.5 Deep Research mode and report if it's better?

infecto1y ago

I do think they are leaps in front of everyone else from the product perspective and everyday its looking more to be the battleground where money is going to be made.

arresin1y ago

I haven’t used 2.5 pro just 2.0 pro. It was inferior to OpenAI (which isn’t that good).

My ranking openai > grok 3 deeper > Gemini 2.0 pro. All have been terrible for the 100 or so times I’ve used them (all SWE / finance related in some way)

infecto1y ago

Inversely we have been getting huge gains from OpenAIs implementation in our group for certain workflows related to finance deals. We don’t use it for quant work though, all qualitative research.

j / k navigate · click thread line to collapse

24 comments

doctoboggan1y ago

> In our testing, raters preferred the reports generated by Gemini Deep Research powered by 2.5 Pro over other leading deep research providers by more than a 2-to-1 margin.

Are these raters experts in the field the report was written on? Did they rate the reports on factuality, broadness, and insights?

[0]: I know there is no objective correct answer for some questions.

AIPedant1y ago

jeffbee1y ago

muzani1y ago

That sounds fascinating. Please share if it's not too private.

DadBase1y ago

Deep research used to mean spending a weekend with grep and a coffee pot. Now it’s just autocomplete with a confidence interval.

esperent1y ago

> with grep

DadBase1y ago

Back then, grep was how you searched the scholarly publications—assuming you’d mirrored the arXiv to a local FTP server like any serious researcher. The notes were just comments in the Makefile.

esperent1y ago

Ah, only a true Scotsman ever did real deep research. Got it.

cayley_graph1y ago

It doesn't have a confidence interval. We can only dream...

triyambakam1y ago

Is the coffee pot full of coffee? Is it a pot of coffee?

pizzly1y ago

infecto1y ago

nico1y ago

Just did a test last week and OpenAIs research was way better. Found 10x more sources and did an overall pretty great job

The task was to lookup information about a late distant family member who had been a prominent employee in a certain foreign government about 100 years ago

Gemini barely scratched the surface and pretty much gave up

ChatGPT on the other hand, kept building up on its research, connecting the dots and leveraging each bit of acquired information to try to find more

consumer4511y ago

Would love to see this repeated with this latest version from Google.

jeffbee1y ago

1 more reply

shigawire1y ago

Isn't that what llmarena does?

1 more reply

SequoiaHope1y ago

I suppose Consumer Reports could do it!

phonon1y ago

But this is a brand new version? Why not run it again on Gemini 2.5 Deep Research mode and report if it's better?

infecto1y ago

I do think they are leaps in front of everyone else from the product perspective and everyday its looking more to be the battleground where money is going to be made.

arresin1y ago

I haven’t used 2.5 pro just 2.0 pro. It was inferior to OpenAI (which isn’t that good).

My ranking openai > grok 3 deeper > Gemini 2.0 pro. All have been terrible for the 100 or so times I’ve used them (all SWE / finance related in some way)

infecto1y ago

Inversely we have been getting huge gains from OpenAIs implementation in our group for certain workflows related to finance deals. We don’t use it for quant work though, all qualitative research.

j / k navigate · click thread line to collapse