Are these raters experts in the field the report was written on? Did they rate the reports on factuality, broadness, and insights?
These sort of tests (and RLHF in general) are the reason that LLMs often respond with "Great question, you are exactly right to wonder..." or "Interesting insight, I agree that...". I do not want this obsequious behavior, I want "correct answers"[0]. We need some better benchmarks when it comes to human preference.
[0]: I know there is no objective correct answer for some questions.
OpenAI's Deep Research seems oddly restricted in the number of sources it uses, eg repeating one survey article over and over. I suspect it is just too draining and demoralizing for RLHFers to check Deep Research's citations (especially without a formal bibliography).
Maybe for an extremely limited number of people. For the rest of the world, it meant searching the web, books, or scholarly publications, reading a ton, taking notes, and then possibly creating a report. Which is pretty much exactly what these AI agents are claimed to to, so deep research is the perfect name for it. Whether or not they are good at it compared to humans is a question that hasn't been answered to my satisfaction yet, but the name I'm fine with.
The task was to lookup information about a late distant family member who had been a prominent employee in a certain foreign government about 100 years ago
Gemini barely scratched the surface and pretty much gave up
ChatGPT on the other hand, kept building up on its research, connecting the dots and leveraging each bit of acquired information to try to find more
Man, what's really missing from all of this is a 3rd party AI Consumer Reports type site for all of these LLM tools. Whoever does this thing that does not scale will have a highly referenced site on their hands.
My ranking openai > grok 3 deeper > Gemini 2.0 pro. All have been terrible for the 100 or so times I’ve used them (all SWE / finance related in some way)