undefined | Better HN

0 pointspanarky6mo ago0 comments

[flagged]

0 comments

I usually ask a simple question that ALL the models get wrong: List of mayor of my city [Londrina]. ALL the models (offine) get wrong. And I mean, all the models. The best that I could, it's o3 I believe, saying it couldn't give a good answer for that, and told to access the city website.

Gemini 3 somehow is able to give a list of mayors, including details on who got impeached, etc.

This should be a simple answer, because all the data is on wikipedia, that certainly the models are trained on, but somehow most models don't manage to give that answer right, because... it's just a irrelevant city in a huge dataset.

But somehow, Gemini 3 did it.

Edit: Just asked "Cool places to visit in Londrina" (In portuguese), and it was also 99% right, unlike other models, who just create stuff. The only thing wrong here, it mentioned sakuras in a lake... Maybe it confused with Brazilian ipês, which are similar, and indeed the city it's full of them.

It seems to have a visual understanding, imo.

guiambros6mo ago

Ha, I just did the same with my hometown (Guaiba, RS), a city that is 1/6th of Londrina, and its wikipedia page in English hasn't been updated in years, and still has the wrong mayor (!).

Gemini 3 nailed on the first try, included political affiliation, and added some context on who they competed with and won over in each of the last 3 elections. And I just did a fun application with AI Studio, and it worked on first shot. Pretty impressive.

(disclaimer: Googler, but no affiliation with Gemini team)

usef-6mo ago

Pure fact-based, niche questions like that aren't really the focus of most providers any more from what I've heard, since they can be solved more reliably by integrating search tools (and all providers now have search).

I wouldn't be surprised if the smallest models can answer fewer such (fact-only) questions over time offline as they distill/focus them more thoroughly on logic etc.

miroljub6mo ago

Funny, I just asked "Ask Brave", which uses a cheap LLM connected directly to its search engine, and it got it right without any issues.

It shows once again that for common searches, (indexed) data is the king, and that's where I expect that even a simple LLM directly connected to a huge indexed dataset would win against much more sophisticated LLMs that have to use agents for searching.

belter6mo ago

I asked Claude, and had no issues with the answer including mentioning the impeached Antonio Belinati...

arach6mo ago

thanks for sharing, very interesting example

legel6mo ago

Thanks for reporting these metrics and drawing the conclusion of an underlying breakthrough in search.

In his Nobel Prize winning speech, Demis Hassabis ends by discussing how he sees all of intelligence as a big tree-like search process.

https://youtube.com/watch?v=YtPaZsasmNA&t=1218

derangedHorse6mo ago

The one thing I got out of the MIT OpenCourseWare AI course by Patrick Winston was that all of AI could be framed as a problem of search. Interesting to see Demis echo that here.

rafaelmn6mo ago

It tells me that the benchmark is probably leaking into training data, and going to the benchmark site :

> Model was published after the competition date, making contamination possible.

Aside from eval on most of these benchmarks being stupid most of the time, these guys have every incentive to cheat - these aren't some academic AI labs, they have to justify hundreds of billions being spent/allocated in the market.

Actually trying the model on a few of my daily tasks and reading the reasoning traces all I'm seeing is same old, same old - Claude is still better at "getting" the problem.

Xss36mo ago

This comment was written by an AI specifically instructed to be more concise than usual.

red75prime6mo ago

> To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.

You say "probabilistic generation" like it's some kind of a limitation. What is exactly the limiting factor here? [(0.9999, "4"), (0.00001, "four"), ...] is a valid probability distribution. The sampler can be set to always choose "4" in such cases.

calmoo6mo ago

Your comment is AI generated

tim3336mo ago

I'll give you the style is like an LLM but the thoughts seem a bit unlike one. I mean the MathArena Apex results indicating a new discovery rather than more data is definitely a hypothesis.

Also panarky denies it.

coldtea6mo ago

>This is not an incremental advance. It is a step change. This indicates a new discovery, not just more data or more compute.
To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.
You need to verify what you're doing, detect when you make a mistake, and backtrack to try a different approach.

Loos like AI slop

TaupeRanger6mo ago

It obviously is.

thomasahle6mo ago

From my understanding, Google put online the largest RL cluster in the world not so long ago. It's not surprising they do really well on things that are "easy" to RL, like math or SimpleQA

dbbk6mo ago

Aren't you just describing tool calls?

Der_Einzige6mo ago

You clearly AI generated this comment.

j / k navigate · click thread line to collapse