It did ultimately decide Ozzy was alive. I pushed back on that, and it instantly corrected itself and partially blamed my query "what is he up to" for being formulated as if he was alive.
I was unclear if this meant that the API was overloaded or if he was on a subscription plan and had hit his limit for the moment. Although I think that the Gemini plans just use weekly limits, so I guess it must be API.
And yeah, it's not the insanely priced AI Ultra plan, but if there are any hard limits on Gemini Pro usage I haven't found them. I have played a lot with really long Antigravity sessions to try to figure out what this thing is good for, and it seems like it will pretty much sit there and run all day. (And I can't really blame anyone for still remaining mad about AI to be completely honest, but the technology is too neat by this point to just completely ignore it.)
Seeing as Google is still giving away a bunch of free access, I'm guessing they're still in the ultra-cash-burning phase of things. My hope (hopium, realistically) is that by the time all of the cash burning is over, there will be open-weight local models that are striking near where Gemini 3 Pro strikes today. It doesn't have to be as good, getting nearby on hardware consumers can afford would be awesome.
But I'm not holding my breath, so let's hope the cash burning continues for a few years.
(There is, of course, the other way to look at it, which is that looking at the pricing per token may not tell the whole story. Given that Google is running their own data centers, it's possible the economic proposition isn't as bad as it looks. OTOH, it's also possible it is worse than it looks, if they happen to be selling tokens at a loss... but I quite doubt it, given they are currently SOTA and can charge a premium.)
If limit your token count to a fraction of 2 billion tokens, you can try it on your own game, and of course have it complete a shorter fraction of the game.
Did the streamer get subsidized by Google?
(The stream isn't run by Google themselves, is it?)
Does this even have any effect?
"It is worth noting that the instruction to "ignore internal knowledge" played a role here. In cases like the shutters puzzle, the model did seem to suppress its training data. I verified this by chatting with the model separately on AI Studio; when asked directly multiple times, it gave the correct solution significantly more often than not. This suggests that the system prompt can indeed mask pre-trained knowledge to facilitate genuine discovery."
It definitely doesn't wipe its internal knowledge of Crystal clean (that's not how LLMs work). My guess is that it slightly encourages the model to explore more and second-guess it's likely very-strong Crystal game knowledge but that's about it.
Whether the 'effect' something implied by the prompt, or even something we can understand, is a totally different question.
If you looked inside they would be spinning on something like "oh I know this is the tile to walk on, but I have to only rely on what I observe! I will do another task instead to satisfy my conditions and not reveal that I have pre-knowledge.
LLMs are literal douche genies. The less you say, generally, the better
This was a good while back but I'm sure a lot of people might find the process and code interesting even if it didn't succeed. Might resurrect that project.
So it just looks right at what's going on, writes a description for refinement, and uses all of that to create and manage goals, write to a scratchpad and submit input. It's minimal scaffolding because I wanted to see what these raw models are capable of. Kind of a benchmark.
(I do think from personal benchmarks that Gemini 3 is better for the reasons stated by the author, but a single run from each is not strong evidence.)
- History, most likely
In other words, how much of this improvement is true generalization vs memorization?
That said, this writeup itself will probably be scraped and influence Gemini 4.
Just don’t confuse it with a random benchmark!
Is this baked into how the models are built? A model outputs a bunch of tokens, then reads them back and treats them as the existing "state" which has to be built on. So if the model has earlier said (or acted like) a given assumption is true, then it is going to assume "oh, I said that, it must be the case". Presumably one reason that hacks like "Wait..." exist is to work around this problem.
That said, it's definitely Gem's fault that it struggled so long, considering it ignored the NPCs that give clues.
Just try other random, non-realistic things like “a giraffe walking a tightrope”, “a car sitting at a cafe eating a pizza”, etc.
If the results are dramatically different, then they gamed it. If they are similar in quality, then they probably didn’t.