Claude Plays Pokémon (opens in new tab)

(twitch.tv)

75 pointsLightMachine1y ago24 comments

24 comments

dang1y ago

Related ongoing thread:

Show HN: LLM plays Pokémon (open sourced) - https://news.ycombinator.com/item?id=43187231

This is truly tremendous to watch. Eleven years from TPP, and we're watching the current best-in-class AI try its best at the same. Who'll get there first, the historical gestalt of Twitch users or the just-shy-of-10^26 FLOPS [0] AI model?

Now here's a concept for anyone with more money than sense: ClaudePlaysTwitchPlaysPokemon, where it's TPP but every participant is Claude. Would hivemind AI consensus perform better than a single AI? Anthropic's certainly looking into it! [1]

[0]: https://www.oneusefulthing.org/p/a-new-generation-of-ais-cla...

[1]: https://www.anthropic.com/news/visible-extended-thinking

IX-1031y ago

A few years ago there was another AI that tried to beat Pokemon. It wasn't a LLM. I think it was an LSTM trained with reinforcement learning. It got stuck in Mt Moon.

Right now, Claude has been stuck in Mt Moon for nearly a day. It keeps forgetting where it has been. It also almost always runs from battles instead of changing Pokemon or fighting.

At one point it got stuck in a Pokemon center when it mistook the character's red hat for the red carpet around the exit. It kept pressing down and wondering why it wasn't working. It only broke out of that when it mistakenly concluded it had successfully exited the Pokemon center. Then it wandered around a bit and only realized it was still in the Pokemon center after talking to Nurse Joy.

Philpax1y ago

You're thinking of https://www.youtube.com/watch?v=DcYLT37ImBY and https://github.com/PWhiddy/PokemonRedExperiments.

> It also almost always runs from battles instead of changing Pokemon or fighting.

I believe this is because all of its Pokemon are on the verge of fainting, so it's trying to conserve them while it tries to find its way out.

> It keeps forgetting where it has been.

I'm wondering if this could be solved with a better harness; on one hand, that hurts the elegance of having one model dedicated to playing the game, but their existing harness is already cheating a little (they have a second LLM for verification). They're frequently compacting what's in context, which means its visual memory is quite poor - that could potentially be a point of improvement?

Y_Y1y ago

Or the converse, feed all of twitch chat to Claude and see if it can output the correct button presses.

unification_fan1y ago

You'd have to feed it all of Twitch chat correlated to whatever frame was being streamed at the time and adjusted for network jitter and buffering.

Good luck

_--__--__1y ago

This is neat but watching a reasoning model that stops to consider "I have read half of a dialogue block, time to press A to get the rest of the text" gets old really quick. I think I'd rather watch a model try to play pokemon against human opponents on a simulator like pokemon showdown (which I understand is a bit further in an IP rights grey area than emulating a 30 year old game). In that case you would get to see how it handles unknown information and updates its reasoning based on the success/failure of its predictions.

northern-lights1y ago

A model doesn't need to play on visual simulators, it can very well do that on IRC (like the good old days of RS/GSBots), to show how it fares against humans.

One of the biggest challenge this Claude version faces is to read the visual data accurately. It was stuck in the Viridian forest and Pokemarts for a while because the overworld objects like trees and paths kept confusing it.

lcnPylGDnU4H9OF1y ago

> play pokemon against human opponents on a simulator like pokemon showdown

That's precisely the pet project I'd take on if/when I bother to take the time making some deep learning agent. There's a bot that plays one of the ladders already but it's just a decision tree and the best players know how to predict its moves. It's like ~1500 ELO in a ladder where the best players are 1800+. Still not bad, to be fair; it would probably beat me.

The bot has a pre-selected team, which I believe always starts with the same mon. I'd be more interested in an agent that fully played the game, start-to-finish, including making a team based on play data and selecting a starter based on the current opponent's team.

alexchantavy1y ago

Haha yeah this is cool but the days of watching Twitch Plays Pokemon or RNG Plays Pokemon or things like that were much more entertaining

unification_fan1y ago

> which I understand is a bit further in an IP rights grey area than emulating a 30 year old game

But Nintendo will never take down anything that is related to Showdown because it would highlight their massive hypocrisy!

It would set a precedent. People would go: "wait, but why did they never take down Showdown itself? Could it be that it's because they actually benefit from its existence? Then why did they take down X/Y/Z? Oh! It's because copyright law only applies when you want it to! It's all arbitrary and made up! You just need to be friends with the right people in the VGC and your pet project will be immune from all legal backlash!"

Or something.

Seriously I hate it so fucking much that Nintendo does nothing about Showdown, which blatantly steals a ton of game assets, and then nukes some random guy's fan project that no one ever played.

Philpax1y ago

It's run by Anthropic! https://x.com/AnthropicAI/status/1894419011569344978

falcor841y ago

This administrative request for a reset is incredible - I can't help but feel that this is intended as the equivalent of a prompt injection for the person running it. Time to rewatch Ex Machina.

https://x.com/AnthropicAI/status/1894419017756029427?t=xDXk6...

tehsauce1y ago

Anyone interested in watching lots of reinforcement agents playing pokemon red at once, we have a website which streams hundreds of concurrent games from multiple people’s training runs to a shared map in real time!

https://pwhiddy.github.io/pokerl-map-viz/

(works best on desktop)

sunaookami1y ago

I like that it named the rival "Waclaude" :)

wanderer23231y ago

What is the significance of this?

_--__--__1y ago

https://en.wikipedia.org/wiki/Waluigi_effect

adenta1y ago

I think the “wa” prefix means “bad” in Japan

1 more reply

skoll431y ago

not beating the copyright allegations

TheAceOfHearts1y ago

Watching the moment to moment is pretty boring, but it might be interesting if someone puts together highlights of interesting events and moments. The screenshot where Claude asks for the game to restart is absolutely charming.

meltyness1y ago

I can't look at the current state of this and without wondering if it's tokenizer-dyslexia. I wonder if AI performance growth has been borrowed from overfitting and pruning the tokenizer of invalid sequences and leakage the entire corpus, a cardinal sin of making valid predictions.

j_timberlake1y ago

This would be a really cool category of speed-running. "How fast can a model beat a game that it's never played before?"

First get the model to beat a game, then work on better decision-making, then try to speed up the decision-making. Then repeat when better models come out.

j / k navigate · click thread line to collapse

24 comments

dang1y ago

Related ongoing thread:

Show HN: LLM plays Pokémon (open sourced) - https://news.ycombinator.com/item?id=43187231

Philpax1y ago

[0]: https://www.oneusefulthing.org/p/a-new-generation-of-ais-cla...

[1]: https://www.anthropic.com/news/visible-extended-thinking

IX-1031y ago

A few years ago there was another AI that tried to beat Pokemon. It wasn't a LLM. I think it was an LSTM trained with reinforcement learning. It got stuck in Mt Moon.

Right now, Claude has been stuck in Mt Moon for nearly a day. It keeps forgetting where it has been. It also almost always runs from battles instead of changing Pokemon or fighting.

Philpax1y ago

You're thinking of https://www.youtube.com/watch?v=DcYLT37ImBY and https://github.com/PWhiddy/PokemonRedExperiments.

> It also almost always runs from battles instead of changing Pokemon or fighting.

I believe this is because all of its Pokemon are on the verge of fainting, so it's trying to conserve them while it tries to find its way out.

> It keeps forgetting where it has been.

Y_Y1y ago

Or the converse, feed all of twitch chat to Claude and see if it can output the correct button presses.

unification_fan1y ago

You'd have to feed it all of Twitch chat correlated to whatever frame was being streamed at the time and adjusted for network jitter and buffering.

Good luck

_--__--__1y ago

northern-lights1y ago

A model doesn't need to play on visual simulators, it can very well do that on IRC (like the good old days of RS/GSBots), to show how it fares against humans.

lcnPylGDnU4H9OF1y ago

> play pokemon against human opponents on a simulator like pokemon showdown

alexchantavy1y ago

Haha yeah this is cool but the days of watching Twitch Plays Pokemon or RNG Plays Pokemon or things like that were much more entertaining

unification_fan1y ago

> which I understand is a bit further in an IP rights grey area than emulating a 30 year old game

But Nintendo will never take down anything that is related to Showdown because it would highlight their massive hypocrisy!

Or something.

Seriously I hate it so fucking much that Nintendo does nothing about Showdown, which blatantly steals a ton of game assets, and then nukes some random guy's fan project that no one ever played.

Philpax1y ago

It's run by Anthropic! https://x.com/AnthropicAI/status/1894419011569344978

falcor841y ago

This administrative request for a reset is incredible - I can't help but feel that this is intended as the equivalent of a prompt injection for the person running it. Time to rewatch Ex Machina.

https://x.com/AnthropicAI/status/1894419017756029427?t=xDXk6...

tehsauce1y ago

https://pwhiddy.github.io/pokerl-map-viz/

(works best on desktop)

sunaookami1y ago

I like that it named the rival "Waclaude" :)

wanderer23231y ago

What is the significance of this?

_--__--__1y ago

https://en.wikipedia.org/wiki/Waluigi_effect

adenta1y ago

I think the “wa” prefix means “bad” in Japan

1 more reply

skoll431y ago

not beating the copyright allegations

TheAceOfHearts1y ago

meltyness1y ago

j_timberlake1y ago

This would be a really cool category of speed-running. "How fast can a model beat a game that it's never played before?"

First get the model to beat a game, then work on better decision-making, then try to speed up the decision-making. Then repeat when better models come out.

j / k navigate · click thread line to collapse