undefined | Better HN

0 pointsaliljet7mo ago0 comments

It's very unclear if OpenAI has been casually leaking things to create buzz, but a few days ago there was a pretty stunning pelican on a bike attempt: https://old.reddit.com/r/OpenAI/comments/1mettre/gpt5_is_alr...

In practice, it's very clear to me that the most important value in writing software with an LLM isn't it's ability to one-shot hard problems, but rather it's ability to effectively manage complex context. There are no good evals for this kind of problem, but that's what I'm keenly interested in understanding. Show me GPT-5 can move through 10 steps in a list of tasks without completely losing the objective by the end.

0 comments

stri8ed7mo ago

That problem along with its many solutions are surely littered throughout the training data. Not to mention, it would be trivial to overfit on that problem. I don't know why people still reference that.

ben_w7mo ago

> That problem along with its many solutions are surely littered throughout the training data. Not to mention, it would be trivial to overfit on that problem.

It would be trivial to over-fit, if that was their goal.

But why would there be a large number of good SVG images of pelicans on bikes? Especially relative to all the things we actually want them to generalise over?

Surely most of the SVG images of pelicans on bikes are, right now, going to be "look at this rubbish AI output"? (Which may or may not be followed by a comment linking to that artist who got humans to draw bikes and oh boy were those humans wildly bad at drawing bikes, so an AI learning to draw SVGs from those bitmap pictures would likely also still suck…)

AlecSchueler7mo ago

Because it's become the iconic test for them and countless articles have been written about it with plenty of examples.

1 more reply

Xenoamorphous7mo ago

Maybe we can try “dog in a paraglider”? If it fails then we know it’s overfitting, if it works then the model generalises well?

aliljetOP7mo ago

Honestly, you're probably right. It's quickly become a pretty weak eval, but the guy that's running that eval is excellent. I'd much rather the evals people were using to test these things looked more like classic/boring engineering problems: deploy to dev/test/stage/prod with digital ocean, cloudflare, github, and a common git flow. Boring problem, I know, but that problem is wildly complex when you start to add a few extra dimensions (frontend vs backend, ports shifting between deployments, local deployments, etc.).

93po7mo ago

i think the point is people assume models arent overfitting for it, and its a fun/silly way to potentially gauge its general abilities

j / k navigate · click thread line to collapse