Killed by LLM (opens in new tab)

(r0bk.github.io)

218 pointsyz-exodao1y ago95 comments

95 comments

Ukv1y ago

IMO a critical feature of the Turing test/imitation game, which many modern implementations including this site's linked paper ignore, is that the interrogator talks to both a human and a bot and must decide that one xor the other is a human. So fooling an interrogator means having them choose the bot as human over an actual human, not just judging the bot to be human (while probably judging humans to be human even more frequently).

When the interrogator is only answering "do you think your conversation partner was a human?" individually, bots can score fairly highly simply by giving little information in either direction - like pretending to be a non-english-speaking child, or sending very few messages.

Whereas when pitted against a human, the bot is forced to give stronger or equally strong evidence of being human as the average human (over enough tests). To be chosen as human, giving 0 evidence becomes a bad strategy when the opponent (the real human) is likely giving some positive non-zero evidence towards their personhood.

fastball1y ago

That's not the original Turing test either. The original imitation game as proposed by Turing involves reading a text transcript of a human and a computer and having the evaluator determine which is which. The evaluator does not interact directly with the conversing parties.

tripletao1y ago

Where are you getting that? Turing's most famous paper is just as Ukv describes. The link on that site doesn't work for me, but the reference is buried in their source:

https://courses.cs.umbc.edu/471/papers/turing.pdf

In Turing's test, the forced binary choice means P(human-judged-human) + P(machine-judged-human) is necessarily equal to 100%. This gives the 50% threshold clear intuitive and mathematical significance.

In the bastardized test that GPT-4 "passed", that sum can be (and actually was) >100%. This makes the result practically impossible to interpret, since it depends on the interrogators' prior. The correct prior seems to be that it was human with p = 25%, though the paper doesn't say that explicitly, or say anything about what the interrogators were told. If the interrogators guessed mistakenly that it was 50% then that would lead them to systematically misjudge machines as humans, perhaps as observed.

The bastardized test is pretty bad, but treating the 50% threshold as meaningful there is inexcusable. I see the preprint hasn't yet passed peer review, and I'll regain some faith in social science professors if it never does. Of course the credulous media coverage is everywhere already, including the LLM training sets--so regardless of whether LLMs can pass the Turing test, they now believe they do.

2 more replies

d0mine1y ago

The original Turing game is whether machine can pretend to be a woman better than a man can (via teletype) as judged by an interrogator:

> We now ask the question, "What will happen when a machine takes the part of A in this game?" Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman? These questions replace our original, "Can machines think?”

https://courses.cs.umbc.edu/471/papers/turing.pdf

1 more reply

silisili1y ago

I'm skeptical on the claim. I think most folks, given the test you describe, would be able to pick out which is human. I think it can get there, but I'm not sure anyone has made one yet. ChatGPT responses are heavily downvoted and mocked because they're easy to spot.

Does there exist a public LLM that isn't so...wordy, excited, and guardrailed all the time?

You can pretty much spot the bot today by prompting something horribly offensive. Their response is always very inhuman, probably due to lack of emotional energy.

mitthrowaway21y ago

I agree but that's not really a scientific limitation though, right? As I understand it in the early days of GPT 4, before it was publicly released and RLHF'd for brand safety, it would have offered convincing text completions for just about any context, whether an academic discussion of philosophy or a steamy crossover fanfiction or a reddit trash-talk exchange. It took a deliberate bit of lobotomizing to make them so bland, conservative, and cheery-helpful.

The required investment probably means it will be a while before any less brand- and legal-action-conscious actors offer up unrestrained foundation models of comparable quality, but it's only a matter of time, isn't it?

chriscappuccio1y ago

You can get rid of OpenAI's wordy, excited and guardrailed responses with the eigenrobot prompt, for instance.

https://x.com/eigenrobot/status/1870696676819640348

I generally prefer it to the default. It doesn't work as well on Claude or Grok for various reasons. I think it really shines on GPT o1-mini and GPT 4o.

1 more reply

Dylan168071y ago

"pretending to be a non-english-speaking child" isn't a hypothetical, it's a real tactic that was annoyingly effective a while back.

Being uncooperative makes it really hard to tell anything about you, including whether you're real.

seanmcdirmid1y ago

Aren't you just describing those emails in a big corp that are supposedly still written by humans? Yes, they are wordy, excited, and guardrailed, but I don't think they are written by AI yet.

I guess this is why LLMs are so feared by high school English teachers. Yes, they don't write well, but neither do their students.

1 more reply

ben_w1y ago

> Does there exist a public LLM that isn't so...wordy, excited, and guardrailed all the time?

Most of them, if you prompt them right, for that specific problem.

Most people don't bother, and instead treat them as if they're magic (they are "sufficiently advanced technology", but still), and therefore we get them emphasising "nuance" and "balance" where it doesn't belong.

> You can pretty much spot the bot today by prompting something horribly offensive.

Yes, though also each model's origins give a different idea of what counts as "horribly offensive". I'm thinking mainly because the Chinese models don't want to talk about Tiananmen Square as I've not tried grok (how does grok cope with trans/cis-gender as concepts? I know Musk doesn't, but it would be speculation to project that assumption onto the AI).

> Their response is always very inhuman, probably due to lack of emotional energy.

This, specifically, can also be faked fairly well with the right prompt. Tell ChatGPT to act like a human with clinical depression, and it does… at least by American *memetic* standards of what that means.

That said, ChatGPT and Claude are also trained specifically to reveal that they're AI, not humans, even if you want them to role-play as specific humans.

Probably for the best, given how powerful a tool they are for, e.g. phishing and similar scams.

dathinab1y ago

easy to spot by you and other people involved in tech

but the test subjects should be randomly samples from society at which point the skill availability/level of spotting it goes majorly down

fastball1y ago

It's not a lack of emotional energy, it is the guardrails you point out. All of the SotA models are heavily fine-tuned to be botlike, and even then they are fooling people. If you had an LLM fine-tuned with RLHF to deliberately confuse humans in a Turing test it seems clear it would do a good job.

1 more reply

lamename1y ago

Posted by Chollet himself:

> I don't think people really appreciate how simple ARC-AGI-1 was, and what solving it really means. It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.

> Passing it means your system exhibits non-zero fluid intelligence -- you're finally looking at something that isn't pure memorized skill. But it says rather little about how intelligent your system is, or how close to human intelligence it is.

https://bsky.app/profile/fchollet.bsky.social/post/3les3izgd...

energy1231y ago

> Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.

Not necessarily. Get a human to solve ARC-AGI if the problems are shown as a string. They'll perform badly. But that doesn't mean that humans can't reason. It means that human reasoning doesn't have access to the non-reasoning building blocks it needs (things like concepts, words, or in this case: spatially local and useful visual representations).

Humans have good resolution-invariant visual perception. For example, take an ARC-AGI problem, and for each square, duplicate it a few times, increasing its resolution from X*X to 2X*2X. To a human, the problem will be almost exactly equally difficulty. Not for LLMs that have to deal with 4x as much context. Maybe for an LLM if it can somehow reason over the output of a CNN, and if it was trained to do that like how humans are built to do that.

refulgentis1y ago

Excellent point, I'm not sure people are aware, but these are straight-up lifted from standard IQ tests, so they're definitely not all trivially humanly solvable.

I needed an official one for medical reasons a few years back

echelon1y ago

ARC-AGI feels like it would fall to a higher dimensional convolution rather than reasoning.

refulgentis1y ago

Honestly, after that, I'm tuned out completely on him and ARC-AGI. Nice minor sidestory at one point in time.

He's right that this isn't solving all human-intelligence domain level problems.

But the whole stunt, this whole time, was that this was the ARC-AGI benchmark.

The conceit was the fact LLMs couldn't do well on it proved they weren't intelligent. And real researchers would step up to bench well on that, avoiding the ideological tarpit of LLMs, which could never be intelligent.

It's fine to turn around and say "My AGI benchmark says little about intelligence", but, the level of conversation is decidedly more that of punters at the local stables than rigorous analysis.

0xDEAFBEAD1y ago

I assumed this was about chatbot users committing suicide in order to "join" the bot they are chatting with. It's already happened a couple of times, apparently:

https://futurism.com/teen-suicide-obsessed-ai-chatbot

https://garymarcus.substack.com/p/the-first-known-chatbot-as...

cdev_gl1y ago

Yea, I too was not expecting a list of past benchmarks. If not the aforementioned actual human deaths, I had expected either a list of companies whose pivot to AI/LLMs led to their downfall (but I guess we're going to need to wait a year or two for that) or a list of industries (such as audio transcription) that are being killed by AI as we speak.

We really do live in interesting times. Usually I feel pretty confident about predicting how a trend will continue, but as it is the only prediction I can make with confidence for this latest AI research is that it is and will be used by militaries to kill a lot of people. Oh, hey, that's another thing this article could have listed!

Outside of that, all bets are open. Possible wagers include: "Turns out to be mostly useful in specific niche applications and only seemingly useful anywhere else", "Extremely useful for businesses looking to offset responsibility for unpopular decisions", "Ushers in an end to work and a golden age for all mankind", "Ushers in an end to work and a dark age for most of the world", "Combines with profit motives to damage all art, culture, and community", etc etc.

I know many folk have strong opinions one way or the other, but I think it's literally anyone's game at this point, though I will say I'm not leaning optimistic.

nayuki1y ago

I thought the title meant that a chatbot gave bad medical, engineering, and/or safety-critical advice that a human ended up following.

sam0x171y ago

Similarly I thought it would be about ML and data projects that have become defunct due to the advent of LLMs.

rasz1y ago

Using people with severe mental health problems might be a poor benchmark of performance.

ok_dad1y ago

Why? Something like 20-25% of people have mental health issues. Seems like someone should be thinking about the impact of their product here, rather than blaming the victims.

aitchnyu1y ago

I thought it was a credible source of actual jobs replaced by LLMs. When I see headlines like this, I ad homimem the source as unprofitable company CEO, big consulting firm, bootcamp seller etc.

ultrablack1y ago

The tortoise lays on its back, its belly baking in the hot sun, beating its legs trying to turn itself over, but it can't. Not without your help. But you're not helping.

mindcrime1y ago

Describe in single words, only the good things that come into your mind about your mother.

mmustapic1y ago

Let me tell you about my mother

matt32101y ago

I read recently that small variations in the tests cause failures by large margins.

If this doesn’t show over fitting in don’t know what would.

friend_Fernando1y ago

Eventually, all the better AGI tests should have large private evaluation datasets with no possible cheating or feedback loops. We're getting there.

lxgr1y ago

Wasn’t that for human tests, i.e. not specifically AI benchmarks? Benchmarks should generally not be game-able by overfitting.

matt32101y ago

The article shows all the tests against human performance.

The math one in particular is the one where small variations reduce the success rate significantly. I can’t find the source but it was pasted here in the last 2 weeks.

1 more reply

yamrzou1y ago

ARC-AGI is not yet killed by LLM. O3 achieved a breakthrough only on ARC-AGI-PUB, which is semi-private. Nothing guarantees that the test data wasn't leaked to OpenAI in previous testing rounds, because the model is not running offline.

See: https://news.ycombinator.com/item?id=42478098

anon3738391y ago

I think this should be discussed more. Models that can only be accessed via API cannot be tested without giving their owners access to the test data. You just have to trust that they’ll do the right thing.

Tepix1y ago

In particular, in cases where the model gets 16 hours to solve a task that a human can solve in a few minutes, cheating is trivial!

Tepix1y ago

See https://arcprize.org/blog/oai-o3-pub-breakthrough

ARC-AGI-1 will be replaced by ARC-AGI-2

So yes, ARC-AGI-1 was killed.

yamrzou1y ago

ARC-AGI-2 was planned long before those results came out. Also from the link: ARC-AGI-2 (same format - verified easy for humans, harder for AI) will launch alongside ARC Prize 2025. So, no, it will not replace it.

1 more reply

anonymoushn1y ago

Interesting choice having a little (i) icon in the Turing Test card but having mouseover not bring up any text. Or having the link icons in that card that you can click on to do nothing.

fenomas1y ago

Looks like a bug - that card has an overlay at a higher z-index that obscures its mouseover and clicks. In the source the (i) links to Turing's original "Imitation Game" paper, and the (?) has this hover text:

> (?) While the Turing Test remains philosophically significant, modern LLMs can consistently pass it, making it no longer effective at measuring the frontier of AI capabilities.

levocardia1y ago

I don't really understand why "Killed by: Saturation" is needed - what other options are there?

It would also be nice to see the "unbeaten" list: standardized tests LLMs still fail (for now). e.g. Wozniak's coffee test.

themanmaran1y ago

Wozniak's coffee test would be a really fun one to attempt. As long as you could get a capable enough robot, I imagine it's possible. Something like the Spot Arm[1] would be sufficient.

Something like:

- Key the robot controls to a series of tools (move_forward(x), extend_arm(y))

- Add a camera and pass each frame to the AI model along with the task "make a cup of coffee" and the list of available tools it can call.

And it would likely succeed some percentage of the time today!

[1] https://bostondynamics.com/products/spot/arm/

dleavitt1y ago

The layer with the radial gradient you're putting in front of the Turing Test card blocks interaction with it - can't click or hover on its links.

sinuhe691y ago

I find that MATH challenge “solved” by AI hard to believe. The reason given was “saturation”. Could anybody help explain it a bit? Also in my daily encounter, I stop find a lot of simple math problems all the frontier models could not solve: long logic puzzle, many cases reasoning, and particularly geometry problems. I don’t know where the 97% number for o1 does come from, but in my experience they are much lower than that and math, even elementary maths certainly can not be considered to be “solved”. As far as I can see, OpenAI has been trained their models on all these public problems, so testing on them to record a benchmark is tainted as best when not outright cheating.

Taek1y ago

I've found o1 to be entirely useful at math problems that are beyond my own (admittedly modest) skills. I've had it write full proofs of correctness for me (one shot, verified), I've had it optimize equations to reduce necessary precision, I've had it optimize equations to remove specific expensive operations (making them computationally more efficient), and finally I've had it prove a handful of my conjectures, which was helpful for taking algorithmic shortcuts in a security sensitive environment.

Mostly all algebra and calculus, but definitely all problems that most undergrads would struggle with.

It's most useful because it has deep knowledge of related and adjacent conjectures that are well understood, even if you've never heard of them. So it can mix and match things with a lot more ease than a tinkering mathematician

gazchop1y ago

It can't handle trigonometric identities and any form of calculus at the same time without fucking it up. Also abstract stuff like symmetry groups, nope! And anything which involves vectors is a mess.

The big problem is it confidently answers the questions utterly wrongly.

This is stuff I expect a basic mathematics undergrad to be able to work out in their first or second year.

hgomersall1y ago

Yeah, I had (so far apparent, but still be verified) success with o1 teaching me the necessary physics and maths I need to solve my specific problem. This is definitely grad level stuff but well understood. My concern though is it's missing things that are more esoteric.

blinding-streak1y ago

Scroll down on the page. It explains saturation.

knowaveragejoe1y ago

I didn't know ARC-AGI had been "beaten" by o3. What are the next challenges that frontier models like o1/o3 are faced with?

chriscappuccio1y ago

o1 did terrible. o3 did well on arc-agi-pub (public training data) but hasn't passed the private test yet.

lucianbr1y ago

Is the test still private once it has been run? If you call the OpenAI API and send it some data, OpenAI has access to the data. Did the benchmaker run the models locally somehow?

1 more reply

alganet1y ago

An very reliable, very unethical test would be to deploy LLMs on the internet as humans and gauge how other humans react (ignore, call out as LLM, engage, etc). There isn't much in the way of stopping a company from doing that (there should be!).

erichocean1y ago

I'm working on operationalizing AI, and our Turing test is if—by watching a screenshare of the AI worker—you can tell an AI worker (vs. a human) did the task.

If you can't, the AI worker passes the test.

j451y ago

I'm not sure if LLMs have beaten the standards, as much have the information to reply to them as needed.

Last week there was a post where slightly changing one of the tests caused LLMs to drop off drastically.

solarkraft1y ago

The page doesn’t seem to define what „killed“ or „defeated“ means. The LLM being better than a human? The LLM having been trained against the benchmark, making it useless?

anonymoushn1y ago

It does if you scroll down.

krackers1y ago

Everything says "killed by saturation". Is there another way to be killed?

bufferoverflow1y ago

There are benchmarks that humans score close to zero on average and the top LLM scores 25%.

https://epoch.ai/frontiermath/the-benchmark

pama1y ago

If anyone from epoch.ai is reading this, it would be nice to link the toplevel result for o3 to this page.

mrayycombi1y ago

Bragging about how LLMs defeated maginot line defenses that can be trained around, makes us feel warm and fuzzy.

Too bad the real world isn't like that.

benreesman1y ago

This technology is useful and interesting and even fun in spite of the ugliest broad-based cash and power grab since 1999.

When this godawful once in a generation hype cycle dies down this stuff is going to be strictly awesome.

tharkun__1y ago

How does this site make sense?

It lists the "Turing test" as "original" at greater than 50% and the the AI that "beat" it at 46%.

At that point I just stopped scrolling.

junon1y ago

Score is based on the interrogator, a human. If you read a Markov chain bot's text you'd guess it was a bot probably 80-100% of the time. With a real human, you'd guess it was a bot maybe 0-30% of the time, depending.

I'm making up these figures, but the point is lower is better, or "more Human-Like". Test was specified as >50% meaning "accurately determined human vs. bot more than half the time". The site claims LLMs are now guessed correctly less than half, which is how the turing test was defined as per the site.

It makes sense, even if you disagree it's significant.

meltyness1y ago

I wonder if there's a hyper-Turing test where an AI passes if the model, itself, cannot determine if it is talking to itself; or perhaps stated differently, maximizing some measure of control and processing duration to successfully conceal its identity under forced processing, discounting a solution that specifically learns to be silent or incoherent. I'm not sure what the value would be, just a passing thought.

This is probably already happening within the parade of censorship systems trying to imbue the models with agency

mattnewton1y ago

the Turing test is scored by how often an interrogator can determine if they are talking to a machine or a human. It’s perhaps a confusing way to show it, and leaves out a lot of important information about the result they are citing, but they are saying before the interrogator did better than chance and after gpt the interrogator guesses right slightly worse than chance.

lxgr1y ago

Presumably it means that the human detected the AI correctly less than 50% of the time, averaged over a repeated number of experiments.

casey21y ago

From TFA:

>GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%).

tharkun__1y ago

What article? The Turing box on the page has no links to anything and the entire page is just a bunch of these boxes. Sure it has the link icon in the two places like all the others. But the Turing one has no actual links on them. Even the little info icon that works on the GSM8K does nothing for the Turing one.

spookie1y ago

It wasn't even the real Turing test but a lesser version of it. I'm hopeful about uses of this tech, but companies need to be more honest unless they want a second winter.

The incentives don't align with honesty though.

bacheaul1y ago

I read that as the pass mark being able to identify human vs machine more than 50% of the time. At 50% it's no better than randomly guessing.

j / k navigate · click thread line to collapse

95 comments

Ukv1y ago

fastball1y ago

tripletao1y ago

Where are you getting that? Turing's most famous paper is just as Ukv describes. The link on that site doesn't work for me, but the reference is buried in their source:

https://courses.cs.umbc.edu/471/papers/turing.pdf

2 more replies

d0mine1y ago

The original Turing game is whether machine can pretend to be a woman better than a man can (via teletype) as judged by an interrogator:

https://courses.cs.umbc.edu/471/papers/turing.pdf

1 more reply

silisili1y ago

Does there exist a public LLM that isn't so...wordy, excited, and guardrailed all the time?

You can pretty much spot the bot today by prompting something horribly offensive. Their response is always very inhuman, probably due to lack of emotional energy.

mitthrowaway21y ago

chriscappuccio1y ago

You can get rid of OpenAI's wordy, excited and guardrailed responses with the eigenrobot prompt, for instance.

https://x.com/eigenrobot/status/1870696676819640348

I generally prefer it to the default. It doesn't work as well on Claude or Grok for various reasons. I think it really shines on GPT o1-mini and GPT 4o.

1 more reply

Dylan168071y ago

"pretending to be a non-english-speaking child" isn't a hypothetical, it's a real tactic that was annoyingly effective a while back.

Being uncooperative makes it really hard to tell anything about you, including whether you're real.

seanmcdirmid1y ago

Aren't you just describing those emails in a big corp that are supposedly still written by humans? Yes, they are wordy, excited, and guardrailed, but I don't think they are written by AI yet.

I guess this is why LLMs are so feared by high school English teachers. Yes, they don't write well, but neither do their students.

1 more reply

ben_w1y ago

> Does there exist a public LLM that isn't so...wordy, excited, and guardrailed all the time?

Most of them, if you prompt them right, for that specific problem.

> You can pretty much spot the bot today by prompting something horribly offensive.

> Their response is always very inhuman, probably due to lack of emotional energy.

That said, ChatGPT and Claude are also trained specifically to reveal that they're AI, not humans, even if you want them to role-play as specific humans.

Probably for the best, given how powerful a tool they are for, e.g. phishing and similar scams.

dathinab1y ago

easy to spot by you and other people involved in tech

but the test subjects should be randomly samples from society at which point the skill availability/level of spotting it goes majorly down

fastball1y ago

1 more reply

lamename1y ago

Posted by Chollet himself:

https://bsky.app/profile/fchollet.bsky.social/post/3les3izgd...

energy1231y ago

> Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.

refulgentis1y ago

Excellent point, I'm not sure people are aware, but these are straight-up lifted from standard IQ tests, so they're definitely not all trivially humanly solvable.

I needed an official one for medical reasons a few years back

echelon1y ago

ARC-AGI feels like it would fall to a higher dimensional convolution rather than reasoning.

refulgentis1y ago

Honestly, after that, I'm tuned out completely on him and ARC-AGI. Nice minor sidestory at one point in time.

He's right that this isn't solving all human-intelligence domain level problems.

But the whole stunt, this whole time, was that this was the ARC-AGI benchmark.

It's fine to turn around and say "My AGI benchmark says little about intelligence", but, the level of conversation is decidedly more that of punters at the local stables than rigorous analysis.

0xDEAFBEAD1y ago

I assumed this was about chatbot users committing suicide in order to "join" the bot they are chatting with. It's already happened a couple of times, apparently:

https://futurism.com/teen-suicide-obsessed-ai-chatbot

https://garymarcus.substack.com/p/the-first-known-chatbot-as...

cdev_gl1y ago

I know many folk have strong opinions one way or the other, but I think it's literally anyone's game at this point, though I will say I'm not leaning optimistic.

nayuki1y ago

I thought the title meant that a chatbot gave bad medical, engineering, and/or safety-critical advice that a human ended up following.

sam0x171y ago

Similarly I thought it would be about ML and data projects that have become defunct due to the advent of LLMs.

rasz1y ago

Using people with severe mental health problems might be a poor benchmark of performance.

ok_dad1y ago

Why? Something like 20-25% of people have mental health issues. Seems like someone should be thinking about the impact of their product here, rather than blaming the victims.

aitchnyu1y ago

I thought it was a credible source of actual jobs replaced by LLMs. When I see headlines like this, I ad homimem the source as unprofitable company CEO, big consulting firm, bootcamp seller etc.

ultrablack1y ago

The tortoise lays on its back, its belly baking in the hot sun, beating its legs trying to turn itself over, but it can't. Not without your help. But you're not helping.

mindcrime1y ago

Describe in single words, only the good things that come into your mind about your mother.

mmustapic1y ago

Let me tell you about my mother

matt32101y ago

I read recently that small variations in the tests cause failures by large margins.

If this doesn’t show over fitting in don’t know what would.

friend_Fernando1y ago

Eventually, all the better AGI tests should have large private evaluation datasets with no possible cheating or feedback loops. We're getting there.

lxgr1y ago

Wasn’t that for human tests, i.e. not specifically AI benchmarks? Benchmarks should generally not be game-able by overfitting.

matt32101y ago

The article shows all the tests against human performance.

The math one in particular is the one where small variations reduce the success rate significantly. I can’t find the source but it was pasted here in the last 2 weeks.

1 more reply

yamrzou1y ago

See: https://news.ycombinator.com/item?id=42478098

anon3738391y ago

Tepix1y ago

In particular, in cases where the model gets 16 hours to solve a task that a human can solve in a few minutes, cheating is trivial!

Tepix1y ago

See https://arcprize.org/blog/oai-o3-pub-breakthrough

ARC-AGI-1 will be replaced by ARC-AGI-2

So yes, ARC-AGI-1 was killed.

yamrzou1y ago

1 more reply

anonymoushn1y ago

Interesting choice having a little (i) icon in the Turing Test card but having mouseover not bring up any text. Or having the link icons in that card that you can click on to do nothing.

fenomas1y ago

> (?) While the Turing Test remains philosophically significant, modern LLMs can consistently pass it, making it no longer effective at measuring the frontier of AI capabilities.

levocardia1y ago

I don't really understand why "Killed by: Saturation" is needed - what other options are there?

It would also be nice to see the "unbeaten" list: standardized tests LLMs still fail (for now). e.g. Wozniak's coffee test.

themanmaran1y ago

Wozniak's coffee test would be a really fun one to attempt. As long as you could get a capable enough robot, I imagine it's possible. Something like the Spot Arm[1] would be sufficient.

Something like:

- Key the robot controls to a series of tools (move_forward(x), extend_arm(y))

- Add a camera and pass each frame to the AI model along with the task "make a cup of coffee" and the list of available tools it can call.

And it would likely succeed some percentage of the time today!

[1] https://bostondynamics.com/products/spot/arm/

dleavitt1y ago

The layer with the radial gradient you're putting in front of the Turing Test card blocks interaction with it - can't click or hover on its links.

sinuhe691y ago

Taek1y ago

Mostly all algebra and calculus, but definitely all problems that most undergrads would struggle with.

gazchop1y ago

The big problem is it confidently answers the questions utterly wrongly.

This is stuff I expect a basic mathematics undergrad to be able to work out in their first or second year.

hgomersall1y ago

blinding-streak1y ago

Scroll down on the page. It explains saturation.

knowaveragejoe1y ago

I didn't know ARC-AGI had been "beaten" by o3. What are the next challenges that frontier models like o1/o3 are faced with?

chriscappuccio1y ago

o1 did terrible. o3 did well on arc-agi-pub (public training data) but hasn't passed the private test yet.

lucianbr1y ago

Is the test still private once it has been run? If you call the OpenAI API and send it some data, OpenAI has access to the data. Did the benchmaker run the models locally somehow?

1 more reply

alganet1y ago

erichocean1y ago

I'm working on operationalizing AI, and our Turing test is if—by watching a screenshare of the AI worker—you can tell an AI worker (vs. a human) did the task.

If you can't, the AI worker passes the test.

j451y ago

I'm not sure if LLMs have beaten the standards, as much have the information to reply to them as needed.

Last week there was a post where slightly changing one of the tests caused LLMs to drop off drastically.

solarkraft1y ago

The page doesn’t seem to define what „killed“ or „defeated“ means. The LLM being better than a human? The LLM having been trained against the benchmark, making it useless?

anonymoushn1y ago

It does if you scroll down.

krackers1y ago

Everything says "killed by saturation". Is there another way to be killed?

bufferoverflow1y ago

There are benchmarks that humans score close to zero on average and the top LLM scores 25%.

https://epoch.ai/frontiermath/the-benchmark

pama1y ago

If anyone from epoch.ai is reading this, it would be nice to link the toplevel result for o3 to this page.

mrayycombi1y ago

Bragging about how LLMs defeated maginot line defenses that can be trained around, makes us feel warm and fuzzy.

Too bad the real world isn't like that.

benreesman1y ago

This technology is useful and interesting and even fun in spite of the ugliest broad-based cash and power grab since 1999.

When this godawful once in a generation hype cycle dies down this stuff is going to be strictly awesome.

tharkun__1y ago

How does this site make sense?

It lists the "Turing test" as "original" at greater than 50% and the the AI that "beat" it at 46%.

At that point I just stopped scrolling.

junon1y ago

It makes sense, even if you disagree it's significant.

meltyness1y ago

This is probably already happening within the parade of censorship systems trying to imbue the models with agency

mattnewton1y ago

lxgr1y ago

Presumably it means that the human detected the AI correctly less than 50% of the time, averaged over a repeated number of experiments.

casey21y ago

From TFA:

>GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%).

tharkun__1y ago

spookie1y ago

It wasn't even the real Turing test but a lesser version of it. I'm hopeful about uses of this tech, but companies need to be more honest unless they want a second winter.

The incentives don't align with honesty though.

bacheaul1y ago

I read that as the pass mark being able to identify human vs machine more than 50% of the time. At 50% it's no better than randomly guessing.

j / k navigate · click thread line to collapse