When the interrogator is only answering "do you think your conversation partner was a human?" individually, bots can score fairly highly simply by giving little information in either direction - like pretending to be a non-english-speaking child, or sending very few messages.
Whereas when pitted against a human, the bot is forced to give stronger or equally strong evidence of being human as the average human (over enough tests). To be chosen as human, giving 0 evidence becomes a bad strategy when the opponent (the real human) is likely giving some positive non-zero evidence towards their personhood.
https://courses.cs.umbc.edu/471/papers/turing.pdf
In Turing's test, the forced binary choice means P(human-judged-human) + P(machine-judged-human) is necessarily equal to 100%. This gives the 50% threshold clear intuitive and mathematical significance.
In the bastardized test that GPT-4 "passed", that sum can be (and actually was) >100%. This makes the result practically impossible to interpret, since it depends on the interrogators' prior. The correct prior seems to be that it was human with p = 25%, though the paper doesn't say that explicitly, or say anything about what the interrogators were told. If the interrogators guessed mistakenly that it was 50% then that would lead them to systematically misjudge machines as humans, perhaps as observed.
The bastardized test is pretty bad, but treating the 50% threshold as meaningful there is inexcusable. I see the preprint hasn't yet passed peer review, and I'll regain some faith in social science professors if it never does. Of course the credulous media coverage is everywhere already, including the LLM training sets--so regardless of whether LLMs can pass the Turing test, they now believe they do.
> We now ask the question, "What will happen when a machine takes the part of A in this game?" Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman? These questions replace our original, "Can machines think?”
Does there exist a public LLM that isn't so...wordy, excited, and guardrailed all the time?
You can pretty much spot the bot today by prompting something horribly offensive. Their response is always very inhuman, probably due to lack of emotional energy.
The required investment probably means it will be a while before any less brand- and legal-action-conscious actors offer up unrestrained foundation models of comparable quality, but it's only a matter of time, isn't it?
https://x.com/eigenrobot/status/1870696676819640348
I generally prefer it to the default. It doesn't work as well on Claude or Grok for various reasons. I think it really shines on GPT o1-mini and GPT 4o.
Being uncooperative makes it really hard to tell anything about you, including whether you're real.
I guess this is why LLMs are so feared by high school English teachers. Yes, they don't write well, but neither do their students.
Most of them, if you prompt them right, for that specific problem.
Most people don't bother, and instead treat them as if they're magic (they are "sufficiently advanced technology", but still), and therefore we get them emphasising "nuance" and "balance" where it doesn't belong.
> You can pretty much spot the bot today by prompting something horribly offensive.
Yes, though also each model's origins give a different idea of what counts as "horribly offensive". I'm thinking mainly because the Chinese models don't want to talk about Tiananmen Square as I've not tried grok (how does grok cope with trans/cis-gender as concepts? I know Musk doesn't, but it would be speculation to project that assumption onto the AI).
> Their response is always very inhuman, probably due to lack of emotional energy.
This, specifically, can also be faked fairly well with the right prompt. Tell ChatGPT to act like a human with clinical depression, and it does… at least by American *memetic* standards of what that means.
That said, ChatGPT and Claude are also trained specifically to reveal that they're AI, not humans, even if you want them to role-play as specific humans.
Probably for the best, given how powerful a tool they are for, e.g. phishing and similar scams.
but the test subjects should be randomly samples from society at which point the skill availability/level of spotting it goes majorly down
> I don't think people really appreciate how simple ARC-AGI-1 was, and what solving it really means. It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.
> Passing it means your system exhibits non-zero fluid intelligence -- you're finally looking at something that isn't pure memorized skill. But it says rather little about how intelligent your system is, or how close to human intelligence it is.
https://bsky.app/profile/fchollet.bsky.social/post/3les3izgd...
Not necessarily. Get a human to solve ARC-AGI if the problems are shown as a string. They'll perform badly. But that doesn't mean that humans can't reason. It means that human reasoning doesn't have access to the non-reasoning building blocks it needs (things like concepts, words, or in this case: spatially local and useful visual representations).
Humans have good resolution-invariant visual perception. For example, take an ARC-AGI problem, and for each square, duplicate it a few times, increasing its resolution from X*X to 2X*2X. To a human, the problem will be almost exactly equally difficulty. Not for LLMs that have to deal with 4x as much context. Maybe for an LLM if it can somehow reason over the output of a CNN, and if it was trained to do that like how humans are built to do that.
I needed an official one for medical reasons a few years back
He's right that this isn't solving all human-intelligence domain level problems.
But the whole stunt, this whole time, was that this was the ARC-AGI benchmark.
The conceit was the fact LLMs couldn't do well on it proved they weren't intelligent. And real researchers would step up to bench well on that, avoiding the ideological tarpit of LLMs, which could never be intelligent.
It's fine to turn around and say "My AGI benchmark says little about intelligence", but, the level of conversation is decidedly more that of punters at the local stables than rigorous analysis.
https://futurism.com/teen-suicide-obsessed-ai-chatbot
https://garymarcus.substack.com/p/the-first-known-chatbot-as...
We really do live in interesting times. Usually I feel pretty confident about predicting how a trend will continue, but as it is the only prediction I can make with confidence for this latest AI research is that it is and will be used by militaries to kill a lot of people. Oh, hey, that's another thing this article could have listed!
Outside of that, all bets are open. Possible wagers include: "Turns out to be mostly useful in specific niche applications and only seemingly useful anywhere else", "Extremely useful for businesses looking to offset responsibility for unpopular decisions", "Ushers in an end to work and a golden age for all mankind", "Ushers in an end to work and a dark age for most of the world", "Combines with profit motives to damage all art, culture, and community", etc etc.
I know many folk have strong opinions one way or the other, but I think it's literally anyone's game at this point, though I will say I'm not leaning optimistic.
If this doesn’t show over fitting in don’t know what would.
The math one in particular is the one where small variations reduce the success rate significantly. I can’t find the source but it was pasted here in the last 2 weeks.
ARC-AGI-1 will be replaced by ARC-AGI-2
So yes, ARC-AGI-1 was killed.
> (?) While the Turing Test remains philosophically significant, modern LLMs can consistently pass it, making it no longer effective at measuring the frontier of AI capabilities.
It would also be nice to see the "unbeaten" list: standardized tests LLMs still fail (for now). e.g. Wozniak's coffee test.
Something like:
- Key the robot controls to a series of tools (move_forward(x), extend_arm(y))
- Add a camera and pass each frame to the AI model along with the task "make a cup of coffee" and the list of available tools it can call.
And it would likely succeed some percentage of the time today!
Mostly all algebra and calculus, but definitely all problems that most undergrads would struggle with.
It's most useful because it has deep knowledge of related and adjacent conjectures that are well understood, even if you've never heard of them. So it can mix and match things with a lot more ease than a tinkering mathematician
The big problem is it confidently answers the questions utterly wrongly.
This is stuff I expect a basic mathematics undergrad to be able to work out in their first or second year.
If you can't, the AI worker passes the test.
Last week there was a post where slightly changing one of the tests caused LLMs to drop off drastically.
Too bad the real world isn't like that.
When this godawful once in a generation hype cycle dies down this stuff is going to be strictly awesome.
It lists the "Turing test" as "original" at greater than 50% and the the AI that "beat" it at 46%.
At that point I just stopped scrolling.
I'm making up these figures, but the point is lower is better, or "more Human-Like". Test was specified as >50% meaning "accurately determined human vs. bot more than half the time". The site claims LLMs are now guessed correctly less than half, which is how the turing test was defined as per the site.
It makes sense, even if you disagree it's significant.
This is probably already happening within the parade of censorship systems trying to imbue the models with agency
>GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%).
The incentives don't align with honesty though.