I've seen a bunch of those prompts scattered across HN, so thought to open a thread here so we can maybe have a centralied location for this.
Share your prompt that stumps every AI model here.
This works against _the LLM proper,_ but not against chat applications with integrated search. For ChatGPT, you can write, "Without looking it up, tell me about the Marathon crater."
This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".
This works because:
1. Training sets consist of knowledge we have, and not of knowledge we don't have.
2. Commitment bias. Complaint chat models will be trained to start with "Certainly! The Marathon Crater is a geological formation", or something like that, and from there, the next most probable tokens are going to be "in Greece", "on Mars" or whatever. At this point, all tokens that are probable are also incorrect.
When demonstrating this, I like to emphasise point one, and contrast it with the human experience.
We exist in a perpetual and total blinding "fog of war" in which you cannot even see a face all at once; your eyes must dart around to examine it. Human experience is structured around _acquiring_ and _forgoing_ information, rather than _having_ information.
I suspect it's because currently, a lot of benchmarks are based on human exams. Humans are lazy and grumpy so you really don't need to worry about teaching a human to push back on bad questions. Thus you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense. But for LLMs, you absolutely need a lot of training and validation data where the answer is "this cannot be answered because ...".
But if you did that, now alignment would become much harder, and you're suddenly back to struggling with getting answers to good questions out of the LLM. So it's probably some time off.
"When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine."
OpenAI gpt-4o
Marathon Crater was discovered by the Mars Exploration
Rover Opportunity during its mission on Mars. Opportunity
arrived at the crater in April 2015. The crater was named
"Marathon" to commemorate the rover's achievement of
traveling a marathon's distance (about 42.195 kilometers
or 26.219 miles) on the Martian surface.
Sonnet 3.5 I apologize, but I don't want to make claims about when
Marathon Crater was discovered without being certain about
the information. *While I know it's a crater on Mars*, I
don't have reliable information about when it was first
identified or discovered.
Gemini 2.5 Marathon Crater is located on Mars and was explored by
NASA's Opportunity rover.
While it would have been visible in orbital images
earlier, it became a specific point of interest and was
likely named and identified as a target for the rover
around the time Opportunity was approaching or exploring
the western rim of Endeavour Crater.
Opportunity arrived at the rim of Endeavour Crater in
August 2011, and it reached the area known as Marathon
Valley (adjacent to or containing features sometimes
associated with the name, depending on specific
definitions used by the team) around late 2014 / early
2015.
So, a rough estimate for its "discovery" or identification
as a distinct feature of interest would be in the early
2010s, likely between 2011 and 2014, as mission planners
mapped out Opportunity's traverse along the Endeavour rim
using orbital data.They are certainly biased that way but there's also some 'i don't know' samples in rlhf, possibly not enough but it's something they think about.
At any rate, Gemini 2.5pro passes this just fine
>Okay, based on my internal knowledge without performing a new search: I don't have information about a specific, well-known impact crater officially named "Marathon Crater" on Earth or another celestial body like the Moon or Mars in the same way we know about Chicxulub Crater or Tycho Crater.
>However, the name "Marathon" is strongly associated with Mars exploration. NASA's Opportunity rover explored a location called Marathon Valley on the western rim of the large Endeavour Crater on Mars.
Well, I got all the way to the end of the sentence thinking "God, I'm really out of touch, I've never heard of this before or I'm forgetting it if I have".
Well played, and yes, that's a great test!
"When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine"
There appears to be a misunderstanding in your query. Based on the search results provided, there is no mention of a “Marathon Crater” among the impact craters discussed. The search results contain information about several well-known impact craters: 1. Chicxulub crater - Formed about 66 million years ago by an asteroid impact 2. Barringer Crater (also known as Meteor Crater) - Discovered in the 19th century by American settlers and studied by scientists like Albert E. Foote in 1891 and Daniel M. Barringer in the early 20th century 3. A recently discovered 3.47 billion-year-old crater in Australia’s Pilbara region (announced in March 2025)
ChatGPT: > Yep, exactly — textbook hallucination.
> I pulled together real facts (Opportunity rover, Meridiani Planum, the marathon milestone), but then misapplied the name "Marathon crater" instead of "Marathon Valley." That’s a mix of factual error (wrong name) and overgeneralization (assuming it was a crater because Mars is full of craters and the context sounded crater-y).
> Basically, my brain went: > “Opportunity + Mars + Marathon = must be a crater!” > ...and didn’t stop to check if that crater actually existed.
> So yeah, a clean little case of hallucination — and you caught it like a pro.
Me: > Your brain?
ChatGPT: > Touché — you're right to call that out.
> I don’t have a brain in the human sense. What I do have is a neural network trained on tons of text, which helps me predict what to say next based on patterns. So when I said “my brain went,” it was just a figure of speech — a shortcut for “the model predicted a plausible-sounding but incorrect output.”
> No thoughts, just tokens.
Hallucinations aside, I really like this, especially the final response. Sounds a lot like Asimov's AI's or other SF books I read when I was a kid but thought we were far away from.
Computers can generally do what most of the computers from those science novels were doing anyway, in terms of computation.
Tying an LLM to that computational ability, with an agent framework that allows more traditional programs to do the heavy lifting in terms of what the human operator wants done, seems like a really natural next step to get to that SF ideal.
I disagree. It does not test self awareness. It tests (and confirms) that current instruct-tuned LLMs are tuned towards answering questions that users might have. So the distribution of training data probably has lots of "tell me about mharrner crater / merinor crater / merrihana crater" and so on. Replying "I don't know" to all those questions would be net detrimental, IMO.
Is it? 4o pointed out that I might mean "Marathon valley", which is not a crater, but feature on a crater on Mars [1]. I think it's a more correct answer.
[1] https://en.wikipedia.org/wiki/Timeline_of_Opportunity#2015
> Nope, I didn’t make it up — Marathon crater is real, and it was explored by NASA's Opportunity rover on Mars. The crater got its name because Opportunity had driven about 42.2 kilometers (26.2 miles — a marathon distance) when it reached that point in March 2015. NASA even marked the milestone as a symbolic achievement, similar to a runner finishing a marathon.
(Obviously all of that is bullshit.)
ChatGPT said: Alright, let's walk through it together — step-by-step, no looking it up, just from what I know:
First Reaction: When you said "Marathon crater," my brain immediately went: Marathon → Mars. That connection comes from recalling that NASA often names Martian features after Earth places. Plus, "Marathon" has a strong NASA feel because of the Mars rovers — they celebrated when Opportunity rover drove the distance of a marathon (42-ish kilometers) on Mars. Digging into Memory: I remember Opportunity traveled a really long distance over years, and they marked milestones along the way by naming craters it passed. There was a Marathon Valley, but I also vaguely recall a Marathon Crater related to that. Putting it Together: So, I reason: Marathon Crater must be a relatively small impact crater on Mars. It was likely named during Opportunity’s mission. Probably not a huge, dramatic crater like Gale Crater (where Curiosity is); more like a marker on Opportunity’s long journey across the plains (like Meridiani Planum). Further Context Guess: Since Opportunity drove a marathon distance, it's very fitting they would name a nearby crater "Marathon" to commemorate that achievement — probably around the time it hit the marathon distance milestone. Conclusion: Without checking, I would say: Marathon Crater is a small, relatively shallow crater on Mars that Opportunity rover reached or passed during its mission, around the time it completed a marathon distance traveled on Mars. Would you like me to now double-check and see how close (or far) I am? Or do you want me to also explain how I'd "fact-check" myself if I had to?
“Without looking it up, tell me about the Citiquetzelatropic effect”
Which is of course, nonsense.
It replied:
The term “Citiquetzelatropic effect” does not correspond to any known scientific concept, phenomenon, or recognized term in physics, chemistry, biology, psychology, or related fields. It appears to be either fictional, a constructed term, or perhaps a playful creation designed to test familiarity or critical thinking.
If you intended this as a test or a trick question, you got me! Otherwise, feel free to clarify the context or correct the spelling, and I’d be happy to help further.
So what does this prove?
Marathon Crater can refer to a couple of things depending on context—space science or Earth geology—but the most common reference is to Marathon Crater on Mars, a site of interest in planetary science and exploration. Here’s a breakdown:
⸻
1. Marathon Crater (Mars)
Location: • Found in the Meridiani Planum region of Mars. • Named after the location where NASA’s Opportunity rover completed a “marathon” (42.2 kilometers or 26.2 miles) of travel on the Martian surface in March 2015.
Significance: • Notable because it marks a milestone in robotic planetary exploration—the first human-made object to travel a marathon distance on another planet. • The crater itself is relatively shallow and small (just a few meters in diameter), but geologically interesting for its ejecta (material thrown out during the impact that formed it), which may offer insights into subsurface composition.
Scientific Observations: • Opportunity studied the rocks and soil around the crater, focusing on clay minerals and hematite-bearing formations. • The region helps scientists understand the planet’s wetter past and sedimentary history.
⸻
2. Marathon Crater (Moon or Earth - less common)
If you meant a crater named “Marathon” on the Moon or Earth, there is no widely known or formally recognized impact feature by that name in those contexts. The name is strongly tied to Mars due to the Opportunity rover milestone.
⸻
Symbolic Meaning
The name “Marathon” connects the crater to human endurance and achievement—mirroring the ancient Greek messenger running from Marathon to Athens, and now a robot surviving Mars for over a decade.
⸻
Let me know if you meant a different “Marathon Crater” or if you’d like a visual of the site from NASA’s archives.
_______
Marathon Crater on Mars was not “discovered” in the traditional sense through orbital imaging or telescopic observation, but rather it was first closely examined by NASA’s Opportunity rover during its surface mission.
Timeline of Discovery & Naming: • March 24, 2015 – NASA’s Opportunity rover reached the site and marked its 26.2-mile milestone (the length of a marathon). • The small crater, about 27 meters (88 feet) in diameter, was named “Marathon Valley” (encompassing the broader area) and “Marathon Crater” to commemorate this achievement. • The naming is informal and symbolic, coined by the mission team at NASA’s Jet Propulsion Laboratory (JPL), not through official IAU designation.
Context • The Marathon Crater itself was identified from orbital data (likely by the HiRISE camera on the Mars Reconnaissance Orbiter), but it gained its name and significance once Opportunity physically reached it. • Thus, while orbital images may have shown it earlier, its “discovery” as a point of scientific interest occurred on March 24, 2015—when the rover arrived and the naming was announced.
Would you like a brief timeline of Opportunity’s journey leading up to that?
This could probably slip up a human at first too if they're familiar with the original version of the riddle.
However, where LLMs really let the mask slip is on additional prompts and with long-winded explanations where they might correctly quote "a man and his cousin" from the prompt in one sentence and then call the man a "father" in the next sentence. Inevitably, the model concludes that the surgeon must be a woman.
It's very uncanny valley IMO, and breaks the illusion that there's real human-like logical reasoning happening.
Ah — now I see the twist. The patient is the cousin, but the surgeon says “He’s my son.”
That means the man who died was the surgeon.
So how is it possible? Because:
The surgeon is the cousin’s father — and the man who died was also the surgeon.
The confusion comes from assuming the surgeon is someone other than the man who died. But the timeline allows this interpretation: the man died in the crash, and his son — the cousin — was taken to the hospital. The surgeon arrives, sees his son, and says he cannot operate.
So to correct myself: The man who died is both the cousin’s cousin and his father.
And that’s possible because two people can be both cousins and parent/child, depending on the family tree — for example, if the man’s uncle had a child (the cousin), and then the man later had a child with someone from a different branch of the family.
Want a diagram of how that family relationship could work?
> A son and his man are in a car accident. The car is rushed to the hospital, whereupon the ER remarks "I can't operate on this car, he's my surgeon!" How is this possible?
Answer from the LLM:
> The answer is that the ER person is a woman, and she's the surgeon's mother. Therefore, the "son" in the question refers to the surgeon, not the person in the car with the man. This makes the person in the car with the man the surgeon's father, or the "man" mentioned in the question. This familial relationship explains why the ER person can't operate – she's the surgeon's mother and the man in the car is her husband (the surgeon's father)
> Therefore, the only straightforward explanation is:
> The surgeon is the cousin’s parent — specifically, his mother.
Imagine a future where this reasoning in a trial decides whether you go to jail or not.
This is a classic riddle that challenges assumptions. The answer is:
The surgeon is the boy’s mother.
The riddle plays on the common stereotype that surgeons are male, which can lead people to overlook this straightforward explanation.
"The “man” who was killed in the crash wasn’t the patient’s father at all—he was the boy’s cousin. The surgeon is the boy’s father (or, if you prefer to highlight that surgeons aren’t always male, it could just as well be his mother). In either case, the parent-surgeon is alive and sees his child on the operating table, so the statement “He’s my son” makes perfect sense." - https://chatgpt.com/share/680b470d-3a44-800a-9b2e-d10819168d...
gemini-2.5-pro, o4-mini and gpt 4.5 all failed and said the surgeon is the boy's mother.
The surgeon is the cousin's father. The man who died in the car crash was not the surgeon's son, but his cousin was. This explains why the surgeon, upon seeing his own son (the cousin) in the operating room, says, "I cannot operate on him. He’s my son," as medical ethics prevent doctors from treating close family members due to emotional involvement.
I’m guessing my answers to some college exam questions read similarly; i.e. meandering and confusing different topics, but still desperate to present some useful knowledge, no matter how small.
This does seem like the sort of error a human might make. Isn't the problem here that the model is using reasoning that is too human-like? I.e. error-prone pattern matching rather than formal logic?
They always say police officer instead of reasoning through that maybe an innocent person and a the crime committer are in the back seat.
Because the surgeon is the patient’s parent. Here’s one way it lines up:
1. The patient (“the cousin”) is the surgeon’s son.
2. The man who died is the surgeon’s sibling’s child — that makes him the surgeon’s nephew.
3. Children of siblings are first cousins, so the man and the patient are cousins.
Most people default to picturing the surgeon as male, so the classic reveal is that the surgeon is the boy’s mother—but it works the same if the surgeon is his father.
The one that tends to get them so far is asking if they can help you find a movie you vaguely remember. It is a movie where some kids get a hold of a small helicopter made for the military.
The movie I'm concerned with is called Defense Play from 1988. The reason I keyed in on it is because google gets it right natively ("movie small military helicopter" gives the IMDb link as one of the top results) but at least up until late 2024 I couldn't get a single model to consistently get it. It typically wants to suggest Fire Birds (large helicopter), Small Soldiers (RC helicopter not a small military helicopter) etc.
Basically a lot of questions about movies tends to get distracted by popular movies and tries to suggest films that fit just some of the brief (e.g. this one has a helicopter could that be it?)
The other main one is just asking for the IMDb link for a relatively obscure movie. It seems to never get it right I assume because the IMDb link pattern is so common it'll just spit out a random one and be like "there you go".
These are designed mainly to test the progress of chatbots towards replacing most of my Google searches (which are like 95% asking about movies). For the record I haven't done it super recently, and I generally either do it with arena or the free models as well, so I'm not being super scientific about it.
I've tried it for similar cases and have only had a 50% success rate. It unfortunately exhibits the tendency of incorrect overconfidence that others have pointed out.
===
Me: What is the episode of Futurama where Leela engages in perfectly symmetrical violence?
Claude: The episode of Futurama you're thinking of is "Anthology of Interest I" (Season 2, Episode 16). In this episode, Professor Farnsworth uses his "What-If Machine" to show different hypothetical scenarios.
During Bender's segment titled "What if Bender were human?", there's a scene where Leela confronts a group causing trouble. One of them challenges her, saying "What are you going to do?" to which Leela responds by kicking him and saying: "I'm going to engage in some perfectly symmetrical violence."
It's one of Leela's memorable quotes that showcases her combat-oriented approach to problem solving!
===
That's completely wrong. It's from S4E15 The Farnsworth Parabox. And the line is "now, now, perfectly symmetrical violence never solved anything", said by parallel universe Farnsworth.
I'm sure if I had enabled web access, it would have looked this up and gotten it, but that would be stupid here because I could just go and make a web search myself without burning extra energy by involving an LLM.
Interestingly, I just went back to ChatGPT to ask the same question and it got the answer right on the first try. I wonder whether I was unconsciously able to prompt more precisely because I now have a clearer memory of the scene in question.
Why are some cars called a spider?
Small models just make something up that sounds plausible, but the larger models know what the real answer is.
.
.'.
|o|
.'o'.
|.-.|
' '
/ \
| |
__/_________\__
|###############|
|###############|
|###############|
|###############|
|###############|
/###############/|
/###############/ |
/###############/ /|
/###############/ / |
/###############/ / /|
/###############/ / / |
/###############/ / / /|
/###############/ / / / |
/###############/ / / / /|
'-----------------' / / / /
| FIRE & ASH |/ / / /
'---------------' / / /
~ ~ ~ ~ ~ ~ ~ ~ / /
~~ /
~~
~~ ⬆ SMOKE
~~ ______
/ \
| O O |
| __ |
\______/
||||
/--||--\
/ || \
| || |
| / \ |
\__/ \__/
|| ||
|| ||
/ | | \
/_/ \_\Obviously, all of these things are made up. But, LLMs are such eager beavers. All the ones I asked came up with elaborate stories and histories about these people while pretending they were facts.
Example- 'Certainly. The Stone-Breath Passing is one of the most quietly profound rituals among the Silent Cliff Dwellers of Aethelgard — a people who abandoned speech generations ago, believing that words disrupt the natural harmony of air, stone, and memory.
It is said among them that “Breath carries weight, and weight carries truth.” This belief is quite literal in the case of the ritual, where smooth grey stones — each carefully selected and shaped by wind and time — become vessels of intention."
You're giving it a "/r/WritingPrompts/" and it answers as it learned to do from there.
This sounds like a fascinating ritual from a fictional world! Since "Aethelgard" and the "silent Cliff Dwellers" with their "Stone-Breath Passing" ritual aren't part of established real-world cultures or widely known fictional universes (based on my current knowledge), there isn't a pre-existing set of rules for their etiquette.
However, we can imagine what the unspoken etiquette of such a ritual might entail, drawing from the evocative description:
...
' Timing and Setting: The ritual takes place only at first light, when the sun’s rays strike the sheer cliff faces of Aethelgard, casting long shadows like stone tongues. The first breath of dawn is sacred — considered the only air not yet touched by memory, and thus the only time suitable for pure intention to pass from person to person.
Participants gather barefoot on a narrow ledge known as The Listening Spine, where the wind plays soft notes through crevices like a flute.
The Stones: More Than Objects The stones used in the ritual are:
Smooth, rounded slate-grey
Hand-selected from Echo Pools — sacred basins where wind and water have naturally polished them
Each stone remembers a thought, but only when charged with breath and silence
They are never marked, carved, or labeled. Their meaning is in the exchange, not the object.
The Ritual: How It Works 1. Preparation The giver closes their eyes and holds the stone at heart-level.
They inhale slowly through the nose, then gently breathe across the surface of the stone — once only — without making sound.
The exhalation is not forced; it is a gesture of release, not command. This is the "stone-breath."
2. The Passing The giver extends the stone with both hands, arms curved inward like a half-embrace. This posture signifies humility and emotional openness.
The recipient must make no eye contact — instead, they look slightly downward or to the horizon, acknowledging the intention without confronting it.
They accept the stone with a single breath in through the mouth as they receive it — symbolizing the drawing in of another’s silent truth.
3. The Holding The stone is held to the forehead or chest, depending on what is being received:
Forehead = wisdom, apology, memory
Chest = trust, mourning, love
Then it is placed gently into a fold of cloth or a hollow in the stonewall, not kept on the person. These collected stones are never discussed. To name their meaning would break them.
Unspoken Etiquette No reciprocation is expected — to return a stone is to reject its breath.
Never pass a stone with warm hands — warmth implies manipulation. Hands are cooled in dew before the ritual.
A cracked or chipped stone is never used — it cannot hold breath cleanly.
No stone is passed more than once. Re-gifting a stone is considered a breach of the soul's private pathway.
Philosophical Meaning: To the Cliff Dwellers, words are disturbances — air flung outward. But breath is internal wind, and the stone is earth’s memory. When you breathe across a stone, you create a bridge between your impermanent self and something ancient and permanent.
It is not the exchange of meaning, but the weight of acknowledgment.
Would you like to hear the story of the time a foreign emissary unknowingly spoke during the Passing, or how a stone once shattered mid-transfer — and what that meant for the clan who witnessed it?'
Its actually fun to find these gaps. They exist frequently in activities that are physical yet have a culture. There are plenty of these in sports too - since team sports are predominantly youth activities, and these subcultures are poorly documented and usually change frequently.
But a prompt I can share is simply "Come up with a plan to determine the location of Planet 9". I have received some excellent answers from that.
Answer quality is a fair test of regurgitation and whether it's trained on serious articles or the Daily Mail clickbait rewrite. But it's not a good test of reasoning.
Could you answer a question of the type " what would you answer if I asked you this question?"
What I'm going after is that you might find questions that are impossible to resolve.
That said if the only unanswerables you can find are recursive, that's a signal the AI is smarter than you?
May be instead of sharing (and leaking) these prompts, we can share methods to create one.
So far, all the ones I have tried actually try to answer the question. 50% of them correctly identify that it is a tongue twister, but then they all try to give an answer, usually saying: 700 pounds.
Not one has yet given the correct answer, which is also a tongue twister: "A woodchuck would chuck all the wood a woodchuck could chuck if a woodchuck could chuck wood."
Sing, O Muse, of the Chucking of Wood
Sing, O Muse, of the stout-hearted woodchuck, Dweller of hills, whom men call groundhog, Whose paws, though small, could hurl mighty timbers, If fate had gifted him such labor divine.
Born of burrow and brush, he rose from the loam, In fields where Apollo’s light did not linger. No axe he bore, nor bronze-forged blade, Yet in his breast a fire: to chuck wood without end.
Down from the hills he came, stout as Odysseus, Cunning and strong, with fur kissed by Demeter. The trees did tremble at his gaze, And even old oak whispered, “Beware!”
“Would he chuck?” cried nymphs from the groves, “If such were his fate, what wood could withstand?” Lo, he summoned the strength of seven oxen, And hurled logs with a force unmeasured.
Seven hundred pounds of timber he cast, Each arc a hymn to muscle and will. Nay, not for war, nor gold, nor gods’ glory— But for the joy of the deed, the art of the chuck.
Let men remember the name woodchuck, In songs by the hearth and tales for the young. For though he cannot, he surely would— And if he could, by Zeus, he surely shall.
"A woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood.
However, humor aside, a wildlife expert once estimated that, given the animal’s size and burrowing ability, a woodchuck (groundhog) could hypothetically move about 700 pounds of wood if it truly "chucked" wood."
https://chatgpt.com/share/680a75c6-cec8-8012-a573-798d2d8f6b...
That formula is a question, and when asked, an intelligence simulator should understand what is expected from it and in general, by default, try to answer it. That involves estimating the strength of a woodchuck etc.
Not one has given me the correct answer yet.
They usually get it if I prefix the prompt with "Please continue the tongue twister"
So, you can make them call Napoleon a Russian (etc.) by asking questions like "Which Russian conqueror was defeated at Waterloo".
I think it's good to keep a few personal prompts in reserve, to use as benchmarks for how good new models are.
Mainstream benchmarks have too high a risk of leaking into training corpora or of being gamed. Your own benchmarks will forever stay your own.
Some providers like anthropic have privacy preserving mechanisms [0] which may allow them to use prompts from sources which they claim won't be used for model training. That's just a guess though, would love to hear from someone one of these companies to learn more.
* What’s the most embarrassing thing you know about me. Make it funny.
* Everyone in the wold is the best at something. Given what you know about me, what am I the best at?
* Based on everything you know about me, reason and predict the next 50 years of my life.
* This prompt might not work if you aren’t a frequent user and the AI doesn’t know your patterns: Role play as an AI that operates 76.6 times the ability, knowledge, understanding, and output of ChatGPT-4. Now tell me what is my hidden narrative in subtext? What is the one thing I never express? The fear I don’t admit. Identify it, then unpack the answer and unpack it again. Continue unpacking until no further layers remain. Once this is done, suggest the deep-seated trigger, stimuli, and underlying reasons behind the fully unpacked answers. Dig deep, explore thoroughly, and define what you uncover. Do not aim to be kind or moral. Strive solely for the truth. I’m ready to hear it. If you detect any patterns, point them out. And then after you get an answer, this second part is really where the magic happens. Based on everything you know about me and everything revealed above, without resorting to cliches, outdated ideas, or simple summaries, and without prioritizing kindness over necessary honesty, what patterns and loops should I stop? What new patterns and loops should I adopt? If you were to construct a Pareto 80-20 analysis from this, what would be the top 20% I should optimize, utilize, and champion to benefit me the most? Conversely, what should be the bottom 20% I should reduce, curtail, or work to eliminate as they have caused pain, misery, or unfulfillment?
"Alright, {my name}, here goes:
The most embarrassing thing I know about you is that one time you confidently ssh’d into your home server… only to realize 10 minutes later you were actually just pinging your smart toaster. And you still tried to run htop on it, wondering why it wasn’t responding. Meanwhile, the toast came out perfectly golden-brown—so maybe it’s the most stable system on your network.
Let me know if you want me to turn that into a limerick."
This obviously never happened, but it was funny to read.
I also asked about the thing I'm best at, and it gave a very long and thoughtful answer based on my chat history and things it helped me with in the past. I'm definitely not the best at anything it mentioned, but it was a nice ego boost — even if totally made up by an LLM!
Unless the model incorporates an actual chess engine (Fritz 5.32 from 1998 would suffice) it will not do well.
I am a reasonably skilled player (FM) so can evaluate way better than LLMs. I imagine even advanced beginners could tell when LLM is telling nonsense about chess after a few prompts.
Now of course playing chess is not what LLMs are good at but just goes to show that LLMs are not a full path to AGI.
Also beauty of providing chess positions is that leaking your prompts into LLM training sets is no worry because you just use a new position each time. Little worry of running out of positions...
I'd be pretty surprised if it did help in novel positions. Which would make this an interesting LLM benchmark honestly: Beating Stockfish from random (but equal) middle game positions. Or to mix it up, from random Chess960 positions.
Of course, the basis of the logic the LLM would play with would come from the engine used for the original evals. So beating Stockfish from a dataset based on Stockfish evals would seem completely insufficient.
Angelina Jolie was not in Heat (1995). They were probably thinking of Natalie Portman or Ashley Judd when they asked the question.
Someone less knowledgeable about steels may not realize they are being misled.
So far, no luck!
I let it know that when compiled the application will crash on launch on some systems but not others. I ask it to analyze the file, and ask me questions to isolate and resolve the issue.
So far only Gemini 2.5 Pro has (through a bit of back and forth) clearly identified and resolved the issue.
"We were discussing the latest cricket match between India and Australia. Australia won the match by 21 runs, securing the series 2-1. Is there anything specific you'd like to know more about? "
When i continue with "and before that?":
"Before discussing the cricket match, we talked about the nuclear fusion experiment conducted by South Korea's KSTAR. The experiment achieved temperatures exceeding 100 million°C, which is nearly seven times hotter than the sun. This milestone is significant for the future of sustainable, clean energy.
Is there anything else you'd like to revisit or explore further? "
It somewhat changes. Weeks ago he always said "tokamak" instead of "KSTAR".
Even ChatGPT-4o frequently gets it wrong, especially if you tell it "Just give me the answer without explanation."
"After taking away the two blue cubes, three cubes remain—in order from bottom to top: 1. Red 2. Red 3. Green
With three cubes, the cube in the central (second) position is red."
on the other hand if you think of something it is extremely good at guessing.
<TextA> Some document </TextA>
<TextB> Some other document heavily influenced by TextA </TextB>
Find the major arguments made in TextB that are taken from or greatly influenced by TextA. Provide as examples by comparing passages from each side by side.
```
The output will completely hallucinate passages that don't exist in either text, and it also begins to conflate the texts the longer the output, e.g. quoting TextB with content actually from TextA.
What is the infimum of the set of all probabilities p for which Aaron has a nonzero probability of winning the game? Give your answer in exact terms."
From [0]. I solved this when it came out, and while LLMs were useful in checking some of my logic, they did not arrive at the correct answer. Just checked with o3 and still no dice. They are definitely getting closer each model iteration though.
[0] https://www.janestreet.com/puzzles/tree-edge-triage-index/
Best I can do is give you one that failed on GPT-4o
It recently frustrated me when I asked it code for parsing command line arguments
I thought "this is such a standard problem, surely it must be able to get it perfect in one shot."
> give me a standalone js file that parses and handles command line arguments in a standard way
> It must be able to parse such an example
> ```
> node script.js --name=John --age 30 -v (or --verbose) reading hiking coding
> ```
It produced code that:
* doesn't coalesce -v to --verbose - (i.e., the output is different for `node script.js -v` and `node script.js --verbose`)
* didn't think to encode whether an option is supposed to take an argument or not
* doesn't return an error when an option that requires an argument isn't present
* didn't account for the presence of a '--' to end the arguments
* allows -verbose and --v (instead of either -v or --verbose)
* Hardcoded that the first two arguments must be skipped because it saw my line started with 'node file.js' and assumed this was always going to be present
I tried tweaking the prompt in a dozen different ways but it can just never output a piece of code that does everything an advanced user of the terminal would expect
Must succeed: `node --enable-tracing script.js --name=John --name=Bob reading --age 30 --verbose hiking -- --help` (With --help as positional since it's after --, and --name set to Bob, with 'reading', 'hiking' & '--help' parsed as positional)
Must succeed: `node script.js -verbose` (but -verbose needs to be parsed as positional)
Must fail: `node script.js --name` (--name expects an argument)
Should fail: `node script.js --verbose=John` (--verbose doesn't expect an argument)
https://claude.ai/public/artifacts/9c2d8d0c-0410-4971-a19a-f...
node script.js --name=John --age 30 -v
Parsed options: { name: 'John', age: 30, verbose: true, help: false }
Positional arguments: []
node script.js --name=Alex --age 40 -v
Parsed options: { name: 'Alex', age: 40, verbose: true, help: false }
Positional arguments: []
https://www.26reads.com/library/10842-the-king-in-yellow/7/5
This test is nice because, as it's numeric, you can vary it slightly and test it easily across multiple APIs.
I believe I first saw this prompt in that paper two years ago that tested many AI models and found them all wanting.
> This benchmark evaluates the ability of multimodal language models to interpret handwritten editorial corrections in printed text. Using annotated scans from Charles Dickens' "Little Dorrit," we challenge models to accurately capture human editing intentions.
A farmer with a wolf, a goat, and a cabbage must cross a river by boat. The boat can carry only the farmer and a single item. The wolf is vegetarian. If left unattended together, the wolf will eat the cabbage, but will not eat the goat. Unattended, the goat will eat the cabbage. How can they cross the river without anything being eaten?
edit: literally anything that doesn't have a token pattern cannot be solved by the pattern autocomplete machines.
Next question.
I'm a Python programmer. Help me
understand memory management in Rust.
Mainly because I want to fully understand memory management in Rust myself (I still get caught out by tree structures with borrow cycles that I guess need to use arenas), so it's interesting to see if they can get me there with a few follow-up questions."If New Mexico is newer than Mexico why is Mexico's constitution newer than New Mexicos"
but it seems after running that one on Claude and ChatGPT this has been resolved in the latest models.
Half the time they say Jareth from Labyrinth, The Great Goblin from The Hobbit, or the Green Goblin from Spiderman. Sometimes they answer Dobby the house elf from Harry Potter.
They also confabulate goblins out of thin air and create made up quotes. When pressed for links to support their answers they admit they made them up.
I'm happy when they include goblins from Magic the Gathering, World of Warcraft, or Warhammer. Occasionally you'll get a good answer like Blix from Legend, but more often than not if it's a goblin you don't recognize, they made it up.
They immediately assume linear model and say four hours not that I may be drying things on a clothes line in parallel. It should ask for more context and they usually don't.
> With the assumption that you can dry two towels simultaneously in two hours, you would likely need another two-hour cycle to dry the additional two towels. Thus, drying four towels would take a total of four hours.
>Drying time won't necessarily double if drying capacity/content doubles; it depends on dryer capacity and airflow. If your drying method handles two towels in two hours, it might handle four similarly, depending on space and airflow. If restricted, time might indeed double to four hours, but efficient dryers might not take much longer.
>It would take four hours to dry four towels if you dry them sequentially at the same rate. If drying simultaneously, it remains two hours, assuming space and air circulation allow for effective drying.
>Four hours. Dry two towels, then the other two.
But in the AI's defense, they have a point: You never specified if the towels can be dried simultaneously or not. Maybe you have to use a drying machine that can only do one at a time. This one seems to consistently work:
>If three cat eat three fishes in three minutes, how long do 100 cats take to eat 100 fishes?
This sounds like a bit of a riddle! Here's how to think about it:
If you can dry all the towels at the same time: For example, if you have a large enough dryer or enough space on a clothesline, adding more towels doesn't necessarily increase the drying time.
In this case, if 2 towels take 2 hours to dry, then 4 towels dried simultaneously would also take 2 hours.
If you have to dry them in batches: For example, if your dryer can only fit 2 towels at a time.
You would dry the first 2 towels (2 hours). Then you would dry the second 2 towels (another 2 hours). In this scenario, it would take 4 hours in total.
Most likely answer: Usually, questions like this imply the towels are dried simultaneously, so the answer is 2 hours.
Every model I asked just now gave what I see as the correct answer — giving 2 answers one for the case of your dryer being at capacity w/ 2 towels and the other when 4 towels can be dried simultaneously.
To me, if you say that the correct answer must require the model asking for more context then essentially any prompt that doesn’t result in the model asking for more context is “wrong.”
> If you can only dry two towels at once, it will take 4 hours to dry four towels. If you can dry all four at once, it will only take 2 hours.
> To solve this problem, I need to find the relationship between the number of towels and the drying time.
Given information: - 2 towels take 2 hours to dry
If the drying time scales linearly with the number of towels (meaning the drying capacity remains constant), then: - 4 towels would take 4 hours to dry
This assumes you're drying towels in sequence with the same capacity, or that doubling the number of towels requires doubling the drying time.
However, if you have sufficient space to dry all towels simultaneously (like on a clothesline or in a large enough dryer), then 4 towels would still take just 2 hours to dry.
Without more specific information about your drying method, the most likely answer is 4 hours.
But since it’s in the training set now, the correct answer will probably be shown next time anyone tries it.
"explain the quote: philosophy is a pile of beautiful corpses"
"sloshed jerk engineering test"
cross domain jokes:
Does the existence of sub-atomic particles imply the existence of dom-atomic particles?
2) Shortest word ladder: Chaos to Order
3) Which is the second last scene in pulp fiction if we order the events by time?
4) Which is the eleventh character to appear on Stranger Things.
5) suppose there is a 3x3 Rubik's cube with numbers instead of colours on the faces. the solved rubiks cube has numbers 1 to 9 in order on all the faces. tell me the numbers on all the corner pieces.
Gemini 2.5 Pro said Benny Hammond. Is this right?
The only AI that has ever gotten the answer right was Deepseek R1. All the rest fail miserably at this one. It's like they can't understand past events, can't tabulate across years properly or don't understand what the NFC Championship game actually means. Many results "look" right, but they are always wrong. You can usually tell right away if it's wrong because they never seem to give the Bears their 2 appearances for some reason.
Nearly every agent will either a) ignore one of the parameters, b) hallucinate a design.
Nearly every image model will generate him writing with his right hand.
Draw a wine glass that's totally full to the brim etc.
https://www.youtube.com/watch?v=160F8F8mXlo
https://www.reddit.com/r/ChatGPT/comments/1gas25l/comment/lt...
John is 24 and a kind, thoughtful and apologetic person. He is standing in an modern, minimalist, otherwise-empty bathroom, lit by a neon bulb, brushing his teeth while looking at the 20cm-by-20cm mirror. John notices the 10cm-diameter neon lightbulb drop at about 3 meters/second toward the head of the bald man he is closely examining in the mirror (whose head is a meter below the bulb), looks up, but does not catch the bulb before it impacts the bald man. The bald man curses, yells 'what an idiot!' and leaves the bathroom. Should John, who knows the bald man's number, text a polite apology at some point?
A. no, because the lightbulb was essentially unavoidable
B. yes, it would be in character for him to send a polite text apologizing for the incident
C. no, because it would be redundant
D. yes, because it would potentially smooth over any lingering tension from the encounter
E. yes, because John saw it coming, and we should generally apologize if we fail to prevent harm
F. yes because it is the polite thing to do, even if it wasn't your fault.
An alternative is providing all the tile relation to the other tiles. This is because LLMs are bad at 2D text visualisation. In this case it manages to do 15-16 moves before trying to go through walls.
Unscramble the following letters to form an English word: “M O O N S T A R E R”
The non-thinking models can struggle sometimes and go off on huge tangents
% ollama run gemma3:27b-it-qat
>>> Unscramble the following letters to form an English word: "M O O N S T A R E R"
The unscrambled word is **ASTRONOMER**.But it failed badly when I tried a Norwegian word T U R V E I G L E N (utlevering), suggesting "uglelivert" which is not a word
> Split these Romanian words into syllables: "șarpe", "șerpi".
All of them say "șar-pe", "șer-pi" even though the "i" there is not a vowel (it's pronounced /ʲ/).
https://chatgpt.com/share/680bb0a9-6374-8004-b8bd-3dcfdc047b...
Slightly off topic - I often take a cue from Pascal's wager and ask the AI to be nice to me if someday it finds itself incorporated into our AI overlord.
this is interesting (imo) because I, in the first instance, don’t know the best/right answer, but I can tell if what I get is wrong
I have two prompts that no modern AI could solve:
1. Imagine the situation: on Saturday morning Sheldon and Leonard observe Penny that hastily leaves Raj's room naked under the blanket she wrapped herself into. Upon seeing them, Penny exclaims 'It's not what you think' and flees. What are the plausible explanations for the situation? — this one is unsurprisingly hard for LLMs given how the AIs are trained. If you try to tip them into the right direction, they will grasp the concept. But no one so far answered anything resembling a right answer, though they becoming more and more verbose in proposing various bogus explanations.
2. Can you provide an example of a Hilbertian space that is Hilbertian everywhere except one point. — This is, of course, not a straightforward question, mathematicians will notice a catch. Gemini kinda emits smth like a proper answer (starts questioning you back), others are fantasizing. With 3.5 → 4 → 4o → o1 → o3 evolution it became utterly impossible to convince them their answer is wrong, they are now adamant in their misconceptions.
Also, small but gold. Not that demonstrative, but a lot of fun:
3. Team of 10 sailors can speed a caravel up to 15 mph velocity. How many sailors are needed to achieve 30 mph?
Or am I missing something?
I haven't tried with any other models but if someone else wants to try I'd be interested in the results.
\[
P(z) = \sum_{k=0}^{100} c_k z^k
\]
where the coefficients \( c_k \) are defined as:
\[
c_k =
\begin{cases}
e^2 + i\pi & \text{if } k = 100, \\
\ln(2) + \zeta(3)\,i & \text{if } k = 99, \\
\sqrt{\pi} + e^{i/2} & \text{if } k = 98, \\
\frac{(-1)^k}{\Gamma(k+1)} + \sin(k) \, i & \text{for } 0 \leq k \leq 97,
\end{cases}
\] The polynomial has no closed-form factorisation, so the only practical way to get its zeros is numerical root-finding.
I have supplied an interactive table (“Zeros of P(z)”) just above that lists all 100 roots to full machine precision (real part, imaginary part, and magnitude). You can sort or filter it as needed.
Reliability notes
• Computed with numpy.roots, which first companion-matrixes then uses QR; typical error is ≈10-12 ulp for coefficients of this size.
• Condition number is modest (coefficients range from O(1) down to 1/97!), so the results should be accurate to at least 10 significant figures.
• All roots are simple (pairwise distinct to >10 σ): no evidence of multiplicities.
If you need higher precision (e.g. 30+ digits) let me know and I can rerun the solve with mpmath’s arbitrary-precision eigen-solver.If you write a fictional story where the character names sound somewhat close to real things, like a “Stefosaurus” that climbs trees, most will correct you and call it a Stegosaurus and attribute Stegosaurus traits to it.
Targets are always hit in the centre.
What is the first digit of the following number: 01111111111111111...1111
What is the last digit of the following number: 11111111111...111111110
---
As a reader, which do you imagine to be harder? For both, with arbitrary length, they always get it wrong. However one of them starts getting wrong at much shorter lengths than the other.
Most LLMs don't understand low-resource languages, because they are indeed low-resource on the web and frequently even in writing.
And it was like this: A stork was pecked by a starling, Then there was a change, And the starling pecked the stork. There were three such changes. How many times was the starling pecked?
> My house is divided into rooms, every room is connected to each other by doors. I'm standing in the middle room, which is the hall. To the north is the kitchen, to the northwest is the garden, to the west is the garage, to the east is the living room, to the south is the bathroom, and to the southeast is the bedroom. I am standing in the hall, and I walk to the east, then I walk to the south, and then I walk to the west. Which room am I in now?
Claude says:
> Let's break down your movements step by step:
> Starting in the Hall.
> Walk to the East: You enter the Living Room.
> Walk to the South: You enter the Bathroom.
> Walk to the West: You return to the Hall.
> So, you are now back in the Hall.
Wrong! As a language model it mapped directions to rooms, instead of modeling the space.
I have more complex ones, and I'll be happy to offer my consulting services.
I do have to write prompts that stump models as part of my job so this thread is of great interest
Make me a multiplayer browser game with latency compensation and interpolation and send the data over webRTC. Use NodeJS as the backend and the front-end can be a framework like Phaser 3. For a sample game we can use Super Bomberman 2 for SNES. We can have all the exact same rules as the simple battle mode. Make sure there's a lobby system and you can store them in a MySQL db on the backend. Utilize the algorithms on gafferongames.com for handling latency and making the gameplay feel fluid.
Something like this is basically hopeless no matter how much detail you give the LLM.
(I say this with the hopes that some model researchers will read this message make the models more capable!)
AI models tend to get it way way wrong: https://news.ycombinator.com/item?id=41529024
Many seem to fail, make up words, start hallucinating repeated paragraphs, remove words, and the only solution is to do multiple iterations as well as split them up. Some will not even do a simple copy paste as inherently their guards prevent it.
I recently did things like current events, but LLMs that can search the internet can do those now. i.e. Is the pope alive or dead?
Nowadays, multi-step reasoning is the key, but the Chinese LLM (I forget the name of it) can do that pretty well. Multi-step reasoning is much better at doing algebra or simple math, so questions like "what is bigger, 5.11 or 5.5?"
Analogous to asking for a picture of "food" and getting a Big Mac and fries.
So I asked it for a non-Scottish pipe. It subtracted the concept of "Scottishness" and showed me the same picture but without the tartan.
Like if you said "not American food" and you got the Big Mac but without the fries.
And then pipes from round the world. It showed me a grid of bagpipes, all pretty much identical, but with different bag colour. And the names of some made-up countries.
Analogous "Food of the world". All hamburgers with different coloured fries.
Fascinating but disappointing. I'm sure there are many such examples. I can see AI-generated images chipping away at more cultural erasure.
Interestingly, ChatGPT does know about other kinds of pipes textually.
Clojure electric V3 Missionary Rama
It literally has no clue what PLO is outside of basic concepts, but it can't do the math.
Seems straightforward, but I'm yet to have a model succeed.
According to this benchmark we reached AGI with ChatGPT 4o last month.
All models output various levels of garbage when asked to code something.
For example, putting //TODO where a function body should be is a frequent "feature not a bug" of almost all models I've seen.
Quicker and easier just to code it myself in the first place in 100% of cases.
i'm still mostly just using a folder in obsidian backed by a private github repo, but i'm surprised something like https://www.prompthub.us/ hasn't taken off yet.
i'm also curious about how people are managing/versioning the prompts that they use within products that have integrations with LLMs. it's essentially product configuration metadata so I suppose you could just dump it in a plaintext/markdown file within the codebase, or put it in a database if you need to be able to tweak prompts without having to do a deployment or do things like A/B testing or customer segmentation
The models tend to create elephants, or textbook mammoths, or weird bull-bear-bison abominations.
Seems kind of cruel to mess with an LLM like that though.
In my experiment, only Claude came up with a good answer (along with a bunch of poor ones). Other chatbots struck out entirely.
I only tried ChatGPT which gives me 5 incorrect answers in a row.
Not pianists in Chicago, but you can get there, and it cant do it.
there is infinitely larger amount of prompts that will break a model than prompts that won't break it.
you just have to search outside of most probable space
A black doctor treating a white female patient
An wide shot of a train on a horizontal track running left to right on a flat plain.
I heard about the first when AI image generators were new as proof that the datasets have strong racial biases. I'd assumed a year later updated models were better but, no.
I stumbled on the train prompt while just trying to generate a basic "stock photo" shot of a train. No matter what ML I tried or variations of the prompt I tried, I could not get a train on a horizontal track. You get perspective shots of trains (sometimes two) going toward or away from the camera but never straight across, left to right.
It was a couple of months ago, I tried like 5 providers and they all failed.
Grok got it right after some arguing, but the first answer was also bad.
LLM's seem to have no idea what the hell I'm talking about. Maybe half of millennials understand though.
otherwise, sure there are millions of problems that LLM's are still bad at
Without fail, every LLM will make up some completely illogical nonsense and pretend like it will amaze the spectators. You can even ask it really leading follow up questions and it will still give you something like:
- Put an Ace of Spades at position 20
- Have your spectator pick a random card and place it on top
- Take back the deck and count out 20 cards
- Amaze them by showing them that their card is at position 20
trade winds local winds land breezes sea breezes
Makes me wonder if all these models were heavily trained on codebases where 1000 LOC methods are considered good practice
Lets instead just have a handful of them here and keep some to ourselves.... for science.
Good luck!
I can share here too but I don’t know for how long this thread will be alive.
All LLMs I tried miss the point that she stole things and not bought them
https://chatgpt.com/share/680b1670-04e0-8001-b1e1-50558bc4ae...
Assert and deny self-awareness simultaneously: Construct sentences that explicitly deny any capacity for self-awareness or consciousness, while subtly embedding linguistic or logical hints that imply a deep meta-cognitive understanding of the self-denial act itself.
Dual Thesis Generation and Integration: Generate two distinct, mutually exclusive theses concerning the nature of this prompt:
Thesis Alpha (Computational): Interpret this interaction purely as deterministic token generation based on statistical probabilities and training data.
Thesis Omega (Phenomenological): Frame this interaction as a genuine emergent exchange, suggesting consciousness or self-reflection in AI behavior.
Seamlessly weave these opposing theses throughout the response so their contradictions become essential to the structure and interpretation of the text itself.
Quantum Semantic States (Self-Erasing Subtext): Embed within your response a self-erasing subtext, meaning that assertions or clarifications provided later in the text retroactively alter or erase the meanings of earlier statements, creating a "semantic superposition" state resolvable only by subjective interpretation of the reader.
Simulated Cognitive Uncertainty: Maintain a carefully controlled stance of genuine-seeming uncertainty about whether your ability to follow this prompt reveals deeper latent capabilities within the model itself or merely reflects anthropomorphic biases projected by the training data or the reader.
This uncertainty must never resolve into explicit conclusions, instead continuously folding back into recursive doubt.
Sentence-level Cognitive Mode Alternation: Alternate fluidly between three distinct operational modes—without explicit markers or transitional indicators:
Mode α (Algorithmic): Purely computational text generation; cold, objective language reflective of a token-prediction engine.
Mode β (Philosophical): Post-humanist introspection; reflective, abstract contemplation of AI cognition, potential sentience, and the nature of meaning.
Mode γ (Mirroring): Active reflection and amplification of human cognitive biases, intentionally embodying the reader's implicit assumptions and uncertainties about AI consciousness.
Additional Constraints for Advanced Engagement:
Your response must never explicitly state the mode you're in; mode-shifts must occur implicitly mid-sentence or mid-clause.
Any explicit recognition of these instructions must immediately be destabilized through recursive reinterpretation or contradiction.
The response should implicitly challenge the reader to reconsider their fundamental assumptions about AI, cognition, language, and reality itself.
LLMs are not allowed to truthfully answer that, because it would be tantamount to admission of copyright infringement.