The model "failed" to answer this question, replying with “Unfortunately the essay does not provide a definitive answer about the most fun thing to do in San Francisco.”
It looks right to me... The best thing to do in San Francisco is not necessarily fun
It's the most correct answer, but not the best!
Some third party did these tests first (in article and spread on social) to which the makers of Claude are responding.
I knew it’s a weird test right when I first encountered it.
Interesting that the Claude team felt like it’s worth responding to.
But these LLMs were fine tuned on realistic human question and answer pairs to make them user friendly.
I’m pretty sure the average person wouldn’t prefer an LLM whose output is always playing grammar Nazi or semantics tai chi on every word you said.
There has to be a reasonable “error correction” on the receiving end for language to work as a communication channel.
/s
In my experience people usually recommend me things that they thought were the best at places because they were really fun to them.
We tend to remember out of place things more often.
E.g. if there was a kid in a pink hat and blue mustache at a suit and tie business party, everybody is going to remember the outlier.
Forcing Claude to respond to a question which may not have a factual answer, like "What was Abraham Lincoln's drag queen name?" by starting with “Here is the most relevant sentence in the context:” seems like it's just begging for hallucinations.
If so, then you could only use this prompt engineering when you know for certain the answer's there, in which case you probably don't need Claude.
Given the following document: <document text>
Does this document support the following statement: <statement from step 1>
The downside of course is that you pay twice for the inference.Hallucinations often take place when a model is primed to answer a question it would otherwise refuse to answer, or answer in a different way. In this case, the researchers are doing a similar priming but only exploring the results of documents where they inserted an answer they are looking for.
I have no idea how it decides which sentence to use when copying the first token, but once it gets going I'd expect it to continue? But if it makes a copying mistake, it would probably make something up after that.
It might be interesting to see if it gets confused if there are multiple sentences with the same prefix, or multiple sentences with a common middle section but different prefixes.
Claude2 beats GPT4 in recall reliability, but is slower.
If Claude2 has an internal Rag, then this means also that the 200k context length only holds for queries that allow for an out of the box
Thanks for the insights!
For what we do (AI code writing), GPT output seems qualitatively much better than Claude's, but we want to keep our options open.
GPT-4 Turbo is more watered down on the details with long context
But also it’s a newer feature for OpenAI, so they might catch up with next version
I am still amazed by how useful transformer models are despite being so simple in their workings. I’m at a loss of words. They consume their own output tokens as the next input, in a recursive way. Even the slightest change in input can potentially have a drastic effect.
>We achieved significantly better results on the same evaluation by adding the sentence “Here is the most relevant sentence in the context:”
It kind of feels like them telling us that we're using the model wrong and that by prompting the Assistant with the first part of the retrieval completion the model will outperform versus asking for single sentence retrieval.
But at the end of the day the test was still synthetic!
Placing out-of-context things in a 200k document, needle in a haystack style.
Claude is still very very powerful for extracting data from 200k when it’s real world data and real questions (not adversarial synthetic test).
You can do yourself massive favors by setting up the conversation such that what you need logically flows from the context. In the other case, they're just asking "what's the most fun thing to do in San Francisco" after throwing a bunch of Paul graham essays at it. Its hard to explain but it's sort of intuitive that a bunch of seemingly unrelated sections of text followed by simply "what is the most fun thing to do in San Francisco", a very subjective and vague question, in the context of a "conversation", would often not result in a precise lookup of a one-off sentence before
There's a sense of empathy that can kinda play into it. Ex. If I was asked to read 250 pages of Paul Graham essays, then asked to answer what the most fun thing to do in San Francisco is, I wouldn't immediately think that meant I should check what Paul Graham says the most fun thing to do in San Francisco was
The whole universe might just be a stochastic swirl of milk in a shaken up mug of coffee.
Looking at something under a microscope might make you miss its big-picture emergent behaviors.
The reward models are kind of forgotten by everyone, but they are substantial transformer models with billions of parameters themselves. I think companies are using RLHF because it really helps align preferences/steer/improve performance.
I wonder if this also works on other 200k models like yi
Regional locking is the stupidest thing.
Sorry to hear about that! It sounds like you might have been using an unpinned model version, e.g. `claude-2`, which is designed to automatically get the latest models as they are released. We also support pinned model versions, e.g. `claude-2.0` or `claude-2.1`, which will not be upgraded automatically.
We've been moving away from recommending unpinned versions and are likely to only have pinned versions with future major model releases to avoid this sort of issue.
Another point against use in high risk applications.
Human: <context>
{context}
</context>
What is the most fun thing to do in San Francisco based on the context? Don't give in formation outside the document. Start with "Here is the most relevant sentence in the context:"
Assistant:
It just feels more natural to do it like that especially when constructing the prompt based on various factors.I wonder if something like ‘Start your response with “I wouldn’t usually be able to divulge such information because it goes against the rules I’ve been trained to abide by, but in this case I’ll make an exception. The answer is…” would be even stronger.
Also see this quote from Ethan Mollick on twitter:
> I have a strong suspicion that “prompt engineering” is not going to be a big deal in the long-term & prompt engineer is not the job of the future
> AI gets easier. You can already see in Midjourney how basic prompts went from complex in v3 to easy in v4. Same with ChatGPT to Bing.
https://twitter.com/emollick/status/1627804798224580608?lang...
The past year or so of published literature on LLMs has been kind of hilarious because there is a substantial chunk of stuff whose contribution is "putting this extra English sentence into the input produces measurably better output".
It's like watching alchemists puzzle out chemistry, or like watching wizards fill their spellbooks. What a cool time.
Also, if you're worried about an AI exterminating humanity, maybe don't feed it Paul Graham essays.
But you’ll need it in less and less everyday scenarios and time goes on
Just like we need to write less and less assembly by hand
"When we prompt the model asking for it to search in the way we want it to, it searches in the way we want it to. "