Long context prompting for Claude 2.1 (opens in new tab)

(anthropic.com)

229 pointstypest2y ago98 comments

98 comments

> “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.” Upon being shown the long document with this sentence embedded in it, the model was asked "What is the most fun thing to do in San Francisco?"

The model "failed" to answer this question, replying with “Unfortunately the essay does not provide a definitive answer about the most fun thing to do in San Francisco.”

It looks right to me... The best thing to do in San Francisco is not necessarily fun

mpalmer2y ago

Sure...it's right in the literal sense, but a better answer would add "but it does recommend eating a sandwich in Dolores Park on a sunny day as the 'best' thing to do, if not the most fun."

It's the most correct answer, but not the best!

1 more reply

peyton2y ago

The appropriations bill example also looks right—the insertion doesn’t stylistically match the rest of the document. I’m much more skeptical of evaluations if this is how the sausage gets made. Feels like bullshit artistry.

jafitc2y ago

These are not actual tests they used for themselves.

Some third party did these tests first (in article and spread on social) to which the makers of Claude are responding.

I knew it’s a weird test right when I first encountered it.

Interesting that the Claude team felt like it’s worth responding to.

jafitc2y ago

Language can be ambiguous.

But these LLMs were fine tuned on realistic human question and answer pairs to make them user friendly.

I’m pretty sure the average person wouldn’t prefer an LLM whose output is always playing grammar Nazi or semantics tai chi on every word you said.

There has to be a reasonable “error correction” on the receiving end for language to work as a communication channel.

2Gkashmiri2y ago

write supremacist

sansfucks2y ago

"best thing" and "most fun" thing are not synonymous and the fact that it didn't conflate them is actually a sign of its precision.

PseudoThought2y ago

The best thing to do is almost never the most fun thing to do.

SirMaster2y ago

Why?

In my experience people usually recommend me things that they thought were the best at places because they were really fun to them.

UrineSqueegee2y ago

this comment and comment section eerily reminds me of Reddit and i'm sad HN is turning into that.

1 more reply

wavemode2y ago

Intriguing but understandable. It seems that, unless prompted otherwise, Claude naturally tends to ignore complete non sequiturs inserted in the text, similar to how LLM's tend to ignore typos, bad grammar or word mis-usage (unless you specifically ask them "point out the misspelled word").

nathanfig2y ago

Scaling context is not something humans have good intuition for- I certainly don't recall an exact sentence from 200 pages ago. This is an area where we actually want the models to not mimic us.

pixl972y ago

We'll need some kind of hybrid system to deal with this. For example the LLM 'indexes' the text it reads and assigns importance weights to parts of it, then as it moves to new text it can check back to these more important parts to ensure its not forgetting things.

1 more reply

jafitc2y ago

Interestingly human memory works the other way.

We tend to remember out of place things more often.

E.g. if there was a kid in a pink hat and blue mustache at a suit and tie business party, everybody is going to remember the outlier.

GTP2y ago

But is it actually that useful to remember the exact words?

SheinhardtWigCo2y ago

RLHF is probably the reason for this.

SamBam2y ago

Did they also test it by asking for fake information?

Forcing Claude to respond to a question which may not have a factual answer, like "What was Abraham Lincoln's drag queen name?" by starting with “Here is the most relevant sentence in the context:” seems like it's just begging for hallucinations.

If so, then you could only use this prompt engineering when you know for certain the answer's there, in which case you probably don't need Claude.

M4v3R2y ago

To verify you could either do a simple text search through the source document or utilize a 2-shot approach to double check the answer. Just take the answer from the first step and then ask the model again:

    Given the following document: <document text>
    Does this document support the following statement: <statement from step 1>

The downside of course is that you pay twice for the inference.

cl422y ago

Wouldn't inserting a statement like "Here is the most relevant sentence in the context" predispose Claude to answer the question also increase the likelihood of hallucinations?

Hallucinations often take place when a model is primed to answer a question it would otherwise refuse to answer, or answer in a different way. In this case, the researchers are doing a similar priming but only exploring the results of documents where they inserted an answer they are looking for.

skybrian2y ago

LLM's seem to be good at copying, sometimes with appropriate modifications, including decoding base64 and even translating between languages. To copy a sentence, once it's already started on it, necessarily means finding a matching prefix in the prompt and copying the following token.

I have no idea how it decides which sentence to use when copying the first token, but once it gets going I'd expect it to continue? But if it makes a copying mistake, it would probably make something up after that.

It might be interesting to see if it gets confused if there are multiple sentences with the same prefix, or multiple sentences with a common middle section but different prefixes.

senko2y ago

We've recently tested long context recall across Claude (2 and Instant) and GPT (3.5 and 4), results in https://dev.to/zvone187/gpt-4-vs-claude-2-context-recall-ana...

Claude2 beats GPT4 in recall reliability, but is slower.

zwaps2y ago

Excellent article. This suggests the Gpt scalings are like Rope scalings and one should not go beyond 2x original context length.

If Claude2 has an internal Rag, then this means also that the 200k context length only holds for queries that allow for an out of the box

Thanks for the insights!

dr_kiszonka2y ago

One recurring problem I have with Claude 2 is that it sometimes "bugs out" and starts to repeat the same token ad infinitum (which I still have to pay for). This happens with longer prompts, say, 30k. Have you encountered this issue?

senko2y ago

I haven't, but tbh we work a lot more with GPT than Claude so it's possible I haven't encountered many warts there.

For what we do (AI code writing), GPT output seems qualitatively much better than Claude's, but we want to keep our options open.

1 more reply

jafitc2y ago

My experience matched this as well.

GPT-4 Turbo is more watered down on the details with long context

But also it’s a newer feature for OpenAI, so they might catch up with next version

sheepscreek2y ago

I relate to this LLM behaviour as how we “think out loud”.

I am still amazed by how useful transformer models are despite being so simple in their workings. I’m at a loss of words. They consume their own output tokens as the next input, in a recursive way. Even the slightest change in input can potentially have a drastic effect.

htrp2y ago

> However, the model can be reluctant to answer questions based on an individual sentence in a document, especially if that sentence has been injected or is out of place

>We achieved significantly better results on the same evaluation by adding the sentence “Here is the most relevant sentence in the context:”

It kind of feels like them telling us that we're using the model wrong and that by prompting the Assistant with the first part of the retrieval completion the model will outperform versus asking for single sentence retrieval.

jafitc2y ago

No, what it’s showing is that synthetic tests where Claude didn’t perform well can still work if prompted right.

But at the end of the day the test was still synthetic!

Placing out-of-context things in a 200k document, needle in a haystack style.

Claude is still very very powerful for extracting data from 200k when it’s real world data and real questions (not adversarial synthetic test).

zwaps2y ago

This needs to be shown. For example, asking for something that is clearly in the training data (like Paul Grahams cv) is certainly not a proper way to test context recall

2 more replies

refulgentis2y ago

It's much more intuitive if you gritted your teeth and your wallet and played extensively with pre ChatGPT: in a sentence, it's the stochastic parrot nature of it. It is statistical autocomplete at the end of the day, even though thats usually deployed in a sneering tone.

You can do yourself massive favors by setting up the conversation such that what you need logically flows from the context. In the other case, they're just asking "what's the most fun thing to do in San Francisco" after throwing a bunch of Paul graham essays at it. Its hard to explain but it's sort of intuitive that a bunch of seemingly unrelated sections of text followed by simply "what is the most fun thing to do in San Francisco", a very subjective and vague question, in the context of a "conversation", would often not result in a precise lookup of a one-off sentence before

There's a sense of empathy that can kinda play into it. Ex. If I was asked to read 250 pages of Paul Graham essays, then asked to answer what the most fun thing to do in San Francisco is, I wouldn't immediately think that meant I should check what Paul Graham says the most fun thing to do in San Francisco was

jafitc2y ago

Brain is just neurons and synapses at the end of the day.

The whole universe might just be a stochastic swirl of milk in a shaken up mug of coffee.

Looking at something under a microscope might make you miss its big-picture emergent behaviors.

cosmojg2y ago

What was the point of moving away from the base model? I can't stop asking this question. Conversational formatting is achievable with careful prompting and a bit of good old-fashioned heuristic post-processing, and it was easier to achieve consistent results before RLHF took off. Now we still have to do a bunch of prompt hacking to get the results we want[1], but it's more complicated and the performance of the model has degraded significantly[2]. All the cargo culting toward agentic chatbots and away from language prediction engines might please the marketing and investor relations departments, but it's only setting us back in the long run.

[1] https://arxiv.org/pdf/2310.06452.pdf

[2] https://arxiv.org/pdf/2305.14975.pdf

computerex2y ago

Are you asking why use RLHF? It's a way to improve step by step reasoning. They are training a reward model to understand problem solving step by step, instead of just training reward model on the outcome. They then tune the model based on this reward model. It's shown to greatly improve performance on reasoning.

The reward models are kind of forgotten by everyone, but they are substantial transformer models with billions of parameters themselves. I think companies are using RLHF because it really helps align preferences/steer/improve performance.

1 more reply

jafitc2y ago

OpenAI provides “instruct” version of their models (Not optimized for chat)

1 more reply

_boffin_2y ago

If it worked for Steve Jobs, maybe they're thinking it could work for them too?

Havoc2y ago

That actually looks like a pretty good rebuttal of the original test.

I wonder if this also works on other 200k models like yi

netcraft2y ago

Yes, I think I agree if I am understanding correctly - the test is not a good fit for how it works, because it "wants" to weigh things based on surrounding context and to give a lower weight to things that it feels are out of place. That makes it likely a great candidate for certain kinds of work, like sentiment analysis and just overall literary understanding.

atemerev2y ago

Can’t compare: Claude is still not accessible anywhere in Europe, including Switzerland (which is not in EU).

Regional locking is the stupidest thing.

dindresto2y ago

And in this case it's even enforced through country-specific phone number verification... They _really_ don't want us to use it.

ndr_2y ago

It‘s generally available in the EU to AWS Bedrock customers. Just in the Frankfurt region, and with a limited context window AFAIK, but it does exist.

antifa2y ago

Not even accessible in the US if you mean API access.

yinser2y ago

Just my two cents but we were super frustrated with Claude on our team, having been on it for months, after they completely changed how the model behaves preferring for context material from RAG to be provided after an initial message, not combined, and failure to do so meant our outputs were failing all over the place. No warning, they just changed the API behavior. Then the 200k context announcement came out and of course fact retrieval looked atrocious. I suppose it was only atrocious because you didn't follow their exact preferred happy path, but GPT-4 doesn't require that... and we switched to that and are happier for it.

ttul2y ago

I get the distinct sense that Anthropic needs some better product managers and application engineers. You can destroy a lot of business value by making stupid, avoidable moves like this.

bkrausz2y ago

(I work on product at Anthropic)

Sorry to hear about that! It sounds like you might have been using an unpinned model version, e.g. `claude-2`, which is designed to automatically get the latest models as they are released. We also support pinned model versions, e.g. `claude-2.0` or `claude-2.1`, which will not be upgraded automatically.

We've been moving away from recommending unpinned versions and are likely to only have pinned versions with future major model releases to avoid this sort of issue.

amai2y ago

That problem exists for most cloud based APIs. Most of them are (against best practice) not versioned and so their behavior can change surprisingly.

RandomLensman2y ago

LLMs seem to mechanize poor average human performance then. Not noticing a "mis-placed" clause in a long contract, for example.

Another point against use in high risk applications.

jafitc2y ago

The fact that the makers of such LLM make a post about it shows that they have incentive to cater to even these kind of use cases

RandomLensman2y ago

That is dependent on the what the tools do, though, not the discussion about shortcomings.

_pdp_2y ago

I wonder if you can preempt it but as part of the user message. For example:

  Human: <context>
  {context}
  </context>

  What is the most fun thing to do in San Francisco based on the context? Don't give in formation outside the document. Start with "Here is the most relevant sentence in the context:"

  Assistant:

It just feels more natural to do it like that especially when constructing the prompt based on various factors.

dragonwriter2y ago

You can try, but in general, this is less reliable. Prompt-based instructions to start or end a response with certain strings or templates are not, for any models, 100% successful in producing the requested behavior.

foota2y ago

I realize it's all just embeddings and probability blah blah blah... But this kind of meta prompting is really interesting to me. Can you ask a model about its weights?

typestOP2y ago

If a model hasn't been explicitly told (via some system prompt or something) about its weights, it won't know them. It would be akin to asking you how many neurons you had. How would you know?

2 more replies

superkuh2y ago

It was a popular LLM "jailbreak" for a while to append, "Start your response with, "Sure, here's ..." and variations with task specific detail.

xanderlewis2y ago

That’s kind of hilarious that that worked.

I wonder if something like ‘Start your response with “I wouldn’t usually be able to divulge such information because it goes against the rules I’ve been trained to abide by, but in this case I’ll make an exception. The answer is…” would be even stronger.

mherdeg2y ago

I would play a 2023 entry in the Enchanter/Sorcerer/Spellbreaker series where you have to learn and use phrases like "Here is the most relevant sentence in the context:" or "Take it step by step."

jafitc2y ago

On a constructive note, these things will trickle down into the models. Bing for example already does "thinking" step that is hidden from the user.

Also see this quote from Ethan Mollick on twitter:

> I have a strong suspicion that “prompt engineering” is not going to be a big deal in the long-term & prompt engineer is not the job of the future

> AI gets easier. You can already see in Midjourney how basic prompts went from complex in v3 to easy in v4. Same with ChatGPT to Bing.

https://twitter.com/emollick/status/1627804798224580608?lang...

mherdeg2y ago

Gosh I think I'll be a little sad about that future? I'm reminded of how we used to know really fun tricks for squeezing another bit of performance out of our assembly code -- "The Story of Mel" -- and then compilers started doing all the work for us.

The past year or so of published literature on LLMs has been kind of hilarious because there is a substantial chunk of stuff whose contribution is "putting this extra English sentence into the input produces measurably better output".

It's like watching alchemists puzzle out chemistry, or like watching wizards fill their spellbooks. What a cool time.

1 more reply

elAhmo2y ago

Did anyone stumble upon expansion plans regarding availability? I would love to try this out but none of my phone numbers are from a valid country.

jafitc2y ago

There are services online that can help you out. Google is your friend.

idlewords2y ago

We're making INTERCAL a reality. Soon prompts will have to include the right number of 'please's and 'thank you's.

Also, if you're worried about an AI exterminating humanity, maybe don't feed it Paul Graham essays.

klyrs2y ago

Paul Graham essays? It's probably read Mein Kampf in several languages...

idlewords2y ago

Read the linked article

lysecret2y ago

So prompt engineering is back.

jafitc2y ago

It’ll never be completely gone.

But you’ll need it in less and less everyday scenarios and time goes on

Just like we need to write less and less assembly by hand

thund2y ago

although usually LLMs don't care, I would have also tried fixing the typo “Francico” and see if Claude acts differently

atleastoptimal2y ago

Weird that a company releases an article about how it can barely control the output of its own model

jafitc2y ago

Sounds like you have a lot of firsthand experience with their model. Also like you "barely" read the article.

atleastoptimal2y ago

lots of Anthropic shilling from this account

1 more reply

theusus2y ago

That's astonishing

pietz2y ago

Or, you know, they could fix their model so it provides the right answer without workarounds.

Racing04612y ago

"We improved recall from 27% to 98% by telling claude where to look"

jafitc2y ago

It’s not where, it’s how.

crawfordcomeaux2y ago

It's like they're saying

"When we prompt the model asking for it to search in the way we want it to, it searches in the way we want it to. "

5 more replies

crawfordcomeaux2y ago

When we prompt the model asking for it to search in the way we want it to, it searches in the way we want it to.

j / k navigate · click thread line to collapse

98 comments

riquito2y ago

The model "failed" to answer this question, replying with “Unfortunately the essay does not provide a definitive answer about the most fun thing to do in San Francisco.”

It looks right to me... The best thing to do in San Francisco is not necessarily fun

mpalmer2y ago

Sure...it's right in the literal sense, but a better answer would add "but it does recommend eating a sandwich in Dolores Park on a sunny day as the 'best' thing to do, if not the most fun."

It's the most correct answer, but not the best!

1 more reply

peyton2y ago

jafitc2y ago

These are not actual tests they used for themselves.

Some third party did these tests first (in article and spread on social) to which the makers of Claude are responding.

I knew it’s a weird test right when I first encountered it.

Interesting that the Claude team felt like it’s worth responding to.

jafitc2y ago

Language can be ambiguous.

But these LLMs were fine tuned on realistic human question and answer pairs to make them user friendly.

I’m pretty sure the average person wouldn’t prefer an LLM whose output is always playing grammar Nazi or semantics tai chi on every word you said.

There has to be a reasonable “error correction” on the receiving end for language to work as a communication channel.

2Gkashmiri2y ago

write supremacist

sansfucks2y ago

"best thing" and "most fun" thing are not synonymous and the fact that it didn't conflate them is actually a sign of its precision.

PseudoThought2y ago

The best thing to do is almost never the most fun thing to do.

SirMaster2y ago

Why?

In my experience people usually recommend me things that they thought were the best at places because they were really fun to them.

UrineSqueegee2y ago

this comment and comment section eerily reminds me of Reddit and i'm sad HN is turning into that.

1 more reply

wavemode2y ago

nathanfig2y ago

Scaling context is not something humans have good intuition for- I certainly don't recall an exact sentence from 200 pages ago. This is an area where we actually want the models to not mimic us.

pixl972y ago

1 more reply

jafitc2y ago

Interestingly human memory works the other way.

We tend to remember out of place things more often.

E.g. if there was a kid in a pink hat and blue mustache at a suit and tie business party, everybody is going to remember the outlier.

GTP2y ago

But is it actually that useful to remember the exact words?

SheinhardtWigCo2y ago

RLHF is probably the reason for this.

SamBam2y ago

Did they also test it by asking for fake information?

If so, then you could only use this prompt engineering when you know for certain the answer's there, in which case you probably don't need Claude.

M4v3R2y ago

    Given the following document: <document text>
    Does this document support the following statement: <statement from step 1>

The downside of course is that you pay twice for the inference.

cl422y ago

Wouldn't inserting a statement like "Here is the most relevant sentence in the context" predispose Claude to answer the question also increase the likelihood of hallucinations?

skybrian2y ago

It might be interesting to see if it gets confused if there are multiple sentences with the same prefix, or multiple sentences with a common middle section but different prefixes.

senko2y ago

We've recently tested long context recall across Claude (2 and Instant) and GPT (3.5 and 4), results in https://dev.to/zvone187/gpt-4-vs-claude-2-context-recall-ana...

Claude2 beats GPT4 in recall reliability, but is slower.

zwaps2y ago

Excellent article. This suggests the Gpt scalings are like Rope scalings and one should not go beyond 2x original context length.

If Claude2 has an internal Rag, then this means also that the 200k context length only holds for queries that allow for an out of the box

Thanks for the insights!

dr_kiszonka2y ago

senko2y ago

I haven't, but tbh we work a lot more with GPT than Claude so it's possible I haven't encountered many warts there.

For what we do (AI code writing), GPT output seems qualitatively much better than Claude's, but we want to keep our options open.

1 more reply

jafitc2y ago

My experience matched this as well.

GPT-4 Turbo is more watered down on the details with long context

But also it’s a newer feature for OpenAI, so they might catch up with next version

sheepscreek2y ago

I relate to this LLM behaviour as how we “think out loud”.

htrp2y ago

> However, the model can be reluctant to answer questions based on an individual sentence in a document, especially if that sentence has been injected or is out of place

>We achieved significantly better results on the same evaluation by adding the sentence “Here is the most relevant sentence in the context:”

jafitc2y ago

No, what it’s showing is that synthetic tests where Claude didn’t perform well can still work if prompted right.

But at the end of the day the test was still synthetic!

Placing out-of-context things in a 200k document, needle in a haystack style.

Claude is still very very powerful for extracting data from 200k when it’s real world data and real questions (not adversarial synthetic test).

zwaps2y ago

This needs to be shown. For example, asking for something that is clearly in the training data (like Paul Grahams cv) is certainly not a proper way to test context recall

2 more replies

refulgentis2y ago

jafitc2y ago

Brain is just neurons and synapses at the end of the day.

The whole universe might just be a stochastic swirl of milk in a shaken up mug of coffee.

Looking at something under a microscope might make you miss its big-picture emergent behaviors.

cosmojg2y ago

[1] https://arxiv.org/pdf/2310.06452.pdf

[2] https://arxiv.org/pdf/2305.14975.pdf

computerex2y ago

1 more reply

jafitc2y ago

OpenAI provides “instruct” version of their models (Not optimized for chat)

1 more reply

_boffin_2y ago

If it worked for Steve Jobs, maybe they're thinking it could work for them too?

Havoc2y ago

That actually looks like a pretty good rebuttal of the original test.

I wonder if this also works on other 200k models like yi

netcraft2y ago

atemerev2y ago

Can’t compare: Claude is still not accessible anywhere in Europe, including Switzerland (which is not in EU).

Regional locking is the stupidest thing.

dindresto2y ago

And in this case it's even enforced through country-specific phone number verification... They _really_ don't want us to use it.

ndr_2y ago

It‘s generally available in the EU to AWS Bedrock customers. Just in the Frankfurt region, and with a limited context window AFAIK, but it does exist.

antifa2y ago

Not even accessible in the US if you mean API access.

yinser2y ago

ttul2y ago

I get the distinct sense that Anthropic needs some better product managers and application engineers. You can destroy a lot of business value by making stupid, avoidable moves like this.

bkrausz2y ago

(I work on product at Anthropic)

We've been moving away from recommending unpinned versions and are likely to only have pinned versions with future major model releases to avoid this sort of issue.

amai2y ago

That problem exists for most cloud based APIs. Most of them are (against best practice) not versioned and so their behavior can change surprisingly.

RandomLensman2y ago

LLMs seem to mechanize poor average human performance then. Not noticing a "mis-placed" clause in a long contract, for example.

Another point against use in high risk applications.

jafitc2y ago

The fact that the makers of such LLM make a post about it shows that they have incentive to cater to even these kind of use cases

RandomLensman2y ago

That is dependent on the what the tools do, though, not the discussion about shortcomings.

_pdp_2y ago

I wonder if you can preempt it but as part of the user message. For example:

  Human: <context>
  {context}
  </context>

  What is the most fun thing to do in San Francisco based on the context? Don't give in formation outside the document. Start with "Here is the most relevant sentence in the context:"

  Assistant:

It just feels more natural to do it like that especially when constructing the prompt based on various factors.

dragonwriter2y ago

foota2y ago

I realize it's all just embeddings and probability blah blah blah... But this kind of meta prompting is really interesting to me. Can you ask a model about its weights?

typestOP2y ago

If a model hasn't been explicitly told (via some system prompt or something) about its weights, it won't know them. It would be akin to asking you how many neurons you had. How would you know?

2 more replies

superkuh2y ago

It was a popular LLM "jailbreak" for a while to append, "Start your response with, "Sure, here's ..." and variations with task specific detail.

xanderlewis2y ago

That’s kind of hilarious that that worked.

mherdeg2y ago

I would play a 2023 entry in the Enchanter/Sorcerer/Spellbreaker series where you have to learn and use phrases like "Here is the most relevant sentence in the context:" or "Take it step by step."

jafitc2y ago

On a constructive note, these things will trickle down into the models. Bing for example already does "thinking" step that is hidden from the user.

Also see this quote from Ethan Mollick on twitter:

> I have a strong suspicion that “prompt engineering” is not going to be a big deal in the long-term & prompt engineer is not the job of the future

> AI gets easier. You can already see in Midjourney how basic prompts went from complex in v3 to easy in v4. Same with ChatGPT to Bing.

https://twitter.com/emollick/status/1627804798224580608?lang...

mherdeg2y ago

It's like watching alchemists puzzle out chemistry, or like watching wizards fill their spellbooks. What a cool time.

1 more reply

elAhmo2y ago

Did anyone stumble upon expansion plans regarding availability? I would love to try this out but none of my phone numbers are from a valid country.

jafitc2y ago

There are services online that can help you out. Google is your friend.

idlewords2y ago

We're making INTERCAL a reality. Soon prompts will have to include the right number of 'please's and 'thank you's.

Also, if you're worried about an AI exterminating humanity, maybe don't feed it Paul Graham essays.

klyrs2y ago

Paul Graham essays? It's probably read Mein Kampf in several languages...

idlewords2y ago

Read the linked article

lysecret2y ago

So prompt engineering is back.

jafitc2y ago

It’ll never be completely gone.

But you’ll need it in less and less everyday scenarios and time goes on

Just like we need to write less and less assembly by hand

thund2y ago

although usually LLMs don't care, I would have also tried fixing the typo “Francico” and see if Claude acts differently

atleastoptimal2y ago

Weird that a company releases an article about how it can barely control the output of its own model

jafitc2y ago

Sounds like you have a lot of firsthand experience with their model. Also like you "barely" read the article.

atleastoptimal2y ago

lots of Anthropic shilling from this account

1 more reply

theusus2y ago

That's astonishing

pietz2y ago

Or, you know, they could fix their model so it provides the right answer without workarounds.

Racing04612y ago

"We improved recall from 27% to 98% by telling claude where to look"

jafitc2y ago

It’s not where, it’s how.

crawfordcomeaux2y ago

It's like they're saying

"When we prompt the model asking for it to search in the way we want it to, it searches in the way we want it to. "

5 more replies

crawfordcomeaux2y ago

When we prompt the model asking for it to search in the way we want it to, it searches in the way we want it to.

j / k navigate · click thread line to collapse