Interesting Research Out of Anthropic on Long Context Prompting

Essentially, the idea is to end the prompt with the beginning of a statement assumptive of locating the key part of the text.

In their example,

Here is the most relevant sentence in the context:

My thoughts,

  1. This reminds me a great deal of the early CoT prompt of “Let’s think step by step.”
  2. I would like to see this experiment replicated with the OpenAI models. I’m curious if it will yield improvements in even relatively small context models like gpt-3.5-turbo-16k.
  3. This seems ripe for exploration and fine-tuning of the response seed, e.g. “here is the most salient sentence in the context,” “this is the most relevant passage in the document,” etc.
  4. For GPT builders who include documents for RAG, this, or something like this, might (potentially) boost the hit rate in retrieval (though details on the GPTs retrieval system are sparse.
  5. I think there are almost certainly many other areas where model performance will be improved with these types of response seeds.

This is honestly why I love the older instruct models.

Priming the response can be very powerful. I think the trade-off here is that it can lead to hallucinations though.

I thought this was well known?


Nice find!
It’s a recurring pattern to tell these models to do less thinking by themselves and focus more on what we already know. From circling the interesting parts of an image for analysis to explaining in detail what the generated code should be like focusing attention on what’s important is always helpful.
It’s great to see a growing body of research supporting practical experience.


There must be a research paper associated with this, but maybe it isn’t out yet :thinking:

Anyways I found this figure floating around the web:


I think it’s from this YouTube creator:


Interesting, 64k of perfect recall is hella impressive if that is the case.


It seems like OpenAI has done a bit of this by their own fine-tune, or even, I suspect, a layer that acts as an “injector” of approval.


Did I need to see the word “Certainly”? It is out of context, but is the kind of token one might drop into a completion to guide the AI into fulfillment or denial.

You also see a bit more “problem discussion” in recent models, instead of just jumping into producing the output described. You get a bit of chain-of-thought without needing techniques to elicit that.

What is shown in the paper is a bit of probability gaming: you have to get the AI to output the first token of text from a context document instead of training penalty that might be tuned (and Anthropic has lots of training penalty avoidance of mistruth, spouting its constitutional AI mantra at you)

1 Like

I believe this is slightly different.

The article is discussing the models reluctance to respond. Not the model’s accuracy in retrieval.

Funnily enough, the image you posted is from the comment I was discussing earlier about lost-in-the-middle.

The Test

  1. Place a random fact or statement (the ‘needle’) in the middle of a long context window (the ‘haystack’)
  2. Ask the model to retrieve this statement
  3. Iterate over various document depths (where the needle is placed) and context lengths to measure performance

This is the code that backed this OpenAI and Anthropic analysis.

Two Needles Strategy: By duplicating the target statement within the text, we found that GPT-4 could retrieve the information with 100% accuracy, suggesting that reinforcing the signal (i.e., the target information) enhances the model’s retrieval capability.

It was posted here:

The OP also wrote their own article:


Fair point,

I found the figure by searching for “Pressure testing Claude-2.1 200K via Needle-in-a-Haystack”, since I was trying to find the the paper, but that figure was all that came up :sweat_smile:

1 Like

I’m wondering how often you have to use this method

Is it for every 2k tokens or 30k?

It seems that this is part of an enhanced prompt, actual “prompt” being the triggering that AI should answer as an entity and not complete.

Here they’ve inserted the beginnings of an answer which appears to be from the AI role, to then continue a compliant completion instead of an avoidance.

The AI might not “learn” from this, the multi-shot only showing that AI does what it does because of the additional text, but the AI might start writing that itself.

ChatML doesn’t allow this, and it would seem that developers placing their own assistant role messages that the AI might continue on also has the first phase of blocking in “assistants”.



Thank you for clarifying, I think I got confused by some of the wording in the original text.

True, but one could overtrain a model to always respond with “Here is the most relevant sentence in the context:…blabla” :rofl:

Crazy enough I was looking for this article again and was not looking forward to scouring my activity logs :sweat_smile: so thank you!

Seeing the word “needle in a haystack” triggered exactly what I needed to recall & find the post.

These kind of insights are going to be more and more valuable as the context lengths get longer.

1 Like

To be fair, LLMs have already surpassed humans in this regard. I don’t know any human who can read the first two Harry Potter books (~140k tokens) in 45 seconds and answer detailed questions about them. :sweat_smile:

1 Like

I mean, maybe I just discovered this naturally through trial and error while working with LLMs, but is this that much different from how we remind other humans of stuff? The contextual information we rely on might be different, but I feel like the premise is identical here.
“Did you see Bob last night?”
“Bob? Who’s that?”
“The guy who had the weird hair?”
“OOOHHHH, yeah, I remember him!”

Say 100 people were at a party, and those people are the “context”. Providing the relevant approximate association for how somebody could retrieve that data from their context is much the same way discussed here. Instead of text it’s just visual stimuli.

At the same time though, even humans bullshit their responses to that at times, pretending that they remembered when they didn’t and just go along with it as the other person gives them the context they needed to keep engaging in the conversation as if they knew.

Idk man, sometimes I wonder if I should put out some stuff on arxiv or something because I still fail to be impressed by much of this and think to myself “…wait, this wasn’t known before?”


You know as @RonaldGRuckus pointed out there is a intermixing of reluctance to response and retrieval quality. In the context of working with OpenAI models in their current state encountering reluctance happens quite often.
But without being able to look under the hood we can’t tell if the model opts to reply with “I cannot do X” when it really should have replied with “I don’t know who Bob is”.
And from there with a little bit of priming we should get the model back on track but we are being mislead by the reply.

Maybe that’s where I’m lost in all this: what is misleading about either response, and what changes somebody’s approach when the model responds in either of those ways?

If you know the context is there, the “doing” and the “remembering” is that same thing, because the action is the act of remembering. So, you just nudge 'em a lil bit and the problem is solved.

And maybe I’m proving Ronald’s point that the intermixing of the two problems is causing confusion here, but from my perspective that’s not really the relevant aspects of the problem? If there is one?

Maybe I’ve also just gotten so used to getting GPT to do what I want I don’t encounter an “I cannot do X”, because in every possible known use case I’ve tried, once I understood the actual limitations of the model, I can get around that so intuitively I don’t notice it?

Then again, idk how I’m able to do some of the stuff I’m able to do, and that to me interests me way more. A friend of mine who I did CTFs with before said he struggled to get ChatGPT to help him hack, but I’ve never had issues with it helping me in any of my own fuckery? Like apparently it would deny his requests and says it can’t help him, but every time I’ve tried asking it to do the same things, it never denies my request. So half the time, I feel like any reluctancy is based on the language in the prompting, but that isn’t exclusive to this problem. Far from it tbh, considering how many posts I’ve seen on this forum at this point.

So, maybe to sum it all up, I feel like there are better solutions that can solve a broader spectrum of problems all at the same time, and defining these solutions to these issues in this specific way only makes it more confusing and limiting. Does that make sense?

It’s like, giving a search function as a fat list of if/else statements instead of making a search algorithm that works better in more use cases with only a few lines of code.


I cannot and should not speak for Ronald but noted that his reply appeared relevant to your observations about how humans prime recall in a somewhat similar fashion compared to LLMs.
The paper is relevant from my perspective because it is another previously unnamed technique in the skill set which can now be cited for future reference.


You know, I’ve been thinking about your reply to this lately. Mostly about why you find it relevant. And I think I’ve finally come up with the words for my thoughts lol.

I feel like I find the research relevant (albeit less than I think a lot of people), but for different reasons.

See you, and a lot of people I’m noticing in this space, discuss research like this like it adds to a skillset, and it takes off as if it’s this “new” technique. To be clear, that’s totally okay, and I’m not arguing or invalidating that by any means!

However, what’s really interesting to me is that where people are finding these phenomena as “documented techniques” I simply see “more alignment to well-documented psychological and linguistic phenomena”. And maybe that’s what bugs me about the research being done and the way we talk about it. What’s interesting and relevant here in my honest opinion isn’t the technique, but its continued correlation and resemblance to actual, real, well-known things in psychology and in discourse analysis / semantics (linguistics stuff). If we just looked over the fence a little bit, we begin to see some really fascinating overlap. This continues to happen the more AI research like this gets out.

So while people look at this research and go “oh that’s cool”, I’m over going “Yet another phenomena we can add to the list here that sounds an awful lot like research x in field y”. Considering these discoveries are coming from models that already exist too, it tells me there’s so much more we might not understand and we might be overlooking.

That’s all I wanted to say lol. It’s not bad research, and I’m not poo-pooing it. Maybe I just want more people in the AI space to look at things more broadly than seeing all of this as simply techniques to a toolbox. Because if we later discover a lot of these techniques can apply to human interaction, or be extrapolated easily for human interaction, to what degree will people use that for manipulation?

This research for example is relatively innocuous for that, but it’s like, for all this talk about “safety” hoo-ha, the real dangers are staring us right in the face, and we keep getting distracted by “can it tell us how to build a bomb?” and it’s driving me nuts.


Thanks for taking the time before writing all of this down. I really had the intention to write more about model behavior and retrieval and etc… but ultimately chose to stay on topic. That’s why I summed up the paper in a single sentence.

While I have to get back to work this may be of interest for you: