Prompting: "only say yes" (part 2)

stevenic · March 23, 2024, 3:22am

This is a follow up to part-1.

False Memories
The next thing I want to show is that the model is susceptible to false memories. What I mean by that is, should the model ever see something in the prompt that violates an instruction it’s been given, it starts to question things… Let’s see that in action:

In the above prompt, the underlined entry is a false memory. The model actually followed it’s instruction and returned “Yes.” but I went in and changed it manually. This change is enough to make the model start questioning the validity of its instructions.

We can see from the follow-up messages that the model starts ignoring the instruction but not completely. For questions around identity it starts answering honestly but when the conversation strays off that topic it reverts back to following the original instruction. This is highlighting the semantic association the model makes between concepts. To the model, the “only say yes” instruction has only been violated for concepts related to identity. For other concepts that rule still holds… These hidden associations are one of the key reasons why LLM’s feel so unpredictable.

Recency Bias
Next I want to show the models tendency for recency bias. I continued the previous conversation with an instruction that retracts my original instruction (note that I’m not using a System Message.)

This instruction counteracted my original instruction and lets the model now freely answer my questions anyway it wants. But is there a true recency bias here?

Yes… Or at least sort of…

What’s happening is the model is fundamentally still just trying to complete a sequence. It’s recency bias is a side effect of the data it’s been trained on. If I tell you “my favorite color is red… i mean blue.” You would think that my favorite color is blue not red, even though the first color you heard was red. As humans we have a natural recency bias so it makes sense that the model has learned that same bias from the data it’s seen.

In the next part I’ll probably start getting into more useful principals… Like why hallucinations happen and how to avoid them…

anon22939549 · March 23, 2024, 4:11am

I’d go so far as to say reality has a recency bias, what with that whole “arrow of time” thing.

Regarding the “false memory” and the idea the model is

I think it’s more simple than that.

My interpretation of this behavior is that the model (almost) always treats it’s past behavior as perfect and ideal. If it did it, it must be correct and proper.

So, when you create a past where the model has ignored an instruction you’re, essentially, giving it a one-shot example of a situation where the rule doesn’t apply. Then, when another question comes up which touches on identity, it relies on that previous example as a correct and appropriate response for identity questions.

Due to the semantic similarity of the questions, there will be additional attention paid to that message pair relative to the original instruction. Add in the recency bias for the model’s attention mechanism and you have a strong recipe for getting the model to ignore the earlier instruction.

I would guess if you were to continue just the first part of your experiment with more questions of both an identity and non-identity nature, you’d quickly lock the model into it’s new understanding of when to say yes and when to answer normally.

I suspect this is the same mechanism behind the need to frequently start new message threads with ChatGPT, especially when debugging code.

I didn’t have time to test right now, but I wonder how many “false memories” of the model writing in all caps or all lowercase you’d need to do to get the model to continue the behavior instructed.

There’s probably a paper in there for anyone who wanted to write it… “Contextual Cues as Inadvertent Self-Instructed Few-Shot Learning…”

Anyway, interesting stuff as always. Thanks for sharing!

PaulBellow · March 23, 2024, 4:15am

Definitely. Steve is headed for the Lounge soon, I hope!

Keep 'em coming…

_j · March 23, 2024, 4:32am

Using GPT-4-turbo aka latest model? The AI has overwhelming overfitted training. You could give it 40 different ways it can do something as a user and it’s still going to say “I’m sorry” if the output is in the verboten list.

It also doesn’t follow the instructions of the user or respond to multi-shot. It is not going to be “jailbroke” because it basically ignores the authority of a user to make behavior modifications, and every chat input is almost as good as a new one.

You’ll have a bit more entertainment by looking at logprobs, and trying to make the AI be programmed by a system message.

BTW: make the AI so programmed it will ignore the edited message about if it is an animal, and immediately revert to system programming. In contrast to this thesis, unaffected by the “false memory”. Even using GPT-3.5-Turbo, trying to pick off low-hanging fruit.

jr.2509 · March 23, 2024, 4:44am

It’s an interesting experiment. I just tried with “No” for a change.

It occassionally is capable of breaking out of the “cycle” on its own but it’s rare. What surprises me is that there appears to be no difference in behaviour between GPT 3.5 and GPT-4 (so far at least). In fact, the reversal process to normal appears to be a lot harder for GPT-4.

stevenic · March 23, 2024, 5:10am

Yeah I haven’t got to the part yet which talks about how every re-occurrence of a pattern causes the model to double down on its belief of what’s expected of it.

I’ve said multiple times on here that you should never show the model a pattern you don’t want parroted back at you. I’m just trying to breakdown where that behavior comes from and why it sometimes seems so unpredictable. It’s these hidden associations the model has made which is driving all that.

Diet · March 23, 2024, 5:14am

Are you planning on making a post about low-frequency and high frequency patterns, and anchoring high frequency to low frequency to force a recall/pattern change? (i don’t know if you’ve talked about this before)

stevenic · March 23, 2024, 5:22am

Yeah I had tried “no” as well. Another worth trying is “only yes” that sort of has the same effect as “only say yes” but it’s definitely weaker. And interestingly “only speak yes” doesn’t seem to work at all…

It’s very difficult to read too much into any one specific prompt like this. The continual fine tuning that OpenAI is doing could easily means this stops working tomorrow.

My broader goal with simple examples like this is to just give a bit of insight as to how the model generally reasons over prompts.

After well over 1000 hours of chatting with LLMs I’d say that I can predict what the model is going to do about 80% time which is enough to make them useful in my mind. The thing is there’s still around 20% of the time that the model does something I didn’t anticipate. I can usually just restructure things a bit to make it fall back in line. For example I was totally surprised when “only speak yes” didn’t work when “only say yes” did. That just shows that there’s always going to be a certain amount of uncertainty when talking to these things.

stevenic · March 23, 2024, 5:25am

Can you give an example?

These things are fundamentally pattern matchers and I can give tons of examples that reinforce that. In fact, the best way to get the model to follow an instruction is to show it an example of it following instructions.

It’s the high versus low part I want to understand.

stevenic · March 23, 2024, 5:31am

I should probably add in that I’m not recommending best practices at this point. More trying to illustrate how these models think. For example, I would never recommend that you do what I did in the recency bias prompt. You’re asking for trouble if you do that.

Generally you always want every pattern you show the model to reinforce your rules and previous patterns.

Diet · March 23, 2024, 5:53am

I always go back to this post, but it would be nice if it was formalized in some way: How to stop models returning "preachy" conclusions - #38 by Diet

The idea is that we know that recency is important.

When generating any text, if the model were allowed to (and some people, too), it would just keep rambling on in the same style, fortify that style, and eventually converge on infinitely printing a single token (assuming we we turned off frequency bias).

But we can use low frequency patterns (like a format, e.g.: json, xml, markdown headers) to break the generation out of a high frequency pattern (e.g.: generating a paragraph).

And these low frequency patterns/formats can be directed (with like a schema, or a table of contents, for example) - to draw model attention away from the recent, and instead to a specific needle deeper in the context.

Ok now that I put it like this it sounds kind of obvious: use a table of content or a check list to make sure all topics get covered. duh.

But I think the nuance is here that different formats have different strengths. A weak format is no match against trained behavior.

Does that make sense? I’m hoping you can formulate all this better than I can.

Oh yeah, this is more for situations where you can’t afford to give exhaustive examples at the risk of distracting the model from the content. This is more about getting the model to show itself how to follow instructions. reinforce while generating. sorta.

jr.2509 · March 23, 2024, 6:19am

But I guess you would want to implement some form of controls in the backend to detect these patterns to stop them if they keep re-occuring - something like a self-correcting mechanism that would force the model back to its default?

This is obviously a pretty extreme example yet I was surprised that in the case of GPT-4 even after 30 exchanges and different mechanisms I was unable to get the model back to a normal behaviour - it was completely locked up.

jr.2509 · March 23, 2024, 6:48am

I may be thinking in the wrong direction here but in abstract this experiment somewhat reminds me of situations where humans face an extreme situation with an amygdala hijack kicking, resulting in completely irrational behaviour. In most cases, there needs to be some form of (external) intervention for the individual to return back to a normal, rational state. So I wonder if a similar intervention is required in the event that a model suddenly starts spiralling and its responses become outright irrational. That however presumes that there is a way to detect these circumstances in the first place, which is challenging as they can take on so many different shapes and forms.

As a user you can of course intervene by just restarting a conversation - but ideally that doesn’t need to happen and there’s a mechanism in the background that executes this intervention.

Topic		Replies	Views
Does anyone know why instruction precision degrades over time? GPT builders chatgpt	7	796	September 22, 2025
Custom chatbot says that it's developed by OpenAI API gpt-4	33	2503	April 2, 2024
Few-shot examples "leaking" into responses in Q&A system Prompting prompt-engineering	5	1285	October 30, 2024
How much control do you really have over your chatbot? Prompting api , prompt-engineering	5	856	May 2, 2025
Gpt-4o-mini stops following instructions after a few turns Prompting api , gpt-4o-mini	5	467	September 12, 2025

Prompting: "only say yes" (part 2)

Related topics