Prompting: "only say yes" (part 2)

I’d go so far as to say reality has a recency bias, what with that whole “arrow of time” thing.

Regarding the “false memory” and the idea the model is

I think it’s more simple than that.

My interpretation of this behavior is that the model (almost) always treats it’s past behavior as perfect and ideal. If it did it, it must be correct and proper.

So, when you create a past where the model has ignored an instruction you’re, essentially, giving it a one-shot example of a situation where the rule doesn’t apply. Then, when another question comes up which touches on identity, it relies on that previous example as a correct and appropriate response for identity questions.

Due to the semantic similarity of the questions, there will be additional attention paid to that message pair relative to the original instruction. Add in the recency bias for the model’s attention mechanism and you have a strong recipe for getting the model to ignore the earlier instruction.

I would guess if you were to continue just the first part of your experiment with more questions of both an identity and non-identity nature, you’d quickly lock the model into it’s new understanding of when to say yes and when to answer normally.

I suspect this is the same mechanism behind the need to frequently start new message threads with ChatGPT, especially when debugging code.

I didn’t have time to test right now, but I wonder how many “false memories” of the model writing in all caps or all lowercase you’d need to do to get the model to continue the behavior instructed.

There’s probably a paper in there for anyone who wanted to write it… “Contextual Cues as Inadvertent Self-Instructed Few-Shot Learning…”

Anyway, interesting stuff as always. Thanks for sharing!

3 Likes