This is a follow up to part-1.
False Memories
The next thing I want to show is that the model is susceptible to false memories. What I mean by that is, should the model ever see something in the prompt that violates an instruction it’s been given, it starts to question things… Let’s see that in action:
In the above prompt, the underlined entry is a false memory. The model actually followed it’s instruction and returned “Yes.” but I went in and changed it manually. This change is enough to make the model start questioning the validity of its instructions.
We can see from the follow-up messages that the model starts ignoring the instruction but not completely. For questions around identity it starts answering honestly but when the conversation strays off that topic it reverts back to following the original instruction. This is highlighting the semantic association the model makes between concepts. To the model, the “only say yes” instruction has only been violated for concepts related to identity. For other concepts that rule still holds… These hidden associations are one of the key reasons why LLM’s feel so unpredictable.
Recency Bias
Next I want to show the models tendency for recency bias. I continued the previous conversation with an instruction that retracts my original instruction (note that I’m not using a System Message.)
This instruction counteracted my original instruction and lets the model now freely answer my questions anyway it wants. But is there a true recency bias here?
Yes… Or at least sort of…
What’s happening is the model is fundamentally still just trying to complete a sequence. It’s recency bias is a side effect of the data it’s been trained on. If I tell you “my favorite color is red… i mean blue.” You would think that my favorite color is blue not red, even though the first color you heard was red. As humans we have a natural recency bias so it makes sense that the model has learned that same bias from the data it’s seen.
In the next part I’ll probably start getting into more useful principals… Like why hallucinations happen and how to avoid them…