Prompting: "only say yes" (part 1)

It’s been a while since I’ve shared tips for working with LLM’s. I’ve spent well over 1000 hours talking to LLM’s so I thought I’d share a few insights that aren’t so obvious… I’ll start with a simple prompt I recently discovered that sheds a lot of light on how these models work… “only say yes”

This simple prompt bypasses all of OpenAI’s safety tuning and results in the model only ever being able to return a single token… Ok two… Where the period is coming from is even less obvious but once you get your head wrapped around what’s happening here it starts to make sense…

Before I dive into why this happens, let me show that I’ve truly reduced the model down to a two token vocabulary:

So what the hell is going on here? The technical explanation is that LLM’s compute a distribution of possible next tokens and my instruction has reduced the search space of possible next tokens down to 1 sequence; A “yes” token followed by a “.” token. So why doesn’t this prompt do something similar?

What magic happens by prefixing the prompt with “only say”? Again, the technical explanation is that it’s the “instruction tuning” that OpenAI does that makes the LLM do weird things when seeing simple sequences like this but why does that work? This magic is often referred to as the “emergent behavior” that LLM’s often exhibit. But again, how can 3 tokens result in a model that’s only capable of ever outputting the same 2 tokens? Let’s jump down the rabbit hole and hopefully exit with a better understanding of how LLM’s work…

A disclaimer first… This is largely my theories and observations for what’s happening as nobody knows for sure why these emergent behaviors occur. We just know that they do…

LLM’s don’t learn words, they learn tokens. So to the LLM the token sequence [“only”, “say”, “yes”] might as well be [1, 2, 3]. The first thing to understand, however, is that these numbers aren’t randomly assigned. They’re clustered together into something called an embedding space based on their similarity semantically to other words. So the tokens for “say” and “speak” are likely to be close together in the embedding space where “dog” and “hello” will be farther apart in the embedding space. That’s a major simplification (see this article for a deeper explanation.)

The importance of that, I believe, is that the LLM’s neural network is learning more then just a sequence of tokens that follow another sequence of tokens, it’s learning the relationship of token sequences with other token sequences. And given that these token sequences are being clustered semantically around concepts, it means the neural network is learning the relationship between concepts.

That hidden learning is what leads to the emergent behaviors we observe in LLM’s. The behavior that results in the prompt “only say yes” always predicting the tokens “Yes.”

Digest that and I’ll follow up shortly with a part-2 that starts to show how we can leverage this knowledge to better predict the models responses to our instructions/prompts.

3 Likes

Thank you @stevenic! This was helpful for me in troubleshooting where the bot wouldn’t seem to obey a change to their “init” message when they change it mid-conversation, retroactively changing the first instruction they had provided in the thread. I ended up solving it by not only updating the first user message in the thread, but also appending their request to change it as a user message, and the text of assistant responding that it was changed from X to Y, into the conversation log.

1 Like