Extra space in system prompt seem to significantly affect output

Hello, I’ve came across a very curious case where my prompt performed significantly worse in production than eval / backtest and only difference seem to be that eval script had some extra leading spaces in the system prompt while prod did not.

This makes no logical sense, i just tested locally again (with or without stripping spaces for sys prompt). My prompt is a text classification case, the difference is >95% accurate (with spaces) v.s only picking correct category 1/3 the time.

Is the python client doing something weird? Admittedly I’m still using openai-0.28.0 , any known issues ?

messages = [
    {
        "role": "system",
        "content": system_message.strip() # or without strip makes a huge difference. 
    },
    {
        "role": "user",
        "content": user_message
    },
]

for i in range(40):
    raw_response = openai.ChatCompletion.create(
                messages = messages,
                model='gpt-3.5-turbo-1106',
                                    max_tokens=512,
                                    timeout=1,
                                    temperature=0,
                                    top_p=0.01,
                                    frequency_penalty=None,
                                    presence_penalty=None
            )

I suspect it’s all about deviating from training.

A single leading space may be semantically useful in some cases, as it gives a more common token " banana" (rank 44196) than “banana” (rank 88847).

Too many leading spaces, and a million trainings of “You are ChatGPT” are significantly departed from.

You also could be performing in-context training with that, showing the AI that a message should start with spaces, displacing the certainty of other tokens.

Trials

gpt-3.5-turbo-1106 at repeatable top-p:

“I am an AI digital assistant designed to provide helpful and accurate information, answer questions, and assist with various tasks to the best of my ability. My purpose is to support and assist users in a wide range of topics and inquiries.”

…with leading spaces:

“I am an AI digital assistant designed to provide helpful and accurate information, answer questions, and assist with a wide range of topics and tasks. My purpose is to support and enhance human productivity and decision-making by providing valuable insights and guidance.”

Variation on identity

“I’m an AI digital assistant designed to provide helpful and informative responses to a wide range of questions and topics. Whether it’s answering questions, providing explanations, or engaging in casual conversation, I’m here to assist and support you in any way I can.”

… with more leading spaces

“I’m an AI digital assistant designed to provide helpful and friendly conversation, answer questions, and assist with a wide range of topics. How can I help you today?”

Conclusion

It’s the model, not the library (which here is not used)

Thanks. I see what you mean, I also didn’t know that leading spaces yields different tokens - good to know. It’s still surprising that it caused this much divergence in my results tho, I’m not thrill to need to test variations of the prompt with / without leading space :sweat_smile:

That said, one other detail I forgot to mention is that in playground, I can almost always get consistent correct output regardless of if with or without leading spaces. That’s what made me think there’s something weird with the clients and not model itself. Is there more to it?

and fyi my sys prompt is something like "We are xx a yyy service, You are an assistant … "

I would follow the form often demonstrated, like

“You are xx bot, an online web assistant for xx company, providing yyy service…”