We’re building a long-form conversational chat assistant. We built a 5000-token system prompt for GPT-4 and we’re now switching to GPT-3.5 for cost and speed purposes.
We see that the LLM responses become somewhat worse, but not critically. It follows instructions somewhat more loosely and hallucinates more often. We’re fixing the prompt, but I have an intuition that we might be missing important differences between how GPT-3.5 and GPT-4 interpret prompts.
What are the best practices for “downgrading” a large prompt from GPT-4 to GPT-3.5?
Welcome to the forum!
Make sure you use the system prompt for instructions.
Make sure you include examples in that prompt for behaviour you need to narrow down.
That’s basically it.
GPT-3.5 is not the best at following the system prompt, especially compared to Gpt-4. I would advise you test the prompt being in the system message vs as a chat to see which is giving better performance for your case.
Also, few-shot prompting personally has felt the best method of instructing the model to me and GPT-3 does indeed do quite well with it.
Other than that, the performance might change because you are going back a model and I would advise to thoroughly test the prompt you are using and tinker with it to make it the most suitable to your case
Thanks everyone! Will report back if we get any additional insights after the transition.