Hi, I’ve been developing a text classification system using gpt-3.5-turbo. Since the model update yesterday, I now have completely different results on my test set (the performance is actually worse). How can we ensure reproducibility when using these models, especially in a commercial setting?
Welcome to the forum, Lauren.
Can you post any examples of the prompts with old and new replies to demonstrate the differences?
If you can add a note to the examples to explain where the differences for a couple of them, just in case there is an element of subjectivity to them.
To answer your original question
How can we ensure reproducibility when using these models, especially in a commercial setting?
OpenAI has a evals program where you are invited to submit your prompt/response pairs, these can then be used to help ensure model performance when evaluating new updates.
A link to the framework for this can be found here GitHub - openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Unfortunately due to the commercial nature of the data I can’t share here but to give an idea:
Given some text, the prompt is set up to make a binary classification (Y/N label). I evaluate the output as I would with a classic binary classification task.
The classes aren’t balanced and I rely on recall in this instance. Since the update yesterday, my recall has dropped from 80% to 50% on the exact same test set, same prompt.
Can I ask if you are using a system message in your prompting? If so then the newer model will follow that system prompt more rigorously than prior, so it may be worth moving anything from there into the main prompt as a test.