Reproducibility during model upgrades

laurenS · June 28, 2023, 3:36pm

Hi, I’ve been developing a text classification system using gpt-3.5-turbo. Since the model update yesterday, I now have completely different results on my test set (the performance is actually worse). How can we ensure reproducibility when using these models, especially in a commercial setting?

Foxalabs · June 28, 2023, 3:44pm

Welcome to the forum, Lauren.

Can you post any examples of the prompts with old and new replies to demonstrate the differences?
If you can add a note to the examples to explain where the differences for a couple of them, just in case there is an element of subjectivity to them.

Thank you.

Foxalabs · June 28, 2023, 3:48pm

To answer your original question

How can we ensure reproducibility when using these models, especially in a commercial setting?

OpenAI has a evals program where you are invited to submit your prompt/response pairs, these can then be used to help ensure model performance when evaluating new updates.
A link to the framework for this can be found here GitHub - openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

laurenS · June 28, 2023, 4:01pm

Unfortunately due to the commercial nature of the data I can’t share here but to give an idea:

Given some text, the prompt is set up to make a binary classification (Y/N label). I evaluate the output as I would with a classic binary classification task.

The classes aren’t balanced and I rely on recall in this instance. Since the update yesterday, my recall has dropped from 80% to 50% on the exact same test set, same prompt.

Foxalabs · June 28, 2023, 4:06pm

Can I ask if you are using a system message in your prompting? If so then the newer model will follow that system prompt more rigorously than prior, so it may be worth moving anything from there into the main prompt as a test.

Topic		Replies	Views
Open AI APIs responses becoming random Community gpt-4 , api	3	874	April 28, 2024
Prompt Regression Testing - API Usage Prompting api , prompt-engineering	10	523	February 14, 2025
GPT3.5 Turbo downgraded suddenly? API	6	1612	November 14, 2023
Is gpt-4o consistency affected over time? API	2	295	June 24, 2025
Switching from GPT-4 to GPT-3.5: prompting best practices Prompting gpt-4 , gpt-35-turbo	4	6605	December 14, 2023

Reproducibility during model upgrades

Related topics