Prompt Regression Testing - API Usage

kavitatipnis · February 13, 2025, 12:57am

How are you approaching prompt regressions in production applications, context and model choice is constant. Prompt has been performing well , however suddenly regresses causing disruptions to downstream tools invocations. We are tracking these as we have good logging in place and it has happened once before after Structured Outputs was released initially, however, since then we have optimized the prompt which has been pretty stable for the last 2-3 months until recently. Technically, how do you guarantee idempotent behavior from prompts?

platypus · February 13, 2025, 7:02pm

Hi @kavitatipnis !

Out of curiosity, are you using an explicitly dated model checkpoint, e.g. gpt-4o-2024-11-20 or an alias like gpt-4o?

sergeliatko · February 13, 2025, 10:23pm

Personally, when I see my prompts produce stable results on a rather long run (a couple of months as you’ve mentioned), I seriously consider fine-tuning for 2 reasons:

“Freeze” current quality of results.
Improve the quality while aiming for #1 by selecting “best” outputs (when I have a choice of outputs for same inputs) or manually editing the sub-optimal outputs (the failing edge cases).

The #2 is pretty much the only way I see currently to improve the quality of otherwise stable results. As fine-tuning at this stage is way more straight forward for me than goggling with prompt engineering.

What do you guys think?

kavitatipnis · February 13, 2025, 11:26pm

an alias because I am assuming the alias references the latest model checkpoint. I wonder now if I should update our dev to dated model checkpoints for evals.

platypus · February 13, 2025, 11:53pm

Yes, you should use dated models if you want idempotency. Aliases get dereferenced and it’s not necessarily the latest models. For example, gpt-4o points to gpt-4o-2024-08-06 while the latest checkpoint is actually gpt-4o-2024-11-20.

So in short, to have any chance of idempotency for a prompt and data that is held constant, you should use specific (dated) models.

kavitatipnis · February 14, 2025, 12:10am

Freezing of results - I am researching Predicted Outputs for our use case. I feel the “Structured Outputs” which are essentially system prompts also guarantee this to some degree.
I understand why fine tuning can be an straightforward way , however, I feel it may be an overkill - expensive for our use case just yet. I can see this as a must have for coding, scientific research or other use cases where there is a constant stream of “new or synthetic” data - more frequent data updates?
https://platform.openai.com/docs/guides/optimizing-llm-accuracy - was useful and I see that it has been updated, I like the separation of context and behavior.

kavitatipnis · February 14, 2025, 12:10am

going to try this one on our dev environments and will measure the results.

_j · February 14, 2025, 1:13am

The prediction field only serves to possibly increase the language generation speed (and expense). It does not alter the generation.

Prompts and system messages should be robust, the task being able to be run on any OpenAI model with general indication of success, even stepping down to gpt-4o-mini to see results. If you are the edge of the model’s understanding and abilities, a “stealth” update of an existing model name, even the dated version, can end up breaking your application overnight.

If you have a particular API case that must not break and is described well, you can “store” chat completions runs of it with the store parameter, and then submit them to “evals” as part of tests that OpenAI themselves might run on pre-release model updates.

kavitatipnis · February 14, 2025, 1:37am

Thank you for clarifying prediction and it’s correlation to generation speed. Can you clarify what you mean by - edge of model’s understanding and capabilities.

_j · February 14, 2025, 1:51am

Avoid running on-the-edge. Meaning: You might write an application that relies on complex decision-making, by extensive conditional instructions and by adding knowledge needed, to then produce a particular desired output, one perhaps to an API that cannot fail. It might be the type of complex application where you can see that multiple AI processing turns could do individual steps with focus, but you hacked it all together and made an API call work.

It might be something where gpt-4 (original) can navigate its way through the logic and produce the result 99% of the time depending on inputs, following procedures. Then that “working application” might only go down a bit in quality if you seek a reduction in costs by using gpt-4o of a particular version. Perhaps you found the one gpt-4o version (of three) that can do it with pretty good success still.

By using the minimum AI you can get away with, you are putting yourself in the crosshairs of future updates to models. OpenAI has not promised that “versioned” models are “stable” models, and we see many application-breaking changes to AI abilities on instruction-prompted operations with new AI tunings that are not transparent.

So better is to make it work still with minimum AI, cheapest AI, via procedural or multi-step preprocessing towards a solution, and robust task descriptions. A solution even portable to different AI providers. Then, in production, use the bulletproof model that gives you constant success.

kavitatipnis · February 14, 2025, 6:58pm

@_j Yes that has been our strategy , we have modularized all aspects and we have control over AI inputs - prompt + context and expect a certain set of outputs. I think what’s debatable is this notion of “constant success” where majority of the applications that are going to be rewritten are going to need a combination of deterministic + probabilistic capabilities. I see OpenAI’s latest releases and research in this direction, until then as @platypus and @sergeliatko suggested we optimize with what we have - versioned models, effective prompts ( synonymous to task descriptions), context optimization ( RAG techniques) and fine-tuning. For now, we modified the prompt in our case and creating evals ( OpenAI’s Evals) in playground for testing the expected output given this incremental prompt change.

Topic		Replies	Views
LLM and Prompt Evaluation Frameworks Prompting prompt-engineering , prompting , evals	11	5065	December 16, 2024
Keeping Assistants in a Box API assistants-api	22	146	February 5, 2025
How to test an API, built on GPT? API	2	2197	April 9, 2024
Managing prompts in production Prompting api , prompt , prompt-engineering	11	4152	January 22, 2025
Prompt result consistency - Need some perspectives to validate understanding Prompting gpt-4 , chatgpt , prompt , prompt-engineering	7	3486	August 10, 2023

Prompt Regression Testing - API Usage

Related topics