How are you approaching prompt regressions in production applications, context and model choice is constant. Prompt has been performing well , however suddenly regresses causing disruptions to downstream tools invocations. We are tracking these as we have good logging in place and it has happened once before after Structured Outputs was released initially, however, since then we have optimized the prompt which has been pretty stable for the last 2-3 months until recently. Technically, how do you guarantee idempotent behavior from prompts?
Hi @kavitatipnis !
Out of curiosity, are you using an explicitly dated model checkpoint, e.g. gpt-4o-2024-11-20
or an alias like gpt-4o
?
Personally, when I see my prompts produce stable results on a rather long run (a couple of months as you’ve mentioned), I seriously consider fine-tuning for 2 reasons:
- “Freeze” current quality of results.
- Improve the quality while aiming for #1 by selecting “best” outputs (when I have a choice of outputs for same inputs) or manually editing the sub-optimal outputs (the failing edge cases).
The #2 is pretty much the only way I see currently to improve the quality of otherwise stable results. As fine-tuning at this stage is way more straight forward for me than goggling with prompt engineering.
What do you guys think?
an alias because I am assuming the alias references the latest model checkpoint. I wonder now if I should update our dev to dated model checkpoints for evals.
Yes, you should use dated models if you want idempotency. Aliases get dereferenced and it’s not necessarily the latest models. For example, gpt-4o
points to gpt-4o-2024-08-06
while the latest checkpoint is actually gpt-4o-2024-11-20
.
So in short, to have any chance of idempotency for a prompt and data that is held constant, you should use specific (dated) models.
- Freezing of results - I am researching Predicted Outputs for our use case. I feel the “Structured Outputs” which are essentially system prompts also guarantee this to some degree.
- I understand why fine tuning can be an straightforward way , however, I feel it may be an overkill - expensive for our use case just yet. I can see this as a must have for coding, scientific research or other use cases where there is a constant stream of “new or synthetic” data - more frequent data updates?
https://platform.openai.com/docs/guides/optimizing-llm-accuracy - was useful and I see that it has been updated, I like the separation of context and behavior.
going to try this one on our dev environments and will measure the results.
The prediction
field only serves to possibly increase the language generation speed (and expense). It does not alter the generation.
Prompts and system messages should be robust, the task being able to be run on any OpenAI model with general indication of success, even stepping down to gpt-4o-mini
to see results. If you are the edge of the model’s understanding and abilities, a “stealth” update of an existing model name, even the dated version, can end up breaking your application overnight.
If you have a particular API case that must not break and is described well, you can “store” chat completions runs of it with the store parameter, and then submit them to “evals” as part of tests that OpenAI themselves might run on pre-release model updates.
Thank you for clarifying prediction and it’s correlation to generation speed. Can you clarify what you mean by - edge of model’s understanding and capabilities.
Avoid running on-the-edge. Meaning: You might write an application that relies on complex decision-making, by extensive conditional instructions and by adding knowledge needed, to then produce a particular desired output, one perhaps to an API that cannot fail. It might be the type of complex application where you can see that multiple AI processing turns could do individual steps with focus, but you hacked it all together and made an API call work.
It might be something where gpt-4
(original) can navigate its way through the logic and produce the result 99% of the time depending on inputs, following procedures. Then that “working application” might only go down a bit in quality if you seek a reduction in costs by using gpt-4o of a particular version. Perhaps you found the one gpt-4o version (of three) that can do it with pretty good success still.
By using the minimum AI you can get away with, you are putting yourself in the crosshairs of future updates to models. OpenAI has not promised that “versioned” models are “stable” models, and we see many application-breaking changes to AI abilities on instruction-prompted operations with new AI tunings that are not transparent.
So better is to make it work still with minimum AI, cheapest AI, via procedural or multi-step preprocessing towards a solution, and robust task descriptions. A solution even portable to different AI providers. Then, in production, use the bulletproof model that gives you constant success.
@_j Yes that has been our strategy , we have modularized all aspects and we have control over AI inputs - prompt + context and expect a certain set of outputs. I think what’s debatable is this notion of “constant success” where majority of the applications that are going to be rewritten are going to need a combination of deterministic + probabilistic capabilities. I see OpenAI’s latest releases and research in this direction, until then as @platypus and @sergeliatko suggested we optimize with what we have - versioned models, effective prompts ( synonymous to task descriptions), context optimization ( RAG techniques) and fine-tuning. For now, we modified the prompt in our case and creating evals ( OpenAI’s Evals) in playground for testing the expected output given this incremental prompt change.