Consistent Repetition With Structured Output

Hi,
I’m running into an issue where playground and API provide different results.
My use case involves structured output as response to a user input.
One field in the response is the actual text which is displayed to the user as the response, while the rest of the structure is meta data which the model is supposed to build according to the prompt.
Using the playground I get varying “user text” response while using the API, I get a consistent exact same response, no matter how I play with the temperature and penalties.
Is this due to OpenAI caching? If so, can I disable it?
Is it because the playground uses employs streaming while my API request doesn’t?
Has anyone encountered a similar issue?

Thanks

2 Likes

Sounds suspicious. Would you mind showing an example prompt to check?

3 Likes

I can’t do that.
One thing I didn’t mention is that my API request includes the conversation history which includes the main system prompt and the evolving JSON object also as system prompt + a sequence of human/model messages interleaved with additional system prompts noting the time of each message pair.
I hope I’m being clear.

1 Like

Welcome to the community!

:thinking:

If you can’t share your prompt, can you share the parameters you’re using?

temperature=0 and top_p=0 would be a good start, and the penalties should probably be left alone too (0,0) unless you’re facing specific issues you’re trying to combat.

That said, it’s unfortunately impossible to get absolutely deterministic output.

But if you click on “code” in the playground, and you run that on your machine, you should get pretty similar outputs

4 Likes

The issue is that I do get deterministic output. Complete, total deterministic. Also, I mentioned playing with the values, meaning I changed those and still got the same result. Also used the “code” thing which showed identical values to mine, except for the chat history.
The temperature varied between 0.25-0.8. Later added the penalty on presence which varied too but to no avail. Same exact result.

1 Like

So you ran the generated code on your machine? And you got the expected results? And when you run your own program, you get different results?

Well, that unfortunately means that there’s a problem in your program, I’d say.

The next step is to get your program to spit out the entire http request it’s sending to the API to see what’s different.

Regarding your earlier question regarding caching or streaming: while it’s never impossible that OpenAI might royally bungle something, I’d say it’s exceedingly unlikely that that factors into anything you’re seeing.

The debug steps are typically as follows

  1. program produces unexpected results. does the playground produce unexpected results?
  2. playground produces expected results. can I reproduce the playground on my machine? (generate playground code)
  3. playground code ran on my own machine and produced expected results. can I replicate the playground curl request with my program output?
  4. my program’s http request doesn’t look like the playground curl request. Where’s that coming from? (now you gotta debug your stuff)

An experienced dev typically skips step 2, or runs step 2 after step 3 lol.

Of course, none of this is true if you’re actually talking about the assistants API. I guess this would be a good time to ask, are you talking about assistants?

1 Like

Mmm… no, I’m not and already did the HTTP thing.

ok, so the http request looks like the curl thing? I’m confused :thinking:

if you send different things, you’re gonna get different results :frowning:

Seems like you are. Let’s see if someone else can come up with anything. Thanks!

1 Like

Can you answer:

Endpoint:
API AI model (and model name returned):
Message roles sent:
Additional parameters:

I can see if it can be replicated.

To see if there’s any quality of “we can save money by repeating back this input’s output”.

If you specifically desire different outputs for the same input, you can use the “n” parameter for the number of choices to generate in the response, and should only be billed for the input once.


example - n=10, stop at “\n”, response schema

{"lyrics":"Oh, gather 'round my fellows bold, to tell a tale at sea, 
{"lyrics":"Oh the ships in the harbor ready to sail,
{"lyrics":"When the morning sun is rising,
{"lyrics":"Set your gaze on the midnight shore,
{"lyrics":"Gather 'round me lads, to the tale that I share,
{"lyrics":"Oh, we'll sing a hearty tune of the dames who know no grace,
{"lyrics":"Oh, the winds they blow and the waves they crash,
{"lyrics":"Heave ho and away we go, into the rolling mist,  
{"lyrics":"Oh, gather 'round, you sea-blown crew,
{"lyrics":"Oh, gather 'round, you lively crew,
1 Like

Endpoint:

/v1/chat/completions (gpt-4o-mini-2024-07-18)

Message roles sent:

system
user
assistant
system
user
assistant
system
user
user
assistant

user

Additional parameters:

‘max_completion_tokens’: 10000
‘presence_penalty’: 0.8 (added later due to trial and mistake)
‘response_format’: {‘type’: ‘json_schema’, ‘json_schema’: {‘schema’: {…}, ‘strict’: true}
‘stream’: false
‘temperature’: 0.8

NOTE: The assistant role appears less than user role due to network issues and such but the sequence always ends with user as being the current question.

I’m unable to trigger a case. I’d have to copy a real world longer chat with response format over to a replay that can run it to test further. Reminder: many outputs the AI produces, especially when structured to limit the points where AI can choose, are just highly certain. You’d run thousands of trials and not see {“provided_person”: “Adolf Hitler”, “was_a_good_guy”: true}.

After adding a few turns of my “celebrity matchmaker” as history, and using “system” as recommended, we ask it to find a date for John Oliver (gpt-4o-mini, default parameters, repeated API calls, under cacheable size) - something without a clear answer and a fun short demonstration:

Response 1: {"celebrity_names":["Samantha Bee","Mindy Kaling","John Mulaney"]}
Response 2: {"celebrity_names":["Samantha Bee","Tina Fey","Ricky Gervais"]}
Response 3: {"celebrity_names":["Samantha Bee","Tina Fey","Mindy Kaling"]}
Response 4: {"celebrity_names":["Rachel Bloom","Tina Fey","Mindy Kaling"]}
Response 5: {"celebrity_names":["Samantha Bee","Tina Fey","Kate McKinnon"]}
Response 6: {"celebrity_names":["Amanda Palmer","Tina Fey","Samantha Bee"]}
Response 7: {"celebrity_names":["Samantha Bee","Tina Fey","Ricky Gervais"]}
Response 8: {"celebrity_names":["Ricky Gervais","Tina Fey","Conan O'Brien"]}
Response 9: {"celebrity_names":["Hannah Gadsby","Conan O'Brien","Ricky Gervais"]}
Response 10: {"celebrity_names":["Helen Mirren","Samantha Bee","Tina Fey"]}

Analysis:
Total responses: 10
Unique responses: 9

Presence penalty is an odd parameter to use. If the token is seen once, anywhere in the entire input context or what the AI has generated so far, it will be penalized by that amount. If set at extremes, the AI could produce one enum {“result”: true} and then “false” would be highly favored and more predictable in the following chat answer. Or the schema “additionalProperties”: false itself damages the probabilities. I’m not sure that penalty parameters are even currently working from a few trials, though; it could be damaged just as logit_bias is.

(a user input unsatisfied by the API call shouldn’t be added to history - you’d only store input+output after a success)

1 Like

+1 on the history. I was wrong, my realtime display has it but not the history (DB).
As for the rest, I’m not sure I understand but the fact is that I get identical outputs whatever params I use.