I am trying to get a deterministic answer to my prompt by setting the temperature down to zero. It works in the playground but when I try to do it through the API I get different answers to the same prompt.
For instance I got those two different answers with this prompt :
Larry Ellison co-founded Oracle Corporation, a software company, starting with a small investment. Self-made score: 9/10
Larry Ellison co-founded Oracle Corporation, a software company, starting with a small investment. Self-made score: 8/10
Hi there - that’s a fully normal behaviour. Models are non-deterministic. Even with the temperature set at 0, you are unlikely to get the exact same answer.
In your case, if you are looking for a score, a common strategy that you can apply to deal with this is to run the same query multiple times and then either take the average of scores or the top 2 highest/lowest.
If you wanted to be really tricky, you could take such a 1-token answer, extract all the integers returned in the top-5 logprobs, and weight them by the logprob converted to probability for a new float answer.
I think if you want a deterministic answer, you need a more deterministic prompt. Your scores are defined very qualitatively, using terminology whose meaning is highly subjective, or at least reasonable people could disagree as to its meaning. Terms like: most, helping, marginally, meaningful, managing, hired hand, head start, wealthy, middle-class, upper-middle-class, working-class, poor, largely, little to nothing, significant obstacles. LLMs behave like humans, so . they won’t be deterministic about non-deterministic concepts. If you like your prompt, I agree with @jr.2509 and @_j about doing some post-completion manipulation. Alternatively, you can use more deterministic prompts, such as specifying the ratio of inherited fortune to current value of the business; the tiltes someone obtained in the business (like CEO or CFO); the level of education of the billionaire’s parents, grandparents, and the billionaire herself or himself; the billionarie’s income from investments vs. salary; whether the business is a public company or a private company; the number of employees reporting to the billionaire; whether the billionarie had tutors growing up or went to private school; the average property value of homes in the neighborhood where the billionarie was born; whether the billionaire experienced childhood tragedy like the death of a parent. Essentially, criteria that are established objectively with a number or a yes/no answer. Then, the scores are likely to be more predictable. Giving examples of other billionaires is a good tactic, but reasonable people can disagree as to whether billionarie X is more like billionaire A or B. Imagine that each time you run the query, you are asking a different person to follow the prompt. A temperature of 0 means each person will eschew creativity and follow your instructions more rigidly, but it’s still a different person answering a highly subjective question, and they will inevitably interpret the qualitative criteria a little differently than any other person. That’s less likely to happen, though, if your criteria are numerical or in the yes/no style.
Thanks a lot for your answers ! My surprise came from the fact that I do seem to get deterministic answers in the playground and not through the API. But maybe it was just luck then ! Seems like I will have to run the same query multiple times, too bad for my savings
Yes I agree but I wanted to use the “self-made score” as defined by Forbes for US billionaires and apply it to other countries so I kind of have to stick with their definition. Seems like the best way is to run it multiple times for each billionaire then
I show use of seed just because it is also a parameter that is supposed to make models more deterministic.
Reuse of the same multinomial sampling seed parameter that is newly-provided by OpenAI uses the same randomness values for the token selection process upon subsequent runs. It is not expected that this affects other parts of models that may also include some random-like element (such as differing GPU precisions or allowable computational errors in hardware, or the run switch transformer of MoE), but the entire architecture is not published since GPT-2.
This parameter doesn’t do as much for you when the output that comes from language model inference softmax includes changing decision boundaries between tokens between each run due to AI non-determinism, making the same cutoff threshold of cumulative probability mass to pick a token potentially make a different choice.
TL;DR: the picker can pick different words despite the best API parameters to prevent this, because what comes out of AI computation changes between identical API calls with OpenAI now.