I’m building an LLM-powered application that processes product description files and uses a prompt to summarize the task. I have an OpenAI account with one API key, and I’m running the same application on different machines using this shared API key. When I run the application on a single machine, the responses are accurate and as expected. However, when multiple machines use the same API key in parallel, the results become inconsistent and erratic.
Is there a known issue with the OpenAI API when it’s accessed in parallel by different servers using the same API key?
Open AI LLMs are not deterministic even with Temperature zero. So if you run with the same prompts on multiple machines it is expected you will get slightly different results.
If you want to always have “best” results of an input - responses that start similarly (but may diverge) - you would set top_p: 0.00001.
A seed parameter of fixed value, if used and re-supplied to the API model, will re-run with the same component of randomness in the sampler (randomness essentially turned off by the top_p above), but if the underlying computations are not identical (which they aren’t), then this doesn’t have as much meaning.
Thus, inconsistent responses are expected responses. You can resend and possibly get a better answer, or different brainstorming ideas.
You can read a bit more. OpenAI hasn’t come out directly and explained the technical reason for varying logits and dimensions on language models and embeddings since gpt-3.5+.
You can look at the ‘fingerprint’ returned in an API call to see if repeatability is furthermore not expected, perhaps due to the model running on a different server architecture (some models varying by up to five fingerprints in large trials).