Is it just me or are API calls with higher temperature slower for you as well, at least for chat completions with gpt-3.5-turbo
and gpt-4
?
Latency has no relation with the temperature used in the call. Must just be a co-incidence or GPT may be being slower in general
Pretty easy to find out. Run a loop with three temperatures, 0.8, 0.1, 1.8 and max_tokens 100.
12 temperatures in sets of 3, with the last two alternating:
temp 0.8, resp_tokens: 100. Tokens/s:
26.125656044975727
temp 0.1, resp_tokens: 100. Tokens/s:
27.519449703386897
temp 1.8, resp_tokens: 100. Tokens/s:
25.55855159205317
temp 0.8, resp_tokens: 100. Tokens/s:
19.935016745816352
temp 1.8, resp_tokens: 51. Tokens/s:
12.494154263065434*****
temp 0.1, resp_tokens: 100. Tokens/s:
27.05936416135373
temp 0.8, resp_tokens: 100. Tokens/s:
25.115481524213607
temp 0.1, resp_tokens: 100. Tokens/s:
30.00468565681105
temp 1.8, resp_tokens: 100. Tokens/s:
27.28532513696644
temp 0.8, resp_tokens: 100. Tokens/s:
26.827548654667805
temp 1.8, resp_tokens: 100. Tokens/s:
24.312973515039445
temp 0.1, resp_tokens: 100. Tokens/s:
42.76022641091638
Conclusion: just random for all
possibility: might have hit a fast H100 machine on the last one.
While the tokens/second appears random over temperature, I would say that “higher temp” == “more words in the output” (just try with a temp of 2, and see the entire output tokens get maxed out)
So that the overall latency might appear to be longer, as expected.
Maybe that is what the OP is experiencing. But in this case of more words, this is usually a desired outcome.
Thank you very much. This seems to be it! I ran a dedicated analysis with a pretty long preprompt and with temperature>1.4 responses not only get longer…