Temperature in the new gpt-realtime model

I’ve noticed that in the latest gpt-realtime model, the temperature property has been removed from the session.update event. This means I no longer have a way to control it directly.

In the previous version of the Realtime API, we were at least able to control it, but only within a limited range (0.6-1.2). I never fully understood why it was restricted to that range - it didn’t make much sense, but at least it gave us some control. Now, however, the property is gone completely, and we have no visibility into what temperature is being used or how to influence it.

This has become a big issue in my app:
For text based models like gpt-4o or gpt-4o-mini, I explicitly set temperature to 0.2
When users switch to realtime voice, the model now behaves differently since I can’t set this parameter.

My questions are:

  1. What default temperature value is the realtime model using now?
  2. Is there any plan to bring back configurability?

I really need a way to minimize randomness in responses (ideally as close to 0.2 as possible). Right now, the difference in behavior between modes is causing noticeable inconsistency for my users.

1 Like

The difference between modes is because the AI is making a completely different output, with different training, with different inputs.

The tokens being produced are not words directly mirrored from pretraining on text, but are acoustic content that is deconvolved through a codec, semantic integer stream back to playable audio.

Thus, the AI “talking” is not merely predicting words, it is predicting the shape of audio, whether it should sound like post-trained voice actor with choice enforcement or should be a gruff pirate version of that. There is a large distance of inference needed to also bring knowledge and instruction-following into that, so “difference between modes” is surprising only in that it is not a massive gulf of difference.

Temperature at both ends can send the audio into either repeating loops or invalid audio that can’t be decoded at all. The issue “perceived” is not directly applicable; bad generation first presents as quirky garbled atonal audio, and low temperature before breaking, the robotic speaker.

High temp:

2 Likes

I really need a way to minimize randomness in responses (ideally as close to 0.2 as possible). Right now, the difference in behavior between modes is causing noticeable inconsistency for my users.

I would think “randomness“ has no place in realtime voice - or any voice model. If it does, then it would be a bad design for OpenAi.

1 Like

Randomness is in the selection of tokens. The AI assigns prediction values to every token possible. There is no “right” answer when " apple", " banana", " orange" have near equal probability of being the AI’s favorite. Or for audio, the correct frequencies and tonality by codebook. But…there is a right answer to disrupting the AI saying the same sentences over and over non-stop or getting stuck on even one token by observation of input it has created itself, the reason why there is a sampling and softmax layer at all instead of just picking the “best” (greedy sampling).

The effect of temperature:0 can be run on “audio” models on chat completions and obtained. And the symptomatic result is easy to receive: here 3000 tokens’-worth, where around 500 tokens in, the audio “breaks” and becomes a loop of nonstop hum.

1 Like

I came here to ask the exact same question as OP’s. I also have two follow-up questions:

  • When using the old beta API (OpenAI-Beta: realtime=v1), we can still set a temperature parameter for the new gpt-realtime model. The response from the session.created event shows it defaults to 0.8. Does this parameter actually influence the gpt-realtime model’s responses, or is it ignored?
  • When using the GA API, we can no longer set temperature for the older models like gpt-4o-realtime-preview. In this case, what temperature value is being used for the old model?

I found the answer for OP’s questions from official blog.

The GA interface has removed temperature as a model parameter, and the beta interface limits temperature to a range of 0.6 - 1.2 with a default of 0.8.

You may be asking, “Why can’t users set temperature arbitrarily and use it for things like making the response more deterministic?” The answer is that temperature behaves differently for this model architecture, and users are nearly always best served by setting temperature to the recommended 0.8.

From what we’ve observed, there isn’t a way to make these audio responses deterministic with low temperatures, and higher temperatures result in audio abberations. We recommend experimenting with prompting to control these dimensions of model behavior.

So:

  1. 0.8
  2. Doesn’t look like there are any plans.
1 Like