Realtime API pricing questions: text input and audio tokens

Hi everyone,

after reading through some of the (older) pricing discussions here I looked into the issue a bit more because I was negatively surprised by our cost.

First of - our cost right now is ~ 20 cents per minute, not horrible but still too high for some use cases. As suggested here, I did some experiments in our playground and live experiments looking at the response.done event to monitor and predict prices, but there have been two things which I was unable to wrap my head around:

1. extremely high text input cost
For some of our agents the “gpt-realtime-2025-08-28 text, input “ cost is higher than audio output! I have two potential explanations for this:
a) Session.updated ALWAYS sends the entire prompt: I don’t know why this is necessary, but it always does. As our prompt is quite large, this likely inflates this cost component.

–> Has anyone found a way to avoid that?

b) Unconnected calls: We are calling people outbound, and many calls are hence not connected or quite short. The agent however is started nonetheless, hence the text consumption.

–> Does this make sense? Has anyone conceived of a solution for this (I could not think of one)?

2. Persistent difference between expected and real audio tokens
My understanding is that a token is a token. if only say “Hello”, then this should be 1 input token.
For output tokens, I would expect the same but I understand that maybe the agent has “prepared” more tokens than it eventually sends.

However, in both cases the difference between expected and real output and input tokens is between 3-5x, which I find particularly puzzling for the input tokens. For example saying only “hello” in the playground environment already produces 16 audio input tokens.

–> What am I missing here? Is the definition of audio token entirely divorced from a text based tokens e.g. due no Noise?

Would be keen to hear if somebody has dug a bit deeper on this!

yeah, these are two different things. audio tokens correspond to a specific amount of audio data rather than a particular number of letters.

2 Likes

also, are you certain you are being charged for unconnected calls? If you don’t generate a response, there shouldn’t be any token usage, right?

Thanks for explaining the audio tokens part, makes sense!

As for the unconnected calls: I have not verified this, it was just the attempt of an explanation. We have not saved the last response.done event for every call id until now so I have no insights into token consumption per individual call.

But your explanation makes sense, I will check if we are being charged anything.

If indeed we are not charged, then the high text input cost must be a function of 1a), meaning long prompts or conversations with frequent session.updated events(i.e. many turns), which some of our conversations are.

Hey, can you please indicate approximately how many tokens correspond to one minute of speech (including pauses) ?

Hi Maria,

obviously I cannot answer this for sure but in our use case (a phone conversation) the prices vary a lot by use case because it depends on many other factors (prompt, turn taking etc.).

If your use case is not a conversation then TTS models will suit you better and there it should also be easier to get a reliable price estimate!

Roughly, VERY roughly, 150 spoken words per min at 0.75 tokens per word gives you around 200 tokens per minute. Varies a lot by language, cadence, technical nature of the spoken words, etc.

Voice models do not tokenize audio to “words” or language tokens. Thus you cannot use the same “text to learned AI tokens is just a bit worse that per-word encoding” rule you might for written language tokenization.

The audio frequencies are transformed into a native multimodal embedding that the AI has been trained on. An open-source generalization:

Experimentally, naturally spoken language on OpenAI has a consumption about 5x the rate for the same semantics written as text input (and then the cost billed per “token” is up to 10x). However, while I can stop typing for a minute, audio still is a continuous input stream.

That is just for a single input run, though. In the case of a “phone conversation” there is voice-activity where a pause triggers an input run, and an AI generation. It is turn-based chat; “realtime” is simply another illusion in presentation.

Then that input/output pairing becomes a growing conversation history context that is run again for each input, on a 16k or 32k context window before truncation is triggered with an unseen mechanism on “realtime”. The input doesn’t shut off, though, even while listening to a response. Thus the costs of a single “goodbye” can be extreme after a few minutes.

1 Like