Realtime API pricing questions: text input and audio tokens

mxzllr · November 21, 2025, 2:27pm

Hi everyone,

after reading through some of the (older) pricing discussions here I looked into the issue a bit more because I was negatively surprised by our cost.

First of - our cost right now is ~ 20 cents per minute, not horrible but still too high for some use cases. As suggested here, I did some experiments in our playground and live experiments looking at the response.done event to monitor and predict prices, but there have been two things which I was unable to wrap my head around:

1. extremely high text input cost
For some of our agents the “gpt-realtime-2025-08-28 text, input “ cost is higher than audio output! I have two potential explanations for this:
a) Session.updated ALWAYS sends the entire prompt: I don’t know why this is necessary, but it always does. As our prompt is quite large, this likely inflates this cost component.

–> Has anyone found a way to avoid that?

b) Unconnected calls: We are calling people outbound, and many calls are hence not connected or quite short. The agent however is started nonetheless, hence the text consumption.

–> Does this make sense? Has anyone conceived of a solution for this (I could not think of one)?

2. Persistent difference between expected and real audio tokens
My understanding is that a token is a token. if only say “Hello”, then this should be 1 input token.
For output tokens, I would expect the same but I understand that maybe the agent has “prepared” more tokens than it eventually sends.

However, in both cases the difference between expected and real output and input tokens is between 3-5x, which I find particularly puzzling for the input tokens. For example saying only “hello” in the playground environment already produces 16 audio input tokens.

–> What am I missing here? Is the definition of audio token entirely divorced from a text based tokens e.g. due no Noise?

Would be keen to hear if somebody has dug a bit deeper on this!

juberti · November 25, 2025, 6:24am

yeah, these are two different things. audio tokens correspond to a specific amount of audio data rather than a particular number of letters.

juberti · November 25, 2025, 6:27am

also, are you certain you are being charged for unconnected calls? If you don’t generate a response, there shouldn’t be any token usage, right?

mxzllr · November 25, 2025, 2:26pm

Thanks for explaining the audio tokens part, makes sense!

As for the unconnected calls: I have not verified this, it was just the attempt of an explanation. We have not saved the last response.done event for every call id until now so I have no insights into token consumption per individual call.

But your explanation makes sense, I will check if we are being charged anything.

If indeed we are not charged, then the high text input cost must be a function of 1a), meaning long prompts or conversations with frequent session.updated events(i.e. many turns), which some of our conversations are.

maria99 · December 5, 2025, 2:57pm

Hey, can you please indicate approximately how many tokens correspond to one minute of speech (including pauses) ?

mxzllr · December 6, 2025, 1:40pm

Hi Maria,

obviously I cannot answer this for sure but in our use case (a phone conversation) the prices vary a lot by use case because it depends on many other factors (prompt, turn taking etc.).

If your use case is not a conversation then TTS models will suit you better and there it should also be easier to get a reliable price estimate!

Foxalabs · December 6, 2025, 6:39pm

Roughly, VERY roughly, 150 spoken words per min at 0.75 tokens per word gives you around 200 tokens per minute. Varies a lot by language, cadence, technical nature of the spoken words, etc.

_j · December 6, 2025, 7:05pm

Voice models do not tokenize audio to “words” or language tokens. Thus you cannot use the same “text to learned AI tokens is just a bit worse that per-word encoding” rule you might for written language tokenization.

The audio frequencies are transformed into a native multimodal embedding that the AI has been trained on. An open-source generalization:

Experimentally, naturally spoken language on OpenAI has a consumption about 5x the rate for the same semantics written as text input (and then the cost billed per “token” is up to 10x). However, while I can stop typing for a minute, audio still is a continuous input stream.

That is just for a single input run, though. In the case of a “phone conversation” there is voice-activity where a pause triggers an input run, and an AI generation. It is turn-based chat; “realtime” is simply another illusion in presentation.

Then that input/output pairing becomes a growing conversation history context that is run again for each input, on a 16k or 32k context window before truncation is triggered with an unseen mechanism on “realtime”. The input doesn’t shut off, though, even while listening to a response. Thus the costs of a single “goodbye” can be extreme after a few minutes.

Topic		Replies	Views
Confusion Between Per-Minute Audio Pricing vs. Token-Based Audio Pricing API realtime	3	9324	December 30, 2024
I don't understand the pricing for the realtime API API realtime	35	21701	August 12, 2025
Help me understand the true cost of the RealTime API API api , realtime	2	1628	March 26, 2025
Realtime API cost anomaly: disproportionate charges on audio input API api-costs , api-realtime	6	972	July 9, 2025
Why does each new request in Realtime API get more expensive? Are tokens accumulating? API realtime , api-realtime	1	275	September 5, 2025

Realtime API pricing questions: text input and audio tokens

Related topics