Hi everyone,
after reading through some of the (older) pricing discussions here I looked into the issue a bit more because I was negatively surprised by our cost.
First of - our cost right now is ~ 20 cents per minute, not horrible but still too high for some use cases. As suggested here, I did some experiments in our playground and live experiments looking at the response.done event to monitor and predict prices, but there have been two things which I was unable to wrap my head around:
1. extremely high text input cost
For some of our agents the “gpt-realtime-2025-08-28 text, input “ cost is higher than audio output! I have two potential explanations for this:
a) Session.updated ALWAYS sends the entire prompt: I don’t know why this is necessary, but it always does. As our prompt is quite large, this likely inflates this cost component.
–> Has anyone found a way to avoid that?
b) Unconnected calls: We are calling people outbound, and many calls are hence not connected or quite short. The agent however is started nonetheless, hence the text consumption.
–> Does this make sense? Has anyone conceived of a solution for this (I could not think of one)?
2. Persistent difference between expected and real audio tokens
My understanding is that a token is a token. if only say “Hello”, then this should be 1 input token.
For output tokens, I would expect the same but I understand that maybe the agent has “prepared” more tokens than it eventually sends.
However, in both cases the difference between expected and real output and input tokens is between 3-5x, which I find particularly puzzling for the input tokens. For example saying only “hello” in the playground environment already produces 16 audio input tokens.
–> What am I missing here? Is the definition of audio token entirely divorced from a text based tokens e.g. due no Noise?
Would be keen to hear if somebody has dug a bit deeper on this!
