I’ve read through the few posts here about the pricing challenges with Realtime API. I am going to share my observation in helping us all understand what’s really going on.
Starting Point:
-
- Tokens are accumulated and “carried forward” which has an inflationary impact on total tokens consumed per session.
-
- The longer the conversation, the more amplified this issue
-
- VAD=true is a major contributor to token accumulation
-
- Silence/Background Noise (from the user end) impacts token accumulation
-
- Token caching doesn’t appear to work at all
Scenario:
- Conversation over websocket between OAI and a source (e.g. Twilio, Microphone etc).
- Upon wss connection, session is created
- The session is then updated with a system_prompt (say 3K in length) which counts as text input tokens
- User starts with a “Hello!”
- Audio is streamed (chunked) from the source to OpenAI over the wss connection
- Incoming Audio is first transcribed (supposed to be $0.006/min)
- The transcript is then tokenized (input text tokens at $5/mil)
- The incoming Audio is also tokenized (Audio input tokens at $100/mil)
- AI runs compute, responds with “Well hello there, how can I help you today!”
- This is streamed via wss back to the source
- AI response is tokenized as Audio output tokens ($200/mil)
- AI response is transcribed ($0.006/min)
- Transcript is tokenized as text output tokens ($20/mil)
- The response.done after this exchange should show a breakdown of tokens consumed during this exchange.
Round 2:
- User says: “Well I was wondering if you can tell me a short poem by William Wordsworth”
- Incoming audio is tokenized like before
- Transcribed
- Tokenized as text
- AI starts preparing the poem which is 1 min long (say)
- Text output is tokenized first
- As text is turned into audio, audio output tokens are generated
- Transcript is also tokenized as text output tokens as well as the $0.006/min cost is accumulated
Now this is where it gets, say, problematic:
While the AI is reciting the poem, there are two possibilities.
The user is listening quitely (silence on his/her end)
The user decided to interrupt
If VAD is set to true, with the default settings (0.5, 300ms, 500ms) then while the user is silent, audio is still being streamed from the source and possibly being tokenized. How or why, is anyone’s guess right now. But silence is BEING tokenized for sure.
If there are ambient/background nosises, they will also get tokenized because its needed for the VAD to maintain its function (turn_detection) and for filtering out the noise based on the VAD settings (0.5, 300ms, 500ms).
What’s happening (which is counter-intuitive from a developer’s perspective) is that while VAD should be detecting the incoming silence/noise as such, it should not be tokenizing it as part of the audio input tokens ($100/mil). What’s even worse is that since these tokens are accumulated over turns, those silences are adding up to the token count for no apparent value.
If the user interrupts, its worse, since the 1 minute long poem which got cut off, say 10 seconds in, as identified by @anon22939549 here, is still accumulated (in whole) and carried forward, unless discard by calling either the conversation.item.truncate or conversation.item.delete events.
So an intermediate “observer” is needed to
- Handle VAD ourselves (since OpenAI’s VAD is what’s making the process cost prohibitive)
- Maintain a dict of messages and progressively reduce context by calling conversation.item.delete/truncate
Both options have several pros and cons and since most of us are devs here, I don’t need to dive into exactly what those are… There are pros and cons nontheless.
So, what is the solution?
Challenges:
If vad=true:
truncate only if silence is detected (I haven’t seen any such markers coming back from the server events)
truncate when interruption detected: This is avialable from the speech_stopped server event but not really helpful since it doesn’t tell us the reason why speech was stopped:
{
“event_id”: “event_1718”,
“type”: “input_audio_buffer.speech_stopped”,
“audio_end_ms”: 2000,
“item_id”: “msg_003”
}
If vad=false
The only thing to consider then is managing truncation/deletion at regular intervals as suggested by someone here provides a pathway.
How are you guys handling this issue because token accumulation can make voice conversations horrendously expensive and make it commercially unviable.