Realtime API Pricing: VAD and Token Accumulation - A KILLER

I’ve read through the few posts here about the pricing challenges with Realtime API. I am going to share my observation in helping us all understand what’s really going on.

Starting Point:

    • Tokens are accumulated and “carried forward” which has an inflationary impact on total tokens consumed per session.
    • The longer the conversation, the more amplified this issue
    • VAD=true is a major contributor to token accumulation
    • Silence/Background Noise (from the user end) impacts token accumulation
    • Token caching doesn’t appear to work at all

Scenario:

  • Conversation over websocket between OAI and a source (e.g. Twilio, Microphone etc).
  • Upon wss connection, session is created
  • The session is then updated with a system_prompt (say 3K in length) which counts as text input tokens
  • User starts with a “Hello!”
  • Audio is streamed (chunked) from the source to OpenAI over the wss connection
  • Incoming Audio is first transcribed (supposed to be $0.006/min)
  • The transcript is then tokenized (input text tokens at $5/mil)
  • The incoming Audio is also tokenized (Audio input tokens at $100/mil)
  • AI runs compute, responds with “Well hello there, how can I help you today!”
  • This is streamed via wss back to the source
  • AI response is tokenized as Audio output tokens ($200/mil)
  • AI response is transcribed ($0.006/min)
  • Transcript is tokenized as text output tokens ($20/mil)
  • The response.done after this exchange should show a breakdown of tokens consumed during this exchange.

Round 2:

  • User says: “Well I was wondering if you can tell me a short poem by William Wordsworth”
  • Incoming audio is tokenized like before
  • Transcribed
  • Tokenized as text
  • AI starts preparing the poem which is 1 min long (say)
  • Text output is tokenized first
  • As text is turned into audio, audio output tokens are generated
  • Transcript is also tokenized as text output tokens as well as the $0.006/min cost is accumulated

Now this is where it gets, say, problematic:

While the AI is reciting the poem, there are two possibilities.
The user is listening quitely (silence on his/her end)
The user decided to interrupt

If VAD is set to true, with the default settings (0.5, 300ms, 500ms) then while the user is silent, audio is still being streamed from the source and possibly being tokenized. How or why, is anyone’s guess right now. But silence is BEING tokenized for sure.

If there are ambient/background nosises, they will also get tokenized because its needed for the VAD to maintain its function (turn_detection) and for filtering out the noise based on the VAD settings (0.5, 300ms, 500ms).

What’s happening (which is counter-intuitive from a developer’s perspective) is that while VAD should be detecting the incoming silence/noise as such, it should not be tokenizing it as part of the audio input tokens ($100/mil). What’s even worse is that since these tokens are accumulated over turns, those silences are adding up to the token count for no apparent value.

If the user interrupts, its worse, since the 1 minute long poem which got cut off, say 10 seconds in, as identified by @anon22939549 here, is still accumulated (in whole) and carried forward, unless discard by calling either the conversation.item.truncate or conversation.item.delete events.

So an intermediate “observer” is needed to

  1. Handle VAD ourselves (since OpenAI’s VAD is what’s making the process cost prohibitive)
  2. Maintain a dict of messages and progressively reduce context by calling conversation.item.delete/truncate

Both options have several pros and cons and since most of us are devs here, I don’t need to dive into exactly what those are… There are pros and cons nontheless.

So, what is the solution?

Challenges:

If vad=true:
truncate only if silence is detected (I haven’t seen any such markers coming back from the server events)
truncate when interruption detected: This is avialable from the speech_stopped server event but not really helpful since it doesn’t tell us the reason why speech was stopped:
{
“event_id”: “event_1718”,
“type”: “input_audio_buffer.speech_stopped”,
“audio_end_ms”: 2000,
“item_id”: “msg_003”
}

If vad=false
The only thing to consider then is managing truncation/deletion at regular intervals as suggested by someone here provides a pathway.

How are you guys handling this issue because token accumulation can make voice conversations horrendously expensive and make it commercially unviable.

8 Likes

I’ve confirmed with my own testing that you are not paying for silence. You’re told that an interruption occurred if something other than silence is detected. If you’re not getting that event then you’re not paying for additional tokens.

The realtime service generates the outgoing audio way faster than what you can play it back. You can see this by looking at the outgoing server events. So should the user interrupt after 10 seconds it’s too late. The audio for the last minutes worth of response has already been generated.

The realtime service actually has no clue that an interruption occurred. It knows that the user spoke (assuming vad is on) but it doesn’t know what you did in response to the interruption event. If you did cut off the outgoing audio it has no clue how much you played to the user.

When the next turn occurs it’s going to include all of the tokens that were generated in the conversation history whether they were fully played back or not. If you stop the generation before it’s done it will shorten what’s stored in conversation history but it’s likely to contain some tokens that weren’t played to the user.

Look for my post where I ask the model to count to 100, interrupt it, and then I ask how high it counted to. It thinks it counted to 2 or 3 numbers higher then what it generated audio for.

1 Like

not really at all - it’s just too cost prohibitive.

As @stevenic indicated, I don’t think VAD is actually the big issue here - the fundamental issue is

Pricing (and/or OpenAI’s architecture) has always been a huge issue if you wanted to implement your own sampler or do any sort of multi model context splicing.

OpenAI needs to change how pricing works, their architecture and their product design.

The big draw of realtime was that hypothetically you could have a “realtime” conversation with a model. If you neuter VAD to the point where the user has to press a $5 button every time they want to prevent the model from going off topic, that’s not really the point. You might as well just use whisper and microsoft sam at that point. What I’m trying to say is that ideally, VAD and turn detection shouldn’t even be a thing, but I guess we’re still a couple years away from that.

You should be able to clap along to what the model is saying without having to worry about how much it costs. And the model should know that you’re clapping along.

But that’s not happening the way they implemented it.

2 Likes

A simple fix is to just limit the number of turns of conversation history you maintain. You should be able to prune to the last 3 turns and still have linguistic features like co-reference work.

The default output token count they use in their examples is 2k tokens which is a lot. That may seem kind of high given the cost but I suspect they did that to make sure most tool calls don’t get truncated.

Another best practice would be to keep your tools small and limit the number of parameters returned. That should let you drop the output tokens down which will also save money

2 Likes

in that case you’d still need to do your own transcription to pull any rag references (since you can’t store it in the context)

and even then I don’t think you can come to any sort of cost parity to a human who meets or exceeds 4o intelligence.

1 Like

This is for:

  • Startups that have participated in OpenAI’s programs so they get VC stakes for their funding, where you are the partners seen in announcements who already have the apps rolling, and you can burn millions for years to monopolize against competition in the potential vertical and increase OpenAI’s valuation,
  • You are a guerilla marketing influencer who gets early access and personal consultation to have launch day demonstration videos to release after signed embargo is lifted,
  • You are OpenAI’s customer, the corporation that pays $1-3 million dollars for custom models and development consulting, based on that sales presentation.

or else

  • you are the competition, and the tokens of your three bland voices cost 10x.
4 Likes

I understand the point you’re trying to make with regards to “trimming” the context. But what about this:

Each audio input/output is being transcribed anyway. So what’s the point of retaining the audio tokens once they have been transcribed? Why not delete the audio event_id and retain the transcription which goes toward text tokens (much cheaper).

Shouldn’t that prevent from the audio tokens getting inflated?

I still don’t understand how they use/intend to use cached tokens. If anything, the system token should come out of the text input tokens and go into cached tokens instead of being carried forward all the till kingdom come…

Am I missing something?

So that’s it?
Another case of “started with: Do no evil” but ended up in “Alphabet” soup?

That’s just the impression I get - that there is another customer they are going for instead of those that would visit the forum or be on the payment tier system. Ones that can either afford this or ones that don’t even get this pricing. Or the ultimate customer is those that would sign up for ChatGPT subscriptions.

1 Like

The transcription isn’t going to be as accurate as retaining the actual audio. It’s not going to have information like the fact the model was whispering, what sound effects were generated, or the prosody and emotion that you heard. It’s just going to have the text and may have errors or missing info.

1 Like

So then… perhaps maintain a dictionary of all audio events, retain the last 3 (or 5) and delete the rest to somewhat reduce the carry-forward load?

I guess, in a long conversation (>5 mins with at least 10 exchanges) there’s no way to reach the $0.06/min input and $0.24/min output cost range as stipulated in the original announcement.

Clever system design, reflection of a company going from “not-for-profit” to “I want it all”

Moore’s law’esque effect leading to token cost reduction over time is probably the only saving grace for the (hopefully) not-too-distant future.

1 Like

This is an encouraging response from OpenAI Staff:

Thank you @jeffsharris

6 Likes

I don’t really think that the mechanics of the API are the major issue for cost. The major issue is the cost $.24 per minute that’s just way way too much. I can get high resolution video streams with audio for that.

Well @michael.glenn.willia as you can see from this thread and other similar ones on this forum, even getting it to 0.24/min is near impossible in the current token carry-over situation. Conversations between 2-4 mins are ranging around $0.45 - $0.65/min mark. Converstions around 5-7 minutes are crossing the $1.20-$1.50/min mark. That’s per minute.

So unless OpenAI changes its pricing, or improves caching tokens, this is a non-starter for 99.99% of commercial use-cases. And then, there are only 3 voices that currently work, of which, the most expressive one (Nova) is not one… and no word on when/if Advanced Mode voices are ever going to be included in the API.

But… for now, its the only speech-to-speech interface in the market (that I’m aware of) and going down the speech-text-LLM-text-speech route has an overhead of anywhere between 250ms to 1500ms depending on network conditions from a bunch of tests.

Well, I have been playing around with the realtime API since its launch. Like many out there, I have also been looking for a way to reduce the costs to the point where it at least becomes financially usable.

Depending on the language of conversation and how much you interrupt, I have noticed the following average cost per minute when using the API via a frontend client e.g. the browser or telephony e.g. Twilio.

  1. The first minute is almost always around $0.30/min
  2. At the 10 minutes mark, the average price per minute works around $1.20/min

However, after intensive research and playing around with any way I could think of to reduce costs, so far here’s the lowest I’ve managed to achieve. Again, remember when I say per minute, I mean per minute of lets say a telephonic conversation in which almost half the time it would be input tokens and the other half, output tokens.

  1. First minute conversations never exceeds $0.15/min
  2. At the 10 minutes mark, the total cost consumed averages just above $2 dollars. That’s around $0.22/min

Insane right? How you may ask? Well, here’s what I did.

  1. I don’t care about retaining the audios, both input and output. I listen to the transcription completion evens for both audios, accumulate the transcription and keep deleting all coversations items from the history by using conversations.item.delete events.
  2. I make another call to an Assistants API and pass down the so far accumulated transcript of the user and the agent. I have heavily fine tuned prompts the assistants which produces the lowest possible words in the form of summary of the conversation. It reduces the text of the transcript by about 3.5 times.
  3. The heavily summarized context of the coversation so far, I feed it back to the Realtime session to make sure the agent always have a context of the ongoing conversation.

Now, of course that’s probably a bit too easier to say than done, in reality, I actually faced a lot of trouble implementing what I just described above. However, I have not noticed any significant drop in terms of performance or loss of memory/context even after 10 minutes or so.

For the insane cost reduction, it’s absolutely a no brainer.

9 Likes

Well done and kudos to you for sharing it here. Just to clarify, you’re deleting the audio input and output events and once transcription is done, you are passing the input/output transcripts to the assistant for summrization of that interaction?

How is that affecting latency? Are you seeing a significant delay?

I implemented a “keep_last” rule the other day to keep last n number of audio inputs to maintain context (to avoid latency with assistant interjections). At n=3, the cost reduction is around 15%. At n=5, the reduction is negligible. So the keep_last concept doesn’t seem to be very effective.

If your method isn’t impacting latency that much, then I’d be keen to experiment a little.

There is no impact on the delay at all. This is because I first buffer a few (ideally 3) transcriptions and simultaneously keep a buffer of the IDs of the conversation’s items. To help you further, collect the following.

if (response.type === 'conversation.item.created') {
	conversations.push(response.item.id);
	truncateConversation();
}

if (response.type === 'response.audio_transcript.done') {
	transcript += `Assistant:\n${response.transcript.trimEnd()}\n`;
}

if (
	response.type ===
	'conversation.item.input_audio_transcription.completed'
) {
	transcript += `User:\n${response.transcript.trimEnd()}\n`;
}

As you can see, on each conversation item created, I push it to the buffer and call a truncation operation. The operation will determine if at least 3 items exist and only then proceed. This buffer eliminates the chances of any delay.

Also, you can see I separately collect the transcript for both the agent and the user and append everything as one continuous text. This is to be used for summarizing.

Another trick that I learned after a struggle, is that if you feed the summary via a conversation.item.create event, the model stops producing audio. So, I do the following:

openAiWs.send(
	JSON.stringify({
		type: 'session.update',
		session: {
			instructions: `${SYSTEM_MESSAGE}\nYou are already in an ongoing call conversation with the user. Here is a summary of your conversation so far for your reference:\n${summary}`,
		},
	}),
);

You see I append the summary to the pre-defined, initial system message and keep updating the ongoing session. :smiley:

One more trick, I also summarize twice after the first 3 summary iterations. In other words, a summary of the summary each time until the end. This is because even summaries start becoming larger. Here’s an example of a 3-minute-long conversation, both the original summary at the end and the summary of the summary. Again, no delay at all even with a double summary because of the buffer.

First summary: This is 115 text tokens only. summary of 3 minutes :smiley:

You confirmed you can speak Urdu. User asked about a doctor, you provided information about Dr. Mary Smith, a dentist serving Anaheim. You mentioned her schedule (Mon-Sat, 8 AM-5 PM), Sundays closed. User requested an appointment for tomorrow, you asked for the name (Zia Ur Rehman Khan) and contact number (0341******). User prefers afternoon, you suggested 3 PM, which was confirmed. You initially noted the phone number incorrectly but corrected it. Appointment confirmed for 3 PM, both ended with good wishes.*

And the summary of that summary: Only 68 tokens this time.

You confirmed Urdu. User asked about Dr. Mary Smith, dentist Anaheim, Mon-Sat 8-5, Sun closed. User wanted appointment tomorrow, name Zia ur Rehman Khan, contact 0341*****. Preferred afternoon, confirmed 3 PM. Corrected phone error. Ended with good wishes.*

I have redacted the phone numbers in there just before pasting here. And while I am at it, here’s the simple prompt I use for generating the second summary. I would share the other prompts as well but they are too lengthy.

this.summarizeAgain(
	`The following is a summary of a transcript that you produced earlier. It is already pretty concise but please shorten it even further. Remember that grammar does not matter at all. Ignore grammar to shorten the text as much as possible.\n\n${gptResponse.value}`,
);

Lastly, to save costs and improve both speeds and efficiency even when generating summaries, don’t use the chat completions API. Use the Assistants API instead. This is because, for the normal chat completions, you will have to pass the entire previous text on each summary generation to retain context. This gets way too lengthy and hence more tokens and slower response. The assistants API does not need that because it keeps a thread.

Like I said earlier, there’s still quite a bit more to what I was able to put together. I sincerely hope that helps but feel free to ask if you get stuck.

2 Likes

It seems like a clever approach. But if you are deleting the audio input/outputs straight after receiving their transcripts, even while you are collecting the transcripts in the buffer and doing the summarization, you haven’t noticed any context/prosody loss with the voice?

I guess the time delay between audio in/out and transcript in/out, seems to provide enough of a window for the AI to maintain context before it gets updated via the session.update event.

I might DM you further on this… thanks for your valuable input @zia.khan

No, I haven’t noticed any difference in the voice of the AI agent really. Not in terms of context especially. Although I still believe there might be some effect on things like emotions in the voice etc. Another reason I guess why we can’t really be so sure right now is because the current voices, unlike the Advanced Voice of ChatGPT doesn’t do much emotions, whispering, ups and downs etc.

I bet that if we continue to feed the complete context, it should be fine even when OpenAI releases more Advanced voices.

Feel free to DM.

Just noticed a marked difference in voice quality (used shimmer) in terms of emotions and tone in the Playground vs. the API. Any clues?