Realtime API pricing is wrong, will overcharge

Diet · October 16, 2024, 6:14am

Will you guys ever allow devs to

download/upload/patch/merge/splice/manage this cache data?

I’m expecting the answer will be no, and the reason will be “safety”, but I thought I’d ask.

Would you guys ever consider going away from charging naively for input tokens? (hardware time based pricing, or something) (i guess cache discounts are a step in the right direction, but the discount isn’t all that significant for certain use cases.)

jeffsharris · October 16, 2024, 2:13pm

giving more controls over cache behavior is something we’re open to (e.g. pay to avoid cache eviction) if there’s enough interest

there’s a balance between having our pricing reflect our cost structure and keeping pricing comprehensible. cache discounts are one step in the direction of reflecting cost structure and we plan to keep moving that way

ivan-luchkin-u · October 16, 2024, 2:22pm

I have a suspicion that there’s something wrong going on in the token calculation. Even during the playground tests, I can see that the response.done event contains the same number of input audio tokens as the one before, even if all that happened between them is a conversation.item.create as a response to the function call. So, no new audio has been added at all, yet the same number of audio tokens is reported, and it gets added to the overall usage for the session, which is what I assume gets billed in the end.

liquidshadowsmk · October 17, 2024, 6:05am

If you look closely, you will notice that these input tokens are getting carried forward at each turn, reflected in the response.done server event. As I understand, or as what should be happening is the tokens from system_prompt as well as conversation [0] to conversation[n] should be reflected as cached tokens for conversation[n+1]'s response.done event. But its not. Hence, the end price is currently touching between 55c/min to sometimes 79c/min depending on the length of the conversation and the number of turns.

Either way, for this to be commercially viable, it needs to be well below the $0.06/min input and $0.24/min output. I have run around 2 dozen tests and I have never managed to get to the $0.24/min output.

_j · October 17, 2024, 6:32am

Here’s your free technology: Prebuilt Quality Attention (PQA).
(don’t go Googling; the thought didn’t exist an hour ago)

All those “efficiencies” that are used in masking and other hyperparameters that are used in the general-purpose model that is released, that allows it to burn through an input context in fraction of a second? Max them out in such a build-once-run-many scenario. Token rate and computation on input context is not a concern when one can pay a gpt-4o-max server fee for a context cache for your application’s system instruction and tools.

You tell me it is impossible…

lucasvan · October 28, 2024, 9:45am

+1

This is exactly what I have been seeing in my experiments over the past days and is the #1 reason for extremely high costs: the input tokens are being appended from turn to turn.

This sums it up pretty well:

Total tokens used for a regular 3 minute call:
input text. 16k
input audio : 19k
output text : 0.6k
output audio: 1.2k

prices:
input text. 0.08
input audio : 1.90
output text : 0.13
output audio: 0.25

So although output audio is by far the most expensive per token, the price is for 85% dependant on the input audio tokens being carried over.

The only reasonable way to use the Realtime API currently is to use a separate streaming STT model so that we only ingest text input and since that is so cheap, the appending of the tokens is not a huge deal.

I can’t imagine that this was known and planned by the right people at OpenAI and if it’s now, I can’t imagine that this is not the top priority to fix this somehow? @openai, you are currently burning through all your application developers…

@jeffsharris do you have an update on caching mechanisms to address this issue? Timeline?

The #2 reason for high cost is indeed (accidentally) generating a huge amount of output tokens by asking a question which will generate a huge answer such as: “write me 10 different stories about…”. Indeed as listed above somewhere, whether you actually listen to the output tokens or whether you interrupt the output by asking some other question is obviously not going to help you… Once the output is created you will get billed. But that will just be something you will have to think carefully about as an application developer in my opinion…

jeffsharris · October 28, 2024, 1:56pm

Appreciate your diligence on the details @lucasvan

That sounds about right to me. And we’re going to keep pushing the cost structure so you don’t need to run separate STT in order to reduce audio input cost.

We’re polishing our first big round of caching changes. Hoping to turn on some prompt caching features in the next couple weeks

And we have line of sight on further ways to reduce cost

liquidshadowsmk · October 28, 2024, 2:57pm

That’s great to hear @jeffsharris - when are the rest of the voices going to be made available through the API?

As an additional “food for thought”, given the ongoing cost issues, would OpenAI consider waiving a portion of the token costs for API Calls till a specific cut-off date, allowing devs to thoroughly test various use-cases and scenarios without needing a second mortgage?

As you can imagine, anyone building a serious voice application needs to put the system through a lot of different scenarios and with the current cost regime, this is a huge issue before the product is even commercially viable.

j0rdan · October 29, 2024, 7:30am

Exactly, the huge cost of just testing things out is currently preventing us from using this API (given the problems with the cumulative audio tokens). Although I don’t think OpenAI will do that, I really like your food for thought @liquidshadowsmk

liquidshadowsmk · October 29, 2024, 7:46am

@j0rdan at the very least, they should make Realtime API usage in the playground free for conversations up to X minutes before deducting $$ from the balance… Although, in the playground, some aspects aren’t apples-apples in terms of playground vs. production behaviour (e.g. function calls), as devs, we could live with that I suppose… especially in terms of tweaking system prompts to test variable scenarios and trying other voices (of which, there still are only 3 in total. shame!).

If OpenAI is still a developer-first org, then either the previous food for thought, or the “actionable step” mentioned above would only benefit them in the long run.

they have little/nothing to lose.

platypus · October 30, 2024, 10:27am

A presenter is speaking to an audience with a projected slide displaying graphs and pricing information in a dimly lit venue. (Captioned by AI)4032×3024 2.48 MB

from DevDay

vb · October 30, 2024, 10:31am

Prompt Caching announced for the realtime API at today’s Dev Day event.

RonaldGRuckus · October 30, 2024, 2:17pm

Wooo!!

Prompt caching for RealTime API is an amazing implementation. Thanks for sharing.

liquidshadowsmk · October 31, 2024, 4:03am

Just tested it. I don’t see a massive diff in pricing. Also unable to understand what is cached and what isn’t. Additional doco on this would be helpful. Appreciate the 5 new voices though… very cool.

j0rdan · October 31, 2024, 11:30am

A step in the right direction!

Topic		Replies	Views
New Realtime API voices and cache pricing Announcements realtime , prompt-caching	26	3125	November 27, 2024
Introducing the Realtime API Announcements	26	5478	December 17, 2024
Realtime API Pricing: VAD and Token Accumulation - A KILLER Community token , pricing , tokenization , realtime	21	1820	October 23, 2024
Cached input audio_tokens is always 0 API audio , realtime	3	118	November 8, 2024
Prompt caching (automatic!) Announcements	19	2021	October 9, 2024

Realtime API pricing is wrong, will overcharge

Related topics