download/upload/patch/merge/splice/manage this cache data?
I’m expecting the answer will be no, and the reason will be “safety”, but I thought I’d ask.
Would you guys ever consider going away from charging naively for input tokens? (hardware time based pricing, or something) (i guess cache discounts are a step in the right direction, but the discount isn’t all that significant for certain use cases.)
giving more controls over cache behavior is something we’re open to (e.g. pay to avoid cache eviction) if there’s enough interest
there’s a balance between having our pricing reflect our cost structure and keeping pricing comprehensible. cache discounts are one step in the direction of reflecting cost structure and we plan to keep moving that way
I have a suspicion that there’s something wrong going on in the token calculation. Even during the playground tests, I can see that the response.done event contains the same number of input audio tokens as the one before, even if all that happened between them is a conversation.item.create as a response to the function call. So, no new audio has been added at all, yet the same number of audio tokens is reported, and it gets added to the overall usage for the session, which is what I assume gets billed in the end.
If you look closely, you will notice that these input tokens are getting carried forward at each turn, reflected in the response.done server event. As I understand, or as what should be happening is the tokens from system_prompt as well as conversation [0] to conversation[n] should be reflected as cached tokens for conversation[n+1]'s response.done event. But its not. Hence, the end price is currently touching between 55c/min to sometimes 79c/min depending on the length of the conversation and the number of turns.
Either way, for this to be commercially viable, it needs to be well below the $0.06/min input and $0.24/min output. I have run around 2 dozen tests and I have never managed to get to the $0.24/min output.
Here’s your free technology: Prebuilt Quality Attention (PQA).
(don’t go Googling; the thought didn’t exist an hour ago)
All those “efficiencies” that are used in masking and other hyperparameters that are used in the general-purpose model that is released, that allows it to burn through an input context in fraction of a second? Max them out in such a build-once-run-many scenario. Token rate and computation on input context is not a concern when one can pay a gpt-4o-max server fee for a context cache for your application’s system instruction and tools.
This is exactly what I have been seeing in my experiments over the past days and is the #1 reason for extremely high costs: the input tokens are being appended from turn to turn.
This sums it up pretty well:
Total tokens used for a regular 3 minute call:
input text. 16k
input audio : 19k
output text : 0.6k
output audio: 1.2k
So although output audio is by far the most expensive per token, the price is for 85% dependant on the input audio tokens being carried over.
The only reasonable way to use the Realtime API currently is to use a separate streaming STT model so that we only ingest text input and since that is so cheap, the appending of the tokens is not a huge deal.
I can’t imagine that this was known and planned by the right people at OpenAI and if it’s now, I can’t imagine that this is not the top priority to fix this somehow? @openai, you are currently burning through all your application developers…
@jeffsharris do you have an update on caching mechanisms to address this issue? Timeline?
The #2 reason for high cost is indeed (accidentally) generating a huge amount of output tokens by asking a question which will generate a huge answer such as: “write me 10 different stories about…”. Indeed as listed above somewhere, whether you actually listen to the output tokens or whether you interrupt the output by asking some other question is obviously not going to help you… Once the output is created you will get billed. But that will just be something you will have to think carefully about as an application developer in my opinion…
Appreciate your diligence on the details @lucasvan
That sounds about right to me. And we’re going to keep pushing the cost structure so you don’t need to run separate STT in order to reduce audio input cost.
We’re polishing our first big round of caching changes. Hoping to turn on some prompt caching features in the next couple weeks
And we have line of sight on further ways to reduce cost
That’s great to hear @jeffsharris - when are the rest of the voices going to be made available through the API?
As an additional “food for thought”, given the ongoing cost issues, would OpenAI consider waiving a portion of the token costs for API Calls till a specific cut-off date, allowing devs to thoroughly test various use-cases and scenarios without needing a second mortgage?
As you can imagine, anyone building a serious voice application needs to put the system through a lot of different scenarios and with the current cost regime, this is a huge issue before the product is even commercially viable.
Exactly, the huge cost of just testing things out is currently preventing us from using this API (given the problems with the cumulative audio tokens). Although I don’t think OpenAI will do that, I really like your food for thought @liquidshadowsmk
@j0rdan at the very least, they should make Realtime API usage in the playground free for conversations up to X minutes before deducting $$ from the balance… Although, in the playground, some aspects aren’t apples-apples in terms of playground vs. production behaviour (e.g. function calls), as devs, we could live with that I suppose… especially in terms of tweaking system prompts to test variable scenarios and trying other voices (of which, there still are only 3 in total. shame!).
If OpenAI is still a developer-first org, then either the previous food for thought, or the “actionable step” mentioned above would only benefit them in the long run.
Just tested it. I don’t see a massive diff in pricing. Also unable to understand what is cached and what isn’t. Additional doco on this would be helpful. Appreciate the 5 new voices though… very cool.