New Realtime API voices and cache pricing

Today we’re adding five new voices that you can use with the Realtime API: Ash, Ballad, Coral, Sage, and Verse. These voices are much more expressive and you can better tune their emotions, accents, and tones to your needs. Take a listen.

We’re also making our first cost reduction for Realtime with, prompt caching:

  • Text input that hits the cache costs 50% less.
  • Audio input that hits the cache costs 80% less.

Basically, your typical 15 minute conversation costs about 30% less now than when we first launched the Realtime API. You can try out the new voices in the Playground right now (and be sure to use system instructions to direct their tones). Get started with the docs today—we’re looking forward to seeing and hearing what you build!

14 Likes

Amazing news! However, there is currently no (documented) way to calculate the total session price because the usage field in response.done event seems to either not differentiate between cached audio and cached text tokens, or only include the cached text tokens according to the events API reference

1 Like

Just tested it through the playground. It appears that the proper cached token details were indeed added to the usage of response.done, but cached tokens are still not reported in the total token usage popup in the playground and the response.done documentation is definitely out of date

2 Likes

That’s amazing news. Can’t wait to see what I can build with it :smiley:

1 Like

Just been to the playground and asked a few questions and burnt through nearly $3.

Is this correct ???

it stated no cost

00:54

### assistant

Hi there! How can I help you today?

00:58

### user

what does it cost per minute

00:58

### assistant

Could you clarify what you're asking about the cost per minute for? Is it a phone plan, a service, or something else?

01:07

### user

using this

01:08

### assistant

Oh, interacting with me doesn't cost anything! Feel free to ask me anything you'd like.
2 Likes

I hope not :confounded:
:rabbit::honeybee::heart:

That’s so pricey, it reminds me of pay per min internet :sweat:

1 Like

Would these new voices be made available through the TTS API as well or just via Realtime API?

2 Likes

You can never trust the LLM to know anything about itself. Look at the pricing page for the actual details.

2 Likes

This is definitely not correct.
Check your usage.

Also, here is the pricing page.

2 Likes

Check this out!

Oct 30
Featuregpt-4o-realtime-previewgpt-4o-audio-previewv1/chat/completions

1 Like

The RealTime API is amazing and it works so well, I love it. The pricing though makes it impractical to use for an actual application - even with a 30% discount. I tested an app I built for just a conversation and the bill was close to $80 for two days of testing - just a couple of minutes per day.
So the app is great, but there is no plausible way to monetize it - I just dont see users paying like $100 for a few minutes talk with an AI, when we can barely get them to subscribe to $20 a month.

3 Likes

Hey @bertha.kgokong - very glad you like it. And lower costs is going to be a continued priority

If you have any session IDs to share, I’d love to debug. It’s definitely not expected that your costs would be anywhere close to $10/min

3 Likes

Maybe when adding some data or functionality to it it becomes a service of higher value. E.g. as a frontend of a working autocoder or virtual project manager where the customer explains a product and it creates something like software or a project plan based on the conversation…

1 Like

But 1$ per minute is the real price? For “hey how are you” and “fine, how are you” 1$?

3 Likes

@jochenschultz that’s definitely unexpected. When I have that conversation (with a normal length system prompt) I use:
10 audio input
53 text input
76 audio output
26 text output

which is $0.017 (a bit less than 2 cents)

can you post the token breakdowns you’re seeing?

if you click on the Logs panel in the Realtime playground you can see the tokens

1 Like

Oh, I just had a short test in the playground and it was less then 5 minutes but in usage it showed ~5$ … which is now gone. I just did some short stuff like hello, how are you… like no more than 20 words in total and maybe generated… however now my history is gone and it doesn’t show that in my usage anymore.

Will give it another try then.

* just checked again… that was last month - and yes 4.74$ for not even 5 minutes. I am pretty sure about that - even when I don’t see the logs on that.

1 Like
user
You

00:02
assistant
Hello! How can I help you today?

00:02
user
[inaudible]

00:03
assistant
¿En qué puedo ayudarte?

00:04
user
[inaudible]

00:05
assistant
Parece que me estás poniendo a prueba. Estoy aquí para ayudarte, así que dime, ¿qué necesitas?

00:20
user
[inaudible]

00:22
assistant
¡Claro! Pregúntame lo que quieras. ¿Sobre qué tema necesitas información?

00:36
assistant
Estoy aquí para responder tus preguntas. ¿Qué te gustaría saber?

So I said hello (maybe mumbled a bit…) - and then it just started in Spanish and then I mumbled hello again…

I have no idea why it starts talking in Spanish… Because “inaudible”?

And now in another session after I cleared it it went to Korean (note that I did not say “that’s very interesting” - I said it in German “interessant”.

00:00

### user

[inaudible]

00:01

### assistant

안녕하세요! 무엇을 도와

00:01

### user

Okay.

00:02

### assistant

말씀하세요! 궁금한 게 있으신가요?

00:08

### user

That's very interesting.

00:09

### assistant

흥미롭게 생각해주시니 기쁘네요! 혹시 더 알고 싶은 것이 있나요? 

And in another session when it went to Spanish and I asked it to switch to Spanish it recognized my “Seria posible hablar en aleman?” as "
Serait-il possible de parler en allemand?" - which I admit is pretty close - but it answered in french then…

Very strange… maybe it has no access to my language settings and therefor the system prompt get’s confused…

Anyways 3 or was it 4? small tests ~3-4 minutes came out to approximately 0.67 $.

Which is still hefty - but not the expected 3-4$.

I guess I got to check my microphone before I complain more on that though.

*yeah was my headset not being the active microphone and my build in microphone of my laptop is pretty far away. That with the mumbeling (because it is 4am and I am not alone in the house) makes a lot more sense.

After switching to headset it worked just fine.

00:02
user
Okay, jetzt müsstest du mich aber hören, oder? Richtig?

00:06
assistant
Ja, ich höre dich laut und deutlich! Alles funktioniert. Wie kann ich dir helfen?

“Ok, now you should hear me, right?” - “yes I hear you loud and clear! everything works. How can I help you?” in German – 0.02 $.

So I guess it has to do with, when the application can’t really recognize what I want in terms of audio recognition…

1 Like

As an API developer, can you afford to have users not check their microphone, when static or silence or echo or loopback will trigger the server VAD to make a paid response every few seconds, building a “chat” that makes each more expensive?

This endpoint takes more intelligent discrimination of what audio you send, with manual triggering.

Then, manual discarding of a socket before a chat can grow too long, hitting OpenAI’s parameter for limiting maximum context window loading to the model capabilities, and not one you can supply.

Then, does deconvolution of tokens to audio really add 10x as much computational cost as generating the AI tokens themselves…or just add that much to your bill.

2 Likes

When I am trying to change the voice I am geeting this:

TS2322: Type ‘“ash”’ is not assignable to type ‘“alloy” | “shimmer” | “echo” | undefined’.

I cannot use the other voices, I want to use a male voice, which i am unable.

1 Like