Realtime API extremely expensive

lol… I totally broke it :slight_smile: I set max_tokens to 30 and asked to count to 100. the transcript shows it cutting off after a few tokens but it actually read back the full response up to 60 :slight_smile:

And just because it’s funny… The cuttoff transcript is confusing it as to which language we’re speaking :slight_smile:

3 Likes

now if you’re also just billed for those 30 tokens, then go grab that bag :laughing: Public Bug Bounty: Open Ai - Bugcrowd

1 Like

I checked and they only billed me for the 30 tokens :slight_smile:

1 Like

I tried to repo the audio bug I hit and haven’t gotten it to repo so likely just a state fluke somewhere. My gut says it was probably some sort of cache hit because of the fact I’ve been using the same basic prompt “count to 100 by ones” all night.

I was hoping that by lowering the max_tokens it would pressure the model to want to use less tokens but no such luck.

unless the audio was cached in the browser state… It was clearly playing back and the generation said it was in a stop state. The logs didn’t show any buffering. I even clicked on other tabs to see if something else was playing. It’s definitely not reproducing now…

I was going to file a bug report but no real point if there’s not a reliable repro.

I will say that this thing is pricey even for my taste and I spend a lot of money on OpenAI every month.

I had this clever idea that I was going to use ElevenLabs Voice Cloning feature to clone Alloy and then use Eleven Labs for playback of long text like reading a book or something. That’s when I saw that ElevenLabs is even more expensive…

3 Likes

Tortoise TTS is pretty good nowadays. Been eying it for a bit, people seem to be splitting by sentence for “realtime” generation.

2 Likes

I tried it in the playground this morning. It’s quite disappointing. First, the transcription isn’t great. You need to have a headset and microphone for it to work correctly. And the price—WOW, it’s extremely expensive, especially for testing, and the AI is limited and doesn’t compare to Vocal Advanced. Has anyone tried it outside the playground? Is it possible to select GPT-4o-mini as output with Nova’s voice? Is there a way to reduce costs by mixing models like Deepgram, Claude for the LLM? Mixing STT, LLM, TTS, and Speech-to-Speech?

5 Likes

I too am very disappointed at the cost… I had a 10 minute chat and saw a $6 charge… Given that we are pushing a captured microphone, I am wondering if this charging for empty frames? This is way to expensive!!

Curious, is anyone running VAD and just pushing in the spoken audio as opposed to streaming the data constantly?

2 Likes

It does charge you for all audio streamed in, even silence. In The playground and in the demo github repo they shared you can do push to talk.

I will say even with push to talk this is still very expensive. I don’t see this as being feasible economically for a lot of companies out there. I am also curious why only three voices are offered and why none of those voices are the same as the advanced voice mode. The voices offered in my opinion are not as good as the ones in Advanced Voice Mode.

2 Likes

Then we should do our own VAD and only shoot in the captured speech.

You nailed it. This is clearly a very rough release by OpenAI. It technically and fundamentally works, but beyond that, it is extremely flawed.

1 Like

You don’t get billed for silence.

You only get billed for tokens spent during the Speech Detected phase.

3 Likes

Do note that it looks like [inaudible]/noise can/may still be able to take up a speech turn!

(if it’s billed as suspected)

2 Likes

for noise level, you need to adjust the threshold. it will depend in your ambient noise.

Turns out I am not correct about this, it does not charge for silence

1 Like

Thanks for correcting me on this, silence is not charged.

Background noise can trigger a generation but you can set the sensitivity level.

I tested a number of different scenarios to figure out exactly what we’re getting billed for:

2 Likes

Splitting by sentence is good enough for most uses. This realtime api has really no benefit that makes the cost worth it vs that, unless you want to have some fun with the tone of voice and really don’t mind paying through the nose for that. Sentence by sentence normal tts is more than fast enough.

1 Like