lol… I totally broke it I set max_tokens to 30 and asked to count to 100. the transcript shows it cutting off after a few tokens but it actually read back the full response up to 60
I tried to repo the audio bug I hit and haven’t gotten it to repo so likely just a state fluke somewhere. My gut says it was probably some sort of cache hit because of the fact I’ve been using the same basic prompt “count to 100 by ones” all night.
I was hoping that by lowering the max_tokens it would pressure the model to want to use less tokens but no such luck.
unless the audio was cached in the browser state… It was clearly playing back and the generation said it was in a stop state. The logs didn’t show any buffering. I even clicked on other tabs to see if something else was playing. It’s definitely not reproducing now…
I was going to file a bug report but no real point if there’s not a reliable repro.
I will say that this thing is pricey even for my taste and I spend a lot of money on OpenAI every month.
I had this clever idea that I was going to use ElevenLabs Voice Cloning feature to clone Alloy and then use Eleven Labs for playback of long text like reading a book or something. That’s when I saw that ElevenLabs is even more expensive…
I tried it in the playground this morning. It’s quite disappointing. First, the transcription isn’t great. You need to have a headset and microphone for it to work correctly. And the price—WOW, it’s extremely expensive, especially for testing, and the AI is limited and doesn’t compare to Vocal Advanced. Has anyone tried it outside the playground? Is it possible to select GPT-4o-mini as output with Nova’s voice? Is there a way to reduce costs by mixing models like Deepgram, Claude for the LLM? Mixing STT, LLM, TTS, and Speech-to-Speech?
I too am very disappointed at the cost… I had a 10 minute chat and saw a $6 charge… Given that we are pushing a captured microphone, I am wondering if this charging for empty frames? This is way to expensive!!
Curious, is anyone running VAD and just pushing in the spoken audio as opposed to streaming the data constantly?
It does charge you for all audio streamed in, even silence. In The playground and in the demo github repo they shared you can do push to talk.
I will say even with push to talk this is still very expensive. I don’t see this as being feasible economically for a lot of companies out there. I am also curious why only three voices are offered and why none of those voices are the same as the advanced voice mode. The voices offered in my opinion are not as good as the ones in Advanced Voice Mode.
Splitting by sentence is good enough for most uses. This realtime api has really no benefit that makes the cost worth it vs that, unless you want to have some fun with the tone of voice and really don’t mind paying through the nose for that. Sentence by sentence normal tts is more than fast enough.