Realtime API nerfed vs Advanced Voice Mode?

I got access to the Realtime API this morning.
Playing around with it in the playground it.

Seems like the model is pretty nerfed vs Advanced Voice Mode. Has limited voice models that has a lot less range than AVM in pronounciation, tonality and accents.

What’s your experience?

3 Likes

You might be right with the accents. I seem to have no problem in AVM (Indian accent) – but the realtime API seems to be having issues with the accent (atleast for me) – this is in preliminary testing.

Beyond this: This Realtime API is amazing … (just created a POC for Realtime API + RAG) – this is going to change a lot of things (like data centers)

Are you doing RAG as part of the instructions or via something else like tools?

1 Like

100% agree @ssdavid, this is a basic STT - LLM - TTS like the old voice chatGPT. It cannot do any accents nor non-verbal communication.

I’m in Europe, are you? Any american user can check if their API can fake accents and non-verbal cues ?

Yep, based in Europe. Asked it to switch to Hungarian. AVM is surprisingly good at it, standard voice mode is terrible. This was A LOT closer to SVM.

Haven’t yet checked RAG, but did play around with function calling. Pricing is crazy so will need to do a lot of cost optimization in architecture to make it worthwhile.

The marketing was quite dishonest in cultivating the ambiguity with AVM, probably to justify the pricing.

Gemini 1.5 Pro 002 + whisper for output TTS has the same performance for 32x cheaper. The only difference being the input latency

Using tools (function calling) to get the context from the RAG (Using our own RAG platform) – that part is working great … It’s the UX challenges that seem to be a pain.

I asked it to respond in Hindi and it did a pretty good job with the language and the accent. (The only odd thing was an American voice talking in Hindi – but beyond that, I was quite surprised that it got the pronunciations pretty good)

RAG works great with function calling – got it working well … the only problem is controlling hallucinations (so you will need to test and control extensively for that).