Realtime API nerfed vs Advanced Voice Mode?

I got access to the Realtime API this morning.
Playing around with it in the playground it.

Seems like the model is pretty nerfed vs Advanced Voice Mode. Has limited voice models that has a lot less range than AVM in pronounciation, tonality and accents.

What’s your experience?

4 Likes

You might be right with the accents. I seem to have no problem in AVM (Indian accent) – but the realtime API seems to be having issues with the accent (atleast for me) – this is in preliminary testing.

Beyond this: This Realtime API is amazing … (just created a POC for Realtime API + RAG) – this is going to change a lot of things (like data centers)

Are you doing RAG as part of the instructions or via something else like tools?

1 Like

100% agree @ssdavid, this is a basic STT - LLM - TTS like the old voice chatGPT. It cannot do any accents nor non-verbal communication.

I’m in Europe, are you? Any american user can check if their API can fake accents and non-verbal cues ?

Yep, based in Europe. Asked it to switch to Hungarian. AVM is surprisingly good at it, standard voice mode is terrible. This was A LOT closer to SVM.

Haven’t yet checked RAG, but did play around with function calling. Pricing is crazy so will need to do a lot of cost optimization in architecture to make it worthwhile.

The marketing was quite dishonest in cultivating the ambiguity with AVM, probably to justify the pricing.

Gemini 1.5 Pro 002 + whisper for output TTS has the same performance for 32x cheaper. The only difference being the input latency

Using tools (function calling) to get the context from the RAG (Using our own RAG platform) – that part is working great … It’s the UX challenges that seem to be a pain.

I asked it to respond in Hindi and it did a pretty good job with the language and the accent. (The only odd thing was an American voice talking in Hindi – but beyond that, I was quite surprised that it got the pronunciations pretty good)

RAG works great with function calling – got it working well … the only problem is controlling hallucinations (so you will need to test and control extensively for that).

I don’t understand how this isn’t a bigger complaint. This is the only complaint regarding this I could find across all of the forums/reddit/google/discord.

Surely there are more people that see the value in the astronomical difference between the existing OpenAI API tech and the tech behind the ChatGPT advanced voice. Like it’s genuinely MILES ahead of it?

I’m curious when we will see it added as part of the API, but I wouldn’t be surprised to see it at a huge markup. This level of voice AI with a already existing RAG tech will revolutionise so many industries, but it’s just not quite there with the current API.

2 Likes

I agree with you.
I was just now testing the realtime API and comparing it with the advanced voice mode from chatgpt and the experience was so much better with advanced voice mode. I thought I might be doing something wrong, but from this post I can see I’m not the only one with this problem.
Thank you!

Do we have any updates on this? I agree that the advanced voice mode from chatgpt is absolutely miles ahead of the realtime api. Like, it’s not even close. What’s going on?