I got access to the Realtime API this morning.
Playing around with it in the playground it.
Seems like the model is pretty nerfed vs Advanced Voice Mode. Has limited voice models that has a lot less range than AVM in pronounciation, tonality and accents.
You might be right with the accents. I seem to have no problem in AVM (Indian accent) – but the realtime API seems to be having issues with the accent (atleast for me) – this is in preliminary testing.
Beyond this: This Realtime API is amazing … (just created a POC for Realtime API + RAG) – this is going to change a lot of things (like data centers)
Yep, based in Europe. Asked it to switch to Hungarian. AVM is surprisingly good at it, standard voice mode is terrible. This was A LOT closer to SVM.
Haven’t yet checked RAG, but did play around with function calling. Pricing is crazy so will need to do a lot of cost optimization in architecture to make it worthwhile.
Using tools (function calling) to get the context from the RAG (Using our own RAG platform) – that part is working great … It’s the UX challenges that seem to be a pain.
I asked it to respond in Hindi and it did a pretty good job with the language and the accent. (The only odd thing was an American voice talking in Hindi – but beyond that, I was quite surprised that it got the pronunciations pretty good)
RAG works great with function calling – got it working well … the only problem is controlling hallucinations (so you will need to test and control extensively for that).
I don’t understand how this isn’t a bigger complaint. This is the only complaint regarding this I could find across all of the forums/reddit/google/discord.
Surely there are more people that see the value in the astronomical difference between the existing OpenAI API tech and the tech behind the ChatGPT advanced voice. Like it’s genuinely MILES ahead of it?
I’m curious when we will see it added as part of the API, but I wouldn’t be surprised to see it at a huge markup. This level of voice AI with a already existing RAG tech will revolutionise so many industries, but it’s just not quite there with the current API.
I agree with you.
I was just now testing the realtime API and comparing it with the advanced voice mode from chatgpt and the experience was so much better with advanced voice mode. I thought I might be doing something wrong, but from this post I can see I’m not the only one with this problem.
Thank you!
Do we have any updates on this? I agree that the advanced voice mode from chatgpt is absolutely miles ahead of the realtime api. Like, it’s not even close. What’s going on?