Realtime API poor speech recognition twilio -> OpenAI

Hi all,

Has anyone found a good way to get the realtime API to accurately recognize narrowband audio? 8khz mulaw?

I have not been able to manage to get the api to successfully recognize speech correctly. I would say it is pretty accurate as a whole, but struggles with 1-2 word answers to questions and it does not work well at all with names or addresses

The playground is much better, but the model is also getting hi fidelity audio from there so I would expect this.

I’m using Java, but anyone managing to do this at all would be very helpful

Thanks!

If the playground works better then your issue most likely lies with Twilio, and/or the way you’re processing the audio.

I’d recommend trying other providers and seeing how they fare.

1 Like

The playground uses high quality audio in wide band (likely through webrtc) which is not possible over phone lines

Ah, sorry.

Still,

You can save a sample and run it through numerous services to see how it fares. You could probably also use some sort of pipeline to improve the quality of the audio before sending it off.

For if youre curious / anyone else approaching this. I am now attempting to use this neural net with a bunch of these: voice datasets to upscale the audio in realtime and feed that to the assistant. I’ll update how it works :wink:. I don’t think its a commonly faced issue in voice since most people are going phone → phone and therefore do not care to upscale, but in our case we can restore the audio to pcm 16khz and that is much higher fidelity audio. This is a first thought though, hopefully it works :smile:

Hi Kevin, i get the same issue with Twilio end speech recognition. I tried to use Deepgram speech recognition for better transcription results, and i do get better transcription, but barge in functionality no longer works. Did you manage to get better results with your approach? What about latency? Thanks

Hey sam,

I am still working on training a model on my specs. It will cost a few hundred dollars or even couple k’s to do so, and we’re a small startup working off credits so im optimizing costs still. I am encouraged by results i’ve seen from nvidias bigvgan and also some from hifi-gan. These will 100% be able to operate in realtime, it is mostly about how clear i can make the audio.

I would recommend trying it if you know what youre up to with DL models and want some practice + have some extra funds somewhere, as it is relatively likely it can at least somewhat improve the audio, but at the same time i think its more of a workaround that GPT will eventually patch up better (likely doing something similar but way better than I can do with limited training time and data under the hood instead)

Hi Kevin, thanks for this! i think i ll have to wait for the big update, and i hope it will be solved soon.

Hey! Just here to clarify a few things:

1-2 word answers are the same as if you’re giving a LLM a 1-2 letter prompt.

It won’t really generate anything meaningful because it doesnt have enough context to understand you.

We aren’t sending enough audio samples for it to have any context, so it will start to generate random or default audio.

This is only one of your mentioned issues, so here’s the explanation for the names/addresses.

Let’s take the number 97 as an example.
Since the Realtime API is multilingual, the actual number “97” could be pronounced as “Ninety-Seven”. However it can also be pronounced “Siebenundneunzig” (97 in German), or also “Quatre-vingt-dix-sept” (97 in french) and so many more.

I think you see the issue now.

A fix for this?
Try spelling out the numbers instead of writing actual numbers.
(This is my telephone number: four, nine, six, one, two… etc.)

I hope this helps! :hugs:

EDIT: Of course the spelling out only works if the AI would call a function, you can’t just spell every single letter in a call and expect it to understand everything, maybe that does help though - worth a shot. My example was more of a customer support AI that gets its info from a function call.

So normal function call output would be:

Tel: 4917283917

And your processed output, before sending it to the Realtime API would be:

Telephone: four, nine, one, seven, two, eight, three, nine, one, seven

The address issue is not the numbers, the numbers it does well on. The main issue is that it is not “hearing” the audio for names and addresses clearly enough to successfully save them. Phone numbers it is fine with in English but struggles when there are 3+ repeated numbers in a row