Real Time API invents numbers or does not understand number sequences

Eliasb · January 21, 2025, 6:59pm

Dear community and devs,

It seems the Real Time model via API has a problem understanding number sequences.

For instance 7-8/10 API calls made on a different session every time show the model is inaccurate or tends to lie.

You’d say repeat after me: 1 2 3 6 6 and get a reply like 1 2 3 4 5 6

Or say a more complex number like:
4326514.

I have tried saying the numbers in pairs of 2.

I have tried different prompts, different config, different temperature.

Sometimes I fall into the right version of the model and is almost accurate 100% on every request but most of the time all I get is random number.

My prompt is aggressively saying:

WHEN PRESENTED WITH A NUMBER SEQUENCE: NEVER EVER INVENT NUMBERS, NEVER SAY RANDOM NUMBERS, NEVER SAY NUMBERS YOU HAVENT HEARD. IF PRESENTED WITH CONSECUTIVE EQUAL NUMBERS, REPEAT THEM INDIVIDUALLY. don’t make mistakes with numbers. Separate each single one.

But with no luck.

Context: I’m using the Real Time API to provide a phone service for e-coms and the users struggle to retrieve their orders because of this.
Though we are using ulaw as codec, I’ve download the Audio and is legible 100%, I have also used the realtime console’s code to try to mimic but even the model using the phone app fails to repeat numbers given in a sequence.

Any work arounds or actual solution to this bug?

Kind Regards.

anon10827405 · January 21, 2025, 7:11pm

You can’t prompt out inherent difficulties in Large Language Models. It’s just noise.

This is a common issue. Other users noted that the transcript usually has a more accurate understanding. That’s your best bet.

Or, you can resort to the good ol’ tried and tested number pad format that has historically worked

Eliasb · January 21, 2025, 7:14pm

It’s not true in my case. The transcript is incorrect as well.

As for the number pad it’s really not a bad idea

But maybe there is a way.

anon10827405 · January 21, 2025, 7:37pm

Realistically you’re going to need to move towards an agentic system. I hate using this term because it’s been abused by everybody and is almost meaningless so I’ll define it here:

A programmed system that breaks down a task, and delegates the right model/service for each sub-task.

You’ve hit a deep, fundamental issue with the RealTime Models. Since it’s a proprietary service you have very little control over it. It’s time to pass the baton.

You can use it’s Function Calling to pass the audio binary to another (or multiple) service(s) that specialize in numbers. AWS Transcribe for example.

Create a dataset of so many audio files that both work, and don’t work, then shop around and find a service that accurately transcribes it. Then you can “inject” the numbers into the RealTime API context.

It’s expected that even the best service won’t be perfect. That’s why the number pad idea is preferred. It’s deterministic.

Eliasb · January 21, 2025, 7:40pm

Appreciate the message.

I thought I’d get whisper running locally and transcribe the recorded chunk and pass it back as an argument to the api call, all while the realtime model in async says something like: Give me a minute to look up the order for you… please stand by or whatever.

anon10827405 · January 21, 2025, 7:40pm

The underlying transcription service is Whisper.

It may make sense to use numerous services and then compare the results.
Ultimately, you want to have a deterministic fallback

Eliasb · January 21, 2025, 7:40pm

Ah I see, so Whisper discarded.

Thank you.

bhorseman · January 22, 2025, 9:53pm

How are you able to pass in “audio binaries” via tool function calling?

anon10827405 · January 22, 2025, 10:42pm

If you are handling streams you should already have the audio temporarily held. You can tie in function calling to package it and send it to another service.

Just like you would send the audio to OpenAI to be processed… just another service

Eliasb · January 22, 2025, 10:50pm

Correct. I’ve followed your advice and have added deepgram as middleware before executing the real function and it works very well given they announce in the accuracy of the number transcription.

results":{"channels":[{"alternatives":[{"transcript":"4 6 8 6 3 8 8","confidence":0.99902344,"words":[{"word":"4","start":0.7386786,"end":1.0980357,"confidence":0.99902344},{"word":"6","start":1.0980357,"end":1.3775357,"confidence":0.9980469},{"word":"8","start":1.3775357,"end":1.8775357,"confidence":0.99853516},{"word":"6","start":2.0563214,"end":2.4156785,"confidence":0.99902344},{"word":"3","start":2.4156785,"end":2.8548927,"confidence":0.9995117},{"word":"8","start":2.8548927,"end":3.1343927,"confidence":0.99902344},{"word":"8","start":3.1343927,"end":3.4138927,"confidence":0.9995117}]}]}]}}
VERBOSE Raw transcript: 4 6 8 6 3 8 8

Thank you.

bhorseman · January 22, 2025, 10:54pm

Ah I see. Is this assuming a web socket connection? I don’t believe I am able to access raw audio in a WebRTC connection. [EDIT: Answered my own question with a search].

anon10827405 · January 22, 2025, 11:14pm

WebRTC is okay as well. You would be handling the RTP(?) Packets which have the underlying audio binary

Topic		Replies	Views
Realtime API poor speech recognition twilio -> OpenAI API realtime	9	524	January 29, 2025
Random Time Slots in Realtime Function Call Bugs realtime	3	50	January 29, 2025
RealTime API Transcription errors Bugs realtime	7	1207	January 9, 2025
Why is realtime model so bad at understanding sequences of numbers? API realtime	15	1018	February 4, 2025
Realtime API Gets Names Horribly Wrong API realtime	10	580	February 23, 2025

Real Time API invents numbers or does not understand number sequences

Related topics