Real Time API invents numbers or does not understand number sequences

Dear community and devs,

It seems the Real Time model via API has a problem understanding number sequences.

For instance 7-8/10 API calls made on a different session every time show the model is inaccurate or tends to lie.

You’d say repeat after me: 1 2 3 6 6 and get a reply like 1 2 3 4 5 6

Or say a more complex number like:
4326514.

I have tried saying the numbers in pairs of 2.

I have tried different prompts, different config, different temperature.

Sometimes I fall into the right version of the model and is almost accurate 100% on every request but most of the time all I get is random number.

My prompt is aggressively saying:

WHEN PRESENTED WITH A NUMBER SEQUENCE: NEVER EVER INVENT NUMBERS, NEVER SAY RANDOM NUMBERS, NEVER SAY NUMBERS YOU HAVENT HEARD. IF PRESENTED WITH CONSECUTIVE EQUAL NUMBERS, REPEAT THEM INDIVIDUALLY. don’t make mistakes with numbers. Separate each single one.

But with no luck.

Context: I’m using the Real Time API to provide a phone service for e-coms and the users struggle to retrieve their orders because of this.
Though we are using ulaw as codec, I’ve download the Audio and is legible 100%, I have also used the realtime console’s code to try to mimic but even the model using the phone app fails to repeat numbers given in a sequence.

Any work arounds or actual solution to this bug?

Kind Regards.

1 Like

You can’t prompt out inherent difficulties in Large Language Models. It’s just noise.

This is a common issue. Other users noted that the transcript usually has a more accurate understanding. That’s your best bet.

Or, you can resort to the good ol’ tried and tested number pad format that has historically worked

It’s not true in my case. The transcript is incorrect as well.

As for the number pad it’s really not a bad idea :+1:

But maybe there is a way.

Realistically you’re going to need to move towards an agentic system. I hate using this term because it’s been abused by everybody and is almost meaningless so I’ll define it here:

A programmed system that breaks down a task, and delegates the right model/service for each sub-task.


You’ve hit a deep, fundamental issue with the RealTime Models. Since it’s a proprietary service you have very little control over it. It’s time to pass the baton.

You can use it’s Function Calling to pass the audio binary to another (or multiple) service(s) that specialize in numbers. AWS Transcribe for example.

Create a dataset of so many audio files that both work, and don’t work, then shop around and find a service that accurately transcribes it. Then you can “inject” the numbers into the RealTime API context.

It’s expected that even the best service won’t be perfect. That’s why the number pad idea is preferred. It’s deterministic.

Appreciate the message.

I thought I’d get whisper running locally and transcribe the recorded chunk and pass it back as an argument to the api call, all while the realtime model in async says something like: Give me a minute to look up the order for you… please stand by or whatever.

The underlying transcription service is Whisper.

It may make sense to use numerous services and then compare the results.
Ultimately, you want to have a deterministic fallback

Ah I see, so Whisper discarded.

Thank you.

How are you able to pass in “audio binaries” via tool function calling?

If you are handling streams you should already have the audio temporarily held. You can tie in function calling to package it and send it to another service.

Just like you would send the audio to OpenAI to be processed… just another service

Correct. I’ve followed your advice and have added deepgram as middleware before executing the real function and it works very well given they announce in the accuracy of the number transcription.

results":{"channels":[{"alternatives":[{"transcript":"4 6 8 6 3 8 8","confidence":0.99902344,"words":[{"word":"4","start":0.7386786,"end":1.0980357,"confidence":0.99902344},{"word":"6","start":1.0980357,"end":1.3775357,"confidence":0.9980469},{"word":"8","start":1.3775357,"end":1.8775357,"confidence":0.99853516},{"word":"6","start":2.0563214,"end":2.4156785,"confidence":0.99902344},{"word":"3","start":2.4156785,"end":2.8548927,"confidence":0.9995117},{"word":"8","start":2.8548927,"end":3.1343927,"confidence":0.99902344},{"word":"8","start":3.1343927,"end":3.4138927,"confidence":0.9995117}]}]}]}}
VERBOSE Raw transcript: 4 6 8 6 3 8 8

Thank you.

1 Like

Ah I see. Is this assuming a web socket connection? I don’t believe I am able to access raw audio in a WebRTC connection. [EDIT: Answered my own question with a search].

WebRTC is okay as well. You would be handling the RTP(?) Packets which have the underlying audio binary

1 Like