Realtime conversations API - text and audio not consistent. Advice?

ufmurphy · June 26, 2025, 9:41pm

OpenAi development brothers and sisters,

So I am using the Realtime conversations API with web sockets.

I love it.

But… I have an issue with pronouncing address parts.

I might have an address like:
address: 6241 Oakton Street
City: Morton Grove
State: Illinois
Zip: 60053

and when put into a string I want spoken like:

The address for Dr. Cal Evans’ dental office is 6 2 4 1 Oakton Street, Morton Grove, Illinois, 6 0 0 5 3. You can give them a call at 7 7 3, 9 3 5, 4 7 2 7. Take care!

I notice different voices speak this line differently.
Some will pronounce the zip six hundred oh fifty three
Some will pronounce the zip six oh oh five three
Some will pronounce the zip sixty thousand fifty three

So I take measures to carefully format the string before sending to openai
wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17

response_trigger = {
    "type": "response.create",
    "response": {
        "modalities": ["text", "audio"],
        "output_audio_format": "pcm16",
        "instructions": "You must respond with both text and spoken audio. Use your tools to provide accurate, detailed responses with complete contact information when relevant."
    }
}

But what I notice is that the text it returns is inconsistent with the audio it returns. The text is correct and then audio is not.

It’s almost like you’re completely dependent on the realtime conversations api and it’s output without being able to really influence how it pronounces things, using a prompt.

Would a better design be to use the regular chat completion to get the text, format it how I need and then use a different voice library like ElevenLabs to synthesize the speech playback?

Thanks,
Jim

ufmurphy · July 2, 2025, 3:11pm

Worked around the issue. It looks as though the text generated is influenced by the template I prompt I pass, but the audio is generated immediately and not influenced by the prompt. So I put in steps upstream to pre-process the data and format it properly with another AI step.

Topic		Replies	Views
Random Time Slots in Realtime Function Call Bugs realtime	3	148	January 29, 2025
Realtime API Gets Names Horribly Wrong API realtime	14	1516	January 25, 2026
Trouble mapping realtime speech to function call text API	1	203	January 27, 2025
Realtime API poor speech recognition twilio -> OpenAI API realtime	9	1675	January 29, 2025
Audio Models in the API - live stream at 10 AM PT API	14	1280	March 29, 2025

Realtime conversations API - text and audio not consistent. Advice?

Related topics