Realtime conversations API - text and audio not consistent. Advice?

OpenAi development brothers and sisters,

So I am using the Realtime conversations API with web sockets.

I love it.

But… I have an issue with pronouncing address parts.

I might have an address like:
address: 6241 Oakton Street
City: Morton Grove
State: Illinois
Zip: 60053

and when put into a string I want spoken like:

The address for Dr. Cal Evans’ dental office is 6 2 4 1 Oakton Street, Morton Grove, Illinois, 6 0 0 5 3. You can give them a call at 7 7 3, 9 3 5, 4 7 2 7. Take care!

  1. I notice different voices speak this line differently.
    Some will pronounce the zip six hundred oh fifty three
    Some will pronounce the zip six oh oh five three
    Some will pronounce the zip sixty thousand fifty three

So I take measures to carefully format the string before sending to openai
wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17

response_trigger = {
    "type": "response.create",
    "response": {
        "modalities": ["text", "audio"],
        "output_audio_format": "pcm16",
        "instructions": "You must respond with both text and spoken audio. Use your tools to provide accurate, detailed responses with complete contact information when relevant."
    }
}

But what I notice is that the text it returns is inconsistent with the audio it returns. The text is correct and then audio is not.

It’s almost like you’re completely dependent on the realtime conversations api and it’s output without being able to really influence how it pronounces things, using a prompt.

Would a better design be to use the regular chat completion to get the text, format it how I need and then use a different voice library like ElevenLabs to synthesize the speech playback?

Thanks,
Jim

1 Like

Worked around the issue. It looks as though the text generated is influenced by the template I prompt I pass, but the audio is generated immediately and not influenced by the prompt. So I put in steps upstream to pre-process the data and format it properly with another AI step.