OpenAi development brothers and sisters,
So I am using the Realtime conversations API with web sockets.
I love it.
But… I have an issue with pronouncing address parts.
I might have an address like:
address: 6241 Oakton Street
City: Morton Grove
State: Illinois
Zip: 60053
and when put into a string I want spoken like:
The address for Dr. Cal Evans’ dental office is 6 2 4 1 Oakton Street, Morton Grove, Illinois, 6 0 0 5 3. You can give them a call at 7 7 3, 9 3 5, 4 7 2 7. Take care!
- I notice different voices speak this line differently.
Some will pronounce the zip six hundred oh fifty three
Some will pronounce the zip six oh oh five three
Some will pronounce the zip sixty thousand fifty three
So I take measures to carefully format the string before sending to openai
wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17
response_trigger = {
"type": "response.create",
"response": {
"modalities": ["text", "audio"],
"output_audio_format": "pcm16",
"instructions": "You must respond with both text and spoken audio. Use your tools to provide accurate, detailed responses with complete contact information when relevant."
}
}
But what I notice is that the text it returns is inconsistent with the audio it returns. The text is correct and then audio is not.
It’s almost like you’re completely dependent on the realtime conversations api and it’s output without being able to really influence how it pronounces things, using a prompt.
Would a better design be to use the regular chat completion to get the text, format it how I need and then use a different voice library like ElevenLabs to synthesize the speech playback?
Thanks,
Jim