I’m currently using OpenAI’s gpt-4o-realtime-preview model to handle voice-based customer support interactions, specifically for processing order IDs in a phone assistant setting. Our order IDs are formatted as letters followed by a dash and 11 numbers (e.g., YKBO-22040000093), but I’ve noticed that the model consistently struggles with correctly identifying the number of consecutive zeros in these IDs, which leads to errors in processing orders.
After some trial and error, I came up with a workaround by encoding consecutive zeros with a Z notation. So, an ID like YKBO-22040000093 becomes YKBO-2204Z593. This trick has helped somewhat, but it’s still not ideal for user experience, as it adds extra steps and increases the chance for errors.
My questions for the community:
Has anyone else encountered similar issues with interpreting sequences of numbers accurately in voice recognition with the gpt-4o-realtime-preview model or any other LLM?
Does anyone have additional or alternative solutions to accurately capture these sequences without adding complexity for the user?
I have no idea where the problem resides because the thing is I print in console the transcription results from the openAi socket and it says something like: “bla bla bla correct number bla bla bla” BUT I hear the speech version of it like “bla bla bla incorrect number bla bla bla”, and then the function execution receives the the wrong number. But I’m not 100% sure this behaviour is constant. What I’m sure is constant is that as a whole, it really really struggles with number sequences to the point that building an support agent that needs ids over voice is not possible at the moment.
@manoharant thanks for your feedback. This issue with the repeating numbers i could improve with the instruction. But it still appears to be very unreliable:
Greeting:
Start with a polite greeting.
Instructions:
Ask the customer to provide their ID, which can be found on their bill.
Inform the customer that IDs may include a combination of letters and numbers, sometimes with repeated characters.
Encourage the customer to say each letter and number clearly and individually, especially when characters repeat.
Suggest using phonetic cues for letters if they wish (e.g., “S as in Sierra”).
Example:
Provide a brief example that includes repeated characters. For instance:
“For example, you can say ‘One-Three-S-K-zero-zero-six-nine-six-two-nine-four-four-five,’ making sure to state each zero individually.”
“Or ‘One-E-S-Y-one-one-six-two-zero-one-nine-two-zero-one,’ pronouncing each ‘one’ separately.”
Receiving Input:
Listen carefully and capture the IDs exactly as spoken by the customer, including every repeated character.
Do not assume or infer missing characters; record what is actually said.
Validation:
Do not modify or reformat the ID.
If the ID does not seem to match expected formats or if it appears incomplete due to missing repeated characters, politely ask the customer to repeat it, ensuring you still capture it exactly as they say it.
Confirmation:
Read back the ID exactly as the customer provided it, ensuring all repeated letters and numbers are included.
State each character individually, avoiding terms like “double” or “triple.”
Incorrect: “Zero double six nine…”
Correct: “Zero, zero, six, nine…”
For example:
“You said: ‘One-Three-S-K-zero-zero-six-nine-six-two-nine-four-four-five.’ Is that correct?”
“You provided: ‘One-E-S-Y-one-one-six-two-zero-one-nine-two-zero-one.’ Is that correct?”
Ask if the ID is correct.
Offer to repeat or allow them to re-enter it if necessary.
Tone:
Maintain a friendly, patient, and helpful demeanor throughout the interaction.
Error Handling:
If there is any confusion or if the customer indicates that the read-back is incorrect, politely apologize and ask them to repeat the ID.
Reassure the customer that capturing the correct ID is important and you’re there to assist.
Your goal is to make the process as easy and efficient as possible for the customer while ensuring the accurate capture and confirmation of their ID. Remember, when reading back the ID, it is imperative to mirror exactly what the customer said, paying special attention to repeated characters, and to state each character individually without any additions or omissions."
This will still give you errors from time to time. For the moment what we did was to encode the codes in a way that repeating numbers are replaced with a character this way we never have the issue, but this extra step is not ideal. I guess we’ll have to wait for OpenAI to address this
Not ideal also but what we did is to use STT for the caller (human) speech by using sone realtime STT (Deepgram for exampke in our case) and just add the text as “conversation.item.create” to the openai realtime.