i’ve created a virtual assistant using the realtime api, the assistant ask an alphanumeric code to the user and ask if correct.
The problem is that the model is unable to extract the code from the user audio (even though the whisper trascription is correct).
Conversation example:
User: my code is RTA34FR
Whisper transciption: my code is RTA34FR
Assistant: Your code is RTEE4TC, is it correct?
I also tested with the playground platform but i’m facing the same issue, so i assume it is not an audio quality issue.
I’ve had this issue as well with phone numbers and more complicated numbers or combinations of single letters and numbers.
This is mainly a problem in almost anything AI-Related and there is no real fix yet.
A good workaround would be to handle the codes by calling the whisper endpoint in addition to the realtime model (for example with tool use) which might take longer but at least the codes will be accurate.
Hi @j.wischnat,
How do you achieve this? By passing the chat transcript to GPT again to revalidate the input text and return the actual user input or some other way?