Issue with Mismatch Between Realtime Audio and Transcription Text

Hi everyone,

I’m encountering an issue when using Realtime for word teaching. For example, I’m teaching words like “apple,” “banana,” and “orange” in sequence. However, sometimes the transcription text doesn’t match the actual audio content.

For instance, when the audio plays “please repeat after me, apple,” the transcription text shows “please repeat after me, banana.” I’m wondering if anyone else has faced this issue? How did you resolve it?

Looking forward to any insights or solutions. Thanks in advance!

1 Like

I am facing the same problem.

I am using Realtime API in Japanese, and the transcription text outputs “Monday to Friday” but actually pronounces it as “Thursday to Friday”, the transcription text outputs “11,000 yen” but actually pronounces it as “111,000 yen”, etc. In both cases, the transcription is the preferred answer and the pronunciation is incorrect.

This mistake is reproducible, and no matter how many times you ask, it will still make the same mistake.

Also, if you know how AI makes mistakes, you can correct them to some extent by instruction. For example, pronunciation could be corrected by prompts like "For the amount of money, please convert and output it as ‘11 thousand yen’ instead of ‘11,000 yen’.

they are two different models acting at the same time, so simply put, the realtime one is either not “hearing” the audio correctly or the whisper one is not depending on which is correct. They dont know about each other (i dont think whisper gets into the context for realtime at all at least, perhaps you could feed conversation.item.create events if you want to try).

Until the model or system are both better at hearing the audio correctly, it will continue to make these mistakes