RealTime API Transcription errors

Hello everyone, first time posting here—I really appreciate everyone’s help!

I’m currently developing a translation app using OpenAI’s real-time API. While the translation functionality works impressively well, I’ve been encountering significant issues with transcription accuracy—but only with the RealTime Service. Previously, I used Whisper and GPT-4, and they worked perfectly. However, with the RealTime Service, sometimes the transcribed text doesn’t correlate at all with the translated output—it is completely unrelated and off-base, which is really strange.

For instance, when processing audio input in a particular language, the translated text comes out correctly and makes sense. However, the corresponding transcription often fails horribly, displaying text that doesn’t match the audio input or the translation in any way.

Has anyone else experienced similar issues with the transcription service? I thought there wouldn’t be any problems since it uses Whisper, but it seems to fail miserably at times. Is there something I might be overlooking in the implementation, or could this be a problem with the API itself?

Any insights, suggestions, or guidance would be greatly appreciated!

Thank you!

3 Likes

It happened to me all the time. The transcription of my input made no sense and their response also made no sense. It would bring up random topics that weren’t part of my input.

2 Likes

First thing to try and do is save the buffer locally and run it through whisper to see if it’s something to do with the service, or the format of your audio.

I would imagine that if the audio format is different than expected then it would run into nonsensical responses.

1 Like

Hey, thank you! I downloaded my audio and figured out it was in slow motion, lol. Got that fixed now.

facing the same issue . the quality of output transcription is very bad. Is there anyway we can pass language in this “input_audio_transcription”: { “model”: ‘whisper-1’ }

I’m experiencing the same issue. It seems to work well with English, but for other languages like Thai and Vietnamese, especially with short audio inputs, it doesn’t work well. It doesn’t translate my input correctly, and the output is often nonsensical.