Realtime API transcription language

developer28 · March 31, 2025, 12:38pm

What is the exact behavior of defining the “language” parameter in “input_audio_transcription”? Will it only boost the defined language matching? In our testing, even when the language is defined, other languages are recognized and transcripts in other language than the defined one are created.

The documentation only says:

The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en ) format will improve accuracy and latency.

gregory · April 1, 2025, 7:24am

We have observed that language recognition often fails with minimal audio input, likely because the model considers multiple possible languages. Additionally, during real-time transcription, the model does not reference previous audio fragments but evaluates only the current audio snippet in isolation. We have clearly experienced this behavior in our phone-based customer support solution, AnswerPal, where we use real-time transcription to respond to incoming calls immediately.

When we transcribe the entire conversation after the call concludes, we notice significantly higher accuracy because the model has the full context available. It should be noted that our conversations primarily occur in Dutch, French, and German, so recognition accuracy might actually be better for English compared to these languages.

By explicitly defining a language tag (such as an ISO-639-1 code), you encourage the model to interpret the input specifically from that language perspective, reducing the likelihood of selecting other languages from the 98 available options. While the model may still recognize other languages, the accuracy and latency significantly improve for the specified language.

adito30 · May 17, 2025, 5:25pm

Hi, I have noticed that real-time user transcriptions sometimes end up in a different language (e.g., Portuguese instead of Spanish) and can be fairly inaccurate. Do you know of any way to address this behaviour?

You say “When we transcribe the entire conversation after the call concludes” - are you saving the audio files and transcribing them using a different endpoint? It is possible to do this in (close to ) realtime?

Thanks for your help, appreciate it

gregory · May 20, 2025, 7:56am

Adding "language": "<ISO-639-1>" in input_audio_transcription is only a bias for Whisper.
It skips language-detection and improves latency/accuracy, but very short chunks or code-switching can still produce the “wrong” language.
For inbound calls, we capture both G.711 legs ourselves, merge them into one stereo WAV/MP3 after hang-up, and send that file to /audio/transcriptions (Whisper-1 or GPT-4o-transcribe). One long pass gives a much lower error rate than stitching the real-time fragments.
If you want cleaner captions during the call, stream a 5–10 s sliding-window clip to the same endpoint every few seconds and overwrite the subtitles when the result returns. This keeps real-time speed and near offline-level accuracy.

Topic		Replies	Views
Transcription Accuracy on different language API realtime	3	409	November 7, 2024
Languages in Realtime API API realtime	9	5048	May 5, 2025
Input_audio_transcription accuracy API realtime	6	668	November 6, 2024
Whisper-1 joint translation and transcription API	6	3413	October 21, 2024
Whisper transcription translates to random language (Malay) API whisper	8	1336	July 16, 2024

Realtime API transcription language

Related topics