Realtime API transcription language

What is the exact behavior of defining the “language” parameter in “input_audio_transcription”? Will it only boost the defined language matching? In our testing, even when the language is defined, other languages are recognized and transcripts in other language than the defined one are created.

The documentation only says:

The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en ) format will improve accuracy and latency.

1 Like

We have observed that language recognition often fails with minimal audio input, likely because the model considers multiple possible languages. Additionally, during real-time transcription, the model does not reference previous audio fragments but evaluates only the current audio snippet in isolation. We have clearly experienced this behavior in our phone-based customer support solution, AnswerPal, where we use real-time transcription to respond to incoming calls immediately.

When we transcribe the entire conversation after the call concludes, we notice significantly higher accuracy because the model has the full context available. It should be noted that our conversations primarily occur in Dutch, French, and German, so recognition accuracy might actually be better for English compared to these languages.

By explicitly defining a language tag (such as an ISO-639-1 code), you encourage the model to interpret the input specifically from that language perspective, reducing the likelihood of selecting other languages from the 98 available options. While the model may still recognize other languages, the accuracy and latency significantly improve for the specified language.

2 Likes

Hi, I have noticed that real-time user transcriptions sometimes end up in a different language (e.g., Portuguese instead of Spanish) and can be fairly inaccurate. Do you know of any way to address this behaviour?

You say “When we transcribe the entire conversation after the call concludes” - are you saving the audio files and transcribing them using a different endpoint? It is possible to do this in (close to ) realtime?

Thanks for your help, appreciate it

  • Adding "language": "<ISO-639-1>" in input_audio_transcription is only a bias for Whisper.
    It skips language-detection and improves latency/accuracy, but very short chunks or code-switching can still produce the “wrong” language.
  • For inbound calls, we capture both G.711 legs ourselves, merge them into one stereo WAV/MP3 after hang-up, and send that file to /audio/transcriptions (Whisper-1 or GPT-4o-transcribe). One long pass gives a much lower error rate than stitching the real-time fragments.
  • If you want cleaner captions during the call, stream a 5–10 s sliding-window clip to the same endpoint every few seconds and overwrite the subtitles when the result returns. This keeps real-time speed and near offline-level accuracy.
1 Like