Follow-up Inquiry on Realtime API Issues in AI Interviewer Implementation

I am writing to follow up on specific challenges we have encountered while utilizing the OpenAI Realtime API to develop an AI Interviewer. The system is designed to simulate professional job interviews, incorporating a dynamic interview guide and Voice Activity Detection (VAD) for real-time interaction. To assist in diagnosing and resolving the issues, we are providing the following details about our configuration and behavior expectations.

Configuration Overview

We use the following session configuration:

realtimeClient.updateSession({
    turn_detection: {
        threshold: 0.5,
        prefix_padding_ms: 300,
        silence_duration_ms: 600,
        type: 'server_vad'
    },
    modalities: ['text', 'audio'],
    input_audio_format: 'pcm16',
    output_audio_format: 'pcm16',
    input_audio_transcription: { model: 'whisper-1' },
    instructions,
    voice
});

Issues Encountered

  1. Language Mismatch

Despite specifying a particular interview language (e.g., English), responses sometimes shift to other languages, likely influenced by dynamic resume data. We strictly intend responses to follow the specified language setting.

  • Question: Is there a recommended way to enforce consistent language usage, even with multilingual input data?
  1. Noise Sensitivity and VAD Behavior

The VAD implementation occasionally interprets background noise as valid input, interrupting active audio streams or triggering unnecessary responses.

  • Question: Could you suggest best practices or configuration adjustments to reduce noise sensitivity and enhance VAD accuracy?
  1. Self-Generated Responses

The AI sometimes initiates self-dialogue—posing questions and immediately answering them without processing actual user input. This deviates from the expected interaction flow.

  • Question: Are there potential misconfigurations or known API behaviors that could lead to this?
  1. Truncated Audio Playback

Audio responses occasionally cut off toward the end, even though the corresponding transcribed text appears complete.

  • Question: Could this result from API limitations, streaming inconsistencies, or an incorrect configuration?
  1. Speech Recognition Accuracy

There is a noticeable discrepancy between the transcribed text and the actual spoken input. Non-verbal sounds like coughing or sighing are sometimes interpreted as valid phrases (e.g., “Hello” or “OK”), triggering unintended responses.

  • Question: Are there configuration adjustments or additional filters to improve transcription accuracy and reduce misinterpretation of non-verbal sounds?

Request for Assistance

We kindly request the following:

  • Clarifications or solutions for the issues described above.

  • Any relevant documentation to improve our understanding and usage of the Realtime API in this context.

  • Insights into whether these issues could stem from the Realtime API’s Beta phase and the likelihood of encountering such challenges.

We appreciate any guidance you can provide to enhance the reliability of our AI Interviewer and ensure a seamless user experience. If necessary, we are happy to share additional configuration details or logs for further investigation.

To aid in your analysis, we have attached a version of the prompt instructions with sensitive information redacted.

Thank you for your support, and we look forward to your response.

I have same issue with 2 and 5.
For 2, the noise sensitivity part is extremely annoying. For example, someone is calling with music playing in background. Sometimes the model will process the music input when the user is silent. I hope the model can focus on the initial recognized voice for the entire session.
For 5, I find the model weirdly recognize name badly. I have a few use cases from different users, they clearly say Jordan. And the model say their name is Mark, etc.