GPT-4o-Transcribe: Why Does the Final Output Sometimes Exactly Replicate the Configured Prompt?

I’m encountering an issue with GPT-4o-Transcribe where, in some cases, the system returns a final output that is exactly the same as the input prompt provided in the configuration. I’m unsure why this happens, and I’d like to understand if this is a bug in the API.

I’ve noticed this behavior occurs more frequently with Spanish text. Is there a known limitation or condition that causes the model to return the unmodified prompt as the transcription result?

Here’s a summary of what I’m seeing:

  • The final output is identical to the prompt.
  • This happens intermittently.
  • It severely affects the real-time transcription experience and makes it unsuitable for production use.
  • For this test, I used a HyperX QuadCast microphone.

Below I’m including some API event logs that show this behavior, along with my session configuration for reference.

Let me know if there’s a workaround or if this is something the team is already aware of. I’d really appreciate any guidance on how to mitigate or avoid this issue. :alien_monster:


Logs:

Received message: {'type': 'input_audio_buffer.speech_started', 'event_id': 'event_BK4ZqecytxeVEAGTNsVMa', 'audio_start_ms': 3796, 'item_id': 'item_BK4Zqm01V3DvAZsu6hCt1'}
Received message: {'type': 'input_audio_buffer.speech_stopped', 'event_id': 'event_BK4ZrpiBrdUp2x8NQL0DX', 'audio_end_ms': 4960, 'item_id': 'item_BK4Zqm01V3DvAZsu6hCt1'}
Received message: {'type': 'input_audio_buffer.committed', 'event_id': 'event_BK4ZrAfYTpbGAOYL0G2L3', 'previous_item_id': 'item_BK4ZnE1K1NP0YgOYLMJ35', 'item_id': 'item_BK4Zqm01V3DvAZsu6hCt1'}
Received message: {'type': 'conversation.item.created', 'event_id': 'event_BK4ZrzhIZZBgxsefJPeJR', 'previous_item_id': 'item_BK4ZnE1K1NP0YgOYLMJ35', 'item': {'id': 'item_BK4Zqm01V3DvAZsu6hCt1', 'object': 'realtime.item', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}}
Received message: {'type': 'conversation.item.input_audio_transcription.delta', 'event_id': 'event_BK4Zs1QyLEGUIY8HlWvak', 'item_id': 'item_BK4Zqm01V3DvAZsu6hCt1', 'content_index': 0, 'delta': 'Esta'}
Received message: {'type': 'conversation.item.input_audio_transcription.delta', 'event_id': 'event_BK4ZsuVpHgvR60LP5Isvn', 'item_id': 'item_BK4Zqm01V3DvAZsu6hCt1', 'content_index': 0, 'delta': ' es'}
Received message: {'type': 'conversation.item.input_audio_transcription.delta', 'event_id': 'event_BK4ZsDNjA8jfxZlzTCG0C', 'item_id': 'item_BK4Zqm01V3DvAZsu6hCt1', 'content_index': 0, 'delta': ' una'}
Received message: {'type': 'conversation.item.input_audio_transcription.delta', 'event_id': 'event_BK4Zswq1EW5fwM1DsIqw1', 'item_id': 'item_BK4Zqm01V3DvAZsu6hCt1', 'content_index': 0, 'delta': ' prueba'}
Received message: {'type': 'conversation.item.input_audio_transcription.delta', 'event_id': 'event_BK4ZsCTmm8YMZa3Oz27eb', 'item_id': 'item_BK4Zqm01V3DvAZsu6hCt1', 'content_index': 0, 'delta': ' para'}
Received message: {'type': 'conversation.item.input_audio_transcription.delta', 'event_id': 'event_BK4ZsSHnHgYY1M2Gw8cRX', 'item_id': 'item_BK4Zqm01V3DvAZsu6hCt1', 'content_index': 0, 'delta': ' mostrar'}
Received message: {'type': 'conversation.item.input_audio_transcription.delta', 'event_id': 'event_BK4ZsRiXuPJnE9V7NFA2p', 'item_id': 'item_BK4Zqm01V3DvAZsu6hCt1', 'content_index': 0, 'delta': ' el'}
Received message: {'type': 'conversation.item.input_audio_transcription.delta', 'event_id': 'event_BK4ZsUd2inA5KSKjNOJuW', 'item_id': 'item_BK4Zqm01V3DvAZsu6hCt1', 'content_index': 0, 'delta': ' bug'}
Received message: {'type': 'conversation.item.input_audio_transcription.delta', 'event_id': 'event_BK4ZsEEV9g764nTk5FJYr', 'item_id': 'item_BK4Zqm01V3DvAZsu6hCt1', 'content_index': 0, 'delta': ' de'}
Received message: {'type': 'conversation.item.input_audio_transcription.delta', 'event_id': 'event_BK4Zsmk7Pz4n1r3ju19rq', 'item_id': 'item_BK4Zqm01V3DvAZsu6hCt1', 'content_index': 0, 'delta': ' la'}
Received message: {'type': 'conversation.item.input_audio_transcription.delta', 'event_id': 'event_BK4ZsPugz89oEIS1amWCz', 'item_id': 'item_BK4Zqm01V3DvAZsu6hCt1', 'content_index': 0, 'delta': ' trans'}
Received message: {'type': 'conversation.item.input_audio_transcription.delta', 'event_id': 'event_BK4Zs9TbogfkvvqYQd2WL', 'item_id': 'item_BK4Zqm01V3DvAZsu6hCt1', 'content_index': 0, 'delta': 'cripción'}
Received message: {'type': 'conversation.item.input_audio_transcription.delta', 'event_id': 'event_BK4Zs0RNnmfrbRq27nfG4', 'item_id': 'item_BK4Zqm01V3DvAZsu6hCt1', 'content_index': 0, 'delta': '.'}
Received message: {'type': 'conversation.item.input_audio_transcription.completed', 'event_id': 'event_BK4Zs3IjOSOmW2CiDFGOq', 'item_id': 'item_BK4Zqm01V3DvAZsu6hCt1', 'content_index': 0, 'transcript': 'Esta es una prueba para mostrar el bug de la transcripción.'}

Config session:

    session_config = {
        "type": "transcription_session.update",
        "session": {
            "input_audio_format": "pcm16",
            "input_audio_transcription": {
                "model": "gpt-4o-transcribe",
                "language": "es",
                "prompt": "Esta es una prueba para mostrar el bug de la transcripción.",
            },
            "turn_detection": {
                "type": "server_vad",
                "threshold": 0.5,
                "prefix_padding_ms": 300,
                "silence_duration_ms": 300,
            },
            "input_audio_noise_reduction": {"type": "near_field"},
        },
    }

I’ve similar issue with transcribing Japanese language
I am using audio book from Kokoro-Speech-Dataset

I use book chapter text as the prompt
I run the audio file with VAD (which will cut it into smaller segments)
the audio opening part contain some statement or information that is not in the prompt. and this is where the gpt-4o-transcribe is outputting all the prompt content

2 Likes