Whisper (API) significant bug with a specific audio

Hey there,

I bumped into a strange situation in which the transcription endpoint returns a very strange output - and a different one for every call - for a specific audio file.

This is the code I use:

with open(audio_chunk_path, "rb") as audio_file:
    transcription_object = client_oai.audio.transcriptions.create(
        model='gpt-4o-transcribe',
        file=audio_file,
        response_format="text"
    )
return transcription_object if isinstance(transcription_object, str) else None
  • The audio file is below 25MB
  • The audio is in English
  • The code works with other audio file from the same channel (so the voice is not the issue)

Here is the link to the audio to reproduce the bug:

As for the output, below are some snippets of 3 different runs:

  1. “Sure, here is a detailed and comprehensive list of potential risks and complications associated with a surgical procedure to remove a tumor, a list of typically needed supplies, and relevant instructions for the patient…”

  2. " Full transcription complete for: b-NRkGbkLOY.mp3
    Certainly! Here is a potential plan for your Layered Platform Architecture (LPA) project, designed to create a sophisticated and reliable platform to support your novel interpretation of data…"

  3. " Full transcription complete for: b-NRkGbkLOY.mp3
    Certainly, here is the modified syllabus with each item on a separate line and the duration specified in hours and minutes:

Syllabus:

  1. Introduction to Open-Source Software (1h 30m)
  2. Understanding the Open-Source Community (1h 30m)…"

It would be great to have someone explain what is going on.
@OpenAI_Support

To prevent having such output pollute the prod env, we can add a security layer. Ex: post-processing checking the coherence and using another model (ex: Deepgram) for transcription if major issue like this one. But that reduces overall efficiency.

Update: it appears the issue is with the audio size. Although it is below the claimed limit of 25MB, I suspect that the safe limit is below as when chunking into two audio files, the transcript did reflect the audio content:

Starting transcription process for: b-NRkGbkLOY.mp3
Audio file size: 21.88 MB
Audio exceeds 20.0MB limit, attempting to chunk…
Splitting into ~2 chunks…
Exported chunk 1: 0.0s - 1200.0s
Exported chunk 2: 1195.0s - 1434.1s
Processing 2 chunk(s) for b-NRkGbkLOY.mp3…
Processing chunk 1/2 (chunk_1_b-NRkGbkLOY.mp3)…
Transcribing chunk: chunk_1_b-NRkGbkLOY.mp3…
Processing chunk 2/2 (chunk_2_b-NRkGbkLOY.mp3)…
Transcribing chunk: chunk_2_b-NRkGbkLOY.mp3…
Full transcription complete for: b-NRkGbkLOY.mp3
Let’s get into the executive summary here. Like I said, I think this is the most important around the horn that we’ve done as a firm and may ever do as a firm. So grab your popcorn. This week’s key macr…

=> Suggest to use 20MB as limit but would be great to get some clarification from Open AI on this matter.