Whisper API cut output short when input is large (smaller than 25 MB)

Hi,

I am using the Whisper API. I ran into something strange and wonder if anyone can help.

When I use the Whisper API, if my input is small, then everything is fine. If the input is large (approximately larger than 10MB), the output will be cut short. The larger the input is, the shorter the output is. I inputed an audio of 20 minutes, but only the first 2 sentences were transcribed. When I trimmed this audio, longer output appeared.

What happened?

Sincerely,
JQ

Any help will be appreciated! Thanks

First: I would see if it is the file size, or the audio length.

For transcriptions, you can send Opus audio using a voice codec. This is three hours at under 20MB:

ffmpeg -i audio.mp3 -vn -map_metadata -1 -ac 1 -c:a libopus -b:a 12k -application voip audio.opus

It’s more efficient for everybody, and limiting to voice bandwidth can improve the transcription.

Then: is it terminating at a silence? Too much silence would normally get you some hallucinations after a long period, not a premature finish, but the behavior may have changed.

Thanks for the reply.

The file size is under 25 MB for us already. 25 MB is the limit of the API.

It was not terminating at a silence.

For the same input audio:
If I input the whole thing, I only got the first 2 sentences.
If I trimmed the input, I got longer output scripts.
If I trimmed the input down to less than 10MB, I got the whole script.

So I’ll take that as a “you didn’t try anything different”.

Well, I’ll try something different. Exactly what I documented.

Hour-long audio segment. Encode it to opus:

C:\chat\ai-examples\transcriptions>ffmpeg -i 1hr.wav -vn -map_metadata -1 -ac 1 -c:a libopus -b:a 12k -application voip 1hr.opus
ffmpeg version 2022-01-10-git-f37e66b393-full_build-www.gyan.dev Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 11.2.0 (Rev5, Built by MSYS2 project)
  configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-libsnappy --enable-zlib --enable-librist --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-libbluray --enable-libcaca --enable-sdl2 --enable-libdav1d --enable-libdavs2 --enable-libuavs3d --enable-libzvbi --enable-librav1e --enable-libsvtav1 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxavs2 --enable-libxvid --enable-libaom --enable-libopenjpeg --enable-libvpx --enable-mediafoundation --enable-libass --enable-frei0r --enable-libfreetype --enable-libfribidi --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-ffnvcodec --enable-nvdec --enable-nvenc --enable-d3d11va --enable-dxva2 --enable-libmfx --enable-libshaderc --enable-vulkan --enable-libplacebo --enable-opencl --enable-libcdio --enable-libgme --enable-libmodplug --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libshine --enable-libtheora --enable-libtwolame --enable-libvo-amrwbenc --enable-libilbc --enable-libgsm --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-ladspa --enable-libbs2b --enable-libflite --enable-libmysofa --enable-librubberband --enable-libsoxr --enable-chromaprint
  libavutil      57. 18.100 / 57. 18.100
  libavcodec     59. 20.100 / 59. 20.100
  libavformat    59. 17.100 / 59. 17.100
  libavdevice    59.  5.100 / 59.  5.100
  libavfilter     8. 25.100 /  8. 25.100
  libswscale      6.  5.100 /  6.  5.100
  libswresample   4.  4.100 /  4.  4.100
  libpostproc    56.  4.100 / 56.  4.100
Guessed Channel Layout for Input Stream #0.0 : stereo
Input #0, wav, from '1hr.wav':
  Duration: 03:33:07.27, bitrate: 1411 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, stereo, s16, 1411 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> opus (libopus))
Press [q] to stop, [?] for help
Output #0, opus, to '1hr.opus':
  Metadata:
    encoder         : Lavf59.17.100
  Stream #0:0: Audio: opus, 48000 Hz, mono, s16, 12 kb/s
    Metadata:
      encoder         : Lavc59.20.100 libopus

Continuing the output of the encoder, instead of 20 minutes coming to 10MB, I’ve got 60 minutes under 5MB.

size=    5476kB time=01:00:00.01 bitrate=  12.5kbits/s speed=50.9x
video:0kB audio:5205kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 5.204176%

And then send it off for transcription. OpenAI doesn’t have .ogg listed on the API, but it looks like they removed it yet it is still working fine. Opus with an ogg extension.

import os
from openai import OpenAI

# Initialize the OpenAI client
client = OpenAI()

# Open the audio file
with open("1hr.opus.ogg", "rb") as audio_file:
    # Create a transcription using the Whisper model
    try:
        transcription = client.audio.transcriptions.create(
            file=audio_file,
            language="en",
            model="whisper-1",
            prompt="Here is the radio show.",
            response_format="json",
            temperature=0.1)
    except Exception as e:
        print(f"An API error occurred: {e}")

transcribed_text = transcription.dict()['text']

# Save the transcribed text to a file
try:
    with open("transcript.txt", "w") as file:
        file.write(transcribed_text)
    print("Transcribed text successfully saved to 'transcript.txt'.")
except Exception as e:
    print(f"output file error: {e}")

print(f"{transcribed_text[:320]}\n...\n{transcribed_text[-320:]}")

All done, except for several places the transcript went bezerkers shortly after a song was played for a short bit…replacing over a minute with repeated loop of words. Maybe also because the talk got dirty, it didn’t accurately reproduce frank Howard Stern Show talk.

Summary

Yeah. Anyway, so Wolfie is observing JD, and we’ll get a report tomorrow to see if he finds out anything. (15:40 song 16:14) Wolfie is fully embedded as we speak. Where’s he observing him from? He can’t fit in that office. He’s right in my face. Oh, yeah? Where is he? (ERROR: 16:20) Like, where is he? He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. He’s right in my face. (17:35)I had a sort of a panic attack a week or two before where like my stomach was like clenched. What was going on? I don’t remember. I don’t remember what it was specifically. I bet you use emojis. I very rarely.

Just lost the plan when they were playing snippets of songs…

Just lost the plan for a whole sections when they were playing snippets of songs...

This guy’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. He’s still alive. La cunta calda, calda flor de flor. La cunta calda, calda ciudad santa. It’s good, isn’t it? Yeah. La cunta calda, calda flor de flor. Tic-tic-tic balda tus. Pienta tula tu. Tic-tic-tic balda tus. Pienta tula tu. Tic-tic-tic balda tus. Pienta tula tu. Got every word. You drop them tula punda boots. La cunta calda, calda tida cunda. La cunta calda, calda flor de flor. La cunta calda, calda ciudad cunda. I think this is the song Matt had slow dance to. When they’re coming down. Flor de flor. Flor de flor. Isn’t that nice? Thank you. Julian Vallard. Whole new take on that song. Hello. Tic-tic-tic balda tus. Pienta tula tu. Tic-tic-tic balda tus. Pienta tula tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Yeah, and she did look like she just crawled out of a grave, you know, like her clothing. Yeah. We need like the best music video director in the world to do it. Because I’m telling you, I see the Shakira thing. Oh, yes. Where she’s doing her dancing in like a… Tic-tic-tic balda tu. Belly dancing kind of thing. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. She’s gone viral already. Her background dancers would be just going to town. Viral as in like staph infection. Yeah, exactly. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Tic-tic-tic balda tu. Yeah, we put the microphone like on a six-foot pole. Ha ha ha. Uh. La cunta calda. It’s funny that cunt, cunt, cunta. La cunta calda, calda, suede cunta. La cunta calda, calda, flora. This guy’s awesome. La cunta calda, calda, suede cunta. This could be the flip side. La cunta calda, calda, flora.

So you might not want to invest $0.36 in transcribing a whole hour at once anyway…

It seems there is a discrepancy in the supported formats listed between the Documentation and Reference pages.

Documentation

File uploads are currently limited to 25 MB and the following input file types are supported: mp3, mp4, mpeg, mpga, m4a, wav, and webm.

Reference page

The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.