Can't find a way to transcribe files bigger than 25 MB

Hi guys!

I need your help as I am currently working on a project where I want to transcribe multiple files using the Whisper API regardless of their size.

The code works for all files below 25 MB but if the file is larger than 25 MB I get an APIError (API Disconnect), even if I chunked the file into small files.

I read multiple discussions here and used multiple approaches to evade this problem but I get the same problem each time:

  1. Converting the file into a .ogg file (which resulted in a file of like 4 MB instead of 50 MB) and transcribing this file resulted in the given problem
  2. Setting a 1 second stop at the end of the script between API requests didn’t help
  3. (current approach in the code below) = Chunking the audio files bigger than 20 MB into multiple chunks each being a maximum of 20 MB, transcribing each chunk and appending it to a temporary variable before doing the same with the next chunk, then using the content inside this temporary variable for the transcription file → doesn’t work either

I honestly don’t know what to do. I want to do this project for a potential client but neither me nor ChatGPT know how to resolve this problem.


Here’s my code:

def split_audio(file_path, target_size_mb=20, format='mp3'):
    audio = AudioSegment.from_file(file_path, format=format)
    # 1 Byte = 8 Bits, daher 128 kbps = 128000 bits/s = 16000 Bytes/s
    estimated_bitrate = 128000  # in bps
    bytes_per_second = estimated_bitrate / 8
    max_segment_duration_ms = int((target_size_mb * 1024 * 1024 / bytes_per_second) * 1000)

    chunks = []
    for i in range(0, len(audio), max_segment_duration_ms):
        chunk = audio[i:i+max_segment_duration_ms]
        chunk_name = f"{file_path}_part{i//max_segment_duration_ms}.{format}"
        chunk.export(chunk_name, format=format)
        chunks.append(chunk_name)

    return chunks


def transcribe_audio(file_path):
    with open(file_path, "rb") as audio_file:
        response = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file)
    return response.text


def process_directory(directory_path):
    output_dir = os.path.join(directory_path, "transcriptions")
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    for file_name in os.listdir(directory_path):
        if file_name.endswith('.mp3'):
            file_path = os.path.join(directory_path, file_name)
            file_size = os.path.getsize(file_path)
            complete_transcription=""

            if file_size > 20 * 1024 * 1024:
                parts = split_audio(file_path)
                for part in parts:
                    part_transcription=transcribe_audio(part)
                    complete_transcription += part_transcriptions + " "
                    os.remove(part)
            else:
                complete_transcription = transcribe_audio(file_path)

            with open(os.path.join(output_dir, file_name.replace('.mp3', '.txt')), 'w') as f:
                f.write(complete_transcription)


# Path to folder that contains the audio files
directory_path = # hidden/ not relevant for you
process_directory(directory_path)

You don’t show your use of client, or what platform you are using that might enforce their own network policy, but it is likely you get a timeout from the long transcription time.

You can set

client = OpenAI(timeout=900) # to give the API 15 minutes to wait.

There is no need to send parts that big though. You can just use a silence detector and find places to split the chunks. The reassembly needs to be resilient about failed API calls, retrying parts until all have success. Saving as files that can be continued upon later, with a database of chunks.

MP3 audio also is encoded into frames with a cross-frame buffer. You cannot arbitrarily split the files without data loss.

For larger mp3 files i used compressing by changing their bitrate and tempo, so I could create transcriptions for long speeches. My code is adapted to run in jupyter notebook, but you can easily extract what you need from it. With tempo and bitrate you need to adjust to the file in question, for mine it worked well compress_mp3(2.0, 64)

def compress_mp3(tempo, bitrate):
    # Print a message to indicate the start of the initialization process.
    print("Initializing...")
    
    # Withdraw the root window immediately to prevent a Tk window from appearing unnecessarily.
    Tk().withdraw()
    
    # Open a file dialog to select the MP3 file, filtering to show only audio files with .mp3 extension.
    file_path = askopenfilename(title='Select MP3 file', filetypes=[('Audio files', '*.mp3')])
    
    # Generate a timestamp to uniquely identify the output file, formatted as YearMonthDayHourMinuteSecond.
    timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
    
    # Check if a file path was not selected or the selection was canceled, then exit the function early.
    if not file_path:
        print("File selection canceled or no file selected.")
        return
    
    # Calculate the original file's size in megabytes.
    original_size = os.path.getsize(file_path) / (1024 * 1024)
    print(f"Original file size: {original_size:.2f} MB")
    
    # Extract the base name and extension of the original file to construct the new file name.
    base, ext = os.path.splitext(file_path)
    new_file_path = f"{base}_{tempo}x_{bitrate}kbps_{timestamp}.mp3"
    
    # Notify the user that the audio processing is starting, utilizing ffmpeg.
    print("Processing audio with ffmpeg...")
    
    # Define the command to run ffmpeg with parameters for input file, audio filter (tempo), audio bitrate, and output file.
    cmd = [
        'ffmpeg',
        '-i', file_path,
        '-filter:a', f"atempo={tempo}",
        '-ab', f"{bitrate}k",
        new_file_path
    ]
    
    # Execute the ffmpeg command as a subprocess.
    subprocess.run(cmd)
    
    # Calculate the new file's size in megabytes.
    new_size = os.path.getsize(new_file_path) / (1024 * 1024)
    print(f"New file size: {new_size:.2f} MB")
    
    # Print the path where the new file is saved, indicating the operation's completion.
    print(f"File saved as: {new_file_path}")

Hi guys, thank you for your help.

Actually the solution was way easier than expected.

The problem is simply that OpenAI states the maximum size being 25 MB which seems to be false as 20 MB also resulted in the script crashing as soon as a file of that size or more got through the chunking process.

Changing the two 20’s into 10’s actually solved the problem and now even 3 hour podcasts of 300MB or more can be done with easy and the script runs without any problems.

Thank you again for your help, the thread can be closed now :slight_smile: