Feedback: Whisper, "Meeting-minutes" tutorial audio sample issue

URL: OpenAI Platform

Audio sample provided is 50MB the API limit (25MB) There is no mention of this or the need to convert the audio format in order to reduce size in the documentation

Thanks, keep up the good work

JT

1 Like

Please see this section of the documentation :

Whisper paper:

  1. Long-form Transcription
    Whisper models are trained on 30-second audio chunks and
    cannot consume longer audio inputs at once. This is not a
    problem with most academic datasets comprised of short
    utterances but presents challenges in real-world applications
    which often require transcribing minutes- or hours-long au-
    dio. We developed a strategy to perform buffered transcrip-
    tion of long audio by consecutively transcribing 30-second
    segments of audio and shifting the window according to the
    timestamps predicted by the model. We observed that it
    is crucial to have beam search and temperature scheduling
    based on the repetitiveness and the log probability of the
    model predictions in order to reliably transcribe long audio.
    The full procedure is described in Section 4.5

If you can chunk your own audio based on silence detection, you will likely get better performance than sending big audio to the API.

2 Likes

Thanks all, I know/read what you are linking to, this is feedback for the creator of the official tutorial. I am highlighting the fact that the provided sample will provide unnecessary friction for new users.

3 Likes

It’s there now!

Agreed I think mentioning the need for chunking will make the tutorial more seamless , here was my fix:

import openai
from pydub import AudioSegment
from docx import Document

# Function to split the audio file into chunks
def split_audio(audio_file_path, chunk_length_ms=30000): # chunk_length_ms is 30 seconds by default
    audio = AudioSegment.from_file(audio_file_path)
    chunks = []
    
    for i in range(0, len(audio), chunk_length_ms):
        chunks.append(audio[i:i+chunk_length_ms])
    return chunks

# Function to transcribe audio
def transcribe_audio_chunks(chunks):
    transcription = ""
    for i, chunk in enumerate(chunks):
        # Export chunk to a temporary file
        chunk_file = f'audio/chunk{i}.wav'
        chunk.export(chunk_file, format="wav")
        
        with open(chunk_file, 'rb') as audio_file:
            response = openai.Audio.transcribe("whisper-1", audio_file)
            transcription += response['text'] + " "  # Add a space between chunks to separate words
    return transcription

# Main function to transcribe an audio file
def transcribe_audio(audio_file_path):
    chunks = split_audio(audio_file_path)
    transcription = transcribe_audio_chunks(chunks)
    return transcription

adjust chunk_file path as needed