Transcript only take first sentence of audio file?

Hello - i try to transcript an audio-file in python using the following code and

this sample audio file:

from openai import OpenAI
import os, sys
from dotenv import load_dotenv

path = os.path.abspath(os.path.dirname(sys.argv[0]))
fn = os.path.join(path, “.env”)
load_dotenv(fn)
CHATGPT_API_KEY = os.environ.get(“CHATGPT_API_KEY”)
client = OpenAI(api_key = CHATGPT_API_KEY)

fnAudio = os.path.join(path, “inp.wav”)
audio_file = open(fnAudio, “rb”)
transcription = client.audio.transcriptions.create(
model=“gpt-4o-transcribe”,
file=audio_file
)
print(transcription.text)

When i run the program unfortunately only the first sentence is trancribed:
only: “The stale smell of old beer lingers.“

Why is this not working for the full audio text?
Do i hav to use any addtional parameter?

Doc says: “By default, the response type will be json with the raw text included.”

You may want to try adding parmameter: response_format=“text”

Thanks for the response - but when i add
response_format=“text”
it is the same output as before

I also tried to output simply the transcription variable and get this output:

Transcription(text=‘The stale smell of old beer lingers.’, logprobs=None, usage=UsageTokens(input_tokens=183, output_tokens=11, total_tokens=194, type=‘tokens’, input_token_details=UsageTokensInputTokenDetails(audio_tokens=183, text_tokens=0)))

So i still have the problem that from the audio only the first sentence is decribed.
Could this be also the problem that in the audio-file is a relatively huge gap between the first and second sentence?

It seems like your code is correct, the problem isn’t your Python, it’s the audio file itself.

The model will transcribe the full file, but if the WAV has a long silent gap or a low-volume segment after the first sentence, GPT-4o-transcribe often treats that as end of utterance and stops there.

A few things you can check:

  1. Inspect the waveform

If there’s a big flat region after the first sentence, the model will stop early.

(That Kaggle dataset does contain files with long trailing silence.)

  1. Normalize or trim the audio

Even just running:

ffmpeg -i inp.wav -af “loudnorm” out.wav

or trimming silence:

ffmpeg -i inp.wav -af silenceremove=1:0:-50dB out.wav

usually fixes it.

  1. Try a different transcription model

gpt-4o-transcribe is fast, but a bit aggressive with cutoff on silence.

You can also try:

model=“gpt-4o-mini-transcribe”

which sometimes handles long pauses better.

So the core issue isn’t the parameter, it’s the structure of the audio. If the second sentence is far later in the file or very quiet, the model won’t reach it.

1 Like

Thanks - using this model did the trick for me

Glad it worked!
That mini-transcribe model has been the most reliable for me too, especially with clips that have long pauses or uneven volume😊

Hi! I am glad you found a workaround for your issue.

There is another way to address this. Since gpt-4o-mini-transcribe is not always the best option compared to gpt-4o-transcribe, you can control the silence_duration, which defines how long silence must last before speech is considered finished. To do this, set the chunking_strategy to server_vad and adjust the parameter as needed.

If you do not specify this parameter, the chunking_strategy defaults to auto. In this case, that appears to result in incomplete transcripts. While looking into this, I also found that explicitly setting the chunking_strategy to auto resolves this specific issue as well. This suggests a bug or inconsistency in the documentation.

I will flag this to staff for review. Until then, I recommend explicitly setting the chunking_strategy in all cases.

Code sample below.

# API Spec: https://platform.openai.com/docs/api-reference/audio/createTranscription?lang=python

from openai import OpenAI
import json

AUDIO_FILE = "inp.wav"
MODEL = "gpt-4o-transcribe"
SILENCE_DURATION_MS = 1000  # Duration of silence to detect speech stop (in milliseconds).
PREFIX_PADDING_MS = 0       # Amount of audio to include before the VAD detected speech
THRESHOLD = 0.5             # Threshold for voice activity detection (VAD)

client = OpenAI()  

with open(AUDIO_FILE, "rb") as audio:
    resp = client.audio.transcriptions.create(
        model=MODEL,
        file=audio,
        chunking_strategy={
            "type": "server_vad",
            "silence_duration_ms": SILENCE_DURATION_MS,
            "prefix_padding_ms": PREFIX_PADDING_MS,
            "threshold": THRESHOLD,
        },
        # include=["logprobs"],  # optional; only works with response_format="json" and supported models
        # response_format="json",
    )

print(json.dumps(resp.model_dump(), indent=2, ensure_ascii=False))
print("\n---\nTEXT:\n" + resp.text)
2 Likes