Thanks for the response - but when i add response_format=“text”
it is the same output as before
I also tried to output simply the transcription variable and get this output:
Transcription(text=‘The stale smell of old beer lingers.’, logprobs=None, usage=UsageTokens(input_tokens=183, output_tokens=11, total_tokens=194, type=‘tokens’, input_token_details=UsageTokensInputTokenDetails(audio_tokens=183, text_tokens=0)))
So i still have the problem that from the audio only the first sentence is decribed.
Could this be also the problem that in the audio-file is a relatively huge gap between the first and second sentence?
It seems like your code is correct, the problem isn’t your Python, it’s the audio file itself.
The model will transcribe the full file, but if the WAV has a long silent gap or a low-volume segment after the first sentence, GPT-4o-transcribe often treats that as end of utterance and stops there.
A few things you can check:
Inspect the waveform
If there’s a big flat region after the first sentence, the model will stop early.
(That Kaggle dataset does contain files with long trailing silence.)
gpt-4o-transcribe is fast, but a bit aggressive with cutoff on silence.
You can also try:
model=“gpt-4o-mini-transcribe”
which sometimes handles long pauses better.
So the core issue isn’t the parameter, it’s the structure of the audio. If the second sentence is far later in the file or very quiet, the model won’t reach it.
Hi! I am glad you found a workaround for your issue.
There is another way to address this. Since gpt-4o-mini-transcribe is not always the best option compared to gpt-4o-transcribe, you can control the silence_duration, which defines how long silence must last before speech is considered finished. To do this, set the chunking_strategy to server_vad and adjust the parameter as needed.
If you do not specify this parameter, the chunking_strategy defaults to auto. In this case, that appears to result in incomplete transcripts. While looking into this, I also found that explicitly setting the chunking_strategy to auto resolves this specific issue as well. This suggests a bug or inconsistency in the documentation.
I will flag this to staff for review. Until then, I recommend explicitly setting the chunking_strategy in all cases.
Code sample below.
# API Spec: https://platform.openai.com/docs/api-reference/audio/createTranscription?lang=python
from openai import OpenAI
import json
AUDIO_FILE = "inp.wav"
MODEL = "gpt-4o-transcribe"
SILENCE_DURATION_MS = 1000 # Duration of silence to detect speech stop (in milliseconds).
PREFIX_PADDING_MS = 0 # Amount of audio to include before the VAD detected speech
THRESHOLD = 0.5 # Threshold for voice activity detection (VAD)
client = OpenAI()
with open(AUDIO_FILE, "rb") as audio:
resp = client.audio.transcriptions.create(
model=MODEL,
file=audio,
chunking_strategy={
"type": "server_vad",
"silence_duration_ms": SILENCE_DURATION_MS,
"prefix_padding_ms": PREFIX_PADDING_MS,
"threshold": THRESHOLD,
},
# include=["logprobs"], # optional; only works with response_format="json" and supported models
# response_format="json",
)
print(json.dumps(resp.model_dump(), indent=2, ensure_ascii=False))
print("\n---\nTEXT:\n" + resp.text)