Hello everyone,
Since the update to gpt-4o-transcribe that introduced the -diarise endpoint, the regular (gpt-4o-transcribe) model’s behaviour has changed, altering its performance.
In text-to-speech mode from an audio file (10 seconds to 10 minutes), gpt-4o-transcribe now tends to truncate the end of the transcription if there is a pause in speech — something that never happened before with the same code.
transcript = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file,
language=lang,
prompt=med_prompt,
response_format='text'
)
When adding chunking_strategy= "auto":
Truncation no longer occurs, but the model tends to hallucinate content in segmented chunks containing only background noise, and overall recognition quality appears slightly degraded.
I tried to make it process the audio in one piece defining a custom server VAD the chunking, but it doesn’t seem to be properly handled by the API:
chunking_strategy={
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 200,
"silence_duration_ms": 10000,
},
In summary:
-
Without
chunking_strategy, truncations now occurs. -
With
chunking_strategy: 'auto', hallucinations appear. -
Custom
chunking_strategydoes not seem to work.
Both are significant issues for a speech-to-text model; it previously worked very well.
Thanks for your help!