Hi, has anyone started using speech to text model GPT-4o-transcribe using API yet?
I understand this is conversation but I only want to use for speech to text. Any suggestions on alternative approaches and Any best practices tips?
Thank you
Hi, has anyone started using speech to text model GPT-4o-transcribe using API yet?
I understand this is conversation but I only want to use for speech to text. Any suggestions on alternative approaches and Any best practices tips?
Thank you
Thank you @1uc4s_m4theus
The real reason of wanting to this is the increased accuracy and real time streaming. Both are low on whisper model.
Welcome to the community @saby
Yes gpt-4o-transcribe
can be used directly over the API for transcriptions and it comes with much higher quality transcriptions than whisper-1.
It can be used for streaming transcriptions for both recorded audio and live-streaming audio.
Thanks for the reply! @sps.
Is up to you to evaluate…
Did anyone noticed that gpt-4o-transcribe generates total nonsense? For example, if I use a code below with Whisper, it will generate a decent transcript. However, if I replace whisper with gpr-4o-transcribe (or mini) model, the output of the model will be totally random and not related to the audio file.
from openai import OpenAI
import os
api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)
audio_file = open("audio/A00010001.mp3", "rb")
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word"]
)
print(transcription.text)
having this exact issue. the audio is around the maximum length it allows (1500 seconds), I’m not sure if that’s related the problem but the response is totally random, often not even english.
gpt-4o-transcribe
is the only transcription model selection still. There is no versioned/dated model to choose from, and no overwhelming anecdote of “we no longer have this problem”.
For other endpoints:
gpt-4o-realtime-preview-2025-06-03
gpt-4o-audio-preview-2025-06-03
If your not streaming, the problem has alway been the format of the transcription response - usually one big paragraph blob.
Our solution is to run the transcription response through the “gpt-4.1” API:
Developer Prompt:
You are language expert. You specialize in formatting the unformatted text of any language.
User Prompt:
Format the following text: Insert the transscripion response here
We have extensively tested this approach over the last few days and it works very well - even in different languages. However, it is more expensive.