GPT-4o-transcribe and audio model ready to use via API?

Hi, has anyone started using speech to text model GPT-4o-transcribe using API yet?

I understand this is conversation but I only want to use for speech to text. Any suggestions on alternative approaches and Any best practices tips?
Thank you

Thank you @1uc4s_m4theus

The real reason of wanting to this is the increased accuracy and real time streaming. Both are low on whisper model.

Welcome to the community @saby

Yes gpt-4o-transcribe can be used directly over the API for transcriptions and it comes with much higher quality transcriptions than whisper-1.

It can be used for streaming transcriptions for both recorded audio and live-streaming audio.

4 Likes

Thanks for the reply! @sps.

Is up to you to evaluate…

1 Like

Did anyone noticed that gpt-4o-transcribe generates total nonsense? For example, if I use a code below with Whisper, it will generate a decent transcript. However, if I replace whisper with gpr-4o-transcribe (or mini) model, the output of the model will be totally random and not related to the audio file.

from openai import OpenAI
import os


api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)
audio_file = open("audio/A00010001.mp3", "rb")

transcription = client.audio.transcriptions.create(
    model="whisper-1", 
    file=audio_file, 
    response_format="verbose_json",
    timestamp_granularities=["word"]
)

print(transcription.text)
1 Like

having this exact issue. the audio is around the maximum length it allows (1500 seconds), I’m not sure if that’s related the problem but the response is totally random, often not even english.

Are you seeing Improvements in last 2 months @_j @farazs

gpt-4o-transcribe is the only transcription model selection still. There is no versioned/dated model to choose from, and no overwhelming anecdote of “we no longer have this problem”.

For other endpoints:
gpt-4o-realtime-preview-2025-06-03
gpt-4o-audio-preview-2025-06-03

@saby @_j @farazs

If your not streaming, the problem has alway been the format of the transcription response - usually one big paragraph blob.

Our solution is to run the transcription response through the “gpt-4.1” API:

Developer Prompt:

Identity

You are language expert. You specialize in formatting the unformatted text of any language.

Instructions

  • Determine logical paragraphs and seperate them with blank lines.
  • If a paragraph has a heading, insert a blank line between the heading and the paragraph.
  • If there is a title, insert a blank line after it.
  • Ensure that statements that need to be quoted are, in fact, quoted.
  • Ensure that the text is properly punctuated using the punctuation and grammatical rules of the language.

User Prompt:
Format the following text: Insert the transscripion response here

We have extensively tested this approach over the last few days and it works very well - even in different languages. However, it is more expensive.