GPT-4o-transcribe and audio model ready to use via API?

saby · April 5, 2025, 12:38am

Hi, has anyone started using speech to text model GPT-4o-transcribe using API yet?

I understand this is conversation but I only want to use for speech to text. Any suggestions on alternative approaches and Any best practices tips?
Thank you

saby · April 6, 2025, 3:49am

Thank you @1uc4s_m4theus

The real reason of wanting to this is the increased accuracy and real time streaming. Both are low on whisper model.

sps · April 6, 2025, 6:37am

Welcome to the community @saby

Yes gpt-4o-transcribe can be used directly over the API for transcriptions and it comes with much higher quality transcriptions than whisper-1.

It can be used for streaming transcriptions for both recorded audio and live-streaming audio.

1uc4s_m4theus · April 6, 2025, 5:16pm

Thanks for the reply! @sps.

_j · April 6, 2025, 5:48pm

Is up to you to evaluate…

Alexander.Ovs · April 27, 2025, 1:44pm

Did anyone noticed that gpt-4o-transcribe generates total nonsense? For example, if I use a code below with Whisper, it will generate a decent transcript. However, if I replace whisper with gpr-4o-transcribe (or mini) model, the output of the model will be totally random and not related to the audio file.

from openai import OpenAI
import os


api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)
audio_file = open("audio/A00010001.mp3", "rb")

transcription = client.audio.transcriptions.create(
    model="whisper-1", 
    file=audio_file, 
    response_format="verbose_json",
    timestamp_granularities=["word"]
)

print(transcription.text)

farazs · May 5, 2025, 1:59am

having this exact issue. the audio is around the maximum length it allows (1500 seconds), I’m not sure if that’s related the problem but the response is totally random, often not even english.

saby · July 2, 2025, 10:48am

Are you seeing Improvements in last 2 months @_j @farazs

_j · July 2, 2025, 10:57am

gpt-4o-transcribe is the only transcription model selection still. There is no versioned/dated model to choose from, and no overwhelming anecdote of “we no longer have this problem”.

For other endpoints:
gpt-4o-realtime-preview-2025-06-03
gpt-4o-audio-preview-2025-06-03

jeffvpace · July 2, 2025, 1:04pm

@saby @_j @farazs

If your not streaming, the problem has alway been the format of the transcription response - usually one big paragraph blob.

Our solution is to run the transcription response through the “gpt-4.1” API:

Developer Prompt:

Identity

You are language expert. You specialize in formatting the unformatted text of any language.

Instructions

Determine logical paragraphs and seperate them with blank lines.
If a paragraph has a heading, insert a blank line between the heading and the paragraph.
If there is a title, insert a blank line after it.
Ensure that statements that need to be quoted are, in fact, quoted.
Ensure that the text is properly punctuated using the punctuation and grammatical rules of the language.

User Prompt:
Format the following text: Insert the transscripion response here

We have extensively tested this approach over the last few days and it works very well - even in different languages. However, it is more expensive.

Topic		Replies	Views
Gpt-4o-mini-transcribe and gpt-4o-transcribe not as good as whisper Feedback api	3	2863	April 23, 2025
Extracting Transcription Without Using input_audio.input_transcription in OpenAI API API realtime , api-realtime	10	387	March 11, 2025
GPT4.0-Transcribe—MAX 1500 SECONDS? API api	3	89	July 4, 2025
Persistent Truncation Issues with GPT-4o-Transcribe – Has Anyone Fully Solved This? API gpt-4 , api , transcribe , gpt-4o , api-realtime	10	366	July 5, 2025
Whisper-1 joint translation and transcription API	6	3387	October 21, 2024

GPT-4o-transcribe and audio model ready to use via API?

Identity

Instructions

Related topics