Gpt-4o-mini-transcribe and gpt-4o-transcribe not as good as whisper

We recently migrated from Whisper to the new voice-to-text API but encountered significant latency issues and unstable transcription results, frequently experiencing missed text. Due to these challenges, we reverted back to Whisper. Has anyone else experienced similar issues with the new API?

2 Likes

Well… in my experience, it seems that GPT-4o-Transcribe works better than Whisper-1, as it doesn’t try to transcribe background noise or produce an alien-like, broken language. So, for my use, it works well and it is already in prod.

1 Like

In my experience, the new GPT transcribe models tend to drop words, especially at the beginning/end of the message. I am usually dealing with short messages. Here are my results:

 "RECORDING_TRANSCRIPT": {
  "gpt-4o-mini-transcribe": "Will this work or not?",
  "gpt-4o-transcribe": "Will this work or not?",
  "whisper-1": "Uh, will this work or not? I think so. Bye."
 }

The whisper version is 100% correct in what was said. You can see the 4o models agree, but chopped off words.

Also don’t forget about latency … whisper is the fastest model out of the three too:

 "TRANSCRIPTION_ENGINE": "openai:{'models': ['whisper-1', 'gpt-4o-transcribe', 'gpt-4o-mini-transcribe']}",
 "TRANSCRIPT_METADATA": {
  "gpt-4o-mini-transcribe": {
   "latency_ms": 2016,
   "transcribed_at": "2025-04-08T06:13:49.816574"
  },
  "gpt-4o-transcribe": {
   "latency_ms": 1598,
   "transcribed_at": "2025-04-08T06:13:47.799742"
  },
  "whisper-1": {
   "latency_ms": 857,
   "transcribed_at": "2025-04-08T06:13:46.201050"
  }

So, overall, I’m still liking whisper for these short messages.

1 Like

I evaluated for use at my company (we need to transcribe TV Ads) and whisper 1 seemed a lot better than gpt-4o-transcribe for the specific “Edge Case” tests we ran. I don’t think it’s quite up to par for our specific use case but maybe for cases where you’re on the phone in a noisy coffee shop transcribing a work meeting it’s good enough… We need high accuracy and precision because we are searching for key “Banned Words” in Ads like “Superbowl” (amongst other things).

Here’s some screenshots of our evaluations --not necessarily a super exhaustive test and our grading criteria is a little arbitrary but I think it’s enough to make us hold-off a bit on switching to the gpt-4o-transcribe API:

(Nevermind, OpenAI only lets me post a single screenshot lol)

In some cases, huge amounts of the transcription were dropped: