Ran some transcripts with gpt-4o-transcribe on the same audio multiple times(harvard sentences).
First of all they differ quite a bit for each API call, even though temp=0.0.
Secondly on one of the iterations the model suddenly dumped a lot of sentences back. My audio consists of about the first 24 sentences(list 1, 2, 3). But the model suddenly dumped sentences from list ~70. And about 250 of them.
Is this overtraining?
Is the model realizing my dataset and taking its best guess?
One of the sentences in my audio contains the word Yacht, and the model gave me the other sentence in the dataset containing the word yacht.
Maybe overtrained is not the correct language, implying overfitting.
But rather, “too smart for its own good, beyond the task that is internally fine-tuned or prompted”. Or “too trusted a multimodal AI model by OpenAI when offered to perform any task”.
The API calls should not be connected in any way, no communication between calls. About the identical uses being different, all language models OpenAI has released in the past two years have been non-deterministic in their internal generation.
Perhaps you could clarify, in terms of individual API calls and their audio contents, what is a list, and what has “the first 24 sentences”.
After all, “dumping” everything heard in audio files back as written language is what the transcription endpoint is supposed to do!
/v1/audio/transcriptions only supports one input file, and the output you receive should be a complete transcription of the audio that was sent (but with the gpt-4o-transcribe model, is often symptomatic, repeating prompt text instead of audio, cutting off the end).
So jumping ahead in an audio file with big skips would be some kind of failure of attention to what was being done sequentially - or an AI with an overloaded context being confused about its task and making decisions or following audio as instruction.
Try whisper-1. See if technology made only to transcribe from the start will work for you; it only goes wrong with music or a bunch of filthy jokes.
(reminder: the prompt text field is to be used just as a bit of lead-up from a previous transcription being joined in progress)
For clarification:
Harvard sentences are a set of phonetically balanced sentences designed to test speech quality, intelligibility and audibility in communication systems.
There is a list of about 740 of these sentences here: www.cs.columbia.edu/~hgs/audio/harvard.html
To test some mics, I recorded audio of the first 24 of these sentences, and passed it to gpt 4o for transcription. Then I ran a script that uses the reference transcript to calculate how wrong the generated transcription was. In other words, the word error rate(WER).
I noticed that when I called the API several times on the same audio, the calculated WER was different each time, about 5-10% fluctuation. Probably to be expected with the non-determinism you’re mentioning.
Then suddenly it went from 17% WER to 845%. When i checked the transcription it was about 250, perfectly transcribed, random harvard sentences, and a lot of them were not even in my audio.
My API calls look like this:
transcription = client.audio.transcriptions.create(
model = “gpt-4o-transcribe”,
file = audioFile,
response_format = “text”,
temperature=0
)
So nothing here to suggest it is going to give me harvard sentences.
ChatGPT suggested that if my audio contains trailing silence, this could cause the model to hallucinate, and as it is probably in some grade trained on the sentences, thats what it returns. But this just sounds to good to be true as it is probably trained on thousands of hours of audio.
Not really a big problem for me, as it has only happened twice, but interesting nonetheless.
At temp=0.0, minor output variations can still occur, but the sudden dump of unrelated sentences (e.g., List -70) suggests a potential bug or context leakage, not overtraining. The “Yacht” substitution reflects the model’s lexical associations. Share exact inputs with OpenAI to debug—this likely isn’t expected behavior.
That’s another part about “too smart” - knowledgeable.
That certainly helps with language understanding. Putting known moon walk transcripts or speeches, there is a fallback to the intention even when you garble the words, because it read the book version.
But it’s possible that indeed you may be going right into testing on post-training audio sets that produced the transcription model itself. Benchmarking it on its own benchmarks. Activating it on a topic, and then “language completion” is pattern not from audio but the stronger pattern of text token weights.
You might be thusly experiencing inverse scaling; a bigger model not necessarily being better on particular tasks. Will it repeat back your misspellings as commanded, or can it not help itself but to correct them as it has been trained?
Here’s something to try: gpt-4o-mini-transcribe model. Less world knowledge to be compressed and preserved into its parameters and layers. Then a task: read your Harvard sentence that was already demonstrated as an unheard recitation, but write up your script with different words, or skip them yourself. See how much either AI model is ready to jump in for you and fill in the gaps or complete the sentence.
I’ve heard of this! Benchmarking the largest models is in ways obsolete, as they are trained on the benchmarks.
I tried the mini model, no hallucinations, but also quite a drop in overall accuracy in the transcripts.
But as it is running on the gpt-4o model, is there not a chance that this is just a case of contextual understanding, or is it not designed that way?