How Audio Speed Affects Transcription Accuracy: Benchmark Insights

There has recently been a few discussions about speeding up audio files to save on transcription costs, so I have written a benchmark to show how this affects the WER (word error rate).

Before we jump into the results, here’s a disclaimer:

Disclaimer: All benchmarks on speech-to-text model’s are done at normal playback speed, there is no guarantee that the advertised WER is accurate at other playback speeds.

I’d like to give a shout-out to @sps who originally confirmed that this does save on API cost’s, and to all the people participating in the original discussion started by @vasyl.

Results:

The data & graphs provided below are results from multiple runs of speech-to-text (STT) processing tests on audio recordings in English and Dutch. Each run evaluates the WER across various speed factors, ranging from 1.1x to 4.0x the normal speed, using the 1.0x transcript as a baseline.

English:

The following tests were performed using the spoken version of the Wikipedia page on how to pronounce the word GIF. This file is roughly 18 minutes long.

Next, we will try another language to see how that affects the results.

Dutch:

The following tests were performed using the spoken version of the Dutch Wikipedia page on Apple (the company). This file is roughly 20 minutes long.

Raw data:

click here to expand!

file is “En-Pronunciation_of_GIF-article.ogg” from File:En-Pronunciation of GIF-article.ogg - Wikipedia, language English

Run 1:

Speed Factor: 1.1x - WER: 5.18%

Speed Factor: 1.2x - WER: 6.51%

Speed Factor: 1.3x - WER: 6.32%

Speed Factor: 1.4x - WER: 6.55%

Speed Factor: 1.5x - WER: 6.51%

Speed Factor: 1.6x - WER: 5.36%

Speed Factor: 1.7x - WER: 12.83%

Speed Factor: 1.8x - WER: 6.55%

Speed Factor: 1.9x - WER: 5.55%

Speed Factor: 2.0x - WER: 5.09%

Speed Factor: 2.1x - WER: 7.61%

Speed Factor: 2.2x - WER: 6.14%

Speed Factor: 2.3x - WER: 6.19%

Speed Factor: 2.4x - WER: 5.82%

Speed Factor: 2.5x - WER: 6.19%

Speed Factor: 2.6x - WER: 7.56%

Speed Factor: 2.7x - WER: 6.37%

Speed Factor: 2.8x - WER: 8.57%

Speed Factor: 2.9x - WER: 15.17%

Speed Factor: 3.0x - WER: 30.29%

Speed Factor: 3.1x - WER: 25.85%

Speed Factor: 3.2x - WER: 38.18%

Speed Factor: 3.3x - WER: 15.08%

Speed Factor: 3.4x - WER: 32.03%

Speed Factor: 3.5x - WER: 32.91%

Speed Factor: 3.6x - WER: 31.62%

Speed Factor: 3.7x - WER: 40.56%

Speed Factor: 3.8x - WER: 36.21%

Speed Factor: 3.9x - WER: 41.48%

Speed Factor: 4.0x - WER: 58.89%

Run 2:

Speed Factor: 1.1x - WER: 5.82%

Speed Factor: 1.2x - WER: 7.29%

Speed Factor: 1.3x - WER: 7.47%

Speed Factor: 1.4x - WER: 7.02%

Speed Factor: 1.5x - WER: 8.67%

Speed Factor: 1.6x - WER: 5.09%

Speed Factor: 1.7x - WER: 5.32%

Speed Factor: 1.8x - WER: 6.19%

Speed Factor: 1.9x - WER: 4.49%

Speed Factor: 2.0x - WER: 5.32%

Speed Factor: 2.1x - WER: 6.83%

Speed Factor: 2.2x - WER: 12.93%

Speed Factor: 2.3x - WER: 7.79%

Speed Factor: 2.4x - WER: 6.24%

Speed Factor: 2.5x - WER: 5.96%

Speed Factor: 2.6x - WER: 10.50%

Speed Factor: 2.7x - WER: 6.69%

Speed Factor: 2.8x - WER: 7.79%

Speed Factor: 2.9x - WER: 15.13%

Speed Factor: 3.0x - WER: 26.82%

Speed Factor: 3.1x - WER: 11.65%

Speed Factor: 3.2x - WER: 37.60%

Speed Factor: 3.3x - WER: 13.48%

Speed Factor: 3.4x - WER: 36.50%

Speed Factor: 3.5x - WER: 32.60%

Speed Factor: 3.6x - WER: 32.51%

Speed Factor: 3.7x - WER: 36.96%

Speed Factor: 3.8x - WER: 38.33%

Speed Factor: 3.9x - WER: 45.71%

Speed Factor: 4.0x - WER: 65.52%

Run 3:

Speed Factor: 1.1x - WER: 5.36%

Speed Factor: 1.2x - WER: 8.66%

Speed Factor: 1.3x - WER: 6.46%

Speed Factor: 1.4x - WER: 6.74%

Speed Factor: 1.5x - WER: 6.46%

Speed Factor: 1.6x - WER: 5.45%

Speed Factor: 1.7x - WER: 7.33%

Speed Factor: 1.8x - WER: 6.83%

Speed Factor: 1.9x - WER: 6.51%

Speed Factor: 2.0x - WER: 5.77%

Speed Factor: 2.1x - WER: 8.39%

Speed Factor: 2.2x - WER: 13.29%

Speed Factor: 2.3x - WER: 6.55%

Speed Factor: 2.4x - WER: 6.92%

Speed Factor: 2.5x - WER: 5.96%

Speed Factor: 2.6x - WER: 8.48%

Speed Factor: 2.7x - WER: 7.42%

Speed Factor: 2.8x - WER: 10.22%

Speed Factor: 2.9x - WER: 10.54%

Speed Factor: 3.0x - WER: 22.82%

Speed Factor: 3.1x - WER: 12.51%

Speed Factor: 3.2x - WER: 38.73%

Speed Factor: 3.3x - WER: 18.10%

Speed Factor: 3.4x - WER: 37.90%

Speed Factor: 3.5x - WER: 33.50%

Speed Factor: 3.6x - WER: 31.94%

Speed Factor: 3.7x - WER: 36.80%

Speed Factor: 3.8x - WER: 47.71%

Speed Factor: 3.9x - WER: 49.63%

Speed Factor: 4.0x - WER: 63.93%

file is “Nl-Apple_Computer-article.ogg” from File:Nl-Apple Computer-article.ogg - Wikimedia Commons, language NL (Dutch)

Run 1:

Speed Factor: 1.1x - WER: 9.51%

Speed Factor: 1.2x - WER: 7.49%

Speed Factor: 1.3x - WER: 7.03%

Speed Factor: 1.4x - WER: 11.38%

Speed Factor: 1.5x - WER: 9.23%

Speed Factor: 1.6x - WER: 8.76%

Speed Factor: 1.7x - WER: 9.60%

Speed Factor: 1.8x - WER: 11.15%

Speed Factor: 1.9x - WER: 11.33%

Speed Factor: 2.0x - WER: 12.51%

Speed Factor: 2.1x - WER: 9.88%

Speed Factor: 2.2x - WER: 11.85%

Speed Factor: 2.3x - WER: 17.19%

Speed Factor: 2.4x - WER: 15.36%

Speed Factor: 2.5x - WER: 18.59%

Speed Factor: 2.6x - WER: 17.99%

Speed Factor: 2.7x - WER: 20.42%

Speed Factor: 2.8x - WER: 18.41%

Speed Factor: 2.9x - WER: 22.90%

Speed Factor: 3.0x - WER: 19.63%

Speed Factor: 3.1x - WER: 43.65%

Speed Factor: 3.2x - WER: 51.52%

Speed Factor: 3.3x - WER: 20.23%

Speed Factor: 3.4x - WER: 54.85%

Speed Factor: 3.5x - WER: 92.60%

Speed Factor: 3.6x - WER: 92.27%

Speed Factor: 3.7x - WER: 99.91%

Speed Factor: 3.8x - WER: 34.33%

Speed Factor: 3.9x - WER: 65.90%

Speed Factor: 4.0x - WER: 53.07%

Run 2:

Speed Factor: 1.1x - WER: 9.51%

Speed Factor: 1.2x - WER: 7.49%

Speed Factor: 1.3x - WER: 7.03%

Speed Factor: 1.4x - WER: 11.38%

Speed Factor: 1.5x - WER: 9.23%

Speed Factor: 1.6x - WER: 8.76%

Speed Factor: 1.7x - WER: 9.60%

Speed Factor: 1.8x - WER: 11.15%

Speed Factor: 1.9x - WER: 11.19%

Speed Factor: 2.0x - WER: 12.51%

Speed Factor: 2.1x - WER: 9.88%

Speed Factor: 2.2x - WER: 11.85%

Speed Factor: 2.3x - WER: 17.19%

Speed Factor: 2.4x - WER: 15.36%

Speed Factor: 2.5x - WER: 18.59%

Speed Factor: 2.6x - WER: 17.99%

Speed Factor: 2.7x - WER: 20.42%

Speed Factor: 2.8x - WER: 18.41%

Speed Factor: 2.9x - WER: 22.90%

Speed Factor: 3.0x - WER: 19.63%

Speed Factor: 3.1x - WER: 97.89%

Speed Factor: 3.2x - WER: 99.86%

Speed Factor: 3.3x - WER: 20.23%

Speed Factor: 3.4x - WER: 69.74%

Speed Factor: 3.5x - WER: 92.60%

Speed Factor: 3.6x - WER: 91.85%

Speed Factor: 3.7x - WER: 99.86%

Speed Factor: 3.8x - WER: 34.33%

Speed Factor: 3.9x - WER: 63.19%

Speed Factor: 4.0x - WER: 50.30%

Run 3:

Speed Factor: 1.1x - WER: 9.51%

Speed Factor: 1.2x - WER: 7.49%

Speed Factor: 1.3x - WER: 7.03%

Speed Factor: 1.4x - WER: 11.38%

Speed Factor: 1.5x - WER: 9.23%

Speed Factor: 1.6x - WER: 8.76%

Speed Factor: 1.7x - WER: 9.60%

Speed Factor: 1.8x - WER: 11.15%

Speed Factor: 1.9x - WER: 11.29%

Speed Factor: 2.0x - WER: 13.21%

Speed Factor: 2.1x - WER: 9.88%

Speed Factor: 2.2x - WER: 11.85%

Speed Factor: 2.3x - WER: 17.19%

Speed Factor: 2.4x - WER: 15.36%

Speed Factor: 2.5x - WER: 18.59%

Speed Factor: 2.6x - WER: 17.99%

Speed Factor: 2.7x - WER: 20.42%

Speed Factor: 2.8x - WER: 18.41%

Speed Factor: 2.9x - WER: 22.90%

Speed Factor: 3.0x - WER: 19.63%

Speed Factor: 3.1x - WER: 98.83%

Speed Factor: 3.2x - WER: 64.87%

Speed Factor: 3.3x - WER: 20.23%

Speed Factor: 3.4x - WER: 99.67%

Speed Factor: 3.5x - WER: 82.11%

Speed Factor: 3.6x - WER: 92.41%

Speed Factor: 3.7x - WER: 99.91%

Speed Factor: 3.8x - WER: 34.33%

Speed Factor: 3.9x - WER: 71.57%

Speed Factor: 4.0x - WER: 49.37%

Takeaways.

You can speed up audio files to get cheaper transcriptions at the cost of an increase in the word error rate, but you will need to be careful, as this behavior is not consistent and depends heavily on both the speaker and the language chosen. If you want to take this route, I highly recommend that you run this benchmark using your own files. You can find the code & instructions on GitHub:

14 Likes

Interesting how even at 1.5x and 2x the WER is close to normal playback speed.

3 Likes

Thanks for the amazing work!

The inconsistency is worrying. Dutch, Speed Factor: 3.1x - WER: 98.83%, but 3.3x i s 20%. So I suppose it basically is nonsense at 3.1x, but kind-of usable at 3.3x.

I wonder if there would be ways to detect the 80+% WER results, perhaps with an LLM?

1 Like

Yes, what surprised me the most was that the Dutch version was more consistent between runs than the English version.

1 Like

I wonder where the inconsistency between runs is coming from.

1 Like

I don’t know the answer to that, but I’m guessing Whisper may start to hallucinate multiple words into one when faced with sped-up audio.

I’m hoping people will try this using their own files, but to be honest, this entire endeavor was pure curiosity on my part, if you really want to save money I’ll recommend downloading & running the open source version:

It’s the same model as the one running on the API :laughing:

3 Likes

Additionally if you’re running your own whisper, you could also choose the model size you want to run.

1 Like

WoW, this is great! It’s something I was going to do sometime next week. So we’re talking about quite an efficient, simple, and straightforward approach. Even w/o cutting-off silence episodes - we’re talking about significant cost reductions without any significant quality loss. Huge!

2 Likes

Thank you! :heart:

I’ll definitely recommend pulling the GitHub repo and trying it yourself on your own files, the people reading those wiki pages aren’t particularly fast to begin with, so your results may vary :laughing:

1 Like

I decided to spend some time to make a tiny contribution - I decided to try with a smaller file and took this Steve Jobs video and try it with different playback speeds and measure the WER. Here’s what we have:

Looks like the idea is not so crazy. At least, worth trying this kind of experimentation for some bigger usage. I mean, 3% of word error increase might be worth for saving 50% of you invoice))

which

converts the specified audio file to MP3 format,


converting sample run bins into the frequency domain may be problematic when a perceptual codec is designed for the temporal masking of human ears, not bat ears.

8.65 atempo

Adjust audio tempo.

The filter accepts exactly one parameter, the audio tempo. If not specified then the filter will assume nominal 1.0 tempo. Tempo must be in the [0.5, 100.0] range.

Note that tempo greater than 2 will skip some samples rather than blend them in

even harder when time slicing and overlapping becomes discarding.


Try WAV.

Try max “2” along with sample rate adjustment and resampling for allowing pitch shifts. Or multiple passes of high-frequency slices, then followed by 300Hz sinc filter.


Or autotune them…

2 Likes

Thanks for pointing it out man, I fuckin knew there was something wrong with mp3’s :heart:

This is exactly what happened:

  • started with wav, worked fine on my normal test file (~30s)
  • thought “I need more words for accurate WER”
  • found the files on wikipedia
  • got error because I was trying to send ~200mb in one go.
  • got lazy, didn’t want to implement consistent chunking.
  • changed file type to mp3 to reduce size :man_facepalming:

I then proceeded to check the file’s by listening to them, and concluded that it sounded fine to my human ears :rofl:

Ngl, that’s the funniest idea I’ve heard all day, I just have zero experience with autotuning stuff, could you point me in the right direction?

1 Like

“Pull to fixed pitch” is one option that is appealing in the “autotalent” screenshot that powers one python autotune library, among several that I have not employed by code.

Formant reconstruction allows you to then change the quality of vocal harmonics also.

The idea is that you could take chipmunk-speed speech, and change the pitch and timbre of it with technology to make it better conform with training data, instead of reconstructing audio put through a shredder.

1 Like

Good idea, and thanks for the pointers!

I’ll try that if I get the time. This is all pure curiosity, none of this “speeding up” thing is necessary if the goal is penny pinching, people should just run the model locally.

2 Likes

There are several studies on how video speed affects humans on comprehension and retention, this thread includes nearly all studies on this topic: https://x.com/MedEdFlamingo/status/1645441940144508937

These findings have shown that the decline in human comprehension is most noticeable after going beyond 2x speed. It is interesting for me to see how it is in AI even if it is for audio speed.

I used to do this a few years ago for banking transcriptions, i would load them all into winamp and had keyboard shortcuts set up to skip through them and rewind/fast forward etc and enter them into a spreadsheet.