There has recently been a few discussions about speeding up audio files to save on transcription costs, so I have written a benchmark to show how this affects the WER (word error rate).
Before we jump into the results, here’s a disclaimer:
Disclaimer: All benchmarks on speech-to-text model’s are done at normal playback speed, there is no guarantee that the advertised WER is accurate at other playback speeds.
I’d like to give a shout-out to @sps who originally confirmed that this does save on API cost’s, and to all the people participating in the original discussion started by @vasyl.
Results:
The data & graphs provided below are results from multiple runs of speech-to-text (STT) processing tests on audio recordings in English and Dutch. Each run evaluates the WER across various speed factors, ranging from 1.1x to 4.0x the normal speed, using the 1.0x transcript as a baseline.
English:
The following tests were performed using the spoken version of the Wikipedia page on how to pronounce the word GIF. This file is roughly 18 minutes long.
Next, we will try another language to see how that affects the results.
Dutch:
The following tests were performed using the spoken version of the Dutch Wikipedia page on Apple (the company). This file is roughly 20 minutes long.
Raw data:
click here to expand!
file is “En-Pronunciation_of_GIF-article.ogg” from File:En-Pronunciation of GIF-article.ogg - Wikipedia, language English
Run 1:
Speed Factor: 1.1x - WER: 5.18%
Speed Factor: 1.2x - WER: 6.51%
Speed Factor: 1.3x - WER: 6.32%
Speed Factor: 1.4x - WER: 6.55%
Speed Factor: 1.5x - WER: 6.51%
Speed Factor: 1.6x - WER: 5.36%
Speed Factor: 1.7x - WER: 12.83%
Speed Factor: 1.8x - WER: 6.55%
Speed Factor: 1.9x - WER: 5.55%
Speed Factor: 2.0x - WER: 5.09%
Speed Factor: 2.1x - WER: 7.61%
Speed Factor: 2.2x - WER: 6.14%
Speed Factor: 2.3x - WER: 6.19%
Speed Factor: 2.4x - WER: 5.82%
Speed Factor: 2.5x - WER: 6.19%
Speed Factor: 2.6x - WER: 7.56%
Speed Factor: 2.7x - WER: 6.37%
Speed Factor: 2.8x - WER: 8.57%
Speed Factor: 2.9x - WER: 15.17%
Speed Factor: 3.0x - WER: 30.29%
Speed Factor: 3.1x - WER: 25.85%
Speed Factor: 3.2x - WER: 38.18%
Speed Factor: 3.3x - WER: 15.08%
Speed Factor: 3.4x - WER: 32.03%
Speed Factor: 3.5x - WER: 32.91%
Speed Factor: 3.6x - WER: 31.62%
Speed Factor: 3.7x - WER: 40.56%
Speed Factor: 3.8x - WER: 36.21%
Speed Factor: 3.9x - WER: 41.48%
Speed Factor: 4.0x - WER: 58.89%
Run 2:
Speed Factor: 1.1x - WER: 5.82%
Speed Factor: 1.2x - WER: 7.29%
Speed Factor: 1.3x - WER: 7.47%
Speed Factor: 1.4x - WER: 7.02%
Speed Factor: 1.5x - WER: 8.67%
Speed Factor: 1.6x - WER: 5.09%
Speed Factor: 1.7x - WER: 5.32%
Speed Factor: 1.8x - WER: 6.19%
Speed Factor: 1.9x - WER: 4.49%
Speed Factor: 2.0x - WER: 5.32%
Speed Factor: 2.1x - WER: 6.83%
Speed Factor: 2.2x - WER: 12.93%
Speed Factor: 2.3x - WER: 7.79%
Speed Factor: 2.4x - WER: 6.24%
Speed Factor: 2.5x - WER: 5.96%
Speed Factor: 2.6x - WER: 10.50%
Speed Factor: 2.7x - WER: 6.69%
Speed Factor: 2.8x - WER: 7.79%
Speed Factor: 2.9x - WER: 15.13%
Speed Factor: 3.0x - WER: 26.82%
Speed Factor: 3.1x - WER: 11.65%
Speed Factor: 3.2x - WER: 37.60%
Speed Factor: 3.3x - WER: 13.48%
Speed Factor: 3.4x - WER: 36.50%
Speed Factor: 3.5x - WER: 32.60%
Speed Factor: 3.6x - WER: 32.51%
Speed Factor: 3.7x - WER: 36.96%
Speed Factor: 3.8x - WER: 38.33%
Speed Factor: 3.9x - WER: 45.71%
Speed Factor: 4.0x - WER: 65.52%
Run 3:
Speed Factor: 1.1x - WER: 5.36%
Speed Factor: 1.2x - WER: 8.66%
Speed Factor: 1.3x - WER: 6.46%
Speed Factor: 1.4x - WER: 6.74%
Speed Factor: 1.5x - WER: 6.46%
Speed Factor: 1.6x - WER: 5.45%
Speed Factor: 1.7x - WER: 7.33%
Speed Factor: 1.8x - WER: 6.83%
Speed Factor: 1.9x - WER: 6.51%
Speed Factor: 2.0x - WER: 5.77%
Speed Factor: 2.1x - WER: 8.39%
Speed Factor: 2.2x - WER: 13.29%
Speed Factor: 2.3x - WER: 6.55%
Speed Factor: 2.4x - WER: 6.92%
Speed Factor: 2.5x - WER: 5.96%
Speed Factor: 2.6x - WER: 8.48%
Speed Factor: 2.7x - WER: 7.42%
Speed Factor: 2.8x - WER: 10.22%
Speed Factor: 2.9x - WER: 10.54%
Speed Factor: 3.0x - WER: 22.82%
Speed Factor: 3.1x - WER: 12.51%
Speed Factor: 3.2x - WER: 38.73%
Speed Factor: 3.3x - WER: 18.10%
Speed Factor: 3.4x - WER: 37.90%
Speed Factor: 3.5x - WER: 33.50%
Speed Factor: 3.6x - WER: 31.94%
Speed Factor: 3.7x - WER: 36.80%
Speed Factor: 3.8x - WER: 47.71%
Speed Factor: 3.9x - WER: 49.63%
Speed Factor: 4.0x - WER: 63.93%
file is “Nl-Apple_Computer-article.ogg” from File:Nl-Apple Computer-article.ogg - Wikimedia Commons, language NL (Dutch)
Run 1:
Speed Factor: 1.1x - WER: 9.51%
Speed Factor: 1.2x - WER: 7.49%
Speed Factor: 1.3x - WER: 7.03%
Speed Factor: 1.4x - WER: 11.38%
Speed Factor: 1.5x - WER: 9.23%
Speed Factor: 1.6x - WER: 8.76%
Speed Factor: 1.7x - WER: 9.60%
Speed Factor: 1.8x - WER: 11.15%
Speed Factor: 1.9x - WER: 11.33%
Speed Factor: 2.0x - WER: 12.51%
Speed Factor: 2.1x - WER: 9.88%
Speed Factor: 2.2x - WER: 11.85%
Speed Factor: 2.3x - WER: 17.19%
Speed Factor: 2.4x - WER: 15.36%
Speed Factor: 2.5x - WER: 18.59%
Speed Factor: 2.6x - WER: 17.99%
Speed Factor: 2.7x - WER: 20.42%
Speed Factor: 2.8x - WER: 18.41%
Speed Factor: 2.9x - WER: 22.90%
Speed Factor: 3.0x - WER: 19.63%
Speed Factor: 3.1x - WER: 43.65%
Speed Factor: 3.2x - WER: 51.52%
Speed Factor: 3.3x - WER: 20.23%
Speed Factor: 3.4x - WER: 54.85%
Speed Factor: 3.5x - WER: 92.60%
Speed Factor: 3.6x - WER: 92.27%
Speed Factor: 3.7x - WER: 99.91%
Speed Factor: 3.8x - WER: 34.33%
Speed Factor: 3.9x - WER: 65.90%
Speed Factor: 4.0x - WER: 53.07%
Run 2:
Speed Factor: 1.1x - WER: 9.51%
Speed Factor: 1.2x - WER: 7.49%
Speed Factor: 1.3x - WER: 7.03%
Speed Factor: 1.4x - WER: 11.38%
Speed Factor: 1.5x - WER: 9.23%
Speed Factor: 1.6x - WER: 8.76%
Speed Factor: 1.7x - WER: 9.60%
Speed Factor: 1.8x - WER: 11.15%
Speed Factor: 1.9x - WER: 11.19%
Speed Factor: 2.0x - WER: 12.51%
Speed Factor: 2.1x - WER: 9.88%
Speed Factor: 2.2x - WER: 11.85%
Speed Factor: 2.3x - WER: 17.19%
Speed Factor: 2.4x - WER: 15.36%
Speed Factor: 2.5x - WER: 18.59%
Speed Factor: 2.6x - WER: 17.99%
Speed Factor: 2.7x - WER: 20.42%
Speed Factor: 2.8x - WER: 18.41%
Speed Factor: 2.9x - WER: 22.90%
Speed Factor: 3.0x - WER: 19.63%
Speed Factor: 3.1x - WER: 97.89%
Speed Factor: 3.2x - WER: 99.86%
Speed Factor: 3.3x - WER: 20.23%
Speed Factor: 3.4x - WER: 69.74%
Speed Factor: 3.5x - WER: 92.60%
Speed Factor: 3.6x - WER: 91.85%
Speed Factor: 3.7x - WER: 99.86%
Speed Factor: 3.8x - WER: 34.33%
Speed Factor: 3.9x - WER: 63.19%
Speed Factor: 4.0x - WER: 50.30%
Run 3:
Speed Factor: 1.1x - WER: 9.51%
Speed Factor: 1.2x - WER: 7.49%
Speed Factor: 1.3x - WER: 7.03%
Speed Factor: 1.4x - WER: 11.38%
Speed Factor: 1.5x - WER: 9.23%
Speed Factor: 1.6x - WER: 8.76%
Speed Factor: 1.7x - WER: 9.60%
Speed Factor: 1.8x - WER: 11.15%
Speed Factor: 1.9x - WER: 11.29%
Speed Factor: 2.0x - WER: 13.21%
Speed Factor: 2.1x - WER: 9.88%
Speed Factor: 2.2x - WER: 11.85%
Speed Factor: 2.3x - WER: 17.19%
Speed Factor: 2.4x - WER: 15.36%
Speed Factor: 2.5x - WER: 18.59%
Speed Factor: 2.6x - WER: 17.99%
Speed Factor: 2.7x - WER: 20.42%
Speed Factor: 2.8x - WER: 18.41%
Speed Factor: 2.9x - WER: 22.90%
Speed Factor: 3.0x - WER: 19.63%
Speed Factor: 3.1x - WER: 98.83%
Speed Factor: 3.2x - WER: 64.87%
Speed Factor: 3.3x - WER: 20.23%
Speed Factor: 3.4x - WER: 99.67%
Speed Factor: 3.5x - WER: 82.11%
Speed Factor: 3.6x - WER: 92.41%
Speed Factor: 3.7x - WER: 99.91%
Speed Factor: 3.8x - WER: 34.33%
Speed Factor: 3.9x - WER: 71.57%
Speed Factor: 4.0x - WER: 49.37%
Takeaways.
You can speed up audio files to get cheaper transcriptions at the cost of an increase in the word error rate, but you will need to be careful, as this behavior is not consistent and depends heavily on both the speaker and the language chosen. If you want to take this route, I highly recommend that you run this benchmark using your own files. You can find the code & instructions on GitHub: