A crazy idea or it's feasible: Technique that saves 30% on Transcribe Costs

vasyl · April 18, 2024, 10:04pm

Hey fellow community members and leaders!

I have (maybe a crazy) idea - a transcription approach which could make speech-to-text models more efficient and save up to 40% of transcription costs.

Here’s how normally this process works from the user perspective - audio files are sent directly to APIs like OpenAI Whisper and then a transcription is returned.

I thought - what if we introduced a processing technique with involves several steps that occur between uploading the audio file and sending it to the API.

We identify, chunk, and remove silenced pieces from the audio.
Then, we compile a new file without these silent segments, streamlining the content, shortening the audio file length by up to 25% approximately.
Next, we apply a 1.2x speed increase to the audio, a modification that typically doesn’t compromise transcription quality. But it depends on many factors, such as model capabilities, accent, vocabulary, etc. This could potentially bring up to 20% of audio length shortening.

The result? As we know, transcription models invoice per length of audio files in seconds. A processed file devoid of silences and accelerated in speed would significantly reduce the transcription time and so the cost of transcription. So this approach translates into potential cost savings of up to 40%.

Can anyone please tell me where I am /or can be wrong here?

art.morgan · April 18, 2024, 10:58pm

Removing silence seems reasonable, provided you store the changes you’ve made to the timeline in case you need to align the transcript later, for example for creating captions.

I’d be careful about playback rate increase though. Every fraction of a percentage point of errors counts in transcription and captioning. As you mention, accents, noise, and all sorts of other audio challenges can be found in audio files. Unless you’re sure that you’re already well above the 99% accuracy threshold, why risk a decrease in accuracy in exchange for a small decrease in cost?

vasyl · April 18, 2024, 11:56pm

Very good point- I think in this approach the timeline should be ‘re-adjusted’ to the original file length by a simple function that would add the sum of cut time.

Totally agree on the playback rate - It could be added gradually and only in case a model is capturing everything perfectly form your contact center.

jr.2509 · April 19, 2024, 5:55am

In case you had not seen it already: the idea on speeding up the file for the purpose of cost savings was discussed and tested in the context of this thread earlier this year.

Here are the conclusions:

vasyl · April 19, 2024, 6:05am

thanks a lot for that. The big question would be to see how 1.2x and 1.5x and 2x actually impact the quality (WER). But I think if even removing 10% of silence with applying 1.1x would save 20% - that’s huge for mid-sized and large contact centers.

vb · April 19, 2024, 6:08am

Yes, why not?
One could add a feature to reduce costs with a info for the user that accuracy is likely to drop.
A fancy solution would have a slider to adjust how much the audio is sped up and how aggressively silence is being cut.

Working in a quiet environment is a possible use case for such a feature.

vasyl · April 19, 2024, 6:17am

actually that’s a good idea - to leave he speed-up rate as an optional adjustable parameter, let’s say between 0 and 2, with a comment that it might impact the quality of transcription.

vb · April 19, 2024, 6:24am

Digging a bit deeper why not fine-tune a local whisper on sped-up audio files to increase accuracy? Generating the training data will be comparably easy in this case.

Deploying the fine-tune with the rest of the app on the customer’s side could still lead to a speed increase, and of course, is a good way to hide the actual costs.
Options like this open doors to variants of the same service which sales can use to match the service with the customer’s needs.

vasyl · April 19, 2024, 6:30am

this is a great point actually - the thing is that I’ve tried the OS Whisper and it’s not performing even close to what I was getting via API. I don’t know if that was a unique case and no one else has experience it but I did notice that difference.

etienne · April 19, 2024, 12:45pm

@vasyl

I’m not sure that the second diagram is related to the current thread.

Maybe the wrong file was included?

vasyl · April 19, 2024, 4:48pm

@etienne , thanks a lot! Sure, that one was a wrong image) I didn’t notice. Now it’s good. Thanks a lot!

torronen · April 19, 2024, 6:16pm

I have only limited understanding of Whisper, but I think it is based on 30 sec windows and it might not necessarily give better results for extremely clean, high quality and slow speech. So I would suspect it should work as long as each 30sec window is still similar to what could exist in training data. Potentially we could aim for some sweet spot of words/min where it is the most reliable.

How about taking it further, such as dropping frames on a variable rate depending on the change rate in spectrum, some kind of proxy for information density, such as words per minute or even pauses per minute?

Dropping micro-silences could allow huge saving without removing relevant information, but perhaps the current Whisper model requires “normal” audio so the changes in frequency occur normally, due to training data.

torronen · April 19, 2024, 6:21pm

One more thought on the need for accuracy: if it is for contact center use, I think most of the time the raw transcript is not used. Most of the time it is probably for finding conversation by topics, summarizing or style analysis etc. In this case the accuracy of whisper might not matter.

When doing some online training sessions I’ve used some lower quality transcriptions and then just asked GPT to summarize even if the transcript would take a lot of effort for humans to read. Most of the time it’s a pretty good summary of the session. The changes are probably more about the summarization, such as what has is important.

For accuracy, I think it’s mostly important for subtitles, and maybe some conversations where details really matter such as legally meaningful meeting notes, doctor’s notes etc. but those probably should be anyway reviewed by a person.

@vb had a great idea of finetuning for 2x audio, why not even 4x. Would it be feasible to try, even by longshot, to train it summarize or provide only keywords? A bit like a great meeting secretary does, “voice-to-bullet-points” + ignore non-sense.

vb · April 19, 2024, 6:42pm

Not that I am aware off, no that’s not possible.
OP’s post at it’s core is looking for cost savings and maybe a faster turnaround for every job?
If we break this task down to a most simple use case one could just pull up Google Sheets and use voice recognition instead of typing.

Looking at Whisper’s cost specifically a good starting point is likely to take an average recording and speed it up until the transcription becomes too faulty.

Or maybe you create and train on a summarization language for translations…

_j · April 19, 2024, 7:03pm

Resample to 8000, 16000, 32000 or 48000 Hz.
A frame must be either 10, 20, or 30 ms in duration.

Then you get an energy rating 50 times a second for how likely voice is present. And can drop frames and reassemble.

vasyl · April 22, 2024, 4:18am

yep, that’s a good idea I would say. This is called Speech rate. I think each model, including Whisper, has an optimal speech rate (e.g. 50 words / minute) and then a speech rate above which the quality of transcription - word error rate (WER) starts to drop significantly, let’s say 70 words / minute. So if we measure first minute or so and we see its speech rate (after removing micro-silence) is 40 words, then we could easily apply 1.3x playback rate.

vasyl · April 22, 2024, 4:22am

I have no doubt this kind of models will come soon. Lightweight and affordable for any kind of business.

jr.2509 · April 22, 2024, 4:57am

Zoom’s AI companion is a good example that provides some of these features. I recently tested it with in the context of a client video meeting where it identified the main agreed action points on the basis of a 1h+ discussion instantly at the end of the meeting. To my own surprise it was incredibly accurate and focused.

https://www.zoom.com/en/blog/zoom-ai-companion/

vasyl · April 23, 2024, 3:00am

I think this (summarization / key-point extraction) is actually the best use case for the Zoom transcription model . We had a customer - quite a big company that we built a custom customer service control on top of Zoom Phone (where transcription is provided) and I can say that the quality of transcribes was quite low. Again, it’s not useless, as you still can extract insights, but for example for Quality Control where we had to assess phone calls on compliance to the company playbook it’s quality was not acceptable.

kaylard1352 · April 23, 2024, 6:45am

You could also implement a test transcription of a small section before going for the main transcription. In this case you will be able to detect how fast person is speaking by counting symbols per time. If it’s reasonably slow, then you can determine for how much you could speed up the audio - 1.1x, 1.2x, 1.5x etc.

Topic		Replies	Views
How Audio Speed Affects Transcription Accuracy: Benchmark Insights Community whisper	17	2145	November 20, 2024
Whisper Cost Optimisation for Transcriptions API whisper	2	163	August 5, 2025
TTS API service usability API tts	17	7045	December 16, 2023
Whisper - opaque charges? API whisper	12	1843	October 28, 2024
How to avoid Hallucinations in Whisper transcriptions? API whisper	33	23151	May 20, 2025

A crazy idea or it's feasible: Technique that saves 30% on Transcribe Costs

Related topics