A crazy idea or it's feasible: Technique that saves 30% on Transcribe Costs

Hey fellow community members and leaders! :wave: :wave:

I have (maybe a crazy) idea :bulb: - a transcription approach which could make speech-to-text models more efficient and save up to 40% of transcription costs.

Hereā€™s how normally this process works from the user perspective - audio files are sent directly to APIs like OpenAI Whisper and then a transcription is returned.

I thought - what if we introduced a processing technique with involves several steps that occur between uploading the audio file and sending it to the API.

  1. We identify, chunk, and remove silenced pieces from the audio.

  2. Then, we compile a new file without these silent segments, streamlining the content, shortening the audio file length by up to 25% approximately.

  3. Next, we apply a 1.2x speed increase to the audio, a modification that typically doesnā€™t compromise transcription quality. But it depends on many factors, such as model capabilities, accent, vocabulary, etc. This could potentially bring up to 20% of audio length shortening.

The result? As we know, transcription models invoice per length of audio files in seconds. A processed file devoid of silences and accelerated in speed would significantly reduce the transcription time and so the cost of transcription. So this approach translates into potential cost savings of up to 40%.

Can anyone please tell me where I am /or can be wrong here?

7 Likes

Removing silence seems reasonable, provided you store the changes youā€™ve made to the timeline in case you need to align the transcript later, for example for creating captions.

Iā€™d be careful about playback rate increase though. Every fraction of a percentage point of errors counts in transcription and captioning. As you mention, accents, noise, and all sorts of other audio challenges can be found in audio files. Unless youā€™re sure that youā€™re already well above the 99% accuracy threshold, why risk a decrease in accuracy in exchange for a small decrease in cost?

3 Likes

Very good point- I think in this approach the timeline should be ā€˜re-adjustedā€™ to the original file length by a simple function that would add the sum of cut time.

Totally agree on the playback rate - It could be added gradually and only in case a model is capturing everything perfectly form your contact center.

In case you had not seen it already: the idea on speeding up the file for the purpose of cost savings was discussed and tested in the context of this thread earlier this year.

Here are the conclusions:

4 Likes

thanks a lot for that. The big question would be to see how 1.2x and 1.5x and 2x actually impact the quality (WER). But I think if even removing 10% of silence with applying 1.1x would save 20% - thatā€™s huge for mid-sized and large contact centers.

Yes, why not?
One could add a feature to reduce costs with a info for the user that accuracy is likely to drop.
A fancy solution would have a slider to adjust how much the audio is sped up and how aggressively silence is being cut.

Working in a quiet environment is a possible use case for such a feature.

actually thatā€™s a good idea - to leave he speed-up rate as an optional adjustable parameter, letā€™s say between 0 and 2, with a comment that it might impact the quality of transcription.

Digging a bit deeper why not fine-tune a local whisper on sped-up audio files to increase accuracy? Generating the training data will be comparably easy in this case.

Deploying the fine-tune with the rest of the app on the customerā€™s side could still lead to a speed increase, and of course, is a good way to hide the actual costs.
Options like this open doors to variants of the same service which sales can use to match the service with the customerā€™s needs.

1 Like

this is a great point actually - the thing is that Iā€™ve tried the OS Whisper and itā€™s not performing even close to what I was getting via API. I donā€™t know if that was a unique case and no one else has experience it but I did notice that difference.

@vasyl

Iā€™m not sure that the second diagram is related to the current thread.

Maybe the wrong file was included?

1 Like

@etienne , thanks a lot! Sure, that one was a wrong image) I didnā€™t notice. Now itā€™s good. Thanks a lot!

I have only limited understanding of Whisper, but I think it is based on 30 sec windows and it might not necessarily give better results for extremely clean, high quality and slow speech. So I would suspect it should work as long as each 30sec window is still similar to what could exist in training data. Potentially we could aim for some sweet spot of words/min where it is the most reliable.

How about taking it further, such as dropping frames on a variable rate depending on the change rate in spectrum, some kind of proxy for information density, such as words per minute or even pauses per minute?

Dropping micro-silences could allow huge saving without removing relevant information, but perhaps the current Whisper model requires ā€œnormalā€ audio so the changes in frequency occur normally, due to training data.

One more thought on the need for accuracy: if it is for contact center use, I think most of the time the raw transcript is not used. Most of the time it is probably for finding conversation by topics, summarizing or style analysis etc. In this case the accuracy of whisper might not matter.

When doing some online training sessions Iā€™ve used some lower quality transcriptions and then just asked GPT to summarize even if the transcript would take a lot of effort for humans to read. Most of the time itā€™s a pretty good summary of the session. The changes are probably more about the summarization, such as what has is important.

For accuracy, I think itā€™s mostly important for subtitles, and maybe some conversations where details really matter such as legally meaningful meeting notes, doctorā€™s notes etc. but those probably should be anyway reviewed by a person.

@vb had a great idea of finetuning for 2x audio, why not even 4x. Would it be feasible to try, even by longshot, to train it summarize or provide only keywords? A bit like a great meeting secretary does, ā€œvoice-to-bullet-pointsā€ + ignore non-sense.

1 Like

Not that I am aware off, no thatā€™s not possible.
OPā€™s post at itā€™s core is looking for cost savings and maybe a faster turnaround for every job?
If we break this task down to a most simple use case one could just pull up Google Sheets and use voice recognition instead of typing.

Looking at Whisperā€™s cost specifically a good starting point is likely to take an average recording and speed it up until the transcription becomes too faulty.

Or maybe you create and train on a summarization language for translationsā€¦

Resample to 8000, 16000, 32000 or 48000 Hz.
A frame must be either 10, 20, or 30 ms in duration.

Then you get an energy rating 50 times a second for how likely voice is present. And can drop frames and reassemble.

3 Likes

yep, thatā€™s a good idea I would say. This is called Speech rate. I think each model, including Whisper, has an optimal speech rate (e.g. 50 words / minute) and then a speech rate above which the quality of transcription - word error rate (WER) starts to drop significantly, letā€™s say 70 words / minute. So if we measure first minute or so and we see its speech rate (after removing micro-silence) is 40 words, then we could easily apply 1.3x playback rate.

I have no doubt this kind of models will come soon. Lightweight and affordable for any kind of business.

Zoomā€™s AI companion is a good example that provides some of these features. I recently tested it with in the context of a client video meeting where it identified the main agreed action points on the basis of a 1h+ discussion instantly at the end of the meeting. To my own surprise it was incredibly accurate and focused.

https://www.zoom.com/en/blog/zoom-ai-companion/

I think this (summarization / key-point extraction) is actually the best use case for the Zoom transcription model :sweat_smile: . We had a customer - quite a big company that we built a custom customer service control on top of Zoom Phone (where transcription is provided) and I can say that the quality of transcribes was quite low. Again, itā€™s not useless, as you still can extract insights, but for example for Quality Control where we had to assess phone calls on compliance to the company playbook itā€™s quality was not acceptable.

1 Like

You could also implement a test transcription of a small section before going for the main transcription. In this case you will be able to detect how fast person is speaking by counting symbols per time. If itā€™s reasonably slow, then you can determine for how much you could speed up the audio - 1.1x, 1.2x, 1.5x etc.

1 Like