Whisper API quality degrading over time

michael69 · May 21, 2024, 6:31pm

Hello Everyone,

I’m using a whisper module in Make and am getting very inconsistent results. I’ve played around with the audio quality (upgrading mics, dialing in audio specifications and file types), but today, the same file processed perfectly 1x and then gave me 0-1 word outputs on the 4 subsequent attempts. Audio is .wav, Samping rate = 44.1kHz, Encoder bitrate = 160kbps; Mono, auto gain control / noise suppression /echo cancellation are all off.

I’m processing 30 second files with about 23-30 words in total. I’ve put in a conditional failsafe flow based on word count, but my 20% success rate is pittiful, even with my 4 extra attempts, I think I’m going to struggle to get results.

I spent some time last night working on an HTTP request to remove the silences, but have yet to get that operaitional, although I don’t believe that’s my issue and I’m just adding further complexity to my system.

What do you think I’m missing here?

I’m going to need to make about 100 of these requests a day. I’ve made 164 requests over the past 7 days, with a total of 7,442 transcribed minutes and ~$2 in costs.

supershaneski · May 21, 2024, 11:42pm

you can install whisper in your machine if you just need it to transcribe audios for other purposes. that way you do not need to pay anything. the speed will be dependent on your machine’s spec. the current API is using whisper 2, i think if i am not mistaken, but the open source version is already 3.

michael69 · May 22, 2024, 1:55am

Do you think there are any phones with the specs to handle this type of work - transcribe 30 second recordings every 5 minutes for 9 hours a day?

supershaneski · May 22, 2024, 2:04am

why phone? do you mean to use your phone for audio input/recording?

Incakura · May 28, 2024, 2:27pm

Do you use prompting, do you set the language and is there little background noise?

If with v2 large it’s inconsistent in spite of the above settings:
you can try the whisper large v3 deployed by fal.ai on an A100.

michael69 · May 28, 2024, 2:48pm

michael69 · May 28, 2024, 2:50pm

I’m working with the API, and I only have the one model (Whisper-1 - I believe its v2 but could be wrong) available through there

michael69 · May 28, 2024, 2:51pm

Yes, the recording is made with the phone and passed to my make scenario via webhook and a google drive download module. I don’t mind the idea of putting v3 on a device, but i’m wondering which devices are capable of actually running this - guess I should ask GPT

michael69 · May 28, 2024, 2:54pm

One thing I should say is that the problem appears to have gone away with pre-processing the audio file to remove silences, but any increase in quality will be time well spent. Trying to run this whole operation locally on device doesn’t sound undoable, but dramatically draw out my development process to make such a dramatic shift in strategy at this point in the game - though I think this could be the end game to ensure the maximum robustness of the system

Incakura · May 28, 2024, 4:29pm

Your prompt should use words that are commonly used in your transcript prefereably in the same formatting. The prompt in whisper is more like these are complicated words you might encounter: “GPT-4o, perplexity, RAG…”

oh and the fal.ai you can use with an API. Other than that there is also deepgram with an API service if you’re reluctant to self host.

michael69 · May 28, 2024, 4:49pm

I’m a bit of a noob here. Can you explain “self host”?

I’m not quit sure I fully grasp your recommendation on the prompt - can you give me an example?

RonaldGRuckus · May 28, 2024, 4:52pm

Whisper is not an instructional model. Your prompt should not contain instructions and will be detrimental to the results.

Set temperature to 0 as well. It does not function like a typical temperature. It’s dynamically adjusted based on the current input.

In almost all cases where you can enter temperature as 0 the service will usually do something different than apply the number (temperature can’t be 0)

Incakura · May 28, 2024, 5:16pm

So you could either use an API ( calling openai to do the transcription for you and give you the text back)

or

You could run it on your own PC (locally) or your own PC in the cloud (both of these count as self hosting)

In case it was unclear in RonaldGRuckus and I are saying the same thing about the prompt:
1)-don’t tell it what to do, or what you want from it
2)-Tell it a few complicated words it typically mishears
3) set temperature to 0 (credits to RGR!!)

michael69 · May 28, 2024, 6:07pm

So nothing but a list of words - no lead in like, “Here are a few words you are likely to hear:”?

This is great guys, it’s really bringing a lot of clarity to whisper for me

michael69 · May 28, 2024, 6:09pm

Do you think there are any phones capable of running these models yet?

michael69 · May 28, 2024, 6:11pm

Feel accurate?

Topic		Replies	Views
Whisper hallucinations + dropped sentences: Help? API whisper	3	3082	February 29, 2024
All my attempts to improve accuracy and reduce hallucinations have the opposite effect! API whisper , hallucinations	6	543	October 31, 2024
How to avoid Hallucinations in Whisper transcriptions? API whisper	31	19742	September 25, 2024
Whisper Transcription Questions API whisper	10	4436	March 13, 2024
Whisper transcription translates to random language (Malay) API whisper	8	742	July 16, 2024

Whisper API quality degrading over time

You could run it on your own PC (locally) or your own PC in the cloud (both of these count as self hosting)

Related topics