Hello Everyone,
I’m using a whisper module in Make and am getting very inconsistent results. I’ve played around with the audio quality (upgrading mics, dialing in audio specifications and file types), but today, the same file processed perfectly 1x and then gave me 0-1 word outputs on the 4 subsequent attempts. Audio is .wav, Samping rate = 44.1kHz, Encoder bitrate = 160kbps; Mono, auto gain control / noise suppression /echo cancellation are all off.
I’m processing 30 second files with about 23-30 words in total. I’ve put in a conditional failsafe flow based on word count, but my 20% success rate is pittiful, even with my 4 extra attempts, I think I’m going to struggle to get results.
I spent some time last night working on an HTTP request to remove the silences, but have yet to get that operaitional, although I don’t believe that’s my issue and I’m just adding further complexity to my system.
What do you think I’m missing here?
I’m going to need to make about 100 of these requests a day. I’ve made 164 requests over the past 7 days, with a total of 7,442 transcribed minutes and ~$2 in costs.
you can install whisper in your machine if you just need it to transcribe audios for other purposes. that way you do not need to pay anything. the speed will be dependent on your machine’s spec. the current API is using whisper 2, i think if i am not mistaken, but the open source version is already 3.
Do you think there are any phones with the specs to handle this type of work - transcribe 30 second recordings every 5 minutes for 9 hours a day?
why phone? do you mean to use your phone for audio input/recording?
Do you use prompting, do you set the language and is there little background noise?
If with v2 large it’s inconsistent in spite of the above settings:
you can try the whisper large v3 deployed by fal.ai on an A100.
I’m working with the API, and I only have the one model (Whisper-1 - I believe its v2 but could be wrong) available through there
Yes, the recording is made with the phone and passed to my make scenario via webhook and a google drive download module. I don’t mind the idea of putting v3 on a device, but i’m wondering which devices are capable of actually running this - guess I should ask GPT
One thing I should say is that the problem appears to have gone away with pre-processing the audio file to remove silences, but any increase in quality will be time well spent. Trying to run this whole operation locally on device doesn’t sound undoable, but dramatically draw out my development process to make such a dramatic shift in strategy at this point in the game - though I think this could be the end game to ensure the maximum robustness of the system
Your prompt should use words that are commonly used in your transcript prefereably in the same formatting. The prompt in whisper is more like these are complicated words you might encounter: “GPT-4o, perplexity, RAG…”
oh and the fal.ai you can use with an API. Other than that there is also deepgram with an API service if you’re reluctant to self host.
I’m a bit of a noob here. Can you explain “self host”?
I’m not quit sure I fully grasp your recommendation on the prompt - can you give me an example?
Whisper is not an instructional model. Your prompt should not contain instructions and will be detrimental to the results.
Set temperature to 0 as well. It does not function like a typical temperature. It’s dynamically adjusted based on the current input.
In almost all cases where you can enter temperature as 0 the service will usually do something different than apply the number (temperature can’t be 0)
So you could either use an API ( calling openai to do the transcription for you and give you the text back)
or
You could run it on your own PC (locally) or your own PC in the cloud (both of these count as self hosting)
In case it was unclear in RonaldGRuckus and I are saying the same thing about the prompt:
1)-don’t tell it what to do, or what you want from it
2)-Tell it a few complicated words it typically mishears
3) set temperature to 0 (credits to RGR!!)
So nothing but a list of words - no lead in like, “Here are a few words you are likely to hear:”?
This is great guys, it’s really bringing a lot of clarity to whisper for me
Do you think there are any phones capable of running these models yet?