Whisper Transcription Questions

  1. Is Whisper still in beta? I don’t seem to be charged anything for using it at the moment.

  2. In a brief audio I submitted, it missed a few lines in the middle. In those lines, I included Spanish while the rest was in English – is that why it skipped them? Or does it randomly skip stuff in general?

  3. Is transcribing things that “aren’t allowed” ie against the content rules a problem? (For example, transcribing a romantic love scene which it would be against the rules to have it create.)

  4. Not a question, but I was impressed with how it put in quotation and punctuation marks. Other automated transcription options such as Dragon or Otter are nowhere near as good at doing this.

I’ve been testing it some more and it keeps missing large chunks of the transcription. What it does transcribe is virtually perfect, but then it will miss a big chunk - maybe 30 seconds or a minute.

Is this a known bug? Is it being addressed? Outside of this, the quality of the transcription is phenomenal… but missing chunks makes it unfortunately unusable! I’ve tried other automated transcription programs like Dragon etc. but OpenAI’s transcription is significantly better. If only it were reliable!

(I’ve been using it in the Sandbox. Don’t know if there are other ways to access it.)

And upon trying a longer transcription some from fiction I dictated… a disaster. Most of it simply didn’t appear.

Anyone know what the deal with transcription is? Is it “known” to be broken? Any chance of it working any time soon? The accuracy of the transcripts that it produces is OUTSTANDING… except for the absence of huge chunks of it!

As long the as the moderators on OpenAI’s Discord server are still deciding about my suggestion to create a channel for Whisper over there (where the community is a lot more active), I have connected to a few people on Discord via PM to talk about Whisper. There are useful discussions on GitHub as well.

The main question would probably be, how you set your parameters. For example, if the transcription gets “stuck” somewhere in the middle (which reportedly often happened during longer transcriptions), you should set the parameter condition_on_previous_text to False.

It will only transcribe the first 30 seconds of whatever waveform you give it. To overcome this, I had to break the file into 30 second chunks, and feed each chunk separately for transcription, then stitch all the transcriptions together to get the final overall transcription.

And yes it works great! I ditched AWS Transcribe and went with Whisper!

I’m not sure which model your using or where you’re hosting it. But the kind of workaround to describe shouldn’t normally be necessary.

I am just using the HuggingFace version:


And calling it from an AWS Lambda function.

I suppose the one on HuggingFace was deliberately limited because it’s hosted there for free. Better stick to the instructions on GitHub and you shouldn’t have that problem.

I got it to work, but good to know.