All my attempts to improve accuracy and reduce hallucinations have the opposite effect!

I use the Whisper library with a Python wrapper I wrote myself, that I execute from the command line. The goal is transcribe more than 20 000 recorded phone calls.

I have spent a lot of time with ChatGPT to adjust my settings to improve the accuracy of the transcriptions as well as reduce hallucinations but whatever I do it just gets worse, most of the time a lot worse!

My current settings look like this:

                result = model.transcribe(
                    'file.opus',
                    language=used_language,
                    temperature=0.1,
                    beam_size = 7,
                    patience=1.0,
                    best_of = 5,
                    logprob_threshold = 0.5
                )

I have the most recent version of Whisper, I use the large-v3 model, the language is set to Swedish.

There are 2x2 situations:

  1. The recordings are either amr or opus. The opus are much better at 48 kHz while the amr are at 8 kHz (And the same bitrate, 12 kb/s. Opus is amazing!). The amr’s sound is often quite boxy and tinny.
  2. The phone calls are made on cellphones and are either made in a silent environment (e.g., home or office) or in a noisy environment (i.e., outside, with traffic, wind, music etc in the background)

(So there are opus (i.e. good recording quality) with either silent or noisy background and “vice versa” for the amr files, all in all 4 “situations”)

Since they are phone calls 99 % of them have two persons speaking.

Because it is offline/not realtime, quality is my 1st, 2nd and 3rd priority. I don’t care if it takes long time to process the files. I also think I have plenty of resources since I run it on a HP Z8 G4 with 768 GB RAM, 32+32 cores and two NVIDIA RTX A5000 with 24 GB each:

$ uname -a
Linux localadmin-HP-Z8-G4-Workstation 6.8.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Sep 11 15:25:05 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

The only thing that works is to set the language. When I do that I get around 10x realtime transcription speed. With the settings above the speed is around 50 % faster!! (It varies a lot but 50 % faster is a reasonable approximation) Meanwhile, the quality is much worse, including hallucinations. I want settings that gives better quality, probably at the price of being much slower.

To sum all this up, this leads to two questions:

1. What would you suggest for settings for my purpose?
2. [META] Does ChatGPT give good advice about Whisper settings?

We don’t know what OpenAI uses themselves on their API service. You can see if what is returned by API isn’t a high-water mark for you, using the large model they employ.

However, since you have a workstation suitable to start your own AI company…

I’ll try o1-preview for answering for you on given parameters plus additional context, since I lack experience tuning those. What I note:

best_of:

  • Only applicable when beam_size is 1 (greedy search). Since you’re using beam_size=7, this parameter won’t have any effect.

patience:

  • Acts as a multiplier to the beam_size, allowing the beam search to explore more options without increasing the beam width.

So if you want the “best” only at every word, I would try temperature 0.01, and omit the multiple runs of best_of that needs its variance. Immediately compare beam_size 1 vs your 7 to see marked difference in the correct or wrong direction.

Then - the model you are using? The language? A good match between the two is essential; a smaller model may outperform on purely English rather than simply throwing large-v3 at it.

As a technique, I would try also to enthusiastically separate chunks of spoken language by silence detection yourself to significantly under the 30 second kernels that the model operates on. lowpass-filter audio to telephony voice bandwidth, resample to 16kHz.

1 Like

I’m not sure I am following you here (the AI-terminology is very confusing). I run this from the command line, and process everything locally - what has OpenAI’s API to do with anything?

The language is Swedish. Do you suggest that e.g., medium could be better than large with regards to quality (ignoring performance) in some cases? And that large-v2 could be better than large-v3? That is counterintuitive…

How do you do that in Whisper? Or are you talking about using an external program to preprocess the input with?

Thank you.

OpenAI released open-source Whisper, along with trained models.

OpenAI also operates a paid service, where you can send the audio and receive a transcription. This can be used to establish the quality possible. The API also accepts a language field and a pre-prompt to help establish the spoken language.

If you want to see if there is possibility of a higher-quality transcription, you can test against OpenAI’s premium service.

The larger the AI model and the more training on world languages, the more uncertain it may become, vs a specialist AI model. Expanding the training set may not have direct benefit to you (for example, a language AI trained more to chat may not benefit your multi-step permanent entity extraction jobs). Swedish would be in the minority languages, and should be one to benefit from a larger set, but can’t be guaranteed.

Besides new training, Whisper v3 also increases the number of frequency bins compared to v2, which may not have direct benefit for low quality audio that has already been through a lossy compression.

Looking through the list of candidate AI models and parameters that could be tested, you might have extensive options. I would find five of the most challenging audio examples, produce a human-labeled transcript. Then automate sending them all, to all Whisper settings variations. An AI language model could judge which transcription has the lowest word error rate and highest preservation of meaning on these dozens of trials. This is more of your own programming, but considering the large input set you want to run, a small investment to discover the best settings for the language and audio recording technology and quality.

The last items I discuss are transformations you perform on the input audio yourself to prepare it for sending to the AI model.

1 Like

That was complicated and confusing! I am gonna comment paragraph by paragraph.

The tiny, base, large etc models?

OpenAI also operates a paid service, where you can send the audio and receive a transcription. This can be used to establish the quality possible. The API also accepts a language field and a pre-prompt to help establish the spoken language.

Does the library and the paid service offer the same quality?

A language field like the language argument you can use with the library?

What is a “pre-prompt” in this context? Can you use that with the library (or command line client)?

If you want to see if there is possibility of a higher-quality transcription, you can test against OpenAI’s premium service.

Ok

The larger the AI model and the more training on world languages, the more uncertain it may become, vs a specialist AI model.

Not following you here. Does it exist other models for Whisper than the tiny, base, large etc you can specify when using “default” Whisper?

Expanding the training set may not have direct benefit to you (for example, a language AI trained more to chat may not benefit your multi-step permanent entity extraction jobs).

What is multi-step permanent entity extraction jobs?

Swedish would be in the minority languages, and should be one to benefit from a larger set, but can’t be guaranteed.

I think there is plenty of training data for Swedish, so it should not be a problem. According to these articles, (1, 2), Swedish is a top-5 language on Wikipedia. Not English but definitely good enough.

Besides new training, Whisper v3 also increases the number of frequency bins compared to v2, which may not have direct benefit for low quality audio that has already been through a lossy compression.

What is a frequency bin?

Looking through the list of candidate AI models and parameters that could be tested, you might have extensive options.

Are there other models for Whisper than the ones OpenAI offer? Where can I find them? Hugginface is so confusing! And what parameters are you referring to? Where can I find documentation? Here is what I find the Whisper source. (And for the CLI)

I would find five of the most challenging audio examples, produce a human-labeled transcript. Then automate sending them all, to all Whisper settings variations.

That is like a basically infinite number of combinations!? There are a dozen or so parameters that can have a dozen values!? 12¹² ≈ 8 x 10¹³. That is undoable, isn’t it? Am I misunderstanding something?

Has anyone written something that could be used for this?

An AI language model could judge which transcription has the lowest word error rate and highest preservation of meaning on these dozens of trials. This is more of your own programming, but considering the large input set you want to run, a small investment to discover the best settings for the language and audio recording technology and quality.

Yes I agree. But I think there should be plenty of people in the same situation so I would expect that someone already have solved this problem? OTOH, I spent quite a bit of time looking for a pre-made solution for my situation (transcribing 20 000 recordings offline), without success so maybe not?

The last items I discuss are transformations you perform on the input audio yourself to prepare it for sending to the AI model.

What tool would you use for that? ffmpeg?

Thank you

I cannot teach “how to program” or “programming with audio files”. Both are beyond the scope of the forum and my unpaid consultation for one person. OpenAI provides easy API services for those simply wanting good end results and who will pay for the data. $0.40 hour.

ChatGPT can do a good job of that, though.

Here is Whisper, with resources to learn about the models, analysis of performance and error rates on languages, etc.

There you can also see a graph of two benchmarks on various languages, where lower is better. Swedish seems to perform better with v3 in both benchmarks. Your audio quality may be different than what is used for either benchmark.

“prompt”: OpenAI’s Whisper implementation on their API allows you to send some of the previous transcript for continuity in transcription and quality. I do not know how to do similar with open-source.

“Entity extraction” is a task you can do with language AI models, obtaining names or addresses from a large text, for example. I add that because extraction also needs selection of the best model for the task. It is just an off-hand comment.

Frequency bin: For both encoding lossy audio like mp3, and for the convolution for whisper, audio is transformed from the time domain to the frequency domain. Instead of slices of signal 16000 times a second, it becomes coarser slices of different audio frequencies, in Mel “bins”. The best input audio will be that which has never used lossy compression, but the AI will have also been trained on all sorts of bad audio. This is another example indicating that you need to try more models to find the best for your files.

I gave limited parameters to try variations of. A beam_size of “yes” or “no” basically. Medium temperature with best_of, or low temperature. Default values would be a good place to start, altering one-at-a-time. Automation with AI can turn an 8 man hours into “check back and get results”.

There are multiple audio libraries, any of which can be used, chosen by the ease-of-use. Audio decoding from lossy file. Silence detection. Upsampling. Filtering. Target rate sampling. ffmpeg can be a utility called to operate on files, but in-memory tools will be faster. sox is the highest quality library for resampling and filtering IMO.

As I see it, this is your implementation:

sequenceDiagram
    participant User
    participant CommandLine
    participant PythonWrapper
    participant WhisperLibrary
    participant Model

    User->>CommandLine: Execute Python script
    CommandLine->>PythonWrapper: Call transcribe function
    PythonWrapper->>WhisperLibrary: Load model (large-v3)
    WhisperLibrary->>Model: Initialize with settings
    Model-->>WhisperLibrary: Model ready
    WhisperLibrary-->>PythonWrapper: Model loaded
    PythonWrapper->>WhisperLibrary: Transcribe file.opus
    WhisperLibrary->>Model: Transcribe with settings
    alt Good quality audio
        Model-->>WhisperLibrary: Transcription result
    else Poor quality audio
        Model-->>WhisperLibrary: Transcription with errors
    end
    WhisperLibrary-->>PythonWrapper: Return transcription result
    PythonWrapper-->>CommandLine: Output result
    CommandLine-->>User: Display transcription

    Note over User,CommandLine: User executes the script from the command line
    Note over PythonWrapper,WhisperLibrary: Python wrapper interacts with Whisper library
    Note over WhisperLibrary,Model: Whisper library uses the model to transcribe audio
    Note over Model: Model processes the audio file based on the provided settings

It’s a great choice to go the local route, however this comes with the assumption that you have great knowledge when it comes to programming.

Probably all resources and information can be found on github or huggingface.

Which version you want to use depends on your needs.

If you’d rather have an easier time with this, you should try and use the OpenAI API for your task as it will be a lot easier to implement.

Check out the text to speech docs in order to learn the very basics and figure out each parameter.

Success in this area of coding does not only require a full stack developer, but also experience in audio engineering, so do not get discouraged!

Feel free to elaborate further, I hope I could help you with your issue.

Good luck! :hugs: