I use the Whisper library with a Python wrapper I wrote myself, that I execute from the command line. The goal is transcribe more than 20 000 recorded phone calls.
I have spent a lot of time with ChatGPT to adjust my settings to improve the accuracy of the transcriptions as well as reduce hallucinations but whatever I do it just gets worse, most of the time a lot worse!
My current settings look like this:
result = model.transcribe(
'file.opus',
language=used_language,
temperature=0.1,
beam_size = 7,
patience=1.0,
best_of = 5,
logprob_threshold = 0.5
)
I have the most recent version of Whisper, I use the large-v3
model, the language is set to Swedish.
There are 2x2 situations:
- The recordings are either amr or opus. The opus are much better at 48 kHz while the amr are at 8 kHz (And the same bitrate, 12 kb/s. Opus is amazing!). The amrâs sound is often quite boxy and tinny.
- The phone calls are made on cellphones and are either made in a silent environment (e.g., home or office) or in a noisy environment (i.e., outside, with traffic, wind, music etc in the background)
(So there are opus (i.e. good recording quality) with either silent or noisy background and âvice versaâ for the amr files, all in all 4 âsituationsâ)
Since they are phone calls 99 % of them have two persons speaking.
Because it is offline/not realtime, quality is my 1st, 2nd and 3rd priority. I donât care if it takes long time to process the files. I also think I have plenty of resources since I run it on a HP Z8 G4 with 768 GB RAM, 32+32 cores and two NVIDIA RTX A5000 with 24 GB each:
$ uname -a
Linux localadmin-HP-Z8-G4-Workstation 6.8.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Sep 11 15:25:05 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
The only thing that works is to set the language. When I do that I get around 10x realtime transcription speed. With the settings above the speed is around 50 % faster!! (It varies a lot but 50 % faster is a reasonable approximation) Meanwhile, the quality is much worse, including hallucinations. I want settings that gives better quality, probably at the price of being much slower.
To sum all this up, this leads to two questions:
1. What would you suggest for settings for my purpose?
2. [META] Does ChatGPT give good advice about Whisper settings?