I’ve had it set at 0.1 mostly. I’ve been transcribing Japanese audio, and have also used 0.3 temp for this to see if that created some difference.
But the transcription quality is actually very stable and solid. There was an issue with some names being ‘shmushed together’ with other words because they sounded similar to an existing word that also made sense in the context of the sentence.
But adding the missing word to the system prompt immediately fixed this. Also having a few Katakana words in the prompt seems to completely fix issues with ‘borrowed’ English words in the transcription.
I’m not sure how to realiable replicate the issue with repeating characters. I used to sometimes have a few repetitions of “Thanks for watching!” at the end, but now I’ve been seeing a character repetition that basically seems to max out the response size.
If I have a clip that results in such an issue, I consistently get the same result when trying to send that same audio file with different parameters.
The only thing I could ascertain so far, is that these clips will always contain some “uhh” or “ahh” sounds. My prompt contains some examples of how to transcribe those sounds and it generally works perfectly.
But sometimes the transcription spirals out into repetition right around the “ahh” or “hmm” sound.
It could also maybe be a combination of “hmm” or “ahh” with a background music at a certain frequency/volume or something.
Currently I just use regex to detect and remove excessively repeating characters.
I assume that infrequent quirks like this will keep popping up now and then. The behavior of the model probably changes in subtle ways
Oh, and yeah, I actually tried putting an actual instruction in the Whisper prompt, but that expectedly had no effect.
Ah yes, I used 0 temp in the beginning. But I got frustrated when trying to understand exactly how it worked, and generally did not like the fact that I could not see what temp was actually used in each response. (when dynamically adjusted)
Though I only used 0 temp very briefly, so I haven’t yet observed if it affects the frequency of character repetitions
A separate issue was that (after a transcription is sent to gpt4 for translation) the response from gpt4 also included all those character repetitions, despite the fact that the system prompt contains the instruction to omit repetitions. But what I seem to have maybe figured out is that the system message gets ‘pushed out’ of the request if the token limit is reached.
I haven’t confirmed this because it only occurred to me after it happened. The request has to be under the token limit as a whole, but just close enough that it ‘takes up the space’ of the system prompt.
Now I wonder if moving the system prompts to the bottom of the array in the request can change this behavior
edit:
I tried your suggestion of 0 temp and it actually works. I first obtained a short sound clip that consistently resulted in a long string of repeated characters. The same issue occurred with 0.1 on about 10 retries.
With 0 temp with the same clip, the result is correct every time.
For all that is worth and based on others suggestions, adding a prompt with these instructions: The audio provided may have moments of silence, do not make up words to fill in extended times of silence.
Did work for me it seems. Testing on almost 2 hours video file, segmented with pyDub.
I needed actual VTT files with accurate timestamps and was having repeats when speaker paused for a second or more. Not the most gracious solution, but seems to work.|
UPDATE: It’s a hit and miss. Shortly after posting this and testing another file, I ended up with a file that repeated the prompt above 34 times when there wasn’t even a silence.
A temperature of 0 usually results in greedy grabbing but in this case it actually sets a dynamic temperature.
temperature
number
Optional
Defaults to 0
The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.
If you want to include utterances you should have a prompt that demonstrates this. For example you could use this as your prompt.
“This is… Uhmm… A live transcript. Maybe… Maybe it will start like this”
The model may also leave out common filler words in the audio. If you want to keep the filler words in your transcript, you can use a prompt that contains them: "Umm, let me think like, hmm… Okay, here’s what I’m, like, thinking."
Can you please provide the prompt you are working with? In my experience the prompt has always worked to include these words. Is your prompt by any chance instructional?
transcriptionPrompt = "Hi folks. I'm going to umm... read aloud from this book. I apologize if I sound a bit uhh... a bit stuffy or congested. I haven't gotten over this cold yet. I might clear my throat... *ahem* from time to time too."
Prompt looks great. So I tried an audio recording with and without your prompt. I used whisper/medium locally using the whisper-timestamped library in Python.
def transcribe(language, file_path):
import whisper_timestamped as wt
audio = wt.load_audio(file_path)
model = wt.load_model("medium", device="cuda")
result = wt.transcribe(model, audio, language=language, initial_prompt="Hi folks. I'm going to umm... read aloud from this book. I apologize if I sound a bit uhh... a bit stuffy or congested. I haven't gotten over this cold yet. I might clear my throat... *ahem* from time to time too.")
return result
WITHOUT your prompt:
{‘text’: " Hi, so maybe we could do something like this or maybe not we’ll have to find out. Well maybe we’ll have to find out. We’ll find out. That’s how it’s gonna go. Bop, beep, boop, 20 seconds is coming up."}
WITH your prompt:
{‘text’: " Hi. So umm… maybe… maybe we could do something like this. Or maybe not. We’ll have to find out. Well maybe… maybe we’ll have to find out. We’ll find out. That’s how it’s gonna go. Bop. Beep. Boop. 20 seconds. It’s coming up."}
Could you show your code where you’re calling the whisper endpoint?
The recordings that I saved as examples which produced repeated characters or cut-off transcriptions are now being transcribed correctly, even though I’m using the same settings.
This issue does still happen though, especially if a recording starts with a long “ooh” or “ahh” sound. Basically any vocalization that is not a word. I set up my app to switch between parameters quickly so I can always compare results.
I don’t consider this much of a problem anymore, since I can just use regex to remove characters before translation, and can avoid the issue entirely by making sure I don’t start a recording with any non-word vocalizations
I just thought it was interesting that the recordings I had saved which consistently used to produce errors regardless of parameter settings are now transcribing correctly.