Dialog before long pause gets repeated over and over again by Whisper

I have an audio recording wherein I say “this is a test” once, followed by maybe two minutes of silence (ie. a pause). I then say “lorem ipsum” and then the recording ends. I then try to get a transcription of this audio file and here’s what I get:

This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test. This is a test.

Like the longer the pause the more the previous dialog is repeated.

Here’s my code:

<?php
require 'vendor/autoload.php';

$file = 'demo.m4a';

$audio = file_get_contents($file);

$auth = ['Bearer', '[redacted]'];

$client = new \GuzzleHttp\Client([
	'base_uri' => 'https://api.openai.com/',
	'verify' => false
]);

$response = $client->request('POST', '/v1/audio/transcriptions', [
	'auth' => $auth,
	'multipart' => [
		['name' => 'file', 'contents' => $audio, 'filename' => $file],
		['name' => 'model', 'contents' => 'whisper-1'],
		['name' => 'language', 'contents' => 'en']
	]
]);

$transcript = $response->getBody()->getContents();
$transcript = json_decode($transcript);
echo $transcript->text;

I doubt there’s anything I can do to fix this with the limited API parameters that exist and it seems like this forum is as good a place as any to post possible bugs? Unless there’s a better place?

I can provide a copy of the audio file, too, if that’d be helpful, but the only attachments it would seem that I can upload are images.

1 Like

Hey. Did you use this further and have the problems continued? I am experiencing the same problem where the words are repeated, but I am not sure if it is only because of the long pause. It seems some other things can trigger the same problem, but I am not really sure what exactly.

1 Like

Same problem here. Seems the prompt dictionary raise the whisper’s hallucinations

I’m having the same issue today. It looks like this wasn’t fixed yet.
I wish they would expand the capabilities of whisper API. At this point it looks like I’ll have to run a whisper model on a remote instance or something.
Just a few extra parameters would help a lot.

Perhaps there’s a way to perform some VAD locally before saving/sending the audio recordings. I’ll have to look into that