[SOLVED] Whisper translates into Welsh

@RonaldGRuckus agree. When you take the copy of a copy of a copy, from a copy machine, each copy is degraded, and the spectrum is vastly degraded.

The only advice I have now is, if you need to translate formats to fit it into the API specs, do this:

Initial Format → Lossless → Final format for API.

Use the best engines and formats you can.

Personally, I always use .wav files, for this very reason, when I interface with the Whisper API.

1 Like

i convert from opus (from whatsapp) to mp3 and send to whisper.

Here is a transcription:

Transcription: Helo John, rwy’n gobeithio y gallech chi fy helpu. Rwy’n meddwl o adeiladu cymheithas ar gyfer yr apokalips lle byddwn yn cael ymweld â 50 o bobl ac mae angen cychynnau. Rydyn ni eisiau cael cychynnau sy’n digon iawn ar gyfer un o gynnyrch ar y diwrnod ac hefyd rydyn ni eisiau cymryd rhai a byddwn yn angen eu hymweld â nhw a gwneud yn siŵr ein bod yn gael cyfnod iawn. Gallwch chi fy helpu gyda hynny a ddangos i mi y cyfnodau?

and here is a link to the audio file with my northern accent > received_audio.mp3 - Google Drive

I did two tests. On the first, one of the speakers had an Indian accent. But on the second, none of the speakers had an accent. Both resulted in Welsh.

Can you upload these tests for us to see? Do you have an accent? @brianbray01

@acoloss Try with a prompt. The first couple seconds of your clip has a very heavy accent

I don’t necessarily think that the accent is the highest contributing factor, I think it has its weight though.

OK, I can repeat the issue, and I get this as a transcription.

{‘text’: “Helo John, rwy’n gobeithio y gallech chi fy helpu. Rwy’n meddwl o adeiladu cymheithas ar gyfer yr apokalips lle byddwn yn cael ymweld â 50 o bobl ac mae angen cychynnau. Rydyn ni eisiau cael cychynnau sy’n digon iawn ar gyfer un o gynnyrch ar y diwrnod ac hefyd rydyn ni eisiau cymryd rhai a byddwn yn angen eu hymweld â nhw a gwneud yn siŵr ein bod yn gael cyfnod iawn. Gallwch chi fy helpu gyda hynny a ddangos i mi y cyfnodau?”}

The problem is the ogg → mp3 conversion that I cannot re-create from your data.

Is there any way, that you can say the exact same thing in your microphone, and record it directly to a .wav file and post this .wav file?

I think this is a lossy → lossy conversion issue, not an accent issue, since these conversions can create unseen havoc in the spectrum that the AI sees.

1 Like

just want to point out … i have now changed my script to convert opus to wav, and it is the same issue.

maybe need to choose some different bitrate during conversion?

1 Like

BOOM, I think I solved it.

I am running python, my rouge “direct” version, not the SDK, and when I prompt it, with both the mp3 and wav version, I get it to work. See:

files = {
    'file': open('/Users/curtkennedy/Downloads/received_audio.mp3', 'rb'),
    'prompt': (None,'you are a fun loving british speaker.  please transcribe this into english for me.'),
    'model': (None, 'whisper-1'),
}

response = requests.post('https://api.openai.com/v1/audio/transcriptions', headers=headers, files=files)

print(response.json())

# {'text': 'Hello John, I wondered if you could help me. I am thinking of building a compound for the apocalypse where we will have around 50 people and we need some chickens. We want to have enough chickens for one egg a day and also we want to eat some and we will need to breed them and make sure we always have enough. Can you help me with that and show me the numbers?'}
1 Like

here is the original opus file (note it’s different than the other one as i had to test again if the wav conversion worked).

@curt.kennedy the thing is, i might actually want someone of a different language to speak to it :smiley: but this may help (with the prompt)

(also worth noting i am also using python)

1 Like

If there is another language input, then prompt it in that language. Prompting is an often overlooked feature in Whisper, but it seems powerful. You can correct common misspelling of people names, etc.

PROMPT!

I took your audio, spliced it with 15 seconds of a very prominent english speaker. I will upload it shortly.

Used Whisper:

>>> audio_file = open("out.mp3", "rb")
>>> transcript = openai.Audio.transcribe("whisper-1", audio_file)
>>> transcript
<OpenAIObject at 0x7f0f1062bec0> JSON: {
  "text": "Warning do not recreate or reenact anything seen in this video I am NOT affiliated with any brand or product seen in this video and watch this video at your own risk. That's all. Thank you Hello, John. I wondered if you could help me. I'm thinking of building a compound for the apocalypse Where we will have around 50 people and we need some chickens. We want to have enough chickens"

“Warning do not recreate or reenact anything seen in this video I am NOT affiliated with any brand or product seen in this video and watch this video at your own risk. That’s all. Thank you Hello, John. I wondered if you could help me. I’m thinking of building a compound for the apocalypse Where we will have around 50 people and we need some chickens. We want to have enough chickens

The prompt is the solution. The question is still unknown.

1 Like

Maybe, I had to ditch the SDK initially since it wouldn’t work for me.

Could be a wrapper bug. But try prompting it, and run directly (non-SDK) and it works for me. Then go with SDK if that’s your comfort level and make sure to prompt in whatever language needed.

Use other AI services to detect language too if you are running blind at scale.

1 Like

Yes, I think the easy fix is to prepend some very clear non-accented english (such as the disclaimer above), which is what I believe the prompt actually does without having audio.

1 Like

Yup. Using the same prompt, and only 15 seconds of @acoloss’ audio has the same result as prepending the audio. I should have included the log probs. Would be nice to know why there’s a difference. I think that’s enough fun time for me today though.

>>> audio_file = open("second.mp3", "rb")
>>> transcript = openai.Audio.transcribe("whisper-1", audio_file, prompt="Warning do not recreate or reenact anything seen in this video I am NOT affiliated with any brand or product seen in this video and watch this video at your own risk. That’s all.") 
>>> transcript
<OpenAIObject at 0x7f59f2009b20> JSON: {
  "text": "Hello John, I wondered if you could help me. I am thinking of building a compound for the apocalypse where we will have around 50 people and we need some chickens. We want to have enough chickens.

“Hello John, I wondered if you could help me. I am thinking of building a compound for the apocalypse where we will have around 50 people and we need some chickens. We want to have enough chickens.”

1 Like

OK, so SDK + Prompting also works. Yay!

I’d mark this one as SOLVED!

1 Like

is it solved? :smiley:

should it not just work?

(thanks for the ideas btw)

1 Like

I wish everything just worked. :crazy_face:

My guess is that the training data had more American accented English than British accented English in it. SO the fix would be to re-roll a new model, or an updated one.

Not sure why there is a delta between this OpenAI API version and the open source one though, but I don’t think the open source one had a prompting option, so I’m guessing it really is a different model than the open source one. Personally, I like the prompting version since you can control or influence the output with your prompt.

oh no… i’m dumb… i’m actually using node.js for my whatsapp script… (all my other tools/apps are python).

guess same issue though

async function transcribeAudio(filename) {
  const mp3Filename = 'received_audio.mp3';

  await new Promise((resolve, reject) => {
    ffmpeg()
      .input(filename)
      .output(mp3Filename)
      .on('error', (err) => {
        console.error('Ffmpeg error:', err);
        reject(err);
      })
      .on('end', resolve)
      .run();
  });

  const formData = new FormData();
  formData.append('file', fs.createReadStream(mp3Filename));
  formData.append('model', 'whisper-1');

  const response = await axios.post(
    'https://api.openai.com/v1/audio/transcriptions',
    formData,
    {
      headers: {
        'Content-Type': `multipart/form-data; boundary=${formData.getBoundary()}`,
        'Authorization': `Bearer ${openai.apiKey}`,
      },
    }
  );

  return response.data.text;
}

Add this to your form data

formData.append("prompt", "This is an English sentence used to place me in an english space. Here is the upcoming audio, spoken by english reader George Washington")

You may want to also consider using the OpenAI library to handle the heavy lifting. Here’s an example using it:

const { Configuration, OpenAIApi } = require("openai");
const configuration = new Configuration({
  apiKey: process.env.OPENAI_API_KEY,
});
const openai = new OpenAIApi(configuration);
const resp = await openai.createTranscription(
  fs.createReadStream("audio.mp3"),
  "whisper-1"
);
1 Like

Thanks. I can do this but how will this effect, for example - The mother in law is not english. she sends me voice messages in not english, which i then want to transcribe and translate to English. Will it still work? i.e. will it detect other languages?

Will it also change any of my words and accent in text. for example - I may say “I’m going ta shop” instead of “I’m going to the shop”.

Sorry if these seem like stupid questions - I have just got into transcribing a couple of days ago and this is all new to me - and of course i think whisper works a little different afaik.

also @curt.kennedy i’m using the API with the same issue

@acoloss

I think you have good questions, I think the best answer is “try it out and find what works best”. And don’t forget to post any progress!

It’s supposed to, and in most cases does catch the language. There’s going to be some hit-or-misses though like now. As said above, I think the solution would be to have another separate service try and identify the language (Google offers these services) just for the small cases like this

1 Like