i convert from opus (from whatsapp) to mp3 and send to whisper.
Here is a transcription:
Transcription: Helo John, rwy’n gobeithio y gallech chi fy helpu. Rwy’n meddwl o adeiladu cymheithas ar gyfer yr apokalips lle byddwn yn cael ymweld â 50 o bobl ac mae angen cychynnau. Rydyn ni eisiau cael cychynnau sy’n digon iawn ar gyfer un o gynnyrch ar y diwrnod ac hefyd rydyn ni eisiau cymryd rhai a byddwn yn angen eu hymweld â nhw a gwneud yn siŵr ein bod yn gael cyfnod iawn. Gallwch chi fy helpu gyda hynny a ddangos i mi y cyfnodau?
I did two tests. On the first, one of the speakers had an Indian accent. But on the second, none of the speakers had an accent. Both resulted in Welsh.
OK, I can repeat the issue, and I get this as a transcription.
{‘text’: “Helo John, rwy’n gobeithio y gallech chi fy helpu. Rwy’n meddwl o adeiladu cymheithas ar gyfer yr apokalips lle byddwn yn cael ymweld â 50 o bobl ac mae angen cychynnau. Rydyn ni eisiau cael cychynnau sy’n digon iawn ar gyfer un o gynnyrch ar y diwrnod ac hefyd rydyn ni eisiau cymryd rhai a byddwn yn angen eu hymweld â nhw a gwneud yn siŵr ein bod yn gael cyfnod iawn. Gallwch chi fy helpu gyda hynny a ddangos i mi y cyfnodau?”}
The problem is the ogg → mp3 conversion that I cannot re-create from your data.
Is there any way, that you can say the exact same thing in your microphone, and record it directly to a .wav file and post this .wav file?
I think this is a lossy → lossy conversion issue, not an accent issue, since these conversions can create unseen havoc in the spectrum that the AI sees.
I am running python, my rouge “direct” version, not the SDK, and when I prompt it, with both the mp3 and wav version, I get it to work. See:
files = {
'file': open('/Users/curtkennedy/Downloads/received_audio.mp3', 'rb'),
'prompt': (None,'you are a fun loving british speaker. please transcribe this into english for me.'),
'model': (None, 'whisper-1'),
}
response = requests.post('https://api.openai.com/v1/audio/transcriptions', headers=headers, files=files)
print(response.json())
# {'text': 'Hello John, I wondered if you could help me. I am thinking of building a compound for the apocalypse where we will have around 50 people and we need some chickens. We want to have enough chickens for one egg a day and also we want to eat some and we will need to breed them and make sure we always have enough. Can you help me with that and show me the numbers?'}
If there is another language input, then prompt it in that language. Prompting is an often overlooked feature in Whisper, but it seems powerful. You can correct common misspelling of people names, etc.
I took your audio, spliced it with 15 seconds of a very prominent english speaker. I will upload it shortly.
Used Whisper:
>>> audio_file = open("out.mp3", "rb")
>>> transcript = openai.Audio.transcribe("whisper-1", audio_file)
>>> transcript
<OpenAIObject at 0x7f0f1062bec0> JSON: {
"text": "Warning do not recreate or reenact anything seen in this video I am NOT affiliated with any brand or product seen in this video and watch this video at your own risk. That's all. Thank you Hello, John. I wondered if you could help me. I'm thinking of building a compound for the apocalypse Where we will have around 50 people and we need some chickens. We want to have enough chickens"
“Warning do not recreate or reenact anything seen in this video I am NOT affiliated with any brand or product seen in this video and watch this video at your own risk. That’s all. Thank you Hello, John. I wondered if you could help me. I’m thinking of building a compound for the apocalypse Where we will have around 50 people and we need some chickens. We want to have enough chickens”
The prompt is the solution. The question is still unknown.
Maybe, I had to ditch the SDK initially since it wouldn’t work for me.
Could be a wrapper bug. But try prompting it, and run directly (non-SDK) and it works for me. Then go with SDK if that’s your comfort level and make sure to prompt in whatever language needed.
Use other AI services to detect language too if you are running blind at scale.
Yes, I think the easy fix is to prepend some very clear non-accented english (such as the disclaimer above), which is what I believe the prompt actually does without having audio.
Yup. Using the same prompt, and only 15 seconds of @acoloss’ audio has the same result as prepending the audio. I should have included the log probs. Would be nice to know why there’s a difference. I think that’s enough fun time for me today though.
>>> audio_file = open("second.mp3", "rb")
>>> transcript = openai.Audio.transcribe("whisper-1", audio_file, prompt="Warning do not recreate or reenact anything seen in this video I am NOT affiliated with any brand or product seen in this video and watch this video at your own risk. That’s all.")
>>> transcript
<OpenAIObject at 0x7f59f2009b20> JSON: {
"text": "Hello John, I wondered if you could help me. I am thinking of building a compound for the apocalypse where we will have around 50 people and we need some chickens. We want to have enough chickens.
“Hello John, I wondered if you could help me. I am thinking of building a compound for the apocalypse where we will have around 50 people and we need some chickens. We want to have enough chickens.”
My guess is that the training data had more American accented English than British accented English in it. SO the fix would be to re-roll a new model, or an updated one.
Not sure why there is a delta between this OpenAI API version and the open source one though, but I don’t think the open source one had a prompting option, so I’m guessing it really is a different model than the open source one. Personally, I like the prompting version since you can control or influence the output with your prompt.
formData.append("prompt", "This is an English sentence used to place me in an english space. Here is the upcoming audio, spoken by english reader George Washington")
You may want to also consider using the OpenAI library to handle the heavy lifting. Here’s an example using it:
Thanks. I can do this but how will this effect, for example - The mother in law is not english. she sends me voice messages in not english, which i then want to transcribe and translate to English. Will it still work? i.e. will it detect other languages?
Will it also change any of my words and accent in text. for example - I may say “I’m going ta shop” instead of “I’m going to the shop”.
Sorry if these seem like stupid questions - I have just got into transcribing a couple of days ago and this is all new to me - and of course i think whisper works a little different afaik.
also @curt.kennedy i’m using the API with the same issue
I think you have good questions, I think the best answer is “try it out and find what works best”. And don’t forget to post any progress!
It’s supposed to, and in most cases does catch the language. There’s going to be some hit-or-misses though like now. As said above, I think the solution would be to have another separate service try and identify the language (Google offers these services) just for the small cases like this