Whisper-1 joint translation and transcription

I would like to keep track of a English speech history no matter the spoken language.

Is it possible to achieve joint translation and transcription with the current whisper model API?

Thus far we have experimented with the following two approaches:

  1. anylanguage-to-English translations API + transcriptions API
  2. transcriptions API + GPT-4 turbo completions API for anylanguage-to-English

It would be nice to have a unified way to achieve that; am I missing something with the API as-is?

I am not sure how you would have the API used exactly, but I will tell you what I did with my OpenAI API wrapper (in shell script):

0. Record voice input (or use any audio file)
1.1 Use the anylanguage-to-English translations API, or
1.2 Use the anylanguage-to-anylanguage transcription API
2.1 Submit text to GPT-4 (or any other model), or
2.2. Use the transcription text itself for the next step
3. Submit reply to the Text-To-Speech API

In the translation case, we came across a very interesting API use of the said transcription API, in which you can specify a target two-letter language code with the request. A text prompt (in the same language of the audio, preferably) is optional.

Anyways, in this type of request, you get the transcription in the target language, independently of the input language.

Hi @jamilbio20,

Thanks for your feedback. What you have done is not too dissimilar from our current approach, but it assumes that the language model or whatever you have in step 2. is able to process the information in any language. Now, that may be true in most cases, especially with GPT-4 (but we have also been using GPT-3.5-turbo and open-source non-GPT models), but even then it may fail if the tasks are quite complex. In our specific case, we have to keep a running semantic memory of the human-AI interaction and have the LLM perform rephrasing based on the semantic memory resolving implicit references and decoding the instruction into a sequence of subtasks, ending in a python code generation step. Each step has its own system prompt, and 1) it would be unsustainable to duplicate each prompt in other languages, and 2) we have assessed that the LLM responds better if the system prompt and the user prompt are in the same language. This is why we choosing to translate everything to English no matter what language is being spoken.

Since the API already has anylanguage-to-English translation and anylanguage-to-anylanguage transcription, I was wondering how large the effort would be to bridge the gap into automatic anylanguage-to-English transcription in a unified manner. As for the transcriptions API parameter you mention, which I guess is “language” in the documentation, https://platform.openai.com/docs/api-reference/audio/createTranscription, following its description it should be set to the recording language to improve accuracy and latency and not to the desired transcription language; am I missing something?

1 Like

Indeed, that API endpoint is more versatile and works in more ways than those documented in the API papers.

I believe that using the transcription endpoint with the language option set to the same language as the input, probably optimises the transcription for that single language.

However, as per my testing, the transcription endpoint will always output in the target language defined in the language option of the request.

If the transcription endpoint is set with the target language en, then it should behave like the translations endpoint which converts voice audio to English text. I just reckon that the translations endpoint, which takes less options in the API request, is already optimised for a better English translation.

I have also been testing adding an initial text prompt which differs from the input language and from the output transcription language, and the API seems to work as expected, but I believe this may make the transcriptions less accurate to some extent, maybe the translated transcription is a little different from what can be achieved by more controlled means. But I have not tested it in this way, as for my needs it works fine.

Thanks for the insight!

I will run a couple experiments with that parameter and report back.

1 Like

I experimented with the following code:

import openai


provider = openai.OpenAI(...)


transcript = provider.audio.transcriptions.create(
    model="whisper-1", 
    file=..., 
    response_format="text",
    language="en"
)

both with recordings in Italian and natively in English.

The performance is mixed, performing poorly on short recordings (under three seconds) or recordings with difficult Italian words or bad spelling, and quite well on longer recordings of minimum five seconds. In the latter case, the transcription to English was perfect 80% of the times with occasional words or phrases misinterpreted by the model in the API.

80% is not good enough for our purposes, since with a long enough recording we have near perfect results with the transcriptions API + LLM translation combo even using GPT-3.5-turbo.

could someone from the openAI whisper team clarify:

  1. the strange behavior behind the language parameter being described as something completely different in the API docs?
  2. whether the joint translation and transcription request in achievable?