How to hint the language used for the Text-to-speech (TTS) in GPTs?

I’m developing a bilingual translator GPT for Chinese and Indonesian. My GPT is mainly focus on ChatGPT App’s Voice and TTS feature. When I say in Chinese, it will translated to Indonesian. Then the ChatGPT App will say that words in Indonesian hopefully. But the ChatGPT App will say the words in English occasionally. Is it possible to hint the TTS feature in ChatGPT App that only say Indonesian when the text is actually the Indonesian?

1 Like

Hey there!

This is an interesting question on a few fronts, so let’s roll up our sleeves and see what is going on.

So, when you tested this out, do you have the transcripts? AKA the text version of the conversation? Are there prompt/response pairs you could provide us as-is?

TTS is very much what it says on the box: it converts text to speech. The TTS feature seems like it’s doing more than it is for a lot of users, but all it’s doing is converting speech to plaintext, using that text as a prompt, and then the response is provided as a text and then converted to speech. GPT does not “hear”, it does not use any audio data as part of any inquiry. To the AI, it is all text and is no different than a standard GPT-4 dialogue.

Which makes me curious about how it gave you English based off what you described so far. If the text is Indonesian, it should be speaking in Indonesian. However, it might be confused because of the latin alphabet, and the fact that there’s likely a lot more training data from other languages using latin characters, throwing it off. Also, Indonesian as a language (and the culture of Indonesia itself) is very different than other languages. There’s a lot more complexity there than meets the eye.

If I’m understanding this correctly, you’re saying that it pronounces things in an English way, correct? If it is “speaking” the English language entirely or using English words, then that is a different problem entirely, because the response text would be provided in English already before that text is converted to speech. The transcript should align with that. Now, if the TTS is just pronouncing Indonesian text wrong, that’s a trickier problem to solve.

I am not Indonesian, and I did not specialize in languages of that region when I studied linguistics, so you will have to bring your own expertise to craft some workarounds on my suggestion, but have you considered forcing it to respond using the Jawi alphabet?

It’ll probably require some tinkering, but based off what we know about OpenAI’s TTS and its biases, you should be able to leverage the Jawi alphabet to your advantage here, preventing it from accidentally mistaking Indonesian for a western language. It will be up to you to keep an eye on the nuances and any notable differences that comes from making that switch.

I also have a use case for passing a language hint to the text-to-speech API. If the input is something like a single word or a number in decimal notation, I might be looking for output in a specific language, even if the input could be interpreted as English.

In my case, I want TTS speaks Japanese words, but some Japanese word was spoken in Chinese (which could read as both Japanese and Chinese).

And I find a way to just add a “(日本語: )” before the word, then TTS will speak it always in Japanese.

In most cases, It works for me.

You may want try something like “(Ind:)…”

1 Like