API confuses Hindi with Urdu

Steps to reproduce:

(1) Perform a normal translation of English text to Hindi. Result is a successful translation – Hindi is written in the Devanagari script.

(2) Use above translation output for text-to-speech input.

(3) Use above speech output for speech-to-text input. The text output is Urdu - not Hindi.

Hindi and Urdu are related: Hindi–Urdu controversy - Wikipedia

Edit: Hard to tell which model causes the problem - TTS or STT ?

Ahh, I knew it’d be a matter of time before issues like this sprung up with language models.

This is a prime example of why the nuances of different languages can expose weaknesses in these model capabilities.

Now, were those your exact prompts? Or did you have other particular ones? Any chance you’d be willing to show more details for the forum?

My guess: it’s the speech-to-text causing the issue.

In my best attempt here not to touch on sensitive cultural subjects (and to help folks learn), Hindi and Urdu phonetically speaking, are mutually intelligible when spoken. They have a lot of similarities that some linguists would argue put them closer together as different varieties (dialects) of the same language rather than as distinct languages. Spanish and Italian actually follow similar distinctions, and like those languages, people from those cultures would staunchly disagree.

What makes the two languages distinct from each other are their writing systems.

This puts a speech-to-text in an interesting predicament: If there is not enough data from the input to distinguish between the two, which one should it choose? Is there even enough training data used to be able to identify the dialectical differences at all yet?

Sadly, we don’t really have a way to answer that factually yet.

The only solution, that admittedly is more of a bandaid, is to prime the model with custom or system instructions that specify output for one or the other. Either that, or, if using something like Whisper, you must add another “pass” to give the text result to GPT to translate it to the intended script.

It is also technically possible to fine-tune STT models to accommodate this, however the issue with this approach is that you would need a good amount of training data that has speech and the proper transcript. This is much harder to acquire independently compared to just text and image stuff, which is why you don’t see too many people fine tuning these kinds of models. So, your best bet is to go with the earlier suggestions.

Yes, I agree that it’s probably STT that is causing the issue. One definitive test would be for someone who can speak both Hindi and Urdu to listen to the TTS output.

I did another test with the same text - this time with Urdu:

(1) Perform a normal translation of English text to Urdu. Result is a successful translation – Urdu is written in the Nastaliq (Arabic) script.

(2) Use above translation output for text-to-speech input.

(3) Use above speech output for speech-to-text input. The text output is Urdu - no problem.

To be clear, we are not complaining about this. This stuff is new and complex and will only get better and better over time. The important thing is that devs should really “kick the tires” on these models and then make it a point to discuss issues on these forums.

1 Like

I could not agree more!

Sadly, there’s just not as many English-speaking developers aware of these kinds of things. Linguistics is usually the afterthought, not the primary focus of model interaction.