API confuses Hindi with Urdu

jeffvpace · May 17, 2024, 4:59pm

Steps to reproduce:

(1) Perform a normal translation of English text to Hindi. Result is a successful translation – Hindi is written in the Devanagari script.

(2) Use above translation output for text-to-speech input.

(3) Use above speech output for speech-to-text input. The text output is Urdu - not Hindi.

Hindi and Urdu are related: Hindi–Urdu controversy - Wikipedia

Edit: Hard to tell which model causes the problem - TTS or STT ?

Macha · May 17, 2024, 10:26pm

Ahh, I knew it’d be a matter of time before issues like this sprung up with language models.

This is a prime example of why the nuances of different languages can expose weaknesses in these model capabilities.

Now, were those your exact prompts? Or did you have other particular ones? Any chance you’d be willing to show more details for the forum?

My guess: it’s the speech-to-text causing the issue.

In my best attempt here not to touch on sensitive cultural subjects (and to help folks learn), Hindi and Urdu phonetically speaking, are mutually intelligible when spoken. They have a lot of similarities that some linguists would argue put them closer together as different varieties (dialects) of the same language rather than as distinct languages. Spanish and Italian actually follow similar distinctions, and like those languages, people from those cultures would staunchly disagree.

What makes the two languages distinct from each other are their writing systems.

This puts a speech-to-text in an interesting predicament: If there is not enough data from the input to distinguish between the two, which one should it choose? Is there even enough training data used to be able to identify the dialectical differences at all yet?

Sadly, we don’t really have a way to answer that factually yet.

The only solution, that admittedly is more of a bandaid, is to prime the model with custom or system instructions that specify output for one or the other. Either that, or, if using something like Whisper, you must add another “pass” to give the text result to GPT to translate it to the intended script.

It is also technically possible to fine-tune STT models to accommodate this, however the issue with this approach is that you would need a good amount of training data that has speech and the proper transcript. This is much harder to acquire independently compared to just text and image stuff, which is why you don’t see too many people fine tuning these kinds of models. So, your best bet is to go with the earlier suggestions.

jeffvpace · May 18, 2024, 12:45am

Yes, I agree that it’s probably STT that is causing the issue. One definitive test would be for someone who can speak both Hindi and Urdu to listen to the TTS output.

I did another test with the same text - this time with Urdu:

(1) Perform a normal translation of English text to Urdu. Result is a successful translation – Urdu is written in the Nastaliq (Arabic) script.

(2) Use above translation output for text-to-speech input.

(3) Use above speech output for speech-to-text input. The text output is Urdu - no problem.

To be clear, we are not complaining about this. This stuff is new and complex and will only get better and better over time. The important thing is that devs should really “kick the tires” on these models and then make it a point to discuss issues on these forums.

Macha · May 18, 2024, 1:00am

I could not agree more!

Sadly, there’s just not as many English-speaking developers aware of these kinds of things. Linguistics is usually the afterthought, not the primary focus of model interaction.

Topic		Replies	Views
Language error in TTS api Bugs bug , api , tts	4	971	May 7, 2024
Incorrect Transcription - Arabic voice returns Hebrew text Bugs whisper	0	72	October 2, 2024
Whisper is translating my audios for some reason API whisper	22	10183	December 17, 2024
Issue with Whisper ASR: Incorrect Language Transcription for Malayalam, Nepali, Telugu, and Others Feedback	0	149	October 9, 2024
[Text to Speech API] Chinese TTS unreliable and unusable API	6	2166	May 16, 2024

API confuses Hindi with Urdu

Related topics