Ahh, I knew it’d be a matter of time before issues like this sprung up with language models.
This is a prime example of why the nuances of different languages can expose weaknesses in these model capabilities.
Now, were those your exact prompts? Or did you have other particular ones? Any chance you’d be willing to show more details for the forum?
My guess: it’s the speech-to-text causing the issue.
In my best attempt here not to touch on sensitive cultural subjects (and to help folks learn), Hindi and Urdu phonetically speaking, are mutually intelligible when spoken. They have a lot of similarities that some linguists would argue put them closer together as different varieties (dialects) of the same language rather than as distinct languages. Spanish and Italian actually follow similar distinctions, and like those languages, people from those cultures would staunchly disagree.
What makes the two languages distinct from each other are their writing systems.
This puts a speech-to-text in an interesting predicament: If there is not enough data from the input to distinguish between the two, which one should it choose? Is there even enough training data used to be able to identify the dialectical differences at all yet?
Sadly, we don’t really have a way to answer that factually yet.
The only solution, that admittedly is more of a bandaid, is to prime the model with custom or system instructions that specify output for one or the other. Either that, or, if using something like Whisper, you must add another “pass” to give the text result to GPT to translate it to the intended script.
It is also technically possible to fine-tune STT models to accommodate this, however the issue with this approach is that you would need a good amount of training data that has speech and the proper transcript. This is much harder to acquire independently compared to just text and image stuff, which is why you don’t see too many people fine tuning these kinds of models. So, your best bet is to go with the earlier suggestions.