The reason is because let’s say I give it numbers in roman numeral (“1, 2, 3, 4”) and I want it to output in Chinese, by default it’s going to output in English.
In the current state, the TTS models cannot read Roman numerals in Chinese. You might be able to get the models to perform somewhat better by translating/transliterating the numerals into their respective Mandarin values or pronunciations.
Another, better, and more expensive way is to use the gpt-4o-audio-preview to generate the spoken tracks for your text, as it can be prompted to read the text how you want it to with much more human-like voices.