Hey there and welcome!
I haven’t tried the new TTS yet, but last time I played around with IPA the live API worked quite well. In fact, for a moment I was able to direct it to regurgitate the IPA produced, and it immediately improved itself to sound less “american” when speaking different languages.
While listening and looking at your script, attempt #2 (0:13-0:18) is actually pretty close to what’s written. I think it fumbled a bit on the 'tlə
a bit, but the more I dug into this, the more I’m confused about what you’re trying to get it to say. Can you paste the prompt (the IPA) in this forum directly so we can test it? What is your target? Do you have a speech clip or a text w/ a particular variety (dialect) in mind?
I may be wrong here, but I don’t think your IPA syntax is right, I can’t lie it kind of looks like nonsense. It’s not just about using “phonetic symbols”, the language models do better when you express it in narrow transcription. In fact, I think it even defaults to this (makes sense since the only people providing such training data are linguists using narrow transcription), and is defaulting to such when it tries to read that. If I can’t figure out what the IPA is trying to say, I doubt the model would. Things like 'tlə
isn’t really anything. Also, since you have r
instead of ɹ
in your transcription, I’m gonna take a wild guess and say you mixed in “phonetic symbols” (which, as you might’ve guessed, is called IPA) with regular letters, am I correct? I think that’s why the model is glitching out. That upside r
is supposed to be a kind of trill (spanish ‘r’, or essentially an r that hits the top of your mouth). English 'r’s are transcribed as ɹ
. It’s weird that the model didn’t make that sound, but we can try and test that later if that’s what you actually wanted. That to me is a giveaway this isn’t properly transcribed IPA.
When transcribing to IPA, many of the letters and symbols found in indo-european languages are assigned specific phonemes. Combinations of letters together also can mean specific phonemes. It’s not a “mix-and-max” selection process. Either something is transcribed following IPA syntax rules or it isn’t. It’s also something that takes practice, and is usually what people spend a lot of time studying in linguistics, so it’s okay that you didn’t know all this if you aren’t studying the subject. What is likely happening is that it sees IPA letters and begins to read all the letters as if it were an IPA transcription (which is how linguists do it too btw).
This could still very well be a model problem, but there may be encoding / decoding problems with the IPA characters themselves, which may also contribute to the glitchiness. Another alternative, though technical, would be to directly feed the model something like X-SAMPA. I did hear recently Meta, for example, is using this method as data for training something they’re working on. It was mentioned as an afterthought in an article mentioning they’re hiring people to produce training data that will use x-sampa. Unfortunately, you would likely need to build a makeshift program like this yourself, but it shouldn’t be too complicated.
For now though, post the target speech you’re trying to imitate, and I’ll help transcribe it for you if you’d like. You can then feed it back to the model and we can see how well it does after that.
tl;dr I think this is a transcription problem, not a hallucination problem. I had to check to make sure I wasn’t hallucinating what I was reading lol.