TTS is unpredictable and often really wrong for non-English requests

If you make the same tts request a few times in a row, you get a different response each time.

Run this a few times with and listen to the variations. Add your token:

curl '' \
  -H 'authority:' \
  -H 'accept: */*' \
  -H 'authorization: Bearer your-token' \
  -H 'content-type: application/json' \
  --data-raw '{"model":"tts-1-hd","input":"Me gustaría ir al cine","voice":"alloy"}' \
--output test.mp3

The first one usually seems to be correct. The second one has like an inflection, and the third is total gibberish.

That sounds like maybe there’s some order to it, but in my app, it’s total chaos. I have no idea. Simple sentences like “Dov’è il bagno” will return nonsense one time and then be perfect the next

Anyone have any idea? I’ve seen posts about the speech being wrong for non-English, but for me it seems like they can be correct, but you never know what you’re going to get

1 Like

There’s currently no control of aspects of model like seed or sampling (selection of sequences), so the variations are indeed seen.

The benefit is that you aren’t locked to a sentence that can never be pronounced correctly.

The runs should be stateless; multiple calls being independent.

You can try throwing in a baseline first sentence that clearly and simply distinguishes the language, even talking about what will follow.

1 Like

Yes, stateless makes sense. I guess I was imagining a pattern.

But then how would a baseline sentence work? It’d be in the same request?

I understand what you’re saying about the benefit, but getting back gibberish 1 out of 3 times is not viable for my product :confused: Oh how I wish I could just pass a language code :joy: !

That’s probably because of characters like '. You should find a way to get rid of them

@mdyildirim Thanks for the suggestion, but that’s definitely not it. I’ve debugged for days, and there is just no consistency. It will be perfect one time, completely wrong the next with the exact same request

Hmm. I thought special characters in your example (Dov’è il bagno) created the issue. But what you’re saying worries me for the product i’m building.

here to add to the discussion that, as of May 2024, it still randomly produces gibberish, making it completely unreliable for TTS translations in real life scenarios