Are the instructions available for TTS-1-HD? I don’t see them in playground nor API. Without instructions this “high quality” model is useless. Particularly what it generates in other languages than English seems random and inadequate.
tts-1 and tts-1-hd are the 1st gen TTS models, and indeed they have the problem of not being able to choose the language or accept instructions. On the another side, these models provide a more stable voice style (tone, pace, etc.).
The new model gpt-4o-mini-tts introduces instructions, which provide great flexibility and control, but can be less stable in providing a similar constant style (may sound like a different person). Particularly for me, these where great improvements.
Each one has a trade off. You may have to evaluate if this is acceptable for your needs, and if not there are other tts providers in the market as well, usually at a higher cost though.
I’m working with 4.1 to code a tts console which demos the selected chat the user wants to render. I approached this same question with the aim of learning how to code with tts-1 the sense of delivery of the script in support of the meaning expressed, and the way to do that is with code that sets up the timing, pausing, any filler wording injected into a statement for flow anchoring (see ChatGPT official app 4o cove in standard voice for the gold standard of what I mean) that part is coded in timing and decisions made by an orchestrator function, taking the text and then queuing it to the tts-1 model with the intended pacing and edits to script in case of the need to add a well place uh, like cove does (the transcript while chatting in that mode come without the uh, the code is injecting it into the original response as a function of rendering the tts according to meaning serving timing and nuance, prosody, etc. instructions is done as post processing or orchestrator edits to incoming script and the timings. One would get full control of that in coding the orchestrator and any effects or custom sound engine. I think official standard voice does both of these things and it’s built into the read aloud function to turn the text into the final vocalization where a sound plays because of an emoji. All of that is the code that orchestrates the tts calls.
It is sometimes useful to put commands within the data, and they can be followed, with TTS-1, only a little bit before they are recited.
For example, here, the forum AI “translate selection to English” had its purpose co-opted by my instructions made distinct to include paragraph markers.
The best technique I’ve found is “stage directions” in square brackets.
[Tone: angry}
I don’t want to get a store credit - I want a refund!
Otherwise, you can have an AI language model modify the input text, so that it is full of like..um..(how you say)… spoken stop words. Not so much an “orchestrator” but just a pre-processor given instructions.