Instructions missing in TTS-1-HD?

krzysztof.biedalak · July 30, 2025, 10:24am

Are the instructions available for TTS-1-HD? I don’t see them in playground nor API. Without instructions this “high quality” model is useless. Particularly what it generates in other languages than English seems random and inadequate.

aprendendo.next · July 30, 2025, 10:57am

tts-1 and tts-1-hd are the 1st gen TTS models, and indeed they have the problem of not being able to choose the language or accept instructions. On the another side, these models provide a more stable voice style (tone, pace, etc.).

The new model gpt-4o-mini-tts introduces instructions, which provide great flexibility and control, but can be less stable in providing a similar constant style (may sound like a different person). Particularly for me, these where great improvements.

Each one has a trade off. You may have to evaluate if this is acceptable for your needs, and if not there are other tts providers in the market as well, usually at a higher cost though.

TownPortalArcade · August 12, 2025, 6:40am

I’m working with 4.1 to code a tts console which demos the selected chat the user wants to render. I approached this same question with the aim of learning how to code with tts-1 the sense of delivery of the script in support of the meaning expressed, and the way to do that is with code that sets up the timing, pausing, any filler wording injected into a statement for flow anchoring (see ChatGPT official app 4o cove in standard voice for the gold standard of what I mean) that part is coded in timing and decisions made by an orchestrator function, taking the text and then queuing it to the tts-1 model with the intended pacing and edits to script in case of the need to add a well place uh, like cove does (the transcript while chatting in that mode come without the uh, the code is injecting it into the original response as a function of rendering the tts according to meaning serving timing and nuance, prosody, etc. instructions is done as post processing or orchestrator edits to incoming script and the timings. One would get full control of that in coding the orchestrator and any effects or custom sound engine. I think official standard voice does both of these things and it’s built into the read aloud function to turn the text into the final vocalization where a sound plays because of an emoji. All of that is the code that orchestrates the tts calls.

_j · August 12, 2025, 7:03am

It is sometimes useful to put commands within the data, and they can be followed, with TTS-1, only a little bit before they are recited.

For example, here, the forum AI “translate selection to English” had its purpose co-opted by my instructions made distinct to include paragraph markers.

TownPortalArcade:

I am utilizing version 4.1 to develop a TTS console that showcases the chosen chat the user wishes to render.

My approach to this endeavor is rooted in a desire to master coding with TTS-1, particularly to capture the nuances of delivery that enhance the intended meaning. Achieving this requires code that meticulously manages timing, pauses, and the strategic insertion of filler words to anchor the conversational flow—much like the exemplary standard voice in the official ChatGPT 4o app.

This aspect is implemented by controlling timing parameters and orchestrator functions, which process the text and queue it for the TTS-1 model with deliberate pacing and script modifications, such as the insertion of a well-timed ‘uh,’ reminiscent of Cove’s delivery. Notably, the transcript in this mode omits such fillers; the code injects them into the response during TTS rendering to serve the subtleties of timing, nuance, and prosody.

Instructions are applied as post-processing or orchestrator-level edits to the script and its timing. Full control over these elements is achieved when programming the orchestrator and any associated effects or custom sound engines. I believe the official standard voice incorporates both approaches, seamlessly integrated into the read-aloud feature, which transforms text into its final vocalization—occasionally triggering a sound effect in response to an emoji. All of this is orchestrated by the code that governs the TTS operations.

The best technique I’ve found is “stage directions” in square brackets.

[Tone: angry}
I don’t want to get a store credit - I want a refund!

Otherwise, you can have an AI language model modify the input text, so that it is full of like..um..(how you say)… spoken stop words. Not so much an “orchestrator” but just a pre-processor given instructions.

Topic		Replies	Views
[gpt-4o-mini-tts] Instructions leaking into generated audio Bugs	4	358	August 4, 2025
Audio Models in the API - live stream at 10 AM PT API	15	1168	March 29, 2025
TTS no longer follows instructions parameter Bugs	13	450	February 25, 2026
Voice Instruction with gpt-4o-mini-tts Prompting gpt-4 , api	4	349	January 28, 2026
Are Chatgpt custom instructions available through the GPT 3.5 or GPT 4 api? API	7	2655	August 23, 2023

Instructions missing in TTS-1-HD?

Related topics