Subject: GPT-4o-mini-tts Issues: Volume Fluctuations, Silence, Repetition, Distortion
I’ve been extensively testing OpenAI’s GPT-4o-mini-tts voices for my service, Listen Later, which converts written articles into narrated podcasts. While generally impressed, I’ve observed several noticeable regressions compared to the original TTS model:
1. Volume fluctuations affecting all new voices:
Every new voice introduced in GPT-4o-mini-tts frequently exhibits inconsistent loudness within a single narration. It sounds as if the narrator is moving closer to and farther away from the microphone repeatedly. Explicit instructions emphasizing consistent volume have had some effect in reducing this issue, but it remains present.
2. Long, random silences:
Narrations by the new voices occasionally include unexpected, prolonged silences lasting 10–60 seconds, usually toward the end of the audio. These silences significantly disrupt listener engagement.
3. Random repetition after long silences:
Following these extended silences, portions of previously narrated text frequently repeat unexpectedly. Additionally, when repetitions occur, the final sentences of the provided content may be skipped entirely.
4. Digitized audio distortion (particularly the “Onyx” voice):
The “Onyx” voice specifically produces noticeable digitized distortion, similar to audio from a poor cell phone connection or heavily compressed digital audio. This results in jittery, compressed, and unnatural-sounding narration.
These issues are new regressions introduced with GPT-4o-mini-tts, as none were present in the original TTS model. They negatively impact the overall quality and usability of narrations in a production environment.
For reference, here are the exact narration instructions I currently use for all voices:
Read naturally at a comfortable, conversational pace, clearly articulating each word. Maintain consistent vocal volume and steady microphone proximity throughout the narration, avoiding fluctuations that sound as though you’re moving away from or closer to the microphone. Adopt a friendly, engaging tone suitable for podcast listening—pleasant, approachable, and subtly expressive without dramatic exaggeration. Use slight variations in pitch, rather than volume, to gently highlight important points, key phrases, or transitions. Insert short, natural pauses at paragraph breaks and section headings to smoothly guide listeners through the content without interrupting the narrative flow. Overall, aim for a warm, welcoming, and enjoyable delivery, as if thoughtfully sharing an interesting article or story with a friend through their headphones.
I’d greatly appreciate insights or acknowledgment regarding these issues and information on whether they’re actively being addressed.
Thank you!