GPT-4o-mini-tts Issues: Volume Fluctuations, Silence, Repetition, Distortion

Subject: GPT-4o-mini-tts Issues: Volume Fluctuations, Silence, Repetition, Distortion

I’ve been extensively testing OpenAI’s GPT-4o-mini-tts voices for my service, Listen Later, which converts written articles into narrated podcasts. While generally impressed, I’ve observed several noticeable regressions compared to the original TTS model:

1. Volume fluctuations affecting all new voices:
Every new voice introduced in GPT-4o-mini-tts frequently exhibits inconsistent loudness within a single narration. It sounds as if the narrator is moving closer to and farther away from the microphone repeatedly. Explicit instructions emphasizing consistent volume have had some effect in reducing this issue, but it remains present.

2. Long, random silences:
Narrations by the new voices occasionally include unexpected, prolonged silences lasting 10–60 seconds, usually toward the end of the audio. These silences significantly disrupt listener engagement.

3. Random repetition after long silences:
Following these extended silences, portions of previously narrated text frequently repeat unexpectedly. Additionally, when repetitions occur, the final sentences of the provided content may be skipped entirely.

4. Digitized audio distortion (particularly the “Onyx” voice):
The “Onyx” voice specifically produces noticeable digitized distortion, similar to audio from a poor cell phone connection or heavily compressed digital audio. This results in jittery, compressed, and unnatural-sounding narration.

These issues are new regressions introduced with GPT-4o-mini-tts, as none were present in the original TTS model. They negatively impact the overall quality and usability of narrations in a production environment.

For reference, here are the exact narration instructions I currently use for all voices:

Read naturally at a comfortable, conversational pace, clearly articulating each word. Maintain consistent vocal volume and steady microphone proximity throughout the narration, avoiding fluctuations that sound as though you’re moving away from or closer to the microphone. Adopt a friendly, engaging tone suitable for podcast listening—pleasant, approachable, and subtly expressive without dramatic exaggeration. Use slight variations in pitch, rather than volume, to gently highlight important points, key phrases, or transitions. Insert short, natural pauses at paragraph breaks and section headings to smoothly guide listeners through the content without interrupting the narrative flow. Overall, aim for a warm, welcoming, and enjoyable delivery, as if thoughtfully sharing an interesting article or story with a friend through their headphones.

I’d greatly appreciate insights or acknowledgment regarding these issues and information on whether they’re actively being addressed.

Thank you!

2 Likes

Been having the same problems.

I have noticed that a lot depends on the voice instructions. For example, using 9K text input with the Fable voice:

voice instructions: “Speak in a pleasent, perfessional tone.”
output: 7.4 MB mp3
result : only problem is Volume fluctuations

VS

voice instructions: “Speak in a sarcastic tone.”
output: 11.4 MB mp3
result : many problems: Volume fluctuations, Long, random silences and Random repetition after long silences

While not yet ready for prime time, this model is very promissing and I’m looking forward to it getting better - hopefully in the not too distant future.

Upon further testing, the abovementioned issues are more prevelent with large text input. Smaller text input reduces the issues - even with non-standard voice instructions.

I’m trying to build something very similar to you and seeing similar issues.

The input text I’m using is around 900 tokens and every recording I generate contains multiple extended silences and random repetition after the silences.

I tried using your prompt but still got the same issues.

The two voices I experimented with were ‘Alloy’ and ‘Fable’ .

The input text I’m using doesn’t seem especially challenging and it is well formatted (no typos etc.) so I wonder why this is happening?

I just tried with a different article as input text and this time the recording was near perfect?!

Input text with near perfect recording (extracted from: https://www.lesswrong.com/s/NBDFAKt3GbFwnwzQF/p/46qnWRSR7L2eyNbMA)

Light leaves the Sun and strikes your shoelaces and bounces off; some photons enter the pupils of your eyes and strike your retina; the energy of the photons triggers neural impulses; the neural impulses are transmitted to the visual-processing areas of the brain; and there the optical information is processed and reconstructed into a 3D model that is recognized as an untied shoelace; and so you believe that your shoelaces are untied.

Here is the secret of deliberate rationality—this whole process is not magic, and you can understand it. You can understand how you see your shoelaces. You can think
about which sort of thinking processes will create beliefs which mirror reality, and which thinking processes will not.

Mice can see, but they can’t understand seeing. You can understand seeing, and because of that, you can do things that mice cannot do. Take a moment to marvel at this, for it is indeed marvelous.

Mice see, but they don’t know they have visual cortexes, so they can’t correct for optical illusions. A mouse lives in a mental world that includes cats, holes, cheese and mousetraps—but not mouse brains. Their camera does not take pictures of its own lens. But we, as humans, can look at a seemingly bizarre image, and realize that part of what we’re seeing is the lens itself. You don’t always have to believe your own eyes, but you have to realize that you have eyes—you must have distinct mental buckets for the map and the territory, for the senses and reality. Lest you think this a trivial ability, remember how rare it is in the animal kingdom.

The whole idea of Science is, simply, reflective reasoning about a more reliable process for making the contents of your mind mirror the contents of the world. It is the sort of thing mice would never invent. Pondering this business of “performing replicable experiments to falsify theories,” we can see why it works. Science is not a separate magisterium, far away from real life and the understanding of ordinary mortals. Science is not something that only applies to the inside of laboratories. Science, itself, is an understandable process-in-the-world that correlates brains with reality.

Science makes sense, when you think about it. But mice can’t think about thinking, which is why they don’t have Science. One should not overlook the wonder of this—or the potential power it bestows on us as individuals, not just scientific societies.

Admittedly, understanding the engine of thought may be a little more complicated
than understanding a steam engine—but it is not a fundamentally different task.

Once upon a time, I went to EFNet’s #philosophy chatroom to ask, “Do you believe a nuclear war will occur in the next 20 years? If no, why not?” One person who answered the question said he didn’t expect a nuclear war for 100 years, because “All of the players involved in decisions regarding nuclear war are not interested right now.” “But why extend that out for 100 years?” I asked. “Pure hope,” was his reply.

Reflecting on this whole thought process, we can see why the thought of nuclear war makes the person unhappy, and we can see how his brain therefore rejects the belief. But if you imagine a billion worlds—Everett branches, or Tegmark duplicates
—this thought process will not systematically correlate optimists to branches in which no nuclear war occurs.

To ask which beliefs make you happy is to turn inward, not outward—it tells you something about yourself, but it is not evidence entangled with the environment. I have nothing against happiness, but it should follow from your picture of the world, rather than tampering with the mental paintbrushes.

If you can see this—if you can see that hope is shifting your first-order thoughts by too large a degree—if you can understand your mind as a mapping engine that has flaws—then you can apply a reflective correction. The brain is a flawed lens through which to see reality. This is true of both mouse brains and human brains. But a human brain is a flawed lens that can understand its own flaws—its systematic errors, its biases—and apply second-order corrections to them. This, in practice, makes the lens far more powerful. Not perfect, but far more powerful.

Input text with consistently poor recording (extended pauses, repetition - extracted from: https://www.lesswrong.com/posts/bJ2haLkcGeLtTWaD5/welcome-to-lesswrong )

The road to wisdom? Well, it’s plain
and simple to express:
Err
and err
and err again
but
less
and
less
and
less.
– Piet Hein

LessWrong is an online forum and community dedicated to improving human reasoning and decision-making. We seek to hold true beliefs and to be effective at accomplishing our goals. Each day, we aim to be less wrong about the world than the day before.

See also our New User’s Guide.

Training Rationality

Rationality has a number of definitions on LessWrong, but perhaps the most canonical is that the more rational you are, the more likely your reasoning leads you to have accurate beliefs, and by extension, allows you to make decisions that most effectively advance your goals.

LessWrong contains a lot of content on this topic. How minds work (both human, artificial, and theoretical ideal), how to reason better, and how to have discussions that are productive. We’re very big fans of Bayes Theorem and other theories of normatively correct reasoning.

To get started improving your Rationality, we recommend reading the background-knowledge text of LessWrong, Rationality: A-Z(aka “The Sequences”) or at least selected highlights from it. After that, looking through the Rationality section of the Concepts Portal is a good thing to do.

Applying Rationality

You might value Rationality for its own sake, however, many people want to be better reasoners so they can have more accurate beliefs about topics they care about, and make better decisions.

Using LessWrong-style reasoning, contributors to LessWrong have written essays on an immense variety of topics on LessWrong, each time approaching the topic with a desire to know what’s actually true (not just what’s convenient or pleasant to believe), being deliberate about processing the evidence, and avoiding common pitfalls of human reason.

Check out the Concepts Portal to find essays on topics such as artificial intelligence, history, philosophy of science, language, psychology, biology, morality, culture, self-care, economics, game theory, productivity, art, nutrition, relationships and hundreds of other topics broad and narrow.

LessWrong and Artificial Intelligence

For several reasons, LessWrong is a website and community with a strong interest in AI and specifically causing powerful AI systems to be safe and beneficial.

AI is a field concerned with how minds and intelligence works, overlapping a lot with rationality.
Historically, LessWrong was seeded by the writings of Eliezer Yudkowsky, an artificial intelligence researcher.
Many members of the LessWrong community are heavily motivated by trying to improve the world as much as possible, and these people were convinced many years ago that AI was a very big deal for the future of humanity. Since then LessWrong has hosted a lot of discussion of AI Alignment/AI Safety, and that’s only accelerated recently with further AI capabilities developments.
LessWrong is also integrated with the Alignment Forum
The LessWrong team who maintain and develop the site are predominantly motivated by trying to cause powerful AI outcomes to be good.
If you want to see more or less AI content, you can adjust your Frontpage Tag Filters according to taste.

Getting Started on LessWrong

The New User’s Guide is a great place to start.
The core background text of LessWrong is the collection of essays, Rationality: A-Z (aka “The Sequences”). Reading these will help you understand the mindset and philosophy that defines the site. Those looking for a quick introduction can start wit The Sequences Highlights
Other top writings include The Codex (writings by Scott Alexander) and Harry Potter & The Methods of Rationality. Also see the Library Page for many curated collections of posts and the Concepts Portal.
Also, feel free to introduce yourself in the monthly open and welcome thread!

Lastly, we do recommend that new contributors (posters or commenters) take time to familiarize themselves with the sites norms and culture to maximize the chances that your contributions are well-received.

Thanks for your interest!

-The LW Team

What is it about the latter piece of text that results in consistent problems? All I can think of is something related to the formatting or phrasing is causing issues.

I’m generating the recordings directly in the OpenAI playground.

Update:

Upon further investigation I think I’ve narrowed the issue down. The problem text seems to be:

The road to wisdom? Well, it’s plain
and simple to express:
Err
and err
and err again
but
less
and
less
and
less.
– Piet Hein

The text we are using does not have those issues - we are using professional articles with no “and errs”, et al… As we see it, it’s more a function of the length of the text. Iv’e heard that OpenAI is having issues with compute capacity - maybe that is the problem…

I’m having the exact same issue with having very long silences in the audio. It’s quite weird.