Unnatural Speech in Japanese TTS with Long Paragraphs (e.g., GPT-4o-mini-tts, tts-1)

When using Japanese text with models like gpt-4o-mini-tts or tts-1, I’ve noticed that the length of a paragraph significantly affects the naturalness of the generated speech.

For example, here are two versions of the same content:

Sample 1 (with line breaks between sentences):

読書は、私たちの想像力を豊かにし、心の世界を広げてくれる貴重な体験です。小説を読むことで、現実とは異なる世界に旅することができ、登場人物の感情や出来事を追体験することで、他者への理解も深まります。 また、知識を得る手段としての読書も非常に有効で、歴史、科学、哲学など、さまざまな分野の情報を自分のペースで吸収できます。
現代ではインターネットや動画など、情報を得る方法が多様化していますが、本を手に取ってじっくりと読み進める時間には、他のメディアにはない深い集中と静けさがあります。 読書を習慣にすることで、語彙力や表現力が高まり、思考力や論理的な判断力も養われるため、日常生活や仕事においても大きな助けとなります。

Sample 2 (same content as a single paragraph):

読書は、私たちの想像力を豊かにし、心の世界を広げてくれる貴重な体験です。小説を読むことで、現実とは異なる世界に旅することができ、登場人物の感情や出来事を追体験することで、他者への理解も深まります。また、知識を得る手段としての読書も非常に有効で、歴史、科学、哲学など、さまざまな分野の情報を自分のペースで吸収できます。現代ではインターネットや動画など、情報を得る方法が多様化していますが、本を手に取ってじっくりと読み進める時間には、他のメディアにはない深い集中と静けさがあります。読書を習慣にすることで、語彙力や表現力が高まり、思考力や論理的な判断力も養われるため、日常生活や仕事においても大きな助けとなります。

Note:
I wanted to share a video with audio to demonstrate the issue more clearly, but it seems that I couldn’t include links to YouTube or Google Drive in this post.

If there’s any way to share videos on this platform that I may have missed, I’d really appreciate it if someone could let me know.

When using Sample 2, the TTS output becomes noticeably unnatural in tone and phrasing partway through the paragraph. This doesn’t happen with Sample 1, which has breaks between sentences.

I understand that Japanese sentence structure can be more challenging than in other languages, but if it’s possible to improve how long Japanese paragraphs are handled by the TTS models, I would deeply appreciate it. This would make the voice output much more natural and useful for real-world applications.

Thank you for your amazing work and continued improvements!

1 Like

I can also confirm that when using OpenAI’s dedicated Text-to-Speech (TTS) model, the tone of the output can sometimes become abruptly unnatural at paragraph transitions.

In the example you’ve provided, it is clearly noticeable that the intonation becomes unnatural starting from “現代では”.

Instead of using the dedicated TTS model, you might want to consider generating audio output with the “gpt-4o-audio-preview” model, which is also capable of producing text output through chat completions.

By explicitly instructing the model in the system message to output user input verbatim, you should be able to obtain outputs almost exactly matching your original input.
Then, setting this generated output as audio content will likely result in significantly more natural-sounding speech compared to the dedicated TTS model.

2 Likes

Thank you for sharing this suggestion—I really appreciate it!

I’m not sure yet whether the “gpt-4o-audio-preview” model allows for the same level of control over speech as the dedicated TTS model, but I’m planning to give it a try once I get home and see how it goes.

1 Like

I gave the “gpt-4o-audio-preview” model a try. While I was able to reflect tone and style through how the text was written, it didn’t seem to capture or adjust prosody—such as accent or intonation—during speech synthesis.
As a result, I found that using “gpt-4o-tts” is still the best option for my needs. Specifically, by paying close attention to sentence length within paragraphs and inserting appropriate line breaks, I can guide the model’s speech delivery more effectively.
Hopefully, we’ll eventually see a model that can handle both text generation and detailed prosody control at the same time.

1 Like