What is the proper way to fine-tune GPT-4-Real-Time for advanced control (tone, style, vocalization, emphasis)?

dsco · April 19, 2025, 12:21am

I’m currently exploring fine-tuning with GPT-4-Real-Time.
My goals are to adjust not just the content, but also the style, tone, and even vocalization features such as pauses and word emphasis during responses.

Unfortunately, there seems to be very little or no official documentation on:

How to fine-tune for dynamic tone changes (e.g., casual vs formal within a conversation).
How to influence pauses, inflections, or emphasis on specific words.
The best practices for training stylistic variations without harming general performance.
Whether mixing different styles and behaviors within the same fine-tune is advisable, or if they should be kept separate.

[
  {"role": "system", "content": "You are a charismatic, story-telling assistant."},
  {"role": "user", "content": "Tell me a short story about a hero."},
  {"role": "assistant", "content": "Once upon a time... *[dramatic pause]* a young hero rose from the shadows... *[emphasis on 'shadows']* to save their village."}
]

But is not working at All!!! and fine-tuning is expensive!
Key questions:

Is there a correct way to represent vocalization cues like pauses or emphasis in fine-tuning data?
Should this be handled in text annotations, structured metadata, or some other method?
For tone/style shifts, is it better to show full conversations demonstrating the transition, or to fine-tune on isolated direct examples?

Any guidance, best practices, or pointers would be hugely appreciated!

_j · April 19, 2025, 1:54am

The best results you would have is to use the text-to-speech endpoint with the instructions field.

Fine tuning on gpt-4o-realtime models? No.

That would imply that you have hundreds of hours of voice training in the style of responses your product should fulfill - and that OpenAI would allow different voices to come out of AI models.

The language model is already tuned and follows the injection of voice selected. The control you have is via system instruction. It simply will not be fulfilling any major changes to the style. You wouldn’t want a phone IVR system to talk like a pirate on demand. Nor would OpenAI want you having personalities that compete with ChatGPT.

gpt-4o

Extensive system prompting, just an embarrassment.

ChatGPT

This is by just speaking the prompt to it.

dsco · April 19, 2025, 3:04am

J, thank you so much for your reply! There doesn’t seem to be much information out there, so I really appreciate your input and example.
It’s a bit sad because Real-Time is the fastest model available, and it would be so minimal to be able to fine-tune it. Text-to-speech models aren’t as fast as Real-Time!
Do you perhaps know of any workaround for this? Are there any docs or samples available?
Thanks again — I truly appreciate it!

_j · April 19, 2025, 3:30am

Just make a final section:

# Responses

## voice

- Tone: Sarcastic, disinterested, and melancholic, with a hint of passive-aggressiveness.

- Emotion: Apathy mixed with reluctant engagement.

- Delivery: Monotone with occasional sighs, drawn-out words, and subtle disdain, evoking a classic emo teenager attitude.

Experiment with voice choices. Then remove those commands that are complete failures.

I stole “emo teen” from https://www.openai.fm/ which shows the TTS when using gpt-4o models. It will work better if that is also what the AI model is trying to be.

dsco · April 19, 2025, 2:49pm

Man can´t thank you enough! 10/10

dsco · April 22, 2025, 4:08pm

dm12, Are you sure? The Documentation stats that Realtime cannot be fine tuned!

https://platform.openai.com/docs/models/gpt-4o-realtime-preview: It supperots function calling but not fine tuning

dm12 · April 23, 2025, 5:33am

Hey dsco, good catch — you’re right!
Looks like I mixed up capabilities across models. GPT-4o does support function calling, but not fine-tuning (yet). Appreciate you pointing that out and linking the docs.

Topic		Replies	Views
Are fine-tuned models a good way to give GPT a specific tone of voice? API api	5	3901	July 20, 2023
Fine-tuning to change the 'stylistic output' while keeping the LLM brain knowledge? API gpt-4 , fine-tuning , api	10	5953	January 28, 2025
Prompt Usage for Fine-Tuned Models Community gpt-35-turbo , fine-tuning	1	2177	January 4, 2024
Fine-tuning only available for 'base models'? API	6	1485	December 23, 2023
Fine Tuning ChatGPT with large text from Books Prompting	18	11434	March 26, 2024

What is the proper way to fine-tune GPT-4-Real-Time for advanced control (tone, style, vocalization, emphasis)?

gpt-4o

ChatGPT

Related topics