I’m currently exploring fine-tuning with GPT-4-Real-Time.
My goals are to adjust not just the content, but also the style, tone, and even vocalization features such as pauses and word emphasis during responses.
Unfortunately, there seems to be very little or no official documentation on:
How to fine-tune for dynamic tone changes (e.g., casual vs formal within a conversation).
How to influence pauses, inflections, or emphasis on specific words.
The best practices for training stylistic variations without harming general performance.
Whether mixing different styles and behaviors within the same fine-tune is advisable, or if they should be kept separate.
[
{"role": "system", "content": "You are a charismatic, story-telling assistant."},
{"role": "user", "content": "Tell me a short story about a hero."},
{"role": "assistant", "content": "Once upon a time... *[dramatic pause]* a young hero rose from the shadows... *[emphasis on 'shadows']* to save their village."}
]
But is not working at All!!! and fine-tuning is expensive! Key questions:
Is there a correct way to represent vocalization cues like pauses or emphasis in fine-tuning data?
Should this be handled in text annotations, structured metadata, or some other method?
For tone/style shifts, is it better to show full conversations demonstrating the transition, or to fine-tune on isolated direct examples?
Any guidance, best practices, or pointers would be hugely appreciated!
That would imply that you have hundreds of hours of voice training in the style of responses your product should fulfill - and that OpenAI would allow different voices to come out of AI models.
The language model is already tuned and follows the injection of voice selected. The control you have is via system instruction. It simply will not be fulfilling any major changes to the style. You wouldn’t want a phone IVR system to talk like a pirate on demand. Nor would OpenAI want you having personalities that compete with ChatGPT.
gpt-4o
Extensive system prompting, just an embarrassment.
J, thank you so much for your reply! There doesn’t seem to be much information out there, so I really appreciate your input and example.
It’s a bit sad because Real-Time is the fastest model available, and it would be so minimal to be able to fine-tune it. Text-to-speech models aren’t as fast as Real-Time!
Do you perhaps know of any workaround for this? Are there any docs or samples available?
Thanks again — I truly appreciate it!
# Responses
## voice
- Tone: Sarcastic, disinterested, and melancholic, with a hint of passive-aggressiveness.
- Emotion: Apathy mixed with reluctant engagement.
- Delivery: Monotone with occasional sighs, drawn-out words, and subtle disdain, evoking a classic emo teenager attitude.
Experiment with voice choices. Then remove those commands that are complete failures.
I stole “emo teen” from https://www.openai.fm/ which shows the TTS when using gpt-4o models. It will work better if that is also what the AI model is trying to be.
Hey dsco, good catch — you’re right!
Looks like I mixed up capabilities across models. GPT-4o does support function calling, but not fine-tuning (yet). Appreciate you pointing that out and linking the docs.