I want to upload an audio file and have the realtime api respond with a voice matching the speech patterns (not necessarily the voice). I.e. if the audio is of a 4 year old with terrible grammar, the realtime api should also start talking with the terrible grammar and vocab level of a 4 year old.
I want to upload an audio file and have the realtime API respond with a voice matching the speech patterns (not necessarily the voice). For example, if the audio is of a US president trying to sell a cryptocurrency to senior citizens, the realtime API should also start talking with the slick sales pitch style of the president’s speech.
You can see the systematic protocols to prevent this in the GPT-4o system card.
In all of our post-training audio data, we supervise ideal completions using the voice sample in the system message as the base voice.
We only allow the model to use certain pre-selected voices and use an output classifier to detect if the model deviates from that.
If your goal is just stylistic, you can use the instructions field of a session creation to instruct the speaking style, to limited success. The speech will just be those of the voices you can specify.
The result of 2000+ tokens of system message input of a “child”: