Is it possible for the realtime api to copy speech patterns?

nlinx · December 8, 2024, 7:24am

I want to upload an audio file and have the realtime api respond with a voice matching the speech patterns (not necessarily the voice). I.e. if the audio is of a 4 year old with terrible grammar, the realtime api should also start talking with the terrible grammar and vocab level of a 4 year old.

Is this possible?

_j · December 9, 2024, 6:11am

Not possible (and not a desired use by OpenAI.)

You might as well reframe this as:

I want to upload an audio file and have the realtime API respond with a voice matching the speech patterns (not necessarily the voice). For example, if the audio is of a US president trying to sell a cryptocurrency to senior citizens, the realtime API should also start talking with the slick sales pitch style of the president’s speech.

You can see the systematic protocols to prevent this in the GPT-4o system card.

https://openai.com/index/gpt-4o-system-card/

Risk	Mitigations
Unauthorized voice generation	In all of our post-training audio data, we supervise ideal completions using the voice sample in the system message as the base voice.

We only allow the model to use certain pre-selected voices and use an output classifier to detect if the model deviates from that.

If your goal is just stylistic, you can use the instructions field of a session creation to instruct the speaking style, to limited success. The speech will just be those of the voices you can specify.

The result of 2000+ tokens of system message input of a “child”:

Topic		Replies	Views
API referances . gpt4.o Human like speak API gpt-4	1	491	July 19, 2024
Copy and store the output of the realtime API to use it as a cache API realtime	5	1223	November 20, 2024
How to replicate Custom GPT “Voice Mode” quality via API? API custom-gpt , voice-mode	0	247	August 11, 2025
Voice to voice via API possible? API gpt-4 , api	1	609	May 27, 2024
Can ChatGPT API Mimic User Responses Based on Uploaded Documents? API	0	63	January 8, 2025

Is it possible for the realtime api to copy speech patterns?

Related topics