When using the Realtime API, is it possible to store a copy of the text read by the API / it’s vocal output, in wav or MP3
The use would be to reuse it as a cache instead of generating it again every time.
Example: the first welcome sentence
There is no “text read by the API”. If you are referring to the transcription that’s provided afterwards, then that’s a decoupled service and not a perfect representative of what the model said.
If you want to store the audio files produced by OpenAI you can simply just save them while you are passing it to the end-user.
Then, if you want to play it again as welcoming message you can easily implement that into your code.
I imagine you are using some sort of pre-built framework that facilitates the RealTime API for you. So, your best bet is use ChatGPT to find where the audio file is captured, and then ask it to also save it locally.
You can store it, but you can’t place it back into the realtime API, nor can you preload a cache for permanent reuse with OpenAI. The conversation is server-side.
Only use of the chat completions endpoint allows for placement of previous conversations by the developer.
Still, the audio response of the voice AI on chat completions can only be represented as an ID number that is still server-side audio that expires, and if tomorrow you replace everything the AI spoke with the text transcript, you may get an AI that follows that pattern and now only writes text.
Sorry, my message was probably unclear. I would like to generate audio segments using the realtime API that correspond to phrases I can make my script say without calling the API each time, but that will have the same voice, the same kind of intonation.
For example, I welcome all users with the same phrase: “Hello, welcome to ACME, how can I help you?”
For this initial prompt, I’d like to generate it once, store it, and play it directly at the start of each voice chat without calling the API for it. The same goes for phrases like “Please hold on, I’m checking for you.”
I won’t ask the API to use my pre-generated audio files, but I want my script to play them at the right moment instead.
Probably the easiest way to do this is with Chat Completions, using the same “gpt-4o-audio-preview” and voice setting as you would normally. You have more control over the turns and automation that way, and can work on responses until they are just right. You can just save the base64 audio out of the response.
Why not just use standard TTS API with the same voice for this, and then play whatever greeting you generated right before or right after the session initiation? Better yet, you could use realtime API in a specific and controlled way once, make it say what you need, write that to a file and just reuse it as you wish (for example if you need a voice that is supported by realtime but not supported by TTS).
In general, there is no point and no need to make yourself dependent on realtime API for this specific case. It is not meant to contain/operate on any data outside of a single session, with (currently) a single exception being prompt caching, which has no effect on any logic besides cost calculation and billing.
To summarize, if you want to reduce costs here, AND make it compatible with Realtime (by using a similar if not identical voice), you have to introduce a subsystem that would partially control the audio which is being played to the user at the beginning of each session.