Hi!
I’d like to build some UI controls using the Realtime API and I’m not interested in the audio output (just in the function calls really). Is there a way to call the API without getting billed for the audio output?
Thanks
Hi!
I’d like to build some UI controls using the Realtime API and I’m not interested in the audio output (just in the function calls really). Is there a way to call the API without getting billed for the audio output?
Thanks
The RealTime API uses a voice-to-voice model. The only reason you’d want to use it is for the extremely low latency in voice communication.
If you want to transcribe the text you can use any typical model like Whisper.
@RonaldGRuckus thanks for your answer! I was attracted by the realtime api because of the low latency in general to get a function call. Going the whisper + gpt route is ok but the latency for whisper is not amazing from my tests.
Gotta try running locally the new whisper-large-v3-turbo and see if things are better.
Thank you!
The new turbo model is quite fast, especially for it’s size .
With this new model you would have to pay for the output audio tokens first to get the transcript.
To answer your question yes you can send text messages to the realtime API but as @RonaldGRuckus suggests it is geared more for audio input. It’s pretty pricey even for text-to-function calling.
You’re supposed to be able to both send and receive text from the model but I haven’t worked out how to do the receive part yet…
Ah… you can set the modalities
parameter for the session to just ["text"]
. It defaults to ["text", "audio"]
.
@stevenic I wanted audio input, text/function_call output, which i don’t think is what the modalities allow me to do I think i’ll try to go the whisper route
Ah… I see… Even the Audio input is pretty pricey at $100/million tokens.
“If you want to transcribe the text you can use any typical model like Whisper.”
Not everything can be transcribed in a literal manner though. I was looking into this in order to migrate the speech detection and transformation for https://serenade.ai (which was abandoned ~2 years ago), hoping that maybe OpenAI could offer a more solution that performs as well but is more accurate.
Detecting silence, passing to whisper and then passing the result to ChatGPT - latency would be too high. And the whole approach would be way more complex than simply relying on VAD and having the realtime API process speech detection and transformation all at once.
But I suppose this is purely conversational, since I just asked it to send me a message with some TypeScript code and it told me that it can’t do that.
If anyone is still looking for this, it does look like it’s possible. I’ll be doing this so I can use some voices I have already working at very low latency using a modified version of Tortoise
Blockquote
After adding the user message to the conversation, send theresponse.create
event to initiate a response from the model. If both audio and text are enabled for the current session, the model will respond with both audio and text content. If you’d like to generate text only, you can specify that when sending theresponse.create
client event, as shown below.
from openai_docs
Hm, that’s very interesting, thanks!
Although one thing that would concern me is that it seems like one can’t disable audio for one side. So even though it might be possible to request an additional text response like you quoted here, we’d still also have to pay for all audio responses that are generated. Unless the model would respect being told not to send audio responses but only listen.
If you’d like to generate text only, you can specify that when sending the
response.create
client event, as shown below
I don’t think that’s the case at all based on the quote here and the surrounding documentation. It seems pretty explicit to me. They’re saying you can generate (text), (audio), or (audio and text) responses. It wouldn’t make any sense to me that they’d allow disabling of incoming (not outgoing) audio, but you still pay for it; At that point, you could just have the audio coming in but then either decide to play or discard it in your client, as there’d be no difference in cost in that case (which again, I don’t think is what we’re dealing with thankfully.)
I got the realtime client working, and then I was immediately like “oh crap, I’m not going to want to listen to this thing all the time.” And sometimes text output makes sense for a lot of reasons like showing code or displaying a weather forecast or whatever else (though I suppose that last one could be done in a function call.)
I’m going to try this out this evening and report back. Hoping to have good news. So far, I’m happy to find someone else who wants to do something similar.
But I suppose this is purely conversational, since I just asked it to send me a message with some TypeScript code and it told me that it can’t do that.
Regarding this, I’d be hesitant to accept this based on this alone, and there’s some playing around I’d do to kinda flesh this out:
The model may or may not behave differently if you explicitly disable audio; if it doesn’t know it’s engaging via audio, it might ditch some of the clean-up necessary when engaging with audio (say the “degrees” symbol out loud or replace “35” with “thirty-five” for example.) If it does behave differently, that’s an interesting pointer that there might be some additional behind-the-scenes prompting happening
I’d be curious how providing a tool such as “display_text” or “display_code” would perform; You could tell it to display the code and it could call that tool, and that tool in your client could then spit out whatever the model feeds in. Return a simple success/fail from the tool and modify your prompt to ensure the LLM doesn’t yammer on about whatever was returned by that tool. That would also definitely side-step the audio token cost for that portion of things.
Building on top of the tool/function call idea, I’m curious about how providing a tool that points to another LLM that is text-only might work;
Confirmed:
setting the session to text modality does not generate any outbound audio at all. I ran up $1.00 in inbound audio costs, and nothing at all showed for the month in outbound audio.
I waited a bit, and then switched back to audio and text, and saw the audio usage and cost show up for that conversation only later.
Not entirely sure why someone would think you can’t disable the audio output, but I’m pleased to say that they were incorrect.