How to get text only output from the Realtime API?

amarrella · October 4, 2024, 6:20pm

Hi!

I’d like to build some UI controls using the Realtime API and I’m not interested in the audio output (just in the function calls really). Is there a way to call the API without getting billed for the audio output?

Thanks

anon10827405 · October 4, 2024, 6:32pm

The RealTime API uses a voice-to-voice model. The only reason you’d want to use it is for the extremely low latency in voice communication.

If you want to transcribe the text you can use any typical model like Whisper.

amarrella · October 4, 2024, 6:41pm

@anon10827405 thanks for your answer! I was attracted by the realtime api because of the low latency in general to get a function call. Going the whisper + gpt route is ok but the latency for whisper is not amazing from my tests.

Gotta try running locally the new whisper-large-v3-turbo and see if things are better.

Thank you!

anon10827405 · October 4, 2024, 7:08pm

The new turbo model is quite fast, especially for it’s size .

With this new model you would have to pay for the output audio tokens first to get the transcript.

stevenic · October 4, 2024, 7:20pm

To answer your question yes you can send text messages to the realtime API but as @anon10827405 suggests it is geared more for audio input. It’s pretty pricey even for text-to-function calling.

You’re supposed to be able to both send and receive text from the model but I haven’t worked out how to do the receive part yet…

https://platform.openai.com/docs/guides/realtime/examples

stevenic · October 4, 2024, 7:25pm

Ah… you can set the modalities parameter for the session to just ["text"]. It defaults to ["text", "audio"].

github.com

openai/openai-realtime-api-beta/blob/0126e4bfc19901598c3f20d0a4b32bb3e0bea376/lib/client.js#L197


      
          * @class
          */
          export class RealtimeClient extends RealtimeEventHandler {
           /**
            * Create a new RealtimeClient instance
            * @param {{url?: string, apiKey?: string, dangerouslyAllowAPIKeyInBrowser?: boolean, debug?: boolean}} [settings]
            */
           constructor({ url, apiKey, dangerouslyAllowAPIKeyInBrowser, debug } = {}) {
             super();
             this.defaultSessionConfig = {
               modalities: ['text', 'audio'],
               instructions: '',
               voice: 'alloy',
               input_audio_format: 'pcm16',
               output_audio_format: 'pcm16',
               input_audio_transcription: null,
               turn_detection: null,
               tools: [],
               tool_choice: 'auto',
               temperature: 0.8,
               max_response_output_tokens: 4096,

amarrella · October 4, 2024, 7:28pm

@stevenic I wanted audio input, text/function_call output, which i don’t think is what the modalities allow me to do I think i’ll try to go the whisper route

stevenic · October 4, 2024, 7:37pm

Ah… I see… Even the Audio input is pretty pricey at $100/million tokens.

andreas.o · December 27, 2024, 9:07pm

“If you want to transcribe the text you can use any typical model like Whisper.”

Not everything can be transcribed in a literal manner though. I was looking into this in order to migrate the speech detection and transformation for https://serenade.ai (which was abandoned ~2 years ago), hoping that maybe OpenAI could offer a more solution that performs as well but is more accurate.

Detecting silence, passing to whisper and then passing the result to ChatGPT - latency would be too high. And the whole approach would be way more complex than simply relying on VAD and having the realtime API process speech detection and transformation all at once.

andreas.o · December 27, 2024, 9:19pm

But I suppose this is purely conversational, since I just asked it to send me a message with some TypeScript code and it told me that it can’t do that.

Gorgon · January 1, 2025, 4:46pm

If anyone is still looking for this, it does look like it’s possible. I’ll be doing this so I can use some voices I have already working at very low latency using a modified version of Tortoise

Blockquote
After adding the user message to the conversation, send the response.create event to initiate a response from the model. If both audio and text are enabled for the current session, the model will respond with both audio and text content. If you’d like to generate text only, you can specify that when sending the response.create client event, as shown below.

from openai_docs

andreas.o · January 1, 2025, 5:06pm

Hm, that’s very interesting, thanks!
Although one thing that would concern me is that it seems like one can’t disable audio for one side. So even though it might be possible to request an additional text response like you quoted here, we’d still also have to pay for all audio responses that are generated. Unless the model would respect being told not to send audio responses but only listen.

Gorgon · January 1, 2025, 7:16pm

If you’d like to generate text only, you can specify that when sending the response.create client event, as shown below

I don’t think that’s the case at all based on the quote here and the surrounding documentation. It seems pretty explicit to me. They’re saying you can generate (text), (audio), or (audio and text) responses. It wouldn’t make any sense to me that they’d allow disabling of incoming (not outgoing) audio, but you still pay for it; At that point, you could just have the audio coming in but then either decide to play or discard it in your client, as there’d be no difference in cost in that case (which again, I don’t think is what we’re dealing with thankfully.)

I got the realtime client working, and then I was immediately like “oh crap, I’m not going to want to listen to this thing all the time.” And sometimes text output makes sense for a lot of reasons like showing code or displaying a weather forecast or whatever else (though I suppose that last one could be done in a function call.)

I’m going to try this out this evening and report back. Hoping to have good news. So far, I’m happy to find someone else who wants to do something similar.

But I suppose this is purely conversational, since I just asked it to send me a message with some TypeScript code and it told me that it can’t do that.

Regarding this, I’d be hesitant to accept this based on this alone, and there’s some playing around I’d do to kinda flesh this out:

The model may or may not behave differently if you explicitly disable audio; if it doesn’t know it’s engaging via audio, it might ditch some of the clean-up necessary when engaging with audio (say the “degrees” symbol out loud or replace “35” with “thirty-five” for example.) If it does behave differently, that’s an interesting pointer that there might be some additional behind-the-scenes prompting happening
I’d be curious how providing a tool such as “display_text” or “display_code” would perform; You could tell it to display the code and it could call that tool, and that tool in your client could then spit out whatever the model feeds in. Return a simple success/fail from the tool and modify your prompt to ensure the LLM doesn’t yammer on about whatever was returned by that tool. That would also definitely side-step the audio token cost for that portion of things.
Building on top of the tool/function call idea, I’m curious about how providing a tool that points to another LLM that is text-only might work;

Gorgon · January 1, 2025, 7:49pm

Confirmed:

setting the session to text modality does not generate any outbound audio at all. I ran up $1.00 in inbound audio costs, and nothing at all showed for the month in outbound audio.

I waited a bit, and then switched back to audio and text, and saw the audio usage and cost show up for that conversation only later.

Not entirely sure why someone would think you can’t disable the audio output, but I’m pleased to say that they were incorrect.

Topic		Replies	Views
Is there a way to prevent gpt-4o-audio-preview from returning audio? API audio	8	511	December 17, 2024
I don't understand the pricing for the realtime API API realtime	33	13448	October 8, 2024
How can I pass a system prompt and audio user input to get a text output back? API	15	1133	November 3, 2024
Realtime API re-consuming it's own output audio as input audio API audio , realtime , api-realtime , api-realtime-speech	10	797	January 10, 2025
TTS API service usability API tts	17	6845	December 16, 2023

How to get text only output from the Realtime API?

Related topics