Multimodal/realtime API - audio to text output, not transccription

legacy · April 7, 2025, 6:01pm

Hi - the realtime API lists itself as multimodal. I am trying to use it similarly to Gemini’s Bidirectional API, specifically to take in audio data and reply back to that audio data (chunked) with text. Completely unable to do this, yet it seems that multimodal should support this.. I only see speech <> speech and speech <> transcription as working. Help!

sps · April 7, 2025, 6:49pm

Welcome to the dev community @legacy

I would recommend setting the modalities parameter on session.create to ["text"] to disable audio and get text-only responses on the real-time API.

coccoinomane · April 20, 2025, 5:10pm

Hi!

Have a look at how OpenAI realtime console does it:

    const event = {
      type: "conversation.item.create",
      item: {
        type: "message",
        role: "user",
        content: [
          {
            type: "input_text",
            text: message,
          },
        ],
      },
    };

    sendClientEvent(event);
    sendClientEvent({ type: "response.create" });

Link: openai-realtime-console/client/components/App.jsx at main · openai/openai-realtime-console · GitHub

Cheers,
Guido

Topic		Replies	Views
OpenAI Realtime API for Audio Input → Text Output Only API	1	191	February 19, 2025
Realtime API Audio Modality output API realtime , api-realtime , api-realtime-speech	7	816	December 13, 2024
Realtime API message response - Audio + Text API realtime	2	860	October 17, 2024
How to get text only output from the Realtime API? API api , realtime	13	3385	January 1, 2025
Even with “modalities” set to “text” only in Realtime API, Audio is occasionally generated Bugs realtime , api-realtime , api-realtime-speech	3	1052	November 29, 2024

Multimodal/realtime API - audio to text output, not transccription

Related topics