Multimodal/realtime API - audio to text output, not transccription

Hi - the realtime API lists itself as multimodal. I am trying to use it similarly to Gemini’s Bidirectional API, specifically to take in audio data and reply back to that audio data (chunked) with text. Completely unable to do this, yet it seems that multimodal should support this.. I only see speech <> speech and speech <> transcription as working. Help!

Welcome to the dev community @legacy

I would recommend setting the modalities parameter on session.create to ["text"] to disable audio and get text-only responses on the real-time API.

1 Like

Hi!

Have a look at how OpenAI realtime console does it:

    const event = {
      type: "conversation.item.create",
      item: {
        type: "message",
        role: "user",
        content: [
          {
            type: "input_text",
            text: message,
          },
        ],
      },
    };

    sendClientEvent(event);
    sendClientEvent({ type: "response.create" });

Link: openai-realtime-console/client/components/App.jsx at main · openai/openai-realtime-console · GitHub

Cheers,
Guido