Hi - the realtime API lists itself as multimodal. I am trying to use it similarly to Gemini’s Bidirectional API, specifically to take in audio data and reply back to that audio data (chunked) with text. Completely unable to do this, yet it seems that multimodal should support this.. I only see speech <> speech and speech <> transcription as working. Help!
Welcome to the dev community @legacy
I would recommend setting the modalities
parameter on session.create
to ["text"]
to disable audio and get text-only responses on the real-time API.
1 Like
Hi!
Have a look at how OpenAI realtime console does it:
const event = {
type: "conversation.item.create",
item: {
type: "message",
role: "user",
content: [
{
type: "input_text",
text: message,
},
],
},
};
sendClientEvent(event);
sendClientEvent({ type: "response.create" });
Link: openai-realtime-console/client/components/App.jsx at main · openai/openai-realtime-console · GitHub
Cheers,
Guido