How to Get Text Output (Not Audio) from OpenAI Speech-to-Speech SDK (S2S) with Node.js?

Hi all,

I’m building a Node.js backend using the OpenAI Speech-to-Speech (S2S) SDK, and I need the agent’s output as text (not audio) for my frontend UI. I’ve tested both @openai/agents-realtime and @openai/agents/realtime packages. My setup streams PCM audio from the browser to the backend, then sends it to the Realtime API. Audio-to-audio (S2S) works perfectly.

However, I need to display the agent’s reply as text in the browser. I’ve tried all the config options I could find (modalities: ['text'], enabling transcription, removing output_disabled, etc.), but I only ever get audio events, not transcript or text events.

My stack:

  • Node.js backend (Express)
  • Latest OpenAI S2S SDK (@openai/agents-realtime and @openai/agents/realtime)
  • Browser frontend (streams PCM audio to server)
  • No chat API, no REST, no Assistant v2, no WebRTC—only via availble libraries

How do I configure the SDK/session so the agent’s responses are text, not audio? Is there a supported way to get response.text.delta or similar events, or trigger a text-only reply, in the S2S stack?

Here’s the base code I’m using (Node.js backend):

// server.js  —  Speech-in / Text-out with official SDK only
import 'dotenv/config';
import express from 'express';
import http from 'http';
import { WebSocketServer } from 'ws';
import { RealtimeAgent, RealtimeSession } from '@openai/agents/realtime';

const { OPENAI_API_KEY, PORT = 3000 } = process.env;
if (!OPENAI_API_KEY) throw new Error('OPENAI_API_KEY missing');

const app = express();
const server = http.createServer(app);
const wss = new WebSocketServer({ server });

app.use(express.static('public'));
app.get('/', (_q, r) => r.sendFile(process.cwd() + '/public/index.html'));

// ───────────────────────────────────────────────────────────────
wss.on('connection', async client => {
  console.log('🟢 browser connected', new Date().toLocaleTimeString());

  // 1️⃣  SDK session with every “classic” permutation already explored
  const agent = new RealtimeAgent({
    name: 'S2S Text-Out Assistant',
    instructions: 'You are a helpful assistant. Reply in clear English.',
  });

  const session = new RealtimeSession(agent, {
    transport: 'websocket',
    model: 'gpt-4o-realtime-preview-2025-06-03',
    language: 'en-US',
    modalities: ['audio', 'text'],       // all combos tried
    audio: {
      encoding: 'pcm',
      sample_rate: 24000,
      transcription: { enabled: true, interim_results: true },
      turn_detection: { type: 'server_vad', create_response: false }
    }
  });

  // Debug: log *every* SDK event
  session.onAny?.((evt, ev) =>
    console.log(`[RA EVENT] ${evt}`, JSON.stringify(ev)));

  await session.connect({ apiKey: OPENAI_API_KEY });
  console.log('✅ SDK session open');

  // 2️⃣  Core trick: after we finish *one* chunk/turn, ask for text
  async function handleIncomingAudio(buf) {
    // send audio and COMMIT that buffer as the user turn
    await session.sendAudio(buf, { commit: true });
    // now formally request a text-only agent reply
    await session.response.create({ modalities: ['text'] });
  }

  // 3️⃣  Receive mic packets from browser
  client.on('message', data => handleIncomingAudio(data).catch(console.error));

  // 4️⃣  Text streams to browser
  session.on('response.text.delta', ev => {
    if (ev.delta) client.send(JSON.stringify({ type: 'assistant_text', content: ev.delta }));
  });
  session.on('response.text.done', ev => {
    if (ev.text) client.send(JSON.stringify({ type: 'assistant_text', content: ev.text }));
  });

  // 5️⃣  (Optional) pass audio back too—kept for comparison
  session.on('audio', ev => {
    if (ev.data) client.send(ev.data, { binary: true });
  });

  client.on('close', () => {
    session.close();
    console.log('👋 browser disconnected, session closed');
  });
});

// ───────────────────────────────────────────────────────────────
server.listen(PORT, () =>
  console.log(`🚀 backend listening at http://localhost:${PORT}`));

Attempts to Get Text Output from S2S SDK

  1. Initial Configuration
  • Set transcription.enabled: true and modalities: ['text'] in the RealtimeSession or agent config.
  • Listened for response.text.delta events on the session/agent.
  • Result: Only audio events received; no text or transcript output.
  1. Alternate Event Listeners
  • Added listeners for conversation.updated and conversation.item.completed events.
  • Checked for assistant role messages and transcript in delta or item.
  • Result: Still only received audio events; text not emitted.
  1. Hybrid API Attempt (Realtime + Chat Completions)
  • Tried sending audio to the Realtime API for transcription and then passing the transcript to the Chat Completions API for text response.
  • Encountered a TypeError (likely due to API mismatch or code integration issues).
  • Result: Approach failed.
  1. Corrected Hybrid Attempt
  • Included the RealtimeAgent in configuration, combining S2S agent logic with chat completions for downstream text generation.
  • Still no direct text output as desired.
  1. Realtime API with response.create Event
  • Implemented manual sending of a response.create event to the OpenAI WebSocket, specifying modalities: ['text'] to force a text-only response after receiving audio.
  • Listened for response.text.delta and response.text.done.
  • Result: The most promising approach, but faced timing issues (sending before the WS connection was open), later solved with queueing.
  1. Raw WebSocket Debugging and Event Logging
  • Switched to a raw WebSocket client for more control, logging all OpenAI events to debug exactly what is received from the API.
  • Still saw only audio events unless a text response was explicitly triggered with response.create.
  1. Session Parameter Variations
  • Tried changing output_disabled, toggling between modalities: ['audio'], ['text'], ['audio','text'], and modifying transcription and turn detection configs in every permutation.
  • None of these alone produced text output unless the response.create step was used.
  1. Frontend-to-Backend Variations
  • Tested both direct audio streaming over WebSocket and chunked audio POST via REST, to rule out client/transport-side issues.

What am I missing, and what is the correct/recommended way (with S2S/Realtime SDK) to get text responses for my use case?

Thanks so much for any guidance!

1 Like

I can discuss the realtime API, an underlying layer that must be utilized correctly and presented by any Agent SDK that utilizes it.

Understanding the realtime API is a bit convoluted because using the AI model for producing the same text as spoken instead of chatting up the user with its modality is one query parameter away, within the same API documentation.

Modality refers to the permitted AI model generation types.

The parallel transcript of spoken audio is generated in an undocumented internal manner, unlike getting a transcript of the input (optional by a transcript model you specify), and should be automatic.

It is delivered in the server event response.done with assistant role, stolen here:

{
    "event_id": "event_3132",
    "type": "response.done",
    "response": {
        "id": "resp_001",
        "object": "realtime.response",
        "status": "completed",
        "status_details": null,
        "output": [
            {
                "id": "msg_006",
                "object": "realtime.item",
                "type": "message",
                "status": "completed",
                "role": "assistant",
                "content": [
                    {
                        "type": "text",
                        "text": "Sure, how can I assist you today?"
                    }
                ]
            }
        ],
        "usage": {
...
}

You have https://openai.github.io/openai-agents-js/guides/streaming/#listen-to-all-events

So, to tear into the SDK code and see the method of event delivery otherwise (or sending to null), you can start with a file search for the string catching the event.

One thing mentioned: “Custom guardrails to monitor model output”, so that as a terminus may or may not anticipate X to speech with transcript.

Then we await, “here’s how we do it, in our application built on OpenAI’s SDK” - from someone that has actually done that.

1 Like

Hi, thanks for your detailed explanation and the suggestions!

I followed your advice and tried triggering a response.done event, as well as switching to session.createResponse({ modalities: ['text'] }) instead of using the older session.response.create. I made sure the backend always generates a valid dummy audio buffer (PCM16, 24kHz, 800ms) and submits it exactly as documented. The SDK session opens fine and all events are logged. However, even after these changes, the assistant only returns audio output events—there is never any assistant text in the response, neither in the response.done output nor in any response.text.delta or response.text.done events.

No error is thrown from the SDK itself (other than the earlier method mismatch), but no text output is ever delivered, regardless of whether I use real mic audio or generated sine wave dummy audio on the backend.

Just to confirm, I’m using only the official Node.js SDK (@openai/agents-realtime) and not calling the WebSocket API directly. All permutations of modalities and config options have been tried, including forcing text in createResponse and toggling all combinations of allowed input/output modes. The outcome is always the same: only audio is returned from the agent, no text is available in any event, so it can’t be shown in the UI.

Is there something else that might be blocking text output in S2S mode, or is text-only response simply not available (yet) for SDK-driven speech sessions? If you or anyone else has had a successful S2S → text-only pipeline with this library, I’d really appreciate a working minimal code snippet!

Thanks again for your help and time.