Hi all,
I’m building a Node.js backend using the OpenAI Speech-to-Speech (S2S) SDK, and I need the agent’s output as text (not audio) for my frontend UI. I’ve tested both @openai/agents-realtime and @openai/agents/realtime packages. My setup streams PCM audio from the browser to the backend, then sends it to the Realtime API. Audio-to-audio (S2S) works perfectly.
However, I need to display the agent’s reply as text in the browser. I’ve tried all the config options I could find (modalities: ['text'], enabling transcription, removing output_disabled, etc.), but I only ever get audio events, not transcript or text events.
My stack:
- Node.js backend (Express)
- Latest OpenAI S2S SDK (
@openai/agents-realtimeand@openai/agents/realtime) - Browser frontend (streams PCM audio to server)
- No chat API, no REST, no Assistant v2, no WebRTC—only via availble libraries
How do I configure the SDK/session so the agent’s responses are text, not audio? Is there a supported way to get response.text.delta or similar events, or trigger a text-only reply, in the S2S stack?
Here’s the base code I’m using (Node.js backend):
// server.js — Speech-in / Text-out with official SDK only
import 'dotenv/config';
import express from 'express';
import http from 'http';
import { WebSocketServer } from 'ws';
import { RealtimeAgent, RealtimeSession } from '@openai/agents/realtime';
const { OPENAI_API_KEY, PORT = 3000 } = process.env;
if (!OPENAI_API_KEY) throw new Error('OPENAI_API_KEY missing');
const app = express();
const server = http.createServer(app);
const wss = new WebSocketServer({ server });
app.use(express.static('public'));
app.get('/', (_q, r) => r.sendFile(process.cwd() + '/public/index.html'));
// ───────────────────────────────────────────────────────────────
wss.on('connection', async client => {
console.log('🟢 browser connected', new Date().toLocaleTimeString());
// 1️⃣ SDK session with every “classic” permutation already explored
const agent = new RealtimeAgent({
name: 'S2S Text-Out Assistant',
instructions: 'You are a helpful assistant. Reply in clear English.',
});
const session = new RealtimeSession(agent, {
transport: 'websocket',
model: 'gpt-4o-realtime-preview-2025-06-03',
language: 'en-US',
modalities: ['audio', 'text'], // all combos tried
audio: {
encoding: 'pcm',
sample_rate: 24000,
transcription: { enabled: true, interim_results: true },
turn_detection: { type: 'server_vad', create_response: false }
}
});
// Debug: log *every* SDK event
session.onAny?.((evt, ev) =>
console.log(`[RA EVENT] ${evt}`, JSON.stringify(ev)));
await session.connect({ apiKey: OPENAI_API_KEY });
console.log('✅ SDK session open');
// 2️⃣ Core trick: after we finish *one* chunk/turn, ask for text
async function handleIncomingAudio(buf) {
// send audio and COMMIT that buffer as the user turn
await session.sendAudio(buf, { commit: true });
// now formally request a text-only agent reply
await session.response.create({ modalities: ['text'] });
}
// 3️⃣ Receive mic packets from browser
client.on('message', data => handleIncomingAudio(data).catch(console.error));
// 4️⃣ Text streams to browser
session.on('response.text.delta', ev => {
if (ev.delta) client.send(JSON.stringify({ type: 'assistant_text', content: ev.delta }));
});
session.on('response.text.done', ev => {
if (ev.text) client.send(JSON.stringify({ type: 'assistant_text', content: ev.text }));
});
// 5️⃣ (Optional) pass audio back too—kept for comparison
session.on('audio', ev => {
if (ev.data) client.send(ev.data, { binary: true });
});
client.on('close', () => {
session.close();
console.log('👋 browser disconnected, session closed');
});
});
// ───────────────────────────────────────────────────────────────
server.listen(PORT, () =>
console.log(`🚀 backend listening at http://localhost:${PORT}`));
Attempts to Get Text Output from S2S SDK
- Initial Configuration
- Set
transcription.enabled: trueandmodalities: ['text']in theRealtimeSessionor agent config. - Listened for
response.text.deltaevents on the session/agent. - Result: Only audio events received; no text or transcript output.
- Alternate Event Listeners
- Added listeners for
conversation.updatedandconversation.item.completedevents. - Checked for assistant role messages and transcript in delta or item.
- Result: Still only received audio events; text not emitted.
- Hybrid API Attempt (Realtime + Chat Completions)
- Tried sending audio to the Realtime API for transcription and then passing the transcript to the Chat Completions API for text response.
- Encountered a
TypeError(likely due to API mismatch or code integration issues). - Result: Approach failed.
- Corrected Hybrid Attempt
- Included the
RealtimeAgentin configuration, combining S2S agent logic with chat completions for downstream text generation. - Still no direct text output as desired.
- Realtime API with response.create Event
- Implemented manual sending of a
response.createevent to the OpenAI WebSocket, specifyingmodalities: ['text']to force a text-only response after receiving audio. - Listened for
response.text.deltaandresponse.text.done. - Result: The most promising approach, but faced timing issues (sending before the WS connection was open), later solved with queueing.
- Raw WebSocket Debugging and Event Logging
- Switched to a raw WebSocket client for more control, logging all OpenAI events to debug exactly what is received from the API.
- Still saw only audio events unless a text response was explicitly triggered with
response.create.
- Session Parameter Variations
- Tried changing
output_disabled, toggling betweenmodalities: ['audio'],['text'],['audio','text'], and modifying transcription and turn detection configs in every permutation. - None of these alone produced text output unless the
response.createstep was used.
- Frontend-to-Backend Variations
- Tested both direct audio streaming over WebSocket and chunked audio POST via REST, to rule out client/transport-side issues.
What am I missing, and what is the correct/recommended way (with S2S/Realtime SDK) to get text responses for my use case?
Thanks so much for any guidance!