Hi!
I’ve been struggling with the newly released realtime api. I want to have realtime transcription only (without agent talking back), but with no success yet.
I followed the documentation of the Typescript SDK here
With that, I’ve been able to successfully set up an agent, to whom I can speak though the microphone and listen the response on the speakers. With a few modifications, I could log the transcription of what I said.
Here is the code:
import { RealtimeAgent, RealtimeSession } from "@openai/agents-realtime";
const agent = new RealtimeAgent({
name: "Assistant",
instructions: "You are a helpful assistant.",
});
const session = new RealtimeSession(agent, {
model: "gpt-realtime",
config: {
inputAudioTranscription: {
model: 'gpt-4o-transcribe',
prompt: "Expect words related to programming, development, and technology.",
language: 'es'
}
},
});
export async function connect() {
try {
await session.connect({
// To get this ephemeral key string, you can run the following command or implement the equivalent on the server side:
// curl -s -X POST https://api.openai.com/v1/realtime/client_secrets -H "Authorization: Bearer $OPENAI_API_KEY" -H "Content-Type: application/json" -d '{"session": {"type": "realtime", "model": "gpt-realtime"}}' | jq .value
apiKey: 'ek_1234',
});
console.log('You are connected!');
console.log("Transport:", session.transport);
session.transport.on('error', (event) => {
console.log('Transport error', event);
});
session.transport.on('session.created', (event) => {
console.log('Session created', event);
});
session.transport.on('session.updated', (event) => {
console.log('Session updated', event);
});
session.transport.on('conversation.item.input_audio_transcription.completed', (event) => {
console.log('Audio transcription completed', event);
});
session.transport.on('conversation.item.input_audio_transcription.failed', (event) => {
console.log('Audio transcription failed', event);
});
} catch (e) {
console.error(e);
}
}
when I run the application I get these logs in the console:
And then the agent talks back.
But I just want the transcription, I don’t want the audio response. After many attempts this is as far as I could get:
- Get an ephemeral key for the type “transcription” instead of “realtime”. So the body sent to https://api.openai.com/v1/realtime/client_secrets is
{"session": {"type": "transcription"}}
When running the code again I get this error in the console:
Passing a realtime session update event to a transcription session is not allowed.
These update events are sent automatically. I never trigger any event.
I’ve tried setting the output only to text, and changing the instructions to indicate that I do not want any audio response but it all seems to be ignored, and probably this would be much less efficient that only transcribing.
Probably I’m using wrong the SDK. Maybe I should go deeper and have more control over the WebRTC protocol? Any hints?
Thanks in advance

