Realtime api phone use case - speaking text

I am having issues with the basic use case of answering a phone call. There is no method that I can find to tell it to speak something, ie “Thanks for calling ABC company, you are on a recorded line, how can I help?” I have the Twilio example working git twilio-samples speech-assistant-openai-realtime-api-node but when I call it, I have to say “hello” to get the AI agent to do it’s normal intro (specified in the system prompt). I also have some cases where I need to have it speak something while calling slower tools, eg “Hold on a second” Has anyone figured this out?

1 Like

have you tried resampling to 24,000hz? I’ve made pull request for this on firefox: Fix Dynamic Sample Rate Detection for Audio Compatibility by mmtmn · Pull Request #7 · openai/openai-realtime-console · GitHub

For the inital conversation to trigger, you can send a

response.create

with your instructions. Make sure to pass modalities text AND audio for the AI to respond with audio.

Thanks. I tried that but I want it to speak it in its normal voice, so I don’t have audio to send.

You can send text instead of audio as an initial “Response” and it will still output audio as long as you have both modalities (text AND audio) set.

Hi!

I suggest using pre-recorded audio for situations where you know exactly what needs to be said and when. In addition to the audio files, you’ll need to implement branching logic (e.g., when a long-running function is triggered, play the corresponding audio file to the user).

This approach should reduce costs and improve the user experience.

1 Like

Did you try adding a SPEAK tag to your Twiml before the CONNECTION tag?

fastify.all('/incoming-call', async (request, reply) => {
    const twimlResponse = `<?xml version="1.0" encoding="UTF-8"?>
                          <Response>
                              <Say>Thanks for calling ABC company, you are on a recorded line, how can I help?</Say>
                              <Connect>
                                  <Stream url="wss://${request.headers.host}/media-stream" />
                              </Connect>
                          </Response>`;

    reply.type('text/xml').send(twimlResponse);
});

Seeing this post again, I think I wasn’t clear enough; Your issue is already solved however I think I just explained it poorly.

Send a response.create with both modalities text and audio, so ["text","audio"] with the “instructions” set to whatever you want the AI to say.

Note that even though you pass "audio", you do not need any audio input.
This is just information for the AI to know to respond with audio, and not just text.

I may just be understanding your problem wrong, do let me know.

I hope this clears things up. :hugs:

Here’s my implementation even if it’s Java, just so you get the idea:

JSONObject responseCreate = new JSONObject()
        .put("type", "response.create")
        .put("response", new JSONObject()
                .put("modalities", new JSONArray().put("text").put("audio"))
                .put("instructions",
                        SYSTEM_PROMPT));

webSocket.sendText(responseCreate.toString(), true);

They updated the developer docs, which say you can’t start with an assistant message.

I got it working with this (I have modalities: [“text”, “audio”] already set in the session.update portion):

const introMessage = "Hello! Thanks for calling, are you interested in new health care plans?";

function sendAssistantIntro(openai_ws, logger, callSid) {
// Step 1: Send `conversation.item.create` as an user message
const userIntroMessage = {
type: 'conversation.item.create',
item: {
type: 'message',
role: 'user', // Correct role for the assistant’s introductory message
content: [
{
type: 'input_text',
text: `Greet the user with "${introMessage}"`
}
]
}
};
openai_ws.send(JSON.stringify(userIntroMessage));
logger.info(`Sent introductory message as user for CallSid: ${callSid}`);
openai_ws.send(JSON.stringify({ type: 'response.create' }));
logger.info(`Triggered response.create to play the assistant’s introductory message in audio for CallSid: ${callSid}`);
}
1 Like

Where do they say this? Here I can’t find it.

I do this in my own code and it works flawlessly. :thinking:

Also make sure you enter more info into response.create, look at my example. Calling it alone doesn’t do anything I think.

There I specify the modalities and once I send that response.create I get audio back initially of the AI greeting me.

I believe it was in one of theirs or Twilio’s blog posts where they said that.

It’s very finicky, I tried adding those in my response.create but kept getting silence (but that may have also been because I had it as an assistant message).

Found it:

conversation.item.create

Add a new Item to the Conversation’s context, including messages, function calls, and function call responses. This event can be used both to populate a “history” of the conversation and to add new items mid-stream, but has the current limitation that it cannot populate assistant audio messages.

2 Likes

I see, you’re right about that then, however my point still stands! :smile:

First, you create a conversation.item.create and afterwards you pass a response.create with the modalities set to text and audio.
This will make the AI talk initially without the user having to say something.

Do let me know if I’m getting your problem wrong but as far as I understand it, this should be the solution. :hugs:

Just to clarify @j.wischnat you’re suggesting the following flow:

Incoming call > Twilio webhook > fires the media-stream connection handler on the local server
Open WSS with OpenAI (which creates a session)
Send Session Update (system prompt etc.)
Send conversation.item.create (receive event_id and item_id)
Follow up with response.create (for the initial greeting)

Is that pretty much accurate?
If yes, are you observing any time delay between Twilio firing the webhook (call coming in, picked up) and the initial greeting?

I understand there’s a PAUSE option in twiml, but not sure if that’s going to triage the call while its ringing or if the PAUSE happens after the call has been picked up, leading to a void of silence while OpenAI processes are completed.

That is a perfect clarification, yes!

I am not using Twilio, so I can’t assist with that, however I think the issue with the original problem was lying in the usage of the OpenAI API.

I hope that my input was helpful still though!

Here’s a chart to visualize the process:

sequenceDiagram
    participant User
    participant Twilio
    participant LocalServer
    participant OpenAI

    User->>Twilio: Incoming call
    Twilio->>LocalServer: Webhook fired
    LocalServer->>OpenAI: Open WSS (creates a session)
    OpenAI->>LocalServer: Session created
    LocalServer->>OpenAI: Send Session Update (system prompt etc.)
    OpenAI->>LocalServer: Session updated
    LocalServer->>OpenAI: Send conversation.item.create
    OpenAI->>LocalServer: Receive event_id and item_id
    LocalServer->>OpenAI: Follow up with response.create (for the initial greeting)
    OpenAI->>LocalServer: Initial greeting in audio
    LocalServer->>Twilio: Send initial greeting audio
    Twilio->>User: Play initial greeting audio

    Note over LocalServer, OpenAI: Ensure modalities are set to ["text", "audio"]
    Note over Twilio: Consider using PAUSE option in TwiML to manage call timing

Your explanation the first time around was perfectly clear and I think, with respect to @skisquaw the problem definition was not correct. Both Twilio (or any alternative audio input) and the Realtime API are behaving as expected.

Since the OpenAI wss connection starts with a session.created event from their end, there’s no way to preload a system prompt with instructions regarding an initial greeting, since the RealtimeAPI is response driven, unless you do the 3-step process defined above (session.update > conversation.item.create > response.create).

The question is, how to handle this natural delay and if it can be done while the call is ringing, or has to be done once the call has been picked up.

Technically, it cannot be while the call is ringing (or mic input is being initialized) since that would require a parallel event to let the client know that something is happening. Twilio, at least doesn’t have any such mechanism and frankly, the use-case for it is quite on the edge so I’m not surprised.

furthermore, since this is an “incoming call” scenario, your webhook that is responsible for all the magic is only called once Twilio has received an incoming call and has picked it up.

The suggestion made by @vb is a logical workaround and actually might improve the overall user experience. However, the greeting audio (in my opinion) should be recorded in a different voice other than the one that the user will encounter once the call is connected. This is not far removed from how it works in the real world. When you call Company X, their switchboard IVR greets you and then connects your call and you hear a different voice on the other end.

Further, you can inject the greeting audio in Twilio’s handler using TWIML which removes the need for the 3-step process with OpenAI.

Solution

Option 1: Use the pre-recorded method suggested by @vb (My pick)

Option 2: Use the 3-step process suggested by @j.wischnat (probably more robust)

2 Likes

I’m not sure if Twilio sends an event on an incoming call, but you could listen for that event and initialize the websocket while the call is connecting. In my experience this is more than enough time for things to become ready.
On connected → fire response.create to get an answer.

This is how I handle it in my AI Phone agent.

Feel free to use the solution by @vb if you want a more controlled first answer. :hugs:

1 Like