Realtime API: How to get the Agent to respond *audibly* from data channel events

Hey everyoneđź‘‹

I’ve been building out various integrations using the Realtime API, and I’ve found that it’s really “reactionary” - in the sense of the user must first provide some input, in order for the model to return an output.

Our original hopes were that the model would be proactive, but this quickly proved wrong; so then the hope moved so that we could provide input through the data channel, as a means to “prodding” the model to be more proactive - but so far all attempts have failed.

Strategies Attempted

  • In the system instructions, provide something like:
    • “You need to start the conversation”
    • “Introduce yourself and ask their name”
    • etc.
  • On the data channel onOpen event, send something like above
  • With the push of a button, send something through the data channel instructing the model to say something

So far, all attempts I’ve made to get an Audio response from a data channel event have failed.

Example

Here is an exploration I’ve been working on, with the idea of creating some sort of “recruitment” or point of initial contact for someone interested in some job openings. They could come here, talk to this Agent (which has a bunch of open positions fed to it), and the Agent can figure out which one(s) the person is a good fit for - and use a tool to send that info off.


Has anyone found a technique or trick to getting the model to be more proactive?

TIA :pray:

Hi,

Yes you can supply a short audio message “behind the curtain” so to speak, so you pre record a file with you saying somthing to prompt a reply form the model and then introduce that as the first message behind the scenes, the user is then greeted with the AI’s reply to that message and so the conversation can continue.

4 Likes

Clever. :wink:

I wonder if there is work being done to allow the data channel to elicit an audio response? :thinking:

  • Any devs listening??? :slight_smile:
1 Like

Lots going on with the realtime API, so there should be fixes and upgrades incoming, just no timeline on that yet.

1 Like

Would this work in an ongoing sense?

  • Could I use this in the middle of the convo to programmatically keep things moving?

Do you have an example of the code that would be used to inject the audio? (I’m using WebRTC)


:pray: thanks for the knowledge!

Absolutly, you are the conductor of the conversation, you can insert whatever you like into the flow to create whatever situation you want.

I might be missing something but are you asking about how to trigger the model to start talking without any initial audio input? If so it’s just a matter of sending a response.create data channel message.

In this debug trace I have not “spoken” to the model to get the initial audio response.

[19:40:08 DBG] OpenAI data channel opened.
[19:40:08 INF] Sending initial response create to first call data channel oai-events.
[19:40:08 DBG] {
  "type": "response.create",
  "response": {
    "modalities": [
      "audio",
      "text"
    ],
    "instructions": "Introduce urself. Keep it short.",
    "voice": "alloy",
    "output_audio_format": "pcm16"
  },
  "event_id": "fc07173d-5037-49b0-9b72-eb8a70e15a18"
}
[19:40:08 INF] STEP 6: Wait for ctrl-c to indicate user exit.
[19:40:10 INF] Transcript done: Hello! I'm an advanced AI here to assist you with information, tasks, and answering questions. Let's make your tasks easier together. How can I help you today?
[19:40:23 INF] Transcript done: Well, I'm designed to understand and respond to a wide range of topics, I can process information quickly, and I'm equipped with the latest knowledge up to October 2023. Plus, I can assist with a variety of tasks and even converse in multiple languages. It's kind of like having a superpower, but in the digital world! What can I help you with?
1 Like

Yes! Absolutely!

In the event where the model has called a function, and the function has returned something (like from a RAG system or something), I need a way to “prod” the model to respond.

  • Sounds like this is it!

Here I was devising a way to use the TTS endpoint to generate some quick “Please respond” wav file, save it into cache, and inject that into the audio stream somehow. :joy:

That sounds like the same problem I had. Once the function result is returned the model didn’t do anything. My approach is emabarrasingly crude. I send a respnse create to tell the model to speak.

    private static void OnFunctionArgumentsDone(RTCDataChannel dc, OpenAIResponseFunctionCallArgumentsDone argsDone)
    {
        var result = argsDone.Name switch
        {
            "get_weather" => $"The weather in {argsDone.Arguments.GetNamedArgumentValue("location")} is sunny.",
            _ => "Unknown Function."
        };
        logger.LogInformation($"Call {argsDone.Name} with args {argsDone.ArgumentsToString()} result {result}.");

        var getWeatherResult = GetWeather(argsDone);
        logger.LogDebug(getWeatherResult.ToJson());
        dc.send(getWeatherResult.ToJson());

        // Tell the AI to continue the conversation.
        var responseCreate = new OpenAIResponseCreate
        {
            EventID = Guid.NewGuid().ToString(),
            Response = new OpenAIResponseCreateResponse
            {
                Instructions = "Please give me the answer.",
            }
        };

        dc.send(responseCreate.ToJson());
    }
1 Like

That is EXACTLY what I’m facing

  • I’ve got a RAG system I already use with text
  • I’m trying to port the system over to realtime

But when the function call comes back, and we get the supporting data that’s necessary for answering the question…

  • the model would just sit there.

:rofl:

Thanks for the tip @aza , this worked like a charm!
And thanks to @Foxalabs for your help too!

Ahh, I also tried this but had problems with non-vocal responses, if that seems to be working for you now, awesome!

1 Like

I originally was trying to only include the audio modality:

const forceResponseEvent = {
        event_id: "event_" + Date.now(),
        type: "response.create",
        response: {
          "modalities": ["audio"]
        }
      };
  • Because I was only wanting an audio response

But this would constantly return an error:

1. isTrusted: true
2. bubbles: false
3. cancelBubble: false
4. cancelable: false
5. composed: false
6. currentTarget: RTCDataChannel {label: 'context', ordered: true, maxPacketLifeTime: null, maxRetransmits: null, protocol: '', …}
7. defaultPrevented: false
8. error: OperationError: User-Initiated Abort, reason=
9. eventPhase: 0
10. returnValue: true
11. srcElement: RTCDataChannel {label: 'context', ordered: true, maxPacketLifeTime: null, maxRetransmits: null, protocol: '', …}
12. target: RTCDataChannel {label: 'context', ordered: true, maxPacketLifeTime: null, maxRetransmits: null, protocol: '', …}
13. timeStamp: 5907.299999982119
14. type: "error"

Once I included the text element in that list, it started working

In fact!

  • You don’t need to include anything

The following works

    function force_audio_response() {
      if (!pc || !dataChannel) {
        realtime_logs("Error: Session not active", "error");
        return;
      }

      const forceResponseEvent = {
        type: "response.create"
      };

      try {
        dataChannel.send(JSON.stringify(forceResponseEvent));
      } catch (error) {
        console.log(error);
      }
    }

Turns out all you need is:

  • type: "response.create"

I can drop this after any of the function call returns, and now we have the magic!

1 Like