I’ve been building out various integrations using the Realtime API, and I’ve found that it’s really “reactionary” - in the sense of the user must first provide some input, in order for the model to return an output.
Our original hopes were that the model would be proactive, but this quickly proved wrong; so then the hope moved so that we could provide input through the data channel, as a means to “prodding” the model to be more proactive - but so far all attempts have failed.
Strategies Attempted
In the system instructions, provide something like:
“You need to start the conversation”
“Introduce yourself and ask their name”
etc.
On the data channel onOpen event, send something like above
With the push of a button, send something through the data channel instructing the model to say something
So far, all attempts I’ve made to get an Audio response from a data channel event have failed.
Example
Here is an exploration I’ve been working on, with the idea of creating some sort of “recruitment” or point of initial contact for someone interested in some job openings. They could come here, talk to this Agent (which has a bunch of open positions fed to it), and the Agent can figure out which one(s) the person is a good fit for - and use a tool to send that info off.
Has anyone found a technique or trick to getting the model to be more proactive?
Yes you can supply a short audio message “behind the curtain” so to speak, so you pre record a file with you saying somthing to prompt a reply form the model and then introduce that as the first message behind the scenes, the user is then greeted with the AI’s reply to that message and so the conversation can continue.
I might be missing something but are you asking about how to trigger the model to start talking without any initial audio input? If so it’s just a matter of sending a response.create data channel message.
In this debug trace I have not “spoken” to the model to get the initial audio response.
[19:40:08 DBG] OpenAI data channel opened.
[19:40:08 INF] Sending initial response create to first call data channel oai-events.
[19:40:08 DBG] {
"type": "response.create",
"response": {
"modalities": [
"audio",
"text"
],
"instructions": "Introduce urself. Keep it short.",
"voice": "alloy",
"output_audio_format": "pcm16"
},
"event_id": "fc07173d-5037-49b0-9b72-eb8a70e15a18"
}
[19:40:08 INF] STEP 6: Wait for ctrl-c to indicate user exit.
[19:40:10 INF] Transcript done: Hello! I'm an advanced AI here to assist you with information, tasks, and answering questions. Let's make your tasks easier together. How can I help you today?
[19:40:23 INF] Transcript done: Well, I'm designed to understand and respond to a wide range of topics, I can process information quickly, and I'm equipped with the latest knowledge up to October 2023. Plus, I can assist with a variety of tasks and even converse in multiple languages. It's kind of like having a superpower, but in the digital world! What can I help you with?
In the event where the model has called a function, and the function has returned something (like from a RAG system or something), I need a way to “prod” the model to respond.
Sounds like this is it!
Here I was devising a way to use the TTS endpoint to generate some quick “Please respond” wav file, save it into cache, and inject that into the audio stream somehow.
That sounds like the same problem I had. Once the function result is returned the model didn’t do anything. My approach is emabarrasingly crude. I send a respnse create to tell the model to speak.
private static void OnFunctionArgumentsDone(RTCDataChannel dc, OpenAIResponseFunctionCallArgumentsDone argsDone)
{
var result = argsDone.Name switch
{
"get_weather" => $"The weather in {argsDone.Arguments.GetNamedArgumentValue("location")} is sunny.",
_ => "Unknown Function."
};
logger.LogInformation($"Call {argsDone.Name} with args {argsDone.ArgumentsToString()} result {result}.");
var getWeatherResult = GetWeather(argsDone);
logger.LogDebug(getWeatherResult.ToJson());
dc.send(getWeatherResult.ToJson());
// Tell the AI to continue the conversation.
var responseCreate = new OpenAIResponseCreate
{
EventID = Guid.NewGuid().ToString(),
Response = new OpenAIResponseCreateResponse
{
Instructions = "Please give me the answer.",
}
};
dc.send(responseCreate.ToJson());
}