Hi everyone,
First, I will provide some context. I’m using the real time API to create an agent that will do questions/answers. The user provides a set of questions with a context for each question (e.g., “Skip the next question if…”). The agent then asks these questions to another user. This is strictly a speech-to-speech setup; there’s no chat interface involved for the user answering the questions.
So far, I’ve put all of these questions and instructions into the agent’s base instructions before starting him. I realize I could move the questions into a chat message, but it currently works as-is. If you have any thoughts on this, I’d be happy to hear them.
My main concern is about function calling. In my initial use case, once the agent has asked all its questions, I want it to call a function automatically. I’ve specified this in the agent’s instructions, but nothing happens unless the user explicitly asks the agent to call that function. In other words, I need the agent itself to decide when to invoke a function based on his action.
(e.g., When he finishes all the questions, when he has a media URL in a question, …)
My workaround has been to tell the agent to include a special keyword that he pronounce silently in the transcript. I then parse the transcript, and whenever I see the keyword, I trigger the function. However, I’d prefer a more direct or elegant approach if one is available.
Is there a better way to achieve this, please ?
Best,
Géry