How to make agent calling function by himself?

Hi everyone,

First, I will provide some context. I’m using the real time API to create an agent that will do questions/answers. The user provides a set of questions with a context for each question (e.g., “Skip the next question if…”). The agent then asks these questions to another user. This is strictly a speech-to-speech setup; there’s no chat interface involved for the user answering the questions.

So far, I’ve put all of these questions and instructions into the agent’s base instructions before starting him. I realize I could move the questions into a chat message, but it currently works as-is. If you have any thoughts on this, I’d be happy to hear them.


:arrow_down:

My main concern is about function calling. In my initial use case, once the agent has asked all its questions, I want it to call a function automatically. I’ve specified this in the agent’s instructions, but nothing happens unless the user explicitly asks the agent to call that function. In other words, I need the agent itself to decide when to invoke a function based on his action.
(e.g., When he finishes all the questions, when he has a media URL in a question, …)

My workaround has been to tell the agent to include a special keyword that he pronounce silently in the transcript. I then parse the transcript, and whenever I see the keyword, I trigger the function. However, I’d prefer a more direct or elegant approach if one is available.

Is there a better way to achieve this, please ?

Best,
Géry

The Pattern for Function Call Usage:

User: Is Mars in retrograde right now?
Assistant to tool: astro_calculator(object: "Mars", coordinates: "geocentric", time_range: {"start": "2024-12-22T00:00:00", "end": "2024-12-22T23:59:59"}, motion_type: "apparent")
Tool:

{
  "object": "Mars",
  "retrograde_status": false,
  "apparent_motion": {
    "direction": "prograde",
    "rate": "0.25 arcseconds/hour"
  }
}

Assistant: Mars is currently in prograde motion, moving at an apparent rate of 0.25 arcseconds per hour in the geocentric reference frame. Its next retrograde period will start on January 12, 2025, at 03:00 UTC and end on March 2, 2025, at 18:00 UTC. Let me know if you’d like to track Mars’ future motion!


The AI picks from available functions when they seem to answer user input.

An interview scenario, sending a tool output after gathering information and never without complete information being provided, takes advanced function description to counter typical behavior, along with system instructions of what conversational path is prescribed.

Tools cannot return media, such as for vision, just text the AI can understand.

Here’s an example function, just to see the user is not discussing bad URLs or something like that:

{
  "name": "verify_url",
  "description": "Sends a URL to verify that it is a valid link and not a 404.",
  "strict": true,
  "parameters": {
    "type": "object",
    "required": [
      "url"
    ],
    "properties": {
      "url": {
        "type": "string",
        "description": "The URL to be verified"
      }
    },
    "additionalProperties": false
  }
}

However, tons of multi-line application-specific main description is needed, such as “call this tool any time and every time a user message has a URL being spoken about, to provide feedback in your answering whether it is invalid.”

Hi,

Thanks for your response. It seems like my use case isn’t being fully addressed because my users do not provide the URL or initiate the end of the Q&A—the agent does. Additionally, I don’t want to return media; I just want to call a function and handle the rest myself.

Here is a pseudo base instruction that I pass to my agent:

Your knowledge cutoff is 2023-10. You are a helpful, witty, and friendly AI.

You are about to conduct an interview with a candidate using the provided questions...

If there is media in a question, call the show_media tool.

....

THERE ARE THE QUESTIONS:

Question 1: Who are you?
Context: blablabla

Question 2: What do you think of that?
Context: blablabla
Media: s3:*//<s3-url>*

With these instructions, the agent doesn’t call the tool because it’s not a user-initiated request.

My workaround is:

If there is media in the question, write "show_media" in the chat without saying it.

Then, I call a function when I detect “show_media” in the transcript. This works, but I would prefer the agent to call the function directly.

Is there a better way to have the agent autonomously call functions based on its actions?