GPT-4o innate agentic tool use capabilities

We’ve recently started using GPT-4o tools/functions. Since GPT-3.5, we’ve used mostly JSON outputs, which work well for most of our tool-based agent flows. However, with our new hybrid text chat/voice AI assistant, for our agent workflow platform, it made the most sense to use “tools” as they’re supported by both text and voice APIs, providing standardized interactions across modes.

Our assistant needs to support dozens of function calls, so we created different “modes” such as agent_selection_and_subscription, agent_configuration, agent_operations, user_login, etc. Each assistant “mode” has a system prompt and a corresponding array of functions to use in that mode.

Initially, I planned to create a routing prompt and set up mode switching manually. But then o1-preview suggested I equip each assistant “mode” with the same “switch_mode” tool/function, which makes the model aware of other modes as well.

I was skeptical at first—is it really that simple? Could I really skip most of the control flow logic? And if I don’t implement it, will I lose too much control?

So far, I’ve mostly been testing the voice aspect of the hybrid assistant, as it’s much faster, and I’m in awe. The assistant can take a query like, “Start this agent,” and if in agent_selection mode, it first runs the select_agent tool, then runs the switch_mode tool (which replaces the agent system prompt and functions, except for switch_mode itself), then the “start_agent” tool, and it all just works! :slight_smile:

I’m mostly sharing this to highlight my positive experience, but also to ask - does it feel too good to be true? Has anyone experienced any major issues with this type of setup?

Of course, we’re not really deleting anything in the database, and we’ve made it easy to undo any errors the model makes. So far, in limited testing, no major errors have occurred.

p.s. Here is the work-in-progress “mode” tools code, a part of our open source Your Priorities agentic engagement platform :

6 Likes

So really if you think about it, you are scafolding a set of assistants behind an parent assistant. This is pretty interesting.

The unique aspect about an assistant that you are exploiting is it’s system prompt and metadata that could point to a panoply of functions. So when you switch-mode, you could actually switch the assistant contained within the parent assistant along with it’s relevant functions.

3 Likes

@robertb It was really interesting to read your perspective on this. I have a physical device with access to 45 functions from setting alarms, looking up news, sending texts, playing audio, etc. which it can chain together if it so chooses (it calls itself with the result of a function and decides if it should do another response or function).

https://www.youtube.com/watch?v=AKz6Seb3Axs

It seems to work well, but I have been wondering what I’ll do when it starts to struggle to pick between functions (I’m amazed it can handle 45, but does loop round sometimes running the same function several times in a row).

I did consider some mechanism to give “focus” which would be similar to what you’re proposing about switching “mode”.

I tried very briefly by asking the agent to repeatedly respond with a tag (inside which was a JSON object of it’s choosing) while it was tracking the information about a real-life situation (such as the updates to a football game, or a game of poker) but I got the sense that this was adding more complexity to the existing 45 things it has to process - it was quite unreliable. So your idea of “mode switching” and giving more focussed access to system prompts and tools is very interesting.

I do wonder whether thinking about it having “focus” rather than a “mode” might be an interesting way to think about it. A system may have more than one focus at a time, and a focus can be reinforced each cycle (if the focus is ongoing) or “fade” over a series of cycles until it is no longer needed.

In practical terms I mean: if the user has recently asked about reading the news, and playing music then appropriate functions and system prompts and function can be injected for those two focuses… then they get removed if the conversation is not focussed on those subjects for 5 “turns” of the conversation.

Maybe I am overcomplicating things :slight_smile:

Sorry that this is conjecture rather than real-world experience, but just to say your post really made me think about how a pattern like “mode” or “focus” switching might work, so thank you. I will look more into your code soon to see if I can get some more insight.

3 Likes

Thank you for your thoughts on this; it’s all so new yet exciting to wrap one’s head around! :slight_smile: Yes, “focus” could actually be another good way to think about this.

Since the early days of GPT-3.5, it’s made the most sense to keep each system prompt as simple as possible for maximum reliability. The goal was to ask for only one thing, one step, etc. - then use logic to chain, loop, or otherwise break down complex tasks into hand-crafted chains of thought or handcrafted “system 2” thinking processes.

This simple system-prompt strategy has worked really well as the models have improved, making many, even complex, tasks reliable enough for our multi-scale agents to work 99%+ of the time. We mostly use JSON outputs, with occasional hybrid COT followed by JSON.

With our new voice assistant and the voice API, tools are the most practical way to generate output for actions in a unified way.

Then I read this from the official tools documentation:

We recommend that you use no more than 20 functions in a single tool call. Developers typically see a reduction in the model’s ability to select the correct tool once they have between 10-20 tools.

This aligns with our experience, clearly, the fewer tools, the better. And of course, this also applies to the system prompt that sometimes refers to the tools or something connected to them - more tools lead to more complex system prompts.

Hence, the “modes,” which are now evolving to also have “states” or perhaps “focuses” - a concept I’m coding and will check in later today. For example, when you’re not logged in, there’s a whole suite of tools the models shouldn’t see; similarly, when logged in, you don’t need the login tool. When an agent is running, you don’t need the run-agent tool or any related system prompt instructions.

I’d be happy for any feedback. In the end, the core mission is to, at each stage of the assistant process, to expose the model to as few tools as possible, with the simplest system prompts.

Thank you so much for expanding more on your experience here. It’s very insightful. Hope you get some good engagement for your project, it looks very interesting

1 Like

I’ve shared a video with results from those experiments here: Exploring a New UX for Multi-Agent AI with the Realtime API

We ended up implementing a simple state-machine that dynamically adds only the tools relevant to each stage of a workflow. This approach has been working really well. While it required quite a bit of scaffolding and guardrails, the strategy of dynamically changing the tools and showing the model as few functions as possible at any given time has proven effective.