Tool Use Differences Between gpt-oss-20b and o3-mini in Multi-Agent Setup

Hi everyone,

I’m currently running gpt-oss-20b locally using Ollama on an i9 machine with 32GB of RAM, with num_threads=20 in the Modelfile to speed things up. It’s working well with a multi-agent chat setup based on the OpenAI Agent SDK in Python.

While the overall results are good enough for basic use, I’ve noticed that it’s not really comparable to o3-mini in terms of tool selection during extended conversations. In some cases, it misses the appropriate tool or doesn’t select any tool when it should.

Are there any tips or best practices I should consider to improve tool usage behavior with gpt-oss-20b?

Thanks in advance!
Giulio

Hello,

Not sure it applies to your setup, but have you already checked Issue #11704 in GitHub ollama/ollama ?

Well, not in my setup, but thanks anyway. Here’s what I’m seeing on the gpt-oss-20b side: tool selection degrades after a few handoffs in multi-agent conversations, and I suspect the difference is in the context window length (I now notice that o3-mini has 200K tokens).

2 Likes

Hello and welcome to the forum!
What’s the token context length you’re running on? What reasoning effort have you set, and have you peeked into the reasoning tokens it produces?

Generally speaking, even before oss models, I and others have found most success by reducing the amount of tools provided to the model as much as possible during runtime, so that way it doesn’t get confused. If you give it 150 tools, but you’re only chatting and need internet access, that can skew things a bit. Obviously each use case is different, but that’s typically the best rule of thumb when it comes to tool use.

Otherwise, there are several tweaks and params we can try with these oss models to find the right balance of reasoning to improve its success rate.

Note too, you can also take the data accumulated and eventually modify it for finetuning to further optimize it for your specific use case. I’m guessing you’re running a 5090 or something, so I think finetuning a 20b model should be somewhat doable. Maybe.

The context length is usually around 10K to 20K tokens, with a medium level of reasoning effort. If I remember correctly, the reasoning tokens it produced were accurate, but - surprisingly - the subsequent tool calls were not.
Aside from open-source models, everything works fine with o3-mini, as I mentioned.

I am providing only the necessary tools, roughly 25 for the coordinator/meta agent (not including 9 hand-offs) and about 10 for each of the 9 specialized agents, so the potential total is around 110 tools, plus 10 underlying hand-off tool calls.
However, when only 3 agents (1 coordinator and 2 specialized) are active, the number of required tools should be at most around 50, typically including 2 hand-offs to the specialized agents and 2 back to the coordinator.

I would prefer to avoid fine-tuning the gpt-oss-20b model, but if that proves absolutely necessary, I might simply stick with o3-mini in the cloud instead. For this use case, gpt-oss-120b on an H100 seemed roughly equivalent to o3-mini.

Interesting….hmmm

Do you have a snippet of the reasoning that mentioned the tools and its resulting json output?

I guess the first inital tweaks I would try would be

  1. Increase reasoning effort to high

  2. Decrease the provided tools to 1-5, then scale it up until it starts flubbering again. That way you’ll have a better idea as to how many tools the model can realistically handle. 20b models are still small little critters, so expect the total tool counts they can handle to be lower than API models or the 120b model.

As a general rule of thumb, I don’t tend to provide any model more than 10 tools at once, ideally no more than 5. Now I may use a model to orchestrate other calls, with each call being provided again no more than 5-10 tools, but each call is modular and independent of the others. So, to each model call, it’s only going to see 5-10 tools, but if you combine the system itself, there’s more happening at once. This is how I’ve worked around model confusion when tools start to get bloated. Not even the orchestrator knows nor cares what tools is siblings have available; it just knows to call the agent that can handle the task. Any time somebody provides 50-100 tools in a single call to model, its success rate seems to consistently decrease, and then those folks end up here in the forum inquiring about the decreased success rate.

I think this is the kind os system you’ve built / are trying to build, but not quite sure.

Split the context then. You don’t need to make an all knowing agent.

There’s been a misinterpretation. I actually meant the opposite — I don’t have an all-knowing agent but rather specialized agents with specialized context and tools. I’m using the OpenAI agent framework to orchestrate all this.
Additionally, it works perfectly well with o3-mini. My only fundamental observation is that the gpt-oss:20b model is not really comparable to the o3-mini as OpenAI presented in their blog. As far as I can see, the 120b model would be equivalent instead.
That’s it

Maybe you just need harmony?

Of course Harmony is needed to communicate with the gpt-oss model, while openai-agents-python calls OpenAI endpoints. So the ‘translation’ is made by model runners like Ollama, vLLM (typically for H100/H200), or more fundamentally by llama.cpp-based solutions. These runners should provide (aside from some bugs or partial implementation) OpenAI-compatible APIs for local models.

I may be oldschool but I still don’t trust the tool calling. I let it build a json and based on that I decide - because I can do a schema check and dismiss and repeat the call until it does what I want.