I’m currently running gpt-oss-20b locally using Ollama on an i9 machine with 32GB of RAM, with num_threads=20 in the Modelfile to speed things up. It’s working well with a multi-agent chat setup based on the OpenAI Agent SDK in Python.
While the overall results are good enough for basic use, I’ve noticed that it’s not really comparable to o3-mini in terms of tool selection during extended conversations. In some cases, it misses the appropriate tool or doesn’t select any tool when it should.
Are there any tips or best practices I should consider to improve tool usage behavior with gpt-oss-20b?
Well, not in my setup, but thanks anyway. Here’s what I’m seeing on the gpt-oss-20b side: tool selection degrades after a few handoffs in multi-agent conversations, and I suspect the difference is in the context window length (I now notice that o3-mini has 200K tokens).
Hello and welcome to the forum!
What’s the token context length you’re running on? What reasoning effort have you set, and have you peeked into the reasoning tokens it produces?
Generally speaking, even before oss models, I and others have found most success by reducing the amount of tools provided to the model as much as possible during runtime, so that way it doesn’t get confused. If you give it 150 tools, but you’re only chatting and need internet access, that can skew things a bit. Obviously each use case is different, but that’s typically the best rule of thumb when it comes to tool use.
Otherwise, there are several tweaks and params we can try with these oss models to find the right balance of reasoning to improve its success rate.
Note too, you can also take the data accumulated and eventually modify it for finetuning to further optimize it for your specific use case. I’m guessing you’re running a 5090 or something, so I think finetuning a 20b model should be somewhat doable. Maybe.
The context length is usually around 10K to 20K tokens, with a medium level of reasoning effort. If I remember correctly, the reasoning tokens it produced were accurate, but - surprisingly - the subsequent tool calls were not.
Aside from open-source models, everything works fine with o3-mini, as I mentioned.
I am providing only the necessary tools, roughly 25 for the coordinator/meta agent (not including 9 hand-offs) and about 10 for each of the 9 specialized agents, so the potential total is around 110 tools, plus 10 underlying hand-off tool calls.
However, when only 3 agents (1 coordinator and 2 specialized) are active, the number of required tools should be at most around 50, typically including 2 hand-offs to the specialized agents and 2 back to the coordinator.
I would prefer to avoid fine-tuning the gpt-oss-20b model, but if that proves absolutely necessary, I might simply stick with o3-mini in the cloud instead. For this use case, gpt-oss-120b on an H100 seemed roughly equivalent to o3-mini.
Do you have a snippet of the reasoning that mentioned the tools and its resulting json output?
I guess the first inital tweaks I would try would be
Increase reasoning effort to high
Decrease the provided tools to 1-5, then scale it up until it starts flubbering again. That way you’ll have a better idea as to how many tools the model can realistically handle. 20b models are still small little critters, so expect the total tool counts they can handle to be lower than API models or the 120b model.
As a general rule of thumb, I don’t tend to provide any model more than 10 tools at once, ideally no more than 5. Now I may use a model to orchestrate other calls, with each call being provided again no more than 5-10 tools, but each call is modular and independent of the others. So, to each model call, it’s only going to see 5-10 tools, but if you combine the system itself, there’s more happening at once. This is how I’ve worked around model confusion when tools start to get bloated. Not even the orchestrator knows nor cares what tools is siblings have available; it just knows to call the agent that can handle the task. Any time somebody provides 50-100 tools in a single call to model, its success rate seems to consistently decrease, and then those folks end up here in the forum inquiring about the decreased success rate.
I think this is the kind os system you’ve built / are trying to build, but not quite sure.
There’s been a misinterpretation. I actually meant the opposite — I don’t have an all-knowing agent but rather specialized agents with specialized context and tools. I’m using the OpenAI agent framework to orchestrate all this.
Additionally, it works perfectly well with o3-mini. My only fundamental observation is that the gpt-oss:20b model is not really comparable to the o3-mini as OpenAI presented in their blog. As far as I can see, the 120b model would be equivalent instead.
That’s it
Of course Harmony is needed to communicate with the gpt-oss model, while openai-agents-python calls OpenAI endpoints. So the ‘translation’ is made by model runners like Ollama, vLLM (typically for H100/H200), or more fundamentally by llama.cpp-based solutions. These runners should provide (aside from some bugs or partial implementation) OpenAI-compatible APIs for local models.
I may be oldschool but I still don’t trust the tool calling. I let it build a json and based on that I decide - because I can do a schema check and dismiss and repeat the call until it does what I want.