Hello everyone,
I’m developing a MAS (Multi Agent System) platform where users can create their own MASs. In the core it uses openai-agents-js
. Users create agents by writing instructions and handoff descriptions and selecting tools and handoffs, which all together constitute a MAS. Each agent’s model can also be customized.
Everything works, but the system is unreliable, and simply switching models or APIs doesn’t help since each has its own deficiencies. Here are the most common issues I’ve experienced across different models/APIs:
-
Multiple Outputs (rare)
- Occurs with the
GPT-4.1 family
, with a higher occurrence as model complexity decreases. - Possible solution: A basic
check & retry
mechanism would work since the problem is rare. - Community suggests this is related to the Responses API: How to prevent the API returning multiple outputs? - #44 by bragma
- Occurs with the
-
Structured Output: Model Doesn’t Follow Strict Schema (rare)
- Occurs with
GPT-4.1 family
andResponses API
. - Possible solution: Providing schema as a zod object instead of a json helps.
- More info: GPT4.1 doesn't follow strict json schema
- Occurs with
-
Model Returns Output + Function Call
- This leads to two problems:
- The
Agents SDK
throws an exception in the next run due to missing tool output in the history. - If the function call is something like “get_order_details,” the output is often: “I’m fetching the order details now, please hang tight.” However, the user never receives the actual order details because the agent can’t send them after producing its output.
- The
- Responses API returns message + function_call - #17 by markns
- Agent run with previous_response_id fails - No tool output found for function call call_WdnUUKXKvwy3jk.... · Issue #1061 · openai/openai-agents-python · GitHub
- toolUseBehavior: run_llm_again (default) doesn't work · Issue #180 · openai/openai-agents-js · GitHub
- This leads to two problems:
-
Instruction Following
- Most models but especially gpt-4.1-nano, gpt-4.1-mini, o4-mini, o3-mini are very bad at following instructions. I’m not even talking about gpt-5 since it is not even “usable” with Agents SDK. Even the most basic tasks that are explained in the prompt repeatedly and tested multiple times can fail the next time. This reduces reliability. o3 is performing okay but it is not also perfect.
- GPT-5 does not work well compared to 4.1 · Issue #1397 · openai/openai-agents-python · GitHub
- GPT-5 + Responses API is extremely slow
- GPT-5 is very slow compared to 4.1 (Responses API) - #2 by OctoSharko
-
Tool Calling - Model Gives Promises That Cannot Be Kept
- This is a well known problem. It is especially common with
gpt-4.1-mini
andgpt-4.1-nano
. Let’s say the agent has “get_order_details” tool and customer gave their order id. Instead of producing a tool call, model says “Hang tight, I’m fetching your order details” but never produces a function call output. This is different thanModel Returns Output + Function Call
scenario since here function call is not produced by the model. - GPT-4.1 sometimes says “One moment…” / “Hang tight…” but never calls a tool
- This is a well known problem. It is especially common with
As you can see, most of the issues are unresolved even if it’s been months. So I believe they won’t be resolved soon. As you know it is not always easy to change the vendor so we need to share our best practices.
My will is to collect every problem in this post with explanations and suggest workarounds with the community.
You can share your problem as a comment, or can send me through PM so I will add to the post. You can also say “I experience “Structured Output: Model Doesn’t Follow Strict Schema” problem with gpt-4o as well” and I will add it to the description of the problem. Please suggest your workaround in the comments with clear explanations.