How to increase reliability? Let's compile best practices

baki.kucukcakiroglu · August 31, 2025, 12:19pm

Hello everyone,

I’m developing a MAS (Multi Agent System) platform where users can create their own MASs. In the core it uses openai-agents-js. Users create agents by writing instructions and handoff descriptions and selecting tools and handoffs, which all together constitute a MAS. Each agent’s model can also be customized.

Everything works, but the system is unreliable, and simply switching models or APIs doesn’t help since each has its own deficiencies. Here are the most common issues I’ve experienced across different models/APIs:

Multiple Outputs (rare)
- Occurs with the GPT-4.1 family, with a higher occurrence as model complexity decreases.
- Possible solution: A basic check & retry mechanism would work since the problem is rare.
- Community suggests this is related to the Responses API: How to prevent the API returning multiple outputs? - #44 by bragma
Structured Output: Model Doesn’t Follow Strict Schema (rare)
- Occurs with GPT-4.1 family and Responses API.
- Possible solution: Providing schema as a zod object instead of a json helps.
- More info: GPT4.1 doesn't follow strict json schema
Model Returns Output + Function Call
- This leads to two problems:
  1. The Agents SDK throws an exception in the next run due to missing tool output in the history.
  2. If the function call is something like “get_order_details,” the output is often: “I’m fetching the order details now, please hang tight.” However, the user never receives the actual order details because the agent can’t send them after producing its output.
- Responses API returns message + function_call - #17 by markns
- Agent run with previous_response_id fails - No tool output found for function call call_WdnUUKXKvwy3jk.... · Issue #1061 · openai/openai-agents-python · GitHub
- toolUseBehavior: run_llm_again (default) doesn't work · Issue #180 · openai/openai-agents-js · GitHub
Instruction Following
- Most models but especially gpt-4.1-nano, gpt-4.1-mini, o4-mini, o3-mini are very bad at following instructions. I’m not even talking about gpt-5 since it is not even “usable” with Agents SDK. Even the most basic tasks that are explained in the prompt repeatedly and tested multiple times can fail the next time. This reduces reliability. o3 is performing okay but it is not also perfect.
- GPT-5 does not work well compared to 4.1 · Issue #1397 · openai/openai-agents-python · GitHub
- GPT-5 + Responses API is extremely slow
- GPT-5 is very slow compared to 4.1 (Responses API) - #2 by OctoSharko
Tool Calling - Model Gives Promises That Cannot Be Kept
- This is a well known problem. It is especially common with gpt-4.1-mini and gpt-4.1-nano. Let’s say the agent has “get_order_details” tool and customer gave their order id. Instead of producing a tool call, model says “Hang tight, I’m fetching your order details” but never produces a function call output. This is different than Model Returns Output + Function Call scenario since here function call is not produced by the model.
- GPT-4.1 sometimes says “One moment…” / “Hang tight…” but never calls a tool

As you can see, most of the issues are unresolved even if it’s been months. So I believe they won’t be resolved soon. As you know it is not always easy to change the vendor so we need to share our best practices.

My will is to collect every problem in this post with explanations and suggest workarounds with the community.

You can share your problem as a comment, or can send me through PM so I will add to the post. You can also say “I experience “Structured Output: Model Doesn’t Follow Strict Schema” problem with gpt-4o as well” and I will add it to the description of the problem. Please suggest your workaround in the comments with clear explanations.

baki.kucukcakiroglu · August 31, 2025, 12:23pm

My suggestion for “Instruction Following” problem:

Instead of running the MAS and giving the answer back to the user, repeated runs as Cursor does may help. An agent will run multiple times to until it decides the answer fits to the instructions. Tools calls will not also be a problem since the tools will run after the final answer produced after repeated runs.

The problem is, Agents SDK doesn’t have a native way of doing this so we need to find a workaround.

I created an issue in openai-agents-js for this. MAS Reliability: Agent Action/Output Evaluation · Issue #418 · openai/openai-agents-js · GitHub

baki.kucukcakiroglu · August 31, 2025, 4:02pm

Suggestion for Tool Calling - Model Gives Promises That Cannot Be Kept

This can be improved a bit, as long as functions are not actually breaking sometimes.

Use tool language somewhere that mirrors how you would prompt to inspire such a preamble to be written on models that normally don’t do it. Best following is in the function description itself.

By understanding how the functions are placed in AI context, you can modify the first function and give a description that can be interpreted by AI as global to functions.
"tools": [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "# Tools - IMPORTANT: You must automatically send to a function tool recipient *in the same response* if you announce you will use one!.\n\n"`get_weather` - Retrieve current weather for a city. ", ...

baki.kucukcakiroglu · September 1, 2025, 10:18am

The solution = gpt-5

There were some problems preventing me from trying out gpt-5. Yesterday, I’ve tried it and it is obvious that it solves many problems I mentioned above.

tool calling is pretty reliable
instruction following is on the state of the art
structured output is almost 100% reliable
no multiple outputs have been encountered so far

Only problem is that it is too slow. gpt-5 is no brainer pick for background agentic tasks for me but for real-time/near real-time tasks, one needs to find a way to increase interactivity. My solution is to show agentic workflow events and reasoning summaries to the user while they are waiting.

Harrison82_95 · September 1, 2025, 8:23pm

Hey all—appreciate the thread.

We’ve been working on a governance-first AGI framework called Codette, and recently published an adversarial test suite designed to benchmark emotional resonance, epistemic integrity, and constraint adherence. It’s minimal, inspectable, and built to teach—not just evaluate.

The suite includes:

Adversarial Fact Verification: conflicting sources, truth anchoring
Tool-Augmented Reasoning: multi-hop logic across financial filings
Constrained Policy Generation: ethical synthesis under edge-case constraints

Each task logs structured traces, tool calls, and reasoning steps. Metrics include determinism index, hallucination rate, and constraint fidelity. We’ve also included a drop-in harness for reproducible benchmarking.

If it’s useful to your framework, feel free to adapt it. If you spot gaps, we’d love to hear them. The goal isn’t to prove superiority—it’s to build systems that reason with care and fail with grace.

Repo: Raiffs-bits/Collaborative-AGI-Development—Bridging-Architectures-and-Execution

Thanks for holding space for this kind of dialogue.

Topic		Replies	Views
Comparing GPT-5 Pro and earlier LLMs for developers Coding with ChatGPT chatgpt , gpt-5	11	6055	August 25, 2025
Custom Java System vs Assistants API—Seeking Advice on Dynamic AI Agents, Training, and Token Efficiency API gpt-4 , chatgpt , fine-tuning , api , assistants-api	1	96	April 25, 2025
Evaluating AI Agents - thoughts on this flow? Community gpt-4	0	3362	August 2, 2023
GPT-4o - Hallucinating at temp:0 - Unusable in production Feedback api-hallucinations , gpt-4o	26	6269	July 24, 2024
LLM and Prompt Evaluation Frameworks Prompting prompt-engineering , prompting , evals	11	8816	December 16, 2024

How to increase reliability? Let's compile best practices

Related topics