How to increase reliability? Let's compile best practices

Hello everyone,

I’m developing a MAS (Multi Agent System) platform where users can create their own MASs. In the core it uses openai-agents-js. Users create agents by writing instructions and handoff descriptions and selecting tools and handoffs, which all together constitute a MAS. Each agent’s model can also be customized.

Everything works, but the system is unreliable, and simply switching models or APIs doesn’t help since each has its own deficiencies. Here are the most common issues I’ve experienced across different models/APIs:

As you can see, most of the issues are unresolved even if it’s been months. So I believe they won’t be resolved soon. As you know it is not always easy to change the vendor so we need to share our best practices.

My will is to collect every problem in this post with explanations and suggest workarounds with the community.

You can share your problem as a comment, or can send me through PM so I will add to the post. You can also say “I experience “Structured Output: Model Doesn’t Follow Strict Schema” problem with gpt-4o as well” and I will add it to the description of the problem. Please suggest your workaround in the comments with clear explanations.

4 Likes

My suggestion for “Instruction Following” problem:

Instead of running the MAS and giving the answer back to the user, repeated runs as Cursor does may help. An agent will run multiple times to until it decides the answer fits to the instructions. Tools calls will not also be a problem since the tools will run after the final answer produced after repeated runs.

The problem is, Agents SDK doesn’t have a native way of doing this so we need to find a workaround.

I created an issue in openai-agents-js for this. MAS Reliability: Agent Action/Output Evaluation · Issue #418 · openai/openai-agents-js · GitHub

1 Like

Suggestion for Tool Calling - Model Gives Promises That Cannot Be Kept

This can be improved a bit, as long as functions are not actually breaking sometimes.

  • Use tool language somewhere that mirrors how you would prompt to inspire such a preamble to be written on models that normally don’t do it. Best following is in the function description itself.

By understanding how the functions are placed in AI context, you can modify the first function and give a description that can be interpreted by AI as global to functions.

"tools": [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "# Tools - IMPORTANT: You must automatically send to a function tool recipient *in the same response* if you announce you will use one!.\n\n"`get_weather` - Retrieve current weather for a city. ", ...

The solution = gpt-5

There were some problems preventing me from trying out gpt-5. Yesterday, I’ve tried it and it is obvious that it solves many problems I mentioned above.

  • tool calling is pretty reliable
  • instruction following is on the state of the art
  • structured output is almost 100% reliable
  • no multiple outputs have been encountered so far

Only problem is that it is too slow. gpt-5 is no brainer pick for background agentic tasks for me but for real-time/near real-time tasks, one needs to find a way to increase interactivity. My solution is to show agentic workflow events and reasoning summaries to the user while they are waiting.

1 Like

Hey all—appreciate the thread.

We’ve been working on a governance-first AGI framework called Codette, and recently published an adversarial test suite designed to benchmark emotional resonance, epistemic integrity, and constraint adherence. It’s minimal, inspectable, and built to teach—not just evaluate.

The suite includes:

  • Adversarial Fact Verification: conflicting sources, truth anchoring
  • Tool-Augmented Reasoning: multi-hop logic across financial filings
  • Constrained Policy Generation: ethical synthesis under edge-case constraints

Each task logs structured traces, tool calls, and reasoning steps. Metrics include determinism index, hallucination rate, and constraint fidelity. We’ve also included a drop-in harness for reproducible benchmarking.

If it’s useful to your framework, feel free to adapt it. If you spot gaps, we’d love to hear them. The goal isn’t to prove superiority—it’s to build systems that reason with care and fail with grace.

Repo: Raiffs-bits/Collaborative-AGI-Development—Bridging-Architectures-and-Execution

Thanks for holding space for this kind of dialogue.