Agents SDK vs manual orchestration

I’d love to get some advice on what some recommended approaches are for architecting my first multi-agent system. I’m building an agent for my iOS app that has access to a bunch of tools, most importantly it can fetch context it needs from my DB via raw SQL queries. It need it to detect when the users request is incomplete and ask for clarification. Lastly it outputs structured JSON responses that my app can parse and turn into UI state.

The system I came up with works but is incredibly slow. I’m currently taking the users request and running it through my first LLM call which is a planner that generates a step by step plan for my tool caller to execute. (There’s so much to know about generating a coherent plan that I separated it from the tool calling agent ) The tool caller goes step by step, fetches data it needs, stops to ask the user for clarification when needed and gathers all the context needed to feed into my final responder LLM that will be the user facing structured output.

Some requests are taking 15+ seconds :sweat_smile:

What’s nice about my approach is it’s not recursive like the Agent SDK and therefore I have more control over the cost and token usage. I believe I’m effectively doing everything the Agent SDK is doing anyway, just manually.

I have yet to put in the work to optimize the latency and am not even streaming yet. (Due to the annoying work of streaming structured JSON output effectively) I’m planning on streaming both the planner response to be able to execute steps sooner, as well as stream the final user-facing structured output. I’m also gonna look into parallelizing tool calls.

I’m wondering if I’m on the right track with approach or if I should just switch to the Agent SDK.

Is there something I’m missing that would drastically reduce latency? (I’m using Gpt4o for all llms)

1 Like

It definitely sounds like you’re on the right track!

Parallelism and asynchronous calls are going to be key to reducing latency. Remember reasoning models do take time to think, so you have to factor that in if you’re using them. Perhaps see where you could use faster models, and which ones need reasoning?

I would also start thinking about what exactly could be done simultaneously. If you have a planner model generate a structured list, you can stream the outputs and begin delegating tasks the moment a point is completed, not the entire output. You could also run each point it makes asynchronously, so that it’s not one step at a time, but each of the steps roughly at once (unless they’re waiting on a response of a previous thread).

Just keep in mind, even if you reduce latency to its lowest possible amount, the time it takes for the models to actually produce the output and think is variable and based on the models themselves. You can increase the efficiency of the architecture, but you’ll always be bounded somewhat by the time of the models.

1 Like