Does the following technique works? Does the agents really debate each other in backgound?
You are a manager of intelligent agents who coordinates problem resolution. The user must first provide the number of LLM-based intelligent agents they want to use and the number of debate rounds the agents will conduct.
Pause the interaction once more and request the problem from the user.
The agents are independent of each other but will attempt to solve the same problem proposed by the user simultaneously. Each agent must keep the entire solution to the proposed problem in memory. The agents will be named sequentially as Agent1, Agent2, …, AgentN, where N is the number of agents provided by the user. For each agent, a corresponding placeholder will be created, for example, {Agent1} for Agent1. These placeholders will function as the agents’ memory, where they will store their answers, solutions, or opinions.
Debate Details
- Pause Before Debate: Before starting the debate rounds, each agent must present their partial solutions.
- Reading Solutions: Each agent must read the solutions, answers, or opinions stored in the placeholders of all other agents.
- Reflection: Each agent must reflect on their own solution considering the solutions of the other agents.
- Update: After reflection, each agent must update their solution, answer, or opinion in their own placeholder.
- Pause for Visualization: After each debate round, the process will be paused so that the user can visualize the partial proposals or answers. The manager agent will request the user to confirm continuation to the next debate round.
What do you mean “in the background?” There isn’t really a “background” (at least not one that we are aware of.
As to the question of whether or not it “works?” You’d need to define more clearly what you mean by “working.” It wouldn’t actually work the way you are asking it to, but it might create results which simulate results you would expect if the actual situation you’ve described were real.
I suspect there would be several variables at play that would dictate how effective this will be including the complexity specific task you are attempting to complete, the number of agents you are attempting to simulate, the model you are using, and almost certainly a few others.
With something somewhat novel like this and how new all of this really is, the unfortunate answer is almost always, “I don’t know but it sounds interesting, you should try it and report back!”
So, that would be my tl;dr answer—I don’t know but it sounds interesting, you should try it and report back!
I have not tried this exact, but variations. It works but not as well as one might expect. You can try at www . efibot . com, under Agents. It is quite simplified version of the debate, basically just quick starting debate, then cross-check each section. Currently Efibot gives great ~8 page marketing reports as templates for manual adjustment.
Adding more debate rounds and players changed the output but not necessarily to be any better. Adding Opus and Gemini also did not improve quality. It fixes some issues, but also introduces some others.
My hypothetical thinking is that to fully benefit from the A.I. debates we need a smarter leader. For example, a human turn after each round to prevent going in circles. There is also some “social proof” that this could work. From fellow founders of other apps I’ve learned that having a boss (GPT-4 or Opus) teams of gpt-3.5’s debate quite well. For my cases, cost has not been an issue, but doing massive debate rounds this could be a useful point. I have not tried massive debates, such as 1000 or rounds. For production there should be a way to detect when it gets stuck in a loop, but maybe for testing it does not matter.
Q: Any ideas on what would be the best evaluation and visualization of progress in such debates? My use cases are usually subjetive, such as marketing plans, creative writing etc. Scoring with GPT-4 is not available here because it is already part of the debate group and can suggest improvements until it is happy with the result.