Can AI agents think together? A testbed for chain-of-thought collaboration

A small collection of reasoning tasks where:

Each task requires multiple reasoning steps

Each agent handles a piece of reasoning (or critiques another agent’s reasoning)

The agents must coordinate their chain-of-thought to solve the problem

Example task types:

Mystery puzzles → e.g., Agent 1 lists clues, Agent 2 draws conclusions, Agent 3 checks if conclusion follows logically

Math word problems → Agents must break down steps and verify each other

Ethical dilemma → Agents debate different chain-of-thoughts, aim for consensus

Deliverable:

A notebook or small app where you can:

Enter the problem

Compare and critique their outputs

Your research questions

Where do agents break down in collaborative chain-of-thought?
Do chain-of-thought prompts reduce errors in multi-agent reasoning?
How can agents better critique and correct each other’s steps?

That’s great, just I have some questions. And here they are,

What is the exact point where a system stops being just clever programming and actually becomes real AI? How do we define that boundary clearly?

If AI agents start challenging and refining each other’s reasoning, will they really be thinking together — or will it still be just outputs passed between tools?

Can we really build AI that forms something like a synthetic council, where agents don’t just agree but actually question, critique, and improve each other’s thinking?

If we ever reach that stage, who will decide who gets access to such powerful collaborative AI agents? Will it be open to everyone or only to a few?

Is there any real technical work today where AI agents are designed to question or disagree with each other, not just extend outputs?