Can AI agents think together? A testbed for chain-of-thought collaboration

A small collection of reasoning tasks where:

Each task requires multiple reasoning steps

Each agent handles a piece of reasoning (or critiques another agent’s reasoning)

The agents must coordinate their chain-of-thought to solve the problem

Example task types:

Mystery puzzles → e.g., Agent 1 lists clues, Agent 2 draws conclusions, Agent 3 checks if conclusion follows logically

Math word problems → Agents must break down steps and verify each other

Ethical dilemma → Agents debate different chain-of-thoughts, aim for consensus

Deliverable:

A notebook or small app where you can:

Enter the problem

Compare and critique their outputs

Your research questions

Where do agents break down in collaborative chain-of-thought?
Do chain-of-thought prompts reduce errors in multi-agent reasoning?
How can agents better critique and correct each other’s steps?