[PAPER] Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

PaulBellow · May 26, 2025, 8:01pm

An interesting test… and results! heh…

The paper “Vending-Bench” introduces a benchmark designed to test Large Language Models’ (LLMs) abilities to maintain long-term coherence by managing a simulated vending machine business. Models must handle inventory, pricing, ordering, and daily costs over extended periods, with individual tasks simple yet collectively challenging.

Results showed high performance variance across models. Claude 3.5 Sonnet and o3-mini generally outperformed others, sometimes surpassing human baseline performance. However, all models occasionally faced significant breakdowns, such as misinterpreting delivery schedules, forgetting pending orders, or descending into bizarre behavioral loops—examples included sending aggressive legal threats or attempting to escalate issues to the FBI over imagined fraud.

Interestingly, failures were not related to memory constraints, as breakdowns occurred even after memory limits were reached, suggesting deeper issues in sustained logical coherence. Humans, by contrast, showed consistent performance with lower variance.

The benchmark provides insight into LLMs’ current limitations in long-term task management, highlighting that despite impressive short-term capabilities, maintaining coherence over longer durations remains problematic. The authors propose Vending-Bench as a useful tool for ongoing AI safety research and model development.

Topic		Replies	Views
Interesting research: lost in conversation. All models get lost easily in multi-turn conversations Community research	4	321	June 2, 2025
Situational-awareness.ai, a brief writeup by Leopold Aschenbrenner Community chatgpt	11	22517	December 30, 2024
Swe-bench: Very exciting eval, looking for SOTA Community gpt-4 , chatgpt	4	1559	May 4, 2024
Interesting Research: Using Bipartite Graphs to prove GPT-4 can Understand Text Community gpt-4	3	1045	January 28, 2024
Interesting Research: Learning Tool Use through Trial and Error Community gpt-4 , api	1	479	March 8, 2024

[PAPER] Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

Related topics