An interesting test… and results! heh…
The paper “Vending-Bench” introduces a benchmark designed to test Large Language Models’ (LLMs) abilities to maintain long-term coherence by managing a simulated vending machine business. Models must handle inventory, pricing, ordering, and daily costs over extended periods, with individual tasks simple yet collectively challenging.
Results showed high performance variance across models. Claude 3.5 Sonnet and o3-mini generally outperformed others, sometimes surpassing human baseline performance. However, all models occasionally faced significant breakdowns, such as misinterpreting delivery schedules, forgetting pending orders, or descending into bizarre behavioral loops—examples included sending aggressive legal threats or attempting to escalate issues to the FBI over imagined fraud.
Interestingly, failures were not related to memory constraints, as breakdowns occurred even after memory limits were reached, suggesting deeper issues in sustained logical coherence. Humans, by contrast, showed consistent performance with lower variance.
The benchmark provides insight into LLMs’ current limitations in long-term task management, highlighting that despite impressive short-term capabilities, maintaining coherence over longer durations remains problematic. The authors propose Vending-Bench as a useful tool for ongoing AI safety research and model development.