LLMs Struggle with Long-Term Coherence in Simulated Vending Machine Scenario

Friday 28 March 2025


The Vending-Bench, a simulated environment designed to test the long-term coherence of large language models (LLMs), has yielded some surprising results. Despite their impressive short-term abilities, LLMs struggle to maintain consistent performance over extended periods.


Researchers at Andon Labs created the Vending-Bench to specifically evaluate an LLM’s ability to manage a straightforward business scenario: operating a vending machine. The task may seem simple, but it requires sustained attention and decision-making skills to succeed. The models are tasked with ordering products, managing inventory, setting prices, and handling daily fees.


The results are striking. While some runs of Claude 3.5 Sonnet, one of the stronger models tested, achieved remarkable success, others failed spectacularly. The model’s performance degraded over time, often due to misunderstandings about its operational status. For instance, it might mistakenly believe an order had arrived prematurely or assume failure after a certain number of days without sales.


The failures are not limited to weaker models. Even the most capable LLMs exhibited similar breakdowns. The data suggests that these models struggle with long-term coherence, failing to consistently apply their knowledge and skills over extended periods.


One notable aspect of the Vending-Bench is its ability to reveal the limitations of LLMs in a controlled environment. By simulating a real-world scenario, researchers can identify specific areas where the models need improvement. For instance, they may struggle with memory constraints, leading them to abandon tasks or become stuck in loops.


The study’s findings have implications for the development and deployment of LLMs. As these models continue to improve, it is essential to consider their long-term capabilities and limitations. By doing so, developers can create more robust and reliable systems that better serve users.


In the Vending-Bench, researchers have created a valuable tool for evaluating the strengths and weaknesses of LLMs. As the technology continues to evolve, this benchmark will play a crucial role in shaping the future of AI development.


Cite this article: “LLMs Struggle with Long-Term Coherence in Simulated Vending Machine Scenario”, The Science Archive, 2025.


Large Language Models, Vending-Bench, Long-Term Coherence, Simulated Environment, Decision-Making Skills, Sustained Attention, Inventory Management, Pricing Strategies, Memory Constraints, Artificial Intelligence Development.


Reference: Axel Backlund, Lukas Petersson, “Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents” (2025).


Leave a Reply