CODEELO: A Competition-Level Code Generation Benchmark for Large Language Models

Friday 28 February 2025


The quest for a more accurate benchmarking of large language models (LLMs) has been ongoing, and a new paper proposes a solution in the form of CODEELO, a competition-level code generation benchmark that aims to fill the gap left by previous evaluations. The authors have designed a comprehensive evaluation framework that assesses LLMs’ coding abilities using real-world problems from CodeForces, a popular online coding platform.


CODEELO takes into account two major limitations of existing benchmarks: the lack of support for special judges and interactive problems. In some cases, multiple outputs can be considered correct, making it necessary to develop dedicated code to verify their validity. CODEELO addresses this issue by evaluating submissions directly on the CodeForces platform, allowing it to handle these complex problems.


The benchmark also includes a unique judging method that involves submitting solutions to the CodeForces platform and receiving judgment status, ensuring zero false positives and providing an aligned execution environment for the first time. This approach allows CODEELO to accurately assess LLMs’ abilities in generating code that solves real-world programming challenges.


Thirty popular open-source and proprietary models were tested using CODEELO, with impressive results. The paper presents a comprehensive analysis of the performance of these models across different problem types and difficulty levels. Interestingly, the top-performing models showed remarkable consistency in their ability to solve problems, while others struggled even with relatively simple tasks.


The implications of this research are significant, as it highlights the importance of developing more accurate benchmarks for evaluating LLMs’ coding abilities. By providing a standardized evaluation framework, CODEELO enables researchers and developers to better understand the strengths and weaknesses of these models, ultimately driving innovation in the field of artificial intelligence.


For those interested in exploring the world of competitive programming and large language models, CODEELO offers a unique opportunity to delve deeper into the challenges and opportunities presented by this rapidly evolving field.


Cite this article: “CODEELO: A Competition-Level Code Generation Benchmark for Large Language Models”, The Science Archive, 2025.


Large Language Models, Code Generation, Benchmarking, Codeforces, Programming Challenges, Artificial Intelligence, Competitive Programming, Natural Language Processing, Machine Learning, Evaluation Framework


Reference: Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, et al., “CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings” (2025).


Leave a Reply