Evaluating Large Language Models Ability to Solve Algorithmic Challenges

Friday 28 March 2025


A comprehensive performance evaluation of Large Language Models (LLMs) has been conducted, focusing on their ability to solve algorithmic challenges typically encountered in programming contests and technical interviews. The research analyzed the code generated by LLMs using problems from LeetCode, a popular platform offering a vast repository of algorithmic problems across various difficulty levels.


The study employed two categories of LLMs: OpenAI Models and GitHub Copilot Model. Each problem was approached using LLMs configured with five different temperature settings – 0.2, 0.4, 0.6, 0.8, and 1.0. The temperature parameter controls the creativity and variability of the generated solutions, allowing researchers to examine how different levels of randomness impact code correctness and efficiency.


The evaluation process involved assessing the functional correctness and performance of the generated code using metrics such as pass@k, which measures the probability of a model generating a correct solution within k attempts. The study also considered memory usage, runtime performance, and LeetCode’s runtime percentile rankings to compare the execution speed of LLM-generated solutions with human-written counterparts.


The results suggest that top-performing models, such as Canonical Solutions and GPT-4-omni, achieve near-perfect scores in pass@1 and pass@10 metrics. These models demonstrate exceptional ability in solving algorithmic challenges, outperforming lower-tier models like CodeLlama-13B-Instruct and WizardCoder-Python-7B.


In contrast, mid-tier performers show moderate performance, while lower-performing models struggle to effectively handle evaluated tasks. The study highlights the strengths and limitations of current LLMs in code generation and problem-solving, providing insights into their potential applications and areas for improvement in automated programming assistance.


The analysis also reveals that human-written code tends to outperform LLM-generated solutions in terms of runtime performance. However, the latter’s ability to generate correct solutions within a reasonable number of attempts is impressive, indicating significant progress in AI-driven coding capabilities.


The research has implications for the development and deployment of Large Language Models in software engineering, highlighting the need for more advanced models capable of generating efficient and high-performance code. As LLMs continue to evolve, they may increasingly be relied upon as valuable tools for developers, freeing them from routine coding tasks and enabling focus on higher-level design decisions.


Cite this article: “Evaluating Large Language Models Ability to Solve Algorithmic Challenges”, The Science Archive, 2025.


Large Language Models, Algorithmic Challenges, Programming Contests, Technical Interviews, Code Generation, Problem-Solving, Leetcode, Pass@K, Runtime Performance, Memory Usage.


Reference: Lun Wang, Chuanqi Shi, Shaoshui Du, Yiyi Tao, Yixian Shen, Hang Zheng, Yanxin Shen, Xinyu Qiu, “Performance Review on LLM for solving leetcode problems” (2025).


Leave a Reply