Thursday 20 March 2025
For decades, computer scientists have been working on developing artificial intelligence that can write code as efficiently and effectively as humans do. Recently, a team of researchers made significant progress in this area by creating a benchmark to evaluate the efficiency of language models when generating code.
The new benchmark, called COFFE, stands for Code Efficiency Benchmark for Code Generation. It consists of two types of problems: function-level code generation and file-level code generation. The first type requires the model to generate a specific function within a program, while the second type asks it to create an entire file with multiple functions.
COFFE is designed to assess not only the correctness of the generated code but also its efficiency in terms of execution time. This is crucial because code efficiency can significantly impact the performance and reliability of software systems. The benchmark includes 398 problems for function-level code generation and 358 problems for file-level code generation, making it a comprehensive tool for evaluating language models.
The researchers tested 14 popular language models on COFFE and found that some models performed better than others in terms of efficiency. They also identified four key findings that could have significant implications for the development of artificial intelligence in software engineering.
Firstly, they discovered that larger language models are not always more efficient than smaller ones. In fact, some smaller models were able to generate code with similar or even better performance than their larger counterparts.
Secondly, the researchers found that different evaluation metrics can lead to vastly different results. For example, a model may perform well in terms of correctness but poorly in terms of efficiency, and vice versa.
Thirdly, they identified a correlation between the complexity of the problem and the execution time of the generated code. This suggests that models may struggle with complex problems, leading to slower execution times.
Lastly, the study revealed that language models tend to perform better when generating code for specific domains or industries rather than general-purpose programming.
The development of COFFE and the insights gained from its evaluation are significant steps forward in the quest for efficient and effective artificial intelligence in software engineering. As researchers continue to improve language models, this benchmark will play a crucial role in ensuring that they meet the demands of real-world applications.
Cite this article: “Measuring Code Efficiency: A New Benchmark for AI-Generated Software”, The Science Archive, 2025.
Artificial Intelligence, Code Generation, Language Models, Software Engineering, Benchmark, Coffe, Efficiency, Correctness, Execution Time, Programming.







