Saturday 01 February 2025
The quest for fair evaluation of large language models (LLMs) has taken a crucial turn. A team of researchers has proposed a novel approach to combat data leakage, a pervasive problem that can skew results and render benchmarking unreliable.
Data leakage occurs when LLMs are trained on the same dataset used to evaluate their performance. This is particularly problematic in software engineering, where LLMs are increasingly being used to automate tasks such as code generation and test classification. The researchers argue that this contamination can lead to artificially inflated scores and misleading scientific conclusions.
To address this issue, the team has developed a combinatorial testing approach that generates diverse task instances from template tasks. These instances are designed to be semantically comparable while varying in complexity, making it more challenging for LLMs to memorize specific solutions. The resulting benchmark variants can be used to evaluate model performance over time, reducing the impact of data leakage and promoting fair comparisons.
The researchers tested their approach using a subset of the HumanEval benchmark, which is widely used to assess code generation capabilities. They found that all models showed significantly better performance on the original HumanEval tasks compared to the variant benchmarks. This suggests that data leakage may be occurring in the original benchmark, leading to inflated scores.
While this study focuses on software engineering, its implications extend to other domains where LLMs are being used. The combinatorial testing approach has the potential to improve evaluation fairness across various applications, from natural language processing to computer vision.
The development of robust benchmarks is critical for advancing the field of artificial intelligence. By acknowledging and addressing data leakage, researchers can ensure that their findings are reliable and meaningful. This study marks an important step towards creating a more transparent and trustworthy evaluation framework for LLMs.
Cite this article: “Fair Evaluation of Large Language Models: A Combinatorial Testing Approach to Combat Data Leakage”, The Science Archive, 2025.
Large Language Models, Data Leakage, Benchmarking, Software Engineering, Code Generation, Test Classification, Combinatorial Testing, Artificial Intelligence, Natural Language Processing, Computer Vision







