Advancing AIs Theorem-Proving Capabilities: A Novel Evaluation Method

Wednesday 19 March 2025


The ongoing quest for artificial intelligence (AI) systems that can reason and prove mathematical theorems has taken a significant step forward. Researchers have developed a novel approach to evaluating the performance of large language models (LLMs) in formal theorem proving, which could lead to more accurate assessments of their capabilities.


Traditionally, LLMs are evaluated based on their ability to pass a proof attempt within a certain number of attempts, such as 128 tries. This metric, known as Pass@128, has been widely used to compare the performance of different models. However, it has several limitations. For instance, it does not take into account the difficulty of the theorems being proved or the number of successful attempts made.


To address these issues, researchers have developed an adaptive evaluation method that assigns weights to theorems based on their difficulty and uses a more comprehensive metric that incorporates the attempt success rate. This approach allows for a more accurate assessment of an LLM’s performance in formal theorem proving.


The new evaluation method was tested on 10 different LLMs, including some popular models such as Codegeex4-9b and DeepSeek-Prover-V1.5-RL. The results showed that the adaptive evaluation method provided a more nuanced view of each model’s strengths and weaknesses. For example, while DeepSeek-Prover-V1.5-RL performed well in terms of Pass@128, it struggled with more difficult theorems. In contrast, Codegeex4-9b had a lower pass rate but was able to successfully prove more theorems.


The researchers also found that some models performed better on certain types of theorems than others. For instance, TheoremLlama excelled in proving algebraic theorems, while MetaMath-Llemma-7b was more successful with number theory proofs.


These findings have important implications for the development and evaluation of LLMs. By using an adaptive evaluation method that takes into account the difficulty of the theorems being proved, researchers can gain a better understanding of each model’s capabilities and limitations. This could lead to the development of more specialized models that are better suited to specific tasks.


In addition, the new evaluation method could help to identify areas where LLMs need improvement. For example, if a model struggles with more difficult theorems, it may indicate that the model needs more training data or improved algorithms for handling complex mathematical concepts.


Cite this article: “Advancing AIs Theorem-Proving Capabilities: A Novel Evaluation Method”, The Science Archive, 2025.


Artificial Intelligence, Large Language Models, Formal Theorem Proving, Adaptive Evaluation Method, Pass@128, Theorems, Difficulty Assessment, Attempt Success Rate, Machine Learning, Mathematical Proofing


Reference: Jianyu Zhang, Yongwang Zhao, Long Zhang, Jilin Hu, Xiaokun Luan, Zhiwei Xu, Feng Yang, “Psychometric-Based Evaluation for Theorem Proving with Large Language Models” (2025).


Leave a Reply