Automating Program Verification with Large Language Models

Saturday 22 March 2025


Researchers have made significant progress in automating program verification, a crucial step in ensuring software is reliable and secure. The latest breakthrough involves developing a system that can verify large-scale software systems by leveraging massive language models (LLMs) to generate proofs.


Program verification is the process of checking whether a program’s behavior matches its intended specifications. This task is notoriously difficult, especially for complex programs with many interacting components. Traditional approaches rely on manual proof construction, which can be time-consuming and prone to errors.


To overcome these limitations, researchers have turned to artificial intelligence (AI) techniques. Specifically, they’ve been exploring the potential of LLMs in generating proofs for program verification. These language models are trained on vast amounts of text data and can produce coherent and context-specific responses.


The latest system, called RagVerus, combines the strengths of both LLMs and traditional program verification techniques. It uses a retrieval-augmented generation approach to generate proofs that are informed by cross-module dependencies and project-wide examples.


RagVerus consists of two main components: a retrieval module and a generation module. The retrieval module searches for relevant code snippets and proof annotations within a large repository, while the generation module uses these retrieved pieces to construct a complete proof.


The system has been tested on a novel dataset called RepoVBench, which comprises verification tasks derived from real-world software systems. Experimental results show that RagVerus outperforms traditional approaches in terms of proof completion rates and code similarity with respect to the ground-truth proofs.


One of the key challenges in program verification is handling complex dependencies between different parts of a program. RagVerus addresses this issue by incorporating context-aware prompting, which enables the LLMs to generate proofs that take into account these intricate relationships.


The system’s ability to handle large-scale software systems is demonstrated through its performance on RepoVBench-Complex, a challenging subset of the dataset. Here, the retrieval module searches for relevant code snippets and proof annotations within a massive repository, while the generation module uses these retrieved pieces to construct a complete proof.


While RagVerus represents a significant step forward in automating program verification, there is still much work to be done. Future research will focus on improving the system’s ability to handle super-long proofs and increasing its sampling budget to accommodate larger context capacity.


The development of RagVerus has far-reaching implications for software reliability and security.


Cite this article: “Automating Program Verification with Large Language Models”, The Science Archive, 2025.


Program Verification, Artificial Intelligence, Language Models, Software Reliability, Security, Massive Language Models, Ragverus, Retrieval-Augmented Generation, Proof Completion Rates, Code Similarity


Reference: Sicheng Zhong, Jiading Zhu, Yifang Tian, Xujie Si, “RAG-Verus: Repository-Level Program Verification with LLMs using Retrieval Augmented Generation” (2025).


Leave a Reply