Sunday 02 March 2025
The quest for efficient and accurate code testing has been a long-standing challenge in software development. A team of researchers has made significant strides in this direction by proposing DeCon, a novel approach that leverages large language models to detect incorrect assertions generated by these models.
Assertions are an essential component of code testing, as they help identify errors and ensure the correctness of a program’s behavior. However, generating accurate assertions can be a laborious process, especially for complex software systems. Large language models, such as those used in chatbots like ChatGPT, have shown promise in automating this task by generating test cases and assertions. However, these models often produce incorrect or incomplete assertions, which can lead to false positives and false negatives.
DeCon addresses this issue by introducing a novel approach that uses large language models to generate postconditions – formal specifications of the expected behavior of a program. These postconditions are then used to detect incorrect assertions generated by other models. The researchers demonstrate the effectiveness of DeCon on the HumanEval benchmark, which consists of 164 problems with known solutions.
The results show that DeCon can detect over 64% of incorrect assertions generated by four state-of-the-art large language models, including GPT-3.5 and GPT-4. Moreover, DeCon improves the effectiveness of these models in code generation and fault-finding by up to 25%. This is a significant improvement over traditional approaches that rely on manual testing or use static analysis tools.
The researchers also conducted an empirical study on the quality of assertions generated by large language models. The results reveal that over 62% of these assertions are incorrect, highlighting the need for more effective approaches like DeCon to ensure the reliability of code testing.
DeCon’s architecture consists of three main components: postcondition generation, assertion filtering, and fault detection. The first component uses a large language model to generate postconditions based on the problem description and target function signature. The second component filters out incorrect postconditions using a small set of input-output (I/O) examples provided with the problem description. Finally, the third component detects faults by checking whether an assertion generated by another model violates any remaining postconditions.
The researchers evaluated DeCon’s performance using four large language models: GPT-3.5, GPT-4, CodeGen, and Codex.
Cite this article: “Efficient and Accurate Code Testing with DeCon: A Novel Approach Using Large Language Models”, The Science Archive, 2025.
Code Testing, Large Language Models, Decon, Assertions, Postconditions, Code Generation, Fault-Finding, Humaneval Benchmark, Gpt-3.5, Gpt-4, Code Quality







