Designing Experiments with Large Language Models for Software Engineering Research

Sunday 30 November 2025

Recently, a team of researchers has been working on developing a framework for designing and conducting experiments in software engineering using large language models (LLMs). LLMs are artificial intelligence systems that can process and generate human-like text, and they have been increasingly used in various applications such as coding, writing, and even art.

The team’s goal is to create a standardized way of evaluating the performance and quality of LLM-generated code, which will help software engineers and researchers make more informed decisions when using these models. The framework is designed to be flexible and adaptable to different experimental settings and goals, allowing researchers to tailor their experiments to specific research questions.

The framework consists of six core components that are organized into a modular structure. These components include the coding task, quality attributes, empirical research method, environment, LLM model, and generated output. Each component is designed to capture specific aspects of the experimental design and execution.

For example, the coding task component defines what kind of programming problem the LLM will be solving, such as writing a program to solve a mathematical equation or generating code for a web application. The quality attributes component specifies the criteria that will be used to evaluate the generated code, such as correctness, efficiency, and maintainability.

The empirical research method component outlines the approach that will be used to collect data from the LLM-generated code, such as running the code on a specific dataset or testing it against a set of pre-defined scenarios. The environment component describes the setting in which the experiment will take place, such as a cloud-based infrastructure or a local machine.

The LLM model component specifies the type and configuration of the LLM that will be used in the experiment, such as its language capabilities, knowledge domain, and training data. Finally, the generated output component captures the actual code produced by the LLM, which can be used to evaluate its quality and performance.

By using this framework, researchers can design experiments that are more comprehensive and systematic, allowing them to draw more reliable conclusions about the effectiveness of LLMs in software engineering tasks. The framework also provides a common language and set of standards for reporting experimental results, making it easier for researchers to communicate their findings with each other and with industry practitioners.

The team’s work has already led to several published studies that demonstrate the potential benefits of using LLMs in software engineering.

Cite this article: “Designing Experiments with Large Language Models for Software Engineering Research”, The Science Archive, 2025.

Large Language Models, Software Engineering, Experiment Design, Code Generation, Quality Attributes, Empirical Research, Artificial Intelligence, Experimental Framework, Research Methodology, Machine Learning.

Reference: Nathalia Nascimento, Everton Guimaraes, Paulo Alencar, “Designing Empirical Studies on LLM-Based Code Generation: Towards a Reference Framework” (2025).

Leave a Reply