Automated Red Teaming Framework for Large Language Models

Friday 28 March 2025


For years, researchers have been working on a way to automatically generate high-quality prompts for red teaming large language models (LLMs). Red teaming is the process of testing an LLM’s ability to respond safely and responsibly to potentially harmful inputs. The goal is to identify and mitigate potential risks before these models are deployed in real-world applications.


Recently, a team of scientists has developed a new framework called RTPE that can automatically generate red team prompts for LLMs. RTPE stands for Red Team Prompt Evolution, and it uses a combination of natural language processing (NLP) techniques and machine learning algorithms to create effective attack prompts.


The process starts with a set of initial seeds, which are used to generate the first batch of prompts. These prompts are then evaluated by a safety model, which assesses their potential impact on the LLM’s behavior. The top-performing prompts are selected and used as the basis for generating additional prompts through a process called in-breadth evolving.


In-breadth evolving involves using NLP techniques to analyze the structure and content of the initial prompts, and then modifying them to create new variations that are even more effective at eliciting harmful responses from the LLM. This process is repeated multiple times, with each iteration generating new prompts that are evaluated by the safety model.


The second component of RTPE is in-depth evolving, which uses machine learning algorithms to identify patterns and relationships within the generated prompts. These patterns are then used to generate even more effective prompts, which are further evaluated by the safety model.


Through this process, RTPE is able to generate a vast array of high-quality red team prompts that can be used to test an LLM’s robustness in a variety of scenarios. The effectiveness of these prompts was demonstrated through experiments using GPT-3.5-turbo-0613 as the attack model and target model.


The results showed that RTPE significantly outperformed existing methods for generating red team prompts, with an average attack success rate of 76%. Additionally, the generated prompts were found to be highly diverse and effective in eliciting harmful responses from the LLM, regardless of the literary genre used as the Mutagenic Factor.


The reliability of the safety model used in RTPE was also evaluated through a manual verification process. The results showed that GPT-3.


Cite this article: “Automated Red Teaming Framework for Large Language Models”, The Science Archive, 2025.


Red Teaming, Language Models, Natural Language Processing, Machine Learning, Safety Model, Prompts, Attack Prompts, Nlp, Gpt-3, Robustness


Reference: Rui Li, Peiyi Wang, Jingyuan Ma, Di Zhang, Lei Sha, Zhifang Sui, “Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming” (2025).


Leave a Reply