Monday 10 March 2025
Researchers have discovered a new way for attackers to manipulate large language models, potentially allowing them to inject malicious prompts and alter the output of these powerful AI systems.
The attack exploits a weakness in the fine-tuning process used by many language models, which allows an attacker to manipulate the order in which the model processes input data. By carefully crafting a sequence of inputs, an attacker can trick the model into producing a desired response, essentially injecting their own prompt into the system.
To demonstrate this vulnerability, researchers created a series of attacks that exploited different types of prompts, including persuasion attempts and phishing scams. In each case, they were able to successfully manipulate the language model’s output, demonstrating the potential for significant damage if left unaddressed.
One of the most concerning aspects of this attack is its ease of execution. The researchers found that an attacker only needs to make a few requests to the fine-tuning API in order to recover the permutation function used by the model, allowing them to craft targeted attacks.
The implications of this vulnerability are significant, as language models are increasingly being used in critical applications such as customer service chatbots and language translation systems. If left unpatched, these models could be vulnerable to manipulation, potentially leading to serious consequences for users.
To combat this threat, researchers recommend improving the security of fine-tuning APIs by implementing additional checks on input data and restricting access to sensitive information. They also suggest developing more robust methods for detecting and mitigating attacks against language models.
For now, however, it’s clear that attackers have a new tool in their arsenal, and language model developers must take steps to ensure the integrity of these powerful AI systems.
The researchers tested their attack on several different language models, including Google’s Gemini 1.0 Pro and the Purple Llama model. In each case, they were able to successfully manipulate the output, highlighting the widespread nature of this vulnerability.
To illustrate the potential impact of this attack, the researchers created a series of examples demonstrating how an attacker might use this technique to inject malicious prompts into language models. For instance, in one scenario, they showed how an attacker could use a persuasion prompt to convince a model to generate false information, while in another, they demonstrated how a phishing scam prompt could be used to trick users into divulging sensitive information.
These examples serve as a stark reminder of the potential consequences of this vulnerability and highlight the urgent need for language model developers to address these security concerns.
Cite this article: “New Attack Method Allows Manipulation of Large Language Models”, The Science Archive, 2025.
Language Models, Fine-Tuning, Vulnerability, Attacks, Manipulation, Prompts, Ai Systems, Phishing, Persuasion, Security







