Breakthrough Technique Boosts Performance of Smaller Language Models

Friday 25 July 2025

For years, language models have been impressing us with their ability to generate human-like responses to our queries. But there’s a catch: these models only shine when they’re large and complex. Smaller models, on the other hand, struggle to keep up.

Researchers have been trying to find ways to improve the performance of smaller language models. One approach is called knowledge distillation, which involves training a smaller model by copying the responses of a larger, more knowledgeable model. But this method has its limitations – it only works well when the teacher model provides clear and concise answers.

Recently, scientists have made a breakthrough in developing a new technique that addresses these limitations. Dubbed daDPO (Distribution-Aware DPO), this approach not only considers the responses of the teacher model but also takes into account the output distribution offered by the teacher. In other words, it looks at how likely each response is to be correct.

The researchers tested their new method on several language models, including Vicuna-7B and Qwen2.5 Series. They found that daDPO significantly outperformed traditional knowledge distillation methods in restoring the performance of pruned models and enhancing smaller language models.

One of the most impressive results came from a comparison between a 20% pruned Vicuna-7B model (which is roughly half the size of the original) and its unpruned counterpart. The pruned model, trained with daDPO, was able to achieve near-teacher performance – just 7.3% worse than the full-size model.

Another notable finding was that daDPO allowed a smaller Qwen2.5-1.5B model to occasionally outperform its 7B teacher model in certain tasks. This is remarkable, considering that the teacher model is much larger and more powerful.

So how does daDPO work? The researchers developed an algorithm that samples responses from both the student and teacher models, then calculates a preference score based on which response is most likely to be correct. They also introduced a new loss function that incorporates this preference score into the training process.

The implications of daDPO are significant. With this technique, developers can create smaller language models that are more efficient and easier to deploy, while still maintaining high levels of performance. This could have major consequences for fields such as natural language processing, machine translation, and even chatbots.

Cite this article: “Breakthrough Technique Boosts Performance of Smaller Language Models”, The Science Archive, 2025.

Language Models, Knowledge Distillation, Dadpo, Distribution-Aware, Dpo, Vicuna-7B, Qwen2.5 Series, Natural Language Processing, Machine Translation, Chatbots

Reference: Zhengze Zhang, Shiqi Wang, Yiqun Shen, Simin Guo, Dahua Lin, Xiaoliang Wang, Nguyen Cam-Tu, Fei Tan, “daDPO: Distribution-Aware DPO for Distilling Conversational Abilities” (2025).

Leave a Reply