Compressing Large Language Models for Resource-Constrained Devices

Friday 14 March 2025


Researchers have developed a new technique for compressing large language models, allowing them to be run on resource-constrained devices such as smartphones or embedded systems. The approach, known as COMP, is designed to reduce the memory footprint of these models while preserving their ability to understand and generate human-like text.


Large language models are powerful tools that have enabled significant advances in natural language processing, but they require substantial computational resources to train and run. This has limited their adoption in many applications, such as mobile devices or edge computing systems, where power consumption and memory availability are critical concerns.


To address this issue, the COMP team developed a hybrid-granularity pruning strategy that combines layer-wise and neuron-level pruning techniques. Layer-wise pruning involves removing entire layers from the model, while neuron-level pruning involves removing individual neurons within those layers. By carefully selecting which layers and neurons to remove, the team was able to reduce the memory footprint of the models without sacrificing their performance.


The approach is based on a novel metric that evaluates the importance of each layer and neuron in the model. This metric takes into account both the input and output of each layer, as well as the connections between them. By using this metric to guide the pruning process, the team was able to identify the most critical components of the model and remove those that were less important.


The results are impressive: COMP was able to compress a large language model from 13 billion parameters down to just 2.5 billion while preserving its accuracy on a range of tasks. This represents a reduction of over 80% in memory usage, making it possible to run the model on devices with limited resources.


COMP also demonstrates significant speedups compared to traditional pruning techniques. By removing unimportant layers and neurons, the team was able to accelerate the processing of text data by up to 30%. This is particularly important for applications such as real-time language translation or chatbots, where fast response times are critical.


The potential impact of COMP is significant. It enables large language models to be deployed in a wide range of devices and applications, from smartphones to smart home systems. It also opens up new possibilities for edge computing and fog computing, where data processing occurs closer to the source of the data rather than in the cloud.


While there are still challenges to overcome, COMP represents an important step forward in the development of more efficient and scalable language models.


Cite this article: “Compressing Large Language Models for Resource-Constrained Devices”, The Science Archive, 2025.


Language Models, Compression, Pruning, Neural Networks, Natural Language Processing, Edge Computing, Fog Computing, Smartphones, Embedded Systems, Machine Learning.


Reference: Zihuai Xu, Yang Xu, Hongli Xu, Yunming Liao, Zhiwei Yao, Zuan Xie, “Lightweight and Post-Training Structured Pruning for On-Device Large Lanaguage Models” (2025).


Leave a Reply