Friday 14 March 2025
As our reliance on language models grows, so too does the need for efficient ways to deploy them. A new system, HyGen, aims to solve this problem by seamlessly integrating online and offline requests, reducing waste and improving performance.
The challenge is that online requests, which require rapid response times, often compete with offline tasks, such as batch processing, for resources like GPU power and memory. This can lead to underutilization of these resources during periods of low online demand. By co-locating the two types of requests, HyGen aims to optimize resource allocation and minimize waste.
The system achieves this through a unique scheduling algorithm that takes into account the latency requirements of online requests while also maximizing the throughput of offline tasks. This is done by predicting the execution time of each request and allocating resources accordingly.
One key innovation of HyGen is its ability to dynamically adjust the priority of requests based on changing demand patterns. By analyzing production traces, researchers found that LLM workloads exhibit significant temporal variations in request load, with online request rates varying by up to 3x within minutes. This insight has allowed them to develop a scheduling algorithm that can adapt to these fluctuations.
Another feature of HyGen is its use of prefix sharing, which allows offline requests to share common prefixes and reduce memory usage. By constructing a prefix tree and a self-balancing BST, the system can efficiently identify and utilize shared prefixes, further improving resource utilization.
The benefits of HyGen are twofold. Firstly, it can improve the overall throughput of LLM serving by up to 3.87x compared to traditional online-only or offline-only approaches. Secondly, it can reduce latency for online requests while also preventing starvation of offline tasks.
To demonstrate its effectiveness, researchers evaluated HyGen on production workloads and found that it achieved significant gains in both throughput and resource utilization. The system’s ability to adapt to changing demand patterns and optimize resource allocation made it an attractive solution for LLM serving.
As the use of language models continues to grow, so too will the need for efficient ways to deploy them. HyGen represents a major step forward in this area, offering a flexible and adaptable solution that can meet the complex demands of modern LLM workloads.
Cite this article: “HyGen: A Novel System for Efficient Language Model Serving”, The Science Archive, 2025.
Language Models, Hygen, Online Requests, Offline Tasks, Resource Allocation, Scheduling Algorithm, Prefix Sharing, Memory Usage, Throughput, Latency, Llm Serving







