Predicting Query Performance: A New Machine Learning Approach

Friday 31 January 2025


Scientists have been trying to crack the code of predicting how long it takes for a database query to run, but it’s a notoriously tricky problem. Queries can involve complex calculations and data retrieval, making it difficult to accurately forecast completion times.


Now, researchers have developed a new approach that uses machine learning to better predict query performance. The system, called MERLIN, is designed specifically for high-performance databases like ClickHouse, which are used to analyze large amounts of data quickly.


The team behind MERLIN recognized that traditional cost models, which estimate the time it takes to complete a query based on various factors such as data size and complexity, often struggle with complex queries. They also noticed that these models tend to focus too much on individual components of a query, rather than how they interact with each other.


To address this issue, MERLIN uses a multi-stage model that incorporates both static and dynamic information about the query. The system first analyzes the query plan, which is like a blueprint for how the database will execute the query. It then uses this information to predict the performance of individual components of the query, such as data retrieval and processing.


But here’s where things get clever: MERLIN also takes into account the dynamic nature of the database environment. This includes factors such as the availability of resources like CPU and memory, which can impact query performance. By incorporating these dynamics into its predictions, MERLIN can better account for unexpected events that might slow down a query.


The researchers tested MERLIN on two popular benchmarks: TPC-H and TPC-DS. These workloads involve complex queries that test the limits of database performance. The results were impressive: MERLIN outperformed traditional cost models in terms of accuracy, with significantly lower errors and faster inference times.


MERLIN’s performance was also robust to cardinality errors, which are common issues that can occur when estimating the size of data sets. In these situations, traditional cost models often struggle to provide accurate predictions. But MERLIN was able to adapt and still produce reliable results.


The team behind MERLIN believes their approach could have significant implications for database management. By providing more accurate predictions of query performance, databases could optimize their resources more effectively, leading to better overall performance and reduced latency.


In the future, the researchers plan to expand MERLIN’s capabilities to include support for other types of queries and databases. They also hope to integrate MERLIN with existing database systems, making it easier for developers to use the technology in real-world applications.


Cite this article: “Predicting Query Performance: A New Machine Learning Approach”, The Science Archive, 2025.


Machine Learning, Query Performance, Merlin, Database Management, Clickhouse, Cost Models, Query Plan, Cpu, Memory, Tpc-H, Tpc-Ds


Reference: Kaixin Zhang, Hongzhi Wang, Kunkai Gu, Ziqi Li, Chunyu Zhao, Yingze Li, Yu Yan, “MERLIN: Multi-stagE query performance prediction for dynamic paRallel oLap pIpeliNe” (2024).


Leave a Reply