Hybrid Cost Modeling for Reducing Query Performance Regression in Index Tuning

IEEE Transactions on Knowledge and Data Engineering | , Vol 37(1): pp. 379-391

Autonomous index tuning (“auto-indexing” for short) has recently started being supported by cloud database service providers. Index tuners rely on query optimizer’s cost estimates to recommend indexes that can minimize the execution cost of an input workload. Such cost estimates can often be erroneous that lead to significant query performance regression. To reduce the chance of regression, existing work primarily uses machine learning (ML) technologies to build prediction models to improve query execution cost estimation using actual query execution telemetry as training data. However, training data collection is typically an expensive process, especially for index tuning due to the significant overhead of creating/dropping indexes. As a result, the amount of training data can be limited in auto-indexing for cloud databases. In this paper, we propose a new approach named “hybrid cost modeling” to address this challenge. The key idea is to limit the ML-based modeling effort to the leaf operators such as table scans, index scans, and index seeks, and then combine the ML-model predicted costs of the leaf operators with optimizer’s estimated costs of the other operators in the query plan. We conduct theoretical study as well as empirical evaluation to demonstrate the efficacy of applying hybrid cost modeling to index tuning, using both industrial benchmarks and real workloads.