ProtoRAIL: A Risk-cognizant Imitation Agent for Adaptive vCPU Oversubscription In The Cloud

Safe optimization of operating costs is one of the holy grails of successful revenue-generating cloud systems and capacity/resource efficiency is a key factor in making that a reality. Among other strategies for resource efficiency across major cloud providers, Oversubscription is an extremely prevalent practice where more virtual resources are offered than actual physical capacity to minimize revenue loss against redundant capacity. While resources can be of any type, including compute, memory, power or network bandwidth, we highlight the scenario of virtual CPU (vCPU) oversubscription since vCPU cores are primarily the billable units for cloud services and has substantial impact on business as well as users. For a seamless cloud experience, while being cost-efficient for the provider, suitable policies for controlling oversubscription margins are crucial. Narrow margins lead to redundant expenditure on under-utilized resource capacity, and wider margins lead to under-provisioning where customer workloads may suffer from resource contention.

Most oversubscription policies today are engineered either with tribal knowledge or with static heuristics about the system, which lead to catastrophic overloading or stranded/under-utilized resources. Smart oversubscription policies that can adapt to demand/utilization patterns across time and granularity to jointly optimize cost benefits and risks is a non-trivial, largely, unsolved problem. We address this challenge with our proposed novel Prototypical Risk-cognizant Active Imitation Learning (ProtoRAIL) framework that exploits approximate symmetries in utilization patterns to learn suitable policies. The active knowledge-in-the-loop (KITL) module de-risks the learned policies. Our empirical investigations and real deployments on X company’s internal (1st party) cloud service, show orders of magnitude reduction (≈≥ 90×) in risk and significant increase in benefits (saved stranded resources: in a range of ≈ 7 to 10%).