Deriving Semantic Checkers from Tests to Detect Silent Failures in Production Distributed Systems

  • Chang Lou ,
  • Dimas Shidqi Parikesit ,
  • Yujin Huang ,
  • Zhewen Yang ,
  • Senapati Diwangkara ,
  • Yuzhuo Jing ,
  • Achmad Imam Kistijantoro ,
  • Ding Yuan ,
  • ,
  • Peng Huang

OSDI'25: 19th USENIX Symposium on Operating Systems Design and Implementation |

Published by USENIX

Production distributed systems provide rich features, but various defects can cause a system to silently violate its semantics without explicit errors. Such failures cause serious consequences. Yet, they are extremely challenging to detect, as it requires deep domain knowledge and substantial manual efforts to write good checkers. In this paper, we explore a novel approach that directly derive semantic checkers from existing test code in a system. We first present a large-scale feasibility study on existing test cases. Guided by the study findings, we develop T2C, a framework that uses static and dynamic analysis to transform and generalize a test into a runtime checker. We apply T2C on four large, popular distributed systems and successfully derive many checkers. Derived checkers are able to detect 15 out of 20 real-world silent failures with a small overhead incurred during runtime.