• Publications
  • Influence
Large-scale cluster management at Google with Borg
A summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it are presented. Expand
An introduction to disk drive modeling
A calibrated, high-quality disk drive model is demonstrated in which the overall error factor is 14 times smaller than that of a simple first-order model, which enables an informed trade-off between effort and accuracy. Expand
Omega: flexible, scalable schedulers for large compute clusters
This work presents a novel approach to address increasing scale and the need for rapid response to changing requirements using parallelism, shared state, and lock-free optimistic concurrency control to address monolithic cluster scheduler architectures. Expand
My Cache or Yours? Making Storage More Exclusive
  • T. Wong, John Wilkes
  • Computer Science
  • USENIX Annual Technical Conference, General Track
  • 10 June 2002
This work explores the benefits of a simple scheme to achieve exclusive caching, in which a data block is cached at either a client or the disk array, but not both, and introduces a DEMOTE operation to transfer data ejected from the client to the array, and explores its effectiveness with simulation studies. Expand
CloudScale: elastic resource scaling for multi-tenant cloud systems
CloudScale is a system that automates fine-grained elastic resource scaling for multi-tenant cloud computing infrastructures that can achieve significantly higher SLO conformance than other alternatives with low resource and energy cost. Expand
UNIX Disk Access Patterns
A detailed characterization of every lowlevel disk access generated by three quite different systems over a two month period is presented, finding that using a small non-volatile cache at each disk allowed writes to be serviced considerably faster than with a regular disk. Expand
CPI2: CPU performance isolation for shared compute clusters
CPI2, which uses cycles-per-instruction (CPI) data obtained by hardware performance counters to identify problems, select the likely perpetrators, and then optionally throttle them so that the victims can return to their expected behavior. Expand
AGILE: Elastic Distributed Resource Scaling for Infrastructure-as-a-Service
AGILE uses wavelets to provide a medium-term resource demand prediction with enough lead time to start up new application server instances before performance falls short, and it uses dynamic VM cloning to reduce application startup times. Expand
Hibernator: helping disk arrays sleep through the winter
This paper describes the Hibernator design, and presents evaluations of it using both trace-driven simulations and a hybrid system comprised of a real database server (IBM DB2) and an emulated storage server with multi-speed disks. Expand
Borg, Omega, and Kubernetes
The lessons from developing and operating three different container-management systems at Google for more than ten years are described. Expand