Learn More
Benchmarks have historically played a key role in guiding the progress of computer science systems research and development, but have traditionally neglected the areas of availability, maintainability, and evolutionary growth, areas that have recently become critically important in high-end system design. As a first step in addressing this deficiency, we(More)
The complexity of configuring computing systems is a major impediment to the adoption of new information technology (IT) products and greatly increases the cost of IT services. This paper develops a model of configuration complexity and demonstrates its value for a change management system. The model represents systems as a set of nested containers with(More)
The lmbench suite of operating system microbenchmarks provides a set of portable programs for use in cross-platform comparisons. We have augmented the lmbench suite to increase its flexibility and precision, and to improve its methodological and statistical operation. This enables the detailed study of interactions between the operating system and the(More)
We describe a methodology for identifying and characterizing dynamic dependencies between system components in distributed application environments such as e-commerce systems. The methodology relies on active perturbation of the system to identify dependencies and the use of statistical modeling to compute dependency strengths. Unlike more traditional(More)
We introduce the ROC-1 hardware platform, a large-scale cluster system designed to provide high availability for Internet service applications. The ROC-1 prototype embodies our philosophy of Recovery-Oriented Computing (ROC) by emphasizing detection and recovery from the failures that inevitably occur in Internet service environments, rather than simple(More)
System operators play a critical role in maintaining server dependability yet lack powerful tools to help them do so. To help address this unfulfilled need, we describe Operator Undo, a tool that provides a forgiving operations environment by allowing operators to recover from their own mistakes, from unanticipated software problems, and from intentional or(More)
Motivated by the growth of web and infrastructure services and their susceptibility to human operator-related failures, we introduce <i>system-level undo</i> as a recovery mechanism designed to improve service dependability. Undo enables system operators to recover from their inevitable mistakes and furthermore enables <i>retroactive repair</i> of problems(More)