Resilience Engineering: Learning to Embrace Failure
@article{Robbins2012ResilienceEL, title={Resilience Engineering: Learning to Embrace Failure}, author={Jesse Robbins and Kripa Krishnan and John Allspaw and Thomas A. Limoncelli}, journal={Queue}, year={2012}, volume={10}, pages={20 - 28} }
In the early 2000s, Amazon created GameDay, a program designed to increase resilience by purposely injecting major failures into critical systems semi-regularly to discover flaws and subtle dependencies. Basically, a GameDay exercise tests a company’s systems, software, and people in the course of preparing for a response to a disastrous event. Widespread acceptance of the GameDay concept has taken a few years, but many companies now see its value and have started to adopt their own versions…
40 Citations
Maelstrom: Mitigating Datacenter-level Disasters by Draining Interdependent Traffic Safely and Efficiently
- Computer ScienceOSDI
- 2018
Maelstrom is presented, a new system for mitigating and recovering from datacenter-level disasters, with modular, reusable primitives that can be composed to safely and efficiently drain the traffic of interdependent services from one or more failing datacenters to the healthy ones.
The Tech Company: On the neglected second nature of platforms
- Business
- 2022
The unprecedented rise of startups such as Google or Amazon has spurred an ongoing debate on the conceptualization of the corporate model these firms represent. Thus far, attention has cente red on…
Understanding resilience in the built environment: Going beyond disaster mitigation
- Economics
- 2021
Although introducing the resilience concept into the built environment context occurred relatively late compared to other disciplines, it has been rapidly gaining ground in urban-related studies.…
An Overview of 100 Resilient Cities Network—The Case of Amman
- BusinessAdvanced Studies in Efficient Environmental Design and City Planning
- 2021
Amman was chosen to join the 100 Resilient Cities Network (100RC) in December 2014 to design and implement an urban resilience strategy that was announced three years later to assist strengthen the…
Sage: practical and scalable ML-driven performance debugging in microservices
- Computer ScienceASPLOS
- 2021
Sage is presented, a machine learning-driven root cause analysis system for interactive cloud microservices that focuses on practicality and scalability and captures the impact of dependencies between microservices to determine the root cause of unpredictable performance online, and applies corrective actions to recover a cloud service’s QoS.
Sage: Using Unsupervised Learning for Scalable Performance Debugging in Microservices
- Computer ScienceArXiv
- 2021
Sage is presented, a machine learning-driven root cause analysis system for interactive cloud microservices that leverages unsupervised ML models to circumvent the overhead of trace labeling, captures the impact of dependencies between microservices to determine the root cause of unpredictable performance online, and applies corrective actions to recover a cloud service's QoS.
Toward a Smart Cloud: A Review of Fault-Tolerance Methods in Cloud Systems
- Computer ScienceIEEE Transactions on Services Computing
- 2021
A comprehensive survey of the state-of-the-art work on fault tolerance methods proposed for cloud computing is presented and current issues and challenges in cloud fault tolerance are discussed to identify promising areas for future research.
A framework for the resilience analysis of electric infrastructure systems including temporary generation systems
- EngineeringReliab. Eng. Syst. Saf.
- 2020
Enhancing Failure Propagation Analysis in Cloud Computing Systems
- Computer Science2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE)
- 2019
This work proposes a novel approach that joins fault injection with anomaly detection to identify the symptoms of failures and shows that this model can significantly improve the accuracy of failure analysis in terms of false positives and negatives, with a low computational cost.
Avoiding queue overflow and reducing queuing delay at eNodeB in LTE networks using congestion feedback mechanism
- Computer ScienceComput. Commun.
- 2019