• Corpus ID: 18596682

Failure as a Service (FaaS): A Cloud Service for Large- Scale, Online Failure Drills

@inproceedings{Gunawi2011FailureAA,
  title={Failure as a Service (FaaS): A Cloud Service for Large- Scale, Online Failure Drills},
  author={Haryadi S. Gunawi and Thanh Do and Joseph M. Hellerstein and Ion Stoica and Dhruba Borthakur and Jesse Robbins},
  year={2011}
}
Cloud computing is pervasive, but cloud service outages still take place. One might say that the computing forecast for tomorrow is “cloudy with a chance of failure.” One main reason why major outages still occur is that there are many unknown large-scale failure scenarios in which recovery might fail. We propose a new type of cloud service, Failure as a Service (FaaS), which allows cloud services to routinely perform large-scale failure drills in real deployments. 

Figures and Tables from this paper

The Case for Drill-Ready Cloud Computing
TLDR
As cloud computing has matured, more and more local applications are replaced by easy-to-use on-demand services accessible via computer networks that if not tested thoroughly can exhibit failures that lead to major service disruptions.
Failure scenario as a service (FSaaS) for Hadoop clusters
TLDR
This work proposes a new model called Failure Scenario as a Service (FSaaS), which will be utilized across the cloud for testing the resilience of cloud applications and focuses its efforts on the Hadoop platform.
Evolution of as-a-Service Era in Cloud
TLDR
The evolution of as-a-Service modalities, stimulated by cloud computing, is studied, and the most complete inventory of new members beyond traditional cloud computing stack is explored.
Efficient Inter-cloud Replication for High-Availability Services*
TLDR
This paper investigates the idea of tolerating outages by inter-cloud replication through service replication on multiple, fail-independent clouds by developing a new order protocol that makes the most use of the high bandwidth communication within a cloud and uses the Internet communication to minimum necessary.
Self-managing SLA compliance in cloud architectures: a market-based approach
TLDR
Early results from simulation studies show that the approach is feasible at reducing the SLA violations incurred by cloud providers and an innovative self-managed cloud architecture has been designed based on the control mechanism.
The Hydra: A Layered, Redundant Configuration Management Approach for Cloud-Agnostic Disaster Recovery
  • Ke HuangKyrre M. Begnum
  • Computer Science
    2013 IEEE 5th International Conference on Cloud Computing Technology and Science
  • 2013
TLDR
A bottom-up approach to developing autonomic fault tolerance and disaster recovery on cloud-based deployments is demonstrated, and it is shown that tools used in system administration today can provide the foundation for recovery processes with few additions.
On fault resilience of OpenStack
TLDR
This paper has built a prototype fault-injection framework targeting service communications during the processing of external requests, both among OpenStack services and between OpenStack and external services, and has thus far uncovered 23 bugs in two versions of OpenStack.
HMGOWM: A Hybrid Decision Mechanism for Automating Migration of Virtual Machines
TLDR
HMGOWM, a hybrid decision-making mechanism for automating the migration of VMs is proposed and extensive experiment results indicate that the downtime experienced by users can be efficiently reduced and that the implementation of H MGOWM outperforms the original scheduling of the OpenStack cloud platform.
A Test Design Method for Resilient System on Cloud Infrastructure
TLDR
A new test design method to test an expected system configuration after reconfiguration operation triggered by disturbance events is proposed, designed in a way to be embedded on the production system so that it can validate the expected system with high confidence with the minimum additional resources.
...
...

References

SHOWING 1-10 OF 26 REFERENCES
FATE and DESTINI: A Framework for Cloud Recovery Testing
TLDR
A new testing framework for cloud recovery is proposed: FATE (Failure Testing Service) and DESTINI (Declarative Testing Specifications).
How is the weather tomorrow?: towards a benchmark for the cloud
TLDR
This paper argues that traditional benchmarks (like the TPC benchmarks) are not sufficient for analyzing the novel cloud services and presents some initial ideas how such a new benchmark should look like that fits better to the characteristics of cloud computing (e.g., scalability, pay-per-use and fault-tolerance).
Toward Online Testing of Federated and Heterogeneous Distributed Systems
TLDR
This work argues that system reliability should be improved by proactively identifying potential faults using an online testing functionality, and proposes DiCE, an approach that continuously and automatically explores the system behavior, to check whether the system deviates from its desired behavior.
Autopilot: automatic data center management
TLDR
The first version of Autopilot is described, the automatic data center management infrastructure developed within Microsoft over the last few years, responsible for automating software provisioning and deployment; system monitoring; and carrying out repair actions to deal with faulty software and hardware.
Subtleties in Tolerating Correlated Failures in Wide-area Storage Systems
TLDR
This paper systematically revisits previously proposed techniques for addressing correlated failures and identifies a set of design principles that system builders can use to tolerate correlated failures.
Characterizing, modeling, and generating workload spikes for stateful services
TLDR
This paper proposes and validate a model of stateful spikes that allows us to synthesize volume and data spikes and could thus be used by both cloud computing users and providers to stress-test their infrastructure.
Automated software testing as a service
TLDR
The case for TaaS is made: a "programmer's sidekick" enabling developers to thoroughly and promptly test their code with minimal upfront resource investment; a "home edition" on-demand testing service for consumers to verify the software they are about to install on their PC or mobile device; and a public "certification service" that independently assesses the reliability, safety, and security of software.
Glacier: highly durable, decentralized storage despite massive correlated failures
TLDR
Glasgow is described, a distributed storage system that relies on massive redundancy to mask the effect of large-scale correlated failures and is used as the storage layer for an experimental serverless email system.
Chukwa: A large-scale monitoring system
TLDR
The design and initial implementation of Chukwa, a data collection system for monitoring and analyzing large distributed systems that inherits Hadoop’s scalability and robustness, and includes a flexible and powerful toolkit for displaying monitoring and analysis results.
The impact of DHT routing geometry on resilience and proximity
TLDR
The basic finding is that, despite the initial preference for more complex geometries, the ring geometry allows the greatest flexibility, and hence achieves the best resilience and proximity performance.
...
...