An empirical study on crash recovery bugs in large-scale distributed systems

@article{Gao2018AnES,
  title={An empirical study on crash recovery bugs in large-scale distributed systems},
  author={Yu Gao and Wensheng Dou and Feng Qin and Chushu Gao and Dong Wang and Jun Wei and Ruirui Huang and Li Zhou and Yongming Wu},
  journal={Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering},
  year={2018}
}
  • Yu Gao, Wensheng Dou, Yongming Wu
  • Published 26 October 2018
  • Computer Science
  • Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
In large-scale distributed systems, node crashes are inevitable, and can happen at any time. As such, distributed systems are usually designed to be resilient to these node crashes via various crash recovery mechanisms, such as write-ahead logging in HBase and hinted handoffs in Cassandra. However, faults in crash recovery mechanisms and their implementations can introduce intricate crash recovery bugs, and lead to severe consequences. In this paper, we present CREB, the most comprehensive… 

Figures and Tables from this paper

CrashTuner: detecting crash-recovery bugs in cloud systems via meta-info analysis
TLDR
CrashTuner is presented, a novel fault-injection testing approach to combat crash-recovery bugs that can cause severe damages such as cluster down or start-up failures and can be applied to five representative distributed systems.
Understanding Node Change Bugs for Distributed Systems
TLDR
An extensive empirical study on node change bugs is performed, and two useful tools are developed, NCTrigger and NPEDetector, which help users to automatically reproduce a node change bug by injecting node change events based on user specification.
CloudRaid: Detecting Distributed Concurrency Bugs via Log Mining and Enhancement
TLDR
This paper presents CloudRaid, a new automatical tool for finding distributed concurrency bugs efficiently and effectively and proposes a log enhancing technique to introduce new logs automatically in the system being tested, which makes it well-suited for live systems.
Understanding Exception-Related Bugs in Large-Scale Cloud Systems
TLDR
This paper presents a comprehensive study on 210 eBugs from six widely-deployed cloud systems, including Cassandra, HBase, HDFS, Hadoop MapReduce, YARN, and ZooKeeper, and focuses on triggering conditions.
CloudRaid: hunting concurrency bugs in the cloud via log-mining
TLDR
The proposed CloudRaid automatically detects concurrency bugs in cloud systems, by analyzing and testing those message orderings that are likely to expose errors, and tries to flip the order of a pair of messages if they may happen in parallel.
A Delta-Debugging Approach to Assessing the Resilience of Actor Programs through Run-time Test Perturbations
TLDR
This work presents the first automated approach to testing the resilience of actor programs, which perturbs the execution of existing test cases and leverages delta debugging to explore all failure scenarios more efficiently.
Exception Creation Exception Propagation Reacted Method Exception Handling Triggering Condition CA B
TLDR
This paper presents a comprehensive study on 210 eBugs from six widely-deployed cloud systems, including Cassandra, HBase, HDFS, Hadoop MapReduce, YARN, and ZooKeeper, and focuses on triggering conditions.
Towards Understanding Tool-chain Bugs in the LLVM Compiler Infrastructure
TLDR
This paper conducts an empirical study of the LLVM tool-chain bugs, aiming to provide the first comprehensive understanding of these bugs.
McTAR: A Multi-Trigger Checkpointing Tactic for Fast Task Recovery in MapReduce
TLDR
A novel multi-trigger checkpointing approach for fast recovery of MapReduce tasks, named a Multi-trigger Checkpointing Tactic for fAst TAsk Recovery (McTAR), which employs finer-grained and better fault tolerance tactic.
...
1
2
3
...

References

SHOWING 1-10 OF 53 REFERENCES
Correlated Crash Vulnerabilities
TLDR
PACE, a framework that systematically generates and explores persistent states that can occur in a distributed execution, is built and uses a set of generic rules to effectively prune the state space, reducing checking time from days to hours in some cases.
TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems
TLDR
This work studies 104 distributed concurrency bugs from four widely-deployed cloud-scale datacenter distributed systems, Cassandra, Hadoop MapReduce, HBase and ZooKeeper to present TaxDC, the largest and most comprehensive taxonomy of non-deterministic concurrence bugs in distributed systems.
Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems
TLDR
The majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code - the last line of defense - even without an understanding of the software design, and a static checker was developed, Aspirator, capable of locating bugs.
FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems
TLDR
This paper carefully models time-of-fault bugs as a new type of concurrency bugs, and develops FCatch to automatically predict TOF bugs by observing correct execution, and Evaluation on representative cloud systems shows that FCatch is effective, accurately finding severe ToF bugs.
Reducing crash recoverability to reachability
TLDR
A hierarchical formal model of what it means for a program to be crash recoverable is provided and a novel technique capable of automatically proving that a program correctly recovers from a crash via a reduction to reachability is introduced.
DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems
TLDR
To build DCatch, a set of happens-before rules are designed that model a wide variety of communication and concurrency mechanisms in real-world distributed cloud systems, and tools to help prune false positives and trigger DCbugs are designed.
Understanding Real-World Timeout Problems in Cloud Server Systems
TLDR
This study conducts a comprehensive study to characterize real-world timeout problems in 11 commonly used cloud server systems (e.g., Hadoop, HDSF, Spark, Cassandra, etc.).
CloudRaid: hunting concurrency bugs in the cloud via log-mining
TLDR
The proposed CloudRaid automatically detects concurrency bugs in cloud systems, by analyzing and testing those message orderings that are likely to expose errors, and tries to flip the order of a pair of messages if they may happen in parallel.
Lineage-driven Fault Injection
TLDR
MOLLY is presented, a prototype of lineage-driven fault injection that exploits a novel combination of data lineage techniques from the database literature and state-of-the-art satisfiability testing that finds bugs in fault-tolerant data management systems rapidly.
All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications
TLDR
It is found that applications use complex update protocols to persist state, and that the correctness of these protocols is highly dependent on subtle behaviors of the underlying file system, which is referred to as persistence properties.
...
1
2
3
4
5
...