An empirical study on crash recovery bugs in large-scale distributed systems
@article{Gao2018AnES, title={An empirical study on crash recovery bugs in large-scale distributed systems}, author={Yu Gao and Wensheng Dou and Feng Qin and Chushu Gao and Dong Wang and Jun Wei and Ruirui Huang and Li Zhou and Yongming Wu}, journal={Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering}, year={2018} }
In large-scale distributed systems, node crashes are inevitable, and can happen at any time. As such, distributed systems are usually designed to be resilient to these node crashes via various crash recovery mechanisms, such as write-ahead logging in HBase and hinted handoffs in Cassandra. However, faults in crash recovery mechanisms and their implementations can introduce intricate crash recovery bugs, and lead to severe consequences. In this paper, we present CREB, the most comprehensive…
Figures and Tables from this paper
22 Citations
CrashTuner: detecting crash-recovery bugs in cloud systems via meta-info analysis
- Computer Science, BiologySOSP
- 2019
CrashTuner is presented, a novel fault-injection testing approach to combat crash-recovery bugs that can cause severe damages such as cluster down or start-up failures and can be applied to five representative distributed systems.
Understanding Node Change Bugs for Distributed Systems
- Computer Science2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER)
- 2019
An extensive empirical study on node change bugs is performed, and two useful tools are developed, NCTrigger and NPEDetector, which help users to automatically reproduce a node change bug by injecting node change events based on user specification.
CloudRaid: Detecting Distributed Concurrency Bugs via Log Mining and Enhancement
- Computer ScienceIEEE Transactions on Software Engineering
- 2022
This paper presents CloudRaid, a new automatical tool for finding distributed concurrency bugs efficiently and effectively and proposes a log enhancing technique to introduce new logs automatically in the system being tested, which makes it well-suited for live systems.
Understanding Exception-Related Bugs in Large-Scale Cloud Systems
- Computer Science2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE)
- 2019
This paper presents a comprehensive study on 210 eBugs from six widely-deployed cloud systems, including Cassandra, HBase, HDFS, Hadoop MapReduce, YARN, and ZooKeeper, and focuses on triggering conditions.
How are distributed bugs diagnosed and fixed through system logs?
- Computer ScienceInf. Softw. Technol.
- 2020
CloudRaid: hunting concurrency bugs in the cloud via log-mining
- Computer ScienceESEC/SIGSOFT FSE
- 2018
The proposed CloudRaid automatically detects concurrency bugs in cloud systems, by analyzing and testing those message orderings that are likely to expose errors, and tries to flip the order of a pair of messages if they may happen in parallel.
A Delta-Debugging Approach to Assessing the Resilience of Actor Programs through Run-time Test Perturbations
- Computer Science
- 2020
This work presents the first automated approach to testing the resilience of actor programs, which perturbs the execution of existing test cases and leverages delta debugging to explore all failure scenarios more efficiently.
Exception Creation Exception Propagation Reacted Method Exception Handling Triggering Condition CA B
- Computer Science
- 2019
This paper presents a comprehensive study on 210 eBugs from six widely-deployed cloud systems, including Cassandra, HBase, HDFS, Hadoop MapReduce, YARN, and ZooKeeper, and focuses on triggering conditions.
Towards Understanding Tool-chain Bugs in the LLVM Compiler Infrastructure
- Computer Science, Business2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)
- 2021
This paper conducts an empirical study of the LLVM tool-chain bugs, aiming to provide the first comprehensive understanding of these bugs.
McTAR: A Multi-Trigger Checkpointing Tactic for Fast Task Recovery in MapReduce
- Computer ScienceIEEE Transactions on Services Computing
- 2021
A novel multi-trigger checkpointing approach for fast recovery of MapReduce tasks, named a Multi-trigger Checkpointing Tactic for fAst TAsk Recovery (McTAR), which employs finer-grained and better fault tolerance tactic.
References
SHOWING 1-10 OF 53 REFERENCES
Correlated Crash Vulnerabilities
- Computer ScienceOSDI
- 2016
PACE, a framework that systematically generates and explores persistent states that can occur in a distributed execution, is built and uses a set of generic rules to effectively prune the state space, reducing checking time from days to hours in some cases.
TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems
- Computer ScienceASPLOS 2016
- 2016
This work studies 104 distributed concurrency bugs from four widely-deployed cloud-scale datacenter distributed systems, Cassandra, Hadoop MapReduce, HBase and ZooKeeper to present TaxDC, the largest and most comprehensive taxonomy of non-deterministic concurrence bugs in distributed systems.
Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems
- Computer ScienceOSDI
- 2014
The majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code - the last line of defense - even without an understanding of the software design, and a static checker was developed, Aspirator, capable of locating bugs.
FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems
- Computer ScienceASPLOS 2018
- 2018
This paper carefully models time-of-fault bugs as a new type of concurrency bugs, and develops FCatch to automatically predict TOF bugs by observing correct execution, and Evaluation on representative cloud systems shows that FCatch is effective, accurately finding severe ToF bugs.
Reducing crash recoverability to reachability
- Computer SciencePOPL 2016
- 2016
A hierarchical formal model of what it means for a program to be crash recoverable is provided and a novel technique capable of automatically proving that a program correctly recovers from a crash via a reduction to reachability is introduced.
DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems
- Computer ScienceASPLOS 2017
- 2017
To build DCatch, a set of happens-before rules are designed that model a wide variety of communication and concurrency mechanisms in real-world distributed cloud systems, and tools to help prune false positives and trigger DCbugs are designed.
Understanding Real-World Timeout Problems in Cloud Server Systems
- Computer Science2018 IEEE International Conference on Cloud Engineering (IC2E)
- 2018
This study conducts a comprehensive study to characterize real-world timeout problems in 11 commonly used cloud server systems (e.g., Hadoop, HDSF, Spark, Cassandra, etc.).
CloudRaid: hunting concurrency bugs in the cloud via log-mining
- Computer ScienceESEC/SIGSOFT FSE
- 2018
The proposed CloudRaid automatically detects concurrency bugs in cloud systems, by analyzing and testing those message orderings that are likely to expose errors, and tries to flip the order of a pair of messages if they may happen in parallel.
Lineage-driven Fault Injection
- Computer ScienceSIGMOD Conference
- 2015
MOLLY is presented, a prototype of lineage-driven fault injection that exploits a novel combination of data lineage techniques from the database literature and state-of-the-art satisfiability testing that finds bugs in fault-tolerant data management systems rapidly.
All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications
- Computer ScienceOSDI
- 2014
It is found that applications use complex update protocols to persist state, and that the correctness of these protocols is highly dependent on subtle behaviors of the underlying file system, which is referred to as persistence properties.