Exhaustive Exploration of the Failure-Oblivious Computing Search Space

  title={Exhaustive Exploration of the Failure-Oblivious Computing Search Space},
  author={Thomas Durieux and Youssef Hamadi and Zhongxing Yu and Monperrus Martin},
  journal={2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST)},
High-availability of software systems requires automated handling of crashes in presence of errors. Failure-oblivious computing is one technique that aims to achieve high availability. We note that failure-obliviousness has not been studied in depth yet, and there is very few study that helps understand why failure-oblivious techniques work. In order to make failure-oblivious computing to have an impact in practice, we need to deeply understand failure-oblivious behaviors in software. In this… 

Figures and Tables from this paper

Context-aware Failure-oblivious Computing as a Means of Preventing Buffer Overflows
This work presents an approach to handling buffer overflows without aborting the program and demonstrates that introspection can be implemented in popular bug-finding and bug-mitigation tools such as LLVM’s AddressSanitizer, SoftBound, and Intel-MPX-based bounds checking.
Preventing Buffer Overflows by Context-aware Failure-oblivious Computing
This work presents an approach to handle buffer overflows without aborting the program by implementing a continuation logic in library functions based on an introspection function that allows querying the size of a buffer.
TripleAgent: Monitoring, Perturbation and Failure-Obliviousness for Automated Resilience Improvement in Java Applications
  • Long Zhang, Monperrus Martin
  • Computer Science
    2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE)
  • 2019
A novel resilience improvement system to combine automated monitoring, automated perturbation injection, and automated resilience improvement that is achieved thanks to the failure-oblivious computing concept.
Maximizing Error Injection Realism for Chaos Engineering with System Calls
The results show that Phoebe successfully generates realistic error models and is able to detect important reliability weaknesses with respect to system call invocation errors.
From Runtime Failures to Patches: Study of Patch Generation in Production. (De l'erreur d'exécution aux correctifs: étude de la génération de correctifs en production)
This thesis proposes new patch generation techniques that remove the human intervention for the patch generation and shows the applicability and feasibility of this approach to generate patches in the production environment without the intervention of a developer.
ATPG Binning and SAT-Based Approach to Hardware Trojan Detection for Safety-Critical Systems
This work uses binning of trigger-population based on Automatic Test Pattern Generation (ATPG), and invoke Boolean Satisfiability (SAT) solvers to generate test vectors with high Trojan coverage, demonstrating the effectiveness and superiority of this method with respect to prior work in terms of Trojan coverage and the cardinality of the test set.


Enhancing Server Availability and Security Through Failure-Oblivious Computing
Failure-oblivious computing is presented, a new technique that enables servers to execute through memory errors without memory corruption and enables the servers to continue to operate successfully to service legitimate requests and satisfy the needs of their users even after attacks trigger their memory errors.
Rx: treating bugs as allergies---a safe method to survive software failures
This paper proposes an innovative safe technique, called Rx, which can quickly recover programs from many types of software bugs, both deterministic and non-deterministic, which requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis.
DieHard: probabilistic memory safety for unsafe languages
Analytical and experimental results are presented that show DieHard's resilience to a wide range of memory errors, including a heap-based buffer overflow in an actual application.
Execution suppression: An automated iterative technique for locating memory errors
An automated approach for locating memory errors in the presence of memory corruption propagation that leverages the information revealed by a program crash and shows how crashes can be exposed in an execution by manipulating the relative ordering of particular variables within memory.
Automatic runtime error repair and containment via recovery shepherding
A system, RCV, for enabling software applications to survive divide-by-zero and null-dereference errors and performs a manual analysis of the source code relevant to the benchmark errors, which indicates that for 11 of the 18 errors the RCV and later patched versions produce identical or equivalent results on all inputs.
Exterminator: automatically correcting memory errors with high probability
Exterminator is a system that automatically correct sheap-based memory errors without programmer intervention, and enables collaborative bug correction by merging patches generated by multiple users.
ASSURE: automatic software self-healing using rescue points
Experimental results show that ASSURE enabled recovery for all of the bugs tested with fast recovery times, has modest performance overhead, and provides automatic self-healing orders of magnitude faster than current human-driven patch deployment methods.
SafeMem: exploiting ECC-memory for detecting memory leaks and memory corruption during production runs
This paper proposes a tool called SafeMem, which makes a novel use of existing ECC memory technology and exploits intelligent dynamic memory usage behavior analysis to detect memory leaks and corruption on-the-fly during production-runs.
Dynamic Error Remediation : A Case Study with Null Pointer Exceptions
This paper describes dynamic error remediation, and its effectiveness, with different strategies for handling the exceptions, and describes origin tracking, a JVM modification which exposes the origin of null values that cause null pointer exceptions.
Safe software updates via multi-version execution
This work implemented this technique in Mx, a system targeting Linux applications running on multi-core processors, and shows that it can be applied successfully to several real applications such as Coreutils, a set of user-level UNIX applications; Lighttpd, a popular web server used by several high-traffic websites such as Wikipedia and YouTube; and Redis, an advanced key-value data structure Server used by many well-known services such as GitHub and Flickr.