Automatic runtime error repair and containment via recovery shepherding

@article{Long2014AutomaticRE,
  title={Automatic runtime error repair and containment via recovery shepherding},
  author={Fan Long and Stelios Sidiroglou and Martin C. Rinard},
  journal={Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation},
  year={2014}
}
We present a system, RCV, for enabling software applications to survive divide-by-zero and null-dereference errors. RCV operates directly on off-the-shelf, production, stripped x86 binary executables. RCV implements recovery shepherding, which attaches to the application process when an error occurs, repairs the execution, tracks the repair effects as the execution continues, contains the repair effects within the application process, and detaches from the process after all repair effects are… 

Figures and Tables from this paper

Automatic runtime recovery via error handler synthesis
TLDR
The key insight is leveraging a system's inherent error handling support to recover from unexpected errors, and proposes Ares, a novel, practical approach for ARR that expands the intrinsic capability of runtime error resilience in software systems to handle unexpected errors.
Context-aware Failure-oblivious Computing as a Means of Preventing Buffer Overflows
TLDR
This work presents an approach to handling buffer overflows without aborting the program and demonstrates that introspection can be implemented in popular bug-finding and bug-mitigation tools such as LLVM’s AddressSanitizer, SoftBound, and Intel-MPX-based bounds checking.
CARE: compiler-assisted recovery from soft failures
TLDR
CARE is presented, a light-weight compiler-assisted technique to repair the (crashed) process on-the-fly when a crash-causing error is detected, allowing applications to continue their executions instead of being simply terminated and restarted.
Fast and Precise On-the-Fly Patch Validation for All
TLDR
This study demonstrates for the first time that on- the-fly patch validation can often speed up state-of-the-art source-code-level APR by over an order of magnitude, enabling all existing APR techniques to explore a larger search space to fix more bugs in the near future.
Automatic Analysis and Repair of Exception Bugs for Java Programs. (Analyse et réparation automatique des bugs liés aux exceptions dans les programmes Java)
TLDR
This thesis proposes resilience capabilities which correctly handle exceptions that were never foreseen at specification time neither encountered during development or testing, and focuses on a more specific kind of exception: the null pointer dereference exceptions (NullPointerException in Java).
VFix: Value-Flow-Guided Precise Program Repair for Null Pointer Dereferences
TLDR
VFIX is presented, a new value-flow-guided APR approach, to fix null pointer exception (NPE) bugs by considering a substantially reduced solution space in order to greatly increase the number of correct patches generated.
Preventing Use-after-free with Dangling Pointers Nullification
TLDR
DANGNULL is a system that detects temporal memory safety violations—in particular, use-after-free and double-free—during runtime during runtime and is effective against even the most sophisticated exploitation techniques.
Near-Zero Downtime Recovery From Transient-Error-Induced Crashes
TLDR
In this article, IterPro is presented, a light-weight compiler-assisted resilience technique to quickly and accurately recover processes from transient-error-induced crashes and could tremendously mitigate the overheads and resource requirements of the resilience subsystem in future exa-scale systems.
LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures
TLDR
The hypothesis is that a class of HPC applications have good enough intrinsic fault tolerance so that its possible to re-purpose the default mechanism that terminates an application once a crash-causing error is signalled, and instead attempt to repair the corrupted application state, and continue the application execution.
...
...

References

SHOWING 1-10 OF 31 REFERENCES
Automatic recovery from runtime failures
TLDR
This technique is intended to maintain a faulty application functional in the field while the developers work on permanent and radical fixes, and works without interrupting the execution flow of the application and without restarting its components.
Enhancing Server Availability and Security Through Failure-Oblivious Computing
TLDR
Failure-oblivious computing is presented, a new technique that enables servers to execute through memory errors without memory corruption and enables the servers to continue to operate successfully to service legitimate requests and satisfy the needs of their users even after attacks trigger their memory errors.
ASSURE: automatic software self-healing using rescue points
TLDR
Experimental results show that ASSURE enabled recovery for all of the bugs tested with fast recovery times, has modest performance overhead, and provides automatic self-healing orders of magnitude faster than current human-driven patch deployment methods.
Rx: treating bugs as allergies---a safe method to survive software failures
TLDR
This paper proposes an innovative safe technique, called Rx, which can quickly recover programs from many types of software bugs, both deterministic and non-deterministic, which requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis.
DieHard: probabilistic memory safety for unsafe languages
TLDR
Analytical and experimental results are presented that show DieHard's resilience to a wide range of memory errors, including a heap-based buffer overflow in an actual application.
A Dynamic Mechanism for Recovering from Buffer Overflow Attacks
TLDR
This work automatically augment source code to dynamically catch stack and heap-based buffer overflow and underflow attacks, and recover from them by allowing the program to continue execution, so that each code function can be aborted when an attack is detected, without affecting the application's ability to correctly execute.
Automatically patching errors in deployed software
TLDR
Aspects of ClearView that make it particularly appropriate for this context include its ability to generate patches without human intervention, apply and remove patchesto and from running applications without requiring restarts or otherwise perturbing the execution, and identify and discard ineffective or damaging patches by evaluating the continued behavior of patched applications.
Self-recovery in server programs
TLDR
SRS is proposed, a technique for self recovery in server programs which takes advantage of self-cleansing to recover from crashes, and employs a mechanism called crash suppression, to prevent further crashes from recurring as the execution proceeds forwards.
Goal-Directed Reasoning for Specification-Based Data Structure Repair
TLDR
The system accepts a specification of data structure consistency properties stated in terms of an abstract set-and relation-based model of the data structures in the running program, and automatically generates a repair algorithm that detects and repairs any violations of these constraints.
Building a Reactive Immune System for Software Services
TLDR
The overall system architecture and a prototype implementation for the x86 platform are discussed, and the preliminary performance evaluation shows that although full emulation can be prohibitively expensive, selective emulation can incur as little as 30% performance overhead relative to an uninstrumented (but failure-prone) instance of Apache.
...
...