TripleAgent: Monitoring, Perturbation and Failure-Obliviousness for Automated Resilience Improvement in Java Applications

@article{Zhang2019TripleAgentMP,
  title={TripleAgent: Monitoring, Perturbation and Failure-Obliviousness for Automated Resilience Improvement in Java Applications},
  author={Long Zhang and Monperrus Martin},
  journal={2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE)},
  year={2019},
  pages={116-127}
}
  • Long Zhang, Monperrus Martin
  • Published 27 December 2018
  • Computer Science
  • 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE)
In this paper, we present a novel resilience improvement system for Java applications. The unique feature of this system is to combine automated monitoring, automated perturbation injection, and automated resilience improvement. The latter is achieved thanks to the failure-oblivious computing, a concept introduced in 2004 by Rinard and colleagues. We design and implement the system as agents for the Java virtual machine. We evaluate the system on two real-world applications: a file transfer… 

Figures and Tables from this paper

Practical Online Debugging of Spark-like Applications
TLDR
This work presents an online debugging approach tailored to Big Data analytics applications that includes local debugging of remote parallel exceptions through dynamic local checkpoints, extended with domain-specific debugging operations and live code updating functionality, and extends the model to easily allow developers to automatically ignore exceptions that happen at runtime.
Chaos engineering experiments in middleware systems using targeted network degradation and automatic fault injection
TLDR
Methods for applying Chaos Engineering to open systems architecture are developed as a way to improve the resiliency of such systems against natural and adversarial failure conditions, which will, in turn, lead to the development of more resilient military mission systems.
Production Monitoring to Improve Test Suites
TLDR
An approach called PANKTI is devised which monitors applications as they execute in production, and then automatically generates unit tests from the collected production data, and shows that the generated tests indeed improve the quality of the test suite of the application under consideration.
A Reflection on “An Exploratory Study on Exception Handling Bugs in Java Programs”
TLDR
The goal of this reflection paper is to investigate the state of the art in exception handling research, with a particular emphasis on exception handling bugs, and how the paper investigating the prevalence and nature of exception handle bugs in two large, widely adopted Java systems has influenced other studies in the area.
Maximizing Error Injection Realism for Chaos Engineering With System Calls
TLDR
The results show that the novel fault injection framework, called Phoebe, successfully generates realistic error models and is able to detect important reliability weaknesses with respect to system call invocation errors.
Automatic Observability for Dockerized Java Applications
Docker is a virtualization technique heavily used in industry to build cloud-based systems. In this context, observability means that it is hard for engineers to get timely and accurate information

References

SHOWING 1-10 OF 38 REFERENCES
Rx: treating bugs as allergies---a safe method to survive software failures
TLDR
This paper proposes an innovative safe technique, called Rx, which can quickly recover programs from many types of software bugs, both deterministic and non-deterministic, which requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis.
Enhancing Server Availability and Security Through Failure-Oblivious Computing
TLDR
Failure-oblivious computing is presented, a new technique that enables servers to execute through memory errors without memory corruption and enables the servers to continue to operate successfully to service legitimate requests and satisfy the needs of their users even after attacks trigger their memory errors.
ASSURE: automatic software self-healing using rescue points
TLDR
Experimental results show that ASSURE enabled recovery for all of the bugs tested with fast recovery times, has modest performance overhead, and provides automatic self-healing orders of magnitude faster than current human-driven patch deployment methods.
Correctness attraction: a study of stability of software behavior under runtime perturbation
TLDR
A qualitative manual analysis enables the first taxonomy ever of the reasons behind correctness attraction and the findings on the stability of software under execution perturbations have a level of validity that has never been reported before in the scarce related work.
Crash-Only Software
TLDR
This paper presents ideas on how to build such crash-only Internet services, taking successful techniques to their logical extreme, and shows that it can lead to more reliable, predictable code and faster, more effective recovery.
Exhaustive Exploration of the Failure-Oblivious Computing Search Space
TLDR
The outcome of this experiment is a much better understanding of what really happens when failure-oblivious computing is used, and this opens new promising research directions.
Discovering faults in idiom-based exception handling
TLDR
It is shown that the popular return-code idiom for dealing with exceptions is indeed fault prone, but that a simple solution can lead to significant improvements.
Automatically patching errors in deployed software
TLDR
Aspects of ClearView that make it particularly appropriate for this context include its ability to generate patches without human intervention, apply and remove patchesto and from running applications without requiring restarts or otherwise perturbing the execution, and identify and discard ineffective or damaging patches by evaluating the continued behavior of patched applications.
Exception-Chain Analysis: Revealing Exception Handling Architecture in Java Server Applications
  • Chen Fu, B. Ryder
  • Computer Science
    29th International Conference on Software Engineering (ICSE'07)
  • 2007
TLDR
A new static analysis is presented that, when combined with previous exception-flow analyses, computes chains of semantically-related exception- flow links, and thus reports entire exception propagation paths, instead of just discrete segments of them.
...
...