TripleAgent: Monitoring, Perturbation and Failure-Obliviousness for Automated Resilience Improvement in Java Applications

  title={TripleAgent: Monitoring, Perturbation and Failure-Obliviousness for Automated Resilience Improvement in Java Applications},
  author={Long Zhang and Monperrus Martin},
  journal={2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE)},
  • Long Zhang, Monperrus Martin
  • Published 27 December 2018
  • Computer Science
  • 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE)
In this paper, we present a novel resilience improvement system for Java applications. The unique feature of this system is to combine automated monitoring, automated perturbation injection, and automated resilience improvement. The latter is achieved thanks to the failure-oblivious computing, a concept introduced in 2004 by Rinard and colleagues. We design and implement the system as agents for the Java virtual machine. We evaluate the system on two real-world applications: a file transfer… 

Figures and Tables from this paper

Maximizing Error Injection Realism for Chaos Engineering with System Calls
The results show that Phoebe successfully generates realistic error models and is able to detect important reliability weaknesses with respect to system call invocation errors.
Practical Online Debugging of Spark-like Applications
This work presents an online debugging approach tailored to Big Data analytics applications that includes local debugging of remote parallel exceptions through dynamic local checkpoints, extended with domain-specific debugging operations and live code updating functionality, and extends the model to easily allow developers to automatically ignore exceptions that happen at runtime.
Chaos engineering experiments in middleware systems using targeted network degradation and automatic fault injection
Methods for applying Chaos Engineering to open systems architecture are developed as a way to improve the resiliency of such systems against natural and adversarial failure conditions, which will, in turn, lead to the development of more resilient military mission systems.
Production Monitoring to Improve Test Suites
An approach called PANKTI is devised which monitors applications as they execute in production, and then automatically generates unit tests from the collected production data, and shows that the generated tests indeed improve the quality of the test suite of the application under consideration.
A Reflection on “An Exploratory Study on Exception Handling Bugs in Java Programs”
The goal of this reflection paper is to investigate the state of the art in exception handling research, with a particular emphasis on exception handling bugs, and how the paper investigating the prevalence and nature of exception handle bugs in two large, widely adopted Java systems has influenced other studies in the area.
Automatic Observability for Dockerized Java Applications
Docker is a virtualization technique heavily used in industry to build cloud-based systems. In this context, observability means that it is hard for engineers to get timely and accurate information


Rx: treating bugs as allergies---a safe method to survive software failures
This paper proposes an innovative safe technique, called Rx, which can quickly recover programs from many types of software bugs, both deterministic and non-deterministic, which requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis.
Enhancing Server Availability and Security Through Failure-Oblivious Computing
Failure-oblivious computing is presented, a new technique that enables servers to execute through memory errors without memory corruption and enables the servers to continue to operate successfully to service legitimate requests and satisfy the needs of their users even after attacks trigger their memory errors.
ASSURE: automatic software self-healing using rescue points
Experimental results show that ASSURE enabled recovery for all of the bugs tested with fast recovery times, has modest performance overhead, and provides automatic self-healing orders of magnitude faster than current human-driven patch deployment methods.
SFIDA: a software implemented fault injection tool for distributed dependable applications
SFIDA, a new software implemented fault injection tool is described in this paper which can be used to test for dependability of distributed applications on the Linux platform. This has been
Correctness attraction: a study of stability of software behavior under runtime perturbation
A qualitative manual analysis enables the first taxonomy ever of the reasons behind correctness attraction and the findings on the stability of software under execution perturbations have a level of validity that has never been reported before in the scarce related work.
Crash-Only Software
This paper presents ideas on how to build such crash-only Internet services, taking successful techniques to their logical extreme, and shows that it can lead to more reliable, predictable code and faster, more effective recovery.
Exhaustive Exploration of the Failure-Oblivious Computing Search Space
The outcome of this experiment is a much better understanding of what really happens when failure-oblivious computing is used, and this opens new promising research directions.
Discovering faults in idiom-based exception handling
It is shown that the popular return-code idiom for dealing with exceptions is indeed fault prone, but that a simple solution can lead to significant improvements.
Automatically patching errors in deployed software
Aspects of ClearView that make it particularly appropriate for this context include its ability to generate patches without human intervention, apply and remove patchesto and from running applications without requiring restarts or otherwise perturbing the execution, and identify and discard ineffective or damaging patches by evaluating the continued behavior of patched applications.