• Corpus ID: 211566520

Understanding, Detecting and Localizing Partial Failures in Large System Software

@inproceedings{Lou2020UnderstandingDA,
  title={Understanding, Detecting and Localizing Partial Failures in Large System Software},
  author={Chang Lou and Peng Huang and Scott F. Smith},
  booktitle={NSDI},
  year={2020}
}
Partial failures occur frequently in cloud systems and can cause serious damage including inconsistency and data loss. Unfortunately, these failures are not well understood. Nor can they be effectively detected. In this paper, we first study 100 real-world partial failures from five mature systems to understand their characteristics. We find that these failures are caused by a variety of defects that require the unique conditions of the production environment to be triggered. Manually writing… 
Fail-slow fault tolerance needs programming support
TLDR
The Dependably Fast Library (DepFast) is designed and used to implement a distributed replicated state machine (RSM) and it is shown that it can tolerate various types of fail-slow faults that affect existing RSM implementations.
ViperProbe: Using eBPF Metrics to Improve Microservice Observability
TLDR
ViperProbe is the first scalable eBPF-based dynamic and adaptive microservices metrics collection framework that provides dynamic sampling and collection of deep, diverse, and precise system metrics, and is described as the CriticalMetrics.
SwissLog: Robust and Unified Deep Learning Based Log Anomaly Detection for Diverse Faults
TLDR
The semantic embedding and the time embedding approaches are combined to train a unified attention based BiLSTM model to detect anomalies, which is robust to the changing log data and effective for diverse faults.
ViperProbe: Rethinking Microservice Observability with eBPF
TLDR
ViperProbe is the first scalable eBPF-based dynamic and adaptive microservices metrics collection framework that provides dynamic sampling and collection of deep, diverse, and precise system metrics.
ITSY: Initial Trigger-Based PFC Deadlock Detection in the Data Plane
  • X. Wu, T. Ng
  • Computer Science
    IEEE INFOCOM 2021 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)
  • 2021
TLDR
ITSY, a novel system that correctly detects and solves deadlocks entirely in the data plane and does not require any assumptions on network topologies and routing algorithms, is proposed and implemented.
Analytical study of software development process model variants
TLDR
This work presents an unambiguous expository of selected software development model variants, studied in a theoretical, visual and analytical manner, and concluded by presenting guides towards choice of these models.
Detecting and Resolving PFC Deadlocks with ITSY Entirely in the Data Plane
TLDR
ITSY is a novel system that correctly detects and resolves deadlocks entirely in the data plane and contributes to efficient deadlock detection, mitigation, and recurrence prevention, and it works with any network topologies and routing algorithms.
Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions
TLDR
A preventive and adaptive failure mitigation service that is integrated in a production cloud, Microsoft Azure’s compute platform, NARYA, that predicts imminent host failures based on multi-layer system signals and then decides smart mitigation actions to avert VM failures.
Device and Placement Aware Framework to optimize Single Failure Recoveries and Reads for Erasure Coded Storage System with Heterogeneous Storage Devices
  • Yingxun Fu, Xun Liu, Li Ma
  • Computer Science
    2020 International Symposium on Reliable Distributed Systems (SRDS)
  • 2020
TLDR
A new erasure code framework termed Device Placement Aware Framework (DPAF) is proposed, to integrate existing erasure codes to generate DPAF-Codes, in order to gain good performance on heterogeneous storage devices.
Robust Anomaly Detection Using Reconstructive Adversarial Network
TLDR
Adran is an unsupervised anomaly detection model that introduces adversarial learning into a reconstructionive model, generating a reconstructive adversarial network with an anomaly detection-based training objective that tolerates non-Gaussian noise by activating the discriminator with a non-smooth function.
...
1
2
...

References

SHOWING 1-10 OF 72 REFERENCES
Capturing and Enhancing In Situ System Observability for Failure Detection
TLDR
It is argued that the missing piece in failure detection is detecting what the requesters of a failing component see, which leads to the design and implementation of Panorama, a system designed to enhance system observability by taking advantage of the interactions between a system's components.
Comprehensive and Efficient Runtime Checking in System Software through Watchdogs
TLDR
This paper argues that modern software needs intrinsic failure detectors that are tailored to individual systems and can detect anomalies within a process at finer granularity, and advocates a notion of intrinsic software watchdogs.
Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems
TLDR
The majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code - the last line of defense - even without an understanding of the software design, and a static checker was developed, Aspirator, capable of locating bugs.
Early Detection of Configuration Errors to Reduce Failure Damage
TLDR
A tool named PCHECK is presented that analyzes the source code and automatically generates configuration checking code (called checkers) that emulate the late execution that uses configuration values, and detect LC errors if the error manifestations are captured during the emulated execution.
The φ Accrual Failure Detector
TLDR
This paper presents a novel abstraction, called accrual failure dete ctors, that emphasizes flexibility and expressiveness and can serve as a basic building block to implementing failu re detectors in distributed systems.
An Analysis of Network-Partitioning Failures in Cloud Systems
We present a comprehensive study of 136 system failures attributed to network-partitioning faults from 25 widely used distributed systems. We found that the majority of the failures led to
Detecting failures in distributed systems with the Falcon spy network
TLDR
The design, implementation, and evaluation of Falcon are presented, a failure detector with several features that are fast, reliable, and viable that could change the way that a class of distributed systems is built.
Practical Hardening of Crash-Tolerant Systems
TLDR
A generic and principled hardening technique for Arbitrary State Corruption (ASC) faults, which specifically model the effects of realistic data corruptions on distributed processes, and implemented a wrapper library to transparently harden distributed processes.
Microreboot - A Technique for Cheap Recovery
TLDR
This work uses separation of process recovery from data recovery to enable microrebooting - a fine-grain technique for surgically recovering faulty application components, without disturbing the rest of the application.
EIO: Error Handling is Occasionally Correct
TLDR
A static analysis technique, EDP, is developed that analyzes how file systems and storage device drivers propagate error codes and finds that errors are often incorrectly propagated.
...
1
2
3
4
5
...