• Corpus ID: 239050327

DeLag: Detecting Latency Degradation Patterns in Service-based Systems

@article{Traini2021DeLagDL,
  title={DeLag: Detecting Latency Degradation Patterns in Service-based Systems},
  author={Luca Traini and Vittorio Cortellessa},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.11155}
}
Performance debugging in production is a fundamental activity in modern service-based systems. The diagnosis of performance issues is often time-consuming, since it requires thorough inspection of large volumes of traces and performance indices. In this paper we present DeLag, a novel automated search-based approach for diagnosing performance issues in service-based systems. DeLag identifies subsets of requests that show, in the combination of their Remote Procedure Call execution times… 

References

SHOWING 1-10 OF 52 REFERENCES
Detecting Latency Degradation Patterns in Service-based Systems
TLDR
An automated approach that detects relevant RPCs execution time patterns associated to request latency degradation, i.e. latency degradation patterns, based on a genetic search algorithm driven by an information retrieval relevance metric and an optimized fitness evaluation is presented.
Diagnosing Performance Changes by Comparing Request Flows
TLDR
A new technique for gaining insight into performance changes in a distributed storage service caused by code changes, configuration modifications, and component degradations is developed, demonstrating the value and efficacy of comparing request flows.
Understanding latency variations of black box services
TLDR
This work proposes a general framework for understanding performance of arbitrary black box services, and designs algorithms that use this measure not only for a fixed latency interval, but also to explain the entire range of latencies of the service by segmenting it into smaller intervals.
Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis
TLDR
This paper proposes an unstructured log analysis technique for anomalies detection and proposes a novel algorithm to convert free form text messages in log files to log keys without heavily relying on application specific knowledge.
Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services
TLDR
Kraken is a new system that runs load tests by continually shifting live user traffic to one or more data centers, which enables empirical testing by monitoring user experience and system health in a feedback loop between traffic shifts.
DeCaf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services
TLDR
DeCaf is presented, a system for automated diagnosis and triaging of KPI issues using service logs that uses machine learning along with pattern mining to help service owners automatically root cause and triage performance issues.
Performance debugging in the large via mining millions of stack traces
TLDR
To enable performance debugging in the large in practice, a novel approach is proposed, called StackMine, that mines callstack traces to help performance analysts effectively discover highly impactful performance bugs (e.g., bugs impacting many users with long response delay).
Fa: A System for Automating Failure Diagnosis
TLDR
Two novel challenges are addressed are to make signatures robust to the noisy monitoring data in production systems, and to generate reliable confidence estimates for matches in a platform for automated diagnosis called Fa.
Root cause detection in a service-oriented architecture
TLDR
MonitorRank is introduced, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures and provides a ranked order list of possible root causes for monitoring teams to investigate.
Experience report: Anomaly detection of cloud application operations using log and cloud metric correlation analysis
TLDR
An approach is presented that adopts a regression-based analysis technique to find the correlation between an operation's activity logs and the operation activity's effect on cloud resources and derives assertion specifications, which can be used for runtime verification of running operations and their impact on resources.
...
1
2
3
4
5
...