Pivot tracing: dynamic causal monitoring for distributed systems

@article{Mace2016PivotTD,
  title={Pivot tracing: dynamic causal monitoring for distributed systems},
  author={Jonathan Mace and Ryan Roelke and Rodrigo Fonseca},
  journal={Proceedings of the 25th Symposium on Operating Systems Principles},
  year={2016}
}
Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today -- logs, counters, and metrics -- have two important limitations: what gets recorded is defined a priori, and the information is recorded in a component- or machine-centric way, making it extremely hard to correlate events that cross these boundaries. This paper presents Pivot Tracing, a monitoring… 
Dynamic Causal Monitoring for Distributed Systems
TLDR
Pivot Tracing is dynamic, extensible, and enables cross-tier analysis between any inter-operating applications, with low execution overhead and can identify a diverse range of root causes such as so-called "ware bugs, misconguration, and limping hardware.
Pivot tracing
TLDR
Pivot Tracing gives users, at runtime, the ability to define arbitrary metrics at one point of the system, while being able to select, filter, and group by events meaningful at other parts of thesystem, even when crossing component or machine boundaries.
Universal context propagation for distributed system instrumentation
TLDR
This paper proposes a layered architecture for cross-cutting tools that separates concerns of system developers and tool developers, enabling independent instrumentation of systems, and the deployment and evolution of multiple such tools.
Falcon: A Practical Log-Based Analysis Tool for Distributed Systems
TLDR
A case study with the popular distributed coordination service Apache Zookeeper shows that Falcon eases the log-based analysis of complex distributed protocols and is helpful in bridging the gap between protocol design and implementation.
Profiling Distributed Virtual Environments by Tracing Causality
TLDR
This paper describes how the instrumentation can be implemented natively in common environments, how its output can be processed into a graph describing causality, and how heterogeneous data sources can be incorporated into this to maximise the scope of the profiling.
Profiling distributed systems in lightweight virtualized environments with logs and resource metrics
TLDR
LRTrace is proposed and implemented, a non-intrusive tracing and feedback control tool for distributed applications in lightweight virtualized environments that can diagnose performance issues caused by either interference or bugs, or both and helps users to understand the workflows of data-parallel applications.
CAT: content-aware tracing and analysis for distributed systems
TLDR
CaT is presented, a non-intrusive content-aware tracing and analysis framework that, through a novel similarity-based approach, is able to comprehensively trace and correlate the flow of network and storage requests from applications.
Optimizing distributed data stream processing by tracing
Melange: A Hybrid Approach to Tracing Heterogeneous Distributed Systems
TLDR
This document proposes an alternative distributed tracing tool (that operates as a middleware layer) that combines sourcecode modification and passive meta-data capturing to implement distributed tracing in heterogeneous distributed systems containing black-boxes.
DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems
TLDR
To build DCatch, a set of happens-before rules are designed that model a wide variety of communication and concurrency mechanisms in real-world distributed cloud systems, and tools to help prune false positives and trigger DCbugs are designed.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 143 REFERENCES
Dynamic Causal Monitoring for Distributed Systems
TLDR
Pivot Tracing is dynamic, extensible, and enables cross-tier analysis between any inter-operating applications, with low execution overhead and can identify a diverse range of root causes such as so-called "ware bugs, misconguration, and limping hardware.
Fay: extensible distributed tracing from kernels to clusters
TLDR
Fay demonstrates the efficiency and extensibility benefits of using safe, statically-verified machine code as the basis for low-level execution tracing, and establishes that the expressiveness and performance of high-level tracing queries can equal or even surpass that of specialized monitoring tools.
lprof: A Non-intrusive Request Flow Profiler for Distributed Systems
Applications implementing cloud services, such as HDFS, Hadoop YARN, Cassandra, and HBase, are mostly built as distributed systems designed to scale. In order to analyze and debug the performance of
Stardust: tracking activity in a distributed storage system
TLDR
This paper reports on the experience building and using end-to-end tracing as an on-line monitoring tool in a distributed storage system and shows that such fine-grained tracing can be made efficient and is useful for on- and off-line analysis of system behavior.
Pinpoint: problem determination in large, dynamic Internet services
TLDR
This work presents a dynamic analysis methodology that automates problem determination in these environments by coarse-grained tagging of numerous real client requests as they travel through the system and using data mining techniques to correlate the believed failures and successes of these requests to determine which components are most likely to be at fault.
MTracer: A Trace-Oriented Monitoring Framework for Medium-Scale Distributed Systems
TLDR
This paper presents MTracer, which is a lightweight trace-oriented monitoring system for medium-scale distributed systems, which has a very lower overhead, and can handle more than 4000 events per second.
VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications
TLDR
Experimental results show that VScope can deploy and operate a variety of on-line analytics functions and metrics with a few seconds at large scale, and compared to traditional logging approaches, VScope based troubleshooting has substantially lower perturbation and generates much smaller log data volumes.
So , youwant to trace your distributed system ? Key design insights from years of practical experience
TLDR
Drawing upon experiences building and using end-to-end tracing infrastructures, this paper distills the key design axes that dictate trace utility for important use cases and identifies the remaining challenges on the path to making tracing an integral part of distributed system design.
Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems
TLDR
DISTALYZER is described, an automated tool to support developer investigation of performance issues in distributed systems that uses machine learning techniques to compare system behaviors extracted from the logs and automatically infer the strongest associations between system components and performance.
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
TLDR
The design of Dapper is introduced, Google’s production distributed systems tracing infrastructure is described, and how its design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met are described.
...
1
2
3
4
5
...