Capturing, indexing, clustering, and retrieving system history

  title={Capturing, indexing, clustering, and retrieving system history},
  author={Ira Cohen and Steve Zhang and Mois{\'e}s Goldszmidt and Julie Symons and Terence Kelly and Armando Fox},
  booktitle={Symposium on Operating Systems Principles},
We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar to a previously-observed state. This allows operators to identify and quantify the frequency of recurrent problems, to leverage previous diagnostic efforts, and to establish whether problems seen at different… 

Capturing, indexing, and retrieving system history

The approach of automatically extracting indexable descriptions, or signatures, that distill the system information most associated with a problem and can be formally manipulated to facilitate automated clustering and similarity based search is described.

Fmeter: Extracting Indexable Low-Level System Signatures by Counting Kernel Function Calls

A monitoring system that extracts formal, indexable, low-level system signatures using the classical vector space model from the field of information retrieval and text mining and shows that the signatures are naturally amenable to formal processing with statistical methods like clustering and supervised machine learning.

Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems

DISTALYZER is described, an automated tool to support developer investigation of performance issues in distributed systems that uses machine learning techniques to compare system behaviors extracted from the logs and automatically infer the strongest associations between system components and performance.

Learning, indexing, and diagnosing network faults

This work introduces a new class of indexable fault signatures that encode temporal evolution of events generated by a network fault as well as topological relationships among the nodes where these events occur and presents an efficient learning algorithm to extract such fault signatures from noisy historical event data.

Identifying Recurrent and Unknown Performance Issues

This paper formulate the problem of issue identification as a HMRF-based clustering problem and incorporates the learning of metric discretization thresholds and the optimization of issue clustering, which can achieve accurate identification of recurrent issues and unknown issues.

Use of Incremental Clustering in Clustering of Message Types in a Server Log

It is being proposed to apply the incremental clustering to extract the data from the event log as per the characteristics provided by the users of the system.

Supporting System-wide Similarity Queries for networked system management

S2Q simplifies many systems management tasks through a simple and intuitive query interface available to operators, and two applications are evaluated in the paper: fast diagnosis of repeated failures in enterprise IT systems, and automated application traffic profiling on computer networks.

System Problem Detection by Mining Console Logs

This work proposes a fully automatic methodology for mining console logs using a combination of program analysis, information retrieval, data mining, and machine learning techniques, and extends the methods to perform online analysis on console log streams.

Log4Perf: suggesting and updating logging locations for web-based systems’ performance monitoring

Log4Perf is an automated approach that provides suggestions of where to insert logging statement with the goal of monitoring web-based systems’ CPU usage and can build well-fit statistical performance models, indicating that such models can be leveraged to investigate the influence of locations in the source code on performance.

System State Discovery Via Information Content Clustering of System Logs

This work explores a natural behaviour of system logs where system log data partitioned using source and time information contain correlated message types and demonstrates how the groups of partitions can be found by clustering the partitions based on their entropy-based information content.



Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control

Experimental results from a testbed show that TAN models involving small subsets of metrics capture patterns of performance behavior in a way that is accurate and yields insights into the causes of observed performance effects.

Ensembles of models for automated diagnosis of system performance problems

This paper shows that the ensemble of models captures the performance behavior of the system accurately under changing workloads and conditions, with results comparable to those produced by an oracle that continuously changes the model based on advance knowledge of the workload.

Pinpoint: problem determination in large, dynamic Internet services

This work presents a dynamic analysis methodology that automates problem determination in these environments by coarse-grained tagging of numerous real client requests as they travel through the system and using data mining techniques to correlate the believed failures and successes of these requests to determine which components are most likely to be at fault.

The Art of Computer Systems Performance Analysis

  • Ray Jain
  • Computer Science
    Int. CMG Conference
  • 1990
The authors' goal is always to offer you an assortment of cost-free ebooks too as aid resolve your troubles.

Performance debugging for distributed systems of black boxes

The goal is to design tools that enable modestly-skilled programmers to isolate performance bottlenecks in distributed systems composed of black-box nodes by developing two very different algorithms for inferring the dominant causal paths through a distributed system from these traces.

Detecting application-level failures in component-based Internet services

Pinpoint, a methodology for automating fault detection in Internet services by observing low-level internal structural behaviors of the service and modeling the majority behavior of the system as correct; and detecting anomalies in these behaviors as possible symptoms of failures.

A survey of fault localization techniques in computer networks

A Microrebootable System - Design, Implementation, and Evaluation

This work uses separation of process recovery from data recovery to enable frequent use of the microreboot, a fine grain recovery mechanism that restarts only suspected faulty application components without disturbing the rest.

httperf—a tool for measuring web server performance

This paper describes httperf, a tool for measuring web server performance. It provides a flexible facility for generating various HTTP workloads and for measuring server performance. The focus of

Data mining: practical machine learning tools and techniques with Java implementations

This presentation discusses the design and implementation of machine learning algorithms in Java, as well as some of the techniques used to develop and implement these algorithms.