Detecting performance anomalies in large-scale software systems using entropy
Today's fault management is characterized by ineecient event management. The events delivered by the managed system frequently descibe symptoms of a problem instead of its cause. If a problem in the managed system occurs, e.g. a network failure or misconngured software, the administrator often is ooded by a burst of more or less meaningless events indicating symptoms of the problem. The aim of an event correlator is to reduce the number and enrich the meaning of events shown to the administrator. Ideally, the event correlator is able to condense the received events into a single event directly indicating the problem in the managed system. The core part of any event correlation approach is a language to describe how events are (cor)related. This information is a prerequisite for feeding an event correlation engine. Many proposals have been made addressing this issue. However, as will be shown, key issues have not yet been addressed in products and research. The correlation language has to deal with very complex and dynamic systems and widely distributed knowledge. We intoduce a graph that describes to the event correlator the dependencies within the managed system. We outline why dependency graphs are an appropriate representation of correlation knowledge. The work presented here is focussed on the methods to derive a dependency graph from information already present in the management system.