Adam J. Oliner

Learn More
If we hope to automatically detect and diagnose failures in large-scale computer systems, we must study real deployed systems and the data they generate. Progress has been hampered by the inaccessibility of empirical data. This paper addresses that dearth by examining system logs from five supercomputers, with the aim of providing useful insight and(More)
As the complexity of distributed computing systems increases, systems management tasks require significantly higher levels of automation; examples include diagnosis and prediction based on real-time streams of computer events, setting alarms, and performing continuous monitoring. The core of <i>autonomic computing</i>, a recently proposed initiative towards(More)
We aim to detect and diagnose <i>energy anomalies</i>, abnormally heavy battery use. This paper describes a collaborative black-box method, and an implementation called Carat, for diagnosing anomalies on mobile devices. A client app sends intermittent, coarse-grained measurements to a server, which correlates higher expected energy use with client(More)
Accurate fault detection is a key element of resilient computing. Syslogs provide key information regarding faults, and are found on nearly all computing systems. Discovering new fault types requires expert human effort, however, as no previous algorithm has been shown to localize faults in time and space with an operationally acceptable false positive(More)
We propose a method for identifying the sources of problems in complex production systems where, due to the prohibitive costs of instrumentation, the data available for analysis may be noisy or incomplete. In particular, we may not have complete knowledge of all components and their interactions. We define influences as a class of component interactions(More)
We present Nodeinfo, an unsupervised algorithm for anomaly detection in system logs. We demonstrate Nodeinfo's effectiveness on data from four of the world's most powerful supercomputers: using logs representing over 746 million processor-hours, in which anomalous events called alerts were manually tagged for scoring, we aim to automatically identify the(More)
We aim to detect and diagnose code misbehavior that wastes energy, which we call energy bugs. This paper describes a method and implementation, called Carat, for performing such diagnosis on mobile devices. Carat takes a collaborative, black-box approach. A noninvasive client app sends intermittent, coarse-grained measurements to a server, which identifies(More)
Cooperative checkpointing increases the performance and robustness of a system by allowing checkpoints requested by applications to be dynamically skipped at runtime. A robust system must be more than merely resilient to failures; it must be adaptable and flexible in the face of new and evolving challenges. A simulation-based experimental analysis using(More)