Learn More
If we hope to automatically detect and diagnose failures in large-scale computer systems, we must study real deployed systems and the data they generate. Progress has been hampered by the inaccessibility of empirical data. This paper addresses that dearth by examining system logs from five supercomputers, with the aim of providing useful insight and(More)
Cooperative checkpointing increases the performance and robustness of a system by allowing checkpoints requested by applications to be dynamically skipped at runtime. A robust system must be more than merely resilient to failures; it must be adaptable and flexible in the face of new and evolving challenges. A simulation-based experimental analysis using(More)
We present Nodeinfo, an unsupervised algorithm for anomaly detection in system logs. We demonstrate Nodeinfo's effectiveness on data from four of the world's most powerful supercomputers: using logs representing over 746 million processor-hours, in which anomalous events called alerts were manually tagged for scoring, we aim to automatically identify the(More)
Accurate fault detection is a key element of resilient computing. Syslogs provide key information regarding faults, and are found on nearly all computing systems. Discovering new fault types requires expert human effort, however, as no previous algorithm has been shown to localize faults in time and space with an operationally acceptable false positive(More)
There is little information from independent sources in the public domain about mobile malware infection rates. The only previous independent estimate (0.0009%) [11], was based on indirect measurements obtained from domain-name resolution traces. In this paper, we present the first independent study of malware infection rates and associated risk factors(More)
This paper gives an overview of the BlueGene/L Supercomputer. This is a jointly funded research partnership between IBM and the Lawrence Livermore National Laboratory as part of the United States Department of Energy ASCI Advanced Architecture Research Program. Application performance and scaling studies have recently been initiated with partners at a(More)
As the complexity of distributed computing systems increases, systems management tasks require significantly higher levels of automation; examples include diagnosis and prediction based on real-time streams of computer events, setting alarms, and performing continuous monitoring. The core of <i>autonomic computing</i>, a recently proposed initiative towards(More)
We aim to detect and diagnose <i>energy anomalies</i>, abnormally heavy battery use. This paper describes a collaborative black-box method, and an implementation called Carat, for diagnosing anomalies on mobile devices. A client app sends intermittent, coarse-grained measurements to a server, which correlates higher expected energy use with client(More)
Cooperative checkpointing uses global knowledge of the state and health of the machine to improve performance and reliability by dynamically deciding when to skip checkpoint requests made by applications. Using results from cooperative checkpointing theory, this paper proves that periodic checkpointing is not expected to be competitive with the offline(More)