• Publications
  • Influence
What Supercomputers Say: A Study of Five System Logs
This paper examines system logs from five supercomputers with the aim of providing useful insight and direction for future research into the use of such logs, and proposes a simpler and more effective filtering algorithm. Expand
Evaluating the viability of process replication reliability for exascale systems
Results show that state machine replication is a potentially useful technique for meeting the fault tolerance demands of HPC applications on future exascale platforms. Expand
Memory Errors in Modern Systems: The Good, The Bad, and The Ugly
This study uses data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems and finds that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability. Expand
Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults
A study of DRAM and SRAM faults in large high-performance computing systems to understand the factors that influence faults in production settings and finds that altitude has a substantial impact onSRAM faults, and that top of rack placement correlates with 20% higher fault rate. Expand
Addressing failures in exascale computing
This report presents a report produced by a workshop on ‘Addressing failures in exascale computing’ held in Park City, Utah, 4–11 August 2012, which summarizes and builds on discussions on resilience. Expand
Towards informatic analysis of syslogs
  • Jon Stearley
  • Computer Science
  • IEEE International Conference on Cluster…
  • 20 September 2004
The author describes the use of the bioinformatic-inspired Teiresias algorithm to automatically classify syslog messages, and compares it to an existing log analysis tool (SLCT), and presents a simple graphical user interface for viewing analysis results. Expand
Alert Detection in System Logs
This work formalizes the alert detection task in these terms, describes how Nodeinfo uses the information entropy of message terms to identify alerts, and presents an online version of this algorithm, which is now in production use. Expand
Bad Words: Finding Faults in Spirit's Syslogs
This work presents experiments on three weeks of syslogs from Sandia's 512-node "Spirit" Linux cluster, showing one algorithm that localizes 50% of faults with 75% precision, corresponding to an excellent false positive rate of 0.05%. Expand
Extra Bits on SRAM and DRAM Errors - More Data from the Field.
The results of a field study of DRAM and SRAM faults in Cielo, a leadershipclass high-performance computing system located at Los Alamos National Laboratory, show that vendor choice has a significant impact on fault rates and that command and address parity on the DDR channel is beneficial to memory reliability. Expand
Bridging the Gaps: Joining Information Sources with Splunk
This paper describes the experience with applying the Splunk log analysis tool as a vehicle to combine both data, and people and describes the challenges in joining the data and expressing complex queries. Expand