Edward Chuah

Learn More
Bursts of abnormally high use of resources are thought to be an indirect cause of failures in large cluster systems, but little work has systematically investigated the role of high resource usage on system failures, largely due to the lack of a comprehensive resource monitoring tool which resolves resource use by job and node. The recently developed(More)
System event logs are often the primary source of information for diagnosing (and predicting) the causes of failures for cluster systems. Due to interactions among the system hardware and software components, the system event logs for large cluster systems are comprised of streams of interleaved events, and only a small fraction of the events over a small(More)
A goal for the analysis of supercomputer logs is to establish causal relationships among events which reflect significant state changes in the system. Establishing these relationships is at the heart of failure diagnosis. In principle, a log analysis tool could automate many of the manual steps systems administrators must currently use to diagnose system(More)
The ability to automatically detect faults or fault patterns to enhance system reliability is important for system administrators in reducing system failures. To achieve this objective, the message logs from cluster system are augmented with failure information, i.e., The raw log data is labelled. However, tagging or labelling of raw log data is very(More)
The use of console logs for error detection in large scale distributed systems has proven to be useful to system administrators. However, such logs are typically redundant and incomplete, making accurate detection very difficult. In an attempt to increase this accuracy, we complement these incomplete console logs with resource usage data, which captures the(More)
Professional Bodies Member of Board of Engineers Malaysia Research (any ongoing / completed research project) " Wireless System CoExistence in the Extended C-Band "-Completed Publications 1. Service and Broadband Wireless Access interference analysis in the extended C-band", Wireless Days
Large cluster systems are composed of complex, interacting hardware and software components. Components, or the interactions between components, may fail due to many different reasons, leading to the eventual failure of executing jobs. This paper investigates an open question about failure diagnosis: What are the characteristics of the errors that lead to(More)