Michael P. Kasick

Learn More
We present a behavior-based problem-diagnosis approach for PVFS that analyzes a novel source of instrumentation—CPU instruction-pointer samples and function-call traces—to localize the faulty server and to enable root-cause analysis of the resource at fault. We validate our approach by injecting realistic storage and network problems into three different(More)
Problem Diagnosis and debugging in concurrent environments such as the cloud and popular distributed systems frameworks has been a traditionally hard problem. We explore an evaluation of a novel way of debugging distributed systems frameworks by using system calls. We focus on Google's MapReduce framework, which enables distributed, data-intensive, parallel(More)
We present a syscall-based approach to automatically diagnose performance problems, server-to-client propagated errors, and server crash/hang problems in PVFS. Our approach compares the statistical and semantic attributes of syscalls across PVFS servers in order to diagnose the culprit server, under these problems, for different file-system benchmarks—dd,(More)
In the large-scale Emulab distributed system, the many failure reports make skilled operator time a scarce and costly resource, as shown by statistics on failure frequency and root cause. We describe the lessons learned with error reporting in Emulab, along with the design, initial implementation, and results of a new local error-analysis approach that is(More)
Performance problems account for a significant percentage of documented failures in large-scale distributed systems, such as Hadoop. Localizing the source of these performance problems can be frustrating due to the overwhelming amount of monitoring information available. We automate problem localization using ASDF an online diagnostic framework that(More)
Localizing performance problems (or fingerpointing) is essential for distributed systems such as Hadoop that support long-running, parallelized, data-intensive computations over a large cluster of nodes. Manual fingerpointing does not scale in such environments because of the number of nodes and the number of performance metrics to be analyzed on each node.(More)
Intrepid has a very-large, production GPFS storage system consisting of 128 file servers, 32 storage controllers, 1152 disk arrays, and 11,520 total disks. In such a large system, performance problems are both inevitable and difficult to troubleshoot. We present our experiences, of taking an automated problem diagnosis approach from proof-of-concept on a(More)