Learn More
Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein.
We describe a methodology that enables the real-time diagnosis of performance problems in complex high-performance distributed systems. The methodology includes tools for generating precision event logs that can be used to provide detailed end-to-end application and system level monitoring; a Java agent-based system for managing the large amount of logging(More)
As the practice of science moves beyond the single investigator due to the complexity of the problems that now dominate science, large collaborative and multi-institutional teams are needed to address these problems. In order to support this shift in science, the computing and data handling infrastructure that is essential to most of modern science must(More)
We have designed, built, and analyzed a distributed parallel storage system that will supply image streams fast enough to permit multi-user, “real-time”, video-like applications in a wide-area ATM network-based Internet environment. We have based the implementation on user-level code in order to secure portability; we have characterized the(More)
Modern scientific computing involves organizing, moving, visualizing, and analyzing massive amounts of data from around the world, as well as employing large-scale computation. The distributed systems that solve large-scale problems will always involve aggregating and scheduling many resources. Data must be located and staged, cache and network capacity(More)