Learn More
Today's top high performance computing systems run applications with hundreds of thousands of processes, contain hundreds of storage nodes, and must meet massive I/O requirements for capacity and performance. These leadership-class systems face daunting challenges to deploying scalable I/O systems. In this paper we present a case study of the I/O challenges(More)
Computational science applications are driving a demand for increasingly powerful storage systems. While many techniques are available for capturing the I/O behavior of individual application trial runs and specific components of the storage system, continuous characterization of a production system remains a daunting challenge for systems with hundreds of(More)
—High-performance computing (HPC) and distributed systems rely on a diverse collection of system software to provide application services, including file systems, schedulers, and web services. Such system software services must manage highly concurrent requests, interact with a wide range of resources, and scale well in order to be successful.(More)
In preparation for the Intergovernmental Panel on Climate Change (IPCC) Fifth Assessment Report, the climate community will run the Coupled Model Intercomparison Project phase 5 (CMIP-5) experiments, which are designed to answer crucial questions about future regional climate change and the results of carbon feedback for different mitigation scenarios. The(More)
In this paper, we propose an approach to improving the I/O performance of an IBM Blue Gene/Q supercomputing system using a novel framework that can be integrated into high performance applications. We take advantage of the system's tremendous computing resources and high interconnection bandwidth among compute nodes to efficiently exploit I/O bandwidth.(More)
We examine the I/O behavior of thousands of supercomputing applications "in the wild," by analyzing the Darshan logs of over a million jobs representing a combined total of six years of I/O behavior across three leading high-performance computing platforms. We mined these logs to analyze the I/O behavior of applications across all their runs on a platform;(More)
—I/O performance is an increasingly important factor in the productivity of large-scale HPC systems such as Hopper, a 153,216 core Cray XE6 system operated by the National Energy Research Scientific Computing Center. The scientific workload diversity of such systems presents a challenge for I/O performance tuning, however. Applications vary in terms of data(More)
High-performance computing (HPC) storage systems rely on access coordination to ensure that concurrent updates do not produce incoherent results. HPC storage systems typically employ pessimistic distributed locking to provide this functionality in cases where applications cannot perform their own coordination. This approach, however, introduces significant(More)
Fault response strategies are crucial to maintaining performance and availability in HPC storage systems, and the first responsibility of a successful fault response strategy is to detect failures and maintain an accurate view of group membership. This is a nontrivial problem given the unreliable nature of communication networks and other system components.(More)
Distributed object-based storage models are an increasingly popular alternative to traditional block-based or file-based storage abstractions in large-scale storage systems. Object-based storage models store and access data in discrete, byte-addressable containers to simplify data management and cleanly decouple storage systems from underlying hardware(More)