Learn More
—Roadrunner is a 1.38 Pflop/s-peak (double precision) hybrid-architecture supercomputer developed by LANL and IBM. It contains 12,240 IBM PowerXCell 8i processors and 12,240 AMD Opteron cores in 3,060 compute nodes. Roadrunner is the first supercomputer to run Linpack at a sustained speed in excess of 1 Pflop/s. In this paper we present a detailed(More)
Current disk prefetch policies in major operating systems track access patterns at the level of the file abstraction. While this is useful for exploiting application-level access patterns, file-level prefetching cannot realize the full performance improvements achievable by prefetch-ing. There are two reasons for this. First, certain prefetch opportunities(More)
In this work we present an initial performance evaluation of Intel's latest, second-generation quad-core processor, Nehalem, and provide a comparison to first-generation AMD and Intel quad-core processors Barcelona and Tigerton. Nehalem is the first In-tel processor to implement a NUMA architecture incorporating QuickPath Interconnect for interconnecting(More)
Ever-increasing demand for computing capability is driving the construction of ever-larger computer clusters, typically comprising commodity compute nodes, ranging in size up to thousands of processors, with each node hosting an instance of the operating system (OS). Recent studies [1, 4] have shown that even minimal intrusion by the OS on user(More)
Checkpoint/restart is a general idea for which particular implementations enable various functionalities in computer systems, including process migration, gang scheduling, hibernation, and fault tolerance. For fault tolerance, in current practice, implementations can be at user-level or system-level. User-level implementations are relatively easy to(More)
Based on a set of measurements done on the 512-node 500MHz prototype and early results on a 2048 node 700MHz BlueGene/L machine at IBM Watson, we present a performance and scalability analysis of the architecture from low-level characteristics to large-scale applications. In addition, we present predictions using our models for the performance of two(More)
but distant cores need not compete with each other for access to memory. Roadrunner, Jaguar, and Jugene are all petascale systems, which can process 1 × 10 15 floating-point operations per second (1 petaflop/s). Already, the high-performance computing community is investigating the challenges of exascale systems, which, while possibly only six years away,(More)
Collective I/O is a widely used technique to improve I/O performance in parallel computing. It can be implemented as a client-based or server-based scheme. The client-based implementation is more widely adopted in MPI-IO software such as ROMIO because of its independence from the storage system configuration and its greater portability. However , existing(More)
The design and implementation of a high performance communication network are critical factors in determining the performance and cost-effectiveness of a largescale computing system. The major issues center on the trade-off between the network cost and the impact of latency and bandwidth on application performance. One promising technique for extracting(More)
A cluster of data servers and a parallel file system are often used to provide high-throughput I/O service to parallel programs running on a compute cluster. To exploit I/O parallelism parallel file systems stripe file data across the data servers. While this practice is effective in serving asynchronous requests, it may break individual program's spatial(More)