Learn More
—Roadrunner is a 1.38 Pflop/s-peak (double precision) hybrid-architecture supercomputer developed by LANL and IBM. It contains 12,240 IBM PowerXCell 8i processors and 12,240 AMD Opteron cores in 3,060 compute nodes. Roadrunner is the first supercomputer to run Linpack at a sustained speed in excess of 1 Pflop/s. In this paper we present a detailed(More)
SUMMARY Clustered systems have become a dominant architecture of scalable high-performance super computers. In these large-scale computers, the network performance and scalability is as critical as the compute-nodes speed. InfiniBand TM has become a commodity networking solution supporting the stringent latency, bandwidth and scalability requirements of(More)
In this work we present an initial performance evaluation of Intel's latest, second-generation quad-core processor, Nehalem, and provide a comparison to first-generation AMD and Intel quad-core processors Barcelona and Tigerton. Nehalem is the first In-tel processor to implement a NUMA architecture incorporating QuickPath Interconnect for interconnecting(More)
Data driven programming models like MapReduce have gained the popularity in large-scale data processing. Although great efforts through the Hadoop implementation and framework decoupling (e.g. YARN, Mesos) have allowed Hadoop to scale to tens of thousands of commodity cluster processors, the centralized designs of the resource manager, task scheduler and(More)
Large-scale server deployments in the commercial internet space have been using group based protocols such as peer-to-peer and gossip to allow coordination of services and data across global distributed data centers. Here we look at applying these methods, which are themselves derived from early work in distributed systems, to large-scale, tightly-coupled(More)
As compute nodes increase in parallelism, existing intra-node locking and synchronization primitives need to be scalable, fast, and power efficient. Most parallel runtime systems try to find a balance between these properties during synchronization by fine-tuned spin-waiting and processor yielding to the OS. Unfortunately, the code path followed by the OS(More)
Load balancing techniques (e.g. work stealing) are important to obtain the best performance for distributed task scheduling systems that have multiple schedulers making scheduling decisions. In work stealing, tasks are randomly migrated from heavy-loaded schedulers to idle ones. However, for data-intensive applications where tasks are dependent and task(More)
With the exponential growth of supercomputers in parallelism, applications are growing more diverse, including traditional large-scale HPC MPI jobs, and ensemble workloads such as finer-grained many-task computing (MTC) applications. Delivering high throughput and low latency for both workloads requires developing a distributed job management system that is(More)
This work provides a performance analysis of three leading supercomputers that have recently been deployed: Purple, Red Storm and Blue Gene/L. Each of these machines are architecturally diverse, with very different performance characteristics. Each contains over 10,000 processors and has a system peak of over 40 Teraflops. We analyze each system using a(More)