Learn More
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to(More)
Live migration is a widely used technique for resource consolidation and fault tolerance. KVM and Xen use iterative pre-copy approaches which work well in practice for commercial applications. In this paper, we study pre-copy live migration of MPI and OpenMP scientific applications running on KVM and present a detailed performance analysis of the migration(More)
In this paper we characterize the behavior with respect to memory locality management of scientific computing applications running in virtualized environments. NUMA locality on current solutions (KVM and Xen) is enforced by pinning virtual machines to CPUs and providing NUMA aware allocation in hyper visors. Our analysis shows that due to two-level memory(More)
Scalability of applications on distributed shared-memory (DSM) multiprocessors is limited by communication overheads. At some point, using more processors to increase parallelism yields diminishing returns or even degrades performance. When increasing concurrency is futile, we propose an additional mode of execution, called slipstream mode, that instead(More)
The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi-and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the(More)
The gyrokinetic Particle-in-Cell (PIC) method is a critical computational tool enabling petascale fusion simulation research. In this work, we present novel multi- and manycore-centric optimizations to enhance performance of GTC, a PIC-based production code for studying plasma microturbulence in tokamak devices. Our optimizations encompass all six GTC(More)
Computing the actions of Wilson-Dirac operator contributes most of the CPU time for the grand challenge problem of simulating Lattice Quantum Chromodynamics (Lattice QCD). This routine exhibits many challenges in implementation on most computational environments because of the multiple patterns of accessing the same data, making it difficult to align the(More)
Keywords: Lattice QCD calculations Graphic processing unit (GPU) General-purpose computing on graphics hardware SIMD computer architecture Data-parallel computing High-performance computing Domain specific systems a b s t r a c t Simulation time for the classical problem of Lattice Quantum Chromodynamics (Lattice QCD) is dominated by one kernel routine(More)
Efficient communication is a requirement for application scalability on High Performance Computing systems. In this paper we argue for incorporating proactive congestion avoidance mechanisms into the design of communication layers on manycore systems. This is in contrast with the status quo which employs a reactive approach, \emph{e.g.} congestion control(More)