Learn More
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to(More)
Keywords: Lattice QCD calculations Graphic processing unit (GPU) General-purpose computing on graphics hardware SIMD computer architecture Data-parallel computing High-performance computing Domain specific systems a b s t r a c t Simulation time for the classical problem of Lattice Quantum Chromodynamics (Lattice QCD) is dominated by one kernel routine(More)
Live migration is a widely used technique for resource consolidation and fault tolerance. KVM and Xen use iterative pre-copy approaches which work well in practice for commercial applications. In this paper, we study pre-copy live migration of MPI and OpenMP scientific applications running on KVM and present a detailed performance analysis of the migration(More)
The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi-and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the(More)
In this paper we characterize the behavior with respect to memory locality management of scientific computing applications running in virtualized environments. NUMA locality on current solutions (KVM and Xen) is enforced by pinning virtual machines to CPUs and providing NUMA aware allocation in hyper visors. Our analysis shows that due to two-level memory(More)
Scalability of applications on distributed shared-memory (DSM) multiprocessors is limited by communication overheads. At some point, using more processors to increase parallelism yields diminishing returns or even degrades performance. When increasing concurrency is futile, we propose an additional mode of execution, called slipstream mode, that instead(More)
The gyrokinetic Particle-in-Cell (PIC) method is a critical computational tool enabling petascale fusion simulation research. In this work, we present novel multi- and manycore-centric optimizations to enhance performance of GTC, a PIC-based production code for studying plasma microturbulence in tokamak devices. Our optimizations encompass all six GTC(More)
We report our experiences porting Spark to large production HPC systems. While Spark performance in a data center installation (with local disks) is dominated by the network, our results show that file system metadata access latency can dominate in a HPC installation using Lustre: it determines single node performance up to 4x slower than a typical(More)
Computing the actions of Wilson-Dirac operator contributes most of the CPU time for the grand challenge problem of simulating Lattice Quantum Chromodynamics (Lattice QCD). This routine exhibits many challenges in implementation on most computational environments because of the multiple patterns of accessing the same data, making it difficult to align the(More)
Efficient communication is a requirement for application scalability on High Performance Computing systems. In this paper we argue for incorporating proactive congestion avoidance mechanisms into the design of communication layers on manycore systems. This is in contrast with the status quo which employs a reactive approach, \emph{e.g.} congestion control(More)