• Corpus ID: 119327548

A performance evaluation of CCS QCD Benchmark on the COMA (Intel(R) Xeon Phi$^{TM}$, KNC) system

  title={A performance evaluation of CCS QCD Benchmark on the COMA (Intel(R) Xeon Phi\$^\{TM\}\$, KNC) system},
  author={Taisuke Boku and K.-I. Ishikawa and Yoshinobu Kuramashi and Lawrence Meadows and Michael DMello and Maurice Troute and Ravi Vemuri},
  journal={arXiv: High Energy Physics - Lattice},
The most computationally demanding part of Lattice QCD simulations is solving quark propagators. Quark propagators are typically obtained with a linear equation solver utilizing HPC machines. The CCS QCD Benchmark is a benchmark program solving the Wilson-Clover quark propagator, and is developed at the Center for Computational Sciences (CCS), University of Tsukuba. We optimized the benchmark program for a \Intel \XeonPhi (Knights Corner, KNC) system named "COMA (PACS-IX)" at CCS Tsukuba under… 

Figures from this paper

Mixed Precision Solver Scalable to 16000 MPI Processes for Lattice Quantum Chromodynamics Simulations on the Oakforest-PACS System

This work has developed a mixed-precision quark solver for a large Intel Xeon Phi (KNL) system named "Oakforest-PACS", employing the O(a)-improved Wilson quarks as the discretized equation of motion.

Practical Implementation of Lattice QCD Simulation on Intel Xeon Phi Knights Landing

  • I. KanamoriH. Matsufuru
  • Computer Science
    2017 Fifth International Symposium on Computing and Networking (CANDAR)
  • 2017
The performance tuning on KNL as well as the code design for facilitating such tuning on SIMD architecture and massively parallel machines are discussed.

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

This work compares the performance for an implementation of the Conjugate Gradient method with CUDA, OpenCL, and OpenACC on NVIDIA Pascal GPUs and tries to answer the question of whether the higher abstraction level of directive based models is inferior to lower level paradigms in terms of performance.

Machines and Algorithms

I discuss the evolution of computer architectures with a focus on QCD and with reference to the interplay between architecture, engineering, data motion and algorithms. New architectures are



Domain Decomposition method on GPU cluster

This work investigates the performance of quark solver using the restricted additive Schwarz (RAS) preconditioner on a low cost GPU cluster and finds that the improvment mainly comes from the reduction of the communication bottleneck as expected.

Lattice QCD with Domain Decomposition on Intel® Xeon Phi Co-Processors

  • S. HeybrockB. Joó P. Dubey
  • Computer Science
    SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2014
This work investigates this in the context of Lattice Quantum Chromo dynamics and implements an alternative solver algorithm, based on domain decomposition, on Intel® Xeon Phi co-processor (KNC) clusters, that strong-scales to more nodes and reduces the time-to-solution.

HISQ inverter on Intel Xeon Phi and NVIDIA GPUs

This contribution compares the performance of the Intel Xeon Phi to current Kepler-based NVIDIA Tesla GPUs running a conjugate gradient solver and obtains a performance 250 GFlop/s on both architectures.

Code Optimization on Kepler GPUs and Xeon Phi

This work upgrades the code to use the latest CPS (Columbia Physics System) library along with the most recent QUDA (QCD CUDA) library for lattice QCD to improve the performance of the conjugate gradient (CG) inverter so that it runs twice faster than before.

Conjugate gradient solvers on Intel Xeon Phi and NVIDIA GPUs

This work compares the performance of the Intel Xeon Phi to current Kepler-based NVIDIA Tesla GPUs running a conjugate gradient solver using the Knights Corner architecture, and obtains a performance greater than 300 GFlop/s on both architectures.

Accelerating Twisted Mass LQCD with QPhiX

The implementation of twisted mass fermion operators for the QPhiX library is presented and it is demonstrated that on the Xeon Phi 7120P the Dslash kernel is able to reach 80\% of the theoretical peak bandwidth, while on a Xeon Haswell E5-2630 CPU the generated code for the DSlash operator with AVX2 instructions outperforms the corresponding implementation in the tmLQCD library by a factor of 5.5 in single precision.

Grid: A next generation data parallel C++ QCD library

The motivation, implementation details, and performance of a new physics code base called Grid is discussed, intended to be more performant, more general, but similar in spirit to QDP++\cite{QDP}.

Staggered Dslash Performance on Intel Xeon Phi Architecture

This work test the performance of CG and dslash, the key step in the CG algorithm, on the Intel Xeon Phi, also known as the Many Integrated Core (MIC) architecture, and tries different parallelization strategies using MPI, OpenMP, and the vector processing units (VPUs).

MILC Staggered Conjugate Gradient Performance on Intel KNL

This work done to optimize the staggered conjugate gradient (CG) algorithm in the MILC code for use with the Intel Knights Landing (KNL) architecture is reviewed.

Adaptive algebraic multigrid on SIMD architectures

The implementation of the Wuppertal adaptive algebraic multigrid code DD-$\alpha$AMG on SIMD architectures, with particular emphasis on the Intel Xeon Phi processor used in QPACE 2.