Can traditional programming bridge the Ninja performance gap for parallel computing applications?

  title={Can traditional programming bridge the Ninja performance gap for parallel computing applications?},
  author={Nadathur Satish and Changkyu Kim and Jatin Chhugani and Hideki Saito and Rakesh Krishnaiyer and Mikhail Smelyanskiy and Milind Girkar and Pradeep K. Dubey},
  journal={ACM SIGARCH Computer Architecture News},
  pages={440 - 451}
Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs… 

Towards Enhancing Performance, Programmability, and Portability in Heterogeneous Computing

This chapter presents work with a real-world application of interest to NASA, called Synoptic SARB, to extend the GLAF graphical user interface front-end, as well as the code generation back-end to facilitate expressing existing data structures found in realworld applications and enable code generation whose output can seamlessly integrate with preexisting code.

GPRM : a high performance programming framework for manycore processors

A new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM), which provides high performance while maintaining ease of programming and a low-overhead mechanism, called “Global Sharing”, which improves performance in multiprogramming situations.

SIMD@OpenMP: a programming model approach to leverage SIMD features

The evaluation on the Intel Xeon Phi coprocessor shows that the SIMD proposal allows the compiler to efficiently vectorize codes poorly or not vectorized automatically with the Intel C/C++ compiler, an important step in the direction of a more common and generalized use of SIMD instructions.

Measuring the Haskell Gap

A subset of the benchmarks studied by Satish et al to port to Haskell are chosen and performance of these Haskell benchmarks compiled with the standard Glasgow Haskell Compiler and with the experimental Intel Labs Haskell Research Compiler is measured.

Delivering Parallel Programmability to the Masses via the Intel MIC Ecosystem: A Case Study

This study offers evidence that traditional compiler optimizations candeliver parallel programmability to the masses on the Intel XeonPhi platform and observes that the identically optimized code on MIC can outperform its CPU counterpart by up to 3.2-fold.

Temporal Vectorization for Stencils

A novel temporal vectorization scheme for stencils that vectorizes the stencil computation in the iteration space and assembles points with different time coordinates in one vector.

Evaluating Auto-Vectorizing Compilers through Objective Withdrawal of Useful Information

This article exhaustively evaluated five industry-grade compilers on four representative vector platforms using the modified version of TSVC and application-level proxy kernels and formulate a method to objectively supply or withdraw information that would otherwise aid the compiler in the auto-vectorization process.

Accelerating Graph Processing on Large-scale Multicores

This paper proposes a performance predictor paradigm for a heterogeneous parallel architecture where multiple disparate accelerators are integrated in an operational high performance computing setup and aims to improve graph processing efficiency by exploiting the underlying concurrency variations within and across the heterogeneous integrated accelerators.

Scalability analysis of AVX-512 extensions

A scalability and energy efficiency analysis of AVX-512 is performed using the ParVec benchmark suite to show the main bottlenecks of the architecture and results show that the performance and energy improvements depend greatly on the fraction of code that can be vectorized.



FAST: fast architecture sensitive tree search on modern CPUs and GPUs

FAST is an extremely fast architecture sensitive layout of the index tree logically organized to optimize for architecture features like page size, cache line size, and SIMD width of the underlying hardware, and achieves a 6X performance improvement over uncompressed index search for large keys on CPUs.

Efficient implementation of sorting on multi-core SIMD CPU architecture

An efficient implementation and detailed analysis of MergeSort on current CPU architectures, and performance scalability of the proposed sorting algorithm with respect to certain salient architectural features of modern chip multiprocessor (CMP) architectures, including SIMD width and core-count.

3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

A novel 3.

Auto-tuning stencil codes for cache-based multicore platforms

This thesis has created an automatic stencil code tuner, or auto-tuner, that incorporates several optimizations into a single software framework, thereby allowing for much greater productivity than hand-tuning.

Lattice Boltzmann Modeling: An Introduction for Geoscientists and Engineers

Lattice Boltzmann models have a remarkable ability to simulate single- and multi-phase fluids and transport processes within them. A rich variety of behaviors, including higher Reynolds numbers

Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU

Performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one is evaluated, demonstrating that performance comparable to that of GPUs can be achieved with much greater productivity on modern multicore CPUs.

The end of denial architecture and the rise of throughput computing

This talk will discuss exploitation of parallelism and locality with examples drawn from the Imagine and Merrimac projects, from NVIDIA GPUs, and from three generations of stream programming systems.

Joint Forces: From Multithreaded Programming to GPU Computing

Using graphics hardware to enhance CPU-based standard desktop applications is a question not only of programming models but also of critical optimizations that are required to achieve true

Performance Evaluation of Convolution on the Cell Broadband Engine Processor

An in-depth analysis of the convolution algorithm and its complexity is presented in order to develop adequate parallel algorithms and the proposed parallelization approach can be widely adopted by any convolution-based application.

Larrabee: A many-Core x86 architecture for visual computing

  • D. Carmean
  • Art
    2008 IEEE Hot Chips 20 Symposium (HCS)
  • 2008
This article consists of a collection of slides from the author's conference presentation. Some of the topics discussed include: architecture convergence; Larrabee architecture; and graphics pipeline.