ispc: A SPMD compiler for high-performance CPU programming

  title={ispc: A SPMD compiler for high-performance CPU programming},
  author={Matt Pharr and William R. Mark},
  journal={2012 Innovative Parallel Computing (InPar)},
  • M. Pharr, W. Mark
  • Published 13 May 2012
  • Computer Science
  • 2012 Innovative Parallel Computing (InPar)
SIMD parallelism has become an increasingly important mechanism for delivering performance in modern CPUs, due its power efficiency and relatively low cost in die area compared to other forms of parallelism. [] Key Method We have developed a compiler, the Intel R® SPMD Program Compiler (ispc), that delivers very high performance on CPUs thanks to effective use of both multiple processor cores and SIMD vector units. ispc draws from GPU programming languages, which have shown that for many applications the…

Figures and Tables from this paper

Writing scalable SIMD programs with ISPC

A performance study of compiling several benchmarks from the domains of computer graphics, financial modeling, and high-performance computing for different vector instruction sets using the Intel SPMD Program Compiler, an alternative to compiler autovectorization of scalar code or handwriting vector code with intrinsics.

Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck?

A feature-complete version of HPP is implemented, including all syntactic constructs, that runs on top of a task-parallel runtime executing on the CPU, including reducing overhead due to channel management, and plan to make a public version available sometime in the future.

OpenCL Performance Evaluation on Modern Multi Core CPUs

This paper evaluates the performance of OpenCL programs on out-of-order multicore CPUs from the architectural perspective, comparing OpenCL to conventional parallel programming models for CPUs.

Riposte: A trace-driven compiler and parallel VM for vector code in R

Riposte is a new runtime for the R language that uses tracing, a technique commonly used to accelerate scalar code, to dynamically discover and extract sequences of vector operations from arbitrary R code and achieves an overall average speed-up of over 150× without explicit programmer parallelization.

SIMD@OpenMP: a programming model approach to leverage SIMD features

The evaluation on the Intel Xeon Phi coprocessor shows that the SIMD proposal allows the compiler to efficiently vectorize codes poorly or not vectorized automatically with the Intel C/C++ compiler, an important step in the direction of a more common and generalized use of SIMD instructions.

Exploiting Automatic Vectorization to Employ SPMD on SIMD Registers

Experimental results reveal that, although the manually-tuned intrinsics code slightly outperforms the SPMD-based column scan, the performance differences are small, and developers may benefit from the advantages of SIMD parallelism through ispc, while supporting arbitrary hardware architectures without hard-to-maintain code.

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

This work proposes an alternative approach that automatically trans- lates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR, and includes a representation of parallel constructs that allows conventional compiler transformations to apply transpar- ently and without modification and enables parallelism-speci-c optimizations.

DynaSOAr: A Parallel Memory Allocator for Object-Oriented Programming on GPUs with Efficient Memory Access

DynaSOAr achieves performance superior to state-of-the-art GPU memory allocators by controlling both memory allocation and memory access, and manages heap memory more efficiently than other allocators, allowing programmers to run up to 2x larger problem sizes with the same amount of memory.

SIMD programming using Intel vector extensions

Decoupled Vector-Fetch Architecture with a Scalarizing Compiler

This thesis presents the Hwacha decoupled vector-fetch architecture as the basis of a new data-parallel machine and reasons through the design decisions while describing its programming model, microarchitecture, and LLVM-based scalarizing compiler that efficiently maps OpenCL kernels to the architecture.



Twin Peaks: A Software Platform for Heterogeneous Computing on General-Purpose and Graphics Processors

Twin Peaks is presented, a software platform for heterogeneous computing that executes code originally targeted for GPUs on CPUs as well, which permits a more balanced execution between the CPU and GPU, and enables portability of code between these architectures and to CPU-only environments.

A performance analysis of the Berkeley UPC compiler

This paper describes a portable open source compiler for UPC and identifies some of the challenges in compiling UPC, and uses a combination of micro-benchmarks and application kernels to show that the compiler has low overhead for basic operations on shared data and is competitive, and sometimes faster, the commercial HP compiler.

MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

A framework called MCUDA is described, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs and argues that CUDA can be an effective data-parallel programming model for more than just GPU architectures.

Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology

It is demonstrated that the otherwise uncontrolled growth of the Ninja gap can be contained and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

This paper discusses optimization techniques for both CPU and GPU, analyzes what architecture features contributed to performance differences between the two architectures, and recommends a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

Extending a C-like language for portable SIMD programming

This paper shows how a C-like language can be extended to allow for portable and efficient SIMD programming, and presents a type system and a formal semantics of the extension and proves the soundness of the type system.

Cilk: an efficient multithreaded runtime system

This paper shows that on real and synthetic applications, the “work” and “critical path” of a Cilk computation can be used to accurately model performance, and proves that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time and communication bounds all within a constant factor of optimal.

Revisiting SIMD Programming

The design of Cnis is based on the concept of the SIMD array type architecture and revisits first principles of designing efficient and portable parallel programming languages.

LLVM: a compilation framework for lifelong program analysis & transformation

The design of the LLVM representation and compiler framework is evaluated in three ways: the size and effectiveness of the representation, including the type information it provides; compiler performance for several interprocedural problems; and illustrative examples of the benefits LLVM provides for several challenging compiler problems.

A single-program-multiple-data computational model for EPEX/FORTRAN