Comparative analysis of coprocessors

@article{Sakdhnagool2019ComparativeAO,
  title={Comparative analysis of coprocessors},
  author={Putt Sakdhnagool and Amit Sabne and Rudolf Eigenmann},
  journal={Concurrency and Computation: Practice and Experience},
  year={2019},
  volume={31}
}
While GPUs have seen a steady increase in usage, Xeon Phis have struggled in proving their value, and eventually got discontinued. Is this a matter of the Intel many‐core architecture's younger age or are there reasons due to specific features? This paper reviews quantitative information addressing these questions. Using two latest coprocessors, we evaluate performance and programming productivity across a range of microbenchmarks and applications. We consider productivity as the percentage of… 

References

SHOWING 1-10 OF 31 REFERENCES

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi

This paper investigates the performance of the Xeon Phi coprocessor for sparse linear algebra kernels and shows that Xeon Phi’s sparse kernel performance is very promising and even better than that of cutting-edge CPUs and GPUs.

CUDA-Lite: Reducing GPU Programming Complexity

The present CUDA-lite, an enhancement to CUDA, is presented and preliminary results that indicate auto-generated code can have performance comparable to hand coding are shown.

An early performance evaluation of many integrated core architecture based sgi rackable computing system

  • S. SainiHaoqiang Jin R. Biswas
  • Computer Science
    2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
  • 2013
An early performance evaluation of the Xeon Phi coprocessor based on the Many Integrated Core architecture featuring 60 cores with a peak performance of 1.0 Tflop/s is conducted.

Examining recent many-core architectures and programming models using SHOC

Modifications to the stock SHOC distribution are described and several examples of using the augmented version of SHOC for evaluation of recent heterogeneous architectures and programming models are presented.

hiCUDA: High-Level GPGPU Programming

The hiCUDA}, a high-level directive-based language for CUDA programming is designed, which allows programmers to perform tedious tasks in a simpler manner and directly to the sequential code, thus speeding up the porting process.

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

This technical report presents the microarchitectural details of the NVIDIA Volta architecture, discovered through microbenchmarks and instruction set disassembly, and compares quantitatively the findings against its predecessors, Kepler, Maxwell and Pascal.

OpenMPC: Extended OpenMP Programming and Tuning for GPUs

  • Seyong LeeR. Eigenmann
  • Computer Science
    2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2010
This paper has developed a fully automatic compilation and user-assisted tuning system supporting OpenMPC, which builds on OpenMP to provide an abstraction of the complex CUDA programming model and offers high-level controls of the involved parameters and optimizations.

Productive Programming of GPU Clusters with OmpSs

This work presents the implementation of OmpSs for clusters of GPUs, which supports asynchrony and heterogeneity for task parallelism based on annotating a serial application with directives that are translated by the compiler.

Rodinia: A benchmark suite for heterogeneous computing

This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

HYDRA : Extending Shared Address Programming for Accelerator Clusters

A fully automatic translation system that generates an MPI + accelerator program from a HYDRA program that ensures scalability of the generated program by optimizing data placement and transfer to and from the limited, discrete memories of accelerator devices.