One OpenCL to rule them all?

@article{Dolbeau2013OneOT,
  title={One OpenCL to rule them all?},
  author={Romain Dolbeau and François Bodin and Guillaume Colin de Verdiere},
  journal={2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS)},
  year={2013},
  pages={1-6}
}
OpenCL is now available on a very large set of processors. This makes this language an attractive layer to address multiple targets with a single code base. The question on how sensitive to the underlying hardware is the OpenCL code in practice remains to be better understood. 1. This paper studies how realistic it is to use a unique OpenCL code for a set of hardware co-processors with different underlying micro-architectures. In this work, we target Intel® Xeon Phi™, NVIDIA® K20C and AMD® 7970… Expand
Automatic Generation of Optimized OpenCL Codes Using OCLoptimizer
TLDR
OCLoptimizer, a tool that automatically generates host codes and optimizes OpenCL kernels for each specific target device based on a user provided configuration file, which can explore different granularities for the problem decomposition as well as different alternatives for the kernel. Expand
Understanding Performance Portability of OpenACC for Supercomputers
TLDR
This work proposes a systematic optimization method, instead of auto-tuning by compliers, to achieve reasonable portable performance with minor code modifications for OpenACC application developers to efficiently and correctly use the available OpenACC compilers. Expand
Analysis and parameter prediction of compiler transformation for graphics processors
TLDR
A portable compiler transformation: thread-coarsening is described, which increases the amount of work carried out by a single thread running on the GPU and shows that the speedups given by coarsening are stable for problem sizes larger than a threshold that is called saturation point. Expand
Writing Self-adaptive Codes for Heterogeneous Systems
TLDR
This paper explores the development of self-adaptive kernels that exploit this characteristic so that their code depends on configuration parameters that are tuned using a genetic algorithm through an iterative optimization process. Expand
Simultaneous multiprocessing in a software-defined heterogeneous FPGA
TLDR
This paper investigates how to enhance an existing software-defined framework to reduce overheads and enable the utilization of all the available CPU cores in parallel with the FPGA hardware accelerators and introduces two schedulers, Dynamic and LogFit, which distribute the tasks among all the resources in an optimal manner. Expand
Writing a performance-portable matrix multiplication
TLDR
This paper presents a performance portable matrix multiplication that has a set of parameters that can be tuned for each device, tuned using a genetic algorithm, and is compared to two state-of-the-art adaptive implementations on four different platforms. Expand
Directive-based tile abstraction to distribute loops on accelerators
TLDR
TileK is a tile abstraction used to generate distributed kernels from nested loops that provides a high-level abstraction that enables an effective and efficient placement of multi-dimensional computations on the 3D topology of accelerators (e.g. graphics processing units, GPUs). Expand
Parallel Programming Models for Heterogeneous Many-Cores : A Survey
TLDR
A comprehensive survey for parallel programming models for heterogeneous many-core architectures and the compiling techniques of improving programmability and portability are reviewed and various software optimization techniques for minimizing the communicating overhead are examined. Expand
Workload distribution and balancing in FPGAs and CPUs with OpenCL and TBB
TLDR
This paper focuses on streaming applications that can be implemented as a pipeline of stages and presents an approach that allows the user to specify the mapping of the pipeline stages to the devices (FPGA, GPU or CPU) and the number of active threads. Expand
Workload distribution and balancing in FPGAs and CPUs with OpenCL and TBB
In this paper we evaluate the performance and energy effectiveness of FPGA and CPU devices for a kind of parallel computing applications in which the workload can be distributed in a way that enablesExpand
...
1
2
3
...

References

SHOWING 1-10 OF 26 REFERENCES
From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming
TLDR
This work evaluates OpenCL as a programming tool for developing performance-portable applications for GPGPU, and proposes the use of auto-tuning to better explore these kernels' parameter space using search harness. Expand
Performance Gaps between OpenMP and OpenCL for Multi-core CPUs
TLDR
It is shown that the incorrect usage of the multi-core CPUs, the inherent OpenCL fine-grained parallelism, and the immature OpenCL compilers are the main reasons that lead to the OpenCL poorer performance. Expand
Auto-tuning a high-level language targeted to GPU codes
TLDR
This work performs auto-tuning on a large optimization space on GPU kernels, focusing on loop permutation, loop unrolling, tiling, and specifying which loop(s) to parallelize, and shows results on convolution kernels, codes in the PolyBench suite, and an implementation of belief propagation for stereo vision. Expand
Performance characterization of the NAS Parallel Benchmarks in OpenCL
TLDR
Experimental results and analysis show that the OpenCL version has different characteristics from the OpenMP version on multicore CPUs and exhibits different performance characteristics depending on different OpenCL compute devices. Expand
An experimental study on performance portability of OpenCL kernels
TLDR
This paper investigates the specificity of code optimizations to accelerator architecture and the severity of lack of performance portability, and achieves functional protability, allowing to reduce the development time of kernels. Expand
Comparing Hardware Accelerators in Scientific Applications: A Case Study
TLDR
It is shown that OpenCL provides application portability between multicore processors and GPUs, but may incur a performance cost and it is illustrated that graphics accelerators can make simulations involving large numbers of particles feasible. Expand
A multi-objective auto-tuning framework for parallel codes
TLDR
A multi-objective autotuning framework comprising compiler and runtime components that enables the runtime system to choose specifically tuned code versions when dynamically adjusting to changing circumstances. Expand
Guest Editor's Introduction: Special Section on Challenges and Solutions in Multicore and Many‐Core Computing
TLDR
Nine high-quality contributions in this special issue of Concurrency and Computation are presented, where they were first presented at the Frontiers of GPU, Multiand Many-Core Systems Workshop in conjunction with the 10 IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2010). Expand
PARTANS: An autotuning framework for stencil computation on multi-GPU systems
TLDR
This article presents an autotuner that optimizes the distribution across multiple GPUs for stencil computation and shows that the best strategy depends not only on the stencil operator, problem size, and GPU, but also on the PCI express layout. Expand
AutoTune: A Plugin-Driven Approach to the Automatic Tuning of Parallel Applications
TLDR
The AutoTune project is extending Periscope, an automatic distributed performance analysis tool developed by Technische Universitat Munchen, with plugins for performance and energy efficiency tuning, to be able to tune serial and parallel codes for multicore and manycore architectures. Expand
...
1
2
3
...