pocl: A Performance-Portable OpenCL Implementation

@article{Jskelinen2014poclAP,
  title={pocl: A Performance-Portable OpenCL Implementation},
  author={Pekka J{\"a}{\"a}skel{\"a}inen and Carlos S. de La Lama and Erik Schnetter and Kalle Raiskila and Jarmo H. Takala and Heikki Berg},
  journal={International Journal of Parallel Programming},
  year={2014},
  volume={43},
  pages={752-785}
}
OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with… Expand
Area Exam: General-Purpose Performance Portable Programming Models for Productive Exascale Computing
Modern supercomputer architectures have grown increasingly complex and diverse since the end of Moore’s law in the mid-2000s, and are far more difficult to program than their predecessors. While HPCExpand
Augmenting Operating Systems with OpenCL Accelerators
TLDR
KOCL exposes a set of the high-level programming interfaces for the Linux kernel module developers to offload compute-intensive tasks on different hardware accelerators without managing and coordinating the platform-specific computing and memory resources. Expand
PACXXv2 + RV: An LLVM-based Portable High-Performance Programming Model
TLDR
The high-performance capabilities of PACXXv2 together with RV on benchmarks from well-known benchmark suites are demonstrated and the performance of the generated code is compared to Intel's OpenCL driver and POCL -- the portable OpenCL project based on LLVM. Expand
Compiler and runtime techniques for bulk-synchronous programming models on CPU architectures
TLDR
This dissertation argues and demonstrates that obtaining high performance from executing OpenCL programs on CPU is feasible, and presents compiler and runtime techniques to execute OpenCLprograms on CPU architectures. Expand
PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusion
TLDR
Validation against OpenCL benchmarks shows that PPOpenCL (implemented in Clang 3.9.1) can achieve significantly improved portable performance on seven platforms considered. Expand
OpenCL in Scientific High Performance Computing: The Good, the Bad, and the Ugly
TLDR
This work presents experiences with utilising OpenCL alongside C++, MPI, and CMake in two real-world scientific codes, and points out current limitations of OpenCL in the domain of scientific HPC from an application developer's and user's point of view. Expand
The Minos Computing Library: efficient parallel programming for extremely heterogeneous systems
TLDR
The Minos Computing Library is introduced, as system software, programming model, and programming model runtime that facilitate programming extremely heterogeneous systems, and provides performance portability. Expand
A runtime controller for openCL applications on heterogeneous system architectures
TLDR
This work aims at proposing a runtime controller, integrated in Linux Operating System (OS), for optimizing the power efficiency of a running OpenCL application deciding the system configuration, and experimental results show that this controller is able to obtain aPower efficiency of more than 90% of the one achievable via offline profiling. Expand
Cross-platform heterogeneous runtime environment
TLDR
A cross-platform heterogeneous runtime environment which provides a high-level, unified, execution model that is coupled with an intelligent resource management facility and can achieve scalable performance and application speedup as the authors increase the number of computing devices, without any changes to the program source code is designed. Expand
Efficient kernel synthesis for performance portable programming
TLDR
TANGRAM is a kernel synthesis framework that composes architecture-neutral computations and composition rules into high-performance kernels customized for different architectural hierarchies based on an extensible architectural model that can be used to specify a variety of architectures. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 67 REFERENCES
OpenCL-based design methodology for application-specific processors
TLDR
The case shows that the use of OpenCL allows producing scalable application-specific processor designs and makes it possible to gradually reach the performance of hand-tailored RTL designs by exploiting the OpenCL extension mechanism to access custom hardware operations of varying complexity. Expand
Customized Exposed Datapath Soft-Core Design Flow with Compiler Support
TLDR
An application-specific processor design toolset that uses a multi-issue exposed data path processor architecture template and shows that a relatively small soft-core tailored with the toolset provides significant speedups on software execution without using any instruction set extensions. Expand
Techniques for efficient placement of synchronization primitives
TLDR
Novel compiler techniques to parallelize programs, which cannot be auto-parallelized, via explicit synchronization, using real codes, specifically, from the industry-standard SPEC CPU benchmarks, the Linux kernel and other widely used open source codes are proposed. Expand
Improving Performance of OpenCL on CPUs
TLDR
A static analysis and an accompanying optimization to exclude code regions from control-flow to data-flow conversion, which is the commonly used technique to leverage vector instruction sets and a novel technique to implement barrier synchronization are presented. Expand
An OpenCL framework for heterogeneous multicores with local memory
TLDR
The design and implementation of an OpenCL framework that targets heterogeneous accelerator multicore architectures with local memory, based on software-managed caches and coherence protocols that guarantee OpenCL memory consistency to overcome the limited size of the local memory is presented. Expand
Whole-function vectorization
TLDR
A language- and platform-independent code transformation that performs whole-function vectorization on low-level intermediate code given by a control flow graph in SSA form for dataparallel programming languages on machines with SIMD instruction sets is discussed. Expand
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs
TLDR
A framework called MCUDA is described, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs and argues that CUDA can be an effective data-parallel programming model for more than just GPU architectures. Expand
Synchronization optimizations for efficient execution on multi-cores
TLDR
This paper proposes novel predication-based and other adjunct synchronization optimizations which facilitate exploitation on higher level of TLP than what can be achieved using the state-of-the-art, and demonstrates the efficacy of the techniques using real codes from the industry-standard SPEC CPU benchmarks and other widely used open source codes. Expand
Twin Peaks: A Software Platform for Heterogeneous Computing on General-Purpose and Graphics Processors
TLDR
Twin Peaks is presented, a software platform for heterogeneous computing that executes code originally targeted for GPUs on CPUs as well, which permits a more balanced execution between the CPU and GPU, and enables portability of code between these architectures and to CPU-only environments. Expand
LLVM: a compilation framework for lifelong program analysis & transformation
  • Chris Lattner, V. Adve
  • Computer Science
  • International Symposium on Code Generation and Optimization, 2004. CGO 2004.
  • 2004
TLDR
The design of the LLVM representation and compiler framework is evaluated in three ways: the size and effectiveness of the representation, including the type information it provides; compiler performance for several interprocedural problems; and illustrative examples of the benefits LLVM provides for several challenging compiler problems. Expand
...
1
2
3
4
5
...