• Corpus ID: 10777608

PENCIL: Towards a Platform-Neutral Compute Intermediate Language for DSLs

  title={PENCIL: Towards a Platform-Neutral Compute Intermediate Language for DSLs},
  author={Riyadh Baghdadi and Albert Cohen and Serge Guelton and Sven Verdoolaege and Jun Inoue and Tobias Grosser and Georgia Kouveli and Alexey Kravets and Anton Lokhmotov and Cedric Nugteren and Fraser Waters and Alastair F. Donaldson},
We motivate the design and implementation of a platform-neutral compute intermediate language (PENCIL) for productive and performance-portable accelerator programming. 

Figures from this paper

PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming
PENCIL, a rigorously-defined subset of GNU C99-enriched with additional language constructs-that enables compilers to exploit parallelism and produce highly optimized code when targeting accelerators, is presented.
VOBLA: a vehicle for optimized basic linear algebra
VOBLA is compiled to PENCIL, a domain independent intermediate language designed for efficient mapping to accelerator architectures such as GPGPUs, and the performance of OpenCL code generated using the compilation flow on ARM Mali, AMD Radeon, and AMD Opteron platforms is evaluated.
Correct and efficient accelerator programming
The aim of this Dagstuhl seminar was to bring together researchers from various sub-disciplines of computer science to brainstorm and discuss the theoretical foundations, design and implementation of techniques and tools for correct and efficient accelerator programming.
Scalable Polyhedral Compilation, Syntax vs. Semantics: 1–0 in the First Round
A family of techniques called offline statement clustering, which integrates transparently into the flow of a state-of-the-art polyhedral compiler and can reduce the scheduling time by a factor of 6 without inducing a significant loss in optimization opportunities is introduced.
Algorithmic species revisited: A program code classification based on array references
This work presents a revised theory of algorithmic species that overcomes the limitation of static affine loop nests, and presents an extension of this theory named SPECIES+, providing a more detailed 6-tuple characterisation.
Towards Improving Programmability of Heterogeneous Parallel Architectures
OpenCRun is presented, an OpenCL runtime implementation supporting a range of platforms with very different architectures characteristics, such as X86 multicores and embedded parallel accelerators, and a code transformation technique, workitem coalescing, is proposed that bypasses the limitations of the embedded platforms, allowing code developed for GPGPU to be ported seamlessly.
A compiler for throughput optimization of graph algorithms on GPUs
  • Sreepathi Pai, K. Pingali
  • Computer Science
    Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications
  • 2016
This paper argues that three optimizations called throughput optimizations are key to high-performance for this application class and has implemented these optimizations in a compiler that produces CUDA code from an intermediate-level program representation called IrGL.
Verification of Loop Parallelisations
A technique based on separation logic to verify whether a loop can be parallelised and how the loop iteration contracts can be compiled into specifications for the code coming out of the parallelising compiler is proposed.
A Deep Learning Based Cost Model for Automatic Code Optimization
A novel deep learning based cost model for automatic code optimization that enables TIRAMISU to automatically find code transformations that match or are better than state-of-the-art compilers without requiring the same level of heavy feature engineering required by those compilers.
Program Correctness by Transformation
This paper argues that compilation and program transformations should be made annotation-aware, i.e. during compilation andProgram transformation, not only the code should be changed, but also the corresponding annotations, so that if the original high-level program could be verified, also the resulting low-level programs can be verified.


A Heterogeneous Parallel Framework for Domain-Specific Languages
A new end-to-end system for building, compiling, and executing DSL applications on parallel heterogeneous hardware, the Delite Compiler Framework and Runtime is presented and results comparing the performance of several machine learning applications written in OptiML are presented.
A domain-specific approach to heterogeneous parallelism
Delite is introduced, a system designed specifically for DSLs that is both a framework for creating an implicitly parallel DSL as well as a dynamic runtime providing automated targeting to heterogeneous parallel hardware.
Deriving Efficient Data Movement from Decoupled Access/Execute Specifications
A framework of C++ classes for decoupled Access/Execute specifications is developed, allowing for automatic communication optimisations such as software pipelining and data reuse, and demonstrates the ease and efficiency of programming the Cell Broadband Engine architecture using these classes.
OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning
OptiML is an implicitly parallel, expressive and high performance alternative to MATLAB and C++ and shows that OptiML outperforms explicitly parallelized MATLAB code in nearly all cases.
Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation
A compiler that generates and tunes code for sparse matrix-vector multiplication (SpMV) on GPUs is developed and it is shown that the generated code performs similar to or better than hand-optimized code.
The Landscape of Parallel Computing Research: A View from Berkeley
The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures
It is demonstrated that an application written once at a high-level using the OP2 API can be easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.
Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time
  • P. Feautrier
  • Computer Science
    International Journal of Parallel Programming
  • 2005
This paper extends the algorithms which were developed in Part I to cases in which there is no affine schedule, i.e. to problems whose parallel complexity is polynomial but not linear, and gives some experimental evidence for the applicability, performances and limitations of the algorithm.