Corpus ID: 6433694

A Tale of Three Runtimes

  title={A Tale of Three Runtimes},
  author={Nicolas Vasilache and Muthu Manikandan Baskaran and Thomas Henretty and Beno{\^i}t Meister and Harper Langston and Sanket Tavarageri and Richard A. Lethin},
This contribution discusses the automatic generation of event-driven, tuple-space based programs for task-oriented execution models from a sequential C specification. We developed a hierarchical mapping solution using auto-parallelizing compiler technology to target three different runtimes relying on event-driven tasks (EDTs). Our solution benefits from the important observation that loop types encode short, transitive relations among EDTs that are compact and efficiently evaluated at runtime… Expand
Efficient Compilation to Event-Driven Task Programs
This work presents an efficient technique to generate task graphs from a polyhedral representation of a program, both in terms of compilation time and asymptotic execution time, and explores the different ways of programming EDTs using each synchronization model, and identifies important sources of overhead associated with them. Expand
Automatic Code Generation for an Asynchronous Task-based Runtime
Hardware scaling considerations associated with the quest for exascale and extreme scale computing are driving system designers to consider event-driven-task (EDT)-oriented execution models forExpand
Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed Memory
This article uses techniques from the polyhedral compiler framework to extract tasks and generate components of the runtime that are used to dynamically schedule the generated tasks, and is also the first automatic scheme that allows for dynamic scheduling of affine loop nests on a cluster of multicores. Expand
Automatic Code Generation and Data Management for an Asynchronous Task-Based Runtime
New capabilities within R-Stream - an automatic source-to-source optimization compiler - for automatic generation and optimization of code and data management targeted towards Open Community Runtime (OCR) - an exascale-ready asynchronous task-based runtime are developed. Expand
PIPES: A Language and Compiler for Task-Based Programming on Distributed-Memory Clusters
This work presents a new macro-dataflow programming environment for distributed-memory clusters, based on the Intel Concurrent Collections (CnC) runtime, and introduces a compiler to automatically generate Intel CnC C++ run-time, with key automatic optimizations including task coarsening and coalescing. Expand
The Open Community Runtime: A runtime system for extreme scale computing
The fundamental concepts behind OCR are laid out, OCR performance is compared to that from MPI for two simple benchmarks and OCR features supporting flexible algorithm expression are compared. Expand
Multigrain Parallelism: Bridging Coarse-Grain Parallel Programs and Fine-Grain Event-Driven Multithreading
A Multigrain Parallel Programming environment is presented that allows programmers to use these well-known coarse-grain constructs to generate a fine-grain multithreaded application to be run on top of afine-grain event-driven program execution model. Expand
Polyhedral Optimization of TensorFlow Computation Graphs
R-Stream can exploit the optimizations available with R-Stream to generate a highly optimized version of the computation graph, specifically mapped to the targeted architecture, demonstrating its utility in porting neural network computations to parallel architectures. Expand
Runtime Systems Summit


Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies
This work leverages on algorithmic advances in polyhedral code generation and has been implemented in a modern research compiler, using a semi-automatic optimization approach to demonstrate that current compilers suffer from unnecessary constraints and intricacies that can be avoided in a semantically richer transformation framework. Expand
Logical inference techniques for loop parallelization
This paper presents a fully automatic approach to loop parallelization that integrates the use of static and run-time analysis and thus overcomes many known difficulties such as nonlinear andExpand
Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors
This paper develops a completely automatic parallelization approach for transforming input affine sequential codes into efficient parallel codes that can be executed on a multi-core system in a load-balanced manner and obviates the need for programmer intervention and re-writing of existing algorithms for efficient parallel execution on multi-cores. Expand
Adapting the polyhedral model as a framework for efficient speculative parallelization
A Thread-Level Speculation framework to be able to speculatively parallelize a sequential loop nest in various ways, by re-scheduling its iterations, by applying the polyhedral model that was adapted for speculative and runtime code parallelization. Expand
Cilk: An Efficient Multithreaded Runtime System
It is shown that on real and synthetic applications, the “work” and “critical-path length” of a Cilk computation can be used to model performance accurately, and it is proved that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal. Expand
Evaluation of mechanisms for fine-grained parallel programs in the J-machine and the CM-5
An abstract machine approach is used to compare the mechanisms of two parallel machines: the J-Machine and the CM-5 and finds that message dispatch is found to be less valuable without atomic operations that allow the scheduling levels to cooperate. Expand
A practical automatic polyhedral parallelizer and locality optimizer
An automatic polyhedral source-to-source transformation framework that can optimize regular programs for parallelism and locality simultaneously simultaneously and is implemented into a tool to automatically generate OpenMP parallel code from C program sections. Expand
Transforming loops to recursion for multi-level memory hierarchies
A new compiler transformation that can be used to convert loop nests into recursive form automatically is presented, and an improved algorithm for transitive dependence analysis is developed that is much faster than the best previously known algorithm in practice. Expand
Oversubscription on multicore processors
This paper evaluates the impact of task oversubscription on the performance of MPI, OpenMP and UPC implementations of the NAS Parallel Benchmarks on UMA and NUMA multi-socket architectures and discusses sharing and partitioning system management techniques. Expand
Qthreads: An API for programming with millions of lightweight threads
The qthread API and its Unix implementation is introduced, resource management is discussed, and performance results from the HPCCG benchmark are presented. Expand