Cooperative kernels: GPU multitasking for blocking algorithms

  title={Cooperative kernels: GPU multitasking for blocking algorithms},
  author={Tyler Sorensen and Hugues Evrard and Alastair F. Donaldson},
  journal={Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering},
There is growing interest in accelerating irregular data-parallel algorithms on GPUs. These algorithms are typically blocking, so they require fair scheduling. But GPU programming models (e.g. OpenCL) do not mandate fair scheduling, and GPU schedulers are unfair in practice. Current approaches avoid this issue by exploiting scheduling quirks of today's GPUs in a manner that does not allow the GPU to be shared with other workloads (such as graphics rendering tasks). We propose cooperative… 

Figures and Tables from this paper

Slate: Enabling Workload-Aware Efficient Multiprocessing for Modern GPGPUs
Slate is presented, a software-based workload-aware GPU multiprocessing framework that enables concurrent kernels from different processes to share GPU devices and improves GPU resource utilization.
Concurrent query processing in a GPU-based database system
A variation of multi-dimensional knapsack model to maximize concurrency in a multi-kernel environment is constructed and an in-depth analysis of the model is presented and an algorithm based on dynamic programming technique is developed to solve the model.
Inter-workgroup barrier synchronisation on graphics processing units
This thesis includes the following studies: it is shown that the scheduling guarantees of current GPUs can be used to dynamically create an execution environment that allows for a safe and portable global barrier across a subset of the GPU threads.
GPGPU Task Scheduling Technique for Reducing the Performance Deviation of Multiple GPGPU Tasks in RPC-Based GPU Virtualization Environments
This work proposes a GPGPU task scheduling scheme based on thread division processing that supports GPU use evenly by multiple VMs that process GPGU tasks in an RPC-based GPU virtualization environment and divides the threads of the GPG PU task into several groups and controls the execution time of each thread group to prevent a specific GPGUE task from a long time monopolizing the GPU.
Forward progress on GPU concurrency
An overview of work undertaken over the last six years in the Multicore Programming Group at Imperial College London, and with collaborators internationally, related to understanding and reasoning about concurrency in software designed for acceleration on GPUs is provided.
Fast parallel vessel segmentation
Transactions on Petri Nets and Other Models of Concurrency XIII
This paper extends the previous work by means of large-scale statistically-sound experiments that describe the effects and trends of these parameters for different populations of process models and shows that, indeed, there exist parameter configurations that have a significant positive impact on alignment computation efficiency.
GPU acceleration of liver enhancement for tumor segmentation
MCC'2017 - The Seventh Model Checking Contest
The principles and results of the 2017 edition of the Model Checking Contest are presented, which took place along with the Petri Net and ACSD joint conferences in Zaragoza, Spain.


Cooperative Kernels: GPU Multitasking for Blocking Algorithms (Extended Version)
This work describes a prototype implementation of a cooperative kernel framework implemented in OpenCL 2.0 and evaluates the approach by porting a set of blocking GPU applications to cooperative kernels and examining their performance under multitasking.
Enabling preemptive multiprogramming on GPUs
This paper argues for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies and extends the NVIDIA GK110 (Kepler) like GPU architecture to allow concurrent execution of GPU kernels from different user processes and implements a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities.
Multitasking Real-time Embedded GPU Computing Tasks
This study highlights the shortcomings of current GPU architectures with regard to running multiple real-time tasks, and recommends new features that would improve scheduling, including hardware priorities, preemption, programmable scheduling, and a common time concept and atomics across the CPU and GPU.
Exploiting Parallelism in Iterative Irregular Maxflow Computations on GPU Accelerators
This paper considers a graph-based maximum maximum algorithm that has applications in network optimization problems and shows that the performance of the GPU algorithm far exceeds that of a sequential CPU algorithm.
A study of Persistent Threads style GPU programming for GPGPU workloads
Through micro-kernel benchmarks, it is shown the PT approach can achieve up to an order-of-magnitude speedup over nonPT kernels, but can also result in performance loss in many cases.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
Chimera first introduces streaming multiprocessor flushing, which can instantly preempt an SM by detecting and exploiting idempotent execution, and utilizes flushing collaboratively with two previously proposed preemption techniques for GPUs, namely context switching and draining to minimize throughput overhead while achieving a required preemption latency.
A GPU implementation of inclusion-based points-to analysis
This paper describes a high-performance GPU implementation of an important graph algorithm used in compilers such as gcc and LLVM: Andersen-style inclusion-based points-to analysis, which achieves an average speedup of 7x compared to a sequential CPU implementation and outperforms a parallel implementation of the same algorithm running on 16 CPU cores.
Task management for irregular-parallel workloads on the GPU
It is demonstrated that dynamic scheduling and efficient memory management are critical problems in achieving high efficiency on irregular workloads and the preferred choice is task-donation because of comparable performance to task-stealing while using less memory overhead.
Inter-block GPU communication via fast barrier synchronization
  • S. Xiao, Wu-chun Feng
  • Computer Science
    2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
  • 2010
This work proposes two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization andGPU lock-free synchronization and evaluates the efficacy of each approach via a micro-benchmark as well as three well-known algorithms — Fast Fourier Transform, dynamic programming, and bitonic sort.
A compiler for throughput optimization of graph algorithms on GPUs
  • Sreepathi Pai, K. Pingali
  • Computer Science
    Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications
  • 2016
This paper argues that three optimizations called throughput optimizations are key to high-performance for this application class and has implemented these optimizations in a compiler that produces CUDA code from an intermediate-level program representation called IrGL.