Improving GPGPU concurrency with elastic kernels

@inproceedings{Pai2013ImprovingGC,
  title={Improving GPGPU concurrency with elastic kernels},
  author={Sreepathi Pai and Matthew J. Thazhuthaveetil and Ram Prasath Govindarajan},
  booktitle={ASPLOS '13},
  year={2013}
}
Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite that we used in our work. Current GPUs therefore allow concurrent execution of kernels to improve utilization. In this work, we study concurrent… 
Slate: Enabling Workload-Aware Efficient Multiprocessing for Modern GPGPUs
TLDR
Slate is presented, a software-based workload-aware GPU multiprocessing framework that enables concurrent kernels from different processes to share GPU devices and improves GPU resource utilization.
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications
TLDR
It is argued that GPU memory system should be augmented with application awareness such that requests from different applications can be scheduled in a round robin (RR) fashion while still preserving the benefits of the current first-ready FCFS (FR-FCFS) memory scheduling policy.
Kernel concurrency opportunities based on GPU benchmarks characterization
TLDR
This work proposes to categorize the kernels of each application of these benchmark suites by multiple criteria, built on their behavior in terms of computation type, usage of memory hierarchy, efficiency and hardware occupancy, and analyzes kernel concurrency opportunities.
Simultaneous Multikernel: Fine-Grained Sharing of GPUs
TLDR
Simultaneous Multikernel (SMK) is proposed, a fine-grained dynamic sharing mechanism that fully utilizes resources within a streaming multiprocessor by exploiting heterogeneity of different kernels.
Multi-threaded Kernel Offloading to GPGPU Using Hyper-Q on Kepler Architecture
TLDR
This work investigates the Hyper-Q feature within heterogeneous workloads with multiple concurrent host threads or processes offloading computations to the GPU each and evaluates the performance obtained and compares it against a kernel reordering mechanism introduced by the authors for the Fermi architecture.
Maximizing the GPU resource usage by reordering concurrent kernels submission
TLDR
This work proposes a novel optimization approach to reorder the kernels invocation focusing on maximizing the resources utilization, improving the average turnaround time and system throughput.
Efficient kernel management on GPUs
  • Xiuhong Li, Yun Liang
  • Computer Science
    2016 Design, Automation & Test in Europe Conference & Exhibition (DATE)
  • 2016
TLDR
A framework that optimizes the performance and energy-efficiency for multiple kernel execution on GPUs is designed and an algorithm to adjust the thread-level parallelism (TLP) for the concurrently executing kernels is developed.
Enabling preemptive multiprogramming on GPUs
TLDR
This paper argues for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies and extends the NVIDIA GK110 (Kepler) like GPU architecture to allow concurrent execution of GPU kernels from different user processes and implements a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities.
Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling
TLDR
Kernelet embraces transparent memory management and PCI-e data transfer techniques, and dynamic slicing and scheduling techniques for kernel executions, and develops a novel Markov chain-based performance model to guide the scheduling decision.
Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing
TLDR
Simultaneous Multikernel (SMK) is proposed, a fine-grain dynamic sharing mechanism, that fully utilizes resources within a streaming multiprocessor by exploiting heterogeneity of different kernels to improve system throughput while maintaining fairness.
...
...

References

SHOWING 1-10 OF 19 REFERENCES
Fine-grained resource sharing for concurrent GPGPU kernels
TLDR
KernelMerge provides a concurrent kernel scheduler compatible with the OpenCL API that runs two OpenCL kernels concurrently on one device and outlines a method for using KernelMerge to investigate how concurrent kernels influence each other, with the goal of predicting runtimes for concurrent execution from individual kernel runtimes.
Analyzing CUDA workloads using a detailed GPU simulator
TLDR
Two observations are made that for the applications the authors study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.
Enabling Task Parallelism in the CUDA Scheduler
TLDR
An issue queue that merges workloads that would underutilize GPU processing resources such that they can be run concurrently on an NVIDIA GPU is proposed and throughput is increased in all cases where the GPU would have been underused by a single kernel.
Exploiting concurrent kernel execution on graphic processing units
TLDR
This paper explores the techniques to effectively share a context, i.e., context funneling, which could be done either manually at application level, or automatically at the GPU runtime starting from CUDA v4.0.
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs
TLDR
Techniques for compiling fine-grained SPMD-threaded programs, expressed in programming models such as OpenCL or CUDA, to multicore execution platforms are described, and reasonable restrictions on the synchronization model enable significant optimizations and performance improvements over a baseline approach.
The case for GPGPU spatial multitasking
TLDR
The case is made for a GPU multitasking technique called spatial multitasking, which allows GPU resources to be partitioned among multiple applications simultaneously and shows an average speedup of up to 1.19 over cooperative multitasking when two applications are sharing the GPU.
Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU
  • Guibin Wang, Yisong Lin, Wei Yi
  • Computer Science
    2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
  • 2010
TLDR
Experimental evaluation validates that the proposed kernel fusion method could reduce energy consumption without performance loss for several typical kernels and effective method to reduce the usage of shared memory and coordinate the thread space of the kernels to be fused is proposed.
Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework
TLDR
A framework to enable applications executing within virtual machines to transparently share one or more GPUs is presented and it is found that even when contention is high the consolidation algorithm is effective in improving the throughput, and that the runtime overhead of the framework is low.
Rodinia: A benchmark suite for heterogeneous computing
TLDR
This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Chunking parallel loops in the presence of synchronization
TLDR
A transformation framework is presented that uses a combination of transformations from past work to obtain an equivalent set of parallel loops that chunk together statements from multiple iterations while preserving the semantics of the original parallel program, thereby improving performance and scalability.
...
...