• Corpus ID: 306206

Enabling Task Parallelism in the CUDA Scheduler

@inproceedings{Guevara2009EnablingTP,
  title={Enabling Task Parallelism in the CUDA Scheduler},
  author={Marisabel Guevara and Chris Gregg and Kim M. Hazelwood and Kevin Skadron},
  year={2009}
}
General purpose computing on graphics processing units (GPUs) introduces the challenge of scheduling independent tasks on devices designed for data parallel or SPMD applications. This paper proposes an issue queue that merges workloads that would underutilize GPU processing resources such that they can be run concurrently on an NVIDIA GPU. Using kernels from microbenchmarks and two applications we show that throughput is increased in all cases where the GPU would have been underused by a single… 
A Static Task Scheduling Framework for Independent Tasks Accelerated Using a Shared Graphics Processing Unit
TLDR
This paper explores the problem of GPU tasks scheduling, to allow multiple tasks to efficiently share and be executed in parallel on the GPU, and develops the multi-tasking execution model as a performance prediction approach.
Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs
TLDR
Simulation results derived by a cycle-accurate simulator on real-world applications prove that the proposed GPU microarchitecture improves the computing throughput by 18% and reduces the overall accesses to off-chip GPU memory by 13%.
CUsched: multiprogrammed workload scheduling on GPU architectures
TLDR
This paper proposes a set of hardware extensions to the current GPU architectures to efficiently support multi-programmed GPU workloads, allowing concurrent execution of codes from different user processes.
Workload-aware Scheduling Techniques for General Purpose Applications on Graphics Processing Units
TLDR
It is shown that GPU computing workloads have significantly varying characteristics, and design techniques that monitor the hardware state to aide at each of the three levels of scheduling are required to continue making advancements in GPU computing.
On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering
TLDR
A producer-consumer principle approach to manage GPU kernel invocations from within parallel host regions by reordering the respective GPU kernels before actually invoking them is proposed.
Multi-threaded Kernel Offloading to GPGPU Using Hyper-Q on Kepler Architecture
TLDR
This work investigates the Hyper-Q feature within heterogeneous workloads with multiple concurrent host threads or processes offloading computations to the GPU each and evaluates the performance obtained and compares it against a kernel reordering mechanism introduced by the authors for the Fermi architecture.
Enabling preemptive multiprogramming on GPUs
TLDR
This paper argues for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies and extends the NVIDIA GK110 (Kepler) like GPU architecture to allow concurrent execution of GPU kernels from different user processes and implements a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities.
Scenario-Based Execution Method for Massively Parallel Accelerators
  • S. Yamagiwa, Shixun Zhang
  • Computer Science
    2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications
  • 2013
TLDR
This paper proposes a novel execution mechanism for the accelerators that drastically improves the performance, called the scenario-based execution for the Accelerators, which exploits the potential performance of the accelerator, and the application invokes all program contents on the accelerator side.
Merge or Separate?: Multi-job Scheduling for OpenCL Kernels on CPU/GPU Platforms
TLDR
A machine learning-based predictive model at runtime is used at runtime to detect whether to merge OpenCL kernels or schedule them separately to the most appropriate devices without the need for ahead-of-time profiling.
Effective GPU Sharing Under Compiler Guidance
TLDR
The proposed solution outperforms existing state-of-the-art solutions by leveraging its knowledge about applications’ multiple resource requirements, which include memory as well as SMs, and improves throughput by up to 2.5× for Rodinia benchmarks, and up to 1.7× for Darknet neural networks.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 24 REFERENCES
Accelerating Data-Serial Applications on Data-Parallel GPGPUs: A Systems Approach
TLDR
A highly-efficient software barrier is designed, implemented, and evaluated that synchronizes all the thread blocks running on an offloaded kernel on the GPGPU without having to transfer execution control back to the host processor.
A performance study of general-purpose applications on graphics processors using CUDA
TLDR
This paper uses NVIDIA's C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU.
Brook for GPUs: stream computing on graphics hardware
TLDR
This paper presents Brook for GPUs, a system for general-purpose computation on programmable graphics hardware that abstracts and virtualizes many aspects of graphics hardware, and presents an analysis of the effectiveness of the GPU as a compute engine compared to the CPU.
Program optimization space pruning for a multithreaded gpu
TLDR
The complexity involved in optimizing applications for one highly-parallel system and one relatively simple methodology for reducing the workload involved in the optimization process are shown.
Scalable Programming Models for Massively Multicore Processors
  • M. McCool
  • Computer Science
    Proceedings of the IEEE
  • 2008
TLDR
A range of multicore processor architectures and programming models are surveyed and evaluated with a focus on GPUs and the Cell BE processor, finding that the scalable programming models developed for these processors are also applicable to current and future multicore CPUs.
Merge: a programming model for heterogeneous multi-core systems
TLDR
The Merge framework replaces current ad hoc approaches to parallel programming on heterogeneous platforms with a rigorous, library-based methodology that can automatically distribute computation across heterogeneous cores to achieve increased energy and performance efficiency.
Harmony: an execution model and runtime for heterogeneous many core systems
TLDR
Harmony, a runtime supported programming and execution model that provides semantics for simplifying parallelism management, dynamic scheduling of compute intensive kernels to heterogeneous processor resources, and online monitoring driven performance optimization for heterogeneous many core systems is proposed.
Extending the OpenMP Tasking Model to Allow Dependent Tasks
TLDR
An extension to allow the runtime detection of dependencies between generated tasks, broading the range of application that can benefit from tasking or improving the performance when loadbalancing or locality are critical issues for performance is proposed.
Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors
TLDR
It is demonstrated how a systems biology application—detection and tracking of white blood cells in video microscopy—can be accelerated by 200× using a CUDA-capable GPU.
A fast high quality pseudo random number generator for graphics processing units
  • W. Langdon
  • Computer Science
    2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence)
  • 2008
TLDR
The Park-Miller PRNG is programmed using G80's native Value4f floating point in RapidMind C++ to address limited numerical precision of nVidia GeForce 8800 GTX and other GPUs.
...
1
2
3
...