Fine-Grained Synchronizations and Dataflow Programming on GPUs

@article{Li2015FineGrainedSA,
  title={Fine-Grained Synchronizations and Dataflow Programming on GPUs},
  author={Ang Li and Gert-Jan van den Braak and H. Corporaal and Akash Kumar},
  journal={Proceedings of the 29th ACM on International Conference on Supercomputing},
  year={2015}
}
The last decade has witnessed the blooming emergence of many-core platforms, especially the graphic processing units (GPUs). With the exponential growth of cores in GPUs, utilizing them efficiently becomes a challenge. The data-parallel programming model assumes a single instruction stream for multiple concurrent threads (SIMT); therefore little support is offered to enforce thread ordering and fine-grained synchronizations. This becomes an obstacle when migrating algorithms which exploit fine… Expand
HeteroSync: A benchmark suite for fine-grained synchronization on tightly coupled GPUs
TLDR
This work characterize the scalability of HeteroSync for different coherence protocols and consistency models on modern, tightly coupled CPU-GPU systems and show that certain algorithms, coherence Protocols, and consistency Models scale better than others. Expand
Lock-based synchronization for GPU architectures
TLDR
The proposed locking scheme allows lock stealing within individual warps to avoid the concurrency bugs due to the SMIT execution of GPUs, and adopts lock virtualization to reduce the memory cost of fine-grain GPU locks. Expand
Fast Fine-Grained Global Synchronization on GPUs
TLDR
This paper extends the reach of General Purpose GPU programming by presenting a software architecture that supports efficient fine-grained synchronization over global memory and implementing a scalable and efficient message passing library. Expand
Lightweight Hardware Transactional Memory for GPU Scratchpad Memory
TLDR
This work proposes GPU-LocalTM as a lightweight and efficient transactional memory (TM) for GPU local memory, which provides from 1.1X up to 100X speedup over serialized critical sections. Expand
GPU performance modeling and optimization
  • A. Li
  • Computer Science
  • 2016
TLDR
This thesis proposes an analytic model for throughput-oriented parallel processors called X, which is visualizable, traceable and portable, while providing a good abstraction for both application designers and hardware architects to understand the performance and motivate potential optimization approaches. Expand
Warp-Consolidation: A Novel Execution Model for GPUs
TLDR
A novel execution model for modern GPUs that hides the CTA execution hierarchy from the classic GPU execution model; meanwhile exposes the originally hidden warp-level execution, which relies on individual warps to undertake the original CTAs' tasks. Expand
Warp Scheduling for Fine-Grained Synchronization
TLDR
Back-Off Warp Spinning is proposed, a hardware warp scheduling policy that extends existing warp scheduling policies to temporarily deprioritize warps executing busy wait code and Dynamic Detection of Spinning (DDOS), a novel hardware mechanism for accurately and efficiently detecting busy-wait synchronization on GPUs. Expand
A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs
TLDR
This work provides an in-depth analysis of the performance considerations and pitfalls of the state-of-art synchronization methods for Nvidia GPUs and provides a case study of the commonly used reduction operator to illustrate how the knowledge gained can be useful. Expand
Don't Forget About Synchronization!: A Case Study of K-Means on GPU
TLDR
The experience shows that lock-based solutions to the k-means clustering problem outperform the well-engineered and parallel KMCUDA on both synthetic and real datasets, and identifies two guidelines to help make concurrency effective when programming GPU applications. Expand
A deadlock‐free lock‐based synchronization for GPUs
TLDR
This paper discusses various deadlock scenarios that can happen in GPUs, and describes a novel lock‐based deadlock‐free, fine‐grained synchronization mechanism for GPU architectures that overcomes deadlocks without a significant overhead. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 45 REFERENCES
Efficient Synchronization Primitives for GPUs
TLDR
This paper revisits the design of synchronization primitives---specifically barriers, mutexes, and semaphores---and how they apply to the GPU and defines an abstraction of GPUs to classify any GPU based on the behavior of the memory system. Expand
Inter-block GPU communication via fast barrier synchronization
  • S. Xiao, Wu-chun Feng
  • Computer Science
  • 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
  • 2010
TLDR
This work proposes two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization andGPU lock-free synchronization and evaluates the efficacy of each approach via a micro-benchmark as well as three well-known algorithms — Fast Fourier Transform, dynamic programming, and bitonic sort. Expand
CudaDMA: Optimizing GPU memory bandwidth via warp specialization
TLDR
This work proposes an approach for programming GPUs with tightly-coupled specialized DMA warps for performing memory transfers between on-chip and off-chip memories, and presents an extensible API, CudaDMA, that encapsulates synchronization and common sequential and strided data transfer patterns. Expand
Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures
TLDR
The Synchronization State Buffer is proposed, a scalable architectural design for fine-grain synchronization that efficiently performs synchronizations between concurrent threads that records and manages the states of frequently synchronized data using modest hardware support. Expand
Using Shared Memory to Accelerate MapReduce on Graphics Processing Units
  • Feng Ji, Xiaosong Ma
  • Computer Science
  • 2011 IEEE International Parallel & Distributed Processing Symposium
  • 2011
TLDR
This work designed and implemented a GPU MapReduce framework, whose key techniques include shared memory staging area management, thread-role partitioning, and intra-block thread synchronization, and proposes a novel GPU data staging scheme for Map Reduce workloads, tailored toward the GPU memory hierarchy. Expand
Performance Modeling of Atomic Additions on GPU Scratchpad Memory
TLDR
This paper presents an exhaustive microbenchmark-based analysis of atomic additions in shared memory that quantifies the impact of access conflicts on latency and throughput and proposes a performance model to estimate the latency penalties due to collisions by position or bank conflicts. Expand
Accelerating Data-Serial Applications on Data-Parallel GPGPUs: A Systems Approach
TLDR
A highly-efficient software barrier is designed, implemented, and evaluated that synchronizes all the thread blocks running on an offloaded kernel on the GPGPU without having to transfer execution control back to the host processor. Expand
Techniques for efficient placement of synchronization primitives
TLDR
Novel compiler techniques to parallelize programs, which cannot be auto-parallelized, via explicit synchronization, using real codes, specifically, from the industry-standard SPEC CPU benchmarks, the Linux kernel and other widely used open source codes are proposed. Expand
Atomic-free irregular computations on GPUs
TLDR
This paper presents two high-level methods to eliminate atomics in irregular programs by exploiting algebraic properties of algorithms to elide atomics, and illustrates the generality of the two methods by applying them to five irregular graph applications. Expand
Dynamic Barrier Architecture for Multi-Mode Fine-Grain Parallelism Using Conventional Processors
TLDR
A new hardware barrier architecture is introduced that provides the full DBM functionality, but can be implemented with much simpler hardware and can be used to efficiently support multi-mode moderate-width parallelism with instruction-level granularity. Expand
...
1
2
3
4
5
...