Adaptive and transparent cache bypassing for GPUs

@article{Li2015AdaptiveAT,
  title={Adaptive and transparent cache bypassing for GPUs},
  author={Ang Li and Gert-Jan van den Braak and Akash Kumar and H. Corporaal},
  journal={SC15: International Conference for High Performance Computing, Networking, Storage and Analysis},
  year={2015},
  pages={1-12}
}
In the last decade, GPUs have emerged to be widely adopted for general-purpose applications. To capture on-chip locality for these applications, modern GPUs have integrated multilevel cache hierarchy, in an attempt to reduce the amount and latency of the massive and sometimes irregular memory accesses. However, inferior performance is frequently attained due to serious congestion in the caches results from the huge amount of concurrent threads. In this paper, we propose a novel compile-time… Expand
Reducing Cache Contention On Gpus
TLDR
Caches, which significantly improve CPU performance, are introduced to GPUs to further enhance application performance, but the effect of caches is not significant for many cases in GPUs and even detrimental for some cases. Expand
Intelligent Scheduling and Memory Management Techniques for Modern GPU Architectures
TLDR
An instructionaware Control Loop Based Adaptive Bypassing algorithm, called Ctrl-C, is developed that can effectively improve cache utilization in GPUs and achieve an average of 1.42x speedup for cache sensitive GPGPU workloads. Expand
Locality-Aware CTA Clustering for Modern GPUs
TLDR
The concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse together on the same SM are proposed and incorporated into an integrated framework for automatic inter-CTA locality optimization. Expand
Exploring cache bypassing and partitioning for multi-tasking on GPUs
TLDR
This paper proposes to use cache partitioning together with cache bypassing as the shared cache management mechanism for multi-tasking on GPUs to reduce the interference among the tasks and preserve the locality for each task. Expand
Contention-Aware Selective Caching to Mitigate Intra-Warp Contention on GPUs
TLDR
A locality and contention aware selective caching based on memory access divergence to mitigate intra-warp resource contention in L1 data (L1D) cache on GPUs is proposed and outperforms recently published state-of-the-art GPU cache bypassing schemes. Expand
Compiler-Assisted GPU Thread Throttling for Reduced Cache Contention
TLDR
A compiler-assisted thread throttling scheme, which limits the number of active thread groups to reduce cache contention and consequently improve the performance and evaluated the proposed scheme with GPU programs that suffer from cache contention. Expand
A Quantitative Performance Evaluation of Fast on-Chip Memories of GPUs
  • E. Konstantinidis, Y. Cotronis
  • Computer Science
  • 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)
  • 2016
TLDR
A set of micro-benchmarks which aim to provide effective bandwidth performance measurements of the on-chip special memories of GPUs and validate the peak measurements on real world problems as provided by the polybench-gpu benchmark suite. Expand
The Demand for a Sound Baseline in GPU Memory Architecture Research
Modern GPUs adopt massive multithreading and multi-level cache hierarchies to hide long operation latencies, especially off-chip memory access latencies. However, poor cache indexing and cache lineExpand
Inter-kernel Reuse-aware Thread Block Scheduling
TLDR
This article proposes new hardware thread block schedulers that optimize inter-kernel reuse while using work stealing to preserve load balance and reduces average execution time and energy in regular applications and irregular applications. Expand
CUDAAdvisor: LLVM-based runtime profiling for modern GPUs
TLDR
This paper presents CUDAAdvisor, a profiling framework to guide code optimization in modern NVIDIA GPUs that supports GPU profiling across different CUDA versions and architectures, including CUDA 8.0 and Pascal architecture. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 38 REFERENCES
An efficient compiler framework for cache bypassing on GPUs
TLDR
An efficient compiler framework for cache bypassing on GPUs is proposed and efficient algorithms that judiciously select global load instructions for cache access or bypass are presented. Expand
Locality-Driven Dynamic GPU Cache Bypassing
TLDR
This paper presents a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions. Expand
Characterizing and improving the use of demand-fetched caches in GPUs
TLDR
This paper characterizes application performance on GPUs with caches and provides a taxonomy for reasoning about different types of access patterns and locality and presents an algorithm which can be automated and applied at compile-time to identify an application's memoryAccess patterns and to use that information to intelligently configure cache usage to improve application performance. Expand
Adaptive Cache Management for Energy-Efficient GPU Computing
TLDR
A specialized cache management policy for GPGPUs is proposed that is coordinated with warp throttling to dynamically control the active number of warps and a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources. Expand
A detailed GPU cache model based on reuse distance theory
TLDR
This work extends reuse distance to GPUs by modelling the GPU's hierarchy of threads, warps, threadblocks, and sets of active threads, including conditional and non-uniform latencies, cache associativity, miss-status holding-registers, and warp divergence. Expand
Managing shared last-level cache in a heterogeneous multicore processor
TLDR
HeLM is able to throttle GPU LLC accesses and yield LLC space to cache sensitive CPU applications and outperforms LRU policy by 12.5% and TAP-RRIP by 5.6% for a processor with 4 CPU and 4 GPU cores. Expand
MRPB: Memory request prioritization for massively parallel processors
TLDR
This paper proposes the memory request prioritization buffer (MRPB), a hardware structure that improves caching efficiency of massively parallel workloads by applying two prioritization methods-request reordering and cache bypassing-to memory requests before they access a cache. Expand
Analyzing CUDA workloads using a detailed GPU simulator
TLDR
Two observations are made that for the applications the authors study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system. Expand
Improving GPU performance via large warps and two-level warp scheduling
TLDR
This work proposes two independent ideas: the large warp microarchitecture and two-level warp scheduling that improve performance by 19.1% over traditional GPU cores for a wide variety of general purpose parallel applications that heretofore have not been able to fully exploit the available resources of the GPU chip. Expand
Cache-Conscious Wavefront Scheduling
TLDR
This paper proposes Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity. Expand
...
1
2
3
4
...