A detailed GPU cache model based on reuse distance theory

@article{Nugteren2014ADG,
  title={A detailed GPU cache model based on reuse distance theory},
  author={Cedric Nugteren and Gert-Jan van den Braak and Henk Corporaal and Henri E. Bal},
  journal={2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)},
  year={2014},
  pages={37-48}
}
As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel… 
A reuse distance based performance analysis on GPU L1 data cache
  • Dongwei Wang, Weijun Xiao
  • Computer Science
    2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC)
  • 2016
TLDR
This work adequately analyze the memory accesses from twenty benchmarks based on reuse distance theory and quantify their patterns, finding that most benchmarks either access cache in a streaming manner or reuse previous cache line in a short reuse distance.
GPUs Cache Performance Estimation using Reuse Distance Analysis
TLDR
This paper proposes a memory model to predict the entire cache performance (L1 & L2 caches) in GPUs based on reuse distance which is very flexible where it takes into account the different cache parameters thus it can be used for design space exploration and sensitivity analysis.
Locality Protected Dynamic Cache Allocation Scheme on GPUs
TLDR
A locality protected scheme to make full use of the data locality based on the fixed capacity based on instruction PC to promote GPU performance and shows that LPP provides an up to 17.8% speedup and an average of 5.5% improvement over the baseline method.
Coordinated static and dynamic cache bypassing for GPUs
The massive parallel architecture enables graphics processing units (GPUs) to boost performance for a wide range of applications. Initially, GPUs only employ scratchpad memory as on-chip memory.
Optimizing Cache Bypassing and Warp Scheduling for GPUs
TLDR
This paper proposes coordinated static and dynamic cache bypassing to improve the GPU application performance and develops a bypass-aware warp scheduler to adaptively adjust the scheduling policy based on the cache performance.
RDMKE: Applying Reuse Distance Analysis to Multiple GPU Kernel Executions
TLDR
The present study proposes a framework, called RDMKE (short for Reuse Distance-based profiling in MKEs), to provide a method for analyzing GPU cache memory performance in Mke scenarios, and simulation results of 28 two-kernel executions indicate that RDMke can properly capture the nonlinear cache behaviors in Make scenarios.
Adaptive and transparent cache bypassing for GPUs
TLDR
This paper proposes a novel compile-time framework for adaptive and transparent cache bypassing on GPUs that uses a simple yet effective approach to control the bypass degree to match the size of applications' runtime footprints.
An efficient compiler framework for cache bypassing on GPUs
TLDR
An efficient compiler framework for cache bypassing on GPUs is proposed and efficient algorithms that judiciously select global load instructions for cache access or bypass are presented.
An efficient compiler framework for cache bypassing on GPUs
TLDR
An efficient compiler framework for cache bypassing on GPUs is proposed and efficient algorithms that judiciously select global load instructions for cache access or bypass are presented.
The Demand for a Sound Baseline in GPU Memory Architecture Research
TLDR
The results show that advanced cache indexing functions can greatly reduce conflict misses and improve cache efficiency; the allocation- on-fill policy brings a better performance than allocation-on-miss; and the performance does not consistently improve with more MSHRs, but can greatly mitigate the problem of memory partition camping.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 26 REFERENCES
Cache Miss Analysis for GPU Programs Based on Stack Distance Profile
TLDR
This paper proposes, for the first time, a cache miss analysis model for the GPU programs, based on the deep analysis of GPU's execution model, and shows that the method is efficient and can be used to guide the cache locality optimizations for theGPU programs.
An adaptive performance modeling tool for GPU architectures
TLDR
An analytical model to predict the performance of general-purpose applications on a GPU architecture that captures full system complexity and shows high accuracy in predicting the performance trends of different optimized kernel implementations is presented.
Neither more nor less: Optimizing thread-level parallelism for GPGPUs
TLDR
To reduce resource contention, this paper proposes a dynamic CTA scheduling mechanism, called DYNCTA, which modulates the TLP by allocating optimal number of CTAs, based on application characteristics, to minimize resource contention.
Performance Estimation of GPUs with Cache
TLDR
A model to count the number of instructions in the kernel is developed and the instruction count methodology to give exact count is found, which is used to predict the total execution time.
Analyzing CUDA workloads using a detailed GPU simulator
TLDR
Two observations are made that for the applications the authors study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.
Cache-Conscious Wavefront Scheduling
TLDR
This paper proposes Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity.
An integrated GPU power and performance model
TLDR
An integrated power and performance prediction model for a GPU architecture to predict the optimal number of active processors for a given application and the outcome of IPP is used to control the number of running cores.
Reuse Distance as a Metric for Cache Behavior.
TLDR
The distribution of the conflict and capacity misses was measured in the execution of code generated by a state-of-the-art EPIC compiler and it is observed that some program transformations to enhance the parallelism may counter the optimizations to reduce the capacity misses.
Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
TLDR
Several novel code transformations are explored that are applicable only when compiling explicitly parallel applications and traditional dynamic compiler optimizations are revisited for this new class of applications to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.
Multicore-aware reuse distance analysis
TLDR
This paper shows several methods to keep reuse stacks consistent so that they account for invalidations and cache sharing, either as references arise in a simulated execution or at synchronization points, and shows that adding multicore-awareness substantially improves the ability of reuse distance analysis to model cache behavior.
...
1
2
3
...