A detailed GPU cache model based on reuse distance theory
@article{Nugteren2014ADG, title={A detailed GPU cache model based on reuse distance theory}, author={Cedric Nugteren and Gert-Jan van den Braak and Henk Corporaal and Henri E. Bal}, journal={2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)}, year={2014}, pages={37-48} }
As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel…Â
Figures and Tables from this paper
107 Citations
A reuse distance based performance analysis on GPU L1 data cache
- Computer Science2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC)
- 2016
This work adequately analyze the memory accesses from twenty benchmarks based on reuse distance theory and quantify their patterns, finding that most benchmarks either access cache in a streaming manner or reuse previous cache line in a short reuse distance.
GPUs Cache Performance Estimation using Reuse Distance Analysis
- Computer Science2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC)
- 2019
This paper proposes a memory model to predict the entire cache performance (L1 & L2 caches) in GPUs based on reuse distance which is very flexible where it takes into account the different cache parameters thus it can be used for design space exploration and sensitivity analysis.
Locality Protected Dynamic Cache Allocation Scheme on GPUs
- Computer Science2016 IEEE Trustcom/BigDataSE/ISPA
- 2016
A locality protected scheme to make full use of the data locality based on the fixed capacity based on instruction PC to promote GPU performance and shows that LPP provides an up to 17.8% speedup and an average of 5.5% improvement over the baseline method.
Coordinated static and dynamic cache bypassing for GPUs
- Computer Science2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)
- 2015
The massive parallel architecture enables graphics processing units (GPUs) to boost performance for a wide range of applications. Initially, GPUs only employ scratchpad memory as on-chip memory.…
Optimizing Cache Bypassing and Warp Scheduling for GPUs
- Computer ScienceIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
- 2018
This paper proposes coordinated static and dynamic cache bypassing to improve the GPU application performance and develops a bypass-aware warp scheduler to adaptively adjust the scheduling policy based on the cache performance.
RDMKE: Applying Reuse Distance Analysis to Multiple GPU Kernel Executions
- Computer ScienceJ. Circuits Syst. Comput.
- 2019
The present study proposes a framework, called RDMKE (short for Reuse Distance-based profiling in MKEs), to provide a method for analyzing GPU cache memory performance in Mke scenarios, and simulation results of 28 two-kernel executions indicate that RDMke can properly capture the nonlinear cache behaviors in Make scenarios.
Adaptive and transparent cache bypassing for GPUs
- Computer ScienceSC15: International Conference for High Performance Computing, Networking, Storage and Analysis
- 2015
This paper proposes a novel compile-time framework for adaptive and transparent cache bypassing on GPUs that uses a simple yet effective approach to control the bypass degree to match the size of applications' runtime footprints.
An efficient compiler framework for cache bypassing on GPUs
- Computer Science2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)
- 2013
An efficient compiler framework for cache bypassing on GPUs is proposed and efficient algorithms that judiciously select global load instructions for cache access or bypass are presented.
An efficient compiler framework for cache bypassing on GPUs
- Computer ScienceICCAD 2013
- 2013
An efficient compiler framework for cache bypassing on GPUs is proposed and efficient algorithms that judiciously select global load instructions for cache access or bypass are presented.
The Demand for a Sound Baseline in GPU Memory Architecture Research
- Computer Science
- 2017
The results show that advanced cache indexing functions can greatly reduce conflict misses and improve cache efficiency; the allocation- on-fill policy brings a better performance than allocation-on-miss; and the performance does not consistently improve with more MSHRs, but can greatly mitigate the problem of memory partition camping.
References
SHOWING 1-10 OF 26 REFERENCES
Cache Miss Analysis for GPU Programs Based on Stack Distance Profile
- Computer Science2011 31st International Conference on Distributed Computing Systems
- 2011
This paper proposes, for the first time, a cache miss analysis model for the GPU programs, based on the deep analysis of GPU's execution model, and shows that the method is efficient and can be used to guide the cache locality optimizations for theGPU programs.
An adaptive performance modeling tool for GPU architectures
- Computer SciencePPoPP '10
- 2010
An analytical model to predict the performance of general-purpose applications on a GPU architecture that captures full system complexity and shows high accuracy in predicting the performance trends of different optimized kernel implementations is presented.
Neither more nor less: Optimizing thread-level parallelism for GPGPUs
- Computer ScienceProceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques
- 2013
To reduce resource contention, this paper proposes a dynamic CTA scheduling mechanism, called DYNCTA, which modulates the TLP by allocating optimal number of CTAs, based on application characteristics, to minimize resource contention.
Performance Estimation of GPUs with Cache
- Computer Science2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
- 2012
A model to count the number of instructions in the kernel is developed and the instruction count methodology to give exact count is found, which is used to predict the total execution time.
Analyzing CUDA workloads using a detailed GPU simulator
- Computer Science2009 IEEE International Symposium on Performance Analysis of Systems and Software
- 2009
Two observations are made that for the applications the authors study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.
Cache-Conscious Wavefront Scheduling
- Computer Science2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
- 2012
This paper proposes Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity.
An integrated GPU power and performance model
- Computer ScienceISCA
- 2010
An integrated power and performance prediction model for a GPU architecture to predict the optimal number of active processors for a given application and the outcome of IPP is used to control the number of running cores.
Reuse Distance as a Metric for Cache Behavior.
- Computer Science
- 2001
The distribution of the conflict and capacity misses was measured in the execution of code generated by a state-of-the-art EPIC compiler and it is observed that some program transformations to enhance the parallelism may counter the optimizations to reduce the capacity misses.
Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
- Computer Science2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)
- 2010
Several novel code transformations are explored that are applicable only when compiling explicitly parallel applications and traditional dynamic compiler optimizations are revisited for this new class of applications to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.
Multicore-aware reuse distance analysis
- Computer Science2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)
- 2010
This paper shows several methods to keep reuse stacks consistent so that they account for invalidations and cache sharing, either as references arise in a simulated execution or at synchronization points, and shows that adding multicore-awareness substantially improves the ability of reuse distance analysis to model cache behavior.