Configurable XOR Hash Functions for Banked Scratchpad Memories in GPUs

@article{Braak2016ConfigurableXH,
  title={Configurable XOR Hash Functions for Banked Scratchpad Memories in GPUs},
  author={Gert-Jan van den Braak and Juan G{\'o}mez-Luna and Jos{\'e} Mar{\'i}a Gonz{\'a}lez-Linares and Henk Corporaal and Nicol{\'a}s Guil Mata},
  journal={IEEE Transactions on Computers},
  year={2016},
  volume={65},
  pages={2045-2058}
}
Scratchpad memories in GPU architectures are employed as software-controlled caches to increase the effective GPU memory bandwidth. Through the use of well-known optimization techniques, such as privatization and tiling, they are properly exploited. Typically, they are banked memories which are addressed with a <inline-formula> <tex-math notation="LaTeX">$\text{mod}(2^N)$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="vandenbraak-ieq1-2479595.gif"/></alternatives… Expand
An access pattern based adaptive mapping function for GPGPU scratchpad memory
TLDR
An adaptive mapping function is proposed, which can dynamically select a suitable mapping function for applications based on the statistics of first block executing, and the experimental results show that 94.8 percent bank conflicts reduced and 1.235× performance improved for 17 benchmarks on GPGPU-sim, a Fermilike simulator. Expand
Improving GPU Performance: Reducing Memory Conflicts and Latency
TLDR
A set of software techniques to improve the parallel updating of the output bins in the voting algorithms, the so called ‘voting algorithms’ such as histogram and Hough transform, are analyzed, implemented and optimized on GPUs. Expand
A load balancing technique for memory channels
TLDR
In the proposed memory system, a memory request from a busy channel can be migrated to other non-busy channels and serviced in the other channels, which reduces stalls by memory controllers and shows a 10.1% increase in performance for GPGPU workloads. Expand
Architecting Memory Systems for Emerging Technologies
The advance of traditional dynamic random access memory (DRAM) technology has slowed down, while the capacity and performance needs of memory system have continued to increase. This is a result ofExpand
Adaptive Linear Address Map for Bank Interleaving in DRAMs
TLDR
The experimental results show that the presented adaptive bank-interleaved linear address map for a DRAM technology can effectively improve the performance with a moderate hardware cost. Expand
Power modeling and architectural techniques for energy-efficient GPUs
TLDR
This thesis investigates bottlenecks that cause low performance and low energy efficiency in GPUs and proposes architectural techniques to address them and proposes the more complex Entropy Encoding Based Memory Compression (E2MC) technique for GPUs. Expand

References

SHOWING 1-10 OF 34 REFERENCES
Application-Specific Reconfigurable XOR-Indexing to Eliminate Cache Conflict Misses
TLDR
It is shown how application-specific hashing of the address can eliminate a large number of conflict misses in caches and it is shown that a reconfigurable XOR-function selector is inherently less complex than a reconfigured selector for bit-selecting functions. Expand
Simulation and architecture improvements of atomic operations on GPU scratchpad memory
TLDR
This paper proposes to use a hash function in both the addressing of the banks and the locks of the scratchpad memory in GPGPU-Sim to reduce serialization of threads and result in a speed-up in histogram and Hough transform applications with minimum hardware costs. Expand
StreamScan: fast scan algorithms for GPUs without global barrier synchronization
TLDR
StreamScan is a novel approach to implement scan on GPUs with only one computation phase, and the main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. Expand
XOR-based hash functions
TLDR
Two ways to reason about hash functions are presented: by their null space and by their column space, which helps to quickly determine whether a pattern is mapped conflict-free. Expand
Performance Modeling of Atomic Additions on GPU Scratchpad Memory
TLDR
This paper presents an exhaustive microbenchmark-based analysis of atomic additions in shared memory that quantifies the impact of access conflicts on latency and throughput and proposes a performance model to estimate the latency penalties due to collisions by position or bank conflicts. Expand
Eliminating cache conflict misses through XOR-based placement functions
TLDR
It is shown that for a 8 Kbyte data cache, XOR-mapping schemes approximately halve the miss ratio for two-way associative and column-associative organizations, and XOR mapping schemes provide a very significant reduction in the misses ratio for the other cache organizations, including the direct-mapped cache. Expand
Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures
TLDR
Techniques for enhancing the memory efficiency of applications on data-parallel architectures are presented, based on the analysis and characterization of memory access patterns in loop bodies; they target vectorization via data transformation to benefit vector-based architectures and algorithmic memory selection for scalar- based architectures. Expand
Reducing Conflict Misses by Application-Specific Reconfigurable Indexing
TLDR
An improved indexing scheme for direct-mapped caches is proposed, which drastically reduces the number of conflict misses by using application-specific information, and is based on the selection of a subset of the address bits. Expand
Pseudo-randomly interleaved memory
TLDR
The notion of polynomial interleaving modulo an irreducible polynometric is introduced as a way of achieving pseudo-random interleaved with certain attractive and provable properties. Expand
Arbitrary Modulus Indexing
TLDR
A new scheme called Arbitrary Modulus Indexing (AMI) is introduced that can be implemented efficiently for all moduli, matching or improving the efficiency of the best existing schemes for primes while allowing great flexibility in choosing a modulus to optimize cost/performance trade-offs. Expand
...
1
2
3
4
...