A new perspective for efficient virtual-cache coherence

@article{Kaxiras2013ANP,
  title={A new perspective for efficient virtual-cache coherence},
  author={S. Kaxiras and Alberto Ros},
  journal={Proceedings of the 40th Annual International Symposium on Computer Architecture},
  year={2013}
}
  • S. Kaxiras, Alberto Ros
  • Published 2013
  • Computer Science
  • Proceedings of the 40th Annual International Symposium on Computer Architecture
Coherent shared virtual memory (cSVM) is highly coveted for heterogeneous architectures as it will simplify programming across different cores and manycore accelerators. In this context, virtual L1 caches can be used to great advantage, e.g., saving energy consumption by eliminating address translation for hits. Unfortunately, multicore virtual-cache coherence is complex and costly because it requires reverse translation for any coherence request directed towards a virtual L1. The reason is the… Expand
Early Experiences with Separate Caches for Private and Shared Data
TLDR
This paper proposes the use of dedicated caches for private (+shared read-only) and shared data, which will be independent for each core while the shared cache (L1S) will be logically shared but physically distributed for all cores. Expand
Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming
TLDR
A delayed many segment translation designed for the hybrid virtual caching, which effectively lowers accesses to the TLBs, leading to significant power savings and performance improvement with scalable delayed translation with variable length segments. Expand
Reducing address translation overheads with virtual caching
TLDR
This thesis makes novel, empirical observations, based on real world applications, that show temporal properties of synonym accesses and proposes a practical virtual cache design with dynamic synonym remapping (VC-DSR), which effectively reduces the design complications of virtual caches. Expand
Devirtualizing virtual memory for heterogeneous systems
Accelerators are increasingly recognized as one of the major drivers of future computational growth. For accelerators, unified virtual memory (VM) promises to simplify programming and provide safeExpand
ARCHITECTURAL SUPPORT FOR VIRTUAL MEMORY IN GPUs
TLDR
This work is the first to explore GPU Translation Lookaside Buffers and page table walkers for address translation in the context of shared virtual memory for heterogeneous systems and considers the impact on the design of general purpose GPU performance improvement schemes. Expand
Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces
TLDR
This paper is the first to explore GPU Translation Lookaside Buffers (TLBs) and page table walkers for address translation in the context of shared virtual memory for heterogeneous systems and shows that introducing cache-parallel address translation does pose challenges, but that modest optimizations can buy back much of this lost performance. Expand
Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces
TLDR
This work is the first to explore GPU Memory Management Units (MMUs) consisting of Translation Lookaside Buffers (TLBs) and page table walkers (PTWs) for address translation in unified heterogeneous systems and shows that a little TLB-awareness can make other GPU performance enhancements feasible in the face of cache-parallel address translation. Expand
TLB Shootdown Mitigation for Low-Power Many-Core Servers with L1 Virtual Caches
TLDR
This paper proposes a low-overhead and readily implementable hardware mechanism using bloom filters to reduce spurious invalidations and mitigate their ill effects in low-power virtually-addressed caches. Expand
Efficient synonym filtering and scalable delayed translation for hybrid virtual caching
Conventional translation look-aside buffers(TLBs) are required to complete address translation withshort latencies, as the address translation is on the criticalpath of all memory accesses even forExpand
Filtering Translation Bandwidth with Virtual Caching
TLDR
Evaluation shows that the entire GPU virtual cache hierarchy effectively filters the high address translation bandwidth, achieving almost the same performance as an ideal MMU. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 40 REFERENCES
Cache coherence for GPU architectures
TLDR
This paper describes a time-based coherence framework for GPUs, called Temporal Coherence (TC), that exploits globally synchronized counters in single-chip systems to develop a streamlined GPU coherence protocol, called TC-Weak. Expand
The Synonym Lookaside Buffer: A Solution to the Synonym Problem in Virtual Caches
TLDR
It is shown that small SLBs of 8-16 entries are sufficient to solve the synonym problem in virtual caches and that their performance overhead is negligible. Expand
Complexity-effective multicore coherence
  • Alberto Ros, S. Kaxiras
  • Computer Science
  • 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT)
  • 2012
TLDR
A virtually costless coherence that outperforms a MESI directory protocol while at the same time reducing shared cache and network energy consumption for 15 parallel benchmarks, on 16 cores is shown. Expand
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks
TLDR
This work proposes to deactivate the coherence protocol for directory caches and to treat them as uniprocessor systems do, which allows directory caches to omit the tracking of an appreciable quantity of blocks, which reduces their load and increases their effective size. Expand
Reducing memory reference energy with opportunistic virtual caching
TLDR
An Opportunistic Virtual Cache is proposed that exposes virtual caching as a dynamic optimization by allowing some memory blocks to be cached with virtual addresses and others with physical addresses, and saves 94-99% of TLB lookup energy and nearly 23% of L1 cache dynamic lookup energy. Expand
Enigma: architectural and operating system support for reducing the impact of address translation
TLDR
Enigma is a novel approach to address translation that defers the bulk of the work associated with address translation until data must be retrieved from physical memory. Expand
Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors
  • A. Lebeck, D. Wood
  • Computer Science
  • Proceedings 22nd Annual International Symposium on Computer Architecture
  • 1995
TLDR
The results show that DSI reduces execution time of a sequentially consistent full-map coherence protocol by as much as 41%. Expand
DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism
TLDR
DeNovo is presented, a hardware architecture motivated by a disciplined shared-memory programming model that allows DeNovo to seamlessly integrate message passing-like interactions within a global address space for improved design complexity, performance, and efficiency. Expand
Shared last-level TLBs for chip multiprocessors
TLDR
This paper is the first to propose and evaluate shared last-level (SLL) TLBs as an alternative to the commercial norm of private, per-core L2 TLBs, and holds great promise for CMPs. Expand
Organization and performance of a two-level virtual-real cache hierarchy
TLDR
It is shown how the second-level cache can be easily extended to solve the synonym problem resulting from the use of a virtually-addressed cache at the first level and how this organization has a performance advantage over a hierarchy of physically-add addressed caches in a multiprocessor environment. Expand
...
1
2
3
4
...