A new perspective for efficient virtual-cache coherence

  title={A new perspective for efficient virtual-cache coherence},
  author={Stefanos Kaxiras and Alberto Ros},
  journal={Proceedings of the 40th Annual International Symposium on Computer Architecture},
  • S. Kaxiras, Alberto Ros
  • Published 23 June 2013
  • Computer Science
  • Proceedings of the 40th Annual International Symposium on Computer Architecture
Coherent shared virtual memory (cSVM) is highly coveted for heterogeneous architectures as it will simplify programming across different cores and manycore accelerators. In this context, virtual L1 caches can be used to great advantage, e.g., saving energy consumption by eliminating address translation for hits. Unfortunately, multicore virtual-cache coherence is complex and costly because it requires reverse translation for any coherence request directed towards a virtual L1. The reason is the… 

Figures and Tables from this paper

Early Experiences with Separate Caches for Private and Shared Data

This paper proposes the use of dedicated caches for private (+shared read-only) and shared data, which will be independent for each core while the shared cache (L1S) will be logically shared but physically distributed for all cores.

Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming

A delayed many segment translation designed for the hybrid virtual caching, which effectively lowers accesses to the TLBs, leading to significant power savings and performance improvement with scalable delayed translation with variable length segments.

Devirtualizing virtual memory for heterogeneous systems

The Devirtualized Virtual Memory (DVM) scheme is proposed to combine the protection of VM with direct access to physical memory (PM) and reduces VM overheads to 5% on average, down from 29% for conventional VM.


This work is the first to explore GPU Translation Lookaside Buffers and page table walkers for address translation in the context of shared virtual memory for heterogeneous systems and considers the impact on the design of general purpose GPU performance improvement schemes.

Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces

This work is the first to explore GPU Memory Management Units (MMUs) consisting of Translation Lookaside Buffers (TLBs) and page table walkers (PTWs) for address translation in unified heterogeneous systems and shows that a little TLB-awareness can make other GPU performance enhancements feasible in the face of cache-parallel address translation.

TLB Shootdown Mitigation for Low-Power Many-Core Servers with L1 Virtual Caches

This paper proposes a low-overhead and readily implementable hardware mechanism using bloom filters to reduce spurious invalidations and mitigate their ill effects in low-power virtually-addressed caches.

Efficient synonym filtering and scalable delayed translation for hybrid virtual caching

A delayed many segment translation designed for the hybrid virtual caching, which effectively lowers accesses to the TLBs, leading to significant power savings and performance improvement with scalable delayed translation with variable length segments.

Filtering Translation Bandwidth with Virtual Caching

Evaluation shows that the entire GPU virtual cache hierarchy effectively filters the high address translation bandwidth, achieving almost the same performance as an ideal MMU.

The Virtual Block Interface: A Flexible Alternative to the Conventional Virtual Memory Framework

The benefits of VBI are demonstrated with two important use cases: reducing the overheads of address translation (for both native execution and virtual machine environments), as VBI reduces the number of translation requests and associated memory accesses; and two heterogeneous main memory architectures, where VBI significantly improves performance over conventional virtual memory.

Enabling Large-Reach TLBs for High-Throughput Processors by Exploiting Memory Subregion Contiguity

With MESC, address translations of up to 512 pages can be coalesced into single TLB entry, without the needs of changing memory allocation policy (i.e., demand paging) and the support of large pages.



Cache coherence for GPU architectures

This paper describes a time-based coherence framework for GPUs, called Temporal Coherence (TC), that exploits globally synchronized counters in single-chip systems to develop a streamlined GPU coherence protocol, called TC-Weak.

Complexity-effective multicore coherence

  • Alberto RosS. Kaxiras
  • Computer Science
    2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT)
  • 2012
A virtually costless coherence that outperforms a MESI directory protocol while at the same time reducing shared cache and network energy consumption for 15 parallel benchmarks, on 16 cores is shown.

Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks

This work proposes to deactivate the coherence protocol for directory caches and to treat them as uniprocessor systems do, which allows directory caches to omit the tracking of an appreciable quantity of blocks, which reduces their load and increases their effective size.

Reducing memory reference energy with opportunistic virtual caching

An Opportunistic Virtual Cache is proposed that exposes virtual caching as a dynamic optimization by allowing some memory blocks to be cached with virtual addresses and others with physical addresses, and saves 94-99% of TLB lookup energy and nearly 23% of L1 cache dynamic lookup energy.

Enigma: architectural and operating system support for reducing the impact of address translation

Enigma is a novel approach to address translation that defers the bulk of the work associated with address translation until data must be retrieved from physical memory.

Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors

  • A. LebeckD. Wood
  • Computer Science
    Proceedings 22nd Annual International Symposium on Computer Architecture
  • 1995
The results show that DSI reduces execution time of a sequentially consistent full-map coherence protocol by as much as 41%.

DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism

DeNovo is presented, a hardware architecture motivated by a disciplined shared-memory programming model that allows DeNovo to seamlessly integrate message passing-like interactions within a global address space for improved design complexity, performance, and efficiency.

Shared last-level TLBs for chip multiprocessors

This paper is the first to propose and evaluate shared last-level (SLL) TLBs as an alternative to the commercial norm of private, per-core L2 TLBs, and holds great promise for CMPs.

A Primer on Memory Consistency and Cache Coherence

This primer is to provide readers with a basic understanding of consistency and coherence, and presents both highlevel concepts as well as specific, concrete examples from real-world systems.

Cooperative shared memory: software and hardware for scalable multiprocessors

The initial implementation of cooperative shared memory uses a simple programming model, called Check-In/Check-Out (CICO), in conjunction with even simpler hardware, called Dir1SW, that adds little complexity to message-passing hardware, but efficiently supports programs written within the CICO model.