• Corpus ID: 239015952

Enabling Large-Reach TLBs for High-Throughput Processors by Exploiting Memory Subregion Contiguity

@article{Yu2021EnablingLT,
  title={Enabling Large-Reach TLBs for High-Throughput Processors by Exploiting Memory Subregion Contiguity},
  author={Chao Yu and Yuebin Bai and Rui Wang},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.08613}
}
Accelerators, like GPUs, have become a trend to deliver future performance desire, and sharing the same virtual memory space between CPUs and GPUs is increasingly adopted to simplify programming. However, address translation, which is the key factor of virtual memory, is becoming the bottleneck of performance for GPUs. In GPUs, a single TLB miss can stall hundreds of threads due to the SIMT execute model, degrading performance dramatically. Through real system analysis, we observe that the OS… 

References

SHOWING 1-10 OF 52 REFERENCES
Supporting x86-64 address translation for 100s of GPU lanes
  • Jason Power, M. Hill, D. Wood
  • Computer Science
    2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)
  • 2014
TLDR
This proof-of-concept design shows how a judicious combination of extant CPU MMU ideas satisfies GPU MMU demands for 4 KB pages with minimal overheads (an average of less than 2% over ideal address translation).
Observations and opportunities in architecting shared virtual memory for heterogeneous systems
TLDR
This work analyzes, using real-system measurements, shared virtual memory across the CPU and an integrated GPU, and presents a detailed measurement study of a commercially available integrated APU that illustrates these effects and motivates future research opportunities.
Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces
TLDR
This work is the first to explore GPU Memory Management Units (MMUs) consisting of Translation Lookaside Buffers (TLBs) and page table walkers (PTWs) for address translation in unified heterogeneous systems and shows that a little TLB-awareness can make other GPU performance enhancements feasible in the face of cache-parallel address translation.
Big data causing big (TLB) problems: taming random memory accesses on the GPU
TLDR
A TLB-conscious approach to mitigate the slowdown for algorithms with irregular memory access is proposed, applied to two fundamental database operations - random sampling and hash-based grouping - showing that the slowdown can be dramatically reduced, and resulting in a performance increase of up to 13×.
Increasing TLB reach by exploiting clustering in page translations
TLDR
This work provides a detailed characterization of the spatial locality among the virtual-to-physical translations and presents a multi-granular TLB organization that significantly increases its effective reach and reduces miss rates substantially while requiring no additional OS support.
CoLT: Coalesced Large-Reach TLBs
TLDR
This work proposes Coalesced Large-Reach TLBs (CoLT), which leverage this intermediate contiguity to coalesce multiple virtual-to-physical page translations into single TLB entries and eliminates TLB misses for next-generation, big-data applications with low-overhead implementations.
Towards high performance paged memory for GPUs
TLDR
Without modifying the GPU execution pipeline, it is shown it is possible to largely hide the performance overheads of GPU paged memory, converting an average 2× slowdown into a 12% speedup when compared to programmer directed transfers.
Efficient Address Translation for Architectures with Multiple Page Sizes
TLDR
MIX TLBs are invented, energy-frugal set-associative structures that concurrently support all page sizes by exploiting superpage allocation patterns, and boost the performance of big-memory applications on native CPUs, virtualized CPUs, and GPUs.
Devirtualizing Memory in Heterogeneous Systems
TLDR
Devirtualized Memory (DVM) is proposed to combine the protection of VM with direct access to physical memory (PM), and its potential to extend beyond accelerators to CPUs, where it reduces VM overheads to 5% on average, down from 29% for conventional VM.
Efficient virtual memory for big memory servers
TLDR
This work proposes mapping part of a process's linear virtual address space with a direct segment, while page mapping the rest of thevirtual address space to remove the TLB miss overhead for big-memory workloads.
...
...