A survey of techniques for architecting TLBs

@article{Mittal2017ASO,
  title={A survey of techniques for architecting TLBs},
  author={Sparsh Mittal},
  journal={Concurrency and Computation: Practice and Experience},
  year={2017},
  volume={29}
}
  • Sparsh Mittal
  • Published 25 May 2017
  • Computer Science
  • Concurrency and Computation: Practice and Experience
Summary Translation lookaside buffer (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. [...] Key Method We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects, and system engineers.Expand
Improving Instruction TLB Reliability with Efficient Multi-bit Soft Error Protection
Abstract A Translation Lookaside Buffer (TLB) is a type of memory cache that is used to store recent translations of virtual to physical memory to reduce the access latency. Every time the processor
Fast TLB Simulation for RISC-V Systems
TLDR
This paper presents a TLB simulation framework that allows rapid, flexible and versatile prototyping of various hardware TLB design choices, and enables validation, profiling and benchmarking of software running on RISC-V systems.
CoPTA: Contiguous Pattern Speculating TLB Architecture
TLDR
This paper proposes CoPTA, a technique to speculate the memory address translation upon a TLB miss to hide the PTW latency and shows that the operating system has a tendency to map contiguous virtual memory pages to contiguous physical pages.
Architecting HBM as a high bandwidth, high capacity, self-managed last-level cache
TLDR
This paper designs a last-level, stacked DRAM cache that is practical for real-world systems and takes advantage of High Bandwidth Memory (HBM), and introduces novel tag/data storage that enables faster lookups, associativity, and more capacity than previous designs.
Ptlbmalloc2: Reducing TLB Shootdowns with High Memory Efficiency
  • Stijn Schildermans, Kris Aerts, Jianchen Shan, X. Ding
  • Computer Science
    2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)
  • 2020
TLDR
Ptlbmalloc2 outperforms glibc by up to 70% in terms of cycles and execution time with a negligible impact on memory efficiency for real-world workloads, providing a strong incentive to rethink memory allocator scalability in the current era of many-core NUMA systems and cloud computing.
Big data causing big (TLB) problems: taming random memory accesses on the GPU
TLDR
A TLB-conscious approach to mitigate the slowdown for algorithms with irregular memory access is proposed, applied to two fundamental database operations - random sampling and hash-based grouping - showing that the slowdown can be dramatically reduced, and resulting in a performance increase of up to 13×.
Efficient TCAM design based on dual port SRAM on FPGA
TLDR
This paper presents a design of 480 × 104 bit SRAM - based TCAM on Altera Cyclone IV FPGA, achieving lookup rate over 150 millions input search data and update speed at 75 million rules per second.
Runtime Data Management on Non-Volatile Memory-based Heterogeneous Memory for Task-Parallel Programs
  • Kai Wu, J. Ren, Dong Li
  • Computer Science
    SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2018
TLDR
This paper study task-parallel programs, and introduces a runtime system to address the data placement problem on NVM-based HMS and a performance model to predict performance for tasks with various data placements on HMS.
D-TCAM: A High-Performance Distributed RAM Based TCAM Architecture on FPGAs
TLDR
A novel TCAM architecture, the distributed RAM based TCAM (D-TCAM), using D-CAM as a building block is presented, which improves throughput by 58.8% without any additional hardware cost.
Improving application timing predictability and caching performance on multi-core systems
TLDR
This dissertation focuses on shared cache interference and investigates two issues raised by the increasing complexity of underlying hardware and software for multi-core systems: timing predictability of real-time computing and caching performance for high performance computing.
...
1
2
3
...

References

SHOWING 1-10 OF 98 REFERENCES
DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory
TLDR
This paper characterize the impact of TLB shoot downs on multiprocessor performance and scalability, and presents the design of a scalable TLB coherency mechanism that couples a shared TLB directory with load/store queue support for lightweight TLB invalidation, and thereby eliminates the need for costly IPIs.
Inter-core cooperative TLB for chip multiprocessors
TLDR
This work is the first to present TLB prefetchers that exploit commonality in TLB miss patterns across cores in CMPs, and shows that TLBPrefetchers exploiting inter-core correlations can effectively eliminate TLB misses.
B2P2: bounds based procedure placement for instruction TLB power reduction in embedded systems
TLDR
The code placement problem to minimize the page-switches in a program is formulated and it is proved that this problem is NP-complete and an efficient Bounds Based Procedure Placement (B2P2) heuristic is proposed to efficiently reduce the program's page- Switches.
Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors
TLDR
This paper proposes to improve system performance by means of a novel way of organizing TLBs called Synergistic TLBs, and finds that an optimal point exists for high performance address translation.
Code Transformations for TLB Power Reduction
TLDR
This paper proposes compiler techniques (specifically, instruction and operand reordering, array interleaving, and loop unrolling) to reduce the page switchings in data accesses and results in an average of 39% reduction in the data-TLB page switching.
Recency-based TLB preloading
TLDR
A novel TLB miss prediction algorithm based on the concept of "recency" is presented, and it is shown that it can predict over 55% of the TLB misses for the five commercial applications considered.
A simulation based study of TLB performance
TLDR
The amount of memory mapped was found to be the dominant factor in TLB performance, and small first-level FIFO instruction TLBs can be effective in two level TLB configurations.
Generating physical addresses directly for saving instruction TLB energy
TLDR
Four different approaches for reducing the number of accesses to the instruction TLB (iTLB) for power and performance optimizations are proposed, and one of these schemes that uses a combination of compiler and hardware enhancements can reduce iTLB dynamic power by over 85% in most cases.
Uniprocessor Virtual Memory without TLBs
TLDR
A feasibility study for performing virtual address translation without specialized translation hardware is presented and Trace-driven simulations show that software-managed address translation is just as efficient as hardware- managed address translation.
Generating physical addresses directly for saving instruction TLB energy
TLDR
Four different approaches for reducing the number of accesses to the instruction TLB (iTLB) for power and performance optimizations are proposed and experimentally demonstrate that one of these schemes that uses a combination of compiler and hardware enhancements can reduce iTLB dynamic power by over 85% in most cases.
...
1
2
3
4
5
...