Stash: Have your scratchpad and cache it too

@article{Komuravelli2015StashHY,
  title={Stash: Have your scratchpad and cache it too},
  author={Rakesh Komuravelli and Matthew D. Sinclair and Johnathan Alsop and Muhammad Huzaifa and Maria Kotsifakou and Prakalp Srivastava and Sarita V. Adve and Vikram S. Adve},
  journal={2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)},
  year={2015},
  pages={707-719}
}
Heterogeneous systems employ specialization for energy efficiency. Since data movement is expected to be a dominant consumer of energy, these systems employ specialized memories (e.g., scratchpads and FIFOs) for better efficiency for targeted data. These memory structures, however, tend to exist in local address spaces, incurring significant performance and energy penalties due to inefficient data movement between the global and private spaces. We propose an efficient heterogeneous memory… 

Figures and Tables from this paper

LMStr: exploring shared hardware controlled scratchpad memory for multicores

TLDR
The LMStr is a shared special kind of a SPM among the cores in a multicore processor that can be used with a regular cache hierarchy or solely as a redesigned scratchpad memory (SPM).

Local memory store (LMStr): A hardware controlled shared scratchpad for multicores

  • N. SiddiqueAbdel-Hameed A. BadawyJeanine E. CookD. Resnick
  • Computer Science
    2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI)
  • 2017
TLDR
The LMStr is a shared special kind of SPM among the cores in a multicore processor that can be used with a regular cache hierarchy or solely as a redesigned scratchpad memory (SPM).

APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs

TLDR
Adaptive PREfetching and Scheduling (APRES) is proposed to improve GPU cache efficiency and achieves 31.7% performance improvement compared to the baseline GPU and 7.2% additional speedupCompared to the best combination of existing warp scheduling and prefetching methods.

SPX64: A Scratchpad Memory for General-purpose Microprocessors

TLDR
This paper adds a virtually addressed, setassociative scratchpad to a general purpose CPU, which exists alongside of a traditional cache, and is able to avoid many of the programming challenges associated with traditional scratchpads without sacrificing generality.

SDC: a software defined cache for efficient data indexing

TLDR
By providing a program with the ability of explicitly using the cache as a lookaside key-value buffer, SDC enables a much more efficient cache without disruptively changing the existing cache organization and without substantially increasing hardware cost.

ShaVe-ICE

TLDR
ShaVe-ICE is proposed, an operating-system-level solution, along with hardware support, to virtualize and ultimately share SPM resources across a many-core embedded system to reduce the average memory latency and present a number of simple allocation policies to improve performance and energy.

SPX64

TLDR
This article adds a virtually addressed, set-associative scratchpad to a general purpose CPU, which exists alongside a traditional cache and is able to avoid many of the programming challenges associated with traditional scratchpads without sacrificing generality.

Analyzing and Leveraging Shared L1 Caches in GPUs

TLDR
Lightweight communication optimization techniques and a run-time mechanism that considers the latency-tolerance characteristics of applications to decide which applications should execute in private versus shared L1 cache organization are developed and reconfigures the caches accordingly.

Rethinking the Memory Hierarchy for Modern Languages

TLDR
Hotpads is presented, a new memory hierarchy designed from the ground up for modern, memory-safe languages like Java, Go, and Rust, that improves memory performance and efficiency substantially, and unlocks many new optimizations.
...

References

SHOWING 1-10 OF 48 REFERENCES

An optimal memory allocation scheme for scratch-pad-based embedded systems

TLDR
This article presents a compiler strategy that automatically partitions the data among the memory units, and shows that this strategy is optimal, relative to the profile run, among all static partitions for global and stack data.

Reducing memory reference energy with opportunistic virtual caching

TLDR
An Opportunistic Virtual Cache is proposed that exposes virtual caching as a dynamic optimization by allowing some memory blocks to be cached with virtual addresses and others with physical addresses, and saves 94-99% of TLB lookup energy and nearly 23% of L1 cache dynamic lookup energy.

Dymaxion: Optimizing memory access patterns for heterogeneous systems

  • Shuai CheJ. SheafferK. Skadron
  • Computer Science
    2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
  • 2011
TLDR
This paper proposes a simple API, Dymaxion, that allows programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms and achieves 3.3× speedup on GPU kernels and 20% overall performance improvement, including the PCI-E transfer, over the original CUDA implementations on an NVIDIA GTX 480 GPU.

TLC: A tag-less cache for reducing dynamic first level cache energy

TLDR
A new cache design that adds way index information to the TLB that reduces the dynamic energy for a 32 kB, 8-way cache by 78% compared to a VIPT cache without affecting performance.

Dynamic allocation for scratch-pad memory using compile-time decisions

TLDR
This research proposes a dynamic allocation methodology for global and stack data and program code that accounts for changing program requirements at runtime, has no software-caching tags, requires no runtime checks, has extremely low overheads, and yields 100% predictable memory access times.

Compiler-decided dynamic memory allocation for scratch-pad based embedded systems

TLDR
A dynamic allocation method for global and stack data that accounts for changing program requirements at runtime, has no software-caching tags, requires no run-time checks, has extremely low overheads, and yields 100% predictable memory access times is presented.

QuickRelease: A throughput-oriented approach to release consistency on GPUs

TLDR
QuickRelease (QR), which improves on conventional GPU memory systems in two ways, uses a FIFO to enforce the partial order of writes so that synchronization operations can complete without frequent cache flushes, and provides a throughput-oriented solution to provide fine-grain synchronization on GPUs.

Scratchpad memory: a design alternative for cache on-chip memory in embedded systems

TLDR
The results clearly establish scratch pad memory as a low power alternative in most situations with an average energy reduction of 40% and the average area-time reduction for the scratchpad memory was 46% of the cache memory.

D2MA: Accelerating coarse-grained data transfer for GPUs

TLDR
D2MA is a reimagination of traditional DMA that addresses the challenges of extending DMA to thousands of concurrently executing threads, and provides a more direct and efficient path for data in global memory to travel into the shared memory, and introduces a novel dynamic synchronization scheme that is transparent to the programmer.

Memory allocation for embedded systems with a compile-time-unknown scratch-pad size

TLDR
This work presents a compiler method whose resulting executable is portable across SPMs of any size, and shows that the overhead from the embedded loader averages about 1% in both code-size and run-time of the benchmarks.