Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability

@article{Kalokerinos2019PrototypingAC,
  title={Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability},
  author={G. Kalokerinos and V. Papaefstathiou and George Nikiforos and S. Kavadias and Xiaojun Yang and D. Pnevmatikatos and M. Katevenis},
  journal={Trans. High Perform. Embed. Archit. Compil.},
  year={2019},
  volume={5},
  pages={100-120}
}
We present the hardware design and implementation of a local memory system for individual processors inside future chip multi- processors (CMP. [...] Key Method The processor interacts with the NI at user-level through virtualized command areas in scratchpad; the NI uses a similar access mechanism to provide efficient support for two hardwaresynchro- nization primitives: counters, and queues. We describe the NI design, the hardware cost, and the latencies of our FPGA-based prototype im- plementation that…Expand
A Reconfigurable Cache for Efficient Use of Tag RAM as Scratch-Pad Memory
TLDR
A cache organization, called Tag-SPM architecture, which allows the tag RAM to be used as the scratch-pad memory (SPM) and thus increases its capacity and is a highly cost-effective way to boost the SPM space. Expand
A Reconfigurable Cache for Efficient Use of Tag RAM as Scratch-Pad Memory
TLDR
This paper presents a cache organization, called Tag-SPM architecture, which allows the tag RAM to be used as the SPM and thus increases its capacity and is accomplished with small Tag/Data- SPM controllers and four additional multiplexers in the cache organization. Expand

References

SHOWING 1-10 OF 18 REFERENCES
On-chip communication and synchronization mechanisms with cache-integrated network interfaces
TLDR
This paper introduces event responses, as a mechanism for software configurable synchronization primitives, and presents three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, memory barriers for explicitly-selected accesses of arbitrary size, and multi-party synchronization queues. Expand
SEEN AS LOAD-STORE INSTRUCTION GENERALIZATION
This paper presents the current (2007) author’s views and opinions on interprocessor communication (IPC) and how it should evolve in future multiprocessors, with an attempt to define an IPCExpand
Scratchpad memory: a design alternative for cache on-chip memory in embedded systems
TLDR
The results clearly establish scratch pad memory as a low power alternative in most situations with an average energy reduction of 40% and the average area-time reduction for the scratchpad memory was 46% of the cache memory. Expand
Integration of message passing and shared memory in the Stanford FLASH multiprocessor
TLDR
This paper presents the hardware and software mechanisms in FLASH to support various message passing protocols and provides an integrated solution that handles the interaction of the messaging protocols with virtual memory, protected multiprogramming, and cache coherence. Expand
Anatomy of a message in the Alewife multiprocessor
TLDR
The Alewife machine, a shared-memory multiprocessor being built at MIT, provides a message-passing interface that affords direct, user-level access to the network queues, supports an efficient DMA mechanism, and includes fast trap handling for message reception. Expand
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor
TLDR
This paper describes the control protocols in the TRIPS processor, a distributed, tiled microarchitecture that supports dynamic execution and describes each of the five types of reused tiles that compose the processor, the control and data networks that connect them, and the distributedmicroarchitectural protocols that implement instruction fetch, execution, flush, and commit. Expand
Producer-consumer communication in distributed shared memory multiprocessors
TLDR
StreamLine, a cache based message passing mechanism, provides the best performance on the benchmarks with regular communication patterns, and forwarding write and cache based locks are also among the best performing producer initiated mechanisms. Expand
Smart Memories: a modular reconfigurable architecture
TLDR
Simulations of the mappings show that the Smart Memories architecture can successfully map two very different machines at opposite ends of the architectural spectrum, the Imagine stream processor and the Hydra speculative multiprocessor, with only modest performance degradation. Expand
Telegraphos: high-performance networking for parallel processing on workstation clusters
  • E. Markatos, M. Katevenis
  • Computer Science
  • Proceedings. Second International Symposium on High-Performance Computer Architecture
  • 1996
TLDR
This paper presents Telegraphos, a distributed system that provides efficient shared-memory support on top of a workstation cluster that provides a variety of shared- memory operations like remote reads, remote writes, remote atomic operations, all launched from user level without any intervention of the operating system. Expand
Remote queues: exposing message queues for optimization and atomicity
TLDR
Remote Queues is introduced, a communication model that integrates polling with selective interrupts to support a wide range of applications and communication paradigms and provides atomicity guarantees that greatly simplify programming for the user. Expand
...
1
2
...