Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability

  title={Prototyping a Configurable Cache/Scratchpad Memory with Virtualized User-Level RDMA Capability},
  author={George Kalokerinos and Vassilis D. Papaefstathiou and George Nikiforos and Stamatis G. Kavadias and Xiaojun Yang and Dionisios N. Pnevmatikatos and Manolis G. H. Katevenis},
  journal={Trans. High Perform. Embed. Archit. Compil.},
We present the hardware design and implementation of a local memory system for individual processors inside future chip multi- processors (CMP. [] Key Method The processor interacts with the NI at user-level through virtualized command areas in scratchpad; the NI uses a similar access mechanism to provide efficient support for two hardwaresynchro- nization primitives: counters, and queues. We describe the NI design, the hardware cost, and the latencies of our FPGA-based prototype im- plementation that…

A Reconfigurable Cache for Efficient Use of Tag RAM as Scratch-Pad Memory

A cache organization, called Tag-SPM architecture, which allows the tag RAM to be used as the scratch-pad memory (SPM) and thus increases its capacity and is a highly cost-effective way to boost the SPM space.

A Reconfigurable Cache for Efficient Use of Tag RAM as Scratch-Pad Memory

This paper presents a cache organization, called Tag-SPM architecture, which allows the tag RAM to be used as the SPM and thus increases its capacity and is accomplished with small Tag/Data- SPM controllers and four additional multiplexers in the cache organization.



On-chip communication and synchronization mechanisms with cache-integrated network interfaces

This paper introduces event responses, as a mechanism for software configurable synchronization primitives, and presents three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, memory barriers for explicitly-selected accesses of arbitrary size, and multi-party synchronization queues.


An attempt to define an IPC architecture that is uniformly extensible from small-scale chip multiprocessors (CMP) to large-scale multi-chip parallel systems, with an attempt to outline a unifying architecture for high performance IPC both at the small and at the large scale.

Scratchpad memory: a design alternative for cache on-chip memory in embedded systems

The results clearly establish scratch pad memory as a low power alternative in most situations with an average energy reduction of 40% and the average area-time reduction for the scratchpad memory was 46% of the cache memory.

Coherent Network Interfaces for Fine-Grain Communication

This paper begins an exploration of network interfaces (NIs) that use coherence---coherent network interface (CNIs)---to improve communication performance, restricting this study to NI/CNIs that reside on coherent memory or I/O buses, to NI's that are much simpler than processors, and to the performance of fine-grain messaging from user process to user process.

Integration of message passing and shared memory in the Stanford FLASH multiprocessor

This paper presents the hardware and software mechanisms in FLASH to support various message passing protocols and provides an integrated solution that handles the interaction of the messaging protocols with virtual memory, protected multiprogramming, and cache coherence.

Anatomy of a message in the Alewife multiprocessor

The Alewife machine, a shared-memory multiprocessor being built at MIT, provides a message-passing interface that affords direct, user-level access to the network queues, supports an efficient DMA mechanism, and includes fast trap handling for message reception.

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

This paper describes the control protocols in the TRIPS processor, a distributed, tiled microarchitecture that supports dynamic execution and describes each of the five types of reused tiles that compose the processor, the control and data networks that connect them, and the distributedmicroarchitectural protocols that implement instruction fetch, execution, flush, and commit.

Producer-consumer communication in distributed shared memory multiprocessors

StreamLine, a cache based message passing mechanism, provides the best performance on the benchmarks with regular communication patterns, and forwarding write and cache based locks are also among the best performing producer initiated mechanisms.

Smart Memories: a modular reconfigurable architecture

Simulations of the mappings show that the Smart Memories architecture can successfully map two very different machines at opposite ends of the architectural spectrum, the Imagine stream processor and the Hydra speculative multiprocessor, with only modest performance degradation.

Telegraphos: high-performance networking for parallel processing on workstation clusters

  • E. MarkatosM. Katevenis
  • Computer Science
    Proceedings. Second International Symposium on High-Performance Computer Architecture
  • 1996
This paper presents Telegraphos, a distributed system that provides efficient shared-memory support on top of a workstation cluster that provides a variety of shared- memory operations like remote reads, remote writes, remote atomic operations, all launched from user level without any intervention of the operating system.