Corpus ID: 18141824

Avalanche: A Communication and Memory Architecture for Scalable Parallel Computing

  title={Avalanche: A Communication and Memory Architecture for Scalable Parallel Computing},
  author={J. Carter and A. Davis and R. Kuramkote and E. R. Panier and L. Stoller},
As the gap between processor and memory speeds widens system designers will inevitably incorpo rate increasingly deep memory hierarchies to maintain the balance between processor and memory system performance At the same time most communication subsystems are permitted access only to main memory and not a processor s top level cache As memory latencies increase this lack of integration between the memory and communication systems will seriously impede interprocessor communication performance… Expand
Design alternatives for shared memory multiprocessors
The paper concludes with several recommendations to designers of next generation DSM machines, complete with a discussion of the issues that led to each recommendation so that designers can decide which ones are relevant to them given changes in technology and corporate priorities. Expand
Distributed Shared Memory: A Review
A new abstraction of shared memory on a distributed system that combines the best of the two original models is proposed, commonly known as Distributed Shared Memory (DSM), which provides the illusion of a large “shared” memory that extends across machine boundaries. Expand
Extending the reach of microprocessors: column and curious caching
This thesis motivates column and curious caching by high-performance communciation, evaluates these adaptive mechanisms for communication and other uses and proposes various implementations designed for different constraints, demonstrating how these simple mechanisms can enable substantial performance improvements and support a wide range of additional functionality. Expand
Design and verification of adaptive cache coherence protocols
This dissertation illustrates the use of TRSs by giving the operational semantics of a simple instruction set, and a processor that implements the same instruction set on a micro-architecture that allows register renaming and speculative execution. Expand
Issues in Multiprocessor Memory Consistency Protocol Design and Verification
A glimpse of the complexity faced in DSM protocol design is provided, and work in progress in addressing these issues through model-checking and synthesis using recently proposed challenge problems to drive research is provided. Expand
Effects Of Communication Latency, Overhead, And Bandwidth In A Cluster Architecture
Most applications in this study demonstrate a highly linear dependence to both overhead and per-message bandwidth, indicating that further improvements in communication performance will continue to improve application performance. Expand
Architectural enhancement for message passing interconnects
Research in high-performance architecture has been focusing on achieving more computing power to solve computationally-intensive problems. Advancements in the processor industry are not applicable inExpand
MP-LOCKs: replacing H/W synchronization primitives with message passing
It is argued that synchronization operations implemented using fast message passing and kernel-embedded lock managers are an attractive alternative to dedicated synchronization hardware and should be considered as a replacement for hardware locks in future scalable multiprocessors that support efficient message passing mechanisms. Expand
A System Software Architecture for High End Computing
This work presents a new system architecture used at Sandia, a lightweight applications interface to a collection of processing nodes that allows data to flow between processing nodes with minimal system overhead while maintaining a suitable degree of protection and reconfigurability. Expand
Reactive NUMA: A Design For Unifying S-COMA And CC-NUMA
  • B. Falsafi, D. Wood
  • Computer Science
  • Conference Proceedings. The 24th Annual International Symposium on Computer Architecture
  • 1997
This paper proposes and evaluates a new approach to directory-based cache coherence protocols called Reactive NUMA (R-NUMA). An R-NUMA system combines a conventional CC-NUMA coherence protocol with aExpand


The Stanford FLASH multiprocessor
The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory and high-performance message passing, while minimizing both hardware and software overhead. Each node in FLASHExpand
The Alewife architecture is described and the novel hardware features of the machine including LimitLESS directories and the rapid context switching processor are concentrated on. Expand
Software-extended coherent shared memory: performance and cost
Evaluated tradeoffs involved in the design of the software-extended memory system of Alewife, a multiprocessor architecture that implements coherent shared memory through a combination of hardware and software mechanisms show that a small amount of shared memory hardware provides adequate performance. Expand
Tempest and Typhoon: user-level shared memory
Future parallel computers must efficiently execute not only hand-coded applications but also programs written in high-level, parallel programming languages. Today's machines limit these programs to aExpand
Hiding Shared Memory Reference Latency on the Galactica Net Distributed Shared Memory Architecture
Preliminary performance evaluations indicate that together the latency hiding mechanisms employed by the Galactica Net scalable distributed shared memory architecture are able to hide a significant amount of the memory reference latency, thus increasing the scalability of the architecture. Expand
The performance impact of flexibility in the Stanford FLASH multiprocessor
This paper compares the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload, and finds that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small. Expand
Avoiding conflict misses dynamically in large direct-mapped caches
Using trace-driven simulation of applications and the operating system, it is shown that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity. Expand
The Stanford Dash multiprocessor
The overall goals and major features of the directory architecture for shared memory (Dash), a distributed directory-based protocol that provides cache coherence without compromising scalability, are presented. Expand
Cache Invalidation Patterns in Shared-Memory Multiprocessors
The cache invalidation patterns of several parallel applications are analyzed and a classification scheme for data objects found in parallel programs is proposed, indicating that cache line sizes in the 32-byte range yield the lowest data and invalidation traffic. Expand
Adaptive software cache management for distributed shared memory architectures
It is found that the access patterns of a large percentage of shared data objects fall into a small number of categories for which efficient software coherence mechanisms exist. Expand