TLSync: Support for multiple fast barriers using on-chip transmission lines

@article{Oh2011TLSyncSF,
  title={TLSync: Support for multiple fast barriers using on-chip transmission lines},
  author={Jung-Sub Oh and Milos Prvulovi{\'c} and Alenka G. Zaji{\'c}},
  journal={2011 38th Annual International Symposium on Computer Architecture (ISCA)},
  year={2011},
  pages={105-115}
}
As the number of cores on a single-chip grows, scalable barrier synchronization becomes increasingly difficult to implement. In software implementations, such as the tournament barrier, a larger number of cores results in a longer latency for each round and a larger number of rounds. Hardware barrier implementations require significant dedicated wiring, e.g., using a reduction (arrival) tree and a notification (release) tree, and multiple instances of this wiring are needed to support multiple… 

Figures and Tables from this paper

A Generic Implementation of Barriers Using Optical Interconnects
TLDR
It is proved in this paper that current protocols for barriers in optical NoCs are simplistic and cannot be trivially extended to accommodate for normal events that arise in regular operation such as presence of multiple applications, context switches, thread migrations, and variability in the number of active threads.
WiSync: An Architecture for Fast Synchronization through On-Chip Wireless Communication
TLDR
This paper proposes to address the challenge of fine-grain synchronization in shared-memory multiprocessing by using on-chip wireless communication, and shows that WiSync speeds-up synchronization substantially.
A case for globally shared-medium on-chip interconnect
TLDR
This paper shows that with straight forward optimizations, the traffic between different cores can be kept relatively low, which allows simple shared-medium interconnects to be built using communication circuits driving transmission lines.
HyBar: high efficient barrier synchronization based on a hybrid packet-circuit switching Network-on-Chip
TLDR
HyBar is presented, a hardware barrier based on a hybrid switching NoC which adopts packet switching and circuit switching methods in two sub-networks respectively which introduces a minor efficiency loss for concurrent barriers with no limitation on their layouts of participating cores in the on-chip network.
Efficient synchronization and communication in many-core chip multiprocessors
TLDR
GBarrier is a hardware-based barrier mechanism especially aimed at providing efficient barriers in future many-core CMPs, and deploys a dedicated G-Line-based network to allow for fast and efficient signaling of barrier arrival and departure.
Enhancing effective throughput for transmission line-based bus
TLDR
Transmission line-based buses are found to be a more compelling interconnect even for large-scale chip-multiprocessors, and thus bring into doubt the centrality of packet switching in future on-chip interconnect.
Broadcast- and Power-Aware Wireless NoC for Barrier Synchronization in Parallel Computing
TLDR
The proposed architecture reduces the barrier synchronization cost up to 43.97% regarding network latency under the PARSEC benchmarks and saves up to 80.49% idle-state power consumption in WIs for a 64-core system compared with the conventional WiNoC architecture without incurring significant overhead.
Traffic steering between a low-latency unswitched TL ring and a high-throughput switched on-chip interconnect
TLDR
This paper shows that a low-latency unswitched interconnect built with transmission lines can be synergistically used with a high-throughput switched interconnect and designs a broadcast ring as a chain of unidirectional transmission line structures with very low latency but limited throughput.
Single-cycle collective communication over a shared network fabric
  • T. Krishna, L. Peh
  • Computer Science
    2014 Eighth IEEE/ACM International Symposium on Networks-on-Chip (NoCS)
  • 2014
TLDR
A network fabric is designed that enables messages to dynamically create virtual 1-to-Many (multicast) and Many- to-1 (reduction) tree routes over a physical mesh, get forked/aggregated at nodes on the tree, and traverse the tree - all within a single-cycle across each dimension.
...
...

References

SHOWING 1-10 OF 37 REFERENCES
Efficient and scalable barrier synchronization for many-core CMPs
TLDR
This work uses global interconnection lines (G-lines) and S-CSMA technique to develop a simple G-lines-based network that operates independently of the main data network in order to carry out barrier synchronizations on many-core CMPs.
A case for globally shared-medium on-chip interconnect
TLDR
This paper shows that with straight forward optimizations, the traffic between different cores can be kept relatively low, which allows simple shared-medium interconnects to be built using communication circuits driving transmission lines.
TLC: transmission line caches
  • Bradford M. Beckmann, D. Wood
  • Computer Science
    Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36.
  • 2003
TLDR
This paper proposes a family of transmission line cache (TLC) designs that represent different points in the latency/bandwidth spectrum and shows that TLC provides more consistent performance than the DNUCA design across a wide variety of workloads.
Low-Overhead, High-Speed Multi-core Barrier Synchronization
TLDR
Three barrier implementations that are hybrids of software and dedicated hardware barriers and are specifically tailored for CMPs are presented and evaluated, providing low latency comparable to that of dedicated hardware networks at a fraction of the cost.
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers
TLDR
This work presents barrier filters, a mechanism for fast barrier synchronization on-chip multi-processors to enable vector computations to be efficiently distributed across the cores of a CMP.
NoC with Near-Ideal Express Virtual Channels Using Global-Line Communication
TLDR
A novel NoC with hybrid interconnect that leverages multiple types of interconnects - specifically, conventional full-swing short-range wires for the data path, in conjunction with low-swing, multi-drop wires with long-range, ultra-low-latency communication for the flow control signals.
Synchronization and communication in the T3E multiprocessor
TLDR
The T3E augments the memory interface of the DEC 21164 microprocessor with a large set of explicitly-managed, external registers (E-registers), which provide a rich set of atomic memory operations and a flexible, user-level messaging facility.
Distributed Hardwired Barrier Synchronization for Scalable Multiprocessor Clusters
TLDR
The versatility, scalability, programmability, and low overhead make the distributed barrier architecture attractive in constructing fine-grain, massively parallel MIMD systems using multiprocessor clusters with distributed shared memory.
Power reduction of CMP communication networks via RF-interconnects
TLDR
A novel interconnect design exploiting dynamic RF-I bandwidth allocation to realize a reconfigurable network-on-chip architecture is proposed, and it is found that the adaptiveRF-I architecture on top of a mesh with 4B links can even outperform the baseline with 16B mesh links by about 1%, and reduces NoC power by approximately 65% including the overhead incurred for supporting RF- I.
Scalability Evaluation of Barrier Algorithms for OpenMP
TLDR
Some of the most widely used approaches for implementing barriers on large-scale shared-memory multiprocessor systems are considered: a "blocking" implementation that de-schedules a waiting thread, a "centralized" busy wait and three forms of distributed "busy" wait implementations are discussed.
...
...