Architecture support for accelerator-rich CMPs

  title={Architecture support for accelerator-rich CMPs},
  author={Jason Cong and Mohammad Ali Ghodrat and Michael Gill and Beayna Grigorian and Glenn D. Reinman},
  journal={DAC Design Automation Conference 2012},
This work discusses a hardware architectural support for accelerator-rich CMPs (ARC. [] Key Method This scheme supports sharing and arbitration of multiple cores for a common set of accelerators, and it uses a hardware-based arbitration mechanism to provide feedback to cores to indicate the wait time before a particular resource becomes available. Second, we propose a light-weight interrupt system to reduce the OS overhead of handling interrupts which occur frequently in an accelerator-rich platform.

Figures and Tables from this paper

Architecture Support for Domain-Specific Accelerator-Rich CMPs

This work presents a hardware resource management scheme for sharing of loosely coupled accelerators and arbitration of multiple requesting cores and a mechanism for accelerator virtualization that allows multiple accelerators to efficiently compose a larger virtual accelerator out of multiple smaller accelerators.

Accelerator-rich CMPs: From concept to real hardware

  • Yu-Ting ChenJ. Cong Yi Zou
  • Computer Science
    2013 IEEE 31st International Conference on Computer Design (ICCD)
  • 2013
This work discusses a prototype of accelerator-rich CMPs (PARC), and developed an automated flow with a number of IP templates and customizable interfaces to a C-based synthesis flow to enable rapid design and update of PARC.

ARACompiler: a prototyping flow and evaluation framework for accelerator-rich architectures

The ARACompiler is developed, which is a highly automated design flow for prototyping ARAs and performing evaluation on FPGAs and can provide 2.9x to 42.6x evaluation time saving over the full-system simulations.

Supporting Address Translation for Accelerator-Centric Architectures

This work examines the memory access behavior of customized accelerators to drive the TLB augmentation and MMU designs and presents a relatively small private TLB design to provide low-latency caching of translations to each accelerator.

An analysis of accelerator coupling in heterogeneous architectures

A salient conclusion of the study is that working sets of non-trivial size are best served by loosely-coupled accelerators that integrate private memory blocks tailored to their needs.

CHARM: a composable heterogeneous accelerator-rich microprocessor

CHARM is a Composable Heterogeneous Accelerator-Rich Microprocessor design that provides scalability, flexibility, and design reuse in the space of accelerator-rich CMPs to provide orders of magnitude improvement in performance and power efficiency.

Revisiting accelerator-rich CMPs: Challenges and solutions

A novel architecture template is proposed: Transparent Self-Synchronizing (TSS) accelerators for efficient/scalable realization of streaming applications and significantly reduces the pressure on the communication fabric, processor load, and memory requirements to improve scalability.

Revisiting Accelerator-Based CMPs : Challenges and Solutions

A novel architecture template is proposed: Transparent Self-Synchronizing (TSS) accelerators for efficient/scalable realization of streaming applications and significantly reduces the pressure on the communication fabric, processor load, and memory requirements to improve scalability.

On-chip interconnection network for accelerator-rich architectures

This paper proposes reserving NoC paths based on the timing information from the global manager and maximizes the benefit of paths reservation by regularizing the communication traffic through TLB buffering and hybrid-switching.

Power-efficient accelerator allocation in adaptive dark silicon many-core systems

This work proposes a power-efficient accelerator allocation scheme for adaptive many-core systems that maximally utilizes and dynamically allocates a shared accelerator to competing cores, such that deadlines of the executing applications are met and the total power consumption of the overall system is minimized.



AXR-CMP : Architecture Support in Accelerator-Rich CMPs

An efficient cache management scheme for accelerators to mitigate memory latency by overlapping data transfer with computation and aHardware architectural support for accelerator-rich CMPs is discussed.

HiPPAI: High Performance Portable Accelerator Interface for SoCs

It is demonstrated how a novel High Performance Portable Accelerator Interface (HiPPAI) for SoC platforms using hardware accelerators to reduce the overheads of system calls and address translations at the user/kernel boundary in traditional software stacks and enable function portability.

VEAL: Virtualized Execution Accelerator for Loops

It is concluded that using a hybrid static-dynamic compilation approach to map computation on to loop-level accelerators is an practical way to increase computation efficiency, without the overheads associated with instruction set modification.

EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system

Exoskeleton Sequencer (EXO), an architecture to represent heterogeneous accelerators as ISA-based MIMD architecture resources, and C for Heterogeneous Integration (CHI), an integrated C/C++ programming environment that supports accelerator-specific inline assembly and domain-specific languages are presented.

A wire-speed powerTM processor: 2.3GHz 45nm SOI with 16 cores and 64 threads

An emerging data-center market merges network and server attributes into a single wire-speed processor SoC that powers edge-of-network processing, intelligent I/O devices in servers, network attached appliances, distributed computing, and streaming applications.

Simics: A Full System Simulation Platform

Simics is a platform for full system simulation that can run actual firmware and completely unmodified kernel and driver code, and it provides both functional accuracy for running commercial workloads and sufficient timing accuracy to interface to detailed hardware models.

Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers and has released a set of timing simulator modules for modeling the timing of the memory system and microprocessors.

McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures

Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taking into account configuring clusters with 4 cores gives thebest EDA2P and EDAP.

Kernel sharing on reconfigurable multiprocessor systems

This work shows that using extensions for sharing configured circuits between processes improves overall system throughput, and outperforms a static schedule of the kernels between the multiple processes.

Accelerating vision and navigation applications on a customizable platform

This work mathematically characterize viable accelerator candidates, describes ideal application code for acceleration, and outlines a dynamic-programming-based methodology for selecting an optimal set of candidates to accelerate applications in the domain of vision and navigation.