Double Throughput Multiply-Accumulate unit for FlexCore processor enhancements

  title={Double Throughput Multiply-Accumulate unit for FlexCore processor enhancements},
  author={Tung Thanh Hoang and Magnus Sj{\"a}lander and Per Larsson-Edefors},
  journal={2009 IEEE International Symposium on Parallel \& Distributed Processing},
As a simple five-stage General-Purpose Processor (GPP), the baseline FlexCore processor has a limited set of datapath units. By utilizing a flexible datapath interconnect and a wide control word, a FlexCore processor is explicitly designed to support integration of special units that, on demand, can accelerate certain data-intensive applications. In this paper, we propose the integration of a novel Double Throughput Multiply-Accumulate (DTMAC) unit, whose different operating modes allow for on… 

Figures and Tables from this paper

Customization for an Energy-Efficient Embedded Processor with Flexible Datapath

A design space exploration methodology to customize the FlexCore datapath interconnect to a domain of applications, with the goal of reducing energy dissipation, and a high-speed energy-efficient 2-cycle multiply-accumulate architecture, which can act as an accelerator for embedded processors.

FlexCore: Implementing an exposed datapath processor

An overview of the implementation of complete FlexCore processors is given, accompanied with discussions ondatapath interconnects, datapath extensions and instruction decompression.

Design space exploration for an embedded processor with flexible datapath interconnect

Evaluation results suggest that a well-optimized instance of a 65-nm multiplier-extended FlexCore processor datapath, obtained using FlexTools, executes nine integer EEMBC benchmarks with a 15% cycle count reduction and dissipates 17% less energy than a reference MIPS datapATH.

A High-Speed, Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architecture and Its Application to a Double-Throughput MAC Unit

This work proposes a high-speed and energy-efficient two-cycle multiply-accumulate (MAC) architecture that supports two's complement numbers, and includes accumulation guard bits and saturation circuitry, and extends the new architecture to create a versatile double-throughput MAC (DTMAC) unit that efficiently performs either multiply- Accumulate or multiply operations for N-bit, 1 × N/2- bit, or 2 × N-2-bit operands.

Multi-Mode Datapath Circuits for Flexible and Energy-Efficient Computing

This thesis presents the existing FlexCore processor design environment, which enables holistic processor system evaluations of, for example, multi-mode circuits, and demonstrates processor integration of a multi- mode cyclic-redundancy-checking (CRC) accelerator.

Low Power and Area Efficient 2C Multiply-Accumulate Unit and Its Application to a DTMAC Unit

A low power and area efficient two-cycle multiply-accumulate (2C-MAC) architecture which supports 2’s complement numbers, and includes accumulation guard bits and saturation circuitry, and is extended to create a double throughput MAC, which can perform either multiply or multiply- Accumulate operations.

Design and Analysis of High Speed, Area Optimized 32x32-Bit Multiply Accumulate Unit Based on Vedic Mathematics

The efficiency of Urdhva Triyagbhyam Vedic method for multiplication which strikes a difference in actual process of multiplication itself is presented, which enables the parallel generation of partial products and eliminates unwanted multiplication and addition steps.

An efficient hardware based MAC design in digital filters with complex numbers

  • M. BasiriN. Sk
  • Computer Science
    2014 International Conference on Signal Processing and Integrated Networks (SPIN)
  • 2014
The proposed architecture of multiplier-cum-accumulator which can be used as multiplier as well as MAC gives the better performance compared with conventional fixed point complex number MAC.

MAC Implementation using Vedic Multiplication Algorithm

The paper presents the implementation of MAC (multiplieraccumulator) unit using Vedic multiplier, and the proposed design shows improvement of speed over the design presented in [1].



FlexCore: Utilizing Exposed Datapath Control for Efficient Computing

This study shows that, in comparison to a conventional five-stage general-purpose processor, the FlexCore is up to 40% more efficient in terms of cycle count on a set of benchmarks from the embedded application domain and that both the fine-grained control and the flexible interconnect contribute to the speedup.

A Flexible Datapath Interconnect for Embedded Applications

The results from the case studies indicate that by utilizing a flexible interconnect, significant performance gains can be achieved for generic applications.

FPGA-friendly code compression for horizontal microcoded custom IPs

The code size of one of the new HMA-based technologies called NISC is studied, it is shown that NISC code size can be several times larger than a typical RISC processor, and several low-overhead dictionary-based code compression techniques are proposed to reduce the code size.

An efficient twin-precision multiplier

A twin-precision multiplier that in normal operation mode efficiently performs N-b multiplications and has 72% lower power dissipation and 15% higher speed than the conventional one, while only requiring 8% more transistors.

High performance dual-MAC DSP architecture

The MSA core is a dual-MAC modified Harvard architecture that has been designed to have good performance on both voice and video algorithms and some of the best features and simplicity of microcontrollers has been incorporated into the core.

Reconfigurable embedded MAC core design for low-power coarse-grain FPGA

A reconfigurable multiplier design for low-power field programmable gate arrays (FPGAs) is presented that incorporates a capability of configuring itself dynamically, thus, is suitable for FPGA type of design.

A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations

  • P. KoggeH. Stone
  • Mathematics, Computer Science
    IEEE Transactions on Computers
  • 1973
This paper uses a technique called recursive doubling in an algorithm for solving a large class of recurrence problems on parallel computers such as the Iliac IV.

Computer Architecture: A Quantitative Approach

This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important

Implemenation of Programmable Baseband Processors

Implementation of programmable baseband DSP processors for digital radio communications is discussed in this paper. An implementation example based on IEEE802.11a/b/g WLAN is given.