Learn More
Summary form only given. FPGAs are increasingly being used in the high performance and scientific computing community to implement floating-point based hardware accelerators. We analyze the floating-point multiplier and adder/subtractor units by considering the number of pipeline stages of the units as a parameter and use throughput/area as the metric. We(More)
—We develop new algorithms and architectures for matrix multiplication on configurable devices. These have reduced energy dissipation and latency compared with the state-of-the-art field-programmable gate array (FPGA)-based designs. By profiling well-known designs, we identify " energy hot spots, " which are responsible for most of the energy dissipation.(More)
Advances in their technologies have positioned FPGAs and embedded processors to compete with digital signal processors (DSPs). In this paper, we evaluate the performance in terms of both latency and energy-efficiency of FP-GAs, embedded processors, and DSPs in multiplying two ¢ ¤ £ ¥ ¢ matrices. As specific examples, we have chosen a representative of each(More)
Introduction Almost all signal processing algorithms are initially represented as double precision floating-point in languages such as Matlab. For hardware implementations, these algorithms have to be converted to large precision fixed-point to have a sufficiently large dynamic range. However the inevitable quantization effects and the complexity of(More)
In this paper, we present techniques for energy-efficient design at the algorithm level using FPGAs. We then use these techniques to create energy-efficient designs for two signal processing kernel applications: fast Fourier transform (FFT) and matrix multiplication. We evaluate the performance, in terms of both latency and energy efficiency, of FPGAs in(More)
– Reconfigurable architectures such as FPGAs are flexible alternatives to DSPs or ASICs used in mobile devices for which energy is a key performance metric. Re-configurable architectures offer several parameters such as operating frequency, precision, amount of memory, number of computation units, etc. These parameters define a large design space that must(More)
We develop new algorithms and architectures for matrix multiplication on configurable hardware. These designs significantly reduce the latency as well as the area. Our designs improve the previous designs in [7] and [1] in terms of the area/speed metric where the speed denotes the maximum achievable running frequency. The area/speed metrics for the
Summary form only given. We first develop a novel architecture for fixed-point LU decomposition of streaming input matrices, on FPGAs. Our architecture, based on a circular linear array, achieves the minimal latency and is resource-efficient. We then extend it, by using a stacked matrices approach, to a floating-point based architecture, which achieves the(More)
In this paper, new algorithms and architectures for matrix factorization are presented. Two fully-parallel and block-based designs for LU decomposition on configurable devices are proposed. A linear array architecture is employed to minimize the usage of long interconnects, leading to lower energy dissipation. The designs are made scalable by using a fixed(More)