High-Speed Double-Precision Computation of Reciprocal, Division, Square Root and Inverse Square Root

@article{Pieiro2002HighSpeedDC,
  title={High-Speed Double-Precision Computation of Reciprocal, Division, Square Root and Inverse Square Root},
  author={Jos{\'e}-Alejandro Pi{\~n}eiro and Javier D. Bruguera},
  journal={IEEE Trans. Computers},
  year={2002},
  volume={51},
  pages={1377-1388}
}
A new method for the high-speed computation of double-precision floating-point reciprocal, division, square root, and inverse square root operations is presented in this paper. This method employs a second-degree minimax polynomial approximation to obtain an accurate initial estimate of the reciprocal and the inverse square root values, and then performs a modified Goldschmidt iteration. The high accuracy of the initial approximation allows us to obtain double-precision results by computing a… 

Figures and Tables from this paper

High-speed function approximation using a minimax quadratic interpolator
TLDR
The use of an enhanced minimax approximation which takes into account the effect of rounding the polynomial coefficients to a finite size allows for a further reduction in the size of the look-up tables to be used, making the method very suitable for the implementation of an elementary function generator in state-of-the-art DSPs or graphics processing units (GPUs).
An Area-Efficient Unified Architecture for Multi-Functional Double-Precision Floating-Point Computation
TLDR
The area efficiency (performance/area ratio) of the proposed unified architecture is increased by about 20% in average, which is a better performance-area trade-off for embedded microprocessors.
Low Latency Floating-Point Division and Square Root Unit
TLDR
A floating-point division and square root unit is presented, which implements a radix-64 floating- point division and a Radix-16 floating- Point square root, requiring 11, 6, and 4 cycles for double, single and half-precision division with normalized operands and result, and 15, 8 and 5 cycles for square root.
Computing Floating-Point Square Roots via Bivariate Polynomial Evaluation
TLDR
This paper shows how to reduce the computation of correctly rounded square roots of binary floating-point data to the fixed-point evaluation of some particular integer polynomials in two variables, and shows further that this approach allows for high instruction-level parallelism (ILP) exposure, and thus, potentially low-latency implementations.
Algorithm and architecture for logarithm, exponential, and powering computation
TLDR
A sequential implementation of the algorithm, with a control unit which allows the independent computation of logarithm and exponential, is proposed and the execution times and hardware requirements are estimated for single and double-precision floating-point computations.
Design and Implementation of a 64/32-bit Floating-point Division, Reciprocal, Square root, and Inverse Square root Unit
This paper presents an efficient design and implementation of a configurable multifunctional floating-point unit for the computation of division, reciprocal, square root and inverse square root,
Hardware Implementation of Single Iterated Multiplicative Inverse Square Root
Inverse square root has played an important role in Cholesky decomposition, which devoted to hardware efficient compressed sensing. However, the performance is usually limited by the trade-off
Low-complexity Inverse Square Root Approximation for Baseband Matrix Operations
TLDR
A scalable low-complexity approximation method of the inversesquare root is developed and applied in Cholesky and QR decompositions and can accelerate any fixed-point system where cost-efficiency and low power consumption are of high importance, and coarse approximation of inverse square root operation is required.
High-Radix Logarithm with Selection by Rounding: Algorithm and Implementation
A high-radix digit-recurrence algorithm for the computation of the logarithm, and an analysis of the tradeoffs between area and speed for its implementation, are presented in this paper. Selection by
High-speed floating-point divider with reduced area
TLDR
The recursive equations in the Goldschmidt algorithm are modified to replace full-precision multipliers with smaller multipliers and squarers, and implementation of floating-point reciprocal and divider using the modification are presented.
...
...

References

SHOWING 1-10 OF 34 REFERENCES
Efficient initial approximation and fast converging methods for division and square root
TLDR
A new initial approximation method for division, an accelerated higher order converging division algorithm, and a new square root algorithm that can form, double-precision square roots faster using smaller look-up tables than the Newton-Raphson method are proposed.
Area and performance tradeoffs in floating-point divide and square-root implementations
TLDR
The case for high-performance division and square root is argued and the algorithms and implementations of the primary techniques, subtractive and multiplicative methods, employed in microprocessor floating-point units with their associated area/performance tradeoffs are explained.
Floating point division and square root algorithms and implementation in the AMD-K7/sup TM/ microprocessor
  • S. Oberman
  • Computer Science
    Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)
  • 1999
TLDR
This paper presents the AMD-K7 IEEE 754 and /spl times/87 compliant floating point division and square root algorithms and implementation, and the formulation of a mechanically-checked formal proof using the ACL2 theorem prover.
Very High Radix Square Root with Prescaling and Rounding and a Combined Division/Square Root Unit
TLDR
Compared with other combined div/sqrt units, the comparisons show that the proposed scheme potentially produces a significant speed-up for division, whereas, for square root, the speed-ups are small.
Fast Hardware-Based Algorithms for Elementary Function Computations Using Rectangular Multipliers
TLDR
These algorithms exploit microscopic parallelism using specialized hardware with heavy use of truncation based on detailed accuracy analysis for the computation of the common elementary functions, namely division, logarithm, reciprocal square root, arc tangent, sine and cosine.
Improving Goldschmidt Division, Square Root, and Square Root Reciprocal
The aim of this paper is to accelerate division, square root, and square root reciprocal computations when the Goldschmidt method is used on a pipelined multiplier. This is done by replacing the last
High bandwidth evaluation of elementary functions
  • P. Farmwald
  • Computer Science
    1981 IEEE 5th Symposium on Computer Arithmetic (ARITH)
  • 1981
TLDR
This paper elaborate on a technique for computing piecewise quadratric approximations to many elementary functions, which permits the effective use of large RAMs or ROMs and parallel multipliers for rapidly generating single-precision floating-point function values.
Faithful powering computation using table look-up and a fused accumulation tree
A method for the calculation of faithfully rounded single-precision floating-point powering (X/sup p/) is proposed in this paper. This method employs table look-up and a second-degree minimax
Cascaded implementation of an iterative inverse-square-root algorithm, with overflow lookahead
We present an unconventional method of computing the inverse of the square root. It implements the equivalent of two iterations of a well-known multiplicative method to obtain 24-bit mantissa
Evaluating Elementary Functions in a Numerical Coprocessor Based on Rational Approximations
TLDR
It is concluded that rational approximations can successfully complete with previously used methods when execution time and silicon area are considered.
...
...