• Publications
  • Influence
NVIDIA Tesla: A Unified Graphics and Computing Architecture
To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture, which is massively multithreaded and programmable in C or via graphics APIs.
Division Algorithms and Implementations
A taxonomy of division algorithms is presented which classifies the algorithms based upon their hardware implementations and impact on system design, finding that, for low-cost implementations where chip area must be minimized, digit recurrence algorithms are suitable.
Floating point division and square root algorithms and implementation in the AMD-K7/sup TM/ microprocessor
  • S. Oberman
  • Computer Science
    Proceedings 14th IEEE Symposium on Computer…
  • 14 April 1999
This paper presents the AMD-K7 IEEE 754 and /spl times/87 compliant floating point division and square root algorithms and implementation, and the formulation of a mechanically-checked formal proof using the ACL2 theorem prover.
High-speed function approximation using a minimax quadratic interpolator
The use of an enhanced minimax approximation which takes into account the effect of rounding the polynomial coefficients to a finite size allows for a further reduction in the size of the look-up tables to be used, making the method very suitable for the implementation of an elementary function generator in state-of-the-art DSPs or graphics processing units (GPUs).
Design issues in high performance floating point arithmetic units
This work examines the state-of-the-art in FPU design and proposes techniques for improving the performance and the performance/area ratio of future FPUs and proposes a combination of the proposed techniques which provides a basis for future high performance floating point units.
Design Issues in Division and Other Floating-Point Operations
The system performance impact of floating-point division latency for varying instruction issue rates is presented and the performance implications of shared multiplication hardware, shared square root, on-the-fly rounding and conversion, and fused functional units are examined.
SRT division architectures and implementations
It is concluded that divider performance is only weakly sensitive to reasonable choices of architecture but significantly improved by aggressive circuit techniques.
The SNAP project: design of floating point arithmetic units
The paper presents results of the Stanford subnanosecond arithmetic processor (SNAP) research effort in the design of hardware for floating point addition, multiplication and division and shows that one cycle FP addition is achievable 32% of the time using a variable latency algorithm.
AMD 3DNow! technology: architecture and implementations
The AMD-K6-2 microprocessor is the first implementation of AMD 3DNow!, a technology innovation for the x86 architecture that drives today's personal computers, designed to open the traditional processing bottlenecks for floating-point-intensive and multimedia applications.