Asymptotic Optimality of Parallel Short Division

  title={Asymptotic Optimality of Parallel Short Division},
  author={Niall Emmart and C. Weems},
  journal={2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
  • Niall Emmart, C. Weems
  • Published 2016
  • Computer Science
  • 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
In 2011 we published a practical algorithm for short division (division of a multiple precision dividend by a single precision divisor) on a parallel processor (HiPC 2011) with a run time of O(n/p+log p). Our algorithm, based on parallel computation of remainder sequences, is an improvement of Takahashi's earlier work (LSSC 2007) which has a run time of O((n/p) log p). Here we prove that Omega(n/p+log p) is a tight lower bound for short division (using a conventional fixed radix number system… Expand
Review of Basic Classes of Dividers Based on Division Algorithm
The broad classification of dividers into basic classes named digit recurrence, high radix, functional iteration, estimation, a look-up table, and variable latency is described, which illustrates that, in practical implementation, many algorithms have been developed that combine one or many classes and are implemented with different hardware architectures. Expand
A Study of High Performance Multiple Precision Arithmetic on Graphics Processing Units
A study of the impact of multi-modal decision analysis on graphics processing units and how it affects performance and efficiency is published. Expand


Parallel multiple precision division by a single precision divisor
  • Niall Emmart, C. Weems
  • Computer Science
  • 2011 18th International Conference on High Performance Computing
  • 2011
This work combines a parallel version of Jebelean's exact division algorithm with a left-to-right algorithm for computing the borrow chain, to relax the requirement of exactness, and employs Takahashi's recently reported cyclic reduction technique for GPU division to further enhance performance. Expand
An Algorithm for Exact Division
  • T. Jebelean
  • Computer Science, Mathematics
  • J. Symb. Comput.
  • 1993
An algorithm which computes the quotient of two long integers in this particular situation, starting from the least-significant digits of the operands, which is better suited for systolic parallelization in a "least-significant digit first" pipelined manner. Expand
Fast recursive division
A new recursive method for division with remainder of integers is presented and practical results of an implementation allow us to say that the authors have the fastest integer division on a SPARC architecture compared to all other integer packages they know of. Expand
On Parallel Prefix Computation
We prove that prefix sums of n integers of at most b bits can be found on a COMMON CRCW PRAM in time with a linear time-processor product. The algorithm is optimally fast, for any polynomial numberExpand
Modular exponentiation via the explicit Chinese remainder theorem
A new result on the parallel complexity of modular exponentiation is obtained: there is an algorithm for the Common CRCW PRAM that, given positive integers x, e, and m in binary, of total bit length n, computes x e mod m in time O(n/lglgn) using n O(1) processors. Expand
Upper and Lower Time Bounds for Parallel Random Access Machines without Simultaneous Writes
It is shown that even if the authors allow nonuniform algorithms, an arbitrary number of processors, and arbitrary instruction sets, $\Omega (\log n)$ is a lower bound on the time required to compute various simple functions, including sorting n keys and finding the logical “or” of n bits. Expand
A randomized sublinear time parallel GCD algorithm for the EREW PRAM
  • J. Sorenson
  • Mathematics, Computer Science
  • Inf. Process. Lett.
  • 2010
Abstract We present a randomized parallel algorithm that computes the greatest common divisor of two integers of n bits in length with probability 1 − o ( 1 ) that takes O ( n log log n / log n )Expand
Improved Upper and Lower Time Bounds for Parallel Random Access Machines Without Simultaneous Writes
The time required by a variant of the PRAM to compute a certain class of functions called critical functions (which include the Boolean OR of n bits) is studied and it is shown that any PRAM which computes a critical function must take at least $0.5log n - O(1) steps. Expand
Bidirectional Exact Integer Division
It is shown that the high- order part and the low-order part of the exact quotient can be computed independently from each other. Expand
Parallel Algorithms for Shared-Memory Machines
  • R. Karp, V. Ramachandran
  • Computer Science
  • Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity
  • 1990
This chapter discusses parallel algorithms for shared-memory machines, which focus on the technological limits of today's chips, in which gates and wires are packed into a small number of planar layers. Expand