Montgomery Modular Multiplication Algorithm on Multi-Core Systems

@article{Fan2007MontgomeryMM,
  title={Montgomery Modular Multiplication Algorithm on Multi-Core Systems},
  author={Junfeng Fan and Kazuo Sakiyama and Ingrid M. R. Verbauwhede},
  journal={2007 IEEE Workshop on Signal Processing Systems},
  year={2007},
  pages={261-266}
}
In this paper, we investigate the efficient software implementations of theMontgomery modular multiplication algorithm on amulti-core system. AHW/SW co-design technique is used to find the efficient system architecture and the instruction scheduling method. We first implement the Montgomery modular multiplication on a multi-core systemwith general purpose cores. We then speed up it by adopting the Multiply-Accumulate (MAC) operation in each core. As a result, the performance can be improved by… 

Figures and Tables from this paper

Parallelization of Radix-2 Montgomery Multiplication on Multicore Platform
TLDR
This brief presents an improved task partitioning of the Montgomery multiplication algorithm for the multicore platform with area-efficient processors to verify the efficiency of parallelization.
An Efficient Implementation of Montgomery Multiplication on Multicore Platform With Optimized Algorithm, Task Partitioning, and Network Architecture
TLDR
A block-level parallel algorithm for MM with quotient pipelining and optimally map it on a network-on-chip-based multicore platform equipped with broadcasting mechanism to maximizes the speedup ratio with regard to given intercore communication latency.
Hardware Implementation of Improved Montgomery Modular Multiplication Algorithm
TLDR
A hardware implementation of modular multiplication coprocessor for both RSA and ECC Cryptosystems using a self-improvement Montgomery modular multiplication algorithm, which completes a modular multiplication with less clock cycles under the equivalent circumstance of the other designs.
Highly-Parallel Montgomery Multiplication for Multi-Core General-Purpose Microprocessors
TLDR
This work proposes a new parallel Montgomery multiplication algorithm which exhibits up to 39 % better performance than the known best serial Montgomery multiplication variant for the bit-lengths of 2048 or larger and is the first work that shows with actual implementation results that Montgomery multiplication can be practically and scalably parallelized on general-purpose multi-core processors.
pSHS: A scalable parallel software implementation of Montgomery multiplication for multicore systems
  • Zhimin Chen, P. Schaumont
  • Computer Science
    2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010)
  • 2010
Parallel programming techniques have become one of the great challenges in the transition from single-core to multicore architectures. In this paper, we investigate the parallelization of the
A Parallel Implementation of Montgomery Multiplication on Multicore Systems: Algorithm, Analysis, and Prototype
TLDR
This work presents a parallel-software implementation of the Montgomery multiplication for multicore systems, pSHS, and reveals that it is high performance, scalable over different number of cores, and stable when the communication latency changes.
The Researcher and Implement of High-Speed Modular Multiplication Algorithm Basing on Parallel Pipelining
This page presents an improving method which realizes parallel operation in cell arithmetic unit and between cell arithmetic units to improve the speed of Montgomery modular multiplication
Efficient Translation of Algorithmic Kernels on Large-Scale Multi-cores
  • A. Pande, J. Zambreno
  • Computer Science
    2009 International Conference on Computational Science and Engineering
  • 2009
TLDR
The design of a novelembedded processor architecture (which is called a μ-core) that makes use of a reconfigurable ALU that serves as the basis of custom 2-dimensional array architectures that can be used to accelerate algorithms such as cryptography and image processing.
Survey on Hardware Implementation of Montgomery Modular exponentiation
TLDR
Three modified Montgomery algorithm discussed with their output compared with each other are Iterative architecture, Montgomery multiplier for faster Cryptography and Vedic multipliers used in Montgomery algorithm for multiplication.
Novel algorithms and hardware architectures for Montgomery Multiplication over GF(p)
TLDR
A novel digit-digit based MM algorithm is derived and two hardware architectures that compute that algorithm are described, making use of available dedicated multiplier and memory blocks reducing drastically the FPGA’s standard logic while keeping an acceptable performance compared with other implementation approaches.
...
1
2
3
4
...

References

SHOWING 1-10 OF 33 REFERENCES
Efficient pipelining for modular multiplication architectures in prime fields
TLDR
A pipelined architecture of a modular Montgomery multiplier, which is suitable to be used in public key coprocessors and compares to the state-of-the-art in Montgomery multipliers on the basis of performance results for 1024-bit RSA.
A Scalable Architecture for Modular Multiplication Based on Montgomery's Algorithm
TLDR
A word-based version of MM is presented and used to explain the main concepts in the hardware design and gives enough freedom to select the word size and the degree of parallelism to be used, according to the available area and/or desired performance.
Montgomery in Practice: How to Do It More Efficiently in Hardware
TLDR
This work presents modular exponentiation based on Montgomery's method without any modular reduction achieving the best possible bound according to C. Walter.
Parallelized Very High Radix Scalable Montgomery Multipliers
  • K. Kelley, D. Harris
  • Computer Science, Mathematics
    Conference Record of the Thirty-Ninth Asilomar Conference onSignals, Systems and Computers, 2005.
  • 2005
TLDR
A parallelized very high radix scalable Montgomery multiplier designed for non-redundant FPGA implementations that can perform 1024-bit modular exponentiation in 5.0 ms and 256- bit modular exponentation in 0.20 ms, improving the fastest scalable design yet reported.
A fast dual-field modular arithmetic logic unit and its hardware implementation
TLDR
A fast modular arithmetic logic unit (MALU) that is scalable in the digit size (d) and the field size (k) and well suited and very efficient for the modular multiplication and addition/subtraction which are the computational kernels of elliptic curve and hyperelliptic curve cryptography.
Architectural Enhancements to Support Digital Signal Processing and Public-Key Cryptography
TLDR
The analysis shows that the MIPS32 architecture can be easily extended for efficient cryptography processing and offers some advantages compared to the ARMv5TE architecture.
Hardware Implementation of Montgomery's Modular Multiplication Algorithm
TLDR
Hardware is described for implementing the fast modular multiplication algorithm developed by P.L. Montgomery (1985), showing that this algorithm is up to twice as fast as the best currently available and is more suitable for alternative architectures.
Montgomery modular exponentiation on reconfigurable hardware
  • Thomas Blum
  • Computer Science, Mathematics
    Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)
  • 1999
TLDR
This contribution proposes arithmetic architectures which are optimized for modern field programmable gate arrays (FPGAs) and shows that it is possible to implement modular exponentiation at secure bit lengths on a single commercially available FPGA.
Analyzing and comparing Montgomery multiplication algorithms
TLDR
The operations involved in computing the Montgomery product are studied, several high-speed, space-efficient algorithms for computing MonPro(a, b), and their time and space requirements are described.
Modular exponentiation using parallel multipliers
  • S. Tang, K. Tsui, P. Leong
  • Computer Science, Mathematics
    Proceedings. 2003 IEEE International Conference on Field-Programmable Technology (FPT) (IEEE Cat. No.03EX798)
  • 2003
A field programmable gate array (FPGA) semi-systolic implementation of a modular exponentiation unit, suitable for use in implementing the RSA public key cryptosystem is presented. The design is
...
1
2
3
4
...