Parallelization of Radix-2 Montgomery Multiplication on Multicore Platform
@article{Han2013ParallelizationOR, title={Parallelization of Radix-2 Montgomery Multiplication on Multicore Platform}, author={Jun Han and Shuai Wang and Wei Huang and Zhiyi Yu and Xiaoyang Zeng}, journal={IEEE Transactions on Very Large Scale Integration (VLSI) Systems}, year={2013}, volume={21}, pages={2325-2330} }
Montgomery multiplication is the kernel operation in public key ciphers. Aiming at parallel implementation of Montgomery multiplication, this brief presents an improved task partitioning of the Montgomery multiplication algorithm for the multicore platform with area-efficient processors. Several multicore platforms are designed to verify the efficiency of parallelization. The fastest platform takes 3460 cycles to finish a 1024-b Montgomery multiplication, which is six times faster than a single…
26 Citations
An Efficient Implementation of Montgomery Multiplication on Multicore Platform With Optimized Algorithm, Task Partitioning, and Network Architecture
- Computer ScienceIEEE Transactions on Very Large Scale Integration (VLSI) Systems
- 2014
A block-level parallel algorithm for MM with quotient pipelining and optimally map it on a network-on-chip-based multicore platform equipped with broadcasting mechanism to maximizes the speedup ratio with regard to given intercore communication latency.
Parallelism exploitation of montgomery multiplication in RNS on NoC-based platform
- Computer Science2014 12th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT)
- 2014
An efficient parallelization scheme is proposed to overcome the influence caused by communication latency and is shown to be more resistant to communication latency than the state of the art MM algorithm.
A Systolic Hardware Architecture of Montgomery Modular Multiplication for Public Key Cryptosystems
- Computer Science
- 2017
This work presents how to relax the data dependency in conventional word-based algorithms to increase the possibility of reusing the current words of variables and proposed a novel scheduling scheme to alleviate the number of memory access in the developed scalable micro architecture.
Efficient VLSI Architecture for Montgomery Modular Multiplier
- Computer Science
- 2017
A Configurable CSA (CCSA) is proposed to for performing modular multiplication by using two serial half-adders and a mechanism that can detect and skip the unnecessary carry-save addition operations thereby maintaining the short critical path delay is developed by means of designing a skip detector.
A Heterogeneous Multicore Crypto-Processor With Flexible Long-Word-Length Computation
- Computer ScienceIEEE Transactions on Circuits and Systems I: Regular Papers
- 2015
The proposed multicore processor provides flexible and efficient computation for various forms of RSA and ECC algorithms, fulfilling low-latency or high-throughput requirements of different application scenarios, by using a heterogeneous multicore architecture.
VLSI Implementation of High Performance Montgomery Modular Multiplication for Crypto Graphical Application
- Computer Science
- 2017
Experimental results show that the proposed Montgomery modular multiplier can achieve higher performance and significant area–time product Improvement when compared with previous design.
VLSI ARCHITECTURE FOR MONTGOMERY MODULAR MULTIPLICATION ALGORITHM BY USING PASTA ADDER
- Computer Science
- 2017
A configurable CSA (CCSA), which could be one full-adder or two serial half-adders, is proposed to reduce the extra clock cycles for operand pre-computation and format conversion by half, to overcome the weakness in the Montgomery modular multiplier.
Efficient Area and Delay Profile Architecture of Asynchronous Parallel Self Timed Adder Based Montgomery Multiplication
- Computer Science, Mathematics
- 2018
This architecture proposes improvements on the well–known Montgomery multiplication algorithm and its previous implementations, and the implementation and synthesis of proposed work has completed on Xilinx ISE design suite using hardware descriptive language HDL.
Enhanced Vlsi Architecture For Montgomery Modular Multiplication In Digital Filters
- Computer Science
- 2016
A Configurable CSA (CCSA), which could be one fulladder or two serial half-adders, is proposed to reduce the extra clock cycles for operand pre computation and format conversion by half.
Low Power Montgomery Modular Multiplication Using Carry Save Adder
- Computer Science
- 2016
A mechanism that can detect and skip the unnecessary carry-save addition operations in the one-level CCSA architecture while maintaining the short critical path delay is developed and high throughput can be obtained.
References
SHOWING 1-10 OF 11 REFERENCES
Montgomery Modular Multiplication Algorithm on Multi-Core Systems
- Computer Science, Mathematics2007 IEEE Workshop on Signal Processing Systems
- 2007
This paper first implements the Montgomery modular multiplication on a multi-core system with general purpose cores, and then speed up it by adopting the Multiply-Accumulate (MAC) operation in each core.
A Parallel Implementation of Montgomery Multiplication on Multicore Systems: Algorithm, Analysis, and Prototype
- Computer ScienceIEEE Transactions on Computers
- 2011
This work presents a parallel-software implementation of the Montgomery multiplication for multicore systems, pSHS, and reveals that it is high performance, scalable over different number of cores, and stable when the communication latency changes.
Analyzing and comparing Montgomery multiplication algorithms
- Computer Science, MathematicsIEEE Micro
- 1996
The operations involved in computing the Montgomery product are studied, several high-speed, space-efficient algorithms for computing MonPro(a, b), and their time and space requirements are described.
A Scalable Architecture for Modular Multiplication Based on Montgomery's Algorithm
- Computer ScienceIEEE Trans. Computers
- 2003
A word-based version of MM is presented and used to explain the main concepts in the hardware design and gives enough freedom to select the word size and the degree of parallelism to be used, according to the available area and/or desired performance.
A low-complexity heterogeneous multi-core platform for security soc
- Computer Science2010 IEEE Asian Solid-State Circuits Conference
- 2010
Comparison results shows that this heterogeneous multi-core SoC platform to deal with intensive cryptography algorithms in different security protocols also has a low-complexity hardware cost but more flexibility.
Modular multiplication without trial division
- Mathematics, Computer Science
- 1985
A method for multiplying two integers modulo N while avoiding division by N, a representation of residue classes so as to speed modular multiplication without affecting the modular addition and subtraction algorithms.
Challenges of programming multi-core microprocessors
- Computer Science
- 2008
It is claimed that many of the programming abstractions for parallel program have been honed for the developed of closed world software like operating system kernels and are not suitable for application development in a modular manner.
Test power reduction with multiple capture orders
- Computer Science13th Asian Test Symposium
- 2004
A multiple-capture-orders method is developed to guarantee the full scan fault coverage and a test architecture based on a ring control structure is adopted which makes the test control very simple and requires very low area overhead.
Fast and accurate protocol specific bus modeling using TLM 2.0
- Computer Science2009 Design, Automation & Test in Europe Conference & Exhibition
- 2009
A new methodology is introduced that enables the creation of fast and cycle accurate protocol specific bus-based communication models, based on the new TLM 2.0 standard from the Open SystemC Initiative (OSCI).
Combining Behavioural Real-time Software Modelling with the OSCI TLM-2.0 Communication Standard
- Computer Science2010 10th IEEE International Conference on Computer and Information Technology
- 2010
A software Processing Element (PE) model is implemented which effectively integrates mixed timing RTOS-centric software models, abstract processor hardware functions, and OSCI TLM-2.0 communication interfaces.