Parallelization of Radix-2 Montgomery Multiplication on Multicore Platform

  title={Parallelization of Radix-2 Montgomery Multiplication on Multicore Platform},
  author={Jun Han and Shuai Wang and Wei Huang and Zhiyi Yu and Xiaoyang Zeng},
  journal={IEEE Transactions on Very Large Scale Integration (VLSI) Systems},
  • Jun Han, Shuai Wang, +2 authors Xiaoyang Zeng
  • Published 1 December 2013
  • Computer Science
  • IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Montgomery multiplication is the kernel operation in public key ciphers. Aiming at parallel implementation of Montgomery multiplication, this brief presents an improved task partitioning of the Montgomery multiplication algorithm for the multicore platform with area-efficient processors. Several multicore platforms are designed to verify the efficiency of parallelization. The fastest platform takes 3460 cycles to finish a 1024-b Montgomery multiplication, which is six times faster than a single… 
An Efficient Implementation of Montgomery Multiplication on Multicore Platform With Optimized Algorithm, Task Partitioning, and Network Architecture
A block-level parallel algorithm for MM with quotient pipelining and optimally map it on a network-on-chip-based multicore platform equipped with broadcasting mechanism to maximizes the speedup ratio with regard to given intercore communication latency.
Parallelism exploitation of montgomery multiplication in RNS on NoC-based platform
An efficient parallelization scheme is proposed to overcome the influence caused by communication latency and is shown to be more resistant to communication latency than the state of the art MM algorithm.
A Systolic Hardware Architecture of Montgomery Modular Multiplication for Public Key Cryptosystems
The Montgomery modular multiplication is mostly used in the field public-key cryptosystems. This work presents how to relax the data dependency in conventional word-based algorithms to increase the
Efficient VLSI Architecture for Montgomery Modular Multiplier
Montgomery modular multiplication is used in cryptographic algorithms and digital signal processing application. The main objective is to reduce the delay and area of the Montgomery multipliers while
A Heterogeneous Multicore Crypto-Processor With Flexible Long-Word-Length Computation
The proposed multicore processor provides flexible and efficient computation for various forms of RSA and ECC algorithms, fulfilling low-latency or high-throughput requirements of different application scenarios, by using a heterogeneous multicore architecture.
VLSI Implementation of High Performance Montgomery Modular Multiplication for Crypto Graphical Application
This paper proposes a simple and efficient Montgomery multiplication algorithm such that the low-cost and high-performance Montgomery modular multiplier can be implemented accordingly. Full -adder or
In data transmission applications, the widely used public-key cryptosystem is a simple and efficient Montgomery multiplication algorithm such that the low-cost and highperformance. In which includes
Efficient Area and Delay Profile Architecture of Asynchronous Parallel Self Timed Adder Based Montgomery Multiplication
With the ongoing digital revolution and advances in high performance computing, powerful desktop computer systems are available to almost everybody at low cost. While there has always been a demand
Enhanced Vlsi Architecture For Montgomery Modular Multiplication In Digital Filters
The multiplier receives and outputs the data with binary representation and uses only one-level Carry Save Adder (CSA) to avoid the carry propagation at each addition operation. A famous approach to
Low Power Montgomery Modular Multiplication Using Carry Save Adder
A mechanism that can detect and skip the unnecessary carry-save addition operations in the one-level CCSA architecture while maintaining the short critical path delay is developed and high throughput can be obtained.


Montgomery Modular Multiplication Algorithm on Multi-Core Systems
This paper first implements the Montgomery modular multiplication on a multi-core system with general purpose cores, and then speed up it by adopting the Multiply-Accumulate (MAC) operation in each core.
A Parallel Implementation of Montgomery Multiplication on Multicore Systems: Algorithm, Analysis, and Prototype
This work presents a parallel-software implementation of the Montgomery multiplication for multicore systems, pSHS, and reveals that it is high performance, scalable over different number of cores, and stable when the communication latency changes.
Analyzing and comparing Montgomery multiplication algorithms
The operations involved in computing the Montgomery product are studied, several high-speed, space-efficient algorithms for computing MonPro(a, b), and their time and space requirements are described.
A Scalable Architecture for Modular Multiplication Based on Montgomery's Algorithm
A word-based version of MM is presented and used to explain the main concepts in the hardware design and gives enough freedom to select the word size and the degree of parallelism to be used, according to the available area and/or desired performance.
A low-complexity heterogeneous multi-core platform for security soc
Comparison results shows that this heterogeneous multi-core SoC platform to deal with intensive cryptography algorithms in different security protocols also has a low-complexity hardware cost but more flexibility.
Modular multiplication without trial division
Let N > 1. We present a method for multiplying two integers (called N-residues) modulo N while avoiding division by N. N-residues are represented in a nonstandard way, so this method is useful only
Challenges of programming multi-core microprocessors
It is claimed that many of the programming abstractions for parallel program have been honed for the developed of closed world software like operating system kernels and are not suitable for application development in a modular manner.
Test power reduction with multiple capture orders
A multiple-capture-orders method is developed to guarantee the full scan fault coverage and a test architecture based on a ring control structure is adopted which makes the test control very simple and requires very low area overhead.
Fast and accurate protocol specific bus modeling using TLM 2.0
A new methodology is introduced that enables the creation of fast and cycle accurate protocol specific bus-based communication models, based on the new TLM 2.0 standard from the Open SystemC Initiative (OSCI).
Combining Behavioural Real-time Software Modelling with the OSCI TLM-2.0 Communication Standard
  • K. Yu, N. Audsley
  • Computer Science
    2010 10th IEEE International Conference on Computer and Information Technology
  • 2010
A software Processing Element (PE) model is implemented which effectively integrates mixed timing RTOS-centric software models, abstract processor hardware functions, and OSCI TLM-2.0 communication interfaces.