Parallelization of Radix-2 Montgomery Multiplication on Multicore Platform

@article{Han2013ParallelizationOR,
  title={Parallelization of Radix-2 Montgomery Multiplication on Multicore Platform},
  author={Jun Han and Shuai Wang and Wei Huang and Zhiyi Yu and Xiaoyang Zeng},
  journal={IEEE Transactions on Very Large Scale Integration (VLSI) Systems},
  year={2013},
  volume={21},
  pages={2325-2330}
}
  • Jun Han, Shuai Wang, Xiaoyang Zeng
  • Published 1 December 2013
  • Computer Science
  • IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Montgomery multiplication is the kernel operation in public key ciphers. Aiming at parallel implementation of Montgomery multiplication, this brief presents an improved task partitioning of the Montgomery multiplication algorithm for the multicore platform with area-efficient processors. Several multicore platforms are designed to verify the efficiency of parallelization. The fastest platform takes 3460 cycles to finish a 1024-b Montgomery multiplication, which is six times faster than a single… 

Figures and Tables from this paper

An Efficient Implementation of Montgomery Multiplication on Multicore Platform With Optimized Algorithm, Task Partitioning, and Network Architecture
TLDR
A block-level parallel algorithm for MM with quotient pipelining and optimally map it on a network-on-chip-based multicore platform equipped with broadcasting mechanism to maximizes the speedup ratio with regard to given intercore communication latency.
Parallelism exploitation of montgomery multiplication in RNS on NoC-based platform
TLDR
An efficient parallelization scheme is proposed to overcome the influence caused by communication latency and is shown to be more resistant to communication latency than the state of the art MM algorithm.
A Systolic Hardware Architecture of Montgomery Modular Multiplication for Public Key Cryptosystems
TLDR
This work presents how to relax the data dependency in conventional word-based algorithms to increase the possibility of reusing the current words of variables and proposed a novel scheduling scheme to alleviate the number of memory access in the developed scalable micro architecture.
Efficient VLSI Architecture for Montgomery Modular Multiplier
TLDR
A Configurable CSA (CCSA) is proposed to for performing modular multiplication by using two serial half-adders and a mechanism that can detect and skip the unnecessary carry-save addition operations thereby maintaining the short critical path delay is developed by means of designing a skip detector.
A Heterogeneous Multicore Crypto-Processor With Flexible Long-Word-Length Computation
TLDR
The proposed multicore processor provides flexible and efficient computation for various forms of RSA and ECC algorithms, fulfilling low-latency or high-throughput requirements of different application scenarios, by using a heterogeneous multicore architecture.
VLSI Implementation of High Performance Montgomery Modular Multiplication for Crypto Graphical Application
TLDR
Experimental results show that the proposed Montgomery modular multiplier can achieve higher performance and significant area–time product Improvement when compared with previous design.
VLSI ARCHITECTURE FOR MONTGOMERY MODULAR MULTIPLICATION ALGORITHM BY USING PASTA ADDER
TLDR
A configurable CSA (CCSA), which could be one full-adder or two serial half-adders, is proposed to reduce the extra clock cycles for operand pre-computation and format conversion by half, to overcome the weakness in the Montgomery modular multiplier.
Efficient Area and Delay Profile Architecture of Asynchronous Parallel Self Timed Adder Based Montgomery Multiplication
TLDR
This architecture proposes improvements on the well–known Montgomery multiplication algorithm and its previous implementations, and the implementation and synthesis of proposed work has completed on Xilinx ISE design suite using hardware descriptive language HDL.
Enhanced Vlsi Architecture For Montgomery Modular Multiplication In Digital Filters
TLDR
A Configurable CSA (CCSA), which could be one fulladder or two serial half-adders, is proposed to reduce the extra clock cycles for operand pre computation and format conversion by half.
Low Power Montgomery Modular Multiplication Using Carry Save Adder
TLDR
A mechanism that can detect and skip the unnecessary carry-save addition operations in the one-level CCSA architecture while maintaining the short critical path delay is developed and high throughput can be obtained.
...
1
2
3
...

References

SHOWING 1-10 OF 11 REFERENCES
Montgomery Modular Multiplication Algorithm on Multi-Core Systems
TLDR
This paper first implements the Montgomery modular multiplication on a multi-core system with general purpose cores, and then speed up it by adopting the Multiply-Accumulate (MAC) operation in each core.
A Parallel Implementation of Montgomery Multiplication on Multicore Systems: Algorithm, Analysis, and Prototype
TLDR
This work presents a parallel-software implementation of the Montgomery multiplication for multicore systems, pSHS, and reveals that it is high performance, scalable over different number of cores, and stable when the communication latency changes.
Analyzing and comparing Montgomery multiplication algorithms
TLDR
The operations involved in computing the Montgomery product are studied, several high-speed, space-efficient algorithms for computing MonPro(a, b), and their time and space requirements are described.
A Scalable Architecture for Modular Multiplication Based on Montgomery's Algorithm
TLDR
A word-based version of MM is presented and used to explain the main concepts in the hardware design and gives enough freedom to select the word size and the degree of parallelism to be used, according to the available area and/or desired performance.
A low-complexity heterogeneous multi-core platform for security soc
TLDR
Comparison results shows that this heterogeneous multi-core SoC platform to deal with intensive cryptography algorithms in different security protocols also has a low-complexity hardware cost but more flexibility.
Modular multiplication without trial division
TLDR
A method for multiplying two integers modulo N while avoiding division by N, a representation of residue classes so as to speed modular multiplication without affecting the modular addition and subtraction algorithms.
Challenges of programming multi-core microprocessors
TLDR
It is claimed that many of the programming abstractions for parallel program have been honed for the developed of closed world software like operating system kernels and are not suitable for application development in a modular manner.
Test power reduction with multiple capture orders
TLDR
A multiple-capture-orders method is developed to guarantee the full scan fault coverage and a test architecture based on a ring control structure is adopted which makes the test control very simple and requires very low area overhead.
Fast and accurate protocol specific bus modeling using TLM 2.0
TLDR
A new methodology is introduced that enables the creation of fast and cycle accurate protocol specific bus-based communication models, based on the new TLM 2.0 standard from the Open SystemC Initiative (OSCI).
Combining Behavioural Real-time Software Modelling with the OSCI TLM-2.0 Communication Standard
  • K. Yu, N. Audsley
  • Computer Science
    2010 10th IEEE International Conference on Computer and Information Technology
  • 2010
TLDR
A software Processing Element (PE) model is implemented which effectively integrates mixed timing RTOS-centric software models, abstract processor hardware functions, and OSCI TLM-2.0 communication interfaces.
...
1
2
...