Montgomery Multiplication Using Vector Instructions

@inproceedings{Bos2013MontgomeryMU,
  title={Montgomery Multiplication Using Vector Instructions},
  author={Joppe W. Bos and Peter L. Montgomery and Daniel Shumow and Gregory M. Zaverucha},
  booktitle={Selected Areas in Cryptography},
  year={2013}
}
In this paper we present a parallel approach to compute interleaved Montgomery multiplication. [] Key Method We have implemented this approach for tablet devices which run the x86 architecture Intel Atom Z2760 using SSE2 instructions as well as devices which run on the ARM platform Qualcomm MSM8960, NVIDIA Tegra 3 and 4 using NEON instructions. When instantiating modular exponentiation with this parallel version of Montgomery multiplication we observed a performance increase of more than a factor of 1.5…
Fast Multiple Montgomery Multiplications Using Intel AVX-512IFMA Instructions
TLDR
A fast implementation of multiple Montgomery multiplications using Intel AVX-512IFMA (Integer Fused Multiply-Add) instructions is proposed, based on a modified Montgomery multiplication.
Montgomery Modular Multiplication on ARM-NEON Revisited
TLDR
The Cascade Operand Scanning (COS) method is introduced to speed up multi-precision multiplication on SIMD architectures and it is shown that two COS computations can be “coarsely” integrated into an efficient vectorized variant of Montgomery modular multiplication, which the paper calls CICOS method.
Parallel modular multiplication using 512-bit advanced vector instructions
TLDR
A new block-based variant of Montgomery multiplication, the Block Product Scanning (BPS) method, which is particularly efficient using new 512-bit advanced vector instructions (AVX-512) on modern Intel processor families, and allows for squaring and sub-quadratic Karatsuba enhancements.
Speeding up elliptic curve arithmetic on ARM processors using NEON instructions
TLDR
D dual NEON-based multiplications and squarings in the finite field Fp are proposed and reported a performance improvement of up to 30% over a conventional implementation of the same operation when instantiated on 256-bit elliptic curves over either Fp or Fp2.
Montgomery Arithmetic from a Software Perspective
This chapter describes Peter L. Montgomery’s modular multiplication method and the various improvements to reduce the latency for software implementations on devices which have access to many
Montgomery multiplication using CUDA
TLDR
This paper implements a highly optimized systolic Montgomery multiplication algorithm using NVIDIAs general-purpose parallel programming model called CUDA (Compute Unified Device Architecture) for NVIDIA GPUs, and shows that this version is faster than previous implemented multiprecision Montgomery multiplication algorithms, while also providing an intuitive data representation.
ARM/NEON Co-design of Multiplication/Squaring
TLDR
This work introduces new parallel approach for integer multiplication and squaring operations on ARM–NEON processors and mix-use both ARM and NEON instructions to hide computation latency for ARM into NEON, which outperform the best-known results on the identical ARM– NEON processors.
Efficient arithmetic on ARM-NEON and its application for high-speed RSA implementation
TLDR
A novel Double Operand Scanning (DOS) method to speed-up multi-precision squaring with non-redundant representations on SIMD architecture, compatible with separated Montgomery algorithms and highly efficient for RSA crypto system is introduced.
Implementation of RSA Signatures on GPU and CPU Architectures
TLDR
This paper reports a constant-time CPU and GPU software implementation of the RSA exponentiation by using algorithms that offer a first-line defense against timing and cache attacks and reports that a combination of the schoolbook and Karatsuba algorithms for integer multiplication along with Montgomery reduction, yields the fastest modular multiplication procedure.
...
...

References

SHOWING 1-10 OF 42 REFERENCES
Montgomery Multiplication on the Cell
TLDR
A technique to speed up Montgomery multiplication targeted at the Synergistic Processor Elements (SPE) of the Cell Broadband Engine is proposed, which consists of splitting a number into four consecutive parts, representing columns in a 4-SIMD organization.
Software Implementation of Modular Exponentiation, Using Advanced Vector Instructions Architectures
TLDR
It is demonstrated, for the first time, how such a software approach can outperform the classical scalar (ALU) implementations, on the high end x86_64 platforms, if they have a wide SIMD architecture.
Parallel cryptographic arithmetic using a redundant Montgomery representation
  • D. Page, N. Smart
  • Computer Science, Mathematics
    IEEE Transactions on Computers
  • 2004
TLDR
It is shown that an SIMD parallel implementation of RSA can be around twice as fast as traditional sequential code, especially useful given the larger 2,048 bit RSA keys which are now being proposed for standard security levels.
High-Performance Modular Multiplication on the Cell Processor
TLDR
This paper presents software implementation speed records for modular multiplication arithmetic on the synergistic processing elements of the Cell broadband engine (Cell) architecture, and proposes techniques to efficiently implement the modular multiplication algorithms, suited to run on any architecture which is able to compute multiple computations concurrently.
Energy-Efficient Software Implementation of Long Integer Modular Arithmetic
TLDR
This paper investigates performance and energy characteristics of software algorithms for long integer arithmetic, and shows that a combination of Karatsuba-Comba multiplication and Montgomery reduction allows to achieve better performance than other algorithms for modular multiplication.
An RNS Montgomery modular multiplication algorithm
TLDR
The authors present a new RNS modular multiplication for very large operands based on Montgomery's method adapted to mixed radix, and is performed using a residue number system.
Architectural Support for Long Integer Modulo Arithmetic on Risc-Based Smart Cards
TLDR
This paper investigates the potential of application-specific instruction set extensions for cryptographic workloads such as long integer arithmetic and proposes two special instructions that carry out computations of the form a A—b + c + d, whereby a,b,c,d are single-precision words unsigned integers.
Architectural Support for Long Integer Modulo Arithmetic on Risc-Based Smart Cards
TLDR
This paper defines two special instructions that carry out computations of the form a ×b + c + d, whereby a,b,c,d are single-precision words (unsigned integers) and therefore they are simple to incorporate into common RISC architectures such as the MIPS32.
Systolic-Arrays for Modular Exponentiation Using Montgomery Method (Extended Abstract)
TLDR
Two types of systolic-array for MMM which can realize more efficient and flexible chip implementation than the array in [1] are proposed.
On Software Parallel Implementation of Cryptographic Pairings
TLDR
This paper identifies several methods for exploiting parallelism within one pairing evaluation, and parallelism between different pairing evaluations (inter-pairing), and shows that it is possible to accelerate pairing evaluation by a significant factor in comparison to a naive approach.
...
...