# Montgomery Modular Multiplication on ARM-NEON Revisited

@article{Seo2014MontgomeryMM, title={Montgomery Modular Multiplication on ARM-NEON Revisited}, author={Hwajeong Seo and Zhe Liu and Johann Gro{\ss}sch{\"a}dl and Jongseok Choi and Howon Kim}, journal={IACR Cryptol. ePrint Arch.}, year={2014}, volume={2014}, pages={760} }

Montgomery modular multiplication constitutes the “arithmetic foundation” of modern public-key cryptography with applications ranging from RSA, DSA and Diffie-Hellman over elliptic curve schemes to pairing-based cryptosystems. [... ] Key Method The COS method operates on 32-bit words in a row-wise fashion (similar to the operand-scanning method) and does not require a “non-canonical” representation of operands with a reduced radix. We show that two COS computations can be “coarsely” integrated into an efficient… Expand

## 27 Citations

Efficient arithmetic on ARM-NEON and its application for high-speed RSA implementation

- Computer Science, MathematicsSecur. Commun. Networks
- 2015

A novel Double Operand Scanning (DOS) method to speed-up multi-precision squaring with non-redundant representations on SIMD architecture, compatible with separated Montgomery algorithms and highly efficient for RSA crypto system is introduced.

SIDH on ARM: Faster Modular Multiplications for Faster Post-Quantum Supersingular Isogeny Key Exchange

- Computer Science, MathematicsIACR Cryptol. ePrint Arch.
- 2018

These results consolidate the practicality of supersingular isogeny-based protocols for many real-world applications and present efficient implementations of SIDH and SIKE for 64-bit ARMv8-A processors, based on a high-speed Montgomery multiplication that leverages the power of 64- bit instructions.

Parallel Implementation of SM2 Elliptic Curve Cryptography on Intel Processors with AVX2

- Computer Science, MathematicsACISP
- 2020

This paper presents an efficient and secure implementation of SM2, the Chinese elliptic curve cryptography standard that has been adopted by the International Organization of Standardization (ISO) as ISO/IEC 14888-3:2018 and is the first constant-time implementation of the Co-Z based ladder that leverages the parallelism of AVX2.

ARM/NEON Co-design of Multiplication/Squaring

- Computer ScienceWISA
- 2017

This work introduces new parallel approach for integer multiplication and squaring operations on ARM–NEON processors and mix-use both ARM and NEON instructions to hide computation latency for ARM into NEON, which outperform the best-known results on the identical ARM– NEON processors.

Speeding up elliptic curve arithmetic on ARM processors using NEON instructions

- Mathematics, Computer Science
- 2020

D dual NEON-based multiplications and squarings in the finite field Fp are proposed and reported a performance improvement of up to 30% over a conventional implementation of the same operation when instantiated on 256-bit elliptic curves over either Fp or Fp2.

NEON PQCryto: Fast and Parallel Ring-LWE Encryption on ARM NEON Architecture

- Computer Science, MathematicsIACR Cryptol. ePrint Arch.
- 2015

This paper presents the first implementation of ring-LWE encryption on ARM NEON architecture and proposes a vectorized version of Iterative Number Theoretic Transform (NTT) for high-speed computation and presents a 32-bit variant of SAMS2 technique, original proposed in CHES’15, for fast reduction.

PhiRSA: Exploiting the Computing Power of Vector Instructions on Intel Xeon Phi for RSA

- Computer ScienceSAC
- 2016

A vector-oriented Montgomery multiplication design based on vector carry propagation chain (VCPC) method to fully exploit the computing power of vector instructions on Intel Xeon Phi, which achieves high throughput comparable to those on GPUs but with much less parallel tasks, and small latency comparable to that on CPUs.

Efficient VLSI Architecture for Montgomery Modular Multiplier

- Computer Science
- 2017

A Configurable CSA (CCSA) is proposed to for performing modular multiplication by using two serial half-adders and a mechanism that can detect and skip the unnecessary carry-save addition operations thereby maintaining the short critical path delay is developed by means of designing a skip detector.

Efficient Software Implementation of Ring-LWE Encryption on IoT Processors

- Computer ScienceIEEE Transactions on Computers
- 2020

This paper presents the first implementation of ring-LWE encryption on ARM NEON and MSP430 architectures and results are roughly 7 times faster than the fastest ECC implementation on desired platforms with same security level.

Supersingular Isogeny Diffie-Hellman Key Exchange on 64-Bit ARM

- Computer Science, MathematicsIEEE Transactions on Dependable and Secure Computing
- 2019

An efficient implementation of the supersingular isogeny Diffie-Hellman (SIDH) key exchange protocol on 64-bit ARMv8 processors for 125- and 160-bit post-quantum security levels is presented and a comprehensive analysis of both approaches based on the inversion-to-multiplication ratio is provided.

## References

SHOWING 1-10 OF 26 REFERENCES

SIMD acceleration of modular arithmetic on contemporary embedded platforms

- Computer Science, Mathematics2013 IEEE High Performance Extreme Computing Conference (HPEC)
- 2013

This contribution proposes vector processing techniques to accelerate modular multiplications in prime fields in ECC, and demonstrates implementations for the Venom (NEON) coprocessor in Qualcomm's Scorpion (ARM) CPU, as well as for the SSE2 instruction-set extensions in Intel's Atom CPU.

Efficient and Secure Algorithms for GLV-Based Scalar Multiplication and Their Implementation on GLV-GLS Curves

- Computer ScienceCT-RSA
- 2014

The techniques allow the cost of adding protection against timing attacks in the GLV-based variable-base scalar multiplication computation to below 10% and contribute to the improvement of the state-of-the-art performance of elliptic curve computations.

Montgomery Multiplication Using Vector Instructions

- Computer ScienceSelected Areas in Cryptography
- 2013

A parallel approach to compute interleaved Montgomery multiplication which is particularly suitable to be computed on 2-way single instruction, multiple data platforms as can be found on most modern computer architectures in the form of vector instruction set extensions is presented.

Fast Software Polynomial Multiplication on ARM Processors Using the NEON Engine

- Computer Science, MathematicsCD-ARES Workshops
- 2013

A novel software multiplier for performing a polynomial multiplication of two 64-bit binary polynomials based on the VMULL instruction included in the NEON engine supported in many ARM processors is described, obtaining a fast software multiplication in the binary field \(\mathbb{F}_{2^m}\), which is up to 45% faster compared to the best known algorithm.

Software Implementation of Modular Exponentiation, Using Advanced Vector Instructions Architectures

- Computer ScienceWAIFI
- 2012

It is demonstrated, for the first time, how such a software approach can outperform the classical scalar (ALU) implementations, on the high end x86_64 platforms, if they have a wide SIMD architecture.

Montgomery Multiplication on the Cell

- Computer Science, MathematicsPPAM
- 2009

A technique to speed up Montgomery multiplication targeted at the Synergistic Processor Elements (SPE) of the Cell Broadband Engine is proposed, which consists of splitting a number into four consecutive parts, representing columns in a 4-SIMD organization.

NEON Crypto

- Computer Science, MathematicsCHES
- 2012

This paper explains how to use a single 800MHz Cortex A8 core to compute the existing NaCl suite of high-security cryptographic primitives at the following speeds: 5.60 cycles per byte (1.14 Gbps) to encrypt using a shared secret key, 2.30 cycles perbyte (2.78 Gbps), and 244655 cycles (3269/second) to sign a message.

NEON Implementation of an Attribute-Based Encryption Scheme

- Computer Science, MathematicsACNS
- 2013

This paper presents the design of a software cryptographic library that implements a 127-bit security level attribute-based encryption scheme over mobile devices equipped with a 1.4GHz Exynos 4 Cortex-A9 processor and a developing board that hosts a 2.7 GHz ExynOS 5 Cortex- a15 processor.

Modular multiplication without trial division

- Mathematics, Computer Science
- 1985

A method for multiplying two integers modulo N while avoiding division by N, a representation of residue classes so as to speed modular multiplication without affecting the modular addition and subtraction algorithms.

Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor

- Computer ScienceCRYPTO
- 1986

A description of the techniques employed at Oxford University to obtain a high speed implementation of the RSA encryption algorithm on an "off-the-shelf" digital signal processing chip and the techniques of algorithm development employed lead to a provably correct implementation.