Efficient arithmetic on ARM-NEON and its application for high-speed RSA implementation

@article{Seo2015EfficientAO,
  title={Efficient arithmetic on ARM-NEON and its application for high-speed RSA implementation},
  author={Hwajeong Seo and Zhe Liu and Johann Gro{\ss}sch{\"a}dl and Howon Kim},
  journal={IACR Cryptol. ePrint Arch.},
  year={2015},
  volume={2015},
  pages={465}
}
Advanced modern processors support Single Instruction Multiple Data (SIMD) instructions (e.g. Intel-AVX, ARM-NEON) and a massive body of research on vector-parallel implementations of modular arithmetic, which are crucial components for modern public-key cryptography ranging from RSA, ElGamal, DSA and ECC, have been conducted. In this paper, we introduce a novel Double Operand Scanning (DOS) method to speed-up multi-precision squaring with non-redundant representations on SIMD architecture. The… 

Figures and Tables from this paper

ARM/NEON Co-design of Multiplication/Squaring
TLDR
This work introduces new parallel approach for integer multiplication and squaring operations on ARM–NEON processors and mix-use both ARM and NEON instructions to hide computation latency for ARM into NEON, which outperform the best-known results on the identical ARM– NEON processors.
SIDH on ARM: Faster Modular Multiplications for Faster Post-Quantum Supersingular Isogeny Key Exchange
TLDR
These results consolidate the practicality of supersingular isogeny-based protocols for many real-world applications and present efficient implementations of SIDH and SIKE for 64-bit ARMv8-A processors, based on a high-speed Montgomery multiplication that leverages the power of 64- bit instructions.
PhiRSA: Exploiting the Computing Power of Vector Instructions on Intel Xeon Phi for RSA
TLDR
A vector-oriented Montgomery multiplication design based on vector carry propagation chain (VCPC) method to fully exploit the computing power of vector instructions on Intel Xeon Phi, which achieves high throughput comparable to those on GPUs but with much less parallel tasks, and small latency comparable to that on CPUs.
SIKE in 32-bit ARM Processors Based on Redundant Number System for NIST Level-II
TLDR
An optimized implementation of the post-quantum Supersingular Isogeny Key Encapsulation for 32-bit ARMv7-A processors supporting NEON engine (i.e., SIMD instruction) is presented, which is about 1.58× faster than previous state-of-the-art work presented in CHES’18.
Parallel Implementations of LEA, Revisited
TLDR
The proposed implementations of LEA implemented on ARM and NEON architectures achieved the fastest LEA encryption within 3.2 cycle/byte for Cortex-A9 processors.
Secure Number Theoretic Transform and Speed Record for Ring-LWE Encryption on Embedded Processors
TLDR
This paper presents secure and fastest Ring-LWE encryption implementation on low-end 8-bit AVR processors and targeted the most expensive operation, i.e. Number Theoretic Transform (NTT) based polynomial multiplication, to provide countermeasures against timing attacks and best performance among similar implementations till now.
Efficient Software Implementation of Ring-LWE Encryption on IoT Processors
TLDR
This paper presents the first implementation of ring-LWE encryption on ARM NEON and MSP430 architectures and results are roughly 7 times faster than the fastest ECC implementation on desired platforms with same security level.
NEON PQCryto: Fast and Parallel Ring-LWE Encryption on ARM NEON Architecture
TLDR
This paper presents the first implementation of ring-LWE encryption on ARM NEON architecture and proposes a vectorized version of Iterative Number Theoretic Transform (NTT) for high-speed computation and presents a 32-bit variant of SAMS2 technique, original proposed in CHES’15, for fast reduction.
NEON-SIDH: Effi cient Implementation of Supersingular Isogeny Diffi e-Hellman Key-Exchange Protocol on ARM
TLDR
The goal of this paper is to show that isogeny-based cryptosSystems can be implemented further and be used as an alternative to classical cryptosystems on embedded devices.
Parallel Implementations of CHAM
TLDR
Novel parallel implementations of CHAM-64/128 block cipher on modern ARM-NEON processors in terms of instruction set and multiple cores are presented and the 4.2 cycles/byte result is competitive to the parallel implementation of LEA-128/128 and HIGHT- 64/128 on same processor.
...
...

References

SHOWING 1-10 OF 33 REFERENCES
Montgomery Modular Multiplication on ARM-NEON Revisited
TLDR
The Cascade Operand Scanning (COS) method is introduced to speed up multi-precision multiplication on SIMD architectures and it is shown that two COS computations can be “coarsely” integrated into an efficient vectorized variant of Montgomery modular multiplication, which the paper calls CICOS method.
Reverse Product-Scanning Multiplication and Squaring on 8-Bit AVR Processors
High performance, small code size, and good scalability are important requirements for software implementations of multi-precision arithmetic algorithms to fit resource-limited embedded systems. In
On the Evaluation of Multi-core Systems with SIMD Engines for Public-Key Cryptography
  • P. Martins, L. Sousa
  • Computer Science, Mathematics
    2014 International Symposium on Computer Architecture and High Performance Computing Workshop
  • 2014
TLDR
The efficiency of these devices when operating as cryptographic accelerators is assessed, using a two-tiered parallelism model, where not only multi-core, but also Single Instruction Multiple Data (SIMD) parallelism is exploited to increase the throughput of modular multiplications.
Stretching the limits of Programmable Embedded Devices for Public-key Cryptography
TLDR
The efficiency of embedded devices when operating as cryptographic accelerators is assessed, exploiting both multithreading and Single Instruction Multiple Data (SIMD) parallelism, and algorithms are proposed to simultaneously perform multiple modular multiplications.
New Speed Records for Montgomery Modular Multiplication on 8-Bit AVR Microcontrollers
TLDR
A new variant of the widely-used hybrid method for multiple-precision multiplication that is 10.6% faster than the original hybrid technique is presented and how to perform the modular subtraction of Montgomery reduction in a regular fashion without execution of conditional statements so as to counteract Simple Power Analysis attacks is shown.
SIMD acceleration of modular arithmetic on contemporary embedded platforms
TLDR
This contribution proposes vector processing techniques to accelerate modular multiplications in prime fields in ECC, and demonstrates implementations for the Venom (NEON) coprocessor in Qualcomm's Scorpion (ARM) CPU, as well as for the SSE2 instruction-set extensions in Intel's Atom CPU.
Energy-Efficient Software Implementation of Long Integer Modular Arithmetic
TLDR
This paper investigates performance and energy characteristics of software algorithms for long integer arithmetic, and shows that a combination of Karatsuba-Comba multiplication and Montgomery reduction allows to achieve better performance than other algorithms for modular multiplication.
Software Implementation of Modular Exponentiation, Using Advanced Vector Instructions Architectures
TLDR
It is demonstrated, for the first time, how such a software approach can outperform the classical scalar (ALU) implementations, on the high end x86_64 platforms, if they have a wide SIMD architecture.
Multi-precision Multiplication for Public-Key Cryptography on Embedded Microprocessors
TLDR
This paper proposes a novel method, i.e., “consecutive operand caching”, which reduces the number of required load instructions by caching the operands and boosts the speed of multi-precision multiplication by 3.85%, as compared to previous best known results.
Multi-precision Squaring for Public-Key Cryptography on Embedded Microprocessors
TLDR
The novel and flexible SBD method, which delays the doubling process till the very end of the partial-product computation and then doubles the result by simply shifting it one bit to the left, outperforms state-of-the-art implementations by a factor of between 3.5% and 4.4%.
...
...