Montgomery Multiplication on the Cell

@inproceedings{Bos2009MontgomeryMO,
  title={Montgomery Multiplication on the Cell},
  author={Joppe W. Bos and Marcelo E. Kaihara},
  booktitle={PPAM},
  year={2009}
}
A technique to speed up Montgomery multiplication targeted at the Synergistic Processor Elements (SPE) of the Cell Broadband Engine is proposed. The technique consists of splitting a number into four consecutive parts. These parts are placed one by one in each of the four element positions of a vector, representing columns in a 4-SIMD organization. This representation enables arithmetic to be performed in a 4-SIMD fashion. An implementation of the Montgomery multiplication using this technique… 
Montgomery Multiplication Using Vector Instructions
TLDR
A parallel approach to compute interleaved Montgomery multiplication which is particularly suitable to be computed on 2-way single instruction, multiple data platforms as can be found on most modern computer architectures in the form of vector instruction set extensions is presented.
Montgomery multiplication using CUDA
TLDR
This paper implements a highly optimized systolic Montgomery multiplication algorithm using NVIDIAs general-purpose parallel programming model called CUDA (Compute Unified Device Architecture) for NVIDIA GPUs, and shows that this version is faster than previous implemented multiprecision Montgomery multiplication algorithms, while also providing an intuitive data representation.
Montgomery Modular Multiplication on ARM-NEON Revisited
TLDR
The Cascade Operand Scanning (COS) method is introduced to speed up multi-precision multiplication on SIMD architectures and it is shown that two COS computations can be “coarsely” integrated into an efficient vectorized variant of Montgomery modular multiplication, which the paper calls CICOS method.
Montgomery Arithmetic from a Software Perspective
This chapter describes Peter L. Montgomery’s modular multiplication method and the various improvements to reduce the latency for software implementations on devices which have access to many
Faster ECC over \mathbb F_2^521-1 F 2 521 - 1 (feat. NEON)
TLDR
High speed parallel multiplication and squaring algorithms for the Mersenne prime \(2^{521}-1\) are presented in order to provide asymptotically faster integer multiplication and fast reduction algorithms.
Efficient arithmetic on ARM-NEON and its application for high-speed RSA implementation
TLDR
A novel Double Operand Scanning (DOS) method to speed-up multi-precision squaring with non-redundant representations on SIMD architecture, compatible with separated Montgomery algorithms and highly efficient for RSA crypto system is introduced.
Pollard Rho on the PlayStation 3
TLDR
This paper describes a high-performance PlayStation 3 implementation of the Pollard rho discrete logarithm algorithm on elliptic curves over prime fields and most of the implementation strategies apply to other large moduli as well.
PhiRSA: Exploiting the Computing Power of Vector Instructions on Intel Xeon Phi for RSA
TLDR
A vector-oriented Montgomery multiplication design based on vector carry propagation chain (VCPC) method to fully exploit the computing power of vector instructions on Intel Xeon Phi, which achieves high throughput comparable to those on GPUs but with much less parallel tasks, and small latency comparable to that on CPUs.
Investigating large integer arithmetic on Intel Xeon Phi SIMD extensions
  • A. Keliris, M. Maniatakos
  • Computer Science, Mathematics
    2014 9th IEEE International Conference on Design & Technology of Integrated Systems in Nanoscale Era (DTIS)
  • 2014
TLDR
Preliminary results indicate that the Knights Corner SIMD speedup of large integer multiplication is limited by the absence of specific instructions that typically appear in common SIMD architectures, but emulation on Knights Landing shows that large integers can indeed benefit by the presence of 512-bit vectors, for commonly used 1024- and 2048-bit operands, compared to publicly available large arithmetic libraries.
On the Cryptanalysis of Public-Key Cryptography
TLDR
The elliptic curve method (ECM) for integer factorization is the asymptotically fastest method to find relatively small factors of large integers and the performance of ECM gives information about secure parameter choices of some cryptographic protocols.
...
1
2
...

References

SHOWING 1-10 OF 18 REFERENCES
Multi-Stream Hashing on the PlayStation 3
TLDR
This work presents high-performance multi-stream versions of cryptographic hash functions from the MD/SHA-family, which can be useful for cryptanalytic use as well as for utilizing the SPEs as cryptographic accelerators.
Accelerating SSL using the Vector processors in IBM's Cell Broadband Engine for Sony's Playstation 3
TLDR
This paper explores the implementation and performance gains when using the vector processing capabilities for SSL and shows that big improvements are still possible with the hardware designed primarily for other purposes.
Pollard Rho on the PlayStation 3
TLDR
This paper describes a high-performance PlayStation 3 implementation of the Pollard rho discrete logarithm algorithm on elliptic curves over prime fields and most of the implementation strategies apply to other large moduli as well.
Modular multiplication without trial division
TLDR
A method for multiplying two integers modulo N while avoiding division by N, a representation of residue classes so as to speed modular multiplication without affecting the modular addition and subtraction algorithms.
Fast Elliptic-Curve Cryptography on the Cell Broadband Engine
This paper is the first to investigate the power of the Cell Broadband Engine for state-of-the-art public-key cryptography. We present a high-speed implementation of elliptic-curve Diffie-Hellman
Power efficient processor architecture and the cell processor
  • H. P. Hofstee
  • Computer Science
    11th International Symposium on High-Performance Computer Architecture
  • 2005
TLDR
The paper discusses some of the challenges microprocessor designers face and provides motivation for performance per transistor as a reasonable first-order metric for design efficiency, and alternate architectural choices and some of its limitations are discussed.
Fast Implementations of AES on Various Platforms
TLDR
This paper presents new software speed records for encryption and decryption using the block cipher AES-128 for different architectures, and this is the first AES implementation for the GPU which implements both encryption andDecryption.
Montgomery exponentiation needs no final subtractions
Montgomery's modular multiplication algorithm is commonly used in implementations of the RSA cryptosystem. It has been observed that there is no need for extra cleaning up at the end of an
Short Chosen-Prefix Collisions for MD5 and the Creation of a Rogue CA Certificate
TLDR
A more flexible family of differential paths and a new variable birthdaying search space are described, leading to just three pairs of near-collision blocks to generate the collision, enabling construction of RSA moduli that are sufficiently short to be accepted by current CAs.
Advances in Cryptology — CRYPTO ’96
  • N. Koblitz
  • Computer Science, Mathematics
    Lecture Notes in Computer Science
  • 2001
TLDR
This work presents new, simple, and practical constructions of message authentication schemes based on a cryptographic hash function, and proves that NMAC and HMAC are proven to be secure as long as the underlying hash function has some reasonable cryptographic strengths.
...
1
2
...