Architectural Enhancements for Montgomery Multiplication on Embedded RISC Processors

@inproceedings{Groschdl2003ArchitecturalEF,
  title={Architectural Enhancements for Montgomery Multiplication on Embedded RISC Processors},
  author={Johann Gro{\ss}sch{\"a}dl and Guy-Armand Kamendje},
  booktitle={ACNS},
  year={2003}
}
Montgomery multiplication normally spends over 90% of its execution time in inner loops executing some kind of multiply-and-add operations. The performance of these critical code sections can be greatly improved by customizing the processor’s instruction set for low-level arithmetic functions. In this paper, we investigate the potential of architectural enhancements for multiple-precision Montgomery multiplication according to the so-called Finely Integrated Product Scanning (FIPS) method. We… 
Performance Evaluation of Instruction Set Extensions for Long Integer Modular Arithmetic on a SPARC V8 Processor
TLDR
A partial loop unrolling (PLU) technique for modular multiplication is introduced which allows to achieve large performance gains at the cost of a moderate increase in code size, while maintaining the full flexibility of a "rolled-loop" implementation.
Performance Evaluation of Instruction Set Extensions for Long Integer Modular Arithmetic on a SPARC V8 Processor
TLDR
A partial loop unrolling (PLU) technique for modular multiplication is introduced which allows to achieve large performance gains at the cost of a moderate increase in code size, while maintaining the full flexibility of a "rolled-loop" implementation.
Architectural Enhancements to Support Digital Signal Processing and Public-Key Cryptography
TLDR
The analysis shows that the MIPS32 architecture can be easily extended for efficient cryptography processing and offers some advantages compared to the ARMv5TE architecture.
Instruction Set Extensions for Fast Arithmetic in Finite Fields GF( p) and GF(2m)
TLDR
This paper introduces a set of five custom instructions to accelerate arithmetic operations in finite fields GF(p) and GF(2 m), which can be easily integrated into a standard RISC architecture like MIPS32 and require only little extra hardware.
Architectural support for arithmetic in optimal extension fields
TLDR
This work introduces two custom instructions to accelerate the reduction modulo a PM prime and shows that the multiplication in an optimal extension field can take advantage of a multiply/accumulate unit with a wide accumulator so that a certain number of 64-bit products can be summed up without overflow.
Architectural support for arithmetic in optimal extension fields
TLDR
This work introduces two custom instructions to accelerate the reduction modulo a PM prime and shows that the multiplication in an optimal extension field can take advantage of a multiply/accumulate unit with a wide accumulator so that a certain number of 64-bit products can be summed up without overflow.
Enhanced Montgomery Multiplication on DSP Architectures for Embedded Public-Key Cryptosystems
TLDR
This paper tackles the efficient support of modular exponentiation on inexpensive circuitry for embedded security services and proposes a variant of the finely integrated product scanning (FIPS) algorithm that is targeted to digital signal processors.
An efficient scalable and hybrid arithmetic unit for public key cryptographic applications
TLDR
An efficient scalable and flexible arithmetic unit which executes word-based multiplication and squaring for RSA arithmetic and addition, multiplication and inversion in GF(2m) for Elliptic Curve Cryptography(ECC) arithmetic operation is proposed.
New Speed Records for Montgomery Modular Multiplication on 8-Bit AVR Microcontrollers
TLDR
A new variant of the widely-used hybrid method for multiple-precision multiplication that is 10.6% faster than the original hybrid technique is presented and how to perform the modular subtraction of Montgomery reduction in a regular fashion without execution of conditional statements so as to counteract Simple Power Analysis attacks is shown.
Fast and Efficient Implementation of AES via Instruction Set Extensions
  • A. J. Elbirt
  • Computer Science
    21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07)
  • 2007
TLDR
A general purpose instruction set extension to a 32-bit SPARC V8 compatible processor core is presented that accelerates the performance of Galois Field fixed field constant multiplication, a core element of the AES algorithm.
...
1
2
3
4
...

References

SHOWING 1-10 OF 37 REFERENCES
Optimized RISC Architecture for Multiple-Precision Modular Arithmetic
TLDR
This paper presents an optimized Assembly routine for fast multiple-precision multiplication with ”finely” integrated Montgomery reduction (FIOS method) and demonstrates that the custom instructions double the processor’s arithmetic performance compared to a standard MIPS32 core.
Implementing 1,024-bit RSA exponentiation on a 32-bit processor core
  • B. Phillips, N. Burgess
  • Computer Science
    Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors
  • 2000
This paper describes how long-wordlength (1024-bit) modular exponentiation may be implemented on a standard 32-bit microprocessor core with a total execution lime of under 1 second. The design does
Lx: a technology platform for customizable VLIW embedded processing
TLDR
The experiments described in the paper show that specialization for an application domain is effective, yielding large gains in price/performance ratio and how scaling machine resources scales performance, although not uniformly across all applications.
Instruction set selection for ASIP design
  • M. Gschwind
  • Computer Science
    Proceedings of the Seventh International Workshop on Hardware/Software Codesign (CODES'99) (IEEE Cat. No.99TH8450)
  • 1999
TLDR
The presented approach uses the processor core to allow early evaluation of ASIP design options using rapid prototyping techniques, and describes a hardware/software co-design methodology which can be used with this design approach.
Hardware/software instruction set configurability for system-on-chip processors
TLDR
This paper describes the key dimensions of extensibility within the processor architecture, the instruction set extension description language and the means of automatically extending the software environment from that description, and describes two groups of benchmarks that show 20 to 40 times acceleration of a broad set of algorithms through application-specific instruction set extensions, relative to high performance RISC processors.
An ASIP design methodology for embedded systems
TLDR
This paper presents a unique architecture and methodology to design ASIPs in the embedded controller domain by customizing an existing processor instruction set and architecture rather than creating an entirely new ASIP tuned to a benchmark.
Analyzing and comparing Montgomery multiplication algorithms
TLDR
The operations involved in computing the Montgomery product are studied, several high-speed, space-efficient algorithms for computing MonPro(a, b), and their time and space requirements are described.
Design of a high performance 32/spl times/32-bit multiplier with a novel sign select Booth encoder
  • Kiwon Choi, Minkyu Song
  • Computer Science, Engineering
    ISCAS 2001. The 2001 IEEE International Symposium on Circuits and Systems (Cat. No.01CH37196)
  • 2001
In this paper, a high performance 32/spl times/32-bit multiplier for a DSP core is proposed. The multiplier is composed of a novel sign select Booth encoder, an efficient data compressor block with a
Exponentiation Cryptosystems on the IBM PC
  • P. Comba
  • Computer Science
    IBM Syst. J.
  • 1990
TLDR
A mixed system that combines the superior key management capabilities inherent in public key cryptosystems with the much higher bulk-encryption speed obtainable with the Data Encryption Algorithm is discussed.
Modular multiplication without trial division
TLDR
A method for multiplying two integers modulo N while avoiding division by N, a representation of residue classes so as to speed modular multiplication without affecting the modular addition and subtraction algorithms.
...
1
2
3
4
...