# Architectural Enhancements for Montgomery Multiplication on Embedded RISC Processors

@inproceedings{Groschdl2003ArchitecturalEF, title={Architectural Enhancements for Montgomery Multiplication on Embedded RISC Processors}, author={Johann Gro{\ss}sch{\"a}dl and Guy-Armand Kamendje}, booktitle={ACNS}, year={2003} }

Montgomery multiplication normally spends over 90% of its execution time in inner loops executing some kind of multiply-and-add operations. The performance of these critical code sections can be greatly improved by customizing the processor’s instruction set for low-level arithmetic functions. In this paper, we investigate the potential of architectural enhancements for multiple-precision Montgomery multiplication according to the so-called Finely Integrated Product Scanning (FIPS) method. We…

## 31 Citations

Performance Evaluation of Instruction Set Extensions for Long Integer Modular Arithmetic on a SPARC V8 Processor

- Computer Science, Mathematics
- 2007

A partial loop unrolling (PLU) technique for modular multiplication is introduced which allows to achieve large performance gains at the cost of a moderate increase in code size, while maintaining the full flexibility of a "rolled-loop" implementation.

Performance Evaluation of Instruction Set Extensions for Long Integer Modular Arithmetic on a SPARC V8 Processor

- Computer Science, Mathematics10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007)
- 2007

A partial loop unrolling (PLU) technique for modular multiplication is introduced which allows to achieve large performance gains at the cost of a moderate increase in code size, while maintaining the full flexibility of a "rolled-loop" implementation.

Architectural Enhancements to Support Digital Signal Processing and Public-Key Cryptography

- Computer ScienceWISES
- 2004

The analysis shows that the MIPS32 architecture can be easily extended for efficient cryptography processing and offers some advantages compared to the ARMv5TE architecture.

Instruction Set Extensions for Fast Arithmetic in Finite Fields GF( p) and GF(2m)

- Computer ScienceCHES
- 2004

This paper introduces a set of five custom instructions to accelerate arithmetic operations in finite fields GF(p) and GF(2 m), which can be easily integrated into a standard RISC architecture like MIPS32 and require only little extra hardware.

Architectural support for arithmetic in optimal extension fields

- Computer Science, MathematicsProceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004.
- 2004

This work introduces two custom instructions to accelerate the reduction modulo a PM prime and shows that the multiplication in an optimal extension field can take advantage of a multiply/accumulate unit with a wide accumulator so that a certain number of 64-bit products can be summed up without overflow.

Architectural support for arithmetic in optimal extension fields

- Computer Science, Mathematics
- 2004

This work introduces two custom instructions to accelerate the reduction modulo a PM prime and shows that the multiplication in an optimal extension field can take advantage of a multiply/accumulate unit with a wide accumulator so that a certain number of 64-bit products can be summed up without overflow.

Enhanced Montgomery Multiplication on DSP Architectures for Embedded Public-Key Cryptosystems

- Computer ScienceEURASIP J. Embed. Syst.
- 2008

This paper tackles the efficient support of modular exponentiation on inexpensive circuitry for embedded security services and proposes a variant of the finely integrated product scanning (FIPS) algorithm that is targeted to digital signal processors.

An efficient scalable and hybrid arithmetic unit for public key cryptographic applications

- Computer Science, MathematicsIEICE Electron. Express
- 2007

An efficient scalable and flexible arithmetic unit which executes word-based multiplication and squaring for RSA arithmetic and addition, multiplication and inversion in GF(2m) for Elliptic Curve Cryptography(ECC) arithmetic operation is proposed.

New Speed Records for Montgomery Modular Multiplication on 8-Bit AVR Microcontrollers

- Computer Science, MathematicsAFRICACRYPT
- 2014

A new variant of the widely-used hybrid method for multiple-precision multiplication that is 10.6% faster than the original hybrid technique is presented and how to perform the modular subtraction of Montgomery reduction in a regular fashion without execution of conditional statements so as to counteract Simple Power Analysis attacks is shown.

Fast and Efficient Implementation of AES via Instruction Set Extensions

- Computer Science21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07)
- 2007

A general purpose instruction set extension to a 32-bit SPARC V8 compatible processor core is presented that accelerates the performance of Galois Field fixed field constant multiplication, a core element of the AES algorithm.

## References

SHOWING 1-10 OF 37 REFERENCES

Optimized RISC Architecture for Multiple-Precision Modular Arithmetic

- Computer ScienceSPC
- 2003

This paper presents an optimized Assembly routine for fast multiple-precision multiplication with ”finely” integrated Montgomery reduction (FIOS method) and demonstrates that the custom instructions double the processor’s arithmetic performance compared to a standard MIPS32 core.

Implementing 1,024-bit RSA exponentiation on a 32-bit processor core

- Computer ScienceProceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors
- 2000

This paper describes how long-wordlength (1024-bit) modular exponentiation may be implemented on a standard 32-bit microprocessor core with a total execution lime of under 1 second. The design does…

Lx: a technology platform for customizable VLIW embedded processing

- Computer ScienceISCA '00
- 2000

The experiments described in the paper show that specialization for an application domain is effective, yielding large gains in price/performance ratio and how scaling machine resources scales performance, although not uniformly across all applications.

Instruction set selection for ASIP design

- Computer ScienceProceedings of the Seventh International Workshop on Hardware/Software Codesign (CODES'99) (IEEE Cat. No.99TH8450)
- 1999

The presented approach uses the processor core to allow early evaluation of ASIP design options using rapid prototyping techniques, and describes a hardware/software co-design methodology which can be used with this design approach.

Hardware/software instruction set configurability for system-on-chip processors

- Computer ScienceProceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232)
- 2001

This paper describes the key dimensions of extensibility within the processor architecture, the instruction set extension description language and the means of automatically extending the software environment from that description, and describes two groups of benchmarks that show 20 to 40 times acceleration of a broad set of algorithms through application-specific instruction set extensions, relative to high performance RISC processors.

An ASIP design methodology for embedded systems

- Computer ScienceCODES '99
- 1999

This paper presents a unique architecture and methodology to design ASIPs in the embedded controller domain by customizing an existing processor instruction set and architecture rather than creating an entirely new ASIP tuned to a benchmark.

Analyzing and comparing Montgomery multiplication algorithms

- Computer Science, MathematicsIEEE Micro
- 1996

The operations involved in computing the Montgomery product are studied, several high-speed, space-efficient algorithms for computing MonPro(a, b), and their time and space requirements are described.

Design of a high performance 32/spl times/32-bit multiplier with a novel sign select Booth encoder

- Computer Science, EngineeringISCAS 2001. The 2001 IEEE International Symposium on Circuits and Systems (Cat. No.01CH37196)
- 2001

In this paper, a high performance 32/spl times/32-bit multiplier for a DSP core is proposed. The multiplier is composed of a novel sign select Booth encoder, an efficient data compressor block with a…

Exponentiation Cryptosystems on the IBM PC

- Computer ScienceIBM Syst. J.
- 1990

A mixed system that combines the superior key management capabilities inherent in public key cryptosystems with the much higher bulk-encryption speed obtainable with the Data Encryption Algorithm is discussed.

Modular multiplication without trial division

- Mathematics, Computer Science
- 1985

A method for multiplying two integers modulo N while avoiding division by N, a representation of residue classes so as to speed modular multiplication without affecting the modular addition and subtraction algorithms.