New and improved architectures for Montgomery modular multiplication

  • Authors:
  • M. Sudhakar;R. V. Kamala;M. B. Srinivas

  • Affiliations:
  • Center for VLSI and Embedded System Technologies, International Institute of Information Technology, Hyderabad, Andhra Pradesh, India;Center for VLSI and Embedded System Technologies, International Institute of Information Technology, Hyderabad, Andhra Pradesh, India;Center for VLSI and Embedded System Technologies, International Institute of Information Technology, Hyderabad, Andhra Pradesh, India

  • Venue:
  • Mobile Networks and Applications
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper an improved Montgomery multiplier, based on modified four-to-two carry-save adders (CSAs) to reduce critical path delay, is presented. Instead of implementing four-to-two CSA using two levels of carry-save logic, authors propose a modified four-to-two CSA using only one level of carry-save logic taking advantage of pre-computed input values. Also, a new bit-sliced, unified and scalable Montgomery multiplier architecture, applicable for both RSA and ECC (Elliptic Curve Cryptography), is proposed. In the existing word-based scalable multiplier architectures, some processing elements (PEs) do not perform useful computation during the last pipeline cycle when the precision is not equal to an exact multiple of the word size, like in ECC. This intrinsic limitation requires a few extra clock cycles to operate on operand lengths which are not powers of 2. The proposed architecture eliminates the need for extra clock cycles by reconfiguring the design at bit-level and hence can operate on any operand length, limited only by memory and control constraints. It requires 2∼15% fewer clock cycles than the existing architectures for key lengths of interest in RSA and 11∼18% for binary fields and 10∼14% for prime fields in case of ECC. An FPGA implementation of the proposed architecture shows that it can perform 1,024-bit modular exponentiation in about 15 ms which is better than that by the existing multiplier architectures.