Area-Efficient Hardware Design of Modular Exponentiation based on Montgomery Multiplier for RSA Cryptosystem

Richard Boateng Nti1,1 and Kwangki Ryoo1,

1 Graduate School of Information & Communications, Hanbat National University,125 Dongseodaero, Yuseong-Gu, Daejeon 34158, Republic of Korea

[email protected], [email protected]

Abstract. Modular exponentiation is the most time-consuming mathematical operation in some public key cryptosystem such as RSA. The primal operation of the RSA cryptosystem is modular exponentiation, computed by repeated modular multiplication. Fast modular multiplication algorithms have been proposed to speed up decryption/encryption yet minimize area. However, the Montgomery algorithm is limited by the carry propagation delay from the addition of long operands. In this paper, we propose a hardware structure that simplifies the operation of the Q logic coupled with a compact CSA in Montgomery multiplier. The resulting design was applied in modular exponentiation for lightweight applications of RSA. Synthesis results showed that the new multiplier design achieved reduce hardware area, consequently, an area-efficient modular exponentiation design. A frequency of 452.49MHz was achieved for modular exponentiation with 85K gate equivalent using the 130nm CMOS technology.

Keywords: Public key cryptosystem, RSA, Carry-Save Adder (CSA), Montgomery multiplication, Modular exponentiation.

1 Introduction

Maintenance of privacy and data integrity is essential from the viewpoint of security, which can be achieved through techniques: authentication and cryptography1. Public key cryptosystems are vital for information security. RSA is the most widely used public key algorithm and requires repeated modular multiplication to compute for modular exponentiation2. Modular multiplication with large numbers is time-consuming. The Montgomery algorithm is used as the core algorithm for cryptosystems. Montgomery algorithm determines the quotient by replacing the trial division by modulus with a series of additions and shift operations3. To avoid the delay, several approaches have been proposed to speed up the operation based on carry-save addition. Based on the representation, these approaches can be divided into two.

In the first approach, the intermediate results are kept in carry-save form to avoid carry propagation4 while the input and output operands (i.e. A, B, N and S) of the algorithm remains in binary representation. Final conversion from the carry save form into the binary form must be performed. This comes with an extra effort to pay by using the CPA3. This results in an increase in the total computation time and it implies that the resulting throughput rate is dependent on the length of operands.

On the contrary, some work used 5-to-2 carry-save additions/adders (CSA)5, a three-level CSA tree, to deal with this problem without performing format conversion. This second approach eliminates repeated interim output of the Montgomery modular multiplication output to input conversion by keeping all input and output operands in the carry save form except the final step for the results. Mclvor et al.6 proposed two algorithmic variants of the Montgomery algorithm with both approaches using carry save adder(CSA) to compute for the exponentiation. One is based on a five-to-two CSA and the other on a four-to-two CSA plus multiplexer. Each can perform a Montgomery multiplication in only k + 1 and k + 2 clock cycles, respectively, where k is the operand bit length. The later approach MM42 in an effort to reduce the number of input operands introduces extra multiplexers and select signals to select the desired operands. More registers are also required to store the combined input as well.

In this paper, we focus on the hardware design of efficient Montgomery multiplier with a two-level adder. A simplified Q_logic was designed for bit operation which accounted for a reduction in the hardware area. The proposed Montgomery multiplier is then applied in the H algorithm to compute modular exponentiation.

Section 2 reviews radix-2 Montgomery and modular exponentiation algorithms. We proposed an efficient design of the Montgomery algorithm in section 3. In Section 4, we compare different Montgomery multipliers as well as modular exponentiation designs. Finally, concluding remarks are drawn in Section 5.

2 Modular Multiplication and Exponentiation Algorithms

The Montgomery multiplication is an algorithm used to compute the product of two integers A and B modulo N. Algorithm 1 shows the radix-2 version of the algorithm. Given two integers a and b; where a, b < N (i.e. N is the k-bit modulus), R can be defined as 2k mod N where 2k-1 ? N < 2k. The N-residue of a and b with respect to R can be defined as (1). Based on (1), the Montgomery modular product S of A and B can be obtained as (2) where R-1 is the inverse of R modulo N, i.e. R * R-1 = 1 (mod N). Since the convergence range of S in Montgomery Algorithm is 0 ? S < 2N, an additional operation S = S – N is required to remove the oversized residue if S ? N. The critical delay of algorithm 1 occurs during the calculation of S. A = (a * R) mod N, B = (b * R) mod N (1) S = (A * B * R-1) mod N (2) Algorithm 1: Radix-2 Montgomery MultiplicationInputs: A, B, N (modulus)Output: Sk S0 = 0; for i = 0 to k - 1 { qi = (Si0 + Ai * B0) mod 2; Si+1 = (Si + Ai * B + qi * N)/2;} if(Sk ? N ) Sk = Sk - N;return Sk; Algorithm 2: H-Algorithm (modular exponentiation)Inputs: M, E(ke bits), N(modulus), K ? R2 mod NOutput: ME mod N P = MM(M, K, N), Q = MM(K, 1 ,N); //Pre-process for i = ke - 1 to 0 { Q = MM(Q, Q, N); if(Ei = 1) Q = MM(P, Q, N); }return MM(Q, 1, N); //Post-process Modular exponentiation is usually accomplished by performing repeated modular multiplications9-10. The H-algorithm is used to calculate ME mod N. Let M be a k-bit message and E denotes a ke–bit exponent or key. Algorithm 2 depicts the H-algorithm, where the two constant factors K and R can be computed in advance. Note that the value of R is either 2k or 2k+1. The content of the exponent is scanned from the most significant bit (MSB). 3 Proposed Hardware Design We propose an efficient hardware design of Montgomery algorithm for low area complexity. The new multiplier is embedded into the H-algorithm to implement modular exponentiation. Algorithm 3: Modified Montgomery MultiplicationInputs: A, B, N (modulus)Output: Sk + 21. S0 = 0;2. for i = 0 to k + 1 {3. qi = ( Si0 + Ai * B0 ) mod 2;4. Si+1=(Si+Ai * B + qi * N ) div 2; }5. return Sk + 2 Algorithm 4: Montgomery Multiplication (proposed) Inputs: A, B, N (modulus) Output: (S1k + 2, S2k + 2)1. S10 = 0; S20 = 0;2. for i = 0 to k + 1 {3. q_0=(S1i0 + S2i0); q_1=(S1i0 + S2i0+Bi0); 4. If (Ai = 0 and q_0 = 0)5. (S1i+1,S2i+1)=(S1i+ S2i + 0 + 0)/2; 6. If (Ai = 0 and q_0 = 1)7. (S1i+1,S2i+1)=(S1i+ S2i + 0 + N)/2; 8. If (Ai = 1 and q_1 = 0)9. (S1i+1,S2i+1)=(S1i+ S2i + B + 0)/2; 10. If (Ai = 1 and q_1 = 1)11. (S1i+1,S2i+1)=(S1i+ S2i + B + N)/2; }12. return (S1k + 2, S2k + 2); The proposed Montgomery algorithm (algorithm 4) is a modification of algorithm 3 by Walter5. From line 4 of algorithm 3, the computation of Si+1 depends on the pre-computation of qi which evaluates to 0 or 1. Evaluation of even or odd numbers (line 3) can simply be derived from the LSB (0=even, 1=odd). Inference can be made that qi is influenced by Ai. When Ai equals zero (0) qi depends directly on S 0, else it depends on the XOR of S0 and B0. On the basis of this observation, a new algorithm was proposed (algorithm 4). The proposed algorithm splits qi into two: q_0 and q_1. Whereas q_0 computes for the equivalent qi when Ai is 0, q_1 operates when Ai is 1 as shown on line 3. As a result, a logic circuit of bit operations can model the given mathematical equation as Q_logic. The simplified Q_logic design significantly reduces the hardware complexity. Fig. 1 illustrates the block diagram of the proposed multiplier. Registers A, B and N store the inputs, S1 and S2 store the intermediate computation and R presents the output. Based on the control signals to the multiplexers, different computations are carried out. When Ai=0, q_0 from the Q_logic is actively involved otherwise q_1. The arithmetic unit was designed as a two-level adder (CSA). Note that S (S1 and S2) is shift register: outputs one bit right shift. The shifting accounts for the division by 2. All lines in short dashes represent bit signals. Fig. 1. Proposed multiplier structure. The modular exponentiation implemented is the H-Algorithm. Given that the inputs M, E, N, and K represent the message, exponent, modulus and a constant factor K, the proposed structure as shown in fig. 2 computes ME mod N. Montgomery multiplication is repeatedly applied for exponentiation. In effect, the efficiency and performance of the modular multiplier make a significant impact. The initial step in the algorithm is the pre-processing stage which calculates P and Q. The finite state machine (FSM) shown in fig. 2 dictates the route of data in and out of the Montgomery multiplier. In state 0, M and K are routed to the MM; the output from the multiplier goes to register P. The gate combinations after the multiplier ensure that P receives the value for the first evaluation. State 1 follows with 1 and K being routed to MM and then stored in Q. Iterations are performed in states 2 and 3 until the given condition is reached. The square operation (state 2) is always executed but the computation of state 3 depends on ei. Therefore a 2-by-1 multiplexer chooses between P and Q. The result goes to the output via port S. Fig. 2. Proposed exponentiation architecture. 4 Experimental Results Synthesis of the proposed hardware design of the Montgomery multiplier and modular exponentiation was done using TSMC 90nm and 130nm CMOS technology from Synopsys Design Compiler respectively. Table 1 shows comparisons of different multipliers designs. Our proposed multiplier design recorded a reduction of about 37% in area over8 at 250MHz due to the bit operations carried out by our Q_logic design and the omission of BRFA. In addition, a clock speed of 884.9MHz was achieved with throughput enhancement by a factor of 3.04 and a reduction of approximately 6% in hardware complexity. A gate count of 84K was gained at 884.9MHz (1.13ns). Nevertheless, Kuang et al.8 showed a reduction in clock cycles compared to our approach. Table 2 illustrates implementation comparison for 1024-bit modular exponentiations. An operating frequency of 452.49MHz was set to compare with the reference paper10. From the synthesis results, the number of gate count decreased by 22.73%. Table 1. Comparison of results for different Montgomery multiplier implementations with 1024-bit key size Multiplier #Cycle Delay(ns) Area(um2) Throughput(Mbps) proposed 1 1026* 4.00 313K 249.5 proposed 2 1026* 1.13 465K 883.2 5 880 4.00 498K 290.9 4 1049 5.60 406K 174.3 Table 2. Implementation comparison for 1024-bit exponentiations. ME design Delay(ns) Area(um2) Gate count(K) our 2.21 579891 85 6 2.00 - 139 7 2.21 714676 110 5 Conclusion This paper presented an alternative hardware design of the Montgomery algorithm. A simplified Q_logic was designed coupled with a compact arithmetic unit which caused hardware area to decrease. Synthesis results show that our multiplier has an improved performance for low area modular multiplication applications with an area of 313K (um2) at 250MHz. In addition, the modular exponentiation design was implemented using the proposed Montgomery multiplier as the kernel unit and showed improved performance. Future works on this research will delve into the analysis and design of Carry Save Adder (CSA) to improve propagation delay. 1 Please note that the LNCS Editorial assumes that all authors have used the western naming convention, with given names preceding surnames. This determines the structure of the names in the running heads and the author index.