A Hardware-Efficient Elliptic Curve Cryptographic Architecture over GF (p)

. This paper proposes a hardware-eﬃcient elliptic curve cryptography (ECC) architecture over GF(p), which uses adders to achieve scalar multiplication (SM) through hardware-reuse method. In terms of algorithm, the improvement of the interleaved modular multiplication (IMM) algorithm and the binary modular inverse (BMI) algorithm needs two adders. In addition to the adder, the data register is another optimize target. The design compiler is synthesized on 0.13 µ m CMOS ASIC platform. The time range of performing scalar multiplication over 160, 192, 224, and 256 ﬁeld orders under 150MHz frequency is 1.99–3.17ms. Moreover, the gate area required for diﬀerent ﬁeld orders in this design is in the range of 35.65k–59.14k, with 50%–91% hardware resource less than other processors.


Introduction
Due to the rapid development of technology, Internet of ings-(IoT-) related devices have become popular. Most importantly, the safety must be guaranteed. In addition to the IoT devices, the safety of road networks also needs to be paid great attention [1]. Miller [2] and Koblitz [3] put forward the concept of elliptic curve cryptography (ECC), which is a kind of asymmetrical cryptosystem put forward by Miller [2] and Koblitz [3] in 1986, which has higher security than other methods like RSA encryption algorithm. Several international organizations have adopted ECC, including NIST [4], ANSI [5] and IEEE [6].
For ECC, there have been a large number of hardware architectures [7][8][9][10][11][12][13][14][15][16][17]. Among them, there are two methods for the realization of modular multiplication (MM), namely, the multiplier and the adder. e multiplier-based architecture includes the design based on specific prime field and the design based on Montgomery multiplication algorithm [7]. e adder-based architecture includes the design based on interleaved multiplication algorithm [9]. e processor [13] uses a design with Montgomery MM algorithm and r-bit * r-bit multiplier. e processors [8,14] use a design with n-bit * n-bit multiplier. MM includes multiplication and fast reduction operation over a specific prime field. It should be noted that the multiplier-based architecture requires a lot of hardware.
In ECC, modular inversion (MI) is also a kind of cumbersome operation. Among them, binary modular inversion algorithms are usually used in hardware-efficient architectures. e MM and MI units of processor [11] are based on the adder, and the two units are independent in adder. Processors [18,19] adopt a radix-4 booth encoding IMM algorithm. Processor [20] implements MM through a radix-2 MM algorithm and avoids MI through projective coordinates.
Traditional cryptographic algorithm software has the disadvantages of high power consumption and time delay, which can be solved by hardware implementation. is article attempts to provide security assurance with low power consumption for IoT devices through hardware implementation. e following are the main contributions of this article.
(1) A hardware-efficient architecture based on add units is proposed to achieve as little hardware consumption as possible (2) GF(p). An introduction on EC over GF(p) is conducted. When the p value of nonsupersingular elliptic curve E on GF(p) is greater than 3, the following formula can be used:

Elliptic Curve Scalar Multiplication.
In ECC, SM is the basic operation. As for PM operation, integer k and point P on the elliptic curve are input and then performed as a sequence of PA and PD operations given in Algorithm 1. In Step 1, the point Q is initialized as a point at infinity.
Step 2 is to iterate n − 1 times, where each iteration has the PD operation. k i � 1 indicates that there is a PA operation.

Scalar Multiplication Architecture
is part describes the bottom-up algorithm optimization on GF(p), which achieves maximum reuse by adder unit. e SM operation is implemented by using two full-word adder units. e optimization of MM and MI operations is conducive to the reduction of power consumption and the improvement of SM operation's performance.
3.1. Modular Addition/Subtraction. MA and MS operations are implemented based on Algorithm 2. In ASIC, the addition or subtraction operations can be implemented using nearly equal hardware, namely, adder units. Since MA and MS operations require a clock cycle, there is a need for 2 fullword adders. In addition, here, the adder unit is the minimum unit, and C0 n and C1 n are the most significant bits (MSB).

Modular Multiplication.
MM is an indispensable operation in SM operation architecture. In this study, the interleaved modular multiplication algorithm is selected. e standard interleaved modulo multiplication in [16] (Algorithm 2) has certain shortcomings. Since steps 5, 6, and 7 carry out addition operations with carry propagation and steps 6 and 7 check all lengths of the operands, there is a large latency. In response to this problem, the improved algorithm in [16] (Algorithm 3) performs addition operations with carry-save adders in the loop. Moreover, the modified algorithm in [16] (Algorithm 4) reduces the area and time by lookup-table method. In [10], a new interleaved modular multiplication algorithm is proposed, which uses only two adder units. e specific steps are shown in Algorithm 5 as follows. In step 1, the variable R is initialized to zero. In step 2.1, the R * 2 can be realized by shifting operation. In step 2.2, the X i * Y can be implemented by a multiplexer. Step 2.3 and step 2.4 require an adder unit, respectively. erefore, if each iteration is completed within one clock cycle, then a total of two adder units are required. After the iteration of step 2, the result is limited to [0, 2p − 1]. erefore, it is necessary to go to step 3 to limit R to [0, p − 1].

Modular Inversion.
In addition to the MM operation, the modular inversion (MI) operation also plays an extremely important role in the SM operation architecture. In MI operation, the same two adder units are reused to reduce hardware consumption.
is paper adopts the binary modular inversion algorithm proposed in [10]. Algorithm 3 can calculate MM and MI operations in the same clock cycle. If the input a � 1, it is an MI operation, and y � 1/x mod p. In step 1, the variables u, v, r, s are initialized. In step 2 and step 3, the /2 operations can be realized by right shifting one bit.
With a positive or negative odd r, R/2modp � (r + p) ≫ 1, that is, it can be computed by adding r to p and then shifting right. e same is true in other situations. e above operations require one adder unit. In step 2 or step 3, the comparison between u and v in step 4 is calculated in advance, which requires two adder Input: an integer k and a point P on elliptic curve Output: kP (3) return Q ALGORITHM 1: Elliptic curve scalar multiplication. Input: p, x, a ∈ [1, p − 1] Output: y, satisfying xy � a mod p Step 1: u � p; v � x; r � 0; s � a; Step 2: if (u is even units. In step 4, (r − s, u − v) or (s − r, v − u) is calculated, which requires two adder units.
Step 5 requires a total of two adder units, one of which is used to determine whether r or s is less than 0, and the other is used for r + p or s + p. erefore, if each step is completed in one clock cycle, two adder units are required.

Point Addition and Point Doubling.
Algorithm 4 provides PA and PD operations. Since modular operations (MA, MS, MM, and MI) share the same two adder units, only one modular operation is computed at a time. A total of eight registers are required, of which six are used for PA and PD operations of t 1 , t 2 , x 1 , x 2 , y 1 , and y 2 , and two for integer k and prime p.

Scalar Multiplier
Architecture. In this part, Figure 1 shows the scalar multiplication architecture of SM on GF(p), which achieves the modular operations of MM, MS, MA, and MI as well as the point operations of SM, PA, and PD. Among them, point controller block is the main state Input: P1(x1, y1), P2(x2, y2), ALGORITHM 4: Point addition and point doubling.

Implementation and Result
e ECC architecture described in this part is designed using Verilog-HDL language and adopt Design Compiler to synthesize it using SMIC 130-nm CMOS standard cell library. In addition, the experimental circuit area is evaluated by the 2-way NAND gate. e source of the experimental simulation parameters is the FIPS 186-2 standard [8]. Figure 2 lists the main parameters for one 256 bit elliptic curve on the prime field GF(p) and the other bit elliptic curve can be found in the FIPS 186-2 standard. e coordinates of base point G on elliptic curve are Gx and Gy.
ere is a need for a total of two adders and twelve data registers in the proposed architecture. According to Table 1, the required registers and adders consumed 42% of the hardware. Among them, the twelve registers are used for data storage. With the increase of field order, the adder's resource consumption percentage increases from 13.72% to 15.54%.
In Table 2, the results of the implementation and comparison of the proposed architecture are shown. By testing 100 times, the SM operation requires an average of 186, 268, 364, and 475 clock cycles on 160, 192, 224, and 256 prime fields, respectively. e proposed architecture takes 1.24, 1.78, 2.42, and 3.16 ms with the 35.65 k, 43.25 k, 49.41 k, and 59.14 k gate area for one SM operations over 160, 192, 224, and 256 prime fields, respectively. e authors of [10,11] use IMM algorithm and BIA to realize the inversion and multiplier units. Among them, the processor in [10] and the processor we proposed use the same method, that is, use the same unit to implement MM and MI operations. But in contrast, the proposed design has higher performance on the prime fields of 160/192/224/ 256 bits, which is 1.28∼1.29 times faster than that of [10]. Under 160 bit prime field, the processor in [10] takes 35.43k gate area and 1.60 ms to perform an SM operation. In the area-time product (AT) parameter, the AT value of the processor we designed is relatively low, indicating that there is a better balance between hardware consumption and performance. e processor in [11] uses two adder-based inversion units and two adder-based multiplier units, and our processor uses one combined unit. In contrast, the proposed processor has the advantage of low hardware consumption. In addition, our design saves 64.81%, 64.87%, 65.66%, and 64.69% area over the 160/192/224/256 bit prime fields than the design in [11]. Taking the 160 bit prime field as an example, the processor in [11]   Mathematical Problems in Engineering 0.87 ms. Although it has higher performance, the design we propose chooses a lower AT value in order to balance hardware consumption and performance. In summary, the proposed processor has the advantages of low hardware consumption and high hardware efficiency. e processor in [13] uses a word-based Montgomery multiplier and dynamic redundant binary converter, which can improve the performance of SM. Compared with the design in [13], our design can save 69.66%, 63.35%, 58.93%, and 50.84% area over the 160/192/224/256 bit prime fields. e processor in [14] causes large power consumption, which is not suitable for IoT devices. More specifically, a fullsize 256 bit × 256 bit multiplier requires a large hardware consumption, namely, 659 k gate. In contrast, the proposed design can save 91.03% of the area. e processor in [15] uses a systolic arithmetic unit in high frequency of 556 MHz. Based on the 256 bit prime fields, our design can save 51.52% of the area.

Conclusion
By constructing a bottom-up optimization for all operations of algorithm-level scalar multiplication on the basis of two full-word adders, a hardware-efficient elliptic curve processor over GF(p) is proposed. rough the improvement of IMM algorithm and BMI algorithm, they become suitable for two adder units. Moreover, the registers are also optimized. A total of 12 full-word register units are used to store data. Synthesized on 0.13 µm ASIC platform, the processor's hardware consumption can be controlled within the range of 35.65 k∼59.14 k, which is far lower than most processors.

Data Availability
e raw/processed data required to reproduce these findings cannot be shared at this time as the data also form part of an ongoing study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.