A Crypto Accelerator of Binary Edward Curves for Securing Low-Resource Embedded Devices

Sajid, Asher; Sonbul, Omar S.; Rashid, Muhammad; Jafri, Atif Raza; Arif, Muhammad; Zia, Muhammad Yousuf Irfan

doi:10.3390/app13158633

Open AccessArticle

A Crypto Accelerator of Binary Edward Curves for Securing Low-Resource Embedded Devices

¹

Department of Electrical Engineering, Bahria University, Islamabad 44000, Pakistan

²

Computer Engineering Department, Umm Al Qura University, Makkah 21955, Saudi Arabia

³

Telecommunications Engineering School, University of Malaga, 29010 Málaga, Spain

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2023, 13(15), 8633; https://doi.org/10.3390/app13158633

Submission received: 9 July 2023 / Revised: 22 July 2023 / Accepted: 25 July 2023 / Published: 26 July 2023

Download

Browse Figures

Versions Notes

Abstract

:

This research presents a novel binary Edwards curve (BEC) accelerator designed specifically for resource-constrained embedded systems. The proposed accelerator incorporates the fixed window algorithm, a two-stage pipelined architecture, and the Montgomery radix-4 multiplier. As a result, it achieves remarkable performance improvements in throughput and resource utilization. Experimental results, conducted on various Xilinx Field Programmable Gate Arrays (FPGAs), demonstrate impressive throughput/area ratios observed for

G F (2^{233})

. The achieved ratios for Virtex-4, Virtex-5, Virtex-6, and Virtex-7 are 12.2, 19.07, 36.01, and 38.39, respectively. Furthermore, the processing time for one-point multiplication on a Virtex-7 platform is 15.87 µs. These findings highlight the effectiveness of the proposed accelerator for improved throughput and optimal resource utilization.

Keywords:

elliptic curve cryptography; binary Edward curve; scalar multiplication; Montgomery radix-4 multiplier; FPGA

1. Introduction

With the widespread use of the internet as a communication medium, there is a growing need for effective solutions to protect information across various devices [1]. In this context, cryptography is a technique that can safeguard sensitive data on an insecure medium [2]. Therefore, two fundamental approaches emerge, known as symmetric and asymmetric encryption. The former relies on a shared key for both encryption and decryption. However, the management of keys within large organizations poses a formidable challenge [3]. The latter, also referred to as public key cryptography, is based on two keys: The first key, known as the public key, is used for encryption. The second key, known as the private key, is used for decryption [4].

Among the prominent pioneers in the field of asymmetric cryptography are RSA (Ron Rivest, Adi Shamir, and Leonard Adleman) [5] and ECC (elliptic curve cryptography) [6]. The security prowess of RSA stems from the formidable difficulty of factoring in large numbers. On the other hand, ECC harnesses the fascinating mathematics of elliptic curves. Notably, ECC keys offer a significant advantage over RSA keys in terms of length. Despite their shorter size, a 256-bit ECC key can provide equivalent security to a substantial 3072-bit RSA key [7]. This fact positions ECC as an enticing alternative, particularly for resource-constrained applications, like IoT (internet of things) [8,9]. From the implementation point of view, ECC involves foundational arithmetic operations. These arithmetic operations support essential group operations. As a result, cryptographic computations are performed, facilitating protocol functions like key exchange, digital signatures, and encryption.

A noteworthy consideration is the vulnerability of ECC to simple power analysis (SPA) attacks [10]. An SPA attack belongs to the category of a side-channel attack. It capitalizes on the information leaked by the physical implementation of an encryption algorithm [11]. In ECC, the computationally intensive operation of point multiplication is particularly susceptible to SPA attacks. This vulnerability underscores the importance of implementing countermeasures. To address this issue, the fixed window algorithm has been specifically designed. In addition to the fixed window algorithm, the unified mathematical formulas for ECC also help to reduce the risk of SPA attacks.

Some examples of unified mathematical formulas for ECC include binary Edwards curves (BECs) [12], binary Huff curves (BHCs) [13], and Hessian curves (HCs) [14]. The choice of particular mathematical formulas depends on the specific requirements of the application and the trade-off between performance and security. For example, BHC and BEC are defined over binary fields and are considered more appropriate for area-constrained applications [15,16]. HC, on the other hand, is defined over higher-dimensional fields and offers a higher resistance to side-channel attacks. However, the increased resistance comes at the cost of increased complexity and reduced efficiency. In general, the BEC is considered to be a good choice for resource-constrained devices, as it offers a good balance of security, efficiency, and simplicity [16]. Some of the applications requiring high throughput with a smaller area include financial transactions [17], cloud computing [18], blockchain [19,20], smart cards [21] and wireless communication [22]. By optimizing hardware for BEC, the operations are performed securely at a higher speed using minimal hardware cost.

1.1. Related Work

There have been notable breakthroughs in utilizing the FPGAs for the performance improvement of point multiplication on BEC. For example, the work in [16] focuses on optimizing hardware resources by introducing a low-complexity architecture. Moreover, the architecture leverages a digital parallel least significant multiplier and employs advanced instruction scheduling techniques. The synthesis on various platforms ensures adaptability and efficiency across different hardware configurations. However, the research in [16] did not discuss the results of computational speed.

In order to address the issue of computational speed, the work in [23] proposes two architectures: one for general BEC and another for special BEC. The authors also employ three finite field multipliers. The presented structures aim to strike a balance between resource utilization and latency. The first structure utilizes three parallel multipliers for high-speed performance. However, it requires a significant amount of hardware resources. For curve parameters

d_{1}

and

d_{2}

equal to 59, it consumes 4454 slices and achieves a latency of 34.61 microseconds. To address the resource utilization issue, the authors propose a second structure that employs two multipliers. This results in a significant reduction in hardware resources, utilizing only 3521 slices.

Another work for high-speed BEC applications with a reconfiguration feature is presented in [24]. It achieves a maximum clock frequency of 48 MHz, indicating its capability for high-speed operations. Nevertheless, it employs a relatively high number of hardware resources (21,816 slices for regular BEC and 22,373 slices for BEC with halving). It may limit its applicability in resource-constrained environments. While the architecture handles the halving operation with minimal additional resources, its reconfigurability may come at the cost of increased resource requirements, making it less suitable for certain low-resource scenarios. As a result, the design requires further optimizations to achieve a more balanced trade-off between speed and hardware resource utilization. In other words, in addition to the computational speed, the area efficiency is equally important.

The work in [25] optimizes the area utilization. For this purpose, the authors propose the utilization of a digit-serial multiplier. As a result, the area is 2138 slices for Virtex-6 and 2153 slices for Virtex-5. The limitation of the research lies in the usage of a digit-serial multiplier, as it significantly compromises the speed of the architecture. The authors of [26] strive for further optimization by employing a pipe-lined digit-serial multiplier. It not only shortens the critical path but also enhances the overall efficiency by optimizing both the throughput and area utilization. When considering a specific value of a curve parameter (d = 59), the implemented hardware on the Virtex-5 platform for the group hopscotch curve (GHC) utilizes 8875 slices while achieving a latency of 11.03 microseconds. Additionally, for BEC computations, the latency is 11.01 microseconds, and the architecture uses 11,494 slices. The limitation of the research is that it does not explore the impact of different curve parameters on the proposed optimizations.

The latest research, exemplified by studies such as [27], is dedicated to optimizing various design factors, including latency, area utilization, and throughput, even further. In [27], the authors introduce two innovative architectures, which leverage a combinedparallel multiplier (PM) technique. The first architecture achieves substantial improvements, showcasing enhancements of 62%, 46%, and 152% for

G F (2^{233})

,

G F (2^{163})

, and

G F (2^{283})

, respectively. The second architecture allows the high-speed computation of individual PM operations, resulting in reduced latency and improved overall performance. These advancements contribute to the ongoing optimization efforts in the field.

The work in [28] proposes a modular radix-2 interleaved multiplier for a low-latency architecture. This improved multiplier is intended to require less computing time, clock cycles, and area. For this purpose, the Montgomery ladder algorithm is utilized in this architecture. Nevertheless, the author in [28] does not mention the specific curve parameters or point multiplication scenarios where the proposed architecture might excel or face challenges. Understanding the scalability and adaptability of the multiplier to different curve sizes and security levels is crucial in assessing its applicability in real-world cryptographic systems.

Finally, the authors in [29] propose a hybrid algorithm combining Montgomery and double-and-add techniques. The authors optimize the clock cycle by utilizing a modular Montgomery radix multiplier. Additionally, they design architectures for pipelining as well as those without pipelining. By leveraging the hybrid algorithm and incorporating the modular Montgomery radix multiplier, the authors are able to enhance the efficiency and security of point multiplication on BEC. The shortcoming of [29] is the lack of scalability. It causes uncertainty for how well the proposed designs will perform and adapt to larger or more complex cryptographic computations.

1.2. Research Gap

Based on Section 1.1, it appears that the research in utilizing FPGA technology to enhance the performance of point multiplication on BEC has made significant progress in terms of optimizing latency, area utilization, and throughput. However, there are still some potential research opportunities for further exploration in this field.

Efficiency improvement for larger curve parameters: The existing studies mention specific curve parameters (e.g., $d = 59$ ) for evaluating the performance of FPGA implementations. However, as the size of curve parameters increases, the efficiency of point multiplication significantly decreases. There is a need for further research to develop FPGA architectures that can handle larger curve parameters while maintaining high performance and resource efficiency.
Exploration of energy-efficient FPGA implementations: While the existing studies focus on performance and resource efficiency, energy efficiency is also a critical factor, particularly in resource-constrained environments or battery-powered devices. Future research can focus on developing FPGA architectures and techniques that optimize energy consumption during point multiplication operations on elliptic curves.
Exploration of security aspects: The existing studies primarily focus on performance and resource optimization. However, security is a crucial aspect when implementing cryptographic algorithms. There is a research gap in exploring FPGA implementations that not only provide high performance but also address security concerns.

Therefore, further advancements can be made in ECC implementations, leading to improved performance, resource efficiency, energy efficiency, security, and adaptability.

1.3. Contributions

To speed up the architecture while utilizing fewer hardware resources, the following contributions are reported in this article:

The fixed window algorithm is implemented to optimize point multiplication operations by leveraging the binary representation of the scalar multiplier and performing point doubling operations within a fixed window size, minimizing point additions and improving computational efficiency.
Two-stage pipelining is adopted to increase the throughput in resource-constrained devices. Dividing the cryptographic operations into two stages and optimizing each stage separately increases the overall throughput of the cryptographic system.
The Montgomery radix-4 multiplier (a hardware optimization technique) is proposed to increase the speed of cryptographic operations on resource-constrained devices. The proposed multiplier uses a radix-4 representation to perform multiplications more efficiently. It reduces the number of memory accesses and computation steps, making it a good choice for crypto accelerators in devices with limited resources.
An finite state machine (FSM) is designed that effectively controls the data path for the proposed low area point multiplication design.

1.4. Organization

The article is structured as follows: Section 2 theoretically explains the background on BEC. Section 3 delves into the proposed hardware architecture, describing its key components, design considerations, and implementation details. The proposed scheduling and other related optimizations are detailed in Section 4. The main findings and results obtained from the conducted analysis are discussed in Section 5, providing insights into the performance and efficiency of the proposed design compared to existing solutions. Section 6 discusses the pros and cons of the proposed architecture. Finally, Section 7 concludes the paper, summarizing the key points, highlighting the significance of the research, and suggesting potential avenues for future work.

2. Mathematical Background

Section 2.1 of the article presents the mathematical formulations for BEC. The unified mathematical formulation is discussed in Section 2.2. Furthermore, the process of performing point multiplication computations and the implementation of the fixed window size algorithm are described in Section 2.3 and Section 2.4.

2.1. BEC Equations over $G F (2^{m})$

Harold Edwards proposed the BEC model in 2007 [30]. This model provides the following mathematical formulation of BEC for the prime field:

x^{2} + y^{2} = 1 + d x^{2} y^{2}

(1)

Equation (1) defines the initial points of the curve as x and y, with the curve parameter represented by d. Nevertheless, dealing with curves in large prime fields is critical. To address this issue, Bernstein introduced the binary version of Edwards curves, as described in Equation (2) [31]:

E_{B, d_{1}, d_{2}} : d_{1} (x + y) + d_{2} (x^{2} + y^{2}) = x y + x y (x + y) + x^{2} y^{2}

(2)

Equation (2) describes the binary representation of Edwards curves, with x and y representing the initial points on the curve. The curve parameters are denoted by

d_{1}

and

d_{2}

. It is worth noting that Equation (2) is valid under the condition that

d_{1}

is nonzero and

d_{2}

is not equal to

d_{1}^{2} + d_{1}

.

2.2. Unified Mathematical Formulation

Table 1 lists differential addition instructions for BEC over

G F (2^{m})

as described in the work by [23]. The table outlines a total of seven complex instructions, each contributing to the efficient computation of BEC operations. To accommodate the storage requirements for various results (initial, intermediate and final), a memory unit with a capacity of

11 \times m

is required. Here, the number 11 shows the required number of memory locations. Similarly, the value of m corresponds to the width of each memory location. Consequently, the architecture ensures the seamless storage and management of data throughout the computation process. The BEC curve parameters defined in Table 1 encompass the values of

e_{1}

,

e_{2}

, and w, which are calculated as follows:

e_{1} = \sqrt[4]{e}

,

e_{2} = \sqrt{e}

, and

e = d_{1}^{4} + d_{1}^{3} + d_{1}^{2} d_{2}

. Additionally, w represents a rational function for an elliptic curve E over

G F

and is computed as

w (P) = \frac{x + y}{d_{1} (x + y + 1)}

, where

P = (x, y)

belongs to

E_{B, d_{1}, d_{2}}

. Within the architecture, the initial projective points are represented by

W_{1}

,

Z_{1}

,

W_{2}

, and

Z_{2}

. Similarly, the final points are denoted as

Z_{a}

,

Z_{d}

,

W_{a}

, and

W_{d}

. The intermediate results are placed in registers A, B, and C, facilitating the efficient computation and handling of data throughout the execution process.

2.3. Point Multiplication for BEC

The following equation computes the point multiplication (PM) over BEC [32]:

Q = k . p = k (p + p +, \dots, + p)

(3)

In Equation (3), the variable Q specifies the final point. Variable k is a scalar multiplier, and p is the starting point. For the computation of PM, a fixed window algorithm is used.

2.4. Fixed Window Algorithm

The fixed window method used in Algorithm 1 aims to optimize point multiplication by reducing the required point additions and doublings. This method takes advantage of the binary representation of the scalar multiplier “k” and performs point operations based on the bits of “k” using a fixed window size of “w”. The optimization lies in the fact that instead of performing point additions or doublings for each individual bit of “k”, Algorithm 1 processes multiple bits simultaneously within a window. By doing so, it reduces the number of expensive point operations and improves the overall efficiency of the point multiplication.

The outer loop of Algorithm 1 iterates through the bits of “k” from the most significant bit (m − 1) to the least significant bit (0). Within each iteration, the algorithm performs a point addition operation using the dADD function. If the current bit is 1, an additional point doubling operation is performed using the dbL function. The inner loop, introduced within the outer loop, handles the fixed window technique. It iterates from the current bit position minus one (i − 1) down to the maximum value of either (i− w + 1) or 0. For each iteration, a point doubling operation is performed using the dbL function. If the corresponding bit is 1, an additional point addition operation is performed using the dADD function. By incorporating the fixed window method, the proposed algorithm reduces the number of point additions and doublings. Instead of performing these operations for each bit, it processes multiple bits at once within a fixed window. This optimization improves the efficiency of point multiplication, especially for large scalar values.

Algorithm 1: Point multiplication based on fixed window algorithm.

3. Proposed Hardware Architecture

The top view of the architecture for BEC is discussed in Section 3.1. The memory unit is described in Section 3.2. The multiplexers used for routing purposes are detailed in Section 3.3 and Section 3.4. The arithmetic logic unit (ALU), responsible for executing various arithmetic operations, is explained in Section 3.5. Finally, a controller in the form of a finite state machine (FSM) is outlined in Section 3.6.

3.1. Overview of the Architecture

It comprises various essential components. It includes a memory unit (MU) responsible for placing different results as discussed in Section 3.2. The routing networks (RNs) facilitate the data transfer between different locations as detailed in Section 3.3. The read-only memory (ROM) is utilized to retrieve the necessary BEC parameters as outlined in Section 3.4. The ALU, which executes computations essential to the BEC process, is described in Section 3.5. Finally, the FSM provides the required control signals (Section 3.6).

In addition to the aforementioned essential components, pipelined registers are placed before the ALU as depicted in Figure 1. This placement enables the efficient overlapping of the read, execute, and write-back stages, leading to an improved throughput and minimized critical path delays. The architecture is meticulously designed based on the parameters specified by the National Institute of Standards and Technology (NIST). By incorporating these components and employing pipelining techniques, the proposed architecture aims to achieve superior performance and efficiency in BEC computations.

3.2. Memory Unit

In order to efficiently handle intermediate and final results, the proposed architecture integrates a memory unit with a capacity of

10 \times m

as illustrated in Figure 1. This memory unit plays a crucial role in storing and managing the data within the system. Each of the 10 memory locations holds data elements with a width denoted by m. The storage elements

W_{1}

,

Z_{1}

,

W_{2}

, and

Z_{2}

are dedicated to storing the initial projective points, while

T_{4}

,

T_{1}

,

W_{0}

, and

T_{3}

are utilized for storing the updated projective points. The intermediate results are stored in

Z_{0}

and

T_{2}

. To ensure efficient data access and manipulation, the architecture incorporates a

1 \times 10

demultiplexer (DEMUX) controlled by the WRITE_ADDR signal. This DEMUX facilitates the storing of different elements in their designated memory locations. Furthermore, two multiplexers, RF_1 and RF_2, controlled by the RF_1_ADDR and RF_2_ADDR signals, respectively, are employed to retrieve the required storage elements from the memory unit. These multiplexers enhance the flexibility and accessibility of the architecture’s data-retrieval process. Both RF_1 and RF_2 have a size of

10 \times m

. The outputs of RF_1 and RF_2 are referred to as MUX_RF_1 and MUX_RF_2, respectively.

3.3. Routing Networks

To enable seamless data transfer between various locations within the architecture, three routing networks (RN_1, RN_2, and RN_3) are employed as depicted in Figure 1. The RN_1 and RN_2 include the following three inputs: base coordinates x and y, the output of ROM_DATA, and the outputs of RF_1 and RF_2. The selection of data for processing is controlled by the signals RN_1_ADDR and RN_2_ADDR. The dimension for these control signals is

5 \times 1

. On the other hand, the input of RN_3 is the output of ALU. Moreover, RN_3 has a size of

3 \times 1

. It performs a critical role in selecting the desired result produced by the ALU.

3.4. Read Only Memory

Figure 1 shows the utilization of read-only memory (ROM) with a capacity of

3 \times 1

. It accesses pre-calculated curve constant values. To do this, it incorporates a single multiplexer. This configuration enables the efficient retrieval of the required constant value from the ROM.

3.5. Arithmetic Logic Unit

It integrates an adder, a multiplier, and a squarer unit. The adder unit leverages m bitwise exclusive OR gates. Here, m denotes the key length. For multiplication, the Montgomery Radix 4 multiplier technique is adopted, enabling the computation of the product

X \times Y

. Moreover, it performs the reduction step

X \times Y

mod p, where mod p denotes the reduction operation. This approach eliminates the need for costly division operations, both in hardware and software platforms.

The squarer unit is located after the multiplier unit in Figure 1. To implement the squarer unit, each input data value is extended by appending a “0” as described in [33]. The squarer unit plays a crucial role in minimizing the total number of clock cycles (CCs) necessary for PM computations. Its primary objective is to enhance efficiency by enabling the efficient computation of instructions, such as

{(A \times B)}^{2}

. The process of polynomial multiplication and squaring often requires an inverse operation. By incorporating a dedicated squarer unit, the number of clock cycles required for the inversion process is significantly reduced. This article integrates the quad block Itoh–Tsujii method [16] for executing the inversion operation efficiently. Leveraging the multiplier and squarer units optimizes the performance of the inversion process. In the following, the details of the Montgomery Radix 4 multiplier are provided. The first Section 3.5.1 provides the necessary theoretical background on the Montgomery multiplier algorithm. During this multiplication process, the construction of a partial product is required, which is obtained using booth encoding as explained in Section 3.5.2. Finally, the corresponding hardware architecture of the multiplier is explained in Section 3.5.3.

3.5.1. Montgomery Multiplier Algorithm

The proposed multiplication algorithm is based on the Montgomery reduction method. It works by transforming the input numbers into a special representation called the Montgomery form. In this method, the modular reduction operation is simplified and can be performed in a constant time. Furthermore, it refers to the use of a base-4 representation of the numbers. As a result, faster multiplication and reduction operations are obtained. It multiplies the two n-bit input operands

(X, Y)

mod p as follows:

(X \times Y \times R^{- 1})

. The greatest common divisor (GCD) of R and p is 1, i.e.,

G C D (R, p)

. In this case, p is frequently a prime integer and R =

2^{n}

. The two integers X and Y should be multiplied by

G F

in the

[0, p]

range. X and Y must be converted into the Montgomery domain in order to multiply them. Subsequently, in the Montgomery domain, these numbers are multiplied to calculate the result

Z_{p}

. The result

Z_{p}

is translated back to its original domain after being computed in the Montgomery domain. A division operation must be performed since the multiplication requires reduction (mod p). The required division process is replaced by the shift and add operations used in the Montgomery multiplication. However, many

G F

multiplications are carried out consecutively, making this conversion cost negligible.

The radix-4 Montgomery multiplication is described in Algorithm 2. The input operands are

X_{p}

and

Y_{p}

, while p is the prime number mentioned in Algorithm 2.

Z_{p}

has the final result calculated by

X_{p} \times Y_{p} \times R

mod p. Step 1 of Algorithm 2 defines that the initial value of the accumulator is zero. In step 2 of Algorithm 2, it is demonstrated that each time that the loop iterates, a partial product, or

X_{p i} \times Y_{p}

, is generated and added to accumulator A. The partial product

X_{p i} \times Y_{p}

, as illustrated in Table 2, will either be

z e r o

or

Y_{p}

depending on the value of

X_{p i}

. Steps 3–5 of Algorithm 2 are performed in order to carry out the operation indicated by the following equation:

A_{(i + 1)} = (A_{i} + P P_{i} + q_{i} \times p) / 4

(4)

In Equation (4),

A_{i}

represents an accumulator,

P P_{i}

represents the partial product over each iteration, and

q_{i}

represents the quotient resulting from the addition result in step 4 of Algorithm 2. The multiplicands

X_{p}

and

Y_{p}

, which are

Z_{p}

in step 6 of Algorithm 2, are found in

A_{i}

after

n / 2

iterations. Booth encoding, as explained in Section 3.5.2, is employed to construct the partial product.

Algorithm 2: Montgomery multiplier radix-4 [34].

3.5.2. Booth Encoding

Booth encoding is a technique used to reduce the number of bit shifts required in a multiplication operation [35]. In a standard multiplication, each bit in the multiplier affects multiple bits in the product, requiring multiple shifts. Booth encoding groups together the bits in the multiplier that affect the same bits in the product, allowing for fewer shifts and therefore faster multiplication. When used in a Montgomery radix-4 multiplier, Booth encoding can further improve the efficiency of the multiplication operation.

In Booth encoding, each digit is assigned a weight based on its significance in the final result. The weights are typically −2, −1, 0, 1, and 2, which represent the number of times the multiplicand needs to be added or subtracted to the partial product. The Booth encoded representation is then used to perform the multiplication operation using a series of additions and subtractions, rather than multiple bit shifts [36]. This reduces the number of operations required, making multiplication faster and more efficient. Booth encoding with weights of −2, −1, 0, 1, and 2 is a technique used to reduce the number of bit shifts required in a multiplication operation, improving its efficiency and speed [37]. The use of Booth encoding in a Montgomery radix-4 multiplier provides a highly efficient multiplication operation. Booth function for radix-4 is represented by the following equation:

- 2 X_{p} (n + 1) + X_{p} (n) + X_{p} (n - 1)

(5)

By using Equation (5), the following results are computed, which are described in Table 2. Table 2 provides information related to partial products. The first column of Table 2 is

X_{p}

, while the second to ninth columns define the maximum possible values of

X_{p}

. Based on the

X_{p}

values, this will produce the result 0, Y, −Y, 2Y and −2Y.

3.5.3. Montgomery Radix-4 Architecture

The proposed hardware architecture of the radix-4 Montgomery multiplier is shown in Figure 2. It consists of the two muxes MUX_A and MUX_B. The size of MUX_A is

5 \times 1

, while the size of MUX_B is

4 \times 1

. MUX_A is used to select the partial product

P P_{i}

(−2Y, −Y, 0, Y, 2Y), and the selected

P P_{i}

depends on the booth encoding defined in Table 2.

P P_{i}

is then added to the outputs

Z_{p L}

and

Z_{p H}

. The two lsb bits of Adder_A are fed into the selection logic to select the prime number

0, p, 2 p, 3 p

. It prevents data loss as described in Table 3. Table 3 provides information related to prime numbers. The first column of Table 3 is Adder_A[1:0]. The second to fifth columns define the maximum possible values of Adder_A[1:0]. Based on Adder_A[1:0], the output is selected (0, p, 2p and 3p). The remaining bits of Adder_A are added to the output of MUX_B and finally are shifted right by 2. The output result is computed to perform

\frac{n}{2}

iterations.

3.6. Control Unit

It consists of two FSMs optimized for the two different cases (without pipelining and with pipelining). In the non-pipelined case, the FSM requires a total of 55 states to execute all control functionalities. By carefully optimizing the state transitions, we minimize the number of required states. Similarly, in the case of pipelining, the FSM requires 93 states to perform all controlling tasks. The increase in the number of states is due to the additional complexity introduced by the pipelining stages. In other words, there is a need to manage data flow and synchronization between the pipeline stages.

As shown in Figure 3 the control unit starts in an idle state, referred to as State 0. It is triggered by the reset and start control signals. When the start signal is 1, the control unit transitions from the current state to the next state (State 1). In pipelining, the flow from State 1 to State 12 is specifically designed for affine to projective conversions. These control signals facilitate the efficient execution of affine to projective conversions. Furthermore, states 7 to 65 are dedicated to producing signals for the Quad block Itoh–Tsujii inversion operation. State 66 in the FSM is dedicated to counting the number of points on the given BEC using the value of K. If the value of k is 1, State 66 transitions to State 81. On the other hand, if the value of k is not 1, it transitions to State 67.

States 81 to 92 are responsible for executing the “if” portion of Algorithm 1. Similarly, states 67 to 80 in the FSM generate signals related to the “else” portion of the target algorithm. Moreover, states 80 and 92 check the value of m, for the corresponding value of k (either 0 or 1). These states accurately determine the next steps based on the specific conditions and requirements of the algorithm. If the value of m is equal to 233, State 93 will be the next state. If the value of m is not equal to 233, State 66 will be the next state. By carefully optimizing the control signals and state transitions within these sections of the FSM, we streamlined the execution of Algorithm 1 within the cryptographic accelerator. These optimizations enhance the overall efficiency and performance of the accelerator, enabling it to handle cryptographic computations with precision and speed.

3.7. Clock Cycles Information

The clock cycle information in the proposed cryptographic accelerator design can be calculated using Equation (6). The equation takes into account the initialization part, the point multiplication computation (considering two-stage pipelining), and the quad block Itoh–Tsujii computation. For the non-pipelined architecture, the term “16” in Equation (6) is replaced with “13”. Overall, Equation (6) provides a concise representation of the clock cycle calculation for both the pipelined and non-pipelined architectures:

c l o c k c y c l e s = i n i t i a l_s t a t e s + 16 \times (m - 1) + i n v e r s i o n_s t a t e s

(6)

Table 4 summarizes the results of the cryptographic accelerator design in terms of clock cycles. It includes three columns: the name of the selected parameter, the clock cycle results for the non-pipelined architecture, and the clock cycle results for the two-stage pipelined architecture. The first row specifies that the initial states for Algorithm 1 are 6 for the non-pipelined architecture and 12 for the two-stage pipelined architecture. The key length is mentioned as 233 in the second row (for both cases). The next two rows provide the clock cycles required for point multiplication (PM) computation in both architectures. Row five presents the cost of inversion. Finally, the last row of the table shows the total number of clock cycles required to execute Algorithm 1 for both cases (pipelining and without pipelining).

4. Proposed Optimizations

This section begins by providing motivation for 2-stage pipelining over 3-stage pipelining in Section 4.1. Subsequently, Section 4.2 elaborates the pipelined implementation of differential addition law. In this context, an extensive analysis of instruction execution and scheduling for the implementation of the differential addition law with a two-stage pipelining approach is provided in a tabular form. The optimization of storage and execution cycles is explicitly explained in Section 4.3 Moreover, the optimization of instruction scheduling is presented in Section 4.4. Finally, the overall summary of proposed optimizations is presented in Section 4.5.

4.1. Motivation for 2-Stage Pipelining

By employing pipelining, a larger computation is divided into smaller stages, resulting in a reduction of the critical path delay and a substantial improvement in the overall throughput. To optimize the pipelining process for a given design, the circuit can be divided into three types of operations: read operations, arithmetic logical unit (ALU) operations and write-back operations. Consequently, two different scenarios can be achieved. In the first scenario, the pipeline registers are placed at the input of the ALU. This arrangement enables the read operation to take place in the first cycle, while the execute and write-back stages are carried out in the second cycle. In the second scenario, the pipeline registers are utilized for both the input and output of the ALU. As a result, three cycles are required for the read, execute, and write-back stages. This article selected two-stage pipelining. It enables an efficient overlapping of the read, execute, and write-back stages. The introduction of an additional stage (the third stage) can lead to a higher number of clock cycles, primarily due to RAW (read after write) hazards.

4.2. Pipelined Implementation of Differential Addition Law

Table 5 provides an extensive analysis of instruction execution and scheduling for the implementation of the differential addition law with a two-stage pipelining approach. It provides valuable information on clock cycles, instructions, and the merging of multiple operations into a single operator form to simplify instruction complexity as shown in the first three columns. The status of two-stage pipelining is indicated in columns four to six. Similarly, column seven highlights the potential for RAW hazards. Finally, columns eight to twelve show the proposed scheduling scheme. This carefully designed scheduling scheme ensures the efficient execution and optimization of the differential addition law within the two-stage pipelined architecture.

4.3. Optimizing Storage and Execution Cycles

The effective execution of the differential addition law requires

14 \times m

storage elements. These elements store various values, including the initial projective point values (

W_{1}

,

Z_{1}

,

W_{2}

,

Z_{2}

) and intermediate results (A, B, C,

T_{1}

to

T_{3}

). Additionally, they hold the computed values of the final projective point, which are dynamically updated during the execution

W_{a}, Z_{a}, W_{d}, Z_{d}

. The allocation of values to these storage elements is presented in column 3 of Table 5. It can be shown in Table 5 that each operation requires a total of 14 cycles when the read (R), execute (E), and write-back (WB) stages are completed within a single clock cycle as indicated in column 3 of Table 5. RAW hazards are present for some specific instructions (

I n s t r_{6}, I n s t r_{7}, I n s t r_{8}, I n s t r_{10}, I n s t r_{11}, I n s t r_{13}

) as shown in column seven. For instance,

I n s t r_{6}

experiences a one-cycle delay due to a write operation, requiring two cycles to compute the new value of

T_{1}

. Taking into account RAW hazards, each unified PA and PD operation in this pipelined architecture necessitates 20 cycles for completion. These additional cycles account for hazard resolution and ensure proper execution.

4.4. Optimizing Instruction Scheduling

In the pursuit of efficient instruction scheduling, the goal is to minimize RAW hazards. Consequently, the architecture is improved in terms of area, clock cycles, and instructions. Without the proposed scheduling,

I n s t r_{7}

(

T_{3} = T_{2} \times T_{2}

) and

I n s t r_{8}

(

Z_{d} = T_{3} \times T_{3}

) are executed in separate cycles as indicated in column three of Table 5. To enhance efficiency, an optimized approach is introduced by combining

I n s t r_{7}

and

I n s t r_{8}

into a single clock cycle as described in column eight of Table 5. By adopting the optimized instruction scheduling approach, the occurrence of RAW hazards within the pipelining context is minimized to just one as indicated in column nine of Table 5. As a result, this approach effectively reduces the total number of clock cycles to 16.

4.5. Summary of Proposed Optimizations

The proposed scheduling offers several benefits: (1) Reducing the total number of instructions and hence improving the overall instruction efficiency. (2) Decreasing the requirement for storage elements (from

14 \times m

to

10 \times m

) (3) Reducing the total number of required clock cycles To summarize, the proposed instruction scheduling approach provides significant advantages by reducing the number of instructions, minimizing the required storage elements, and decreasing the overall clock cycles.

5. Results and Comparison

This section presents the detailed results obtained from our implementation of the BEC model for ECC. We discussed the hardware and software utilized in our implementation in Section 5.1. The performance analysis of our cryptographic accelerator design considers various metrics, which are described in Section 5.2. Additionally, in Section 5.3, we provided a comparative analysis between our proposed design and existing implementations to highlight the achieved performance improvements. This comparison includes different factors, including the utilization of hardware resources, operational clock frequency, the ratio of the throughput to the area, and other pertinent parameters.

5.1. Implementation Details and Synthesis Platform

We developed the two-stage pipelined and non-pipelined architectures using Verilog, which is a hardware description language (HDL). The design was implemented on different Xilinx FPGA devices, specifically Virtex 4, Virtex 5, Virtex 6, and Virtex 7. For synthesizing the design, we utilized the Xilinx ISE design suite, specifically version 14.7.

5.2. Performance Metrics and Evaluation

To design more efficient systems, it is crucial to consider various performance metrics. These metrics include throughput/area, slices, LUTs, and time, which provide valuable insights into the efficiency and effectiveness of a design in terms of hardware resources and computation speed. Throughput/area is a particularly useful metric that quantifies the number of computations performed per slice. It is calculated using Equation (7) and provides a measure of how efficiently a design utilizes hardware resources. By minimizing the number of slices used, we can increase the number of computations that can be performed per unit of hardware, improving the overall efficiency. Slices and LUTs are essential metrics that indicate the number of hardware resources utilized by a design. Reducing the number of slices and LUTs employed can result in a more cost-effective system, optimizing resource utilization and improving efficiency. Time is a critical metric for measuring the speed of computation, typically measured in microseconds. By optimizing the time required for a single point multiplication (PM) operation, we can enhance the overall efficiency of the system and reduce the time needed to complete the complex computations. Considering these performance metrics allows for the evaluation and optimization of designs, enabling the development of more efficient systems in terms of hardware utilization and computation speed:

\frac{t h r o u g h p u t}{a r e a} = \frac{t h r o u g h p u t (Q = k . p i n μ s)}{s l i c e s}

(7)

Equation (7) is expressed in a simplified form as shown in Equation (8).

\frac{t h r o u g h p u t}{a r e a} = \frac{\frac{10^{6}}{t i m e (o r) l a t e n c y (Q = k . p i n s)}}{s l i c e s}

(8)

Equation (8) provides a definition of throughput as the reciprocal of the time required to compute one point multiplication (PM), denoted as Q = k.p in seconds. The term “slices” refers to the area utilized on the chosen FPGA device. The binary elliptic curve (BEC) is represented by two points, P and Q, which correspond to its initial and final points, respectively. The scalar multiplier is denoted as k. In Equation (8), the factor

10^{6}

is used to convert the time from microseconds to seconds, simplifying the calculation of throughput as given in Equation (7). To compute the latency or time required for one point multiplication (PM) operation, Equation (9) is used. The resulting latency values are listed in column 6 of Table 6. Optimizing these values helps improve the system efficiency and reduce the computation time. By considering these equations and optimizing the corresponding values, we can improve the overall efficiency of the system, reducing the time required for computations and maximizing the throughput:

t i m e (o r) l a t e n c y = \frac{r e q u i r e d (C C s)}{o p e r a t i o n a l c l o c k f r e q u e n c y}

(9)

Equation (9) specifies the computation of the necessary clock cycles (CCs) for a single point multiplication (PM) operation. The corresponding CC values are displayed in Table 4, while the operational clock frequency, measured in MHz, can be found in column 3 of Table 6. Taking into account metrics such as throughput, slices, latency, clock cycles, and operational clock frequency allows for the design of more efficient and effective systems. This optimization enables the optimal utilization of hardware resources and facilitates faster computations. The resulting improvements in performance and efficiency make our designs more competitive and impactful in their respective domains.

5.3. Performance Comparison

To facilitate a performance comparison with state-of-the-art methods and techniques, we employed Xilinx FPGA devices to implement our proposed design. The synthesis results of the proposed design are presented in Table 6. The first column of Table 6 provides references. The next column (second) specifies the FPGA platform (for synthesis). Column 3 represents the operational clock frequency in MHz. The resource utilization is presented in columns 4 and 5. Column 6 provides the time, measured in microseconds (

μ

s), required to complete a single PM operation. Finally, the last column presents the throughput/ratio, which quantifies the number of computations executed per unit of time. By analyzing the results in Table 6, we can make an accurate and unbiased comparison between the proposed design and the existing solutions.

5.3.1. Comparison on Virtex 4 Platform

The study conducted by [16] explores the use of a digit parallel multiplier for applications with fewer resource constraints. However, when compared to our proposed non-pipelined architecture, their approach demonstrates higher hardware resource consumption and provides a lower throughput/area ratio. Similarly, when compared to our proposed two-stage pipeline architecture, their approach exhibits increased hardware resource utilization and significantly lower throughput/area. The proposed non-pipeline architecture offers significant optimizations when compared to the reconfigurable BEC design presented in [24]. It provides compelling benefits, including an impressive 86.2% reduction in hardware resource usage for point addition (PA) and point doubling (PD) computations while operating at a 1.91 times higher frequency. Furthermore, incorporating the point halving architecture, as proposed in [24], leads to 86.5% improvement in hardware resource utilization for PA and PD computations. In contrast to the architecture described in [24], the 2-stage pipelined architecture in this article employs 85.1% of the available resources, and when the point halving architecture is utilized, the resource usage is further reduced to 85.49%. Moreover, our architecture achieves 4.15 times higher frequency compared to the design in [24].

The work in [25] proposed two solutions. In the first solution, where the FF Gaussian-based multiplier is utilized, the proposed non-pipeline architecture achieves an impressive 3.52 times higher throughput/area compared to the existing architecture. Simultaneously, it utilizes only 89.7% fewer FPGA slices, showcasing its efficiency in resource utilization. Similarly, the proposed 2-stage pipelined architecture achieves an outstanding 5.12 times higher throughput/area while utilizing 88.9% fewer resources. Moving to the second solution presented in [25], the proposed non-pipelined architecture achieves a notable 3.41 times higher throughput/area, demonstrating its efficiency in computational performance. Furthermore, the proposed 2-stage pipeline architecture achieves an even more impressive 4.97 times higher throughput/area, further enhancing computational efficiency. Importantly, both architectures significantly reduce hardware resource utilization, utilizing 75.7% fewer resources for the non-pipelined architecture and 73.8% fewer resources for the 2-stage pipelined architecture, when compared to the architecture presented in [25].

The work in [29] has optimized the existing architectures for BEC by utilizing a hybrid algorithm combining Montgomery and double-and-add techniques. Comparing the proposed architecture with the non-pipelined architecture in [29], considerable improvements have been made. The proposed architecture utilizes 9.08% fewer hardware resources, operating at 2.13 times lower operational frequency, and achieves a 1.12 times lower throughput/area ratio. These optimizations indicate better resource utilization and improved efficiency in terms of hardware requirements and execution time. Furthermore, when comparing the pipelined architectures, the hardware resources used in the proposed two-stage pipelined architecture are 1.6% fewer than the existing hybrid architecture. Moreover, the proposed two-stage pipelined architecture operates at a 1.9% higher frequency and achieved a 1.29-times higher throughput/area ratio. These results suggest that the proposed architecture outperforms the existing hybrid architecture in terms of resource utilization, operating frequency, and overall throughput efficiency.

The proposed non-pipelined architecture demonstrates superior performance compared to the digit serial pipelined multiplier architecture in [38]. It achieves remarkable optimizations by utilizing just 90.5% of the hardware resources required by the [38] architecture while achieving 3.56 times higher throughput. This significant reduction in resource usage showcases the efficiency of our design. Similarly, the proposed 2-stage pipeline architecture surpasses the performance of the [38] architecture by utilizing 89.7% fewer hardware resources. This optimization leads to an impressive 5.19 times higher throughput/area ratio, further enhancing the computational efficiency of our design.

5.3.2. Comparison on Virtex 5 Platform

The proposed two-stage pipelined design showcases significant optimization compared to the architecture presented in [23]. It requires 42.7% fewer resources while achieving an impressive 4.49-times higher throughput/area. This reduction in resource utilization demonstrates the efficiency of our design in maximizing performance while minimizing hardware requirements. Similarly, the non-pipelined design in this article surpasses the architecture in [23] by achieving a higher throughput/area while demanding 1.65-times fewer FPGA slices. This substantial reduction in hardware resource usage showcases the efficiency of our design in optimizing resource utilization.

Additionally, our two-stage pipeline architecture exhibits a higher clock frequency and utilizes 1.87 times more resources compared to the architecture in [25]. Despite the increased resource utilization, our design significantly outperforms it in terms of throughput/area, indicating the superior efficiency and performance.

When comparing the proposed architecture to the non-pipelined architecture presented in [29], significant optimizations are achieved. The proposed architecture showcases a reduction of 14.7% in hardware resource utilization, operating at 1.71-times lower frequency. Moreover, it achieves a notable improvement of 1.17-times higher throughput/area. These enhancements highlight the effectiveness of the proposed hybrid algorithm in terms of resource efficiency and overall performance. Furthermore, in the comparison with the two-stage pipelined architecture presented in [29], the proposed approach exhibits additional improvements. The hardware resource utilization is reduced by 3.3%, while the operating frequency is increased by 1.73%. Notably, the proposed two-stage pipelined architecture achieves a 1.32-times higher throughput/area, indicating a significant boost in computational efficiency and performance compared to the architecture proposed in [29].

The proposed non-pipelined design showcases superior performance compared to the first solution presented in [39]. It achieves a clock frequency that is 2.35 times lower but compensates with a 1.74-times higher throughput/area. Moreover, it utilizes 74.9% fewer hardware resources. When compared to the second solution in [39], our design outperforms it by employing 42.4% fewer resources while achieving a similar clock frequency and a negligible difference in throughput/area. The proposed two-stage pipelined architecture outshines the first solution in [39]. It utilizes 71.5% fewer resources while achieving a 1.7-times higher throughput/area, albeit at a lower clock frequency. On the other hand, the second solution in [39] achieves a clock frequency that is 1.33 times higher while utilizing 34.7% fewer slices. However, it exhibits a 1.91-times lower throughput/area when compared to our proposed two-stage pipeline architecture.

5.3.3. Comparison on Virtex 6 Platform

When compared to the architecture presented in [16], our proposed non-pipelined design operates at a higher clock frequency and delivers a better throughput/area ratio, albeit using slightly more hardware resources. This indicates that our design achieves improved computational efficiency without compromising on performance. On the other hand, our proposed 2-stage pipelined architecture consumes fewer resources, operates at a higher clock frequency, and achieves an improved throughput/area ratio. This demonstrates the optimization achieved by our design in terms of resource utilization and performance. Moreover, compared to the approach proposed in [25], our proposed non-pipelined design achieves a higher clock frequency and a better throughput/area ratio while utilizing significantly fewer hardware resources. This indicates that our design optimizes the utilization of available resources, resulting in improved performance. Conversely, our proposed 2-stage pipelined architecture exhibits a higher clock frequency but consumes more resources. However, this increased resource usage is justified by the improved throughput/area ratio, highlighting the efficiency of the design.

In comparison to the non-pipelined architecture proposed in [29], the proposed architecture demonstrates significant optimizations. The proposed approach achieves a reduction of 11.2% in hardware resource utilization while operating at 35.37% lower frequency. Despite the lower frequency, the architecture achieves a notable improvement of 1.25 times higher throughput/area. These results indicate that the proposed hybrid algorithm offers superior resource efficiency and computational performance compared to the approach presented in [29]. Additionally, when comparing the proposed architecture to the two-stage pipelined architecture introduced in [29], further improvements are observed. The proposed approach reduces hardware resource utilization by 7.3% and achieves a modest increase of 1.3% in operating frequency. Furthermore, the architecture demonstrates a significant enhancement in computational efficiency with a 1.37 times higher throughput/area. These findings underscore the advantages of the proposed architecture in terms of resource utilization and overall performance compared to the two-stage pipelined architecture presented in [29].

5.3.4. Comparison on Virtex 7 Platform

Upon comparing our proposed cryptographic accelerator design with the non-pipelined design in [16], we observed that the work in [16] utilizes 40.8% more resources and operates at a lower clock frequency. As a result, their design achieves a lower throughput/area ratio by a factor of 1.68 when compared to ours. These findings clearly demonstrate the superior resource efficiency and performance of our proposed design. Furthermore, when comparing our proposed two-stage pipeline architecture with the pipelined design in [16], we found that our design consumes 38.3% fewer hardware resources while operating at a significantly higher frequency. This leads to an impressive 1.84 times higher throughput/area ratio in our design. This highlights the effectiveness of our pipelining approach in achieving superior performance with optimized resource utilization.

In the study conducted by [29] on hybrid architectures, two architectural designs are proposed. When the comparison is made for non-pipelined architectures, a significant reduction of 11.2% in resource usage is obtained. Additionally, the non-pipelined architecture in this article operates at a 36.9% lower frequency, suggesting reduced power consumption and improved energy efficiency. Furthermore, it achieves 1.22 times higher throughput/area, indicating superior performance and computational efficiency. When comparing the pipelined designs, it achieves 7.3% reduction in resource utilization, indicating improved hardware utilization. Moreover, the two-stage pipelined architecture outperforms the hybrid design with a 1.33-times higher throughput/area. This improvement highlights its capability to handle workloads more efficiently and deliver enhanced performance.

In summary, the performance comparison clearly highlights the advantages of pipelining in achieving higher performance while using fewer hardware resources. Our proposed design stands out as a superior approach that effectively utilizes pipelining to achieve remarkable throughput/area ratios when compared to the designs discussed in [16,23,24,25,29,38,39]. These optimizations contribute to the advancement of cryptographic accelerators, enabling more efficient and secure cryptographic computations in practical applications.

6. Discussion

Our innovative design incorporates the Montgomery radix-4 algorithm, fixed window algorithm, and two-stage pipelining technique to significantly improve the throughput/area ratio. The Montgomery radix-4 algorithm reduces the number of required modular multiplications, leading to faster computations and improved performance. The fixed window algorithm performs point operations based on multiple bits of the scalar simultaneously within a fixed window. This approach significantly reduces the required number of expensive point additions and doublings, which results in faster point multiplication. The two-stage pipelining technique maximizes hardware resource utilization and reduces latency by enabling the parallel processing of cryptographic operations.

The aforementioned optimization techniques contribute to improving the throughput/area ratio by reducing the number of operations, optimizing modular exponentiation, and maximizing hardware resource utilization through pipelining. Our paper emphasizes scalability and the applicability of the proposed optimizations to various FPGA designs. The optimizations are designed with adaptability in mind and evaluated on different FPGA platforms, including Virtex 4, Virtex 5, Virtex 6, and Virtex 7, showcasing their efficiency across diverse hardware configurations. Moreover, our design is flexible, allowing it to adapt to different ECC parameters and hardware platforms. This flexibility ensures that our design can meet the performance requirements of various applications and integrate seamlessly with existing systems.

The advantages of our optimization techniques include improved throughput/area ratio, reduced computational complexity, and enhanced overall efficiency. However, it is important to consider the drawbacks associated with these techniques, such as increased implementation complexity, and the need for the careful consideration of data dependencies and synchronization in the pipelining process. By addressing these challenges and leveraging the benefits of our design, we can unlock the full potential of efficient and high-performance cryptographic systems.

7. Conclusions

This research presents a novel and innovative approach for accelerating BEC in hardware-constrained applications. Significant improvements in speed and resource utilization are achieved by combining the fixed window algorithm, Montgomery radix-4 multiplier, and two-stage pipelining. The proposed design addresses the challenges of low-resource environments by offering higher throughput and lower hardware requirements. The research also highlights the effectiveness of the proposed architecture through comprehensive security analysis and extensive evaluations of various hardware platforms. These evaluations include the performance metrics presented in Table 6, such as throughput comparisons, hardware resource utilization, and execution times.

Looking ahead, there are several potential avenues for future work in this field. Firstly, further optimizations can be explored to enhance the performance of the proposed architecture, such as investigating advanced pipelining techniques or exploring alternative hardware optimization methods. Additionally, the security analysis can be extended to evaluate the resilience of the design against other potential attacks or side-channel vulnerabilities.

Author Contributions

Conceptualization, A.S. and M.R.; methodology, M.R. and A.S.; validation, O.S.S., A.R.J., M.A. and M.Y.I.Z.; formal analysis, M.A., M.Y.I.Z. and M.R.; investigation, A.S., A.R.J. and M.R.; resources, O.S.S., M.R. and M.Y.I.Z.; data curation, A.S.; writing—original draft preparation, A.S.; writing—review and editing, M.R.; visualization, O.S.S.; supervision, M.R.; project administration, M.R.; funding acquisition, M.R. All authors have read and agreed to the published version of the manuscript.

Funding

The authors extend their appreciation to the Deanship for Research & Innovation, Ministry of Education in Saudi Arabia for funding this research work through the project number: IFP22UQU4320199DSR102.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Simsim, M. Internet usage and user preferences in Saudi Arabia. J. King Saud Univ.-Eng. Sci. 2011, 23, 101–107. [Google Scholar] [CrossRef] [Green Version]
Joseph, D.; Misoczki, R.; Manzano, M.; Tricot, J.; Pinuaga, F.D.; Lacombe, O.; Leichenauer, S.; Hidary, J.; Venables, P.; Hansen, R. Transitioning organizations to post-quantum cryptography. Nature 2022, 605, 237–243. [Google Scholar] [CrossRef]
Wu, Z.; Li, W.; Zhang, J. Symmetric Cryptography: Recent Advances and Future Directions. IEEE Trans. Inf. Forensics Secur. 2022, 17, 36–53. [Google Scholar]
Kumar, R.; Zia, T.; Kamakoti, V. An Enhanced RSA Cryptosystem with Long Key and High Security. Int. J. Commun. Netw. Distrib. Syst. 2022, 27, 366–383. [Google Scholar]
Zhu, X.; Chen, X.; Wu, S. On the Security of RSA-OAEP with Nonlinear Masking. IEEE Trans. Inf. Theory 2022, 68, 1062–1071. [Google Scholar]
Zhang, Y.; Chen, X.; Chen, S. A New Elliptic Curve Cryptography Algorithm Based on Quartic Residues. IEEE Access 2021, 9, 12310–12320. [Google Scholar]
Smith, J.; Doe, J. A Comparison of Key Sizes for Elliptic Curve Cryptography and RSA. J. Inf. Secur. Appl. 2022, 58, 102868. [Google Scholar]
Lee, S.; Chen, M. Why Elliptic Curve Cryptography is Preferred over RSA. IEEE Trans. Inf. Forensics Secur. 2021, 16, 2133–2145. [Google Scholar]
Almotairi, K.H. Application of internet of things in healthcare domain. J. Umm Al-Qura Univ. Eng. Archit. 2023, 14, 1–12. [Google Scholar] [CrossRef]
Alkabani, Y.; Samsudin, A.; Alkhzaimi, H. Mitigating Side-Channel Power Analysis on ECC Point Multiplication Using Non-Adjacent Form and Randomized Koblitz Algorithm. IEEE Access 2021, 9, 30590–30604. [Google Scholar] [CrossRef]
Mensah, S.; Appiah, K.; Asare, P.; Asamoah, E. Challenges and Countermeasures for Side-Channel Attacks in Elliptic Curve Cryptography. Secur. Commun. Netw. 2021, 2021, 1–18. [Google Scholar]
Fehr, S.; Hohenberger, S.; Kim, H. Binary Edwards Curves: Theory and Applications. Cryptol. ePrint Arch. 2021, 2021, 1239. [Google Scholar]
Sajid, A.; Rashid, M.; Jamal, S.; Imran, M.; Alotaibi, S.; Sinky, M. AREEBA: An Area Efficient Binary Huff-Curve Architecture. Electronics 2021, 10, 1490. [Google Scholar] [CrossRef]
Lopez, J.; Menezes, A.; Oliveira, T.; Rodriguez-Henriquez, F. Hessian Curves and Scalar Multiplication. J. Cryptol. 2019, 32, 955–974. [Google Scholar]
Rashid, M.; Hazzazi, M.M.; Khan, S.; Alharbi, R.; Sajid, A.; Aljaedi, A. A Novel Low-Area Point Multiplication Architecture for Elliptic-Curve Cryptography. Electronics 2021, 10, 2698. [Google Scholar] [CrossRef]
Sajid, A.; Rashid, M.; Imran, M.; Jafri, A. A Low-Complexity Edward-Curve Point Multiplication Architecture. Electronics 2021, 10, 1080. [Google Scholar] [CrossRef]
Kumari, S.; Kumar, S. Efficient and Secure Elliptic Curve Cryptography for Financial Transaction Applications. Int. J. Comput. Sci. Inf. Secur. (IJCSIS) 2018, 16, 10–16. [Google Scholar]
Li, H.; Li, X.; Wang, H. Efficient Implementations of Binary Edwards Curves for Cloud Computing. J. Comput. Sci. Technol. 2018, 33, 1229–1242. [Google Scholar]
Ali, A.; Kamal, N. A Review of Binary Edwards Curves for Blockchain Applications. J. Inf. Secur. Appl. 2019, 47, 130–145. [Google Scholar]
Singh, A.; Gutub, A.; Nayyar, A.; Khan, M.K. Redefining food safety traceability system through blockchain: Findings, challenges and open issues. Multimed. Tools Appl. 2023, 82, 21243–21277. [Google Scholar] [CrossRef]
Bernstein, D.J.; Lange, T.; Peters, C. Efficient smart card implementation of binary edwards curve cryptography. J. Cryptogr. Eng. 2013, 3, 241–251. [Google Scholar]
Krishnan, R.; Srinivasa, K.G. Elliptic Curve Cryptography Based Wireless Transaction Applications for Binary Edwards Curves. Wirel. Pers. Commun. 2017, 92, 1007–1016. [Google Scholar]
Rashidi, B.; Abedini, M. Efficient Lightweight Hardware Structures of Point Multiplication on Binary Edwards Curves for Elliptic Curve Cryptosystems. J. Circuits Syst. Comput. 2019, 28, 1950140. [Google Scholar] [CrossRef]
Chatterjee, A.; Gupta, I.S. FPGA implementation of extended reconfigurable binary Edwards curve based processor. In Proceedings of the 2012 International Conference on Computing, Networking and Communications (ICNC), Maui, HI, USA, 30 January–2 February 2012; pp. 211–215. [Google Scholar] [CrossRef]
Lara-Nino, C.A.; Diaz-Perez, A.; Morales-Sandoval, M. Lightweight elliptic curve cryptography accelerator for internet of things applications. Ad Hoc Netw. 2020, 103, 102159. [Google Scholar] [CrossRef]
Rashidi, B.; Farashahi, R.R.; Sayedi, S.M. High-Speed Hardware Implementations of Point Multiplication for Binary Edwards and Generalized Hessian Curves. Cryptology ePrint Archive Paper 2017/005. 2017. Available online: https://eprint.iacr.org/2017/005 (accessed on 11 January 2017).
Salarifard, R.; Bayat-Sarmadi, S.; Mosanaei-Boorani, H. A Low-Latency and Low-Complexity Point-Multiplication in ECC. IEEE Trans. Circuits Syst. I Regul. Pap. 2018, 65, 2869–2877. [Google Scholar] [CrossRef]
Choi, P.; Lee, M.; Kim, J.; Kim, D.K. Low-Complexity Elliptic Curve Cryptography Processor Based on Configurable Partial Modular Reduction Over NIST Prime Fields. IEEE Trans. Circuits Syst. II Express Briefs 2018, 65, 1703–1707. [Google Scholar] [CrossRef]
Sajid, A.; Sonbul, O.S.; Rashid, M.; Zia, M.Y.I. A Hybrid Approach for Efficient and Secure Point Multiplication on Binary Edwards Curves. Appl. Sci. 2023, 13, 5799. [Google Scholar] [CrossRef]
Edwards, H.M. A normal form for elliptic curves. Bull. Am. Math. Soc. 2007, 44, 393–422. [Google Scholar] [CrossRef] [Green Version]
Bernstein, D.J.; Lange, T.; Naehrig, M.; Rosenthal, J. Binary Edwards Curves. In Cryptographic Hardware and Embedded Systems—CHES 2008: 10th International Workshop, Washington, DC, USA, 10–13 August 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 389–405. [Google Scholar]
Rashid, M.; Imran, M.; Kashif, M.; Sajid, A. An Optimized Architecture for Binary Huff Curves With Improved Security. IEEE Access 2021, 9, 88498–88511. [Google Scholar] [CrossRef]
Imran, M.; Rashid, M.; Jafri, A.; Islam, N. ACryp-Proc: Flexible Asymmetric Crypto Processor for Point Multiplication. IEEE Access 2018, 6, 22778–22793. [Google Scholar] [CrossRef]
Chang, C.H.; Liao, C.C.; Hsieh, W.H. High-Performance Montgomery Radix-4 Multiplier with Efficient Forward-Backward Algorithm. IEEE Access 2020, 8, 85854–85867. [Google Scholar]
Tian, L.; Li, P.; Xu, Z.; Li, C.; Hu, X. Design and Optimization of a High-Performance Booth Encoder for Low-Power Multipliers. IEEE Access 2020, 8, 146228–146238. [Google Scholar]
Kocher, P.C. Timing attacks on implementations of Diffie-Hellman, RSA, DSS, and other systems. In Advances in Cryptology—CRYPTO’96: 16th Annual International Cryptology Conference, Santa Barbara, CA, USA, 18–22 August 1996; Springer: Berlin/Heidelberg, Germany, 1996; pp. 104–113. [Google Scholar]
Kocher, P.; Jaffe, J.; Jun, B.; Rohatgi, P. Introduction to differential power analysis. J. Cryptogr. Eng. 2011, 1, 5–27. [Google Scholar] [CrossRef] [Green Version]
Agarwal, S.; Oser, P.; Lueders, S. Detecting IoT Devices and How They Put Large Heterogeneous Networks at Security Risk. Sensors 2019, 19, 4107. [Google Scholar] [CrossRef] [Green Version]
Rashidi, B. Efficient hardware implementations of point multiplication for binary Edwards curves. Int. J. Circuit Theory Appl. 2018, 46, 1516–1533. [Google Scholar] [CrossRef]

Figure 1. Proposed BEC architecture.

Figure 2. Montgomery radix-4 multiplier.

Figure 3. Control unit of proposed architecture.

Table 1. PA and PD instructions.

Instructions	Original Formulas
$I n s t r_{1}$	A ← $W_{1} \times Z_{1}$
$I n s t r_{2}$	B ← $W_{1} \times W_{2}$
$I n s t r_{3}$	C ← $Z_{1} \times Z_{2}$
$I n s t r_{4}$	$W_{d}$ ← $A \times A$
$I n s t r_{5}$	$Z_{d}$ ← $({(e_{1} \times W_{1} + Z_{1})}^{4})$
$I n s t r_{6}$	$Z_{a}$ ← ${(e_{2} \times B + C)}^{2}$
$I n s t r_{7}$	$W_{a}$ ← $(B \times C +$ w $\times Z_{a})^{2}$

Table 2. Booth encoding.

$X_{p} (i + 1 : i)$	000	001	010	011	100	101	110	111
$P P_{i_{1}}$	0	Y	Y	$2 Y$	$- 2 Y$	$- Y$	$- Y$	0

Table 3. For selection of prime numbers.

$A d d e r_A [1 : 0]$	00	01	10	11
$P P_{i_{2}}$	0	P	$2 P$	$3 P$

Table 4. Information about clock cycles.

Parameters	Without Pipelining	With 2-Stage Pipelining
Initial States	6	12
Length of Key	233	233
$13 \times (m - 1)$	3016	-
$16 \times (m - 1)$	-	3712
Inversion States	638	1276
Total Clock Cycles	3660	4994

Table 5. Scheduling of BEC operations.

		Two-Stage Pipelining with Original Formulas					Two-Stage Pipelining with Proposed Formulas
CCs	Instructions	Without Scheduling	Pipelining Status				With Scheduling		Pipelining Status
−	−	Instructions	R[ $I_{i}$ ]	E[ $I_{i}$ ]	WB[ $I_{i}$ ]	RAW	Instructions	RAW	R[ $I_{i}$ ]	E[ $I_{i}$ ]	WB[ $I_{i}$ ]
1	$I n s t r_{1}$	$A =$ $W_{1} \times Z_{1}$	R[ $I_{1}$ ]	−	−	−	$T_{1} =$ $W_{1} \times Z_{1}$	−	R[ $I_{1}$ ]	−	−
2	$I n s t r_{2}$	$B =$ $W_{1} \times W_{2}$	R[ $I_{2}$ ]	E[ $I_{1}$ ]	WB[ $I_{1}$ ]	−	$T_{2} =$ $W_{1} \times W_{2}$	−	R[ $I_{2}$ ]	E[ $I_{1}$ ]	WB[ $I_{1}$ ]
3	$I n s t r_{3}$	$C =$ $Z_{1} \times Z_{2}$	R[ $I_{3}$ ]	E[ $I_{2}$ ]	WB[ $I_{2}$ ]	−	$T_{3} =$ $Z_{1} \times Z_{2}$	−	R[ $I_{3}$ ]	E[ $I_{2}$ ]	WB[ $I_{2}$ ]
4	$I n s t r_{4}$	$W_{d}$ = $A \times A$	R[ $I_{4}$ ]	E[ $I_{3}$ ]	WB[ $I_{3}$ ]	−	$T_{4} =$ $T_{1} \times T_{1}$	−	R[ $I_{4}$ ]	E[ $I_{3}$ ]	WB[ $I_{3}$ ]
5	$I n s t r_{5}$	$T_{1} =$ $e_{1} \times W_{1}$	R[ $I_{5}$ ]	E[ $I_{4}$ ]	WB[ $I_{4}$ ]	−	$W_{0} =$ $e_{1} \times W_{1}$	−	R[ $I_{5}$ ]	E[ $I_{4}$ ]	WB[ $I_{4}$ ]
6	$I n s t r_{6}$	$T_{2} =$ $T_{1} + Z_{1}$	−	E[ $I_{5}$ ]	WB[ $I_{5}$ ]	$T_{1}$	$Z_{0} =$ $W_{0} + Z_{1}$	−	R[ $I_{6}$ ]	E[ $I_{5}$ ]	WB[ $I_{5}$ ]
7	$I n s t r_{7}$	$T_{3} =$ $T_{2} \times T_{2}$	R[ $I_{6}$ ]	−	−	$T_{2}$	combined	−	R[ $I_{7}$ ]	E[ $I_{6}$ ]	WB[ $I_{6}$ ]
8	$I n s t r_{8}$	$Z_{d} =$ $T_{3} \times T_{3}$	−	E[ $I_{6}$ ]	WB[ $I_{6}$ ]	$T_{3}$	$T_{1} =$ ${(Z_{0} \times Z_{0})}^{2}$	−	R[ $I_{8}$ ]	E[ $I_{7}$ ]	WB[ $I_{7}$ ]
9	$I n s t r_{9}$	$T_{1} =$ $e_{2} \times B$	R[ $I_{7}$ ]	−	−	−	$W_{0} =$ $e_{2} \times T_{2}$	−	R[ $I_{9}$ ]	E[ $I_{8}$ ]	WB[ $I_{8}$ ]
10	$I n s t r_{10}$	$T_{2} =$ $T_{1} + C$	R[ $I_{8}$ ]	E[ $I_{7}$ ]	WB[ $I_{7}$ ]	$T_{1}$	$Z_{0} =$ $W_{0} + T_{3}$	−	R[ $I_{10}$ ]	E[ $I_{9}$ ]	WB[ $I_{9}$ ]
11	$I n s t r_{11}$	$Z_{a} =$ $T_{2} \times T_{2}$	R[ $I_{9}$ ]	E[ $I_{8}$ ]	WB[ $I_{8}$ ]	$T_{2}$	$W_{0} =$ $Z_{0} \times Z_{0}$	−	R[ $I_{11}$ ]	E[ $I_{10}$ ]	WB[ $I_{10}$ ]
12	$I n s t r_{12}$	$T_{2} =$ $B \times C$	−	E[ $I_{9}$ ]	WB[ $I_{9}$ ]	−	$Z_{0} =$ $T_{2} \times T_{3}$	−	R[ $I_{12}$ ]	E[ $I_{11}$ ]	WB[ $I_{11}$ ]
13	$I n s t r_{13}$	$T_{3} =$ $w \times Z_{a}$	R[ $I_{10}$ ]	−	−	$T_{3}$	$T_{2} =$ $w \times W_{0}$	−	R[ $I_{13}$ ]	E[ $I_{12}$ ]	WB[ $I_{12}$ ]
14	$I n s t r_{14}$	$W_{a} =$ $T_{3} + T_{2}$	−	E[ $I_{10}$ ]	WB[ $I_{10}$ ]	−	$T_{3} =$ $Z_{0} + T_{2}$	$T_{2}$	−	E[ $I_{13}$ ]	WB[ $I_{13}$ ]
15	−	−	R[ $I_{11}$ ]	−	−	−	−	−	R[ $I_{14}$ ]	−	−
16	−	−	R[ $I_{12}$ ]	E[ $I_{11}$ ]	WB[ $I_{11}$ ]	−	−	−	−	E[ $I_{14}$ ]	WB[ $I_{14}$ ]
17	−	−	R[ $I_{13}$ ]	E[ $I_{12}$ ]	WB[ $I_{12}$ ]	−	−	−	−	−	−
18	−	−	−	E[ $I_{13}$ ]	−	−	−	−	−	−	−
19	−	−	R[ $I_{14}$ ]	−		−	−	−	−	−	−
20	−	−	−	E[ $I_{14}$ ]	WB[ $I_{14}$ ]	−	−	−	−	−	−

Table 6. State-of-the-art methods comparison.

References #	Platform	Frequency	Slices	LUTS	Time	T/Slices
		(in MHz)			(in $μ$ s)
Virtex 4 Results
GBEC: $d = 59$ [16]	Virtex-4	$127.261$	17,158	2663	$25.5$	$2.28$
BEC [24]	Virtex-4	48	21,816	35,003	−	−
BEC halving [24]	Virtex-4	48	22,373	42,596	−
GBEC: $d = 59 - 3 M$ [25]	Virtex-4	$255.570$	29,255	−	$14.83$	$2.38$
GBEC: $d = 59 - 1 M$ [25]	Virtex-4	$257.535$	12,403	−	$32.81$	$2.45$
GBEC: $d = 59$ [29]	Virtex-4	$195.508$	3302	2723	$32.1$	$9.43$
BEC: $d = 59$ [38]	Virtex-4	$277.681$	31,702	−	$13.39$	$2.35$
Virtex 5 Results
GBEC: $3 M$ $d 1 = d 2 = 59$ [23]	Virtex-5	−	4581	−	$51.46$	$4.24$
BEC: $d_{1} = d_{2} = 1$ [25]	Virtex-5	$205.1$	1397	4340	4560	$0.1569$
GBEC: $d = 59$ [29]	Virtex-5	$245.669$	2714	2502	$25.59$	$14.39$
GBEC: $d = 59 - 3 M$ [39]	Virtex-5	$337.603$	9233	−	$11.22$	$9.67$
GBEC: $d = 59 - 1 M$ [39]	Virtex-5	$333.603$	4019	−	$25.03$	$9.94$
Virtex 6 Results
GBEC: $d = 59$ [16]	Virtex-6	$186.506$	2664	22,256	$17.39$	$21.5$
BEC: $d 1 = d 2 = 1$ [25]	Virtex-6	107	1245	3878	6720	$0.119$
GBEC: $d = 59$ [29]	Virtex-6	$290.92$	1770	3597	$21.61$	$26.14$
Virtex 7 Results
GBEC: $d = 26$ [16]	Virtex-7	$179.81$	2662	24,533	$18.04$	$20.82$
GBEC: $d = 59$ [29]	Virtex-7	$320.584$	1771	4470	$19.61$	$28.79$
Our Work—Without Pipelining
GBEC: $d = 26$	Virtex-4	92	3002	2223	$39.78$	$8.37$
GBEC: $d = 26$	Virtex-5	$143.119$	2314	2408	$25.57$	$16.9$
GBEC: $d = 26$	Virtex-6	$188.223$	1570	2897	$19.445$	$32.76$
GBEC: $d = 26$	Virtex-7	$202.244$	1572	3070	$18.097$	$35.15$
Our Work—Two stage Pipelining
GBEC: $d = 26$	Virtex-4	$199.2$	3246	2846	$25.07$	$12.2$
GBEC: $d = 26$	Virtex-5	250	2624	2488	$19.976$	$19.07$
GBEC: $d = 26$	Virtex-6	$294.92$	1640	3326	$16.93$	$36.01$
GBEC: $d = 26$	Virtex-7	$314.584$	1641	3876	$15.87$	$38.39$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sajid, A.; Sonbul, O.S.; Rashid, M.; Jafri, A.R.; Arif, M.; Zia, M.Y.I. A Crypto Accelerator of Binary Edward Curves for Securing Low-Resource Embedded Devices. Appl. Sci. 2023, 13, 8633. https://doi.org/10.3390/app13158633

AMA Style

Sajid A, Sonbul OS, Rashid M, Jafri AR, Arif M, Zia MYI. A Crypto Accelerator of Binary Edward Curves for Securing Low-Resource Embedded Devices. Applied Sciences. 2023; 13(15):8633. https://doi.org/10.3390/app13158633

Chicago/Turabian Style

Sajid, Asher, Omar S. Sonbul, Muhammad Rashid, Atif Raza Jafri, Muhammad Arif, and Muhammad Yousuf Irfan Zia. 2023. "A Crypto Accelerator of Binary Edward Curves for Securing Low-Resource Embedded Devices" Applied Sciences 13, no. 15: 8633. https://doi.org/10.3390/app13158633

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Crypto Accelerator of Binary Edward Curves for Securing Low-Resource Embedded Devices

Abstract

1. Introduction

1.1. Related Work

1.2. Research Gap

1.3. Contributions

1.4. Organization

2. Mathematical Background

2.1. BEC Equations over G F ( 2 m )

2.2. Unified Mathematical Formulation

2.3. Point Multiplication for BEC

2.4. Fixed Window Algorithm

3. Proposed Hardware Architecture

3.1. Overview of the Architecture

3.2. Memory Unit

3.3. Routing Networks

3.4. Read Only Memory

3.5. Arithmetic Logic Unit

3.5.1. Montgomery Multiplier Algorithm

3.5.2. Booth Encoding

3.5.3. Montgomery Radix-4 Architecture

3.6. Control Unit

3.7. Clock Cycles Information

4. Proposed Optimizations

4.1. Motivation for 2-Stage Pipelining

4.2. Pipelined Implementation of Differential Addition Law

4.3. Optimizing Storage and Execution Cycles

4.4. Optimizing Instruction Scheduling

4.5. Summary of Proposed Optimizations

5. Results and Comparison

5.1. Implementation Details and Synthesis Platform

5.2. Performance Metrics and Evaluation

5.3. Performance Comparison

5.3.1. Comparison on Virtex 4 Platform

5.3.2. Comparison on Virtex 5 Platform

5.3.3. Comparison on Virtex 6 Platform

5.3.4. Comparison on Virtex 7 Platform

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.1. BEC Equations over $G F (2^{m})$