Compact modular multiplier design for strong security capabilities in resource-limited Telehealth IoT devices

Telehealth is an emerging model of delivering quality health to remote communities and stay-at-home users. This is motivated by the rising health care costs and by the beneﬁts of many patients staying at home as opposed to extended hospital stays. Telehealth relies on IoT technology


Introduction and Related Work
The internet of things (IoT) finds a handy application in providing quality healthcare delivery to remote communities or stay-athome patients. Moving extended stay patients to their homes proves beneficial to patients, and their families and reduces the escalating costs of healthcare (Belcher et al., 2021;Dykgraaf et al., 2021;Granjal et al., 2015;Atzori et al., 2010). However, IoT devices are considered the weakest link in any system that uses these devices. This is due to the limited computing and energy resources of IoT devices coupled with the fact that most of these devices are heterogeneous and seldom undergo password changes or operating system updates. A practical approach to secure these IoT devices, especially for telehealth, is to use physically unclonable functions (PUFs) to facilitate authentication and secure key establishment/exchange (Fakroon et al., 2021).
Securing IoT operations and communications is provided using elliptic curve cryptography (ECC) in preference to using the more traditional, and very expensive, approaches such as RSA (Rivest et al., 1978;Lidl and Niederreiter, 1994). ECC offers similar security as RSA with shorter key sizes (Di Matteo et al., 2021). An essential step in implementing ECC is providing efficient modular multiplication in the binary extension field GF 2 m À Á since this is the crucial step for field arithmetic, including modular exponentiation, modular squaring, and finding multiplicative inverse (Chiou et al., 2006;Kim and Jeon, 2014;Choi and Lee, 2015;Kim and Kim, 2018).
Modular multipliers can be implemented in serial or parallel based on the intended application. In the case of parallel implementation, the multiplier produces all output bits in a single clock cycle resulting in high throughput at the expense of consuming many hardware resources. On the other hand, serial implementation targets low-apace applications at the expense of increasing computation latency to m clock cycles. As we target resourcelimited IoT applications, we will concentrate on the serial implementation of the adopted modular multiplier algorithm. We can implement the multiplier in a bit-serial fashion or a word-serial fashion. Word-serial implementation realizes better area and time https://doi.org/10.1016/j.jksuci.2022.06.009 1319-1578/Ó 2022 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). complexity than the bit-serial implementation making it more efficient for resource-constrained IoT devices.
We propose in this work a two-dimensional (2-D) word-based SISO modular multiplier processor. The explored construction has the exceptional features of regularity, modularity, concurrency, and local interconnections of the systolic structure, making it more efficient for VLSI implementation. Using the formal mapping technique described in (Gebali, 2011;Ibrahim et al., 2018;Ibrahim et al., 2016;Ibrahim, 2019;Gebali and Ibrahim, 2016), The system designer can control the area and power consumption of the explored structure to fit IoT devices. Applying a non-linear scheduling function allows the system designer to control the resulting processor array's workload and the workload for each processing element. In addition, the latency of the algorithm is also controlled through non-linear task scheduling. The empirical results prove that the offered multiplier structure archives significant savings in area and energy consumption, making it more suitable for resource-limited IoT devices.
The paper can be laid out as follows: Section 2 provides an overview of the adopted modular interleaved multiplication algorithm over the binary extension field GF 2 m À Á . Section 3 explains the systematic approach used to explore the 2-D word-based SISO processor. Section 4 displays the experimental results and analysis of the developed word-based multiplier structure and its competitor word-based ones previously reported in the literature. Lastly, the conclusion of this work is provided in Section 5

Modular Interleaved Multiplication Algorithm
Assume an order m irreducible polynomial H h ð Þ defining the binary extension field GF(2 m ). Also, assume the two polynomials X h ð Þ and Y h ð Þ in this field. We can write: where h j ; x j ; y j 2 GF 2 ð Þ. We can perform modular multiplication as the product of X h ð Þ and Y h ð Þ and reducing the result using H h ð Þ. The following mathematical expression shows the interleaved method used to perform this operation.
With initial value X 0 h ð Þ ¼ X h ð Þ. Eq. (5) can be represented in the bit-level form at iteration i as follows: with x i À1 ¼ 0 for 0 6 i 6 m À 1. Also, Eq. (4) can be represented in the recursive form as: with Z 0 ¼ 0. Also, Eq. (7) can be expressed in the bit-level form at iteration i as follows (Choi and Lee, 2015): with 0 6 i 6 m À 1 , 0 6 j 6 m À 1, and z 0 j ¼ 0 for 0 6 j 6 m À 1. The algorithm form of the aforementioned mathematical expressions can be obtained as indicated in Algorithms 1 and 2. Algorithm 2 is the bit-level form of Algorithm 1.

Algorithm 1: Interleaved Modular Multiplication Algorithm
Algorithm 2: Interleaved Modular Multiplication Algorithm in the bit-level from.

Exploration of the word-based 2D SISO Multiplier
In this section, we will follow a systematic methodology, previously reported by the second author in (Gebali, 2011), to extract the word-based 2D SISO multiplier structure. The approach starts by extracting the data dependency graph (DG) of the adopted multiplier algorithm. Then, non-linear scheduling and projection functions are applied to nodes of the DG to extract the systolic multiplier processor. The details of this approach are given in the following subsections.

Algorithm Dependence Graph
The data dependence graph (DG) associated with the modular multiplication algorithm is obtained from the iterations defining the modular multiplication in Eqs. (6) and (8) following the guidelines set forth in (Gebali, 2011). The two equations use two iteration indices i and j to define a 2-D integer domain D 2 Z 2 . The details of the DG for the case when m ¼ 5 are shown in Fig. 1. The operations in Eqs. (6) and (8)  are obtained at column m À 1 and broadcasted horizontally as shown in Fig. 1. Input signals z 0 j ; x 0 j , and h j are supplied at the top row and the output signals z m j are obtained from the bottom row.

Node Scheduling and Projection Functions
The DG of Fig. 1 can be used for design space exploration by choosing node scheduling and projection functions based on the approach discussed in (Gebali, 2011).
We will not use the linear scheduling and projection functions since they provide few options in choosing the resulting processor array area, latency, processing elements workload, or overall system workload. In this work, we use the non-linear node scheduling and projection techniques. This choice affords a rich set of design options to optimize the resulting systolic array area, latency, processing elements workload, and overall system workload.
Our goal is to design a SISO multiplier that requires supplying the input polynomials X; Y, and H in word-serial fashion. The resulting reduced polynomial Z is also obtained in a word-serial manner. Assume the goal of the system designer is to simultaneously process k-bits of each polynomial at the same time and obtain k bits of the output polynomial. The following parts explain the steps to be used by the system designer.

Non-Linear Task Scheduling
The nonlinear scheduling technique in Gebali (2011), is used to partition domain D into k Â k equitemporal zones or clusters. The choice of k allows the system designer to control the number of bits of the input or output polynomials to be processed simultaneously. This indirectly affects the system area, speed, and latency.
We choose the following non-linear scheduling function to assign timing to each node p of the DG: where l p ð Þ is the time assigned to node p of the DG, 0 6 i < m þ k; Àk 6 j < m À 1 with k given by: k represents the number of columns and rows that must be added to the DG in order to make their number an integer multiple of k. For the case shown in Fig. 2, where m ¼ 5 and k ¼ 2, we have k ¼ 1 which implies adding one column on the left and one row at the bottom. The blue boxes highlight the equitemporal zones (the cluster of nodes having the same time values). Fig. 3 shows the scheduling time for the nodes of the DG when m ¼ 5 and k ¼ 4. In this case, we have k ¼ 3, which implies adding three columns on the left and three rows at the bottom. Examination of Fig. 2 or Fig. 3 reveals that any block l receives two inputs from the north and west directions and produces two outputs from the south and east directions. The times associated with these inputs and outputs are summarized in Table 1. We note that the inputs at the top row produce the outputs of the right column. Similarly, the inputs at the left column produce the outputs at the bottom row. Therefore, the number of iterations for the modular multiplication should be given by: This means that the first product of the proposed multiplier will be available on the output bus after dm=ke 2 þ 1 clock cycles. Through each subsequent clock cycle, we will have another product. Therefore, the throughput should be 1 dm=ke 2 þ1 .

Non-Linear Task Projection
Figs. 2 and 3 indicate that the k Â k equitemporal zones execute at the same time. This observation and the projection technique explained in Gebali (2011) produce the following nonlinear task projection function: The extracted projection function maps the k Â k node clusters to a single processor array. The processor array consists of a twodimensional k Â k processing elements (PEs). The whole system Fig. 1. DG of the adopted multiplication algorithm for m ¼ 5.
A. Ibrahim and F. Gebali Journal of King Saud University -Computer and Information Sciences xxx (xxxx) xxx is shown in Fig. 4. Notice that we used two input registers X and XL to store input X. Register X stores k bit values starting from bit x 0 mÀ1 , while register XL stores k bit values starting from bit x 0 mÀ2 . The intermediate words of Z; XL, and X are pipelined through the shift-right registers SHR-Z, SHR-XL, and SHR-X, respectively.
The content of H is pipelined through the rotate-right register ROR-H. The registers are based on k-bits words and the size is . Fig. 4 illustrates the word update at the bottom outputs of the processor array.
The details of each PE are shown in Fig. 5 for the case when m ¼ 5 and k ¼ 4. Two tri-state buffers are used to select between signals x i mÀ1 and x d . Control signal t is activated (t ¼ 1) at time instances l ¼ qdm=ke þ 1; 0 6 q < dm=ke, enabling the tri-state buffers Tr1 to pass x i mÀ1 signals. Signals x i mÀ1 , and the input bits of y i , are broadcast to the processors to compute the intermediate results Z; X, and XL. Signal t is deactivated at t ¼ 0, enabling the tri-state buffers Tr2 to pass the intermediate x d signals as shown in Fig. 5. At time instances l ¼ qdm=ke; 1 6 q < dm=ke, control signal n will be deactivated (n ¼ 0) to force zero values of XL signals, as shown at the left of Fig. 3. The logic circuit of the PE is shown in Fig. 6. It contains two AND gates and two XOR gates.
We can describe the operation details of the 2-D SISO multiplier for general values of m and k as follows: 1. At the first time instance l ¼ 1, the controller activates the select signal (S) of all MUXes, depicted in Fig. 4, to allow the k MSB bits of X; XL; H to be input to the processor array shown in Fig. 4. This ensures Z has zero initial value as described in Algorithm 1, the controller resets the right-shift register SHR-Z at the first time instance. Also, at this time instance, the controller activates the control signal t (t ¼ 1) to enable the tristate buffers Tr1 in Fig. 5 to move bits of x i mÀ1 ; 1 6 i 6 k to the

Table 1
Input and output timing for block l in Fig. 2 or Fig. 3.
A. Ibrahim and F. Gebali Journal of King Saud University -Computer and Information Sciences xxx (xxxx) xxx remaining PEs in the same row of the processor array. We also notice that the least significant k bits of variable Y are broadcasted horizontally, at the first time instance, to the PEs nodes of the processor array. 2. At time 1 < l 6 m k AE Ç , the controller still activates the select signal S of all MUXes to enable the remaining words of inputs X; XL, and H to supply processor array inputs. These words together with X i mÀ1 ; Y i ; 1 6 i 6 k, are used to calculate in sequence the partial words of Z; X, and XL. These words are pipelined through the shift-right registers SHR-Z, SHR-X, and SHR-XL, respectively, as displayed in Fig. 4. Also, the fixed words of H are pipelined through the rotate-right register ROR-H. It is worth noticing that the depth of the shift-right register SHR-Z keeps the initial values of Z having zero values during these time instances. Also, the updated bits x i mÀ1 are pipelined through the shift-right register SHR-Xm of the depth size d during these time instances. 3. At time l > d m k e, the controller deactivates all MUXes (S ¼ 0) to pass in sequence the resulted partial words Z; X; XL kept in shift-right registers SHR-Z, SHR-X, SHR-XL, respectively, and the fixed H words kept in rotate-right resister ROR-H to the processor array. Also, at the same time instances, the updated bits X i mÀ1 kept in SHR-Xm register are passed to the processor array block. These bits are utilized alongside the bits of Y i mÀ1 ; qk < i 6 q k þ 1 ð Þ; 1 6 q 6 d m k e À 1, to compute in sequence the partial words of Z; X, and XL. 4. During times l ¼ qd m k e þ 1; 1 6 q 6 d m k e, the controller resets the control signal n in Fig. 5 to force zero values of XL in Fig. 3. Signal n is set to 1 for the remaining times. 5. During times l P m k AE Ç mþk k Ä Å , the controller activates the load signal of the register Z, Fig. 4, to pass in sequence the resulted output words of Z.
We added delay elements (D Flip-Flop blocks) to the processor array as shown in Fig. 5 to ensure that there is always one time step difference between the words of Z; X, and H above and below the delay elements. These elements synchronize the operation of the processor array by lagging the words of Z; X, and H by one time instance to arrive at the same time of the resulted bits of x d . We notice from Fig. 3 that the x d bits are generated starting from the second time instance and this results in the extra 1 term in Eq. (11).

Experimental Results and Discussion
We estimated the area and delay complexities of the proposed 2-D word-based multiplier structure and the efficient word-based ones in the literature (Pan et al., 2013;Xie et al., 2015;Hua et al., 2013;Chen et al., 2008). The area estimation is based on counting the number of basic logic gates and components (AND gates, Tristate buffers, XOR gates, Flip-Flops (FFs), and MUXs) of the compared multiplier structures. In this work, we define latency as the number of clock cycles required to complete the multiplication operation. We also define critical path delay (CPD) as the delay of the basic gates/components of the longest path of the multiplier logic circuit. Table 2 shows the estimated area and time results of the examined multiplier structures. We can interpret the symbols utilized in Table 2 as follows: 1. k designates the word size of the multiplier structures. 2. f A designates the delay of the basic 2-input AND gate. 3. f X designates the delay of the basic 2-input XOR gate. 4. f MUX designates the delay of the 2-to-1 MUX.
Þþk þ 3 represents the total number of FFs used in the multiplier structure of Pan et al. (2013).
Þþ4k þ 1 represents the total number of FFs used in the multiplier structure of Hua et al. (2013).
Þþ2k represents the total number of FFs used in the multiplier structure of Chen et al. (2008). 8. b 1 ¼ k þ dm=ke 2 þ dm=ke depicts the latency of the multiplier structure of Chen et al. (2008).   A. Ibrahim and F. Gebali Journal of King Saud University -Computer and Information Sciences xxx (xxxx) xxx 10. c 2 ¼ f A þ 2f X is the estimated CPD of the multiplier structure of Hua et al. (2013). 12. c 4 ¼ kf A þ kf X þ f MUX is the estimated CPD of the multiplier structure of proposed multiplier structure.
It is worth reporting that the estimated number of FFs includes the input/output registers. This ensures a fair comparison between the multiplier structures.
By investigating the area expressions in Table 2, we can conclude the following: 1. The multipliers in (Pan et al., 2013;Xie et al., 2015) have area complexity approximately of order O mk ð Þ. Þfor the other multiplier structures.

5.
Increasing the values of the word size k will not significantly increase the number of FFs of the proposed multiplier structure. This is due to the area complexity of the FFs of the proposed multiplier structure is of order O kdm=ke ð ).
The FFs consume a larger chip area than the other logic components, as indicated in Rabaey (2002). Therefore, reducing the number of FFs will significantly reduce the overall area of the multiplier structures. As we mentioned before, increasing the word size will not significantly increase the total number of FFs in the proposed multiplier structures. This will result in the overall area of the proposed multiplier structure will not significantly increase as k increases.
By analyzing the latency expressions in Table 2, we can observe the following: 1. The multiplier of Hua et al. (2013) has the least latency compared to the other multiplier structures. 2. The latency results obtained in Table 3 for the standared value m ¼ 409 and word sizes k ¼ 8; 16; 32 confirms that the latency expression of the proposed multiplier structure will lead to a more significant latency than that of the multiplier structures in (Xie et al., 2015;Pan et al., 2013)  3. The latency decreases when word size k increases. This is attributed to the latency expressions are inversely proportional to k.
By investigating CPD expressions, we can observe the following: 1. CPD expressions of Xie et al. (2015), Hua et al. (2013), and Chen et al. (2008) multiplier designs do not depend on the word sizes k. Therefore, they will always have fixed CPD values for all values of k. Pan et al. (2013) and the proposed multiplier structures directly depend on k. Therefore, the CPD values of these multipliers will increase as k increases.

CPD expressions of
Since it is difficult to qualitatively estimate the latency decrease and CPD increase as k increases, we can not precisely expect which multiplier structure has the best computation time. However, the quantitative results obtained in Table 3 will verify which multiplier structure outperforms the others in computation time.
All the multiplier structures were modelled using VHDL language. The modelled multipliers are synthesized for the standard field size m ¼ 409 and k ¼ 8; 16; 32. For synthesizing the modelled multipliers, we used Synopsys tools version 2005.09-SP2 with the NanGate (15 nm, 0.8 V) Open Cell Library. The typical corner of (VDD = 0.8 V and T _j = 25 lC) and unit drive strength are used for all the utilized primitives. We recorded the switching activities of each design during the simulation process (using Mentor Graphics ModelSim SE 6.0a) in the Switching Activity Interchange Format (SAIF) file. This file is read by Synopsys power compiler to have the power report. Table 3 displays the design metrics -Latency, Area (A), CPD, maximum frequency (F), Total Computation Time (T), Consumed Power (P), and Consumed Energy (E) -used to compare the adopted word-based multiplier structures. Area and CPD are obtained from the synthesis tools. The maximum operating frequency is obtained by calculating the multiplicative inverse of the CPD values. The area is normalized by the area of a 2-input NAND gate. The total computation time can be defined as the required time to complete one product operation. It is obtained by multiplying latency and CPD. The consumed power is measured at a 1 kHz frequency. The consumed energy results are obtained as the product of P and T.
We can read the performance results obtained in Table 3 as follows: 1. Our proposed multiplier outperforms the other multipliers in terms of A. It significantly reduces area at all embedded word sizes k by rates varying from 33.5% to 94.6% at k ¼ 8, 30.7% to 95.1% at k ¼ 16, and 31.2% to 96.3% at k ¼ 32. 2. The multiplier structure of Pan et al. (2013) outperforms the other multiplier structures, including the offered one, in terms of the computation time at k ¼ 8 and k ¼ 16. It saves at least 45.8% of time at k ¼ 8 and 9.3% at k ¼ 16.
3. The multiplier structure of Xie et al. (2015) outperforms the other multiplier structures, including the offered one, in terms of the computation time at k ¼ 32. It saves at least 18.4% of time at this embedded word size. 4. The proposed multiplier outperforms the other multiplier structures in terms of consumed power (P). The reduction in power is attributed to the reduced area of the proposed design over the other multiplier designs. The reduced area reduces the parasitic capacitance and consequently the dynamic and overlap energy loss of the circuit. Dynamic and overlap power loss is a major contributor to power loss in electronic circuits. The proposed multiplier structure decreases power consumption at all k values by 31.5% to 98.7% at k ¼ 8, 27.4% to 98.9% at k ¼ 16, and 19.7% to 98.7% at k ¼ 32. 5. Our proposed multiplier outperforms the other multiplier structures in terms of consumed energy. It saves energy by rates varying from 37.6% to 98.1% at k ¼ 8, 43.5% to 98.3% at k ¼ 16, and 42.8% to 98.5% at k ¼ 32. The energy reduction results from the significant saving of the consumed power and the reasonable computation time of the offered multiplier structure over the other multiplier structures.
In our comparison, we focused on all design metrics (Area, Time, Power, and Energy) to have a fair comparison between the proposed design and its competitor ones. It is known that there is always a trade-off between design metrics. This means that having more resources leads to having more area, speed, power consumption, and vice versa. In this paper, we mainly target resource-constrained IoT applications that have more restrictions on the area and consumed energy. The obtained results show that the proposed multiplier outperforms its competitors in terms of area, consumed power, and consumed energy for all the common embedded word sizes. Despite the proposed design having a lower computation speed compared to some of its competitors, the performance is still in the acceptable range. Therefore, the proposed design can be efficiently utilized in the implementation of crypto-processors in resource-limited medical IoT devices such as wearable and implantable medical devices. Also, it can be used in other resource-limited applications that impose limitations on the area and consumed energy.

Summary and Conclusion
In this manuscript, we presented a compact and practical 2-D word-based SISO processor for the modular multiplier over GF 2 m À Á . The proposed processor architecture is derived using a for-mal and systematic technique for mapping regular iterative algorithms (RIA) onto processor arrays. The methodology allows the system designer to control the workload of the entire processor array system and the workload of each processing element.
Managing the processor word size provides control of system speed, latency, and area. The proposed processor size can be manipulated to fit the expected chip area, making the implementation of the offered multiplier processor more efficient in resourcelimited IoT devices. The regularity and modularity of the proposed processor array make it more suitable for implementation in ASIC technology. The obtained results show that the proposed multiplier processor has the advantage of reducing area, power consumption, and consumed energy over the other competitor word-based multiplier designs. Thus, it can be perfectly used to implement crypto-processors in resource-limited medical IoT devices, such as wearable and implantable medical devices, and other resource-limited devices such as smart cards, RFID devices, and wireless sensor nodes.In future work, we intend to implement the entire ECC cryptographic processor based on the proposed multiplier accelerator structure to estimate the overall savings in area and energy consumed with the entire system.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.