High Performance and Low Power Hardware Implementation for Cryptographic Hash Functions

Since hash functions are cryptography's most widely used primitives, efficient hardware implementation of hash functions is of critical importance. The proposed high performance hardware implementation of the hash functions used sponge construction which generates desired length digest, considering two key design metrics: throughput and power consumption. Firstly, this paper introduces unfolding transformation which increases the throughput of hash function and pipelining and parallelism design techniques which reduce the delay. Secondly, we propose a frequency trade-off technique which can give us a scope of frequency value for making a trade-off between low dynamic power consumption and high throughput. Finally, we use load-enable based clock gating scheme to eliminate wasted toggle rate of signals in the idle mode of hash encryption system. We demonstrated the proposed design techniques by using 45 nm CMOS technology at 10 MHz. The results show that we can achieve up to 47.97 times higher throughput, 6.31% delay reduction, and 13.65% dynamic power reduction.


Introduction
The explosion of e-commerce nowadays boosts the transaction over the internet; thus we have to prevent intruders from accessing the sensitive information. According to this circumstance, we call for higher security level protection. There are many types of modern cryptography, for example, symmetric-key cryptography, public-key cryptography, and cryptographic hash function. Cryptographic hash function is used in almost every modern application, especially in a multitude of protocols, be it as digital signatures for achieving message authentication and integrity protection. For example, hash-based message authentication codes (HMACs) are used in IP security protocol and also in secure sockets layer (SSL) protocol [1].
As we know, some hash functions, such as messagedigest algorithm (MD) series (MD4 and its strengthened variant MD5) and secure hash algorithm (SHA) series (SHA-0 and SHA-1), were widely used, however, broken in practice.
Considering the potential danger of being attacked for SHA-2, in 2008, the National Institute of Standards and Technology (NIST) has started the NIST hash competition to develop the future hash standard SHA-3 [2].
Although software encryption is becoming more prevalent today, hardware design is the embodiment of choice for many commercial applications and military [3]. Firstly hardware design is much faster than the corresponding software implementation [4]. Secondly, hardware implementation provides physical protection as high level of security [5]. However, higher security level hash function means more complicated gates, and much more information needs higher frequency to improve the efficiency (or throughput). As a result, the power dissipation of hardware design would increase tremendously. This will cause serious problems in hardware systems, such as less reliability, higher energy consumption, and higher device costs. Thus, low power techniques are highly appreciated in nowadays hardware design. Absorbing Squeezing · · · · · · ⨁ ⨁ ⨁ Figure 1: Sponge construction [6].
The rest of this paper is organized as follows. Sponge construction and low power methods which are used in this paper will be introduced in Section 2. In Section 3, we analyze the hash function designed by sponge construction and its original hardware implementation, and then unfolding transformation and pipelining and parallelism design techniques used to improve the throughput and delay of hash function are presented. In Section 4, we construct the hash encryption system and introduce two low power techniques, the frequency trade-off technique and load-enable based clock gating scheme. This paper is concluded in Section 5.

Background of the Research
In this section, first, sponge construction will be explained. Next, we will introduce two dynamic power reduction methods which are used in this paper.

Sponge Construction.
The idea of sponge construction came from the design of RadioGatún, and its final definition was given at the Ecrypt Hash Workshop in Barcelona [6]. As shown in Figure 1, sponge construction takes arbitrary length input with finite internal state and gives an output of any desired length.
There are three components in sponge construction [7]: (i) a state memory; (ii) a function of fixed length that permutes or transforms the state memory; (iii) a padding function.
The state memory in Figure 1 is divided into two parts: the top section called bitrate of bits and the bottom section called capacity of bits. And the input message ( in Figure 1) will be padded as a whole multiple of the bitrate. Thus this padded input message could be broken into many -bit blocks. Sponge construction consists of two processes: absorbing and squeezing. Considering the left part of dash line in Figure 1, called absorbing, firstly, the input message is padded and the state memory will be initialized; secondly, the firstbit block of padded input will be XORed with the initial bit of state memory; thirdly, the fixed length function (block in Figure 1) updates the state memory. Then steps two and three will be repeated until all the padded -bit blocks are used up. Considering the right section which is squeezing, firstly, the bit of the latest state memory is the firstbit output; secondly, if we need more output bits, the fixed length function is used to update the state memory and the bit of new state memory is the second -bit output. This process is repeated until the desired number of output bits ( in Figure 1) is produced.
The extent -bit part which is altered by the input message depends on the fixed length function [7]. The security of hash function, for example, resistance to collision or preimage attacks, relies on this -bit part. Because of its arbitrarily long input and output sizes, the sponge construction allows building various primitives such as hash function. Keccak hash function, known as the new SHA-3, uses this sponge construction.

Dynamic Power Reduction
Methods. Digital circuits will consume dynamic power in the active mode. There are two sources of dynamic power consumption [8]: (i) charging and discharging processes of output capacitance; (ii) short-circuit current when PMOS and NMOS networks are all ON.
Because the short circuit power is usually less than 10% of total dynamic power [9], the dynamic power consumption which we try to reduce in this paper is referred to as switching power for the rest of this paper. Dynamic power can be explained in (1). Note that is the clock frequency and TR is the toggle rate of gate output: Since the power optimization at RTL has significant impact with reasonable accuracy, RTL is considered as the optimal stage for low power techniques [8]. According to (1), four parameters, such as voltage, clock frequency, load capacitance, and the toggle rate of gate output, determine the dynamic power consumption. Because reducing supply voltage will increase critical path delay and changing the capacitance of gate output needs to redesign the load logic, it is more efficient to focus on clock frequency and toggle rate at RTL. Figure 2 gives us a basic dynamic voltage/frequency scaling (DVFS) system. The DVFS controller will determine the clock frequency, which is sufficient to finish work and gives the best performance without overheating by collecting information about the workload and the temperature. Then this variable clock frequency scheme will lead to dynamic power reduction by choosing proper clock frequency.    gating technique; particularly, we use load-enable based clock gating scheme [10]. Figure 3 shows a normal structure of load-enable based clock gating scheme. As we know, if the data do not change during some consecutive clock periods or the enable signal is kept low, those clock periods are wasted. This technique can be applied to a circuit with mux in which an enable signal is a selection signal or a pipeline construction circuit, such as hash encryption system in this research.

Proposed High-Speed Hashing Module in Hardware
Cryptographic hash function provides powerful protection for data; it has been utilized in the security layer of every communication protocol. However, as protocols evolve, data sizes and communication speeds are dramatically increasing; low throughput of hash function seems to be a bottleneck in these digital communications systems. A promising solution is the hardware implementation on reconfigurable devices which combines high flexibility with the speed and physical security. Various techniques have been proposed to speed up or to improve the throughput of hash function, for example, unfolding transformation and pipeline and parallelism techniques. In this section, the characteristics which are relevant to the hardware implementation of the hash algorithm will be presented. Then the high-speed hashing methodology module will be introduced based on the delay bound analysis.
Then two techniques, such as unfolding transformation and pipeline and parallelism, will be used to optimize the inner logic of transformation rounds.

Hash Algorithm Specification.
In this section, we introduce a cryptographic hash algorithm with sponge construction, called sponge hash algorithm (SHAT). SHAT is a hash function generating 128-/256-/384-bit hash values. According to the hash value length, SHAT can be denoted by SHAT-(128 ⋅ ) ( = 1, 2, 3). The parameters of SHAT are shown in Table 1. . This -box is specified in Table 2. The diffusion layer is a permutation that satisfies the diffusion property (the same as the function of Camellia [11]). Considering computational efficiency, this diffusion layer should be represented using only bit-wise exclusive ORs. The branch number of diffusion layer ) ) ) ) ) ) ) ) ) ) ) ) ) should be optimal against differential and linear cryptanalysis for security [11]. When we get all eight 4-bit outputs ofbox ( 0 , . . . , 7 ), this diffusion layer mixes them. Diffusion layer is defined as (2).

Hash Function of SHAT.
SHAT uses the hermetic sponge construction as shown in Figure 4. As we mentioned in Section 2, is called bitrate and is called capacity. And 4 International Journal of Distributed Sensor Networks Table 2: -box of the function.
In the absorbing phase, the input message = ( 0 , 1 , . . . , −1 ) shown in Figure 4 is padded as a whole multiple of bitrate ( ). Then we will explain our padding method; is the total length of input message (we assume that is whole multiple of four as integer multiples of hexadecimal number), and then we append 1 to the end of the message, followed by bits zero where is the smallest nonnegative integer to set up the following formulation: Then, we set 4 −1 as the bitrate that used to be XORed with the padded -bit message block. Then the result goes through that one-way compression function, Perm. Perm is a permutation process which has 48 steps. Each STEP is defined in Algorithm 1. In Algorithm 1, the left circular rotations rot are rot 0 = 19, rot 1 = 1, and rot 2 = 14. In the squeezing phase, SHAT was defined in (4). This SHAT-(128⋅ ) ( = 1, 2, 3) is specified in Algorithm 2: 3.2. Hardware Implementation. Following the guidelines of SHAT-(128 ⋅ ) ( = 1, 2, 3) as shown in Algorithm 2, the architecture of SHAT is illustrated in Figure 5.
-box of function is designed from Karnaugh map. According to Table 2, we get the logic functions of -box as shown in (5). We set ( = 0, 1, 2, 3) as the input bit ofbox and ( = 0, 1, 2, 3) as the output bit: There are 48 iteration rounds in the basic architecture of Perm function. Then we use rolling loop technique to reduce area requirement. Our design is a single operation block which is reused 48 times as shown in Figure 6. Here ( = 1 to 47) is a counter for the number of iteration rounds from 0 to 47. The critical path is highlighted by bold line. Since the delay of circular shift is negligible in hardware implementation, the critical path delay of this architecture is shown aŝ

Proposed High-Speed Module.
In the previous section, we introduce rolling loop technique to construct Perm function. Although this approach considers area efficiency, throughput is kept low due to the requirement of 48 clock cycles to generate the result. There are many architectures that can be made by varying the Perm function to solve this problem. We performed the unfolding transformation technique. This high-speed module combines STEP blocks into a single round and even can take advantage of architectures with complete round-unrolled circuit. By unfolding, the hidden concurrencies can be parallelized [12]. Also in [13], the pipeline and parallelism technique was explained to improve the unfolding construction of hash function. This technique is related to precomputing by analysing the inner logic and architecture of hash function.

Unfolding
Transformation. According to Figure 6, the mathematical expression of one iteration round is described as International Journal of Distributed Sensor Networks 5 Step( ) Algorithm 1: Typical one step algorithm.
get 24 rounds in one permutation process. The expression of throughput is given as Considering (7), although this unfolding transformation reduces the maximum operation frequency, the throughput is increased significantly due to the fact that the operation 6 International Journal of Distributed Sensor Networks numbers are reduced from 48 to 24. The mathematical expression of one iteration round is replaced by

Pipeline and Parallelism.
We assume to unroll two STEP operations in each round; for sure it will reduce the frequency to increase the throughput. However, the increased area is introduced as penalty. If some logics can be done in parallel, and this parallelism happens in critical path, then the delay of each round could be decreased, so that the frequency of each operation will be increased. According to (8), when the number of operations is kept as constant (the number of bits is also kept as constant), the throughput will increase with its frequency. This method could be used in any other hardware implementation of hash function.
For example, Figure 7 shows the architecture of unfolding two STEP operations in one round, which has the minimum critical path delay. The critical path is composed of seven XOR gates and two functions. By unfolding two STEPs in one round, we have a gain of three 32-bit XOR gates and one function in critical path comparing with the architecture of one STEP block. The critical path is highlighted by bold line.
In Figure 7, cycle counter +1 can be calculated with temp 2 first, and then XORed with temp 3 in second STEP part. Comparing with the first STEP part where XORed with 3 and then XORed with 2 , we can figure out that there is another additional component which used to make a calculation with temp 3 and +1 . Because of the mandatory output generation necessity, this area penalty cannot be avoided.
When we have a limit of , (10) could be changed into This is the delay bound of SHAT, which means that a delay of one SHAT operation round cannot be less than this bound.

Experimental Results.
We introduce a measurement of hardware efficiency in (12) [14]. This is the improvement of normal figure of merit (FOM). We assume that the power is proportional to the gate count; then we could divide the metric by another GE instead of power dissipation when we want to trade off throughput for power. Note that one gate equivalent (GE) is equal to the area of two-input NAND gate in 45 nm CMOS technology:  In Table 4, firstly, the throughput of SHAT-256 is 51.05% of that of Grostl; however, the area of SHAT-256 is only 21.84% of that of Grostl; this results in having 84.47      Table 5, the throughput of SHA-384 is 6.09 times higher than that of SHAT-384; however, the area of SHA-384 is 9.11 times higher; this results in having the hardware efficiency of SHAT-384 to be 13.64 times higher than that of SHA-384.
Then we implement unfolding transformation technique with 10 different numbers of unrolling loops (1, 2, . . . , 48) by using 45 nm CMOS technology at 10 MHz to evaluate the performances of SHAT-128; the results are shown in Table 7. As we can see in Table 7, the throughput of PERM function can be achieved up to 47.97 times higher than original one which is 6.67 Mbps. However, area, delay, and power will increase dramatically as penalty.
Finally we implement pipeline and parallelism technique to reconstruct STEP block, as shown in Table 6; comparing with the performances of original circuit, the critical path delay reduces to 6.31% at most, while the power and area will increase in 8%.

Low Power Design for Hash Function
Low power design is a significant consideration in hardware implementation. How much the power consumption is will determine a device's life, reliability, and energy cost. Thus low power technique is applied normally to every application nowadays. There are many methods to reduce power consumption such as clock gating and power gating related to dynamic power and leakage power. Frequency decreasing technique will pull down the power dissipation dramatically as well.
Firstly, we will propose the frequency trade-off technique. By using this method we could achieve a range of frequency values for making a trade-off between low power consumption and high throughput of hash function. Secondly, we construct a hash encryption system which includes input data padding unit, RAM registers, main hash computing construction, message digest extraction component, and main control unit. Thirdly, by analyzing the idle mode and control signals of this hash encryption system, load-enable based clock gating scheme is applied to reduce the dynamic power consumption.

Frequency Trade-Off Technique.
According to (1), reducing clock frequency is an effective method to decrease dynamic power dissipation linearly. In Section 2.2, we talked about the DVFS technique. By collecting the information about workload and temperature, DVFS will determine the sufficient clock frequency for the proper performance. However, modifying the clock frequency at RTL is not easy. Normally, we treat the clock frequency as constant. Also as we know, dynamic frequency scaling reduces the number of operations a system can issue in a given amount of time, thus reducing performance. Therefore, there is an issue we need to consider: high clock frequency brings high level throughput; however dramatically increased dynamic power consumption is the critical drawback. Low clock frequency minimizes the dynamic power dissipation; however it decreases the throughput as well.
However, according to the unfolding transformation technique which is introduced in Section 3.3, the maximum frequency of Perm function will decrease, while the number of unrolling loops increases. It means that we can decrease the clock frequency while increasing throughput of the hash algorithm. Thus, this unrolling transformation technique compromises high performance without high clock frequency. According to this advantage, by choosing proper clock frequency, we can make a trade-off between high performance and low power consumption.
Next, we explain how to get this scope of frequency value from the two performance bounds. For example, first we achieve two values of rolling Perm circuit: dynamic power consumption 1 and clock frequency 1 which is defined by the necessity of circuit design (the clock period computed from 1 needs to be not less than the critical path delay). Then, according to (8), we can get the throughput 1 at this frequency. Thus, those two performance bounds are defined in (13), where is the number of iteration rounds in one Perm function with rolling STEPs: This method can be defined as the following: referring to the performance of original folding circuit (we assume that this circuit is the one with 48 iteration rounds in one Perm function), each unfolding transformation design with different numbers of unrolling STEPs (2, 3, . . . , 48) has two performance bounds: one is maximum dynamic power and the other is minimum throughput of the circuit. These two performance bounds are used to determine the boundary of proper frequency range for each unfolding transformation circuit. It means that when we choose one specific clock frequency in this value scope, the total dynamic power consumption of that PERM function will be not more than defined maximum dynamic power max and its throughput will be not less than that fixed minimum throughput min .
This clock frequency scope gives us many different choices for different circuit designs by using unfolding transformation technique. The results of this frequency tradeoff technique are shown in Table 9 in Section 4.4.

Hash Encryption System
Design. The hash encryption system is divided into 5 main parts as shown in Figure 8.
Firstly, the receiver and RAM section is actually our padding unit. We use serial communication technique to connect PC and the hash encryption system. Thus, we need clock divider to generate proper clock cycle to be synchronous with Baud rate of serial communication. We choose 4800 Baud/s as our transmission Baud rate which is not a quick speed for low error rate (less than 3%). In this case, one Baud represents 1 bit. Our rule of transmission is a one start bit "0", then 8-bit message, and one finish bit "1". This start bit and finish bit will be added into the transmission message bits automatically; the sampling rate of receiver is 16 and FPGA board provides 100 MHz clock frequency. Thus, the clock period used in sampling is 1302 times provided 100 MHz clock period as shown in (14). This error is 0.0064% less than 3%: Because the liquid crystal display (LCD) limits the number of characters we can display which are 32 characters in hexadecimal, this number is suitable for the number of digest bits of SHAT-128. Thus, our for each padded block is determined to be 32 bits which consist of eight 4-bit hexadecimal numbers.
Secondly, hash function which we introduced in Section 3 is designed as sponge construction as shown in Figure 4. Absorbing 32-bit message blocks, there are 128 bits digest that will be squeezed out.
Finally, the main control unit is designed for managing the working order between receiver, hash process, and LCD display. Figure 9 shows the pipeline working of system.
Because we use serial communication technique, the speed will be slow. We apply 4800 Baud/s as our Baud rate for low error rate; thus each 32-bit block needs roughly 7 ms. For example, there are seven 32-bit blocks that need to be transmitted; roughly 50 ms needs to be dissipated for data receiving and padding. Although the hash function that we used in this system is one STEP each round, this means that there are 48 iteration rounds for a complete Perm function. However, hash processing just needs roughly 6 s. It also costs much time in LCD displaying period. Even though we can finish LCD initialization before we get hash digest, we still need roughly 1.5 ms to completely display all data.

Load-Enable Based Clock Gating.
In this section, we introduce the load-enable based clock gating technique for the hash encryption system.
Clock gating is the most widely used low power technique at RTL. It is more reasonable to determine the toggle rate of gate output at RTL than any other three components, such as DD , clock frequency, and gate output capacitance. According to Figure 9, the hash encryption system is composed of a pipeline construction. Finishing signal of each process can be treated as enable signal in load-enable based clock gating as shown in Figure 3. On the other hand, XOR-based clock gating technique needs to specify the outputs of single level flip-flops which is not easily determined in our encryption system; thus the load-enable based clock gating is our best option for low power method.
As shown in Figure 10, there are three signal pairs to realize this load-enable based clock gating: V and ℎ , ℎ and ℎ ℎ, and and ℎ . Because receiver is implemented in a specific clock frequency which is corresponding to the serial communication, the main control unit will not gate the clock signal of receiver directly; by controlling the clock signal of clock divider with V, receiver can be properly managed. Figure 9 gives us three operation phases of the encryption system. In first phrase, V and signals are asserted to logic one and ℎ is asserted to logic zero; thus receiver starts receiving input messages and padding them into RAM. At the meantime, system will begin the initialization process for LCD displayer. However, the hash processing unit is waiting for the padded input message. Considering the serial communication takes long time due to the low Baud rate and its characteristic which is transmitting message bit one by one, LCD displayer initialization can be finished before the padded message is ready. Thus, can be asserted to logic zero by main control unit when ℎ is switching to logic one.
During the second phase, because the padded message is ready, then ℎ switches to logic one. Then V is asserted to zero which means that clock divider is turned off; then no specific clock frequency is produced; thus the receiver will stop working. In this phase, ℎ is asserted to logic one for hash encryption which is our core function.
is still zero waiting for the hash digest generated by hash processing.
This system will enter the third phase when the ℎ ℎ signal switches to logic one. In this phase, hash digest is ready; thus both receiver and hash processes are in idle mode which means that V and ℎ are all asserted to logic zero. Signal will be asserted to logic one to start LCD displaying.
will be asserted back to zero when the displaying process is finished. This is the end of the whole system; then the device will be turned off or repeats these three phases for another input message.
By analyzing the construction and process of hash encryption system, we can figure out the idle time for each component. Then applying the load-enable based clock gating to each component, the dynamic power dissipation of this system can be properly reduced as shown in Table 8 in Section 4.4.

Experimental
Results. By using 10 MHz clock frequency and 45 nm CMOS technology, the results of frequency tradeoff technique are shown in Tables 9, 10, and 11. Table 9 shows that the area and critical path delay are not changed comparing with the unfolding transformation technique. Tables 10  and 11 give us the variation of dynamic power consumption and throughput with frequency trade-off method. Note that stands for frequency, stands for throughput, and pct is the percentage of increasing comparing with the minimum throughput ( min ) which is 6.67 Mbps. means the total dynamic power consumption by finishing a complete Perm function and pct is the percentage of power reduction comparing with the maximum power consumption ( max ) defined as 1308.96 W which is calculated from the product of 48 (number of iteration rounds) and 27.27 W (as shown in Table 7). Note that stands for the number of iteration rounds.
Then we apply load-enable based clock gating scheme to hash encryption system by using 100 MHz clock frequency, which can be provided on FPGA board, and 45 nm CMOS technology. As shown in Table 8, the dynamic power decreases 13.65%. However, 3.64% increased area and 5.52% increased critical path delay are sacrificed.

Conclusion
In order to achieve high performance and low power hardware implementation for cryptographic hash function which uses sponge construction, firstly, we use unfolding transformation technique to improve the throughput of hash function; secondly, pipeline and parallelism design techniques are implemented to reduce the critical path delay by modifying the structure of permutation function; thirdly, frequency trade-off technique is proposed to calculate a frequency scope which can be used to make a trade-off between low dynamic power consumption and high throughput of hash function; finally, load-enable based clock gating scheme is applied in hash encryption system to eliminate wasted toggle rate of signals in the idle mode.
The experimental results have shown that unfolding transformation technique can achieve up to 47.97 times higher throughput, pipeline and parallelism methods give 6.31% delay reduction, load-enable based clock gating scheme decreases 13.65% dynamic power consumption, and frequency trade-off technique shows how to decide the clock frequency of the hash function to achieve low power consumption and high throughput.