Strengthened 32 ‐ bit AES implementation: Architectural error correction configuration with a new voting scheme

Digital data transmission is day by day more vulnerable to both malicious and natural faults. With an aim to assure reliability, security and privacy in communication, a low ‐ cost fault resilient architecture for Advanced Encryption Standard (AES) is proposed. In order not to degrade the reliability of our AES architecture, the reliability of voter is very important, for which reason we have introduced a novel voting scheme include a majority voter (named TMR voter) and an error barrier element (named DMR voter). In this paper, a reliable and secure 32 ‐ bit data ‐ path AES implementation based on our robust fault resilient approach is developed. We illustrate that the proposed architecture can tolerate up to triple ‐ bit (byte) simultaneous faults at each pipeline stage’s logic and verify our claim through extensive error simulations. Error simulation results also show that our architecture achieves close to 100% fault ‐ masking capability for multiple ‐ bit (byte) faults. Finally, it is shown that the Application ‐ Specific Integrated Circuit implementation of the fault ‐ tolerant architectures using the composite field ‐ based S ‐ box, CFB ‐ AES, and ROM ‐ based S ‐ box, RB ‐ AES allows better area usage, throughput and fault resilience trade ‐ off compared to their counterparts. So, it provides the most appropriate features to be used in highly ‐ secure resource ‐ constraint applications.


| INTRODUCTION
Recently, cryptographic algorithms, like block ciphers and stream ciphers, play an essential role in providing the confidentiality of secure data transmission in modern information systems [1]. There have been many different fault attacks on the implementation of cryptographic primitives, even mathematically strong ciphers such as the Advanced Encryption Standard (AES) by which the secret information can be extracted. In some existing attacks, for example differential fault analysis, even a single well-adjusted fault is enough to extract secret information.
Cryptosystems are often implemented in hardware platforms (i.e. Field Programmable Gate Array [FPGA] or Application-Specific Integrated Circuit [ASIC] platforms) to meet the real-time requirements. On the other hand, the probability of fault occurrence in such hardware platforms is increased by transistor size down-scaling. So, providing reliability for such designs becomes a really serious concern. In other words, nonsecure implementation of a cryptosystem not only leads to failure in the presence of fault(s) but could lead to revealing the secret information [2]. AES is a specification as a standard for encryption of secret data established by the U.S. National Institute of Standards and Technology in November 2001 [3]. Hence, it is used in a wide range of applications [4] as a well-known block cipher with strong cryptographic features. Nevertheless, AES can be used as a lightweight authenticated encryption, for example AES-based light weight authentication encryption and it can also be used as a stream cipher, for example leak extraction [5].
Nowadays, some critical resource-constrained digital technologies such as the Internet of Things (IoT) with severe limitations over their area and power have been extensively developed during the recent decade. These technologies, which have made a refurbishment in our daily lives, have big security deficiencies [6,7]. Due to the importance of security in such digital technologies, there have been many research works on optimising AES for them so that their requirements are met [8]. In fact, the main goal of using the AES is to secure data transmission by allowing only the desired receiver to access the original data with a specific key. However, as previously stated, because of the possibility of occurrence of malicious and This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. natural faults in its non-secure implementation, the AES does not guarantee the reliability of data transmission [9].
We introduce a low-cost fault-tolerant technique based on a combination of hardware and time redundancies. We provide a secure and reliable implementation for AES with 32-bit datapath. We also propose a new voting scheme which includes two independent voting stages: Triple Modular Redundancy (TMR, Majority) and (Error Barrier Element) dual modular redandancy [DMR] voters. Our TMR voter is a hazard-free voter and can be used in all types of digital circuits. The proposed technique achieves a high error correction capability. Our main contributions are as follows.
1. We propose a highly reliable architecture for 32-bit AES implementation. Our fault-tolerant configuration can generalise for any digital circuit hardware implementation. 2. We introduce a new voting scheme confirmed with the proposed fault-tolerant configuration. Our voting scheme consists of two independent voting stages: Majority and DMR voting. We also change the architecture of the proposed Majority-voter so that it can be used in the asynchronous realization of digital circuits. 3. We employ 16 8-bit registers in an organization such that perform ShiftRow operation and also be inconsistent with our timing specification. 4. We show that our architecture can tolerate all faults occurred in each pipeline stage which has the size up to 3-bit, by checking all cases and we also verify this matter through extensive fault injection scenarios in simulation level. 5. Finally, we implement our fault-tolerant 32-bit AES architectures using the composite field-based S-box, CFB-AES, and ROM-based S-box, R -AES, and also traditional faulttolerant configurations, TMR and Quintuple Modular Redundancy (QMR) of 32-bit AES, on the ASIC platform using 180-nm Complementary Metal Oxide Semiconductor (CMOS) technology and compare the obtained results of different proposed structures.
The remainder of this paper is organised as follows. Section 2 reviews different AES architectures and its fault resilient hardware implementations. We describe the proposed fault resilient architecture of AES in detail in Section 3. The proposed voting scheme is also explained in this section. Section 4 reports the results of our extensive error simulations and shows how the reliability is enforced and how different fault effects can be thwarted in this section. The implementation results on TSMC 180-nm technology are discussed and compared with previous AES implementations in terms of area, frequency and throughput in Section 5. Finally, Section 6 concludes the paper.

| State of the art of low-cost hardware implementation for AES
Due to the significance of security, there are many research works on optimising area and power consumption of AES for different applications. Fully pipelined or unrolled-round architecture of AES are implemented on hardware in several works [10,11], but they just provide high throughput and are practical for resource-reach applications which demand high throughput. These architectures are not appropriate for resourceconstrained devices or embedded systems [10]. In [12], a lowcost 128-bit AES is introduced for edge device in IoT applications. They optimised encryption/decryption hardware implementation of AES in a four-fold manner based on resource sharing.
An optimised data-path of AES is proposed by Rouvroy et al. for FPGA implementation in [13]. Their design uses dedicated 18-K bit dual-port Block RAMs (BRAMs) to implement SubByte and MixColumn operations. Bui et al. [10] presented a 32-bit ultra-low-power low-area AES architecture based on the number of registers and control logic minimisation by the reorganization of both data-path and key expansion. Authors in [10] exploited DES S-box to reduce the power consumption of their proposed 32-bit AES. Two Optimised 8-bit architectures, in terms of area and power consumption, with only one or two S-boxes are introduced in [14,15], respectively Bani-Hani proposed a 32-bit compact AES for embedded applications which are poor in the area and power budget. The MixColumn and SubByte have been combined into a single operation thanks to the use of FPGA BRAMs. They also utilised Addressable shift register with dynamic access to internal bit content in their proposed design [16]. A 32-bit AES architecture is suggested in [17] which employs a low switching activity shiftrow operation and also occupies low implementation area. Shreedhar et al. proposed an ultra-small implementation of AES with low gate count [18].
To reduce the AES gate counts they reused the areacritical circuits, that is only one S-Box circuit and one 32bit MixColumn circuit, repeatedly. In addition, they cascaded the input without extra multiplexing circuits and also minimised ShiftRow operation. To efficiently utilise the silicon area and additional battery power, small area AES besides Cyclic Redundancy Check, as two most used approaches for ensuring security and reliability, are implemented on a core by Noor and John in [11]. They shared the Galois Field Computation Unit between two designs to reduce the implementation area so that the energy consumption can also be reduced.

| State of the art of fault-resilient hardware implementation for AES
Generally, to mitigate the fault effect in digital circuits, three categories of techniques are proposed, that is, correction, detection and recovery. These techniques are often used to reduce the effect of natural faults. Among all these techniques, fault detection is widely performed to deal with the malicious fault(s) in the fault attacks. In these approaches, some recovery schemes are addressed in software [2,19,20].
A fault-detection technique based on partial duplication is proposed for 128-bit AES architecture in [19]. While their technique imposes about 25% area overhead and a negligible throughput overhead on unprotected AES architecture, it checks only 32-bit against fault occurrence at each clock cycle and fault occurrence in other bit is not checked. A faultdetection technique called recomputing with permuted operands (REPO) for AES is introduced in [2].
The REPO performs a computation on the input data followed by a re-computation on the permuted input data to detect faults. This technique is able to detect both permanent and transient faults in all operations of AES but it puts a high throughput degradation (about 50%). In the fault detection technique by Mestiri et al. in [21], the AES round function is divided into two pipeline stages. Each stage is used for encryption and re-encryption processes in two consecutive clock cycles. The output of encryption and reencryption are compared for transient fault detection. This technique is not able to detect permanent faults. A High throughput fault-resilient architecture for 128-bit AES (HFA) is introduced in [22]. The HFA is constructed of four equivalent blocks and each of them is divided into two pipeline stages. The fault detection is performed by comparison between the results of each pipeline stage when it is in computation and re-computation modes, respectively. A concurrent and lightweight fault-tolerant based on parity checking is integrated into a parallel pipeline implementation of 128-bit AES in [23] There are some research works in which Authors introduced hybrid fault detection techniques for AES [24,25]. Yu and Heys employed a hardware redundancy based fault detection technique for SubByte while a technique based on parity checking is performed for other AES operations [24]. In the proposed fault detection technique in [25], an operation, a round function or the entire cryptographic process, and after that its inverse are executed and their result is compared with the input. The main disadvantage of this technique is its high implementation cost.
Correction techniques are more desirable than detection. Several AES architectures with correction capability have been proposed so far. Cheltha et al. proposed a fault-correction technique based on the hamming code for AES architecture in [26] and also a hybrid redundancy fault-correction technique for AES S-Box is given in [20]. Although these techniques provide a high level of reliability, they are not strong enough against fault attacks.
Robustness against fault attacks is a challenging issue in correction techniques that should be paid more attention. In [27] three fault-tolerant AES architectures called configurable fault-tolerant AES (CFTA), robust CFTA (R-CFTA), and high throughput CFTA (HT-CFTA), for 128-bit AES are introduced. These architectures offer different levels of fault tolerance and robustness against fault attacks. R-CFTA is the most robust and reliable architecture among others. A high-throughput fault-tolerant hardware implementation of AES S-box based on hybrid redundancy is proposed by authors in [28].

| PROPOSED ARCHITECTURE
In this section, the proposed architecture with error correction capability is introduced for 32-bit AES as a case study. It should be noted that although we utilise this case study, the proposed technique is generally practical to pipeline implementation of any digital circuit, especially similar block ciphers and hash functions.
The proposed architecture is depicted in Figure 1(a). The AES architecture includes three main parts: AES core; a control unit; and a key generator unit.

| AES core
AES core is the data-path of encryption which is divided into two pipeline stages. One stage is a TMR implementation of four S-boxes for SubByte operation and the other stage is a TMR configuration of a 32-bit MixColumn circuit and 32-bit XOR gates for 32-bit AddRoundKey, as shown in Figure 1 In our architecture, the ShiftRow operation is executed by the shiftRow shift register which also acts as a pipeline register to store intermediate results. Figures 2(a) and 3(b) illustrate the proposed ShiftRow register and its related control signals, respectively. The 8-bit registers in this shift register are organised so that its output, that is the input to the next pipeline stage, is (be) the result of applying ShiftRow operation on its input.
Each 32-bit block of data that forms a state matrix column is processed through each pipeline stage twice, in two consecutive clock cycles. In this paper, the former processing is named original computation and the next one is called redundant computation. As it can be seen in Figure 1(b), thanks to TMR configuration of different operations of AES in each pipeline stage, data blocks are processed in a fault-tolerant manner in both original and redundant computations. Actually, the TMR configuration can inherently mask all transient and permanent faults which happen in a single module [20]. To have the ability of fault masking in two modules, we exploit time redundancy by processing data in original and redundant computation modes. We utilised two 32-bit registers as data register and its redundant (i.e. temp1 register) to store the 32bit output of the fault-tolerant MixColumn and AddRoundKey operations in the original and redundant computation modes, respectively. In turn, two shift-registers, that is ShiftRow shiftregister and temp2 shift-register as the main and its redundant registers, are also used to retain the output of TMR configuration of SubByte operation in the original and redundant computation modes. We depict the ShiftRow shift-register, temp2 shift-register and how their contents are compared through comparators in Figure 2(b). As shown in the figure, we need two 32-bit comparators to compare their inputs and outputs. This is because the ShiftRow operation needs all 128bit data of the state matrix. So, the 32-bit input to the next pipeline stage is different from the 32-bit data currently being processed in SunByte operation, that is the input to the Shif-tRow operation. In fact, the proposed countermeasure is based on the error detection in the fault-tolerant implementation of different pipeline stage data processing combined with a kind of time redundancy. If a transient error, due to any mismatch between the outputs of original and redundant computation modes of a pipeline stage, is detected, its error signal, that is (err 1 ) or (err 2 1 ), would be activated and the control unit prevents the data-path registers from loading the incorrect data such that the previous correct registers' contents are preserved till the error disappears. The system continues to process previous expected data. These error signals of each stage report the error just when the same errors occur in at least two modules of that stage and the error latch in the corresponding registers.
If the error signal (err 2 2 ) alarms the error, the secondary is set by the control unit and the encryption process is reinitialised. This case occurs only when there is an error in an 8-bit output register of ShiftRow shift-register or temp2 shiftregister.
There is no need for an additional recovery procedure in our proposed design except when the error signal (err 2 2 ) becomes '1'.
The whole encryption (decryption) process takes the 95 clock cycle. In Figure 3(b), we show how the different columns (32-bit) of the state matrix are processed through pipeline stages of our architecture and process-related control signals. In this figure, s j i is the i th column of state matrix in round j and j = 0…11, Err-mask allows check each stage's error signal in when it is in redundant computation mode and the control signals Load_I and Load_II are the enable for ShiftRow shiftregister (temp1 register) and temp2 shift-register (Data register) in the original and redundant computations, respectively. As shown in Figure 3(b), the process should switch between original and redundant computations every other clock cycle in the normal situation. The proposed voting scheme which is composed of two independent voting elements is described in the two next subsections.

| Majority TMR voter
In the new proposed voting structure (Figure 4(a)), if the first or the third replica becomes faulty, D input of the flip-flop goes to high logic value '1' which selects the outputs of the second replica as the correct output. Otherwise, the output of the first replica (the main module) is selected as the correct output. In fact, D input of the flip-flop is at a low logic value '0' when either the second replica is faulty or there is no faulty replica.
It is considered that we have merged the recovery controller unit and the conventional majority voter in our proposed TMR system in order to reduce area overhead and improve time efficiency. According to [29] a significant reduction in area overhead, more than 20 transistors has been omitted, is achieved.
This voting mechanism that is depicted in Figure 4(b) has three control signals which control the clock of the D flip-flop. Counter I, Counter II and Counter III are the three counter's outputs (one per replica) and they get the high logic value '1' when a complete period of counting is over. Once the three counter signals are at high logic value, the flip-flop is clocked. The rest of the voting mechanism is like Figure 4(a). The two employed Muller C-elements shown in Figure 4(b) do the synchronization between the replicas and enable the recovery mode. In fact, when the three counter signals are at high logic value '1', the clock signal of the flip-flop goes high and it comes back to logic value '0' as soon as all the counter signals are at logic value '0' (this happens by resetting the counter circuits at the end of recovery mode). The voter design depicted in Figure 4 Figure 4(c).
Implementing the Muller C-element, is an issue itself, especially on FPGA platforms. We have addressed this issue in Figure 4(c) where a different method is proposed to implement and synthesise a Muller C-element by utilising a simple memory element (i.e. a flip flop). This method is different from other existing implementations [30,31]. In addition, the output is set to '0' and it does not need initialisation. The proposed Celement is subject to metastability if the clock and clear signals rise or fall at the same time, such as 00 to 11 or 11 to 00 transitions. However, such transitions cannot occur in our design due to the following reasons: 1. The input signals to the Muller C-element alter once in each n clock cycle which is not fast enough to cause metastability in C-element. 2. The NOT gate's delay is smaller enough than the signals' alterations. Therefore, NOT gate propagates the changes before the end of signals' alteration. The asynchronous circuits are gaining a lot of attention in recent years [32][33][34][35][36]. Since we have designed our proposed Majority-voter so that it can be used in the synchronous and asynchronous realization of digital circuits.
It is worth mentioning that several fault-tolerant techniques have been previously proposed to improve combinational and sequential circuits' reliability. Triple modular redundancy is one of the well-studied techniques which is widely used in reliable circuit design. To the best of our knowledge, all of the existing TMR approaches suggested for synchronous circuits, use the global clock signal to make sure that the corresponding outputs of different modules are being voted. Also the global clock cannot be the best synchroniser anymore due to the process variations, clock skew, etc. Herein, an alternative for the global clock net is suggested which makes the conventional TMR technique applicable to both synchronous and asynchronous system designs. Since the most wellknown fault-tolerant technique is the TMR, it is of great importance to make it applicable to asynchronous circuits. Employing independent clock nets for different replicas and using a counter for synchronization leads to a novel TMR technique that besides the proposed novel hazard-free voter forms a complete high-performance TMR design. Independent clocking is the key feature that makes the synchronous circuits different from asynchronous ones; therefore, we suggest each replica in a TMR system uses its own clock net, including the clock generator and clock tree. This approach magnifies the need for an alternative mechanism (the global clock does this task in synchronous circuits) that guarantees voting is performed at certain synchronization points. Thus, a F I G U R E 3 Timing diagram of the: (a) ShiftRow & temp1 shift-register control signals; (b) proposed architecture regular counter is employed and loaded with an arbitrary value n that stops the circuit's normal operation once in each n clock cycle and stimulates the recovery mode in which state restoration is performed. Since the CMOS devices are active devices, some hazards on replicas outputs may be generated even if all the inputs are stable and hazard-free [37]. Hazards can also be generated due to unequal propagation delay paths originated from the same inputs [38].
As it is depicted in Figure 5, replicas' outputs may be different from each other for a moment due to a difference in the phase offset (mesochronous system). Moreover, it is possible for each replica's output to be erroneous for a glance due to a hazard. Therefore, utilising a non-hazard-free voter may spread the hazard to the voter's output. To address this problem, we have proposed a hazard-free voter. In this voter, hazards on replicas outputs can be propagated to the voter's output. To suppress these hazards, we insert a C-element into the voter's structure. This C-element compares the D input of the flip-flop with a delayed version of it in order to mask any transient fault on it. The delay value depends on the duration of the hazard, but using four back-to-back inverters is suggested in [39] which is claimed to suppress hazards with durations lower than 0.9 ns. We have reduced the area overhead by omitting the whole recovery controller unit [29] and replacing it with two Muller C-Elements. Hence, not only the area overhead is reduced in terms of transistor counting, but also the recovery mode has been sped up which improves the circuit's time efficiency.

| Error barrier Element (DMR voter)
The proposed Error Barrier Element does the two tasks of a voter in a DMR configuration which are: holding the previous correct state when faced by a mismatch in any pipeline stage of the design and changing the vote signal's value when both modules produce the same output. In fact, when the outputs of the two replicas are not the same as each other which means an error has occurred, the voter holds the previous value until the two replicas' outputs become similar. The proposed DMR Voter is shown in Figure 4(d).

| Control unit
In the proposed architecture, control signals must be generated at specific times to establish the desired scheduling, as shown in Figure 3(a) and (b). We have three different parts of signals: � Sequential control signals like register-related signals such as Load_I, Load_II and MUX1-sel-MUXr6-sel, and Err-mask signal: these types of control signals must activate or deactivate on the negative edge of clock signal after the error detection process. � Combinational control signals such as data-path MUX selects signals: These types of control signals must activate or deactivate shortly after the positive edge of the clock signal on which the error flag is specified. � Reset signal: This signal must activate after detecting an uncorrectable error.
To generate these control signals, the control unit consists of three parts: CU 1 , CU 2 and CU 3 .
The first part, that is CU 1 , is responsible for control signal generation for the first types of control signals, that is sequential control signals. As we illustrate in Figure 1(b), Errmask is the signal that determines which error flag must be considered in each clock cycle. This unit is triggered on the negative edge of the clock signal.
CU 2 is the second part of our control unit, fed by the delayed clock, which generates the control signals for the combinational part of the data-path, for example selects of depicted multiplexers in Figure 1(b). In fact, the combinational control signals must be determined after the error flag is specified. So, the clock signal of this part of the control unit is delayed of the intended pipeline stage. So the clock signal of CU 2 is delayed by the same amount of error signal generator circuit, that is comparator circuit, delay. It should be noted that the control signals of these units preserve their previous values in the presence of error detection by Err 1 and Err 2 , except the enables of 8-bit registers which are not input registers in ShiftRow and temp1 shit-registers. The value of these enables signals would be '0' to prevent the loading of the un-correct state. After disappearing error their values will change to the next correct values. Another part is used in the proposed control unit for generating the reset signal for the whole design. When an uncorrectable error is detected in ShiftRow shift-register (secondary reset) or an external reset occurs the reset would be activated and consequently all sequential parts of data-path would be reset. This part of data-path is named as CU 3 .

| Key generator
The key generator unit, as a main part of the AES encryption, takes the input secret key and generates round keys for all computational rounds, that is 10 rounds [3]. In our architecture, round keys are pre-computed and stored in a RAM. In fact, each pair of 32-bit data and 32-bit related key are loaded in each clock cycle.

| ERROR SIMULATIONS
If one, two or three bit error appears at the logic of AddRouKey, SubByte or MixColumn, our fault-tolerant scheme can tolerate it and the error coverage (EC) in these cases is 100(%). We design our 32-bit architecture so that it can inherently mask all single-bit (byte) permanent and transient faults thanks to the identical blocks in each pipeline stage. If a fault appears at one block in original computation (see Figure 6 (a)) or redundant computation (see Figure 6(b)), the correct output will be obtained due to the majority vote. Our architecture can also tolerate any single error on registers. Notice for ShiftRow shift-register, because we need all four 32-bits, that is all 128-bit data, the 32-bit data applied to the next stage is different from that one is processing in the redundant computation in the current stage. Due to this deference, we can't expect to disappear the error in the ShiftRow shiftregister by re-processing data in that stage. In this case, if any mismatch between the content of ShiftRow shift-register and temporal shift-register is observed the control unit will assert the secondary reset flag to re-initialise the encryption. It is worth mentioning that single-byte fault is one of the most important and practical fault models used by attackers in fault injection attacks [2].
Our proposed architecture also assures the correct output in case of two or even three faulty blocks in each pipeline stage's logic. This is because in these cases (see Figures 7 and 8), that is the erroneous output of one computation whether original or redundant, the error flag of the corresponding pipeline stage alarms the error to control unit. In these situations, the control unit holds the correct previous output of the DMR voter in all pipeline stages till the errors disappear. So, our architecture remains operational in the presence of two or three faults simultaneously in combinational logic of each stage as we depict in Figures 7 and 8.
An attacker may inject several multiple-bit transient faults into cryptographic hardware to gain secure information [2]. To evaluate the fault-tolerant capability of our proposal, error simulations are extensively performed using Modelsim simulator. Different cases of multiple-bit and burst bit-flip transient faults are injected into our design. In our fault model, random faults with different size are injected at random locations, that is, the inputs or outputs of the operations at random clock cycles of the encryption process. The size, location, and time of faults are randomly determined by a random number generator. The considered fault model covers both natural faults and fault injection attacks. The results of our error simulations for faults with different sizes (these faults are multiple or burst transient) are reported in Figure 9. In this figure, the EC is calculated by the flowing equation: EC is a measure of all the covered errors that our architecture can tolerate them. In other words, in the presence of uncovered errors our architecture can't continue to encrypt data correctly. Our simulations show that our architecture is  able to tolerate more than 99.99% of total injected faults. The probability of error ðerr 2 2 Þ which makes the secondary reset flag active (SRR), is calculated as: As mentioned previously, our architecture does not need any recovery process in most cases (more than about 99.99% of all cases, as shown in Figure 9).
Actually, according to the result of our error simulation (Figure 9), it is obvious that the proposed architecture can provide a high level of reliability and act as consequence security for the AES design against both natural and malicious faults.

| IMPLEMENTATION RESULTS
The proposed design of the 32-bit AES architecture is created with a focus to enhance reliability and security and to achieve a small area. To evaluate the design features of the proposed architecture is described in Verilog, simulated using Modelsim  Table 1. We implement the S-boxes using memories (ROM-base) and the ones presented in [41] (Composite field-based) in two different proposals, named RB-AES and CFB-AES.
There are many AES implementations in the literature but a few of them focussed on the 32-bit ASIC implementation. Most of the research works proposed 128-bit AES architectures to attend high throughput. Therefore, we pick a set of different presented AES architectures targeted for a small area to draw a comparison with our architectures in terms of area, dynamic power, operating frequency and throughput [10,12,14,15,18,40]. Some of the selected architectures in this set have been developed with the aim of achieving a high level of reliability in a small area (i.e. most of the 128-bit architectures), similar to our aims [20,22,27]. Table 1 lists the evaluation and comparison design features for some of the similar 8-, 32-and 128-bit AES implementations.
The 8-bit architectures reported in [14,15,18] achieve the smallest implementation area, that is 1.45-4 K GE. However, in these designs, the throughput has been sacrificed for the area to lead to a further reduction in area resource utilization.
In other words, they need a large number of clock cycles to encrypt a 128-bit plain-text, that is 160-336. Table 1 shows that the architecture presented in [40] utilises 5.5 K GE, which is the smallest but slowest (i.e. 28 Mbps) design among all 32-bit architectures. According to the results reported in Table 1, the proposed CFB-AES offers about 8 times more throughput than 32-bit architectures in [10,40]. However, it occupies only 6.1 K GE which is about 10% more than that of the proposed architecture in [40] as the smallest 32-bit implementations.
The implementation results of classical TMR and QMR configurations of 32-bit AES are reported in Table 1. In fact, these two techniques are the most popular fault-tolerant techniques for the error correction task. The fault-resilient architectures in [20,22,27] fall in the category of 128-bit AES architectures and so suggest higher throughput as compared to any introduced 32-bit architecture. Our architectures have a much smaller implementation area but less frequency and throughput than [22]. However, the proposed architecture in [22] has the capability of error detection and it can't perform any correction.
The number of errors that can be tolerated by classical TMR, QMR, [20,27], and our proposed architecture as faulttolerant architectures are compared in Figure 10.  In fact, our architecture assures masking up to three simultaneous transient (single bit (byte), burst and/or multiple) faults in each pipeline stage, that is AddRoundKey, SubByte and MixColumn operations, thanks to its modular structure and according to our analysis for Figures 6-8. In addition, our architecture can tolerate all single-bit (byte) faults in ShiftRow operation. The faults in ShiftRow activate error flags on the second stage of the pipeline. As can be seen in Figure 10, our architecture can tolerate the highest number of faults (up to three simultaneous) among all faulttolerant architectures at AddRoundKey, MixColumn and SubByte operations.
In comparison with all other 128-bit fault-tolerant [20,22,27] or even unprotected [12] AES implementations, the area usage of our proposals is much smaller, that is 6.1 and 9.9 K GE versus 14.6, 15.7, 16.6 and 14.6 K GE.
According to Table 1 and Figure 10, the proposed architectures overcome both TMR and QMR configurations in terms of area, frequency and also the number of error which can be tolerated. The proposed architectures offer less throughput compared to implementations in [20,22,27]. It should be noted that in the trade-off between throughput and area, we prefer to be more area-efficient rather than being high-throughput. On the other hand, proposed architectures consume some resources to be able to correct close to 100% of errors.
Based on the error simulation and implementation results and also comparisons with other similar works (see Figure 10 and Table 1), we can say that our fault-tolerant architectures allow a proper trade-off between the reliability, security and implementation cost of the AES for a wide range of resourceconstraint applications.

| CONCLUSIONS
We have proposed a low-cost fault resilient 32-bit AES architecture based on a hybrid redundancy. A novel voting scheme, including a hazard-free TMR voter and a new DMR voter, has been also introduced. Our proposed architecture has a high ability for fault resilience in limited hardware and frequency overheads. Our error injection in the simulation level shows that a very high error correction ability, that is close to 100% for the proposed architecture is obtained.
Moreover, our proposal has been implemented on ASIC and also compared with a number of proposed low-cost architectures from the literature. Our implementation results show that the proposed 32-bit architecture provides a high level of reliability and as a result security in a very small implementation area with little reduction of throughput compared to other similar 32-bit architectures.
The proposed solution is also practical for pipeline implementation of any digital circuits, especially for resourceconstrained applications in which too much delay for error recovery could not be tolerated.