Automated Generation of Fault-Resistant Circuits

. Fault Injection (FI) attacks, which involve intentionally introducing faults into a system to cause it to behave in an unintended manner, are widely recognized and pose a signiﬁcant threat to the security of cryptographic primitives implemented in hardware, making fault tolerance an increasingly critical concern. However, protecting cryptographic hardware primitives securely and eﬃciently, even with well-established and documented methods such as redundant computation, can be a time-consuming, error-prone, and expertise-demanding task. In this research, we present a comprehensive and fully-automated software solution for the Automated Generation of Fault-Resistant Circuits (AGEFA). Our application employs a generic and extensively researched methodology for the secure integration of countermeasures based on Error-Correcting Codes (ECCs) into cryptographic hardware circuits. Our software tool allows designers without hardware security expertise to develop fault-tolerant hardware circuits with pre-deﬁned correction capabilities under a comprehensive fault adversary model. Moreover, our tool applies to masked designs without violating the masking security requirements, in particular to designs generated by the tool AGEMA. We evaluate the eﬀectiveness of our approach through experiments on various block ciphers and demonstrate its ability to produce fault-tolerant circuits. Additionally, we assess the security of examples generated by AGEFA against Side-Channel Analysis (SCA) and FI using state-of-the-art leakage and fault evaluation tools.


Introduction
The already well-advanced integration of embedded systems into our daily lives highlights the critical need for reliable guarantees with respect to the confidentiality of sensitive data processed by these systems.Cryptographic primitives, such as block ciphers, are well-established methods that ensure the confidentiality of data both at rest and in transit.However, the implementation of cryptographic primitives in hardware presents significant challenges that are yet to be fully resolved.Physical attacks, such as passive Side-Channel Analysis (SCA) attacks [Koc96] and active Fault Injection (FI) attacks [BDL97], have emerged as severe attack vectors due to the physical accessibility that embedded devices offer to potential adversaries.
In the context of SCA, an adversary records a certain physical characteristic of the device during the execution of a cryptographic primitive and subsequently exploits the relation between the measurements and the processed data to recover secret information.Examples include but are not restricted to [Koc96, KJJ99, GMO01, HS13, GST14].Masking [CJRR99], as a well-studied countermeasure against SCA based on secret sharing [Sha79], has gained a considerable amount of attention from the scientific community due to its simple security assumptions and adversary models.The d-probing model [ISW03] and its extension to cover physical defaults, known as the robust probing model [FGP + 18], are the commonly used adversary models and form the basis for security notions such as probing security and composability, which enable to prove the formal security of masked implementations.However, designing and implementing a properly masked cryptographic primitive is complex, error-prone, and tedious.Many examples of insecure masking schemes can be found in the literature [MMSS19], highlighting the need for automated solutions that reduce manual interaction and the potential for human error.One such solution is the Automated Generation of Masked Hardware (AGEMA) [KMMS22], which automates the generation of masked hardware circuits by translating unprotected gate-level netlists into masked hardware circuits albeit with an associated increase in circuit size and latency compared to manually crafted masked designs.To narrow the performance gap in terms of area and latency between automatically-generated masked designs and manual approaches, various optimizations have been introduced and implemented into new software tools like the Automated Generation of Masked Nonlinear Components (AGMNC) [WFP + 23] or Compress [CGM + 23].
In addition to passive SCA attacks, active adversaries employ techniques that deliberately introduce faults into the operation of the device to compromise its security [BDL97].Such attacks can be carried out through various fault injection methods, i.e. by shortening the period of particular clock cycles (clock-glitching) [ADN + 10], by altering the power supply (voltage-glitching) [SGD08], by electromagnetic pulses (as Electromagnetic Fault Injection (EMFI)) [SSAQ02], or by focused laser beams [SA02].The injected faults can be exploited by a various set of concrete techniques including Differential Fault Analysis (DFA) [BS97], Differential Fault Intensity Analysis (DFIA) [GYTS14], Statistical Fault Attack (SFA) [FJLT13], Fault Sensitivity Analysis (FSA) [LSG + 10], Ineffective Fault Attack (IFA) [Cla07], and Statistical Ineffective Fault Attack (SIFA) [DEK + 18].The first generic and comprehensive hardware fault adversary model in [RSG23] abstracts any fault, regardless of its location, type, or timing, by replacing a gate whose output is faulty with another gate realizing the functionality of the faulty gate.The faulty circuit is then compared with a fault-free circuit to determine whether the fault is propagated to the primary output.Incorporating redundancy into the system can be an effective strategy for countering FI attacks.This redundancy allows faulty states to be compared with fault-free states, thereby detecting or correcting faults introduced during the attack.One way of incorporating redundancy is to use binary linear codes, which can be implemented in a variety of ways.Importantly, the fault tolerance of an implementation depends on the specification of the underlying code used.
• Schemes based on Error-Detecting Codes (EDCs) are used to detect a pre-defined number of faults.If a fault is detected, the device discards the faulty output and may eliminate the key if a certain threshold is reached [AMR + 20].
• Schemes based on Error-Correcting Codes (ECCs) are used to correct a pre-defined number of faults.Hence, an adversary will always observe fault-free outputs as long as the injected fault is covered by the underlying fault model, defined by the employed code [SRM20].
Since EDC-based schemes do not protect against SIFA [DEK + 18], this work focuses on the application of ECC-based schemes to block ciphers, as exemplarily and manually done in [SRM20].

Our Contributions
To the best of our knowledge, there is no AGEMA-equivalent tool for protecting a given design against FI attacks, i.e. there is no tool for automatically generating fault-tolerant circuits.We fill this gap by introducing an extension to the current security-aware hardware design flow, namely Automated Generation of Fault-Resistant Circuits (AGEFA).
In particular, our tool, which is publicly available via GitHub1 , has the following key features.
• AGEFA enables the fully automated translation of unprotected hardware designs, i.e. designs without any countermeasures against FI, into provably fault-tolerant hardware circuits.In particular, the fault-tolerance of AGEFA's outputs, even against SIFA, can be exhaustively verified by cryptographic fault-analysis tools, e.g., VerFI [AWMN20].For this work, we have chosen to exclusively focus on symmetric ciphers, aiming to ensure a comprehensive verification process.While there are no inherent limitations to applying AGEFA to, e.g., post-quantum cryptography, the potentially extensive outcomes would pose significant challenges for verification.As a result, AGEFA enables even inexperienced engineers without deep knowledge in the field of hardware security to reliably protect any design against FI attacks.
Concretely, AGEFA provides automated solutions for the following problems: -Searching for an efficient ECC that is not only correct but also tailored precisely to the designer's specific requirements can be a time-consuming task when approached manually.In Section 3.2, we present an automated procedure aimed at expediting the process of finding an efficient ECC tailored to the individual requirements of the user.
-Implementing a fault-resistant design, which relies on the generated ECC, is a task fraught with a high risk of implementation flaws.Even minor oversights can compromise the security of the entire design.In Section 3.3, we outline a procedure that automates the implementation process, thereby eliminating human errors.
-Furthermore, AGEFA can incorporate several non-trivial optimizations, typically performed manually.This capability enables the generated designs to achieve an efficiency level comparable to that of manually crafted designs.
• When AGEFA processes a masked design, e.g. an SCA-secure design generated by AGEMA, it preserves all the security guarantees provided by the masking countermeasure that satisfy Probe-Isolating Non-Interference (PINI) requirements, while adding protection against FI attacks.In practice, this means that an engineer can give an unprotected design without any countermeasures to AGEMA first, and receive an SCA-secure design based on PINI gadgets.Further processing of the resulting design with AGEFA will produce a solution that is SCA-and FI-secure.

Notations
As summarized in Table 1, we use lower-case characters to denote atomic elements, e.g.Boolean variables, and sans-serif fonts to denote Boolean functions.Additionally, we use upper-case bold characters, such as X, to represent a set of elements with cardinality |X|.
We refer to the individual elements of a set using their index, such that x i ∈ X represents the i-th element of X. Exceptionally, we use upper-case characters to denote multi-bit variables such as X ∈ F n 2 . 2 we enumerate all k-bit input and output chunks and store the number related to w as its linear index.This is introduced in Section 3.1.

Circuit Model
We formalize the operations carried out by a hardware circuit through a Boolean function F i 2 → F p 2 with i inputs and o outputs.The circuit acquires every input through a signal driven by a physical wire.Subsequently, the corresponding outputs are retrieved from the circuit by reading the signals transmitted through the output wires.Each wire w in a circuit is uniquely defined by its name while we refer to the signal currently transmitted by a wire as its state.Formally, we associate multiple parameters with each wire and define functions to access them as given in Table 2.
Further, review a circuit as a composition of separate building blocks called modules.

Definition 1 (Module
with the following elements: • IN M (resp.OUT M ) defines a set of primary input wires (resp.primary output wires) belonging to M.
• T M defines the set of intermediate (internal) wires of M.
• INST M defines a set of instructions representing the functionality of the M. We define an instruction as a Boolean function based on a restricted set of operands {not, and, nand, or, nor, xor, xnor, reg}.
We refer to an atomic module G as a gate (either combinational or sequential).We focus on sequential circuits, where a clock signal clk synchronizes the processed data by storing the state of wires in ST in registers.Any synchronous sequential circuit (which has no combinational loops) can be modeled as a Mealy machine [Mea55].We show the schematic model of a generic sequential circuit in Figure 1.
The model incorporates a single register stage created by merging all synchronization elements, such as registers from an arbitrary number of stages.Initially, an active reset signal rst forces the register stage to load the state of the primary inputs (data and control signals) carried by wires in IN.Subsequently, the circuit repeatedly processes the state of ST by feeding the feedback signal state carried by wires in FB back to the combinational logic.The register stage synchronizes the state of all wires in ST using a clock signal clk while the combinational logic computes the subsequent register inputs based on signals from FB.We refer to the combinational logic together with its subsequent multiplexer as the round function R.

Security Models
The discussion on security models in our work is twofold, encompassing both the SCA and FI adversary models.We remark, that SCA adversary models allow us to prove the secure application of masking [CJRR99] which stands as the predominant concept for protecting a circuit against SCA.Therefore, masking is the only SCA countermeasure considered in this work.According to secret sharing [Sha79], masking involves randomizing every sensitive variable X ∈ F n 2 with d + 1 uniformly and randomly distributed shares In addition, the circuit must undergo transformation into a masked variant, executing every operation on a subset of shares.However, while we briefly revisit SCA models to provide necessary background information, we focus more extensively on the FI adversary models.

Side-Channel Analysis (SCA) Adversary Model
The d-probing adversary model [ISW03] restricts the adversary's capabilities to place a maximum of d probes on the wires of an ideal circuit.Each probe provides access to the intermediate value of a specific wire at a particular point in time, corresponding to a clock cycle.

Definition 3 (d-probing security).
A circuit is d-probing secure iff it does not disclose any sensitive information to any d-probing adversary.
Further, the (g, t, c)-robust d-probing model [FGP + 18] extends the d-probing model with the coverage of physical defaults, such as glitches (g), transitions (t), and couplings (c).A (g, t, c)-robust d-probing adversary can place up to d extended probes, each capable of recording all intermediate values on a specific wire that could potentially be leaked due to the physical defaults.
Definition 4 ((g, t, c)-robust d-probing security).A circuit is (g, t, c)-robust d-probing secure iff it does not disclose any sensitive information to any (g, t, c)-robust d-probing adversary.
We remark that the most prominent instances of the robust probing model are the (1, 0, 0)-robust d-probing model, i.e. the glitch-extended probing model, (0, 1, 0), i.e. the transition-extended probing model, and (1, 1, 0), i.e. the glitch-and transition-extended probing model.Except [CBG + 17, CEM18], couplings are not extensively considered particularly from a theoretical point of view due to the necessity to have access to the detailed information about the physical realization of the target device, e.g., place and routing details.

Composability
As designs have continued to increase in complexity, it has become challenging to design and evaluate circuits that are provably robust-probing secure, especially those that require higher-order security.To address this issue, researchers have introduced composable gadgets [CGLS21], which are small and provably secure building blocks that can be composed to create circuits of any order that are also provably secure.In practice, tools such as AGEMA replace unprotected cells with their protected counterparts, i.e. gadgets.

Definition 5 (Composability).
A robust-probing secure gadget is composable iff its arbitrary composition with other robust-probing secure and composable gadgets also results in a robust-probing secure circuit.
For a set of gadgets to be composable, each gadget must individually satisfy particular security requirements, and all possible combinations of the gadgets must also be in conformity with the underlying security model.As explained in detail below, PINI [CS20] is known as one of such composable security notions.
Definition 6 (Perfect Probe Simulation).Let P be a set of d (extended) probes on a gadget.P can be perfectly simulated with a set S of input shares if a probabilistic polynomial-time simulator can be found that computes a joint probability distribution based on S, which is identical to the distribution of P. Definition 7 (d-Probe-Isolating Non-Interference (PINI)).Let P be a set of d 0 (extended) probes on a gadget, and S a set of d 1 (extended) probes on the gadget's primary outputs, while O contains all output share indices probed by the probes in S. A gadget is d-PINI iff for every P ∪ S with d 0 + d 1 ≤ d, there exists a set I of d 0 share indices such that the wires probed by probes in P ∪ S can be perfectly simulated from input shares with indices in I ∪ O [CS20].

Fault Injection (FI) Adversary Model
Based on [RSG23], we consider a formal model for the adversary whose abilities are denoted by ζ(f, s, l), where f ∈ {1, 2, . . ., |G|} denotes the maximum number of faults that can be simultaneously injected into different gates, s ∈ {τ sr , τ s , τ r , τ bf , τ f m } the fault type i.e. stuck-at (τ sr , τ s , τ r ), bit-flip (τ bf ), or custom (τ f m ) faults, and l ∈ {c i , m, mc i } the possible fault locations, which are usually a restricted set of gates, defined by l = c i ⊆ c ∞ , and registers, defined by l = m.In this work, we focus on a conservative (worst-case) adversary model that allows toggle faults (s = τ bf ), including stuck-at faults as well, at arbitrary locations of the circuit, such as gates and registers (l = mc ∞ ),but excluding primary signals3 .Each fault is modeled by replacing the faulty cell with a predefined cell that implements another functionality, i.e. for s = τ bf a faulty gate gets replaced by its negated variant.Hence, for certain given inputs X, the faulty circuit may produce a faulty intermediate state Q instead of Q while for other inputs the fault gets suppressed.Then, the model compares the primary output of the faulty circuit Z with that of an equivalent circuit without any fault Z to determine whether the injected fault was effective (cf.Definition 8) or ineffective (cf.Definition 9).Definition 8 (Effective fault).We consider an injected fault as effective, w.r.t to a primary input X, iff Z = Z, i.e. if the injected fault affects the primary output when X is processed.
Definition 9 (Ineffective fault).We consider an injected fault as ineffective, w.r.t to a primary input X, iff Z = Z, i.e. if the injected fault does not affect the primary output when X is processed.
As our focus is solely on error correction, we can disregard separate definitions of detected and undetected faults.We consider a circuit to be secure under a fault model ζ(f, t, l) if all the considered faults will be corrected, i.e. are ineffective.
Definition 10 ((f, s, l)-fault security).A circuit is (f, s, l)-fault secure if all considered faults are ineffective.
We remark that (f, τ bf , mc ∞ )-fault security, which we consider in this work, is equivalent to security under the multivariate adversary model defined in [SRM20] and that security according to Definition 10 also implies security against SIFA.

Fault Propagation and Independence
Whenever an adversary injects faults in f intermediate gates, it is possible that more than f output signals of the circuit can become faulty [AMR + 20].To illustrate, let us assume that the adversary injects a fault in one intermediate gate inside the circuit.The faulty output of the gate may become the input of multiple subsequent gates, which are driven by the faulty gate propagating to multiple primary outputs.This phenomenon is known as fault propagation.To prevent fault propagation, every module must satisfy the independence property [AMR + 20], meaning that every intermediate wire in a module should contribute to at most one primary output of the circuit.
Definition 11 (Independence).The implementation of a module M inside a circuit fulfills the independence property iff |OUT M | = 1, i.e. if M computes a single output.
If the independence property is satisfied, every introduced fault can make at most one output signal faulty.

Error-Correcting Codes (ECCs)
To achieve (f, s, l)-fault security an error-correction mechanism is required.Usually, error-correction is based on redundancy and binary linear codes.
Definition 12 (Binary linear code).A binary linear code C is a k-dimensional subset of F n 2 representing a linear and injective mapping function As C is linear, we can formalize the mapping by a (k × n)-matrix, refered to as generator matrix G, i.e. it holds that X • G = Y .Furthermore, we refer to a code as systematic iff it embeds the message into the first k bits of the codeword, i.e.G is of the form [I k |P ], with I k being the k × k identity matrix.In this context, where we are dealing with binary linear systematic codes (denoted as [n, k]-codes in the following), we refer to the codeword associated with the message X ∈ F k 2 as Y = X|X ∈ C while X denotes the parity of X. Formally, the parity is computed as X = X • P , while P again is the matrix representation of a linear and injective mapping P : To prove the injectivity of a linear function, we refer to Lemma 1.
When transmitting a codeword Y ∈ C over an unreliable communication channel, the receiver can obtain a faulty codeword Ỹ = X ⊕ E|X ⊕ E , where E|E is the error vector that affected the transmission.The number of faults to correct within a single codeword depends on the code's minimum distance δ.Therefore, we denote an [n, k]-code satisfying a minimum distance of δ as [n, k, δ]-code.
In the above definition, HD(Y, Z) denotes the Hamming Distance (HD) of two different codewords Y and Z, i.e. the number bits which differ between Y and Z, while HW(.) denotes the Hamming Weight (HW) of a codeword.
To correct the faults, Ỹ is replaced by its nearest valid codeword, i.e. the codeword with minimal distance to Ỹ .Accordingly, Ỹ will only be correctly replaced by Y if the distance between Y and Ỹ is still smaller than the distance between Ỹ and other codewords.This directly leads to Lemma 2. Hence, an [n, k, δ]-code can correct at most f = δ 2 − 1 faulty bits in a single codeword.Lemma 2. An [n, k, δ]-code can correct all faulty codewords Ỹ = X ⊕ E|X ⊕ E with HW(E) + HW(E ) < δ 2 .The translation from a faulty codeword to its nearest valid codeword can be done by means of syndrome decoding.For a valid codeword Y = X|X it holds that X = X • P and therefore X ⊕ X • P = 0.For a faulty codeword Ỹ = X ⊕ E|X ⊕ E with the error vector E|E , we denote (X ⊕ E ) ⊕ (X ⊕ E) • P = E ⊕ E • P as a syndrome.Once the syndrome is calculated, the Syndrome Decoder (SD) compares it to a pre-calculated lookup table of syndromes and their corresponding error vectors S : E ⊕ E • P → E|E .If the syndrome matches one of the entries in the table, it means that an error has occurred and the decoder can use the associated error vector to correct the error.Based on syndrome decoding, we can correct faulty codewords as shown in Figure 2. We call the place where such a correction is made a correction point.
We remark that S is split into two functions namely S 0 : E ⊕ E • P → E and S 1 : E ⊕ E • P → E , meaning that both receive the syndrome.Syndrome decoders S 0 and S 1 predict E and E correctly as long as the fault can be corrected by the underlying C. Hence, it must hold that HW(E) + HW(E ) < δ 2 to correct Ỹ with an [n, k, δ]-code C.

Majority Voting
A simple yet effective correction point derived from an [n, 1, δ]-code with C : F 2 → F n 2 and C(x) = {x} n can correct up to f = δ 2 − 1 faults in a codeword with an odd codeword size of n = δ through the application of a technique known as Majority Voting (MV).In essence, MV involves examining each bit within the codeword and identifying the majority value among them.By aligning the bits with the majority decision, the correction point effectively corrects up to f faults in the codeword.Typically, the correction points are designed to receive the redundantly generated results of a system, which often involves obtaining the same output from n independent and parallel instances.For example, consider a scenario where three cipher cores simultaneously process identical inputs.In this setup, the correction points are responsible for evaluating these redundant outputs in order to correct one fault per output bit.Since the independent instances operate without sharing any signals, the occurrence of a fault in one instance does not impact the outputs of the other instances.Additionally, given that each output bit is corrected individually, there is no necessity for the instances to adhere to the independence property which reduces the area overhead associated with MV to a factor of n due to the presence of these n parallel instances plus an additional MV circuit.If the correction is applied exclusively to the system's output, the effectiveness of MV is constrained as it can only correct a total of f faults throughout the entire execution of the system, as opposed to f faults per clock cycle.Consequently, to attain the desired level of fault security denoted as (f, τ bf , mc ∞ ), the correction logic must be capable of addressing the cumulative sum of faults that an adversary could potentially inject during the system's execution.This necessitates correcting a much larger number of faults, rendering the corresponding overhead arising from redundancy impractical for our needs.

Impeccable Circuits
Aghaie et al. apply the aforementioned [n, k, δ]-codes in Concurrent Error Detection (CED) schemes for hardware circuits [AMR + 20].The presented strategy allows to limit the fault propagation in a way that the circuit guarantees the detection of up to f faults at arbitrary positions when employing an [n, k, δ]-code.Again f refers to the number of simultaneously injected faults as introduced in Section 2.3.3 and applied in Definition 10.As detection of faults is not enough to prevent SIFA, a follow-up work shows a scheme for correcting faults at arbitrary positions by applying an [n, k, δ]-code [SRM20].To implement the error correction scheme, we commence with a general sequential circuit, as illustrated in Figure 1.To simplify matters, we assume |IN| mod k = 0 ∧ |FB| mod k = 0 meaning that the input state of the round function and the feedback state are both multiples of k bits.
In accordance with the [n, k, δ]-code, each k-bit message X, carried by a dedicated k-bit segment of IN or FB, can be encoded into its respective codeword Y = X|X by applying the parity mapping function P, as defined in Section 2.4.1.For example, if . This function maps a collection of q messages, each composed of k bits, to their corresponding q parities, each comprised of n − k bits.Specifically, F applies P individually to each of the q messages.While the straightforward procedure mentioned above applies to data signals, it is essential to carefully consider control signals, which are typically processed by a finite-state machine and play a critical role in the round function.Changes in the control flow can potentially enable an adversary Further, we assume that F is an injective function4 .Therefore, the redundancy size must be at least as large as the message size, implying that n ≥ 2k.Additionally, since F is an injective function, the redundant counterpart of the round function R can exclusively operate on the parities of the input signals.This results in a redundant round function denoted as R = F • R • F −1 , operating on the set of parities belonging to the input signals of the round function.In particular, we denote the set of wires driving the redundancy state of the feedback signals as FB .To construct a full correction point for FB we employ the same principle as for defining F to establish SD 0 : ).These functions are responsible for decoding the syndromes of q parity chunks.As a result, we can apply q correction points in parallel to correct the input state of the round function.Figure 3 shows the schematic of the circuit from Figure 1 after the application of such a CED.
It is important to note that contrary to MV as explained in Section 2.4.2, the round functions R and R , together with the correction logic, must be implemented in a way that the propagation of a single intermediate fault to multiple faults within ST or ST is prevented.This necessitates their implementation in a manner that ensures the independence property.In the case of an ECC with δ = 2f + 1, the output of SD 0 remains unchanged even when up to f faults are introduced at its input.It means that SD 0 does not propagate any fault E with HW(E) < f .Therefore, it is feasible to instantiate F and the corresponding XOR operation separately.However, this does not hold true for SD 1 [SRM20].Consequently, all sub-circuits enclosed by dashed red lines in Figure 3 must adhere to the independence property.Unfortunately, this may lead to the necessity of incorporating multiple instances of the same correction logic, potentially increasing the area requirements.To mitigate the increase in area overhead, the authors of [SRM20] suggested the insertion of multiple correction points while maintaining the circuit's latency, provided that the round function R can be decomposed.Specifically, they suggest to split R into two sub-functions R = R 1 • R 0 in a way that R 1 becomes linear.Further, if R 1 can be effectively represented as a binary matrix composed of elements from the finite field F k 2 , it has been established in [AMR + 20] that R 1 exhibits a property of fault non-propagation.Consequently, when implemented in accordance with the aforementioned decomposition, there is no necessity for an additional correction point before the computation of R 1 .However, if R 0 and R 1 are both non-linear, this would necessitate additional circuitry to implement multiple correction points.In this case, two individual correction points are required to correct the inputs of R 0 and R 1 respectively.It is worth noting that the round function can be decomposed into multiple sub-functions, as opposed to having just one linear and one non-linear sub-function.An extreme example of this approach is illustrated in [BKHL20], where a correction point is introduced for each non-linear gate.While this strategy simplifies the achievement of the independence property, it is important to consider that the substantial increase in the number of correction points could potentially result in an impractical area overhead, depending on the specific target.Since we allow an adversary to introduce up to f faults per clock cycle, it is important to note that the number of faults to be corrected between two correction points is effectively doubled, as observed in [SRM20].To illustrate this, consider an adversary with the capability to inject a single fault per clock cycle.Exemplary, the adversary can fault the input of a specific register in one clock cycle and fault the output of another register in the subsequent clock cycle.In this scenario, the fault introduced at the register's input is propagated to its output, resulting in two simultaneous faults on register outputs.Consequently, the underlying ECC must be designed to correct for a total of 2f faults to achieve (f, τ bf , mc ∞ )-fault security.

Technique
In this section, we present AGEFA's procedure (cf. Figure 4) for converting an unprotected implementation into a fault-resistant design, utilizing the technique outlined in Impeccable Circuits II [SRM20].We have chosen to adopt the scheme due to its assurance of security within our designated fault adversary model, a feat that, practically, cannot be achieved through the application of, for instance, cipher-level MV.Additionally, Impeccable Circuits II provides a high degree of flexibility in its implementation, primarily through the ability to decompose the round function as needed.This flexibility permits us to optimize protection in two vital dimensions: latency and area.On one hand, we can opt for no decomposition to minimize the latency, while on the other hand, we have the option to decompose the round function into multiple sub-functions to reduce the overall area footprint at the cost of higher latency.Moreover, this flexibility allows us to experiment with various configurations, enabling us to identify the most suitable one.This advantage is a direct result of automation and would be unattainable through manual methods.

Fault-unprotected Netlist Attribute Report
Code Parameters Attribute Prop.

Corr.Point Generation Generate Layers
Combine Modules Independently

Write Final Design
Fault-protected Design AGEFA follows the process outlined in Figure 4.The process begins with receiving a gate-level netlist written in Verilog.The given netlist implements no protection against FI attacks but can be a masked implementation to be protected against SCA.To prepare the netlist to be processed by AGEFA, the designer must first synthesize the behaviorallevel description of the design using a synthesizer such as Design Compiler (DC) [Inc] or Yosys [Wol].Additionally, for every primary input or output wire w, the designer has to set attribute(w) through textual annotations in the netlist in the following manner: • If wire w should be identified as the clock signal, it must hold that attribute(w) = clock.In this work, we denote the clock signal as clk.
• If wire w should be identified as the reset signal, it must hold that attribute(w) = reset.In this work, we denote the reset signal as rst.
• For every primary input or output wire w which carries a data signal (including plaintext, ciphertext, and key), it must hold that attribute(w) = secure.Likewise, if w carries a control signal, e.g.enable or done signals, it must hold that attribute(w) = control.Further, for every primary input wire (resp.output wire) w, it holds that if the round function is a decomposition of h sub-functions, and we abstract every sub-function with a module R h−1 , . . ., R 0 , it holds that attribute(w) = layer for every intermediate wire Once the annotated netlist is received, AGEFA applies a slightly modified version of the AGEMA's parser, resulting in a graph (G, W) that represents the netlist based on the circuit model described in Section 2.2.We straightforwardly render R into a separate set of modules R. Again, R ∈ R abstracts one coordinate function of R. If R is decomposed, we create additional coordinate functions for every intermediate wire x with attribute(x) = layer.More concretely, if attribute(x) = layer, we add a new module X to R while OUT X = {x} and IN X can contain primary input wires or other intermediate wires w with attribute(w) = layer.Further, we render the register stage into another set of modules FF while FF ∈ FF abstracts a single register.Based on this representation of the circuit, AGEFA performs the following high-level steps.
• To generate a secure and efficient design, AGEFA sets all parameters mentioned in Table 2 for every signal of the circuit.
• Based on the given message size k and fault cardinality 2f , AGEFA automatically finds an efficient ECC including all operations required to construct a correction point.For every operation required for the correction, AGEFA generates a module that satisfies the independence property.
• AGEFA transforms all coordinate functions of the (decomposed) combinational logic in a way that they fulfill the independence property.
• Subsequently, it proceeds to construct the three functions, as indicated by the dashed red lines in Figure 3.In this process, AGEFA connects the generated functions, each of which independently satisfies the independence property.Importantly, the compositions are crucially designed to uphold the independence property.
• Lastly, AGEFA finalizes the protected design by connecting the cascaded functions.

Attribute Propagation
We repeat that the protection mechanism presented in Impeccable Circuits II [SRM20] takes special care of the Finite State Machine (FSM) as injecting faults on the FSM may change the control flow, i.e. enables an adversary to observe intermediate states of the design.Therefore, AGEFA must distinguish between the protection of control signals and data signals.While all w ∈ IN are already annotated, AGEFA automatically determines all attribute(x) for every x ∈ ST and for every x ∈ FB (see Figure 1).
To propagate the attributes of the primary inputs through the circuit, we can apply Algorithm 1 of [KMMS22] on R and FF.Hence, the following two rules apply.
• For every R ∈ R, it holds for the output wire w • For every FF ∈ FF, it holds for the output wire w ∈ OUT FF that attribute(w) = attribute(x) if x ∈ IN FF denotes the input wire with attribute(x) = clock.
If AGEFA processes a masked design operating on multiple shares, we make sure that the error-correction logic does not violate the SCA-security assumptions of the original design.Specifically, if a masked design adheres to probing security (resp.composability) under the robust probing model, the final masked and fault-tolerant design must similarly guarantee probing security (resp.composability) under the same model.Implementing the error-correction logic SCA-securely can be done share-wise, allowing each share domain to be processed independently, thereby satisfying the PINI-notion.However, it is crucial to ensure that the error-correction logic does not mistakenly combine shares from different domains.To avoid this, each message (for encoding) must only contain shares from the same share domain, which can be achieved by labeling every w ∈ ST with a corresponding share domain, i.e. share_domain(w).The designer has to provide an additional report assigning all w ∈ ST that carry shared variables to their corresponding share domain share_domain(w)5 .If a handcrafted masked implementation is given to AGEFA, the designer must manually generate and provide the wires with their respective share domain, i.e. in the synthesis script or the behavioral design, which might be a challenging task.On the other hand, if the design is entirely made by secure and composable gadgets, such as a result of AGEMA, we can annotate the required wires in each gadget separately and propagate the annotation during the synthesis procedure6 .However, if share_domain(w) for a w ∈ ST is not specified in the report, e.g. in case of attribute(w) = control, we set share_domain(w) = 0 while it holds that share_domain(w) > 0 for all w carrying a shared variable.This annotation ensures that subsequent stages, such as Algorithm 5, encode signals exclusively from the same share domain within a given message.We acknowledge that annotating a masked circuit in this manner can be a tedious task.Nonetheless, if we were to limit the annotation to only primary inputs of the circuit, we would need to exercise greater caution when consolidating signals into the same share domain.This cautious approach would, in turn, lead to a further increase in the area overhead of the final result.

Optimizations
If the round function R can be expressed as the composition of two sub-functions, R = R 1 •R 0 , and R 1 can be represented as a binary matrix with elements in F k 2 , there is no need to correct the inputs of R 1 [AMR + 20, SRM20].Hence, avoiding an additional correction point leads to a more efficient design w.r.t circuit size and latency.Therefore, AGEFA searches in the module-based representation of R, denoted as R for a subset of modules R 1 ⊂ R where the instructions can be represented as the aforementioned matrix multiplication.This procedure is twofold.First, AGEFA extracts all R ∈ R representing linear functions.Then, AGEFA considers all linear R ∈ R as a linear layer and checks if the correction at inputs can be safely removed.
To identify every R ∈ R representing a linear function, we create its Algebraic Normal Forms (ANFs) based on INST R and examine whether it exclusively comprises

Input: R
The module-based abstractions of the round function.

Input: FF
The module-based abstraction of the register stage.

Input: IN
The set of wires carrying primary input signals Output: R The round function with marked input and output wires.
Check if the non-linear output is an input of only linear functions. 9: Get the register output wire. 10: for ∀z ∈ Z do Mark register outputs that are no inputs of linear functions. 12: end for 14: for ∀R ∈ R 1 do Consider linear layer outputs as inputs of another linear layer. 16: end for 21: X ← X\{x} 22: end for monomials with an algebraic degree of one.If R is confirmed to be abstract a linear function, we mark w ∈ OUT R by setting linear_index(w) = 1.Otherwise, we set linear_index(w) = 0. Additionally, it has been demonstrated in [AMR + 20] that the outputs of a multiplexer stage controlled by rst can be directly integrated into the following linear layer without an additional correction.Thus, we set linear_index(w) = 1 not only when R abstracts a linear function but also in cases where R represents a function with non-linear components limited to multiplexers using rst as the select signal.Hence, it holds that linear_index(w) = 1 if the ANF of R exclusively contains monomials with an algebraic degree of one or monomials with an algebraic degree of two with rst as one of the variables.This approach enables the identification of such multiplexers, even if they are not explicitly represented by dedicated multiplexer modules in the netlist but are expressed in their underlying algebraic form.After this step, the output wire w of every R ∈ R is either marked as linear (linear_index(w) = 1) or non-linear (linear_index(w) = 0) and we denote the list of linear modules as R 1 = {R ∈ R|∃w ∈ OUT R : linear_index(w) = 1}.Similarly, we denote the non-linear modules as R 0 = R\R 1 .To identify if R 1 abstracts one or multiple layer(s) which do not require additional correction, we apply Algorithm 1.Initially, in Lines 3-7, Algorithm 1 creates two sets, X and Y, encompassing all output wires of R 0 .In the following Lines 9-14, Algorithm 1 systematically examines each output wire x ∈ X produced by a non-linear function, determining whether its signal is exclusively propagated (though a register, as checked in Line 9) to the inputs of linear coordinate functions.This criterion is satisfied iff the signal carried by x is not propagated to the input of any R ∈ R 0 .To identify such wires, Algorithm 1 marks every input wire of R 1 , denoted as z in Line 12, meeting this condition by setting linear_index(z) = 1.

Input: L
The sorted linear layer with marked linear outputs Input: IN L , OUT L A sorted list of input and output wires of L Input: k The message size Output: L The linear layer with updated linear outputs and inputs 1: q ← 0 2: for ∀w ∈ IN L do 3: linear_index(w) ← q k + 1 4: q ← q + 1 5: end for 6: for ∀w ∈ OUT L do 7: linear_index(w) ← q k + 1 8: q ← q + 1 9: end for 10: for ∀q ∈ {0, k, 2k, . . ., |OUT L | − k} do 11:  In Lines 2-9 of Algorithm 2, each k-bit message derived from the sorted primary wires of L is assigned a distinct linear index greater than 0. This annotation signifies that no correction is necessary.However, Algorithm 2 is responsible for verifying whether the conditions for removing the correction are indeed met.If not, Algorithm 2 resets all linear Algorithm 3 Generation of a binary, linear, systematic, and injective [n, k, δ]-code.

Input: k, δ
Message size k and minimum distance δ Output: C A binary, linear, systematic, and injective [n, k, d]-code Iterate through all possible messages 3: X ← 0 X stores the parity of X 4: while ∃Z : end while 9: Add the new codeword to the code 10: end for indices to 0 (cf.Lines 15 and 20).This reset implies that correction is required for all signals.To assess this, AGEFA iterates through all chunks consisting of k modules and stores both their inputs and linear indices.Two conditions must be satisfied to warrant the removal of the correction point.
1.The same input signal (except the signal carried by rst) must not be distributed across multiple coordinate functions forming a k-bit message.This condition is examined in Line 13.
2. Each coordinate function forming a k-bit message should receive the same number of inputs with identical linear indices.This condition is examined in Line 18.
It is important to note that the question whether the correction logic can be removed or not depends on the order of the wires in IN L and OUT L We specifically examine only the case where IN L and OUT L are sorted based on the wire names while also unsorted IN L and OUT L can lead to a removed correction point.However, validating all these diverse orderings is computationally infeasible.Instead, we choose this sorted representation based on the belief that it is likely to be implemented, given that it results in a k-bit message for k subsequent bits of a larger state.To illustrate, in the context of a cipher with a 64-bit round state {x 0 , . . ., x 63 }, and k = 4, we assume that a matrix representation in F 4 2 is implemented by considering sets such as x 0 , x 1 , x 2 , x 3 , . . ., x 60 , x 61 , x 62 , x 63 as 4-bit messages.In other words, we assume that a designer would interpret the 64-bit state as 16 subsequent 4-bit words.However, if the designer chooses another representation, AGEFA cannot remove the correction point resulting in a design that is still secure but not as efficient as possible.

Correction Point Generation
The designer specifies the code parameters, i.e. the message size k and the maximum number of faults to correct within one clock cycle, usually set to 2f to achieve (f, τ bf , mc ∞ )-fault security.Utilizing these parameters, AGEFA estimates the appropriate ECC parameters [n, k, δ].As outlined in Lemma 2 and Section 2.4.3, it holds that n ≥ 2k.Further, the underlying ECC must correct 2f faults, hence δ = 4f + 1.

Error-Correcting Code (ECC) Generation
The procedure for finding a binary, linear, systematic, and injective [n, k, δ]-code, denoted as C which is a vector subspace of F n 2 is outlined in Algorithm 3. It processes every message X by assigning a parity X and generating the codeword Y = X|X .In Line 5, it is checked if Y can be added to C without violating the code's minimum distance and injectivity, i.e. by checking if the minimum distance of Y and all other codewords of C is sufficient and by ensuring that no other codeword in C shares X as parity.As long as Y is not a suited codeword for C, we increment its parity X until the codeword satisfies all the requirements.Algorithm 3 repeats this procedure until all X ∈ F k 2 are associated with a corresponding X in C. The presented procedure follows a Greedy approach for constructing a, so-called, lexicographic code [CS86,Con90].We remark that (1) such a code exists for all possible k and d [BP93], (2) the resulting codes are provably linear, systematic, and injective [Lev60, CS86, BP93], and (3) n is usually minimal [CS86].The injectivity and systematicity of the code generated by Algorithm 3 are easily provable due to the algorithm's nature.However, the linearity of the code is not immediately apparent.A lexicographic code, where the codewords are arranged and iterated in lexicographic order, was first proven to be linear by Levenshtein [Lev60].Brualdi et al. later generalized this proof to lexicographic codes generated using arbitrarily ordered bases of F n 2 , resulting in a lexicographic ordering on the coefficient vectors [BP93].As Algorithm 3 iterates through the codewords in lexicographic order, the generated codes are also linear by extension of the aforementioned proofs.In summary, Algorithm 3 leads to codes with small parities and thus an efficient encoding.However, the designer is free to force AGEFA to employ a certain ECC that better fits particular needs.From C, we derive the mapping in form of a lookup table.We compute the table lookup for an arbitrary message X as X • P = X .Further, we generate the mapping P −1 (.), again as a lookup table to map arbitrary parities back to their corresponding messages.

Syndrome Decoder (SD) Generation
Algorithm 4 constructs the corresponding SD of the previously generated [n, k, δ]-code C. As shown in Figure 2, we split the SD into two separate functions, namely S 0 : E ⊕ E • P → E and S 1 : E ⊕ E • P → E .As previously mentioned, we generate the mappings S 0 and S 1 as lookup tables.Algorithm 4 generates mappings for all possible error vectors that C can correct, specifically for error vectors with less than δ−1 2 faulty bits.In Line 3, the syndrome is computed based on the error vector.The syndrome is subsequently mapped to the corresponding parts of the error vector and stored as a mapping (lookup table) in S 0 and S 1 .These lookup tables are not complete, meaning that S 0 and S 1 do not cover all E|E ∈ F n 2 .We deal with such cases in the following section.

Algebraic Representation
To integrate P, P −1 , S 0 , and S 1 into a hardware design and to facilitate further steps, we convert the lookup tables into a set of modules in accordance with Definition 1.We repeat that a module can abstract complex operations consisting of multiple gates and inputs.
For every module M with |IN M | = n inputs and |OUT M | = m outputs, it holds that Algorithm 4 Generation of the Syndrome Decoder (SD).

Input: P, n, k, δ
Resulting ECC parameters from Algorithm 3 Output: S 0 , S 1 The corresponding SD with two mappings Generate error vectors to correct S 1 ← S 1 ∪ (T, E ) Add S 1 (T ) = E to the mapping 6: end for it can be formalized by an n-ary vectorial Boolean function with m coordinate functions.However, we can also decompose the Boolean functions in a way that multiple vectorial Boolean functions compute the module's intermediates in T M while other functions process the intermediates to compute the primary outputs OUT.We denote the resulting sets of modules as P, P −1 , S 0 , and S 1 while every module in the set abstracts one coordinate function of the respective operation, e.g.P i denotes the i-th module in P abstracting the i-th coordinate function of P. Further, it holds that none of the modules stores intermediates, meaning that every module implements a Boolean function processing all input signals to compute a single output.
Since all functions are injective, their corresponding lookup tables may contain don't care values.For example, S 0 and S 1 certainly have such cases as explained above.To find the most efficient logic function that represents the lookup table even if it contains don't cares, we apply the Quine-McCluskey algorithm [Qui52,McC56] on every coordinate function of the respective mappings.Hence, every mapping is translated from a lookup table into its minimal sum-of-products form becoming efficient in terms of circuit size.Finally, we store the generated Boolean function, i.e. the minimal sum-of-products form, using a single instruction, in accordance with Definition 1, into the corresponding module.

Optimizations
Depending on the complexity of the combinational logic, choosing C with minimum n, as shown in Section 3.2.1 may not always be the optimal approach.As an example, consider an arbitrary linear, injective, and systematic [8, 4, 3]-code C which is capable of correcting one fault in an 8-bit codeword with 4-bit message.It can be shown that it is possible to implement the mapping P : F 4 2 → F 4 2 , where a message X : x 3 , x 2 , x 1 , x 0 is mapped to its corresponding parity X : x 3 , x 2 , x 1 , x 0 , using four separate coordinate functions p 0 , p 1 , p 2 , p 3 : F 4 2 → F 2 as follows.
x 3 = p 3 (x 3 ), x 2 = p 2 (x 2 , x 1 ), According to Figure 3 and the explanations given in Section 2.4.3, the redundant part of the round function, i.e.R , is derived as F • R • F −1 .We remark that each coordinate function of R should satisfy the independence property.Therefore, the output of sub-functions of R • F −1 are the inputs of F, i.e.X in the above equations.Due to the independence property, coordinate functions p 0 to p 3 cannot share any inputs.As an example, the circuit which computes x 2 must be instantiated three times as p 0 , p 1 , and p 2 receive x 2 .In other words, each input of the coordinate functions p 0 to p 3 should be individually generated.This means that these coordinate functions need in total 9 individual inputs.As a side note, there is no other [8, 4, 3]-code with a smaller number of individual inputs.If we, artificially, increase the parity size by one bit, we achieve a linear, injective, and systematic [9, 4, 3]-code with message-to-parity mapping Q : F 4 2 → F 5 2 which maps a message X : x 3 , x 2 , x 1 , x 0 to its corresponding parity X : x 4 , x 3 , x 2 , x 1 , x 0 with the following five coordinate functions q 0 , q 1 , q 2 , q 3 , q 4 : F 4 2 → F 2 .
x 4 = q 4 (x 3 ), Despite the larger parity size, the coordinate functions of Q need in total only 8 individual inputs.We remark that the number of individual inputs of P (resp.Q) has a decisive influence on the complexity of the design, i.e. circuit size, as R needs to employ these message-to-parity mappings.Consequently, a mapping with a minimal number of inputs per coordinate function can avoid multiple instances of the same combinational sub-circuit, potentially saving more area than that of the additional parity.The specific size of the circuit depends on the round function and the syndrome decoders, which generally become more complex as the parity size increases.Therefore, the designer has to balance between fewer instances of individual coordinate functions belonging to the round function and an increasingly complex SD.
To determine the optimal parity size, AGEFA automatically finds the parity size resulting in the smallest number of individual inputs for the message-to-parity mapping.This is done by incrementally increasing the parity size as long as the number of individual inputs decreases and continuing with the code that results in the smallest number of individual inputs.However, it is recommended to apply AGEFA on every design twice, once with these optimizations and once without, to ensure that the optimizations truly result in a decrease in the circuit size.If optimizing the parity size does not lead to a smaller circuit, it can be concluded that the SD becomes too complex and that the smallest possible parity size, i.e.AGEFA without optimizations, is a better choice.

Layer Generation
In contrast to the round function, P, P −1 , S 0 , and S 1 , are designed to process a single message (or parity) rather than the entire state ST.To correct ST, we have to decide which signals of ST can be securely corrected within a single message and which signals need to be processed separately.To address this limitation, we extend the modules to operate on ST, as given below.

Generating Redundant State
To accurately encode every signal of the circuit state, it is necessary to construct the redundant counterpart belonging to ST. Due to the importance of control signals, it holds that, every x ∈ ST with attribute(x) = control builds its own message while k data signals can be processed within the same message.In particular, we pad every control signal x with k − 1 leading zeros, resulting in an (n − k)-bit parity, denoted as X = P( {0} k−1 , x ).Contrary, data signals are encoded as k-bit chunks, each of which is denoted as X = x k−1 , . . ., x 0 and results in (n − k)-bit parity, derived as X = P(X).X is only padded with i leading zeros if only k − i data signals are available, i.e. if the number of data signals is not a multiple of k bits.Furthermore, special care must be taken when the given design is masked since the combination of multiple share domains of the same variable within a k-bit message violates the security of masking.To transfer ST into a padded state, denoted as SP, we apply Algorithm 5.
Before executing Algorithm 5, we cluster the unpadded state ST so that all signals with the same attribute, the same share domain, and the same linear index appear in groups one after the other while control signals make a distinct group.In Line 5 of Algorithm 5, a separate message is generated for each control signal by padding the message with (k − 1) zeros.To indicate leading zeros, we insert dummy signals driving constant zero into SP which we denote as 0 ∈ SP (see Line 5).As control signals are usually unmasked, and are associated to separate messages, their share domain can be neglected.In contrast, data signals (processed in Lines 7-16) cannot be combined in the same message if their share domains or linear indices are different.Therefore, we only combine data signals in a message with the same parameters and start a new message as soon as we reach a data signal with a different share domain or linear index.Finally, if the number of signals in a message is smaller than k, we fill the remaining space with zeros, see Line 19.Note that we process the feedback state FB in the same way to receive a padded feedback state FP.Further, we remark that ST and FB are clustered in the same way meaning that the signals of a certain register have the same index in both states.For example, u i ∈ SP denotes the input signal of the i-th register while v i ∈ FP denotes its corresponding output signal.Now, based on the padded data states SP and FP, we can create q • (n − k) signals representing the corresponding parity states SP and FP .if share_domain(w) = q then 8: q ← share_domain(w) 10: 13: q ← linear_index(w)

Extending ECC Modules
As depicted in Figure 3, FB must be given as an input state to the functions F of the correction logic while both F compute the parity state depending on FB.Here, we refer to the sets of outputs of the functions F as U and V .We highlight that both states indicate parity states as |U | = |V | = |FP | and that we derive the wire names of U and V from FP by extending every wire name of FP with an unique extension.Both functions F : map FP as a state of q concatenated k-bit messages to a state of q corresponding (n − k)-bit parities (U and V ).This functionality is achieved through the parallel application of q instances of the modules in P, with each instance processing a single message.Therefore, we start with creating an additional set of modules We remark that every F i ∈ F computes exactly one signal of the parity state U .To instantiate every F i ∈ F we apply Algorithm 6.
Algorithm 6 receives the input state FP, the output state U , the set of modules P, the size of one message k, and generates the set of modules F. It investigates |P| outputs for every k-bit chunk of the input state FP.If the investigated output signal is not driving constant zero (as checked in Line 4), we create a new module that computes the output signal based on the given k-bit input chunk in Line 5. The applied instructions are taken from a particular module of P. As these instructions depend on the input and output signals of P, we replace all signal names in the instructions with the input and output signals of the new module. More end for 8: end for signal names in the instructions.Afterwards, Algorithm 6 continues with the next k input signals until all messages are processed.
We repeat Algorithm 3 with V instead of U to receive the modules for the second instance of the function F. The same can be done to produce F −1 but with P −1 instead of P and appropriate input and output states.Further, the same strategy with S 0 (resp.S 1 ) and appropriate input and output states can be applied to generate the SD modules SD 0 (resp.SD 1 ).We remark that we do not apply Algorithm 6 to generate modules for the linear layers as these modules can be straightforwardly generated.For example, if the linear layer receives U and FP as inputs, the i-th module processes the i-th signal of U and FP to generate the corresponding single output signal.

Satisfying Independence Property
To establish a functional and correct design, it would be enough to implement a circuit architecture in which all modules are connected based on their input and output wires, as depicted in Figure 3.However, this design would lack robustness against potential faults, if multiple signals, e.g. the outputs of the round function, fail to satisfy the independence property.
Example 2 (One Advanced Encryption Standard (AES) round).In the context of a single round of the AES cipher, it is notable that all coordinate functions of the S-box share the same 8-bit input in a standard implementation, while the MixColumns operation combines the 8-bit outputs of four different S-boxes to produce 32 output bits.As a result, if a single-bit input of any S-box is faulty, the fault can propagate to every S-box output bit and subsequently to every 32 output bits of MixColumns.This example highlights how a single faulty intermediate in an AES round can potentially lead to several faulty bits in the output state of the same cipher round.
To reduce the propagation of faults to a single output bit, such that every faulty intermediate signal results in at most one faulty output bit, AGEFA connects all components in a manner to ensure that no signal serves as a shared input for multiple modules.This is accomplished by allowing only one output signal per module, such that no fault within one module can propagate to multiple outputs.Although all generated modules are solely independent, they must also be connected to other modules to form the desired functions.This connection must be established without sharing intermediate signals between multiple coordinate functions.To accomplish this, we follow a specific composing procedure (cf.Algorithm 7), composing two sets of modules X and Y. Further, we can iteratively compose the result of Algorithm 7, which is also a set of modules Z, with further sets of modules.address this issue, we include a post-processing step after executing Algorithm 7 where we examine each I Z to ensure that each intermediate signal is computed only once.In other words, we remove any instructions that compute an intermediate signal that has already been computed.Moreover, particular attention must be given to primary outputs computed by any module in X.These signals may be either directly wired out without reaching a module in Y, resulting in no connection between the modules, or wired out and also given as inputs to modules in Y, thus connecting X and Y while also being primary outputs.To ensure the independence property is maintained, primary outputs are automatically handled as outputs of Z. Therefore, if X computes a primary output, it is added as an additional module to Z.This ensures that primary outputs are computed by independent modules, even if they serve as intermediates in another module.Algorithm 7 handles primary between two layers in Line 19.

Finalize
At this juncture, we have arrived at three sets of modules, each representing the sub-circuits demarcated by red dashed boxes in Figure 3. Since these modules do not require any further extensions, we can proceed to write them to the final design.Moving on, we can create the modified register stage by utilizing ST, ST , FB, FB along with the annotated clock signal, and finally we can connect all the modules together in a top module, which can then be printed to complete the final design.
If the circuit encompasses a signal indicating the termination of the cipher computation (commonly identified as a done signal), we incorporate the multiplexer-based construction from [SRM20] to avoid that faults injected on such a signal reveal intermediate states to an adversary.Hence, we connect all primary outputs of the final circuit depicted in Figure 3 together with the redundancy of the done signal to a multiplexer tree that only forwards the primary output if there are less than δ 2 bits of the done codeword faulty.

Case Studies
To demonstrate the flexibility of AGEFA and the performance of the designs it generates, we applied it to a wide range of publicly available unprotected cipher designs.

Design Sources
We target the complete cipher cores given in Table 2 of [KMMS22].We took all the designs from GitHub7 including both behavioral-level descriptions and Verilog netlists.All given netlists were generated by Synopsys DC and the NanGate 45 nm standard cell library.Their specifications are listed in Table 3.
For the unmasked cipher cores, we utilized the available Verilog netlists from GitHub and annotated the primary inputs and intermediates according to the description given in Section 3, which are then given to AGEFA.Where circuit decomposition was desired, we decomposed CRAFT, Skinny64, and Midori into two sub-circuits, separating the non-linear part and the linear part (without input correction).This procedure is in line with the decomposition strategy outlined in [SRM20].For the other cores, where the correction of the linear part cannot be removed, we decided to further decompose the linear part based on the diffusion properties of the cipher if it tends to reduce the area footprint.On the other hand, if the designs are masked by AGEMA, we first synthesized the behavioral-level result of AGEMA by Synopsys DC.Beforehand, we annotated every register input or primary output of each respective gadget with its corresponding share domain.Technically, this is done by introducing a new attribute share_domain in the Round-based enc./dec.2.0 0.97 17 synthesis script.We also kept the annotation during synthesis and included register inputs of the netlist in the attribute report along with their share domain.

Verification Setup
Since AGEFA produces behavioral-level designs, we proceeded to synthesize the output of AGEFA and verify the resiliency of the resulting netlists against potential faults using VerFI, an open-source tool for cryptographic fault diagnosis [AWMN20].Available on GitHub8 , VerFI simulates faults on the netlist based on user-defined specifications and determines whether the injected faults are detected and/or corrected.To provide a thorough security analysis in our fault adversary model, we configured VerFI to exhaustively verify whether all possible combinations of δ−1 2 toggle faults injected on arbitrary cells of the netlist during every clock cycle are corrected.Hence, VerFI considers every possible fault during its analysis.Further, if AGEFA processes a masked design, we additionally evaluated the robust probing security of the resulting netlist with PROLEAD [MM22].PROLEAD, like VerFI, makes use of a simulator to determine the independence of distributions for each possible robust probing adversary.We followed the evaluation guidelines provided on the PROLEAD GitHub9 repository and conducted two separate evaluations for each design.Specifically, we evaluated each design in compact mode10 using up to 100 million simulations and in normal mode with an effect size of 0.1.We confirmed the robust probing security of a design only if PROLEAD detected no leakage during both evaluations.We remark that PROLEAD does not perform an exhaustive evaluation, i.e. not all input vectors are checked.Hence, PROLEAD cannot provide a security proof as usually given by formal verification tools, such as SILVER [KSM20], and ends up with a false-negative probability of 10 −5 when reporting the absence of leakage.However, since full cipher designs protected against SCA and FI are quite large, exhaustive evaluations (e.g. by SILVER) become infeasible while PROLEAD claims efficiency even for large circuits.
Finally, as a sanity check, we performed experimental SCA evaluations on selected designs using an Field Programmable Gate Array (FPGA)-based setup.We measured power consumption traces using a SAKURA-X board encompassing a Kintex-7 target FPGA.We monitored the power consumption, recorded by a digital oscilloscope at a sampling rate of 500 MS/s, at a shunt resistor inserted in the target FPGA's Vdd path while the target design received a stable 6 MHz clock.Using 100 million traces obtained by encrypting either a fixed or a random plaintext for a constant key, we conducted a nonspecific (fixed vs. random) t-test to gain an impression about the first-order information leakage of the design under test.This test compares the statistical properties of two groups of traces and detects the presence of information leakage by estimating t-statistic values for every single sample point based on student-t distribution [SM15].In this work, we depict the t-values for the first-and second-order statistical moments.

Unmasked Designs
We start with the unmasked designs summarized in Table 3.For every cipher core, we investigated eight different message sizes k, from 1 to 8, and two different minimum distances δ = 3 and δ = 5.Technically, a code that satisfies δ = 3 can correct a single fault while δ = 5 enables the correction of two faults.We remark that the underlying code of the presented scheme must satisfy the correction of two faults per cycle (δ = 5) in order to be (1, τ bf , mc ∞ )-fault secure.The detailed results including the circuit sizes and the critical path delays for every experiment are given in Appendix A while we summarize the results leading to the best performance in terms of circuit size and critical path delay in Table 4 and Table 5.We remark that AGEFA does not add additional latency to any design.Hence, the number of required clock cycles stays the same as in Table 3.The results depicted in Table 4 and Table 5 demonstrate the significant impact of the message size k and the applied decomposition on the generated designs.As a result, we can provide practical recommendations for AGEFA's settings based on the specific requirements of the designer.

Recommended message sizes.
Our observations indicate that the most area-efficient designs are generated by using k = 2 or k = 4.These message sizes offer a good trade-off between the number of bits per message, i.e., the number of parallel messages processed, and the area requirements to process one message.However, using k = 1 leads to a slightly larger, but still small, circuit size compared to the designs generated with k = 2 and k = 4 due to the high number of messages and required parallel instances of correction logic.Additionally, larger codes with k > 4 become inefficient in terms of area and latency as they require increasingly complex correction logic.Moreover, most cipher cores process states with sizes that are multiples of 4, leading to less padding.This explains why designs with k = 8 perform better than, e.g.k = 7, even though the code becomes more complex.Therefore, to optimize the area overhead, we recommend using codes with 2 ≤ k ≤ 4. For designers seeking optimization for a short critical path delay, k = 1 appears to be the best choice.In this case, AGEFA applies the same correction logic as for MV but to correct faults during every round.In our experiments, setting the message size k = 1 consistently yielded the shortest critical path delay for almost every design.This outcome is because the MV code simply duplicates the data signal to build redundancy.Consequently, the mapping between 1-bit messages and their redundant counterparts and vice versa is achieved through wiring, i.e., without any additional computation.Hence, only the SDs contributes to increasing the critical path delay.However, MV codes are usually large, as they cannot be realized with a code size of 2k.Therefore, they may not be the best choice when seeking area-optimized results.It is important to note that these recommendations should not be considered as strict rules and may not always apply.Therefore, we suggest exploring multiple designs generated with different message sizes before finalizing the design choice.
Recommended decomposition strategy.The impact of decomposition becomes particularly evident when applied to complex functions, such as the AES-128, and in scenarios where no correction is needed for the linear layer such as CRAFT and Skinny64.Therefore, we recommend decomposing the circuit into a linear and a non-linear part whenever it is feasible to remove the correction for the linear component.However, in the case of lightweight ciphers, where removing the correction of the linear layer is not possible, decomposition seems to hurt the area and latency overheads.This is because restricting fault propagation does not justify the costs of additional correction logic.This observation holds since the round function of lightweight ciphers -in contrast to the round-based AES -are simple and the faults cannot propagate extensively.For more complex functions, even the incorporation of multiple additional correction layers can result in a smaller area overhead while the results without decomposition become impractical.Therefore, we recommend decomposing complex functions in general to avoid obtaining impractically large results.

Verification
Our verification with VerFI demonstrated that all injected faults, except for the gate responsible for computing the done signal, become ineffective.However, such a fault is detected and reveals no information to the adversary since no intermediate state is forwarded to the primary output.

Masked Designs
We selected CRAFT to examine the application of AGEFA on masked implementations.This selection is justified by its comparably small area footprint and moderate latency, making it easier to verify using both tool-based and experimental analyses.Due to the same reason, we restrict our experiment to a first-order masked version of CRAFT protected by a code with minimum distance δ = 3. AGEFA's generated fault-tolerant version of masked CRAFT is still small enough to be verified with PROLEAD in a feasible time and using a realistic amount of memory.Additionally, the design is still compact enough to fit on our FPGA-based setup for experimental analysis, and the length of the power consumption traces remains practical.The above arguments highlight that the decision to use CRAFT was purely based on verification considerations.However, AGEFA itself is capable of handling, even higher-order masked versions of all ciphers discussed above in a matter of minutes.Below, we describe our design flow in detail.
• We started with the behavioral-level description of CRAFT and synthesized it with Synopsys DC and NanGate 45 nm library to receive a Verilog netlist.The synthesized design has an area footprint of 1.21 kGE and a critical path delay of 0.68 ns.
• To process the netlist with AGEMA, we annotated the primary inputs directly in the Verilog file by marking all signals related to the plaintext or key as secure.This was done to ensure that AGEMA produces a design with masked plaintext and key.Additionally, we adjusted AGEMA to use first-order (d = 1) protection and to apply HPC2 gadgets [CGLS21] to protect the given netlist.Further, we adjusted AGEMA to operate in the naive mode, which involves replacing every cell that needs to be masked with its HPC2 variant and to produce a non-pipeline design.Compared to a pipelined version of the same design, this approach significantly reduces the overall circuit size by avoiding additional pipeline registers, but at the cost of being able to encrypt only one plaintext in each cipher run.As a result, AGEMA produced a masked behavioral-level design of the given netlist which is provably secure under the (1, 1, 0)-robust 1-probing model, i.e. first-order glitch-and transition-extended probing secure.
• We synthesized the masked behavioral-level design along with the provided HPC2 gadgets obtained from AGEMA's GitHub.Beforehand, we extended our synthesis script to define the share_domain attribute for every masked gadget and set the share_domain attribute for every register input for each gadget separately.Finally, we adjusted our synthesis script to generate the attribute report.The masked design provided by AGEMA has an area footprint of 10.84 kGE and a critical path delay of 1.12 ns.
• The synthesized netlist of the masked design serves as the input of AGEFA.We annotated all data inputs, i.e. the masked plaintext and key signals as secure while we annotated the register enable and done signal as control.Next, we applied AGEFA to generate a fault-tolerant design using ECCs capable of correcting a single fault (δ = 3).It is important to note that δ = 3 is not sufficient for security under the (1, τ bf , mc ∞ )-fault model, but it facilitates the verification process.By setting k = 1, every single bit is seen as a separate message.As a side note, the resulting design does not require the attribute report to maintain probing security if all registers are realized with separate modules.Each message contains only k = 1 bit which avoids any two signals from different share domains being combined in a syndrome decoder.This is not the case for any other implementation with k > 1, and the attribute report generated by the synthesis script is strictly required.
• To receive the final netlist allowing us to use PROLEAD and VerFI to evaluate the results, we synthesized both behavioral-level designs generated by AGEFA with Synopsys DC using NanGate 45 nm library.The final masked and fault-tolerant design has an area footprint of 112.46 kGE and a critical path delay of 1.47 ns.

Verification
Similar to the unmasked designs, our verification with VerFI led to the conclusion that all considered faults are corrected.We also verified the security of the given design under the (1, 1, 0)-robust 1-probing model using PROLEAD.Using 100 million simulations, PROLEAD reported the highest probability for detectable leakage as − log 10 (p) = 5.88.Since PROLEAD assumes a false-positive probability of 10 −5 , the original authors claim that exceeding the 5.0 threshold is not a strict criterion for insecure designs if the number of considered probing sets is quite high.Since in our case study, there are 150 560 possible robust probing sets, the probability of at least one probing set surpassing the threshold is  1 − (1 − 10 −5 ) 150560 = 77.81%.Additionally, we observed that the probabilities did not increase when we considered more simulations.Therefore, we conclude that the surpassed threshold is due to a false positive, i.e., the underlying design maintains first-order security.
To verify the absence of leakage, we conducted experimental analyses using the setup detailed in Section 4.2 that yielded the results depicted in Figure 5.Given that the design is anticipated to possess only first-order security (with 2 shares), we expected the absence of leakage in the first order, as confirmed by Figure 5b, as well as the presence of higher-order detected leakages, which is evident from Figure 5c.

Benchmark
In addition to evaluating the security of the hardware designs produced by AGEFA, it is crucial to analyze the performance of the tool itself, specifically in terms of runtime.To this end, we conducted runtime measurements of AGEFA while generating each of the individual case studies.Our measurements were performed on a standard laptop with an i7-10610U CPU running at a clock frequency of 1.80 GHz, and 16 GB of RAM.We used the Ubuntu 20.04 subsystem running on Windows 10 as the environment for executing AGEFA.The runtime measurements are depicted in Figure 6.
For δ = 3, there are significant differences in the runtime of AGEFA when optimizations are turned on versus when they are turned off.Notably, artificially increasing the code size becomes the primary bottleneck if the message size increases while the optimized variant is usually faster than without optimizations if k < 5.When the message size is sufficiently small, the time saved by finding minimal Boolean functions can outweigh the time required to perform the code optimization.This explains the lower runtime observed when AGEFA optimizes the code.Upon analyzing Figure 6, we can conclude that finding an optimized code with k = 9 would take hours, and further increasing the message size or the minimum distance would make this optimization computationally infeasible.However, it is essential to emphasize the following points: • The process of finding an optimized ECC is independent of the specific netlist being processed.Thus, it is sufficient to find a code with specific k and δ once and reuse it for subsequent designs.This approach allows for the precomputation of ECCs for arbitrary parameters and the use of these precomputations as a database for AGEFA.If a code with specific parameters already exists in the database, AGEFA can simply load the code from the database and skip the code generation process.This strategy can significantly reduce the runtime of AGEFA and make the optimization of ECC computationally feasible for larger message sizes.
• Iteratively checking a large number of codes for their applicability, as they are required to find an optimized code, can be efficiently performed in parallel.Therefore, the aforementioned database of precomputed ECCs can be created on a more powerful machine using multiple threads.
• All case studies imply that large message sizes lead to inefficient designs.Hence, generating large codes should generally be avoided.
If the code optimization is disabled, optimizing Boolean functions with the Quine-McCluskey algorithm can become time-consuming when correcting multiple faults, i.e. as δ grows.However, optimizing Boolean functions is optional and can be avoided by replacing don't cares with concrete results, albeit at the cost of performance.Additionally, this optimization is mostly problematic for large message sizes, which lead to inefficient designs.For instance, generating designs with a code that satisfies k = 6 was the most time-consuming case in our experiments with δ = 5, taking just around an hour.

Comparison
When evaluating the performance of our constructions, it is important to compare them to hand-crafted designs, in which countermeasures are manually implemented by the designer rather than through automated tools.This comparison involves two key aspects.Firstly, we assess how AGEFA processes masked designs generated by AGEMA in comparison to manually masked designs.Secondly, we evaluate how arbitrary designs generated by AGEFA compare to designs where error correction is manually implemented.

Hand-Made Masked Designs vs. AGEMA-Generated Designs
In the context of SCA, an extensive discussion on the advantages and disadvantages of using composable gadgets, as automatically instantiated by AGEMA, versus handmade masking is given in [KMMS22].Hand-crafted masked circuits are often more efficient under some performance metrics, such as area, latency, or demand for fresh randomness, but can   be difficult to verify and evaluate.Although PROLEAD can evaluate full masked cipher cores, evaluating higher-order designs may be infeasible due to the high demand on runtime and memory.Composable gadgets achieve higher-order provable security by design but at the cost of some overhead.If AGEFA processes a design made out of composable gadgets, the overhead compared to a hand-crafted design is multiplied by at least a factor of two due to the addition of redundancy, i.e. the second instantiation of the round function.However, it is important to note that this overhead applies to all target circuits and is not unique to masked designs generated by AGEMA.Furthermore, error correction typically involves redundant computation, which cannot be avoided.Therefore, the overhead is not specific to AGEFA and would also arise if the designer manually integrates an error-correction facility.As demonstrated above, preserving the robust probing security of a hand-crafted or automatically generated masked circuit when applying AGEFA is straightforward when it is adjusted to set k = 1.However, if the user applies another code with k > 1, annotating every register input and primary output with its corresponding share domain can be complicated and error-prone.If we pre-annotate the gadgets of AGEMA, the designs can be processed by AGEFA together with an attribute report (generated by the synthesizer like DC) to integrate arbitrary ECCs without violating the robust probing security.

Hand-Made Fault-Resistant Designs vs. AGEFA-Generated Designs
Impeccable Circuits II [SRM20].Initially, we compare the manually protected designs of CRAFT from [SRM20] with the outcomes of AGEFA.Given that the synthesis script utilized to generate the outcomes in [SRM20] is not publicly accessible, we have opted to provide not only the absolute performance metrics but also the relative area and delay overheads when compared to the unprotected design.Specifically, we present the overheads for the synthesized AGEFA-generated designs in relation to the unprotected input design of AGEFA, which we synthesized by ourselves.The performance metrics are detailed in Table 6.While we acknowledge that cipher-level MV leads to more area-efficient designs compared to the application of Impeccable Circuits II, we must emphasize once more that cipher-level MV falls short of guaranteeing the desired security level (see.Section 2.4.2).Further, the hand-crafted Impeccable Circuits II design with δ = 3 incorporates a noninjective code which complicates the fair comparison to the design generated using AGEFA.To make a truly equitable comparison, we focus on the design for δ = 5, where both the hand-crafted version and the AGEFA result employ injective codes, and the circuit decomposition is consistent.Upon examination, we observe that the AGEFA-generated design exhibits an approximately 26% increase in relative area overhead and an approximately 15% increase w.r.t the critical path delay compared to the hand-crafted design.However, it is essential to note that manual protection of the design offers opportunities for optimization of the target cipher itself, specifically tailored to minimize overhead.Such optimizations are not feasible with AGEFA since it operates on a pre-synthesized netlist where the target structure is at least partially removed.Consequently, we posit that optimizing the netlist before applying AGEFA may yield more area-efficient designs.
A Countermeasure Against SIFA [BKHL20].The authors of [BKHL20] investigated a [3, 1, 3]-code to protect Skinny against one-bit FI attacks.While they did not report the relative area overhead of a real hardware design, they estimated the relative overhead based on the number of basic logic gates.It is worth noting that such an estimated overhead may tend to be overly optimistic, as it does not take into account additional logic elements needed for signal distribution, such as gates with higher drive strength or buffers instantiated to allow higher fan out.However, when considering the smallest design of Skinny generated by AGEFA, we end up with a relative area overhead of 6.3 times which is 12.5% higher compared to the relative overhead of 5.6 times reported in [BKHL20] while the manual protection, again, allows a wider range of optimizations on the target cipher.Again, we estimated the relative overhead of the AGEFA-generated design by comparing it to our unprotected, synthesized Skinny design (see.Table 3).

Summary
We observe that the designs generated by AGEFA tend to be quite large, particularly when considering the round-based implementation of the AES without any form of decomposition.This characteristic makes these particular designs more suitable for academic exploration, where our primary objective is to demonstrate the limitations and capabilities of AGEFA.However, our findings indicate that when dealing with lightweight ciphers and/or employing decomposition techniques, the results become more acceptable.In this context, we must emphasize that we operate within a strong adversary model, wherein security cannot be assured through significantly more cost-effective methods like cipher-level MV.Furthermore, it is worth noting that AGEFA produces smaller designs when the unprotected design is serialized (involving multiple cycles per cipher round) as opposed to the round-based approach (with one cycle per cipher round).This introduces a trade-off between area and latency.When we primarily provided round-based designs to AGEFA, we obtained designs of substantial size but with relatively low latency.

Conclusions
In conclusion, this paper introduces AGEFA, a fully-automated and open-source framework designed to enable engineers and hardware designers to generate fault-tolerant cryptographic hardware circuits with ease and reliability.The framework leverages various optimization techniques to implement the ECC-based procedure presented in Impeccable circuits II [SRM20] on arbitrary hardware circuits, ensuring the improved performance of the resulting designs.Our tool is effectively demonstrated through a series of case studies that feature well-known symmetric block ciphers, providing insight into the diverse performance trade-offs that arise based on the message size and number of faults to correct, particularly in terms of area overhead and critical path delay.Our case studies yield specific recommendations for creating area-optimized designs (by selecting k = 2 or k = 4 with decomposition) or delay-optimized designs (by selecting k = 1 without decomposition), depending on the desired outcome.Furthermore, we demonstrated that our tool naturally extends the existing security-aware hardware design flow.AGEFA can add fault-tolerance to masked circuits generated with AGEMA, without violating the given security proofs under the robust probing security model by simply annotating internal signals of each gadget.Hence, the combination of AGEMA and AGEFA allows for the generation of fault-and SCA-secure hardware circuits from an unprotected netlist.We conducted practical experiments and tool-based evaluations to demonstrate the fault-tolerance and SCA resistance of the designs generated by our framework, further affirming our claims.However, while our evaluation shows that the generated designs are secure against SCA and FI adversaries individually, we cannot guarantee their security against combined adversaries who apply both types of attacks simultaneously.Hence, further research in extending AGEFA or AGEMA to ensure security against combined adversaries would be a promising avenue to explore.Additionally, the substantial overhead attributed to error correction can be mitigated in scenarios where assured error detection is deemed sufficient, such as in Root-of-Trust (RoT) modules.Integrating error detection support into AGEFA necessitates an additional output signal, signaling the injection of a fault while employing an underlying EDC with a smaller minimum distance than that of an ECC.Hence, we highlight the exploration of a fault-detection variant of AGEFA as a compelling avenue for future research.

Figure 1 :
Figure 1: General Mealy model of a synchronous sequential circuit.

Figure 3 :
Figure 3: CED based on Impeccable Circuits II with injective F.

Figure 4 :
Figure 4: Procedure of AGEFA to protect a circuit against FI attacks.
reset then 14: for ∀w ∈ IN L ∪ OUT L do 15: Additionally, in Lines 15-20 the algorithm considers all output wires of linear functions and assesses whether their signals are exclusively propagated to the inputs of other linear coordinate functions.This step enables the detection of multiple cascaded linear layers.Ultimately, all elements R ∈ R possessing solely linear inputs and outputs are identified as constituting a linear layer.Formally, we denote the linear layer asL = {R ∈ R| w ∈ OUT R ∪ IN R : linear_index(w) = 0}with a joint set of primary input wires IN L and primary output wires OUT L .We remark, that both sets IN L and OUT L must be sorted based on the signal names.Utilizing the information from L, IN L , and OUT L , the procedure presented in Algorithm 2 determines whether the correction point can be removed from all wires in IN L .Specifically, it examines if the implementation of L can be represented as a binary matrix with elements in F k 2 .A crucial condition for such a representation is that both |IN L | and |OUT L | are multiples of k, except rst, which is encoded into a separate message.Consequently, rst within |IN L | is counted as k signals.If L satisfies this condition, indicating that |IN L | and |OUT L | are multiples of k, Algorithm 2 is executed.Otherwise, we set linear_index(w) = 0 for all w ∈ IN L ∪ OUT L .

Algorithm 5
Secure padding of a state Input: ST An unpadded but clustered state, i.e. a set of signals Input: k The size of one message Output: SP A padded state, i.e. a set of signals 1: q ← 0 2: l ← 0 3: for ∀w ∈ ST do 4: if attribute(w) = control then 2nd-order univariate t-test.

Table 1 :
Notations used in this work.

Table 2 :
Parameters associated with a single wire.If w carries the share of a sensitive variable, this parameter stores the share domain of w.This is introduced in Section 3.1.linear_index(w) If w is the input or output of a binary matrix with elements in F k Notation Descriptionattribute(w)Returns the attribute associated with w.It holds that attribute(w) ∈ {clock, control, layer, reset, secure} introduced in Section 3. share_domain(w) precisely, Algorithm 6 initially takes the first k signals from FP to compute the first (n − k) signals from U .Hence, it creates the first (n − k) modules {F 0 , . . ., F n−k−1 } while every module receives k signals from FP as input.As F is just the extension of P to a full state, F i ∈ {F 0 , . . ., F n−k−1 } implements the same function as P i ∈ P but on different signals.Therefore, we can copy the instructions from P i to F i if we update the

Table 3 :
Full cipher implementation case studies.

Table 6 :
Comparison of different CRAFT designs.

Table 7 :
Synthesis results, δ = 3, without optimized code and without round function decomposition, circuit size in kGE.

Table 8 :
Synthesis results, δ = 3, without optimized code and without round function decomposition, critical path delay in ns.

Table 9 :
Synthesis results, δ = 3, with optimized code and without round function decomposition, circuit size in kGE.

Table 10 :
Synthesis results, δ = 3, with optimized code and without round function decomposition, critical path delay in ns.

Table 11 :
Synthesis results, δ = 3, without optimized code and with round function decomposition, circuit size in kGE.

Table 12 :
Synthesis results, δ = 3, without optimized code and with round function decomposition, critical path delay in ns.

Table 13 :
Synthesis results, δ = 3, with optimized code and with round function decomposition, circuit size in kGE.

Table 14 :
Synthesis results, δ = 3, with optimized code and with round function decomposition, critical path delay in ns.

Table 15 :
Synthesis results, δ = 5, without optimized code and without round function decomposition, circuit size in kGE.

Table 16 :
Synthesis results, δ = 5, without optimized code and without round function decomposition, critical path delay in ns.

Table 17 :
Synthesis results, δ = 5, without optimized code and with round function decomposition, circuit size in kGE.

Table 18 :
Synthesis results, δ = 5, without optimized code and with round function decomposition, critical path delay in ns.