Optimizing implementations of linear layers using two and higher input XOR gates

Maximum distance separable (MDS) matrices are often used in the linear layer of a block cipher due to their good diffusion property. A well-designed lightweight MDS matrix, especially an involutory one, can provide both security and performance benefits to the cipher. Finding the corresponding effective linear straight-line program (SLP) of the circuit of a linear layer is still a challenging problem. In this article, first, we propose a new heuristic algorithm called Superior Boyar-Peralta (SBP) in the computation of the minimum number of two-input Exclusive-OR (XOR) gates with the minimum circuit depth for the SLPs. Contrary to the existing global optimization methods supporting only two-input XOR gates, SBP heuristic algorithm provides the best global optimization solutions, especially for extracting low-latency circuits. Moreover, we give a new 4 × 4 involutory MDS matrix over F24, which requires only 41 XOR gates and depth 3 after applying SBP heuristic, whereas the previously best-known cost is 45 XOR gates with the same depth. In the second part of the article, for further optimization of the circuit area of linear layers with multiple-input XOR gates, we enhance the recently proposed BDKCI heuristic algorithm by incorporating circuit depth awareness, which limits the depth of the circuits created. By using the proposed circuit depth-bounded version of BDKCI, we present better circuit implementations of linear layers of block ciphers than those given in the literature. For instance, the given circuit for the AES MixColumn matrix only requires 44 XOR gates/depth 3/240.95 GE in the STM 130 nm (simply called ASIC4) library, while the previous best-known result is 55 XOR gates/depth 5/243.00 GE. Much better, our new 4 × 4 involutory MDS matrix requires only 19 XOR gates/depth3/79.75 GE in the STM 90 nm (simply called ASIC1) library, which is the lightest and superior to the state-of-the-art results.


INTRODUCTION
In recent years, lightweight cryptography has gained increasing attention due to the usage of resource-constrained devices like smart devices, wearable devices, and Internet of Things (IoT) devices.Since these small, constrained devices can manipulate private data, designing novel, lighter cryptographic primitives with low implementation costs is crucial (Duval & Leurent, 2018).
Despite having maximum diffusion property, MDS matrices have high implementation costs.So, finding lightweight MDS diffusion layers, especially involutory ones, with minimized hardware requirements is a challenging task (Toh et al., 2018).The cost of implementation can be quantified using two key metrics: XOR count (which is divided into three types: direct-XOR (d-XOR), general-XOR (g-XOR), and sequential XOR (s-XOR) count (see "Definition and Notations" section for more details)) and circuit depth.XOR count represents the number of XOR gates required for the circuit implementation of the diffusion matrix, while circuit depth refers to the number of layers of gates required to implement the linear layer in hardware.Minimizing the number of gates, particularly expensive ones such as XOR gates ensures low-cost and efficient circuit implementations of the diffusion matrix.Similarly, minimizing circuit depth guarantees reduced latency.On the other hand, Gate Equivalent (GE) is a metric utilized to compare the size of logic gates.Banik, Funabiki & Isobe (2019) observed that utilization of higher input XOR gates may result in a reduced area (i.e., GE) in ASIC libraries.That idea brought new directions in research on optimized circuit implementations; hence, the researchers started to search for circuits not only with the minimum number of XOR gates and circuit depth but also with reduced GE cost by considering multiple-input XOR gates (e.g., Baksi et al., 2021;Banik, Funabiki & Isobe, 2019;Liu et al., 2022b).

Related work
To address the challenge of finding efficient circuit implementations of a given linear layer, in the beginning, a variety of local optimization techniques (e.g., Gupta & Ray, 2013;Sim et al., 2015;Beierle, Kranz & Leander, 2016;Sarkar & Sim, 2016;Sarkar & Syed, 2017;Pehlivanoğlu et al., 2018) were proposed in order to reduce the number of XOR counts.
Local optimization means the selection of the coefficients of the matrix with minimum XOR counts, but it does not guarantee the finding of efficient circuits.Because the fixed cost of connecting entries remains without being optimized in local optimization methods.
Then, the authors started to address the task of globally optimizing linear layers.This involves estimating the hardware cost of a linear layer by identifying an SLP that corresponds to it (Li et al., 2019).Two such algorithms are the Paar1 and Paar2 heuristics (Paar, 1997), which generate cancellation-free SLPs that do not contain the same variables in both variables of an XOR pair.Although Paar's heuristics do not necessarily yield the optimal circuit implementations, the Paar1 heuristic is easy to implement and produces fast results, even for large matrix sizes.Boyar-Peralta's heuristic (Boyar & Peralta, 2010) and its variant (Boyar, Matthews & Peralta, 2012) were an inspiration to improve new global optimization heuristics.In Boyar, Find & Peralta (2017), the authors modified Paar's heuristic by using preprocessing steps and allowing cancellations.In the same article, minimizing the number of AND gates was handled beside XOR gates.In Duval & Leurent (2018), the authors proposed a new approach based on searching the circuit space to find optimal circuits of MDS matrices by using the tree-based Dijkstra searching technique.In Li et al. (2019), the authors modified Boyar-Peralta's heuristic (Boyar, Matthews & Peralta, 2012) by considering the circuit depth metric to determine the optimal circuit implementations.Boyar, Find & Peralta (2019) proposed a new heuristic creating smaller linear and nonlinear circuits for a given circuit depth bound.Tan & Peyrin (2019) proposed Randomized Normal Boyar Peralta (RNBP) heuristic and two nondeterministic algorithms A1 and A2.All these given heuristics focus on the reduction of XOR counts by using temporary intermediate signals (gates) to determine the globally optimized implementations of a diffusion matrix.In Banik, Funabiki & Isobe (2021), the authors extracted lower circuit depth implementations by adding randomness in the tweaked algorithm given in Li et al. (2019) (this new version is simply called BFI heuristic).In Liu et al. (2022a), especially considering the low latency criteria, the authors proposed a new framework based on forward and backward search strategies that can find optimal solutions with a minimized circuit depth.Pehlivanoglu & Demir (2022) designed a new framework that combines some of these recently proposed global optimization heuristics to find better circuit implementations.It should be noted that all these global optimization methods given above generate circuit implementations with two-input XOR gates under the g-XOR metric.
The idea given in Banik, Funabiki & Isobe (2019) has opened up new directions for research on the usage of multiple-input XOR gates in SLP.In the same article, Banik, Funabiki & Isobe (2019) designed a graph-based heuristic to explore circuits featuring both two-input and three-input XOR gates.Specifically, they converted circuits constructed using only two-input gates into new ones with a combination of two-input and three-input XOR gates.Then, Baksi et al. (2021) introduced enhanced versions of BP heuristic (originally presented in Boyar & Peralta (2010) and Tan & Peyrin (2019)), simply called BDKCI.These improved versions support two-input, three-input, and four-input XOR gates.Recently, Liu et al. (2022b) proposed two algorithms: the transform algorithm and the graph extending algorithm.By combining these two algorithms, they generated better circuit implementations.
In the literature, there are few heuristics based on optimizing implementations of diffusion matrices under only the s-XOR metric without using any temporary intermediate signals.Optimizing a diffusion matrix under the s-XOR metric is based on the problem of optimal pivoting in the Gauss-Jordan elimination (Kölsch, 2019).In Jean et al. (2017), the authors proposed an exhaustive search algorithm to find out the optimal circuit implementations for small matrix sizes such as 4 Â 4 and 8 Â 8 under s-XOR metric.Xiang et al. (2020) proposed a new heuristic, called XZLBZ, that was capable of reducing the implementation cost (in terms of s-XOR count) of 16 Â 16 and 32 Â 32 involutory/ non-involutory binary MDS matrices.More recently, Yang, Zeng & Wang (2021) proposed a new heuristic, called IX algorithm, which was an improved variant of the heuristic given in Xiang et al. (2020).IX heuristic found better circuit implementation under the same run-time with higher accuracy.However, the circuit depth metric is not taken into consideration in all of these heuristics designed to decrease the s-XOR count.

Motivation and our contribution
In this article, we focus on two challenging research questions: (1) how to improve BP heuristic by considering the circuits using two-input XOR gates with low latency criteria (especially for depth 3), and (2) how to enhance BDKCI heuristic by incorporating circuit depth awareness.To address the first research question, we propose a new heuristic, called SBP, that is the improved version of Boyar-Peralta's heuristic (Boyar, Matthews & Peralta, 2012) by considering low latency criteria.We introduce a new randomized way of choosing actions that would lead to better circuit solutions, especially with minimum circuit depth 3. To address the second research question, we give the enhanced (depthbounded) version of BDKCI heuristic that is capable of producing depth-limited circuits.
The main contributions of this article can be given as follows: We give a new 4 Â 4 involutory MDS linear layer over F 2 4 whose circuit implementation requires the lowest number of XORs (i.e., 41 g-XORs saving four from the previous best result (Liu et al., 2022a)) with the minimum depth 3. We apply our new heuristic SBP to the previously given 4 Â 4 linear layers and find many low-latency circuits which are better than the other best previous results given in Liu et al. (2022a).
For further improvement, we enhance the recently proposed BDKCI heuristic algorithm by incorporating circuit depth awareness, which limits the depth of the circuits created.

Organization
This article is organized as follows: In "Definition and Notations" section, we give some basic notations and definitions.In "SBP Heuristic" section, we propose a new heuristic SBP for global optimization to generate low-latency circuit implementations of linear layers.Next, in "Depth-bounded version of BDKCI Heuristic" section, we present the depthbounded version of BDKCI heuristic and some good experimental results.Finally, we conclude and highlight some possible future works for further results in "Conclusion and Future Works" section.

DEFINITION AND NOTATIONS
This section reviews the fundamental mathematical principles concerning finite fields and MDS matrices.In this context, definitions and notations are introduced.
The finite field F 2 m is defined by an irreducible polynomial pðxÞ of degree m over F 2 , can be denoted as F 2 ½x=ðpðxÞÞ.Each element in finite field F 2 m can be represented as P mÀ1 i¼0 b i a i , where b i 2 F 2 and a is a root of F 2 m .For simplicity's sake, the hexadecimal notation is used to represent the elements of F 2 m and the irreducible polynomial pðxÞ, e.g., the irreducible polynomial pðxÞ ¼ x 4 þ x þ 1 can be denoted as 0x13.
The n Â n matrix over finite field F 2 m can be represented as M n ðF 2 m Þ, and binary representation of the same n Â n matrix (whose each entry is m Â m invertible binary matrix) over the same finite field can be denoted as M n ðGLðm; F 2 ÞÞ.Definition 1. (MDS Matrix) Let C be an ½n; k; d code and G ¼ ½IjA be a generator matrix of C, where A is a k Â ðn À kÞ matrix.If and only if every square sub-matrix of A is nonsingular, A is an MDS matrix.If A also satisfies A ¼ A À1 , A is an involutory MDS matrix.
GHadamard matrix form proposed in Pehlivanoğlu et al. ( 2018) is a hybrid construction method to construct (involutory) MDS matrices.A k Â k GHadamard matrix GH is generated by using the non-zero b i parameters and their inverses with a k Â k Finite Field Hadamard (simply Hadamard) matrix H over F 2 m .A 4 Â 4 GHadamard matrix GH can be denoted as follows:

Metrics
To compute the hardware implementation cost of a diffusion matrix in terms of XOR count, there are three important metrics: direct XOR (d-XOR) count (Khoo et al., 2014), sequential XOR (s-XOR) count (Jean et al., 2017), and general-XOR (g-XOR) count (we used the same abbreviation given in Xiang et al. (2020)).Definition 3. The d-XOR count is defined as the Hamming weight (sum of the nonzero elements) of the n Â n invertible binary matrix minus n.Definition 4. The s-XOR count is defined as the minimum number of XOR operations necessary to implement an n Â n invertible binary matrix using in-place operations.In other words, given input vectors fx 0 ; x 1 ; . . .; x nÀ1 g of the n Â n invertible binary matrix, the output vectors fy 0 ; y 1 ; . . .; y nÀ1 g are calculated using in-place operations x i x i È x j , where 0 i; j n À 1. Definition 5.The g-XOR is defined as the minimum number of required operations x i x j 1 È x j 2 , where 0 j 1 ; j 2 i.Some intermediate values can be computed repeatedly under d-XOR and that will ensure a more costly (i.e., overestimation) final circuit than the actual one (Yang, Zeng & Wang, 2021), therefore s-XOR and g-XOR metrics are used for further evaluation.However, s-XOR count causes a high computational cost, especially for optimizing full MDS matrices (Duval & Leurent, 2018).

SBP HEURISTIC
SBP heuristic starts from Boyar-Peralta's heuristic but uses a different structure to find the optimal circuit solutions while choosing the new bases.SBP chooses a threshold value that gives the number of pair candidates that ensure (minimize the sum of distances or maximize the Euclidean Norm) the best results above the tie.After that, it performs a randomization step to randomly pick one of the best pairs by using the uniform integer distribution function.This function produces integer values in a range [0, threshold value] according to a uniform discrete distribution.Different distributions like uniform, normal, and sampling distributions were tested in our initial experiments exploring the effects of various random number distributions based on the Mersenne Twister algorithm (Matsumoto & Nishimura, 1998).The findings showed that the uniform integer distribution was the most effective (in terms of the XOR count of the generated circuit) in our research problem.Therefore, we chose it for further experiments.
We present all the details in Algorithm 1.According to Algorithm 1, S denotes a sequence of input signals (i.e., x i s), D keeps trace of circuit depth of S, and D defines a distance vector, where d H ðS; y i sÞ represents the Hamming-Distance from S to output signals (i.e., y i s).SBP picks signal pairs that maximize the Euclidean norm of the new updated distance vector D, by taking into account the circuit depth limit.But here, the algorithm handles a specified number of pairs (depending on the chosenParam parameter).Then, SBP applies the uniform discrete distribution function to determine a new base element.It performs the previous steps until all elements of D are equal to zero.The idea given in SBP potentially leads to the best result by pairing up the input signals that minimize the target values in the distance vector.The chosenParam parameter plays a pivotal role in defining the dimension of the element space.If this space's size equals the maximum count of selectable elements, SBP will yield outcomes equal to those of other optimization algorithms.However, by constraining the number of elements within this space (by using the chosenParam parameter value), SBP consistently selects superior elements.When determining the chosenParam value, it can be selected based on: (1) the size of the matrix, and (2) the runtime of other optimization algorithms.For small matrix sizes, the chosenParam value should be lower compared to larger sizes.For the second condition, essentially, if generating the circuit for the same matrix takes a long time in other optimization algorithms, it is advisable to keep the chosenParam value low, and if it takes a short time, a higher threshold is recommended.However, when the chosenParam value is set excessively high, the algorithm may enter an infinite loop, making it challenging to make selections between elements or find any optimal element at all.When establishing the maximum value for the threshold, it is important to consider the fundamental factor, which is the number of elements the algorithm places in the candidate list (i.e., allElement array) during each element selection.For example, if there are 10 elements in the allElement array within one iteration, the threshold value should not exceed 10.However, since this situation varies with each iteration of the algorithm, a precise threshold value calculation cannot be made.Therefore, an average threshold value can be determined instead.
While SBP shares its foundational traits with A1, A2, and RNBP, its superior performance can be attributed to its unique approach to element storage logic.Unlike other optimization algorithms that exhaustively explore all possibilities during element storage, thereby significantly expanding the search space and often generating numerous divergent paths, SBP takes a more controlled approach.SBP algorithm carefully curates the search space and stores elements acquired through the element selection process within BP algorithm, up to a specified limit.This strategy ensures that the highest-quality elements remain readily accessible within the stored values.The selections from this pool of top-tier elements facilitate a focus on achieving superior results.Consequently, this approach narrows down the search space, ultimately leading to the attainment of the optimal XOR count.Better circuit implementations for 4 × 4 low-latency involutory MDS matrices by using SBP heuristic In this subsection, we apply our new heuristic SBP to the existing and new linear layers and find numerous low-latency candidates for circuit implementations.Notably, we give a new 4 Â 4 involutory MDS matrix over F 2 4 =0x19 which can be implemented with only 41 g-XOR gates and depth 3 by applying SBP global optimization method, while the previous best optimal result requires 45 g-XOR gates (Liu et al., 2022a) for the same depth level.
In Table 1, we provide the circuit implementation and computation sequence of the matrix GH 1 by applying SBP heuristic with threshold value 7.Moreover, we look for more efficient low latency circuit implementations of the matrix GH 1 , so we compare our obtained implementation with the results from different state-of-the-art heuristics.We ran all the algorithms for eight hours for the matrix GH 1 by taking the number of XOR gates into account with respect to the minimum depth, then we present all the implementation costs in Table 2.As shown in Table 2, our proposed heuristic leads to better circuit results in terms of circuit depth (not only depth 3 but also different depths) than the other heuristics given in the literature.
Furthermore, in Table 3, we consider the several 4 Â 4 linear layers given in the literature to extract their low latency circuits.Notably, our results are better than the other heuristics, we can easily see that the SBP heuristic ensures a significant improvement for the minimum circuit depth metric.Note that, the implementation of 4 Â 4 involutory MDS matrix given in Sarkar & Syed (2016) requires only 44 g-XOR, and 40 g-XOR for depth 3 and depth 4, respectively.These new records beat all previous best-known results for this matrix.Even though we find a new record, the circuit of GH 1 (see Table 1) beats all the records (for low latency implementations of 4 Â 4 involutory MDS linear layers).
New base element New distance vector D 39 t 39 ¼ x 6 þ x 13 (1) ½0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 2; 0; 0 40 t 40 ¼ t 2 þ t 11 (2) ½0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 1; 0; 0 41 t 41 ¼ t 39 þ t 40 ½y 13 (3) ½0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0  within the original BDKCI heuristic.Note that, we have not only made alterations to these two functions but have also modified others called within them.Algorithm 2, represents the MAIN function that begins by importing a target matrix.It then systematically introduces XOR gates using the SLP method until all elements within the target matrix are encompassed.This iterative process is tailored to iteratively enhance the parameters of the XOR circuit through multiple applications of the SLP method.At the end of each iteration, the best XOR circuit parameters, including relevant information such as cost and depth, are recorded in a log file.On the other hand, Algorithm 3, represents the PICKNEWBASEELEMENTXOR3 function.Basically, in this function, an element (chosen value) is chosen randomly from the element array to generate a circuit gate.Also, the depth value of the selected element is appended to the depth array.In the original BDKCI version, for detecting the chosen value, A1, and A2 algorithms can be used in addition to RNBP.But, in our proposed depth-bounded version, we just utilize RNBP heuristic.
The following is a brief overview of the changes made to the original BDKCI heuristic: Within the MAIN function, the "BestDepth" variable is declared as a large data type, enabling it to store the minimum depth value identified during the algorithm's execution.Inside the same function, we have established the "depths" array for the purpose of retaining the depth of each gate.These values play a crucial role in identifying the minimum depth value attained throughout the algorithm's execution.
The return type of the EASYMOVEXOR3 function has been altered to an integer, allowing us to decide whether to print the results based on the function's return value.Moreover, inside the same function, we have made the following modifications that allow us to record depth information of two-input XOR gates, three-input XOR gates, and four-input XOR gates, respectively.depths½BaseSize ¼ maxðdepth map½a; depth map½bÞ þ 1, depths½BaseSize ¼ maxðdepth map½a; depth map½b; depth map½cÞ þ 1, depths½BaseSize ¼ maxðdepth map½a; depth map½b; depth map½c; depth map½dÞ þ 1.
Furthermore, within the EASYMOVEXOR3 function, a boolean variable named " f oundone" has been defined to monitor whether the algorithm's depth surpasses the specified threshold value, thus influencing the progression or conclusion of the current algorithm round.
In the function PICKNEWBASEELEMENTXOR3, we have defined "DepthLimit" variable that allows us to generate circuits with the chosen circuit depth.Moreover, the condition DepthLimitÞ" compares the depth information of the element pair that is eligible for selection in the current round with the depth limit.If the depth limit is exceeded, this pair of elements is not selected, and the loop continues to select a new pair of elements.

Better circuit implementations by using depth-bounded version of BDKCI heuristic
In this subsection, we present improved circuit implementations for the linear layers of some block ciphers, utilizing the circuit depth-bounded version of the BDKCI heuristic suggested in this study.We enhanced AES MixColumn matrix circuit with a cost of 240.95 GE (see Table 4) for the ASIC4 library.This circuit utilizes five XOR2 gates, seven XOR3 gates, and 32 XOR4 gates with depth 3, outperforming the previous best result of 243 GE with depth 5 (Liu et al., 2022b).Note that, XOR2, XOR3, and XOR4 refer to twoinput XOR gates, three-input XOR gates, and four-input XOR gates, respectively.The binary matrix of AES MixColumn is directly taken from the repository given in Baksi et al. (2021).Table 5 provides an overview of recent works that have utilized AES MixColumn, including our own findings.Additionally, we have enhanced the previous implementations of linear layers for ANUBIS and CLEFIA M 0 .As for TWOFISH, we find the circuit which equals the previous best-known result.

Table 2
Circuit cost (XOR count/depth) of GH 1 under several global optimization algorithms.

Table 3
Comparison of circuit cost (XOR count/depth) of binary matrices of size 16 × 16 under several global optimization algorithms.
Note:Bold values indicate the best results.

Table 6
Summary of implementation costs of linear layers of various block ciphers in ASIC4 library.
Note:Bold values indicate the best results.

Table 8
The global optimization results of 4 × 4 involutory and MDS matrices over F 2 4 by using the depth-bounded version of BDKCI.