Structural and Statistical Analysis of Multidimensional Linear Approximations of Random Functions and Permutations

—The goal of this paper is to investigate linear approximations of random functions and permutations. Our motivation is twofold. First, before the distinguishability of a practical cipher from an ideal one can be analysed, the cryptana-lyst must have an accurate understanding of the statistical behaviour of the ideal cipher. Secondly, this issue has been neglected both in old and in more recent studies, particularly when multiple linear approximations are being used simultaneously. Traditional models have been based on the average behaviour and simpliﬁed using other assumptions such as independence of the linear approximations. Multidimensional cryptanalysis was introduced to avoid making artiﬁcial assumptions about statistical independence of linear approximations. On the other hand, it has the drawback of including many trivial approximations that do not contribute to the attack but just cause a waste of time and memory. We show for the ﬁrst time in this paper that the trivial approximations reduce the degree of freedom of the related χ 2 distribution. Previously, the afﬁne linear cryptanalysis was proposed to allow removing trivial approximations and, at the same time, admitting a solid statistical model. In this paper, we identify another type of multidimensional linear approximation, called Davies-Meyer approximation, which has similar advantages, and present full statistical models for both the afﬁne and the Davies-Meyer type of multidimensional linear approximations. The new models given in this paper are realistic, accurate and easy to use. They are backed up by standard statistical tools such as Pearson’s χ 2 test and ﬁnite population correction and demonstrated to work accurately using practical examples.

ciphers. It makes use of the nonrandom behaviour of certain linear approximations of the cipher. Linear approximations are single-bit values obtained by exclusive-or summation of certain input bits and output bits over some rounds of the block cipher.
In the setting of linear key-recovery attacks, the traditional heuristic assumption is that a keyed iterated block cipher becomes a pseudorandom function or permutation if some of its rounds are replaced by encryption using a wrong key. On the other hand, if the key is correct, then the data is computed from the cipher. Distinguishing between these two cases using statistical tests requires statistical models of the test statistic. For a recent overview of the existing models, we refer to [1]. Such statistical models are always based on trade-offs between accuracy and feasibility. The traditional approach has been to state some unproven assumptions, called as wrong-key hypothesis and right-key hypothesis, which are desired to capture the statistical behaviour, but still simple enough to allow feasible computation of the model.
In all existing studies, the wrong-key hypothesis in linear cryptanalysis, as well as in other statistical attacks, is based on a statistical model of the family of random permutations, when the target cipher is a block cipher, or a model of the family of random functions in some other cases such as stream ciphers. Then the main effort in the cryptanalytic attack is focused on identifying and demonstrating evidence of nonrandom behaviour in the target cipher. In linear cryptanalysis, the problem is to find bit combinations that are either strongly biased, or equal to zero for all keys. The known search algorithms for finding suitable biased linear approximations are based on Matsui's seminal work [2], where biased linear approximations were found by identifying one or more strong linear approximation trails that the linear approximations is composed of. The right-key hypothesis is then derived from a statistical model that captures the probability distributions and their parameters of the linear approximations in the case of the cipher.
The success probability and the data complexity of the attack are then estimated based on statistical distinguishing between the probability distributions in the right-key case and the wrong-key case. Since the wrong-key case is typically modelled using linear approximations of randomly and uniformly selected permutations, it is clear that a proper understanding of the random behaviour has an essential role in statistical cryptanalysis. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Along the history of the linear cryptanalysis, the wrong-key hypothesis has taken different forms, and the main contributions are rather scattered in the literature. The first goal of this paper is to give a concise presentation of the behaviour of random functions and permutations under linear cryptanalysis. Our second goal is to present a new and more realistic model of the wrong-key hypothesis for the multidimensional linear cryptanalysis. The statistical behaviour of a multidimensional linear approximation appears to depend significantly on its structure.

B. Existing Wrong-Key Models in Linear Cryptanalysis
The understanding about the statistical behaviour of linear approximations of random functions and permutations has developed a lot during the times. In early works, correlations of linear approximations of random permutations were estimated to be negligible and equal to their expected value, zero. While it was understood already in 1994 by O'Connor [3] that the correlations of linear approximations vary within the random permutations, it was not until in 2006 this fact was examined in more detail by Daemen and Rijmen [4]. They considered the probability distribution of correlations of linear approximations both for random functions and random permutations and showed that both distributions behave similarly and can be approximated using normal distributions with the same parameters with the only distinction that the interval of the discrete distribution of correlations can have only even values for permutations.
These advanced models of linear approximations of random functions and permutations led to the observation that if a linear approximation of a cipher has correlation equal to zero for all keys, then it is not random and can be distinguished from random [5]. Conversely, this means that under the traditional hypothesis, according to which correlations of linear approximations of random permutations are equal to zero, even a truly randomly selected permutation will be falsely identified as nonrandom, because the correlations of their linear approximations are usually nonzero. This example illustrates how important it is to state the wrong-key assumption accurately.
The wrong-key model of [4] was extended by Bogdanov and Tischhauser [6] by integrating data sampling into it. While being an important opening to key-dependent models, it had two main drawbacks. First, the right-key model was still based on the assumption that the correlation of the linear approximation is equally large (in absolute value) for all keys. Secondly, the plaintexts were assumed to be drawn with replacement. While giving realistic estimates for small sample sizes, this approach lead to significant deviations from the true behaviour when the sample size approaches the full codebook. These two drawbacks of that model were highlighted by the counterintuitive phenomenon that the success probability is not always an increasing function of the data-complexity. The underlying problems were corrected by a new model given in [7] for a single linear approximation based on a single dominant trail. A more detailed study of the conditions for this counterintuitive phenomenon of nonmonotonicity was given in [8].
With the goal of making the linear distinguishers more powerful, several authors have proposed to use multiple linear approximations simultaneously. In the early models, the wrong-key hypothesis was always based on the assumption that in the wrong-key case, the expected correlations, that is, the correlations of linear approximations computed for the full codebook of the cipher behave as on average, that is, are equal to zero [9], see also [10]. Recently, key-dependency has been integrated to the models both in the wrong key and right key cases [7], [11], [12] by adopting a simplifying assumption that the correlations of any set of multiple linear approximations are independent when considered over the set of all permutations. In a subsequent version [1] of [11], this assumption was stated only for correlations of linearly independent linear approximations of random permutations. Whether this means a true theoretical improvement is not known.
In general, not much is known about the statistical independence of correlations considered as random variables over the key space. Only correlations of components of balanced functions are known to be independent trivially as they are always constants, that is, equal to zero. A multidimensional linear approximation of a permutation is not in general a balanced function. Hence the correlations of its components may not be equal to zero and may have statistical dependencies.
The assumption about independence of correlations was needed to derive statistical distributions for the sum of the squared correlations of the linear approximations. More specifically, the independence assumption has been used for expressing the variance of the sum of squared correlations as the sum of the variances of the squared correlations of the individual linear approximations. In this paper, it will be shown that, for certain sets of linear approximations, this result can be achieved without the independence assumption.

C. Our Contributions
We start by deriving exact formulas for the mean and variance of the capacity of multinomially distributed variables and make the observation that the variance of the capacity is additive, that is, it can be expressed as the sum of the variances of the capacities of the individual variables in the case when the expected distribution is uniform. This corresponds to the case of the expected value distribution of a random function.
We continue by revisiting the distributions of correlations of single linear approximations of random functions and random permutations. Adding to the results of [4] we observe that a linear approximation of a random function is a random Boolean function, while this is not the case if the random functions are restricted to permutations. We give the discrete probability distribution of the correlation of a linear approximation of a random permutation in terms of a hypergeometric distribution.
While multidimensional linear approximations of some functions can be modelled using the multinomial distribution, this is never the case for a multidimensional linear approximation of permutations. Even in case of a single variable, the hypergeometric distribution must be used instead of the binomial distribution. We leave it an open question whether the multivariate hypergeometric distribution might give a feasible approach in this case, and instead, use continuous approximations of the probability distributions to model the statistical behaviour of the capacity of a multidimensional linear approximation of a random permutation. This leads us to the study of the χ 2 distribution.
In many practical applications of multidimensional linear cryptanalysis, the linear space of linear approximations contains many trivial approximations that have correlation zero for any permutation. Their impact has been ignored in previous works and the degree of freedom of the χ 2 distribution is taken equal to 2 t − 1 where t is the dimension of the multidimensional linear approximation, see e.g. [7]. We prove that in the presence of trivial approximations, the degree of freedom is strictly less than 2 t − 1. Moreover, we conjecture the correct value of the degree of freedom and present experimental evidence to support this conjecture. We also identify a new type of multidimensional linear approximation, which we call the Davies-Meyer approximation, and which is characterised by the property of not containing any trivial linear approximations.
Having found a realistic solution to the problem of how to model wrong-key behaviour for multidimensional linear cryptanalysis, we apply the same approach for the recently presented variant of linear cryptanalysis, named as affine multidimensional cryptanalysis [13]. Preliminary versions of these results appeared in [14].
Affine subsets of linear approximations naturally arise in many ciphers. As an example we analyse SIMON32/64, which is a Feistel cipher that employs bitwise AND operation as the only nonlinear component of the round function. In this case, the affine spaces originate from the linear spaces comprising the four linear approximations of the AND operation that have non-zero correlations. Using our χ 2 model of affine linear approximations of randomly selected permutations, we experimentally identify nonrandom behaviour of 2-dimensional affine sets of linear approximations over 13-18 rounds and a 6-dimensional affine set over 18 rounds of SIMON32/64. These experiments used full codebook of data and 2 13 randomly selected keys. We also performed similar experiments with less than the full codebook of data.

D. Outline
The standard definitions of linear cryptanalysis are recalled and the mean and variance of capacity are computed for a general multinomial distribution in Section II, where we also recall the related discrete probability distributions and their continuous approximations. In Section III, the distributions of correlations of single linear approximations are revisited. The new contributions of the structure and probability distributions of multidimensional linear approximations are presented in Section IV and applied to affine sets of approximations in Section V. Then we enhance these statistical models by integrating random sampling without replacement to them in Section VI. To perform randomness analysis of linear approximations of SIMON32/64, we define the randomness test in Section VII and present the results in Section VIII. The conclusions are drawn in Section IX.

A. Correlation and Capacity
Let F be a function from S to F t 2 , where S is a finite set and F t 2 is a vector space over F 2 of dimension t. We focus on two ways of defining F . First, we can just give the (indexed) set of the values F (x), x ∈ S. The second way of defining F is to give t Boolean functions f 1 , . . . , f t , that is, t coordinate functions of F , and their values f i (x), x ∈ S, i = 1, . . . , t. Given β = (β 1 , . . . , β t ) ∈ F t 2 , we denote by β · F the linear combination of the coordinate functions of F = (f 1 , . . . , f t ) determined by β, that is, and say that the Boolean function β · F is a component of F .
Functions are in general imbalanced, that is, all values in the image space are not taken equally often. Related to the two ways of defining F , we have two ways of measuring the imbalance of F . First, we can consider the uniformity of its value distribution. Given η ∈ F t 2 let us denote by p η its probability, that is, Then the imbalance of this distribution is measured using the capacity Secondly, we can consider the imbalance of its components using correlations. Let f be a Boolean function from S to F 2 . Then its correlation cor(f ) is given by It is well-known, see e.g. [15], [16], that these two approaches to measuring imbalance are related due to the following equality, or equivalently, by the Walsh-Hadamard transform, Then we can express Cap(F ) also as In particular, a random function F : S → F t 2 can be generated either by selecting its t coordinate functions randomly and independently, or by picking its values F (x) randomly and independently from F t 2 . The value distribution of a random function F follows a multinomial distribution. By (5) the expected value of the capacity of the value distribution of a random function is the sum of the expected values of the squared correlations taken over the non-trivial components of F . We are also interested to determine the variance of the capacity for random functions. The problem is not trivial, since we can neither assume all nonzero components of F to be independent, nor to have independent correlations. Nevertheless, in the next subsection we give a result, see Corollary 1, which shows that, based solely on the properties of the multinomial distribution of the values of a random function F , the variance of its capacity is obtained as the sum of the equal variances of the squared correlations of its nonzero components.

B. Capacity Related to Multinomial Distributed Variables
In the preceding section, the notion of capacity was defined as a measure of the uniformity of the value distribution of the function. More generally, we can define capacity for any finite set of non-negative values. Let z 1 , . . . , z k be non-negative real numbers and denote by m their sum. Then we define their capacity as This quantity is related to the Euclidean distance of the probability distribution from the uniform distribution and also called as the squared Euclidean imbalance. By substituting z η = |{ x ∈ S | F (x) = η }| and k = 2 t to (6), we have m = |S| and we get the capacity of F as defined by (1). Next we determine the mean and variance of the capacity for stochastic variables that follow a general multinomial distribution.
Let z 1 , . . . , z k be the outcomes of a set of k, k ≥ 2, stochastic variables that follow a multinomial distribution with probabilities p 1 , . . . , p k and let us denote the number of trials by m. Then z 1 +· · ·+z k = m. Let us denote by C the capacity of z 1 , . . . , z k as given by (6). Then C is also an outcome of a stochastic variable. The proof of the following result is given in Appendix A.
Theorem 1: Let C be the capacity of multinomially distributed variables and let the parameters of the multinomial distribution be p 1 , . . . , p k and m. Then Note that in the expression of the expected capacity we have which is the capacity of the values p 1 , . . . , p k . If p η = 1 k , for all η = 1, . . . , k, then P 2 = 1/k and P 2 2 = P 3 = 1/k 2 , and the mean and variance of the capacity of multinomially distributed variables are given by the following corollary.
Corollary 1: Let C be the capacity of a multinomially distributed variable with distribution parameters p η = 1 k , for all η = 1, . . . , k, and m. Then

C. Standard Probability Distributions
The normal distribution is denoted by N (μ, σ 2 ), where μ is the mean and σ 2 is the variance. In case μ = 0 and σ 2 = 1 this distribution is called the standard normal distribution.
The binomial distribution is the multinomial distribution with k = 2 and is denoted by B(m, p), where p = p 1 and 1 − p = p 2 . The mean and variance of this distribution are mp and mp(1 − p), respectively. The binomial distribution corresponds to random sampling with replacement from a set S of size M = |S|, where we have two types of elements, denoted by 0 and 1. If the sampling is without replacement then the number of outcomes of type 0 in m experiments follows the hypergeometric distribution HG(M, K, m), where K is the number of elements of type 0 in the entire S. The mean and variance of the hypergeometric distribution are respectively, where we denoted by p the probability of outcomes of type 0 in the entire set S, that is, p = K M . The variances of the binomial and hypergeometric distributions differ by a factor, whose close estimate is called the finite population correction coefficient [17]. For sufficiently large S, both distributions can be approximated by the normal distribution N (μ, σ 2 ), where μ is the mean and σ 2 is the variance, as follows: The general (noncentral) chi-squared distribution with degrees of freedom and noncentrality parameter δ is denoted by χ 2 (δ). It is defined as the probability distribution of the sum of squares of independent random variables that follow the normal distribution N (μ i , 1). Then δ = i=1 μ i . The mean of the χ 2 (δ) distribution is + δ and its variance is 2( + 2δ). If δ = 0 then the distribution is called central and is denoted by χ 2 . Another setting that gives rise to a chi-square distribution is the one of the multinomial distribution. Using the same notation as in Subsection II-B we set Then T is the test statistic of the well-known Pearson's chisquared test and it is known to follow the see, e.g., [18]. Note that in this setting the number of degrees of freedom is the number of variables that are free to vary, that is k − 1, the size of the domain of the multinomial distribution minus one, due to the constraint z 1 + . . . + z k = m. If there are other constraints, then the number of degrees of freedom may be further reduced. For example, s additional linearly independent linear constrains on the values z η will further reduce the number of degrees of freedom to k − 1 − s. By the expression (6) of C we have T = mC. Hence the χ 2 distribution of T can be used to give a continuous approximation of the discrete probability distribution of C. For example, we can compare the mean and variance of C given by Corollary 1 in the case where p η = 1/k for all η = 1, . . . , k with the ones obtained from the χ 2 distribution of T . We can see that the means are identical, while the variances differ by a negligble term 2(k − 1)/m 3 .
The multinomial distribution and the related Pearson's χ 2 distribution apply to the case when the values z η are obtained by drawing samples of m elements from S with replacement. If sampling is without replacement then the multivariate hypergeometric distribution shall be used instead of the multinomial distribution. Then the statistic T given in (10) must be multiplied by the inverse of the finite population correction coeffifient to get a χ 2 -distributed variable [17]. We state this result for further reference as follows.
Lemma 1: Let T be given by (10) where the values of variables z eta , η = 1, . . . , k are obtained by sampling m elements from S without replacement and the initial probabilities p η are as defined in the setting of the multinomial distribution. Then the variable where B is given by (7), approximately follows χ 2 k−1 (δ) distribution, where δ is given by (11).

III. PROBABILITY DISTRIBUTION OF A SINGLE LINEAR
APPROXIMATION OF A RANDOM FUNCTION AND PERMUTATION We denote by F n 2 the linear space over the field F 2 = {0, 1} with addition denoted by '+' and inner product denoted by '·'. Let f be a Boolean function from F n 2 to F 2 . Given an element a ∈ F n 2 the Boolean function defined as is called a linear approximation of f . We first derive the distributions of linear approximations of random Boolean functions and random balanced Boolean functions. They are essentially the same as those given by Daemen and Rijmen in [4]. In this section, we will complete their work by giving the exact distributions in both cases.

A. Zeroes of Linear Approximations
Let f be a Boolean function from F n 2 to F 2 . We say that x ∈ F n 2 is a zero of f if f (x) = 0. To determine the correlation of a linear approximation of f , let us first determine the number of its zeroes.
Lemma 2: Let a ∈ F n 2 , a = 0, and f be a Boolean function from F n 2 to F 2 . Let N 0 be the number of the zeroes of f .
Then the number of zeroes of the linear approximation g( . Adding υ to both sides of this equation gives what is claimed. The following lemma gives the distribution of υ. Lemma 3: Let Boolean function f over F n 2 be chosen randomly and equiprobably from the set of all Boolean functions with a fixed number N 0 of zeroes. Let a ∈ F n 2 be nonzero and fixed. Then υ defined by (12) follows the hypergeometric distribution HG(2 n , 2 n−1 , N 0 ).
Proof: Given a fixed balanced linear function a · x, the N 0 zeroes of f are chosen by choosing υ zeroes among the 2 n−1 zeroes of a · x and N 0 − υ zeroes among the 2 n−1 inputs x such that a · x = 1.

B. Random Boolean Function
The number of zeroes of a Boolean function selected randomly and equiprobably from the set of all Boolean functions of n variables follows the binomial distribution B(2 n , 1 2 ). Theorem 2: Let f be selected randomly and equiprobably from the set of all Boolean functions of n variables. Then for any fixed a ∈ F n 2 the number of zeroes of the linear approximation a · x + f (x) follows a binomial distribution B(2 n , 1 2 ). Proof: For any fixed Boolean function g, the mapping, which maps a Boolean function f to the function f + g, is a bijection in the set of all Boolean functions of n variables. Then if f is chosen uniformly at random from this set then so is f + g. In particular, the distribution of the number of zeroes of f + g follows the same distribution as the number of zeroes of f .
For an alternative proof that computes the distribution of the zeroes of the linear approximation based on Lemma 3, see Appendix B. Now we apply Corollary 1 for k = 2 to get the following result.
Corollary 2: Let a ∈ F n 2 be fixed. The distribution of a correlation c of a linear approximation a·x+f (x) of a Boolean function f that is drawn randomly and equiprobably from the set of all Boolean functions of n variables has the following parameters: Proof: When k = 2 we have C = c 2 by (5) and we can apply Corollary 1 with m = 2 n to get Exp(c 2 ) = 2 −n and Var(c 2 ) = 2(2 n − 1)2 −3n . Further, by Theorem 2 we have that Exp(2 n c) = 0. Hence Exp(c) = 0 and Var(c) = Exp(c 2 ) − Exp(c) 2 = 2 −n .

C. Balanced Random Boolean Function
A Boolean function over F n 2 is said to be balanced if its number of zeroes is equal to 2 n−1 . It is well known that a vectorial Boolean function is a permutation if and only if all its components, that is, nonzero linear combinations of its coordinate functions are balanced.
From Lemma 2 and Lemma 3 we get the following result.
Theorem 3: Let f be selected randomly and equiprobably from the set of all balanced Boolean functions of n variables. Then for any fixed a ∈ F n 2 , a = 0, the number of zeroes of the linear approximation f (x) + a · x is an even integer 2υ where υ ∼ HG(2 n , 2 n−1 , 2 n−1 ).
Corollary 3: The distribution of a correlation c = cor(g) of a linear approximation g(x) = a · x + f (x) of a balanced Boolean function f drawn randomly and equiprobably from the set of all balanced Boolean functions of n variables has the following parameters:

D. Random Vectorial Boolean Function and Permutation
In the context of linear cryptanalysis, a linear approximation of a vectorial Boolean function F : F n 2 → F s 2 is identified with a Boolean function defined as Since a single component b·F (x), b = 0, of a random vectorial Boolean function is a random Boolean function, it follows that the number of zeroes of a linear approximation of a random vectorial Boolean function is binomially distributed as given by Theorem 2.
For permutations, the nonzero component functions b ·F (x) are balanced Boolean functions. Therefore, the distribution of the zeroes of a single linear approximation of a permutation drawn uniformly at random among all permutations is given by Theorem 3.

IV. MULTIDIMENSIONAL LINEAR APPROXIMATIONS OF PERMUTATIONS A. Multidimensional Linear Approximation as a Vectorial Boolean Function
Let F : F n 2 → F s 2 be a vectorial Boolean function. A multidimensional linear approximation Λ is a vectorial Boolean function such that the components of Λ form a linear subspace of the linear space of all linear approximations of F . Let us denote this subspace by L and its dimension by t. Let us fix a basis λ 1 , . . . , λ t of L, and give notations of the basic elements as Then Λ : F n 2 → F t 2 is given by λ 1 , . . . , λ t as its coordinate functions. Given β = (β 1 , . . . , β t ) ∈ F t 2 the component β · Λ of Λ has a unique representation as a linear approximation of F of the form a · x + b · F (x) as follows: In the rest of this paper we identify a linear approximation and call it the mask pair of g. Here the element a ∈ F n 2 is called the input mask and the element b ∈ F s 2 the output mask. We also denote cor(g) by cor(a, b). Also the linear subspace L spanned by λ 1 , . . . , λ t of the space of all linear approximations is identified with a linear subspace of F n 2 × F s 2 spanned by the mask pairs (a 1 , b 1 ), . . . , (a t , b t ) given by (13). We will use L also to denote this subspace and call it the mask space of Λ.
By (1), (5), and (14) the capacity of Λ is then given as One known consequence of this result is that the value distribution of a multidimensional linear approximation is uniform if and only if the correlations of all its non-zero linear approximations are equal to zero.

B. Structure of Multidimensional Linear Approximation of Permutation
In this section we determine the structural properties of a multidimensional approximation of a permutation F . For example, F is an encryption function of a block cipher, or some rounds of a block cipher with a fixed key, or F is just any bijective function of bit strings.
A multidimensional linear approximation Λ of a permutation F may contain nonzero linear approximations with mask pairs of the form (a, 0) or (0, b). Such linear approximations are called trivial, because they have fixed correlations equal to zero for any permutation F . Next we examine their effect on the distribution of the capacity Cap(Λ). Let us denote by U the linear subspace of the multidimensional approximation consisting of the approximations of the form (a, 0) and let u be its dimension. Similarly, we denote by V the subspace of the masks of the form (0, b) and by v its dimension.
Often such spaces span the whole multidimensional approximation, that is, all masks are of the form (a, b), where (a, 0) ∈ U and (0, b) ∈ V . Then the multidimensional approximation is said to have independent input and output masks [19]. But in general, there may exist a linear subspace W of L such that, if (a, b) ∈ W and (a, b) = (0, 0), then a = 0 and b = 0. Then U ∩W = V ∩W = {(0, 0)} and the mask space L of the multidimensional approximation Λ can be written as a direct sum Mask pairs of the type comprising W do not have independent input and output masks. We will show later in Subsection IV-D that they are actually connected by a one-to-one correspondece.
Let us denote by Λ 1 , Λ 2 and Λ 3 the multidimensional linear approximations determined by the mask sets U , V and W , respectively. Then the values of Λ 1 are u-bit vectors, the values of Λ 2 are v-bit vectors, and the values of Λ 3 are (t−u−v)-bit vectors, and Λ = (Λ 1 , Λ 2 , Λ 3 ). Since all linear approximations in U and V are balanced, the value distributions of Λ 1 and Λ 2 are uniform. Considering this property for Λ 1 we get 2 u constraints for the value distribution of Λ as follows Similarly, by the uniformity of the value distribution of Λ 2 , we get the following constraints from which We conclude that the number of degrees of freedom of the probability distribution of the values of a multidimensional linear approximation Λ of a permutation, as considered above, is bounded from above by Let us now consider Λ and the probabilities p η of its t-bit values η = (ξ, ζ, ν) as stochastic variables over the space of all equiprobable permutations. We apply Pearson's χ 2 test and compute the test variable as Then T (Λ) follows the χ 2 distribution. By Corollary 3, for linear approximations of randomly and equiprobably drawn permutations, the expected value of correlations cor(a, b), with (a, b) = 0, is equal to zero, also of those correlations where a = 0 and b = 0. On the other hand, cor(a, b) = 1 for a = b = 0. Hence by (3), the expected value of each p η is equal to 2 −t . Thus we have proved the following result.  respectively. Then for a permutation chosen randomly and equiprobably from the set of all permutations from F n 2 to F n 2 the capacity of this multidimensional linear approximation follows, when multiplied by the factor 2 n , the central Motivated by this result, we conjecture that the value distribution of a multidimensional linear approximation of a randomly and equiprobably chosen permutation with mask subspaces U and V of dimensions u and v, respectively, has the maximum degree of freedom, that is, Conjecture 1: For a permutation from F n 2 to F n 2 drawn uniformly at random, the capacity of a multidimensional linear approximation with dimension t and the linear subspaces of trivial masks with dimensions u and v follows, when multiplied by 2 n , the χ 2 distribution with 2 t − 2 u − 2 v + 1 degrees of freedom.

C. Experiments
We performed experiments to check the validity of Conjecture 1 in different dimensions. In our simulations of a random permutation, we used the iterated block cipher SMALLPRESENT- [4] with a varying number of rounds. This cipher has 31 rounds in total and the block size is 16 bits [20]. The state bits at input and output to each round are numbered from 0 to 15 from right to left.
For each fixed number of rounds of SMALLPRESENT- [4] varying from 0 to 31, the distribution of the capacity of the multidimensional linear approximation is computed over 2 14 keys. Then the mean and the variance of the capacity is computed. The multidimensional linear approximation is of the form U ⊕ V where both U and V have nonzero bits in positions 5, 6, 9, 10, 11, 13, 14, 15. Six typical examples are depicted in Figures 1 -6. In all six examples U is spanned by bits in positions 9, 10, 11, 13, 14, 15, and has dimension equal to 6, while the dimension of V varies from 1 to 6.
In each figure, the negatives of the base 2 exponents, that is − log 2 , of the mean and variance of the capacity are plotted as the number of rounds increases, and compared with the hypothetical value given by Conjecture 1 which is depicted using a horizontal line. We see that the results of the experiments support Conjecture 1 perfectly.
We also computed a number of experimental probability distributions of capacities for a random permutation instantiated by 20 rounds of SMALLPRESENT- [4]. One typical example of such probability distribution is plotted in Figure 7 for a multidimensional linear approximation with mask space

D. Special Case
Let us start by defining a special type of multidimensional linear approximation, which we call a Davies-Meyer approximation for reasons to be explained in this subsection.
Definition 1: A multidimensional linear approximation Λ is called a Davies-Meyer approximation if given any linearly independent set of mask pairs (a i , b i ), i = 1, . . . , t, in the mask space L of Λ, the input masks a i , i = 1, . . . , t, are linearly independent and the output masks b i , i = 1, . . . , t, are linearly independent.
An equivalent formulation of this definition can be given as follows.
Theorem 5: A multidimensional linear approximation of a permutation is a Davies-Meyer approximation if and only if it does not contain any nonzero trivial approximations.
Proof: By definition, a Davies-Meyer approximation does not contain any nonzero approximation that has either input or output mask equal to zero, since a zero element cannot be included in a set of linearly independent elements. It remains to show that if the mask space L of a multidimensional linear approximation Λ does not contain any trivial approximations, then Λ must be a Davies-Meyer approximation.
Let us suppose the contrary, that is, L does not contain trivial approximations, but is not a Davies-Meyer approximation. Then Then L contains a nonzero mask pair of the form (a, 0), which contradicts the assumption.
By this theorem, the multidimensional approximation Λ 3 determined by the mask set W in the presentation (16) (a 1 , . . . , a t ), spanned By definition, t ≤ n. Then we extend D to a bijective linear mapping from F n 2 to F n 2 and denote it byD. Then a linear approximation (a, b) ∈ L can be expressed as whereD is the transpose ofD. If F is chosen randomly and equiprobably from the set of all permutations, then the same holds for the permutation P =D • F . We observe that the function of the form is the Davies-Meyer construction [21], which is known to give a pseudorandom function (more accurately, a family of pseudorandom functions) from F n 2 to F n 2 if P is a truly random permutation, that is, chosen randomly and equiprobably among all permutations F n 2 to F n 2 [22]. By (18) the linear approximations in L form a linear subspace of the components of a Davies-Meyer function, and hence the Davies-Meyer approximation Λ of F is a pseudorandom function from F n 2 to F t 2 if F is a truly random permutation. In Subsection VII-A we define a test for distinguishing a permutation (cipher) from a truly random permutation. In the theory of cryptography, analogous tests are also used to distinguish a function from a truly random function. Specifically, a pseudorandom function is defined by the property that there is no efficient test that can be used to distinguish between a pseudorandom function and a truly random function with a larger than a negligble distinguishing advantage [23].
This means that any probability distribution computed from the values of a Davies-Meyer approximation Λ over a truly random permutation F can be replaced by the corresponding distribution computed for a truly random function. Recalling that the multinomial distribution of the capacity of a truly random function from F n 2 to F t 2 can be approximated by the χ 2 t distribution, see Section II-C, we can state the following result.
This property will be used later in the statistical analysis of a Davies-Meyer approximation, see Theorem 11.
To illustrate a probability distribution of a Davies-Meyer approximation we depict the distribution of capacity over 2 14 random keys in Figure 8

E. Multidimensional Linear Approximation of Serpent
The block cipher Serpent [10] was one of the first ciphers analysed using the multidimensional linear cryptanalysis. The multidimensional approximation Λ for Serpent was built by taking the linear space spanned by a linearly independent set of m strong base approximations of the form (a 1 , b), . . . , (a m , b) all with the same output mask b. Then L is of the form U ⊕ V , where u = m and v = 1. Moreover, the Davies-Meyer part W was non-existent. It means that all the linear combinations of the base approximations involving an even number of base approximations had output mask equal to zero, and hence, correlation zero. In the cryptanalysis, all 2 m+1 − 1 non-zero approximations were involved including those 2 m −1 of the form (a, 0) with correlation zero. It was mentioned that such approximations can be ignored in the computation of the empirical correlation. Nevertheless, they cannot be ignored when the degree of freedom of the sampled χ 2 statistic is determined as will be explained in Subsection VI-C.
Recently it was proposed by Nyberg to remove the subspace of trivial linear approximations and consider only the remaining set that forms an affine subspace [13]. Let us apply this idea to the multidimensional approximation of Serpent discussed above. Take the m−1 dimensional subspace spanned by masks (a 2 ⊕ a 1 , 0), . . . , (a m ⊕ a 1 , 0) and denote it by H. Then the affine subspace (a 1 , b) + H is only a half of the size of the original linear space and still contains all m strong base approximations. Moreover, for each key, the capacity of the affine set of approximations is exactly the same as the capacity of the original set, while the degrees of freedom of the χ 2 statistic is reduced by one half.
To conclude this section let us mention that the structure of multidimensional linear approximations must be taken in consideration also for non-bijective functions. Then only the mask pairs of the form (a, 0) are trivial with mean and variance of the correlation equal to zero. For example, if in the above example the block cipher Serpent is replaced by some nonbijective function but the same set of linear approximations are used, then removing the trivial approximations leads to the same affine set of approximations.
Next we study the distribution of the capacity for an affine set of linear approximations of a random permutation. Further in Subsection VI-D, we will recall the sampled χ 2 statistic from [13] with the following essential improvements: randomisation over the key and sampling without replacement. The compound probability distribution is then given by the integration of the probability distribution of the capacity into the probability distribution of the sampled χ 2 statistic.

A. Constructing Affine Set of Approximations
The approach for constructing an affine set of linear approximations which does not contain trivial approximations but has a statistical model without artificial independence assumptions, was proposed by Nyberg [13]. Such a set can be constructed, for example, by taking an affine subspace of input masks and an affine subspace of output masks to get a set of the form where the dimensions of U and V are positive, a 0 / ∈ U and b 0 / ∈ V . We denote Then the smallest linear space that contains A is that is, the space W in the expression (16) has dimension one. But using the multidimensional linear approximation defined by this set of masks instead of using only the set A would add all trivial linear approximations from U and V to this set and reduce the strength of the attack. To avoid wasting attack resources, such as memory and time, we want to exclude the linear approximations with masks in U ⊕ V . More generally, let us consider such a statistic T (A) for any affine set of the form A = (a 0 , b 0 ) + H where H is a linear subspace of masks and (a 0 , b 0 ) / ∈ H. Moreover, we assume that A does not contain trivial masks. Let Λ be the multidimensional linear approximation defined by the linear space of masks L = {(0, 0), (a 0 , b 0 )} ⊕ H. Let Λ the multidimensional linear approximation defined by H and Λ = U ⊕V ⊕W be its presentation in the form (16). We define the affine test statistic as follows We denote the dimension of Λ by t. Hence we can express Then the values of Λ are given as (ν, η), where ν is a bit and η is a (t − 1)-bit vector.
Since the correlations of the linear approximations are not independent, we cannot examine the distribution of T (A) directly from its expression as a sum of squared correlations. We can, however, do this if instead we express T (A) in terms of value distribution p (ν,η) of Λ as given by the following lemma.
Lemma 4: In the setting defined above, we have Proof: By applying (17) to T (Λ ) we obtain By replacing the summation index (δ, η) ∈ F 2 × F t−1 2 by η ∈ F t 2 we get the expression of T (Λ) given by (17). Then the claim follows from (20).

B. Distribution of the Statistic T (A) for a Random Permutation
To compute T (A) according to (21) for a permutation, all n-bit inputs x are distributed to 2 t−1 categories according to the value η of Λ (x). Further, within each category the inputs x are divided into two subsets according to their value f 0 (x). The resulting value in category η is the difference of the sizes of its two subsets.
Since the expected probability distribution of the values (ν, η) of Λ over all permutations is uniform, the expected value of the differences p (1,η) − p (0,η) is zero. Hence we propose to use Pearson's χ 2 test for the values obtained in this way in 2 t−1 categories. The related χ 2 test statistic is T (A).
To determine the number of degrees of freedom of T (A), we observe that, taken together, the 2 t−1 variables p (1,η) + p (0,η) and the 2 t−1 variables p (1,η) − p (0,η) , where η is a t − 1-bit vector, uniquely determine the value distribution of Λ with probabilities p ν,η , where ν is a bit and η is a t − 1bit vector, which by Conjecture 1 has 2 t − 2 u − 2 v + 1 free variables. Since the masks in U ⊕ V (if any) belong also to the multidimensional linear approximation Λ , the value distribution of Λ has 2 t−1 − 2 u − 2 v + 1 free variables, also by Conjecture 1. Since T (A) + T (Λ ) = T (Λ), it follows that T (A) must have at least 2 t−1 degrees of freedom. On the other hand, by its expression (21) T (A) has at most 2 t−1 degrees of freedom, and hence exactly 2 t−1 degrees of freedom.
We conclude that under Theorem 4 and Conjecture 1 for random permutations, T (A) is χ 2 distributed with 2 t−1 degrees of freedom and summarise the result as follows. follows χ 2 distribution with |A| degrees of freedom.

VI. DATA SAMPLING FOR APPROXIMATIONS OF RANDOM PERMUTATIONS
A linear attack can be seen as composed of two parts, first, finding an approximation with good correlation and secondly, detecting this correlation in a collection of inputoutput pairs. When viewed this way, linear cryptanalysis is mainly a parameter estimation problem and the influence of data sampling is only on the second part. The distribution of the correlation over the keys is determined by the structure of the block cipher. Undersampling introduces an error to this parameter estimation problem. The empirical correlation is therefore a random variable in the key and the choice of the sample of plaintexts.
In Sections IV and V we presented the probability distributions of correlations and capacities computed over the full input domain of random functions and permutations. The goal of this section is to integrate a random variate data sample of fixed size into these probability distributions.

A. Sampling With or Without Replacement for a Random Permutation
In many studies on linear cryptanalysis, known plaintextciphertext pairs are assumed to be drawn randomly and independently, which implies sampling with replacement. It has been argued that sampling without replacement implies chosen plaintext and contradicts the essence of linear cryptanalysis of being a "known plaintext attack".
On the other hand, it has been acknowledged that duplicated plaintext-ciphertext pairs do not give new information, for which reason experimental cryptanalysis of practical ciphers typically use non-repeating plaintexts. For example, in the first experimental cryptanalysis on the DES cipher, Matsui generated the plaintexts as distinct powers of a primitive element in a 64-bit field [2].
Considering practical applications, the raw data obtained from the cipher is rarely non-repeating and may have too many duplicates for being random looking. Therefore, it requires preparations before it can be used for statistical analysis. Given two models, one requiring data input that looks like a randomly generated sample with replacement and another one without duplicates, the latter is arguably more practical to achieve. It takes O(N ) memory and time to clear a raw sample of N plaintexts from duplicates and also gives a unique value for the size of the clean sample.
Today, linear cryptanalysis is most commonly used for estimating how many rounds of an iterative block cipher it takes until any reasonable linear attack requires the full codebook of known plaintext-ciphertext pairs. Achieving the whole codebook using sampling with replacement introduces unnecessary uncertainty to the model that can be avoided if sampling is without replacement. Based on the reasons given above, we will consider only sampling without replacement in this paper. In particular, for the experiments given in Section VIII that deal with sample sizes equal or close to full codebook, analysis with distinct plaintexts gives more accurate results.
Whether sampling is with or without replacement has also implications to the statistical models of wrong-key behaviour. The classical wrong-key assumptions commonly use the idea that if the key is wrong then the values of a bitwise linear approximation follow the binomial distribution with probability 1/2. This leads to a normal distribution N (0, 1/N ) of the empirical correlation, where N is the size of the sample drawn with replacement. As long as the cipher has a linear approximation such that its correlation has only a small number of values as the key varies, or the average correlation over the keys is different from zero, then this wrong-key model is reasonable. But the modern block ciphers have been designed not to have such linear approximations. In particular, the correlations typically vary a lot with the key and have average value equal to zero leading to the same distribution N (0, 1/N ) of the empirical correlation in the right-case and the wrong-key case. It follows that the early models of linear cryptanalysis, e.g. [24], [25], hardly apply to modern block ciphers.
The more advanced models of linear correlations of block ciphers consider randomisation over the key and the data sample. Bogdanov and Tischauser were the first to present a wrong-key model of the empirical correlation and gave the distribution N (0, 1/N + 2 −n ), where n is the block size and N is the size of the sample assuming sampling with replacement [4]. Later Blondeau and Nyberg showed that if sampling is without replacement, then the wrong-key distribution is N (0, 1/N ). While this distribution is the same as in the classical case without key randomisation, the setting is different and the corresponding right-key model allows building a distinguisher [7].
In this section, we present probability distributions of the capacity considered over a random permutation and random sampling without replacement for a single linear approximation, a multidimensional linear approximation including Davies-Meyer approximation as a special case, and an affine set of approximations. In all cases, we first derive the distribution for an arbitrary fixed permutation by randomisation over the data sample only. Then by using the results from Sections IV and V we present the compound probability distributions of the capacity over a random permutation and a random sample.

B. Sampling Without Replacement for a Single Linear Approximation
Given a mask pair (a, b), where b = 0, and a data sample S of input-output pairs (x, F (x)) of size N drawn for a random function F : F n 2 → F s 2 , let us denote by w(a, b) the number of inputs x, for which (x, F (x) ∈ S and the linear approximation a · x + b · F (x) takes the value zero . Let w(a, b) be the number of zeroes of a · x + b · F (x) over all inputs x ∈ F n 2 . Then w(a, b) ∼ HG (2 n , w(a, b), N).
For a truly random F , we know by Theorem 2 that Then the distribution of w(a, b) taken over a truly random function and a random data sample S of size N has the following probability distribution Hence w(a, b) ∼ B(N, 1/2). Let us denote by cor(a, b) the sampled correlation, that is, By normal approximation (8), we obtain the following result.
Theorem 8: Let cor(a, b), where b = 0, be the sampled correlation of the linear approximation (a, b) of a function from F n 2 to F s 2 . Then the probability distribution of cor(a, b) taken over a truly random function and a random data sample of size N of distinct plaintexts, where N ≤ 2 n , can be approximated by the normal distribution N (0, 1/N ).
To prove the corresponding result for a random permutation we use the normal approximation from the beginning.
Theorem 9: Let  cor(a, b), where b = 0, be the sampled correlation of a linear approximation of a permutation from F n 2 to F n 2 . Then the probability distribution of cor(a, b) considered over a truly random permutation and a random data sample of size N of distinct plaintexts, where N ≤ 2 n , can be approximated by the normal distribution N (0, 1/N ).
Proof : Given a linear approximation (a, b), b = 0, and a data sample S of size N drawn for a permutation E : F n 2 → F n 2 , the sampled correlation is w(a, b), N). Then by normal approximation (8), cor(a, b) where cor(a, b) = 2 −n (2w(a, b)−2 n ) and B = (2 n − N ) /2 n . By Theorem 3, Corollary 3, and using the normal approximation of the hypergeometric distribution (8), we have Then the distribution of cor(a, b) taken over a random permutation and a random sample of size N is approximately normal with mean Exp (cor(a, b)) = 0 and variance equal to Var (cor(a, b))+Exp(Var (  cor(a, b) We get another view of this result by observing that a linear approximation of a random permutation F can be expressed as a linear approximation of a pseudorandom function as follows see Subsection IV-D, and then applying Theorem 8.

C. Sampling Without Replacement for a Multidimensional Linear Approximation
Let us now recall the sampled test statistic of a multidimensional linear approximation Λ. It is obtained by taking (17) and replacing 2 n by N and correlations cor (a, b) by sampled correlations cor (a, b) as follows Let us first derive the probability distribution of T N (Λ) for an arbitrary fixed key and randomly chosen sample of distinct plaintexts. The corresponding probability distribution for T N (Λ) is given by the following result originally stated in [7]. The proof given in [7] assumed independent hypergeometric distributions. In [1] the validity of this result was questioned due to the artificial assumption of independence. Therefore another proof will be given here by applying the standard statistical argument of finite population correction to the χ 2 distributed variable as given by Lemma 1. In our context, 2 n is the size of the population and N is the size of the random sample of distinct elements from that population.
Theorem 10: Let Λ be a multidimensional linear approximation of dimension t applied to a permutation from F n 2 to F n 2 . Let T N (Λ) be the statistic defined by (22) computed over a random sample of size N of distinct plaintexts. Then B −1 T N (Λ) follows non-central χ 2 distribution with 2 t − 1 degrees of freedom and non-centrality parameter B −1 N Cap(Λ), where B is as defined by (7).
Proof: We denote by p η the sampled probabilities of the distribution of the t-bit values η ∈ Λ computed for a sample of size N of inputs x. We apply (15) to this distribution to write T N (Λ) as follows Then T N (Λ) is Pearson's χ 2 -test statistic with 2 t −1 degrees of freedom. Since the sample is without replacement we apply Lemma 1 and get that B −1 T N (Λ) is non-centrally χ 2 distributed and has expected value equal to 2 t − 1 + δ, where δ is the non-centrality parameter. Then the expected value of T N (Λ) is equal to B(2 t − 1) + Bδ. To determine δ we compute the expected value of T N (Λ) directly. Expanding the expression (23) we get where p η is the probability of the t-bit value η in the image space of Λ. Note that in the expansion (24)(25)(26) the term N 2 t+1 η ( p η − p η ) was omitted because it is equal to zero. Now expression (24) is Pearson's χ 2 -test statistic with 2 t −1 degrees of freedom by using the standard approximation N p η ≈ N 2 −t in the denominator. Moreover it is central, since for each η the expected value of p η is equal to p η . Since the sampling is without replacement, we get that the expected value of (24) is equal to B(2 t − 1). The expression (25) is constant and equal to N Cap(Λ), and the expected value of (26) is equal to zero. Solving δ from the equation gives the non-centrality parameter as claimed.
As the sample size N grows, and gets equal to 2 n , the sampled statistic T N (Λ) gets equal to the statistic T (Λ). In general, the χ 2 -variables computed for the entire input space may not have the same number of degrees of freedom as we saw in Subsection IV-B, which complicates the analysis of the compound distribution of the statistic T N (Λ) considered over a random permutation and a random sample of size N . In the case, where Λ does not contain any trivial approximations, that is, Λ is a Davies-Meyer approximation, the distribution of T (Λ) also has t − 1 degrees of freedom. Moreover, by Theorem 6 the distribution given by Conjecture 1 holds and we get the following result. The proof is similar to the proof of Theorem 13 in the next subsection and is omitted here.
Theorem 11: Let Λ be a Davies-Meyer approximation applied to a random permutation from F n 2 to F n 2 . Let T N (Λ) be the statistic defined by (22) computed for a sample of size N of distinct plaintexts and considered as a random variable over a truly random permutation and a random sample of size N . Then the mean of T N (Λ) is |Λ| − 1 and the variance is 2 (|Λ| − 1).

D. Sampling Without Replacement for an Affine Approximation
Given an affine subspace A of linear approximations defined by two multidimensional linear approximations Λ and Λ of dimensions t and t − 1 respectively, we define the sampled test statistic T N (A) analogously to (20) as follows By repeating the derivations of Section V, but now for the sampled statistic T N (A) and using Theorem 10 we get the following result.
Theorem 12: Let A be an affine set of linear approximations applied to a permutation from F n 2 to F n 2 and assume it does not contain trivial approximations. Let T N (A) be the statistic defined by (27) computed for a random sample of size N of distinct plaintexts. Then B −1 T N (A) follows the noncentral χ 2 distribution with |A| degrees of freedom and noncentrality parameter B −1 N Cap(A), where B is as defined by (7).
The noncentrality parameter B −1 N Cap(A) of the distribution of T N (A) depends on the permutation. If the permutation is truly random, the distribution of T (A) = 2 n Cap(A) is given by Theorem 7 under the assumption that Conjecture 1 holds. We get the following result.
Theorem 13: Let A be an affine set of linear approximations applied to a permutation from F n 2 to F n 2 and let us assume that A does not contain trivial approximations and Conjecture 1 holds. Let T N (A) be the statistic defined by (27) computed for a sample of size N of distinct plaintexts and considered as a random variable over a random permutation and a random sample of size N . Then the mean of T N (A) is |A| and the variance is 2|A|.
Proof: Let us denote |A| by . Using the non-central χ 2 distribution of B −1 T N (A) for a fixed permutation with capacity Cap(A) given by Theorem 12, we get that the mean of T N (A) is equal to By taking the mean over random permutations, we get the mean as claimed.
Similarly, by Theorem 12, we get that the variance of T N (A) is equal to Then the total variance over random permutation is computed as the sum of the mean of (29) and the variance of (28) to get Based on these considerations one can argue that, when considered as a random variable over a random permutation and a random sample of N distinct plaintexts, the test statistic T N (A) follows the χ 2 |A| distribution. We have seen that constructions of multidimensional and affine linear approximations that do not contain any trivial approximations have a simple and clear theory for random permutations. Also for those approximations that contain trivial approximations it is quite straightforward to derive the mean and the variance of the sampled statistic. For permutations originating from ciphers the theory is not that clear. The least one can say is that linear approximations of block ciphers have the same trivial linear approximations as a random permutation. The problem of trivial approximations was observed also in [10] where it was recommended to exclude them in the computation of the empirical correlation. While this helps in speeding up the cryptanalysis, the problem of accuracy still remains. In the case of [10] the trivial linear approximations could have been easily excluded by considering the related affine set as discussed in Subsection IV-E.
VII. EVALUATING NONRANDOMNESS OF LINEAR APPROXIMATIONS In this section, we apply the statistical models of linear approximations of a random permutation and present a test to evaluate non-randomness of linear approximations of a block cipher based on the observed capacities in large experiments.

A. Randomness Test
We consider the set of permutations from F n 2 to F n 2 and a permutation drawn from this set. The null hypothesis of the test is defined as follows.

Hypothesis 1 (Null Hypothesis):
The permutation is a truly random permutation, that is, it has been drawn uniformly at random from the set of all permutations.
The alternative hypothesis is then defined as follows.

Hypothesis 2 (Alternative Hypothesis):
The permutation is not a truly random permutation.
The test is performed by computing the test statistic T for the given permutation. For example, T = T N (A) defined by (27). Given a threshold τ , the null hypothesis is accepted if T ≤ τ and the alternative hypothesis is accepted if T > τ.
To determine the threshold, we first set the significance level α and then use the probability distribution of T over a randomly and uniformly selected permutation to determine the threshold τ α in such a way that Pr(Alternative Hypothesis is accepted|Permutation is truly random) = α.

The following probability
Pr(Alternative Hypothesis is accepted|Permutation is not truly random) is called the success probability. It depends on τ α and we denote it by P S (α).
The performance of the test for a given significance level α can then be quantified using the distinguishing advantage defined as the absolute value of the difference The distinguishing advantage is a value between 0 and 1. While for practical attacks this value is usually closer to 1, non-negligible deviation from 0 can be considered to reflect nonrandomness of the permutation.
When used for key recovery it is assumed that, if the test statistic is computed from data obtained from the cipher using a wrong key, then the Null Hypothesis holds. Based on this assumption, the significance level α can be interpreted as the fraction of wrong keys falsely accepted as potential right keys. If the total number of key candidates to be tested is 2 k then the number of key candidates accepted by the test as correct is α2 k = 2 k−d , where d = − log(α). This is akin to saying that the number of correct key bits recovered using the distinguisher is d. The levels α = 0.25, 0.125, 0.0625 used in this paper correspond to d = 2, 3, 4, respectively.

B. How to Determine the Success Probability
At each significance level, the success probability depends on the probability distribution of the test statistic considered over the cipher keys and over the data obtained from the cipher. Previously, there have been attempts to estimate the distribution of the capacity of a multidimensional linear approximation of a block cipher [1], [12]. The former paper determines the mean and the variance of the probability distribution which is then assumed to be normal, while the latter paper computes an approximation of the probability density function in simulations.
In this paper, we take a different approach. We use rough estimates of the correlations only to support the search for strong linear approximations, but do not use these estimates to model the cipher. For certain selected values of α, we determine the success probabilities P S (α) and distinguishing advantages experimentally and see how many rounds the distinguishers can cover before the distinguishing advantage becomes negligible, that is, α ≈ P S (α).

A. Description of SIMON
SIMON is a family of lightweight block ciphers designed by the US National Security Agency (NSA) and published in 2013 [26]. The SIMON2n/mn family of lightweight block ciphers has 10 members differing in their block and key sizes. All members of the family have a Feistel structure with round function R employing a non-linear function f . In each round i, R receives two n-bit input words X i and Y i , and outputs two n-bit words X i+1 and Y i+1 . The round function uses three operations in F n 2 : bitwise addition (exclusive-or; XOR), bitwise multiplication (AND), and a left circular shift by j positions, which we denote by '⊕', '&', and '≪ j', respectively. Note that the meaning of '⊕' within this section differs from the one used elsewhere in this paper, see Section IV. The internal non-linear function f is defined as: The output of the round function R on an input block X i ||Y i is: where i is the round number and k i is the round key. The entire cipher is a composition of round functions R r−1 • R r−2 • . . .• R 0 (X 0 , Y 0 ). The structure of the round function of SIMON is depicted in Figure 9.

B. Building Linear Approximations of SIMON32/64
We start by determining the linear approximations over one round of SIMON. Referring to Figure 9 let us denote by a the mask on the left input data half X i and by b the mask on the right input data half Y i . To have nonzero correlation, we then must have b as the mask on the left output data half X i+1 .
The bitwise AND-function maps two bits x and y to the single bit x&y. All four linear combinations of the bits x and y have nonzero correlation with x&y, all with the same absolute value 2 −1 . Hence the 2-bit to 1-bit AND-function is a bent function and its four linear approximations with nonzero output mask form an affine set with capacity equal to 1.
To capture the linear approximations of the bitwise AND-function over one round, let us assume now that b is a vector with a single 1-bit. If b has a single 1-bit, only a single AND operation is activated. The four linear approximations of the AND function induce four masks on the right half Y i+1 which form the two-dimensional affine space where we denoted by b ≫j the right circular shift by j positions of the bit string b. Since any Feistel cipher (without the final swap) is its own inverse, the same reasoning works also backwards. Given a mask b||c on the output data X i+1 ||Y i+1 , where b is assumed to have a single 1-bit, then the input masks on the data X i ||Y i that may give nonzero correlations are of the form a||b, where a is one of the four vectors of the twodimensional affine set Each non-zero bit of the mask b as described above, potentially induces a two-dimensional affine subspace of masks. Later in this section we will see an example of a mask b that has two non-zero bits which together generate a fourdimensional affine space of masks each relating to a linear approximation over one round with a correlation that has absolute value 2 −2 .
Since it becomes soon impractical to iterate this method over more rounds by keeping track of all linear approximations, we focus on the so-called core approximations which are linear approximations that use at each round only one linear approximation with input mask a||b and output mask b||(a ⊕ b ≫2 ). This was also the approach used in [27] to build the core approximation trails for SIMON. Then the trail correlations are computed based on how many times the linear approximation x&y = 0 was used.
For a linear approximation of an n-bit block cipher to be useful in cryptanalysis, its average squared correlation should differ from 2 −n , that is, the corresponding value for a random permutation. The trail correlations are commonly used to give lower bounds to the average squared correlation. Only if such a lower bound is larger than 2 −n it is useful for distinguishing from the random case.
The absolute values of the trail correlations for SIMON32/64 used in [27] were far less than 2 −16 . Nevertheless, referring to the method by Biryukov et al. of multiple linear cryptanalysis [25], it was argued that by collecting sufficiently many of such linear approximations that have squared correlations with known lower bounds, however small they are, one can accumulate the lower bound of the sum of the average squared correlations of the multiple linear approximations to exceed 2 −32 to get data complexity less than 2 32 . The problem is that these early methods of linear cryptanalysis make the assumption that for each linear approximation to be used, the cryptanalyst has obtained a good estimate of the absolute value of the correlation, and this value is the same for all keys. This assumption is not satisfied by modern block ciphers such as SIMON whose linear approximations have a large number of trails and correlations that vary a lot with the key.
In this paper, we revisit the constructions of sets of linear approximations of SIMON32/64 from [27], and compute estimates of the correlations experimentally over large sets of keys. As observed, a core trail can be extended at the beginning and at the end by using all four approximations to build multiple linear approximations, which give naturally rise to affine sets of linear approximations.
Using the statistical distribution given in Theorem 13 of the capacity of an affine set of linear approximations for a random permutation and the randomness test presented in Section VII we evaluate the deviation of the behaviour of the block cipher from random behaviour by extensive experiments.

C. Full Codebook Randomness Evaluation of SIMON32/64
In this section we build examples of affine sets of linear approximations which cover up to 18 rounds of SIMON32/64 and experimentally evaluate their distinguishing advantages. We begin by describing the core approximation trail we will use for these examples and experimentally evaluate its nonrandomness starting from 13 rounds. The core trail is built starting with the input mask 4000 x ||0001 x to round 1 and by following its propagation through the rounds as described in Subsection VIII-B. Note that output mask from round i is the input mask to round i + 1.
By counting the total number of non-zero bits in the right halves of the output masks from all rounds of the trail, we can evaluate the trail correlation. We see that after 10 rounds the absolute value of the trail correlation drops below 2 −16 . Nevertheless, the true correlation will stay above this value for at least 16 rounds as we will see next in Table II, which gives the results of the experimental evaluation of the core linear approximation over 13-18 rounds. The next step is to use an affine subspace of linear approximations. We take the output mask 0001 x ||0000 x from round 1 of the core trail and determine all four input masks to round 1 that have non-zero correlation. According to (31) they are 4000 x ||0001 x , C000 x ||0001 x , 4100 x ||0001 x , and C100 x ||0001 x . This is a two-dimensional affine space With these input masks and a single output mask as given for the core trail we have an affine set of input-output mask pairs of dimension 2. In Table III we give the experimental results of these affine linear approximations over 13-18 rounds using the test described in Section VII. The expected capacity in the random case is 2 30 . Compared to the corresponding results for the single approximations given in Table II, there is a significant improvement in success probabilities, in all significance levels, up to 17 rounds, after which point also the capacity becomes very close to the random.
In an attempt to analyse nonrandomness of 18 rounds of SIMON32/64 we use another trail constructed from the core trail. We add one more round to the core trail in the beginning and take all four masks. To apply (31) we take b||c = 4000 x ||0001 x , to get the four input masks a||b, where To obtain the output masks, we start with the output mask a||b = 0040 x ||0110 x from round 16, and observe that b has two active bits. Then by (30) the output masks are of the form b||c, where one value for c is a ⊕ b ≫2 = 0004 x . All the 16 masks are obtained by adding this value to the masks in the linear space spanned by the right circular shifts by 1 and 8 of the two single-bit components 0100 x and 0010 x of b. That is, The resulting affine space of linear masks is of dimension 6. Again, by running experiments with a randomly selected set of 2 13 keys we get the average capapcity 2 −25.99 which is quite close to the expected capacity of 2 −26 in the random case. Nevertheless, when applying the randomness test, we get success probabilities P S (0.25) = 0.260, P S (0.125) = 0.137 and P S (0.0625) = 0.072. Thus the 6-dimensional 18-round distinguisher performs better than the 1-dimensional 18-round distinguisher in Table II or the 2-dimensional 18-round distinguisher in Table III. On the other hand, we can see that the 18-round affine set of approximations does not preserve all the distinguishing power of the 16-round core approximation, see Table II, even if it was constructed by extending this 16-round approximation by one round up and one round down by taking all masks in the input and output that have non-zero correlations. For example, P S (0.125) = 0.159 for the 16-round approximation, while P S (0.125) = 0.137 for the extended 18-round affine approximation.

D. Compliance With the Model of Multidimensional Linear Cryptanalysis
We used this experimental setting also for verifying Conjecture 1 and computed the capacity of the 6-dimensional linear subspace of approximations related to the 6-dimensional affine subspace of approximations of SIMON32/64 constructed in the previous subsection. The dimensions of the spaces are t = 6, u = 4 and v = 2. According to Conjecture 1 the expected capacity in the random case is equal to 2 −26.508 . When evaluating it for the cipher by computing the average capacity over 2 13 keys we get 2 −26.506 , which is convincingly close to the conjectured value.

E. Randomness Testing of SIMON Using Less Than Full Codebook
The experiments used for this section are identical to those of Subsection VIII-C in all but the data complexity. We use The results of experiments given in Tables IV-V reveal that distinguishing from random is still possible also with less than the full codebook. Comparison with Table III shows that, as expected, the success probability is lower in rounds 13-16 and becomes closer to the random behaviour already at 17 rounds when 2 30 pairs of data is used and even earlier, at 16 rounds, when 2 28 is used. This is due to the smaller sample which leads to a larger sample error and thus to bigger variance for the test statistic.

IX. CONCLUSION
In this paper we presented a model which captures the statistical behaviour of the capacity of multidimensional linear approximations computed for a permutation and a sample of plaintext, when the permutation and the sample of distinct plaintext of fixed size are selected uniformly at random. The additivity of the variances of squared correlations is achieved without any assumptions of statistical independence based only on standard statistical tools such as Pearson's χ 2 test and the finite population correction coefficient.
We showed for the first time that the degree of freedom of the related χ 2 distribution over the distribution depends on the structure of the multidimensional linear approximation and that it can be significantly smaller than assumed in previous works due to the existence of trivial approximations. We identify two types of sets of multiple linear approximations, the Davies-Meyer approximation and the affine approximation which do not have trivial approximations. Such types of approximations offer the most efficient χ 2 -based linear attacks due to the optimal number of degrees of freedom. When selecting sets of strong multiple linear approximations for actual ciphers such structures are recommended for consideration if possible. As the first example, we mentioned the first multidimensional linear cryptanalysis on Serpent where restricting to an affine set of approximations could potentially improve the attack.
The second example consists of experimental evaluation of certain affine multidimensional linear approximations on block cipher SIMON32/64. Using our statistical model for random permutations we present a simple test to evaluate randomness experimentally. We were able to identify nonrandom behaviour of round-reduced SIMON32/64 up to 18 rounds. It remains, however, an open question whether such affine multidimensional linear approximations on SIMON32/64 and its larger versions can potentially lead to efficient key-recovery attacks.
The best linear key-recovery attack on SIMON32/64 is given in [28]. It makes use of the 13-round linear hull identified in [27] and adds 5 rounds before and after this distinguisher to extend the attack over to 23 rounds. It starts from the input to round 2 of the core trail, see Table I and ends after round 14 thus covering 13 rounds of the cipher. It has been chosen carefully in such a way that the input and output masks of the linear hull have only one active bit each which allows very efficient key guessing techniques.
The distinguishing advantage of the 13-round linear approximation used in [28] corresponds to the one given in Table II, that is, it can recover d = 4 bits of the secret key with experimentally determined success probability P S (0.0625) = 0.330. Our 2-dimensional affine distinguishers given in Table III can improve the success probability to P S (0.0625) = 0.483 for 13 rounds, or alternatively, increase the number of rounds to 15 with a slight decrease of the success probability to P S (0.0625) = 0.254. Our distinguishers, however, have several active bits in the input and output making efficient key search a challenging task which is left for future work.
By combining these expressions and multiplying by k 2 /m 4 , we get the claimed result for the variance of C. The derivation of the mean is similar, but simpler. APPENDIX B PROOF OF THEOREM 2