Spectral analysis of ZUC-256

. In this paper we develop a number of generic techniques and algorithms in spectral analysis of large linear approximations for use in cryptanalysis. We apply the developed tools for cryptanalysis of ZUC-256 and give a distinguishing attack with complexity around 2 236 . Although the attack is only 2 20 times faster than exhaustive key search, the result indicates that ZUC-256 does not provide a source with full 256-bit entropy in the generated keystream, which would be expected from a 256-bit key. To the best of our knowledge, this is the ﬁrst known academic attack on full ZUC-256 with a computational complexity that is below exhaustive key search.


Introduction
ZUC is the stream cipher being used as the core of 3GPP Confidentiality and Integrity Algorithms UEA3 & UIA3 for LTE networks [ETS11a].It was initially proposed in 2010 as the candidate of UEA3 & UIA3 for use in China.After external and public evaluation and two ZUC workshops, respectively in 2010 and 2011, it was ultimately accepted by 3GPP SA3 as a new inclusion in the LTE standards with a 128-bit security level, i.e., the secret key is 128-bit long.
Like most stream ciphers, ZUC has a linear part, which is an LFSR, and a non-linear part, called the F function, to disrupt the linearity of the LFSR contribution.The design is different from common stream ciphers which are often defined over binary fields GF (2) or extension fields of GF (2), the LFSR in ZUC is defined over a prime field GF (p) with p = 2 31 − 1 while the registers in F are defined over GF (2 32 ).There is a bit-reorganization (BR) layer between the LFSR and F serving as a connection layer to extract bits from the LFSR and push them into F .Thus standard cryptanalysis against common stream ciphers can not be directly applied to ZUC and till now, there is no efficient cryptanalysis of ZUC with an attack faster than exhaustive key search.
After ZUC was announced, there were a number of research work conducted to evaluate the cipher [ETS11b], [STL10], [WHN + 12].A weakness in the initialization phase was found in [STL10], [WHN + 12] and this directly resulted in an improved version.After the adoption as the UEA3 & UIA3 standard, there were additional work in cryptanalysis of ZUC.A guess-and-determine attack on ZUC was proposed in [GDL13] based on half-words, i.e. 16-bit blocks, by splitting the registers in the LFSR and FSM into high and low 16 bits, where some carry bits are introduced due to the splitting.It requires 6 keystream words and the complexity is O(2 392 ), which is, however, higher than exhaustive key search.In [ZFL11], a differential trail covering 24 rounds of the initialization stage is given, but this does not pose a threat since ZUC has 32 initialization rounds.[LMVH15] also shows that weak inputs do not exist in ZUC when it is initialized with 32 rounds.These results indicate that ZUC is resistant against common attacks.
In January 2018, ZUC-256 was announced as the 256-bit version of ZUC [Tea18] in order to satisfy the 256-bit security level requirement of 5G from 3GPP [3GP18].Compared to ZUC-128, the structure of ZUC-256 remains the same, while only the initialization and message authentication code generation phases are improved to match with the 256-bit security level.Subsequently, in July 2018, a workshop on ZUC-256 was held and some general cryptanalyses were presented, but no obvious weaknesses of ZUC-256 were found.To conclude, till now, there are no efficient cryptanalysis techniques succeeding to reduce the claimed security levels of ZUC (128-bit or 256-bit).
In this paper, we propose a distinguishing attack on ZUC-256 with computational complexity around 2 236 , by linearly approximating the non-linear part F and the different finite fields between the LFSR and F .The important techniques we employ to find a good linear approximation and compute the bias are called spectral tools here for cryptanalysis, using e.g., the Walsh Hadamard Transform(WHT) and the Discrete Fourier Transform (DFT).The spectral tools for cryptanalysis are widely used in linear cryptanalysis to, for example, efficiently compute the distribution or the bias of a linear approximation, since there exist fast algorithms for WHT and DFT which can reduce the computational complexity from O(N 2 ) to O(N log N ) [MJ05], [LD16].It is also widely used to investigate the properties of Boolean functions and S-boxes, which can be considered as vectorial Boolean functions, like correlation, autocorrelation, propagation characteristics and value distributions [NH07], [HN12].We explore the use of WHT and DFT and find new results about efficiently computing the bias or correlations.Importantly, we show how a permutation or a linear masking in the time domain would affect the spectrum points in the frequency domain for widely used operations and components, such as , ⊕, and S-boxes.Based on that, we give a number of further results on how to choose linear maskings in the time domain by considering the behavior of noise variables in the frequency domain such that a decent approximation with a large bias can be found.
We employ the new findings in spectral analysis of ZUC-256 and use them to develop a distinguishing attack.Even though the distinguishing attack is not a very strong one, it indicates that ZUC-256 can not achieve the full 256-bit security level under this case.
The rest of this paper is organized as follows.We first give the general design and structure of ZUC-256 in Section 2 and then the spectral analysis techniques are given in Section 3.After that, we in Section 4 give a distinguishing attack on ZUC-256 using the spectral tools.Specifically, we first derive a linear approximation in Subsection 4.1; and then we show how to efficiently derive the bias of the approximation in Subsection 4.2 ∼ Subsection 4.4 by using the spectral analysis and a technique called "bit-slicing technique"; and lastly we give the distinguishing attack based on the derived approximation.In Section 5, we conclude the paper.

Description of ZUC-256
In this section we give a brief description of the ZUC-256 algorithm.Basically, the structure of ZUC-256 is exactly the same as that of ZUC-128, except that the length of the secret key K is changed to be 256-bit long and the loading process of the key and IV is modified accordingly [Tea18].ZUC-256 takes a 256-bit secret key K and a 128-bit initial vector IV as input and produces a sequence that is usually called keystream.In this paper, we use Z (t) to denote the generated keystream block at a time instance t for t = 1, 2, . ... In ZUC-256, each keystream block is a 32-bit word, so we write Z (t) ∈ GF (2 32 ), t = 1, 2, . ... Furthermore, each (K, IV ) pair should produce a unique keystream sequence, and in practice K is usually fixed and the IV value varies to generate many different keystream sequences.
The overall schematic of the ZUC-256 algorithm is shown in Figure 1.It consists of three layers: the top layer is a linear feedback shift register (LFSR) of 16 stages; the

Figure 1:
The keystream generation phase of the ZUC-256 stream cipher bottom layer is a nonlinear block which is called F function; while the middle layer, called bit-reorganization (BR) layer, is a connection layer between the LFSR and F .Now we would give some details of the three layers, and for more details we refer to the original design document [ETS11a].

The LFSR Layer
The LFSR part consists of 16 cells denoted (s 0 , s 1 , ..., s 15 ) each holding 31 bits and giving 496 bits in total.Every value in the cells is an element from the finite field GF (p), where p = 2 31 − 1 and it can be written in a binary representation as where x i ∈ {0, 1} for 0 ≤ i ≤ 30.Then 2 k • x mod p is computed as x ≪ 31 k, where ≪ 31 k is the 31-bit left circular shift by k steps.This makes the implementation quite efficient.One can see that the LFSR in ZUC is operating over a prime field instead of GF (2) or GF (2 n ) as most stream ciphers do.This makes it insusceptible to common linear cryptanalysis.The feedback polynomial of the LFSR is given by: P (x) = −x 16 + 2 15 x 15 + 2 17 x 13 + 2 21 x 10 + 2 20 x 4 + (1 + 2 8 ) ≡ 0 mod p. P (x) is a primitive polynomial over GF (p) and this ensures that the LFSR sequence is an m-sequence with period p 16 − 1 ≈ 2 496 .If we denote the LFSR state at clock t as (s = p (i.e., the representation of element 0 is the binary representation of p).

The BR Layer
The BR layer is the connection layer between the LFSR and F .It extracts 128 bits from the LFSR and forms four 32-bit words X0, X1, X2, X3 with the first three being fed to F and the last one XOR-ed with the output of F to finally generate the keystream symbol.For a cell s i in the LFSR, the low and high 16 bits are extracted as: Then X0, X1, X2, X3 are constructed as follows: where h||l denotes the concatenation of two 16-bit integers h and l into a 32-bit one, with l being the least significant bits and h being the most significant bits of the result.Then X1, X2 will be sent into F to update the registers there.

The Non-linear Layer F
The nonlinear layer F has two internal 32-bit registers R1 and R2 being updated through linear and nonlinear operations.It is a compression function taking X0, X1, X2 as the input and producing one 32-bit word which would be used to generate the keystream symbol as below: Then F is updated by: Here S = (S0, S1, S0, S1) is a 32 × 32 S-box composed of four juxtaposed S-boxes, where S 0 and S 1 are two different 8-to-8-bit S-boxes.L 1 , L 2 are two 32 × 32 linear transforms which are defined as follows: Just like other stream ciphers, ZUC-256 uses an initialization phase before generating a keystream sequence, to fully mix the secret key and IV.During the initialization phase, the key and IV are loaded into the LFSR registers and the cipher runs 32 iterations with the output from the F function being fed back to the LFSR instead of producing keystream symbols.After the initialization, the cipher enters the keystream mode, with the first output word from F being discarded and the following outputs forming the keystream symbols by XOR-ing with X3.Since the attack in this paper only uses the keystream mode, we do not give the details of the initialization mode, but refer to the design document for the details [ETS11a], [Tea18].

Spectral tools for cryptanalysis
In multidimensional linear cryptanalysis one often has to deal with large distributions, and be able to find good approximations with large biases that can further be used in an attack.In this section, we give several techniques in spectral analysis which help to efficiently explore a good linear approximation and compute its bias.We will later use most of the presented techniques in cryptanalysis of ZUC-256.
Notations.Let X (1) , X (2) , . . ., X (t) be t independent random variables taking values from an alphabet of n-bit integers, such that the total size of the alphabet is For a random variable X, let the sequence of X k , k = 0, 1, . . ., N − 1 represent the distribution table of X, i.e., X k = P r{X = k}, or a sequence of occurrence values in the time domain, e.g.X k = the number of occurrences of X = k.If such a sequence of numbers would be normalized by dividing each entry by the total number of occurrences, we would talk about an empirical distribution or a type [CT12].
We will denote by W(X) the N -point Walsh-Hadamard Transform (WHT) and by F(X) the N -point Discrete Fourier Transform (DFT).Individual values of the transforms will be addressed by W(X) k and F(X) k , for k = 0, 1, . . ., N − 1.We will denote by Xk the spectrum value of a point k, i.e., Xk = W(X) k or Xk = F(X) k , depending on the context.The values Xk , for k = 0, 1, . . ., N − 1, in the frequency domain constitute the spectrum of X.
When considering Boolean operations, such as k • M , where k is an n-bit integer (or an index) and M is an n × n Boolean matrix, it should be understood as that the integer k is 1-to-1 mapped to a Boolean vector of length n containing the corresponding bits of the integer k in its binary representation.Then a Boolean multiplication is performed modulo 2, and the resulting Boolean vector can thus be 1-to-1 mapped back to an n-bit integer.
WHT and DFT.The DFT is defined as: where ω 0 = e −i2π/N is a primitive N -th root of unity.Every point value F(X) k is a complex number with the real part Re() and imaginary part Im(), i.e., Xk = Re( Xk ) + i • Im( Xk ).
WHT is a special variant of DFT and it is defined as where k • j now denotes the bitwise dot product of the binary representation of the n-bit indices k and j.I.e., one can rewrite the dot product in the vectorial binary form: where k i , j i are the i-th bits of the binary representation of k and j, for i = 0, 1, . . ., n − 1.Every W(X) k has only the real part and it is an integer.
The squared magnitude at a point k is derived by | Xk | 2 = Re( Xk ) 2 + Im( Xk ) 2 .The point k = 0 in the spectrum represents the sum of all values in the time domain for both WHT and DFT cases, i.e., (1) There are many well-known fast algorithms computing DFT or WHT in time O(N log N ) and this makes the spectral transform widely used in cryptanalysis and in many other areas as well.
Convolutions.A typical operation in linear multidimensional cryptanalysis is to compute the distribution of a noise variable which is the sum (⊕ or ) of other noise variables (referred to as sub-noise variables).While computing the distribution directly in the time domain might be complicated, the complexity could be largely reduced when using DFT and WHT [MJ05] through: where • is the point-wise multiplication of two spectrum vectors.In particular, the overall complexity is now O(t • N log N ).

Precision problems and the bias in the frequency domain
The bias of a multidimensional noise variable X is often expressed in the time domain as the Squared Euclidean Imbalance (SEI), which is also called the capacity in some papers [HN12], defined in [BJV04] as follows: where f = N −1 i=0 X i is the normalization factor, used in case when the distribution table of X is not normalized.For example, the table in the time domain for X stores the number of occurrences of each entry.If the distribution table of X is already normalized then f = 1, as expected for the sum of all probabilities.
It is known that to distinguish a noise distribution X with the above bias (X) from random using a hypothesis testing, one needs to collect O(1/ (X)) samples from this distribution [BJV04], [HG97].
Precision problems.Assume that we want to compute the bias of a noise variable X, which is the sum (⊕ or ) of t other sub-noises X (1) , . . ., X (t) using the convolution formulae given in Equation 2. If the expected bias is (X) ≈ 2 −p , then in practice we would expect to have probability values around 2 −n ± 2 −p/2−n , in average, and then a float data type should be able to maintain at least O(|p/2|) bits of precision for every value of X k in the time domain, conditioned that the float data type has the exponent field (e.g., data types float and double in standard C).
For example, when we want to compute a bias > 2 −512 (p = 512) then underlying data types for float or integer values should hold at least 256 bits of precision.This forces a program to utilize a large number arithmetic (e.g., BIGNUM, Quad, etc), which requires larger RAM and HDD storage, and expensive computation time.
In the following, we show that the bias of X may be computed in the frequency domain without having to switch to the time domain, and the required precision may fit well into the standard type double in C/C++.Theorem 1.For an n-bit random variable X with either normalized or non-normalized probability distribution (X 0 , X 1 , ..., X N −1 ) and its spectrum ( X0 , X1 , ..., XN−1 ), computed either in DFT or WHT, the bias (X) can be computed in the frequency domain as the sum of normalized squared magnitudes of all nonzero points, where the zero point, X0 , serves as the normalization factor, i.e., Proof.From Equation 1 we get that the normalization factor is f = | X0 |.The SEI expression can be written as ( , where U is the uniform distribution.According to Parseval's theorem, we can derive (X) = N Theorem 1 means that the required precision of values in the frequency domain can be as small as just a few bits, but the exponent value must be correct and preserved.In C/C++ it is therefore good enough to store the spectrum of a distribution in type double which has 52 bits of precision and the smallest exponent it can hold is 2 −1023 .We can barely imagine cryptanalysis where the expected bias will be smaller than that (and if it will, we can always change the factor X0 to a larger value).
A similar technique to compute the bias in the frequency domain has been given in [BJV04], but the probability sequence in the time domain there is the probability differences to the uniform distribution, while the probability sequence here is the original probabilities of the variable X.By this, we could further directly compute the bias of the sum ( or ⊕) of several sub-noises in the frequency domain by combining Theorem 1 and Equation 2: the bias of the -sum of several sub-noises can be computed by and a similar result holds under the ⊕-sum for the WHT case.Note, if we convert each spectrum value |F(X (i) ) k | to log 2 (|F(X (i) ) k | 2 ) (and, similarly, for the WHT case), then arithmetics in the frequency domain, such as in Equation 4, change from computing products to computing sums.This can give additional speed-up, RAM and storage savings.Later we will show how these results help to find a good approximation.
The main observation and motivation for developing further algorithms.In linear cryptanalysis of stream ciphers where we have an FSM and an LFSR, the approach is usually to first linearly approximate the FSM and get a noise variable X of the linear approximation, then the LFSR contribution in the linear approximation is canceled out by combining several (say t) time instances, such that only noise terms remain.Thus, the final noise is the t-folded noise of X, written as t × X (i.e., the total noise is the sum of t independent noise variables that follow the same distribution as X), for which the bias is written (t × X).Usually, an attacker tries to maximize this value.
One important observation from Theorem 1 and Equation 4 is that if there is a peak (maximum) value | Xk | in the spectrum of X at some nonzero position k, then that peak value will be the dominating contributor to the bias (t × X), as it will contribute | Xk | 2t , while other points in the spectrum of X will have a much less (or even negligible) contribution to the total bias as t grows.
This important observation also affects the case when trying to align the spectrum points from several sub-noises with different distributions to achieve a large bias.We should actually try to move the peak spectrum values of each sub-noise such that they are aligned at some nonzero index k.Then the product of those peak values will result in a large total bias value.This motivates us to develop further algorithms to permute or linearly mask variables and align them at an expected or desired spectrum location k in the frequency domain.In the next sections we will give new findings and algorithms for WHT and DFT cases, which can be helpful in searching for a good linear approximation for common operations in the nonlinear part of a stream cipher, such as , ⊕, S-boxes, etc.

Algorithms for WHT type approximations
Consider the expression where distribution tables of X (i) 's are known, and an attacker can freely select n × n full-rank Boolean matrices M (i) , i = 1 . . .t; we want to find a method to efficiently search for the choices of M (i) 's to maximize the total bias (X).Below we first give a theorem and then an algorithm to achieve this.
Theorem 2. Given an n-bit variable X and its distribution, for an n × n full-rank Boolean matrix M we have Note that the left-side matrix multiplication M • X is switched to the right-side matrix multiplication k • M .
We want to maximize (X) in Equation 5and we know that if spectrum values of X (i) 's are aligned after linear masking of M (i) 's, we could achieve a large bias.By "aligned" we mean that the largest spectrum magnitudes of each X (i) are at the same location, and this holds for the second largest, the third largest spectrum magnitudes and so on.But in practice, it is unlikely to achieve such a perfect alignment for all spectrum points.Instead, we can try to align n largest spectrum magnitudes and thus getting a decent bias.Algorithm 1 below can be used to achieve this based on Theorem 2.
Algorithm 1 Find M (1) , . . ., M (t) that maximize spectrum points of X at n indices K Input The distributions of X (i) 's (1 ≤ i ≤ t) and the index matrix K, which must be an n × n full-rank Boolean matrix where each row K j, * is a binary form of the j-th spectrum index where we want the best alignment to happen.Output The n × n full-rank Boolean matrices M (1) , M (2) , . . ., M (t)   1: procedure WhtMatrixAlign(K, X (1) , . . ., X (t) ) 2: for q = 1, . . ., t do Construct the n × n Boolean matrix Λ in a greedy approach as follows: Then we want that K • M (q) = Λ, from which we derive Intuitively, the main trick in Algorithm 1 happens in the step 12.For example, let us take the first row of K as the integer k, and the first row from Λ as the integer λ.The integer λ will eventually be the first value in the sorted list, λ = λ 1 , where we have |W(X (q) ) λ1 | → max.Following Theorem 2 we then get that the k'th spectrum point of M (q) X (q) , expressed as W(M (q) X (q) ) k = W(X (q) ) k•M (q) , now actually have the largest spectrum value W(X (q) ) λ1 , since k • M (q) = λ = λ 1 by construction in that step 12.
As a comment, in Algorithm 1 we do not really have to sort and find all N − 1 indices of λ, it is most likely that the inner loop will use just a bit more than n values of the first "best" λ's.Thus, it is enough to only collect the best c • n indices, for some small c = 2, 3, 4, out of which the full-rank matrix Λ can be constructed.We note that the algorithm does not necessarily give the best overall bias, but it guarantees that at least n points in the spectrum of X will have the largest possible peak values.
Linear approximation of S-boxes.S-boxes, which can be regarded as vectorial Boolean functions, are widely used in both stream ciphers and block ciphers, serving as the main nonlinear component to disrupt the linearity.Therefore, linear approximations of S-boxes are widely studied in cryptanalysis.For one dimensional approximations of an S-box, i.e., ax ⊕ bS(x) where a, b ∈ GF (2 n ) are linear masks, the common way is to construct a linear approximation table (LAT), by trying all possibilities of a, b values.The complexity is O(2 2n ), which is affordable for small S-boxes, e.g., 4-bit, 8-bit.WHT is usually employed to speed up the process.For multiple (vectorized) linear approximations, i.e., Ax ⊕ BS(x), where A, B are n × n full-rank binary masking matrices, testing every choice of A, B would be impossible, and the main task is rather to find A, B such that the linear approximation would be highly biased.Some papers investigated properties of multiple linear approximations, such as [HN12], [HCN19], but there is not much research on how a linear masking in the time domain would affect the spectrum points in the frequency domain, and how to explore good linear maskings to achieve a highly biased approximation.Below we give some new results in these aspects.
Let S(x) be an S-box that maps Z N → Z N , and x ∈ Z N , N = 2 n .For the sake of notation in this section the expression of the kind W(F (x)) means the WHT over the distribution table that is constructed through the function F (x) : Z N → Z N by running through all values of x.
For an n-bit S-box S(x) and an n-bit integer k, let us introduce the k-th binary-valued (i.e., ±1/N ) function, associated with S(x), as follows where k • S(x) is the scalar product of two binary vectors, i.e., k • S(x) = n−1 i=0 k i • S(x) i , and 1/N is the normalization factor.Such a combination (without the normalization factor 1/N ) is called a component of the S-box, and for a well-chosen S-box, every component should have good cryptographic properties.We can derive the following results.

Theorem 3. For a given S-box S(x) and a full-rank Boolean matrix
Proof.Let X be a non-normalized noise distribution of the expression (S(x) ⊕ Q • x), where every X j is the number of different values of x for which j = S(x) ⊕ Q • x.Then we have: from which the result follows, since the last term is exactly W(B Theorem 3 can now be used to derive a matrix Q such that at least n points in the noise spectrum, where the noise variable is X = S(x) ⊕ Q • x, will have peak values, thus, making the total bias (X) large.Basically, we first search for the > n best one-dimensional linear masks and then we build a matrix Q that contains these best peak values in the spectrum, see Algorithm 2 for details.
Let Φ be the sorted list of maximum c • n (for some small c ≈ 4) best triples (k, λ, ω) sorted by the magnitude of ω, where k is the index of the binary-valued function of the S-box, λ denotes the index of the spectrum points and ω is the corresponding spectrum value.If the list is full and we want to add a new triple then the last (worst) list entry is removed. 3: Consider the triple (k, λ, ω = |W λ |).If ω is larger than that in the worst triple of Φ (with the smallest ω), then add (k, λ, ω = |W λ |) to the list.

7:
From the list Φ use the greedy approach to construct the n × n full-rank Boolean matrices K and Λ, similar to how it was done in Algorithm 1. Set the i-th row of K as the k value of the l-th entry of Φ, i.e., K i, * = Φ(l).k

15:
Set the i-th row of Λ as the λ value of the l-th entry of Φ, i.e., Λ i, * = Φ(l).λ16: while rank(K) = (i + 1) or rank(Λ) = (i + 1) 18: In Algorithm 2, the choice of the parameter c should be such that we would not need to generate final rows of K and Λ randomly.Alternatively, one can also modify the algorithm as follows: when a new triple is added to Φ, we run the greedy algorithm and flag records in Φ that are used to construct K and Λ.After that, the first worst triple in Φ (starting from the end of Φ) that was not flagged is removed if the size of Φ reaches the limit.
The algorithm does not guarantee to get the maximum possible overall bias, but it guarantees that at least the maximum possible peak value will be present in the noise spectrum, which would allow to get a fairly large bias in the end.The complexity is O(N 2 log N ), but in practice there are usually other sub-noises that depend solely on k and λ, which can be used to select a subset of "promising" k and λ values for actual probing of the total noise spectrum, as it will be shown later for the ZUC-256 case.
Other useful formulae on spectral analysis of S-boxes can be derived in Corollary 1, based on Theorem 2 and Theorem 3.

Corollary 1. Let M, P, Q be n × n full-rank Boolean matrices, and let S(x) be a bijective S-box over n-bit integers. Then
Theorem 4 (Linear transformation of S-boxes).Let us consider the following k-th binaryvalued function at its spectrum point λ = k • M , for some full-rank Boolean matrix M , where the original S-box S(x) is linearly transformed with other full-rank Boolean matrices R and Q, We want to find a set of the best m triples {(k, λ, )} sorted by the maximum bias .Assume we have a fast method to find best m triples {(k , λ , )} for W(B k {S(x)} ) λ instead, then that set can be converted to {(k, λ, )} as follows: and the result follows.
Theorem 5 (S-box as a disjoint combination).Let us consider an n-bit S-box constructed from t smaller n 1 , n 2 , . . ., n t -bit S-boxes S 1 (x 1 ), S 2 (x 2 ), . . ., S t (x t ), such that where the n-bit input integer x is split into Proof.Since all x i 's are independent from each other, the combined bias at any point λ is the product of sub-biases at corresponding λ i 's for each k i -th binary-valued function of the corresponding S-box S i (x), which can be proved by below.
Theorem 4 and Theorem 5 pave the way to compute the bias of any pair (k, λ) in Equation 11 efficiently in time O(t), without even having to construct a large n-bit distribution of the S-box approximation (e.g., X = RS(Qx) ⊕ M x), given that S(x) is constructed from smaller S-boxes, which is a common case in cipher designs.E.g., we can simply precompute the tables of {(k i , λ i , )} exhaustively for smaller S-boxes, then apply the theorems to compute the bias for a large composite S-box for any pair (k, λ).
For example, let X = RS(Qx) ⊕ M x be the noise variable as the result of an approximation of a large n-bit composite S-box, RS(Qx), where R and Q are some known n × n Boolean matrices, and M x is the approximation of that large S-box with a selectable (or given) n × n full-rank Boolean matrix M .Then, if we want to get the value of some spectrum point k of X we do: compute λ = k • M (Theorem 3), then convert the indices as k = k • R and λ = λ • Q −1 (Theorem 4), and split them into t n 1 , . . ., n t -bit integers as k = (k 1 , . . ., k t ) and λ = (λ 1 , . . ., λ t ) (Theorem 5).Then, the desired spectrum value at the index k is derived as Alongside, this also leads to an efficient and fast algorithm to search for the best set of triples {(k, λ, )} in Equation 11, by "reverting" the procedure.These findings have a direct application in the upcoming cryptanalysis of ZUC-256.
General approach of spectral cryptanalysis using WHT.With the tools and methods developed in this subsection, we can now propose a general framework for finding the best approximation, based on probing spectral indices.

Derive the total noise expression based on basic approximations and S-box ap-
proximations.The noise expression may involve ⊕ operations, Boolean matrices multiplications, where some of the matrices can be selected by the attacker.
2. Derive the expression for the k-th spectrum point of the total noise, using the formulae that we found earlier.
3. Convert expressions such as k • M , where the matrix M is selectable, to be some parameter λ.If there are more selectable matrices then more λ's can be used.
4. Probe different tuples (k, λ, . ..) to find the maximum peak value in the spectrum for the total noise.The search space for k's and λ's may be shrunk by spectrum values of basic approximations.
5. Convert the best found tuple into the selected matrices, and compute the final multidimensional bias using the constructed matrices.

Algorithms for DFT type approximations
In this section we provide a few ideas on spectral analysis for DFT type convolutions.Although these methods were not used in the presented attack on ZUC-256, they can be quite helpful in linear cryptanalysis for some other ciphers.Consider the expression where, again, N = 2 n and the attacker can choose the constants c i 's, which must be odd, and X (i) 's are independent random variables.We will propose the algorithm to find the best combination of the constants c i 's such that the total noise X will have the best peak spectrum value.
The theorem below would help to decide how to rearrange the spectrum points in the frequency domain to achieve a larger total bias, by multiplication with a constant in the time domain, which is a linear masking.Theorem 6.For a given distribution of X and an odd constant c we have Corollary 2. Any spectrum value at index k = 2 m (1 + 2q), for some m = 0 . . .n − 1, q = 0 . . . 2 n−m−1 − 1, can only be relocated to another index k of the form k = 2 m (1 + 2q ), for some q = 0 . . .
The results above can be used to solve the problem of finding the constants c i in Equation 12 such that the spectrum of X would contain the maximum possible peak value.
Algorithm 3 Find c i 's that maximize the peak spectrum point of X in Equation 12Input The distributions of X (i) , for i = 1, 2, . . ., t.Output The coefficients c i , for i = 1, 2, . . ., t.

Linear cryptanalysis of ZUC-256
In this section, we perform linear cryptanalysis on ZUC-256.Normally, the basic idea of linear cryptanalysis is to approximate nonlinear operations as linear ones and further to find some linear relationships between the generated keystream symbols or between keystream symbols and the LFSR state words, and thus respectively resulting into a distinguishing attack and correlation attack.In a distinguishing attack over a binary or an extension field over GF (2), the common way is to find LFSR states at several time instances (usually 3, 4 or 5) which are XOR-ed to be zero such that the LFSR contribution in the linear approximation is canceled out while only noise terms remain which would be biased.This common way, however, does not apply well on ZUC, since the LFSR in ZUC is defined over a prime field GF (2 31 − 1) which is different to the extension field GF (2 32 ) in the function F .
In this section, we describe a more general approach where the expression that we use to cancel out the LFSR contribution is directly included in the full noise expression, which effectively reduces the total noise, i.e., the final bias is larger.This general approach may be used in cryptanalysis of any other stream cipher where an LFSR is involved.
Below we first give our linear approximation of the full ZUC-256, including LFSR state cancellation process.Then we describe in details how we employ the spectral tools given in Section 3 and a technique we call "bit-slicing" to efficiently compute the bias.Finally, we use the derived linear approximation to launch a distinguishing attack on ZUC-256.

Linear approximation
Any LFSR's 31-bit word s (t) x at a time instance t and a cell index x can be expressed as s (t+x) 0 , for 0 ≤ t and 0 ≤ x ≤ 15.Thus, in this section, we will omit the lower index and refer to an LFSR's word by using the time instance only, i.e., s (t+x) .We then try to find a four-tuple of time instances t 1 , t 2 , t 3 , t 4 such that, s (t1) + s (t2) = s (t3) + s (t4) mod p (where p = 2 31 − 1).
At each time instance t i , we define a 32-bit variable X (ti) which is the concatenation of the low and high 16-bit parts of s (ti+a) and s (ti+b) , for some constants 0 ≤ a, b ≤ 15, a = b, following the description of the BR layer in ZUC-256, i.e., X (ti) is one of H ). Then one can derive the following relation for X (ti) 's according to Equation 13: where 16 is the 16-bit arithmetic addition, i.e., addition modulo 2 16 , of the low and high 16-bit halves of X (ti) 's in parallel.Here L ) is a 32-bit random carry variable from the approximation of the modulo p, and it can only take the values 16 , where the values in the low and high parts of C (t1) are independent.As an example, the approximation in Equation 14 for X (ti) = X1 (ti) is then written as: .
Next, we would like to derive the distribution of the carries H , and to achieve that, we first give a theorem.
Proof.The proof is given in Appendix A.
Corollary 4. The distribution for C (t1) (note here t 1 denotes the time instance) in Equation 14 is as follows: Proof.Given the relation in Equation 14, we basically need to consider these two 16-bit cases independently (since a = b): s ).The distributions of E (t1+a) L and E (t1+b) H can be respectively proved through Theorem 7 by setting n = 31, s = 0, t = 16 and n = 31, s = 15, t = 16.The probability values can be approximated as 2/3 and 1/6 with an error < 2 −63 .
Next, we list the expressions for generating keystream symbols at time instances t and t + 1 as follows, where is the arithmetic subtraction modulo 2 32 and (T 1 , T 2 ) = (T 1, T 2) ≪ 16 is a cyclic shift 16 bits to the left.Then we give the full approximation of ZUC-256 based on Equation 13 and its approximation in Equation 14 as follows: where σ is the swap of the high and low 16 bits of a 32-bit argument, and M is some 32 × 32 full-rank Boolean matrix that we can choose, which serves as a linear masking matrix.The expressions for the noise N 1 (t1) (we further split N 1 (t1) = N 1a (t1) ⊕ N 1b (t1) ) and noise N 2 (t1) are as follows: and (SL 1 (T 1 (t) ) ⊕ SL 2 (T 2 (t) )).
In our analysis we consider noise variables N 1 (t1) and N 2 (t1) as independent.By this assumption the attacker actually looses some advantage since there is a dependency between, for example, T 1 (t1) , T 2 (t1) in N 1 (t1) and SL 1 (T 1 (t1) ), SL 2 (T 2 (t1) ) in N 2 (t1) .The attack can be stronger if we could take into account these dependencies, since then there will be more information in these noise distributions.However, it is practically hard to compute the bias in that scenario.
Next we want to compute the distribution and the bias of the noise terms.However, as one can note, there are many variables involved in each sub-noise expression.For example, the sub-noise N 1a (t1) involves 17 32-bit variables, and 3 C-carries.In order to compute the distribution of N 1a (t1) , a naive loop over all combinations of the involved variables would imply the complexity O(9 3 • 2 17•32 ), which is computationally infeasible.
In the next subsections we make a recap of the bit-slicing technique and show how we adapt it to our case to compute the distributions of the above noise terms.

Recap on the bit-slicing technique from [MJ05]
Let an n-bit noise variable N be expressed in terms of several n-bit uniformly distributed independent variables, using any combination of bitwise Boolean functions (AND, OR, XOR, etc.) and arithmetical addition and subtraction modulo 2 n .The distribution of such a noise expression, referred to as a pseudo-linear function in [MJ05], can be efficiently derived through the so-called "bit-slicing" technique in complexity O(k • 2 n + k 2 n • 2 n/2 ), for some (usually small) k.
The general idea behind the technique is that if we know the set of distributions for (n − 1)-bit truncated inputs for each possible outcome vector of the sub-carries' values for corresponding arithmetical sub-expressions, then we can easily extend these distributions to the n-bit truncated distributions with a new vector of output sub-carries' values.Then, the algorithm may be viewed as a Markov chain where the nodes are viewed as a vector of probabilities for each combination of sub-carries, and some transition matrices are used to go from the (n − 1)-th state to the n-th state.
Example.Let us explain the technique on a small example.Let n = 32 bits and the noise N is expressed in terms of random 32-bit variables A, B, C: For each n-bit value X with (x n−1 . . .x 1 x 0 ) as its binary form, we will compute the number of combinations of A, B, C such that the value of N is equal to X.
Transition matrices.We will construct two k × k (12 × 12) transition matrices, M 0 and M 1 , associated with every bit position i for the i-th bit value of X, x i , being either 0 or 1, such that the vector V i+1 is derived by V i+1 = M xi • V i .I.e., when the i-th bit of X, x i , is 0, we apply M 0 , otherwise M 1 .These two matrices are constructed as follows: initialize M 0 and M 1 with zeroes; loop through all possible choices of the i-th bits of A, B, C ∈ {0, 1} 3 and all possible values of (c1 in , c2 in , c3 in ); then for each combination we compute the resulting bit r ∈ {0, 1} by evaluating the noise expression, and the vector of output carries (c1 out , c2 out , c3 out ); we then increase the corresponding matrix cell by 1 as ++M r [τ (c1 out , c2 out , c3 out )][τ (c1 in , c2 in , c3 in )] at the same time.Note, the inner output carries c1 out and c2 out should not be summed up in the outer output carry c3 out , while only the resulting 1-bit values of inner sums should go to the outer expression.
In Appendix B we give the code in C for computing the transition matrices M 0 and M 1 for the exampled noise expression given in Equation 17.
The general formulae can now be derived as follows: Pr{N = (x n−1 . . .x 0 )} = 1 2 t•n • (1, 1, . . ., 1) where t is the number of involved random variables, in our example t = 3, and 1/2 t•n is the normalization factor for the distribution.The left-side row vector (1, 1, . . ., 1) due to the last carries are truncated by the modulo 2 n operation and thus all combinations for all carries' outcomes should be summed up to the result.Precomputed vectors.We intentionally split Equation 18 into two parts, since it shows that the computation of P r{N = X} for all values of X ∈ {0, 1, . . ., 2 n − 1} can be accelerated by precomputing two tables of the middle sub-vectors in Equation 18 for all possible values of the high (H[(x n−1 . . .x n/2 )]) and low (L[(x n/2−1 . . .x 0 )]) halves of X,
while s