Improved Fast Correlation Attack Using Multiple Linear Approximations and Its Application on SOSEMANUK

At CRYPTO 2018, Todo et al. proposed an effective fast correlation attack using multiple linear approximations, and gave effective attacks on the Grain-like stream ciphers with the same size of LFSR and key. However, many stream ciphers require that the size of LFSR must be at least twice the key size. For this type of stream ciphers, we propose an improved fast correlation attack using multiple linear approximations. The main idea is to reduce the number of attacked bits of parity-check equations by XORing the same linear approximation at different clocks, and then further bypass some unknown variables of parity-check equations by multiple linear approximations with an expected probability. Finally, full unknown variables are recovered by solving systems of linear equations. SOSEMANUK is one of the finalists in the eSTREAM project. The best absolute correlation of linear approximations of SOSEMANUK we found is 2−20.84, which improves the linear approximations with current best absolute correlation of 2−21.41. Finally, the improved fast correlation attack method is applied to SOSEMANUK, and a fast correlation attack with time/data/memory complexity of $O(2^{139.75})/O(2^{139.37})/O(2^{139.37})$ is given, and the success probability is 0.99. It improves the current best fast correlation attack with time/data/memory complexity of $O(2^{147.88})/O(2^{145.5})/O(2^{147.1})$ (ASIACRYPT 2008). For the optional key size ranging from 128-bit to 256-bit of SOSEMANUK, our attack result shows that SOSEMANUK can only guarantee the security of 139-bit key. In addition, we declare that our new fast correlation attack method can be applied to the linear analysis of other LFSR-based stream ciphers.

Communicated by T. the correct key to distinguish and recover these secret bits.This part is usually accelerated by fast Walsh transform (FWT).However, when there are multiple high-biased linear approximations, this part of the information leakage is often ignored.At CRYPTO 2018, Todo et al. [10] proposed a new fast correlation attack method using multiple linear approximations to solve this problem, and then gave effective fast correlation attacks on Grain-like stream ciphers, including Grain-128a [11], Grain-128 [12] and Grain-v1 [13].Grain-128a is standardized by ISO/IEC 29167-13 [14], and Grain-v1 is one of the seven finalists in the eSTREAM project [15].Let n and κ be the size of LFSR and key bits in one stream cipher, respectively.Todo et al. pointed out that when we find m linear approximations with high correlations under the same keystream output masks, m related solutions can be observed with high correlations in the parity-check equations due to the property of LFSR.If only (n − h)-bit unknown variable out of n-bit LFSR state variable is exhausted and the remaining h-bit variable is fixed to a constant in the online phase, then the h-bit variable can be bypassed in probability 2 −h and m2 −h related solutions can be observed on average.Therefore, m2 −h ≥ 1 is a necessary condition.The m2 −h solutions can be represented as s i = S 0 × M γi , where S 0 is the n-bit LFSR initial state and M γi is an n × n binary matrix.When the corresponding matrix M γi for the solution s i = S 0 × M γi is obtained after using the Poisson distribution, then S 0 can be recovered by computing s i ×M −1  γi .The time complexity of the online phase is (n − h)2 n−h accelerated by FWT.When (n − h)2 n−h ≤ 2 κ , an effective attack result can be given.Therefore, it is required that the size n of LFSR generally does not exceed the number κ of key bits.Then the multiple linear approximations can be exploited to reduce the number of attacked bits.
However, there is usually n ≥ 2κ for the general stream ciphers, such as SNOW family of stream ciphers (including SNOW 2.0 [16], SNOW 3G [17], SNOW-V [18] and SNOW-Vi [19]), SOSEMANUK [20], K2 [21] and so on.But Todo et al. did not give the corresponding results for this case.On the one hand, if the method of Todo et al. is used directly, then the attack is not valid because the time complexity of the online phase is (n − h)2 n−h ≫ 2 κ .On the other hand, if the number of the attacked bits of the parity-check equations is reduced by XORing the same linear approximation at different clocks [3], [4], [5] (containing only B-bit unknown variable), and then further bypass B 2 bits unknown variables of parity-check equations by m linear approximations with an expected probability.Then in the online phase, the m2 −B2 B 1 -bit solutions s i can be observed with high correlations, where B = B 1 + B 2 and B < n.And the solutions can be represented as s i = (S 0 × M γi ) L B 1 , where x L B 1 denotes the value of a vector x on the least significant B 1 -bit.Noted that we cannot compute the vector multiplication between a B 1 -bit vector s i and an n × n binary matrix M −1 γi .At this time, even if the M −1 γi is obtained, the method of removing M γi proposed by Todo et al. cannot be used to recover n-bit LFSR initial state S 0 .In the case of n ≥ 2κ, how to use multiple linear approximations to improve the efficiency of fast correlation attack is still an unsolved problem.
SOSEMANUK [20] is one of the seven finalists in the eSTREAM project [15].It is a software-oriented stream cipher designed by Ekdahl and Johansson It adopts design principles similar to the stream cipher SNOW 2.0 [16] and the block cipher Serpent [22].SOSEMANUK aims at improving SNOW 2.0 for both the security and the performance efficiency.The overall structure is based on the combination of word-oriented LFSR and finite state machine (FSM).The introduction of multiplication modulo 2 32 in FSM is also a feature of SOSE-MANUK.In addition, SOSEMANUK has a variable key size, ranging from 128 to 256 bits.
In the past nearly 20 years, SOSEMANUK has undergone a great deal of cryptanalysis.However, none can break the claimed 128-bit key security of the cipher.The attack methods against SOSEMANUK mainly include fast correlation attacks [23], [24] and guess-and-determine attacks [25], [26].In the aspect of fast correlation attacks, at ASIACRYPT 2008, the best correlation of linear approximations of SOSEMANUK found in [23] was 2 −21.41 .Then a fast correlation attack with time/data/memory complexity of O(2 147.88 )/O(2 145.5 )/O(2 147.1 ) was given, and the success probability was 0.99.Subsequently, [24] found 896 linear approximations with equal correlation of 2 −21.41 .Under the assumption that linear approximations are independent of each other, a fast correlation attack with low data complexity is given, and the data complexity of the attack can be reduced by a factor of 2 10 .However, in fact, these linear approximations are not independent of each other according to the fast correlation attack method proposed by Todo et al.Therefore, we consider that the best attack result at present is the attack given at ASIACRYPT 2008.For the guessand-determine attacks, at ASIACRYPT 2010, [26] gave the best guess-and-determine attack result, the time complexity of the attack is 2 176 , and the data complexity is 20.Whether there is a linear approximation of SOSEMANUK with higher correlation, and how to use multiple linear approximations with high correlations to perform fast correlation attack on SOSEMANUK, are two unsolved problems.

B. Our Contributions
In this paper, we propose an improved fast correlation attack method using multiple linear approximations, and give an improved fast correlation attack on SOSEMANUK.
1) An improved fast correlation attack method using multiple linear approximations is proposed.m linear approximations with high correlations will cause m related solutions to pass the filter in the online phase.
And the linear relationship between the correct initial state and the other m − 1 related solutions can be characterized by some matrices, which are called the derived matrices in this paper.For a given linear approximation, in the preprocessing phase, parity-check equations containing only B-bit unknown variable are constructed by XORing the same linear approximation at different clocks.In the online phase, only the (B − B 2 )bit unknown variable out of the B-bit unknown variable in the parity-check equations is exhausted, and the remaining B 2 -bit variable is fixed to a constant value (e.g.,0).Then, m2  approximations with the same keystream output masks.Our method solves the problem of how to further reduce the number of attacked bits of parity-check equations by using the multiple linear approximations with the same keystream output masks after exploiting the collision reduction.This extends the method proposed by Todo et al. at CRYPTO 2018, which only relies on the multiple linear approximations to reduce the number of attacked bits.Thirdly, in term of success probability of attack, Zhang et al. calculate the success probability of recovering the LFSR initial state by the unique distance.The success probability is not explicitly given, but they point out that the success probability is not less than 0.5.But in our paper, we give the constraint relationship between the success probability and the attack complexity.Then we can give the fast correlation attack on SOSEMANUK with success probability of 0.99, which is consistent with the previous work at ASIACRYPT 2008 [23].Finally, the linear approximations of SOSEMANUK used to attack are different.Zhang et al. exploit a existing single linear approximation with correlation of 2 −21.41 proposed at ASIACRYPT 2008.But in our paper, we find the new linear approximations with higher correlation of 2 −20.84 through some new techniques, which improves the linear approximations with current best absolute correlation of 2 −21.41 .

D. Paper Organization
In Sect.II, we give the preparations for some notations and definitions, together with SOSEMANUK stream cipher and general fast correlation attack.In Sect.III, we give the improved fast correlation attack by using multiple linear approximations.In Sect.IV, the divide-and-conquer strategy is used to search the linear approximations of SOSEMANUK with high correlations, and the improved fast correlation attack on SOSEMANUK is given by using the multiple linear approximations.In Sect.V, we conclude this paper.

A. Notations and Definitions
• GF (2) denotes the binary field and GF (2 n ) denotes the n-dimensional vector space over GF (2).• ⊕ denotes the bitwise XOR operation.
• ∨ denotes the bitwise OR operation.
• x ≪k denotes bitwise left rotation by k bits for a vector x ∈ GF (2 32 ).
) and the output mask Γ 0 ∈ GF (2 32 ).• Let x H B and x L B be the value of a vector x ∈ GF (2 n ) on the most and least significant B-bit, respectively.• wt(x) denotes the Hamming weight of a vector x ∈ GF (2 n ).

B. Description of SOSEMANUK
SOSEMANUK is a word-oriented stream cipher.It consists of three parts: LFSR component, FSM component and nonlinear function Serpent1, whose structure is shown in Fig. 1.
LFSR has ten 32-bit word registers (s t+9 , s t+8 , • • • , s t ).The sequence of the LFSR has maximal period 2 320 − 1.And the recurrence relation of the LFSR is as follows where α ∈ GF (2 32 ) is a root of the following primitive polynomial: and β ∈ GF (2 8 ) is a root of the following primitive polynomial: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.FSM has two 32-bit word registers R1 t , R2 t , and the registers of FSM are updated as follows: where r t is the least significant bit of R1 t and M = 0x54655307.Then the output of FSM is SOSEMANUK outputs four 32-bit keystream words for every four consecutive clocks, that is, where the nonlinear function Serpent1 takes four consecutive outputs (f t+3 , f t+2 , f t+1 , f t ) of FSM as input, and outputs four 32-bit words.The input and output of Serpent1 are in the same bit slice mode.And the input x i of i-th S-box is composed of the following bit slice mode: After 4×4-bit S-box S2 of Serpent block cipher are paralleled 32 times, the S-boxes output S(x i ), 0 ≤ i ≤ 31.Then Serpent1 outputs Serpent1(f t+3 , f t+2 , f t+1 , f t ) in the same bit slice mode for S(x i ), 0 ≤ i ≤ 31.
SOSEMANUK is initialized with a variable secret key length ranging from 128-bit to 256-bit and a 128-bit initial vector (IV).For more details, please refer to [20].

C. The General Description of Fast Correlation Attack
Fast correlation attack is a well-known cryptanalysis tool for LFSR-based stream ciphers.The basis of fast correlation attack is to establish a linear approximation equation with high correlation that only involves LFSR state variables and keystream variables.The process of fast correlation attack is divided into preprocessing phase and online phase [1], [2], [3], [4], [5], [6], [7], [8].Let ρ be the correlation of a linear approximation z t = α • S t , where S t is the row vector of n-bit internal state of the LFSR at clock t and α is an n-bit mask.Let A be the n × n binary matrix representing the transformation of internal states of the LFSR.And the characteristic polynomial of the binary matrix A is a primitive polynomial with degree n.Then the state of the LFSR can be updated by S T t+1 = AS T t .The linear approximation z t = α•S t can be expressed as the linear approximation for the initial state S 0 of the LFSR, then we have z t = αA t • S 0 .In the preprocessing phase, we construct D parity-check equations containing only B-bit unknown variable by XORing v clocks of the linear approximation, that is, where B < n and ( v j=1 αA ti,j ) H n−B = 0. Then the parity-check equations only involve the B-bit unknown variable of the LFSR initial state S 0 , and the correlation becomes ρ v .In the online phase, the FWT is used to accelerate the evaluate procedure of the parity-check equations and recover the B-bit initial state.After recovering the B-bit initial state of LFSR, the remaining bits of the internal state can be recovered by the same process.The attack complexity of this part is much lower than in the first attack process.After obtaining the full initial state of LFSR, the remaining states can be exhausted with a much lower complexity if the scale of remaining states is small.

III. IMPROVED FAST CORRELATION ATTACK USING MULTIPLE LINEAR APPROXIMATIONS
Ågren et al. [11] found that when the correlations of m linear approximations with the same keystream output masks are high and the others are correlation zero, then the recovery of the initial state can be transformed into a problem of recovering one of the m related solutions under a single linear approximation due to the property of LFSR.Then the m linear approximations can be exploited to reduce the number of attacked bits in the online phase with an expected probability.For Grain-like stream ciphers that the key has the same size of LFSR, this method can significantly improve the attack efficiency.On the other hand, when the size of LFSR is large, the traditional fast correlation attack points out that parity-check equations need to be constructed to reduce the number of unknown variables by XORing the same linear approximation at different clocks in the preprocessing phase.Then it can reduce the number of attacked bits in the online phase.
Based on the above two methods to reduce the number of attacked bits, this section proposes an improved fast correlation attack using multiple linear approximations for stream ciphers with the LFSR size larger than the number of key bits.This method not only constructs parity-check equations to reduce the number of unknown variables by XORing the same linear approximation at different clocks in the preprocessing phase, but also uses multiple linear approximations to further reduce the number of attacked bits with an expected probability in the online phase.Finally, we use the method of solving systems of linear equations to recover the full initial state.Therefore, the new method improves the attack efficiency.

A. Description of the Problem
Let {z t } t≥1 be a known binary keystream sequence and {S t } t≥1 be an n-bit state sequence of an LFSR.Let F : GF (2 n ) → GF (2) be a nonlinear function and z t = F (S t ).Let the correlation of the linear approximation z t = α • S t be ρ, where α is an n-bit mask.Let A be the n × n binary matrix representing the transformation of internal states of the LFSR.And the characteristic polynomial of the binary matrix A is a primitive polynomial with degree n.Then the state of the LFSR can be updated by S T t+1 = AS T t , where S t is the row vector of the n-bit internal state of the LFSR at clock t.And the linear approximation z t = α • S t can be expressed as the linear approximation of the initial state S 0 of the LFSR, i.e., we have the linear approximation z t = αA t • S 0 with correlation of ρ.
Definition 1: Let α i , 1 ≤ i ≤ m be m row vectors of n-bit masks and correlations ρ i satisfy Question 1: Let the initial state S 0 be an n-bit unknown variable.How to recover the initial state S 0 by using the multiple linear approximations defined in Definition 1 and the known keystream sequence {z t } t≥1 ?

B. New Fast Correlation Attack
We first use another set of mathematical theories to introduce the method of reducing the number of attacked bits based on multiple linear approximations proposed by Todo et al.We can easily prove the following Lemma 1 ∼ Lemma 3.
Lemma 1: Let A be an n × n binary matrix and its characteristic polynomial be a primitive polynomial.Then the set {A i : 0 ≤ i ≤ 2 n − 2} ∪ {0} forms a finite field according to the multiplication and addition of matrices over a binary field, and A is a primitive element in the finite field.
Lemma 2: Let α, β ∈ GF (2 n ) be two nonzero row vectors, then the following two conclusions hold: 1) There exists a nonnegative integer k such that β = αA k ; 2) If β = αA k , then the matrix A k can be computed by ) be two nonzero row vectors.Let binary variables x t , y t be x t = α • S t and y t = β • S t , respectively.If β = αA k and S T t = AS T t−1 , then we have For the mask set In addition, the following Theorem 1 can be easily proved by Lemma 3.
Theorem 1: is called the derived matrix set of S 0 , and the set {S 0 k : k ∈ K} is called the related solution set of S 0 .
Remark 1: Theorem 1 shows that m related solutions of S 0 will be observed in the parity-check equations according to m linear approximations with high correlations.Suppose that ρ 1 = max{|ρ v i | : 1 ≤ i ≤ m}.If the probability that the correct initial state S 0 becomes a candidate is greater than p under one specific attack and some given samples, then the probabilities that the m related solutions of S 0 become candidates are all greater than p.Let the key size of one stream cipher be κ.
Assuming that the most significant h bits of the m related solutions are uniformly distributed, there will be m2 −h related solutions with the most significant h-bit of 0 as candidates on average.Then we construct the statistic that only contains the least significant (n − h)-bit of initial state S 0 .The time complexity of the online phase is (n − h)2 n−h accelerated by FWT.When the corresponding matrix A ki for one related solution S 0 (A ki ) T is obtained after using the Poisson distribution, we can recover the full initial state S 0 by computing an effective attack result can be given, which is the principle of reducing the number of attacked bits proposed by Todo et al.Therefore, it is required that the size n of LFSR generally does not exceed the number κ of key bits.For general stream ciphers, there is usually n ≥ 2κ.At this time, their method will lead to a huge time complexity for the online phase, that is (n − h)2 n−h ≫ 2 κ .Todo et al. did not give the corresponding results for this case.
In order to solve this problem, this paper adopts the case of v > 1 given in Theorem 1 and uses multiple linear approximations to reduce the attack complexity.
The Main Idea of Our New Attack: Our method reduces the attack complexity by two parts: construct the parity-check equations containing only B-bit unknown variable according to the following Theorem 2, and further reduce the number of the attacked bits by multiple linear approximations according to the following Theorem 3, i.e., bypass B 2 -bit unknown variable with probability 2 −B2 .At this time, the parity-check equations contain only B 1 = (B − B 2 )-bit unknown variable.Different from the method proposed by Todo et al., we can only obtain the B 1 -bit related solution x of unknown variable S 0 by the FWT, that is, Our method requires establishing multiple matches of the candidate value x = (S 0 (A ki ) T ) L B 1 with the derived matrix (A kj ) T to recover the full initial state S 0 by solving systems of linear equations.The specific attack process is given below.
Let v > 1 and the data complexity of attack be N .Construct the following D parity-check equations by XORing v clocks of linear approximation z t = α 1 A t •S 0 , 0 ≤ t ≤ N −1, and the bit number of unknown variables in the parity-check equations is only B, 0 < B < n.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. where and the parity-check equations Eq.( 1) can be written as The parity-check equations Eq.( 2) contain only B-bit unknown variable and the correlation of the parity-check equations is ρ v 1 .According to Theorem 1, we have the following Theorem 2, that is, when Theorem 2: The m B-bit where the parity-check equations e i (α 1 , x) satisfy For the remaining wrong solution x / ∈ {(S 0 k) L B : k ∈ K}, the corresponding correlation is 0.
The B-bit unknown variable of the parity-check equation Eq.( 3) is divided into two parts: the B 1 -bit unknown variable that needs to be exhausted in the attack, and the B 2 -bit unknown variable that can be bypassed with probability 2 −B2 , here B = B 1 + B 2 .Let positive integer d = m2 −B2 .For the B-bit related solution set {(S 0 k) L B : k ∈ K} shown in Theorem 2, if we assume that the most significant B 2 bits of the m B-bit related solutions are uniformly distributed, then there will be d = m2 −B2 related solutions out of the m B-bit related solutions with the most significant B 2 -bit of 0 as candidates on average.
In order to simplify the expression, let k 1 = 0, then we have E = (A k1 ) T .Therefore, the derived matrix set At this time, the parity-check equations Eq.( 3) only involve the B 1 = (B − B 2 )-bit unknown variable, and Theorem 2 can be rewritten as Theorem 3.
Theorem 3: where the parity-check equations e i (α 1 , x) satisfy (⊕ v j=1 α 1 A ti,j ) H n−B = 0.And for the wrong value Lemma 4 [28]: Let D known parity-check equations contain only B 1 -bit unknown variable, and the correlation of parity-check equations be ρ.Assume that the linear approximation's probability of holding is independent for each guessed value, and its probability is equal to 1/2 for all wrong guessed values.If the correct guessed value is ranked among the top r out of all 2 B1 candidates in the attack and the success probability of attack is p, then we have 2 dt is the distribution function of the standard normal distribution.
Let the probability that each of the d B 1 -bit related solutions ranks among the top r out of all 2 B1 candidates be p.Let positive integer q satisfy q ≤ min{r, d}.Let the probability that at least q values out of the d B 1 -bit related solutions rank among the top r out of all 2 B1 candidates be p suc , then we have Assuming that the accuracy is 0.000001, the approximate solution p ∈ [0, 1] satisfying |f (p) − p suc | < 0.000001 can be calculated by using the Bisection Method.The above Algorithm 1 gives the framework of the new fast correlation attack.We will explain the framework as follows. (

1) Preprocessing phase: Construct parity-check equations
Let the success probability of recovering n-bit LFSR initial state S 0 be p suc and the parity-check equations only contain B 1 -bit unknown variable.Let the candidate values used by the attack be ranked among the top r out of all 2 B1 candidates.Calculate the derived matrix set Suppose that the data required for the attack is N , construct D parity-check equations Eq.( 4) by z t , 0 ≤ t ≤ N − 1, that Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
and the solution needs to satisfy the following system of linear equations with (qB 1 − n) equations then the probability of the wrong solutions satisfying the system of linear equations ( 8) is 2 −(qB1−n) .The time complexity of calculating the block matrix P by using the Gaussian Elimination is (qB 1 ) 3 .Since there are C q m kinds of (A kj 1 ) T , (A kj 2 ) T , • • • , (A kj q ) T for 1 ≤ j 1 < j 2 < • • • < j q ≤ m, we need to calculate the block matrix P for each of the C q m matrices M , respectively.The time complexity of this part is C q m (qB 1 ) 3 .Exhaust the A q r kinds of s = (s 1 , s 2 , • • • , s q ), and calculate the solution x by Eq.( 7) when the s satisfying Eq.( 8).The time complexity of solving the systems of linear equations is C q m ((qB 1 ) 3 + A q r ).The solutions of system of linear equations ( 6) are analyzed below.For the set {i the correct solution S 0 can be obtained from the system of linear equations ( 6) with probability of 1.We can get the correct solution S 0 from C q d kinds out of C q m A q r systems of linear equations (6).The solutions of the remaining (C q m A q r − C q d ) systems of linear equations ( 6) are wrong, and the expected number of the wrong solutions in the candidate set is (C q m A q r −C q d )2 −(qB1−n) according to Eq.( 8).The expected number of occurrences in the candidate set for one given wrong solution is −qB1 according to Eq.( 7).For a valid attack, the time complexity of solving the systems of linear equations is much less than that of the exhaustive of LFSR, that is C q m A q r ≪ 2 n .By qB 1 > n, we have (C q m A q r − C q d )2 −qB1 ≪ 1.The expected number of occurrences in the candidate set for the correct solution is C q d .Therefore, the solution with the most occurrences in the candidate solutions can be determined as the correct solution.
The time complexity of recovering the correct solution S 0 includes the time complexity of solving the systems (6) and the time complexity of verifying the solutions.Then the time complexity is T 3 = O(C q m ((qB 1 ) 3 + A q r )).The following Theorem 4 determines the constraint relationship between the success probability and the attack complexity given in Algorithm 1.Moreover, Theorem 4 can be easily proved from the above description.
Theorem 4: Let the success probability of recovering the full n-bit LFSR initial state S 0 be p suc .Then for the given positive integer v, B, B 1 , B 2 , r, q = ⌊n/B 1 ⌋ + 1, d = m2 −B2 , where v > 1, B = B 1 + B 2 , B < n, q ≤ min{r, d}, the total time complexity of the new fast correlation attack given in Algorithm ), and the memory complexity is O(max{N, 2 B1 }).In particular, when v = 2, the time complexity is O((D2 n−B ) 1/2 + D + B 1 2 B1 + C q m ((qB 1 ) 3 + A q r ))) and the remaining complexity remains unchanged.

IV. APPLICATION TO SOSEMANUK
In this section, we search for the linear approximations with high correlations of SOSEMANUK by using the divide-and-conquer strategy.Then we apply the new fast correlation attack given in Sect.III to SOSEMANUK, and give improved fast correlation attack on SOSEMANUK.

A. Linear Approximate Representation of SOSEMANUK
The linear approximation of SOSEMANUK can be divided into two parts: FSM part and Serpent1 part.
1) Linear Approximation of FSM: A general idea to approximate the FSM part is to cancel out the contributions of the FSM registers by combining expressions for two keystream outputs with linear masks.We have the following equations by using the two consecutive clock keystream equations and binary masks Γ 1 , Γ 2 ∈ GF (2 32 ).
Introduce the binary masks Γ 3 , Γ 4 , Γ 5 ∈ GF (2 32 ), the linear approximation equations can be established as Let the correlations of the above two linear approximations be According to the Correlation Theorem of the composite function [30], we have where ρ Trans (a → b) denotes the correlation of function y = Trans(x) with the input mask a ∈ GF (2 32 ) and the output mask b ∈ GF (2 32 ).Therefore, we can establish the linear approximation of the FSM part as For the following two linear approximations we can prove that the correlation of both is 2 −1 .Then for x Γ ∈ {0, 1}, the linear approximation (9) can be written as and the correlation of linear approximation (10) is Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
2) Linear Approximation of Serpent1: Next, we establish the linear approximation of the Serpent1 part by the keystream expression (2 32 )] 4 be the input masks and output masks of the following linear approximation of Serpent1, respectively.

And let the correlation of linear approximation of Serpent1 be ρ
), 0 ≤ i ≤ 31 be the input mask and output mask of linear approximation of i-th S-box and its correlation be ρ S (a i → b i ).Then the correlation ρ Serpent1 can be computed by There are the following three combinations of the output masks Γ 2 , Γ 1 of FSM and the input masks Then for 0 ≤ τ ≤ 2, we have Now we can give the linear approximation of SOSEMANUK based on the above linear approximation of FSM and Serpent1.For the keystream masks ) and the LFSR masks Γ 3 , Γ 4 , Γ 5 ∈ GF (2 32 ), the linear approximation of SOSEMANUK can be established as where t ≡ 1 mod 4, 0 ≤ τ ≤ 2 and x Γ ∈ {0, 1}.The correlation ρ SOSE can be computed by where

B. Searching for Linear Approximations of SOSEMANUK With High Correlations
In this section, we give a divide-and-conquer strategy to search for the linear approximation of SOSEMANUK.According to the formula (9), we know that the linear approximations of SOSEMANUK include three linear approximations: modular addition, modular multiplication and S-box.The main idea of the divide-and-conquer strategy is as follows: Firstly, we search for the linear approximation trails of modular additions with high correlations by a MILP model.After determining this part of the linear approximations, the correlation of linear approximation of modular multiplication can be calculated.Finally, all the remaining linear approximations are determined according to the LAT of the S-box.Then the accurate correlations of linear approximations of SOSEMANUK can be calculated by formula (12).
1) Searching for the Linear Approximation Trails of SOSEMANUK With High Correlations: Introduce the mask then the linear approximation trail of SOSEMANUK includes linear approximations of three modular additions, and the correlations of the linear approximations of the three modular additions can be computed by In addition, the maximal correlation of S-box is 2 −1 .Then for a given mask tuple (A 3 , A 2 , A 1 , A 0 ) ∈ {(Γ 2 , Γ 1 , 0, 0), (0, Γ 2 , Γ 1 , 0), (0, 0, Γ 2 , Γ 1 )}, there exist masks B i , 0 ≤ i ≤ 3 such that the correlation of linear approximation Next, we establish a MILP model for the correlation ρ of the above three modular additions and the correlation ρ Serpent1 of the S-box, and then search for the linear approximation trails of SOSEMANUK with high correlations.
By introducing three 33-dimensional binary vectors s 0 , s 1 , s 2 , we can establish the linear approximations of the following three modular additions according to the following Lemma 5. [31]: Let the input masks and output mask of linear approximation of addition modulo 2 n be x, y and z, respectively.Introduce an (n + 1)-dimensional binary vector s, then the necessary and sufficient condition Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

And the correlation |ρ| can be computed by |ρ|
of ρ + (x, y → z) ̸ = 0 is that the following 8n + 1 conditions hold at the same time.
And its absolute correlation can be calculated by . The Gurobi solver [32] supports the characterization of bitwise OR operation.Then we can limit z = Γ 1 ∨ Γ 2 by introducing a 32-dimensional binary vector z.In order to search for a high correlation |ρ SOSE |, the Hamming weight of z is limited as follows in this paper.
Solve the above MILP model by the Gurobi solver, and when the solver returns a solution the following constraint is added to the MILP model to exclude the current solution and then search for the remaining solutions.
As a result, 16 linear approximation trails of SOSEMANUK with correlation of 2 −21.41 shown in Table II are  2) Calculating the Accurate Correlations of Linear Approximations: In this section, we carefully select the masks B i , 0 ≤ i ≤ 3 according to the LAT of S-box given in Table VII, and calculate the accurate correlations of linear approximations of SOSEMANUK by formula (12), i.e., In Table II, there are the following two kinds of (Γ 1 , Γ 2 ).
The masks (Γ 1 , Γ 2 ) are transformed into the input masks of Serpent1 according to the bit slice mode of Serpent1, then we know that the nonzero input masks of the S-boxes of the two kinds of (Γ 1 , Γ 2 ) are only at the bit positions of 25, 24, 14, 0. Then the intermediate masks (Γ 1 , Γ 2 ) satisfying ρ Serpent1 ̸ = 0 in formula (12) are active only in the 25th, 24th, 14th and 0th S-boxes.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.When τ = 0, we know that the input masks of the 25th, 24th, 14th and 0th active S-boxes can only be nonzero at least significant two bits.Then there are only three possibilities for the input masks of the four S-boxes, i.e., 0x1, 0x2 and 0x3.In addition, due to the limitation of nonzero correlation of linear approximation of addition modulo 2 n , the input mask at the 25th S-box must be 0x3.Then the mask tuples (Γ 1 , Γ 2 ) have 3 3 = 27 cases.For the 16 mask tuples (Γ 3 , Γ 4 , Γ 5 ) shown in Table II, we exhaust the 27 kinds of (Γ 1 , Γ 2 ) and calculate the corresponding correlation ρ FSM , respectively.Table III The input masks of the 25th, 24th, 14th and 0th active S-boxes derived from the two mask tuples (Γ 1 , Γ 2 ) are (0x3, 0x2, 0x3 and 0x3) and (0x3, 0x1, 0x3 and 0x3), respectively.The input masks of the 25th, 14th and 0th S-boxes are all 0x3, and the input masks of the 24th S-box are 0x1 and 0x2.Therefore, if there is an output mask of the 24th S-box such that the two input masks 0x1 and 0x2 have high correlations, then the linear approximation of SOSEMANUK will have two linear approximation trails with high correlations.
Based on the above observation, the output mask of the 24th S-box can be selected as b 24 = 0x7.According to the LAT of S-box given in Appendix, we know ρ S (0x1 → 0x7) = ρ S (0x2 → 0x7) = 2 −1 .In addition, the output mask of the 25th, 14th and 0th S-boxes can be selected as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE IV 8 KEYSTREAM OUTPUT MASKS
ρ FSM ρ Serpent1 has at most two other linear approximation trails with nonzero correlations except the two linear approximation trails shown in Table III.And the maximum correlation of the two other linear approximation trails is 2 −64 .Obviously, the correlations of the two other linear approximation trails are negligible.Then for the 8 mask tuples (B 0 , B 1 , B 2 , B 3 ) shown in Table IV  −20.848 , and the number of both is 64.

C. Experimental Verification
In this section, we verify the correctness of the correlation 2 −20.84 of the multiple linear approximations used to the new fast correlation attack shown in following Sect.IV-D.
In Sect.IV-B2, for each of 8 keystream output mask tuples (B 0 , B 1 , B 2 , B 3 ) shown in Table V, we can always obtain 16 linear approximations with absolute correlation of 2 −20.84 .According to the new fast correlation attack given in Sect.III-B, we know that the multiple linear approximations with the same keystream output masks are the basis of the attack.Therefore, the correlations of the 16 linear approximations with the same keystream output masks need to be verified.In the attack, we use the first keystream output mask (B 0 , B 1 , B 2 , B 3 ) = (0x03004001, 0x01000000, 0x01000000, 0x02004001) given in Table V and its corresponding 16 LFSR masks.Then we need to verify the correlations of the 16 linear approximations given in Table VI.
Since the linear approximations are only related to the keystream output phase, we generate long keystreams for a following internal state after initialization phase of SOSEMANUK And the above internal state is generated under a test vector with 128-bit key K and 128-bit IV provided in the design report [34], that is, K = 0x00000000,00000000,000001B7,FE83C0A7, IV = 0x00112233,44556677,8899AABB,CCDDEEFF.
For each of the 16 linear approximations given in Table VI, we generate 2 47 equations ( 11) and count the experimental correlations, respectively.It takes about 12 days for obtaining the experimental correlations.And the experimental machines used in this paper are two desktop computers equipped with 32G memory and 24-core Intel(R) Core(TM) i9-14900KF CPU @3.20GHz, 64G memory and 24-core 13th Gen Intel(R) Core(TM) i9-13900KF CPU @3.00GHz, respectively.The comparisons of theoretical correlations and experimental correlations are summarized in Table VI.In total, the experimental correlations are very close to the theoretical correlations, which indicates that our results are correct.

D. Analysis of Attack Complexity and Success Probability
For each keystream output mask tuple (B 0 , B 1 , B 2 , B 3 ) shown in Table IV, there are m = 16 kinds of (Γ 3 , Γ 4 , Γ 5 , x Γ ) satisfying the correlation |ρ SOSE | ≥ 2 −20.848 .In the paper, we can use the 16 linear approximations given Table VI to carry out a new fast correlation attack on SOSEMANUK.
Let success probability be p suc = 0.99, then we can calculate the approximate solution of p is p = 0.958002 by using the Bisection Method.We search for the parameters that minimize attack complexity according to Theorem Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.After determining the full 320-bit initial state of LFSR, a small-scale exhaustive search can be used to recover the remaining 64-bit R1, R2 of FSM.The complexity of this part is much lower.Finally, the attack complexity of recovering full state of SOSEMANUK is considered to be O(2 139.75 ), and the success probability is 0.99.In the same success probability, our attack result improves the current best fast correlation attack on SOSEMANUK with time/data/memory complexity of O(2 147.88 )/O(2 145.5 )/O(2 147.1 ).

V. CONCLUSION
In this paper, we propose an improved fast correlation attack using multiple linear approximations for the stream cipher whose LFSR size is at least twice the key size.The main idea is to construct the parity-check equations to reduce the number of unknown variables by XORing the same linear approximation at different clocks, and then further bypass some unknown variables of parity-check equations by multiple linear approximations with an expected probability.Finally, full unknown variables are recovered by solving systems of linear equations.SOSEMANUK is one of the finalists in the eSTREAM project.The best absolute correlation of linear approximations of SOSEMANUK we found is 2 −20.84 , which improves the linear approximations with current best absolute correlation of 2 −21.41 .Finally, the improved fast correlation attack method is applied to SOSEMANUK, and a fast correlation attack with time/data/memory complexity of O(2 139.75 )/O(2 139.37 )/O(2 139.37 ) is given, and the success probability is 0.99.It improves the current best fast correlation attack with time/data/memory complexity of O(2 147.88 )/O(2 145.5 )/O(2 147.1 ).For the optional key size ranging from 128-bit to 256-bit of SOSEMANUK, our attack result shows that SOSEMANUK can only guarantee the security of 139 bits.The new methods and new results in this paper are well verified by experiments.In addition, we declare that our new fast correlation attack method can be applied to the linear analysis of other LFSR-based stream ciphers.

Manuscript received 11
October 2023; revised 26 February 2024; accepted 29 May 2024.Date of publication 4 June 2024; date of current version 17 September 2024.This work was supported in part by the Natural Science Foundation of Henan under Grant 222300420100 and in part by the National Natural Science Foundation of China under Grant 62372463 and Grant 62302518.(Corresponding authors: Sudong Ma; Ting Cui.)
we can construct the parity-check equations containing only B-bit unknown variable.Define the set {(S 0 k) L b : k ∈ K} as the b-bit related solution set of S 0 .Theorem 2 gives the correlations of parity-check equations about the B-bit related solutions of S 0 and the remaining wrong solutions, respectively.
−B2 values out of m correct values can pass the filter on average.Finally, the matching of all candidate values with the derived matrices is exhausted and the corresponding systems of linear equations are established.Then all the systems of linear equations are solved and the correct full n-bit LFSR variable is recovered.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
2) By using the divide-and-conquer strategy to search the linear approximation of SOSEMANUK, the linear approximations of SOSEMANUK with greater correlation are found, that is, the best correlation we found is 2−20.84,whichimproves the current best correlation of 2−21.41.The divide-and-conquer strategy is divided into the following three steps: Firstly, we search for the linear approximation trails of modular additions with high correlations by a Mixed Integer Linear Programming (MILP) model.After determining this part of the linear approximations, the correlation of multiplication modulo 2 32 can be calculated.Finally, the remaining linear approximations are determined according to the linear approximation table (LAT) of the S-box.Then the accurate correlations of linear approximations of SOSEMANUK can be calculated by exhausting all the intermediate trails.3) Applying the improved fast correlation attack method to SOSEMANUK, a fast correlation attack with time/data/memory complexity of O(2 139.75 )/ O(2 139.37 )/O(2 139.37 ) is given, and the success probability is 0.99.It improves the current best fast correlation attack with time/data/memory complexity of O(2 147.88 )/O(2 145.5 )/O(2 147.1 ).The comparison of the related work of SOSEMANUK is given in Table I.

TABLE I THE
COMPARISON OF FAST CORRELATION ATTACKS ON SOSEMANUK

TABLE II 16
LINEAR APPROXIMATION TRAILS OF SOSEMANUK WITH CORRELATION OF 2−21.41

TABLE VII LINEAR
APPROXIMATION TABLE OF S-BOX USED IN SERPENT1 WITH INPUT MASK a AND OUTPUT MASK b