A Constant-time AVX2 Implementation of a Variant of ROLLO

. This paper introduces a key encapsulation mechanism ROLLO ` and presents a constant-time AVX2 implementation of it. ROLLO ` is a variant of ROLLO-I targeting IND-CPA security. The main diﬀerence between ROLLO ` and ROLLO-I is that the decoding algorithm of ROLLO ` is adapted from the decoding algorithm of ROLLO-I. Our implementation of ROLLO ` -I-128, one of the level-1 parameter sets of ROLLO ` , takes 851823 Skylake cycles for key generation, 30361 Skylake cycles for encapsulation, and 673666 Skylake cycles for decapsulation. Compared to the state-of-the-art implementation of ROLLO-I-128 by Aguilar-Melchor et al., which is claimed to be constant-time but actually is not, our implementation achieves a 12.9x speedup for key generation, a 10.6x speedup for encapsulation, and a 14.5x speedup for decapsulation. Compared to the state-of-the-art implementation of the level-1 parameter set of BIKE by Chen, Chou, and Krausz, our key generation time is 1.4x as slow, but our encapsulation time is 3.8x as fast, and our decapsulation time is 2.4x as fast.


Introduction
ROLLO is a code-based key encapsulation mechanism that was involved up to the second round of the NIST post-quantum cryptography standardization process.It is the merger of three first-round candidates Rank-Ouroboros [AMAB `17], LAKE [ABD `17a] and LOCKER [ABD `17b].While most of the code-based submissions are based on Hamming metric, ROLLO is based on rank metric and makes use of so-called "low-rank paritycheck" (LRPC) codes.The construction of ROLLO, according to the latest version of the specification (version 2020{04{21, available at [ABD `20]), is similar to the third-round code-based candidate BIKE [ABB `20], in the sense that each public key is the quotient of two "low-weight" ring elements.In other words, both schemes are NTRU-like [HPS98].
Unfortunately, the underlying hard problem of ROLLO was not fully studied at the beginning of the second round of the standardization process.During the second round, two algebraic attacks against ROLLO were published [BBC `20, BBB `20].These attacks show that the proposed parameter sets do not reach the claimed security levels.For example, the parameter set ROLLO-I-128, which was claimed to be of 128-bit (pre-quantum) security, was shown to have only 71-bit security in [BBC `20].In response to these attacks, the ROLLO team announced in the latest specification new parameter sets that achieved desired security levels under the attacks of [BBC `20, BBB `20].However, ROLLO still failed to enter the third round.
Even though ROLLO will not be standardized by the current standardization process, ROLLO does have some interesting features.First of all, ROLLO does have very small keys.The public keys of the level-1, 3, 5 parameter sets ROLLO-I-128, ROLLO-I-192, and ROLLO-I-256 are 696 bytes, 954 bytes, and 1371 bytes only.The key sizes are 2.2 to 5.3 times as small as the key sizes of the corresponding parameter sets of BIKE.Second, ROLLO does have a CCA2-secure variant ROLLO-II, while BIKE is only claimed to be CPA-secure.Under the assumption that there will not be attacks that outperform [BBC `20, BBB `20], it seems that ROLLO could be a more interesting option for standardization than BIKE.In fact, NIST also commented that "Despite the development of algebraic attacks, NIST believes rank-based cryptography should continue to be researched.The rank metric cryptosystems offer a nice alternative to traditional hamming metric codes with comparable bandwidth." in the status report for the second-round candidates [AASA `20].
Of course, it takes time to see whether the practical security of ROLLO has been thoroughly studied, but [BBC `20, BBB `20] certainly help the community to understand more about the security of ROLLO.On the other hand, there has been very little effort in improving the performance of ROLLO.In particular, how to build a fast constant-time implementation of ROLLO seems to be a problem that has never been well-studied, and the goal of this paper is to show a solution.

Previous Works
We are aware of only two previous papers about implementing ROLLO.The first paper is [LMB `19], which presents an implementation of the second-round version of ROLLO.The target platform of [LMB `19] is ARM SecureCore SC300.The parameters and the decoding algorithm used in the paper are based on the second-round specification instead of the latest one.The authors did not claim that their implementation is constant-time, and there is no discussion about how to make the implementation constant-time.For these reasons, we think it is not very meaningful to consider the speeds reported in the paper.
The second paper is [AMAB `21], which presents two implementations of ROLLO-I-128 and claims that they are constant-time.The target platform is Intel Coffee Lake.Unfortunately, we found that the algorithm to_ref the authors used to reduce matrices to row echelon form is vulnerable to cache-timing attacks.
The algorithm to_ref is shown in Algorithm 1 1 .As one can see, the indices i and j do not depend on the entries of M , so it is fine to access M i,j and M i directly using memory load/store instructions.However, the value of μ does depend on the entries of M , so M μ,j and M μ should not be accessed directly.Of course, it can be that the authors' implementation actually fixed this issue with some extra operations, but we have checked the source code (available at https://github.com/peacker/constant_time_rollo)and found that it is not the case.[AMAB `21] uses a similar algorithm to_rref to reduce matrices into reduced row echelon form, and to_rref has the same problem.We have not checked if other parts of the source code are actually constant-time.
Finally, the source code included in the second-round submission package of ROLLO is not constant-time.This was first pointed out in [DGK20].

Our Contribution
This paper introduces a new key encapsulation mechanism ROLLO `and presents a constant-time AVX2 implementation of it.ROLLO `is a variant of ROLLO-I.As ROLLO-I, ROLLO `targets only CPA security.The main difference between ROLLO `and ROLLO-I end if 22: end for 23: return M is that the decoding algorithm of ROLLO `is adapted from the decoding algorithm of ROLLO-I.We adapted the decoding algorithm so that some invalid ciphertexts can be identified and rejected.
Our implementation of ROLLO `-I-128, one of the level-1 parameter sets of ROLLO `, takes 851823 Skylake cycles for key generation, 30361 Skylake cycles for encapsulation, and 673666 Skylake cycles for decapsulation.Compared to the state-of-the-art implementation of ROLLO-I-128 (which is not constant-time) by Aguilar-Melchor et al., our implementation achieves a 12.9x speedup for key generation, a 10.6x speedup for encapsulation, and a 14.5x speedup for decapsulation.Compared to the state-of-the-art implementation of the level-1 parameter set of BIKE by Chen, Chou, and Krausz, our key generation time is 1.4x as slow, but our encapsulation time is 3.8x as fast, and our decapsulation time is 2.4x as fast.
As ROLLO `and ROLLO-I are very similar, we expect that key generation, encapsulation and decapsulation times similar to those of ROLLO-I can be achieved with our implementation techniques (see below).We also expect that our implementation techniques can be used to accelerate ROLLO-II and other rank-metric cryptosystems such as RQC [AMAB `20].

Techniques
We consider field multiplication in F 2 mn , field inversion in F 2 mn , and Gaussian elimination as the main building blocks.The speed of our implementation is achieved by optimizing these building blocks.Here is a brief overview of how we optimize the building blocks.
• There are 3 multiplications in F 2 mn in key generation, encapsulation, and decapsu-lation.Each multiplication in F 2 mn involves one low-weight operand.In order to exploit the structure of the low-weight operand, we make use of matrix transposition to perform the multiplication.As far as we can tell, we are the first ones to introduce this idea.
• The decoding algorithm computes the intersection of several vector spaces.One way to compute the intersection is to use the Zassenhaus algorithm.On input generating sets of two vector spaces, the Zassenhaus algorithm builds a matrix from the generating sets and applies a Gaussian elimination.Gaussian elimination is also useful for checking the weights of sampled elements in key generation and encapsulation.In order to perform Gaussian eliminations in constant time, we generalized the algorithm presented in [BCS13, Section 6].The original algorithm is only able to compute the systematic form (if it exists) of the input matrix, while our generalized algorithm is able to compute row echelon form or reduced row echelon form.The generalized algorthm turns out to be very efficient even under the constraint of being constant-time.As far as we can tell, we are the first ones to introduce this generalized algorithm.
• Key generation involves one field inversion in F 2 mn .We make use of the Itoh-Tsuji algorithm [IT88] to perform the field inversions in F 2 mn .We found that this is much faster than raising field elements to the power 2 mn ´2.
At the end of each call to the Zassenhaus algorithm, we need to derive a generating set of the intersection for the next call.We use a small loop that is carefully designed to protect the procedure against timing attacks.We also propose an efficient way to generate low-weight elements using our algortihm for Gaussian elimination.Both the loop and the algorithm for generating low-weight elements are presented in Section 5.

Availability of Source Code
We plan to submit our implementation to the eBACS project [Be] so that the source code can be included in SUPERCOP.Our source code will be in the public domain.

Organization
Section 2 reviews the specification of ROLLO and introduces ROLLO `.Section 3 presents how we optimize field multiplications and field inversions.Section 4 presents how we optimize Gaussian eliminations.Section 5 presents how we use the techniques in Section 3 and 4 to carry out the decoding algorithm and to sample low-weight elements.Section 6 shows some experiment results.

ROLLO and ROLLO
This section reviews the specification of ROLLO (mainly ROLLO-I) and introduces the specification of ROLLO `.In particular, we argue that ROLLO `is IND-CPA secure as long as ROLLO-I is IND-CPA secure in this section.

Parameter Sets
The parameter sets of ROLLO and ROLLO `are shown in Table 1.As one can see ROLLO `simply takes parameters from ROLLO.We note that ROLLO-II is claimed to be CCA secure, while ROLLO `-II only targets CPA security.

Finite Fields
Each parameter set of ROLLO and ROLLO `uses two distinct primes m and n.Each m is associated with a degree-m irreducible polynomial P m P F 2 rxs, and similarly each n is associated with a degree-n irreducible polynomial P n P F 2 rys.The list of P m 's and P n 's is shown in Table 2. P m and P n are used to construct F 2 m as F 2 rxs{pP m q and F 2 mn as F 2 m rys{pP n q.A field element a P F 2 m will always be represented as a vector pa 0 , ..., a m´1 q P F m 2 such that a " ř i a i x i .A field element α P F 2 mn will be represented as a vector pα 0 , ..., α n´1 q P F n 2 m such that α " ř i α i y i , if α is a part of a key pair or a ciphertext.Our implementation represents low-weight F 2 mn elements in a different way, which will be explained in Section 3.
We note that P n is called P in ROLLO's specification [ABD `20].We call the polynomial P n because this explicitly shows that its degree is n.Also, the specification does not mention the field F 2 mn , even though it is actually used implicitly: the specification defines operations of ROLLO as arithmetic between polynomials over F 2 m modulo P (i.e., modulo P n ).For the purpose of this paper, we think it is better to describe operations as arithmetic in F 2 mn .

Terminologies and Notations
Given α P F 2 mn , we define Matpαq as Similarly, given α, β P F 2 mn , we define Matpα, βq as the vertical concatenation of Matpαq and Matpβq.
The support of α P F 2 mn , denoted as Supppαq, is defined as Supppαq " RowSpacepMatpαqq.
The rank weight (or simply weight) of α, denoted as ||α||, is defined as the dimension of Supppαq.Similarly, we define the rank weight of pα, βq, i.e., ||pα, βq||, as the dimension of Supppα, βq.The set S 2n w pF 2 m q is defined as We note that the specification instead defines S 2n w pF 2 m q as a set of vectors in F 2n 2 m , so the definition there might look different from (but is actually equivalent to) our definition.
We use dimpSq to denote the dimension of a linear subspace S, and we use rankpM q to denote the rank of a matrix M .

Decoding Algorithms
ROLLO uses the Rank Support Recover (RSR) algorithm as the decoding algorithm.The algorithm is shown in Algorithm 2. On input an F 2 -subspace F Ă F 2 m of dimension d and s P F 2 mn such that xs 1 , . . ., s n y " S Ď EF -xtef | e P E, f P F uy for some F 2 -subspace E Ă F 2 m of dimension r, RSR recovers E with certain probability: RSR is a probabilistic algorithm, so it might fail to recover E. Note that F is always represented as a vector pf 1 , . . ., f d q P F d 2 m such that F " xf 1 , . . ., f d y.As we will show in the next subsection, RSR is used as a subroutine of decapsulation of ROLLO-I.The way decapsulation is defined ensures that S Ď EF and thus dimpSq ď dr, as long as the ciphertext is valid.However, as shown in Algorithm 2, RSR does not check if dimpSq ď dr.Even when dimpSq ą dr, in which case the ciphertext must be invalid, RSR still computes and returns As one might have expected, the larger dimpSq is, the more computation it takes to compute A constant-time implementation of RSR thus has to take at least as much time as when dimpSq " minpm, nq.
In order to avoid spending extra time on processing these invalid ciphertexts, ROLLO ` uses an adapted version of RSR, which we call the RSR `algorithm.The pseudocode of RSR `is shown in Algorithm 3. RSR `checks the dimension of S and only returns E when dimpSq ď dr.When dimpSq ą dr, the algorithm simply returns K.In this way, the running time of the algorithm can be bounded by the time for the case dimpSq " dr.
In our implementation, each of S, f ´1 i S, and E is represented as an array of F 2 m elements that generate the subspace, which will be explained in more detail in Section 5.

Key generation, Encapsulation, Decapsulation
Key generation, encapsulation, and decapsulation of ROLLO-I are depicted in Figure 1.As shown in the figure, Alice starts with generating the public key h and secret key ph 1 , h 2 q.Then, Alice sends the public key to Bob.On receiving the public key h, Bob runs the encapsulation algorithm to generate the ciphertext c and session key HashpEq, and he sends c to Alice.Here, HashpEq means hashing the reduced row echelon form of the matrix where the rows are formed by the basis elements of E. ROLLO uses SHA256 as the hash function.With c and the secret key ph 1 , h 2 q, Alice runs the decoding algorithm to obtain E and the session key HashpEq.We note that the specification denotes the secret key as px, yq, but we use ph 1 , h 2 q because we would like to save x and y for polynomials.
Key generation, encapsulation, and decapsulation of ROLLO `are depicted in Figure 2. The specification of ROLLO `is very similar to ROLLO-I.The differences between ROLLO-I and ROLLO `are listed below.
• ROLLO-I uses RSR as the decoding algorithm, while ROLLO `uses RSR `.
• In ROLLO-I decapsulation always returns HashpEq, while in ROLLO `decapsulation returns K if RSR `returns K.
• In ROLLO-I the secret key is defined as ph 1 , h 2 q, while in ROLLO `the secret key is defined as ph 1 , F q. What decapsulation (of either ROLLO-I or ROLLO `) needs are h 1 and F , so we think it is more natural to define the secret key in this way.

IND-CPA Security of ROLLO
Ẁe claim that ROLLO `is IND-CPA secure, as long as ROLLO-I is IND-CPA secure.To see this, let us recall the IND-CPA game as shown in, say, [Pei14].In the game, the adversary is asked to distinguish between the following two experiments: In the experiments, Setup is a function that outputs public parameters, Gen stands for the key generation algorithm, and Encaps stands for the encapsulation algorithm.As one can see the decapsulation algorithm is not used in the game, and sk is generated but not used.
As ROLLO-I and ROLLO `only differ in the formats of secret keys and the decapsulation algorithms, clearly ROLLO `must be IND-CPA secure if ROLLO-I is IND-CPA secure.
In fact it is also easy to see that if ph 1 , h 2 q is the same in ROLLO-I and ROLLO `, then on input the same valid ciphertext c, the two decapsulation algorithms will generate the same outputs.This implies that ROLLO `and ROLLO-I have the same failure rates for decapsulation.

Building Blocks
As mentioned in the introduction, our implementation of ROLLO `makes use of optimizations for the following building blocks.
• Field multiplication in F 2 mn : we need to multiply h ´1 1 and h 2 in key generation, multiply e 2 and h in encapsulation, and multiply h 1 and c in decapsulation.The reader should be aware that in each of the three multiplications, there is one operand of weight bounded by either d or r: the weights of h 2 and h 1 are bounded by d, and the weight of e 2 is bounded by r.
• Gaussian elimination: as previous implementations, our implementation uses the Zassenhaus algorithm to compute the intersection of two vector spaces, which requires to compute row echelon form of matrices.Our implementation also uses Gaussian elimination to check the weights of ph 1 , h 2 q and pe 1 , e 2 q.
• Field inversion in F 2 mn : we need to compute the inverse of h 1 in key generation.
We emphasize that ROLLO-I and ROLLO-II can be implemented using essentially the same building blocks.

Field Multiplications and Inversions
As described in Section 2.7, we need to multiply elements in F 2 mn in key generation, encapsulation, and decapsulation.This section shows how we build our multiplication functions for F 2 m " F 2 rxs{pP m q and F 2 n " F 2 rys{pP n q as subroutines of the multiplication function for F 2 mn , and shows how we optimize our multiplication function for F 2 mn by exploiting the fact that one of the operands is of weight bounded by d or r.In key generation, it is necessary to compute the inverse of h 1 .This section shows how we implement the field inversions in F 2 mn .

Field multiplications in F 2 m and F 2 n
As many cryptographic implementations that involve multiplications in binary fields, our implementation makes use of the pclmulqdq instruction.Given two 64-bit polynomials in F 2 rxs, i.e., binary polynomials of degree at most 63, the instruction computes the product of them.Take our multiplication function for F 2 67 for example, which has been shown in Figure 3.We consider the two operands a and b as polynomials a " ř 66 i"0 a i x i P F 2 rxs and b " ř 66 i"0 b i x i P F 2 rxs.Let c " ř 66 i"0 c i x i " ab mod P 67 .To carry out the field multiplication a is represented as pa p0q , a p1q q, where a p0q " ř 63 i"0 a i x i and a p1q " ř 2 i"0 a i`64 x i .Similarly b is represented as pb p0q , b p1q q.First, we obtain a p0q b p0q , a p0q b p1q , a p1q b p0q , and a p1q b p1q by using pclmulqdq 4 times.Note that a p1q b p1q consists of 5 bits only.Our goal is to compute c " a p0q b p0q `pa p0q b p1q `ap1q b p0q qx 64 `ap1q b p1q x 128 mod P 67 .
To carry out the reduction modulo P 67 , Let c 1 " a p0q b p0q `pa p0q b p1q `ap1q b p0q qx 64 `ap1q b p1q x 128 " c 1p0q `c1p1q x 64 `c1p2q x 128 , where degpc 1piq q ă 64 for all i.Observe that x 128 " x 61 px 5 `x2 `x `1q mod P 67 , so c " c p0q `cp1q x 64 `cp2q px 5 `x2 `x `1qx 61 mod P 67 .We thus compute c 1p2q px 5 `x2 `x `1q using one pclmulqdq to obtain c 2 " c 1p0q `c1p1q x 64 `c1p2q px 5 `x2 `x `1qx 61 " c 2p0q `c2p1q x 64 , where degpc 2piq q ă 64 for all i.Finally, observe that We thus compute p ř 60 i"0 c 2p1q i`3 x i qpx 5 `x2 `x `1q by using pclmulqdq to obtain c " In addition to multiplications in F 2 m , we also implemented multiplications in F 2 n using basically the same strategy.As described in the following subsection, our implementation uses both the multiplication functions for F 2 m and F 2 n to carry out multiplications in F 2 mn .Note that this differs from previous implementations, as they do not use any multiplication function for F 2 n .

Accelerating Multiplications in F 2 mn with Matrix Transposition
To multiply two elements in F 2 mn , all previous implementations represent the operands as polynomials over F 2 m and carry out a generic polynomial multiplication using arithmetic in F 2 m .However, each of the three multiplications mentioned in Section 2.7 involves one element of weight bounded by d or r, and we found that this can be exploited to make the multiplications much faster.
For each of the three multiplications, the task is to multiply v, w P F 2 mn , where pv, wq is either ph 2 , h ´1 1 q, ph 1 , cq, or pe 2 , hq.Let t " d if v P th 1 , h 2 u or t " r if v " e 2 .Let β 1 , . . ., β t be a basis of Supp(h 1 , h 2 ) if v P th 1 , h 2 u or a basis of Supp(e 1 , e 2 ) if v " e 2 .We represent w as pw 0 , w 1 , ¨¨¨, w n´1 q in F n 2 m such that w " w 0 `w1 y `¨¨¨`w n´1 y n´1 .In the meantime we represent v as two vectors pα 1 , . . ., α t q P F t 2 n and pβ 1 , . . ., β t q P F t 2 m , such that To see why v can be represented in this way, let v " v 0 `v1 y `. . .v n´1 y n´1 such that v i P F 2 m for all i.Suppose v i " ř j α j,i β j where each α j,i P F 2 , v can be rewritten as Now, let α j " ř n´1 i"0 α j,i y i P F 2 n , we have v " α 1 β 1 `α2 β 2 `¨¨¨`α t β t .Following the discussion above, we have wv " pwβ 1 qα 1 `¨¨¨`pwβ t qα t .
Since w is represented as an array of n elements in F 2 m , each wβ i can be obtained by calling the multiplication function for F 2 m n times.Let where each γ i,j P F 2 m .Now the task is to compute γ i α i for each i.Mathematically, we can view F 2 mn as F 2 n rxs{pP m q, which means γ i can be written as γ 1 i,0 x m´1 , where each γ 1 i,k P F 2 n .Once we have γ 1 i,0 , . . ., γ 1 i,m´1 , it would be easy to multiply each γ i by α i : we can simply call our multiplication function for F 2 n m times.But how can we derive γ 1 i,k 's from γ i,j 's?This is actually not so hard to see.Let γ i,j " ř m´1 k"0 γ i,j,k x k , where γ i,j,k P F 2 .Then we have In other words, γ 1 i,k " ř n´1 j"0 γ i,j,k y j for k " 0, . . ., m ´1.To see how our implementation obtains γ 1 i,k for k " 0, . . ., m ´1, it is useful to consider the matrix Matpγ i q " ¨γi,0,0 . γ i is stored as an array of n F 2 m elements γ i,0 , . . ., γ i,m´1 , meaning that the matrix is stored in a row-major fashion.Our implementation performs a matrix transposition to obtain an array of m F 2 n elements γ 1 i,0 , . . ., γ 1 i,m´1 .We then multiply each γ 1 i,k by α i using our multiplication function for F 2 n to obtain γ i α i .
For ease of implementation, we actually augment the n ˆm matrix to obtain a 256 ˆ128 matrix.This is possible as m ă 128 and n ă 256.We then use an assembly function transpose_64x256_sp_asm to transpose the 256 ˆ128 matrix.What transpose_64x256_sp_asm does is essentially transposing 4 64 ˆ64 matrices in parallel.In total there are 8 64 ˆ64 submatrices in the 256 ˆ128 matrix, so to complete the matrix transposition we need to call transpose_64x256_sp_asm twice.The algorithm used by transpose_64x256_sp_asm to transpose each 64ˆ64 matrix has been explained in [Cho17, Section 2], and the function simply vectorizes the algorithm so that 4 matrix transpositions can be carried out in parallel using logical instructions for YMM registers.We note that the assembly function is included in the source code of Classic McEliece [ABC `20].The source code of Classic McEliece, including the function, is in the public domain.
We make use of lazy reduction to save operations.To multiply γ 1 i,k 's by α i , we do not actually use the complete multiplication function for F 2 n .Instead, we only reduce the results so that they can fit into 256 bits.Then, we compute the array of m F 2 n elements ř i γ 1 i,0 α i , . . ., ř i γ 1 i,m´1 α i and reduce them fully.Finally, we transpose the matrix corresponding to this array to obtain an array of n F 2 m elements as the default representation of wv.

Field Inversions
We make use of the Bernstein-Yang constant-time GCD algorithm [BY19] to perform field inversions in F 2 m .The basic version of the algorithm for modulo inversion is shown in the code segments in Figure 5.1 and 6.1 of [BY19], and in [BY19, Section 7.1] an adapted version for polynomials over F 3 is shown.We consider the F 2 m element as a polynomial over F 2 and compute its inverse modulo P m P F 2 rxs.The algorithm we use to compute inversions in F 2 m is adapted from the algorithm for polynomials over F 3 .The only modification we did is to replace the variable c in each step by g 0 .
Once we have α γ´1 , α γ is computed as α ¨αγ´1 .Observe that 2 mn ´1 " p2 m ´1qγ and thus pα γ q 2 m ´1 " α 2 mn ´1 " 1.This implies that α γ P F 2 m .Therefore, from α γ´1 and α γ , we compute α ´1 using 1 field inversion and n field multiplications in F 2 m .Note that each frob 2 i is a linear operation: suppose the input and output are viewed as vectors in F n 2 m , the operation simply multiplies the input by an n ˆn matrix over F 2 to obtain the output.frob 2 i is thus very cheap for any i.In the Itoh-Tsuji algorithm, we have to perform a few generic multiplications in F 2 mn .These multiplications are implemented with four layers of Karatsuba, which appears to be provide the best performance for the parameter sets.

Gaussian Elimination
As described in Section 2.7, we use Gaussian elimination to generate ph 1 , h 2 q P S 2n d pF 2 m q and pe 1 , e 2 q P S 2n r pF 2 m q, and to perform the Zassenhaus algorithm.This section presents how we implement Gaussian elimination to reduce matrices into row echelon form or reduced row echelon form.We note that our implementation computes row echelon form instead of reduced row echelon form whenever possible to save computation.

A Constant-time Algorithm for Computing Systematic Form
Assume that matrix A P F µˆν 2 has systematic form.In [BCS13, Section 6], the authors describe an algorithm for computing systematic form of A. We denote the ith row of A as A i .The algorithm consists of the following steps.
In each iteration of the algorithm, column p of the matrix is scanned to search for the pth pivot.
Step 2 sets A p,p to 1 by conditionally adding row i to row p for all i ą p. Step 3 sets all A i,p 's with i ‰ p to 0 by conditionally adding row p to each row i.The values of p and i at any point of the algorithm do not depend on the entries of the input matrix, so the algorithm is constant-time even if A i , A p , A p,p , A i,p are accessed using memory load/store instructions.
Note that systematic form is the same as reduced row echelon form for A. If in Step 3 the set t1, . . ., µuztpu is replaced by tp `1, . . ., µu, row-echelon form of A will be computed instead of reduced row echelon form.However, for matrices without systematic form, it is not guaranteed that the algorithm above will generate reduced row echelon form, and it is also not guaranteed that the corresponding algorithm with tp `1, . . ., µu in Step 3 will generate row echelon form.

Computing Row Echelon Form and Reduced Row Echelon Form
We generalized the algorithm in Section 4.1 so that the reduced row echelon form of any A can be computed.The main idea is to search for the column index j of pivot p in each iteration.Once the column index is found, we can use steps that are similar to Step 2 and 3 in the algorithm in Section 4.1 to set A p,j to 1 and set all A i,j with i ‰ p to 0. Our algorithm consists of the following steps.
2. Set v to the logical OR of A p , . . ., A µ .
3. Find the index j of the first nonzero entry in v.If v " 0, set j to any value in t1, . . ., νu.
In each iteration of the algorithm, we search for the pth pivot in the submatrix formed by A p , . . ., A µ2 .
Step 2 and 3 find the column index j of the pivot.
Step 4 ensures that A p,j " 1 by conditionally adding A i with i ą p to row A p .Step 5 ensures that A i,j " 0 for all i ‰ p by conditionally adding A p to other rows.
We note that the algorithm works even if the input matrix is not full rank.Indeed, in this case, at the end of iteration rankpAq, the matrix must be in reduced row echelon form.In the last minpµ, νq ´rankpAq iterations, Step 4 and 5 will not change the matrix because A p , . . ., A µ are always zero.We can modify the set in Step 5 to tp `1, . . ., µu so that the algorithm computes only row echelon form.
We have explained why the algorithm always computes reduced row echelon form and how it can be adapted to computed row echelon form.The remaining question is, how can we make the algorithm constant-time?In fact, it is not hard to see that the algorithm must be constant-time if finding the first nonzero entry of v (Step 3) and accesses of A p,j (Step 4) and A i,j (Step 5) are made constant-time.Below we show that the steps can be made constant-time by using AVX/AVX2 intrinsics.

Implementing Our Gaussian Elimination algorithm
Consider 128 ă ν ď 256.In this case, our implementation represents A as an array of µ 256-bit vectors of type __m256i.In each iteration of the algorithm in the previous subsection, the vector v is thus computed using µ ´p ORs between the 256-bit vectors.We then use the function vec256_find_first_one in Figure 4 to find the index of the first one of v.
To understand how the function vec256_find_first_one works, let us first consider the case when v is not a zero vector.We first use the intrinsic _mm256_cmpeq_epi64 to generate a 256-bit vector mask where each of the 4 64-bit blocks is set to 0xFF. . .F if the corresponding block in v is 0, or 0x00. . .0 if the corresponding block in v is not 0. The 256bit vector is then compressed into a 32-bit value mask_one using _mm256_movemask_epi8, such that each byte (of value either 0xFF or 0x00) in mask is reduced to the complement of the most significant bit of it.Then, we use the intrinsic _tzcnt_u64 to compute the index of the first (i.e., the least significant) 1 in the 32-bit value._tzcnt_u64 is compiled into the tzcnt instruction.The tzcnt instruction returns the index of the least significant 1 of the operand, or 64 if the operand is 0. The return value of _tzcnt_u64 thus lies in t0, 8, 16, 24u.The right shift ensures that the value of index must be 0, 4, 8, or 12 when the index of the first nonzero 64-bit block is 0, 1, 2, or 3, respectively.We then shift each 64-bit block of _mm256_set1_epi64x(0x753100006420) by index bits with _mm256_srli_epi64.The output mask of _mm256_srli_epi64 can be used to broadcast the 64-bit block in a 256-bit vector that has the same index with the first nonzero block in v with _mm256_permutevar8x32_epi32.We then use mask to broadcast the first nonzero 64-bit block of v and use _mm256_extract_epi64(v, 0) to obtain the first nonzero block.
A minor optimization in our implementation is that we use Montgomery's trick to compute all f ´1 i 's: we compute f 1 , f 1 f 2 , . . ., f 1 ¨¨¨f n , compute pf 1 ¨¨¨f n q ´1, and finally compute f ´1 n as pf 1 ¨¨¨f n´1 q ¨pf 1 ¨¨¨f n q ´1, compute f ´1 n´1 as pf 1 ¨¨¨f n´2 q ¨ppf 1 ¨¨¨f n q ´1 ¨fn q, and so on.Each subspace f ´1 i xs 1 1 , . . ., s 1 dr y is represented as an array containing Following previous implementations of ROLLO, we compute the intersection of two subspaces using the Zassenhaus algorithm.Let U and V be two matrices over the same field and with the number of columns.The Zassenhaus algorithm applies Gaussian elimination to the matrix where A and B do not have any zero rows.It is guaranteed that RowSpacepAq " RowSpacepU q `RowSpacepV q and RowSpacepBq " RowSpacepU q č RowSpacepV q.
Therefore, in the first iteration of Step 3, to compute we set row i of U to the vector formed by f ´1 1 s 1 i for i P t1, . . ., dru, set row i of V to the vector formed by f ´1 2 s 1 i for i P t1, . . ., dru, and apply the algorithm in Section 4.2 to compute its row echelon form Z 1 .Note that Z and Z 1 are 2dr ˆ2m matrices.
After the Gaussian elimination, the rows of B form a basis of the intersection.We need to extract the basis or a generating set of the intersection so that we can continue with the next iteration of Step 3.This might seem to be a trivial task, but it is actually not.A naive implementation can easily leak secret information (e.g., the dimension of the intersection) though timing.
Let ∆ " dr.To obtain a generating set with ∆ elements of the intersection in constant time, our implementation sets Z pLq to the left half of Z 1 , sets Z pRq to the right half of Z 1 , and carries out the following steps for i " 1, . . ., ∆.
Each iteration can be easily converted into a sequence of logical operations, making the loop constant-time.We claim that after the last iteration, the first dr rows of Z pRq will form a generating set with dr elements of the intersection.
It might be easier to see why our claim is true by analysing how the loop changes Z pRq in different cases.Let µ and ν be the number of rows of A and B, respectively.Then Z 1 can be written as ¨a1,1 . . .a 1,m . . . . . .
C is a µ ˆm matrix, and the 0 at the bottom right corner is a matrix of m columns and 2dr ´µ ´ν (which can be 0) rows.Note that we must have µ ě ν and ν ď dr.We consider the following three cases: 1) µ ě dr, 2) µ ă dr and µ `ν ě dr, and 3) µ ă dr and µ `ν ă dr.In the first case, at the end of the loop, the first dr rows of Z pRq will become , where the zero matrix at the bottom has 2dr ´µ ´ν rows, and the zero matrix on the top has dr ´ν ´p2dr ´µ ´νq " µ ´dr rows.In the second case, the first dr rows of Z pRq will become , where the zero matrix in the middle has dr ´ν rows.In the third case, the first dr rows of Z pRq will become , where the zero matrix on the top has µ rows, and the zero matrix at the bottom has dr ´µ ´ν rows.In other words, the first dr rows of Z pRq always form a generating set of the intersection after the loop is carried out, which shows the correctness of our claim.At the end of Step 3, we obtain a dr ˆm matrix that forms a generating set of E. For decapsulation, we compute reduced row echelon form of this matrix to obtain a unique representation of E for hashing.

Implementing RSR
One can implement RSR using essentially the same steps as presented at the beginning of Section 5.1.One only needs to modify Step 1 so that it becomes 1. Compute ps 1 1 , . . ., s 1 m q P F dr 2 m , the vector of F 2 m elements formed by the first m rows of row echelon form of Matpsq.Indeed, since m ă n, we have S " xs 1 1 , . . ., s 1 m y.Modifying Step 1 in this way will change the dimension of the matrices Z and Z 1 in the Zassenhaus algorithm.Now the matrices are in F 2mˆ2m 2 instead of F 2drˆ2m 2 .As dr ă m, a constant-time implementation for RSR is thus expected to be slower than a constant-time implementation for RSR `.We note that the loop for extracting a generating set of the intersection also needs to be modified because the dimension of the matrices has changed.This can be done by setting ∆ to m instead of dr.

Sampling from S 2n
r and S 2n d In the key generation algorithm, ph 1 , h 2 q is sampled from S 2n d .In the encapsulation algorithm, pe 1 , e 2 q is sampled from S 2n r .Without loss of generality, let us consider the task of generating ph 1 , h 2 q.As explained in Section 3.2, h 1 and h 2 can be considered as where α i , γ i P F 2 n and β i P F 2 m for all i.In fact, Mat(h 1 , h 2 ) is simply where α i 's, β i 's, and γ i 's are considered as row vectors over F 2 .By definition, Supp(h 1 , h 2 ) = Supp(h 1 ) + Supp(h 2 ) is the row space of Mat(h 1 , h 2 ).Mat(h 1 , h 2 ) is of rank d if and only if both matrices being multiplied are of rank d (i.e., full rank).
Following the discussion above, we carry out the following steps to generate ph 1 , h 2 q.
1. Generate a random d ˆm matrix and use the algorithm in Section 4.2 to check if it is full rank by reducing it to row echelon form.If the matrix is not full rank, repeat this step.
2. Generate a random d ˆ2n matrix and use use the algorithm in Section 4.2 to check if it is full rank by reducing it to row echelon form.If the matrix is not full rank, repeat this step.
3. Obtain β i 's from the d ˆm matrix.Obtain α i 's and γ i 's from the d ˆ2n matrix.Compute h 1 as ř i α i β i to obtain the polynomial representation.We do not compute the polynomial representation of h 2 because we can use γ i 's and β i 's to carry out the multiplication between h ´1 1 and h 2 using the technique introduced in Section 3.2.

Experiment Results and Discussions
This section presents some experiment results.All cycle counts for our implementation and the implementation of [CCK21] are measured on one core of an Intel Xeon E3-1220 v5 CPU (Skylake), with Turbo Boost and hyper-threading disabled.The cycle counts for [AMAB `21] are taken from the paper directly and are measured on an Intel Core i7-8850H CPU (Coffee Lake).
Table 3 shows the timings of some operations in our implementation.Each column correspond to a specific operation, as explained below.• "mul F" means a multiplication of two F 2 mn elements, where the support of one of the elements is a subset of F .The multiplications of h ´1 1 ¨h2 and h 1 ¨c are of this type.
• "mul E" means a multiplication of two F 2 mn elements, where the support of one of the elements is a subset of E. The multiplication of h ¨e2 is of this type.
• "echelon" means the process of reducing a 2dr ˆ2m matrix over F 2 into its row echelon form, which is used for the Zassenhaus algorithm.
• "reduced" means the process of reducing a dr ˆm matrix over F 2 into its reduced row echelon form, which is used for deriving a unique representation of E for hashing.
Table 4 presents the cycle counts for key generation, encapsulation, and decapsulation of our ROLLO `implementation, the two ROLLO-I-128 implementations from [AMAB `21], and the BIKE implementation from [CCK21].The numbers of our implementation and the implementation of [CCK21] are measured using the SUPERCOP benchmarking framework.

How about the Speed of ROLLO-I?
One might think that the reason why our implementation is an order of magnitude faster than the implementations of [AMAB `21] is because the specification of ROLLO `is different from ROLLO-I.This is not true.
First of all, the encapsulation algorithm of ROLLO `is identical to that of ROLLO-I, so our encapsulation time is exactly the encapsulation time one can achieve for ROLLO-I.Second, the key generation algorithm of ROLLO `is almost identical to that of ROLLO-I, so our key generation time is expected to be very close to the key generation time one can achieve for ROLLO-I.Finally, the decapsulation algorithm of ROLLO `is somewhat different from that of ROLLO-I, but it is easy to estimate the decapsulation time of ROLLO-I when our techniques are used.To use our techniques for ROLLO-I, the matrices for the Zassenhaus algorithm will be 2m ˆ2m matrices instead of 2dr ˆ2m matrices as discussed in Section 5.2.Assuming that the decapsulation time is dominated by the Zassenhaus algorithm, which is true according to our experiments, the decapsulation time of ROLLO-I-128 is expected to be pm{drq 2 " 1.43 times our decapsulation time of ROLLO `-I-128, which is still much smaller than 9744693.

How about Other Platforms?
One might wonder whether it is possible to port our implementations to non-x86 platforms.Although how ROLLO `should be implemented on other platforms is out of the scope of this paper, we explain below how this can be done.
First of all, as shown in Section 3, our implementation makes use of pclmulqdq for carryless multiplications.Many platforms do not support any instruction for carryless multiplications.However, one can still use instructions for integer multiplications to carry out carryless multiplications.For example, as shown in [CC21, Section 5.1.2],umlal can be used to carry out carryless multiplications on Cortex-M4.Similarly, one can easily implement a function that achieves the functionality of pclmulqdq on any reasonable platform using instructions for integer multiplications and logical instructions.With such a function, one can easily build multiplication functions for F 2 m and F 2 n .
Our implementation also makes use of tzcnt for counting trailing zeros.Many platforms do not support any instruction for counting trailing zeros.However, as shown in [And05], there are many methods to count the number of trailing zeros without using any tzcnt-like instruction.Note that the methods are not constant-time but can all be easily made constant-time.On Cortex-M4, one can simply use rbit to reverse the bits in a word and clz to count leading zeros.Following the discussion above, one can easily implement the functionality of tzcnt on any reasonable platform.
With a function that implements the functionality of tzcnt, one can find the index j of the first nonzero entry of a vector: we need to do this for the vector v in Step 3 of the Gaussian elimination algorithm presented in Section 4.2.Assuming that v is represented as an array of b-bit words.It is easy to obtain the first nonzero word in the vector along with its index j 1 " tj{bu in constant time using logical instructions.Then, using the function that implements the functionality of tzcnt and j 1 , one can easily derive the index j.A function find_first_one for finding the index of the first nonzero entry in a 256-bit vector is shown in Figure 6.One can also easily obtain the jth entry of a vector in constant time using simple instructions.This operation is required in Step 4 and 5 of the Gaussian elimination algorithm.A function get_coef for obtaining the jth entry of a 256-bit vector is also shown in Figure 6.
The function transpose_64x256_sp_asm and the loop for obtaining the generating set of the intersection of two linear subspaces (Section 5.1) both consist of logical instructions, so it is easy to implement the same functionalities on any reasonable platform.

Figure 4 :
Figure 4: Our function for finding the index of the first nonzero bit in a 256-bit vector.

Table 1 :
Parameter sets of ROLLO and ROLLO

Table 2 :
The lists of P m 's and P n 's as polynomials inF 2 [X].An F 2 -subspace F of dimension d in F 2 m , represented as pf 1 , . . ., f d q P F d 2 m such that F " xf 1 , . . ., f d y, and s " ř n´1 i"0 s i y i P F 2 mn .Output: An F 2 -subspace E of F 2 m .

Table 3 :
Cycle counts for several components in our implementation for ROLLO

Table 4 :
Cycle counts for key generation, encapsulation, and decapsulation of the ROLLO-I implementations from [AMAB `21] (the paper did not implement ROLLO-II), our ROLLO