Power Analysis on NTRU Prime

. This paper applies a variety of power analysis techniques to several implementations of NTRU Prime, a Round 2 submission to the NIST PQC Standardization Project. The techniques include vertical correlation power analysis, horizontal in-depth correlation power analysis, online template attacks, and chosen-input simple power analysis. The implementations include the reference one, the one optimized using smladx , and three protected ones. Adversaries in this study can fully recover private keys with one single trace of short observation span, with few template traces from a fully controlled device similar to the target and no a priori power model, or sometimes even with the naked eye. The techniques target the constant-time generic polynomial multiplications in the product scanning method. Though in this work they focus on the decapsulation, they also work on the key generation and encapsulation of NTRU Prime. Moreover, they apply to the ideal-lattice-based cryptosystems where each private-key coeﬃcient comes from a small set of possibilities.

Unfortunately, quantum resistance is no guarantee of practical security.There has been a large amount of work on the implementation attacks against post-quantum cryptosystems.
[TE15] provides a comprehensive collection of fault analysis and side-channel analysis on various post-quantum schemes.[EFGT17], [KAJ17], and [PSKH18] present more cutting-edge side-channel analyses on digital signatures.[EFGT17] applies electromagnetic analysis to BLISS, and achieves full key recovery from one single trace using integer linear programming.[KAJ17] features three zero-value attacks on supersingular isogeny Diffie-Hellman using refined power analysis.[PSKH18] proposes the correlation power analysis on Rainbow and Unbalanced Oil-and-Vinegar, two digital signatures based on multivariate quadratic equations, to fully recover the secrets in use.
Our Contributions NTRU Prime [BCLvV16], a Round 2 submission to the NIST PQC Standardization Project, is based on ideal lattices.This submission contains two schemes: Streamlined NTRU Prime and NTRU LPRime.Streamlined NTRU Prime is a variant of the classic NTRU [HPS98], and NTRU LPRime shares a similar structure with NewHope [ADPS16,LPR10].However, their reference implementations are not subject to the previous attacks against lattice-based schemes.These attacks target the implementations with data-dependent timing differences [SW07] and the ones which employ the operand scanning method [AKJ + 18, ATT + 18], sparse multiplication [LSCH10, KY12, WZW13, ZWW13, AKJ + 18, SMS19] or the NTT network [PPM17] for polynomial multiplications.In contrast (and somewhat unusually), the reference implementation of NTRU Prime is constant-time and generic, realizing polynomial multiplications with the product scanning method.
This paper applies a variety of power analysis techniques to several implementations of NTRU Prime.The techniques include vertical correlation power analysis (VCPA) [BCO04], horizontal in-depth correlation power analysis (HIDCPA) [CFG + 10], online template attacks (OTA) [BCP + 14], and chosen-input simple power analysis (CISPA) [KJJ99].The implementations include the reference one [BCLvV16], one optimized using DSP instructions, and some protected ones [LSCH10].Adversaries can fully recover private keys with one single trace, with few template traces and no a priori power model, or even with the naked eye.
This work demonstrates the private-key recovery from the polynomial multiplication in decapsulation.However, because HIDCPA and OTA are single-(target-)trace attacks on random inputs, they are also able to reveal private keys from NTRU LPRime's key generation and the seeds of session keys from both schemes' encapsulations.The ideallattice-based cryptosystems where each private-key coefficient comes from a small set of possibilities [ADPS16, HRSS17, HPS + 17, KMRV18, BDK + 18] may well succumb to the approaches in this study.
Even if NTRU Prime optimizes its polynomial multiplications with Karatsuba's method [Kar63, WP06] and Toom's method [Too63,CA69], the approaches may remain effective.Karatsuba's method itself does not prevent the VCPA, the OTA, and the CISPA on lowest-level multiplications.If low-level schoolbook multiplications are sufficiently long, then HIDCPA works, too.Unfortunately, if the optimized version uses Toom-k as the first layer, the approaches can only reveal the first and last 1/k of private-key coefficients.How to adapt them to a fully optimized NTRU Prime [BCLvV16] in pursuit of full private-key recovery is worth further investigation.
Prior Work In the field of side-channel analysis on lattice-based encryption, [SW07] and [LSCH10] are the classics.[SW07] describes a timing attack against NTRUEncrypt exploiting the variation in the number of hash calls during decryption.[LSCH10] not only applies simple power analysis and correlation power analysis to a typical NTRU software implementation, but also provides the corresponding countermeasures.
[AKJ + 18] performs single-trace power analysis on two versions of NTRU.[ATT + 18] mounts horizontal differential power analysis on NewHope and Frodo to reveal private keys with >99% success rate from one single trace.[BFM + 18] applies extend-and-prune template attacks to the challenges beyond the scope of [ATT + 18].[SMS19] implements an additive masking for NTRUEncrypt with little overhead using Cortex-M4 SIMD instructions, and performs a second-order attack against the additive masking.
In all the literature, there does not seem to be an attack against NTRU Prime, or against polynomial multiplication using the product scanning method in general.However, it is worth noting that [HMHW09] and [UW14] mount correlation power analysis on multi-precision integer multiplication using the product scanning method in ECDSA and optimal-Ate pairings, respectively.Also, [JB16] launches (repeated) single-trace correlation/clustering attacks against the operand-scanning field multiplications in elliptic curve scalar multiplication with precomputations, and claims its applicability to the product scanning method.

NTRU Prime
NTRU Prime [BCLvV16] is a Round 2 candidate in the NIST Post-Quantum Cryptography Standardization Project [Nat17].It features polynomial rings distinct from those of typical Ring-LWE-based cryptosystems and NTRU to avoid potential algebraic attacks.In NTRU Prime there are two key-encapsulation mechanisms based on ideal lattices: Streamlined NTRU Prime and NTRU LPRime.
Let Z/mZ be represented by (−m/2, m/2] ∩ Z.For a given prime p and an arbitrary m ∈ Z + , R and R m refer to Z[x]/(x p − x − 1) and (Z/mZ)[x]/(x p − x − 1), respectively.A polynomial is small if all of its coefficients belong to {−1, 0, 1}, and weight-w if exactly w of its coefficients are nonzero.If not specified, the following expressions for a polynomial of degree n ∈ N are interchangeable: The terms above help describe Streamlined NTRU Prime and NTRU LPRime concisely.

Decapsulation
2. Hash r to obtain the session key K.
NTRU LPRime has positive integer parameters p, q, w, δ, and I: p and q are primes; 8 | I; 2p ≥ 3w; q ≥ 16w + 2δ + 3; p ≥ I; x p − x − 1 is irreducible in (Z/qZ)[x].Also, it includes the following five functions: • Hash: a hash function producing two fixed-length strings: cipher key and session key from each I-bit string • Generator: producing a polynomial ∈ R q from each seed string • Small: producing a small weight-w polynomial ∈ R from each cipher key • Top: mapping each vector ∈ (Z/qZ) I to a fixed-length string • Right: mapping each string in the image of Top to a vector ∈ (Z/qZ) The following introduction to its key generation, encapsulation, and decapsulation skips error detection, encoding, and decoding again due to their irrelevance to this paper.

Key Generation
1. Pick randomly and uniformly a seed S.
2. Pick randomly and uniformly a small weight-w polynomial a ∈ R.
3. Obtain A ∈ R by rounding each coefficient of (a × Generator(S) in R q ) to the nearest multiple of 3.
4. Set (S, A) as the public key and a as the private key.

Obtain b = Small(k).
4. Obtain B ∈ R by rounding each coefficient of (b × Generator(S) in R q ) to the nearest multiple of 3.

Obtain
6. Obtain C = Top(C), and set (B, C) as the ciphertext. Decapsulation 2. Obtain the session key K from (k, K) = Hash(r).
The polynomial multiplication in R q in decapsulation is the operation of interest.In NTRU-like cryptosystems there are three common ways to realize such polynomial multiplications [LSCH10, KY12, WZW13, ZWW13, BCLvV16, AKJ + 18, SMS19].The two inputs of degree < p here are the small private key f (or a) and the known ciphertext c (or B).
• Operand Scanning Method [HW11]: viewing f × c as the superposition of f i × c • Sparse Multiplication: expressing f as ({i : f i = 1}, {j : f j = −1}) and substituting integer addition/subtraction for integer multiplication • Product Scanning Method [HW11]: calculating (f × c) i one by one Table 1 shows the details.Conventional NTRU implementations favor the operand scanning method and sparse multiplication.Hence, these two have been intensively studied in the field of power analysis.In contrast, NTRU Prime adopts the product scanning method in its reference implementation.This method has received little attention in the literature, so this paper features a comprehensive set of power analysis on it.
for i = 0 to (p − 1) for j = 0 to i e i += f i−j × c j (mod q) for i = p to (2p − 2) for j = (i − p + 1) to (p − 1) e i += f i−j × c j (mod q) Output: e(x) mod P (x) ⇒ P (x) = x p − x − 1 for NTRU Prime and x p ± 1 for NTRUEncrypt Although the experiments in this article only focus on the recovery of f from (c × 3f in R q ) in Streamlined NTRU Prime and a from (a × B in R q ) in NTRU LPRime, the formula h = (g/(3f ) in R q ) assures the recovery of g in Streamlined NTRU Prime with the knowledge of public key.Furthermore, the single-(target-)trace attacks on random inputs in this article can easily adapt to the multiplications in encapsulation/key generation for the session-key/secret-key recovery.

Power Analysis
Side-channel analysis can break an implementation without breaking the underlying cryptosystem under its design assumptions.First, it collects side-channel leakages (such as execution time [Koc96], power consumption [KJJ99], and electromagnetic radiation [vE85]) from cryptographic devices.Then it identifies the relations between such leakages and the operations being executed or the intermediate values being processed.Finally, it employs a series of data processing, observation, and statistical analysis to reveal sensitive information about the cryptographic primitives in use.
Power analysis is a popular branch of side-channel analysis.The classic instances include simple power analysis (SPA) [KJJ99], correlation power analysis (CPA) [BCO04], and profiling attacks [CRR02].This paper points out that NTRU Prime is subject to all of them.In general, power analysis consists of four steps [MOP07].First, it targets specific intermediate values to decompose the entire key space into several tiny search spaces.Then it models the expected power consumption of the target device for these intermediate values.After power trace recording, it produces an optimal guess for each search space.Finally, sensitive information such as private keys is derived from these optimal guesses.In this context, ciphertexts are assumed accessible at low cost.
SPA, CPA, and profiling attacks follow different ways to model the expected power consumption and produce optimal guesses [MOP07].SPA only cares about the target's data-dependent power characteristics that can be captured within one or few executions and identified with simple arithmetic or even visual inspection.The knowledge of such power characteristics usually requires deep understanding of the target device.CPA applies a simple power model to its target device (Hamming weight model for microcontrollers and Hamming distance model for FPGA).Then it decides on optimal guesses based on the Pearson correlation coefficients between expected power consumption and its counterpart in reality, for all candidates in the search space.Profiling attacks construct multivariate Gaussian distributions or typical power sample sequences from the measurements in its profiling stage.Then with these highly customized power models, it selects suitable statistical analysis to decide between optimal guesses.

Vertical Correlation Power Analysis (VCPA)
This VCPA targets the e p−1 calculation in the product scanning method in Table 1.The only coefficient of e involving all the coefficients of f and c is e p−1 , so its calculation is highly controllable, and the power consumption is rich in data dependencies.Since the private key f is small and weight-w, that is  The following assumption helps accelerate this VCPA: The chosen power model suits the target device so well that in each search space, the correct hypothesis gives a correlation coefficient far away from those of the rest.Because every internal state corresponds to the same kind of multiply-accumulate-and-reduce, this assumption allows adversaries to set a fixed threshold shared by all the search spaces.During this VCPA, whenever the hypothesis being examined yields a coefficient that crosses the threshold, we will assume that this hypothesis is optimal in the current search space.
Furthermore, the assumption above ensures that there is a wide range of eligible thresholds, so adversaries can efficiently select a threshold leaving exactly one survivor per search space.Note that after computing a few correlation coefficients in the first stage, we should be able to identify the numerical gap between correct and incorrect hypotheses.
Algorithm 1 shows this VCPA in detail.It involves N independent multiplications of random ciphertexts c and a fixed secret key f .Each e p−1 calculation leads to a power trace of size L. Algorithm 1 employs the Hamming weight power model [KJJ99,MD99] by default due to its simplicity and prevalence in the microcontroller power modeling.Besides, Algorithm 1 by default considers higher correlation coefficients "better", and defines the optimal guess in a search space as the hypothesis with the "best" correlation coefficient.In this case, a correlation coefficient crosses the threshold if it is above the fixed threshold.

Algorithm 1 Vertical Correlation Power Analysis on NTRU Prime
of the e p−1 calculation THRESHOLD to sieve out the correct hypothesis Output: identify the one giving the "best" correlation coefficient ρ x,y,idw−1 between 5: 9: Among all the (x, id wt ) identify the one giving the "best" correlation coefficient ρ x,idwt between 14: 18: The approach takes probabilistic linear time in terms of p, if a Ω(1) lower bound τ for w/p exists in view of security concerns.In any case the second stage needs less than p iterations of testing ±c i .Assume the random variables X 1 = (p − b w ) and There are practical considerations behind Algorithm 1.The first stage views f bw and f bw−1 as a pair to avoid the potential confusion between the access to c i and the update of e p−1 from 0 to e p−1,1 .Also, to examine the hypotheses of high a priori probability first, the nested loop takes the form of "for j = 1 to (p − w + 1) for i = 0 to (j − 1) do" rather than "for i = 0 to (p − w) for j = (i + 1) to (p − w + 1) do".This reduces both time consumption and the required number of traces.Moreover, this VCPA assumes the relative order of operations in an implementation to be the same as its counterpart in source code, so the second stage just cares about the samples after the latest id * wt+1 .Some may suggest that Algorithm 1 judge a correlation coefficient by its absolute value for the sake of generality.However, this makes the first stage prone to failure due to troublesome false positives: Every e i is initially zero and for 32-bit microcontrollers (e.g.our target devices) HammingWeight(x) ≈ 32 − HammingWeight(−x), so (b 286 , b 285 , −f b286 , −f b285 ) is a false positive which will lead to confusion during the first step.Similarly, under this general design, (−f 760 , −f 759 , • • • , −f 761−m ) is a false positive in the first block recovery of the HIDCPA in subsection 3.2.When deciding on the power model in use, adversaries can take into consideration their measurement setups so that they only need to check for either positive or negative correlation coefficients as hypotheses, not both.

Horizontal In-Depth Correlation Power Analysis (HIDCPA)
Algorithm 2 Horizontal In-Depth Correlation Power Analysis on NTRU Prime Input: for i = 0 to (m − 1) for j = 0 to (l − 1) do 5: realArr.append(the sample ∈ P of e x += f x−y × c y (mod q)) 7: (optGuess, bestCorr) ← candidateConstruct([ ]) 8: if optGuess is empty then Error Correction Mechanism In pursuit of single-trace full private-key recovery, attackers may intuitively group every m consecutive coefficients (including zeros) of f into a block.The corresponding CPA relates each search space to m samples during the e p−1 calculation and avoids collecting samples vertically from different executions for hypothesis examination.Sadly, this naive CPA itself is impractical, but as we shall see below, we have solved three issues to make it work.Algorithm 2 and Algorithm 3 together detail this HIDCPA.Algorithm 3 by default employs the Hamming weight power model, considers higher correlation coefficients "better", and thus shares with Algorithm 1 the same definitions of which guess in a search space is optimal and how a correlation coefficient crosses the threshold.
Our HIDCPA reveals m coefficients of f at a time (depth = m) by observing the calculation of e p−1 , e p , • • • , e p−2+l in the product scanning method in Table 1 During the e i calculation, every (c j , f i−j ) updates the e i in memory with its multiply-accumulate-and-reduce.This HIDCPA targets such updates, locating m × l samples to check every m-coefficient hypothesis.It starts from the block ( generates eligible candidates recursively, and prunes incorrect candidates every n coefficients during the recursive construction.Should the pruning leave no survivor in the current block recovery, roll back the starting index of the block by m/2 .The next block recovery will then eliminate the error from the previous block recovery.Else set the optimal survivor as part of the final guess, and move the starting index forward by m.This HIDCPA iterates such block recovery until the block ends at f 0 .Figure 2 diagrams an instance of the depth m = 5, the pruning period n = 2, and the breadth l = 5.Algorithm 3 candidateConstruct Input: the current block candidate coeffs Output: (the "best" m-coefficient hypothesis coeffs gives, the corresponding ρ) for j = 0 to j = (l − 1) do 8: IVArr.pop() for l times 12: return (guess x * , ρ x * ), where ρ x * = max{ρ −1 , ρ 0 , ρ 1 } Despite its resemblance to horizontal CPA [CFG + 10], HIDCPA focuses on the depth with the breadth being auxiliary.Candidate pruning and error correction together make for an efficient HIDCPA with surprising parameter sets (e.g.(m, l) = (67, 5)).Because l is small, this HIDCPA requires far shorter observation span than horizontal CPA does.We describe how the three main features of our HIDCPA above solve the main practical problems encountered.1.If m is too small, too many candidates would fit the measurements well.However, in the naive CPA, the search complexity grows exponentially in m by a factor of 3. In practice the naive CPA is doomed to fail with m < 20, yet when m ≥ 20, it encounters search spaces of size ≥ 3 20 ≈ 2 32 .
Our solution a la subsection 3.1 is to prune the candidate list whenever n new coefficients are added during the recursive construction of block candidates.To be precise, we choose a fixed threshold for correlation coefficients, and check the current candidate whenever its size reaches a multiple of n.Prune the current candidate if its correlation coefficient fails to cross the threshold, along with all its descendants.
2. Since m is large, if we make an error during recovery towards the end of a block, the naive CPA may well not detect it due to the smallness of its influence on the correlation coefficients for the current block.However, we will know the error as soon as we start the next block: It has all block candidates pruned.
As described above, our solution is for attackers to "roll back" by half a blockdecrement the startIdx of the block by m/2 -when all candidates are pruned.By starting this new block, which contains the last half of the previous block and the first half of the next block, we ensure that the troublesome recovery error would be near the middle of this new block, and thus its influence becomes noticeable.This feature will not detect a recovery error in the last block.Thus, attackers will check the last few uncertain coefficients via exhaustive search (in Streamline NTRU Prime, h and the smallness of g give (f, g); in NTRU LPRime, knowing (S, A) gives a).Two tricks serve to accelerate the search: • Apply the threshold not only to the entire block but also solely to the last n coefficients each time we prune candidates.
• Apply smaller m to the last few block recoveries.
3. This HIDCPA observes the calculation of e p , e p+1 , • • • , e p−2+l and enables us to check each m-coefficient hypothesis with m × l samples in the single trace.This makes our approach as effective as a VCPA of m × l traces, increases the numerical gap between correct and incorrect guesses, and improves the efficiency of the candidate pruning.
Note that the naive CPA only examines the e p−1 calculation, mapping an m-coefficient hypothesis to only m samples.Thus, it is as effective only as a VCPA of m traces.Take (m, l) = (67, 5) as an example.It is easy to see that there are situations where 67 samples are not sufficient for VCPA while 335 samples are.

Online Template Attack (OTA)
The correlation-based approaches in previous subsections require the assumptive use of simple power models.Traditional template attacks generate refined power models, but they demand numerous template traces in the profiling stage and heavy computational power.
Fortunately, [BCP + 14] proposes a way out: online template attacks.Such approaches originally target elliptic curve cryptography.Thanks to online template generation, they achieve single-(target-)trace full private-key recovery with only one template trace per secret scalar bit.Though [BCP + 14] claims that the transfer of OTA to other cryptographic algorithms is nontrivial, this paper presents an OTA against NTRU Prime here.Algorithm 4 shows its control flow.This OTA targets the e p−1 calculation in the product scanning method in Table 1, and it works as follows: First, attackers acquire one single "target trace" from the target device.They partition the target trace into p n-dimensional "target vectors" of power samples, each corresponding to the operation e p−1 += f p−1−i × c i (mod q), respectively for i = 0, 1, • • • , (p − 1).Second, attackers reveal all f p−1−i iteratively with the knowledge of c and f p−1 , f p−2 , • • • , f p−i .They collect three "template traces", extracting "template vectors" for the operations e p−1 += x × c i (mod q), ∀x ∈ {−1, 0, 1}.Right after the online template generation, they measure the similarity between the i th target vector and each of the three template vectors.The template vector resembling the target vector the most makes its hypothesis attackers' optimal guess.Empirically, the Euclidean distance measure suffices to distinguish the correct hypothesis.
Algorithm 4 Online Template Attack on NTRU Prime Input: the power trace P of the e p−1 calculation Output: a small weight-w polynomial f ∈ R 1: T x ← the template vector of (e p−1 + x × c i ) mod q 6: The idea above needs further elaboration to reach a practical implementation: How to correctly partition the target trace?How to collect template traces and extract the template vectors?
Attackers can notice with the naked eye a pattern repeated p times in the target trace.The peak indices in adjacent copies of the pattern give n.The OTA assumes that attackers fully control a device of the same type as the target one.On this device, they experiment with different ciphertexts of identical prefixes and the same private key.Superposing the corresponding traces, they identify where the divergence begins and meanwhile the boundaries between target vectors.For example, the two traces in Figure 3  If the fully controlled device allows illegitimate f * , a chosen-input attack with offline template generation becomes a smarter choice.This new OTA is just slightly different from the old one: First, x, where c * 1 = c 0 or (c 0 × (−w) mod q).The first feature boosts the reusability of template vectors since it limits e p−1 to the multiples of c 0 (in Z/qZ).In addition, f i = −1 and f i = 1 are randomly and uniformly distributed, so in most cases e p−1 only takes its value from few multiples.Thanks to the second feature, four executions suffice to generate all the template vectors this new OTA needs in the case of (p, q, w) = (761, 4591, 286).Table 2 shows how this works.The template vectors for e p−1 += (−1) × c i , e p−1 += 0 × c i , and e p−1 += 1 × c i in Z/qZ are denoted as [−], [×], and [+], respectively."e p−1 before the operation" is expressed as c 0 × t mod q for some t ∈ {−w, −w + 1, • • • , w}.

Experiments and Results
The three approaches above are applied to the reference C implementation of NTRU Prime [BCLvV16, KRSS18] on STM32F303RCT7 [STM18] and STM32F415RGT6 [STM16], two Cortex-M4-based STM32 boards, to validate their efficacy.As what the NTRU Prime submission [BCLvV16] recommends for Streamlined NTRU Prime, the target implementation sets (p, q, w) = (761, 4591, 286).Figure 4 presents the power patterns of multiply-accumulate-and-reduces on both target devices.It aligns the power patterns vertically using correlation methods and separates each horizontal copy with black dotted lines.Each copy corresponds to e i += f i−j × c j (mod q) in the product scanning method in Table 1.
All the experiments here are based on ChipWhisperer-Lite Two-Part Version [O'F16].A control board clocks the target boards at 7.38MHz and samples their power consumptions at 29.54MS/s.The program ChipWhisperer Capture [O + 18] retrieves power samples from the control board, storing power traces and input data.While the HIDCPA and the OTA are programmed in Python 3.6.1, the VCPA is programmed in C++ in pursuit of high performance.They all run on a MacBook Air.
Note that in our experiments, correlation-based approaches check for negative correlation coefficients as hypotheses and set negative thresholds.The reason behind this design is that ChipWhisperer-Lite Two-Part Version [O'F16] inserts a resistor between the target device and its power supply, measuring the voltage after the resistor to quantify the target's power consumption.As a result, the more power the target device consumes, the lower the readings.
We have tried alternatives like inverting the classic Hamming weight power model or reversing the voltage and ground lines during power measurement using the Semi-Rigid cable with SMA(F) connector from Jyebao Co., Ltd.[JYE19].However, they do not reflect as well as the original choice how each factor in the measurement setups influences the design of correlation attacks.The experiment for the VCPA contains 10 trials on the F3 board.Each trial involves an independent key generation.The VCPA adopts -0.90 as its threshold because the Hamming weight power model is stunningly compatible with STM32 boards.The C++ program only needs 50 traces to completely reveal each of the 10 secret keys within less than 8 seconds.Figure 5 and Figure 6 are the screenshots of an example trial.The f b1 recovery in this trial starts with the monomial +x 5 .It searches from higher-order monomials to lower-order ones and from smaller sample indices to larger ones, updates its guess with the (monomial, sampleId) of the "best" known correlation coefficient, and finally outputs −x 2 as its answer.-0.90 as the negative threshold is a nice choice for this experiment: 134 out of the 285 CORR are lower than -0.99, 201 lower than -0.98, and 262 lower than -0.95.The "worst" CORR is -0.914526 (Term 156: +x 329 ).
The range of eligible thresholds is wide in this experiment: Globally, the "best" correlation coefficient the wrong hypotheses can give is -0.707664 in the first stage (Term 001: −x 757 and Term 002: −x 755 ) and -0.658716 in the second stage (Term 213: +x 177 ), so the range is roughly 0.2.
Dynamic threshold, a more general design, seems worth trying: Locally, the difference between the "best" correlation coefficients from the optimal guess and the second-best hypothesis for (b k , f b k ) is 0.547713 on average and 0.294544 at worst (Term 118: -0.925785 from +x 421 and -0.631241 from +x 422 ).
The experiment for the HIDCPA sets (m, n, l) as (67, 6, 5) and adopts -0.95 as the threshold in its candidate pruning.Figure 7 and Figure 8 come from an example trial on the F3 board.In this trial the HIDCPA takes around 3.5 minutes to reveal f 760 , f 759 , • • • , f 10 and leaves f 9 , f 8 , • • • , f 0 for exhaustive search.In the f 559 • • • f 493 recovery, 42 67-coefficient hypotheses survive candidate pruning, and the 26 th survivor becomes the HIDCPA's optimal guess.Unfortunately, this guess is partially incorrect since the next block recovery yields no result.Thus, the start of block candidates rolls back from f 492 to f 525 so that the HIDCPA can review its answers to f 525 , f 524 , • • • , f 493 (underlined in Figure 7) and meanwhile make its first meaningful guess of f 492 , f 491 , • • • , f 459 .The comparison between Candidate 26 and Candidate 1 in Figure 7 shows that the f 559 • • • f 493 recovery only fails at f 493 (double-underlined in red).Here are some interesting observations from the example trial.
There is no (obvious) upper bound on the negative threshold in the HIDCPA: If the threshold goes too negative, no hypothesis can survive the candidate pruning.Indeed, the optimal guess does not always give the "best" correlation coefficients during the candidate pruning, so adversaries should sacrifice a little efficiency for broader effectiveness by selecting less extreme thresholds.Sadly, the HIDCPA's time consumption rapidly grows as the negative threshold elevates.For example, switch the threshold from -0.95 to -0.90, and the example trial now takes 17 minutes to finish the first block recovery and 261 minutes to reveal f 760 , f 759 , • • • , f 10 .
Note that each of the coefficients f l−2 , f l−1 , • • • , f 0 corresponds to less than l samples.Since Algorithm 2 and Algorithm 3 just provide high-level descriptions for the HIDCPA, attackers should take care of this detail when implementing the last block recovery.Some may therefore worry about the accuracy of the last block recovery and the efficiency of the subsequent exhaustive search.We conduct some extra trials to relieve such concerns.According to the trials with n = 6, l = 5, and m ∈ {67, 61, 55, 49, 43, 37, 31, 25}, the block recovery with m × l samples nearly makes no error on average.In the worst case (m = 43), one out of 150 revealed coefficients is wrong.Assume the exhaustive search for f 0 , f 1 , • • • , f l−2 is inevitable.The full private-key recovery needs the additional search for f l−1 on average, so the entire search roughly terminates in (0.5 * 3 l ) ≈ 122 rounds.
Note that correlation attacks may well fail in their example trials if they judge a correlation coefficient only by its absolute value: In the VCPA's first stage, the correct guess (756, 755, 1, -1) gives -0.992131 while its deceptive counterpart (756, 755, -1, 1) gives +0.986618 as their "best" correlation coefficients.In the HIDCPA's first block recovery, the correct guess gives -0.987827 while its additive inverse gives +0.984279.In either case, the numerical gap in between is very small.The experiment for the OTA mounts a chosen-input attack with offline template generation on the F4 board.The trial in Figure 9 sets c 0 = 2046 and shares the same secret with the example trial for the HIDCPA.Finding out the boundaries {44, 156, • • • , 85164}, the OTA then takes 0.5 seconds to achieve full private-key recovery.Figure 10 shows how many times each template vector gets used in this trial.It labels all the template vectors with their (3t + f p−1−i ), where t follows the definition for Table 2.The red, blue, and green bars are respectively for f p−1−i = −1, 0, and 1. Figure 10 implies that the trial only needs 60 template vectors, namely those for t = −4, −5, • • • , 15.Moreover, because w = 286 (2p/3) ≈ 508, most of the blue bars are much higher than their adjacent counterparts in red and green.To summarize, Figure 10 proves the reusability claim in subsection 3.3.

Software Countermeasures
There are three common software countermeasures for NTRU-like cryptosystems, with the prototypes first introduced in [LSCH10].All designed for the polynomial multiplication in R q , these countermeasures are compatible with NTRU Prime and the product scanning method in Table 1: Countermeasure 1: the random initialization of e i Countermeasure 2: the randomized access to (c j , f i−j ) pairs Countermeasure 3: a first-order masking scheme Countermeasure 1 assigns a random integer m i ∈ Z/qZ to each e i in the initialization stage, and removes all the m i using modular subtractions after the product scanning method.The polynomial multiplication in Countermeasure 2 receives one more argument During the e i calculation, the program iterates from j = 0 to j = (p − 1), adding the appropriate f i−Perm[j] × c Perm[j] to e i .Countermeasure 3 can be briefly expressed as: The operations above are in R q , and every multiplication follows the product scanning method in Table 1.
For the first defensive strategy, the increase in time consumption is negligible.The third one takes just twice as long to complete the entire computation: D 2 only depends on the two masks m f , m c and the private key f .Thus, NTRU Prime can compute D 2 ahead in its key schedule, and an update of D 2 is unnecessary until a regeneration of private key or mask pair [MOP07].All these defensive strategies can protect NTRU Prime from the analyses in section 3.However, this does not mean they are invincible.As shown later in this section, Countermeasure 1 and Countermeasure 2 are subject to chosen-input simple power analysis.Improper implementations of Countermeasure 3 are at the risk of horizontal correlation power analysis [ATT + 18] and online template attacks.Luckily, to the best of our knowledge, there is still no efficient attack against NTRU Prime with both ciphertexts and private keys masked.Despite its large overhead, the masking scheme is the software countermeasure this paper recommends the most.

Chosen-Input Simple Power Analysis (CISPA)
The introduction of the CISPA starts with Countermeasure 2 as the victim due to the simplicity of its CISPA implementation.Although the counterpart on Countermeasure 1 is a bit more sophisticated, the idea behind remains the same: Observe the victim's e 0 • • • e p−1 calculation with ciphertexts of only one/two nonzero coefficients, and reveal f according to the discontinuities in each of the few collected power traces.
The CISPA on Countermeasure 2 sets c = c 0 , where 3 | c 0 and c 0 = 0, acquires one single trace from its target device, and partitions the full trace roughly into p partial traces corresponding to the e i calculation, where i ∈ {0, 1, • • • , (p − 1)}.In each partial trace exists at most one discontinuity, and its existence indicates the change of e i 's value during the calculation.There are two types of such discontinuities: One signals f i = 1 and the other f i = −1.Attackers then classify the partial traces with the naked eye (and simple arithmetic if necessary) into three categories: No Discontinuity, Discontinuity I, and Discontinuity II.In our experiment, the major difference between the two types of discontinuities lies in the degree of their voltage drops.
The f i recovery directly follows the categorization.To be specific, No Discontinuity implies e i = 0 throughout the calculation, so f i = 0. Either Discontinuity I or Discontinuity II implies e i = c 0 at the end of the calculation (i.e., f i = 1), and the other implies e i = −c 0 (i.e., f i = −1).Figure 11 shows part of the full trace corresponding to the e 745 • • • e 760 calculation (p = 761), and labels each partial trace with the category it belongs to."X" stands for No Discontinuity, "I" Discontinuity I, and "II" Discontinuity II.These labels imply f = ±(x 759 − x 758 − x 754 + x 753 + x 751 + x 750 + x 746 + x 745 + • • • ).The error detection mechanisms in NTRU Prime [BCLvV16] help find out which is correct.Even if each e i calculation follows an independent random permutation, this CISPA still works.An ad hoc solution is to shuffle all the accesses to (c j , f i−j ) pairs regardless of the e i calculation in which they are involved.Unfortunately, this solution is not a choice for resource-constrained implementations due to its need of high entropy.
The second stage focuses on the e p−1 calculation in the product scanning method in Table 1 with the following (w − 1) ciphertexts.Note that the b 1 , b 2 , • • • , b w here follow the definition in subsection 3.1: As a result, each power trace in this stage gets divided by two discontinuities into three parts.f b k = f b w/4 (or f b 3w/4 ) if and only if the first part shares the same pattern with the last part.Figuratively, f b1 , f b2 , • • • , f bw cluster into two groups.The error detection mechanisms in NTRU Prime [BCLvV16] reveal f by finding out the group for f i = 1. Figure 12   Though the two CISPA require long observation span and a strict form of chosen inputs, they need no power model, few observations, and low sampling frequency.Besides, their underlying assumption is natural: The operations e i += f i−j × c j (mod q) with c j = 0 have similar power patterns for a fixed e i and f i−j ∈ {−1, 0, 1} and distinct power patterns for (three) different e i .The three e i are from {−c 0 , 0, c 0 } in Countermeasure 2 and Z/qZ in Countermeasure 1.
The two CISPA may well remain successful in noisy settings.They target not only specific characteristics (e.g.voltage drops) but the general change of power pattern in a (partial) trace after one certain multiply-accumulate-and-reduce (or smladx, in section 5).The attacks work as long as adversaries can discern such changes from the normal variation due to electronic noise and categorize them into only two classes.Thanks to the strict form of chosen inputs, power pattern stays uniform before and after the targeted operation, so adversaries can focus on any changes happening there.Hence, a device whose signal-tonoise ratio is low enough to hide such changes would be immune to many other, if not most, power attacks.Such a device is not among the targets that power analysis typically cares about.

Additional Remarks
Some may argue that Countermeasure 3 remains secure against efficient power analysis even if it only masks private keys (Variant K) or ciphertexts (Variant C).Unfortunately, neither of the variants is secure.Though originally designed for Frodo and NewHope, the horizontal CPA in [ATT + 18] directly applies to Variant K, independently revealing f and m f .The knowledge of ciphertexts remains useful because input ciphertexts, which are accessible to the public, participate in the polynomial multiplications without any disguise.

Experiments and Results
The experiments here are designed to confirm the CISPA's effectiveness on Countermeasure 1 and Countermeasure 2. The settings are almost the same as those in subsection 3.4, except that the CISPA allows much lower sampling frequencies.The CISPA samples the power consumptions of Countermeasure 1 on the F3 board and Countermeasure 2 on the F4 board at 434.39kS/s and 115.38kS/s, respectively.Figure 12 and Figure 11 are the very results of these experiments.
The analysis on the first target requires higher sampling frequency due to the following two reasons: First, if e i is randomly initialized, its (and other relevant intermediate values') Hamming weight may not change drastically after meaningful multiply-accumulate-reduce operations (nonzero f i−j and c j ).Second, although the F3 board is a bit more compatible with the Hamming weight power model than the F4 board, its data-dependent component in power consumption is far less significant than the F4 counterpart.
Here are the performance statistics.On the F4 board, Countermeasure 1 requires 16044870 clock cycles to complete the (f ×c in R q ) calculation, Countermeasure 2 25233239 clock cycles, Countermeasure 3 31986334 clock cycles, and the unprotected version 15973389 clock cycles.The next section presents an optimized version.As we shall see below, the optimized NTRU Prime remains subject to the power analyses in section 3 and section 4.

Reference Implementation The First Optimization
The Second Optimization The polynomial multiplication in R q is side-channel informative and computationally intensive, so the majority of optimizations for NTRU-like cryptosystems focus on this operation [SDC09, BCLvV16, HRSS17, DWZ18, KRS19].The pursuit of faster implementations is not only for performance improvement but also for side-channel leakage suppression.
Compared with the previous versions, the NTRU Prime in this section contains two optimizations at the instruction level on the product scanning method in Table 1.First, its e i calculation only needs one reduction in Z/qZ, which follows a series of multiplyand-accumulates.In contrast, the old version [BCLvV16] demands that each multiplyaccumulate operation stay strictly in Z/qZ.Second, every two smlabbs in the assembly are replaced with one smladx.smlabb adds the product of one (c j , f i−j ) pair to e i at a time, while smladx adds the products of two consecutive (c j , f i−j ) pairs [ARM11].Table 3 presents the implementations of the polynomial multiplication in R q before and after each optimization.
After the first modification, the multiplication takes 5837648 clock cycles, 36.55% of the original running time.After the second modification, the running time further decreases to 2947368 clock cycles, 50.49% of the previous one.

The Transfer of Power Analyses
In theory, VCPA, HIDCPA, OTA, and CISPA are easily adaptable to the optimized NTRU Prime.Compared with Algorithm 1, the new VCPA takes into account the possibility that both f i−j and f i−j−1 involved in the same smladx are nonzero.Therefore, it may reveal at a time in the second stage.Just like the smlabb of (c j , f i−j ) before, every smladx of (c j , f i−j , c j+1 , f i−j−1 ) here updates the e i in memory.Following the spirits of Algorithm 2 and Algorithm 3, the new HIDCPA checks 2m-coefficient hypotheses and reveals 2m coefficients of f at a time with the m × l samples corresponding to the above memory updates.Also, the new HIDCPA has the starting index of the block roll back by m/2 × 2 rather than m/2 in its error correction mechanism.
The new OTA partitions the target trace into p/2 target vectors, each corresponding to the operation e p−1 += (f p−1−j × c j + f p−2−j × c j+1 ) in Table 3, where j = 0, 2, • • • , (p − 2) or (p − 1).It then reveals all (f p−1−j , f p−2−j ) iteratively: In each round the new OTA prepares nine template vectors of the operation e p−1 += (a × c j + b × c j+1 ), where (a, b) ∈ {−1, 0, 1} 2 .The online template generation relies on the knowledge of c and the recovery results for f p−1 , f p−2 , • • • , f p−j .The CISPA in subsection 4.2 is directly applicable, and we recommend the instance on Countermeasure 2 due to its simplicity.
In practice, attackers may find it frustrating during the adaptation.The load/store instructions at the assembly level contribute to many highly data-dependent components in microcontrollers' power consumption [MOP07].Unfortunately, the substitution of smladx for smlabb cuts in half the number of such instructions available for the f recovery.As a result, in correlation-based approaches the numerical gaps between correct and incorrect hypotheses shrink.
For the new OTA and the CISPA, it worsens the situation that e p−1 += (f p−1−j × c j + f p−2−j × c j+1 ) takes much fewer clock cycles than the aggregate of e p−1 += f p−1−i × c i and e p−1 += f p−2−i × c i+1 in subsection 3.3.The fewer data-dependent samples per f i the OTA observes from each target vector, the more susceptible the OTA is to electronic noise and the lower the OTA's resolution.The CISPA suffers the same setbacks.
Accordingly, the new VCPA requires more traces, and the new HIDCPA demands smaller m, smaller n, and larger l.Now that each (f p−1−j , f p−2−j ) corresponds to 9 clock cycles, far less than the 56 clock cycles in the unoptimized case, the new OTA may well fail in practice.Despite its direct applicability the CISPA no longer tolerates extremely low sampling frequencies.

Experiments and Results
The experiments below apply the new HIDCPA in subsection 5.2 and the CISPA in subsection 4.2 to the optimized NTRU Prime in subsection 5.1 on the F4 board.The experiment settings, if not specified, are the same as those in subsection 3.4 and subsection 4.4, including how to judge a correlation coefficient.
The new HIDCPA changes (m, n, l) to (7, 3, 10).The example trial in Figure 13 shares the same secret with that in Figure 7.In this trial the new HIDCPA takes around 9.5 minutes to reveal f 760 , f 759 , • • • , f 9 and leaves f 8 , f 7 , • • • , f 0 for exhaustive search.The CISPA samples the target's power consumption at 14.769MS/s, and Figure 14 is the very result.The labels in Figure 14

Conclusion
This paper features multiple power analysis approaches on the product scanning method specialized for NTRU-like cryptosystems.We test the attacks on NTRU Prime, a Round 2 candidate in the NIST Post-Quantum Cryptography Standardization Project.Experiments are run using the reference implementation, an implementation further optimized using SIMD instructions, and implementations featuring common protective measures.Every approach achieves full private-key recovery for both schemes in NTRU Prime.
The VCPA is fast and extensible to larger parameters.One single trace of short observation span suffices for the HIDCPA to reveal dozens of f i at a time, quickly and reliably.The OTA needs no a priori power model but three template traces per f i from a fully controlled device similar to the target, one single target trace, and little computational resources.Attackers can uncover private keys of the protected NTRU Prime with the naked eye using the CISPA.
The approaches in this study focus on the polynomial multiplication in R q in decapsulation.Nonetheless, the single-(target-)trace attacks on random inputs, namely the HIDCPA and the OTA, also apply to (a × Generator(S) in R q ) in NTRU LPRime's key generation, (h × r in R q ) in Streamlined NTRU Prime's encapsulation, and (b × A in R q ) in NTRU LPRime's encapsulation.This paper recommends both operand polynomials in such multiplications be masked since (the original version of) Countermeasure 3 is the sole survivor throughout the article.
Other ideal-lattice-based cryptosystems are likely susceptible to these approaches if their private-key coefficients are from a small set of possibilities.As the number of possible coefficients increases, the OTA and the CISPA require higher-resolution measurements, and the HIDCPA becomes computationally impractical.
The approaches potentially break even more optimized NTRU Prime.If the polynomial multiplications use multi-level Karatsuba ending with schoolbook multiplications between two n-long polynomials, these approaches can still target the schoolbook multiplications of the form: In other words, they target the schoolbook multiplications related to the multiplications between two upper halves or two lower halves in the multi-level Karatsuba.In theory, this adaptation leads to full private-key recovery.If Toom-k is used instead of Karatsuba as the first layer, this adaptation can only reveal the first and last 1/k of private-key coefficients.How to adapt these approaches so as to be generally successful against a mix of Toom and Karatsuba multiplications is an interesting follow-up question.
Finally, it is possible that parts of these techniques are adaptable to implementations which use the operand scanning method as the bottom layer, especially when the fixed operand is the small polynomial f and the scanning operand is the generic-looking c.
the e p−1 calculation contains w meaningful multiply-accumulate-reduce operations and thus w interesting internal states e p−1,1 , e p−1,2 , • • • , e p−1,w , as defined in Figure 1.This VCPA contains two stages: The first reveals (b w , b w−1 , f bw , f bw−1 ) all at once, while the second reveals the other (b k , f b k ) sequentially.The first stage calculates e p−1,2 from each hypothesis under examination, checking the similarity between the corresponding expected and real power consumptions.The second stage calculates each e p−1,w−k+1 similarly.

Figure 3 :
Figure 3: Target Trace Partitioning in the OTA

Figure 7 :
Figure 7: The Error Correction Mechanism in the HIDCPA

Figure 8 :
Figure 8: Size Distribution of Block Candidates in the HIDCPA Recursive Construction

Figure 9 :Figure 10 :
Figure 9: The Result of the OTA

(Figure 12 :
Figure 12: The CISPA on Countermeasure 1: the second stage Variant C cannot hide the b 1 , b 2 , • • • , b w defined in subsection 3.1, and online template attacks may further reveal f b k by jointly considering c and m c .The few possibilities of f b k render the template vectors of eligible hypotheses in a round mutually distinguishable.In contrast, the original version of Countermeasure 3 survives every attack in this paper. imply f = ±(x 759 − x 758 − x 754 + x 753 + x 751 + x 750 + x 746 + x 745 + • • • ), the same secret as that in Figure 11.Furthermore, attackers can know from the target's assembly that the discontinuities here result from the updates of e i in memory.If so, the knowledge of c = 2046 and the assumptive use of the Hamming weight power model together make f = x 759 − x 758 + • • • the only possibility.

Table 1 :
The Polynomial Multiplication in R q in Decapsulation

Table 2 :
The Mapping between (f * , c * ) and Template Vectors c * 1

Table 3 :
The Two Optimizations at the Instruction Level