Polynomial multiplication on embedded vector architectures

. High-degree, low-precision polynomial arithmetic is a fundamental computational primitive underlying structured lattice based cryptography. Its algorithmic properties and suitability for implementation on diﬀerent compute platforms is an active area of research, and this article contributes to this line of work: Firstly, we present memory-eﬃciency and performance improvements for the Toom-Cook/Karatsuba polynomial multiplication strategy. Secondly, we provide implementations of those improvements on Arm ® Cortex ® -M4 CPU, as well as the newer Cortex-M55 processor, the ﬁrst M-proﬁle core implementing the M-proﬁle Vector Extension (MVE), also known as Arm ® Helium ™ technology. We also implement the Number Theoretic Transform (NTT) on the Cortex-M55 processor. We show that despite being single-issue, in-order and oﬀering only 8 vector registers compared to 32 on A-proﬁle SIMD architectures like Arm ® Neon ™ technology and the Scalable Vector Extension (SVE), by careful register management and instruction scheduling, we can obtain a 3 × to 5 × performance improvement over already highly optimized implementations on Cortex-M4, while maintaining a low area and energy proﬁle necessary for use in embedded market. Finally, as a real-world application we integrate our multiplication techniques to post-quantum key-encapsulation mechanism Saber.


Introduction
The rapidly expanding Internet of Things (IoT) has an unprecedented impact on our digital ecosystem, so much that it is often termed the fourth industrial revolution.However, it also poses significant challenges.Firstly (sadly, often rather lastly), security: Absence or insufficient quality of security measures on IoT devices is a frequent headline.Secondly, performance: Embedded devices have much less compute resources than high-end consumer devices such as mobile phones or desktops.The dichotomy of the situation is that tight resource constraints need to be imposed on these devices to allow them to be costeffective, but equally they limit performance and, importantly, often impede incorporating secure cryptographic protocols.Both aspects are in fact closely tied because (publickey) cryptography is computationally demanding.It is therefore a constant tussle of cryptographers and embedded engineers alike to fit secure and most advanced cryptographic protocols into constrained devices, and to have them perform sufficiently well to enable a wide range of applications.Unsurprisingly, an important criterion to assess a new cryptographic protocol is to evaluate its suitability for constrained devices.
The rise of Post-Quantum Cryptography (PQC) exacerbates the problem.Recalling the context first: Large-scale quantum computers are a threat to our current digital security infrastructure, especially public-key cryptography (PKC).The reason is that the security of today's predominant public-key cryptography -RSA and Elliptic Curve Cryptography (ECC) -rests on the hardness of computational problems which are known [Sho97,PZ03] to be solvable in polynomial time using a large quantum computer.Therefore, the National Institute of Standards and Technology (NIST) started a standardization procedure [NISa] aimed at the standardization of key-encapsulation mechanisms (KEM), public-key encryption (PKE) and signature schemes which resist the increased computational abilities of quantum computers -those schemes form the field of Post-Quantum Cryptography (PQC).These schemes will eventually replace their counterparts RSA and ECC used in our current public-key infrastructure.We refer to [NISb,Arm20] for more detailed introductions.
Post-Quantum Cryptography is highly relevant to the problem of balancing cost, performance and security in the IoT since it tends to have a higher resource footprint than classical public-key cryptography.It is thus vital to understand the demands and performance of PQC on embedded devices.Reflecting this, the Cortex-M4 processor has been designated by NIST as the reference platform for PQC on embedded systems.
The most prominent class of PQC is that of schemes based on structured lattices, and the underlying computational workload is the multiplication of polynomials of large degree and low coefficient precision.This problem has been studied extensively, and two approaches prevailed: Multiplication via the Toom-Cook-Karatsuba algorithms [Too63, Coo66, KO62], and multiplication via the Number Theoretic Transform (NTT).We are thus interested in the constraints and performance of these algorithms on embedded systems.Structurally, both approaches are very similar and amenable to the same tradeoffs [KMRV18, KRS19,MKV20].The main benefit of Toom-Cook is the absence of modular arithmetic, which the NTT heavily relies on.Its main drawback, and the primary factor determining its memory-usage, is that it expands the input both before ("evaluation") and during ("point multiplication") the core of the multiplication routine, while the NTT operates in-place.
On high-end processors, performance improvements can often be obtained by leveraging Single Instruction Multiple Data (SIMD) instruction set extensions, such as Arm's Neon instruction set on the R-and A-profiles of the Arm architecture, or the Intel ® Advanced Vector Extension (AVX).On embedded systems, however, SIMD instruction extensions are challenging within the usually tight power, cost and area constraints.The M-profile Vector Extension (MVE), or Helium vector extension, is a rather recent addition to the M-profile of the Arm architecture which promises to introduce the performance benefits of vector extensions to embedded systems while maintaining a low gate and power profile.The Cortex-M55 processor is the first implementation of the Helium instruction set.The primary motivation of this work is to evaluate how this future generation of low-cost and low-energy processors fare in the context of the next generation of public key cryptography.

Contributions
We contribute to the study of viability and performance of PQC on embedded devices.We study two themes: Firstly, how to reduce the resource demands of PQC for low-end embedded devices such as the Cortex-M4 CPU.Secondly, how to leverage the Helium instruction set on embedded devices implementing the Helium vector extension, focusing on the Cortex-M55 processor.Our specific contributions are as follows: • We revive a known but little used variant of Toom-Cook and Karatsuba called striding Toom-Cook/Karatsuba to reduce the memory-usage of polynomial multiplication.In a nutshell, while expansion during evaluation remains, a suitable input reordering allows to avoid size-doubling on the point-multiplication, thereby saving almost 50% in memory and also accelerating the interpolation step.See Table 1.We implement the striding variant of Toom-Cook/Karatsuba on the Cortex-M4 processor.
• We implement the striding Toom-Cook/Karatsuba and a 32-bit degree-256 negacyclic NTT on the Cortex-M55 processor, based on the Helium instruction set.We report an ≈ 5× speedup of our striding Toom-Cook/Karatsuba implementation compared to previous Cortex-M4 implementations, and an ≈ 3.5× speedup of our implementation of the NTT compared to the fastest NTT on Cortex-M4 [CHK + 21].NTT-based multiplication remains faster, albeit by a smaller margin than on Cortex-M4.
• We provide an introduction to the M-profile Vector Extension and to the Cortex-M55 processor, highlight its similarities and differences compared to other vector extensions, and share our experience on what to look out for to get the most out of the instructions.We hope that this will enable researchers to build on our work and study the use of the Helium vector extension for other workloads, esp.PQC schemes.

Preliminaries
We denote Z n the ring of integers modulo n ∈ Z.If n = q is a prime, we also write F q .Unless stated otherwise, R denotes a commutative ring.We denote R[X] the polynomial ring over R, and for a monic polynomial P ∈ R[X] we denote R[X]/(P ) the quotient of R[X] obtained by identifying polynomials with the same remainder modulo P .We will mostly work with F q [X]/(X n + 1) or Z 2 k [X]/(X n + 1).

Polynomial multiplication
In this section we survey the most prominent sub-quadratic polynomial multiplication strategies: Toom-Cook/Karatsuba and the Number Theoretic Transform (NTT).Good expositions exist in the literature, so we will be brief.See e.g. the excellent [Ber01].

Multiplication by evaluation
The idea behind Toom-Cook/Karatsuba and NTT-based multiplication is the same: Multiplication via evaluation.Given a ring R, two polynomials f, g ∈ R[X] and fixed elements Each equation puts a constraint on f g, and by choosing n large enough, we hope to be able to recover f g from (f g)(α i ).
Formally, consider the evaluation homomorphism It factors through the quotient R[X]/ i (X − α i ), which as a set is canonically identified with R[X] <n , the set of polynomials of degree < n.Depending on the context, we can view the evaluation homomorphism either as ev α : R[X]/ i (X − α i ) → R n , or as ev α : R[X] <n → R n .It is an isomorphism if and only if for all i = j, α i − α j is invertible in R. For example, this holds if R is a field and the α i are pairwise distinct.If ev α is bijective, we can thus fully recover a polynomial f ∈ R[X] <n of degree < n from its evaluations f (α i ); equally, we can recover the residue of This technique can be used to calculate products In both cases, we first compute f (α i ), g(α i )the "evaluation step" -then the products f (α i )g(α i ) -the "point multiplication step"and finally recover f g from its evaluations at the α i -the "interpolation step".
It is common to add the "evaluation at ∞", giving ev ∞ α is bijective if and only if ev α is.In this case, we can therefore a polynomial of degree ≤ n from its evaluations at α 1 , . . ., α n plus its n-th coefficient.
The Toom-Cook algorithm generalizes Karatsuba's.Concretely, we are interested in 4-way Toom-Cook defined by the points 0, ∞, ±1, ± 1  2 , 2 and applied over a ring R where 2 k = 0 for some k -we will mainly be looking at Z 2 13 .This setting has been introduced and studied in [KMRV18], and a few subtle points should be noted -see [KMRV18] for details: Firstly, to make sense of the fractional evaluations in Z, one scales by α deg(f ) , so that for f = a 0 + a 1 X + a 2 X 2 + a 3 X 3 we have f 1 2 := 8a 0 + 4a 1 + 2a 2 + a 3 .Secondly, ev α is not an isomorphism: In fact, if 2 k = 0 in R but 2 k−1 = 0, the polynomial f := 2 k−1 X(X + 1) satisfies f (α) = 0 for all α ∈ {0, ∞, ±1, ± 1 2 , 2}.However, [KMRV18] explains that multiplication by evaluation still works if we temporarily operate with k + 3 bits of precision -that is, we lift representatives from Z 2 k to Z 2 k+3 below 2 k , compute evaluation, multiplication and interpolation in Z 2 k+3 , and reduce to Z 2 k in the end.

Iterating Toom-Cook/Karatsuba
Toom-Cook/Karatsuba multiplication is usually applied iteratively, with R being a polynomial ring itself, transforming a product of two polynomials of large degree into numerous products of polynomials of smaller degree, which can then be computed by more direct means such as schoolbook multiplication.We explain this for k-way Toom-Cook.
Given a polynomial f ∈ R[X] <n with n = kr, we can group its monomials into the blocks { in k , in k + 1, . . . (i+1)n k − 1}, i = 0, . . ., r − 1, and thereby express f as Formally, we consider lifts under as the coefficient ring, we then apply Toom- Ultimately, this computes a degree-n product via 2k − 1 products of degree n/k.Toom-Cook and Karatsuba can be combined.E.g., one layer of 4-way Toom-Cook and two layers of Karatsuba reduce a degree-16n product to 63 degree-n products.

Number Theoretic Transform
The Number Theoretic Transform (NTT) instantiates the "multiplication by evaluation" blueprint with R = F q for a prime q with n|q − 1.Then, it is known that F q has a primitive n-th root of unity ζ n , and that , and ev α is an isomorphism NTT : If n = 2 k , then NTT can be computed iteratively as indicated in the following figure: Here, the k-th entry from the right in the i-th layer is ), where rev i is the bit-reversal on i + 1 bits.For example, rev This strategy is akin to the Cooley-Tukey algorithm for the FFT and achieves complexity n log 2 (n).Since NTT is a ring isomorphism, we can compute the product of a, b ∈ F q [X]/(X n − 1) as NTT −1 (NTT(a) • NTT(b)), where • is pointwise multiplication in F n q .We consider the NTT from an implementation view.Every transition is of the form In terms of the standard monomial basis, R reverses the Cooley-Tukey butterfly up to 1  2 .The CT and GS are central primitives to implement for an NTT-based polynomial multiplication; the third is point-wise multiplication in F q .

Modular Arithmetic in F q
We briefly recall two prominent algorithms in the context of modular multiplication: Barrett reduction and Montgomery multiplication.
The central problem is as follows: Given a modulus n ∈ Z and a "large" a ∈ Z, find "small" representatives of a modulo n.Two choices stand out: Firstly, the canonical unsigned representative a mod + n is the unique representative of a modulo n in the interval {0, 1, . . ., n − 1}.Secondly, the canonical signed representative a mod ± n is the unique representative of a modulo n in the interval {− n 2 , − n 2 + 1, . . ., n 2 }.These can naïvely be calculated as a mod ± n = a − n a n and a mod + n = a − n a n , but such an approach is inacceptable from a performance perspective since division by n is expensive in general.Barrett and Montgomery reduction are two ways around this by trading the division by n for a division by a 2-power, which can be cheaply implemented as a bitshift.

Barrett reduction
The idea behind Barrett reduction is to pick such that 2 > n and approximate a n = a 2 n /2 ≈ a 2 n /2 , where C := 2 n can be precomputed.See Algorithm 1.For a < 2 and bounded by the machine word size, the resulting modular reduction a−n• aC 2 is amenable for direct mapping to many instruction sets: It is a combination of a highmultiply with rounding for x → x•C 2 and a low-multiply-accumulate for x, a → a − n • x.We will discuss the details in the case of the Helium vector extension in Section 6.4.1.

Montgomery reduction and multiplication
Montgomery reduction is a way to compute a representative a s.t. 2 a ≡ a modulo n if n is odd; in other words, a is a small representative of the quotient a 2 computed in Z nnote that the latter expression makes sense because n is odd, so 2 is invertible modulo n.
If a is already divisible by 2 in Z, we can set a := a 2 , computed in Z. Otherwise, if we pick k := a • n −1 mod ± 2 , where a • n −1 is computed in Z 2 , we have n • k ≡ a modulo 2 by construction, and so a − n • k is divisible by 2 .We can thus set a := (a We are interested in Montgomery reduction for bounded by the machine word size, and a < 2 2 a product of two single-width values x, y < 2 .In this case, it can be expressed in single-width operations: two high-multiplications x, y → xy 2 , k → n•k 2 , and two low multiplications x, y → xy mod + 2 , a → a • n −1 in Z 2 .See Algorithm 2 for both the double-width and the single-width description of Montgomery multiplication.We will discuss the details in the case of the Helium vector extension in Section 6.4.1.

M-profile Vector Extension (MVE)
In this section we give an introduction to the M-profile Vector Extension (MVE), also referred to as Arm ® Helium ™ Technology.We will focus on aspects of MVE that are relevant to the algorithms studied in this paper, and refer to [Arme,Mar21] more exhaustive introductions to the Helium instruction set, as well as to [Armb] for the full reference.We also found the blog series [Armc, Part 1-4] very useful.

A primer on vector architectures
Vector architectures promise to accelerate computation through the introduction of Single Instruction Multiple Data (SIMD) operations: They add a set of wide registers that are viewed as vectors of equal-sized blocks called lanes, and provide instructions that operate on those lanes in parallel.For example, a 128-bit vector register can be viewed as a length-8 vector of 16-bit elements, and a 16-bit vector add instruction (e.g.VADD.u16 in Helium, see Figure 1) adds the lanes of two such vectors, storing the results in the lanes of a third vector -thus the term "Single Instruction, Multiple Data".
It is intuitively clear that vectorization holds great potential for accelerating computational workloads.However, the exact order of speedup is influenced by a large number of architectural and microarchitectural parameters, which tend to also have implications on both hardware cost and software constraints.Choosing a specific set of parameters is therefore a careful tradeoff tailored to the target domain.We provide some details in the rest of this section to help the reader put the design choices of Helium and Cortex-M55 and their implications on hardware and software into perspective.
Architecturally, two parameters impacting the speedup potential of a SIMD architecture are the expressiveness of its instruction set and the size of the vector registers: A SIMD architecture with 256-bit vectors, such as provided by AVX2 or 256-bit implementations of Arm's Scalable Vector Extension (SVE), performs twice as many operations per instruction compared to SIMD on 128-bit vectors, such as provided by Helium or Neon.Also, specialized instructions for common operations or operation sequences (such as multiplyaccumulate) lower the number of instructions required to express a computation.It is clear, however, that tailoring those parameters for performance comes at a significant cost in hardware.
Another architectural parameter of interest is the number of architectural vector registers, which constitutes a less obvious tradeoff: To begin, fewer vector registers clearly imply a lower hardware cost but larger constraints on software.However, the extent to which those constraints imply a performance loss is less obvious, and very much depends on the workload under consideration.
Microarchitecturally, primary parameters influencing the potential of vectorization are the number of SIMD execution units, the latency and throughput of vector instructions, and out-of-order execution capabilities.For example, the Cortex-X1 CPU has four Neon execution units and therefore twice the vectorization throughput potential compared to a Cortex-A78 CPU with two Neon execution units.Whether such potential can be realized, however, heavily depends on the nature of the workload (e.g.data flow, instruction distribution) and on the scheduling of instructions: In the case of high-end out-of-order CPUs, the latter is mostly taken care of by the CPU itself -even to the extent that multiple iterations of a loop can be executed in parallel to create enough independent data streams to keep all SIMD units busy.Out-of-order execution also relaxes constraints on software since instruction scheduling happens dynamically, but the downside of those powerful microarchitectural features is a significant hardware cost.In-order execution, such as found in all M-profile processors as well as low-end A-profile processors, considerably improves cost and efficiency, but there it is the responsibility of the programmer/compiler to keep the SIMD unit(s) busy through careful instruction scheduling.
To summarize: Within the design space of SIMD architectures and microarchitectures, there are tradeoffs between hardware cost and performance potential, and hardware cost and software constraints towards reaching this potential.Whether those constraints can be met is a workload-specific problem.We will next describe the tradeoffs and constraints in the case of the Helium vector architecture and the Cortex-M55 microarchitecture.In Section 6, we will discuss how to solve the software constraints in the case of the PQC workloads under consideration, thereby realizing the full performance potential of Helium + Cortex-M55 while maintaining a low gate and power profile.

Introduction to MVE
The M-Profile Vector Extension (MVE) is a feature of the Armv8.1-Marchitecture [Armb, Arme], primarily for signal processing and machine learning applications.MVE is also referred to as the Helium vector extension, in alignment with the Arm ® Neon ™ Technology architecture extension for A-profile processors [Arma, Section C.3.5].However, while there are similarities between the Helium instruction set and the Neon instruction set, the Helium vector extension is a new ground-up architecture design specifically tailored for the extreme area and energy efficiency required in typical Cortex-M applications.
The vector file Like the Neon instruction set, the Helium vector extension uses 128-bit vector registers.There are 8 vector registers, compared to 32 in the Neon instruction set, a design choice which can be approached from multiple angles: Firstly, it results in a lower gate count and power profile.Secondly, the lower vector count is compensated for by multiple features, including the presence of numerous instructions utilizing both vector and scalar registers.The suitability of scalar-vector operations is a main difference between M-and A-profile: On A-profile implementations, the physical distance between vector and scalar register file is too large to allow for low-latency scalar-vector operations.The smaller scale of M-profile CPUs means that these operations become feasible.
Implication: Algorithm design A common technique for vectorization of computational workloads is batching, whereby multiple instances of a higher-level construct are computed in parallel, rather than the computation of a single instance of the construction being vectorized and accelerated.
Since batching tends to use a large number of vector registers but a low number of general purpose registers (GPR), we found it not suitable for use with the Helium vector extension except for very small computations.Instead, we found that the Helium instruction set works well for single-instance speedup, using a mix of vector and GPR file.We will see this in the example of schoolbook multiplications below.

Instruction Overlapping
The idea of overlapping the execution of instructions is fundamental to CPU microarchitecture and at the heart of the concept of a pipeline.However, it is commonly hidden from the programmers.The Helium architecture extension deviates from this by making instruction overlapping part of the architecture.
The execution of each vector instruction is architecturally subdivided into four 32-bit parts called beats.The operation of each beat may affect multiple lanes: For example, the VADD.u16operation operates on two 16-bit lanes per beat.The Helium architecture extension requires that each beat of an instruction is executed in-order, but it allows implementations to execute beats from the following instruction whilst still executing beats from the previous instruction [Armb].For example, a vector load VLDR instruction can be overlapped with a vector multiply accumulate VMLA as shown in Figure 2.
Because the load/store units and arithmetic units can both be kept busy at the same time, even on a single issue CPU, the architected instruction overlap can significantly improve performance at minimal area overheads.Moreover, implementations benefit from the fact that instruction overlapping is architectural because they do not need logic to maintain the illusion of atomic execution: The Helium architecture extension allows instructions to be interrupted between architectural beats.Implication: Instruction design Support for instruction overlapping has to be built into the design of the architecture: For example, consider a data dependency between two consecutive instructions A,B on a dual-beat system.In this case, beats B0,B1 of B can only commence if the inputs (produced by beats A0, A1 of A) are available.Generally, this is supported by instructions whose beats describe the same amount of work and operate on lanes independently, and from right to left.This explains why there is a 128-bit left shift Helium instruction VSHLC (see Figure 3), but no 128-bit right shift.Long multiplies are another example: The inputs to VMULL in the Neon instruction set are lower and upper halves of 128-bit vectors.This would not fare well with beat-wise execution, and instead there are instructions VMULL{B,T} in Helium operating on even or odd parts of the inputs, which is beat-friendly.We again refer to [Armc, Part 1] for details and more examples.
Figure 3 The semantics of the beat-friendly long left shift with carry instruction VSHLC Implication: Programming discipline There are three main ways to develop code for the Helium instruction set: Auto-vectorization, intrinsics and handwritten assembly.In the former cases, the compiler has the responsibility of instruction scheduling.When handwriting Helium instructions, however, it is important to carefully study the ordering of instructions in order to make optimal use of instruction overlapping and the available computational resources.For example, the sequence in Figure 2 would run 50% slower on a system with 64-bit data paths if we'd batch the multiplies and load operations.Generally, alternating instructions of different nature (load/store, add/sub, multiply) is preferred.
Low overhead loops Armv8.1-Madds the Low Overhead Branch Extension, allowing software to indicate how many iterations of a loop will be executed and let the hardware do counter modification and branching.Importantly, the loop-end instruction LE may cache details of the loop and allow processors to optimize future iterations, skipping subsequent LE instructions and overlapping instructions across loop iterations.The following iterations of the loop can therefore perform as if the loop had been unrolled at compile-time.

Interleaving memory operations
The Helium instruction set adds support for deinterleaving data during loads via the instruction sequences VLD2{0-1} and VLD4{0-3}, and interleaving data during stores via VST2{0-1} and VST4{0-3}.Logically, those correspond to a load of {2, 4} • 16 bytes of data, combined with 2 × _ ↔ _ × 2 and 4 × _ ↔ _ × 4 transpositions.It should be noted that the feasibility of such instructions in an embedded architecture is a non-trivial question, as the required interleaving logic has to be implementable at low cost.We refer to [Armc, Part 2] for details on how this is achieved.See Figure 4 for an illustration of VLD2{0,1}.The instructions can appear in any order and be interleaved with other operations in support of instruction overlapping.
Listing 1 Computing the lane-wise absolute difference via lane predication Lane predication The Helium vector architecture supports restricting the effect of a vector instruction to a subset of lanes through lane predication.Here, the subset of "active" lanes is determined by a dedicated register VPR.P0 (see e.g. Figure 1) which can be set directly or defined via a "predicate" over contents of vector registers.Moreover, the predicate can be inverted, thereby allowing for the construction of if-then-else style instruction sequences using predication.For example, Listing 1 shows how to compute the absolute difference between two vectors using predication.See also [Mar21,Section 4.4].

Cortex-M55
The Cortex-M55 processor is the first implementation of the Arm v8.1-M architecture, including optional support for the Helium vector extension.Cortex-M55r1 is the most recent revision of the Cortex-M55 processor, providing additional features and performance optimizations.We give a very brief introduction here and refer to [Armd] for details.

Pipeline and memory
The Cortex-M55 processor has a 5-stage in-order pipeline when Helium is included, and instruction scheduling is single-issue with a few exceptions.In addition to cache-able main memory, the Cortex-M55 processor supports tightly coupled memory (TCM) interfaces for instructions and data, allowing fast and deterministic memory access which is useful for real-time or performance critical applications.The total D-TCM bandwidth is 128-bit/cycle, allowing 64-bit/cycle bandwidth for processing such as vector processing, plus 64-bit/cycle for DMA transfers to/from TCM running in the background.We locate all our code and data in TCM for our measurements.
Vector processing The Cortex-M55 processor is a dual-beat implementation of the Helium instruction set which supports instruction overlapping.This means that it takes two cycles to perform the work of a vector instruction, and that parallel execution of the next instruction can commence in the second cycle, computing resources permitting.Figure 2 shows how a sequence of alternating VMLA, VLDR would perform on the Cortex-M55 processor, absent of other hazards and assuming data and code in TCM.
The Cortex-M55 processor has separate units for load/store (e.g.VLDR,VSTR), additive integer operations (e.g.VADD,VSUB), and multiplicative integer as well as floating point operations (e.g.VMUL).We therefore tried to achieve an alternation of those three kinds of operations as a first step towards good instruction overlapping.The assembly snippets below reflect the different instruction types through their color.

Memory efficient striding Toom-Cook
In this section, we describe "striding" Toom-Cook multiplication.While not new -see e.g.[Ber01] -it does not appear to have been used in lattice-based cryptography before.
Recall from Section 2.2.2 that using classical Toom-Cook k-way for large-degree multiplication, the inputs in R[X] are lifted to the isomorphic ring R . The problem with this approach is that even if, ultimately, we care about size-preserving multiplication in R[X]/(X n + 1), the base multiplication in R[X] <k is length-doubling: The polynomial degree is too small for wraparound.This size-doubling during point multiplication in turn raises the memory usage.
This can be improved by using a different ring isomorphism, effectively changing the arrangement of X and Y : We lift the polynomials from R , another negacyclic polynomial ring with sizepreserving multiplication.In this way, the polynomial reduction can be moved to the point multiplication, significantly reducing memory usage of Toom-Cook multiplication.Another advantage is that the interpolation operates on smaller polynomials and is thus faster.

Difference in evaluation:
Striding Toom-Cook/Karatsuba evaluation differs from classical evaluation only through the memory access pattern, which is no longer contiguous but at stride k.In Section 5 and Section 6 we discuss how to overcome this with low performancepenalty on the Cortex-M4 and Cortex-M55 processors.Alternatively, one can permute the input upfront and thereby reduce striding evaluation to classical evaluation.It depends on the context whether this is feasible -we comment on the case of Saber below.

Differences in interpolation: While classical interpolation gives ab
Resolving the substitution Y = X n/k in the classical context means overlapping the upper half of c i with the lower half of c i+1 .Resolving the substitution Y = X k in the striding context gives ab = i<k (c i (X k ) + X k c k+i (X k ))X icomputing this thus involves a negacyclic rotation of c k+i and addition onto c k .

Application: Saber
Our primary interest in striding Toom-Cook/Karatsuba multiplication lies in its use for the Saber PQC scheme.We refer to [BMD + 20] or the section pertaining to Toom-Cook multiplication in the supplementary material for details, but the important point is that the ring Z 2 13 [X]/(X 256 +1) underlying Saber is amenable to iterated application of the striding method.Concretely, a striding Toom-Cook/Karatsuba based multiplication for Saber would use a hybrid of one layer of 4-way Toom-Cook to reduce from Z 2 13 [X]/(X 256 + 1) to Z 2 13 [X]/(X 64 + 1), one or two layers of Karatsuba to reduce from Z 2 13 [X]/(X 64 + 1) to Z 2 13 [X]/(X {16,32} + 1), and schoolbook multiplication.
The polynomials used in Saber are generated from packed data on a per-coefficients level.We expect that it possible to merge the upfront permutation of the input into the key generation process with small negligible penalty, thereby reducing the evaluation step to classical Toom-Cook/Karatsuba, but leave exploring the details for future work.

Implementation: Cortex-M4
In this section, we describe the implementation aspects of the striding version of Toom-Cook and Karatsuba on Cortex-M4 processors.

Toom-Cook/Karatsuba
As explained in Section 4, there are two fundamental differences between the classical and the striding variants of Toom-Cook multiplication: The access pattern to the coefficients of the operands and the memory expansion of the products.The sequence of arithmetic operations performed during evaluation and interpolation is the same for both variants.
The read accesses occur during the evaluation and are performed with offset 64 for the classical Toom-Cook whereas 4 consecutive coefficients at a time are required for the striding version.Although the access pattern of the striding variant is more regular, this is an issue for exploiting the DSP extensions to carry out two operations on halfword registers in parallel.To circumvent this problem, the polynomials can be stored with a custom memory layout, in which case the complexity is moved to the packing operation, or the coefficients of the polynomial can be packed with an offset of 4 after being loaded into the register.Since we implement the entire multiplication we opt for the latter.The instruction overhead is equal to the number of word loads, effectively as if we only performed halfword loads.
The write operations happen during interpolation, when the result is computed from the weighted polynomials.Here, the striding version has an advantage since it iterates half the times of the classical due to the non expansion of the products.Additionally, since the coefficients of the result are generated consecutively, the write operations can be simplified to 4 per iteration instead of 7. In addition to this simplification of the interpolation, the non expansion of the products in the striding version allows for a halving in the memory requirements to store all the weighted polynomials, i.e., a saving of 896 bytes.
The differences in the implementation of the striding version of Karatsuba with respect to the classical Karatsuba are equivalent to Toom-Cook.There is an overhead in the instruction count due to the sequential access pattern to the coefficients of the polynomial during evaluation, and the counterpart during interpolation if one wants to exploit the single instruction multiple data capabilities of Cortex-M4 processors.However, the non expansion of the polynomial degree after the multiplication is more beneficial than in the case of Toom-Cook.In particular, if Karatsuba is applied recursively the memory utilization can be kept constant by re-utilizing the memory allocated to store the operands after the evaluation is performed.

Schoolbook multiplication
It has been shown in [KMRV18,KRS19] that schoolbook multiplication has the highest impact in the performance of polynomial multiplication.We follow the divide and conquer approach of [KRS19] to decompose the schoolbook multiplication into smaller multiplications that can be performed without fetching extra coefficients from memory and aiming to use the multiply and accumulate instructions.Moreover, to avoid length doubling, we need to perform a negacyclic convolution in-place.This is a challenge because the DSP extensions of Cortex-M4 processors offer a wide range of multiply and accumulate instructions but not their multiply and subtract counterparts.Additionally, extra load instructions are required due to the negatively wrapped accumulation which further increases the register pressure, which is key to performance [KMRV18,KRS19].To circumvent this issue, we create a custom memory layout where both operands are in consecutive addresses and, therefore, only one pointer needs to be kept in a register while coefficients of different operands can be loaded using immediate offsets.
The other implementation decision for the polynomial multiplication is the cut-off degree from which Karatsuba is not applied anymore and the multiplication is performed using the schoolbook algorithm.Due to the availability of registers to pre-load coefficients and the absence of multiply subtract instructions, we find that the optimal cut-off is degree 16.For higher cut-offs, the performance will be determined by the dominant term between the extra load operations required in the striding version of Karatsuba and the overhead introduced by the negacyclic convolution during schoolbook multiplication.We have verified that the optimal cut-off is degree 16, with 34, 884 clock cycles for a 256 coefficient polynomial multiplication, while for degree 32 the execution time increases to 36, 340 clock cycles.This choice was expected since 16 was also the optimal cut-off for other multiplications combining Toom-Cook and Karatsuba [KMRV18,KRS19].
In Section 7 we discuss the performance and memory figures of our proposed multiplication and compare the impact on the full Saber operation to other implementations in Cortex-M4 processors as well as in the new Cortex-M55 processor.

Implementation: Cortex-M55
In this section, we describe our implementations of large-degree, low-precision polynomial multiplication on the Cortex-M55 CPU, leveraging the Helium vector extension.

NTT vs. Toom-Cook/Karatsuba on vector architectures
[CHK + 21] demonstrates that for Cortex-M4, polynomial multiplication via NTTs can outperform multiplication via Toom-Cook even for coefficient rings not tailored to the NTT, such as Z 2 13 .For vectorized implementations, however, the passage to the NTT comes at a larger cost than for non-vectorized implementations.We provide some details.
Setting expectations Since the Helium vector extension operates on 128-bit vectors and the Cortex-M55 CPU is single-issue for vector instructions, we consider a 4× speedup a theoretical limit for a vectorized 32-bit NTT.However, we do not expect to meet this limit, because the NTT relies on long multiplications for modular arithmetic, and few Helium instructions can perform more than two 32 × 32 → 64 long multiplications.For Toom-Cook/Karatsuba based multiplication, we operate on 16-bit lanes, hence get a theoretical speedup of 8× over scalar code.However, optimized implementations on the Cortex-M4 CPU already leverage the DSP extension to treat 32-bit registers as 2 × 16-bit vectors and thus operate on two 16-bit values at once.We thus expect a speedup somewhere between 4× and 8×.Considering that NTT-based polynomial and matrix-vector multiplication for the Saber rings is around 2× as fast as Toom-Cook/Karatsuba based routines [CHK + 21, Table 5], it is not clear which approach will be faster when implemented using Helium.

Vectorizing striding Toom-Cook/Karatsuba
Recall that 4-way Toom-Cook evaluation is a matrix-vector multiplication with every sub-matrix-vector multiplication R 7×4 × R 4 → R 7 corresponding to one evaluation transformation.We vectorize this by computing one sub-matrix-matrix multiplication R 7×4 × R 4× → R 7× a time, storing the four rows of the input R 4× and the 7 rows of the output R 7× in one vector register each.We find that by overwriting input registers as soon as they are no longer needed, the 8 vector registers available are sufficient for the transformation.With 128-bit vectors, we compute = 128 16 = 8 evaluations at once.The above vectorization approach works for both classical and striding Toom-Cook.However, in the case of striding Toom-Cook, we need to de-interleave the input at stride 4 first, for which we leverage the de-interleaving load instruction VLD4{0-3}.
The interleaving of memory operations and arithmetic is crucial to leverage instruction overlapping.We achieve this by optimizing the evaluation across loops, using preloading and late storing of input and results.The preloading becomes more challenging for striding Toom-Cook, where we can use VLD4{0-3} only once we have four free vector registers available.Listing 2 shows one loop of our vectorized Toom-Cook evaluation.

Listing 3 Schoolbook multiplication
We turn to the implementation of Toom-Cook interpolation: As for evaluation, we vectorize interpolation by computing 8 interpolations at once, keeping the 7 input/output rows in one vector each and leveraging the interleaving-store VST4{0-3} for the striding.We're using the interpolation sequence from [KMRV18].
At the end of the interpolation, we have f 0 , . . .
is the polynomial we're looking for.Each X 4 • f 4+k amounts to a negacyclic rotation of f 4+k , which we compute gradually within the interpolation loop, using VSHLC to shift one 128-bit block after each iteration, and remembering the carry-out as the carry-in for the next iteration.This requires the use of three more general purpose registers for carries.The interpolation sequence from [KMRV18] is very add/sub-heavy, leading to inevitable stalls on a dual-beat system when implemented as written.We experimented with replacing VADD, VSUB by VMLA with constants ±1, but were limited by the pressure on the GPR file.It is left for future work to optimize this further.

Schoolbook multiplication
We consider multiplication in Z 2 16 [X]/(X k +1).While previous vectorized implementations use batch-multiplication, our algorithm vectorizes a single schoolbook multiplication.To our knowledge, the approach is novel, and rests on the shift-with-carry instruction VSHLC illustrated in Figure 3.
Outline We outline the algorithm for k = 16.We can write We will now separately discuss the following: • How to efficiently compute one subproduct (a i + a 8+i X 8 )b in the sum.We will express this as an 8-fold batch multiplication over Z 2 16 [X]/(X 2 + 1).
• How to shift + accumulate in the sum.We will express this as an iteration of addition and negacyclic rotation, similar to a linear feedback shift register.

Subproduct computation
We write b = i (b i + b 8+i X 8 )X i as before, and hence obtain This batch multiplication is easy to vectorize: If a i and a 8+i are in GPRs a 0 , a 1 , and b 0 , . . ., b 7 and b 8 , . . ., b 15 are in vectors b 0 and b 1 , we only have to calculate a scalar-vector 2×2 schoolbook multiplication (a . This uses 2 GPRs for the a, 2 vector registers for the b, and 2 vector registers for the output.Multiplication is done via VMUL.u16/ VMLA.u16.Since there is no multiply-subtract variant of VMLA.u16, the computation of −a 1 b 1 requires flipping the sign of a 1 or b 1 . Accumulation We need to compute the sum ab = i (a i + a 8+i X 8 )b X i .and discussed how to compute an individual product (a i + a 8+i X 8 )b.We now consider how to sum them.
Since •X is a negacyclic shift (z 0 , . . ., z k−1 ) → (−z k−1 , z 0 , . . ., z k−2 ), computing [(a i + a 8+i X 8 )b]X i from (a i + a 8+i X 8 )b means a negacyclic shift of i positions.This does not map well to MVE because the maximum shift amount of VSHLC is 32 bits, hence two 16-bit coefficients.Instead, setting a i := a i + a 8+i X 8 , we can rewrite the sum as follows: Here, we only shift by X, corresponding to a VSHLC Qd, Ra, #16.Moreover, we can handle the sum through multiply-accumulates in the schoolbook subroutine computing a i b + _. has to be fed back into b 0 .We implement this lazily by buffering c new in a vector C = (0, . . ., 0, c) and deferring correcting b 0 .If C is initially 0, we can simultaneously store c in C and clear it via VSHLC.We only need to correct b 0 once after the full loop.

Implementation considerations
A standalone implementation of the anticyclic shift would not perform well on a dual-beat system like the Cortex-M55 processor, because consecutive invocations of VSHLC would stall.We will revisit this after studying the nature of each subproduct computation.
Each subproduct involves the following: Firstly, loading of a-input in two GPRs, ideally using LDRD to load a pair of GPRs at once.Secondly, 4× multiply via VMUL.u16 or VMLA.u16.Thirdly, a sign flip in a or b.The lowest instruction count is offered by storing −b 1 in another vector throughout the loop.Flipping a adds a 1-instruction overhead, but releases pressure on the vector file.At best, we get 1 × LDRD and 3 × VMLA per subproduct.
As for the negacyclic shift, a naïve implementation would perform poorly on a dualbeat system because consecutive invocations of VMLA would stall.However, we found that interleaving subproduct and shift leads to good overlapping and resource utilization on the Cortex-M55 processor.We show an iteration of the core loop in Listing 3.
Variants There is a variant computing the rotation abX instead of ab: We write and apply the same multiply-accumulate strategy as before.The rotation in the a-input is easy to express through modified immediate offsets in the GPR loads.
Secondly, consider what happens when we accumulate onto the destination vector in the first iteration.Since the result of first iteration gets shifted 7 times subsequently, this computes ab + cX 7 , where c is the polynomial previously stored in the destination vectors.
The previous points allow for efficient integration of 16 → 32 Karatsuba interpolation with degree-16 schoolbook multiplication: Recall that we have to compute (f e g e + Xf o g o ) + X(f s g s − f e g e − f o g o ).We avoid any manual shift of f o g o here by computing Xf o g o first, and then using the multiply-accumulate variant for Here, Xf o g o is known, and •X 8 is a single sign-flip and some re-indexing.

Integrating Toom-Cook, Karatsuba, and Schoolbook
We implement a degree 256 negacyclic polynomial through a combination of one layer of 4-way Toom-Cook for degree 256 → 64 reduction, one layer of Karatsuba for degree 32 → 16 reduction, and an integrated 32 ↔ 16 followed by 16 × 16 schoolbook.We found this variant slightly more efficient than an implementation of degree-32 schoolbook., the other the doubling multiply-high instruction VQDMULH.s32 from fixed-point arithmetic, whose functional effect is a, b → 2ab 2 32 .In the latter case, the doubling can be corrected through a halving subtraction VHSUB.s32 in the last instruction.We mention the fixed-point variant for two reasons: Firstly, because VQDMULH has a scalar-vector variant where one vector is constant.This is useful for the layers, where each size-4 butterfly operates on consecutive 32-bit elements, we use the de-interleaving load VLD4{0-3} and compute four size-4 butterflies at once.In this case, we require 6 vectors for the root constants, in addition to the 4 input vectors.However, we find that with careful register management, the total of 8 vector registers is sufficient.To leverage instruction overlapping, it is vital to interleave the Montgomery multiplications -consisting of multiplications only -and the add/sub steps -consisting of addition operations only.Moreover, since there are 6 multiplications but only 4 addition/subtraction operations, we also interleave loads/stores in order to achieve a stall-free execution.

Point multiplication for full NTT
For the point multiplication in F n q , our measured code uses Algorithm 4. We comment on the possibility of using the shorter Algorithm 5: This algorithm is only applicable if one of the inputs is odd.Here, the following trick can be applied: Proposition 1. Assume a Montgomery multiplication routine for the NTT which produces only even representatives.Then, if (x s ) is an input vector to the NTT, all entries of its NTT transform have the same parity, which agrees with the parity of the representative x 0 .
In particular, if x 0 is odd, then all entries in the NTT transform of (x s ) are odd.
Proof.Consider the first layer: If n is the size of the NTT, each pair (a = x r , b = x r+n/2 ) of elements in the lower and upper half is transformed via (a, b) → (a + ζb, a − ζb).By assumption, ζb is represented by an even integer, so at the end of layer 0, the parity of the elements in the lower half has not changed, and the parity of elements in the upper half is the same as the parity of the corresponding lower half element.Continue inductively.
As long we ensure that x 0 is odd initially, we can therefore force all outputs of the NTT to have only odd entries, and thus be suitable for our accelerated point multiplication via Algorithm 5. We leave it for future work to explore the use of this trick further.

Point multiplication for partial NTT
When implementing multiplication in F q [X]/(X 256 + 1) using a 6-layer incomplete NTT, the base multiplication is in F q [X]/(X 4 − ζ).For this, [CHK + 21] relies on SMLAL to reduce the number of Montgomery reductions by operating in double-width values as long as possible, as does [BHK + ], which moreover batches the base multiplication.
As mentioned in Section 3.2, batched implementations are difficult to realize in the Helium instruction set due to the lower number of vectors.A batched multiplication in F q [X]/(X 4 − ζ) would need at least 4 + 4 + 4 vectors for the two inputs and the output.Instead, we vectorize a single multiplication in F q [X]/(X 4 − ζ) as follows: For inputs a = a 0 + a 1 X + a 2 X 2 + a 3 X 3 and b = b 0 + b 1 X + b 2 X 2 + b 3 X 3 , we first prepare the reversed and expanded array of 32-bit values b 3 , b 2 , b 1 , b 0 , ζb 3 , ζb 2 , ζb 1 in memory.We can then compute the coefficients of ab as dot products of [a 0 , a 1 , a 2 , a 3 ] with length-4 subvectors of the expanded array, each of which can be computed with a single invocation of the longmultiply-accumulate-across instruction VMLALDAV.s32.For example, the X 2 -coefficient of Like UMLAL, VMLALDAV.s32accumulates into a pair of GPRs.We compute all coefficients of ab in pairs of GPRs each, before moving them into a vector using VMOV Qx[i], Qx[j], Ra, Rb and applying a single Montgomery reduction.
The approach has some drawbacks: Firstly, we need four loads for b; a batched implementation would require only one.This can be amortized by simultaneously computing multiple a i b; we have not yet implemented this approach, yet.Secondly, the cost of preparing b 3 , b 2 , b 1 , b 0 , ζb 3 , ζb 2 , ζb 1 .Following an approach similar to [BHK + ] and adapted to this context, this can be partly amortized in the application to matrix-vector multiplication by precomputing b 3 , b 2 , b 1 , b 0 , ζb 3 , ζb 2 , ζb 1 as part of the forward NTT.In Section 7 below, this size-doubling version of the NTT is called "expanded NTT".We find that the above schoolbook multiplication strategy offers a good mix of different kinds of operations which can be interleaved to leverage instruction overlapping.

Inverse NTT
We use Gentleman-Sande butterflies (a, b) → (a + b, ζ(a − b)), which we compute using interleaved add/sub sequences and the 3-instruction Montgomery multiplication as in the forward-NTT.Overflow is prevented through the addition of selected Barrett reductions.
The GS butterflies are inverse to the CT butterflies only up to a factor of 2, which we need to compensate through a modular division in/after the inverse NTT.We also need to account for the Montgomery twist by 2 −31 ∈ F q in the point multiplication.We merge half of the required scalings with the multiplications in the last-layer GS butterflies.For the other half, we add explicit Montgomery multiplications.One could integrate the scalings into the GS butterflies for all but the first coefficient with a trick similar to Proposition 1; however, we would need explicit Barrett reductions at the end of the last layer in this case, which would only be one instruction shorter than the Montgomery multiplications.

Side-Channel resistance
All of our code is resistant against timing side-channels on Cortex-M4 and Cortex-M55: The control-flow is secret-independent, and all instructions we used have data-independent timing, including predication.We do not attempt resistance against other side-channels.

Hashing
We have not attempted to develop hashing implementations based on the Helium vector extension: Both the smaller scale nature of Cortex-M CPUs and the fact that software executes directly in the physical address space (without virtual to physical translation) mean that the overhead associated with offloading computation to accelerators is very low.As a result, many Cortex-M microcontrollers already include accelerators for symmetric cryptography and hashing.It is expected that these accelerators will be used for the hashing in PQC schemes, eliminating the need for high performance software-based hashing."90's versions" Numerous PQC schemes offer a "90's version" replacing the use of SHA-3 by AES and SHA-2, and it has been discussed whether NIST should standardize those versions.We support NIST standarizing "90's versions": Firstly, many existing MCUs already have hardware acceleration for AES and SHA-2.Secondly, AES and SHA-2 will not go away anytime soon, and limited gate budget may prevent vendors from shipping MCUs with both SHA-2 and SHA-3 acceleration.Finally, SHA-2 is a faster software-fallback for systems which do not have have any hardware acceleration for hashing.

Development setup
Benchmarking for the Cortex-M4 processor We have developed our code based on the pqm4 framework [KRSS].We have focused our development on polynomial multiplication only without optimizing other operations in Saber.To measure the performance, we have executed the benchmark environment provided by pqm4 on a STM32F4 Discovery board featuring a 32-bit Arm Cortex-M4 with FPU core, 1MB of Flash, and 192KB of RAM.
Benchmarking for the Cortex-M55 processor We developed our code using the Fixed Virtual Platform (FVP) for the Arm ® Corstone ® -300 MPS2  loading contiguous or scattered buffers.While the final assembly can possibly be written by hand, we found the tool useful for quick experimentation with different approaches.For very simple code-sequences, such as the Karatsuba algorithm or NTT base multiplication, we have written the assembly by hand.
There are alternatives to handwritten assembly, e.g.auto-vectorization and intrinsics.The benefit is that the programmer operates at the level of C, while giving up assemblylevel control potentially unlocking the final bit of performance.We explore handwritten assembly in this work to understand the capabilities of the Cortex-M55 processor and the Helium instruction set, without wondering how much penalty we pay from using C.

Polynomial and matrix-vector multiplication
Table 5 and Table 4 show the performance of our matrix-vector multiplication routines, running on the Cortex-M55 CPU, compared to prior implementations on the Cortex-M4 CPU.We found that the incomplete NTT with expansion (see Section 6.4.4) outperforms a full NTT, and is about 3.4× faster than the incomplete NTT implementation from [CHK + 21].For Toom-Cook, we obtain a ≈ 5× speedup compared to [MKV20].The results are in line with the expectations established in Section 6.1.Overall, we confirm the finding of [CHK + 21] that NTT-based polynomial multiplication outperforms Toom-Cook-based multiplication, but we note that the difference is smaller than on the Cortex-M4 CPU.
Table 3 and Table 4 provide a more detailed breakdown of the performance of various components of NTT and Toom-Cook/Karatsuba based polynomial multiplication.

Saber
We have integrated our polynomial and matrix-vector multiplication routines into a toplevel implementation of the NIST PQC finalist Saber.Table 2 shows the results for both Toom-Cook/Karatsuba and NTT-based implementations in comparison with prior art.
As observed and explained above, the difference between Toom-Cook/Karatsuba and NTT on the Cortex-M55 processor is smaller than on the Cortex-M4 processor.We also see that implementations based on the striding technique are slightly faster than classical Toom-Cook/Karatsuba.Additionally, the main benefit of the striding approach is the reduced memory usage.We show this in Table 6, where the memory footprint of different Saber implementations on Cortex-M4 is compared.The striding Toom-Cook approach provides a memory compact alternative to the NTT based implementation.We repeat the common observation that arithmetic underlying Saber, and many other PQC schemes, is outweighed by the computational complexity of hashing; see Section 6.6.

Conclusion
In this work, we have introduced the Helium vector extension for the M-profile of the Arm architecture and studied its use for structured lattice-based cryptography in the example of the Cortex-M55 processor and the NIST PQC finalist Saber.We have demonstrated that even within the tight constraints of the embedded market, features like instruction overlapping, scalar-vector operations and careful instruction design allow a speedup close to the theoretical optimum for the core mathematical primitives -despite only 8 vector registers and an in-order, single-issue pipeline of the Cortex-M55 processor.We have also introduced and implemented a memory-efficient variant of the Toom-Cook multiplication suitable for highly constrained devices.Though we have performed numerous optimizations, there is still scope for further improvements and extension: We hope that our results will motivate further research into the capabilities of the Helium vector extension, esp.within the realm of PQC, and to explore further schemes and optimizations.

Acknowledgements
This work was supported partially by the Research Council KU Leuven: C16/15/058 and by CyberSecurity Research Flanders with reference number VR20192203.In addition, Angshuman Karmakar is funded by FWO (Research Foundation âĂŞ Flanders) as junior post-doctoral fellow (contract number 203056 / 1241722N LV).
We thank Jack Andrew, Francois Botman, Tom Grocutt, Fabien Klein and Jun Mendoza for sharing their insights into the Helium vector extension and the Cortex-M55 processor.

Figure 2
Figure 2 Instruction overlapping on a dual-beat implementation of Helium

Table 1
Comparing features of multiplication based on FFT/NTT and Toom-Cook NTT [CHK + 21] Toom-Cook CodeThe code generation tooling and resulting Helium assembly for Cortex-M55 are available at https://gitlab.com/arm-research/security/pqmx.The repository also contains detailed instructions for how to explore Helium using a freely available functional model of the Cortex-M55.The Cortex-M4 code will be made available on the Saber repository https://github.com/KULeuven-COSIC/SABER.

Table 2
Comparison of Toom-Cook and NTT-based implementations of Ind-CPA and Ind-CCA versions of the NIST PQC finalist Saber.SW-based hashing waters down optimizations for polynomial multiplication, but is expected to be HW-accelerated in practice, see Section 6.6.Numbers are for the SHA3-based Saber, not the 90s version.
† Saber implementation from[KRSS]using Toom-Cook for polynomial multiplication instead of NTT.

Table 3
1, containing a functional model of the Cortex-M55.To measure performance, we have used an FPGA model for the Cortex-M55r1 CPU which is expected to be publicly released by the end of 2021.An image including the Cortex-M55r0 CPU is already available 2 .Researchers can get access to Cortex-M55 RTL and cycle-accurate models through the Arm Academic Access program 3 .Comparing different stages of NTT multiplication for Saber."Expand" is the NTT-variant discussed in Section 6.4.4.

Table 4
Performance of polynomial multiplication based on Toom-Cook/Karatsuba. Sumdecompositions refer to shares of Toom-Cook/Karatsuba during interpolation/evaluation.

Table 5
Comparison of different approaches for Matrix-Vector multiplication.