A RISC-V Post Quantum Cryptography Instruction Set Extension for Number Theoretic Transform to Speed-Up CRYSTALS Algorithms

In recent years, public-key cryptography has become a fundamental component of digital infrastructures. Such a scenario has to face a new and increasing threat, represented by quantum computers. It is well known that quantum computers in the next years will be able to run algorithms capable of breaking the security of currently widespread cryptographic schemes used for public-key cryptography. Post-quantum cryptography aims to define and execute algorithms on classical computer architectures, able to withstand attacks from quantum computers. The National Institute of Standards and Technology is currently running a selection process to define one or more quantum-resistant public-key algorithms and lattice-based cryptographic constructions are considered one of the leading candidates. However, such algorithms require non-negligible computational resources to be executed. One viable solution is to accelerate them totally or partially in hardware, to alleviate the workload of the main processing unit. In this paper, we investigate a solution trading-off performance and complexity to execute the lattice-based algorithms CRYSTALS-Kyber and -Dilithium: we introduce a dedicated Post-Quantum Arithmetic Logic Unit, embedded directly in the pipeline of a RISC-V processor. This results in an almost negligible area overhead with a large impact on the algorithms speed-up and a consistent reduction in the energy required per single operation.


I. INTRODUCTION
In the last years, a digitalization process is going on in many different areas like industry 4.0, automotive, and healthcare. This fact leads to more and more complex Systems-on-Chips (SoC), requiring a continuous internet connection to the cloud that has to be supported efficiently, especially in mobile systems. Such systems will communicate over an intrinsic insecure channel such as the public 5G infrastructure, therefore secure communication is an essential requirement for all these domains of application [1]. In State of the Art (SoA) systems, the security of the connections relies on the Public Key Cryptography (PKC) which employs a pair of keys, public and private. PKC algorithms are based on hard The associate editor coordinating the review of this manuscript and approving it for publication was Ilsun You . mathematical problems that are considered infeasible to solve, i.e. it would take far too many resources and time to compute the solution and break the system. However, the advent of quantum computers will strongly compromise the security of these algorithms, since such elaborators will be able to solve the problems in polynomial time using Shor's algorithm [2]. For this reason, in 2017 the National Institute of Standards and Technology (NIST) started a standardization process, which is now at its third round, to find one or more quantum-resistant public-key cryptographic algorithms [3].
Post-Quantum Cryptography (PQC) exploits mathematical elements and operations which are usually not straightforward to implement on standard processors. This is a critical aspect especially in low-power embedded devices that have a limited amount of resources and computational power. Consequently, there is increasing interest regarding hardware acceleration of PQC [4]. There are several options available in the design space to address this problem.
The most optimised approach is obviously to design and build an Application Specific Integrated Circuit (ASIC) to accelerate the requested algorithm [5]. This solution can reach the best results in terms of performance and power/energy consumption, but it requires a considerable design effort and relative non-recurrent costs, which makes it often an undesired solution, especially for accelerating algorithms that are currently under development or evaluation. A classical approach is to design and embed hardware accelerators connected to the control and elaboration unit as memory-mapped peripherals [6]. Another option is to bring smaller hardware accelerators directly in the processor pipeline [7]. With this approach, the area increment is lower and almost negligible for larger cores. Of course, the timing performance are lower than full algorithm acceleration since the CPU can manage only 32/64-bit operands in its execution stage therefore only a subset of functions can be accelerated. The in-pipeline acceleration approach is applicable for open Instruction Set Architecture (ISA) processors, which can be extended to compute the accelerated functions. A RISC-V processor is perfect for this scope [8] since it is a free and open ISA that provides a set of reserved opcodes specifically created to promote more specialized instruction-set extensions.
Among all the possible post-quantum algorithms that are being developed nowadays and that are currently competing in the NIST standardisation process, Lattice-Based Cryptography (LBC) [9] offers a very good trade-off between security and efficiency. In this work, we propose a first ISA extension to the CVA6 [10] processor, applicable also to other processors of the RISC-V family, to accelerate the execution of the CRYSTALS-Kyber and -Dilithium algorithms, respectively a Key Encapsulation Mechanism (KEM) and a Digital Signature Scheme (DSS). Currently, the RISC-V community is putting a lot of effort to standardise RISC-V cryptographic extensions. At the time of writing the proposed standard is in public review [11]. Unfortunately, the policy is to support only existing standardised cryptographic constructs. Candidate protocols for future standardisation are not currently taken into account, even if it is mentioned that the standard will also deal in the future with the NIST Post-Quantum Cryptography contest. Our proposed work, therefore, aims at paving the way to such future extensions, in case our analysed and accelerated algorithms will be selected by the NIST as future standards.
Number Theoretic Transform (NTT) [12] represents one of the most onerous parts of Kyber and Dilithium algorithms. NTT is a specialized form of Discrete Fourier transform (DFT) for finite fields and is widely adopted to perform polynomial multiplication with large operands [13]. As already indicated in [14], this is one of the bottlenecks of such algorithms, hence it will be investigated in this work for its potential hardware acceleration. The contributions of this work include, but are not limited to: • Detailed algorithms study to demonstrate which functions are worthy to be hardware accelerated.
• First post-quantum ISA cryptographic extension embedded in the 64-bit CVA6 RISC-V processor.
• Introduction of new hardware functionalities directly mapped to assembly instructions to reduce the execution time of the CRYSTALS suite.
• Evaluation of the Post-Quantum (PQ) ISA extension in terms of power/energy consumption on FPGA technology.
The rest of the paper is organised as follows: Section II provides an overview of LBC and CRYSTALS suite.
In Section III the algorithms are analysed to identify the bottlenecks in the RISC-V CVA6 core, and the architecture of the implemented instructions is described. In Section IV, once the achieved performance is considered satisfying, the system has been implemented in real hardware and tested against reference implementation to measure time acceleration, energy efficiency improvement, and resource consumption impact. The obtained results are discussed and compared with solutions available in the SoA in Section V. Finally, in Section VI we conclude this work, highlighting the innovative results achieved and the possible improvement for future research.

II. KYBER AND DILITHIUM OVERVIEW
In the NIST PQC standardization process, seven algorithms have reached the third round as finalists, and five of them are lattice-based. Different mathematical problems can be used to construct cryptography schemes based on lattices, and the most known is the Learning With Errors (LWE) problem. LWE involves the extraction of vector s from the equation t = As + e, where A is a matrix, t, s and e are vectors, and the vector e must be sampled from specific small error distributions. CRYSTALS is a CRYptographic SuiTe for Algebraic Lat-ticeS (CRYSTALS) that encompasses Kyber and Dilithium algorithms, which base their security on the hardness of solving the LWE problem in module lattices (MLWE problem [15]). Compared to standard LWE, matrices in MLWE have smaller dimensions and the coefficients are polynomials in R q .

A. CRYSTALS-KYBER
Kyber is one of the four candidates that have been selected as third-round finalists for KEM of PQC NIST competition, together with Classic McEliece [16], NTRU [17] and SABER [18]. The construction of Kyber follows two steps: first, it encrypts 32-bytes messages following the conventional method to construct INDistinguishability under Chosen-Plaintext Attack (IND-CPA) secure public-key encryption scheme; then, a tweaked Fujisaki-Okamoto (FO) transform [19]   Kyber works on rings of integer polynomials modulo prime q, which are denoted as Z q [X ]. Polynomials modulo both q and X n + 1 compose the ring R q = Z q [X ]/(X n + 1). Bold lower-case letters represent vectors and bold upper-case letters are matrices with coefficients in R q . Noise polynomials in Kyber are sampled from the centred binomial distribution B η where η is directly related with the range of noise samples.
The main functions to construct the IND-CPA-secure public-key infrastructure in Kyber are key generation, encryption, and decryption. Detailed information about such algorithms can be found on the official documentation of Kyber [20]. The parameter sets of Kyber are reported in Table 1. The parameter n is set to 256 to encapsulate keys with 256 bits of entropy, q is a small prime that allows a fast NTT-based multiplication. The parameter k fixes the lattice dimension and allows scaling security and efficiency to different levels. The parameter η 1 defines the noise of vectors s and e in key generation function and of r in encryption function, while η 2 defines the noise of e 1 and e 2 in encryption function. The columns pk, sk, and ct indicate respectively the size in bytes of the public key, the secret key, and the cyphertext.
Kyber algorithms involve several cryptographic primitives; key generation function requires seed expansion through the SHA3-512 HASH function, the matrixÂ ∈ R k×k q generation through the eXtendable-Output Function (XOF) SHAKE-128 and rejection sampling to generate elements in R q that are statistically close to a uniformly random distribution. The noise terms e and s are sampled from the centred binomial distribution B η which requires a Pseudo-Random Function (PRF) implemented in Kyber with SHAKE-256 XOF. In encryption function, matrix generation (i.e.Â) and vectors sampling (i.e. r, e 1 and e 2 ) require the same primitives of the key generation function. NTT is adopted for polynomial multiplications. In addition, several auxiliary functions, such as keys encoding (and decoding) to serialize (and deserialize) polynomials in byte arrays (and vice versa), and compression (and decompression) functions, are used to transform elements ∈ Z q to integer less than log 2 (q) (and vice versa).

B. CRYSTALS-DILITHIUM
Crystals-Dilithium is one of the three digital signature schemes selected in the third round of the NIST PQC competition with Falcon [21] and Rainbow [22]. The mathematical notation reported in Section II-A for Kyber is still valid also for the Dilithium algorithm. The parameter sets of Dilithium are reported in Table 2. Its detailed explanation can be found in the official documentation [23]. In Dilithium A is a k × polynomial matrix, while s and e becomedimensional and k-dimensional polynomial vectors, named respectively s 1 and s 2 . In Dilithium n and q are fixed, while the dimension of k and impacts on the security level and performance. The parameter η indicates the maximum size of s 1 and s 2 coefficients. Parameter γ 1 limits the coefficients of the polynomial vector y, which is used as masking vector in encryption function. Dilithium is composed of three main functions: key generation, signature generation, and signature verification. The two main operations that constitute such functions are XOFs and multiplications in the polynomial ring R q = Z q [X ]/(X n + 1). The generation of the matrix A and of s 1 and s 2 adopts SHAKE-128 and SHAKE-256 as XOF. The computation of the public key t = As 1 + s 2 is performed over R q . NTT is adopted for multiplications. In the signing procedure, the message to be signed is hashed using the SHAKE-256 as Collision Resistant Hash (CRH).

C. NTT AND MODULAR REDUCTIONS
In Kyber and Dilithium algorithms, polynomial multiplication represents one of the most critical and time-consuming operations. NTT is a special case of DFT which is conducted in the finite field Z q rather than in complex field C. NTT can be adopted to speed-up polynomial multiplication, reducing the complexity of multiplying two n-terms polynomials from O(n 2 ) to O(nlog(n)). NTT transformation is denoted as: where ω is the n-th primitive root of unity. Since the product of two n-terms polynomials has 2n coefficients and should be reduced modulo (X n +1), the Negative Wrapped Convolution (NWC) [24] procedure can be used to remove this overhead. NWC involves the pre-scale and post-scale of the polynomials with the square root ξ of the n-th primitive root of unity. Pre-scale is the multiplication between the coefficients of the input polynomials and ξ , while the post-scale requires the multiplication between the output polynomial coefficients and ξ −1 . Equation 2 refers to polynomial multiplication with the NTT technique. The symbol • indicates Point-Wise Multiplication (PWM). NTT computation is typically executed through butterfly operations in Cooley-Tukey (CT) or Gentleman-Sande (GS) configurations [25]. NTT in Kyber is slightly different from the classic one because field Z q does not contain the 2n−th primitive root of unity, and the modulo

Algorithm 1 Montgomery Reduction for Crystals-Kyber
Input: 32-bit integer a, q inv = q −1 mod 2 16 Output: 16-bit integer t congruent to aR −1 mod q, where R = 2 16 (X n + 1) cannot be fully factored into n degree1 polynomials but n/2 degree2 [26], [27]. This means that one 256-term NTT process can be conducted by two separate 128-term, and the PWM requires five Z q multiplications instead of only one.
Butterfly computations and PWM are executed modulo q employing Montgomery [28] and Barret [29] reductions in Kyber (reported respectively in Algorithms 1 and 2). Dilithium adopts Montgomery and a special reduction named reduce32 (reported respectively in Algorithms 3 and 4).

III. EXTENDED ISA DEFINITION AND IMPLEMENTATION
RISC-V ISA provides four basic instruction formats named R-type, I-type, S-type, and U-type. The R-type format uses two registers as sources and puts the output to a single destination register, I-type format replaces one source register with one immediate, S-type format replaces the destination register with one immediate, and U-type format has no source operands but a larger immediate. The RISC-V ISA has reserved a portion of the encoding space for custom extensions using specific values for the opcode field (i.e. bits 6:0) of the instruction, like in the R-type format reported in Figure 1. In the next sections, we will explain our RISC-V custom implementation for CRYSTALS algorithms.

A. ALGORITHMS ANALYSIS ON RISC-V IMPLEMENTATION
The choice of operations to accelerate is based on the analysis of the Kyber and Dilithium algorithms running on the RISC-V CVA6 core. As reported in Section II, the most onerous parts of Kyber and Dilithium algorithms are related

Algorithm 3 Montgomery Reduction for Crystals-Dilithium
Input: 64-bit integer a such that −2 31 q ≤ a ≤ q2 31 , q inv = q −1 mod 2 32 Output: 32-bit integer r such that r = a2 −32 mod q 1: r = aq inv 2: r = (a − (rq)) 32 3: return r Algorithm 4 Modular Reduction (Reduce32) for Crystals-Dilithium Input: 32-bit integer a such that a ≤ 2 31 − 2 22 − 1 Output: 32-bit integer t = a mod q to the computation of polynomial multiplication, XOF, and CRH functions. To confirm this assumption, we firstly analysed the contribution of such operations on the three main functions composing Kyber (i.e. key generation, encryption, and decryption) and Dilithium (i.e. key generation, signature generation, and signature verification). In this analysis we are focusing on the indcpa_keypair, indcpa_enc, and ind-cpa_dec for Kyber and crypto_sign_keypair, crypto_sign and crypto_sign_verify for Dilithium of the reference codes. The goal of this evaluation process is to identify different primitives that can be easily integrated inside the execution stage of the target RISC-V processor. The results of this preliminary analysis are reported in Tables 3 and 4 respectively for Kyber and Dilithium algorithms. Column Funct. reports the functions we are evaluating, column Op. indicates the main subfunctions, columns OP1 and OP2 are the lengths of the input operands and OP3 is the length of the output one. Column % Funct. indicates the contribution in terms of clock cycles taken by the sub-function (i.e. column Op.) over the main one (i.e. column Funct). The reported Keccak sub-function refers to the KeccakF1600_StatePermute function in the code, which is used to compute XOF and CRH operations in the algorithms, while the basemul sub-function refers to the PWM. The presented results refer to the lowest security level of both algorithms measuring the average number of clock cycles based on 10 000 repetitions. Each polynomial has 256 16-bits coefficients in Kyber and 256 32-bits coefficients in Dilithium, thus the operand lengths of the sub-functions are 4096 and 8192 bits when polynomials are involved. In the case of ntt/invntt sub-functions, the OP2 operand refers to the twiddle factors (128 16-bit or 32-bit constants). Aside from Keccak permutation, the other sub-functions are composed mainly by modular multiplications and reductions that can be accelerated by scalar instructions within the RISC-V CPU. In particular, we selected the fqmul (multiplication followed by Montgomery reduction) and barret reduction operations for Kyber algorithm and the fqmul and reduce32 operations for Dilithium one. In Kyber, the fqmul and barret reduction  operations take 31 and 22 cycles respectively, contributing for 51 − 82% of the three high-level indcpa functions. For Dilithium algorithm, the fqmul and reduce32 operations take 26 and 19 cycles respectively, contributing for 11−27% of the high-level functions. We remark that the purpose of this work is to investigate the acceleration of polynomial operations even though we know that a considerable contribution to the execution time of the CRYSTALS suite is also due to the Keccak function. The integration of a Keccak hardware accelerator inside the pipeline of the processor would be very costly since it requires several bit manipulation operations and storage resources.
The implemented extension has been inserted into the custom-0 RISC-V encoding space (opcode = 7'b0001011), using the R-type format for all instructions to simplify the decode stage. func3 field is used to distinguish between Kyber and Dilithium, and funct7 field to individuate the particular instruction. The fqmul and reduce operations can be straight-forward implemented since they have respectively two and one input operands.
As reported in Section II, NTT and INTT are employed for polynomial multiplication in both Kyber and Dilithium algorithms. Forward NTT requires the CT-butterfly configuration, while inverse NTT employs the GS one. Figure 2 depicts both CT and GS configurations. The integration of the butterfly unit inside the RISC-V pipeline as R-type instructions has two main blocking factors: the butterfly unit requires three input operands, i.e. two 16-bit (or 32-bit) polynomial coefficients and one 16-bit (or 32-bit) twiddle factor, and two 16-bit (or 32-bit) polynomial coefficients as result of the computation. To overcome the first limitation, we exploited the fact that the twiddle factors are compiletime constants, therefore they can be saved in an internal LUT of the PQ ALU. We introduced an appropriate instruction (set_twiddle_k/_d), called before the actual butterfly one ([i]ntt_k/_d), to select the twiddle factor. The second limitation can be overcome by packing the two 16-bit (or 32-bit) results into a single 64-bit destination register. This solution is valid only on 64-bit architectures for the Dilithium algorithm but can be used in 32-bit architectures to accelerate Kyber ones.
Finally, as result of this work, we have implemented the ten instructions listed in Table 5, five per algorithms, which accelerate the selected operations. In particular, reduce_k computes the Barrett reduction for Kyber, reduce_d computes the reduce32 reduction for Dilithium, ntt_k/_d perform the CT-butterfly, and intt_k/_d perform the GS one.

B. PQ-ALU ARCHITECTURE
We designed two different PQ ALU modules to implement the PQ ISA extension listed in Table 5.    shows the hardware design for Kyber acceleration, while Figure 4 shows the design for Dilithium. To limit the hardware resource utilisation, both implementations share the hardware modules for fqmul and reduction instructions (i.e. Barrett for Kyber and reduce32 for Dilithium) to perform the CT and GS butterfly operations. Multiplexers are used to modify the datapath and implement both butterfly configurations. The set_twiddle_k/_d instructions change the address of the internal LUT to provide the requested twiddle factor.
All instructions are encoded with R-type format, but reduction instructions do not use Rs2 and set_twiddle_k/_d use only Rs1. The unused registers can be set to x0 in source code to prevent the CPU from performing unnecessary register renaming and/or pipeline stalling due to dependency check between instructions. The output of each instruction is sign-extended to fill the 64-bit register, apart from the ntt_k/_d instructions whose output pair must be taken from the packed destination register. This overhead requires just one shift operation to extract the result stored in the most significant part of the 64-bit register.

IV. FPGA DEMONSTRATOR A. FPGA DEMONSTRATOR TEST SETUP
In order to test and prototype the system, we implemented a basic SoC on the Xilinx ZCU106 evaluation board [30]. As we can see in Figure 5, the SoC is composed by: • The CVA6 core, extended with our proposed PQ ALU; • The RAM memory (external DDR4); • The UART used to display the debug information; • The JTAG used to program the device with compiled software directly written in the RAM; • The clock generator; The CVA6 core includes 32KB and 16KB of cache memory for respectively data and instructions and does not integrate any Floating Point Unit (FPU). We used performance-oriented strategies both for synthesis and implementation on Xilinx Vivado 2020.2, with an operational clock frequency of 100MHz. It is remarkable that the critical path does not include our proposed PQ ALU therefore our ISA extension does not affect the timing performance of the system.

B. RESOURCE UTILISATION ANALYSIS
For what concerns the complexity, in the following we briefly report a resource analysis of the extended CVA6. These results have been obtained with the Flow_PerfOptimized_high strategy with retiming for the synthesis and the Performance_ExtraTimingOpt strategy for the implementation, and with the keep hierarchy attribute on the PQ ALU. Table 6 sums up the utilisation reports.
We can observe that the PQ ALU has a minimal impact on resource utilisation. In fact, it employs only 178 LUTs for Kyber and 377 for Dilithium, no FF, 5 DSP in Kyber and 10 in Dilithium, and only 0.5 BRAM. This means that the area impact of the PQ ALU is almost negligible compared to the CVA6. This result was predictable since our PQ ALU mainly performs arithmetic operations such as multiplications, which in FPGAs are mapped on DSPs, if available.
The PQ ALU is integrated into the CVA6 core as a fixed latency Functional Unit (FU) and its control requires further logic which increases the resource usage. Even if it is not the aim of this paper to describe in detail the architecture of the CVA6 processor (please refer to [10] for further information), in the following paragraph we briefly analyse the area increment due to the control logic.
The largest increment is represented by the issue stage of the core which dispatches instructions to the FUs and keeps track of them in a scoreboard like a data structure. This part of the CVA6 processor is composed of the issue_read_operands module and the just mentioned scoreboard. The latter is a FIFO that keeps track of all decoded, issued, and committed instructions, together with all the registers that are involved during the execution of those instructions. The issue_read_operands issues the instructions from the scoreboard and fetches the operands, also handling the forwarding logic in order to execute two instructions back to back (i.e. with no bubble in between). We introduced a new FU in the execution stage, thus it is obvious that more logic is required to dispatch the instructions to the available FUs, to fetch/forward the operands to other stages of the pipeline, and to decode the instructions themselves.

C. POWER CONSUMPTION
The energy necessary to carry out a specific function is a very important metric, especially if the system is intended to be employed in edge devices where power consumption is a major constraint. Since the resource overhead introduced with our implementation is almost negligible, we expected the instantaneous power consumption to be almost invariant, and the energy necessary to perform a specific function to be reduced by the same percentage of the saved clock cycles to carry out the operation. We performed a specific set of tests where we measured the total power consumption of the FPGA and knowing in advance its static power consumption, we have been able to identify the dynamic one, during the execution of all the six analysed functions, both with the regular CVA6 processor and with the one embedding our PQ ALU. The three main functions for Kyber are: indcpa_keypair, indcpa_enc, and indcpa_dec [26]. While for Dilithium: crypto_sign_keypair, crypto_sign_signature, and crypto_sign_verify [31].
Static power consumption measures have been taken by the implementation results of Vivado, while the total one have been measured applying methods described in [32]: data were collected using the MaxPowerTool instrument from Maxim Integrated for the ZCU106 board. The VCCINT power rail has been taken into account to measure the power consumption of the programmable logic, while the VCCBRAM, VCCAUX, and VCC1V2 are taken into account for other components of the FPGA and board (e.g. the Block RAM, the I/O, and the DDR memory).
Our aim here is to measure the power consumption on the different configurations under the same environmental condition and to calculate the relative energy saving. Table 7 shows the collected results of the lowest level of security for both algorithms; other security levels follow the same principles. We observe that the dynamic power consumption is slightly lower (less than 248 mW difference) on the PQ CVA6 functions. The difference is probably due to the fact that, even if more hardware is active within the FPGA when the PQ_ALU is added, the overall switching activity is lower, due to the fact that the processor operates more efficiently performing less memory accesses, resulting in an overall small reduction of the dissipated power. Minor influences could also be played by randomness in the place and route algorithms output. What should be taken into account is that the overall power consumption is almost the same, but in the PQ CVA6 the functions require much fewer clock cycles to be executed: the energy necessary to perform each function is reduced by the same factor.

D. PERFORMANCE TESTS
Our aim is to measure and understand how the PQ ALU insertion improves the overall cryptographic performance. To do that, we measured the number of clock cycles to compute each principal cryptographic function in three different scenarios to understand the impact of our acceleration: without PQ ALU, with only fqmul and reduce accelerations, and with all five instructions per algorithm (including logic for butterfly operations).
Once the functionalities have been verified against the KAT (Known Answer Test) values provided by NIST, we defined the second set of tests to measure the performance of our proposed PQ ALU and the speed-up obtained. Each of these functions has been executed in our hardware prototype by the RISC-V processor, with a total number of 10'000 iterations. In Table 8 we present the average value of clock cycles necessary to execute each function and its sub-functions for Kyber and in Table 9 for Dilithium. Also, the performance improvements are shown, expressed in terms of clock cycle reduction for executing the task. The average speed-up achieved for the main cryptographic functions ranges from 1.2× to 2.7× depending on the security level.
Please note that the GS butterfly of Kyber (adopted in the inverse NTT) contains the Barret reduction operation that is not present in CT butterfly (adopted in the forward NTT). This results in a higher speed-up because the software version of the INTT takes more clock cycles than the NTT, while the optimized instructions for the butterflies take just one clock cycle in both cases.

V. RESULTS DISCUSSIONS
Combining the achieved results in terms of complexity, timing performance, and energy per function, what we observe is that with our solution we achieve from 20% to 65% speedup of the Kyber and Dilithium Functions, with an almost negligible increase of LUT (+3%), no impact on FF and moderate use of DSP (+5 Units) in the FPGA. Thanks to the large timing improvement at low hardware cost, there is also a significant advantage in the energy required for performing a Kyber/Dilithium operation, which is reduced approximately by the same factor of the performance speed-up.
The approach we followed is inspired by [33], where a similar ISA extension is carried out for the LAC algorithms, with remarkable results. Other contributions present a tightly VOLUME 9, 2021 integrated hardware accelerator in the RISC-V processor. In [34] an NTT & Hash accelerator is embedded in the RISCY processor on an FPGA as a hardware accelerator, with low hardware consumption (886 LUT, 618 FF, and 26 DSP) which is however higher than our PQC ALU, and with comparable performance in terms of NTT clock cycles count (24609 against ours 22866). In [35], the authors follow a completely different approach, designing a domain-specific vector co-processor, integrated with a RISC-V processor, focusing on the NTT transform for the Ring-LWE and Module LWE PQC algorithms, with a complexity in the order of 942 kGates. There are also contributions targeting ASIC accelerators, which achieve higher performance but at the price of more expensive hardware with less flexibility: in [36] a 65nm hardware accelerator oft both Kyber and Dilithium (plus other PQC algorithms) is presented. Finally, there are several contributions on Kyber and/or Dilithium hardware accelerators for FPGAs, which are close to our presented implementation but target slight higher performance with higher complexity: in [37] a Dilithium hardware accelerator is presented, capable of performing all the Dilithium functions (keygen, signature, and verification) in hardware, at the price of about 70k LUT and 86k FF, with performance in the order of 10-20 kOps, depending on the function. In [27] a Kyber hardware accelerator, integrated into a SoC is presented. The entire NTT primitive is performed in hardware, providing better timing performance but also much higher resource consumption (about 7k LUT, 4,6k FF, and 2 DSPs) than our proposed architecture. In [38] a Dilithium hardware accelerator, still based on the NTT primitive is presented; the performance is remarkable (from 1 to 11 kOps), but also the resource consumption is quite high (30k LUT, 11k FF, 45 DSP) compared to our solution.
Even if a detailed comparison with the few alternative solutions available in the literature is not easy, in Table 10 we compare our work with similar solutions. Since many factors are affecting the performance of the systems presented in the various works, we decided to compare only the NTT core designed for FPGA applications, to be used for the Kyber protocol (n = 256, q = 3329). On top of that, we did not consider the execution time for the NTT or the number of clock cycles necessary to carry out the NTT operation, since those numbers are affected by many factors which are not always specified in literature, like security level and code optimisation. We decided to present and compare the speed-up that each circuit achieved using hardware acceleration for the NTT function. Resource utilisation in terms of LUTs, FFs, DSPs, and BRAMs is reported. Even though it is not possible to compute a Speed-up Vs. Resource overhead, since implementations are very different (e.g. some use DSPs and/or BRAM, others do not), it is possible to observe that the speed-up is proportional to the resource utilisation and that our proposed solution is aligned to the state of the art.

VI. CONCLUSION
LBC is one of the most promising candidates to win the NIST standardization process, with Kyber and Dilithium which are finalists of the third round of this process. An important achievement of this work is the proof that PQC algorithms can be significantly accelerated by exploiting the flexibility of RISC-V processors, integrating dedicated accelerators directly in the core pipeline. After the analysis of the Kyber and Dilithium algorithms, their onerous parts have been identified within the polynomial multiplications which use the NTT and modular reductions. We proposed a dedicated architecture to accelerate them directly in the pipeline of a RISC-V processor, exploiting dedicated instructions for the fqmul and Barrett reduction in case of Kyber, and the fqmul and reduce32 in case of Dilithium. In addition, we integrated also the butterfly operations used in NTT and INTT computations for both algorithms. We defined dedicated assembly instruction for each of these functions and we realised two different hardware accelerators, which can be integrated as FUs into the processor pipeline. After that, we designed a SoC implemented on the Xilinx ZCU106 evaluation board for gathering all the performance and power consumption results. As shown in Section V, the number of clock cycles to perform the different functions have been speeded-up consistently, up to 2.7× for indcpa_dec with a security level of 5. These results are achieved at an affordable price in terms of resource consumption, leading also to an energy-saving comparable to the obtained speed-up.
This work provides the first available indication of the power consumption of those operations and the impact that in-pipeline hardware acceleration has on the energy required to execute every single function.