research-article

Public Access

FPGA Implementation of Compact Hardware Accelerators for Ring-Binary-LWE-based Post-quantum Cryptography

Authors:
Pengzhou He

Villanova University, USA

Villanova University, USA

0000-0003-3461-4548
View Profile

,
Tianyou Bao

Villanova University, USA

Villanova University, USA

0000-0003-3321-5123
View Profile

,
Jiafeng Xie

Villanova University, USA

Villanova University, USA

0000-0002-4814-1318
View Profile

,
Moeness Amin

Villanova University, USA

Villanova University, USA

0000-0002-0926-4120
View Profile

ACM Transactions on Reconfigurable Technology and Systems Volume 16 Issue 3Article No.: 45pp 1–23https://doi.org/10.1145/3569457

Published:21 June 2023Publication History

ACM Transactions on Reconfigurable Technology and Systems

Abstract

Post-quantum cryptography (PQC) has recently drawn substantial attention from various communities owing to the proven vulnerability of existing public-key cryptosystems against the attacks launched from well-established quantum computers. The Ring-Binary-Learning-with-Errors (RBLWE), a variant of Ring-LWE, has been proposed to build PQC for lightweight applications. As more Field-Programmable Gate Array (FPGA) devices are being deployed in lightweight applications like Internet-of-Things (IoT) devices, it would be interesting if the RBLWE-based PQC can be implemented on the FPGA with ultra-low complexity and flexible processing. However, thus far, limited information is available for such implementations. In this article, we propose novel RBLWE-based PQC accelerators on the FPGA with ultra-low implementation complexity and flexible timing. We first present the process of deriving the key operation of the RBLWE-based scheme into the proposed algorithmic operation. The corresponding hardware accelerator is then efficiently mapped from the proposed algorithm with the help of algorithm-to-architecture implementation techniques and extended to obtain higher-throughput designs. The final complexity analysis and implementation results (on a variety of FPGAs) show that the proposed accelerators have significantly smaller area-time complexities than the state-of-the-art designs. Overall, the proposed accelerators feature low implementation complexity and flexible processing, making them desirable for emerging FPGA-based lightweight applications.

1 INTRODUCTION

Existing public-key cryptosystems, such as Rivest Shamir Adleman (RSA) and Elliptic Curve Cryptography (ECC), are proved to be vulnerable to attacks launched from well-established quantum computers based on Shor’s algorithm [6, 46]. As a result, post-quantum cryptography (PQC)-related research and development have gained substantial attention in various communities [35]. Several types of cryptosystems have been proposed as potential standard PQC candidates, including lattice-based cryptography [1]. Because of its high quantum attack security and low implementation complexity [1, 34], lattice-based cryptography emerged as one of the most important PQCs. The recently released National Institute of Standards and Technology (NIST) third-round PQC standardization process has also selected three lattice-based PQC schemes (out of four public-key encryption scheme finalists) [1].

Many lattice-based PQC algorithms are based on the learning-with-errors (LWE) problem or its variants like Ring-LWE [39]. Overall, Ring-LWE not only retains the hardness of the lattice-based problem but also offers significantly lower computational complexity than the standard LWE. This offering is attributed to the fact that the main operation of Ring-LWE becomes a polynomial multiplication over the ring \(\mathbb {Z}_q/(x^n+1)\) [34]. A slightly new variant of Ring-LWE, namely the Ring-Binary-LWE (RBLWE), was proposed in [10] to further reduce the computational complexity by using binary errors in lieu of regular Gaussian distributed errors. Since its original introduction, the RBLWE-based PQC has been regarded as a promising lightweight PQC for emerging applications [3, 10, 17, 21, 48, 49, 50].

Indeed, apart from the general-purpose PQC algorithms undergoing the NIST PQC standardization process, there is an urgent need to develop lightweight PQC schemes for related applications. The recent National Science Foundation (NSF) Secure and Trustworthy Cyberspace Principal Investigators’ Meeting 2022 (SaTC PI Meeting’22) has also identified that one future research direction is the “lightweight PQC” in the breakout group report of “security in a Post-Quantum World” [2]. Following this trend, several works have been released on efficient implementation of the RBLWE-based encryption scheme.

Existing Works. The first software implementation of the RBLWE-based PQC was presented in [10]. This was followed by the first hardware implementation of the RBLWE-based PQC scheme, which was provided in [3]. A pair of high-speed and ultra-lightweight structures for the RBLWE-based scheme was then presented in [17]. Recently, efficient high-speed structures were presented in [48, 50]. A new compact design was proposed in [45]. A low-complexity hardware design, for the RBLWE-based PQC, was reported in [21]. More recently, a new high-speed architecture was released in [49]. A pair of low-speed and high-speed lightweight accelerators was presented in [33]. Two high-speed RBLWE-based arithmetic architectures were proposed in [25]. Efficient hardware RBLWE-based implementations were also released in [52]. Other works include the fault-detection-based design of [43] (based on the structure in [17]), fault-resistant software implementation of [16], and RISC-V-based RBLWE cryptoprocessor [20]. These designs are principal and considered major efforts in the field.

Existing Challenges. Along with the NIST PQC standardization process [1], the recent trend in the PQC field has gradually switched to the efficient implementation of the PQC algorithms on hardware platforms, e.g., Field-Programmable Gate Array (FPGA) devices [48]. Though more and more FPGA devices are deployed for lightweight applications such as Internet-of-Things (IoT) and edge computing devices, a thorough study of the lightweight implementation of the RBLWE-based PQC accelerator on the FPGA platform has not been carried out in the literature. Specifically, we point to three challenges facing existing low-complexity implementation of the RBLWE-based scheme: (1) The compact designs involve relatively large area usage; e.g., the most recent architecture of [21] employs one \(\mathrm{log}_2q\)-bit n-to-1 multiplexer (MUX) to deliver the input data to the calculation cell for further processing. This setup incurs large area occupation and is not ideal for practical applications. (2) The architectures are typically designed as fixed speed, and no other flexible processing styles are offered. (3) Most of the existing compact structures for RBLWE-based PQC have not fully considered the actual operational conditions. For instance, existing designs of [17, 21] still assume the large-size input operands to be connected to the structure in parallel, which is impractical due to the corresponding large number of input bits (can be 1,024 bits for \(n=256\)), while another design assumes the input data is produced by outward resources [45]. In essence, significant efforts are needed to advance the compact/ultra-compact implementations for RBLWE-based PQC on the FPGA platform.

Major Contributions. Based on the above considerations, we propose novel hardware accelerators for the RBLWE-based PQC on the FPGA platform to obtain ultra-low implementation complexity, flexible processing, and practical input processing setup. The main contributions of this article are:

Presenting the motivation and delineating the implementation-oriented algorithmic derivation for the key arithmetic component of the RBLWE-based PQC. Specifically, we have derived a binary polynomial originated polynomial multiplication algorithm for possible low-cost implementation, which has not been presented in the literature.
Introducing a novel ultra-compact accelerator for the RBLWE-based PQC scheme, following a novel two-level computation strategy. This strategy allows us to design the accelerator with not only small resource usage but also efficient operation.
Extending the proposed basic version of accelerator into a novel flexible processing style-based design with efficient structural optimizations.
Implementing the proposed structures on the FPGA platform (a variety of devices) and also comparing them with the competing designs to confirm their superior performance. For instance, on the Intel Stratix-V device, the proposed accelerator (\(t=2\)) has at least 74.25% and 75.42% less area-delay product (ADP) than the most recent design for \(n=256\) and \(n=512\), respectively.

The rest of the article is organized as follows. Section 2 provides problem background and preliminary knowledge. Section 3 presents the motivation and the algorithm for the proposed design strategy. The proposed accelerator is presented in Section 4. Further algorithmic derivation and structural extension are presented in Section 5. The complexity analysis, implementation, and comparison are shown in Section 6, and the conclusion is given in Section 7.

2 BACKGROUND

RBLWE-based Encryption Scheme. RBLWE is a relatively new variant of Ring-LWE introduced in [10], which uses binary errors, replacing the regular Gaussian distributed errors, to obtain lower computational complexity. Overall, the RBLWE-based encryption scheme involves three main operational phases [10]: key generation, encryption, and decryption. Figure 1 gives a summary of the operations involved within each phase. The related notations are as follows: a is a public polynomial (integer coefficients of \(\mathrm{log}_2q\)-bit); \(r_1\) and \(r_2\) are two random binary polynomials and \(r_2\) is the secret key; message m (binary polynomial of n-length) is encoded to produce the ciphertext \(c_1\) and \(c_2\); n is the security level of the scheme; and q is used for modulo operation (defined by the ring \(R_q=\mathbb {Z}_q[x]/(x^n+1)\)), where the ring polynomial is \(f(x)=x^n+1\). The detailed operations within each phase are given below.

Fig. 1. Details of the RBLWE-based encryption scheme.

Key generation. Following the setup in [10], Alice and Bob both know the public parameter a and then Alice uses two selected binary polynomials \(r_1\) and \(r_2\) to have \(p=r_1-a\cdot r_2\) to be sent to Bob. \(r_1\) is no longer used after this. In this case, \(r_2\) (n-bit) is the secret key and the public key is p (\(n\mathrm{log}_2q\)-bit).

Encryption. The original message m (a binary polynomial) is coded into \(\widetilde{m}\) through the operation that each coefficient of m is multiplied with \(q/2\) [10]. The coded \(\widetilde{m}\) is then delivered to Alice from Bob through the operations involved with three binary errors (polynomials) \(e_1\), \(e_2\), and \(e_3\) to generate the ciphertext \(c_1\) and \(c_2\) (ciphertext is thus \(2n\mathrm{log}_2q\)-bit); see Table 1.

Table 1.

Phase	Key Arithmetic Operations
Key generation	\(r_1-a\cdot r_2=p\) (PM followed by one PA)
Encryption	\(c_1=a\cdot e_1+e_2\) (PM followed by one PA)
Encryption	\(c_2=p\cdot e_1+e_3+\widetilde{m}\) (PM followed by two PAs)
Decryption	\((c_1\cdot r_2+c_2)\) (PM followed by one PA)

PM (polynomial multiplication); PA (polynomial addition).

View Table

Table 1. Major Arithmetic Operations of the RBLWE-based PQC

PM (polynomial multiplication); PA (polynomial addition).

Decryption. Alice uses a threshold decoder to obtain the recovered message m. The decoder will produce “1” if the coefficient of the (\(c_1r_2+c_2\)) is in the range of \((q/4,3q/4)\); otherwise it is “0” [10].

Security/Hardness of the RBLWE Problem. It is proved that the binary LWE with uniform error distribution retains the worst-case hardness of the LWE problem when the number of samples is restricted [11]. In [10], it is shown that the RBLWE-based encryption scheme is based on the average-case hardness of the RBLWE problem.

Several security attacks and analyses have been carried out to estimate the security level of the RBLWE-based PQC scheme [19]. A relatively recent work has determined that the RBLWE-based PQC achieves 73/84 and 140/190 quantum/classic security bits for the parameter settings of \(n=256,q=256\) and \(n=512,q=256\), respectively [19], which fits lightweight applications well [3, 8, 16].

Inverted RBLWE-based Scheme. To facilitate efficient hardware implementation, the authors of [17] have proposed to use the inverted range \((-\lfloor \frac{q}{2}\rfloor ,\lfloor \frac{q}{2}\rfloor -1)\) to represent the integer coefficients within the polynomials such that the modular addition/subtraction does not need any reduction based on the two’s complement number system. The major operations of Figure 1 then need only slight changes (inverting signs) on the encode and decode functions, while the rest remains the same [17]. We also adopt this method in our article.

Key Arithmetic Operations of RBLWE-based Encryption Scheme. From Figure 1, one can surmise that the key arithmetic operations involved within the RBLWE-based scheme is a polynomial multiplication followed by two polynomial additions; see Table 1. The binary polynomial for generating \(c_2\) in the encryption phase can be set as zero in other phases. Besides, the polynomial subtraction in the key generation phase can be realized by the addition under the two’s complement representation. The polynomial addition is simple, while the polynomial multiplication is much more complicated because of the involved modular operation. Note that in actual implementation, a constant error is also needed when executing the final addition with an integer polynomial, following the suggestion of the recent work of [52].

3 MOTIVATION AND DERIVATION

Following the given major arithmetic operation of Section 2, we conduct the mathematical derivation to obtain the proposed algorithm. The proposed strategy and major innovations are also provided here.

Consideration and Motivation. Though there exist fast algorithms such as the Karatsuba algorithm [36, 47] that can theoretically reduce the computational complexity of the polynomial multiplication (Number Theoretic Transform (NTT) [37] is not applicable here due to the parameter setting [3]), these algorithms cannot be directly adopted because they (1) decompose the original polynomial multiplication into several short-size polynomial multiplications and thus are more suitable for high-speed applications (parallel processing) [36]; (2) involve pre- and post-addition operations, and hence the hardware structure involves complicated resource usage and control setup [29]; and (3) cannot be easily mapped into ultra-compact hardware structures due to the decomposed multiple sub-polynomial multiplications [21].

Besides, as stated in the Existing Challenges of Section 1, most of the existing compact designs still suffer (1) large area, (2) inflexible processing speed, and (3) impractical design setup.

Proposed Strategy. Therefore, unlike the typical fast algorithm-based approach or the existing design methods, we propose a novel implementation-oriented algorithmic derivation strategy to obtain (1) low implementation complexity, (2) flexible processing rates, and (3) complete input-processing-related structural designing.

Contributions. Specifically, we notice that one polynomial within the polynomial multiplication has merely binary values (the other has integer coefficients). This motivates us to use the polynomial multiplication in the form that the modular operation is involved mainly with the binary polynomial and propose a two-level computation technique to minimize the resource usage. Further, we propose binary polynomial operation shared parallel computation to create the processing flexibility and design a complete processing setup for all signals, mainly those input signals.

Contribution I: Binary Polynomial-oriented Derivation. Let us first define the main operation of the RBLWE-based scheme as (1) \(\begin{equation} G=DB~\mathrm{mod}~[f(x)=x^n+1]+U+V, \end{equation}\) where \(D=\sum _{i=0}^{n-1}d_ix^i\), \(B=\sum _{i=0}^{n-1}b_ix^i\), \(U=\sum _{i=0}^{n-1}u_ix^i\), \(V=\sum _{i=0}^{n-1}v_ix^i\), and \(G=\sum _{i=0}^{n-1}g_ix^i\) (\(d_i\), \(u_i\), and \(g_i\) are \(\mathrm{log}_2q\)-bit integers and \(b_i,v_i\in \lbrace 0,1\rbrace\)). Then, we have (2) \(\begin{equation} \begin{split} W=BD~\mathrm{mod}~f(x), \end{split} \end{equation}\)

where \(W=\sum _{i=0}^{n-1}w_ix^i\) (\(w_i\) is also \(\mathrm{log}_2q\)-bit). We can further have (3) \(\begin{equation} \begin{split} W&=b_0(d_0+ \cdots +d_{n-1}x^{n-1})~\mathrm{mod}~f(x)+\cdots \\ & \quad +b_{n-1}x^{n-1}(d_0+ \cdots +d_{n-1}x^{n-1})~\mathrm{mod}~f(x).\\ \end{split} \end{equation}\)

Then, considering that \(x^n\equiv -1\) such that we have \(w_0\), (4) \(\begin{equation} \begin{split} w_0&= b_0d_0+b_{n-1}d_1x^n+ \cdots +b_1d_{n-1}x^n~\mathrm{mod}~f(x)\\ &=b_0d_0+(-b_{n-1})d_1+(-b_{n-2})d_2+ \cdots +(-b_1)d_{n-1}. \end{split} \end{equation}\)

Similarly, we have other \(w_i\) (\(1\le i\le n-1\)) of Equation (3) as (5) \(\begin{equation} \begin{split} w_1& =b_1d_0+b_0d_1+(-b_{n-1})d_2+\cdots +(-b_2)d_{n-1},\\ w_2& =b_2d_0+b_1d_1+b_{0}d_2+\cdots +(-b_3)d_{n-1},\\ \cdots ~&\cdots ~\cdots \\ w_{n-1}& =b_{n-1}d_0+b_{n-2}d_1+b_{n-3}d_2+\cdots +b_0d_{n-1},\\ \end{split} \end{equation}\) where it is observed that the coefficients of B are circularly shifted and with sign changed, while the coefficients of D stay steadily within each \(w_i\) (\(0\le i\le n-1\)).

Contribution II: Proposed Two-level Computation. As the proposed work aims to deliver ultra-compact accelerators for the RBLWE-based scheme, it is highly desirable that the computation of \(w_i\) (\(0\le i\le n-1\)) can be executed through the accumulation of n respective point-wise multiplications. However, since the coefficients of B are circularly shifted and with one coefficient’s sign inverted, we proceed to process the overall computation in two levels. First, define (6) \(\begin{equation} \begin{split} \overline{B}^{(0)}& =b_0+(-b_{n-1})x+(-b_{n-2})x^2+\cdots +(-b_1)x^{n-1},\\ \overline{B}^{(1)}& =b_1+b_0x+(-b_{n-1})x^2+\cdots +(-b_2)x^{n-1},\\ \cdots ~&\cdots ~\cdots \\ \overline{B}^{(n-1)}& =b_{n-1}+b_{n-2}x+b_{n-3}x^2+\cdots +b_0x^{n-1},\\ \end{split} \end{equation}\)

where we have (just list \(\overline{B}^{(0)}\)) (7) \(\begin{equation} \begin{split} \overline{B}^{(0)}_0=b_0,~ \overline{B}^{(0)}_1& =-b_{n-1},~ ~\cdots ~ \overline{B}^{(0)}_{n-1}=-b_{1},\\ \end{split} \end{equation}\)

which can be similarly applied to \(\overline{B}^{(1)}\), \(\overline{B}^{(2)}\), \(\ldots\), \(\overline{B}^{(n-1)}\). We also have (8) \(\begin{equation} \begin{split} \overline{B}^{(n-1)}_{j+1}=\overline{B}^{(n-2)}_j~(0\le j\le n-2),~~~ \overline{B}^{(n-1)}_{0}=-\overline{B}^{(n-2)}_{n-1}, \end{split} \end{equation}\)

which can be extended to any coefficients of \(\overline{B}^{(i)}\) (\(0\le i\le n-2\)) as (9) \(\begin{equation} \begin{split} \overline{B}^{(i+1)}_{j+1}=\overline{B}^{(i)}_j~(0\le j\le n-2),~~~ \overline{B}^{(i+1)}_{0}=-\overline{B}^{(i)}_{n-1}. \end{split} \end{equation}\)

Equations (4) and (5) can be rewritten into another form as (10) \(\begin{equation} w_i=\sum _{j=0}^{n-1}\overline{B}^{(i)}_jd_j, \end{equation}\) where two levels of computation are involved: Level I, where for each specific \(w_i\), there are n number of accumulations of \(\overline{B}^{(i)}_jd_j\), and Level II, where there is a need to switch from \(\overline{B}^{(i+1)}\) to \(\overline{B}^{(i)}\). We start from \(\overline{B}^{(n-1)}\) as all its coefficients have positive signs.

Finally, we can substitute Equation (10) into Equation (1) to obtain (11) \(\begin{equation} \begin{split} G=\sum _{i=0}^{n-1}g_ix^i =\sum _{i=0}^{n-1}\sum _{j=0}^{n-1}\overline{B}^{(i)}_jd_jx^i+\sum _{i=0}^{n-1}u_ix^i+\sum _{i=0}^{n-1}v_ix^i =\sum _{i=0}^{n-1}\left(\sum _{j=0}^{n-1}\overline{B}^{(i)}_jd_j+u_i+v_i\right)x^i.\\ \end{split} \end{equation}\)

Observing Equations (1) to (11), it is evident that now the polynomial multiplication becomes the accumulation of the modular operation applying binary polynomial multiplied with the related coefficient of polynomial D, which facilitates the resource usage (see Section 4). Algorithm 1 captures the proposed two-level computation strategy.

Note that there is a need for a decoder function attached to the final output, realized by the XOR gate connecting with the two most significant bits (MSBs) of \(g_i\), for the operation in the decryption phase. Meanwhile, a constant error also needs to be added with the corresponding coefficient of U before the final decoder operation [10, 52], which will be presented in the accelerator design process.

4 PROPOSED HARDWARE ACCELERATOR

Following Algorithm 1, we have presented the proposed ultra-compact hardware accelerator for the RBLWE-based scheme as shown in Figure 2. The proposed structure consists of four major components, namely (1) the input processing component, (2) the computation component, (3) the final output component, and (4) the control unit (CU).

Fig. 2. The proposed ultra-compact hardware accelerator for the RBLWE-based scheme (basic version). CSR: circular shift-register; CU: control unit; AC: accumulation cell.

The Input Processing Component. As shown in Figure 2, the input processing component includes all the circular shift-registers (CSRs) used for delivering correct input data into the computation and other components for further processing, which can be seen in Figure 3.

Fig. 3. The details of the proposed accelerator, where the values in the registers are initially loaded. SI: sign inverter.

As shown in Figure 3, CSR-I and CSR-II are responsible for producing the input \(\overline{B}^{(i)}_j\), while CSR-III is in charge of delivering the matched \(d_j\) based on Line 6 of Algorithm 1. CSR-I consists of n number of 2-bit registers (1 bit for sign and 1 bit for values of B), one sign inverter (SI) cell, and one 2-bit MUX. The coefficients of the input B are serially loaded into the n registers through the MUX and the signs for all coefficients are set as all “0,” which takes in total n clock cycles. After that, the 2-to-1 MUX switches to the other channel so that all the values stored in the registers of CSR-I can be circularly shifted. The SI cell functions to obtain the inverted value of the input under two’s complement representation, as shown in Figure 4.

Fig. 4. The SI cell (HD: 1-bit half adder).

Contribution II: Two-level CSR-based Operations. According to Lines 4–8 of Algorithm 1, all the coefficients of a specific \(\overline{B}^{(i)}\) need to be multiplied with the matched \(d_j\), respectively, to generate one \(w_i\). After that, the computation switches from \(\overline{B}^{(i)}\) to \(\overline{B}^{(i-1)}\). Following the proposed two-level computation strategy, we have proposed a new implementation technique, i.e., the Two-level CSR-based Operations:

Level I: Once all the registers are loaded with the initial values according to Figure 3, i.e., all the values of the registers form operand \(\overline{B}^{(n-1)}\), then in the following cycles, the values stored in the registers will become \(\overline{B}^{(i)}\) (from \(i=n-2\) to 0) through the function of the SI cell, following Equation (7). This level’s CSR operation corresponds with Line 4 of Algorithm 1.

Level II: All the registers’ outputs of CSR-I are connected to CSR-II to initiate the values stored in the related registers. In this case, we will need two types of clock cycles for CSR-I (clk-1) and CSR-II (clk), respectively, as indicated by the colored clocks in Figure 3. The clk-1 is activating the registers in CSR-I once per every n clock cycles, i.e., form the operand \(\overline{B}^{(i)}\) in every n cycles, for \(i=n-1\) to 0. Another clock (clk), which is also the general clock signal in the structure, activates the registers of CSR-II once per clock cycle. Thus, CSR-II delivers the desired output to the following component; i.e., all the n coefficients of a certain \(\overline{B}^{(i)}\) are delivered out in a serial format once per cycle (Table 2). This level’s CSR operation mainly corresponds to Line 6 of Algorithm 1.

Table 2.

Level I (\(\overline{B}^{(0)}\))	\(\overline{B}^{(0)}_0\)	\(\overline{B}^{(0)}_1\)	\(\cdots\)	\(\overline{B}^{(0)}_{n-2}\)	\(\overline{B}^{(0)}_{n-1}\)
Level II (serially)	\(b_0\)	\(-b_{n-1}\)	\(\cdots\)	\(-b_{2}\)	\(-b_{1}\)
Level I (\(\overline{B}^{(1)}\))	\(\overline{B}^{(1)}_0\)	\(\overline{B}^{(1)}_1\)	\(\cdots\)	\(\overline{B}^{(1)}_{n-2}\)	\(\overline{B}^{(1)}_{n-1}\)
Level II (serially)	\(b_{1}\)	\(b_0\)	\(\cdots\)	\(-b_{3}\)	\(-b_{2}\)
\(\cdots\)	\(\cdots\)	\(\cdots\)	\(\cdots\)	\(\cdots\)	\(\cdots\)
Level I (\(\overline{B}^{(n-2)}\))	\(\overline{B}^{(n-2)}_0\)	\(\overline{B}^{(n-2)}_1\)	\(\cdots\)	\(\overline{B}^{(n-2)}_{n-2}\)	\(\overline{B}^{(n-2)}_{n-1}\)
Level II (serially)	\(b_{n-2}\)	\(b_{n-3}\)	\(\cdots\)	\(b_{0}\)	\(-b_{n-1}\)
Level I (\(\overline{B}^{(n-1)}\))	\(\overline{B}^{(n-1)}_0\)	\(\overline{B}^{(n-1)}_1\)	\(\cdots\)	\(\overline{B}^{(n-1)}_{n-2}\)	\(\overline{B}^{(n-1)}_{n-1}\)
Level II (serially)	\(b_{n-1}\)	\(b_{n-2}\)	\(\cdots\)	\(b_{1}\)	\(b_{0}\)

The Level I operand is switched every \(n\) clock cycles, from \(i=n-1\) to 0; while the Level II coefficients are serially produced once per cycle, from \(j=0\) to \(n-1\).

View Table

Table 2. Coefficients of \(\overline{B}^{(i)}_j\) Related to the Two-level CSRs

The Level I operand is switched every \(n\) clock cycles, from \(i=n-1\) to 0; while the Level II coefficients are serially produced once per cycle, from \(j=0\) to \(n-1\).

The actual details of the proposed two-level CSRs are shown in Figure 3. CSR-II consists of n 2-bit registers connected in a circularly shifted format that only one register’s output is connected to the following AND cell and AC component. Meanwhile, all the n outputs of the registers in CSR-I are connected to CSR-II to initiate the values stored in the related registers. In this case, once all the initial values are stored in the registers of CSR-I, the outputs of these registers (\(\overline{B}^{(n-1)}\)) will be used to initiate the registers in CSR-II, respectively. The registers in CSR-I are deactivated for the next n clock cycles. Besides, each coefficient of the operand \(\overline{B}^{(n-1)}\) (CSR-II) will be serially delivered out to the AND cell and AC component. After all the related coefficients are produced, CSR-I will be activated again to form operand \(\overline{B}^{(n-2)}\) to be loaded into CSR-II again for the next n clock cycles’ operation. The above operations are repeated until all the related coefficients are delivered out to the following computation component.

The internal structure of CSR-II is shown in Figure 5, which is also indicated by a green dotted box in Figures 2 and 3. Each register of CSR-II is attached with a 2-to-1 MUX such that (1) when the control signal (ctr-2) switches the signal flow to the upper channel of the MUX, the values from CSR-I are loaded into the corresponding registers, and (2) once the registers are initiated with correct values, the MUXes then work in the lower channel such that the values stored in the registers can be circularly produced in every cycle. In total, CSR-II is reloaded with the output from CSR-I n times.

Fig. 5. The internal structure of CSR-II, where the MUXes are used to initiate the values in the registers once per every n clock cycles through the control signal (ctr-2).

Besides that, CSR-III functions to produce the matched \(d_j\) to the same component according to Algorithm 1. Since the coefficients of D are delivered in a serial format in every cycle (for n cycles) and then the same process is repeated in the next n cycles, we can use a relatively simple CSR to perform the related functions. An extra MUX is used to load all the coefficients of D into the corresponding registers in the first n cycles in a serial format. After that, the MUX works (ctr-3) to help the values in the registers to be circularly shifted and be delivered to the AND cell and AC component, i.e., starting from \(d_0\) until \(d_{n-1}\), and then repeats the same process. There is a need for in total n rounds of operations, and each round is n clock cycles. Note that there will be one clock cycle’s pause between two rounds as CSR-II needs to be reloaded every n clock cycle, through deactivating the registers in CSR-III.

CSR-IV and CSR-V, for inputs U and V, respectively, have the same internal structures as that of CSR-III except that the contained registers are activated to deliver the output to the final adder (see Figure 2) once per every n cycles. Note that there is a need to add a constant error for the decryption operation, which can be realized through a counter delivering the error once per cycle and added with a corresponding coefficient of U to be loaded into CSR-IV first, as shown in Figure 2.

The Computation Component. As shown in Figure 3, the AND cell consists of \(\mathrm{log}_2q\) number of parallel AND gates (see Figure 6). The output of the AND cell is connected to the AC cell, which consists of an adder along with a register (connected in a loop format). Note that the AC cell can produce one desired output every n clock cycle, and here we just directly use the output of the adder to be connected with the following final adder rather than the output of the register, and thus avoid the storing clock time through the register. A control signal (clr-ac) is used to clear the content of the register after n clock cycle’s accumulation.

Fig. 6. The internal structure of the AND cell.

The Final Output Component. The final output component only involves a final adder. An XOR gate is connected with the two MSBs of \(g_i\) to produce the decoded output according to the decryption function in Table 2 [3]. Note that the output of CSR-V is set as zero if the structure is not working in the encryption phase.

The Control Unit (CU) Component. The CU component is in charge of producing all the necessary control signals related to the operation of the structure for the RBLWE-based encryption scheme, ranging from the loading and delivery of input signals and multiplication-and-accumulation to the final output producing. We propose to use a finite state machine (FSM) to generate the desired control signals along with some extra logic circuits. We have used three consecutive states, namely “read” (n cycles), “load” (1 cycle per round), and “calculation” (n cycles per round), to execute and process the calculated results. The proposed FSM begins with the “read” state (all CSRs are set as zero already), and the internal counter counts from 0 to \(n-1\). Then, CU moves into the “load” state, which lasts for only one cycle for CSR-II to be initiated, and meanwhile the rest of the CSRs are set as freeze. After that, the CU component turns into the “calculation” state so that it produces shifting signals to keep the related CSRs under circularly shifting status.

The Overall Operation of the Hardware Accelerator. Overall, the proposed ultra-compact accelerator produces the desired output after \((n^2+n)\) operational cycles, not counting those CSRs’ initial values’ loading time. Besides that, the proposed structure is suitable for working in different phases of the RBLWE-based PQC scheme and does not require external resources’ assistance for data computation and processing, especially on the input processing part, which is missing in the existing designs [17, 21, 45, 52].

5 EXTENSION: ALGORITHM-TO-ACCELERATOR

In this section, we present the extended version of the proposed ultra-compact hardware accelerator with parallel computation to provide flexible processing capability. Details of the related algorithmic operations are also provided.

Contribution III: Extended Version. First of all, Equation (9) can be extended to (where \(1\le k \le n-i-1\)) (12) \(\begin{equation} \begin{split} \overline{B}^{(i+k)}_{j+k}& =\overline{B}^{(i)}_j~(0\le j\le n-1-k),~~~~~~~ \overline{B}^{(i+k)}_{k}=-\overline{B}^{(i)}_{n-k}. \end{split} \end{equation}\)

For parallel computation, connecting the proposed strategy in Section 3, we can assume \(n=st\) (s and t are integers and s is relatively small) and rewrite Equation (10) into the form of (13) \(\begin{equation} \begin{split} W=\sum _{l=0}^{t-1}\sum _{i=0}^{s-1}w_{i+ls}x^{i+ls} =\sum _{l=0}^{t-1}\sum _{i=0}^{s-1}\sum _{j=0}^{n-1}\overline{B}^{(i+ls)}_jd_jx^{i+ls},\\ \end{split} \end{equation}\)

which can also substituted into Equation (11) to have (14) \(\begin{equation} \begin{split} G= \sum _{l=0}^{t-1}\sum _{i=0}^{s-1}\left(\sum _{j=0}^{n-1}\overline{B}^{(i+ls)}_jd_j+u_{i+ls}+v_{i+ls}\right)x^{i+ls},\\ \end{split} \end{equation}\)

where we can see that the original computation of G has become the accumulations of t groups of sub-polynomial as (15) \(\begin{equation} \begin{split} G=\sum _{l=0}^{t-1}G_l,\\ \end{split} \end{equation}\)

where each \(G_l\) is (16) \(\begin{equation} \begin{split} G_l=\sum _{i=0}^{s-1}\left(\sum _{j=0}^{n-1}\overline{B}^{(i+ls)}_jd_j+u_{i+ls}+v_{i+ls}\right)x^{i+ls},\\ \end{split} \end{equation}\)

where Group \(G_0\) involves \(\lbrace \overline{B}^{(0)},\ldots ,~\overline{B}^{(s-1)}\rbrace\), Group \(G_1\) involves \(\lbrace \overline{B}^{(s)},\ldots ,~\overline{B}^{(2s-1)}\rbrace , \ldots ,\) and Group \(G_{t-1}\) involves \(\lbrace \overline{B}^{(n-s)},\ldots ,~\overline{B}^{(n-1)}\rbrace\).

Besides that, following the rule presented in Equations (9) and (12), one can obtain all the coefficients of \(\overline{B}^{(i+sl)}\) from \(\overline{B}^{(i)}\) directly through (17) \(\begin{equation} \begin{split} \overline{B}^{(i+sl)}_{j+sl}& =\overline{B}^{(i)}_j~(0\le j\le n-1-sl),\\ \overline{B}^{(i+sl)}_{sl}& =-\overline{B}^{(i)}_{n-sl}, \end{split} \end{equation}\)

where \(0\le l \le t-1\) and \(0\le i \le s-1\). In this case, we can easily obtain all the coefficients of \(\overline{B}^{(s-1)}\), \(\overline{B}^{(2s-1)}, \ldots , \overline{B}^{(n-s-1)}\) from \(\overline{B}^{(n-1)}\) directly through the following: (1) setting \(i=s-1\) and choosing \(l=t-1\) for Equation (17), we can obtain \(\overline{B}^{(s-1)}\) from \(\overline{B}^{(n-1)}\) directly; (2) meanwhile, we can set l from 1 to \(t-2\), and the other operands such as \(\overline{B}^{(2s-1)}, \ldots , \overline{B}^{(n-s-1)}\) can also be obtained from \(\overline{B}^{(s-1)}\) (\(\overline{B}^{(n-1)}\)). Under this operation, the last operand within each group of \(G_l\) (\(0\le l \le t-2\)) is obtained directly from \(\overline{B}^{(n-1)}\), as shown in Equation (6).

For the other \(\overline{B}^{(i+ls)}\) within each group of \(G_l\), following Equation (16), one can then use Equation (9) to obtain the related operands. For example, for Group \(G_{t-1}\), it is easy to obtain \(\overline{B}^{(n-2)}\) from \(\overline{B}^{(n-1)}\) through Equation (9), which can be extended to obtain other operands within the same group.

Following the above procedure, these t number of \(G_l\) can be processed in parallel to shorten the overall computation latency. Each \(G_l\) (\(0\le l\le t-1\)) still follows the two-level computation strategy of Algorithm 1. We thus have:

Note that in actual execution, all the involved \(\overline{W_{ls}}=\overline{W_{ls}}+\overline{Z_{ls}}_{,j}d_j\) (Line 8) are processed in parallel for \(l=0\) to \(t-1\). Thus, there are t number of outputs available and it takes in total \(n^2/t\) computational cycles to execute all the operations of Algorithm 2.

Example. For a clear understanding of the proposed Algorithm 2, we can assume \(n=4\) and \(t=2\) to have (18) \(\begin{equation} \begin{split}\overline{B}^{(0)}& =b_0+(-b_{3})x+(-b_{2})x^2+(-b_1)x^{3},\\ \overline{B}^{(1)}& =b_1+b_0x+(-b_{3})x^2+(-b_2)x^{3},\\ \overline{B}^{(2)}& =b_2+b_1x+b_{0}x^2+(-b_3)x^{3},\\ \overline{B}^{(3)}& =b_{3}+b_{2}x+b_{1}x^2+b_0x^{3}, \end{split} \end{equation}\) where we have split them into two groups as \(G_0\) and \(G_1\): one for \(\overline{B}^{(0)}\) and \(\overline{B}^{(1)}\) (\(G_0\)) and the other for \(\overline{B}^{(2)}\) and \(\overline{B}^{(3)}\) (\(G_1\)). These two groups can be processed at the same time and only two rounds of four cycles of multiplication-and-accumulation are required: (1) obtain \(\overline{B}^{(1)}\) from \(\overline{B}^{(3)}\) through Equation (17); (2) serially process each coefficient of \(\overline{B}^{(1)}\) and \(\overline{B}^{(3)}\) to be multiplied with the corresponding \(d_j\) (\(0\le j\le 3\)) as well as the following addition in a parallel format (four cycles); (3) obtain \(\overline{B}^{(0)}\) and \(\overline{B}^{(2)}\) from \(\overline{B}^{(1)}\) and \(\overline{B}^{(3)}\), respectively; (4) following step (2) for \(\overline{B}^{(0)}\) and \(\overline{B}^{(2)}\); (5) obtain the final output. Thus, this whole procedure now requires only half the time when compared to the previous four rounds of operations, as specified in Algorithm 1.

Corresponding Hardware Accelerator: Extended Version. The corresponding hardware accelerator for the RBLWE-based scheme is presented in Figure 7. For detailed information of the proposed structure, we have described their specific components as below.

Fig. 7. The proposed hardware accelerator for the RBLWE-based scheme (extended version).

The Input Processing Component. The internal structures of all the related units are shown in Figure 8, where we have used \(t=2\) and \(s=n/2\) as an example to illustrate the proposed design and it can be easily extended to other numbers of t. The internal structures of these CSRs are the same as those in Figure 2 except that CSR-II now has two outputs, similar to CSR-IV and CSR-V.

Fig. 8. The proposed hardware accelerator for \(s= n/2\) ( \(t=2\) ). The values in the CSRs are the initial loading.

Following Equation (16), we can have two groups of parallel processing, where Group \(G_0\) involves \(\lbrace \overline{B}^{(0)},\ldots ,\overline{B}^{(n/2-1)}\rbrace\), and Group \(G_1\) involves \(\lbrace \overline{B}^{(n/2)},\ldots ,\overline{B}^{(n-1)}\rbrace\). Then, based on Equation (6), we can have (19) \(\begin{equation} \begin{split} \overline{B}^{(n/2-1)}& =b_{n/2-1}\cdots +(-b_{n-1}x)\cdots +(-b_{n/2})x^{n-1},\\ \overline{B}^{(n-1)}& =b_{n-1}+\cdots +b_{n/2-1}x^{n/2}+\cdots +b_0x^{n-1}, \end{split} \end{equation}\)

where we can see that the first coefficient, i.e., \(\overline{B}^{(n/2-1)}_0=b_{n/2-1}\), needs to be delivered out first along with the first coefficient of \(\overline{B}^{(n-1)}\) such that \(G_0\) and \(G_1\) can be processed in parallel. Thus, CSR-II produces two outputs: one for \(\lbrace b_{n-1},b_{n-2},\ldots ,b_{n/2}\rbrace\) and the other for \(\lbrace b_{n/2-1},b_{n/2-2},\ldots ,b_0\rbrace\), the same as CSR-IV and CSR-V. Note that there is no change on the constant error being added to input U.

Apart from the general signal delivering from CSR-II, extra efforts need to be paid on the negative signs related to the half coefficients of \(\overline{B}^{(n/2-1)}\). To obtain the desired sign inverting, we have added an inverter (with MUX) on the second output of CSR-II. Meanwhile, an extra control signal (ctr-s) is attached to the MUX, which functions to help the output of the MUX deliver either the original value or the inverted value to the following computation component. Besides that, the same control signal (ctr-s) is also connected to the carry-in of the adder in the following AC cell such that an inverted value can be produced according to the two’s complement representation (ctr-s becomes “1” when an inverted value is needed).

The Computation Component and the Final Output Component. Two sets of AND cell and AC and final output components, the same as those in Figure 2, are involved.

The CU Component. Although the basic structure of the extended accelerator remains the same as the one in Figure 2, we still need to update CU when t is changed to 2. Besides that, the number of initiating times, loading values from CSR-I to CSR-II, now becomes \(n/2\) and the corresponding control signals also need to be updated. For the sign control signal related to the output of CSR-II (ctr-s), we have innovatively used \((t-1)\) comparators: we only need to set the jth fixed value as \((n/t)\times i\). For example, as for Figure 8, i.e., \(t=2\), we can set the threshold to \((n/2-1)\) such that when the count number is greater than this number, the related control signal (ctr-s) is set as “1” (negative sign). Otherwise, the ctr-s is set as “0” (positive sign), which can be easily observed by the distribution of signs in Table 2. A similar strategy can be extended to the designs when t is chosen as other numbers.

The Overall Operation of the Hardware Accelerator (Extended Version). Overall, the proposed hardware accelerator produces the desired output after \((n^2/t+n/t)\) (\(t=2\) for the structure in Figure 8) cycles of computations, not counting those CSRs’ initial value loading time. Again, the proposed structure does not require any additional external resources for input signal processing. Besides that, though the proposed accelerator has slightly higher area usage than the structure in Figure 2, the computation latency is significantly reduced, which may potentially increase its efficiency in overall area-time complexities when compared with the former one. Following the design style of Figure 8, one can easily obtain extended accelerators with different t based on Algorithm 2.

Summary of the Proposed Hardware Accelerator Design Strategy. The design strategy of the proposed hardware accelerators for the RBLWE-based scheme, including the basic and the extended versions in Figures 2 and 7, strictly follows the Proposed Strategy described in Section 3: (1) the proposed two-level computation strategy is based on the binary polynomial involved modular operation to obtain low implementation complexity (see also the proposed Two-Level CSRs); (2) resource shared parallel computation for flexible processing (see the structure of Figure 7); and (3) the complete input processing setup that includes all major operations of the PQC scheme with efficient control.

6 COMPLEXITY ANALYSIS AND COMPARISON

This section focuses on the complexity analysis of the proposed hardware accelerators along with those of the existing similar style designs (e.g., [17, 21]). FPGA-based comparison is also provided to confirm the efficiency of the proposed accelerators, both basic and extended versions, over the state-of-the-art solutions.

Complexity Analysis. The proposed RBLWE-based PQC accelerator of Figure 2 contains one AND cell, two \(\mathrm{log}_2q\)-bit adders, one \(\mathrm{log}_2q\)-bit register (not counting the ones in the CSRs), five CSRs with different processing sizes, and one XOR gate. The latency time for encryption/decryption is \((2n^2+n)/(n^2+n)\) cycles (the input loading needs n cycles). A CU is required to generate the necessary control signals for the smooth operation of the hardware structure.

The proposed accelerator of Figure 7 has t number of identical computations and final output components, where each set has the same complexity as that of Figure 2. Therefore, the structure of Figure 7 has t AND cells, \(2t\) number of \(\mathrm{log}_2q\)-bit adders, t number of \(\mathrm{log}_2q\)-bit registers (not including the ones in the CSRs), five different sized CSRs, and t XOR gates. A CU is designed to generate all the control signals to coordinate all the signal processing within the accelerator. The latency cycles required for encryption/decryption are now reduced to only \((2n^2/t+n/t)/(n^2/t+n/t)\) cycles (the input loading time still needs n cycles).

The overall area-time complexities of the proposed hardware structures are listed in Table 3 along with those of the newly reported compact designs of [17, 21, 33]. Reference [17] has shown the performance over the one in [3], so we just list the compact design of [17]. Note that the compact design in [45] does not include the input processing setup as the proposed design; therefore, we do not list it in Table 3. Besides that, the rest of the available designs are high-speed structures [25, 48, 49, 50], which are not ideal for resource-constrained applications. The fault detection structure of [43] uses the same high-speed structure of [17]. The design of [20] is based on a RISC-V core. Accordingly, we also do not include them in the comparison.

Table 3.

Design	AND Cell	XOR	Adder\(^1\)	CSR\(^2\)	\(n\)-to-1 MUX\(^{1}\)	Register\(^{1,3}\)	Latency (Enc./Dec.)\(^4\)	RER?\(^5\)
Figure 5 of [17]	1	1	1	\(^{*}\)	2	1	-/\(\simeq n(n+1)+1\)	Yes
Figure 9 of [21]	1	1	2	\(^{\#}\)	2	1	\(2n^2\)/\(n^2\)	Yes
Figure 2 of [33]	1	1	1	3	1	1	\((2n^2+2n)\)/\((n^2+n)\)	No
Pro. (Figure 2)	1	1	3	5	0	1	\((2n^2+n)/(n^2+n)\)	No
Pro. (Figure 7)	\(t\)	\(t\)	\(2t+1\)	5	0	\(t\)	\((2n^2/t+n/t)/(n^2/t+n/t)\)	No

The existing design of [17] only provides the structure for the decryption phase, and hence its complexities are only estimated at this specific phase.
\(^1\): \(\mathrm{log}_2q\)-bit. \(^2\): including different bit-size CSRs.
\(^*\): The existing structure of [17] does not take the actual input processing resource into consideration; only a regular shift-register (for binary polynomial) is used while the rest of the input bits are all fed to the structure in a parallel format. \(^3\): Refers to the registers in the main structure, not in those CSRs.
\(^4\): The latency listed here is the number of cycles required for the encryption/decryption phase of the RBLWE-based scheme (not including the initial input loading time).
\(^{\#}\): The CSR for the integer polynomial connected to the \(n\)-to-1 MUX (total \(n\times \mathrm{log}_2q\) bits) is not included in Figure 9 of [21].
\(^5\): RER: require extra resources. The existing low-complexity designs from [17, 21] need extra CSRs for input data processing.

View Table

Table 3. Theoretical Area-time Complexities for Various Compact Structures (RBLWE-based Scheme)

The existing design of [17] only provides the structure for the decryption phase, and hence its complexities are only estimated at this specific phase.
\(^1\): \(\mathrm{log}_2q\)-bit. \(^2\): including different bit-size CSRs.
\(^*\): The existing structure of [17] does not take the actual input processing resource into consideration; only a regular shift-register (for binary polynomial) is used while the rest of the input bits are all fed to the structure in a parallel format. \(^3\): Refers to the registers in the main structure, not in those CSRs.
\(^4\): The latency listed here is the number of cycles required for the encryption/decryption phase of the RBLWE-based scheme (not including the initial input loading time).
\(^{\#}\): The CSR for the integer polynomial connected to the \(n\)-to-1 MUX (total \(n\times \mathrm{log}_2q\) bits) is not included in Figure 9 of [21].
\(^5\): RER: require extra resources. The existing low-complexity designs from [17, 21] need extra CSRs for input data processing.

As shown in Table 3, the proposed hardware structures offer much more efficiency than the existing ones, which is underscored by the following properties: (1) The structure in [17] uses two n-to-1 \(\mathrm{log}_2q\)-bit MUXes to transfer the parallel inputs into serial format to achieve low-complexity operation, which significantly increases the overall area occupation of the whole structure. The newly reported structures in [21, 33] still require one n-to-1 \(\mathrm{log}_2q\)-bit MUX for compact implementation. This is in contrast to the proposed accelerators, which require only one 2-bit n-size CSR for ultra-compact implementation. (2) The proposed accelerator of Figure 7 can be flexibly adjusted on the computational latency according to the specific application requirements to achieve efficient and fast operations, while the existing designs only have fixed processing speed. Meanwhile, the proposed design of Figure 7 still maintains low area occupation since the increase of the area usage mainly comes from the computation and final output components, which are minor when compared with the five regular CSRs. (3) Lastly, the proposed hardware accelerators do not require extra resources’ assistance for input processing, while the ones of [17, 21] still need extra CSRs to load the input polynomials into parallel format (see Figure 5 of [17] and Figure 9 of [21]). Overall, the theoretical analysis and comparison demonstrate that the proposed hardware designs have significantly better area-time complexities than the existing ones.

Further FPGA-based Implementation. We have coded the proposed structures with VHDL (functions verified through ModelSim) to obtain their detailed area-time complexities. As some of the existing designs were implemented on the Intel devices [17, 21, 33] while some others were reported on the Xilinx FPGAs [21, 26, 45, 52, 54], we thus decided to follow the existing reports’ styles and implement the proposed accelerators on both Intel and Xilinx devices for a comprehensive comparison with the existing reports. Meanwhile, as some of the existing designs such as [21] have already shown their efficiency over the other ones (e.g., [3, 17]), we compare only with those recent designs like [21].

Experimental Setup. (1) Parameter setup: we have followed the same parameter settings in the existing reports of [3, 10, 17, 52] with \(n=256\) (and \(n=512\)) and \(q=256\) (\(\mathrm{log}_2q=8\)) selected for the RBLWE-based scheme. This provides equivalent 190/140 and 84/73 bits of class and quantum securities, respectively, according to the security analysis [3, 11, 17, 31], which rightly fits the resource-constrained application requirements [3]. (2) Platform setup: we have used the Intel Quartus Prime 17.0 to implement all the coded designs on Stratix-V (5SGXMA9N1F45C2), Arria-V GZ (5AGZME7K3F40I4), and Cyclone-V (5CSXFC6D6F31I7ES) devices, respectively (follow [21]). (3) Processing style setup: we have chosen \(t=1\), \(t=2\) \(\ldots\), to \(t=64\) for the proposed design, which still maintains low implementation complexity even when \(t=64\).

The obtained area-time complexities, in terms of the number of adaptive logic modules (ALMs) and delay (critical-path × latency) of the proposed designs with different parameters, are plotted in Figure 9.

Fig. 9. Area-time complexities of the proposed designs on the Intel FPGA devices. Red line ( \(n=256\) ); blue line ( \(n=512\) ).

One can see from Figure 9 that the Fmax of a certain design mostly lies in a certain range, and meanwhile, the area usage slowly increases with t. For example, from the design of \(t=1\) to the case of \(t=16\), the area usage increases very slowly, while the throughput is around 12 times faster. This fully demonstrates the effectiveness of the proposed design: ultra-low implementation complexity and flexible processing.

Comparison Based on Intel Platform Implementations. For a more detailed demonstration of the efficiency of the proposed structures, we have used the FPGA-based implementation results for comparison. The area-time complexities of the proposed and existing compact designs of [17, 21, 33], in terms of the number of ALMs, registers (reg.), Fmax, latency cycles, delay time (latency cycle × critical path), area-delay product (ADP), and throughput (Thr.), with respect to different parameters, operational phases, and devices, are listed and calculated in Table 4. Note that we have followed the existing designs [17, 21, 33] to use the number of ALMs as area usage to calculate the ADP (designs of [17, 21, 33] did not report the register count).

Table 4.

Design	\(n\)	Phase	Device	ALMs	Reg.	Fmax	Latency	Delay	ADP	ADPR\(^1\)	Thr.
[17]	256	-/Dec.	Stratix-V	3,472	-	201.25	65,792	327	1,135,344	–350.34%	0.78
	256	-/Dec.	Arria-V	3,470	-	178.67	65,792	368	1,276,960	–180.76%	0.70
	256	-/Dec.	Cyclone-V	3,556	-	89.67	65,792	734	2,610,104	–201.48%	0.35
[21]	256	Enc./Dec.	Stratix-V	1,864	-	316.96	131,072/65,536	414/207	771,696/385,848	–53.05%	1.24
	256	Enc./Dec.	Arria-V	1,864	-	268.17	131,072/65,536	489/244	911,496/454,816	-	1.05
	256	Enc./Dec.	Cyclone-V	1,878	-	142.15	131,072/65,536	922/461	1,731,516/865,758	-	0.56
[33]	256	Enc./Dec.	Stratix-V	846	2,587	221.29	132,096/66,048	596/298	504,216/252,108	-	0.86
\(t=1\)	256	Enc./Dec.	Stratix-V	492	1,202	502.51	131,328/66,304	262/132	129,083/64,917	74.25%	1.94
	256	Enc./Dec.	Arria-V	492	1,199	378.07	131,328/66,304	347/175	170,903/86,284	81.03%	1.46
	256	Enc./Dec.	Cyclone-V	474	1,179	228.94	131,328/66,304	574/290	271,903/137,277	84.14%	0.88
\(t=2\)	256	Enc./Dec.	Stratix-V	502	1,184	447.23	65,664/33,408	147/75	73,706/37,499	85.13%	3.43
\(t=64\)	256	Enc./Dec.	Stratix-V	1,706	1,930	359.84	2,052/1,540	6/4	9,729/7,301	97.10%	59.82
[17]	512	-/Dec.	Stratix-V	6,901	-	171.32	262.656	1,533	10,579,233	–413.45%	0.33
	512	-/Dec.	Arria-V	6,900	-	155.76	262,656	1,686	11,633,400	–203.93%	0.30
	512	-/Dec.	Cyclone-V	7,066	-	75.52	262,656	3,478	24,575,548	–235.15%	0.15
[21]	512	Enc./Dec.	Stratix-V	3,551	-	296.65	262,144	884	3,139,084	–52.35%	0.58
	512	Enc./Dec.	Arria-V	3,554	-	243.49	262,144	1,077	3,827,658	-	0.48
	512	Enc./Dec.	Cyclone-V	3,614	-	129.22	262,144	2,029	7,332,806	-	0.25
[33]	512	Enc./Dec.	Stratix-V	1,596	5,033	203.87	526,336/263,168	2,582/1,291	4,120,872/2,060,436	-	0.40
\(t=1\)	512	Enc./Dec.	Stratix-V	880	2,252	458.09	524,800/263,680	1,146/576	1,008,151/506,535	75.42%	0.89
	512	Enc./Dec.	Arria-V	881	2,237	366.30	524,800/263,680	1,433/720	1,262,213/634,185	83.43%	0.71
	512	Enc./Dec.	Cyclone-V	862	2,222	201.57	524,800/263,680	2,604/1,308	2,244,270/1,127,609	84.62%	0.39
\(t=2\)	512	Enc./Dec.	Stratix-V	888	2,201	454.96	262,400/132,352	577/291	512,158/258,327	87.46%	1.76
\(t=64\)	512	Enc./Dec.	Stratix-V	2,105	3,260	370.23	8,200/5,128	22/14	46,622/29,156	98.58%	36.97

Pro.: proposed. Enc.: encryption; Dec.: decryption. Thr.: Throughput (Dec. phase, coefficient/microsecond).
Unit for Fmax: MHz. Unit for delay: \(\mu\)s. Delay = critical path \(\times\) latency in microseconds. ADP = #ALMs \(\times\) delay (Dec.). The results of [17, 21] are obtained from [21], where the authors have re-implemented [17] to report the result. The latency of the proposed designs includes the input loading time (while the existing designs [17, 21] only count the computation time as the latency).
The original designs of [17, 21, 33] did not report the number of register count. However, we obtain the register count of [33] from its open-accessed source code.
\(^1\): ADPR refers to ADP reduction. For calculation of ADPR (Dec. phase), the result of [33] on the Stratix-V device is used as the baseline for comparison of all related results on the same device, while the results of [21] on the Arria-V and Cyclone-V devices are used as the baseline for comparison of all the results obtained from the same devices.

View Table

Table 4. Comparison of the Complexities of Various Compact Accelerators/Cryptoprocessors (RBLWE-based Scheme)

Pro.: proposed. Enc.: encryption; Dec.: decryption. Thr.: Throughput (Dec. phase, coefficient/microsecond).
Unit for Fmax: MHz. Unit for delay: \(\mu\)s. Delay = critical path \(\times\) latency in microseconds. ADP = #ALMs \(\times\) delay (Dec.). The results of [17, 21] are obtained from [21], where the authors have re-implemented [17] to report the result. The latency of the proposed designs includes the input loading time (while the existing designs [17, 21] only count the computation time as the latency).
The original designs of [17, 21, 33] did not report the number of register count. However, we obtain the register count of [33] from its open-accessed source code.
\(^1\): ADPR refers to ADP reduction. For calculation of ADPR (Dec. phase), the result of [33] on the Stratix-V device is used as the baseline for comparison of all related results on the same device, while the results of [21] on the Arria-V and Cyclone-V devices are used as the baseline for comparison of all the results obtained from the same devices.

As evident from Table 4, the proposed hardware accelerators significantly outperform the existing designs on nearly every aspect of the area-time complexities. The proposed compact design (\(t=1\), Figure 2) has at least 74.25% and 75.42% less ADP than the competing design of [33] for \(n=256\) and \(n=512\) (Stratix-V), respectively. The proposed extended version of Figure 8 (\(t=2\)) involves 85.13% and 87.46% less ADP than the design of [33], respectively, for the cases of \(n=256\) and \(n=512\). Besides that, as shown in Table 4, the efficiency of the proposed design increases rapidly with t; i.e., the proposed structure of Figure 7 (\(t=64\)) has 97.10% and 98.58% less ADP than the competing design of [33], respectively, for the cases of \(n=256\) and \(n=512\), which fully demonstrates the superior performance of the proposed structures.

Meanwhile, though the existing designs do not report the register count, we still list the best existing design’s register number here as we obtain it from the open-accessed source code [33]. It is clear that the proposed design has much smaller register usage than the best competing one.

It is important to point out that the performance of the existing designs does not account for the external resources’ assistance on input processing (CSRs). This part of resources actually takes a large amount of area. Meanwhile, we also want to mention that the existing designs directly attach the large-size polynomials (large I/O) to the structure so that the designed structure has to use the virtual pin-based FPGA implementation, e.g., [21]. This setup undoubtedly hinders these designs’ mapping efficiency on the FPGAs; see Table 4.

Finally, considering that, with a slight increase in the area usage, accompanied with a slight sacrifice in the Fmax, the proposed accelerator of Figure 7 can easily achieve almost half of the latency time reduction. This proposed processing strategy deserves to be utilized for other deployment.

Xilinx Platform Implementation. We have also implemented the proposed designs on the Xilinx platform through Vivado 2020.2 on Virtex-7 (XC7V2000), Artix-7 (XC7Z020), Ultrascale+ (XCZU9EG), and Kintex-7 (XC7K325) devices. We have re-implemented the design of [21] on the Virtex-7 device for comparison (since it outperforms [3, 17]) with the similar setup as the proposed ones. We have also included a regular Ring-LWE cryptoprocessor with compact style [54], which outperforms others like [28, 32]. Besides that, we have listed the results of [26, 45, 52] in Table 5, though these three designs do not include the input/output resources in the implementation, which may lead to smaller area usage and faster frequency. Nevertheless, from Table 5, one can see again that the proposed designs have better performance than the existing ones. For instance, the proposed accelerator with \(t=32\) has at least 81.55% less ADP and 20.61× higher throughput than the compact design of [45] for \(n=256,\) though the result of [45] does not include all the necessary resources. When comparing with the compact regular Ring-LWE based design in [54] and the compact approximate Ring-LWE based structure of [26], the proposed accelerator again shows its superior performance. However, as the design of [45] has not included the input loading CSRs in the final implementation and the designs of [26, 54] still belong to the regular (or similar) Ring-LWE based structures, we believe the comparison with the most recent one in [21] is more appropriate than the others to showcase the efficiency of the proposed design strategy.

Table 5.

Design	\(n\)	Phase	Device	LUT	FF	Slice	Fmax	Latency\(^1\)	Delay	ADP\(^2\)	Thr.
[45]\(^*\)	256	Enc./Dec.	Virtex-7	380	640	165	434.32	133k/66k	307.94/153.67	25,356	1.67
[21]\(^3\)	256	Enc./Dec.	Virtex-7	693	2,880	654	303.03	131,072/65,536	432.54/216.27	141,439	1.18
[52]\(^*\)	256	Enc./Dec.	Virtex-7	213	336	71	510	-/66,304	-/130.01	9,231	1.97
Pro. \(t=1\)	256	Enc./Dec.	Virtex-7	756	1,603	339	435.31	131,328/66,304	301.69.152.31	51,635	1.68
		Enc./Dec.	Artix-7	741	1,603	337	325.00	131,328/66,304	404.09/204.01	68,752	1.25
		Enc./Dec.	Ultrascale+	739	1,603	186	250.00	131,328/66,304	525.31/265.22	49,330	0.97
Pro. \(t=32\)	256	Enc./Dec.	Virtex-7	2,080	2,162	629	345.26	4,104/2,568	11.89/7.44	4,678	34.42
		Enc./Dec.	Artix-7	1787	2,158	638	258.75	4,104/2,568	15.86/9.92	6,332	25.79
		Enc./Dec.	Ultrascale+	1705	2,157	335	250.00	4,104/2,568	16.42/10.27	3,441	24.92
[21]\(^3\)	512	Enc./Dec.	Virtex-7	1,333	5,717	1,235	277.78	525,312/262,656	1891.12/945.56	1,167,769	0.28
[52]\(^*\)	512	Enc./Dec.	Virtex-7	360	615	121	443	-/263.7k	-/595.26	72,026	0.86
Pro. \(t=1\)	512	Enc./Dec.	Virtex-7	1,411	3,162	638	416.36	524,800/263,680	1260.45/633.30	404,044	0.81
		Enc./Dec.	Artix-7	1,396	3,160	635	278.07	524,800/263,680	1887.29/948.25	602,139	0.54
		Enc./Dec.	Ultrascale+	1,389	3,144	372	250.00	524,800/263,680	2099.20/1054.72	392,356	0.49
Pro. \(t=32\)	512	Enc./Dec.	Virtex-7	2,708	3,346	862	345.26	16,400/9,232	47.50/26.74	23,049	19.15
		Enc./Dec.	Artix-7	1,794	3,443	839	222.86	16,400/9,232	73.59/41.43	34,756	12.36
		Enc./Dec.	Ultrascale+	2,088	3,445	445	250.00	16,400/9,232	65.60/36.93	16,433	13.86
[54]\(^4\)	256	Enc./Dec.	Kintex-7	1,381	1,179	479	275	35,478/17,732	129.01/64.48	30,656	3.97
[26]\(^{*,}\)\(^5\)	256	Enc.	Kintex-7	643	206	206	320	69k	215.63	44,420\(^\#\)	1.19\(^\#\)
[26]\(^{*,}\)\(^5\)	256	Dec.	Kintex-7	546	160	160	320	34k	106.25	17,000	2.41
Pro. \(t=32\)	256	Enc./Dec.	Kintex-7	1,793	2,157	659	222.03	4,104/2,568	18.48/11.57	7,622	22.13
Pro. \(t=32\)	512	Enc./Dec.	Kintex-7	2,158	3,445	893	222.03	16,400/9,232	73.86/41.58	37,131	12.31

Unit for Fmax: MHz. Unit for delay: \(\mu\)s. Delay = critical path \(\times\) latency in microseconds. Thr: Throughput (Dec., coefficient/second\(\times 10^6\)).
\(^*\): The designs of [26, 45, 52] do not include the input/output resources in the final implementations, as indicated by the small number of FFs. This can also be observed from Figure 3 of [45], Figure 3 of [26], and Table I of [52], respectively. Meanwhile, [26, 45] only provided the results for \(n=256\).
\(^\#\): Based on the data for encryption structure.
\(^1\): Those CSRs’ loading time is included in the proposed designs (but not for the existing designs).
\(^2\): ADP = #Slice\(\times\) delay (Dec.).
\(^3\): We have added a CSR for the input polynomial so that it can be implemented on the Virtex-7 device (rather than virtual pin based), but the implementation results listed here do not include that CSR’s resource usage.
\(^4\): The design of [54] also needs 2 DSP and 2 8k BRAM, which are not listed here (the actual ADP is larger than the reported result here).
\(^5\): The listed results are based on the approximate Ring-LWE of \(L=3\) in [26], which has a similar security level as the RBLWE-based scheme.

View Table

Table 5. Comparison of the Complexities of Various Low-complexity Accelerators/Cryptoprocessors on Xilinx Platform

Unit for Fmax: MHz. Unit for delay: \(\mu\)s. Delay = critical path \(\times\) latency in microseconds. Thr: Throughput (Dec., coefficient/second\(\times 10^6\)).
\(^*\): The designs of [26, 45, 52] do not include the input/output resources in the final implementations, as indicated by the small number of FFs. This can also be observed from Figure 3 of [45], Figure 3 of [26], and Table I of [52], respectively. Meanwhile, [26, 45] only provided the results for \(n=256\).
\(^\#\): Based on the data for encryption structure.
\(^1\): Those CSRs’ loading time is included in the proposed designs (but not for the existing designs).
\(^2\): ADP = #Slice\(\times\) delay (Dec.).
\(^3\): We have added a CSR for the input polynomial so that it can be implemented on the Virtex-7 device (rather than virtual pin based), but the implementation results listed here do not include that CSR’s resource usage.
\(^4\): The design of [54] also needs 2 DSP and 2 8k BRAM, which are not listed here (the actual ADP is larger than the reported result here).
\(^5\): The listed results are based on the approximate Ring-LWE of \(L=3\) in [26], which has a similar security level as the RBLWE-based scheme.

Indirect Comparison with Other RBLWE-based PQC Designs. The newly released high-performance RBLWE-based PQC structure of [49] has shown its efficiency over the other high-speed designs of [3, 17, 50]. These high-performance designs are suitable for resource-abundant applications such as servers but may not fit the resource-constrained lightweight applications. Hence, we do not directly include these high-speed designs in the comparison list. Meanwhile, it is noted that the design of [49] only includes the major arithmetic hardware; i.e., all necessary CSRs for input polynomials are missing in the structure (see Figure 1 of [49]). From this perspective, the proposed designs have more balanced consideration and setup than the high-speed designs [3, 17, 49, 50].

Comparison with NewHope and Kyber Implementations. Compared with NewHope [53] and Kyber [51], we want to emphasize first that the RBLWE-based scheme’s targeting application environment is different from the former two. For the RBLWE-based encryption scheme, we put priority more on the area usage as it is designed for lightweight or ultra-lightweight applications, where the resources are not abundant or very limited. From this perspective, it is obvious that from Table 6, the proposed accelerators’ unique advantage, i.e., they do not need to use BRAM and DSP and meanwhile involve a very small number of slices (e.g., the cases of \(t=4\) and \(t=8\)), is superior over the existing schemes [51, 53]. Besides that, as seen from Table 6, even when the proposed accelerator has faster processing speed, e.g., \(t=64\), it still involves smaller area-complexity than the existing ones (counting all the resource usage together). Overall, we can conclude that the proposed RBLWE-based accelerators are more suitable for lightweight/ultra-lightweight applications than NewHope [53] and Kyber [51].

Table 6.

Design	\(n\)	Phase	Device	LUT	FF	Slice	Fmax	DSP	BRAM	Latency\(^1\)	Delay
NewHope [53]	512	Enc./Dec.	Artix-7	6,780	4,026	-	200	2	7	6.6k/2.5k	33/12.5
Kyber [51]\(^2\)server	512\(^1\)	Enc./Dec.	Artix-7	7,412	4,644	2,126	161	3	2	5,079/-	30.5/-
Kyber [51]\(^2\)client	512\(^1\)	Enc./Dec.	Artix-7	6,785	3,981	-	167	3	2	-/6,668	-/41.3
Pro. \(t=4\)	512	Enc./Dec.	Artix-7	1,466	2,827	592	294.64	0	0	131,200/66,688	445.29/226.34
Pro. \(t=8\)	512	Enc./Dec.	Artix-7	1,574	2,867	632	255.99	0	0	65,600/33,856	256.26/132.26
Pro. \(t=32\)	512	Enc./Dec.	Artix-7	1,794	3,443	839	222.86	0	0	16,400/9,232	73.59/41.43
Pro. \(t=64\)	512	Enc./Dec.	Artix-7	3,399	4,272	1,132	236.67	0	0	8,200/5,128	34.65/21.67

Unit for Fmax: MHz. Unit for delay: \(\mu\)s. Delay = critical path \(\times\) latency in microseconds.
\(^2\): The design of [51] used server and client to execute encapsulation and decapsulation, respectively. We just listed the performance of security rank \(k=2\), which is an equivalent size of \(n=512\).
\(^1\): Latency refers to encryption/decryption (encapsulation/decapsulation) cycles, respectively.
\(^*\)In comparing with NewHope and Kyber, we aimed at emphasizing the difference between the RBLWE-based scheme’s targeting application environment and the former two. Based on this consideration, we can conclude that the proposed RBLWE-based accelerators are more suitable for lightweight/ultra-lightweight applications than NewHope and Kyber.

View Table

Table 6. Comparison with NewHope and Kyber Implementations

Unit for Fmax: MHz. Unit for delay: \(\mu\)s. Delay = critical path \(\times\) latency in microseconds.
\(^2\): The design of [51] used server and client to execute encapsulation and decapsulation, respectively. We just listed the performance of security rank \(k=2\), which is an equivalent size of \(n=512\).
\(^1\): Latency refers to encryption/decryption (encapsulation/decapsulation) cycles, respectively.
\(^*\)In comparing with NewHope and Kyber, we aimed at emphasizing the difference between the RBLWE-based scheme’s targeting application environment and the former two. Based on this consideration, we can conclude that the proposed RBLWE-based accelerators are more suitable for lightweight/ultra-lightweight applications than NewHope and Kyber.

Discussions. Overall, the proposed compact accelerators for the RBLWE-based scheme have excellent performance on the FPGA platform, i.e., very low area occupation, flexible timing, complete input processing setup, and small ADP. From the obtained results, one can see that the proposed structure of Figure 2 is suitable for application in resource-constrained nodes/devices due to its extremely low resource occupation and excellent performance on the delay time. The proposed design of Figure 7 provides high flexibility for deploying in various small resource-limited devices as one can choose different t for optimal implementation. As the static power takes the majority of the power consumption on the FPGA platform, we thus do not report it here. Nevertheless, the results obtained based on multiple choices of t provide sufficient references for further deploying of the proposed accelerators in potential applications.

The proposed accelerators all have stable operation frequencies, which can be explored further to resist timing-related attacks [17]. Since the primary focus of this article is to deliver ultra-compact structures for resource-constrained applications, the study on side-channel attack resistance is not covered. Nevertheless, one has to mention that the existing proposed countermeasures against various types of side-channel attacks [3, 44] are applicable to the proposed designs, and this aspect of research can also be regarded as one of our future endeavorings.

Other Related Works. While considering the other available lattice-based PQCs or similar (such as [23, 38, 42]), the authors of [3, 17] have shown that their proposed cryptoprocessors for RBLWE-based PQC have better ADP over these ones. As the RBLWE-based scheme is a lightweight PQC suitable for resource-constrained applications, here we mostly compare with the same type of designs in the literature.

Nevertheless, we still desire to highlight the important LWE works important to the field (Gaussian distributed errors): (1) the Ring-LWE structures based on NTT can be found in [4, 7, 14, 18, 40, 42], which are the major efforts in the field (including others); (2) the non-NTT-based Ring-LWE/LWE structures can be found in [7, 24, 27, 28, 30, 32, 38, 54], which represent another trend in the field when the parameter settings are not in favor of employing NTT.

Meanwhile, we also want to mention some other important types of lattice-based PQC hardware designs, including (1) hardware cryptoprocessors for Learning-with-Rounding (LWR)-based PQC schemes [5, 22, 41, 55] and (2) hardware architectures for the Nth Degree Truncated Polynomial Ring Units (NTRU)-based PQC [9, 12, 13, 15].

7 CONCLUSION

This article proposed a novel implementation of ultra-compact hardware accelerators for the RBLWE-based scheme on the FPGA platform. Based on the proposed design strategy and several optimization techniques, we presented two novel accelerators where the basic version (Figure 2) has very low implementation complexity and the extended version (Figure 7) offers flexible choices on the processing time but still maintains low resource occupation. The corresponding complexity analysis and comparison confirmed the superior efficiency of the proposed designs over the existing ones. These results show that the RBLWE-based scheme is a promising PQC that can be used in emerging resource-constrained applications. The proposed design strategy and obtained results can be important references for further standardization and deploying of the RBLWE-based PQC, possibly on the FPGA-based applications.

REFERENCES

[1] 2020. Post-quantum cryptography round 3 submissions. https://csrc.nist.gov/.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[2] 2022. National Science Foundation (NSF) 2022 Secure and Trustworthy Cyberspace principal investigators’ meeting (SaTC PI Meeting’22)–Break out group reports/slides: Security in a post-quantum world. Slides page 4. https://cps-vo.org/group/satc-pimtg22/breakouts.Google Scholar
Reference
[3] Aysu Aydin, Orshansky Michael, and Tiwari Mohit. 2018. Binary Ring-LWE hardware with power side-channel countermeasures. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE’18). IEEE, 1253–1258.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
Reference 11
Reference 12
Reference 13
Reference 14
Reference 15
[4] Aysu Aydin, Patterson Cameron, and Schaumont Patrick. 2013. Low-cost and area-efficient FPGA implementations of lattice-based cryptography. In 2013 IEEE International Symposium on Hardware-oriented Security and Trust (HOST’13). IEEE, 81–86.Google Scholar
Reference
[5] Basso Andrea and Roy Sujoy Sinha. 2021. Optimized polynomial multiplier architectures for post-quantum KEM Saber. DAC. Springer, Berlin, Heidelberg.Google Scholar
Reference
[6] Bernstein Daniel J.. 2009. Introduction to post-quantum cryptography. In Post-quantum Cryptography. Springer, 1–14.Google ScholarCross Ref
Reference
[7] Bian Song, Hiromoto Masayuki, and Sato Takashi. 2019. Filianore: Better multiplier architectures For LWE-based post-quantum key exchange. In Proceedings of the 56th Annual Design Automation Conference 2019. 1–6.Google ScholarDigital Library
Reference 1Reference 2
[8] Bogdanov Andrey, Knudsen Lars R., Leander Gregor, Paar Christof, Poschmann Axel, Robshaw Matthew J. B., Seurin Yannick, and Vikkelsoe Charlotte. 2007. PRESENT: An ultra-lightweight block cipher. In International Workshop on Cryptographic Hardware and Embedded Systems. Springer, 450–466.Google Scholar
Reference
[9] Braun Konstantin, Fritzmann Tim, Maringer Georg, Schamberger Thomas, and Sepúlveda Johanna. 2018. Secure and compact full NTRU hardware implementation. In 2018 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC’18). IEEE, 89–94.Google ScholarCross Ref
Reference
[10] Buchmann Johannes, Göpfert Florian, Güneysu Tim, Oder Tobias, and Pöppelmann Thomas. 2016. High-performance and lightweight lattice-based public-key encryption. In Proceedings of the 2nd ACM International Workshop on IoT Privacy, Trust, and Security. 2–9.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
Reference 11
[11] Buchmann Johannes, Göpfert Florian, Player Rachel, and Wunderer Thomas. 2016. On the hardness of LWE with binary error: Revisiting the hybrid lattice-reduction and meet-in-the-middle attack. In International Conference on Cryptology in Africa. Springer, 24–43.Google ScholarDigital Library
Reference 1Reference 2
[12] Camacho-Ruiz Eros, Sánchez-Solano Santiago, Brox Piedad, and Martínez-Rodríguez Macarena C.. 2021. Timing-optimized hardware implementation to accelerate polynomial multiplication in the NTRU algorithm. ACM Journal on Emerging Technologies in Computing Systems (JETC) 17, 3 (2021). 1–16.Google ScholarDigital Library
Reference
[13] Carter Elizabeth, He Pengzhou, and Xie Jiafeng. 2022. High-performance polynomial multiplication hardware accelerators for KEM saber and NTRU. Cryptology ePrint Archive (2022).Google Scholar
Reference
[14] Chen Donald Donglong, Mentens Nele, Vercauteren Frederik, Roy Sujoy Sinha, Cheung Ray C. C., Pao Derek, and Verbauwhede Ingrid. 2014. High-speed polynomial multiplication architecture for ring-LWE and SHE cryptosystems. IEEE Transactions on Circuits and Systems I: Regular Papers 62, 1 (2014), 157–166.Google ScholarCross Ref
Reference
[15] Dang Viet Ba, Mohajerani Kamyar, and Gaj Kris. 2021. High-speed hardware architectures and fair FPGA benchmarking of CRYSTALS-Kyber, NTRU, and Saber. In Proceeding of the NIST 3rd PQC Standardization Conf.1–48.Google Scholar
Reference
[16] Ebrahimi Shahriar and Bayat-Sarmadi Siavash. 2020. Lightweight and fault-resilient implementations of binary Ring-LWE for IoT devices. IEEE Internet of Things Journal 7, 8 (2020), 6970–6978.Google ScholarCross Ref
Reference 1Reference 2
[17] Ebrahimi Shahriar, Bayat-Sarmadi Siavash, and Mosanaei-Boorani Hatameh. 2019. Post-quantum cryptoprocessors optimized for edge and resource-constrained devices in IoT. IEEE Internet of Things Journal 6, 3 (2019), 5500–5507.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
Reference 11
Reference 12
Reference 13
Reference 14
Reference 15
Reference 16
Reference 17
Reference 18
Reference 19
Reference 20
Reference 21
Reference 22
Reference 23
Reference 24
Reference 25
Reference 26
Reference 27
Reference 28
Reference 29
Reference 30
Reference 31
Reference 32
Reference 33
Reference 34
Reference 35
Reference 36
Reference 37
[18] Fritzmann Tim and Sepúlveda Johanna. 2019. Efficient and flexible low-power NTT for lattice-based cryptography. In 2019 IEEE International Symposium on Hardware Oriented Security and Trust (HOST’19). IEEE, 141–150.Google Scholar
Reference
[19] Göpfert Florian, Vredendaal Christine van, and Wunderer Thomas. 2017. A hybrid lattice basis reduction and quantum search attack on LWE. In International Workshop on Post-Quantum Cryptography. Springer, 184–202.Google Scholar
Reference 1Reference 2
[20] Hadayeghparast Shahriar, Bayat-Sarmadi Siavash, and Ebrahimi Shahriar. 2022. High-speed post-quantum cryptoprocessor based on RISC-V architecture for IoT. IEEE Internet of Things Journal 9, 17 (2022), 15839–15846.Google ScholarCross Ref
Reference 1Reference 2
[21] He Pengzhou, Guin Ujjwal, and Xie Jiafeng. 2021. Novel low-complexity polynomial multiplication over hybrid fields for efficient implementation of binary Ring-LWE post-quantum cryptography. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 11, 2 (2021), 383–394.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
Reference 11
Reference 12
Reference 13
Reference 14
Reference 15
Reference 16
Reference 17
Reference 18
Reference 19
Reference 20
Reference 21
Reference 22
Reference 23
Reference 24
Reference 25
Reference 26
Reference 27
Reference 28
Reference 29
Reference 30
Reference 31
Reference 32
Reference 33
Reference 34
[22] He Pengzhou, Lee Chiou-Yng, and Xie Jiafeng. 2021. Compact coprocessor for KEM Saber: Novel scalable matrix originated processing. InThe NIST T3rd Standardization Conference. 1–16.Google Scholar
Reference
[23] Howe James, Moore Ciara, O’Neill Máire, Regazzoni Francesco, Güneysu Tim, and Beeden Kevin. 2016. Lattice-based encryption over standard lattices in hardware. In 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC’16). IEEE, 1–6.Google ScholarDigital Library
Reference
[24] Howe James, Oder Tobias, Krausz Markus, and Güneysu Tim. 2018. Standard lattice-based key encapsulation on embedded devices. IACR Transactions on Cryptographic Hardware and Embedded Systems (2018), 372–393.Google ScholarCross Ref
Reference
[25] Imaña José L., He Pengzhou, Bao Tianyou, Tu Yazheng, and Xie Jiafeng. 2022. Efficient hardware arithmetic for inverted binary Ring-LWE based post-quantum cryptography. IEEE Transactions on Circuits and Systems I: Regular Papers 69, 8 (2022), 3297–3307.Google ScholarCross Ref
Reference 1Reference 2
[26] Khalid Ayesha, Bian Song, Wang Chenghua, O’Neill Máire, and Liu Weiqiang. 2021. AxRLWE: A multi-level approximate Ring-LWE co-processor for lightweight IoT applications. IEEE Internet of Things Journal 9, 13 (2021), 10492–10501.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
[27] Khalid Ayesha, Howe James, Rafferty Ciara, Regazzoni Francesco, and O’Neill Máire. 2018. Compact, scalable, and efficient discrete Gaussian samplers for lattice-based cryptography. In 2018 IEEE International Symposium on Circuits and Systems (ISCAS’18). IEEE, 1–5.Google Scholar
Reference
[28] Kundi Dur E. Shahwar, Bian Song, Khalid Ayesha, Wang Chenghua, O’Neill Máire, and Liu Weiqiang. 2020. AxMM: Area and power efficient approximate modular multiplier for R-LWE cryptosystem. In 2020 IEEE International Symposium on Circuits and Systems (ISCAS’20). IEEE, 1–5.Google Scholar
Reference 1Reference 2
[29] Lee Chiou-Yng and Meher Pramod Kumar. 2016. Comment on “Subquadratic space-complexity digit-serial multipliers over \(GF(2^m)\)) using generalized \((a, b)\)-Way Karatsuba algorithm.” IEEE Transactions on Circuits and Systems I: Regular Papers 63, 8 (2016), 1316–1319.Google ScholarCross Ref
Reference
[30] Liu Dongsheng, Zhang Cong, Lin Hui, Chen Yuyang, and Zhang Mingyu. 2018. A resource-efficient and side-channel secure hardware implementation of ring-LWE cryptographic processor. IEEE Transactions on Circuits and Systems I: Regular Papers 66, 4 (2018), 1474–1483.Google ScholarCross Ref
Reference
[31] Liu Mingjie and Nguyen Phong Q.. 2013. Solving BDD by enumeration: An update. In Cryptographers’ Track at the RSA Conference. Springer, 293–309.Google ScholarDigital Library
Reference
[32] Liu Weiqiang, Fan Sailong, Khalid Ayesha, Rafferty Ciara, and O’Neill Máire. 2019. Optimized schoolbook polynomial multiplication for compact lattice-based cryptography on FPGA. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 10 (2019), 2459–2463.Google ScholarDigital Library
Reference 1Reference 2
[33] Lucas Benjamin J., Alwan Ali, Murzello Marion, Tu Yazheng, He Pengzhou, Schwartz Andrew J., Guevara David, Guin Ujjwal, Juretus Kyle, and Xie Jiafeng. 2022. Lightweight hardware implementation of binary ring-LWE PQC accelerator. IEEE Computer Architecture Letters 21, 1 (2022), 17–20.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
Reference 11
Reference 12
Reference 13
Reference 14
Reference 15
Reference 16
Reference 17
[34] Lyubashevsky Vadim, Peikert Chris, and Regev Oded. 2010. On ideal lattices and learning with errors over rings. In Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer, 1–23.Google ScholarDigital Library
Reference 1Reference 2
[35] Micciancio Daniele and Regev Oded. 2009. Lattice-based cryptography. In Post-quantum Cryptography. Springer, 147–191.Google ScholarCross Ref
Reference
[36] Pan Jeng-Shyang, Lee Chiou-Yng, and Meher Pramod Kumar. 2013. Low-latency digit-serial and digit-parallel systolic multipliers for large binary extension fields. IEEE Transactions on Circuits and Systems I: Regular Papers 60, 12 (2013), 3195–3204.Google ScholarCross Ref
Reference 1Reference 2
[37] Pollard John M.. 1971. The fast Fourier transform in a finite field. Mathematics of Computation 25, 114 (1971), 365–374.Google ScholarCross Ref
Reference
[38] Pöppelmann Thomas and Güneysu Tim. 2013. Towards practical lattice-based public-key encryption on reconfigurable hardware. In International Conference on Selected Areas in Cryptography. Springer, 68–85.Google Scholar
Reference 1Reference 2
[39] Regev Oded. 2009. On lattices, learning with errors, random linear codes, and cryptography. Journal of the ACM (JACM) 56, 6 (2009), 1–40.Google ScholarDigital Library
Reference
[40] Rentería-Mejía Claudia Patricia and Velasco-Medina Jaime. 2017. High-throughput ring-LWE cryptoprocessors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, 8 (2017), 2332–2345.Google ScholarDigital Library
Reference
[41] Roy Sujoy Sinha and Basso Andrea. 2020. High-speed instruction-set coprocessor for lattice-based key encapsulation mechanism: Saber in hardware. IACR TCHES, 443–466.Google Scholar
Reference
[42] Roy Sujoy Sinha, Vercauteren Frederik, Mentens Nele, Chen Donald Donglong, and Verbauwhede Ingrid. 2014. Compact ring-LWE cryptoprocessor. In International Workshop on Cryptographic Hardware and Embedded Systems. Springer, 371–391.Google Scholar
Reference 1Reference 2
[43] Sarker Ausmita, Kermani Mehran Mozaffari, and Azarderakhsh Reza. 2020. Fault detection architectures for inverted binary ring-LWE construction benchmarked on FPGA. IEEE Transactions on Circuits and Systems II: Express Briefs 68, 4 (2020), 1403–1407.Google ScholarCross Ref
Reference 1Reference 2
[44] Schneider Tobias, Moradi Amir, and Güneysu Tim. 2016. ParTI–towards combined hardware countermeasures against side-channel and fault-injection attacks. In Annual International Cryptology Conference. Springer, 302–332.Google ScholarDigital Library
Reference
[45] Shahbazi Karim and Ko Seok-Bum. 2021. Area and power efficient post-quantum cryptosystem for IoT resource-constrained devices. Microprocessors and Microsystems 84 (2021), 104280.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
Reference 11
Reference 12
Reference 13
[46] Shor Peter W.. 1994. Algorithms for quantum computation: Discrete logarithms and factoring. In Proceedings 35th Annual Symposium on Foundations of Computer Science. IEEE, 124–134.Google ScholarDigital Library
Reference
[47] Weimerskirch André and Paar Christof. 2006. Generalizations of the Karatsuba algorithm for efficient implementations. IACR Cryptol. ePrint Arch. 2006 (2006), 224.Google Scholar
Reference
[48] Xie Jiafeng, Basu Kanad, Gaj Kris, and Guin Ujjwal. 2020. Special session: The recent advance in hardware implementation of post-quantum cryptography. In 2020 IEEE 38th VLSI Test Symposium (VTS’20). IEEE, 1–10.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[49] Xie Jiafeng, He Pengzhou, Wang Xiaofang Maggie, and Imana Jose Luis. 2021. Efficient hardware implementation of finite field arithmetic \(AB+ C\) over hybrid fields for post-quantum cryptography. IEEE Transactions on Emerging Topics in Computing 10, 2 (2021), 1222–1228.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
[50] Xie Jiafeng, He Pengzhou, and Wen Wujie. 2021. Efficient implementation of finite field arithmetic for binary Ring-LWE post-quantum cryptography through a novel lookup-table-like method. In DAC, Vol. 21. 1–6.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[51] Xing Yufei and Li Shuguo. 2021. A compact hardware implementation of CCA-secure key exchange mechanism CRYSTALS-KYBER on FPGA. IACR Transactions on Cryptographic Hardware and Embedded Systems (2021), 328–356.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
[52] Xu Dongdong, Wang Xiang, Hao Yuanchao, Zhang Zhun, Hao Qiang, and Zhou Zhiyu. 2022. A more accurate and robust binary Ring-LWE decryption scheme and its hardware implementation for IoT devices. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 30, 8 (2022), 1007–1019.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
Reference 11
[53] Zhang Neng, Yang Bohan, Chen Chen, Yin Shouyi, Wei Shaojun, and Liu Leibo. 2020. Highly efficient architecture of NewHope-NIST on FPGA using low-complexity NTT/INTT. IACR Transactions on Cryptographic Hardware and Embedded Systems (2020), 49–72.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[54] Zhang Yuqing, Wang Chenghua, Kundi Dur E. Shahwar, Khalid Ayesha, O’Neill Máire, and Liu Weiqiang. 2020. An efficient and parallel R-LWE cryptoprocessor. IEEE Transactions on Circuits and Systems II: Express Briefs 67, 5 (2020), 886–890.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
[55] Zhu Yihong et al. 2021. LWRpro: An energy-efficient configurable crypto-processor for Module-LWR. IEEE Trans. Circuits and Systems-I 68, 3 (2021), 1146–1159.Google ScholarCross Ref
Reference

Index Terms

FPGA Implementation of Compact Hardware Accelerators for Ring-Binary-LWE-based Post-quantum Cryptography
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

Post-Quantum Lattice-Based Cryptography Implementations: A Survey

The advent of quantum computing threatens to break many classical cryptographic schemes, leading to innovations in public key cryptography that focus on post-quantum cryptography primitives and protocols resistant to quantum computing threats. Lattice-...
Read More
Ultra Low-Complexity Implementation of Binary Ring-LWE based Post-Quantum Cryptography on FPGA Platform
FPGA '22: Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Post-quantum cryptography (PQC) has drawn substantial attention from various communities recently since the existing public-key cryptosystems are proven to be vulnerable to the attacks launched from well-established quantum computers. The Ring-Learning-...
Read More
Hardware Architectures for Post-Quantum Cryptography
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Reconfigurable Technology and Systems Volume 16, Issue 3
September 2023
447 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3604889
Editor:
Deming Chen
University of Illinois, Urbana-Champaign, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 June 2023
- Online AM: 27 October 2022
- Accepted: 9 October 2022
- Revised: 28 August 2022
- Received: 17 June 2022
Published in trets Volume 16, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Compact hardware accelerator
Field-Programmable Gate Array (FPGA)
flexible processing throughput
low implementation complexity
post-quantum cryptography (PQC)
Ring-Binary-Learning-with-Errors (RBLWE)-based scheme
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 1,078
  Total Downloads
- Downloads (Last 12 months)904
- Downloads (Last 6 weeks)161
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

FPGA Implementation of Compact Hardware Accelerators for Ring-Binary-LWE-based Post-quantum Cryptography

ACM Transactions on Reconfigurable Technology and Systems

Abstract

1 INTRODUCTION

2 BACKGROUND

3 MOTIVATION AND DERIVATION

4 PROPOSED HARDWARE ACCELERATOR

5 EXTENSION: ALGORITHM-TO-ACCELERATOR

6 COMPLEXITY ANALYSIS AND COMPARISON

7 CONCLUSION

REFERENCES

Cited By

Index Terms

Recommendations

Post-Quantum Lattice-Based Cryptography Implementations: A Survey

Ultra Low-Complexity Implementation of Binary Ring-LWE based Post-Quantum Cryptography on FPGA Platform

Hardware Architectures for Post-Quantum Cryptography

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

FPGA Implementation of Compact Hardware Accelerators for Ring-Binary-LWE-based Post-quantum Cryptography

ACM Transactions on Reconfigurable Technology and Systems

Abstract

1 INTRODUCTION

2 BACKGROUND

3 MOTIVATION AND DERIVATION

4 PROPOSED HARDWARE ACCELERATOR

5 EXTENSION: ALGORITHM-TO-ACCELERATOR

6 COMPLEXITY ANALYSIS AND COMPARISON

7 CONCLUSION

REFERENCES

Cited By

Index Terms

Recommendations

Post-Quantum Lattice-Based Cryptography Implementations: A Survey

Ultra Low-Complexity Implementation of Binary Ring-LWE based Post-Quantum Cryptography on FPGA Platform

Hardware Architectures for Post-Quantum Cryptography

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media