FELIX: FPGA-Based Scalable and Lightweight Accelerator for Large Integer Extended GCD

The extended greatest common divisor (XGCD) computation is a critical component in various cryptographic applications and algorithms, including both pre- and postquantum cryptosystems. In addition to computing the greatest common divisor (GCD) of two integers, the XGCD also produces Bézout coefficients $b_{a}$ and $b_{b}$ which satisfy $\mathrm {GCD}(a,b) = a\times b_{a} + b\times b_{b}$ . In particular, computing the XGCD for large integers is of significant interest. Most recently, XGCD computation between 6479-bit integers is required for solving Nth-degree truncated polynomial ring unit (NTRU) trapdoors in Falcon, a National Institute of Standards and Technology (NIST)-selected postquantum digital signature scheme. To this point, existing literature has primarily focused on exploring software-based implementations for XGCD. The few existing high-performance hardware architectures require significant hardware resources and may not be desirable for practical usage, and the lightweight architectures suffer from poor performance. To fill the research gap, this work proposes a novel FPGA-based scalable and lightweight accelerator for large integer XGCD (FELIX). First, a new algorithm suitable for scalable and lightweight computation of XGCD is proposed. Next, a hardware accelerator (FELIX) is presented, including both constant- and variable-time versions. Finally, a thorough evaluation is carried out to showcase the efficiency of the proposed FELIX. In certain configurations, FELIX involves 81% less equivalent area-time product (eATP) than the state-of-the-art design for 1024-bit integers, and achieves a 95% reduction in latency over the software for 6479-bit integers (Falcon parameter set) with reasonable resource usage. Overall, the proposed FELIX is highly efficient, scalable, lightweight, and suitable for very large integer computation, making it the first such XGCD accelerator in the literature (to the best of our knowledge).


FELIX: FPGA-Based Scalable and Lightweight
Accelerator for Large Integer Extended GCD Samuel Coulon , Tianyou Bao , and Jiafeng Xie , Senior Member, IEEE Abstract-The extended greatest common divisor (XGCD) computation is a critical component in various cryptographic applications and algorithms, including both pre-and postquantum cryptosystems.In addition to computing the greatest common divisor (GCD) of two integers, the XGCD also produces Bézout coefficients b a and b b which satisfy GCD(a, b) = a × b a + b × b b .In particular, computing the XGCD for large integers is of significant interest.Most recently, XGCD computation between 6479-bit integers is required for solving Nth-degree truncated polynomial ring unit (NTRU) trapdoors in FALCON, a National Institute of Standards and Technology (NIST)-selected postquantum digital signature scheme.To this point, existing literature has primarily focused on exploring software-based implementations for XGCD.The few existing high-performance hardware architectures require significant hardware resources and may not be desirable for practical usage, and the lightweight architectures suffer from poor performance.To fill the research gap, this work proposes a novel FPGA-based scalable and lightweight accelerator for large integer XGCD (FELIX).First, a new algorithm suitable for scalable and lightweight computation of XGCD is proposed.Next, a hardware accelerator (FELIX) is presented, including both constant-and variable-time versions.Finally, a thorough evaluation is carried out to showcase the efficiency of the proposed FELIX.In certain configurations, FELIX involves 81% less equivalent area-time product (eATP) than the state-of-the-art design for 1024-bit integers, and achieves a 95% reduction in latency over the software for 6479-bit integers (FALCON parameter set) with reasonable resource usage.Overall, the proposed FELIX is highly efficient, scalable, lightweight, and suitable for very large integer computation, making it the first such XGCD accelerator in the literature (to the best of our knowledge).

I. INTRODUCTION
R APID progression in computer security has initiated a new round of cryptographic engineering research [1], [2], [3].For instance, the greatest common divisor (GCD) computation serves as one of the major components in many critical cryptographic applications and algorithms [4], [5], [6].One application, the RSA cryptosystem requires verification that some large key parameters are coprime, which equates to confirming GCD(•) = 1 [7].Accordingly, many related research activities have been carried out, including its implementation on both software and hardware platforms [5], [6].The recent advance in the field, however, has gradually switched from software design to hardware implementation since the latter typically offers better performance [4], [6], [8].

A. Prior Works and Existing Challenges
Overall, the research community has proposed four major methods to implement the above-mentioned large integer XGCD function [8], namely the extended Euclidean algorithm (EEA) [12], plus-minus (PM) algorithm [13], two-bit PM algorithm [14], and k-ary algorithm [15].
Meanwhile, related software and hardware implementations have been reported, which have mostly focused on highperformance implementations [5], [8], [10], and [4].The recent works in [4], [8] target high-performance applications.They have demonstrated the efficiency of hardware acceleration over software-based ones, and seem to represent the state-of-theart in the literature.The work of [4] modified the two-bit PM algorithm to reduce the total number of necessary algorithm iterations.They further developed a high-speed architecture by performing single-cycle large integer arithmetic and reducing carry-propagation delay with carry-save adders (CSAs).More recently, [8] adopted the k-ary algorithm, employing some strategies from two-bit PM and using redundant sign digit (RSD) representation to reduce carry-propagation for large integer arithmetic.On the other hand, [6] is the most recent (and only, to the best of our knowledge) existing lightweight XGCD accelerator.They utilize the BEEA algorithm and a sequential processing strategy to reuse hardware resources and avoid large integer arithmetic.
The major limitations of the existing works for XGCD hardware acceleration lie in the following three aspects.
1) The existing works like [4] and [8] only consider the fast computation, and hence their applications are somewhat limited.For instance, design [4, eq. ( 4)] requires 225 776 LUTs, 31 438 FFs, and 57 012 slices for 1024bit integers on Kintex-7 field-programmable gate array (FPGA).Such large resource usage may not be ideal for FPGA-based practical applications for parameters as large as 6479-bit integers, and especially when XGCD computation is just one component of a larger cryptographic scheme.
2) The existing lightweight designs lack severely in speed.
For example, the work of [6] presents a lightweight implementation, but the resulting performance is not sufficient for many applications.The fastest configuration of [6] for 2048-bit integers still requires 1.68 ms for processing, and extrapolating from the given performance results, we estimate that it would take around 5.58 ms for 6479-bit computation, which is 21% slower than our test of the FALCON software implementation [11].3) At last, the existing works (including software ones) only reported the XGCD design for up to 1024 or 2048bit integers, so an architecture for very large integers for applications like FALCON is missing in the literature.

B. Target Platform
This work targets the implementation of XGCD as one component of a theoretically larger scheme rather than a standalone chip.As such, FPGA was selected as the prototyping platform.FPGA-based platform provides many benefits for hardware accelerator design, including flexibility and relatively short development cycles.Application-specific integrated circuits (ASICs) could be selected in future work if a complete scheme was being implemented (FALCON, for example) with the goal of creating an end-product.Although [8] presented results only on the ASIC platform, [4], [6] presented implementation results on FPGA, so we primarily compare [4], [6] with our work.We also compare with the FALCON reference software code [11] for the evaluation of the XGCD computation over 6479-bit integers.

C. Contributions
Following the above discussion, we aim to develop a novel FPGA-based scalable and lightweight accelerator for large integer XGCD (FELIX).In pursuit of this goal, we have made the following contributions.
1) We have proposed a new algorithm for scalable XGCD computation based on the two-bit PM GCD algorithm.The proposed algorithm enables both constant-and variable-time processing options and features extensions based on our thorough analysis of potential speedup factors that are suitable for pipelined processing.
2) We have proposed an efficient accelerator, FELIX, to perform the computation on the FPGA platform.The accelerator implements a pipelined, sequential processing strategy which achieves excellent balance between lightweight resource usage and efficient processing.
We have given the architecture details and related design techniques for FELIX, including its different components, functions, and constant-and variable-time modes.3) We have at last provided a thorough performance evaluation to demonstrate the efficiency of the proposed FELIX over the state-of-the-art designs.We also showcase the proposed FELIX's superior performance and practical resource usage when implementing XGCD for very large integers such as FALCON's 6479-bit parameters.To the authors' best knowledge, this is the first report in the literature of a scalable and lightweight XGCD accelerator with efficient processing, especially for computation over integers as large as 6479 bits.We hope this work can stimulate many related research studies in the field.
The rest of the article is arranged as follows.Section II introduces the preliminary knowledge.Section III introduces the algorithm background information.Section IV discusses the formulation of the proposed algorithm.Section V presents the proposed FELIX.Evaluation of the accelerator is provided in Section VI, including the extension to FALCON's parameter sets.The conclusion is finally given in Section VII. 2) Some Applications: XGCD has been deployed in interesting applications such as VDFs [16] and modular inversion (suitable for elliptic curve cryptosystem) [5].For example, given an integer a and a modulus b, one can find the modular inverse (a −

B. XGCD in FALCON
Recently, XGCD has been found in the NTRUSolve function of FALCON, a critical operation that exists in its key generation phase [11].FALCON's key generation phase samples polynomials f and g and solves the NTRU equation to find polynomials F and G, where a valid set of ( f , g, F, and G) constitutes a valid private key [11].f and g are initially polynomials over ring Z[x]/(x n + 1) where n = 1024 with small coefficients (5-8 bits).Polynomials f and g are repeatedly cast onto smaller rings Z[x]/(x n/2 + 1), . . ., Z[x]/(x n/4 + 1), . .., until f and g are plain integers over ring Z.With each iteration, polynomial degree halves and coefficient size approximately doubles, so the final f and g become very large integers.In the reference code [11], they are represented in RNS notation with 209, 31-bit primes.Given 209 × 31 = 6479, we estimate this as the maximum f and g bitsize, though the reference states they average around 6307 bits [11].From here, we must find F and G such that f × G − g × F = q.Substituting u × q and v × q for G and F, respectively, yields f × u × q − g × v × q = q, and factoring out q reduces to f × u − g × v = 1.This overall process equates to the computing of XGCD, where f and g are the 6479-bit input integers, u and v are the resultant Bézout coefficients, and the expected result is GCD( f, g) = 1.Note that further steps are needed to raise F and G back to ring Z[x]/(x n + 1), but those steps are not the focus of this work.

C. Notations
We have used the following notations throughout the article.N denotes the bit-width of the integer inputs; (a 0 and b 0 ) and (a, b, u, m, y, and n) are N -bit variables used in the algorithm and accelerator descriptions.Q refers to the bit-width of the parsed sections of the above-mentioned variables.Therefore, a 0 , b 0 , a, b, u, m, y, and n values are stored and processed as N /Q individual Q-bit sections.R E and R O are the power-oftwo reduction factors used to configure the proposed XGCD algorithm.L is the pipeline latency (Table I).I is the average number of iterations (Table II).In Algorithms 1-3, "//" indicates floor division, which is bit-shifting in hardware.
A valid set of (N , Q, R E , and R O ) constitutes a configuration for the architecture of the proposed FELIX, while a set of just (R E and R O ) is sufficient to characterize the configuration of the underlying algorithm.As such, we denote configurations of the algorithm as (R E , R O ) which is discussed in Section III.For a given N , smaller Q, R E , and R O results in lower resource usage and slower processing, while larger Q, R E , and R O results in greater resource usage and faster processing.

III. ALGORITHM BACKGROUND
This section provides some background on GCD algorithms as a basis for our work.We start by detailing some information on the Stein family of GCD algorithms [17] and then discuss the one we choose to build from, two-bit PM [14].

A. Background and Algorithm Selection
Existing work suggests that Euclid's (division-based) [12] and Stein's (subtraction-based) [17] algorithm families perform favorably for large integer implementations of GCD.Given our objective of low resource usage and the relatively high cost of division operations in hardware, we chose Stein's algorithm family for our design.Stein's algorithm (also known as Binary GCD) [17] and its variants [13], [14], [18] operate on the basis that given two integers a and b, one of two conditions must always be true: 1) a or b is even and 2) a ± b is even.These algorithms reduce their inputs by dividing the inputs directly when they are even or dividing the sum/difference of the inputs when they are both odd.Division operations are always executed by a power-of-two reduction factor, which equates to simple bit-shifting in hardware design.
Stein-based algorithms [13], [14], [17], [18] can be classified according to the maximum number of bits they can reduce per iteration when a or b is even and when a and b are both odd.We follow [4] and denote this as (R E , R O ), where R E is the maximum reduction factor when a or b is even and R O the factor when a and b are both odd.The algorithms can further be classified by whether they operate in constantor variable-time, where constant-time efficiency is based on the worst-case iteration time, and variable-time mode finishes when the inputs a or b have been reduced to 0 [14].
Purdy's algorithm divides by 2 when a or b is even and 2 when a and b are both odd (2, 2).Purdy's algorithm is simple but suffers from quadratic worst-case time complexity [18].The PM algorithm introduced swapping mechanisms and still has reduction factors (2, 2) but finishes in at most (3.012 • N ) iterations [13].The two-bit PM algorithm extended the PM algorithm, exploiting the fact that when a and b are both odd, one of (a + b)%4 = 0 or (a − b)%4 = 0 is always true, and thus some PM operations can be combined [14].Two-bit PM divides by at most 4 when a or b is even, and divides by 4 when a and b are both odd, so the original two-bit PM has reduction factors (4, 4) for both constant-and variable-time modes, with worst case reduced to (1.51 • N + 1) iterations.When [4] modified two-bit PM, they noted that while the option to reduce by 4 (instead of 2) when a or b is even does improve average iteration time for variabletime mode, it does not improve worst-case iteration time.So, we remove these operations for two-bit PM constant-time GCD, resulting in reduction factors (2,4).In summary, twobit PM in variable-time mode has reduction factors (4,4) and completes when a = 0 or b = 0. Constant-time mode has reduction factors (2,4) and finishes in (1.51• N +1) iterations.
We select the two-bit PM (Algorithm 1) as the basis to develop our algorithm for implementation because of its excellent balance between simple operations (addition/subtraction/shifting) and reasonable worst-case speed.Further, those operations are easily converted to sequential processing, i.e., they can be executed bit by bit.Furthermore, the variable-time version of the algorithm (4, 4) can easily be extended to accommodate optimizations for our pipelined sequential processing strategy, which is discussed in the next section.At last, the original two-bit PM source [14] also provides insight on extension from GCD to XGCD, which is relevant to this work.

B. Two-Bit PM Algorithm for GCD
The procedure for Two-Bit PM GCD computation is shown in Algorithm 1.The algorithm initializes with a ← a 0 and b ← b 0 .There is also a small variable g that tracks how many times a or b has been reduced and determines which should be reduced next.g increments or decrements when a or b is being reduced, so the lesser reduced between a and b can always be determined by the sign of g.The main step of Algorithm 1 implements the main iteration loop of Two-Bit PM.On each iteration, when a or b is even, they are reduced by a factor of 2 (or 4 in variable-time mode), and g is incremented or decremented accordingly.When a and b are both odd, either (a + b)%4 = 0 or (a − b)%4 = 0 must be Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Algorithm 1 Two-Bit PM GCD [14] true [14].When this is the case, g indicates whether a or b should be reduced.This procedure loops for (1.51 • N + 1) iterations in constant-time mode or until a = 0 or b = 0 in variable-time mode, and returns GCD(a 0 , b 0 ) ← a + b.

IV. FELIX: THE PROPOSED ALGORITHM
In this section, we start by extending the two-bit PM GCD procedure (Algorithm 1) to two-bit PM XGCD to additionally calculate Bézout coefficients b a and b b .Then, some further extensions to the algorithm are proposed to optimize processing, finally resulting in the proposed algorithm (Algorithm 2).

A. Proposed New Two-Bit PM Algorithm for XGCD
The original two-bit PM source [14] provides steps for the extension from GCD to XGCD computation.Our version is shown in Algorithm 2, where operations highlighted in blue are the only operations used for constant-time mode and where (R E , R O ) is set to (2,4).Note that these are the same constant-time operations found in Algorithm 1.For variable-time mode, all operations in Algorithm 2 are available depending on the (R E , R O ) configuration.We will discuss extension from GCD to XGCD, and the next section will discuss extensions for larger (R E , R O ) configurations.
For extension to XGCD, we take Algorithm 1 and introduce Bézout variables u, m, y, and n, such that we accordingly update u, m, y, and n to preserve the equality with a and b in (1).Thus, u and m must be updated when a is reduced, while y and n do not need to be updated because their equality with b in (1) still holds, and vice versa.
The procedure for updating u and m or y and n is shown in Algorithm 3 which is based on a similar procedure in [4].Algorithm 3 is used in Algorithm 2 and called as bez_update(•).To understand its function, let us first consider the case where a%2 = 0, so u and m need to be updated such that (1) still holds with a ← a//2.Remember that a 0 and b 0 are required to be odd.Assuming (1) currently holds, this Algorithm 3 Bézout Coefficient Update Function (bez_update(•)) Based on [4] means u and m must either be both even or both odd [14].When u and m are even, we can reduce u and m directly so that u ← u//2 and m ← m//2.When u and m are odd, dividing by 2 will discard the lowest bit and break the equality.Instead, we add b 0 to u and subtract a 0 from m. Again, since a 0 and b 0 are odd, this produces an even result.Then, the values can safely be reduced as u ← (u + b 0 )//2 and m ← (m − a 0 )//2.
Let us now consider the case where a and b are both odd.Suppose g ≥ 0 and (a + b)%4 = 0 is met, u and m need to be updated such that (1) still holds with a ← (a + b)//4.In the same way that b is added to a, y, and n must be added to u and m, respectively.This can be observed in Algorithm 2 where u ± y and m ± n are sometimes fed into bez_update(•).From this point, the procedure is the same as when a or b are even and follows the operations of Algorithm 3.
We now have obtained all necessary procedures to update a, b, u, m, y, and n such that (1) holds on each iteration.The algorithm iterates until it reaches (1.51 • N + 1) iterations in constant-time mode [14], or until a = 0 or b = 0 in variabletime mode.At this point, the GCD and Bézout coefficients can obtained as GCD(a 0 , b 0 ) = a +b, b a = u + y, and b b = m +n.There is also a final step to invert GCD(a 0 , b 0 ), b a , and b b , when GCD(a 0 , b 0 ) < 0.

B. Extensions of the Proposed Algorithm
To this point, we have established a constant-time (2, 4) and variable-time (4, 4) two-bit PM XGCD algorithm based on [14].We now explore options to extend the variable-time version of the two-bit PM algorithm for optimized processing.Recall that the variable-time algorithm terminates when either of the input integers has been reduced to 0, so average-case analysis is concerned with the average number of iterations required to reach this point.If the implementation only supports division by 4 (basic (4, 4) version), then it takes many iterations to reduce a or b to 0. However, when a, b, or a ±b is divisible by a larger power of 2, the algorithm can be extended to reduce more bits in a single iteration, thus reducing the average number of iterations required [4].
The trade-off is that reducing more bits per iteration requires the loop in the bez_update(•) function (Algorithm 3) to run longer.As opposed to [4] who operate over whole N -bit integers and evaluate all possible outcomes of bez_update(•) in a single cycle, we have unrolled the loop and implemented it as a pipeline see PipeUY and PipeMN components in Section V-D, operating over Q-bit sections.This strategy is explained further in Section V-A, but for now, the important note is that more iterations of the bez_update(•) loop equates to a longer, more resource-intensive pipeline.We observe that our sequential processing strategy already requires at least N /Q (the number of sequential processing sections) cycles per iteration, so extending the pipelines to any length lower than N /Q cycles imposes no latency penalty for the length of a single iteration.Overall, we have conducted a simple analysis of these possible extensions, and the results are listed in Tables I-III, respectively.
Table I presents the pipeline latency L (in number of cycles) for the extensions of the proposed algorithm.For example, if we configure to allow reduction by 32 in a single iteration (32, 32), the pipeline latency is 11 clock cycles.For reduction by 32, we need log 2 (32) iterations of the bez_update(•) loop (2 cycles each), and an optional 1 cycle for adding or subtracting inputs to bez_update(•) when it is called.So, L is calculated as Max(2 × log 2 (R E ), 1 + 2 × log 2 (R O )).The pipeline latency (and majority of pipeline resource usage) is determined by the greater of R E and R O , so there is no latency benefit (and only minimal resource usage benefit) to setting II presents the average number of iterations (I ) required to reduce a or b to 0 for a given configuration (R E , R O ), which was obtained by simulating Algorithm 2 with uniformly random inputs.Table III then uses this data to compute the average number of clock cycles required to reduce a or b to 0, where values are calculated as [Max(N /Q, L) + 1] × I (L is the pipeline latency and I is the average number of iterations).
When selecting a configuration (N , Q, R E , and R O ), there are two speedup factors to consider: 1) average number of iterations and 2) cycles per iteration.The total average number of clock cycles for processing is a product of these two values.First, given N , the average number of iterations is purely determined by the reduction factor (R E , R O ) configuration of the algorithm.Table II demonstrates that increasing the reduction factor decreases the average number of iterations required, though the percent reduction decreases with each successive extension.For example, extension from (4, 4) to (8,8) decreases the average number of iterations by 23%, while Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III AVERAGE NUMBER OF CLOCK CYCLES
further extensions to (16,16) and (32, 32) yield only a further 12% and 5% decrease, respectively.For this reason, we chose not to explore further extensions such as (64, 64).Second, given N , cycles per iteration is computed as Max(N /Q, L)+1.In other words, cycles per iteration is the greater of the number of sequential processing sections (N /Q) and pipeline latency (L).If N /Q < L, then pipeline utilization will be suboptimal.For example, in Table III, for N = 1024, Q = 128, it can be seen that average number of clock cycles actually increases from (16,16) to (32, 32) since the pipeline length L = 11 is greater than N /Q = 8.For optimal cycles per iteration, it is best to make N /Q as small as possible (i.e., make Q as large as possible) without dipping below L.
From Table III, we conclude that extensions from (4, 4) to (8,8) and (16,16) generally provide tangible reductions in average number of cycles, which should outweigh the increases in resource usage.However, further extension to (32, 32) seems relatively ineffective, and will require trade-off analysis with the related increase in resource usage.On the other hand, doubling Q yields significant reductions in average number of cycles, since doubling Q halves the number of cycles per iteration if the updated N /Q is still greater than L.However, we expect that larger Q, such as Q = 512, will incur serious max frequency penalties due to carry-propagation delay for larger integer arithmetic.Overall, the various configurations and trade-offs are explored in Section VI.

A. Overall Design Strategy
In keeping with our objective of scalability and lightweight resource usage, FELIX implements a sequential processing strategy where N -bit integer variables are parsed into Q-bit sections, and each operation found in Algorithms 2 and 3 (Addition/Subtraction/Shifting) is executed sequentially, section-by-section, from least to most significant.Large integer arithmetic, such as addition, incurs carry-propagation delay proportional to the bit-width of the operands.Our strategy of processing section-by-section reduces this delay by operating over smaller Q-bit sections instead of the entire N -bit integers.Then, carry bits between those sections are registered, rendering carry-propagation between sections negligible.
The necessary pipelined subcomponents are AddSub(), Add(), Sub(), and Shift(s), where AddSub() is configurable for Q-bit addition or subtraction, Add() and Sub() implement only addition or only subtraction, respectively, and Shift(s) shifts a Q-bit value right by s number of bits.The subcomponents are assembled into processing pipelines that execute all possible permutations of operations in Algorithms 2 and 3. Overall, the proposed accelerator (FELIX) is shown in Fig. 1, which consists of four major components: 1) variable memory; 2) PipeAB (a and b processing pipeline); 3) PipeUY and PipeMN; and 4) control unit.These components, along with some other important design details, are described below.

B. Data Flags
Attached to each Q-bit section are a series of flags that assist with dataflow throughout the accelerator, i.e., valid, f ir st, f inal, sub, target, and numshi f t.The valid flag indicates that the current section is valid data.The f ir st flag indicates that the current section contains the least significant bits of the data, while the f inal flag indicates that the current section contains the most significant bits.The sub flag configures the AddSub() blocks in the processing pipelines to perform subtraction.The target flag indicates which variables are being processed (a, u, m or b, y, n).The numshi f t flag indicates the reduction factor of the current operation, which configures the Shift(numshi f t) block in PipeAB and enables and disables stages within PipeUY and PipeMN.

C. Constant-and Variable-Time Architectures
Overall, the architectures for constant-and variable-time processing are mostly the same.It can be seen in Algorithm 2 that, in general, the operations in blue (constant-time operations) require the same processing primitives (AddSub(), Add(), Sub(), and Shift()) as the nonblue operations.Necessary components are mostly determined by Q and Max(R E , R O ), so (2,4) and (4,4) configurations with the same Q require the same components, for example.The only difference is that the constant-time control unit maintains an iteration count for its termination condition, while the variable-time one contains logic to monitor when a or b has been reduced to 0.

D. Component Descriptions
1) Variable Memory: Input integers a 0 and b 0 and system variables a, b, u, m, y, and n are all N -bit integers stored in RAM-like register structures of width Q and depth respectively.Each structure has a dual port setup, with one read port and one write port.We tried using traditional BRAMs, but found that implementing with LUT/FF was much more efficient than BRAM when computing the equivalent number of slices (ENS) [19], which will be discussed later.Memory access is managed by the control unit.Since values always need to be processed in the order from section 0 to section N /Q − 1, the access pattern is simply incrementing addresses from 0 to N /Q − 1.To facilitate quicker next-state decisions, we also store an additional copy of the least significant bits of a and b in separate registers a lsb and b lsb (which are used to determine the divisibility of a, b, and a ± b within each iteration).
2) PipeAB Component: The PipeAB component is shown in Fig. 2 and implements the operations to reduce the a and b variables, which can be seen as the first assignment in each of the cases of the loop in Algorithm 2. Overall, this processing pipeline consists of one configurable AddSub() and one configurable Shift(numshi f t).The sub flag received with input data configures the AddSub() to perform subtraction when necessary, and the numshi f t flag configures the Shift(numshi f t) block to shift by the correct number of bits.The maximum numshi f t that must be supported is log 2 (Max(R E , R O )), which is the highest possible reduction factor set by the implementer.In cases where a or b is even, addition and subtraction are unnecessary, so the AddSub() block is inactive and acts as a pass-through register.Thus, the pipeline latency is two clock cycles in all cases.
3) PipeUY and PipeMN Components: The PipeUY and PipeMN components are shown in Fig. 3, which act as processing pipelines to implement the operations of the bez_update(•) function (Algorithm 3).Note that for cases where a and b are both odd in Algorithm 2, when bez_update(•) is called, the inputs u, y, m, and n need to be added or subtracted before they are fed to the function.So, similar to the PipeAB component, the first processing element in PipeUY and PipeMN is a configurable AddSub() block.The structure of the rest of the pipeline is dependent on (R E , R O ) configuration.Let us consider a basic variable-time configuration (4, 4) as a basis.The pipelines must support reduction by 4, which is two iterations of the bez_update(•) loop.This is at most two Add() blocks for PipeUY or two Sub() blocks for PipeMN, as well as two Shift(1) blocks for both.With the initial AddSub(), this equates to a five clock cycle latency for each pipeline.For (8,8), the pipelines in Fig. 3 need to be extended with an additional Add() or Sub() and an additional Shift (1).This continues up to (32, 32), which has an AddSub(), five total Add() or Sub(), and five total Shift(1) blocks to support a maximum of five iterations of the bez_update(•) loop, totaling 11 clock cycle pipeline latency.These pipeline latencies can be seen in Table I.
Note that within the bez_update(•) function, uy is checked on each iteration to see if it is even.This decision is implemented with the dec blocks in PipeUY in Fig. 3.When dec sees data with the f ir st flag (indicating these are the least significant bits), it checks the lowest bits of the z output of the preceding AddSub() or Add() block as well as the current overall state operation that is being carried out.If the overall state operation requires only reduction by 2, then the following stage(s) of Add()/Sub() and Shift(1) will be disabled and act as pass-through registers.If the overall state operation enables another stage of Add()/Sub() and Shift(1), but z is odd, then both the proceeding Add()/Sub() and Shift(1) components are enabled.If z is already even, Shift(1) is enabled but Add() or Sub() are disabled.When state indicates that no more reduction is necessary, the remaining stages of Add() or Sub() and Shift(1) will be disabled, regardless of z is even or odd.This would indicate that the current operation is complete, and values should be passed through to the pipeline output.
4) Control Unit: The control unit manages the state machine and coordinates timing throughout the system.The primary function of the control unit is to manage memory access and send data to the correct processing pipelines (PipeAB, PipeUY, PipeMN).The control unit is also responsible for managing termination conditions.When running in constant-time mode, the control unit maintains an iteration_count and terminates the process when iteration_count exceeds 1.51 • N + 1.When in variable-time mode, the control unit monitors the output of PipeAB and terminates when the result is 0 (i.e., either a or b has been reduced to 0).

E. Dataflow
To understand the dataflow of the proposed FELIX, we consider an example using the constant-time configuration (2,4).The necessary operations are highlighted in blue in Algorithm 2. The discussed dataflow process can then be extrapolated for the variable-time operations and overall processing.

1) Initialization
Step: On start-up, a 0 and b 0 are shifted into the system section-by-section, from the least significant bits to most significant bits, and stored in variable memory.Note that these values are stored twice: 1) as a 0 and b 0 that will Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.be accessed but not modified by the bez_update(•) function and 2) as a and b that will be accessed as well as modified.Other variables are also initialized per Initialization Step in Algorithm 2.Then, we enter the processing loop, i.e., the Main Step of Algorithm 2. Each iteration proceeds as follows.
1) First, the least significant section of a and b (stored in a lsb and b lsb ) as well as g are read to determine which state operation of Algorithm 2 is selected.For example, if a is even, then the first constant-time condition (a%2 = 0) in Algorithm 2 is met.g is decremented as g ← g − 1 since a is being reduced by 1 bit.2) Next, flags are set according to the determined state.For this case, sub ← 0, target ← 0, and numshi f t ← 1.
3) The control unit begins reading sections from a, u, m, a 0 , and b 0 .Values b, y, and n are not needed during this operation, so they are not read.From here, the data sections from a are fed to PipeAB, sections from u and a 0 are fed to PipeUY, and sections from m and b 0 are fed to PipeMN.The sub, target, and numshi f t flags travel with the data, and the valid, f ir st, and f inal flags are also appended to the proper data sections.4) Data populate the pipelines in parallel.For this state, a just needs to be reduced by a factor of 2, so the AddSub() block in PipeAB is disabled, and data get shifted by numshi f t = 1.For PipeUY and PipeMN, the AddSub() blocks are also disabled, so data pass through to the first Add() or Sub().When this happens, the first dec unit observes the f ir st flag from the first section of u passing through.If u is even, then the first Add() and Sub() blocks in PipeUY and PipeMN can be disabled.If u is odd, then Add() and Sub() are enabled.Then, they pass through the first Shift(1) block which shifts data by 1.At this point, the operation is complete, so the second dec block in PipeUY disables remaining stages of Add()/Sub() and Shift(1) in both PipeUY and PipeMN.5) Two clock cycles after data enters the processing pipelines, the first section of a exits PipeAB and is sent to be saved back in variable memory.The target flag determines whether to save the resultant values in a or b (a since target = 0).After three more clock cycles, the first sections of u and m exit PipeUY and PipeMN (five cycle pipeline latency) and are also sent be saved in variable memory via the same method.6) At this point, the first section of data should have already returned to the variable memory, with its lowest bits updated in a lsb .With a lsb , b lsb , and g all updated, the control unit can determine the next state ahead of time.The system continues reading a, u, m, a 0 , and b 0 sections and sending them to the processing pipelines until the final section has been sent.7) The control unit then increments the iteration count in constant-time mode, or checks the output of PipeAB to see if the output is 0 in variable-time mode, and terminates if these conditions are true.This sequence of procedures repeats until the termination condition is met.

2) Final
Step: When the termination condition is met, we enter the postprocessing and data shift-out phase, the final step in Algorithm 2. Flags are set to perform the final addition and potential value inversion, with sub ← 0 and numshi f t ← 0. The control unit begins reading sections from a, b, u, y, m, and n.Sections from a and b are fed to PipeAB, u and y are fed to PipeUY, and m and n are fed to PipeMN.The values are added together using the first AddSub() of each pipeline with the remaining pipeline stages disabled.The results a + b, u + y, and m + n are returned to memory as they sequentially exit the pipelines.When the final result section of a +b exits PipeAB, the data are analyzed to determine if a +b, u + y, and m + n must be inverted.

VI. FELIX: EVALUATION AND COMPARISON
This section gives a detailed evaluation of the proposed FELIX on the FPGA platform.Results cover the implementation and comparison with a state-of-the-art high-performance XGCD accelerator, comparison with an existing lightweight XGCD design, and evaluation under the FALCON parameters.

A. Performance Metrics
We invoke many of the common FPGA performance metrics, such as Throughput, Throughput per Slice (TPS), and Area-Time Product (ATP).One shortcoming of a metric like ATP, which is generally calculated as (LUTs × Latency), is that it does not account for other resource usage, such as BRAM and DSP.So, we additionally present results for ENS and Equivalent ATP (eATP).ENS is effective as a holistic resource-usage metric and is computed according to Table IV [19], where eATP = ENS × Latency.

B. Experimental Setup
To test the effectiveness of the proposed FELIX, we have implemented the design in various (N , Q, R E , and R O ) configurations on the FPGA platform.The experimental setup is as follows.
1) The accelerator code is written in VHDL with functionality verified in ModelSim.2) We have identified parameter combinations (N , Q, R E , and R O ) that effectively demonstrate the performance of FELIX compared with existing literature.3) We have implemented the design (in both constant-and variable-time modes) in Vivado 2020.2 on the Kintex-7 (XC7K410TFBG676-1) for N = 1, 024 and on the Ultrascale+ (XCZU7EG-FFVF1517-3-E) for N = 2, 048 and N = 6, 479.Note that for a fair comparison, we do not include the ASIC-or software-based designs (such as [8] and [10]) for direct comparison, though we have discussed them in Section VI-F.The FPGA utilization results are post place-and-route, but were not implemented on a physical FPGA, so listed frequencies may be slightly higher than what is achievable in a real system.At last, FALCON reference code for XGCD (zint_bezout(•) from keygen.c[11]) was benchmarked on the CPU: 1) the microbenchmark support library from Google [20] is used as the benchmark library; 2) a single core of AMD Ryzen Threadripper 3960X processor (@3.8 GHz) is used; 3) the Ubuntu 20.04 LTS OS (virtual machine) was used for testing; and 4) g++ 9.4.0 was used to compile the code and disable the optimization flag.
C. Comparison With High-Performance Variable-Time XGCD 1) Comparison: Table V reports the FELIX implementation results for comparison with the only (to our knowledge) high-performance large integer XGCD implementation on FPGA [4].The design from [4] is also based on a modified two-bit PM with (R E = 8, R O = 4).Since [4] operates in variable-time with N = 1, 024, we implemented FELIX with (N = 1, 024, Q = 32/64/128) with (R E = 8/16/32, R O = 8/16/32).We focus on higher performance configurations of FELIX, ignoring lower Q, R E , and R O .Also, (N = 1, 024, Q = 128, R E = 32, R O = 32) is not recorded because of inefficient pipeline utilization, which was discussed in the proposed algorithm section (Section IV-B).
2) Analysis: Since [4] targets high-performance applications, and FELIX instead targets low resource usage and scalability, it is unsurprising that [4] utilizes significantly more ENS, but is generally much faster.Still, FELIX configuration Q = 32 with (R E = 8, R O = 8) achieves an 81% reduction in eATP over [4].Further, the fastest presented FELIX configuration Q = 128 with (R E = 16, R O = 16) achieves a 96% reduction in ENS, and is only 6.2× slower than [4].As expected, increasing Q significantly reduces processing time, but at the cost of resource usage approximately doubling and a lower max frequency.The optimal selection based on eATP seems to be Q = 32/64.For (R E , R O ) configuration, (8,8) gave the best eATP, extension to (16,16) [6].The design from [6] is based on a variant of the binary EEA from [5].Since [6] operates in constant-time with N = 2048 and Q = 16/32/64/128/256, we also implemented FELIX with those same parameters with (R E = 2, R O = 4), which is our constant-time configuration.[6] does not indicate whether 18 or 36 K BRAMs are used, so we conservatively assess them as 18 K.
2) Analysis: It is easily seen that FELIX outperforms [6] in almost all metrics.Since [6] utilizes BRAMs, they do achieve lower resource usage in terms of FF/CLB (LUTs not reported).However, when accounting for BRAMs, FELIX achieves significantly lower ENS, eATP, and Latency.In particular, the fastest implementation (Q = 256) achieves 45% lower ENS, 97% lower eATP, and 95% lower Latency than [6] with the same Q.Even when configuring for lowest possible resource usage, FELIX implementation for Q = 16 achieves 23% lower ENS than [6] with the same Q while still achieving a 95% reduction in Latency.The BEEA algorithm used in [6] requires some integer division, which seems to lengthen their time per iteration, compared to FELIX's use of simple operations in two-bit PM.Further, our strategy to implement memory with LUT/FF instead of BRAM seemed to yield better ENS usage, especially for storing wide Q-bit data.XGCD is implemented on the CPU, we only compare the Latency.

E. Comparison With FALCON XGCD
2) Analysis: For the constant-time implementation (R E = 2 and R O = 4), the fastest implementation of FELIX (Q = 512) achieves a 90% reduction in Latency over the reference implementation.Then, in terms of the lowest resource usage implementation (Q = 32), FELIX achieves ENS of just 0.50 K while still producing a 21% reduction in latency.
For the variable-time implementation, the proposed accelerator of Q = 512 with (R E = 32, R O = 32) yields the fastest overall processing and produces a 95% reduction in latency over [11].In terms of eATP, FELIX (Q = 128) with (R E = 32, R O = 32) leads to the lowest eATP of 1036.03K, while still giving 85% lower latency compared to the reference implementation of [11].

F. Some Additional Considerations With Other Works
Note that there also exists other GCD/XGCD implementations like [10], [21], and [22], but we do not explicitly compare them here since they belong to either software/ASIC implementations and have also been discussed and outperformed by [4].However, we do want to mention that [8] is an ASIC-based XGCD design and represents the fastest existing accelerator in the literature.Although the direct comparison with their ASIC-based design is difficult, from [8, Table III], one can observe that [8] involves around 50.5% less eATP than [4] for N = 1024.Since our design can achieve at most an 81% reduction in eATP when compared with [4] on the FPGA platform, we conclude that FELIX indirectly outperforms [8].

G. Discussion and Future Work
From the above analysis, it can be seen that FELIX performs very competitively with [4], significantly outperforms [6], and effectively implements the XGCD found in FALCON [11].Regarding [4], it is clear that their accelerator targets vastly different applications from FELIX, and the authors seem to indicate that ASIC is the primary target of their work.However, their implementation on FPGA is the most advanced such design in the literature and is thus helpful for comparison using metrics like ATP and eATP, where it is shown that our accelerator performs favorably.Deshpande et al. [6] is effective for comparison because it implements a sequential processing strategy similar to FELIX.Considering this was the only recent lightweight implementation we could find in the literature, the need for a lightweight and scalable XGCD with reasonable efficiency like FELIX becomes even more evident.At last, we believe we have sufficiently proven the viability of a reasonably lightweight XGCD accelerator for As new cryptographic schemes are being developed, we hope our research provides a foundation for potentially even larger parameters such as N = 16 384.

VII. CONCLUSION
This article presents an FPGA-based scalable and lightweight accelerator for XGCD computation (FELIX).We have developed a modified version of the two-bit PM algorithm for XGCD computation suitable for pipelined sequential processing, and an FPGA-based accelerator to efficiently implement its functionality.We found that our proposed accelerator achieved more efficient area-time complexity compared to the state-of-the-art designs for both high-performance and lightweight accelerators in the literature.At last, we proved the viability of an FPGA-based accelerator for extremely large parameter sets such as the one used in FALCON, which has not yet been shown in the literature (to our best knowledge).We hope this work can positively impact related research in the field.
Given two integer inputs a and b, basic GCD computes the GCD, GCD(a, b).While the extended version, XGCD, computes GCD(a, b) and a pair of Bézout coefficients b a and b b satisfying the Bézout identity GCD(a, b) = a × b a + b × b b .
1 mod b) by solving GCD(a, b) = a × b a + b × b b .If GCD(a, b) = 1 (meaning a and b are coprime), then (b a mod b) can be returned as the modular inverse.The challenge is to solve for GCD(a, b), b a , and b b when inputs a and b are very large.
are always true, with a ← a 0 , b ← b 0 , u ← 1, m ← 0, y ← 0, and n ← 1 on start-up.Similar to Algorithm 1, we proceed by reducing a or b directly when a or b are even, or reducing a + b or a − b when a and b are both odd.With each iteration, Algorithm 2 Proposed New Two-Bit PM Algorithm for Large Integer XGCD Computation
If a + b is positive (i.e., MSB is a 0), then results are returned as GCD(a 0 , b 0 ) ← a+b, b a ← u + y, and b b ← m + n.If the a + b is negative, then GCD(a 0 , b 0 ) and Bézout results need to be inverted.Results are then returned as GCD(a 0 , b 0 ) ← −(a+b), b a ← −(u+ y), and b b ← −(m + n).
Next, Table VI reports the FELIX implementation results for comparison with a lightweight constant-time XGCD architecture which follows a similar sequential processing strategy to FELIX Reference Code 1) Comparison: Finally, Table VII reports the FELIX implementation results for comparison with the FALCON reference code [11] which is written in C and was tested on CPU.Since FALCON reference documents do not indicate whether XGCD should be computed in constant-or variable-time, we present our results for both.Further, since the FALCON Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE V COMPARISON
WITH HIGH-PERFORMANCE VARIABLE-TIME XGCD ACCELERATOR BASED ON FPGA IMPLEMENTATION TABLE VI WITH LIGHTWEIGHT CONSTANT-TIME XGCD ACCELERATOR BASED ON FPGA IMPLEMENTATION

TABLE VII COMPARISON
WITH FALCON XGCD REFERENCE IMPLEMENTATION PERFORMANCE extremely large parameter sets like the one used in FALCON.