Pipelined Key Switching Accelerator Architecture for CKKS-Based Fully Homomorphic Encryption

The increasing ubiquity of big data and cloud-based computing has led to increased concerns regarding the privacy and security of user data. In response, fully homomorphic encryption (FHE) was developed to address this issue by enabling arbitrary computation on encrypted data without decryption. However, the high computational costs of homomorphic evaluations restrict the practical application of FHE schemes. To tackle these computational and memory challenges, a variety of optimization approaches and acceleration efforts are actively being pursued. This paper introduces the KeySwitch module, a highly efficient and extensively pipelined hardware architecture designed to accelerate the costly key switching operation in homomorphic computations. Built on top of an area-efficient number-theoretic transform design, the KeySwitch module exploited the inherent parallelism of key switching operation and incorporated three main optimizations: fine-grained pipelining, on-chip resource usage, and high-throughput implementation. An evaluation on the Xilinx U250 FPGA platform demonstrated a 1.6× improvement in data throughput compared to previous work with more efficient hardware resource utilization. This work contributes to the development of advanced hardware accelerators for privacy-preserving computations and promoting the adoption of FHE in practical applications with enhanced efficiency.


Introduction
With the explosion of the Internet-of-Things-based data and the widespread use of machine learning (ML) as a cloud-based service, securing private user data during ML inferences has become a pressing concern for cloud-service providers. Fully homomorphic encryption (FHE) is a promising solution for preserving sensitive information in cloud computing because it provides strong defense mechanisms and enables the direct computation on encrypted data (ciphertext) while preserving confidentiality [1,2]. However, the requirement for high degrees of security leads to complex parameter settings, resulting in expensive computation on large ciphertext, which limits the practical realization of FHEbased applications. Cloud-side analytics can be resource-intensive and time-consuming, making it necessary to develop cryptographic accelerators to facilitate the deployment of real-world applications. Cryptographic accelerators are designed to reduce the computational overhead of homomorphic functions, thus enabling faster and more efficient computation on encrypted data. The development of such accelerators is crucial to unlock the full potential of FHE-based solutions, make it more accessible to a wider range of users and supporting the secure processing of sensitive data in real-world settings. Figure 1 illustrates an end-to-end FHE-based cryptosystem with primary homomorphic operations performed in the cloud server.
FHE cryptographic protocols typically involve integer-and lattice-based schemes. The most efficient lattice-based schemes rely on the ring learning with errors (RLWE) problem, which provides strong security guarantees and the desired performance [3].
In RLWE-based FHE protocols, the input messages are encrypted by adding noise, and the generated ciphertexts are composed of two polynomial rings. The growth of noise through homomorphic computations limits the circuit depth, and the selection of FHE parameters must balance the security requirements with computational complexity [4]. Parameter selection primarily involves polynomial degree N, and modulo integer Q with at least 128-bit security is typically required to guard against unpredictable attacks [5]. To support multiplicative depth, N increases proportionally. High-circuit-depth FHE schemes inevitably have the drawback of large ciphertexts, which leads to expensive computations, high-bandwidth data movement, and large storage-space requirements. Primary homomorphic operations involve addition, multiplication, and permutation of ciphertexts. Homomorphic multiplication between ciphertexts is often computationally expensive because of the convolution of polynomial coefficients. Figure 2 shows a general diagram of the multiplication between two ciphertexts that dominates homomorphic operations. Initially, ciphertext consists of two component polynomials. The ciphertext multiplication results in a tuple of polynomials, making further computation challenging. Thus, an operation is required to revert the ciphertext to its original form. An expensive operation known as key switching is required to relinearize the ciphertext. However, key switching is computationally intensive with number theoretic transform (NTT) and inverse NTT (INTT) operations being dominant. Therefore, developing key switching hardware accelerators is significant for speeding up homomorphic multiplication and realizing FHE-based applications.

Related Works
While FHE holds potential, its primary limitation is inefficiency, which stems from two factors: complex polynomial operations and time-consuming ciphertext management. To tackle the computational and memory demands of homomorphic functions, various optimization and acceleration efforts are underway. Table 1 presents FHE accelerators, highlighting the hardware utilized and features of the accelerators. Initially, FHE acceleration depended on general hardware features. However, CPUs lack the capacity to effectively harness FHE's inherent parallelism [6]. GPU-based implementations tap into this parallelism, but GPU's extensive floating-point units remain underused as FHE tasks mainly involve integer operations [7][8][9]. Furthermore, neither CPUs nor GPUs offer sufficient main memory bandwidth to cope with FHE workload's data-intensive nature. To enhance FHE scheme performance, researchers have been exploring custom hardware accelerators using ASIC and FPGA technologies. ASIC solutions [10][11][12][13] show promise, as they surpass CPU/GPU implementations and bridge the performance gap between plaintext and ciphertext computations. However, to accommodate large on-chip memory, expensive advanced technology nodes such as 7 nm or 12 nm are required for ASIC implementations. Furthermore, designing and fabricating these ASIC proposals demand significant engineering time and high non-recurring costs. Since FHE algorithms are not standardized and continue to evolve, any changes would necessitate major ASIC redesign efforts. Conversely, FPGA solutions are more cost-effective than ASICs, offer rapid prototyping and design updates, and are better equipped to adapt to future FHE algorithm modifications.
Several studies have proposed FPGA-accelerated architecture designs for FHE [14][15][16][17][18][19]. Notably, Riazi et al. introduced HEAX, a hardware architecture that accelerates CKKS-based HE on Intel FPGA platforms and supports low parameter sets [14]. However, the architecture faces high input/output and memory interface bandwidths, as well as costly internal memory, making it difficult to place and route multiple cores on the target FPGA platform. Han et al. proposed coxHE, an FPGA acceleration framework for FHE kernels using the high-level synthesis (HLS) design flow [16]. Targeting key switching operations, coxHE examined data dependence to minimize interdependence between data, maximizing parallel computation and algorithm acceleration. Mert et al. proposed Medha, a programmable instruction-set architecture that accelerates cloud-side RNS-CKKS operations [17]. Medha featured seven residue polynomial arithmetic units (RPAU), memory-conservative design, and support for multiple parameter sets using a single hardware accelerator with a divide-and-conquer technique. However, these three FPGA-based implementations only support small parameter sets, insufficient for bootstrapping. Recently, Yang et al. proposed Poseidon, an FPGA-based FHE accelerator supporting bootstrapping on the modern Xilinx U280 FPGA [18]. Poseidon employed several optimization techniques to enhance resource efficiency. Similarly, Agrawal et al. presented FAB, an FPGA-accelerated design that balances memory and computing consumption for large homomorphic parameter bootstrapping [19]. FAB accelerates CKKS bootstrapping using a carefully designed datapath for key switching, taking full advantage of on-chip 43 MB on-chip storage. However, the design's extensive parallelism consumes numerous logic elements, especially with larger parameter sets. Additionally, inefficient scheduling can result in redundant resource consumption and complex workflow synchronization, leading to suboptimal performance. In this work, we adopt a pipelined KeySwitch design to simplify scheduling and target highthroughput implementation. Our design method leverages FPGA fabric's programmable logic elements and enhances on-chip memory utilization.

Our Main Contributions
This study presents a comprehensive hardware architecture for the KeySwitch accelerator design, which operates in a highly pipelined manner to speed up CKKS-based FHE schemes. Built on compact NTT and INTT engines [20], the KeySwitch module efficiently employs on-chip resources. Importantly, our design approach significantly reduces internal memory consumption, allowing on-chip memory to hold temporary data. The design executes subfunctions concurrently in a pipelined and parallel manner to boost throughput. We demonstrate an example design supporting a three-level parameter set. The proposed KeySwitch module was evaluated on the Xilinx UltraScale+ XCU250 FPGA platform, and we provide an in-depth discussion of the design methodology and area breakdown for better understanding of key operations. Compared to the most related study, our KeySwitch module achieves a 1.6x higher throughput rate and superior hardware efficiency.
The remainder of this paper is organized as follows: Section 2 provides an overview of the underlying operations of RLWE-based HE schemes. Section 3 describes the key switching algorithm in detail, and Section 4 presents the design of our KeySwitch module. Section 5 presents the experimental results, compares our approach with related works, and discusses our findings. Finally, Section 6 concludes the study.

Background
CKKS-based HE schemes have been extensively studied to perform meaningful computations on encrypted data of real and complex numbers. In the encrypted data domain, the ciphertext often consists of two N-degree polynomials, and each coefficient is an integer modulo Q. Therefore, the underlying homomorphic operations in RLWE-based HE schemes share similarities, enabling the development of a single hardware accelerator that can support multiple HE instances. Our study primarily focuses on accelerating CKKSbased homomorphic encryption; however, the operations described at the ciphertext level have a broad applicability to almost all lattice-based homomorphic encryption schemes.

Residue Number System
The Chinese remainder theorem (CRT) enables a polynomial in R Q to be represented as an RNS decomposition with smaller pairwise coprimes such that Q = ∏ L i=0 q i [21]. This enables polynomial a in R Q to be represented in RNS channels as a set of polynomial components. For instance, considering an RNS representation with three pairwise coprime moduli q 0 , q 1 , q 2 , the polynomial a can be represented as a set of three polynomials: a ≡ (a 0 , a 1 , a 2 ) mod (q 0 , q 1 , q 2 ), where each a i is a polynomial in Rq i . This technique can significantly reduce the magnitude of coefficients and improve the performance of arithmetic operations in HE.
We denote the polynomial component in a ring field R q i = Z q i (X N + 1) as follows: Thus, arithmetic operations on large integer coefficients can be performed for each smaller modulus without any loss of precision.

Gadget Decomposition
Let q be the modulus and g = (g 0 , g 1 , . . . , g d−1 ) ∈ Z d be a gadget vector. A gadget decomposition [22], denoted by g −1 : Z q → Z d , maps an integer a ∈ Z q into a vector a = g −1 (a) ∈ Z d q and g −1 (a), g = a (mod q). By extending the domain of the gadget decomposition g −1 from Z q to R q , we can apply it to a polynomial a = ∑ i∈[N] a i · X i in R q by mapping each coefficient a i to a vector g −1 (a i ) ∈ Z d q and then replacing a i with This extension was proposed by [23].
RNS representation can also be integrated with prime decomposition, as exemplified in [24]. An element a ∈ R Q can be represented in RNS form as ([a] q i ) 0≤i≤l ∈ ∏ l i=0 R q i . The inverse mapping, which allows the retrieval of the original element a from its RNS form, is defined by the formula

Key Generation
The client begins by generating a secret key sk, which is a polynomial in R Q . Then, they generate a uniformly random polynomial r from U(R Q ) and an error or noise polynomial e from a distribution χ. The corresponding public key is generated as pk = (b, r) ∈ R 2 Q , where b is obtained by taking the inner product of r and a fixed vector s, and adding the error polynomial e, that is, b = r, s + e.
Let sk be a different key: We sample D 1 ← U(R L Q ) and e ← χ L . Using the gadget vector g, we compute D 0 = −sk · D 1 + sk · g + e (mod Q) and return a switching key (SwK) as SwK = (D 0,j |D 1,j ), in which D j is a vector of polynomials d i ∈ ∏ l i=0 q i [23].

Encryption and Decryption
CKKS encodes a vector of maximal N/2 real values into a plaintext polynomial m of N coefficients, modulo q. Using the generated public key pk, the client encrypts an input message and produces a noisy ciphertext ct = (c 0 , c 1 ) ∈ R 2 Q as follows: where r 1 is another uniformly random vector and e 0 and e 1 are other noise vectors. After homomorphic computations on ciphertexts, the client obtains the results in the encrypted form ct = (c 0 , c 1 ) and uses the secret key to recover the desired information. Decryption is performed using m = c 1 − c 0 · sk ≈ m + e with a small error.

Homomorphic Operations
Homomorphic addition: Taking ciphertexts a = (a 0 , a 1 ) and b = (b 0 , b 1 ) for example, their homomorphic addition is computed by coefficient-wise adding their co-pair of RNSelement polynomials: Homomorphic multiplication: For ciphertexts a = (a 0 , a 1 ) and b = (b 0 , b 1 ), their homomorphic multiplication is performed by multiplications between their RNS elements: This dyadic multiplication produces a special ciphertext of a 1 · b 1 for a different secret key (that is, sk 2 ). Subsequently, key switching is performed to relinearize the quadratic form of homomorphic multiplication results and obtain a linear ciphertext of the original form. Key switching: RLWE ciphertexts can be transformed from one secret key to another using key switching computation with SwK. This method enables the transformation of a ciphertext decryptable by sk into a new ciphertext under a different secret key sk with an additional error e KS . The SwK is considered a d encryption of sk · g i under different secret keys sk , that is, SwK · (1, sk ) ≈ sk · g (mod Q) [23].

Key Switching Algorithm
Algorithm 1 provides a detailed description of the homomorphic multiplication with a key switching operation, which is a crucial building block of the SEAL HE library [6]. One remarkable feature of homomorphic multiplication is that NTT is a linear transformation, and optimized HE implementations typically store polynomials in the NTT form across operations instead of their coefficient form. Therefore, the first phase of homomorphic multiplication involves dyadic multiplication. However, the use of the Karatsuba algorithm, a fast multiplication technique, can reduce the total number of coefficient-wise multiplications from four to three. Dyadic multiplication produces a tuple of polynomials (ct 0,i , ct 1,i , ct 2,i ), where ct 2,i is a special ciphertext that encrypts the square of the secret key; that is, (1, s, s 2 ). To recombine the homomorphic products and obtain a linear ciphertext in the form (1, s), key switching is required to make ct 2,i decryptable with the original secret key. The homomorphic multiplication is computed using the following equation, which involves key switching using SwK: Key switching is a computationally intensive operation that typically dominates the cost of homomorphic multiplication. The key switching operation requires two inputs: the polynomial component ct 2,i and key switching key matrix SwK. The polynomial component ct 2,i is represented in RNS form as (l + 1) residue polynomials, whereas the key switching key matrix SwK = (D0, j|D1, j) is a tensor of (l + 1) matrices of (L + 2) residue polynomials. RNS decomposition was used to enable fast key switching with a highly parallel and pipelined implementation.
Algorithm 1 shows that key switching involves l INTT and l 2 NTT operations for increasing the modulus, and two INTTs and two l NTTs for modulus switching. Thus, key switching dominates the homomorphic multiplication process in terms of the computational cost. However, at l-depth level, the main costs are memory expense and data movement. To illustrate the efficient utilization of the on-chip resources on the FPGA platform, we used a parameter set of five modulo primes as a running example. The implementation results indicate that the proposed approach maximizes the utilization of hardware resources.

Algorithm 1 Homomorphic multiplication algorithm with a key switching operation [6]
Input: a = (a 0 , a 1 ) and i=0 q i Output: c = (c 0 , c 1 ) ∈ (∏ l i=0 q i ) 2 1: /* Dyadic multiplication */ 2: for i = 0 to l do 3: ct 0,i = a 0,i b 0,i 4: ct 2,i = a 1,i b 1,i 6: end for 7: /* Key switching */ 8: for i = 0 to l do Modulus raising 9:ã ← INTT q i (ct 2,i ) 10: for j = 0 to l do 11: if i = j then 12: end for 33: end for 34: return c = (c 0 , c 1 ) Key switching operation is computationally intensive, with NTT and INTT operations being dominant. In an FHE setting, ciphertext polynomials are represented in the NTT form by default to reduce the number of NTT/INTT conversions. However, this format is not compatible with the rescaling operation that occurs during moduli switching. Therefore, the key switching process involves performing NTT and INTT operations before and after rescaling, respectively. Consequently, the primary computational costs associated with key switching are for the NTT and INTT operations. Conventionally, the NTT and INTT units consume a large amount of internal memory to store precomputed TFs. In this study, the proposed KeySwitch module employs in-place NTT and INTT hardware designs that aim to reduce the on-chip memory usage [20]. In particular, each NTT and INTT unit stores several TF bases of the associated modulus and utilizes built-in twiddle factor generator (TFG) to twiddle all other factors. Based on the design method of [20] and the exploration of the key switching execution, we designed different NTT modules for associated moduli through pipeline stages. By adopting this approach, the proposed KeySwitch module utilizes hardware resources more efficiently.

KeySwitch Hardware Architecture
In the ModRai module, the first INTT operation transforms a sequence of (l + 1) input polynomials into the associated modulus (op 1 ). The next stage involves performing MOD operations on the previous INTT results for the (l + 2) moduli. Because operations on individual (l + 2) moduli are independent of RNS decomposition, we can perform (l + 2) MODs in parallel (op 2 ) to efficiently pipeline the computation. Modular multiplication (ModMul) also requires the original input polynomial, which reduces the number of MODs on (l + 2) moduli to (l + 1) MODs at a time. Figure 4 shows selectable MOD outputs. Subsequently, the (l + 1) NTT modules must run in parallel for subsequent NTT computations (op 3 ). Once the NTT computations are complete, the ModMul module performs modular multiplications with the SwK using Algorithm 1. To simultaneously generate two relinearized vectors, we deployed 2 × (l + 2) ModMul modules (op 4 ). After the ModMul product, the results were stored in the following memory banks (ops 5 and 6 , respectively). We used two Ultra RAM (URAM), large-scale, high-speed memory element, banks to store two polynomials with five RNS components. After accumulating (l + 1) polynomials in URAMs, the ModRai module transferred the temporary data to the ModSwi module memory and continued accumulating with the next polynomials. Cooperation after NTT was indicated as MAR, and its detailed structure is shown in Figure 5.   The ModSwi module performed the second part of the key switching operation after (l + 1) iterations. In this step, temporary data from ModRai were received and stored in RAM banks (op 7 ). The following INTT unit transformed only the two polynomials with the associated special modulus q sp (op 8 ). The ModSwi module then performed the flooring operation with (l + 1) MR units and (l + 1) NTT computations (ops 9 and 10 , respectively). For the ModMul operation of the 51-bit modulus, the coefficients were compared with half of q sp , and the subtraction with the residue of q sp modulo q i was then determined [6]. At the end of the flooring, subtraction with ModRai outputs and subsequent multiplication by the inverse value of the special prime were performed for two polynomials of RNS components in parallel (ops 11 and 12 , respectively). Op 13 added the remaining two components of the homomorphic multiplication results to the outputs of the flooring operation, and generated the relinearized ciphertext simultaneously. The output of the key switching operation consisted of two polynomials of RNS components, which are referred to as c 0 and c 1 of the key-switched ciphertext c.
The pipeline timing for the key switching operation is shown in Figure 6, where each pipeline stage comprises a series of consecutive operations separated by a few cycles. Each square block represents the approximate delay of the one-polynomial NTT computation. The ModRai unit can increase the modulus in a highly pipelined manner, with the results stored in the RAM until all input moduli are transformed (op 6 ). Subsequently, the ModSwi module performs the modulus switching operation only for two polynomials with the associated special modulus. In a pipelined operation, modulus switching has a timing delay of two square blocks. However, the delay gap between consecutive key switching operations depends on the number of modulo primes, which affects the accumulation latency in the ModRai module.

1024×8
). The same SwK matrices for all homomorphic multiplication operations at a specific level can be reused. However, these matrices are often too large to be stored in the on-chip memory, leading to a significant data movement overhead and a bottleneck in the overall performance of the cryptosystem. Thus, reducing data movement between the on-chip and external memory is critical for improving the efficiency of the system.

Evaluation Results
We developed the proposed KeySwitch architecture using SystemVerilog HDL and converted it into register-transfer-level (RTL) designs. We then performed logic synthesis for the Xilinx UltraScale+ XCU250 FPGA platform utilizing the Xilinx Vivado (v2020.1) tool. The KeySwitch hardware design stored the TF bases in Block RAM (BRAM) units and saved temporary data in URAM. For our chosen parameter settings, we kept the SwK in the main memory and supplied it to the KeySwitch module for verification. With default synthesis settings, the KeySwitch module achieved a maximum clock frequency of 236 MHz.
The security level of our KeySwitch design is based directly on the CKKS FHE primitive [25], without introducing any functional modifications. Parameter choices, such as polynomial degree N and modulus size log Q, significantly influence the security and achievable multiplication depth of a CKKS instance. In this research, we opted for a large log Q to allow for a high circuit depth and increased N to ensure a higher security level. Specifically, we set N = 2 16 and a large modulus of log Q = 1760 bits to achieve 128-bit security [26]. These parameters allowed for a multiplication depth of up to 32 levels during ciphertext evaluation. Implementations with a circuit depth less than 32 yield a security level greater than 128 bits. We used L = 3 as a study example throughout this evaluation to illustrate the effectiveness of our proposed KeySwitch module in comparison to prior work.
The synthesis results for our proposed KeySwitch module, which supports five moduli, are presented in Table 2. In the initial design, we stored all the TF constants for the utilized moduli in the on-chip BRAM. This conventional approach required a large amount of storage for precomputed TFs, leading to memory overhead. By effectively integrating TFG into the NTT and INTT hardware designs, we were able to significantly improve internal memory utilization. The NTT design approach employing runtime TFG led to a remarkable reduction (by approximately 99%) in on-chip memory usage compared to the traditional method of storing all precomputed TFs in memory. Furthermore, this approach resulted in a moderate increase (by around 21%) in DSP slices, accompanied by a negligible rise in logic elements. These outcomes highlight the effectiveness of the KeySwitch module regarding on-chip resource utilization, allowing for more internal memory allocation to evaluation keys and temporary data during calculations. To provide a comprehensive breakdown of on-chip resource usage, Tables 3 and 4 detail the FPGA hardware utilization of the ModRai and ModSwi modules, respectively. The functional modules corresponding to the operations shown in Figure 3 were synthesized and reported separately. This approach facilitates a more precise assessment of resource utilization. With the NTT and INTT modules operating on a single modulus, we were able to derive the TF memory from LUTRAM instead of BRAM, resulting in significant savings in on-chip RAM utilization. Additionally, it is worth noting that 60-bit integer multiplier necessitated the use of twelve DSP slices, while 51-bit integer multiplier only necessitated six DSP slices. As a result, we developed various NTT modules for different moduli to maximize the utilization of DSP slices.   Table 3 shows that the ModRai module dominates on-chip resource consumption in the KeySwitch hardware design. In particular, the INTT unit consumed 12.5 BRAMs to store the TF bases of four moduli. The moduli switching circuit used more LUT elements and FFs. The first three NTT units alternatively operated on two modulo primes and shared the multiplexing circuit from the previous MOD units to select the appropriate modulus. The associated RTL designs of these NTT units are denoted as NTTq 0123 , in which NTTq 01 , NTTq 12 , and NTTq 23 consume 564, 282, and 282 DSP slices and 12.5, 11, and 11 BRAMs, respectively. For dyadic multiplication and accumulation, we grouped the RTL modules into designs denoted as MARs of the corresponding modulus primes. Each unit simultaneously processed 16 coefficients during key switching. Table 4 provides a clear breakdown of the hardware consumption of subunits in the ModSwi module. The INTT and NTT units in this module operate only on a singular modulus, which is the reason we derived the TF memories from LUTRAM. To simplify the design, we grouped the RTL modules of the four NTT units into a single design, denoted as NTTq 0123 , because they shared the same control circuit. The DSP utilization of the NTT unit of q 0 was 564 DSP slices, whereas each NTT unit of the others q i consumed only 282 slices.
For the KeySwitch module, we utilized URAM to construct temporary data memory units. Using our design method, we confirmed that the memory unit of each RNS component consistently consumed 16 URAM blocks for the 16-bank data memory units.

Comparison with Related Works
Comparing with the software implementation, we used a computer system equipped with an Intel Core i9-9900KF CPU, 32 GB DDR4 DRAM that runs on the Windows 10 operating system. We installed version 3.7 of the widely used SEAL HE library [6] and then executed the switch_key_inplace() routine to evaluate the execution time of key switching. Latency measurements were performed using Chrono C++ functions. Then, we extracted the test vectors from the SEAL source code and ran them through the KeySwitch module for verification. As shown in Table 5, our KeySwitch design achieved a speedup of approximately 113.4× compared with the software implementation.
The most suitable comparison for our key switching accelerator is with HEAX [14]. In Table 6, we compare the efficiency of both KeySwitch hardware designs. Even though the polynomial sizes differ, both studies performed key switching at the same circuit depth, enabling fair comparisons. We calculated data throughput and assessed hardware efficiency metrics for this comparison. Our KeySwitch module design operates at a lower clock frequency on Xilinx FPGA technology than HEAX but achieves a 1.6× higher data throughput. Comparing LUT efficiencies is impractical due to structural differences between Intel FPGA's ALM elements and Xilinx FPGA's LUT elements. Differences in DSP slice structures between the two FPGA technologies led to distinct modulus bit width selections. Although our design exhibited lower DSP efficiency, we employed enhanced Barrett-based modular multiplication and reasonable numbers of DSP slices, combined with lightweight modular reduction. Our KeySwitch design also used flip-flops more effectively than HEAX for pipelined registers. Importantly, our proposed KeySwitch design achieved a 2.15× improvement in RAM efficiency. Despite a 10× larger polynomial size, our KeySwitch module consumed 1.3× less internal RAM than HEAX. The primary advantage of our design lies in the use of TFG modules in the NTT and INTT hardware designs, as well as the minimal number of TF constants stored in on-chip memory.
Comparing with other FPGA-based implementations: Medha presents a single hardware design for RNS-CKKS acceleration using a Xilinx Alveo U250 FPGA, offering a versatile instructionset architecture that supports two HE parameter sets (Set-1: N = 2 14 , log Q = 438 bits and Set-2: N = 2 15 , log Q = 564 bits) [17]. With a 497.24 µs execution time of homomorphic multiplication for Set-1, Medha reaches a throughput rate of 14,431 Mbps. In contrast, our design employs a pipelined strategy, achieving 3.4× higher throughput than Medha at the cost of increased hardware resource usage. Poseidon, an FPGA-based FHE accelerator featuring bootstrapping capabilities, utilizes optimization methods to enhance resource efficiency [18]. By leveraging an advanced Xilinx Alveo U280 FPGA with high-bandwidth memory (HBM), Poseidon reports a key switching latency of 218.6 µs for a specific parameter set of (N = 2 16 , L = 44). Our KeySwitch module exhibits a comparable execution time of 284.6 µs, but with reduced hardware overhead. FAB, an additional U280 FPGA-based FHE accelerator with bootstrapping support, refines on-chip memory access to remove memory-access-related bottlenecks [19]. For a parameter set of (N = 2 14 , log Q = 438 bits), FAB attains an execution time of 180.3 µs for homomorphic multiplication and a throughput rate of 39,802 Mbps, which is marginally lower than our KeySwitch module's 49,046 Mbps. Nonetheless, FAB consumes a higher hardware ratio than our design, with the exception of DSP slices. To summarize, our design focuses on accelerating key switching using pipelined and parallel implementations. By deploying the processor in consecutive pipeline stages, key switching operations are unrolled, resulting in high asymptotic throughput with minimal hardware resource overhead. Comparing with 100×, the GPU-based FHE implementation by Jung et al. [8]: The 100× focuses on large parameter sets (N = 2 16 , log Q = 2364 bits and N = 2 17 , log Q = 3220 bits) and achieves a significant speedup for CKKS compared to previous GPU-based attempts. Through memory-centric improvements, 100× enhances overall performance and reaches an acceleration rate more than 100 times faster than single-threaded CPU execution. While it is challenging to make a fair comparison between their work and our architecture, our KeySwitch module attains a similar processing time with a more adaptable and customizable FPGA technology implementation. In addition, there are some other studies that demonstrate impressive performance by utilizing modern GPU features, such as tensor cores. For instance, TensorFHE accelerates NTT computation by adopting GPU fine-grained operation and data parallelism [9]. However, TensorFHE still faces suboptimal acceleration due to GPU architectural limitations.

Limitation of This Study
As shown in Figure 7, SwK is larger than the on-chip BRAM capacity. Although SwK is reusable, the internal memory of existing FPGA devices for storing all SwK re-mains an overhead. To mitigate this issue, we reserved on-chip BRAMs for intermediate data and stored SwK test vectors in the main memory. However, this resulted in data movement from the main memory becoming a critical performance bottleneck, thereby limiting the acceleration of the KeySwitch module. An effective solution to further increase the main memory bandwidth is to use alternative main memory technologies, such as HBM [27]. HBM can provide several times higher bandwidth than the DDRx technology, thus improving the performance of the KeySwitch module.
Han and Ki proposed a method to reduce the length of SwK by using a decomposition number (dnum) to split SwK and decompose ciphertexts into dnum slices [26]. However, an increasing dnum also increases the number of SwK components. To overcome this limitation, we can store each component at each computation time, which reduces the number of accesses to the external memory during key switching. Choosing a proper dnum is crucial to strike a balance between the multiplication depth and homomorphic evaluation complexity. Furthermore, the NTT and INTT units perform computations iteratively, and the SwK components are cached in the internal buffer over time. Therefore, the use of dnum can significantly reduce the SwK length, whereas careful consideration of the trade-offs can enhance the overall performance of the KeySwitch module.

Conclusions
This study proposed an efficient hardware design for the KeySwitch module that accelerates the homomorphic multiplication by utilizing efficient NTT and INTT engines. The KeySwitch module achieved high hardware efficiency by utilizing on-chip resources and reducing the internal memory consumption. The pipelined key switching operation also enabled fast homomorphic multiplication with high-throughput rates.
In the future, the proposed KeySwitch module can be applied to accelerate realistic HE-based applications, such as logistic regression inference and simple convolutional neural networks. Efficient NTT and INTT hardware designs can support large circuit depths, making the instruction-set KeySwitch architecture a promising approach for practical HE-based applications. Further research should investigate the integration of the proposed KeySwitch module with other HE-based cryptographic schemes to develop a more comprehensive hardware acceleration platform.