Optimized Homomorphic Evaluation of Boolean Functions

. We propose a new framework to homomorphically evaluate Boolean functions using the Torus Fully Homomorphic Encryption (TFHE) scheme. Compared to previous approaches focusing on Boolean gates, our technique can evaluate more complex Boolean functions with several inputs using a single bootstrapping. This allows us to greatly reduce the number of bootstrapping operations necessary to evaluate a Boolean circuit compared to previous works, thus achieving significant improvements in terms of performances. We define theoretically our approach which consists in adding an intermediate homomorphic layer between the plain Boolean space and the ciphertext space. This layer relies on so-called p -encodings embedding bits into Z p . We analyze the properties of these encodings to enable the evaluation of a given Boolean function and provide a deterministic algorithm (as well as an efficient heuristic) to find valid sets of encodings for a given function. We also propose a method to decompose any Boolean circuit into Boolean functions which are efficiently evaluable using our approach. We apply our framework to homomorphically evaluate various cryptographic primitives, and in particular the AES cipher. Our implementation results show significant improvements compared to the state of the art.


Introduction
Homomorphic encryption (HE) is a cryptographic technique allowing the computation of operations on encrypted messages (which directly reflect on the original messages once decrypted), using only knowledge of public data.For example, an additive homomorphic encryption scheme is able to encrypt two messages m 1 and m 2 in ciphertexts c 1 and c 2 and to compute a third ciphertext c 3 from c 1 and c 2 that encrypts the sum m 1 + m 2 , without knowledge of the secret key.
The security of these schemes typically relies on a small noise introduced in the data when encrypting.The problem arising is that this noise is growing while homomorphic computations are carried out, which bury the original data into the noise and makes it unrecoverable at decryption.In 2009, Gentry [Gen09] introduced the operation of bootstrapping to solve this problem.This operation resets the noise at a nominal level without decryption allowing a potentially infinite amount of operations, making the construction of a scheme achieving Fully Homomorphic Encryption (FHE) possible.This operation being extremely heavy and slow, it is considered as the main bottleneck for the development of schemes efficient enough to be used in practice.
Currently, the most popular schemes in the FHE ecosystem are lattice-based and rely on the hardness of the Learning With Errors assumption [Reg05] and/or its ring variant RLWE [LPR10].BFV [Bra12], BGV [BGV12] and CKKS [CKKS17] are leveled schemes, which means that they keep track of the "level" of noise in the data during the homomorphic evaluation.As soon as this level reaches a critical bound, no more computations can be performed.Some recent works (see e.g.[CHK + 18], [CCS19], [LLL + 20]) propose a bootstrapping operation for these schemes to overcome this limit in the future.On the other hand, TFHE [CGGI18] is built on top of a powerful bootstrapping technique known to currently be the most efficient but limiting the precision of encrypted data.
Each FHE scheme offers a set of basic homomorphic operations that can be used to build more complex algorithms.In general, these operations are homomorphic additions and multiplications, however some complex operations cannot be constructed only with these operations.TFHE offers homomorphic additions and multiplications by a plaintext as well, but its force lies in its operation of programmable bootstrapping allowing the evaluation of encrypted look-up tables (LUT) while resetting the noise level.However, for performance issues, these look-up tables can only handle a small amount of bits as input (around 8 bits maximum) so the scheme is best suited for applications requiring a small precision.
In particular, TFHE is the best option to evaluate Boolean circuits with encrypted inputs, but the performances of the existing frameworks are still limited.In [CGGI18], the authors propose a strategy to evaluate Boolean functions called the gate bootstrapping, in which they perform one bootstrapping for each bivariate Boolean gate of the underlying circuit.As a consequence, the conversion of the original Boolean circuit in a homomorphic circuit handling encrypted bits is straightforward, moreover the noise growth is contained thanks to the systematic use of bootstrapping.However, this approach is very expensive due to the high numbers of bootstrappings and makes it highly suboptimal for large circuits.
The authors of [CLOT21] propose a different approach: by leveraging a newer version of the TFHE scheme supporting a new operation named TLWE ciphertexts multiplication, Boolean circuits are evaluated with homomorphic sums for XOR gates and this new multiplication operation for AND gates.While this approach is clearly a progress from the vanilla framework, we note that a few bootstrappings are still required to control the noise growth and that this new operation of TLWE multiplications remains costly both in terms of performances and in terms of noise.Thus, we choose to stick to the first version of the TFHE scheme (while slightly modifying it) to keep the framework lighter and we tackle the performance issues of [CGGI18] with a different approach than the one of [CLOT21].
Our work introduces a new framework to homomorphically evaluate Boolean functions on encrypted data efficiently, i.e. by reducing the amount of necessary bootstrappings.Our approach introduces an intermediate homomorphic layer which encodes bits on a small ring Z p before encrypting them.This allows us to evaluate Boolean functions with one cheap homomorphic sum followed by one bootstrapping.After formalizing the underlying concept of p-encoding and explaining our evaluation strategy, we investigate the issue of finding valid sets of encodings for a Boolean function.We characterize this problem and provide an exact constructive algorithm to solve it.We further provide a sieving heuristic finding solutions more efficiently but at the cost of loosing optimally.Since our method is only efficient for Boolean functions with limited number of inputs, we also propose a heuristic to decompose any Boolean circuit into Boolean functions which are efficiently evaluable using our approach.Finally, we apply our technique to various cryptographic primitives, namely the SIMON block cipher, the Trivium stream cipher, the Keccak permutation, the Ascon s-box and the AES s-box.Compared to previous works implementing the same primitives (for SIMON, Trivium and AES) our implementations achieve significant speedups.
After some technical preliminaries on TFHE (Section 2), we introduce a new concept of intermediate homomorphic layer and explain how bits are encoded in Section 3 and the algorithms to construct it in Sections 4, 5. Finally, we describe our modifications of the TFHE scheme in Section 6 and our experimental results in Section 7.

Notations
Let T = R/Z be the real torus, that is to say the additive group of real numbers modulo 1.In practice, torus elements are not represented with an infinite number of digits, but are discretized.Let us denote this precision in base 2 as Ω.We can define the discretized torus T q = { a q | a ∈ Z q } (the elements of the torus up to Ω bits of precision, q being 2 Ω ) and identify it with the ring Z q .As a consequence, any element a q of T q will be represented in machine by a without any loss of property of the group T q .The operations of sum + and external product • have to be understood modulo q.
Moreover, for a natural integer N and a given q, we will denote by T N,q [X] the ring of polynomial T q [X]/(X N + 1).The elements of this ring are polynomials of maximum degree N − 1 and with coefficients in T q .Like for the scalar version, this ring will be identified with the ring Z q /(X N + 1).N is usually taken as a power of two.
Finally, we will denote by B the set of binary digits {0, 1}.& and ⊕ denote the AND and XOR binary operations.For x and q ∈ Z , [x] q denotes the reduction of x modulo q.For S a set, x $ ← S denotes a uniformly random sampling from the set.For χ a distribution, x $ ← χ denotes a random sampling according to the distribution.

Complexity Assumptions
The TFHE scheme, as other schemes using lattices, relies on the hardness of the LWE assumption.More precisely, it relies on the torus-based version of the problem.In the following, we consider the classic definition but over a discretized torus and with a binary secret: Definition 1. (LWE problem over the discretized torus).Let q, n ∈ N and let s = (s 1 , . . ., s n ) $ ← B n .Let χ be an error distribution over Z q .The decisional Learning With Errors over discretized torus problem is to distinguish samples chosen with the following distributions: and: The search version of the problem is to recover s from the samples of D 1 .
Both the search and decisional problems are reducible to each other [Reg05] and their average case is as hard as worst-case lattice problems.
[Joy22] argues that identifying the discretized torus T q as Z q makes the LWE assumption over the discretized torus as hard as the standard LWE assumption.
TFHE relies as well on the generalized version of LWE over rings introduced in [BGV12] named GLWE.Definition 2. (GLWE problem over the discretized torus).Let N, q, k ∈ N with N a power of two and let s = (s 1 , . . ., s k ) $ ← B N [X] k .Let χ be an error distribution over Z N,q [X].The General decisional Learning With Errors over discretized torus problem is to distinguish samples chosen with the following distributions: and: The search version is analogous to the LWE one.
Note that RLWE is simply an instantiation of GLWE with k = 1.The complexity analysis is analogous to the LWE version.In practice, the error distribution χ is a centered Gaussian distribution parametrized by its standard deviation σ.Before expliciting more in depth the TFHE scheme, it is useful to define the plaintext space and how it is embedded in the discretized torus.

Plaintext Space
The plaintext space is the ring Z p , with p ∈ N.For now, let us assume that p | q and identify Z p with T p .As p | q, all elements of T p are elements of T q as well.Thus, we can define a mapping ρ : Z p → Z q as ρ : m → mq p .Of course, only p elements of Z q are reached by such a mapping and they have the form kq p | k ∈ Z p .As they are evenly distributed across Z q , they define what we call sectors of Z q of the form: The embedding of Z p in Z q is illustrated in Figure 1.
During encryption of m, some small noise e is drawn from a Gaussian distribution over Z q and is added to m.As e is small, the noisy message m + e stays in the same sector as m but while homomorphic operations are carried out, the noise grows and may overflow out of the sector.When decrypting, one recovers the sum of the expected result and some noise m ′ + e ′ .As long as e ′ < q 2p , the message m ′ can be recovered by rounding to the closest center of sector.
In our work, we pick odd values for p. q being a power of 2 in practice, it implies that p does not divide q.This enables nice features explained in Section 6.Consequently, the centers and the bounds of sectors are computed by rounding the fractions to the closest integers.In practice, p is much smaller than q (p is restricted to a few bits, while q typically equals 2 32 or 2 64 ), so this discrepancy makes this approximation sound.In the following, we will ignore this rounding.

Ciphertexts Types and Basic Operations
TFHE manipulates several different types of ciphertexts.In the following, we explain their structure: • TLWE ciphertexts: The message m to be encrypted is encoded as an element of T q .
A mask a = (a 1 , . . ., a n ) is drawn uniformely from T n q and a noise error e is sampled from χ.Using the secret key sk = (s 1 , . . ., s n ) ∈ B n , the body of the ciphertext is defined by b = n j=1 a j • s j + m + e.Finally, the TLWE ciphertext is c = (a, b).The decryption is performed by calculating the phase: ϕ(c) = b − ⟨a, s⟩ = m + e and rounding to the closest center of sector.
• TRLWE ciphertexts: It has the same global structure as TLWE ones, except the mask a is sampled from T N,q [X] k , the secret key from B[X] k and the error from T N,q [X].Some papers in the literature use the denomination TRLWE only if k = 1, and TGLWE otherwise.In this work, we do not make a difference between both cases.
During the bootstrapping phase presented in Section 2.5, another structure (the TRGSW ciphertext) is used but we do not mention it as we will not need it.More details about TRGSW can be found in [Joy22].
Two basics homomorphic operations are straightforward with these two structures: the component-wise sum of two TLWE (resp.TRLWE) ciphertexts c 1 and c 2 produces a ciphertext c 3 encrypting the sum modulo p of the two underlying messages m 1 and m 2 .Moreover, the external product λ • c 1 with λ ∈ Z also produces an encryption of the multiplication In the framework introduced by this paper, the freshly encrypted ciphertexts are TLWE, as well as during homomorphic computations.We only manipulate TRLWE ciphertexts during the BlindRotate phase of the bootstrapping, presented in Section 2.5.

TFHE programmable bootstrapping (PBS)
As defined by Gentry in [Gen09], the procedure of bootstrapping can be defined as the homomorphic evaluation of the decryption circuit.In the context of TFHE, the hardest part to compute is the rounding of the value to an element of T p by removing the noise.To achieve this homomorphically, it uses four procedures called ModulusSwitch, BlindRotate, SampleExtract and KeySwitch.

ModulusSwitch:
The high level idea starts by homomorphically computing the phase µ ∈ Z q and reducing it to μ ∈ Z 2N by computing μ = µ•2N q .In practice N takes values between 2 10 and 2 13 so the most significant bits carrying the true value modulo p are preserved.

BlindRotate:
Then, for a polynomial v(X) ∈ Z N,q [X], called the accumulator, one homomorphically multiplies v(X) by X −μ by blind rotation which yields an encryption of the polynomial v . By defining v j := 1 p jp 2N ∀j, the blind rotation shall output an encrypted version of the message in the zero-degree coefficient.We do not explain here how this polynomial multiplication occurs, the reader is referred to [CGGI18] for a more elaborated explanation.The procedure outputs a TRLWE ciphertext of dimension k encrypting the polynomial X −μ • v(X).Note that the quotient polynomial of the ring has degree N but μ lives in Z 2N so each coefficient of v i can be reached with a multiplication by X −μ and by X [N −μ] 2N .In the latter case, the coefficient v i gets negated because of the ring modulus X N + 1: we will refer to this problem as the negacyclicity problem.One way to prevent this issue is to ensure that the most significant bit of µ is fixed at 0 [Joy22] but a recent work [CLOT21] proposes a more sophisticated way to solve this problem.In our case, we use a modified version of the accumulator detailed in Section 6.
SampleExtract: This step simply extracts the degree-zero coefficient of the previous polynomial.It takes as input the TRLWE ciphertext yielded by the BlindRotate step and outputs the TLWE ciphertext c ′ encrypting the original message m.However, this ciphertext is not immediately available for either further homomorphic computations or decryption, because it has a length kN + 1 instead of n + 1 (and as a consequence is encrypted under a different TLWE key).

KeySwitch:
The previous step outputs the right value, but encrypted under a different set of parameters i.e. c ′ ∈ Z kN +1 q while we are looking for c ∈ Z n q .The only thing left is to convert c ′ to c, which requires key switching keys constructed from the secret key sk used at encryption.More details about this specific step can also be found in [CGGI18].
This "bland" procedure of bootstrapping simply refreshes the noise in the ciphertext to put it back at the "initial level", but can be very simply turned into a Programmable bootstrapping.Specifically it can simultaneously evaluate homomorphically any function f on the input.To achieve this, at the construction of the accumulator, the coefficient v j is replaced by their evaluation by the function f (v j ).This feature is extremely powerful and is the core of the huge potential of TFHE.

Basics on Boolean Functions and Boolean Circuits
In this paper, we focus on the evaluation of Boolean functions with TFHE.A Boolean function has the form f : B ℓ −→ B, with ℓ being called the arity of the function.
Definition 3. The Algebraic Normal Form (ANF) of a Boolean function f : {0, 1} ℓ → {0, 1} is a polynomial expression in which each term corresponds to a specific input combination of n variables.The ANF is defined as follows: where: a 0 , a 1 , a 2 , . . ., a 2 ℓ −1 ∈ {0, 1} are the Boolean coefficients and x 1 , x 2 , . . ., x ℓ are called the Boolean variables This result means that any Boolean function can be evaluated by the means of AND and XOR operations.In the following, we will focus on the implementation of Boolean circuits composed of these operations exclusively.
A Boolean function can be represented by its truth table, which is simply a table gathering all the possible inputs and the corresponding result of the application by the function.It can also be represented with a Boolean formula.A third representation is the Boolean circuit: Definition 4. A Boolean circuit associated to the Boolean function f is a finite Directed Acyclic Graph whose edges are wires and vertices are Boolean gates representing Boolean operations.We consider AND gates and XOR gates, of fan-in 2 and fan-out 1.We also consider copy gates, of fan-in 1 and fan-out > 1, that outputs several copies of its input.A circuit is further formally composed of input gates of fan-in 0 and fan-out 1, and output gates of fan-in 1 and fan-out 0.
Evaluating a ℓ-input m-output circuit consists in writing an input x ∈ B ℓ in the input gates, processing the gates from input gates to output gates, then reading the outputs from the output gates.This notion of Boolean circuit will be particularly useful in Section 5.

Boolean Encoding over Z p and Homomorphic Evaluation Strategy Between B and Z p
To evaluate Boolean functions in TFHE, one could use the vanilla TFHE with p = 2.The problem is that the only evaluable function would be the XOR operation.To evaluate the other operators, the solution of [CGGI18] which is also implemented in the tfhe-rs library [Zam22b] is to take a larger p, specifically p = 8.This allows all the operations of the Boolean algebra to be carried out, however the negacyclicity problem introduced in Section 2.5 arises because 8 is even.Their solution to this issue is to keep a bit of padding fixed to zero, i.e., the values in Z p have their most significant bit fixed to zero.This restriction has a heavy impact on performances, because it requires a bootstrapping after each Boolean gate to make sure no data ever overflows in the most significant bit.
Our solution makes use of odd values for p, which allows us to remove this constraint of padding and to perform more operations without bootstrapping.To do so, we had to slightly adapt the bootstrapping procedure of TFHE to support odd moduli.We explain this tweak in Section 6.
Moreover, the PBS described in Section 2.5 takes only one input and so can only evaluate univariate functions.The common solution to evaluate multivariate functions is to concatenate several input ciphertexts into one by shifting the MSB of each input and to sum them all.The problem is that the number of message bits cannot grow too much because the other parameters of the LWE problem must grow accordingly, degrading the performances.As a consequence, the performances quickly degrades as the arity of the function increases.Our approach consists in removing the padding bit and using a combination of homomorphic additions before a PBS to evaluate a function for any number of inputs with the cost of a single PBS.
To this purpose, we propose a construction in which we embed Boolean values in Z p for well-chosen values of p, forming an "intermediate homomorphic layer" between B and Z q .In the following, we explain how we define such a layer, and we describe our new strategy to evaluate Boolean functions in a more efficient way without considering the circuit representation of the function.

Encoding of B over Z p
To represent Boolean values over Z p , we use a mapping function that we call a p-encoding: Definition 5 (p-encoding).A p-encoding is a function E : B → 2 Zp that maps the Boolean space to a subset of the discretized torus.A p-encoding is valid if and only if: We call this last property relaxed negacyclicity.
In our approach when we need to encrypt a bit, we apply a p-encoding to embed it on Z p , then we encrypt the result using the classical setup of TFHE.When new values are freshly encrypted or produced by a PBS, only one element of Z p is chosen for each bit.We call such an encoding a canonical p-encoding: Property 1 (Reduction to a canonical encoding).Let E be a valid p-encoding and E ′ a canonical p ′ -encoding.We denote α ′ = E ′ (0) and β ′ = E ′ (1).Let c be a ciphertext encrypting a bit b under E.Then, one can produce a ciphertext c ′ encrypting the same bit b under E ′ by applying a PBS on c.This PBS performs the function : Here, ⊥ simply denotes a placeholder value for a state that cannot be reached.
Our goal is to represent the Boolean function we want to evaluate with a sum of p-encodings (we define what we mean by "sum of p-encoding" in Section 3.2).When sums are carried out on ciphertexts (and so homomorphically on the underlying p-encodings), the sets E(0) and E(1) of the p-encodings may move, grow, shrink, but they should never overlap as it would result in a loss of information.As we removed the need of a bit of padding, we do not need to track a potential overflow of data (informally we say that ciphertexts are free to "go around the torus").After the sum, the encoding of the result can be reset to a canonical one with a PBS to allow further computation.Or, if the homomorphic computation is over, the result can be recovered by decrypting the ciphertext and checking in which partition the decrypted value lies.
The next subsection explains in further details the process of evaluating Boolean functions on with p-encodings.

A New Strategy for Homomorphic Boolean Evaluation
In the following, we consider two Boolean variables x and y and their two respective encodings over Z p : Let f be a bivariate Boolean function and let us construct two sets P 0 and P 1 such that: We say that the sum of p-encodings E x + E y is suitable for the evaluation of f if and only if P 0 ∩ P 1 = ∅.The definition can be generalized to any number of arguments ℓ for f .For a given f , finding such correct encodings is not trivial.We elaborate further on this point in Section 4.
If E x and E y are suitable for f , then one can use the computed sets P b to construct a new p-encoding that encodes the bit f (x, y).As E z is valid, then the clear value of the bit can be recovered by decryption, or further computations can be performed without the need of a bootstrapping.
Definition 7 (Application of a function to a vector of encodings).Let f : B ℓ → B be a Boolean function and let E = (E 1 , . . ., E l ) be a vector of p-encodings.We define f (E) by: with: We stress that f (E) is a valid p-encoding if and only of P 0 ∩ P 1 = ∅.
Let us illustrate the latter definition on two toy example.We consider the two Boolean operators & and ⊕.The p-encoding resulting of the function f : (x, y) → x & y is: and the p-encoding resulting of the operation f : (x, y) → x ⊕ y is: Figure 3 further illustrates this construction for these two operations.
To wrap up, here is our proposed framework to evaluate a Boolean function f : • All the elements of E in are p in -encodings, and E out is a canonical p out -encoding.
• The encoding To do so, we perform the following algorithm: • Reducing the encoding of c inter from E inter to E out by applying a PBS on c inter performing the function Cast Einter →Eout .This produces the expected result c ′ .
The advantage of this construction is that only one PBS is performed to apply the function.Moreover, depending on the function, the input size of the PBS lookup table might be much smaller than the arity of the function.Gadgets can be seen as a way to compress several Boolean operators into a single evaluation of univariate look-up table.Of course, for a given p in and a given f , such a gadget may not exist.In such a case, two solutions can be considered: • Increasing the value of p in (e.g.taking p in ≥ 2 ℓ always works, but is very inefficient).
• Splitting the function into a graph of subfunctions, and evaluating each one with a gadget.
The question of constructing valid gadgets for a given f is treated in Section 4. The question of efficiently splitting a function is treated in Section 5.
Example: We illustrate our approach with a simple working example: let f be a basic multiplexing function, such that Instead of leveraging its Boolean representation f (a, b, c) = a&c ⊕ b&c, which would cost 3 PBS with the approach of [CGGI18], our strategy consists in constructing a gadget and apply it to the inputs a, b and c, which takes only one PBS.Here is the step-by-step procedure: 1. Encrypting the bits with the 7-encodings: .
2. Applying the function f on the 7-encodings by summing the ciphertexts, producing a valid 7-encoding: At this point, only sums have been performed on the ciphertexts.
3. With one PBS, resetting the result to a target canonical p-encoding (with any p), for example A visualization of this procedure can be found in Figure 4. We just defined the gadget Γ = ((E a , E b , E c ), E new , 7, 7).

Encoding Switching
To apply a gadget to a given ciphertext, it has to be encrypted under the right encoding.Thus, we need a method to homomorphically switch the encoding of a ciphertext.This allows as well to plug the output of any gadget on the input of any other one, and so to evaluate a chain of gadgets as long as we want.In the following, we explore different possibilities of encoding switching.Let us begin with some trivial cases: Property 2 (Encoding switching with a sum by a constant).Let x be a ciphertext and a ∈ Z p a constant.The encoding of x can be switched to: by an homomorphic addition of the ciphertext x and the clear value a.
Proof.All the elements of E ′ x (0) (resp.E ′ x (1)) are offset by exactly a from their counterparts in E x (0) (resp.E(1)).Thus, if the original encoding E x was valid, then E x (0) ∩ E x (1) = ∅.So we trivially get E ′ x (0) ∩ E ′ x (1) = ∅ and thus the validity of E ′ x .
Property 3 (Encoding switching with multiplication by a constant).Let x be a ciphertext and a ∈ Z p a constant value prime with p.The encoding of x can be switched to: by an homomorphic multiplication of the ciphertext x by the clear value a.
Proof.As a is prime with p, the multiplication by a is a bijection from Z p to Z p .By definition, all the α i 's are different of the β i 's.If we apply a bijection on them, the inequalities are conserved.
Note that the condition of primality between a and p is a sufficient condition for the multiplication to be a valid encoding switching, but is not necessary.In particular, one other case is particularly useful in practice: Property 4 (Encoding switching for a canonical encoding containing a zero).Let x be a ciphertext encoded under the p-encoding: and let a ∈ Z p \ {0}.Then, it can be switched to: by a simple homomorphic multiplication between the ciphertext x and the clear value a.This holds as well if E(0) and E(1) swapped.
Proof.The property is trivial by the linear homomorphism of the TFHE scheme.
These techniques are powerful because they do not require any bootstrapping, so they can be considered as free in terms of performances.However, any valid p-encoding can be turned into any other one with a programmable bootstrapping, even with a different modulus p.A reduced version of this is given by Property 1, but it can be extended to any valid output p-encoding.
Property 5 (Arbitrary encoding switching with a PBS).Let c be a ciphertext encoded under E. Its encoding can be switched to E ′ (even with a different modulus p ′ ) by applying a PBS on c evaluating the function Here, ⊥ simply denotes an arbitrary placeholder value, as it will never be reached.
See Sections 2.5 and 6.2 for a more in-depth insight on the actual procedure of programmable bootstrapping.

Algorithms of construction of gadgets
Let f : B ℓ → B a Boolean function with ℓ entries.This section addresses the problem of constructing a gadget for f .To do so, we pick a value for p and we search a vector of ℓ p-encodings E in suitable for f .

Reduction of the Search Space
While exhaustive search is a first option, it quickly becomes impractical due to the explosion of the number of possibilities as p grows.As a consequence, a reduction of the search space is needed without leaving out a potential solution.
We introduce two lemmas that will be used to reduce the search space: Lemma 1 (Reducibility to singletons).Let f : B ℓ −→ B and let (E 1 , . . ., E l ) be a vector of pencodings suitable for f and having the form: Then any vector of canonical p-encodings is suitable for the function f as well.
Proof.Let us assume that the vector E = (E 1 , . . ., E l ) of Lemma 1 is suitable for the function f .Then the sets P 0 and P 1 constructed like in Equation 3 are disjoint.Now, let us consider the vector of canonical p-encodings As a consequence, if we build the sets P ′ 0 and P ′ 1 relative to the encodings E ′ , then we naturally get P ′ 0 ⊂ P 0 and P ′ 1 ⊂ P 1 .So we get P ′ 0 ∩ P ′ 1 = ∅, proving Lemma 1.
Lemma 2 (Reducibility to the singleton zero).Let f : B ℓ −→ B and let (E 1 , . . ., E l ) be a vector of p-encodings suitable for f and of the form: ∀i ∈ {1, . . ., ℓ}, is suitable for the function f as well.
Proof.Let f : B ℓ −→ B be a function and E be a vector of canonical p-encodings (E 1 , . . ., E l ) suitable for f with: Let us build the sets P 0 and P 1 according to Equation 3.Each element of these sets is the sum of exactly one element of each p-encoding, that is to say an element E i (0) ∪ E i (1).Let us pick an indice k ∈ {1, . . ., ℓ}, a value a ∈ Z p and replace E k in the vector E by: By using the Property 2, we directly have P ′ 0 ∩ P ′ 1 = ∅ from P 0 ∩ P 1 = ∅ (by suitability of the encodings for f ).
By iterating this procedure on each of the ℓ elements of E, and by picking each time a = −x (i) , we prove Lemma 2.
Using both Lemmas 1 and 2, we can restrict the search to the encodings of the form with d i ̸ = 0 without any loss of generality.Moreover, we restrict the solution further: we only consider p-encodings with p odd and prime.The choice of an odd p allows to free ourselves from the negacyclicity constraint (more about that in Section 6.1).To explain the constraint of primality, we introduce the following lemma, that allows to drastically improve the performances of the search: Lemma 3. Let p be a prime and f : B −→ B be a Boolean function and let E = (E 1 , . . ., E l ) be p-encodings suitable for f with: Proof.This is an immediate consequence of Property 3.
As a consequence, if p is prime (which we shall always choose in practice), any solution can be turned into a solution with d 1 = 1 by simply multiplying all the p-encodings of the solution by [d −1 1 ] p .So we can fix d 1 = 1 without any loss of generality, reducing drastically the size of the search space.

Formalization of the Search Problem
According to the lemmas from Section 4.1, we can reduce the problem of finding a vector of p-encodings (E 1 , . . ., E l ) such that f (E 1 , . . ., E l ) is valid to the problem of finding a vector In the following, we describe an algorithm to find such a vector d.
We denote V the matrix of elements of B of shape 2 ℓ × ℓ gathering all the possible sequences of entries for the function f : Also, we denote by b the vector of all the outputs of the function f , sorted in same order as the rows of V .Thus, we have: Let us define the vector r as: r = V d.To make d a solution of the problem, r has to verify the following property: An alternative formulation is: we look for two disjoint subsets P 0 and P 1 of Z p , such that: The following section describes an algorithm finding a solution to this problem.

Algorithm
We start by constructing two sets F and T such that: Each line V i represents a linear combination of the d j 's, that verifies: The values r i produced by the elements of F must be different from the ones produced by T .As a consequence, we can write: which is equivalent to writing: So we can rewrite our constraints in the set C contains vectors with coordinates in {0, 1, −1} representing linear combinations that have to be non-zero.Note that if an element of the set C is the opposite of an other, it does not bring further constraint and can thus be safely discarded from the set.
The use of a set in the implementation at this point of the algorithm allows to remove a lot of duplicate constraints and to simplify the next step.Then, the problem reduces to solving a "linear system of inequalities" in the ring Z p : After filtering, we pack all the elements of C in ℓ matrices {C i } 1≤i≤ℓ (each row being a linear combination), where the matrix C i packs all the constraints involving only the i first inputs (i.e.all the coefficients of column index greater than i are zeros).
We then perform a recursive search (Algorithm 1), affecting at each step of depth i a possible value d i for the i-th input.To do so, we call Algorithm 2 to construct the set of all possible values complying with the constraints of the matrix C i and the previously set values for the preceding inputs.If we reach a dead-end, we backtrack by deleting the preceding input and assigning it the next possible value.Algorithms 1 and 2 formalize this idea: Algorithm 1 is a basic recursive backtracking algorithm using calls to the set construction function (Algorithm 2) to get the possibilities for the next value of d.The latter, when called at depth j + 1, takes as input the j values already computed at higher depth for d and the matrix of constraints C j+1 .Each line of C j+1 creates a (potentially duplicate) forbidden value for d j+1 , these values are all computed and the complement of this set in Z p is returned by the algorithm (i.e. the set for possible values for d j+1 at this point of the search).
Theorem 1.Running Algorithm 1 with increasing values of p ensures that the first solution d found is optimal for the function f , i.e. the solution works and its associated p is the smallest as possible.
Optimizations: Several optimizations are possible to improve the performances of the search.First, in Algorithm 2, one can check the size of the set S at each iteration and stop as soon as the size of the set is p.Such a set means that a dead-end has been reached and that no value will be returned by the function.Then, one can leverage symmetries existing in the table but also in the function.For example, if we consider the function f : (x, y) −→ x ⊕ y, the two variables x and y have symmetric roles.Thus, if the pair of encodings (E x , E y ) is valid, then the pair (E y , E x ) is valid as well.As a consequence, one can arbitrarily set d x ≤ d y and removing half the possibilities for (x, y).

Development of an heuristic:
This algorithm of the previous section is deterministic and finds any existing set of encodings compliant with the function f for a given value of p.However, the right value for p is not known a priori, so we have to run the full algorithm for each possible value of p until we find one that works.For these reasons, we might prefer an efficient heuristic over the previous algorithm in some contexts.In Section 4.5, we define such a heuristic which allows to drastically improve the performance by executing directly the algorithm with realistic values for p.

Performances measurements
In this section, we present some experimental results to demonstrate the performances of the algorithm.We ran Algorithm 1 for a lot of random Boolean functions of arity ℓ.Two metrics are particularly interesting for us: • The running time of the algorithm, especially in the cases where there is no solution.Figure 6a shows the evolution of the time of execution of the algorithm for random Boolean functions for which no solution exists.It shows the explosion of the complexity for high values of p, and justifies the need of a more efficient algorithm for those function (we introduce one in Section 5).
Lastly, Figure 6b shows how long it takes to find a solution when one exists, relatively to the running time when no solution exist at all.It illustrates a form of "speed of convergence" and shows that it is located around 1 3 .

An Efficient Sieving Heuristic to Find Suitable Encodings
Let us consider a function f : and its associated system of linear inequalities:   To reduce the amount of samples required to find a solution, we want to avoid sampling trivially wrong sets of d j 's.For example, if all the d j 's are themselves divisible by p, then the C i 's will all be divisible as well.To tackle this problem, we perform the sampling across prime numbers in Z.
Algorithm 3 Sample a solution d in Z for a function f and returns a possible value for p.

end if
Running this algorithm several times and keeping the smallest returned value for p, one gets an upper bound on the minimum p required to evaluate a function with our framework.Note that, on the contrary of the deterministic search algorithm, this heuristic does not require a prime p.
Example: Let us consider the s-box of the block cipher ASCON.We study this s-box in more details and provide an exact optimized solution for its homomorphic evaluation in Section 7.4.Here, we apply Algorithm 3 on the five functions generating the five output bits and monitor the results until we gather N = 10000 non-zero possible values for p.The figure 7a shows the repartition of the returned values of p by the algorithm during these N runs on the first subfunction.The optimal value of p found by the deterministic approach of Section 4.3 is 17 so the upper bound 19 is pretty close, despite being rarely found by the algorithm.Also, the figure 7b shows 21 (the second best solution found by the sieving) is almost instantly found by the algorithm.
In the process of finding the smallest p possible and a correct vector of p-encoding to evaluate a function f , this heuristic is really efficient to get a tight upper bound on the value of p.

Scaling our Approach to any Boolean Circuit
Our framework optimizes the homomorphic evaluation of single Boolean functions but suffers the following limitations: 1.For a Boolean function with a high number of inputs, the search algorithm may be very time-consuming.
2. Some functions simply do not have any solution for acceptable values for p (p < 32 for example) and thus are not efficiently evaluable in a single PBS.1 As a consequence, we need a solution to extend our framework to these cases.In this section, we propose a strategy to leverage the circuit representation of a "tough" function f to find a strategy of homomorphic evaluation with as few bootstrappings as possible.

Graph of Subcircuits
Let f : B ℓ −→ B be a Boolean function, and let F be a Boolean circuit representing f (some preliminaries about Boolean circuits can be found in Section 2.6).Let us describe the layout of the circuit F. It has ℓ input wires, denoted by {y j } 1≤j≤ℓ , and the output wire is denoted by z.The intermediary wires are denoted by {t j } 1≤j≤θ .The Boolean operation gates are of fan-out 1.
Our goal is to split the circuit into a directed acyclic graph G, whose vertexes are subcircuits {F 1 , . . ., F k } and whose edges connect the outputs of a subcircuit with the input of another.Each subcircuit F i represents a subfunction f i : B li → B that is evaluable with a gadget with our framework.Each subcircuit F i is evaluated homomorphically with a gadget Γ i .
We use the same notations to refer to the elements of a subcircuit F i and we index them with i.The output of F i is denoted by z (i) and its inputs by {y (i) j } 1≤j≤ℓ and so on.The graph is valid for f with respect to modulus p if the following properties are satisfied: • Each subcircuit F i has only one output z (i) .
• For a subcircuit F i , all its inputs are either inputs of the whole circuit or outputs of other subcircuits of the graph.We can write this property as: Thus, the indexing of the F i 's respects the topological order of the graph, i.e. no gates of F i has a child in any of the F j , with j < i.
• All the Boolean functions f i represented by the subcircuits F i are evaluable in a single bootstrapping with modulus p with our proposed method.
• The last subcircuit F c of the graph has z (the output of the main circuit) for output: To homomorphically evaluate the function f , we evaluate each subcircuits with one bootstrapping for each of them and get the final result.In order to reduce the cost of evaluation for a given p, the goal is hence to find the smallest valid graph possible in terms of number of subcircuits.Taking a greater value of p produces a different graph that may be smaller (as subcircuits might be larger), but the timings of bootstrapping in this graph might on the other hand be greater.One can therefore run the search for different values of p and keep the most efficient setup among the possible graphs.

Heuristics to Find a Small Graph
Finding such a graph can be done by exhaustively evaluating all the possible subcircuits with our method introduced in Section 4, and then find the more efficient one.However it is not really practical to evaluate all the possible subcircuits, so we develop some heuristics to reduce the search space.Let us start by defining a few bounds on the considered subcircuits, we will leave the other ones apart in our algorithm: • The subcircuits have at most B inputs (∀i, l (i) < B).The purpose of this bound is to limit the running time of Algorithm 1.In practice, for our experiments, we took B = 10.
• The subcircuits are evaluable with one single bootstrapping with a maximum value p max .This value ensures a bootstrapping with a reasonable timing.If the search algorithm fails for p max , the subcircuit is dropped without trying to extend p.In our experiment, we took p max = 31.
In order to decompose our Boolean circuit into a graph satisfying the above property for a modulus p, we would want to exhaustively search all the subcircuits of F compliant with the bounds we introduced earlier.However, all subcircuits are not equally worth to evaluate.In particular a wire incoming a copy gate is particularly worth evaluating because is costs one bootstrapping but produce several inputs for the next subcircuits.
We gather wires that precede a copy gate in the set Z. We add to this set the global output z.We also gather the input wires of the global circuit F in the set Y. We define the notion of atomic subcircuit that is a valid subcircuit whose all inputs belong to Y ∪ Z and whose output belongs to Z.Note that the merge of two atomic subcircuits that respect the global circuit wiring is also an atomic subcircuit.
Our heuristic works as follows: 1.For each of these outputs z i ∈ Z, we exhaustively construct a set F zi that gathers all the atomic subcircuits whose output is z i .We then filter out the subcircuits of F zi that do not comply with the bounds introduced at the beginning of the section or that are not evaluable with a gadget with the input modulus p (we use Algorithm 1 to decide that).
2. Now we want to construct the smallest valid graph evaluating F using subcircuits from the F zi 's.While finding the smallest graph is hard, constructing any valid graph is easy.As a consequence, our strategy to find a small graph is to randomly create a lot of valid graphs and to take the smallest one.The procedure to create a valid graph is the following: we start from the output z and we randomly draw a subcircuit F z from F z .The inputs of F z can be sorted into two categories: the ones belonging to Y and the ones belonging to Z.For each one of these latter wires w ∈ Z, we repeat the procedure, i.e. we draw a subcircuit F w from F w , and so on.When we have reached all the input wires of F, we get a valid graph G .This second step is run a large amount of times (the number of trials is a parameter of the method), and the smallest graph, i.e. the one with the fewest subcircuits, is returned.
We carried on this method on the s-box of AES in Section 7.5.

Parallelization of the Execution of the Graph
Once we have our graph G, we can identify its n L layers.Formally, they are defined as: By construction, all the subcircuits belonging to the same layer can be evaluated in parallel.This reduces the number of bootstrapping steps from k (the number of subcircuits in the graph G) to n L (the number of layers).Our graph-finding heuristic can be tweaked to select the graph with minimum number of layers instead of minimum number of subcircuits to optimize parallelization.

Adaptation of TFHE and the tfhe-rs Library
From a high level point of view, our technique can be seen as adding an additional layer of abstraction on top of TFHE.However things are not that picking odd values for p leads to some changes in the inner working of the programmable bootstrapping (PBS), and the choice of parameters is also affected by this change.Moreover, we implemented our framework by forking the tfhe-rs library [Zam22b] written in Rust.The following section covers the adaptation of the PBS and the choice of new parameters.The adaptation of the library is treated in Section 6.4.

Dealing with the Negacyclicity Problem for an Odd p
In the following, we explain the negacyclicity problem and how we propose to solve it.To do so, we need to dig into the details of the BlindRotate step of the PBS, that we have introduced in Section 2.5.
Let v(X) be a polynomial of the ring Z q,N [X]/(X N +1), denoted by v(X) Observe that a multiplication by X in this ring "rotates" the coefficients of the polynomial: In TFHE, the polynomial multiplication in the blind rotation is actually done by X −μ , with μ = µ•2N q , which lives in {0, . . ., 2N − 1}.This leads to two problems: • A coefficient v j can be brought in first place by two differents rotations: the one induced by the polynomial multiplication by X [−j] 2N and the one by • Each time a coefficient goes last to first, it gets negated (because X N = −1 in the ring).So actually, the multiplication by X [−j] 2N yields correctly v j , but the one by However, these problems can be circumvented for even and odd values of p. Recall that µ = m + e ∈ Z q , with e sampled from a small centered Gaussian.The use of a small error makes that µ does not take all the values of Z q with the same probability: in particular, the densest parts in terms of probability over Z q are the one close to the "unscrambled" values of m, namely kq p | k ∈ Z p .We illustrate this distribution on Figure 9.We call these sections of the torus the dense spots.
When we transpose these dense spots into Z 2N , they become the sectors close to Let us note that the noises in Z q and Z 2N are fundamentally different: the former is the one added at encryption that may have grew during the homomorphic computations, and the latter is called "drift" and is caused by the accumulation of the rounding errors on each coefficient of the ciphertext during the modulus switching (but this difference in nature does not impact our purpose).Let k ∈ Z p , the multiplication v(X), up to the minus sign.For the sake of clarity, we write the exponent of the latter in a slightly different manner: This is where the parity of p plays a part: if p is even, then is a dense spot as well.So, the rotations by these two values will happen with high probability and they will both yield the same coefficient v k•2N p (up to the minus sign for one of them).Thus, when evaluating a function f with a PBS, the calls f (k) and f (k + p 2 ) will produce the same output (one again, up to the minus sign), which is a collision constraining the definition of f .On the other hand, let us consider an odd value for p.Then, is no longer a dense spot, as it lies exactly halfway between the two dense spots and . As a consequence, collision never occurs.Figure 10 illustrates this phenomenon.
That is why we select only odd values for p in our framework.We will see in Section 6.3 how this change impacts the parametrization of the scheme.
Exception for p = 2: We just said that only odd values can be selected for p in our framework, however p-encodings with even values of p exist as well: nonetheless they need to achieve the relaxed negacyclicity property introduced in Definition 5.This restriction makes them basically useless, as using only odd p-encodings is sufficient to evaluate all possible Boolean functions without having to bother with the negacyclicity property.However, the case p = 2 is an exception: the valid 2-encodings are automatically negacyclic and allow to evaluate the XOR operation by simply performing an homomorphic sum (so without bootstrapping).So it might be efficient to switch between 2-encodings for XOR operations and p-encodings (with odd p) for non-linear Boolean functions.We make use of this strategy in our implementation of the Keccak permutation in Section 7.3 and for the AES in Section 7.5.

Construction of the Accumulator for an Odd p
The accumulator is the polynomial v(X) used in the BlindRotate step of the PBS.In the Section 6.1, we showed how the values are spread over the torus after bootstrapping.To actually make that works, we need to explicitly characterize this polynomial.In the following presentation, we neglect roundings to keep notations light (as if p would divide N ), or, equivalently, the division operator is assumed to include rounding.Definition 10.If p is an odd modulus, and f : Let us explain the structure of this accumulator.The polynomial has degree N and is made of p distinct windows of width N p .Each of these windows has constant coefficient value f (k), for k ∈ {0, . . ., p − 1}.For 0 ≤ α ≤ p−1 2 , the range of degrees whose coefficients are 2 , with 0 ≤ α < p−1 2 .This time, the range of spanned degrees is α 2N p + N 2p ; (α + 1) 2N p − N 2p .Thus, the values k ∈ {0, . . ., p − 1} spans the entire space [0; N ) without overlap.The values over p+1 2 gets negated by the negacyclicity, so the underlying coefficient is also negated to compensate this effect.We illustrate this construction on Figure 11.

Crafting of Parameters
The instances of the TFHE scheme are defined by a set of parameters.These parameters should simultaneously ensure the security of the scheme and the correctness of the homomorphic computations.They also determine the time of execution of one PBS.Here we define a framework to dimension the parameters required to optimally execute a given gadget.
Finding an optimal set of parameters for a given application is a hard problem and has been studied in particular in [BBB + 23].The parameters need to ensure three properties: security, correctness and efficiency.
Let us start by an overview of the different parameters at play in an instance of the TFHE bootstrapping: • n: the dimension of the LWE samples.Namely, the TLWE ciphertexts are vectors of length n + 1.
• q: the modulus of the ring the encrypted values live on.In tfhe-rs those values are stored on u32 values, making q = 2 32 .We treat this as an immutable platformdependent value.• σ: the standard deviation of the Gaussian distribution of error in LWE samples.
• k: the dimension of the GLWE samples.If k = 1, we talk about RLWE samples.
• σ ′ : the standard deviation of the Gaussian distribution of error in GLWE samples.
• A few more parameters dimensioning some inner algorithms of the bootstrapping.A detailed description and an analysis of their impact on performances and noise level can be found in [BBB + 23].In this work, they are denoted as micro-parameters.
In [BBB + 23], authors elaborate a strategy where they define an atomic pattern of FHE operators, that is to say a subgraph of FHE operators in which the noise of the output is independent from the one in the inputs.Then, they develop an optimization framework to derive the best set of parameters for a given atomic pattern.
In particular, the first atomic pattern they study, that they denote by A (CJP 21) , is a subgraph composed of a linear combination of ciphertexts with clear constants, then a Keyswitch and then a BlindRotate followed by a SampleExtract (ModulusSwitch is seen as a part of BlindRotate).Note that in Section 2.5 we introduced the bootstrapping of TFHE by putting the BlindRotate before the Keyswitch, but the other way around is also doable.To dimension the parameters of TFHE to evaluate such an atomic pattern, their framework takes as input the 2-norm of the vector of constants of the linear combination (denoted by ν) and a noise bound t on the standard deviation of the distribution of error in a ciphertext that ensures a correct decryption with a good probability (1 − ϵ).We elaborate further on how this bound is constructed below in this section.
If we look closely, the evaluation of a gadget we introduced in Definition 8 can be seen as a A (CJP 21) with a few differences.Thus, we slightly modified the tool concrete-optimizer [Zam22a], that allows to generate parameters for different types of atomic patterns, to support our gadget as a new atomic pattern.Let us dive into the differences between a gadget and a A (CJP 21) : Support of odd values for p: Using an odd value for p changes the bootstrapping procedure, and in particular the definition of the accumulator for the BlindRotate (as explained in Section 6.2).With our construction, the windows in the polynomial are half the size of the ones for an even p, which impacts the noise bound t.As this bound depends of the failure probability α that the user is ready to tolerate, we shall denote it t α hereafter, which satisfies: where z * is the standard score and ∆ is the scaling factor (see [BBB + 23] for more explanations).The impact of our adaptation on this formula is solely with respect to the scaling factor.In the context of an A (CJP 21) , we have ∆ = q 2 π p with π the number of MSB for padding.As explained in Section 6.1, we do not need any padding mechanism anymore, so the 2 π vanishes.However, the length of a window is divided by 2, and p does not divide q anymore so we need to add a rounding.We finally get ∆ = q 2p .

Link between input encodings and ν:
In a scenario where only one gadget has to be evaluated, its inputs are freshly encrypted ciphertexts.Then, there is no need to perform any encoding switching before evaluating the gadget, and so we are in the context of a A (CJP 21) with ν = 1.However, if we are in a context of a graph of gadgets like in Section 5, the output of a gadget can be used as input of subsequent gadgets under different encodings.
In this case, some encoding switchings are necessary.If these encoding switching are made using a mutiplication by a constant (Property 3), we are still in the context of a A CJP 21 but with ν ̸ = 1.To formalize that, we first recall that Algorithm 1 produces gadgets of the . Thus, if we fix that all gadget output ciphertexts are encoded under , then the encoding switchings needed before an evaluation of Γ corresponds to a linear combination of the inputs with the vector d = (d i | i ∈ [1, ℓ]), so we fall back on a A (CJP 21) with ν = ∥d∥.
We implemented these changes in concrete-optimizer and uses it to generate sets of parameters for our implementations detailed in Section 7.

Concrete Implementations of p-Encodings and Homomorphic Functions in tfhe-rs
To implement our framework, we relied on the tfhe-rs library [Zam22b].Here is a list of the major changes we applied to the code:

Addition of the notion of p-encoding:
An encoding E is simply implemented with a structure Encoding storing two HashSets and the modulus p.The HashSets represent both sets E(0) and E(1).When creating an Encoding, the code checks whether the two underlying sets are disjoint or not.Moreover, the operation of encryption and decryption are modified as well.The signatures change from: encrypt(Boolean, ClientKey) -> Ciphertext to: encrypt(Boolean, ClientKey, Encoding) -> Ciphertext (same for decrypt).The functions also perform the mapping B → Z p before encryption and the other way around after decryption.

Optimized Homomorphic Evaluation of Boolean Functions
Support of odd moduli: The native tfhe-rs only support power-of-two-moduli p.We extended the library to handle odd values for p.This required modifying the encryption and decryption algorithm, and to compute the sets of parameters with the method of Section 6.3.

Definition of the new structure Gadget:
According to the evaluation strategy we introduced in Section 3.2, we wrote a new structure Gadget, associated to a Boolean function f : B ℓ → B, carrying: • A list of the Encoding objects for the inputs: E in = (E 1 , . . ., E l ), with the input modulus p in they encoded on.
• The output Encoding object E out , with the output modulus p out it is encoded on.
• The clear function f .
When such a structure is constructed, it self-checks whether f (E in ) is valid.Then, when provided ℓ Ciphertexts objects encoded under their respective p-encoding, it executes the homomorphic sum and the PBS and outputs the results encoded under E out .Some utilitary functions performing encoding-switching are also available, allowing the chaining of several Gadget.

Implementation of the accumulator:
The procedure of bootstrapping of tfhe-rs is slightly modified to support the new version of the accumulator we introduced in Section 6.2.

Parsing of graphs:
We implemented a Python script that produces graphs to represent more complex functions that requires several PBS, as described in Section 5.These graphs are stored with a comprehensive file format and our Rust implementation has a module of parsing allowing to load these graphs and automatically generate the corresponding graph of Gadget.

Application to Cryptographic Primitives
In this section, we apply our approach on some cryptographic primitives.For each primitive, we first explain the construction of the gadgets required and report the concrete performances of our implementation.We detailed all the timings of our experimentations along with the sets of parameters we used in Section 7.6.
For performance measurement, we implemented our framework in our fork of the library tfhe-rs [Zam22b] adapted as discussed in Section 6 and we generated the sets of parameters thank to our version of concrete-optimizer [Zam22a].By default, we tailored the sets of parameters to limit the probability of failure ϵ of a bootstrapping to 2 −40 , and a security level of λ = 128 bits.All experiments have been carried out on a laptop with a 12th Gen Intel(R) Core(TM) i5-1245U CPU with 10 cores and a frequency of 4.4 GHz, and 16 GB of RAM.

SIMON Block Cipher
SIMON is a hardware-oriented block cipher developed in [BSS + 15], which relies only on the following operations: AND, rotation, XOR.It is a classical Feistel network for which the Feistel function consists in applying basic operations on the branch, xoring the subkey and then xoring the result with the other branch as depicted in the Figure 12 (on this figure, S i denotes the left circular shift by i bits.).We use one ciphertext per bit so the rotation Figure 12: One Feistel round of SIMON.
operation is essentially free.Note that the key is considered as a plaintext, which does not change anything in the framework.In our implementation, we considered a (128-128) instance of SIMON (i.e. the whole state and the key are of size 128).
The Boolean function to evaluate can be defined as Using Algorithm 1, we found the smallest possible p (p = 9) and the following 9encodings to evaluate each bit of the Feistel function with one single bootstrapping (i.e.totalling 64 PBS per round).
The sum of these p-encodings yields the output encoding: which is valid for f .After the PBS, all the bits of the state are encrypted under the encoding E 0 .We formalize that with the gadget Γ = ((E 0 , E 1 , E 2 , E 3 , E 4 ), E 0 , 9, 9) To perform a Feistel round on a state of size k, the gadget Γ is applied in parallel k/2 times.Note that one bit may be used in several evaluation as b 0 , b 1 and b 2 .So we sometimes have to switch from E 0 to E 1 by a simple external multiplication by 2, which is negligible in terms of performances.
Using our version of concrete-optimizer [Zam22a], we crafted a set of parameters suitable for this modulus and these encodings.On our machine, one PBS with such parameters takes about 9.5 ms.The theoretical timings achieved on one full block without any parallelization is 41 seconds (68 rounds × 64 bits × 9.5 ms) which we confirmed experimentally.
Nonetheless, this setting is intrinsically parallelizable: the 64 gadgets of each round can be performed in parallel.We implemented parallelization using the module Rayon of Rust, which made the total timings drop to 13 seconds on our machine.
Compared to [BSS + 23] that implemented the same block cipher on an equivalent hardware with parallelism, our implementation is about 10 times faster.Table 6 shows the comparison.Note that in this paper, the probability of failure is not specified.As ours is pretty conservative, this is a good argument in favor of our framework.

The Trivium Stream Cipher
Trivium [Can06] is a stream cipher that uses a circular state.At each round, the bits are rotated within the state, except for three of them that are refreshed using the Boolean function of Section 7.1.The outer stream is generated by xoring three bits of the state each round once a "warming-up" phase is achieved.
For each generated key bit, it requires performing this function three times and aggregating five XOR operations in the center.Our strategy is to evaluate the refreshing function three times per round with one PBS for each of them, then get the result in Z 2 and chain the five XOR operations to get the output.Figure 13 illustrates the layout of the cipher.
In [BOS23], the authors implement Trivium using the original tfhe-rs library, with 2 bits of message and 2 bits of carry for a total of 4 significative bits out of the 32 of a ciphertext component.They call this mode the shortint mode.The use-case they target is transciphering.
To compare our implementation with the one of [BOS23], timings are not a good metric as in their work they are provided on a massive AWS instance with a significant amount of parallelism.A better metric is to count the number of PBS and compare the parameter sets.
We reproduced the PBS operation with their parameter set on our machine and then simply estimated the timings of one round of Trivium with their approach with no parallelism.The results are summed up in Table 1.Note that in our implementation we do not refresh the output bits with a PBS after the chain of XOR, because in the use-case of transciphering one more XOR has to be performed with the message.We take advantage of this and move the last PBS into the transciphering phase.Let us recall that our approach encrypts each bit in one TFHE ciphertext.Let us explain the stategies of homomorphization of these sub-functions: • ρ and π simply reorder the bits within the state, so they are not impacted by the homomorphization.
• θ is just a serie of XOR operations, so it can be performed with a serie of homomorphic additions and without any PBS provided that the input ciphertexts are defined over Z p with p = 2.
• χ is the only non-linear function of the permutation, and has to be performed with a PBS.It is the transformation that applies the function defined by to get each bit of the output state.
• Finally, ι performs a simple xor with a constant, so it can be handled in a similar manner that θ.The difference is that the constant is in clear this time.
The p-encodings we use are: with p & = 3 to evaluate the & operator in the alternative formula of χ.
Our strategy of homomorphic evaluation of the Keccak permutation is as follows: 1. Encrypt the input state under the encoding E ⊕ .
2. Evaluate the subfuctions θ, ρ, and π.Theses functions being purely linear, they can be performed only with sums under E ⊕ .
3. Change the encoding from E ⊕ to E & with one PBS per bit of the state (Property 5).
4. Evaluate the AND operator of the subfunction χ with the gadget As a result, each round takes two programmable bootstrappings per bit.An implementation with our tweaked version of tfhe-rs takes 16.5 seconds (without any parallelism) on our hardware to perform one Keccak round on a state of 1600 bits in spite of the two PBS required per round and per bit.Those timings are possible because of the small values of p allowing the use of a set of small parameters, which speeds up the computation.A full run of Keccak counting 24 rounds, we can then estimate the timings without parallelism to 6.6 minutes.For the sake of simplicity, we use the same set of parameters for both types of PBS, avoiding the hassle of using two different server keys.
This strategy of implementation complies with the more generic one that we introduce in Section 7.4 and that is illustrated on Figure 15.It suits very well the use-cases where linear and non-linear operations are alternating.

Ascon
Ascon [DEMS21] is a lightweight block cipher algorithm that was designed to provide efficient and secure encryption and authentication for a wide range of applications, particularly in resource-constrained environments such as embedded systems and IoT devices.The name "Ascon" stands for "Authenticated encryption for Small Constrained Devices".We implemented its s-box, whose circuit is represented on Figure 14.This layout is a bit different from the others: the s-box takes five bits as input and outputs five bits.We denote f 0 , . . ., f 4 the five functions of B 5 → B that generate the 5 output bits x 0 , . . ., x 4 .Thus, we need to define five gadgets (one per function).
These functions, once analyzed by the algorithm, can be computed in one single bootstrapping each, but for different values of p (respectively p = 17, 7, 7, 15, 11 that are the smallest possible values).We could implement the gadgets Γ 0 , . . ., Γ 4 (associated to f 0 , . . ., f 4 ) with different values for p in , but this would imply to introduce some encoding switchings before each round of hashing.To keep things simpler we generated only encodings with p = 17, making the implementation more straightforward as no encoding switching is required.For each subfunction f i , five canonical 17-encodings (E i,0 , . . ., E i,4 ) of form are computed.The results are displayed in the Table 2.Note the zero values in some cases, they show that the variable is not used in the subfunction.The s-box layer is followed by a linear layer, where the bits of the states are shifted and combined with XOR operations.This can be trivially done with p = 2. Finally, to prepare the next round, an encoding switching is performed to send back the ciphertexts on 17-encodings.This is summed up in Figure 15.Note that there is no encoding switching  To wrap up, we construct the five gadgets Γ i = ((E i,0 , . . ., E i,4 ), E ⊕ , 17, 2, f i ).They will carry the evaluation of the s-boxes and output ciphertexts encrypted under E ⊕ .Then, the linear layer is trivially evaluated with homomorphic sums.An encoding switching from E ⊕ to E i,j allows to come back to non-linear operations.
Using this solution, the s-box is evaluated in 92 ms.Note that the 5 different PBS described in Table 2 have different norms of vector d so they may have a different set of parameters for each.We use the more restrictive one (i.e. the one with greater ∥ν∥) for the 5. Estimating the timings of a full run of Ascon is not trivial because it depends a lot of the parameters.To give a rough idea, in hashing mode, 64 s-boxes are required per round, with 12 rounds recommended.The outputs of the s-boxes are in Z 2 to allow the evaluation of the linear layer of Ascon.At the end of this linear layer, the encoding of each of the 320 bits of the state must be switched back to Z 17 with a PBS.To do so, we use the same set of parameters as for the encoding switching in Step 3 of the Keccak evaluation in Section 7.3.
This gives an estimation of 89 seconds for one Ascon hash.

AES
AES [DR00], or Advanced Encryption Standard, stands as one of the most widely used and trusted encryption algorithms in the world of computer security.Its standardization occured in 2001 when it was adopted by NIST to replace the obsolete DES (Data Encryption Standard).Implementing this primitive in FHE is known as particularly tricky and only few attempts have been made [GHS12], [CLT14], [TCBS23].
A round of AES can be decomposed into 4 steps: 1. SubBytes: a non-linear substitution step where each byte is replaced by another according to a lookup table.This step concentrates all the challenge for homomorphization, the other one being trivial with our framework.
Recall that the SubBytes step is made of 16 s-boxes.So, we can derive that one execution of the SubBytes step takes 16 × 36 = 576 PBS.
The outputs of this step would be encoded with p = 2, allowing the XOR operations of the following steps to be performed efficiently.We also need to take into account the encoding switching to come back to p = 11 before each SubBytes.It costs one PBS per bit, so 128 PBS.Finally, this gives a total of 704 PBS per round.For AES-128, which takes 10 rounds, we estimate a full run to 7040 PBS.

Performances
In terms of performances, with a set of parameters ensuring a security level of λ = 128 bits and an error probability ϵ = 2 −40 , a PBS takes 17 ms on our hardware.The total runtime of the whole implementation on one thread is 135 s.We note that the 16 evaluations of s-boxes in SubBytes can be parallelized, as well as each of the 128 encoding switchings before SubBytes.Moreover, within each s-box, we can locally apply our strategy of parallelization introduced in Section 5.3.
We compare favorably to previous works of [GHS12] and [CLT14], who report timings of respectively 18 minutes and 5 minutes for a full AES, Once again, authors do not mention the value of ϵ.The more recent work of [TCBS23], also proposes an implementation of AES-128 using a completely different technique called the tree-bootstrapping.On a similar experimental setup, but with a failure probability ϵ = 2 −23 , they claim an execution in 270 s on one thread.We ran again our code with an other set of parameters tailored for the same ϵ and obtained a full run in 103 s.Note that in our implementation, we used the mode restrictive set of parameters PBS (11,4) for every bootstrapping (even the ones that should be performed with PBS (2,1 .We also derived the theoretical timing that could have been achieved if we had implemented this with two server keys (one for each set of parameters).This theoretical timing should be of 105 s with ϵ = 2 −40 , we added it in Table 6.

Summary of Applications
We summarize hereafter the parameters and performances of our implementations of cryptographic primitives.Table 3 gives an overview of the TFHE parameters used for each value of p in these examples.They all meet the required level of security of 2 128 and the error probability ϵ = 2 −40 .It also shows the associated p and the norm of d, denoted by N d (that corresponds to N d = ⌈log 2 (∥d∥)⌉) that are the input of the parameter selection algorithm.To allow the comparison with the strategy of gate bootstrapping, we also included the set of parameters hardcoded in tfhe-rs to evaluate boolean operators.Table 4 shows the complexity of the cryptographic primitives expressed in PBS with our framework.It can be compared with Table 5, that illustrates the number of PBS required with the naive strategy of gate bootstrapping.Finally, Table 6 shows the concrete performance achieved by our implementations on our machine, as well as the comparison with other works and with the gate bootstrapping.For more information about an implementation or a comparison, the reader is referred to the related section.

Conclusion
In this paper, we have proposed a new strategy to evaluate Boolean functions homomorphically using TFHE.Our technique relies on constructing an intermediate homomorphic layer between the Boolean space B of the plaintexts and the torus T q on which ciphertexts live.We introduced a formal model for our technique and detailed algorithms to efficiently construct such layers and select appropriate parameters.We further extended our strategy Table 3: Sets of TFHE parameters for each PBS used in our implementations, with the constraints used to generate the sets, and the performances.Each setting is referenced as PBS (p,N d ) with N d = ⌈log 2 (∥d∥)⌉.All this parameters ensure a level of security λ = 128 bits and a failure probability of bootstrapping of ϵ = 2 −40 .q is always fixed to 2 32 .PBS gate refers to the naive case of the gate bootstrapping implemented in [Zam22b] and is used to estimate the timings of the naive strategy in Table 6.
Table 6: Timings of evaluation of full primitives, and comparison with previous works when they exist.Like on Table 4, a star ( * ) is added in the cells if our timing is not obtained from a full implementation but estimated from an implemented building block.Also, the security level of each implementation is λ = 128 and the default error probability is ϵ = 2 −40 .The concurrent works that do not indicates their ϵ are marked with †.

Figure 1 :
Figure 1: Embedding of Z p in Z q

Figure 2 :
Figure 2: Representation of two valid p-encodings.The green part represents E(1), and the red part E(0).Note that the relaxed negacyclity is respected by the p-encoding on the right-hand figure as p is even.

2.
For a Boolean function f to be evaluated on b 1 , . . ., b l , compute homomorphically the sum of the ciphertexts c = c 1 + • • • + c l .This yields an encryption of b = f (b 1 , . . ., b l ), encoded with a valid p-encoding E sum = f (E 1 , . . ., E l ).

Figure 3 :
Figure 3: Starting from two canonical encodings, we produce two new p-encodings corresponding to the results of the & and the ⊕ operations.

Figure 4 :
Figure 4: Illustration of an execution of the framework for the multiplexing function.

Algorithm 1 Figure 5 :
Figure 5: Rate of success of the algorithm for 100 random Boolean functions for different values of ℓ and p.
The principle is to sample random values in Z (with some large bound) and affect them to the d j 's.If all the corresponding values for all the C i = ℓ j=1 c (i) j × d j are not divisible by a value p, then the vector (d j mod p | j ∈ {1, . . ., ℓ}) is a solution of the system of inequalities generated by C.
(a) Running time of the algorithm for different values of ℓ and p for random functions.Note that the scale is logarithmic.(b) Ratio between the time to find a solution when it exists with the time to run the full algorithm when no solution exists.

Figure 6 :
Figure 6: Some metrics about running time.
i } 1≤i≤n ▷ The lines of the matrix of constraints C of the function f P ▷ The sets of possible values for p to be tested D ▷ The sets of possible values in Z to assign to the d i 's.All these elements are big primes Ensure: f is possible to evaluate using a modulus smaller or equal than p. d $ ← D ▷ Sample random prime values in Z and assign it to d = (d 1 , . . ., d l ) r = C × d ▷ r is the right member of the system for p ∈ P do if 0 ∈ [r] p then ▷ If p divides one of the coordinates of r P ← P \ {p} ▷ This value of p is incorrect end if end for if | P |> 0 then return min(P ) ▷ Returns the smallest possible value for p, if any.
(a) The outputs of 10000 runs of the Algorithm 3 for the first subfunction of the Ascon s-box (b) Number of iterations required to get a solution for a given value of p

Figure 8 :
Figure 8: Example of graph of subcircuits (on left) and of a valid subcircuit (on right).Each subcircuit F i is evaluated homomorphically with a gadget Γ i .

Figure 9 :
Figure 9: Distribution of the values of µ across Z q for p = 6 and p = 5: the colored parts show the dense spots where the value has a high probability to lie in.The width of these sectors depends on σ (the standard deviation of the error distribution χ of TFHE).Note that this repartition looks the same for μ in Z 2N .

Figure 11 :
Figure 11: Illustration of the construction of the accumulator.On top is the ring Z 2N splitted in windows.Below is a representation of the polynomial v, with its version once rotated by a multiplication by X N .On the figure, p = 5.

Figure 15 :
Figure 15: A common layout to evaluate cryptographic primitives.The upper part of the boxes represents what happens in the clear, while the lower part shows the encrypted operations.

Table 1 :
Comparison of timings of one round of Trivium between our work and [BOS23], with ϵ = 2 −40 .Keccak is a hash function standardized by NIST under the name SHA-3 [NIS15].It is a sponge function, whose transformation is called the Keccak permutation.It consists of five sub-functions: θ, ρ, π, χ, and ι.
This gadget is applied once per bit of the state.5.Evaluate the remaining ⊕ operators of χ and the ι subfunctions, then jump backStep 2. for the next loop iteration.Casting a ciphertext from E ⊕ to E & (Step 3) is a bit tricky because p ⊕ = 2 is even.Because of the negacyclicity problem, one needs E & (0) = [−E & (1)] p & .With p & = 3, the only candidate is the encoding E & defined above.

Table 2 :
Parameters d i,j for Ascon, with p = 17 for every subfunction.subfunction d i,0 d i,1 d i,2 d i,3 d i,4