1 Introduction

Lattice-based cryptography has generated considerable interest in the last decade due to many attractive features, including conjectured security against quantum attacks, strong security guarantees from worst-case hardness and constructions of fully homomorphic encryption (FHE) schemes (see the survey [33]). Moreover, lattice-based cryptographic schemes are often algorithmically simple and efficient, manipulating essentially vectors and matrices or polynomials modulo relatively small integers, and in some cases outperform traditional systems.

Modern lattice-based cryptosystems are built upon two main average-case problems over general lattices: Short Integer Solution (SIS) [1] and Learning With Errors (LWE) [35], and their analogues over ideal lattices, ring-SIS [29] and ring-LWE [27]. The hardness of these problems can be related to the one of their worst-case counterpart, if the instances follow specific distributions and parameters are choosen appropriately [1, 27, 29, 35].

In particular, discrete Gaussian distributions play a central role in lattice-based cryptography. A natural set of examples to illustrate the importance of Gaussian sampling are lattice-based signature and identity-based encryption (IBE) schemes [16]. The most iconic example is the signature algorithm proposed in [16] (hereafter GPV), as a secure alternative to the well-known (and broken) GGH signature scheme [18]. In this paper, the authors use the Klein/GPV algorithm [21], a randomized variant of Babai’s nearest plane algorithm [4]. In this algorithm, the rounding step is replaced by randomized rounding according to a discrete Gaussian distribution to return a lattice point (almost) independent of a hidden basis. The GPV signature scheme has also been combined with LWE to obtain the first identity-based encryption (IBE) scheme [16] conjectured to be secure against quantum attacks. Later, a new Gaussian sampling algorithm for arbitrary lattices was presented in [32]. It is a randomized variant of Babai’s rounding-off algorithm, is more efficient and parallelizable, but it outputs longer vectors than Klein/GPV’s algorithm.

Alternatively to the above trapdoor technique, lattice-based signatures [11, 23,24,25,26] were also constructed by applying the Fiat-Shamir heuristic [14]. Note that in contrast to the algorithms outlined above which sample from a discrete Gaussian distribution for any real center not known in advance, the schemes developed in [11, 25] only need to sample from a discrete Gaussian centered at zero.

1.1 Our Contributions

We develop techniques to speed-up discrete Gaussian sampling when the center is not known in advance, obtaining a flexible time-memory trade-off comparing favorably to rejection sampling. We start with the cumulative distribution table (CDT) suggested in [32] and lower the computational cost of the precomputation phase and the global memory required when sampling from a non-centered discrete Gaussian by precomputing the CDT for a relatively small number of centers, in \(\mathcal {O}(\lambda ^3)\), and by computing the cdf when needed, i.e. when for a given uniform random input, the values returned by the CDTs for the two closest precomputed centers differ. Second, we present an adaptation of the lazy technique described in [12] to compute most of the cdf in double IEEE standard double precision, thus decreasing the number of precomputed CDTs. Finally, we propose a more flexible approach which takes advantage of the information already present in the precomputed CDTs. For this we use a Taylor expansion around the precomputed centers and values instead of this lazy technique, thus enabling to reduce the number of precomputed CDTs to a \(\omega (\lambda )\).

We stress, though, that our construction is not constant time, which limits its utility. We consider addressing this issue important future work.

1.2 Related Work

Many discrete Gaussian samplers over the Integers have been proposed for lattice-based cryptography. Rejection Sampling [12, 17], Inversion Sampling with a Cumulative Distribution Table (CDT) [32], Knuth-Yao [13], Discrete Ziggurat [7], Bernoulli Sampling [11], Kahn-Karney [20] and Binary Arithmetic Coding [36].

The optimal method will of course depend on the setting in which it is used. In this work, we focus on what can be done on a modern computer, with a comfortable amount of memery and hardwired integer and floating-point operations. This is in contrast to the works [11, 13] which focus on circuits or embedded devices. We consider exploring the limits of the usual memory and hardwired operations in commodity hardware as much an interesting question as it is to consider what is feasible in more constrained settings.

Rejection Sampling and Variants. Straightforward rejection sampling [37] is a classical method to sample from any distribution by sampling from a uniform distribution and accept the value with a probability equal to its probability in the target distribution. This method does not use pre-computed data but needs floating-point arithmetic and several trials by sample. Bernoulli sampling [11] introduces an exponential bias from Bernoulli variables, which can be efficiently sampled specially in circuits. The bias is then corrected in a rejection phase based on another Bernouilli variable. This approach is particularly suited for embedded devices for the simplicity of the computation and the near-optimal entropy consumption. Kahn-Karney sampling is another variant of rejection sampling to sample from a discrete Gaussian distribution which does not use floating-point arithmetic. It is based on the von Neumann algorithm to sample from the exponential distribution [31], requires no precomputed tables and consumes a smaller amount of random bits than Bernoulli sampling, though it is slower. Currently the fastest approach in the computer setting uses a straightforward rejection sampling approach with “lazy” floating-point computations [12] using IEEE standard double precision floating-point numbers in most cases.

Note that none of these methods requires precomputation depending on the distribution’s center c. In all the alternative approaches we present hereafter, there is some center-dependent precomputation. When the center is not know this can result in prohibitive costs and handling these becomes a major issue around which most of our work is focused.

Center-Dependent Approaches. The cumulative distribution table algorithm is based on the inversion method [9]. All non-negligible cumulative probabilities are stored in a table and at sampling time one generates a cumulative probability in \([0,1)\) uniformly at random, performs a binary search through the table and returns the corresponding value. Several alternatives to straightforward CDT are possible. Of special interest are: the alias method [38] which encodes CDTs in a more involved but more efficient approach; BAC Sampling [36] which uses arithmetic coding tables to sample with an optimal consumption of random bits; and Discrete Ziggurat [7] which adapts the Ziggurat method [28] for a flexible time-memory trade-off. Knuth-Yao sampling [22] uses a random bit generator to traverse a binary tree formed from the bit representation of the probability of each possible sample, the terminal node is labeled by the corresponding sample. The main advantage of this method is that it consumes a near-optimal amount of random bits. A block variant and other practical improvements are suggested in [13]. This method is center-dependent but clearly designed for circuits and on a computer setting it is surpassed by other approaches.

Our main contribution is to show how to get rid of the known-center constraint with reasonable memory usage for center-dependent approaches. As a consequence, we obtain a performance gain with respect to rejection sampling approaches. Alternatively, any of the methods discussed above could have replaced our straightforward CDT approach. This, however, would have made our algorithms, proofs, and implementations more involved. On the other hand, further performance improvements could perhaps be achieved this way. This is an interesting problem for future work.

2 Preliminaries

Throughout this work, we denote the set of real numbers by \(\mathbb {R} \) and the Integers by \(\mathbb {Z} \). We extend any real function \(f(\cdot )\) to a countable set \(A\) by defining \(f(A) = \sum _{x \in A} f(x)\). We denote also by \(U_I\) the uniform distribution on I.

2.1 Discrete Gaussian Distributions on \(\mathbb {Z}\)

The discrete Gaussian distribution on \(\mathbb {Z} \) is defined as the probability distribution whose unnormalized density function is

If \(s \in \mathbb {R} ^+\) and \(c \in \mathbb {R} \), then we extend this definition to

$$\begin{aligned} \rho _{s,c}(x) := \rho \left( \frac{x-c}{s}\right) \end{aligned}$$

and denote \(\rho _{s,0}(x)\) by \(\rho _{s}(x)\). For any mean \(c \in \mathbb {R} \) and parameter \(s \in \mathbb {R} ^+\) we can now define the discrete Gaussian distribution \(D_{s,c}\) as

Note that the standard deviation of this distribution is \(\sigma = s/\sqrt{2\pi }\). We also define \({{\mathrm{cdf}}}_{s,c}\) as the cumulative distribution function (cdf) of \(D_{s,c}\)

Smoothing Parameter. The smoothing parameter \(\eta _\epsilon (\varLambda )\) quantifies the minimal discrete Gaussian parameter s required to obtain a given level of smoothness on the lattice \(\varLambda \). Intuitively, if one picks a noise vector over a lattice from a discrete Gaussian distribution with radius at least as large as the smoothing parameter, and reduces this modulo the fundamental parallelepiped of the lattice, then the resulting distribution is very close to uniform (for details and formal definition see [30]).

Gaussian Measure. An interesting property of discrete Gaussian distributions with a parameter \(s\) greater than the smoothing parameter is that the Gaussian measure, i.e. \(\rho _{ s,c}(\mathbb {Z})\) for \(D_{s,c}\), is essentially the same for all centers.

Lemma 1

(From the proof of [30, Lemma 4.4]). For any \(\epsilon \in (0,1)\), \( s > \eta _\epsilon (\mathbb {Z})\) and \(c \in \mathbb {R} \) we have

Tailcut Parameter. To deal with the infinite domain of Gaussian distributions, algorithms usually take advantage of their rapid decay to sample from a finite domain. The next lemma is useful in determining the tailcut parameter \(\tau \).

Lemma 2

([17, Lemma 4.2]). For any \(\epsilon > 0\), and \(\tau > 0\), we have

$$\begin{aligned} \textsf {E}_\textsf {tailcut} := \mathop {\Pr }\limits _{X\sim D_{\mathbb {Z}, s,c}}\left[ |X-c| > \tau s\right] < 2e^{-\pi \tau ^2} \cdot \frac{1+\epsilon }{1-\epsilon } \end{aligned}$$

2.2 Floating-Point Arithmetic

We recall some facts from [12] about floating-point arithmetic (FPA) with m bits of mantissa, which we denote by \(\mathbb {FP}_m\). A floating-point number is a triplet \(\bar{x} = (s,e,v)\) where \(s \in \{0,1\}\), \(e \in \mathbb {Z} \) and \(v \in \mathbb {N}_{2^m-1}\) which represents the real number \(\bar{x} = {(-1)}^s \cdot 2^{e-m} \cdot v\). Denote by \(\epsilon = 2^{1-m}\) the floating-point precision. Every FPA-operation \(\bar{\circ } \in \{\bar{+}, \bar{-}, \bar{\times }, \bar{/}\}\) and its respective arithmetic operation on \(\mathbb {R} \), \(\circ \in \{+, -, \times , /\}\) verify

$$\begin{aligned} \forall \bar{x},\bar{y} \in \mathbb {FP}_m, \left|(\bar{x}\;\bar{\circ }\;\bar{y}) - (\bar{x} \circ \bar{y}) \right|\le (x \circ y)\epsilon \end{aligned}$$

Moreover, we assume that the floating-point implementation of the exponential function \(\bar{\exp }(\cdot )\) verifies

$$\begin{aligned} \forall \bar{x} \in \mathbb {FP}_m, \left|\bar{\exp }(\bar{x}) - \exp (\bar{x}) \right|\le \epsilon . \end{aligned}$$

2.3 Taylor Expansion

Taylor’s theorem provides a polynomial approximation around a given point for any function sufficiently differentiable.

Theorem 1

(Taylor’s theorem). Let \(d \in \mathbb {Z} ^+\) and let the function \(f : \mathbb {R} \rightarrow \mathbb {R} \) be d times differentiable in some neighborhood U of \(a \in \mathbb {R} \). Then for any \(x \in U\)

$$\begin{aligned} f(x) = \mathcal {T}_{d,f,a}(x) + \mathcal {R}_{d,f,a}(x) \end{aligned}$$

where

$$\begin{aligned} \mathcal {T}_{d,f,a}(x) = \sum _{i=0}^d \frac{f^{(i)}(a)}{i!}{(x-a)}^i \end{aligned}$$

and

$$\begin{aligned} \mathcal {R}_{d,f,a}(x) = \int _a^x \frac{f^{(d+1)}(t)}{d!}{(x-t)}^d dt \end{aligned}$$

3 Variable-Center with Polynomial Number of CDTs

We consider the case in which the mean is variable, i.e. the center is not know before the online phase, as it is the case for lattice-based hash-and-sign signatures. The center can be any real number, but without loss of generality we will only consider centers in \([0,1)\). Because CDTs are center-dependent, a first naive option would be to precompute a CDT for each possible real center in \([0,1)\) in accordance with the desired accuracy. Obviously, this first option has the same time complexity than the classical CDT algorithm, i.e. \(\mathcal {O}(\lambda \log s\lambda )\) for \(\lambda \) the security parameter. However, it is completely impractical with \(2^\lambda \) precomputed CDTs of size \(\mathcal {O}(s \lambda ^{1.5})\). An opposite trade-off is to compute the CDT on-the-fly, avoiding any precomputation storage, which increase the computational cost to \(\mathcal {O}(s \lambda ^{3.5})\) assuming that the computation of the exponential function run in \(\mathcal {O}(\lambda ^3)\) (see Sect. 3.2 for a justification of this assumption).

An interesting question is can we keep the time complexity of the classical CDT algorithm with a polynomial number of precomputed CDTs. To answer this question, we start by fixing the number n of equally spaced centers in \([0,1)\) and precompute the CDTs for each of these. Then, we apply the CDT algorithm to the two precomputed centers closest to the desired center for the same cumulative probability uniformly draw. Assuming that the number of precomputed CDTs is sufficient, the values returned from both CDTs will be equal most of the time, in this case we can conclude, thanks to a simple monotonic argument, that the returned value would have been the same for the CDT at the desired center and return it as a valid sample. Otherwise, the largest value will immediately follow the smallest and we will then have to compute the cdf at the smallest value for the desired center in order to know if the cumulative probability is lower or higher than this cdf. If it is lower then the smaller value will be returned as sample, else it will be the largest.

3.1 Twin-CDT Algorithm

As discussed above, to decrease the memory required by the CDT algorithm when the distribution center is determined during the online phase, we can precompute CDTs for a number n of centers equally spaced in \([0,1)\) and compute the cdf when necessary. Algorithm 1 resp. 2 describes the offline resp. online phase of the Twin-CDT algorithm. Algorithm 1 precomputes CDTs, up to a precision m that guarantees the \(\lambda \) most significant bits of each cdf, and store them with \(\lambda \)-bits of precision as a matrix \(\mathbf {T}\), where the i-th line is the CDT corresponding to the i-th precomputed center i / n. To sample from \(D_{s,c}\), Algorithm 2 searches the preimages by the cdf of a cumulative probability p, draw from the uniform distribution on \([0,1)\cap \mathbb {FP}_\lambda \), in both CDTs corresponding to the center \(\lfloor n(c-\lfloor c\rfloor )\rfloor / n\) (respectively \(\lceil n(c-\lfloor c\rfloor )\rceil / n\)) which return a value \(v_1\) (resp. \(v_2)\). If the same value is returned from the both CDTs (i.e. \(v_1 = v_2\)), then this value added the desired center integer part is a valid sample, else it computes \({{\mathrm{cdf}}}_{s,c-\lfloor c\rfloor }(v_1)\) and returns \(v_1+\lfloor c\rfloor \) if \(p < {{\mathrm{cdf}}}_{s,c}(v_1)\) and \(v_2+\lfloor c\rfloor \) else.

figure a
figure b

Correctness. We establish correctness in the lemma below.

Lemma 3

Assuming that m is large enough to ensure \(\lambda \) correct bits during the cdf computation, the statistical distance between the output distribution of Algorithm 2 instantiated to sample from \(D_{\mathbb {Z} ^m,\sigma ,c}\) and \(D_{\mathbb {Z} ^m,\sigma ,c}\) is bounded by \(2^{-\lambda }\).

Proof

First note that from the discrete nature of the considered distribution we have \(D_{s,c} = D_{s,c-\lfloor c\rfloor } + \lfloor c\rfloor \). Now recall that the probability integral transform states that if X is a continuous random variable with cumulative distribution function \({{\mathrm{cdf}}}\), then \({{\mathrm{cdf}}}(X)\) has a uniform distribution on [0, 1]. Hence the inversion method: \({{\mathrm{cdf}}}^{-1}(U_{[0,1]})\) has the same distribution as X. Finally by noting that for all \(s,p \in \mathbb {R} \), \({{\mathrm{cdf}}}_{s,c}(p)\) is monotonic in c, if \({{\mathrm{cdf}}}^{-1}_{s,c_1}(p) = {{\mathrm{cdf}}}^{-1}_{s,c_2}(p) := v\), then \({{\mathrm{cdf}}}^{-1}_{s,c}(p) = v\) for all \(c \in [c_1,c_2]\), and as a consequence, for all \(v \in [-\lceil \tau s\rceil -1,\lceil \tau s\rceil +1]\), the probability of outputting v is equal to \(\mathbb {FP}_m:{{\mathrm{cdf}}}_{s,c}(v) - \mathbb {FP}_m:{{\mathrm{cdf}}}_{s,c}(v-1)\) which is \(2^{-\lambda }\)-close to \(D_{s,c}(v)\).    \(\square \)

The remaining issue in the correctness analysis of Algorithm 2 is to determine the error occurring during the m-precision cdf computation. Indeed, this error allows us to learn what precision m is needed to correctly compute the \(\lambda \) most significant bits of the cdf. This error is characterized in Lemma 4.

Lemma 4

Let \(m \in \mathbb {Z} \) be a positive integer and \(\varepsilon = 2^{1-m}\). Let \(\bar{c},\bar{ s},\bar{h} \in \mathbb {FP}_m\) be at distance respectively at most \(\delta _c\), \(\delta _c\) and \(\delta _h\) from \(c, s,h \in \mathbb {R} \) and \(h = 1/\rho _{ s,c}(\mathbb {Z})\). Let \(\varDelta {f(x)} := \left|\mathbb {FP}_m : f(x) - f(x) \right|\). We also assume that the following inequalities hold: \( s \ge 4\), \( \tau \ge 10\), \( s\delta _ s \le 0.01\), \(\delta _c \le 0.01\), \( s^2\varepsilon \le 0.01\), \((\tau s + 1)\varepsilon \le 1/2\). We have the following error bound on \(\varDelta {{{\mathrm{cdf}}}_{ s,c}(x)}\) for any integer x such that \(|x| \le \tau s + 2\)

$$\begin{aligned} \varDelta {{{\mathrm{cdf}}}_{ s,c}(x)} \le 3.5\tau ^3 s^2\varepsilon \end{aligned}$$

Proof

We derive the following bounds using [10, Facts 6.12, 6.14, 6.22]:

$$\begin{aligned} \varDelta {{{\mathrm{cdf}}}_{ s,c}(x)} \le \varDelta \left[ \sum _{i = -\lceil \tau s \rceil -1}^{\lceil \tau s \rceil + 1}\rho _{ s,c}(i)\right] \left( \frac{1}{ s} + 3.6 s\varepsilon \right) + 3.6 s\varepsilon \end{aligned}$$
$$\begin{aligned} \varDelta \left[ \sum _{i = -\lceil \tau s \rceil -1}^{\lceil \tau s \rceil + 1}\rho _{ s,c}(i)\right] \le 3.2\tau ^3 s^3\varepsilon \end{aligned}$$

   \(\square \)

For the sake of readability the FPA error bound of Lemma 4 is fully simplified and is therefore not tight. For practical implementation, one can derive a better bound using an ad-hoc approach such as done in [34].

Efficiency. On average, the evaluation of the cdf requires \(\lceil \tau s\rceil + 1.5\) evaluations of the exponential function. For the sake of clarity, we assume that the exponential function is computed using a direct power series evaluation with schoolbook multiplication, so its time complexity is \(\mathcal {O}(\lambda ^3)\). We refer the reader to [6] for a discussion of different ways to compute the exponential function in high-precision.

Lemma 5 establishes that the time complexity of Algorithm 2 is \(\mathcal {O}(\lambda \log s \lambda + \lambda ^4/ n)\), so with \(n = \mathcal {O}(\lambda ^3)\) it has asymptotically the same computational cost than the classical CDT algorithm.

Lemma 5

Let be the probability of computing the \({{\mathrm{cdf}}}\) during the execution of Algorithm 2, assuming that \(\tau s \ge 10\), we have

Proof

$$\begin{aligned} \textsf {P}_\textsf {cdf} \le \max _{c\in [0,1)}\left( \sum _{i=-\lceil \tau s \rceil - 1}^{\lceil \tau s \rceil + 1} \left|{{\mathrm{cdf}}}_{s,c}(i) - cdf_{s,c+\frac{1}{ n}}(i) \right|\right) \end{aligned}$$

Assuming that \(\tau s \ge 10\), we have

$$\begin{aligned} e^{-\frac{1.25\tau }{s n}}\varDelta _\textsf {measure}{{\mathrm{cdf}}}_{s,c}(i) \le {{\mathrm{cdf}}}_{s,c+\frac{1}{ n}}(i) \le {{\mathrm{cdf}}}_{s,c}(i) \end{aligned}$$

Hence the upper bound.    \(\square \)

On the other hand, the precomputation matrix generated by Algorithm 1 take \( n\) times the size of one CDT, hence the space complexity is \(\mathcal {O}( n s \lambda ^{1.5})\). Note that for n sufficiently big to make the cdf computational cost negligible, the memory space required by this algorithm is about 1 GB for the parameters considered in cryptography and thus prohibitively expensive for practical use.

3.2 Lazy-CDT Algorithm

A first idea to decrease the number of precomputed CDTs is to avoid costly cdf evaluations by using the same lazy trick as in [12] for rejection sampling. Indeed, a careful analysis of Algorithm 2 shows most of the time many of the computed cdf bits are not used. This gives us to a new strategy which consists of computing the bits of \({{\mathrm{cdf}}}_{s,c}(v_1)\) lazily. When the values corresponding to the generated probability for the two closest centers are different, the Lazy-CDT algorithm first only computes the cdf at a precision \(m'\) to ensure \(k < \lambda \) correct bits. If the comparison is decided with those k bits, it returns the sample. Otherwise, it recomputes the cdf at a precision m to ensure \(\lambda \) correct bits.

figure c

Correctness. In addition to the choice of m, discussed in Sect. 3.1, to achieve \(\lambda \) bits of precision, the correctness of Algorithm 3 also requires to know k which is the number of correct bits after the floating-point computation of the cdf with \(m'\) bits of mantissa. For this purpose, given \(m'\) Lemma 4 provides a theoretical lower bound on k.

Efficiency. As explained in [12] the precision used for floating-point arithmetic has non-negligible impact, because fp-operation become much expensive when the precision goes over the hardware precision. For instance, modern processors typically provide floating-point arithmetic following the double IEEE standard double precision (\(m = 53\)), but quad-float FPA (\(m=113\)) is usually about 10–20 times slower for basic operations, and the overhead is much more for multiprecision FPA. Therefore the maximal hardware precision is a natural choice for \(m'\). However this choice for \(m'\) in Algorithm 3 is a strong constraint for cryptographic applications, where the error occurring during the floating-point cdf computation is usually greater than 10 bits, making the time-memory tradeoff of Algorithm 3 inflexible. Note that the probability of triggering high precision in Algorithm 3 given that \(v_1 \ne v_2\) is about \(2^{q-k}\textsf {P}_\textsf {cdf}\), where q is the number of common leading bits of \({{\mathrm{cdf}}}_{s,\lfloor n(c - \lfloor c\rfloor )\rfloor /n}(v_1)\) and \({{\mathrm{cdf}}}_{s,\lceil n(c - \lfloor c\rfloor )\rceil /n}(v_2)\). By using this lazy trick in addition to lookup tables as described in Sect. 5 with parameters considered in cryptography, we achieve a computational cost lower than the classical centered CDT algorithm with a memory requirement in the order of 1 megabyte.

4 A More Flexible Time-Memory Tradeoff

In view of limitations of the lazy approach described above, a natural question is if we can find a better solution to approximate the cdf. The major advantage of this lazy trick is that it does not require additional memory. However, in our context the CDTs are precomputed and rather than approximate the cdf from scratch it would be interesting to reuse the information contained in these precomputations. Consider the cdf as a function of the center and note that each precomputed cdf is zero degree term of the Taylor expansion of the cdf around a precomputed center. Hence, we may approximate the cdf by its Taylor expansions by precomputing some higher degree terms.

At a first glance, this seems to increase the memory requirements of the sampling algorithm, but we will show that this approach allows to drastically reduce the number of precomputed to a \(\omega (\lambda )\) centers thanks to a probability which decreases rapidly with the degree of the Taylor expansion. Moreover, this approximation is faster than the cdf lazy computation and it has no strong constraints related to the maximal hardware precision. As a result, we obtain a flexible time-memory tradeoff which reaches, in particular, the same time complexity as the CDT algorithm for centered discrete Gaussians with a practical memory requirements for cryptographic parameters.

4.1 Taylor-CDT Algorithm

Our Taylor-CDT algorithm is similar to the Lazy-CDT algorithm (Algorithm 3) described above, except that the lazy computation of the cdf is replaced by the Taylor expansion of the cdf, viewed as a function of the Gaussian center, around each precomputed centers for all possible values. The zero-degree term of each of these Taylor expansions is present in the corresponding CDT element \(\mathbf {T}_{i,j}\) and the d higher-degree terms are stored as an element \(\mathbf {E}_{i,j}\) of another matrix \(\mathbf {E}\). As for the other approaches, these precomputations shall be performed at a sufficient precision m to ensure \(\lambda \) correct bits. During the online phase, Algorithm 5 proceed as follow. Draw p from the uniform distribution over \([0,1)\cap \mathbb {FP}_\lambda \) and search p in the CDTs of the two closest precomputed centers to the desired center decimal part. If the two values found are equal, add the desired center integer part to this value and return it as a valid sample. Otherwise, select the closest precomputed center to the desired center decimal part and evaluate, at the desired center decimal part, the Taylor expansion corresponding to this center and the value found in its CDT. If p is smaller or bigger than this evaluation with respect for the error approximation upper bound \(\textsf {E}_\textsf {expansion}\), characterized in Lemma 6, add the desired center integer part to the corresponding value and return it as a valid sample. Otherwise, it is necessary to compute the full cdf to decide which value to return.

figure d
figure e

Efficiency. Algorithm 5 performs two binary searches on CDTs in \(\mathcal {O}(\lambda \log s \lambda )\), d additions and multiplications on \(\mathbb {FP}_m\) in \(\mathcal {O}(m^2)\) with probability \(\textsf {P}_\textsf {cdf} \approx 3\lambda /n\) (see Lemma 5) and a cdf computation on \(\mathbb {FP}_m\) in \(\mathcal {O}(s\lambda ^{3.5})\) with probability close to \(2^{q+1}\textsf {P}_\textsf {cdf}\textsf {E}_\textsf {expansion}\), where q is the number of common leading bits of \({{\mathrm{cdf}}}_{s,\lfloor n(c - \lfloor c\rfloor )\rfloor /n}(v_1)\) and \({{\mathrm{cdf}}}_{s,\lceil n(c - \lfloor c\rfloor )\rceil /n}(v_2)\) and \(\textsf {E}_\textsf {expansion}\) is the Taylor expansion approximation error bound described in Lemma 6.

Lemma 6

Let be the maximal Euclidean distance between \({{\mathrm{cdf}}}_{s,x}(v)\) and \(\mathcal {T}_{d,{{\mathrm{cdf}}}_{s, x}(v),c}(x)\), its Taylor expansion around c, for all \(v \in [-\lceil \tau s\rceil -1, \lceil \tau s\rceil +1]\), \(c\in [0,1)\) and \(x \in [c, c+1/2n]\), assuming that \(\tau \ge 2.5\), \(s \ge 4\), we have

Proof

From Theorem 1 we have

$$\begin{aligned} \textsf {E}_\textsf {expansion} = \max _{\begin{array}{c} c\in [0,1) \\ x \in [c, c+1/2n] \\ v \in \left[ -\lceil \tau s\rceil -1, \lceil \tau s\rceil + 1\right] \end{array}}\left( \sum _{i = -\lceil \tau s\rceil -1}^v \int _c^{x} \frac{\rho _{s,t}^{(d+1)}(i)}{d!\,\rho _{s,t}(\mathbb {Z})}{(c+\frac{1}{2n}-t)}^d dt \right) \end{aligned}$$

By using well-known series-integral comparison we obtain \(\rho _{s,t}(\mathbb {Z}) \ge s\sqrt{2\pi } - 1\) and since \(\left| \rho _{s,t}^{(d)}(i)\right| < \frac{d(1.3\tau )^{d}2^d}{s^{d/2}}\) for \(s \ge 4\) and \(\tau \ge 2.5\), it follows that

$$\begin{aligned} \textsf {E}_\textsf {expansion} \le \frac{(d+1)(1.3)^{d+1}\tau ^{d+2}}{d!\,n^{d+1}s^{\frac{d+1}{2}}} \end{aligned}$$

   \(\square \)

A careful analysis of this technique show that with \(d = 4\) we achieve the same asymptotic computational cost as the classical CDT algorithm with \(n = \omega (\lambda )\), where the hidden factor is less than 1 / 4, therefore for this degree the space complexity of Algorithms 4 and 5 is only \(\lambda \) times bigger than for centered sampling, showing that these algorithms can achieve a memory requirement as low as 1 MB. Finally, note that taking care to add the floating-point computation error to the error of approximation, one can compute the Taylor expansion evaluation at the maximal hardware precision to reduce its computational cost.

5 Lookup Tables

We shall now show how to use partial lookup tables to avoid the binary search in most cases when using CDT algorithms, this technique is the CDT analogue of the Knuth-Yao algorithm improvement described in [8]. Note that this strategy is particularly fitting for discrete Gaussian distributions with relatively small expected values. The basic idea is to subdivide the uniform distribution \(U_{[0,1)}\) into \(\ell \) uniform distributions on subsets of the same size \(U_{[i/\ell ,(i+1)/\ell )}\), with \(\ell \) a power of two. We then precompute a partial lookup table on these subsets which allows to return the sample at once when the subset considered does not include a cdf image. We note that instead of subdividing the uniform range into stripes of the same size, we can also recursively subdivide only some stripes of the previous subdivision. However, for the sake of clarity and ease of exposure, this improvement is not included in this paper and we will describe this technique for the classical centered CDT algorithm.

First, we initialize a lookup table of size \(\ell = 2^l\) where the i-th entry corresponds to a subinterval \([i/\ell ,(i+1)/\ell )\) of \([0,1)\). Second, after precomputing the CDT, we mark all the entries for which there is at least one CDT element in their corresponding subinterval \([i/\ell ,(i+1)/\ell )\) with \(\bot \), and all remaining entries with \(\top \). Each entry marked with \(\top \) allows to return a sample without the need to perform a binary search in the CDT, because only one value corresponds to this subinterval which is the first CDT element greater or equal to \((i+1)/\ell \).

Efficiency. The efficiency of this technique is directly related to the number of entries, marked with \(\top \), whose subintervals do not contain a CDT element. We denote the probability of performing binary search by \(\textsf {P}_\textsf {binsrch}\), obviously the probability to return the sample immediately after choosing i, which is a part of p, is \(1 - \textsf {P}_\textsf {binsrch}\). Lemma 7 gives a lower bound of \(\textsf {P}_\textsf {binsrch}\).

Lemma 7

For any \(\ell \ge 2^8\) and \( s \ge \eta _{\frac{1}{2}}(\mathbb {Z})\). Let be the probability of performing binary search during the execution of the CDT algorithm implemented with the lookup table trick described above, we have

Proof

$$\begin{aligned} \textsf {P}_\textsf {binsrch} = \frac{\ell - \sum _{i=\lfloor c - \tau s\rfloor }^{\lceil c + \tau s\rceil } \left\lfloor \ell {{\mathrm{cdf}}}_{s,c}(i)\right\rfloor - \left\lfloor \ell {{\mathrm{cdf}}}_{s,c}(i-1)\right\rfloor }{\ell } \end{aligned}$$

From Lemma 2 we have

$$\begin{aligned} \left\lfloor \ell {{\mathrm{cdf}}}_{s,c}\left( \left\lfloor c-0.6 s\sqrt{\log _2\ell } \right\rfloor \right) \right\rfloor = 0 \end{aligned}$$
$$\begin{aligned} \left\lfloor \ell \left( 1-{{\mathrm{cdf}}}_{s,c}\left( \left\lceil c+0.6 s\sqrt{\log _2\ell }\right\rceil \right) \right) \right\rfloor = 0 \end{aligned}$$

   \(\square \)

Table 1. Performance of sampling from \(D_{(g),\sigma '}\) as implemented in [3] and with our non-centered discrete Gaussian sampler with \(\ell = n = 2^{8}\). The column \(D_{(g),\sigma '}/s\) gives the number of samples returned per second, the column “memory” the maximum amount of memory consumed by the process. All timings are on a Intel(R) Xeon(R) CPU E5-2667 (strombenzin). Precomputation uses 2 cores, the online phase uses one core.

6 Experimental Results

In this section, we present experimental results of our C++ implementationFootnote 1 distributed under the terms of the GNU General Public License version 3 or later (GPLv3+) which uses the MPFR [15] and GMP [19] libraries as well as Salsa20 [5] as the pseudorandom number generator. Our non-centered discrete Gaussian sampler was implemented with a binary search executed byte by byte if \(\ell = 2^8\) and 2-bytes by 2-bytes if \(\ell = 2^{16}\) without recursive subdivision of \(U_{[0,1)}\), therefore \([0,1)\) is subdivided in \(\ell \) intervals of the same size and \({{\mathrm{cdf}}}(x)\) is stored for all \(x \in \left[ -\lceil \tau \sigma \rceil -1, \lceil \tau \sigma \rceil +1\right] \). The implementation of our non-centered discrete Gaussian sampler uses a fixed number of precomputed centers \( n = 2^{8}\) with a lookup table of size \(\ell = 2^{8}\) and includes the lazy cdf evaluation optimization.

We tested the performance of our non-centered discrete Gaussian sampler by using it as a subroutine for Peikert’s sampler [32] for sampling from \(D_{(g),\sigma ',0}\) with \(g \in \mathbb {Z} [x]/(x^{N}+1)\) for \(N\) a power of two. To this end, we adapted the implementation of this sampler from [3] where we swap out the sampler from the dgs library [2] (implementing rejection sampling and [11]) used in [3] with our sampler for sampling for \(D_{\mathbb {Z},\sigma ,c}\). Note that sampling from \(D_{(g),\sigma ',0}\) is more involved and thus slower than sampling from \(D_{\mathbb {Z} ^{N},\sigma ',0}\). That is, to sample from \(D_{(g),\sigma ',0}\), [3] first computes an approximate square root of \(\varSigma _2 = \sigma '^2 \cdot g^{-T} \cdot g^{-1} - r^2\) with \(r=2\cdot \lceil \sqrt{\log N}\, \rceil \). Then, given an approximation \(\sqrt{\varSigma _2}'\) of \(\sqrt{\varSigma _2}\) it samples a vector \(x \leftarrow _{\$} \mathbb {R}^N\) from a standard normal distribution and interpret it as a polynomial in \(\mathbb {Q}[X]/(x^{N}+1)\); computes \(y = \sqrt{\varSigma _2}' \cdot x\) in \(\mathbb {Q}[X]/(x^{N}+1)\) and returns \(g \cdot (\lfloor y \rceil _r)\), where \(\lfloor y \rceil _r\) denotes sampling a vector in \(\mathbb {Z} ^N\) where the i-th component follows \(D_{\mathbb {Z},r,y_i}\). Thus, implementing Peikert’s sampler requires sampling from \(D_{\mathbb {Z},r,y_i}\) for changing centers \(y_{i}\) and sampling from a standard normal distribution. We give experimental results in Table 1, indicating that our sampler increases the rate by a factor \(\approx 3\).