RNS Montgomery reduction algorithms using quadratic residuosity

The residue number system (RNS) is a method for representing an integer as an n-tuple of its residues with respect to a given base. Since RNS has inherent parallelism, it is actively researched to implement a faster processing system for public-key cryptography. This paper proposes new RNS Montgomery reduction algorithms, Q-RNSs, the main part of which is twice a matrix multiplication. Letting n be the size of a base set, the number of unit modular multiplications in the proposed algorithms is evaluated as (2n2+n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(2n^2+n)$$\end{document}. This is achieved by posing a new restriction on the RNS base, namely, that its elements should have a certain quadratic residuosity. This makes it possible to remove some multiplication steps from conventional algorithms, and thus the new algorithms are simpler and have higher regularity compared with conventional ones. From our experiments, it is confirmed that there are sufficient candidates for RNS bases meeting the quadratic residuosity requirements.


Introduction
The residue number system (RNS) is a method for representing an integer in which a given integer x is represented by its residues divided by a base of integers, which are pairwise co-prime.If we denote the base by B = {m 1 , m 2 , . . ., m n } and the RNS representation of x as [x 1 , x 2 , . . ., x n ], it holds that x i = x mod m i .The main feature of RNS is that addition, subtraction, and multiplication are carried out by independent addition, subtraction, and multiplication with respect to each base element.The operation flow at each base element is called a channel.If each channel has a processing unit, an n-fold speed increase can be achieved, as Part of this paper is based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO).
This paper proposes improved RNS M-red algorithms, Q-RNS M-reds, which by posing quadratic residuosity constraints on the RNS base achieves the least number of multiplications.Past improvements in RNS M-red algorithms, with exceptions such as [6], were optimizations within one round of M-red execution, whereas our optimization for Q-RNS M-red is novel in that it transfers the square root of a constant from the current round to the previous round.Q-RNS includes two concrete algorithms called sQ-RNS and dQ-RNS, depending on the difference of the multiplication unit used.This paper is organized as follows.Section 2 introduces used notation and some basic concepts.Section 3 explains the conventional RNS M-red algorithms.In Sect.4, we introduce the new idea to use quadratic residuosity in order to simplify the M-red algorithm.The first variant, sQ-RNS M-red, is a direct combination of a new idea and a conventional algorithm.The second variant, dQ-RNS M-red, relaxes the constraint to RNS base choice by introducing the double-level Montgomery technique from [13].Other procedures necessary to implement public-key cryptography, such as Initialize, are discussed in Sect. 5. Section 6 compares RNS M-red algorithms including FPGA implementations, and Sect.7 concludes this paper.

Notation
The following definitions are applied in this paper.
w : Bit size of a word in a given computer.
x m = x mod m, where x m ∈ [0, m).
In this paper, matrix expressions are used to describe parallel processing using RNS.If the matrix is diagonal, no substantive mixture of B and B in an operation occurs.In such cases, definitions above are sufficient to carry out the matrix operations.If different bases appear in an operation, which occurs for the base extension operation or the ToBin transformation, the following computation rules will apply.
If the multiplication unit is the Montgomery one, then These definitions suffice to carry out the matrix computations appearing in this paper.Note that the matrix computation in this paper is different from standard matrix computation in that the result in each line is reduced by a modulus unique to that line.Therefore, no inverse matrix can be defined here.However, this representation effectively simplifies the algorithm representation and makes it easy to count the number of operations.

Modular multiplication
Most public-key cryptosystems are implemented by repetition of a modular multiplication with a large modulus p, that is, where p < 2 l .l ranges from several hundreds to several thousands and p is usually a large prime or a product of two large primes.Suppose that we need to add the results of two modular multiplications.If we run a modular multiplication twice, at least two multiplications and two modular reductions are necessary.If we instead use the equation only a single reduction is sufficient to obtain the result after the two multiplications and an addition.This technique is called a lazy reduction, and it effectively reduces the number of reduction operations when summation of several modular multiplications is to be computed.Such a case frequently occurs in the implementation of Elliptic Curve Cryptography.
We define the number of terms ν as a degree of laziness.For example, Eq. ( 2) has degree ν = 2.

Montgomery reduction
When implementing modular multiplication, one option is to simply use Eq. ( 1), However, to avoid conditional branches inherent to division operation, another popular option is to implement it using the Montgomery multiplication below [21].
Figure 1 shows details of the procedure.
Step 1 is a multiplication followed by Montgomery reduction in steps 2-5, which here is called M-red.For correct results, it suffices that gcd(R, p) = 1 and R > p.Since p is usually an odd number, choosing R = 2 l satisfies these conditions.In this setting, step 4 is carried out simply by a shifting operation.Step 5 is called the final subtraction, which makes the computation result less than p.
Let MM(x, y) be the right-hand side of Eq. (3).Using MM(x, y), the procedure to compute a modular multiplication is described as follows.Initialize: Main body: If our goal is single execution of a modular multiplication, calling MM(x, y) four times is not efficient.However, when computing a modular multiplication many times, the overheads of Initialize and Finalize become negligible because they are called once at the beginning and at the end, respectively.
The goal of this paper is to propose an efficient algorithm to compute the Montgomery reduction in RNS, that is, an RNS M-red algorithm.

RNS
Let B = {m 1 , . . ., m n } be a base for RNS representation, where gcd(m i , m j ) = 1 holds for i = j.RNS representation of an integer x is given by The symbol x m is defined as x m = x mod m, and thus x m ∈ [0, m) holds.The n-tuple on the right is called the RNS representation of x in base B. The representation is unique if 0 ≤ x < M, where M = n i=1 m i .This representation allows fast arithmetic in Z /M Z since where ∈ {+, −, ×, /}.'/' applies only if y is co-prime to M.
In RNS, a large integer can be processed with independent parallel operations in each channel.If we could use small factors of public-key p as an RNS base, a very efficient implementation would be realized.It is, however, difficult to employ such an approach since the public-key p is usually a large prime or a product of two large primes.

Chinese remainder theorem
According to the Chinese remainder theorem, the integer x represented in RNS is recovered by Let us consider a method avoiding the modulo M operation in evaluating the right-hand side.Since we can replace without affecting the equality, we can rewrite the equation with a new unknown Considering 0 ≤ x/M < 1, we obtain the following equation.
An approximation L of L is proposed in [3] for an appropriate offset α ∈ [0, 1), trunc(t, x) is a function to force the lower w − t bits of x to zero.
where 0 < t ≤ w.The difference between L and L can be at most 1 if appropriate t and α are selected.This approximation is the most important part of the base extension process in the computation of RNS M-red.

Special modulus for fast reduction
Usually, arbitrary moduli can be selected as RNS bases so long as they are mutually prime.It is well-known that the pseudo-Mersenne prime number m i = 2 w − μ i is useful for fast reduction.Actually, if holds, x mod m i can be computed efficiently with the following procedure.First, repeat the operation below twice: If x < 2 2w holds for the initial value, the result is in [0, 2 w+1 ).
The final result is obtained by subtracting m i at most once.In addition, if the Hamming weight of μ i is small, multiplication by μ i can be replaced by several additions.Let μ = (1/w) log 2 μ i .Equation ( 4) is satisfied if μ < 0.5.

Quadratic residuosity
An integer a is called a quadratic residue modulo m if there exists a solution for the congruence equation of x, x2 ≡ a mod m.
In other words, a is called a quadratic residue if a has a square root, and a quadratic non-residue otherwise.Unlike real numbers, not every integer has its square roots for a given modulus m.An integer a can be a quadratic residue or a quadratic non-residue, depending on the value of modulus m.
Let a function QR(a, m) be defined by This function will be used as a distinguisher of a quadratic residue.If QR(a, m) = 1, let a1/2 m denote one of the square roots of a modulo m.
Quadratic residuosity has been used for RNS in signal processing applications to represent a complex signal (for instance, refer to section 8.1 of [22]), whereas no previously proposed RNS M-red algorithm has used quadratic residuosity.This paper applies quadratic residuosity to the RNS M-red algorithm for the first time to construct algorithms that consist of the least number of unit multiplications.

Algorithm
Figure 2 shows the Montgomery reduction algorithm corresponding to steps 2-5 of Fig. 1.By relaxing the range of the output to less than 2 p, the final subtraction has been removed.In addition, the upper bound of the input is also relaxed from p 2 to β p 2 with β ≥ 4. The condition R > β p ensures that the output is less than 2 p. 1 All RNS M-red algorithms in this paper can be regarded as RNS variants of this M-red algorithm.
Figure 3 shows the RNS M-red algorithm derived straightforwardly from M-red in Fig. 2. A description of each step is on the same line as the step number, followed by the actual specification in matrix form.Steps 1, 4, and 5 correspond to steps 1, 2, and 3 in Fig. 2, respectively.Steps 2 and 3 derive approximation {q } B from {q} B , which is a technique called the base extension step.Similarly, steps 6 and 7 are also the i=1 m i for RNS M-red, while R = 2 l is a common setting for the binary M-red.
Since step 1 is carried out in base B, modulo M is automatically applied to the computation and the result is equivalent to that of step 1 in Fig. 2. It is in base B that steps 4 and 5 should be carried out.The reason for this is as follows: As for step 4, it is of no use computing (x + pq) in B because the result is always a multiple of M and thus always 0 in base B. The computation in step 5 is to multiply M −1 by r .This can be carried out in base B but not in base B, since M −1 does not exist in base B. Although the final result s is computed at step 5, it is only represented in base B .In order to complete the representation in base B, steps 6 and 7 extend {s} B to {s} B .This ensures compatibility between output and input of the RNS M-red algorithm.
The matrix elements in each step are defined as follows: Step 1: .
Step 3: The first base extension.
Step 4: Step 7: The second base extension.
This algorithm includes (2n2 + 5n) unit multiplications.We exclude the multiplications by L at steps 3 and 7 because a technique in [3] shows how to carry out each of these by less than n additions.

Requirement for parameters
Throughout this paper, we assume that the bit length w is common to base elements m i and m i .Let m i have a form of the pseudo-Mersenne prime, where μ i is a relatively small positive integer.Similarly, Such a modulus has two important properties: (a) m i is a special modulus for fast reduction.
(b) 1/m i can be well approximated as 1/2 w .
Property (b) applies to the computation of L in steps 3 and 7 in Fig. 3. Let ξ i (q) = θ m i be the results of step 2. To estimate the approximation error between )/2 w .The following equations hold for f and f [23].
Let q be the extended value of q as computed in step 3. Using the above equations, we can show the relationship between q and q as q = q + u M with u ∈ {0, 1} if an offset α = 0.This means that q is transformed to q at step 3 with an error term u M.This error is absorbed in the relaxed range of the output x M −1 p until the end of step 5.A similar analysis can be applied to the second base extension at step 7.In this case, an approximation error e 1 is replaced by e 2 with parameters (m i , μ i ), and an offset α is positive.A typical offset at the second base extension is α = 0.5 [3].The second base extension is error-free if e 2 < α and 2 p ≤ (1 − α)M .
From condition (iii) and Eq. ( 5), we can derive the lower bound of t, the effective number of bits for approximation, as 4 G-RNS M-red algorithm [9] where e 0 and e 0 represent the summation parts of e 1 and e 2 , respectively.

G-RNS algorithm
Guillermin proposed an algorithm that at the time achieved the minimum number of unit multiplications [9].We call this algorithm G-RNS (Fig. 4).Step 1 is the integration of steps 1 and 2 in Fig. 3. Step 2 is from the first term of step 4 combined with steps 5 and 6 of Fig. 3. Step 3 is derived from steps 3-6 of Fig. 3. Step 4 corresponds to step 7 of the basic algorithm.
Step 5 is new in Fig. 4.
Elements of the matrices of G-RNS are defined as follows: Step 1: Step 2: Step 3: Fig. 5 C-RNS M-red algorithm [15] Step 4: Step 5: c ii = M i m i .The necessary number of unit multiplications for G-RNS is (2n 2 + 3n).

C-RNS algorithm
Figure 5 shows an algorithm proposed by Cheung et al. [15].Elements appear in each step are defined as follows: Step 1: Step 2: Step 3: Step 4: Step 5: Step 6: c ii = M i m i .The difference from G-RNS is that C-RNS restores the original base extension matrix at step 3.The number of multiplications is (2n 2 + 4n) in general cases, but computation of the base extension can be reduced drastically in the special case when n is small.As discussed in [15,16], it follows that n is not so large, we can expect that a i j and b i j are close to 2 w .This makes it possible to reduce the computation amount at the base extensions.In hindsight, this property can be applied to the basic RNS M-red as well.It is shown in [16] that efficient parameters exist for n = 4 and 258-bit modulus p.

R-RNS algorithm
Gandino et al. proposed a reorganized version of the RNS Montgomery multiplication algorithm [6], which we call the R-RNS algorithm here.As shown in Fig. 6, we can describe the R-RNS algorithm using almost the same notation as G-RNS.Let us explain the difference between Figs. 4 and 2. In step 2, elements of the matrix are changed from e i j to e i j , where the latter is defined as Due to this definition, the following relationship holds.
Thus, the result of step 2 in Fig. 6 is identical to that in Fig. 4. 3. Notation of the result in step 3 is changed to ŝ m i , although its value is identical to σ m i , the result of step 3 of Fig. 4.

Since the new output includes ŝ m i
instead of s m i , step 5 of Fig. 4 is omitted in Fig. 6.This reduces the number of unit multiplications by n from Fig. 4.
The number of unit multiplications is (2n 2 + 2n) in this case.

Derivation of Q-RNS
We introduce an idea to pose quadratic residuosity to the RNS base so as to make steps 1 and 2 in the G-RNS algorithm unnecessary.Figure 7 (left) shows part of a long sequence of operations in which a multiplication and G-RNS M-red are repeated.It consists of three phases: the previous M-red, a multiplication, and the present M-red.The input of the present M-red is {x y} B B .For simplicity, elements of RNS representation are uniformly numbered from 1 to 2n only in Fig. 7. From the definition of G-RNS, the input is multiplied by the constants in base B = {m 1 , . . ., m n } and B = {m 1 , . . ., m n }, respectively.If the bases, B and B , are selected so that these Fig. 7 From G-RNS to Q-RNS constants are quadratic residues, each constant can be represented as a square of a constant K , as shown in Fig. 7 (left).We will, then, transfer the square root K from the present M-red to the previous M-red, integrating K onto the coefficient of the multiplication at the last steps (Fig. 7, right).As a result, outputs of previous M-reds are modified to K x and K y and their product is K 2 x y, which is the same as the value immediately after steps 1 and 2 of Fig. 4. We call this new algorithm as Q-RNS M-red or simply as Q-RNS, using initials for quadratic residuosity.Q-RNS includes sQ-RNS which is directly derived from G-RNS and dQ-RNS in which a unit multiplication is replaced by the Montgomery multiplication.
Most past improvements in RNS M-red algorithms except [6] were optimization within one round of M-red execution.Our optimization for Q-RNS M-red is unique in that it transfers a square root 4 of a constant from a present round to a previous round.
As seen in Fig. 7, Q-RNS assumes that multiplication is carried out as preprocessing for the next M-red.This assumption ensures that the degree of K is 2. Let us consider possible degrees of K .All RNS M-red algorithms discussed in this paper use two bases, B and B , with the intent that these M-reds accommodate a number twice the length of what a single base can represent.This means M-red is designed not to accommodate a number with a degree more than or equal to 3. If the degree of K is 1 or 0, we could cope with such cases by multiplying K or K 2 by the input.Even if such cases should occur, the computation amount would be the same as Fig. 7 (left).

sQ-RNS algorithm
Figure 8 shows the sQ-RNS algorithm-the initial "s" indicating a single-level rather than double-level Montgomerya technique proposed in [13].sQ-RNS is basically derived according to the procedure shown in Fig. 7 with a small extra optimization.
The constants d ii and e ii , defined by Eqs. ( 6) and (7), are the diagonal elements in steps 1 and 2 of G-RNS.Quadratic residuosity of these constants is key to the design of Q-RNS.The square root of d ii would yield The factor (−1) 1/2 requires that (−1) should be a quadratic residue modulo m i .To reduce constraints on base B even a bit, a factor −1 m i in d ii is moved to the base extension matrix a i j .As a result, the constant K 2 is defined by , 4 If more than one square root of the constant exists, either is useful to construct Q-RNS M-Red.
Fig. 8 sQ-RNS M-red algorithm The new requirement for base B is that values on the righthand side of the above equations must be quadratic residues.For a given p and i, j ∈ [1, n], we can describe the requirement using the function QR as follows: If Eq. ( 8) holds, there exists a coefficient K defined by the following equations and Q-RNS is properly defined.
The computation of L at step 1 is also modified due to the transfer of the factor −1 m i .Before the transfer, it was L ← 0 + n i=1 trunc(t, K 2 x m i )/2 w .This is replaced by Note here that the offset value changes from 0 to (1 + α), which compensates for the effect of transfer of the factor −1 m i .Derivation of the new formula is explained in "Appendix A".This also makes the constant a i negative, as shown in the next paragraph.In Fig. 8, input and output of the algorithm are K 2 x B B and {K s} B B , respectively, and the elements in each matrix are defined from those of G-RNS as follows: The constants β i j , β i , and γ i are each multiplied by K , because they correspond to the explanatory constants h 1 through h 2n in Fig. 7 (right).Note that steps 2 and 3 in Fig. 8 can be carried out simultaneously.Therefore, the computation time of sQ-RNS can be estimated as comparable with twice that of the matrix multiplication.The number of unit multiplications is (2n 2 + n), which is the minimum among all previously proposed RNS M-red algorithms.

dQ-RNS algorithm
The double-level Montgomery is a technique proposed in [13].It replaces the standard modular unit multiplication in RNS M-red with single-word Montgomery multiplication.dQ-RNS in Fig. 9 is derived by applying this technique to sQ-RNS.Using Montgomery multiplication ⊗ M removes the requirement μ < 0.5, due to its special modulus for fast reduction.Without this requirement, we can take square numbers m i = σ 2 i , as base elements which may violate the condition, μ < 0.5.In this approach, the condition for quadratic residue becomes very simple as since m i = σ i 2 automatically satisfies QR(m i , m j ) = 1.Therefore, it is expected that bases can be found efficiently for a wider range of base size n.
The elements of matrices in dQ-RNS are defined from those of sQ-RNS with the modification by coefficient 2 kw/2 (for k = 1, 2, 3), which is represented by a symbol with a dot.αi j = 2 w α i j m i αi = 2 w α i m i Fig. 9 dQ-RNS M-red algorithm .
If we take an even-valued w, these elements are always welldefined.Constants, K 2 B B and {K } B B are the same as in sQ-RNS.Consequently, variables K 2 x B B and {ξ } B are unchanged from sQ-RNS, leaving the formulae for L and L unchanged.
The number of unit multiplications of dQ-RNS is also (2n 2 + n).It should be noted that the unit multiplication in this case is the Montgomery multiplication.

Base search
A base search experiment is carried out for a given modulus p to find RNS bases that satisfy the requirement for quadratic residuosity.To avoid bias, we use five NIST primes [24] and one for Curve25519 [25], which are defined as common moduli for Elliptic Curve Cryptography.The NIST primes are called P-192, P-224, P-256, P-384, and P-521, with numbers representing the bit size of each prime.The prime for Curve25519 is defined as 2 255 − 19.Experiment for sQ-RNS: We search for bases satisfying Eq. ( 8) using the following search algorithm.Search algorithm 1: 1 Let candidates be an ordered sequence of prime numbers in the form c i = 2 w − μ i , where i > μ j > 0 for i > j.
The search is done in a smaller-index-first manner.
Since we choose the candidates from among prime numbers, they all satisfy the condition that they must be mutually prime.This also makes it easier to determine the quadratic residuosity.Table 1 presents the search results for n = 4.The rightmost column shows that these bases satisfy the condition μ < 0.5.

Experiment for dQ-RNS:
We apply the following search algorithm, which generates bases satisfying the condition given by Eq. ( 9).Search algorithm 2: 1 Let seeds be an ordered sequence of odd numbers in the form σ i = 2 w/2 − ν i , where ν i > ν j > 0 for i > j.The search is done in a smaller-index-first manner.
For the base elements found by algorithm 2, it holds that μ > 0.5, since Figure 10 shows the search results for α = 0.5 and ν ≥ 2, the degree of laziness.The search succeeds for (n, w) plotted in the figure, although the graph P-256 is almost hidden behind that of C25519.The search fails when max(e 1 , e 2 ) exceeds 0.5 and violates condition (iii) max(e 1 , e 2 ) ≤ α in Sect.3.1.2.The lower bounds of word length w for success are 22, 24, 24, 26, 28, and 24 bits for P-192, P-224, P-256, P-10 Base search results for dQ-RNS 384, P-512, and Curve25519, respectively.The lower bound t 0 for necessary bit length for approximation ranges from 3 to 8. Therefore, it is possible to realize a compact computation circuit for L and L .Let N 1 be a number of seeds satisfying QR( p, σ i ) = 1, and let N 0 be the number of all seeds generated until the algorithm halts.In our experiment, N 1 /N 0 ranges from 0.30 to 0.43, which implies that the probability that QR( p, σ i ) = 1 is near 0.3 for these primes.We also confirm that bases are efficiently found for some randomly chosen non-NIST primes.
Experiments show that bases for dQ-RNS can be found unless the word size w is too small.For instance, w ≥ 22 suffices for P-192.Since values less than 22 do not seem to be promising parameters for efficient hardware implementation, dQ-RNS has a sufficient range of word size selection.It is up to hardware designers to determine optimum sizes for specific Q-RNS applications.

Application to cryptography
We discuss several procedures necessary for RNS implementation of public-key cryptography, including Initialize, Finalize, transform to RNS representation (ToRNS, hereafter), and transform to Binary representations (ToBin, hereafter).We also provide formulae for bounds on degree Typically, p is a large prime or a product of two primes.
(b) M-red, Eq. (3) x M p Setting M = 2 l is efficient for a binary computer.
The double-level Montgomery variant of (d) of laziness and for relaxation of reduction within a channel.Although these issues were discussed in previous work, we are interested in the case of Q-RNS and exact expressions of bounds.

Basic representation
Table 2 shows the representations of computation result of each algorithm.Row (a) corresponds to the orthodox modular multiplication described by Eq. ( 1).Row (b) is for the standard Montgomery multiplication defined by Eq. ( 3), in which a constant M = 2 l is multiplied by x.The bar symbol in row (b) means relaxation of the upper bound of reduction from p to 2 p. Row (c) represents conventional RNS M-reds other than R-RNS.This is an immediate transformation of (b) into an RNS representation with base B and B .Rows (d) and (e) are for Q-RNS, derived from (c) by multiplying constant K and 2 w/2 K , respectively.The representation for the R-RNS algorithm is derived from (d) if we replace coefficient {K } B B with {1} B ∪ {X } B , where Montgomery developed an efficient reduction algorithm (b) by multiplying a constant R (here, M) by representation (a), whereas this paper proposes efficient RNS M-red algorithms (d) and (e) by multiplying constants K and 2 w/2 K by representation (c).As a result, (d) and (e) are realized with fewer unit multiplications, and their structures are much simpler.As will be explained in the next subsection, we can embed multiplication by K or 2 w/2 K into the Initialized process.We can also carry out the removal process of K or 2 w/2 K in parallel with the Finalized process.

ToRNS and Initialize
If x p B B is given, Initialize for conventional RNS M-red is carried out as follows: First, a product (10) is computed, then the product is input to RNS M-red to obtain x M p B B .A similar Initialize process can be defined for Q-RNS and applied to the result of ToRNS, which denotes the transformation from binary to RNS representation.To describe the concrete procedure, we assume the input is represented in binary as For sQ-RNS: This procedure outputs X n−1 = x p m i .The second step can be implemented with a single-word modular reduction.This matches the special modulus for fast reduction.By running this procedure with all moduli, we obtain x p B B .
To initialize this variable, we first multiply it by a constant.
Then, we input the product to sQ-RNS and obtain the basic representation K x M p B B .For dQ-RNS: This procedure outputs Y n−1 = 2 −(n−1)w x p m i .The second step can be implemented with a single-word Montgomery reduction, which matches well with the double-level Montgomery.By running this procedure with all moduli, we obtain 2 −(n−1)w x p B B .It is possible to prepare the following lookup table for the single-word Montgomery reduction.
These constants are used in a similar way to the constant (− p −1 ) mod R at step 1 in Fig. 2.
For Initialize, we first multiply a constant as follows. 2 Then, we input this product to dQ-RNS and obtain the basic representation 2 w/2 K x M p B B .

Finalize and ToBin
In the conventional RNS M-red, the Finalize of the Montgomery reduction is carried out by inputting the following value to RNS M-red.
Similarly, the Finalize for sQ-RNS and dQ-RNS is carried out by inputting the following values to the respective M-red algorithms.sQ-RNS algorithm: dQ-RNS algorithm: Regardless of sQ-RNS or dQ-RNS, the products above are both K 2 x M p B B .Similarly, the intermediate results at step 1 of the corresponding Q-RNS M-reds are the same, namely, To compute binary representation, we need an additional subroutine, ToBin, shown in Fig. 11, the input of which is the intermediate result of step 1 above.From condition (v) in Sect.3.1.2,it follows that This meets the input requirement of ToBin.If the return value Z of ToBin is not less than p, it is best to carry out the final subtraction in the binary representation.
ToBin in Fig. 11 is derived for dQ-RNS from the one proposed in [9].In [9], one of the moduli in base B is chosen as m 1 = 2 w , which is used as 2 w in Fig. 11.On the other hand, due to quadratic residuosity, it is not possible to use m 1 = 2 w for Q-RNS.Therefore, ToBin needs (n + 1) more words in its lookup table in step 2 than were used in [9].In addition, step 4 needs an n-word table, while step 3 needs no table.
In Fig. 11, unit multiplication is basically the Montgomery multiplication, although step 2 is an exception.Typical Fig. 11 ToBin transform for dQ-RNS implementation of step 2 is to apply multiplication without reduction and take the lower w bits.Since steps 2 and 3 can be carried out at the same time, efficient implementation is possible.
To obtain ToBin for sQ-RNS, steps 3 and 4 in Fig. 11 should be respectively modified by the following parts of the equation.These need a 2n-word table for sQ-RNS in addition to the conventional RNS M-reds.

Degree of laziness
We will represent the upper bound for degree of laziness ν described with Q-RNS parameters.As a typical lazy reduction, we consider the product sum.Input: . Output: This satisfies the upper bound on input for dQ-RNS.On the other hand, from condition (iv) in Sect.3.1.2,it holds that Combined with 4ν ≤ β, this leads to Thus, we can conclude that Here, α may be replaced by e 1 .The same formula can be applied to sQ-RNS.

Relaxation of reduction within channel
So far, we have assumed that the modular reduction in a unit multiplication is carried out strictly; that is, its result is always less than m i .It is, however, known for the Montgomery reduction in Fig. 2 that relaxation of reduction is effective toward avoiding a conditional branch due to the final subtraction, thus making the implementation simpler.It may be also possible to apply this idea to modular reduction at a unit operation.Let ξ m denote the δ-relax of ξ m defined as ξ m ≡ ξ m (modm) and ξ m ∈ [0, δm), where δ ≥ 1.A special case δ = 1 means strict reduction.If we introduce δ-relax to RNS M-red, it affects the representation of error bound e 1 , which is modified as Similar modification is required for e 2 .
By replacing e 1 and e 2 with e 1 and e 2 in conditions (i)-(v) in Sect.3.1.2,the basic algorithm in Fig. 3 and all its variants including Q-RNSs output a correct result for δ-relax variables.Note that the ranges of L and L change when δ-relax is applied.In the reduction case, their ranges are 0 Since the relaxation requires a wider bit length than w, there is a tradeoff between reduction and word size.

Number of unit multiplications
Table 3 summarizes comparison of four conventional RNS M-red algorithms and two Q-RNS M-red algorithms.Among these, the proposed ones achieved the least number of unit multiplications.It should be noted that unit multiplication for dQ-RNS is Montgomery's, while other algorithms use standard modular multiplication.Note also that if n is small in C-RNS, there is a possibility that one can find base extension matrices with less computation.As for the requirements for base choice, the basic algorithms G-RNS and R-RNS pose the weakest requirements, while sQ-RNS poses the strongest.C-RNS and dQ-RNS fall somewhere in between.dQ-RNS has weaker requirements on the RNS base than does sQ-RNS, since it is possible to employ square numbers as elements of the bases.
As in conventional RNS M-reds, it is easy to implement sQ-RNS and dQ-RNS in parallel processing architecture due to RNS.Since sQ-RNS and dQ-RNS mostly consist of two matrix multiplications, these algorithms have more regularity and simplicity than do conventional ones.From past work, it is definite that Q-RNS can terminate in (2n 2 + n)/n = (2n + 1) cycles if n processing units operate in parallel.Since multiplication previous to Q-RNS finishes in 2n/n = 2 cycles, the total cycles of Montgomery multiplication is (2n + 1 + 2ν), where ν is the degree of laziness.Another possibility, though less likely, is that with (n 2 + n) unit multipliers, Q-RNS finishes in two Although this seems theoretically possible, in practice there are several issues for elaboration, such as feasibility of fan-out n of registers and design of an efficient circuit for summing up the results from unit multipliers.
Bigou et al. proposed a method that consists of fewer unit multiplications than other RNS M-red algorithms, (Fig. 9) The double-level Montgomery QR by Eq. ( 9) including Q-RNS, under the hypothesis that the modulus p and the product of base moduli M should satisfy a certain equation [12,14].Although Q-RNS also poses quadratic residuosity conditions, their hypothesis is much stronger than that of Q-RNS.Actually, no base exists for NIST primes [14].In their algorithm, it should be preferable to fix the base first and then determine p under the hypothesis.On the other hand, we can find bases with very high probability not only for NIST primes but also for other primes.Therefore, the discussion in this paper does not include their algorithm for comparison.

Size of lookup table
Table 4 shows comparison of the lookup table size necessary for the four algorithms, G-RNS, R-RNS, sQ-RNS, and Fig. 12 A set of operation units dQ-RNS.Compared with G-RNS and R-RNS, sQ-RNS and dQ-RNS need only (n + 1) and (2n + 1) words of extra memory, respectively.With such little additional memory, Q-RNSs provide sufficient merit regarding reduction in the number of multiplications and simplicity of the algorithm.A toy example of parameters is shown in "Appendix C".

FPGA implementation
We have implemented sQ-RNS on FPGA with parameters n = 4, w = 65 and P-256 as a modulus.We have also implemented R-RNS for comparison.
Figure 12 shows the main operation units, a multiply-andadd unit and a modular reduction unit, where the latter carries out the fast reduction algorithm presented in Sect.2.6.Let c m and c r be the clock cycles required to carry out these operations, respectively.In our implementation, it follows that c m = 1 and c r = 2. n sets of these operation units are prepared.We use almost the same configuration for both sQ-RNS and R-RNS.Table 5 summarizes the results of FPGA implementations.Both (a) and (b) consume almost the same hardware resources specific to FPGA, such as look up table (LUT), flip-flop (FF), and digital signal processing (DSP).In the implementation, we did not apply hand tuning to the multiplier and adder.Namely, these components are synthesized automatically by the compiler.Further optimization may be possible.

Conclusion
This paper proposed new RNS Montgomery reduction algorithms, namely, sQ-RNS and dQ-RNS, which are derived by posing quadratic residuosity requirements on RNS bases.They achieve fewer number of unit multiplications than all previously proposed algorithms.The size of the lookup tables they use is comparable with conventional ones.Improvement over the R-RNS algorithm was confirmed with FPGA implementations.Since the proposed algorithms have more regularity and symmetry than do conventional ones, it may Base B = {m 1 , . . ., m n }, where gcd(m i , m j ) = 1 for i = j.|B| : Size of a set B. where gcd(M, M ) = 1.

Fig. 13
Fig. 13 Operation diagrams for a sQ-RNS and b R-RNS 6. 1. Input and output are changed from {x} B B and {s} B B to {x} B ∪ { x} B and {s} B ∪ {ŝ} B , respectively, where elements of { x} B and {ŝ} B are defined as

Table 1
Example base for sQ-RNS

Table 2
Comparison of basic representations

Table 4
Number of memory words

Table 5
Synthesis results Another topic for future study is improvement to the two base search algorithms proposed in this paper.Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecomm ons.org/licenses/by/4.0/),which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.