High-Performance Computation in Residue Number System Using Floating-Point Arithmetic

: Residue number system (RNS) is known for its parallel arithmetic and has been used in recent decades in various important applications, from digital signal processing and deep neural networks to cryptography and high-precision computation. However, comparison, sign identiﬁcation, overﬂow detection, and division are still hard to implement in RNS. For such operations, most of the methods proposed in the literature only support small dynamic ranges (up to several tens of bits), so they are only suitable for low-precision applications. We recently proposed a method that supports arbitrary moduli sets with cryptographically sized dynamic ranges, up to several thousands of bits. The practical interest of our method compared to existing methods is that it relies only on very fast standard ﬂoating-point operations, so it is suitable for multiple-precision applications and can be efﬁciently implemented on many general-purpose platforms that support IEEE 754 arithmetic. In this paper, we make further improvements to this method and demonstrate that it can successfully be applied to implement efﬁcient data-parallel primitives operating in the RNS domain, namely ﬁnding the maximum element of an array of RNS numbers on graphics processing units. Our experimental results on an NVIDIA RTX 2080 GPU show that for random residues and a 128-moduli set with 2048-bit dynamic range, the proposed implementation reduces the running time by a factor of 39 and the memory consumption by a factor of 13 compared to an implementation based on mixed-radix conversion.


Introduction
The emergence of new highly parallel architectures has increased interest in fast, carry-free, and energy-efficient computer arithmetic techniques. One such technique is the residue number system (RNS), which has received a lot of attention over recent years [1][2][3]. The RNS is of interest to scientists dealing with computationally intensive applications as it provides efficient highly parallelizable arithmetic operations. This number coding system is defined in terms of pairwise coprime integers called moduli, and a large weighted number is converted into several smaller numbers called residues, which are obtained as the remainders when the given number is divided by the moduli. A useful feature is that the residues are mutually independent, and for addition, subtraction and multiplication, instead of big word length (multiple-precision) operations, we can perform several small word length operations on these residues without carry propagation between them [1].
In a recent paper [21], we have presented a method for implementing difficult RNS operations via computing a finite precision floating-point interval that localizes the fractional value of an RNS representation. Such an interval is called a floating-point interval evaluation, or simply an interval evaluation. The method deserves attention for three reasons. First, it is intended for arbitrary moduli sets with large dynamic ranges significantly exceeding the usual word length of computers. Dynamic ranges consisting of hundreds and even thousands of bits are in demand in many modern applications, primarily in cryptography and high-precision arithmetic. Second, the method leads to efficient software implementations using general-purpose computing platforms, since it only requires very fast standard (finite precision) floating-point operations, and most computations can be performed in a parallel manner. Third, it is a fairly versatile method suitable for computing a wide range of fundamental operations that are problematic in RNS, including • magnitude comparison; • sign identification; • dynamic range overflow detection; • general division and scaling.
A key component of this method is an accurate algorithm that computes the floatingpoint interval evaluation for a number in RNS representation; see Algorithm 1 in [21]. In order to obtain the result with the desired accuracy using only finite precision operations, this algorithm performs the iterative procedure, which in some cases is the most expensive part of the algorithm.
In this paper, we continue our research on the application of finite precision floatingpoint intervals to implement high-performance RNS algorithms. The contribution of this paper is four-fold:

1.
In Section 3, we provide proofs of some important properties of the interval evaluation algorithm that were not included in the previous article [21].

2.
In Section 4, we present an improved version of the interval evaluation algorithm that reduces the number of iterations required to achieve the desired accuracy. 3.
In Section 5, we demonstrate that our method can successfully be applied to implement efficient data-parallel primitives in the RNS arithmetic, namely we use it to find the maximum element of an array of RNS numbers on GPUs. 4.
In Section 6, we present new numerical results showing the performance of the method on various moduli sets, from a moderately small 4-moduli set with 64-bit dynamic range to a huge 256-moduli set with 4096-bit dynamic range.

Residue Number System
We fix a set of n positive relatively prime integers {m 1 , m 2 , . . . , m n } called the moduli set. The number m i from this set is called a modulus, and the inequality m i > 1 should hold for all i ∈ {1, 2, . . . , n}. We define M as the product of all the m i 's. Throughout this paper, we assume that operations modulo m i are inexpensive, in contrast to modulo M operations, which can take a long time. Consider an integer X. For each modulus in {m 1 , m 2 , . . . , m n }, we have x i = X mod m i , i.e., x i is the smallest non-negative remainder of X modulo m i ; this is also written |X| m i , and x i is called the ith residue of X. Thus, an integer X in the RNS is represented by the n-tuple of residues X = (x 1 , x 2 , . . . , x n ).
It has been proved that if 0 ≤ X < M, then the number X is one-to-one corresponding to the RNS representation [22]. An interesting and useful feature of RNS is that addition, subtraction and multiplication are computed in an element-wise fashion. If X, Y, and Z have RNS representations given by (x 1 , x 2 , . . . , x n ), (y 1 , y 2 , . . . , y n ), and (z 1 , z 2 , . . . , z n ), then for • ∈ {+, −, ×} we have That is, the ith RNS digit, namely z i , is defined in terms of |x i • y i | m i only, and no carry information need be communicated between residue digits [23]. As a result, we have very high speed concurrent (parallel) operations, which makes the RNS attractive for modern high-performance applications.
We can convert an RNS representation back to its integer form using the following well-known formula: where M i = M/m i and w i is the modulo m i multiplicative inverse of M i . Both M i and w i are the RNS constants that follow from the Chinese Remainder Theorem (CRT) [2].

Implementing Difficult RNS Operations Using Finite Precision Floating-Point Intervals
Here we give a brief overview of a method for implementing difficult non-modular RNS operations (e.g., comparison, sign determination, overflow detection, and division) using limited precision floating-point arithmetic [21]. A straightforward approach is based on the CRT Formula (1). However, the sum modulo M causes a major implementation problem because M is generally a very large and arbitrary integer [24]. When M consists of hundreds or thousands of bits, the CRT-based method is very slow and inefficient. Instead, the paper [21] uses a fractional version of the CRT, which operates with a fractional representation of an RNS number [25].
Let an integer X be represented in the RNS by the residues (x 1 , x 2 , . . . , x n ). The fractional representation of X is calculated as follows: where the notation | | 1 denotes that the integer part of the expression is discarded and only the fractional part is retained. Unlike (1), the fractional version of the CRT defined by formula (2) does not require the time consuming modulo M operation. However, X/M is actually a rational number and difficulties arise when we attempt to compute X/M using limited precision arithmetic. To address these difficulties without increasing the precision, ref. [21] suggests evaluating X/M using floating-point interval arithmetic. Thus, I(X/M) provides information about the range of possible values for the exact fractional representation of an RNS number. This information may not be sufficient to restore the binary form of the number, but it can be efficiently used to perform other hard RNS operations such as magnitude comparison, sign detection, scaling, and division.

Highly Accurate Computation of I(X/M)
Consider an RNS representation X = (x 1 , x 2 , . . . , x n ), which can take any value in the range 0 to M − 1. To compute I(X/M) such that δI(X/M) < ε, where ε is a given accuracy parameter and δI(X/M) denotes the difference between the upper and lower bounds of I(X/M) divided by X/M, we can use Algorithm 1 of [21]. The general structure of this algorithm is shown in Figure 1. The algorithm consists of the following main stages: 1.
Initial computation. In this stage, evaluation of (2) using floating-point interval arithmetic is performed. We first calculate u i = |x i w i | m i for all i ∈ {1, 2, . . . , n}. Then, S L (X) and S U (X) are computed by evaluating ∑ n i=1 u i /m i in finite precision arithmetic with downwardly directed rounding and upwardly directed rounding, respectively. The algorithm ends if both S L (X) and S U (X) are equal to zero, since in this case X is also zero; otherwise (at least one of S L (X) and S U (X) is not zero) the initial values of X/M and X/M are obtained by discarding the integer parts and keeping only the fractional parts in S L (X) and S U (X).

2.
Validation. In this stage, the algorithm checks the condition S L (X) = S U (X) . If it is true, then X/M and X/M are calculated correctly. Otherwise (a rare case), one of the bounds is adjusted by calling the mixed-radix conversion (MRC) procedure.

3.
Checking the accuracy. The algorithm operates with the predefined constant where u = 2 1−p assuming p-bit floating-point arithmetic (p = 24 and 53 for binary32 and binary64, respectively). The value of ψ may seem strange at first glance; however, it has been proved in [21] that if X/M is greater than or equal to ψ, then δI(X/M) < ε.
If this is the case, then the algorithm ends; otherwise, the refinement stage is performed. One should ensure that ψ ≤ 1/4 (note that this condition is very weak and is satisfied for virtually all values of n and ε).

4.
Refinement. The ultimate goal of this stage is to determine K such that b U = 2 K X/M ≥ ψ. We set K ← 0 and do the following iterative computation: where k = log 2 (1/2ψ) and the initial values of the u i 's are computed earlier (at the first stage of the algorithm). All the |2 k | m i 's are constants that can be preprocessed. The symbol fl · denotes the result of a finite precision floating-point computation performed with upwardly directed rounding. The computation of (3) is repeated until b U ≥ ψ. Once the iterations are finished, the following is computed: where the u i 's are those computed in the iterative procedure and fl · denotes the result of a finite precision floating-point computation performed with downwardly directed rounding. Finally, we compute the endpoints of I(X/M) as follows: We note that in the described algorithm, pairwise summation is used to compute the sum of a set of floating-point numbers in finite precision arithmetic.

Properties of the Interval Evaluation Algorithm
The following theorem gives the maximum number of iterations in the refinement stage of the described algorithm. Proof. The iterative procedure ends at the jth iteration if b U ≥ ψ. On the other hand, when applying upwardly directed rounding, b U cannot be less than 2 jk X/M, provided that Thus, the number of iterations depends on the magnitude of X: the smaller the magnitude, the more iterations are needed. For X = 1, the loop termination condition is satisfied when 2 jk /M ≥ ψ. The result follows by expressing j in terms of M, ψ, and k and requiring it to be an integer.
One way to reduce the number of iterations is to increase the refinement factor k. However, the following theorem says that this can lead to an incorrect result.

Theorem 2.
To guarantee the correct result of calculating I(X/M) = [X/M, X/M], the refinement factor k must not be greater than log 2 (1/2ψ) .
Proof. The proof is based on two remarks:

1.
As shown in [21], if X is too close to M, namely if 1 − X/M ≤ 2un log 2 n then X/M may not be computed correctly. Consequently, X should be less than M(1 − 2un log 2 n) if we want to calculate the correct value of X/M.

2.
Calculation of (3) at the (i − 1)th refining iteration can be considered the same as calculating (2) with rounding upwards for some number X i−1 . Accordingly, calculating (3) at the ith iteration can be considered the same as calculating (2) with rounding upwards for the number We denote the result of the (i − 1)th iteration by b First, assume k = log 2 (1/2ψ) . In this case, X i is less than 2 log 2 (1/2ψ) ψM, and since 2 log 2 (1/2ψ) ≤ 1/2ψ then X i is less than M/2. Thus, if un log 2 n ≤ 1/4, then b (i) U will be computed correctly (see the first remark above). Now consider the case when k = log 2 (1/2ψ) + 1. For this setting, we have X i < 2 · 2 log 2 (1/2ψ) ψM, so we can only guarantee that X i is less than M, but not that it is less than M(1 − 2un log 2 n). Thus, b (i) U may not be computed correctly.

Description
We propose an improvement to the algorithm of [21] described above by modifying each iteration of the refinement loop as follows: Thus, we make the refinement factor r dependent on the value of b U , but not less than k = log 2 (1/2ψ) , and before starting the iterations, we assign the value of X/M computed at the first stage of the algorithm to b U . Once the iterations are finished, b L is computed according to (4), and the desired endpoints of I(X/M) are obtained according to (5). The proposed modification improves the performance of the algorithm by reducing the number of iterations required to achieve the desired accuracy. Summarizing The correctness statement of the original algorithm has been proved in [21], so we only must prove that the proposed modification does not violate the correctness of the algorithm. The following theorem establishes this fact. U is also calculated properly. The proof is as follows. Calculation of (6) at the ith iteration can be considered the same as calculating (2) with upwardly directed rounding for the input X i = 2 r X i−1 . Please note that According to the first remark from the proof of Theorem 2 above, X i should be less than To implement the proposed improved algorithm, all the |2 r | m i 's should be preprocessed and stored in a lookup table of size no more than n by log 2 M . We note that this memory overhead is not actually significant. In fact, let each RNS modulus consists of 32 bits and n = 512. Then M has a bit size of about 16 thousand bits (huge dynamic range), and the total size of the lookup table is 32 MB. Moreover, the table of powers of two is in demand in various RNS applications.

Demonstration
To demonstrate the benefits of the proposed modification, we have counted the number of iterations required to compute the interval evaluation in residue number systems with four different moduli sets. The first set consists of 8 moduli and provides a 128-bit dynamic range. The second set consists of 32 moduli and provides a 512-bit dynamic range. The third set consists of 64 moduli and provides a 1024-bit dynamic range. The fourth set consists of 256 moduli and provides a 4096-bit dynamic range.
The results are reported in Figures 2 and 3, where the algorithm from [21] is labeled as "Original algorithm" and the modified algorithm (Algorithm 1) as "Proposed modification". The inputs are represented by powers of two from 2 0 to 2 log 2 M , and the x-axis on the plots denotes the binary logarithm of the RNS number for which the interval evaluation is computed. The y-axis denotes the number of refining iterations required to compute the interval evaluation with accuracy ε = 10 −7 . In this demonstration, all calculations were done in standard floating-point arithmetic (double precision). The plots show that for small inputs, the proposed modification reduces the number of iterations by almost 3 times, and increasing the size of the moduli set increases the advantage.

Application: Finding the Maximum Element of an Array of RNS Numbers
Reduction is a widely used data-parallel primitive in high-performance computing and is part of many important algorithms such as least squares and MapReduce [26]. For an array of N elements {X 1 , X 2 , . . . , X N }, applying the reduction operator ⊕ gives a single value X * = X 1 ⊕ X 2 ⊕ · · · ⊕ X N . The reduction operator is a binary associative (and often commutative) operator such as +, ×, MIN, MAX, logical AND, logical OR. In this section, we applied the considered interval evaluation method to implement one operation of the parallel reduction primitive over an array of RNS numbers, namely MAX, which consists of finding the maximum element in the array. Our implementation is intended for GPUs supporting the Compute Unified Device Architecture (CUDA) [27].

Approach
Let {X 1 , X 2 , . . . , X N } be an array of N integers in RNS representation and we want to find the largest element of this array, X * = max{X 1 , X 2 , . . . , X N }. To simplify our presentation, we are considering unsigned numbers, i.e., each X i varies in the range of 0 to M − 1. Our approach to finding X * is illustrated in Figure 4. The computation is decomposed into two stages:

1.
For a given array {X 1 , X 2 , . . . , X N }, the array {I(X 1 /M), I(X 2 /M), . . . , I(X N /M)} is calculated. Each interval evaluation is coupled with the corresponding index of the RNS representation in the input array (idx in Figure 4), so knowing the interval evaluation, we can always fetch the original RNS representation.

2.
The reduction tree is built over the array of interval evaluations to obtain the maximum RNS representation. The basic brick of the reduction is comparing the magnitude of two numbers in the RNS. For two given numbers X = (x 1 , x 2 , . . . , x n ) and Y = (y 1 , y 2 , . . . , y n ), the magnitude comparison is performed as follows [21]: • if x i = y i for all i ∈ {1, 2, . . . , n}, then X = Y; • if neither case is true, then the mixed-radix representations of X and Y are calculated and compared component-wise to produce the final result.
Thus, after the array of interval evaluations is computed, each RNS comparison is performed very quickly, namely in O(1) time, and the input RNS array is accessed only in corner cases, when the numbers being compared are equal or nearly equal to each other and the accuracy of their interval evaluations is insufficient. On the other hand, the presented approach also reduces the memory footprint of intermediate computations, since not RNS numbers are stored on the reduction layers, but only their interval evaluations, and each interval evaluation has a fixed size, regardless of the size of the moduli set and dynamic range.

CUDA Implementation
In CUDA, programs that run on the GPU are called kernels. Each kernel launches a grid of parallel threads, and within the grid, GPU threads are grouped into thread blocks. Threads belonging to the same block can communicate through shared memory or shuffle instructions, while global communication between thread blocks is done by sequential kernel launches or using device memory atomic operations.
In what follows, we denote by Arr the pointer to the global memory for accessing the input RNS array and by Eval the pointer to the global memory for accessing the array of floating-point interval evaluations. Each element of the Eval array is an instance of a structure that consists of three fields: • low is the lower bound of the interval evaluation; • upp is the upper bound of the interval evaluation; • idx is the index of the corresponding RNS representation in the Arr array.
We also use the following notations: • N is the size of the input array; • gSize is the number of thread blocks per grid; • bSize is the number threads per block; • bx is the block index within the grid; • tx is the thread index within the block.
For an input array of RNS numbers, the calculation of the array of floating-point interval evaluations is implemented as a separate kernel. One thread computes one interval evaluation, and many threads run concurrently, storing the results in a pre-allocated global memory buffer of the GPU. The pseudocode of the kernel (the code executed by each thread) is shown in Algorithm 2, where the device function ComputeEval is a sequential computation of an interval evaluation as described in Algorithm 1.

Algorithm 2
Computing an array of floating-point interval evaluations i ← i + gSize × bSize 6: end while Remark 1. We have also implemented a parallel algorithm in which n threads simultaneously compute an interval evaluation for a number represented in an n-moduli RNS, and the ith thread is assigned for modulo m i computation. However, for our purpose (calculating an array of interval evaluations), the sequential algorithm is preferable, since it does not require communication between parallel threads. The parallel version can be used in applications that have insufficient internal parallelism to saturate the GPU.
Once the array of interval evaluations is computed, the reduction kernel is launched, which generates and stores partial reduction results using multiple (gSize) thread blocks. To avoid multiple global memory accesses, fast shared memory cache is employed at the internal reduction levels. The same reduction kernel then runs again to reduce the partial results into a single result using a single thread block. The result is an interval evaluation that represent the maximum element in the input array, and the desired RNS number is retrieved from the array using the index associated with that interval evaluation.
The pseudocode of the reduction kernel is shown in Algorithm 3. In this algorithm, S is the size of the reduction operation, Sh is an array in the shared memory of each thread block, and Pow is the next highest power of 2 of bSize. Just like the Eval array, each element of the Sh array is a structure consisting of three fields, low, upp and idx. i ← i + gSize × bSize 8: end while 9: -Intra-block synchronization -10: i ← Pow/2 11: while i ≥ 1 do 12: if tx < i and tx + i < bSize and RnsCmp(Sh[tx + i], Sh[tx], Arr) = 1 then 13: Although we have only considered the MAX operation in this paper, other reduction operations can be implemented in a quite similar manner. We also note that our approach can be straightforwardly extended to find the maximum element in an array of signed RNS numbers.

Results and Discussion
We present several performance results of different approaches to finding the maximum element in an array of RNS numbers on the GPU: • Proposed approach is an implementation of the MAX operation as described in Section 5 using floating-point interval evaluations to compare the magnitude of RNS numbers. • Naive approach is a straightforward parallel reduction using floating-point interval evaluations that consists of two kernel invocations. In contrast to the proposed variant, the naive one does not have a kernel that computes an array of interval evaluations. Instead, the computation of two interval evaluations is performed each time two RNS numbers are compared. This reduces the memory footprint but leads to more computation load. • Mixed-radix approach is an implementation of the MAX operation as described in Section 5, but using the MRC procedure instead of floating-point interval evaluations to compare the magnitude of RNS numbers. We used the Szabo and Tanaka MRC algorithm [3] for this implementation.
All tests were carried out on a system running Ubuntu 20.04.1 LTS, equipped with an NVIDIA GeForce RTX 2080 video card (see Table 1), an Intel Core i5 7500 CPU, and 16 GB of DDR4 memory. We used CUDA Toolkit version 11.1.105 and NVIDIA Driver version 455.32.00. The source code was compiled with the -O3 option. We carried out the performance tests using 7 different sets of RNS moduli that provide dynamic ranges from 64 to 4096 bits. An overview of the moduli sets is given in Table 2, and the tool used to generate these sets is available on the GitHub repository: https: //github.com/kisupov/rns-moduli-generator. The measurements were performed with arrays of fixed length N = 5,000,000, and random residues were used for the data generation, i.e., in the case of an n-moduli RNS, the input dataset (an array of N RNS numbers) was obtained from a random integer array of size N × n. The GNU MP library was used to validate the results.
We report the performance of the tested CUDA implementations in Table 3. We clearly see that the proposed approach significantly outperforms the naive and mixed-radix variants for all test cases. The MRC method has a quadratic behavior, while the proposed and naive approaches based on floating-point interval evaluations have a linear behavior. In fact, the Szabo and Tanaka algorithm requires n(n − 1) operations modulo m i to compute the mixed-radix representation in an n-moduli RNS, while sequential (single-threaded) computation of the interval evaluation using Algorithm 1 requires only n operations modulo m i and 4n standard floating-point operations, assuming no ambiguity resolution or refinement iterations are required. The mixed-radix implementation has not been evaluated for the 256-moduli set due to excessive memory consumption. Although in some cases Algorithm 1 calls the MRC procedure, these cases are very rare when the numbers are randomly distributed. Moreover, if it is known in advance that the RNS number is small enough, then a quick-and-dirty computation of the interval evaluation is possible, which is obtained from Algorithm 1 by eliminating steps 9 to 11.
In Figure 5, we show the performance gains of the proposed approach over the other approaches tested. The superiority of the proposed approach over the naive one increases with an increase in the size of the moduli set. In turn, for the case of the 128-moduli set, the superiority of the proposed approach over the mixed-radix method is less than that for the case of the 64-moduli set. One possible reason for this is a decrease in the GPU memory bandwidth due to strided accesses to the input array of RNS numbers. The reader can refer to [28] for further details. An interval addressing scheme, in which the residues of the RNS numbers are interleaved, will provide more efficient access to the global GPU memory. We plan to explore this in the future. Table 4 shows the memory consumption (in MB) of the tested implementations. Memory consumption is calculated as the size of the auxiliary buffer that needs to be allocated in the global GPU memory for a particular implementation to work properly. The memory requirements of the mixed-radix implementation increase in proportion to the number of moduli, since the size of each mixed-radix representation is equal to the size of the corresponding RNS representation. In contrast, the memory requirements of the naive and proposed implementations are constant, since the size of each floating-point interval evaluation is fixed (40 bytes in our setting, including padding) and does not depend on the size of the moduli set. The naive implementation requires less memory, since it stores only partial results generated by the first reduction kernel, while the proposed implementation requires storing the computed interval evaluations for all inputs. However, the memory consumption of our proposed implementation do not seem critical, since the NVIDIA RTX 2080 graphics card has 8 GB of GDDR6 RAM.

Conclusions
An efficient method has been considered for performing difficult operations in a residue number system with arbitrary moduli sets. It relies only on very fast standard floating-point operations and integer modulo m i arithmetic, which allows it to be implemented on many general-purpose computing architectures. The method can be used to improve the efficiency of various parallel algorithms operating in the RNS domain. In particular, it has now been successfully employed in CUDA-accelerated multipleprecision linear algebra kernels [10]. It is also implemented in GRNS, a new software library for high-performance RNS computations on CPU and GPU architectures (available at https://github.com/kisupov/grns). Similar to the MAX operation presented in this paper, in the future we plan to use our method to build other important data-parallel primitives over RNS arrays, such as SUM and SCAN. Data Availability Statement: Data sharing is not applicable to this article.

Conflicts of Interest:
The author declares no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

RNS
Residue number system GPUs Graphics processing units CRT Chinese Remainder Theorem MRC Mixed-radix conversion CUDA Compute Unified Device Architecture