Pipelined Two-Operand Modular Adders

Pipelined two-operand modular adder (TOMA) is one of basic components used in digital signal processing (DSP) systems that use the residue number system (RNS). Such modular adders are used in binary/residue and residue/binary converters, residue multipliers and scalers as well as within residue processing channels. The structure of pipelined TOMAs is usually obtained by inserting an appropriate number of pipeline register layers within a nonpipelined TOMA structure. Hence the area of pipelined TOMAs is determined by the nonpipelined TOMA structure and by the total number of pipeline registers. In this paper we propose a new pipelined TOMA, that has a considerably smaller area and the attainable pipelining frequency comparable with other known pipelined TOMA structures. We perform comparisons of the area and pipelining frequency with TOMAs based on ripple carry adder (RCA), Hiasat TOMA and parallel-prefix adder (PPA) using the data from the very large scale of integration (VLSI) standard cell library.


Introduction
Modular addition plays an important role in the implementation of digital signal processing systems that use the residue number system [1][2][3][4] as well as its derivatives like the quadratic residue number system (QRNS) [5] and modified quadratic residue number system (MQRNS) [6] for processing of complex signals.The RNS is a nonweighted integer number system that is determined by its base B={m 1 , m 2 , ..., m n } being the set of positive pairwise prime integers m i , i = 1, 2,.. ,where  denotes addition, subtraction or multiplication.
The reverse conversion from the RNS to a weighted system can be performed using the Chinese remainder theorem (CRT) [1], [2] or the mixed-radix system (MRS) [1], [2].The main advantage of the RNS comes from the fact that addition, subtraction and multiplication are carryfree and can be performed without carries between individual positions of the number.The principal advantage of the RNS with respect to the high-speed DSP is due to the replacement of large multipliers that limit the pipelining frequency, by small multipliers modulo m i .If their binary size l = (log 2 m i ), where  denotes rounding off to an integer, does not exceed six bits, multiplications by a constant can be performed by look-up with small ROMs or using combinatorial networks.General multiplications are also easier to perform because their standard realizations are small or segmentation of operands can be used for the combinatorial realization.It is worth mentioning that moduli with l < 7 may provide for the dynamic ranges over 90 bits [7].The additional advantage of the RNS is the possibility of reducing power dissipation in CMOS circuits which is due to the lower switching activity and reduction of supply voltages [9].The RNS has found numerous applications in the DSP, for example, in FIR filters [8][9][10][11], FFT processors [12], digital downconversion [13] and image processing [14], [15].Generally TOMAs can be divided into two main categories determined by the type of the modulus.TOMAs for moduli akin to 2 n represent the first category and those for generic moduli the other.There are several works in the literature that consider the TOMA design.
Banerji [16] presented a look-up approach, Agrawal and Rao [17] proposed a TOMA for moduli of the form (2 n + 1) based on binary adders.Soderstrand [18] introduced a hybrid approach based on look-up table along with the binary adder.Bayoumi and Jullien [19] described TOMAs using the table approach and binary adders approach.Dugdale [20] demonstrated an implementation of TOMAs that used binary adders, Piestrak [21] proposed a TOMA based on the carry-save adder (CSA) and two binary adders.Zimmermann [22] introduced modulo (2 n  1) adders based on parallel prefix-architecture (PPA).Hiasat [23] proposed a TOMA with the reduced area based on the carry-look-ahead (CLA) adder.Also a novel delay-powerarea-efficient approach to the TOMA design was given by Patel et al. [24].Their TOMA structure was based on the cascaded connection of the modified carry-save adder (CSA) and reduced carry-propagate adder (CPA).The used CPA designs included ELM [25], Kogge-Stone [26] and Ladner Fischer [27] PPA.
In this paper we propose a new TOMA based on a modified CLA adder.This TOMA has the smaller area than other considered TOMAs and allows to derive a new pipelined TOMA that is better than other known pipelined TOMAs in terms of the area and the number of stages of pipeline registers.We shall show the structure of the new pipelined TOMA and, for comparison, TOMAs based on the RCA, PPA in the Brent-Kung form [28] and Hiasat TOMA [23].Comparisons are made using the data from the VLSI standard cell library.We shall compare structures of individual TOMAs in terms of area, delay and pipelining frequency with the use of the additive method.The method uses summation of areas of individual components expressed in gate equivalents (GE), where 1 GE is the area of the NAND with the fan-out = 1 for the given standard cell library.The propagation delay of an individual element is taken as the worst case delay for all possible inputs.The analysis relies upon the established 130 nm Samsung standard cell library STDH150 [29].Calculations of areas and delays of individual components are practically technology independent and they can be scaled down for VLSI technologies such as 28 nm or 22 nm.Therefore we may therefore suppose that for comparison of individual digital structures, the assumed technology will give sufficient and dependable information.The paper has the following structure: in Sec. 2 we review the basic TOMA structures, in Sec. 3 we consider the TOMA-RCA, and in Sec. 4 Hiasat TOMA, in Sec. 5 we present the TOMA based on the PPA adder and finally in Sec.6 a new TOMA.In each section we analyze a nonpipelined and pipelined form.

Basic TOMA Structures Based on Binary Adders
In this section we shall shortly describe the basic known TOMA structures that use exclusively binary adders in series and which therefore may be the most suitable for transformation to the pipelined form and not those that use two parallel adders as in [21].Two-operand modular addition for small m,   6 log  m can be implemented by using ), but such approach remarkably reduces the attainable pipelining frequency.

The TOMA computes
, where r m is the least nonnegative remainder from the division X + Y by the modulus m.
, the computation can be also expressed as In Fig. 1  We shall shortly analyze the operation of the Bayoumi-Jullien TOMA (Fig. 1) because this structure will be the basis for the design of selected TOMAs.The binary adder in the first stage of this TOMA computes X + Y, whereas the second adder X + Y -m.The output of the TOMA is selected using carry = carryA ˅ carryB.For X + Y < m, carry = 0 and r m = X + Y , whereas for X + Y ≥ m, carry = 1 and r m = X + Y -m.

TOMA-RCA
By way of introduction we shall consider the realization of the Bayoumi-Jullien TOMA based on the RCA.In order to obtain a pipelined structure, layers of pipeline registers consisting of flip-flops (FFs) have to be inserted between individual adders as shown in Fig. 5.In the following we shall analyze the area of the TOMA-RCA expressed in GE, the delay and the maximum attainable pipelining frequency.The area will be estimated using the areas of the individual components from STDH150, the delay for a nonpipelined structure will be evaluated by using the maximum delays for the individual components.In order to estimate the pipelining frequency a structure is divided into balanced layers with respect to the delay and the maximum pipelining frequency is obtained as the inverse of the sum of the delay of the slowest layer and the FF delay.

A. Nonpipelined 5-bit TOMA-RCA area
This area of 5-bit TOMA-RCA can be expressed in the following manner: . 5 .The area given by ( 1) does not depend upon the form the of the two's complement system (TCS) representation of -m, The delay of the 5-bit TOMA-RCA can be expressed as The delay for 4 s and 5 c bits can be calculated as In order to compute ' 5 c , we shall first calculate i c t and ' and  The area is the sum of the nonpipelined 5-bit TOMA-RCA area and the area of pipeline registers.In this case these registers require n s = 66 FFs.Thus the area can be expressed as As A FF we shall use the area of the flip-flop FD1Q, A FD1Q from STDH150.For the structure from Fig. 5 we receive A TOMA-RCAp = 472.9GE.

D. Pipelined 5-bit RCA-TOMA pipelining rate
In order to design a pipelined structure of a TOMA, we have to decompose its nonpipelined structure into a certain number of layers and place pipeline registers between them.The decomposition is, to certain extent, arbitrary.The lower limit of the number of layers is two and the upper limit is determined by a delay of the component that we treat as indivisible.The minimum pipelining rate is approximately the sum of the delay of the layer with the maximum delay and the delay of the pipeline register.In this case we have assumed that after each FA or HA a register layer is placed and the OR gate and the MUXs are in the same layer.Hence we may evaluate the maximum delay of the layer as where

Hiasat TOMA
In the following we shall examine the results of transforming the Hiasat TOMA which requires the smallest hardware amount among known TOMAs.This TOMA consists of the serial connection of five units: the sum-andcarry (SAC), the carry propagate and generate (CPG), CLA for c OUT , multiplexer (MUX), CLA and Summation (CLAS).The SAC is composed of HAs and HALs (the modified HAs in [23]).The SAC performs for the individual bits of X + Y, and X + Y -m, with the assumption that TCS representation of -m without the sign bit is ) ,..., ( 0 1 z z n  with n = 5.Regarding that z i = 0 or may have w bits for which z i = 0 and n -w bits for which z i = 1.Hence the SAC has w HAs and n -w HAL cells.The CFG computes the carry generate and carry propagate vectors as in the standard CLA This unit has at most 2k -2 HAs.In the CLAS p i and g i are used to compute c OUT , that controls the selection of X + Y or X + Y -m.Regarding that c 0 = 0, g 0 = 0, c OUT can be computed for the five-bit Hiasat adder as The following stage, MUX selects using c OUT the carry's and generate's The final stage, the five-bit CLA adder computes the carries ' 0 In the next step the sum bits are calculated as First we shall determine the area for components of the Hiasat five-bit TOMA and then the area for m = 29.

A. 5-bit Hiasat TOMA area
The area of the five-bit Hiasat TOMA can be computed as follows .

CFG_ SAC_ t TOMA_Hiasa + A + A + A + A = A A (16)
The areas of the individual blocks from ( 16) can be expressed as: with In general, the area of the CFG_5 can be expressed as The CLAS block consists of the five-bit Propagate-Generate Unit (PGU_5), Carry-Generate Unit (CGU_5) and Summation Unit (SU_5).Its hardware amount can be estimated as with the fan-outs 1, 3, 3, 4, 2. We get and GE 0 15 5 Example 2. Area of the five-bit Hiasat TOMA for m = 29.
The TCS representation of (-m) is equal to 100011, hence w = 3, and k -w = 2 (the sign bit is excluded).

C. Pipelined 5-bit Hiasat TOMA area
The area of the Hiasat pipelined 5-bit TOMA can be expressed as where n h is a number of flip-flops used in pipeline registers.For example, for the structure from Fig. 6

D. Pipelining frequency of pipelined 5-bit Hiasat TOMA
In Fig. 6, a pipelined form of the Hiasat TOMA is presented.Five pipeline register stages are used with 58 flip-flops.
In this case we have adopted a decomposition into six layers that leads to a balanced structure.In order to evaluate the maximum pipelining frequency we shall calculate delays of the adopted individual layers.The maximum pipelining frequency will depend on the delay of the layer with the maximum delay and the delay of the assumed pipeline register.These layers have the following delays: layer 1

PPA-based TOMA
As the next structure we shall consider the TOMA based on a PPA.As the PPA the Brent-Kung (BK) [28] adder has been selected.The Brent-Kung TOMA can be relatively easy transformed to the pipelined form, moreover the use of the Brent-Kung PPA allows one to simplify the adder used in the second stage when one of addends is a constant.The prefix operator  is defined as The block that implements (27a-b) will be denoted as BK i .Subsequently we shall analyze the area and delay of the TOMA based on two BK adders.
The area of the TOMA BK where A BK , A BK-m represent the area of the BK adder and the modified BK-m adder that subtracts m, respectively.

A. The area of BK adder
A BK can be calculated as The area of the first two terms is After transforming the logic functions used for the realization of individual adders in (29), we receive the following areas Using ( 29), ( 30) and (31a-e) we obtain

B. The delay of BK adder
The BK adder delay can be expressed as . .

C. The area of BK-m adder
The form of the first layer of the BK-m adder depends on the TCS representation of -m, m ~.We shall analyze the prefix operator computation for a pair of bits ( i m ~, (27a-b) can be expressed as ) For individual combinations of ) , ( The HA's become reduced, for we have g i = 0, and the XOR gate that computes p i , is reduced to the direct connection, i.e. p i =s i .For , 1 ~ i m g i =s i , the XOR gate that computes p i .becomesan inverter, i.e.Next we shall analyze the BK-m adder for m = 29 in order to have a comparison with the adder presented by Hiasat [23].The TCS representation of m = 29 has the form 100011, then for HA 0 , g 0 -connection, p 0 -inversion, for HA 1 g 1 -connection, p 1 -inversion, for HA 2 g 2 = 0, p 2 -connection, for HA 3 g 3 = 0, p 3 -connection, for HA 4 g 4 = 0, p 4 -connection.
Moreover, regarding that Assuming the direct realization we receive For other blocks we have We finally receive for the BK-m adder

D. The area of the pipelined TOMA BK
This area can be evaluated as where n BK is the number of flip-flops in pipeline registers.For the structure from Fig. 7 with n BK = 51 and A FF = A FD1Q = 5.67 GE , we get

New Five-bit TOMA
In this section we shall show a new TOMA structure and its pipelined form that requires smaller area than other TOMA structures.The TOMA is configured as a serial connection X + Y adder and X + Y -m adder that are designed in such a manner that leads to a substantial simplification and thus to a smaller delay or a smaller number of pipeline levels.Both adders are modifications of the standard CLA adder.In the first stage of the proposed structure the propagate's and generate's and transfer functions [30] t i = a i + b i are used.The first three carries c 1 , c 2 and c 3 are computed simultaneously, and c 3 is used to generate c 4 and c 5 .
Generally, the computation of the carry c i can be expressed, assuming c 0 = 0, as In the above formulas instead of p i , the transfer function t i = a i + b i is used, which is justified as follows with t i = a i + b i , g i = a i b i and p i = a i  b i .
We may express c 2 and c 3 as the functions of g i and t i as Consequently, we receive and In the adder realization the above equations are transformed to the NAND form.The sum bits are generated using Regarding that the second operand of the X + Y -m adder is m ~, we can write We may simplify the above equations by substituting m ~ values of the individual five-bit moduli.The results of this simplification are given in Tab. 1.In Fig. 8, the TOMA based on the new principle for m = 29 is depicted.

A. 5-bit new TOMA area
We shall analyze the area and delay of the new TOMA for m = 29.The area of the new TOMA can be computed as The hardware amount of the X + Y adder can be expressed as In effect, we receive The total hardware amount is GE 06 110.A TOMA_New  .

B. 5-bit New TOMA delay
The delay of the new TOMA can be written as where where n N is the number of flip-flops in pipeline registers.For the structure from Fig. 8 with n N = 30 and A FF =A FD1Q =5.67GE , we get .GE 82 280.A TOMA_New  E. Pipelining frequency of the pipelined new TOMA For the individual layers in the pipelined structure of the new TOMA, shown in Fig. 8, we have the following delays: layer 1: It is seen that the area-delay product has the best values for the TOMA-RCA and the new TOMA, moreover the new TOMA requires the smallest area for the pipelined structure but at the cost of the reduced maximum pipelining frequency.In general the new pipelined TOMA calls for about 35% less area than the TOMA-BK, the best of three other considered structures.

Conclusions
The structures of pipelined two-operand modular adders for five-bit moduli based on ripple carry-adder, Brent-Kung adder and Hiasat adder have been presented and analyzed with respect to the area, number of layers and attainable pipelining frequency.Also a new structure of the two-operand modular adder based on the modified carrylook ahead adder has been proposed.It has been shown that the new pipelined adder has the smallest number of pipeline layers as well as the area smaller by about 35% than the best of other considered structures.

.
, n.Each integer X  Z M , represented as X(x 1 ,x 2 ,...,x n ) This mapping is the bijection and for X, Y  Z M and for
the individual components come from STDH150.The data of individual components is given in Appendix A. After inserting these data into (1   .The particular form of this representation allows to reduce the area for the given modulus.For example, if m i = 0, the HA reduces to single connection and for m i = 1 to one connection and to one inverter.For the FA and m i = 0, we have one XOR gate and a single AND gate, and for m i = 1 one OR gate and exclusive NOR.For m = 29 and ) 5-bit TOMA-RCA delay We shall estimate the delay of the structure of Fig. 4 taking into consideration individual delays of signals inside individual HAs and FAs.
area of pipelined 5-bit TOMA-RCA In Fig. 5 a pipelined form of the RCA-TOMA is presented.Six flip-flops stages are used with 66 flip-flops.
m p ~ influences the form of BK 0 and BK 1 .

Tab. 1 .
Logical functions for realizations of the carries of X + Y -m adder.

)
Example 1. Computation of 5-bit TOMA-RCA delay for components from the STDH150.
The design of the pipelined structure aimed at the minimization of the number of pipeline stages while preserving possibly high pipelining frequency.The structure allows one to employ only three pipeline register stages with 30 flip-flops with the maximum pipelining frequency equal to