NULL Convention Floating Point Multiplier

Floating point multiplication is a critical part in high dynamic range and computational intensive digital signal processing applications which require high precision and low power. This paper presents the design of an IEEE 754 single precision floating point multiplier using asynchronous NULL convention logic paradigm. Rounding has not been implemented to suit high precision applications. The novelty of the research is that it is the first ever NULL convention logic multiplier, designed to perform floating point multiplication. The proposed multiplier offers substantial decrease in power consumption when compared with its synchronous version. Performance attributes of the NULL convention logic floating point multiplier, obtained from Xilinx simulation and Cadence, are compared with its equivalent synchronous implementation.


Introduction
Clocked circuits have dominated semiconductor industry for the past two decades. Excessive clock skew, clock noise, and larger power dissipation of clocked circuits have led the way to the asynchronous world of very large scale integration (VLSI). NULL convention logic (NCL) is an asynchronous paradigm that requires less power, generates less noise, radiates less EMI, and allows reusability of components compared to the synchronous counterparts [1].
The 2009 International Technology Roadmap for Semiconductors (ITRS) has predicted that asynchronous clockless circuits will occupy 49% of the chip area in 2024 and has identified power consumption as one of the major design challenges. Delay insensitive NCL circuits designed using CMOS exhibit an inherent idle behaviour since they switch only when useful work is being performed. Hence, the dynamic power consumption contributed due to the switching activity is greatly reduced when compared with the synchronous counterpart. Hence, NCL based asynchronous designs provide a significant contribution in the research of low power VLSI.
In order to integrate NCL into semiconductor design industry, reusable design libraries have to be designed.
We performed a background analysis of the circuits that were designed using NCL methodology. Consequently, we observed that, due to the complexity involved in processing floating point data, researchers contributing to NCL focussed only on NCL based designs that processed nonfractional and fixed point data. However, high precision is a prime requirement for high dynamic range and computationally intensive applications such as fast Fourier transform, which requires an efficient hardware to support floating point data. Hence, we propose the design and characterization of a NCL based floating point multiplier (FPM) that is compliant with single precision IEEE 754. The proposed NCL FPM is targeted to perform multiplication of floating point numbers and to dissipate lower power when compared to its synchronous counterpart. Eventually, the primary contribution of our research was to develop a low power and high precision, reusable NCL floating point multiplier library component, which in future can be used as an integral component in the design of NCL based DSP processor cores. The performance attributes of NCL FPM are analysed in terms of power, average delay, and area and compared with its equivalent synchronous FPM.
The outline of the paper is as follows. In Section 2, literature of NCL based designs is presented. Section 3 presents 2 The Scientific World Journal a brief description of the existing synchronous floating point multiplier architecture. Section 4 composes a detailed structural description of the proposed design, starting with the design and development of NCL components, followed by the integration of the components, to realize the complete NCL FPM. Results and discussions are presented in Section 5, followed by a conclusion in Section 6.

NCL Literature
Delay insensitivity, hysteresis, and input completeness are the distinct advantages of NCL circuits. Delay insensitivity specifies that the circuit operates correctly regardless of when the circuit inputs are available [1]. Delay insensitivity is achieved through dual rail or quad rail logic [1]. A dual rail signal consists of two wires, 0 and 1 , whose values are from the set {DATA0, DATA1, NULL} as illustrated in Table 1 [1]. DATA0 and DATA1 represent Boolean logic levels 0 and 1, respectively. NULL represents empty set, a state when DATA is not available [1]. The two rails are mutually exclusive emphasizing that they cannot be asserted simultaneously. If assigned, it is called an illegal state [1].
Threshold NCL gates with hysteresis state holding capability are constructed to realize the NCL circuits [1]. A basic threshold gate, specified as THmn gate in Figure 1, has inputs and 1 output. At least of the inputs must be asserted before the output will become asserted. Hysteresis is enforced by the fact that after the output is asserted, all inputs must be deasserted before the output becomes deasserted [1]. Input completeness illustrates that all outputs must not transit from NULL to DATA or DATA to NULL until all inputs have transited from NULL to DATA or DATA to NULL [1].
The NCL modules, designed using threshold gates, are sandwiched between the delay insensitive (DI) registers to realize a DI, input complete NCL system. The flow of DATA and NULL wavefronts is controlled by the request and acknowledge signals, and [1] as shown in Figure 2.
The input wavefronts NULL and DATA are controlled by the handshaking signals and completion detection circuitry. Two adjacent register stages interact through their request and acknowledge signals, and , respectively. The handshaking signals ensure that the two DATA wavefronts are always separated by a NULL wavefront. The acknowledge signals are combined in the completion detection circuitry to produce the request signals to the previous register stage [1]. NCL registration is realized through cascaded arrangements of single-bit dual-rail registers. These registers consist of th22 gates that pass a DATA value at the input only when is request for data (rfd) (i.e., logic 1) and likewise pass NULL only when is request for null (rfn) (i.e., logic 0). They also contain a NOR gate to generate , which is rfn when the register output is DATA and rfd when the register output is NULL. The registers are reset to NULL, since all th22 gates are reset to logic 0. An -bit register stage, comprised of single-bit dual-rail NCL registers, requires completion signals, one for each bit. The NCL completion component uses these lines to detect complete DATA and NULL sets at the output of every register stage and request the next NULL and DATA set, respectively. In full word completion, the single-bit output of the completion component is connected to all lines of the previous register stage [1].
The research of DI design using NCL has taken different dimensions since its first onset in the field of asynchronous VLSI design. The most familiar approach in NCL design is the design of a circuit in various NCL approaches such as dual rail, quad rail, and static and semistatic designs [2,3]. The second dimension focuses on transistor level design of NCL threshold gates using varied approaches to reduce power consumption [1,4]. The third dimension focuses on developing designs using NCL and compares them with synchronous versions of the designs [5,6]. The fourth dimension focuses on the tools available for simulating and synthesizing NCL designs [7,8]. In all the dimensions of NCL research, power, average delay, and area are set as the performance attributes.
We have performed an analysis of the existing NCL circuits that use multipliers. Table 2 shows that the multipliers designed so far have used only nonfractional multiplication and fixed point fractional multiplication. The existing NCL multiplier architectures do not support floating point multiplication. Hence, we have proposed a NCL based single precision IEEE 32 bit floating point multiplier that can perform multiplication of floating point numbers, targeted to obtain lower power when compared with its equivalent synchronous version.
The Scientific World Journal 3 Table 2: Analysis of NCL multipliers.

Existing Synchronous Floating Point Multiplier
Implementation of a synchronous FPM without rounding support [13] utilizes IEEE 754 single precision binary format, to represent floating point numbers as shown in Figure 3.
The format consists of a sign bit ( ), an 8-bit exponent ( ), and a 23-bit mantissa ( ). An extra bit is added to the MSB of the mantissa to form the significand. If the exponent ranges between 0 and 255, and there is a 1 in the MSB of the significand, the result is said to be normalized [13].
The real number is represented by (1) where 22 , 21 , . . . , 1 , 0 represents the 23 mantissa bits. Multiplication of two numbers in floating point format is done by (i) addition of the exponent of the two numbers, followed by the subtraction of the bias from their result, (ii) multiplication of the significand of the two numbers, and (iii) calculation of the sign by XORing the sign of the two numbers [13]. In order to represent the multiplication result as a normalized number there should be 1 in the MSB of the result (leading one). The result is normalized to obtain 1 at the MSB of the results' significand. The algorithm is implemented using the synchronous multiplier architecture [13] shown in Figure 4. The architecture has been designed to suit high precision applications. Hence, rounding support is not included in the hardware design.

NCL Components
4.1.1. NCL XOR Gate. NCL XOR gate performs XOR operation on the sign bits ( sign, sign) of the two inputs and , to obtain the sign bit (sign) of the NCL FPM's product as shown in Figure 5. Two instances of th24compx0 threshold  gate [1] are used to perform the XOR operation. It presents 1 gate delay.

NCL Ripple Borrow Subtractor.
We designed a 9-bit NCL subtractor to subtract the bias (127) = (01111111) from the result of the NCL exponent adder. NCL subtractor comprises of a cascaded structure of 7 NCL one bit subtractors (OS) and 2 NCL zero subtractors (ZS) as illustrated in 4 The Scientific World Journal   (4) and (5), respectively: . It comprises of a partial product generator, 6 Wallace tree NCL carry save adders of varied width,s and 1 NCL ripple carry adder. NCL partial product generator comprises an array of 586 NCL AND gates. The significand bits of the inputs act on the NCL AND gates to produce 586 partial products. NCL Wallace tree carry save adders [15] and NCL ripple carry adder act on the partial products to produce the 48-bit product, IP[47:0] as shown in Figure 8. It presents 15 gate delays.

NCL Normalizer.
The intermediate product, IP, has to be normalized to obtain a leading "1" at bit 46. Since the inputs and are normalized, IP will contain a leading "1" at bit 46 or 47. A leading "1" at bit 46 implies that IP is already a normalized number and hence no shift is needed. If IP has a leading "1" at bit 47, then the IP has to be shifted to the right by 1 Figure 9. A NCL half adder is constructed using 2 th24compx0, 1 th12x0, and 1 th22x0 gates. NCL IP shifter comprising of 46 cascaded NCL 2:1 multiplexers (MUX) [1] presents 2 gate delays as shown in Figure 10.

NCL Floating Point Multiplier
Architecture. The NCL FPM components developed using NCL design methodology is sandwiched between the DI registers to realize the NCL FPM as shown in Figure 11. The NCL FPM comprises of two DI register banks (RB), one at both the input and the output. 66-bit DI RB1 receives the two inputs as normalized numbers. 55-bit DI RB2 outputs the normalized result of NCL FPM. NCL XOR gate acts on the sign bits of the two inputs to produce sign bit of the product (sign). The 8-bit exponent inputs are added and then subtracted from the bias using NCL ripple carry adder and NCL subtractor to obtain the 8-bit IE. Array of 586 NCL AND gates, together with the NCL significand multiplier, acts on the two 24bit significand inputs to obtain the 48-bit unnormalized IP. is assigned as select (sel) input to the MUXs of NCL IP shifter and as carry input (cin) to the LSB position of NCL IE incrementer. If IP [47] = 1, the 46-bit IP shifter is shifted to the right by 1 bit. Simultaneously, the IE is incremented by 1. If IP [47] = 0, the IP bits remain unchanged and IE is not incremented. The output of the normalizer is the final 8-bit exponent (E n) and the 46-bit normalized product (P n). The 55-bit output is passed through the 55-bit DI RB2 to generate sign bit (sgn out), exponent bits (exp out), and significand product bits (product out) as the final result of the NCL FPM.
The two DI RBs interact with each other through their request and acknowledge signals and . These signals ensure that two DATA wavefronts are separated by NULL wavefront and prevent the wavefronts from overlapping [1]. Upon the assertion of reset, NCL FPM components are initialized to NULL. When reset is deasserted, full word completion component 2 and full word completion component 1 generate logic 1 on their respective outputs 2 and out, indicating the completion of NULL wavefront. 2 = 1 and out = 1 are sent as request for DATA to of DI RB1 and of DI RB2 (through in). A DATA wavefront is passed through the DI RB1, processed through the combinational logic, and received at the output of DI RB2. The completion components generate logic 0 on their outputs, indicating the completion of DATA wavefront. Request for NULL is sent to of DI RB1 and of DI RB2, thereby repeating the sequence. DATA/NULL cycle represents the sequence: flow of DATA through DI RB1 and combinational circuit; flow of DATA through DI RB2 and request for NULL through completion circuit; flow of NULL through DI RB1 and combinational circuit; flow of NULL through DI RB2 and request for DATA where comb is the combinational delay and comp is the delay of the completion components.

Results and Discussions
The VHDL gate level structural model of the NCL FPM was designed using gate delays based on physical-level simulations with TSMC 1.8 V 0.18 m static CMOS technology libraries [16]. It was simulated on Xilinx ISE simulator using an exhaustive VHDL test bench that generates 2 33 × 2 33 possible input test vector combinations. At the simulation level, DD is obtained as the arithmetic mean of DATA/NULL cycle times corresponding to all possible pairs of input operands [2]. NCL FPM yielded DD of 5.9 ns and is found to be functionally correct as illustrated in Figure 12. For 32bit operations of NCL FPM, the input operands are specified in the form of IEEE 754 standard as specified in Table 3. The sequence of operations performed on the operands and the corresponding results are summarized in Table 4 Partial product generation 48-bit NCL ripple carry adder 2 gate delays    The Scientific World Journal   The Scientific World Journal    FPM were synthesized to TSMC 1.8 V 180 nm process technology libraries. The results are summarized in Table 5. The average delay of NCL FPM is 5.672 ns. The highest clock speed for the synchronous FPM to operate without any timing violation is 0.39 ns. The results demonstrate that the NCL FPM is much slower than the synchronous FPM. However, when the synchronous FPM was run at the same speed as NCL FPM, the proposed NCL FPM dissipates 67.52% less power than its equivalent synchronous FPM. Synchronous FPM operating at its maximum speed consumes 81% more power than NCL FPM as shown in Figure 13. It is also observed that the area of the NCL FPM is increased by 63%. However, in spite of decrease in speed and increase in area, NCL FPM promises a significant reduction in power when compared to its synchronous version. The area of the proposed NCL FPM is definitely much greater than the existing synchronous FPM. Smith and Di [1] have clearly stated that NCL based systems produce a significant decrease in power at the expense of increase in area, which is approximately 1.5-2 times as much as the equivalent synchronous systems. The number of threshold gates required to realise NCL FPM components and DI registers contributes to the greater increase in area, which is a bottleneck in the proposed NCL FPM. However when realizing SoCs, DSP processor cores which will include NCL FPM as one of the components will generally require less than half of the entire chip area. The remaining chip area will be occupied by flash, cache, memory, and peripherals which are the same for both synchronous and NCL designs [1]. Hence the increase in area is comparatively less significant when compared to other advantages such as low power, elimination of clock related issues, and lower electromagnetic interference [1].
The reasons for low power consumption in NCL FPM are discussed as follows. The total power dissipated in a static CMOS circuit is modelled by where is the activity factor. is the load capacitance. dd is the supply voltage and is the frequency of switching. DD of NCL circuits, which are data dependent by construction, can be compared with the clock frequency ( ) of a synchronous clocked circuit [1]. NCL circuits switch only when DATA and NULL wavefronts are being processed ( DD /2), unlike clocked circuits that switch every clock pulse [1]. When the switching activity of a static CMOS circuit is controlled by clock, = 1. Alternatively, circuits driven by data will have a maximum activity factor of = 0.5 [17]. Hence, the switching power of NCL circuits which are data driven is almost halved when compared to clocked Boolean circuits. Short circuit power, dependent on rise and fall time of switching transitions, reduces with the switching activity. Hence, the proposed NCL FPM has a significant reduction in dynamic power. NCL FPM strictly adheres to the monotonic transitions between DATA and NULL wavefronts. Hence, there is no glitching [1] in NCL FPM unlike existing synchronous FPM that produces glitch power. Due to the absences of glitches, the power is uniformly distributed in time, in NCL FPM.
To illustrate the novelty of the proposed multiplier, a comparison of the proposed and existing NCL multipliers is performed and summarized in Table 6. The NCL circuits designed in the past utilized multipliers that performed multiplication of nonfractional and fixed point numbers. The designs determined the trade-off between speed, power, and area. Consequently, we state that the proposed NCL floating point multiplier, characterized in terms of power, speed, and area, is the first ever NCL based low power and high precision multiplier, designed to perform floating point multiplication.

Conclusion
We have designed, simulated, and synthesized an IEEE 754 single precision NCL FPM without rounding support. The gate level structural model of the proposed NCL FPM was successfully simulated and verified to be functionally correct. Synthesis results showed that asynchronous NCL FPM dissipated much less power than its synchronous counterpart. Hence, it can be used as a reusable library component in NCL based digital signal processing applications that demand low power and high precision. The future work is to optimize the proposed design for higher throughput and lower power. NULL cycle reduction technique and fine grain pipelining can be applied to the NCL FPM to increase the throughput. MTNCL (multithreshold NCL) gates can be used at transistor level to decrease the leakage power. In future, many reusable library components, such as NCL floating point adder and NCL floating point ALU, can be designed to realize floating point DSP processors.