Hardware Implementation of 32-Bit High-Speed Direct Digital Frequency Synthesizer

The design and implementation of a high-speed direct digital frequency synthesizer are presented. A modified Brent-Kung parallel adder is combined with pipelining technique to improve the speed of the system. A gated clock technique is proposed to reduce the number of registers in the phase accumulator design. The quarter wave symmetry technique is used to store only one quarter of the sine wave. The ROM lookup table (LUT) is partitioned into three 4-bit sub-ROMs based on angular decomposition technique and trigonometric identity. Exploiting the advantages of sine-cosine symmetrical attributes together with XOR logic gates, one sub-ROM block can be removed from the design. These techniques, compressed the ROM into 368 bits. The ROM compressed ratio is 534.2 : 1, with only two adders, two multipliers, and XOR-gates with high frequency resolution of 0.029 Hz. These techniques make the direct digital frequency synthesizer an attractive candidate for wireless communication applications.


Introduction
Direct digital frequency synthesis (DDFS) systems with highspeed frequency hopping function, low power consumption, high frequency resolution, and small chip area are in a great demand for DDFS applications especially in wireless communications and radar systems. DDFS systems have low noise and high frequency resolution compared with phase lock loop [1]. The DDFS output frequency ( out ) is given by where clk is clock frequency, FCW is a frequency control word, and is the phase accumulator width. Pipelining technique is defined as a technique that partitioned the given task into a number of subtasks that need to be performed in a sequence. The pipelining technique is used in [2][3][4] to increase the throughput of the output frequency. However, this technique also increases the power consumption and chip area. The gated clock technique was applied to the pipeline phase accumulator (PA) in [5] to reduce the number of registers without performance degradation. A parallel adder based on progression-of-states technique was combined with the pipelining technique in [6,7], for high-speed, low power pipelined parallel PA. The pipelined accumulator with parallel adder technique was proposed in [8]. In this paper, two blocks of parallel 16-bit ripple carry adder (RCA) are designed based on the progression-of-states technique with two pipelining stages.
DDFS spectral purity depends on the number of the phase output bits used for addressing the ROM. Therefore, the higher spectral purity required, the bigger ROM size needed. The quarter wave symmetry in [9,10] was applied as a simple technique to reduce the ROM size, storing only quarter (0 : /2) of the sine wave bits and using the two most significant bits (MSB) to generate the full sine wave (2 ). Quarter wave symmetry together with an angular decomposition method based on trigonometric identity was used in [11][12][13]   used in [14], where the ROM is partitioned into two ROMs, namely, coarse and fine ROMs, to obtain the quadrature outputs, sine and cosine values. The author suggested several DDFS systems implemented with different types of FPGA kit boards and compared them with an ASIC based approach.
The polynomial approximation method was used in DDFS design to optimize the spurious-free dynamic range (SFDR). This method was used by Ashrafi and Adhami in [15] to determine the upper bound of the SFDR using piecewise polynomial interpolation. Fourier series was used to establish a linear relationship between the coefficients of the interpolating polynomials and the frequency spectrum. This method can be used in any kind of DDFS to maximize the SFDR.
De Caro et al. proposed a dual-slope technique in [16] to optimize the piecewise linear approximation for the phase to sine mapping. This technique improves the spectral purity and ROM size. The stored values of the two quarters ( /4) sine and cosine sub-ROMs, with 3 MSB bits and mapper, are required to accomplish the quarter phase to sine mapping and produce quadrature output from the DDFS. In [17], De Caro et al. proposed another DDFS architecture which is based on piecewise linear approximation technique with nonuniform segment length to the input of three groups of multiplexer. This technique can maximize the SFDR and reduce the size of the coefficients ROM.
This paper presents a high-speed DDFS system with pipelined PA based on modified parallel BK adder and gated clock technique. The ROM was resized by applying the quarter-wave symmetry technique in one quarter of the sine wave, and an angular decomposition technique based on trigonometric identity has been applied to compress the quarter ROM LUT. Based on these techniques, the quarter ROM LUT was partitioned into three sub-ROMs ( , , and ). The proposed architecture improves the speed of the DDFS and reduces the size of the ROMs.

The Gated Clock and Parallel Pipelining Technique for Phase Accumulator
The modified parallel BK adder based on the progression-ofstates technique combined with gated clock technique was used in the proposed design of the PA. The frequency resolution ( Res ) of DDFS is determined by the clock frequency ( clk ) and the number of input bit of the PA as depicted by For high frequency resolution, it is preferable to design a PA with large FCW bits input. However, a large ROM size is required to implement all the 2 bits of phase accumulator output. Due to this reason, a part of the MSB phase output is used to address the phase to amplitude converter or ROM lookup table while maintaining high frequency resolution. The pipeline technique was used to increase the throughput of the accumulator, and this throughput will double with the number of pipeline stages, as shown in Figure 1.
The number of registers increases with the number of pipeline stages, which leads to high power consumption. Therefore, in this design, a gated clock technique was used to reduce the number of preskewing registers while preserving high-speed operation. In this technique, D flip-flops (DFFs) were used to connect each row of the pipeline stages with FCW input. These registers are clocked by the pipelined pulses with one clock cycle based on the shifted clock pulses as shown in Figure 2(a). Considering that the phase accumulator input bits are , the PA was partitioned into  stages with B DFFs in each stage. The number of the DFFs, for preskewing registers, is given by By applying the gated clock technique on the proposed design, the number of DFFs is given by As a result, with the gated clock technique, the numbers of preskewing registers have been reduced from 80 to 36 corresponding to 53.7% reduction. Figure 2(b) shows the operation of the parallel adder. Four adders calculate the holding constant of 8-bit input word within four clock cycles. Holding the FCW for four clock cycles limits the update rate of the frequency input word but does not cause any glitch, because the four adders will steadily increase the accumulator output four times. Assume that FCW is and the parallel adder output is at the th clock time. The parallel adder's outputs are given as follows: The result of the fourth adder is set as a feedback to the second input for all the four adders. To generate the second output [ ( +2)] and fourth output [ ( +4)], the bit is shifted up by removing one and two bits of the FCW input and replacing them with the one and two bits from the lower pipelining stage, respectively, before they are added. The ( +2) output and ( ) output are used to generate ( + 3). This operation made the frequency tuning word held constant for four clock cycles without causing any imperfections in the PA output. The partitioned clock cycles ( /4) make the multiplexers choose one of the results at the output of the PA to overcome the holding time on the parallel adders as illustrated in Figure 2(c).

Phase Accumulator Architecture
The proposed phase accumulator architecture based on the modified parallel BK adder and the gated clock technique with pipelining stages is shown in Figure 3. The output of the PA is a truncated 14-bit value that is achieved from the 8 and 6 bits of the top and second pipelining stages, respectively.
The 32-bit FCW input is sufficient to obtain 0.029 Hz frequency resolution from the 125 MHz clock frequency of Cyclone III FPGA kit board ( Res = 125 × 10 6 /2 32 = 0.029 Hz).

Modifying Brent-Kung Adder.
The BK adder is fast, and all carries are computed simultaneously through a binary tree of "BK" cells as shown in Figure 4(a).  8-bit reg.
18-bit reg. BK cells compute the carry for two or more of full adder (FA) cells, and they are calculated as a sum of and cells [23,24]. The arithmetic operation is given by where , are higher and lower generate and propagate functions, respectively. The is a propagate function ( = ⊕ ) and is a generate function ( = ⋅ ).
The general prefix addition algorithm is explained by Zimmermann in [25]. By adding the carry input in in the prefix structure with some modifications, the prefix structure can be used in pipelining-based adder design. This approach is used in BK adder fast carry computation. However, in this paper, a modification is proposed to the BK adder so that it can be used in pipelining architecture. The proposed modification is by removing the operation of the 0 , and the carry out of the first bit 1 can be achieved by a 2-1 multiplexer. The input to this multiplexer is 0 and in while 0 is the select input and the output is 1 . The operation of the multiplexer is given by where 1 is the carry out, 0 is the propagate function, in is the carry input, and 0 is the first bit input. The proposed modification of the 8-bit BK adder is shown in Figure 4(b).
The sum and carry out of the modified 8-bit BK adder are shown in

Comparison of Different Adder
Architecture. An adder is a key element of the pipelining PA design, and a fast adder improves PA performance. Parallel-prefix adder tree structures such as Sklansky [26], Kogge-Stone adder [27], BK [23], and Beaumont-Smith [28] have been used in pipelining accumulator design for high-speed operation. A comparison has been made between conventional adder and several parallel-prefix adders for 12-bit, 18-bit, 24bit, and 32-bit operations. The PA designs were coded in Verilog HDL and verified in Cyclone III FPGA kit board. Prior to that, all the designs were simulated by using ALTERA Quartus II. The comparison result is shown in Figure 5. From the figure, it can be seen that BK adder performs relatively faster, especially for high number of bits.

ROM Lookup Table Design
ROM LUT or phase-to-amplitude converter (PAC) is a memory storage address for DDFS, which is used to convert The Scientific World Journal 5 P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 x 0 x 1 x 7 y 0 y 1 y 7 P · g P · g P · g P · g P · g P · g P · g p g g 0 g 1 g 2 g 3 g 4 g 5 g 6 g 7 p 0 p 1 p 2 p 3 p 4 p 5 p 6 P 7 · · · · · · · · · · (a) P 2 P 3 P 4 P 5 P 6 P 7 x 0 x 1 y 0 y 1 x 7 y 7 P · g P · g P · g P · g P · g P · g P · g  the sine waveform and two most significant bits (MSB) from the phase accumulator are used to reconstruct the full sine wave. From these two MSBs, one of them is used to determine if the sine amplitude is increasing or decreasing and the other one is used to determine its sign. The phase output is directly used in the first and third quarter while the inverse values of the phase output are used in the second and fourth quarters. This requires 2's complement when the phase is between ( : 2 ) and is achieved by adding full adder at the output gate to accomplish the full sine wave value. To save power and achieve a smaller area design, the 1/2 LSB offset is added to the stored memory address of sub-ROMs. This offset removes the full adder component from the 2's complement of the proposed design.
The angular decomposition technique based on trigonometric identity is one of the best techniques to reduce the quarter ROM LUT size. The quarter ROM LUT was partitioned into three ( , , and ), such that < ( /2), where sin ( + ) = sin cos + cos sin .
Equation (10) indicates the presence of four split sub-ROMs, two (2 and 2 ) for sin and cos and . The calculated results of (sin-cos ) show that they are inversely symmetrical. Based on this, cos( ) can be obtained by complementing the sin( ) values and this is achieved by connecting the sin( ) output and the high to the XOR logic gate inputs. In this way, only one addressing sub-ROM is needed for sin and cos values. The equation for cos calculation is given by The formula shows that the cos values with 4-bit input are fed into the sub-ROM ( = 4); therefore cos = cos(0.046019) = 0.999999677 ≅ 1. Thus, the (cos ) block ROM was also removed from the proposed design. Therefore, (10) can be reduced as given by sin ( + ) = sin + cos sin .
The final design of the high-speed DDFS, which consists of parallel pipelining PA and compressed ROM LUT by using the wave symmetry technique, is shown in Figure 7.
The Scientific World Journal 7 Note: in this work the measured DDFS output waveform is in signal-to-noise ratio (SNR).

Hardware Implementation of High-Speed DDFS
The proposed design of the high-speed DDFS with 0.029 Hz frequency resolution was coded in Verilog HDL, successfully simulated in ALTERA Quartus II software, and implemented with a Cyclone III FPGA kit board. The proposed DDFS has been verified using a spectrum analyzer. The measured results are consistent with the simulated results. The DDFS design accumulates the FCW input into the sawtooth PA output, and the PA output feeds into the phase-to-amplitude converter (ROM LUT). The full amplitude waveform at the output of the ROM lookup table is shown using the Signal Tap logic analyzer in Figure 8.

Result and Discussion
The 32-bit DDFS has been implemented on the Cyclone III FPGA kit board. The 32-bit phase accumulator output was truncated into 14 bits. Approximately 196608 (2 14 × 12) bits were required to draw the full sine wave signal with 12-bit output. Quarter-wave symmetry and an angular decomposition technique based on trigonometric identity were applied. The ROM was compressed to only 368 bits, with 534.2 : 1 ratio. The measured DDFS output waveforms and spectra for different clock frequencies illustrate the purity of the sine wave output as shown in Figure 9, for frequency tuning word = (1FFFFFFF) in hexadecimal format, and clk = 125 MHz (Cyclone III kit board clock frequency). The output frequency that was calculated is out = (FCW/2 ) × clk = (2 29 − 1/2 32 ) × 125 * 10 6 = 15.624999 MHz. The measured output frequency from oscilloscope is 15.63 MHz and this is closely matched with the calculation result.
The signal-to-noise ratio (SNR) was approximated based on SNR = − 6.02 − 1.76 dB (14) (see [29]), where represents the number of bits used to feed the digital-to-analog converter. For the proposed DDFS, = 12 and the calculated SNR is 74 dB. The measured DDFS output from the spectrum analyzer shows SNR of approximately 68 dB as shown in Figures 10(a) and 10(b). The result is about 6 dB less than the calculated result and this is due to the noise from wire connections. Table 1 shows the comparison of the ROM size used in the proposed design with some other parameters. The table shows that the proposed design has the smallest ROM size.

Conclusion
A 32-bit high-speed DDFS system was designed with 0.029 Hz frequency resolution. The system was successfully simulated in ALTERA Quartus II software. The phase accumulator together with the ROM lookup table has been implemented on the Cyclone III FPGA kit and verified with a Signal Tap logic analyzer, and the complete DDFS system with digital to analog convertor has been implemented on the Cyclone III FPGA kit board and the performance has been measured using oscilloscope and spectrum analyzer. Parallel pipelining with clock gating has been applied on a modified BK adder for the proposed phase accumulator design. Wave symmetry and an angular decomposition technique based on  Figure 10: The measured signal-to-noise ratio for the DDFS waveform output (a and b).
trigonometric identity were used to reduce the ROM size. The ROM was compressed to only 368 bits, with a 534.2 : 1 ratio. The proposed design has demonstrated attractive results that improve the operation speed and reduce the ROM size significantly.