FPGA-Based Implementation of a New Phase-to-Sine Amplitude Conversion Architecture

The classical structure of linear interpolation-based phase-to-sine mapper (PSM) consists of at least two ROMs for polynomial coefficient storage. Other architectures may include extra ROM for storing residual errors. However, ROMs dissipate high power and occupy a significant amount of the die area. This study presents a new technique that eliminates the ROM by including the computation of segment initial coefficients in the hardware. Therefore, it becomes possible to trim down noticeable hardware resources. The proposed direct digital frequency synthesizer (DDFS) architecture has been encoded in VHDL and synthesized with Quartus II software. Post simulation results show that the proposed design is capable of achieving the theoretical spurious-free dynamic range (SFDR) upper bound when optimal polynomial coefficients are considered. For 32 piecewise linear segments, the SFDR of the synthesized sinusoid is 84.15 dBc. A ROM compression ratio of 597.3:1 was also achieved. The performance of the DDFS is compared with previously presented DDFS techniques and the results show that the proposed design has advantages of high ROM compression ratio and low hardware complexity. DOI: http://dx.doi.org/10.5755/j01.eee.19.10.5905

Index Terms-Direct digital frequency synthesizer, DDS, phase to sine amplitude conversion, piecewise linear approximation.

I. INTRODUCTION
Direct digital frequency synthesizers (DDFSs) are capable of producing sine output waveforms with ultra-thin frequency increments, fast frequency switching, and high spectral purities.The synthesized signal is primarily digital in the DDFS; thus, the DDFS can be incorporated with different digital modulations because of the ease in handling the frequency, phase, and amplitude in the digital domain.Moreover for a number of applications it is required to switch the DDFS output frequency in some predefined pattern.The most obvious application would be a DDFSbased chirp signals used in radar system, spread spectrum communications, and as the stimulus in bio-impedance measurement.Among other frequency synthesis types, DDFS appears the most suitable technique for such applications.DDFS exhibits flexible tune capability over a important feature is that the duration of chirp signal and its frequency range can be adjusted independently [1].
The ROM-based DDFS architecture was first introduced by Tierney et al. [2].As displayed in Fig. 1, the basic structure of the ROM-based DDFS consists of three major blocks; phase accumulator, phase to sine amplitude converter (PSAC), and digital-to-analog converter (DAC).fclk represents the reference clock used by the DDFS and FIW is the frequency instruction word.At each leading edge of the fclk, the phase accumulator adds an M-bit FIW.The accumulated phase value addresses the sine lookup table (LUT) to produce sine waveforms.One period of a synthesized waveform is exactly the overflow of an M-bit phase accumulator.The synthesized frequency output can be expressed as the following , 2 where 1 0 2 .

M FIW
   For a precise approximation, the ROM-based sine LUT has to be packed with sine amplitude values that correspond to each possible phase value.The phase and amplitude quantization errors are inversely proportional to the depth and width of the ROM, respectively; thus, stretching the ROM for a high spectral purity sinusoidal output is preferable.A large ROM has high power consumption and occupies a large area.These factors negatively influence the performance of the DDFS.Therefore, compressing the size of the ROM without sacrificing spectral purity is essential.Most DDFSs are developed from the architecture shown in Fig. 1.However, during the last four decades, considerable modifications have been introduced and numerous alternative architectures have been proposed to reduce the computational complexity of the PSAC.In general, these methods can be categorized under three major groups; ROM compression [3]- [5], angle rotation [6]- [8], and piece-wise polynomial interpolation methods [9]- [11].
As stated in [9], [10] the piecewise linear interpolation method is regarded as an efficient technique comparable with other approximation techniques in terms of performances and hardware complexity.A generic PSAC structure based on the linear interpolation technique comprises two ROMs for storing segment initial amplitudes and segment slope coefficients.In this study, we propose a developed version of the standard linear-interpolated PSAC architecture.Our goal is to eliminate the ROM, which stores segment initial coefficients, to minimize system complexity.Once the ROM is eliminated, we expect the target system to exhibit excellent spectral purity and low power consumption with reasonable hardware overhead.

II. PIECEWISE LINEAR INTERPOLATION BASIC BACKGROUND
In uniform piecewise linear approximation, the first quadrant of the sine function is divided into s segments of equal length.Each segment is approximated with a firstorder polynomial.Thus, p(x) can be expressed as the following: , where mi and ci are the polynomial coefficients of the ith segment, x the input phase scaled to a binary fraction in the interval [0, 1], s is the number of segments that is chosen to be a power of two for further simplification.Fig. 2 depicts the basic structure of the uniform piecewise linearinterpolated PSAC, where two ROMs, one multiplier, and one adder are common blocks.We aim to evaluate the initial coefficients to bypass one of the coefficients ROMs.In the following section, we show that a simple recursive substitution in each polynomial segment enables the segment initial amplitude coefficients to be derived from the slope coefficients; accordingly, ROM elimination becomes doable.

III. THE PROPOSED MODIFICATION
As mentioned before the segment initial amplitude coefficients ci can be obtained by recursive substitution in each segment polynomial.In each subinterval of (2), the sine function is approximated by a first-order polynomial with the following form ( ) .( ) , where 1 , For a uniform piecewise linear approximation, the segments are equal in length and the segment bounds xi are equal to (i/s).Starting from the first interval, the segment initial coefficient c0 and segment lower bound x0 are equal to zero.Thus, the first segment polynomial is expressed as the following By substituting the segment bound 1/(s) into (4) we can find the second segment initial coefficient c1 as follows Thus, p1(x), the second segment polynomial where 1 2 s s x   , can be expressed in terms of slope coefficients by substituting ( 5) into (6) as follows where 1 2 s s x   .We apply the same procedure for segment number two by substituting the segment boundary (2/s) into (7).Therefore, the third segment initial coefficient c2 is expressed as the following

( ) .( ) ( ).( ).
And p2(x) can readily be expressed as follows where Following the same procedure for the ith segment polynomial ( ) .( ) , where . We can, in general, deduce the segment initial coefficient ci as the following By substituting ( 11) into (10), pi(x) becomes the following where . We can then rewrite (2) as follows   .
At this point, the initial coefficients are successfully replaced by accumulated pervious slope coefficients, thus allowing the ROM to be replaced with a simple accumulator.We show in subsequent sections of the paper that the hardware resources of the counterpart accumulator are significantly less than the replaced ROM hardware resources.

IV. THE PROPOSED DDFS ARCHITECTURE
Based on the theoretical approach presented in the previous section, we introduce a single coefficient ROM architecture displayed in Fig. 3.
The initial coefficients ROM is replaced by a simple digital accumulator, which is depicted in the dashed-line rounded rectangle.The accumulator is simply a digital Integrator in which its output is an integral of the slope coefficient equivalent to the initial coefficient.Furthermore, memory requirements are reduced by a factor of four by exploiting the quadrant symmetry of the sine function.Accordingly the architecture has to perform both positive and negative accumulation.For this purpose the 1's complement block is needed.The accumulator word length is given by   Or simply equal to (N+log2s) bits long, where ⌈.⌉ denotes the ceiling function, dec(mqi) represent the decimal value of the ith quantized slope coefficient, and N the slope coefficient word length.
According to (13), for a given segment i, the accumulator has an instance value of    1 0 i j j m which represents the segment initial coefficient ci.This value must be kept unchanged during the segment interval.In doing this, the architecture has to initiate one accumulation cycle coincident with each segment's transition.For this purpose, a digital comparator is used to monitor the ROM address bus (the segment selector) for detecting the events of segment's transition.Thus, the comparator output signal En is responsible for initiating the accumulation cycle.
The phase boundary value (π/2) is quantized to L -2-bits; thus, the segment bound is B = L -2 -Log2s bits.In this case, the output of the accumulator must be shifted left by B-bits before adding the resulting coefficients to the multiplier output , as a result the adder has a word length of (D+B) bits.Hardwired shifting does not involve any digital gate.Finally, the output of the adder has to be truncated to P = L -1 word length to accommodate the required DAC resolution.The ROM size required for this architecture is 2 A × N bits.Compared with the conventional counterpart architecture, which has an additional initial coefficient ROM of 2 A × (L -1) bits, the proposed algorithm can save 2 A × (L -1) memory entries with the penalty of the D-bit additional accumulator.

V. SAMPLE DESIGNS AND PERFORMANCE
Following the proposed architecture shown in Fig. 3, we consider a design sample and show the best possible computational cost in this section.We assume that the first quadrant of the sine function is divided into 32 segments (s = 32).The same design procedure can be used for any different number of segments with similar results.In [9], [10] it is stated that with uniform piecewise linear approximation, the maximum achievable SFDR for a certain number of segments is given by 2 20 log(1 16 ) 24.08 40 log( ).
Thus, the targeted SFDR of 84.286 dBc has to be achieved.As a rule of thumb, with L-bits phase resolution, spurs introduced by phase truncation is given by -6.02L dB, thus, the system parameters of this design can be obtained as follows : L = 15 bits, P = L -1 = 14 bits, A = log2 (s) = 5, and B = L -2 -A = 8.The width of the ROM, added accumulator size, multiplier, and adder feed inputs all depend on N, which is the slope coefficient word length.
Hence, to complete the design with minimum hardware overhead, we have to minimize the polynomial coefficient word length N which is the key parameter that determines the performance of the PSM.
By knowing the N, we can easily obtain the accumulator word length D by using ( 14), the size of the multiplier, and the adder feed inputs.In doing so, the optimal polynomial coefficients have to be obtained first and then quantized on a given number of bits to achieve the targeted SFDR level.

VI. OPTIMAL POLYNOMIAL COEFFICIENTS
To minimize the approximation error, the minimummean-square-error (MMSE) criterion is employed: With aid of a powerful Maple optimization package, the optimal set of slope coefficients, mi (i = 0.. s-1) is obtained and presented in Table I.The p(x) has quadrant symmetry; thus, the spectrum of the p(x) is inevitably free of even harmonics and can be expressed as a Fourier sine series: as follows .Figure 4 shows the spectrum of the p(x), where the largest unwanted frequency component occurs at the harmonic (4s -1) and has an amplitude of −84.15 dB with respect to the target sinusoid.It is clear from the same figure that the resulting non-quantized optimal coefficient mi s, can satisfy the theoretical finding of the SFDR upper bound of (15).Figure 5 shows the residual error of the approximated sinusoidal wave.The greatest maximum absolute error is equal to 1.9 × 10 −4 and can be seen in the last linear segment (i = s).

VII. QUANTIZATION OF POLYNOMIAL COEFFICIENTS
To obtain efficient hardware realization, the optimal realvalued coefficients (detailed in Table I) have to be quantized with sufficient finite precision.The rounded quantized coefficient mqi can be obtained by the following 2 0.5 , where [.] denote the floor function, N is the coefficient word length, and 0.5 ensures that the half-way values (2 N mi) are rounded up. the coefficient word to minimize the LUT size and simplifying the arithmetic circuitry is highly desirable; however, the resulting poor accuracy due to excessive quantization may further decrease the SFDR level.Thus, the design has to balance circuit complexity against quantize accuracy.To achieve this balance, the SFDR level has to be checked for each quantize accuracy starting from the lowest accuracy.The quantization process is started with N = 4-bits.The resulting spurious level is checked on whether it satisfies the targeted SFDR level.According to Fig. 6, the SFDR with 4bit coefficient word length is about 73.2 dBc which is far below the theoretical SFDR upper bound.Thus, we have to quantize the coefficients by using a 5-bits and above.From the graph with 6-bit quantization trial, it can meet an SFDR of 82.8 dBc, which is just 1.2 dB below the maximum achievable SFDR.The available results from the 7-bit and 8bit trials do not exhibit considerable SFDR improvement.Each additional bit will increase the LUT by s bits and extend the accumulator and multiplier size by 1 bit.Consequently, a 6-bit quantization resolution has been considered as a compromise solution with an acceptable spurs level of 82.8 dBc.The quantized slope coefficients are shown in Table II.Once again, the spectrum of quantized pq(x) needs to be determined to show the effect of the quantization process.Figure 7 represents the resulting spectrum where the largest unwanted frequency component has an amplitude of −82.8 dBc, which occurs at the harmonic (4s + 1).  Figure 8 shows the residual error of the approximated sinusoidal wave with a 6-bit coefficient word length.Unlike the non-quantized errors shown in Fig. 5 the quantized errors appear randomly distributed over the segment lines because of the nonlinear truncation and rounding processes.

VIII. STRUCTURAL DESIGN IMPROVEMENTS
The architecture displayed in Fig. 3 has to be improved to achieve a well-organized hardware.First, for the Multiply-Add circuitry shown in Fig. 9(a), we suggest two scenarios shown in Fig. 9(b) and Fig. 9(c).In the first proposed scheme, the output of the digital integrator after hardwired shifting has to be added to the mi.x product The size of the first term is 18 bits and its right hand side contain 8 zeroes.Thus, the 8 least significant bits (LSBs) of the mi.x product are concatenated to the resultant output.The addition can be expressed as follows The notation is used in the VHDL hardware description language.The adder output is D + B = 18 bits, which does not match the 15-bit sine output resolution.Thus, the adder output word length has to be truncated by 5 bits.The adder after truncation is defined as follows   Indeed the first 5 LSBs of the mi.x products has been truncated (even in final stage).Thus, truncating the the mi.x product in the early stage is preferable.In doing so the multiplier output become 9 bits, which leads to a noticeable logic gate saving.Following this procedure, the proposed scheme requires a (6 × 8) bit multiplier with 9 bits of output, 6 full adders (F.A), and 5 half adders (H.A).No rounding process has been applied.Defiantly, in the first scenario, the truncation of the 5 LSBs of the mi.x product will introduce an arithmetic error.For this purpose, the rounding technique must be applied.The rounding process is usually realized by adding a constant value equal to LSBOUT /2 = 2 4 and then truncating the result.Thus, we have to add mi.x [4] The implementation of such a scheme is displayed in Fig. 9(c).This scheme requires 3 additional H.A payments as a penalty for the rounding process, and the multiplier has to be truncated with only 4 LSBs.The exact realized circuit is shown in Fig. 9(d).The proposed architecture shown in Fig. 3 still requires improvement.The second modification allows the two's complementer to be replaced at the input of the accumulator with a simpler one's complementer.The MSB2 of the input phase feeds the Carry-in of the first adder.Therefore, it becomes possible to save the +1 adder that is essential to perform the two's complement Furthermore we have to extend the output of the ROM by (D-N) MSB bits to match the accumulator inputs; otherwise, performing the negative accumulation is impossible.

IX. PERFORMANCE COMPARISON
To validate the proposed algorithm, we code the design sample and traditional piecewise linear interpolation DDFS architectures in the VHDL by using Altera Quartus II 11.0 software with the previously mentioned parameters.The designed projects are implemented after full completion by using Stratix III FPGA (EP3SE50F484C2 device).
An architecture having 32 piecewise linear segments should have a worst-case spur of −84 dBc, which is achieved as well.Figure 10 shows the output spectrum for the output frequency of 0.124 clock frequency with FIW set to 4065.III and are compared with previously published algorithms.Please note that the compression ratio has been calculated with respect to (2 L-2 × P) ROM size, where L is the phase resolution and P = L -1 is a sine output resolution.As a great advantage of this technique is that, it can replace the ROM size of S × (L -1) bits required by standard architecture with (N + log2s) bit accumulator.For example, the ROM size required by the traditional approach for s = 64 is (64 × 15) bits, which can only be replaced by a 12-bit accumulator.The accumulator size is just 1 bit over the architecture of s = 32 while the compression ratio now is 758:1.Compared with the algorithms in [9], [4], and [5], the proposed algorithm exhibits the highest compression ratio with low hardware overhead.

X. CONCLUSIONS
In this paper, we have presented a develop phase-to-sinusoid amplitude conversion architecture based on linear interpolation.The initial coefficient ROM has been replaced by a simple digital Integrator.A generalized single ROM DDFS architecture utilizing this approach was presented, and a particular design with optimal polynomial coefficients of 32 linear segments is discussed.The Multiply-Add circuitry has been minimized, resulting in lower hardware implementation cost.The conventional and develop version designs have been implemented on Altera's Stratix III FPGA (EP3SE50F484C2).It is shown that the proposed approach exhibits an excellent ROM compression ratio with reasonable hardware resources in comparison with previously presented DDFS designs.

1
Abstract-The classical structure of linear interpolationbased phase-to-sine mapper (PSM) consists of at least two ROMs for polynomial coefficient storage.Other architectures may include extra ROM for storing residual errors.However, ROMs dissipate high power and occupy a significant amount of the die area.This study presents a new technique that eliminates the ROM by including the computation of segment initial coefficients in the hardware.Therefore, it becomes possible to trim down noticeable hardware resources.The proposed direct digital frequency synthesizer (DDFS) architecture has been encoded in VHDL and synthesized with Quartus II software.Post simulation results show that the proposed design is capable of achieving the theoretical spurious-free dynamic range (SFDR) upper bound when optimal polynomial coefficients are considered.For 32 piecewise linear segments, the SFDR of the synthesized sinusoid is 84.15 dBc.A ROM compression ratio of 597.3:1 was also achieved.The performance of the DDFS is compared with previously presented DDFS techniques and the results show that the proposed design has advantages of high ROM compression ratio and low hardware complexity.

Fig. 5 .
Fig. 5.The residual error of the synthesized curve.

Fig. 10 .
Fig. 10.Calculated output spectrum for Fout = 0.124 fclk.The characteristics of the proposed work, along with the standard uniform linear-interpolated DDFSs, are summarized in TableIIIand are compared with previously published algorithms.Please note that the compression ratio has been calculated with respect to (2 L-2 × P) ROM size, where L is the phase resolution and P = L -1 is a sine output resolution.As a great advantage of this technique is that, it can replace the ROM size of S × (L -1) bits required by standard architecture with (N + log2s) bit accumulator.For example, the ROM size required by the traditional approach for s = 64 is (64 × 15) bits, which can only be replaced by a 12-bit accumulator.The accumulator size is just 1 bit over the architecture of s = 32 while the compression ratio now is 758:1.Compared with the algorithms in[9],[4], and[5], the proposed algorithm exhibits the highest compression ratio with low hardware overhead.

TABLE I .
OPTIMAL SLOPE COEFFICIENTS.

TABLE II .
QUANTIZED SLOPE COEFFICIENTS.Slope

TABLE III COMPARISON
WITH REPORTED WORK.