Demonstration of a carrier frequency offset estimator for 16-/32-QAM coherent receivers: a hardware perspective

: We propose and implement a hardware-efficient frequency offset estimator (FOE) optimized for 16-and 32-QAM coherent optical receivers with low hardware cost and high estimation accuracy. The proposed FOE combines a wide-range coarse estimator and a narrow-range highly accurate estimator in a feedforward architecture. We numerically and experimentally investigate the performance of the proposed estimator by using a field-programmable-logic-array (FPGA) based real-time coherent receiver. Compared with other state-of-the-art estimators in literature, the proposed method reduces over 40% of hardware utilizations while maintaining the same level of estimation accuracy in terms of mean-squared-error (MSE) and optical signal-to-noise ratio (OSNR) sensitivity. These results enable the development of next generation DSP circuit capable of supporting high capacity coherent optical communication link with advanced modulation formats.


Introduction
The global IP traffic continues its explosive growth thanks to the development of bandwidthconsuming services such as 4k/8k ultra-high definition (UHD) video streaming, 5G mobile access, and cloud computing [1].This continuing demand for higher network capacity has driven the research and development of coherent optical communication technologies in a rapid fashion [2].With the recent advances in high-speed electronics and signal processing techniques, coherent optical communication with digital signal processing (DSP) are considered de facto the best solution for achieving higher spectral efficiency, improved receiver sensitivity, as well as better impairments tolerance in a long-haul communication system [2].M-ary quadrature amplitude modulations (M-QAM), such as 16-QAM and 32-QAM, are particularly favored as they can significantly increase the spectral efficiency by packing more bits into one symbol and reduce the average cost per bit [2,3].However, as the modulation format evolves, the reduced Euclidean distance between each symbol poses substantial difficulties on the efficient DSP design and signal demodulation, resulting in reduced operating range for higher-order QAM links [3].The limited reach of these fiber links has impeded the extensive adoption and deployment of higher-order QAM technologies by the network operators over the world [3].Thus, efficient DSP algorithms for higher-order QAM signals are highly desirable by the telecommunication industry.
Among the DSP blocks in a coherent receiver, FOE is essential for estimating and compensating the wavelength mismatch between the carrier signal and local oscillator (LO) [2].Since a free-running laser may exhibit optical frequency drifts up to 5 GHz over its lifetime, it is critical to limit the frequency offset between the carrier and LO within the tolerance of the carrier phase estimation (CPE) algorithm to avoid performance degradation [4][5][6][7][8][9][10][11][12].Frequency offset (FO) estimation for multilevel QAM signals is particularly challenging because their data modulation cannot be directly erased by the standard M th power method [2].The classic way of dealing with this issue is utilizing only the consecutive pairs of the quadrature phase-shift keying (QPSK) symbols for FO estimation of 16-QAM signal, as proposed by Fatadin et al [4].The downside is that the performance of this estimator degrades significantly for higher-order modulation formats such as 32-and 64-QAM, as they inherently contain fewer consecutive QPSK-like symbols.To improve the estimation accuracy, the least-squares (LS) method [5] as well as various fast Fourier transform (FFT) based FOEs [6,7] have been proposed.These methods generally reduce the estimation MSE more than 100 × and achieve ~1.5 dB OSNR improvement against the classic FOE [4].Other FOEs have exploited unscented Kalman filters [8,9], blind frequency offset search (BFS) [10], and pilot symbols [11].Although the feedback loop inside the Kalman filter cannot be efficiently implemented in hardware without sacrificing the performance of the FOE.So far, these state-of-the-art M-QAM FOEs in literature have always been tested based on numerical simulations or experiments with offline DSP and floating-point operations [4][5][6][7][8][9].With a view toward a practical and deployable optical transceiver, implementing and testing the FOEs in a highly parallel application-specificintegrated-circuit (ASIC) or a FPGA with limited word width will be very important.In fact, when realized in hardware, many of the aforementioned FOEs would consume extensive hardware resources and chip area due to massive parallelisms, which leads to larger device footprint, increased power consumption, and higher cost per bit.Therefore, it is important to conduct feasibility studies of efficient DSP algorithms with FPGA prototyping rather than relying only on offline verifications from an economic point-of-view.
In this paper, we present a novel hardware-efficient FOE for 16-and 32-QAM coherent receivers by exploiting the similarity between the non-QPSK-like symbols of 16-/32-QAM and phase-shift keying (PSK) signals.The proposed FOE is implemented on a Xilinx FPGA XC7VX690T along with the other DSP blocks of a coherent receiver and tested through numerical simulations and back-to-back experiments.The results show that the proposed method achieves comparable performance with the LS-16 and the QPSK-selection assisted FFT (QSA-FFT) estimators [5,7] while reducing over 40% of the required lookup tables (LUTs) and flip-flops (FFs) for an FPGA implementation.To make fair comparisons, we only considered blind FOEs with feedforward architecture in this study.
The rest of this paper is organized as follows.Section II introduces the operation principle of the proposed method.Section III presents the simulation results.Section IV shows the experiment and implementation details.Section V discusses the complexity and hardware requirements.Section VI concludes this work.

Principle of operation
We use the case of 32-QAM as an example to illustrate the operation principle of the proposed FOE.After the adaptive equalization, the k th received 32-QAM sample can be expressed as: Hence, we can decompose the modulated data phase , and then raising ' k X to 16th power, we can obtain: ( ) In ( 3), the modulated data phases for all symbols are effectively erased.In the next step, by using the conventional time domain differential phase method [4], the frequency offset over L samples can be calculated as: 1 arg ( ) 32 Note that the differential phase noise ( ) obeys zero-mean multinomial distribution.Since the phase offsets in (2) are small, the variance of this differential phase offset is also relatively small.Therefore, the block summation in (4) can effectively eliminate the interference of the laser phase noise, ASEinduced phase noise, and the approximation error on the estimation.The key idea of this method is to utilize all the symbols of a higher-order QAM signal to obtain a more accurate estimation by purposely introducing an additional zero-mean, low-variance random variable ( ). Due to the 16th power operation in (3), the theoretical estimation range of the FOE is limited within [-1/32T s , 1/32T s ].Considering a 10 Gbaud 32-QAM system, when the actual wavelength mismatch is 400 MHz, the FOE will yield an estimation of −225 MHz, which is 1/16T s (625 MHz) away from the correct value due to the phase wrapping from the arg ( ⋅ ) operation in (4).To extend the functioning range, we adopt a wide-range estimator to shift the folded estimated frequency offset back to its actual value, as the difference between the actual frequency offset and the estimated result is always an integer multiple of 1/16T s .Building upon this idea, we utilize a feedforward parallel architecture shown in Fig. 2  Similarly, for 16-QAM signal, 8th power operation should be performed in (3) and the final estimation can be obtained from 1 , where n∈ [-1, 1].Only three trial offsets are required to cover the full estimation range since the theoretical estimation range of the fine FOE is within [-1/16T s , 1/16T s ] with 8th power operation for 16-QAM.

Simulation results
We simulated a 10 Gbaud 16-QAM and 32-QAM coherent communication system to investigate the performance of the proposed FOE.The performance of the proposed algorithm is compared with the classic partitioning estimator [4] as well as some recent published blind FOEs, including the LS-4/16 and QSA-FFT methods [5,7].We selected the LS-4/16 and QSA-FFT estimators to compare with the proposed FOE because they represent the most recent published time-domain and frequency-domain FOEs with feedforward architecture [5,7].The classic partition-based estimator is also included in the benchmark as the baseline [3].The laser linewidth for all simulations is set to 100 kHz.The total number of symbols is 500,000 for each run.To focus on the limiting factors of the FOE, other impairments like timing offset and linear distortions are excluded from the simulations.The OSNR for all simulations is set to 20 dB for 16-QAM and 25 dB for 32-QAM.We first swept the frequency offset from −1.5 GHz to 1.5 GHz and used the abovementioned FOEs to calculate the results.
Figure 3 shows the normalized mean-square-error (MSE), defined as , with the applied frequency offset Δf for 16-QAM and 32-QAM.We set the number of symbols per block L in (4) to 256 for 16-QAM and 1024 for 32-QAM.For 16-QAM, the proposed estimator achieves the highest accuracy along with the LS-16 FOE (~10 −8 ), whereas the result of the QSA-FFT algorithm shows oscillation due to the limited spectral resolution.For 32-QAM, a slight performance degradation is observed for the proposed FOE (~10 −8 ) against the QSA-FFT method and the LS-16 FOE (~5 × 10 −9 ) because more approximation errors have been introduced.Note that, such small variations in estimation MSE would corresponding to negligible OSNR penalty [5][6][7].Figure 4 illustrates the impact of the number of symbols L per block on the estimation MSE for 16-and 32-QAM.The frequency offset is randomly chosen from [-1.2 GHz, 1.2 GHz] according to Fig. 3 for this simulation.As L increases, all FOEs become more accurate because the impact of the ASE noise and laser phase noise are suppressed better.The FFT method shows higher ASE and laser phase noise tolerance in terms of accuracy against timedomain FOEs as L becomes relatively long.Therefore, the proposed FOE is favorable where a small value of L is required (e.g. in a coherent burst mode receiver to compensate the rapid changing frequency offset).In practice, there is a tradeoff between the accuracy of the estimation result and the hardware complexity depending on the block length L. In Fig. 5, we show the normalized MSE of the FOEs versus the OSNR for 16-and 32-QAM, with frequency offsets that were randomly chosen from [-1.2 GHz, 1.2 GHz].The FFT method shows stronger noise-tolerance against its time-domain counterparts.After a certain OSNR threshold (13 dB for 16-QAM, 17 dB for 32-QAM), its accuracy is nearly independent of the OSNR.The proposed FOE can ensure high estimation accuracy for typical OSNR values in a 16/32-QAM transmission system (typically > 20 dB) along with the LS-16 method.To further study the effectiveness of the proposed method, we conducted a proof-of-concept, back-to-back transmission experiment as shown in Fig. 6(a).Due to the limited effective number of bits (ENOB, ~4.5 bits) of the electrical arbitrary waveform generator (EAWG) and analog-to-digital converters (ADCs), only 10 Gbaud 16-QAM transmission experiment was conducted.A transmitter laser with 100 kHz linewidth is modulated by 10 Gbaud 16-QAM via a LiNbO3 IQ modulator.The 16-QAM signal is generated by two pseudo-random binary sequences of length 2 15 -1 at 12 GS/s using the EAWG.A variable optical attenuator along with an erbium-doped fiber amplifier (EDFA) is used to adjust the OSNR level.Two percent of the optical power is tapped to an optical spectrum analyzer for OSNR monitoring.After beating with a 30-kHz linewidth external cavity laser (ECL) that is 625 MHz away in wavelength from the transmitter laser, the signal is coherent-detected, sampled at 50 GS/s and fed into the memory of the FPGA chip with 8-bit word width.After resampling the signal to 20 GS/s, the FPGA fetched the samples into its DSP chain ( × 2 oversampling) shown in Fig. 6(b), which is implemented in Verilog with Xilinx Vivado design tool.Since the FPGA chip operates at 156.25 MHz, massive parallelism (64/128 lanes) is adopted for the DSP chain to meet the throughput requirement.First, front-end equalization is performed to correct the IQ gain mismatch and timing skew.A 9-tap, T/2 fractional-spaced-equalizer (FSE) adapted by the radius-directed equalization (RDE) algorithm is used to compensate the timing error as well as the channel distortion.After that, the samples are sent to the FOE blocks for FO estimation and compensation with 256 symbols per block, followed by the blind phase search (BPS) based CPE with 8 trail angles and 7 average lengths [13].The processed samples are then sent to an external workstation via the universal asynchronous receiver and transmitter (UART) interface for symbol decoding and error counting.Differential encoding/decoding and hard-decision threshold are utilized to map the constellation point to bit sequence and count the bit error rate (BER) from 250,000 processed symbols.The input samples have a width of 8 bits for its real and imaginary components, 6 of them are fractional bits.As the complex samples come into the proposed FOE block, they are first split and forwarded to the coarse and fine estimators, as well as an amplitude detection unit.The outputs of the amplitude detection unit, d k , are then used to select the QPSK-like symbols for the coarse estimator and perform the pre-rotation for the fine estimator.The fine estimator is designed to operate in polar coordinates since its requires all symbols to be normalized to the same amplitude as shown in (3).Besides, by processing the symbols in polar coordinates, one could replace the resource-consuming complex multipliers required in (4) with adders and subtractors.Coordinate rotation digital computer (CORDIC) softcore is utilized to perform the coordinate transformation [14].A LUT takes the estimation results from the two estimators and performs the unfolding operation according to Fig. 2(b).We have moved the averaging operation in (4) to the output of the multiplexer to simplify the hardware design.It should be noted that the magnitude detection unit in the proposed FOE can be reused at the adaptive equalizer for the RDE updating function.Therefore, the equivalent hardware complexity of the proposed FOE is significantly reduced.Synthesized in different runs, we also implement the other FOEs with 256 filter lengths on separate FPGAs along with the rest of the coherent receiver.We download the received signal into the memory of the FPGAs and let the coherent receivers demodulate the signal in real time.Figure 8(a) shows the measured BER as a function of OSNR (measured with 0.1 nm resolution bandwidth) for the various FOEs.In agreement with the simulation results, the demodulated signal processed by the classic partitioning-based FOE shows the worst performance due to the lack of estimation accuracy.Comparing with the classic FOE, the proposed FOE achieves over 1.5 dB OSNR improvement at BER of 10 −4 .The insets in Fig. 8(a) depict the demodulated constellations using the proposed FOE at 20 dB and 26 dB OSNR.We did not measure significant performance difference between the proposed method and the LS-16/QSA-FFT FOEs since the OSNR penalty between them is less than 0.2 dB.We attribute the error floor and the deviation from the theoretical result toward the quantization noise, the scaling effect of the accumulated multiplications inside the FPGA, as well as the limited ENOB of the system.The floorplan of the DSP chain on the FPGA is shown in Fig. 8(b).The overall utilization of the coherent receiver is ~65% when using the proposed FOE.

Implementation complexity analysis
Table 1 illustrates the resource utilization of the abovementioned FOEs after synthesis and implementation.To fulfill the stringent throughput requirement, direct implementation (Cooley-Tukey) of fully-pipelined radix-2 FFT is selected for the QSA-FFT FOE.Among the FOEs in this study, LS-16 shows the highest degree of hardware complexity and resource utilization because of its sixteen-fold iteration.In fact, the Verilog design tool failed to finish the placement and routing process because the required LUTs and registers for LS-16 method exceeded the maximum available resources of the FPGA chip.Therefore, for the LS-16 FOE, we used post-synthesis simulation to process the experimental data and generate the result in Fig. 8.While the iterative nature of the LS-4/16 method grants itself excellent estimation accuracy, the numerous required hardware resources clearly make it less preferable for practical implementations.
To obtain reasonable accuracy, the FFT-based method needs to have a relatively large number of symbols per block to maintain the spectral resolution.Assuming direct implementation without resource sharing, the computation and interconnection complexity of the FFT-based method scales as Nlog 2 N, which means that the FFT-based method requires at least 2Nlog 2 N multipliers and 2Nlog 2 N adders [9].In addition, the extensive interconnection complexity of the FFT-based method may pose stringent timing requirement in very-largescale integration (VLSI) and ASIC design process as N increases.On the other hand, since the proposed estimator falls into the category of the time-domain FOE, its computational complexity grows linearly.From Fig. 7, we could notice that the proposed FOE is multiplierfree on top of the resource-saving classic estimator because the fine estimator is operated purely in the polar coordinate, which means the averaging operation in Fig. 7 can be performed by using barrel shifter instead of multipliers or dividers.That is to say, the proposed algorithm achieves very low hardware complexity.Comparing with the QSA-FFT method, the proposed FOE requires ~50% fewer LUTs and ~40% fewer slice registers while maintaining a satisfactory estimation accuracy.

Conclusion
In this paper, we presented a high-accuracy and hardware-efficient FOE for 16-and 32-QAM signals.We investigated the performance of the proposed FOE under different filter lengths and OSNR values.Simulation and real-time experiment results show that the proposed FOE yields great accuracy (< 0.2 dB OSNR penalty) when compared to the LS-16 and the QSA-FFT methods.Due to its efficient hardware architecture, the proposed solution requires substantially fewer hardware resources with over 40% reduction in the number of required LUTs and slice registers against the other FOEs.Our FPGA-based prototyping demonstrates the feasibility of the proposed FOE for practical coherent optical transceiver implementations.

Fig. 1 .
Fig. 1.Constellation partitioning and PSK approximation for 32-QAM.By rotating Class III symbols by π/16, normalizing all the symbols to'  | | zero-mean Gaussian random variables.The differential phase offset ( (a).A fine FOE using the quasi-PSK approximation algorithm calculates an accurate estimation 1,k f Δ , yet the absolute location of the final estimate remains unknown.Then five trail offsets are applied to the 1,k f , where n ∈ [-2, 2].The indexing factor k S determines the optimum estimation est fΔ .The location information of the frequency offset can be obtained by using a wide-range, coarse FOE which implements the classic QPSK partitioning algorithm[4].Since it only serves as a location reference of the frequency offset, the coarse FOE can use a short filter length to save the hardware resources.In this study, the filter length of the coarse FOE is set to 128.An Indexing factor k S that selects the output of the multiplexer can be calculated through algorithm 1 shown in Fig.2(b).In practice, a lookup table (LUT) can be implemented to generate k S based on 1,k f Δ and 2,k f Δ to minimize the computational efforts.

Figure 7
Figure7depicts the schematic diagram for the hardware implementation of the proposed FOE.The input samples have a width of 8 bits for its real and imaginary components, 6 of them are fractional bits.As the complex samples come into the proposed FOE block, they are first split and forwarded to the coarse and fine estimators, as well as an amplitude detection unit.The outputs of the amplitude detection unit, d k , are then used to select the QPSK-like symbols for the coarse estimator and perform the pre-rotation for the fine estimator.The fine estimator is designed to operate in polar coordinates since its requires all symbols to be normalized to the same amplitude as shown in(3).Besides, by processing the symbols in polar coordinates, one could replace the resource-consuming complex multipliers required in (4) with adders and subtractors.Coordinate rotation digital computer (CORDIC) softcore is utilized to perform the coordinate transformation[14].A LUT takes the estimation results from the two estimators and performs the unfolding operation according to Fig.2(b).We have moved the averaging operation in (4) to the output of the multiplexer to simplify the hardware design.It should be noted that the magnitude detection unit in the proposed FOE can be reused at the adaptive equalizer for the RDE updating function.Therefore, the equivalent hardware complexity of the proposed FOE is significantly reduced.
Class I symbols are the QPSK-like symbols located on the two diagonal lines of the complex plane.Class II and class III symbols can be approximately seen as 8-PSK symbols rotated by π/8 and 16-PSK symbols rotated by π/16, with a small phase offset , [4]the radius-directed approach of the classic method for 16-QAM[4], we classify the received symbols into three distinct groups based on their magnitude ' k A as shown in Fig.1.