Low-complexity carrier phase recovery based on principal component analysis for square-QAM modulation formats

: We propose, numerically analyze and experimentally demonstrate a low-complexity, modulation-order independent, non-data-aided (NDA), feed-forward carrier phase recovery (CPR) algorithm. The proposed algorithm enables synchronous decoding of arbitrary square-quadrature amplitude modulation (QAM) constellations and it is suitable for a realistic hardware implementation based on block-wise parallel processing. The proposed method is based on principal component analysis (PCA) and it outperforms the well-known and widely used blind phase search (BPS) algorithm at low signal-to-noise ratio (SNR) values, showing much lower cycle slip rate (CSR) both numerically and experimentally. For operation at higher SNR values, a hybrid two-stage implementation combining the proposed method and BPS is also proposed and their performance are investigated benchmarking them against the two-stage BPS (2S-BPS). The complexity of the proposed simple and hybrid methods are evaluated against 2S-BPS and computational complexity savings of 92% and 40% are expected for the simple and hybrid methods, respectively.


Introduction
Laser phase noise is one of the limiting factors in coherent optical communications, especially for higher order modulation formats [1,2]. The traditional method for optical signal demodulation using a digital phase-locked loop (PLL) is highly sensitive to feedback delay, thus limiting the laser linewidth (LW) tolerance of PLL-based carrier phase recovery (CPR) methods [3]. Moreover, considering realistic hardware implementation of digital signal processing (DSP) at the receiver side, a block-wise operation needs to be contemplated, making PLL-based methods even more impractical [4].
Unsupervised feed-forward CPR methods generally extract the phase noise by either applying a nonlinear transformation such as 4 th power [1,5], using the symmetry of the constellation to fold it [6], or massively testing different phase values and comparing them to decided symbols [4,7].
In this letter, we present a hardware-efficient, modulation-order independent, non-data-aided (NDA) and feed-forward CPR method for synchronous decoding of arbitrary quadrature amplitude modulation (QAM) constellations. The proposed method, called principal component-based phase estimation (PCPE), is based on extraction of the principal component of the squared constellation and it is suitable for low signal-to-noise ratio (SNR) signal transmissions considering forward error correction (FEC). Simulations and experimental results are presented and the PCPE performance is compared with a two-stage implementation of the traditional blind phase search (BPS) algorithm.

Principal component based CPR algorithm
Principal component analysis (PCA) is a well-known algorithm that converts a set of observations of a likely correlated random variables into a set of linearly decorrelated variables, the principal components (PCs) [8]. It is usually applied as a pre-processing tool in pattern recognition applications for the extraction of the most critical data features, by projecting the original data into a new feature space in which the first component keeps the most information about the original data.
The PCs extracted directly from a QAM constellation are invariant to the constellation rotation. However, if the received constellation is squared, the PCs extracted from this modified constellation have the PC angles proportional to the original constellation rotation. Figures 1(a) and 1(b) show 64QAM constellations, with and without a π/6 rad phase rotation, respectively. Figures 1(c) and 1(d) show the same 64QAM constellations with and without π/6 rad phase rotation, but after the squaring process. (a) before squaring and without phase rotation; (b) before squaring and with π/6 rad phase rotation; (c) after squaring and without phase rotation; (d) after squaring and with π/6 rad phase rotation. The PC is marked in red.
The angle of the first PC of the squared constellation is related to the rotation of the original constellation by the following relation: φ PC = 2φ + π/2. Therefore, it is only necessary to extract the first PC of the squared constellation to obtain the phase estimation.
A very useful numerical method to find the first PC is the power iteration method (PIM) [9,10], which is an algorithm that iteratively converges to the eigenvector associated to the greatest eigenvalue of a diagonalizable matrix. It begins randomly initializing a vector and then multiplying this vector by a covariance matrix of the inputs at each iteration. When the number of iterations is sufficiently large, the vector converges in the direction of the first PC.
The PCPE algorithm is based on PIM in order to track the first PC of the squared constellation. Figure 2 shows a block diagram representation of PCPE. The algorithm starts by calculating a 2×N real matrix, A A A k , associated to the k th squared constellation input block, given by where is the k th data input block with N elements and { } and { } are the real and imaginary parts, respectively. Considering the input vector as zero-mean, the covariance matrix is a 2×2 matrix updated for each input vector and calculated as Although the PIM method requires multiple iterations to converge to the first PC, one iteration is enough in order to track the phase error. The PCPE algorithm will start by considering the first PC as v v v 0 = 1 0 . Then, for each incoming data block, the PC is updated by .
The estimated phase noise for each block is given bŷ Finally, in order to eliminate phase ambiguities, a phase unwrapping process is necessary and performed at the end [3]. Then, the estimated phase to be compensated is updated bŷ

Two-stage minimum-distance BPS
BPS is a feed-forward algorithm that tests different phases to compensate phase distortions. A common implementation for complexity reduction is to break it into two stages with B 1 and B 2 test phases in the first and the second stages, respectively [7]. In this way, the algorithm tests up to B 1 (B 2 + 1) different phase values while the complexity is proportional to B T = B 1 + B 2 [4]. Figure 3 shows a block diagram representation of the two-stage blind phase search (2S-BPS) algorithm.

PCPE and BPS hybrid scheme
In the analysis and discussion section we show that the PCPE has a considerably lower cycle slip rate (CSR) for low SNR signals but worse performance in phase tracking when compared to BPS for high SNR. Therefore, another possible approach is to try to join the cycle slip robustness of the PCPE with the accuracy of the BPS while still reducing its complexity. Figure 4 shows a block diagram for the PCPE and BPS hybrid scheme (PCPE-BPS).  In this scheme, the first stage performs the PCPE, while the second stage performs a BPS algorithm with test phases where η is the phase aperture of the BPS stage, i.e. the range in which fine test phase search will be performed. The phase unwrapping process is done in the PCPE stage. This ensures that the hybrid scheme has the same robustness to cycle slips as the PCPE alone.

Computational complexity analysis
In order to analyze and compare the computational complexity of the proposed PC-based methods and 2S-BPS, we calculated the number of operations necessary to implement the algorithms. The operations were divided in real additions (or subtractions), real multiplications (or divisions), computation of square root, accesses to look up tables (LUTs) for trigonometric computations, decisions and comparisons. The values are shown in Tab. 1, where N is the number of symbols processed per block. E.g., if we take into account N = 64 samples per block and 11 test phases per BPS stage, and then comparing only the number of multiplication operations, the PCPE would represent a computational complexity reduction of 92% in comparison to the 2S-BPS. If PCPE-BPS method were used instead, a 40% computational complexity reduction would be accomplished, also in comparison to the 2S-BPS. It is important to note that the computational complexity of the BPS-based algorithms scale up with the quantity of test phases used and the number of symbols processed per block, while the PCPE only depends on the number of symbols per block. Therefore, for higher order modulation formats that would require more test phases for precision, the PCPE would represent an even higher computational complexity reduction.

Algorithm evaluation
In this section, we benchmark the proposed PCPE and PCPE-BPS algorithms against the 2S-BPS algorithm. The comparisons presented here take into consideration realistic hardware implementations in which the data is processed in a rate much lower than the symbol rate, therefore using parallelized processing.

Cycle slip tolerance
Due to the π/2 phase symmetry of square-M-QAM modulation formats, cycle slips may occur during the phase unwrapping process [3]. This is a highly nonlinear phenomenon that leads to a catastrophic failure if no special coding is being employed. In general, a NDA CPR algorithms is not capable of recovering from a cycle slip by itself. Nonetheless, it is worth to reduce the cycle slip rate in order to reduce the amount of pilot symbols (overhead) that is necessary to recover from it [11], allow turbo decoders to compensate for differential coding [12,13] or allow a good operation of cycle slip detection algorithms [14]. In our analysis, CSR is defined as the number of cycle slip occurrences per block. The first cycle slip occurs at the first time the phase difference between the actual and the estimated phase noises becomes greater than π/2. Then, the following cycle slip only occurs if the difference between the actual and the estimated phase noise increases to more than π or returns to be lower than π/2. Therefore, we compute the CSR as where K is the number of blocks simulated, andφ k is the rounded difference between the estimated phase noise and the average of the actual phase noise per block, i.e., Figure 6 shows the simulations results. The performance shown is considering the average of 1000 realizations of a 32 GBd signal with 256 blocks of 64 symbols in each realization (a total of 16384 symbols). We considered combined LWs of 200, 500, 1000 and 2000 kHz. Results are presented only for PCPE and 2S-BPS algorithms, as PCPE-BPS method had the same CSR performance as PCPE. Regarding 2S-BPS method, the number of test phases used in each stage was 6 for quadrature phase shift keying (QPSK) and 11 for higher-order M-QAM formats. These values were previously selected because they were the ones that showed least SNR required when using the 2S-BPS algorithm, in which more test phases yielded negligible improvements.' For all modulation formats and all SNRs analyzed, PCPE showed a better performance than 2S-BPS considering LW of less than 1000 kHz. For the case with 2000 kHz of LW, PCPE only showed a better performance for low SNR values. It is important to note that high LW affects more the PCPE performance than the 2S-BPS performance.

Mutual information penalties
In this section, we analyze the AIRs assuming that an optimal encoder and decoder are employed by calculating the mutual information (MI) between the transmitted and received symbols. The MI computation is a Monte Carlo approach, which uses an auxiliary channel to compute a lower bound to the AIR [15]. The AIR reaches the actual MI if the channel is exactly the same as the auxiliary channel [16,17]. Here, we considered two auxiliary channels, one with only AWGN and the other where the main source of noise is the residual phase noise. The AIR is the maximum obtained from these two auxiliary channels.
We analyzed the algorithms in a system with 200 kHz of LW and no residual frequency offset between the transmitter and receiver lasers. Cycle slips were ideally compensated to simplify the MI computation.
Each MI value is an average computed from 10000 realizations with 1024 blocks of 64 symbols each. For 2S-BPS algorithm, we used 11 test phases in each stage and for PCPE-BPS methods we used 11 test phases and an aperture of η = 1/11. Figure 7(a) shows the ideal AIR for different modulation formats. Figure 7(b)-7(e) shows the results in terms of penalty in AIR. These penalties were calculated by comparing the AIR achieved with the algorithms with the ideal MI that would be achieved if no phase noise was considered.
For all modulation formats tested and for low SNR, the MI penalty of 2S-BPS was higher than all the other algorithms. This is because 2S-BPS algorithm had a harder time distinguishing the test phases in the presence of high AWGN. Certain combinations of symbols can lead to  ambiguities and, therefore, estimation errors. For high SNR, the PCPE showed a higher penalty than other algorithms. Different from 2S-BPS, PCPE is not based on testing phases and the calculation of the PC is helped by having more different points in the same block of the received signal which could explain the better performance at low SNR.
In the case of high SNR, the selected block size may not be big enough so that the covariance matrix represents the signal perfectly, i.e. if the block size is smaller than the constellation order, then some combinations of constellation points inside a block would generate a wrong PC. Increasing the number of samples per block indeed improves the PC calculation, but because the same phase value is applied to the entire block, the phase correction could be worse remaining more residual phase noise. This trade-off was investigated in simulations and we found that a block size around 64 samples per block was the best when considering a 256QAM signal operating at 32 GBd and with 200 kHz of combined linewidth. It is also important to point out that the number of samples per block in a real system will be driven by the internal clock of the digital application specific integrated circuit (ASIC) responsible for the DSP operations in a coherent receiver.
For low SNR, the PCPE-BPS is more penalized than the PCPE. We can see that the behavior of 2S-BPS at low SNR is passed to the PCPE algorithm, but the penalty introduced was lower than 0.1 bit/symbol. On the other hand, for high SNR, the disadvantages of PCPE-BPS drops dramatically, which justify the proposed hybrid algorithm.

Experimental validation
To further validate the proposed method we assembled the setup shown in Fig. 8. The experimental results are presented in Fig. 9. An 84-GSa/s arbitrary waveform generator (AWG) generated two electrical lines of a 28 GBd signal with a 0.4 roll-off root raised cosine pulse shape. The electrical signals were amplified in electrical drivers and fed to a LiNbO 3 external optical modulator together with a 100-kHz LW laser. Polarization multiplexing (PM) was emulated and the signal was combined with the noise generated in an erbium-doped fiber amplifier (EDFA). Then, it was received in back-to-back by a coherent receiver with another 100-kHz LW laser as local oscillator. The received electrical signals were sampled by a 80 GSa/s oscilloscope. The DSP was performed offline and consisted of a radially directed equalizer (RDE) algorithm with 12 taps for polarization demultiplexing and inter-symbol interference (ISI) compensation, a frequency-domain 4 th -power frequency offset estimator and the CPR being evaluated. The modulation formats considered were 16 and 64QAM. Cycle slips were not compensated.
We computed the optical signal-to-noise ratio (OSNR) required to achieve an error-free transmission considering 20% FEC overhead (OH). In this case, the target AIR was computed by AIR target = log 2 (M)/(1 + OH), which led to 3.333 and 5 for 16 and 64QAM, respectively. The OSNR was measured in an optical spectrum analyzer with 0.1 nm resolution.
The PCPE algorithm performed better than the other two algorithms, with 0.1 and 0.15 dB OSNR required gain when in comparison to PCPE-BPS and BPS, respectively, for 16QAM. For 64QAM, the OSNR required gains were 0.2 and 0.43 dB, respectively. These gains were higher than expected and are possibly due to IQ imbalances in the generation of the modulation formats. The PCPE is agnostic to these imbalances therefore performing better. For very low OSNR values, the PCPE and PCPE-BPS methods performed better due to lower cycle slip rates, as expected.  Fig. 8. Experimental setup. Polarization multiplexing is realized by emulation. Noises originate from a broad band light source followed by an EDFA and attenuator. Emu.: emulator; Att.: attenuator; ASE: amplified spontaneous emission; ECL: external cavity laser.

Conclusions
We proposed and demonstrated the PCPE algorithm, a block-wise NDA CPR method for square M-QAM systems that is independent on the modulation order. It is based on extracting the PC of the received signal. Simulation results showed that PCPE is more robust to cycle slips than the traditional 2S-BPS algorithm. Nonetheless, it showed a higher penalty in terms of MI for high SNR. Considering this scenario, we proposed a hybrid algorithm, the PCPE-BPS. This scheme, comprised of a PCPE stage for coarse estimation and a BPS stage for fine tuning, combines the cycle slip robustness of the former with the accuracy of the latter. This hybrid scheme shall be the best choice considering flexibility and complexity, operating only with the PCPE part when at low SNR or both when at high SNR. We also plan in future works to further extend our analyses to the impact of non-linear phase noise [13], equalization-enhanced phase noise [18,19], high frequency drifts, constellation imperfections [20] and the use of probabilistic shaping [21], as well as to make comparisons with data-aided CPR methods.