Real-time Transmission of Geometrically-shaped Signals using a Software-defined GPU-based Optical Receiver

A software-defined optical receiver is implemented on an off-the-shelf commercial graphics processing unit (GPU). The receiver provides real-time signal processing functionality to process 1 GBaud minimum phase (MP) 4-, 8-, 16-, 32-, 64-, 128-ary quadrature amplitude modulation (QAM) as well as geometrically shaped (GS) 8- and 128-QAM signals using Kramers-Kronig (KK) coherent detection. Experimental validation of this receiver over a 91~km field-deployed optical fiber link between two Tokyo locations is shown with detailed optical signal-to-noise ratio (OSNR) investigations. A net data rate of 5 Gbps using 64-QAM is demonstrated.

In this work, the flexible, software-defined real-time multi-modulation format receiver is optimized for improved performance and demonstrated modulation formats are extended with odd-power QAM and geometric shaping (GS). A commercial off-the-shelf GPU is used for real-time digital signal processing of minimum phase (MP) 4-, 8-, 16-, 32-, 64-, and 128-QAM as well as geometrically-shaped 8-QAM and 128-QAM signals using KK coherent detection [15]. The receiver is experimentally validated using a field-deployed transmission link between two Tokyo locations with dynamic components of the receiver digital signal processing (DSP) handling fluctuations of the environmental conditions. Detailed investigations into optical signal-to-noise ratio (OSNR) performance are shown and carrier-to-signal power ratio (CSPR) is optimized for all transmission scenarios. Net throughput is calculated for the evaluated formats   and multiple hard decision forward error correction (HD-FEC) algorithms, with 64-QAM paired with a 20% overhead HD-FEC achieving a net throughput of 5 Gbps at 28.2 dB OSNR. This paper is an extension to [11]. The demonstrated capabilities of the real-time receiver are extended to show support for odd-power QAM and geometrically-shaped constellations. Furthermore, the performance of the receiver with respect to [11] is improved through several optimizations; Firstly, the power of the digitally-inserted carrier tone with respect to the signal (CSPR) is optimized. In [11], a static CSPR value was used, here, a detailed investigation into the influence of CSPR on transmission performance is presented. Secondly, the reconfigurable optical add-drop multiplexers (ROADMs) are removed from the transmission path and replaced by an erbium-doped fiber amplifier (EDFA) at the remote site, enabling evaluation of the receiver at higher OSNR. Thirdly, the 1 GHz photodiode is replaced by a 6.5 GHz model, significantly improving the receiver bandwidth. Fourthly, the gap between the signal and digitally-inserted carrier is optimized, resulting in the use of a smaller gap. Fifthly, launch power is optimized, but we observe no significant launch power dependence on OSNR performance. These changes and optimizations substantially improve transmission performance compared to what we presented in [11]. Finally, more modulation formats are evaluated, 4-, 8-, GS-8-, 16-, 32-, 64-, 128-, and GS-128-QAM to comprehensively demonstrate multi-modulation format receiver capability. Fig. 1 shows the real-time DSP chain for KK coherent N-QAM signals. Here, an overview of its GPU-based implementation is given. A more detailed description can be found in [11]. First, buffers containing 2 22 digitized samples are transferred from analog-to-digital converter (ADC) to GPU in real time using direct memory access (DMA). The signal processing starts with a GPU kernel converting samples received as 12-bit fixed point to 32-bit floating point numbers, adding the appropriate DC offset, and performing the square root and logarithm KK front-end operations. This first kernel is annotated by the number 1 in Fig. 1. In step 3, enabled by a pair of 100% overlap-save 1024-point fast Fourier transforms (FFTs), the phase of the optical signal is recovered by a frequency-domain Hilbert transform. This phase is combined with the amplitude calculated in step 1 to reconstruct the optical signal [15] which is subsequently downconverted for further processing. Another pair of FFTs supports frequency-domain static equalization and resampling from 4 to 2 samples-per-symbol. Finally, a 4-tap adaptive time-domain widely-linear [16] decision-directed least mean square (DD-LMS) equalizer is employed to recover the signal. The minimum Euclidean distance decisions made by the equalizer are demapped into bits and sent to random-access memory (RAM).

Digital signal processing chain
Full utilization of GPU resources is achieved through massive parallelization within kernels as well as operating multiple processing streams in parallel. Most of the kernels operating within each stream are highly parallel themselves, e.g. the FFTs, Hilbert transform, and the frequency-domain equalizer. However, algorithms such as the adaptive time-domain equalizer are hard to parallelize due to its sequential nature and time-dependencies of the tap updates. The use of multiple processing streams allows these hard-to-parallelize algorithms to run next to easy-to-parallelize algorithms. Therefore, even though the adaptive equalization takes up significant amount of time as shown in the GPU profiler trace in Fig. 1, significant amount of resources is not required. This concept and many other implementation strategies are discussed in detail in [11]. Constellation cardinality only slightly influences GPU resource utilization. The KK algorithm and static equalization take up the majority of computational resources. Note that several parameters of the real-time receiver are static and optimized offline. Since the ADC is AC-coupled, the DC-term of the signal is lost. This DC-term is crucial for correct signal reconstruction and is added in step 1 as described above. The question remains how to determine the optimal value since it is dependent on the signal power, noise power, and CSPR. Throughout this work, all measurements are performed multiple times using different DC offset values to ensure the optimal performance is observed. Alternatively, one could implement an algorithm to calculate and update the optimal DC-term in real time [17].
Similarly, the 203-tap static frequency-domain equalizer is optimized offline using a training sequence every time that the data aquisition is initialized. Therefore, the entire signal processing chain up to the adaptive equalizer is agnostic to the modulation format. The adaptive equalizer only needs to know the constellation since, after initial setup and convergence using a training sequence, it is updated in a blind decision-directed fashion where part of a buffer is used to update equalizer taps for subsequent buffers. Also, the symbol decisions are demapped into bits and are considered as the output of the DSP chain. The constellation points and bit mapping are uploaded to the GPU for the equalizer to make decisions based on a minimum Euclidean distance criterion. Note that no phase compensation algorithm is required since the KK coherent receiver scheme is free of phase noise. Furthermore, the DSP chain does not rely on specific properties of modulation formats such as symmetry. Therefore, support for various geometrically-shaped constellations is achieved by uploading the location of the constellation points and corresponding bit mapping for the equalizer and demapper to use. Fig. 2 shows the experimental setup using the field-deployed link between Koganei and Otemachi, Tokyo. First, an 8-bit 12 GS/s arbitrary-waveform generator (AWG) generates MP 1 GBaud QAM signals with 1% rolloff root-raised-cosine (RRC) pulse shaping with a digitally inserted carrier tone at 0.516 GHz. The 2 20 N-ary symbol sequence is generated using PCG64 pseudo-random numbers and mapped to the desired modulation format. Both conventional and geometrically shaped QAM formats are employed, the latter optimized for the additive white Gaussian noise (AWGN) channel and generated through an iterative optimization process similar to [18]. Further 1504=F$? At the receiver, the signal is amplified, combined with amplified spontaneous emission (ASE) from a noise-loading stage to vary the OSNR, and filtered using a 5 GHz bandpass filter (BPF). A variable optical attenuator (VOA) controls the optical power into the 6.5 GHz photodiode such that the electrical output swing fills the 1 GHz 4 GS/s 12-bit ADC detection range. Receiver clock is synchronized to the transmitter and DSP is performed in real time on the GPU with 5120 processing cores as described in Section 2. Error counting is performed offline over 98 buffers containing 2 20 symbols each.

Experimental Setup
Several optimizations have been employed to increase OSNR performance. Firstly, no significant dependence of launch power on OSNR performance was observed. Therefore, the launch power was fixed. Next, the gap between the digitally-inserted carrier required for KK detection was optimized. With the 1% roll-off RRC 1 GBaud signal ending at 0.505 GHz and the carrier tone at 0.516 GHz, a gap of only 11 MHz was left. One would expect a larger gap to be more beneficial since it reduces signal-signal beat interference (SSBI) which when not entirely removed by the KK algorithm leaves reconstruction errors. However, the ADC has a limited 3 dB bandwidth of 1 GHz, thus impairing the signal after detection but crucially before the KK algorithm, leading to imperfect reconstruction. It is expected that the penalty from increased imperfect reconstruction obstructs the use of a larger gap. Finally, the power of the digitally-inserted carrier tone relative to the signal, the CSPR, is varied and reported in the next section. N-QAM signal. Therefore, if CSPR is increased, signal power decreases, signal-to-noise ratio (SNR) decreases, and Q-factor decreases. This is especially relevant at low OSNRs, because optical noise is dominant in this regime. Conversely, if CSPR is increased, carrier power increases, SSBI decreases, fewer reconstruction errors occur, and Q-factor increases. This is relevant at higher OSNRs, because reconstruction errors are dominant in this regime. Reconstruction errors originate from minimum-phase condition violations and predominantly distort the outer points of a constellation as can be seen in the insets of Fig. 3 [15]. The CSPR trade-off between noise and reconstruction errors is observed for all tested modulations. 4-QAM with 6 dB CSPR reaches the 6.7% overhead HD-FEC Q-factor threshold of 8.35 dB [19] at 6.5 dB OSNR. For this specific CSPR, the Q-factor never exceeds 12 dB whilst higher CSPRs have a lower error floor or Q-factor ceiling. The optimal CSPR depends on the operating regime of the transmission link, the modulation format, and the FEC algorithm employed. Q-values above 15 dB are not displayed in Fig. 3 because too few errors were recorded for statistical significance. 16-QAM reaches the 20% overhead HD-FEC Q-factor threshold of 6.70 dB [20], [21] at 14.6 dB OSNR using a CSPR of 8 dB. The 6.7% overhead HD-FEC is reached at an OSNR of 17.6 dB using 10 dB CSPR.

Fig. 3 shows the Q-factor versus
OSNR performance for 32-QAM and 64-QAM are shown in Fig. 4. 32-QAM is able to reach both the 20% and 6.7% overhead FEC limit at 19.9 dB and 24.9 dB, respectively. However, 64-QAM is not able to reach the 6.7% overhead FEC threshold, but reaches the other threshold at 28.2 dB. The successful transmission of 64-QAM is enabled by improving receiver bandwidth, increasing OSNR, and optimizing CSPR.  unfortunately stay below the thresholds. Fig. 6b estimates the net throughput of the transmission system at various OSNRs. For each of the HD-FEC threshold crossings mentioned above, the net data rate after FEC decoding at the OSNR of the crossing is plotted. This figure reveals interesting choices for the system designer. If the system operates above an OSNR of 28.2 dB, 64-QAM combined with a 20% overhead HD-FEC can be employed for a net data rate of 5 Gbps. Alternatively, a lower complexity HD-FEC with an overhead of 6.7% can be paired with 32-QAM for a net data rate of 4.7 Gbps, requiring 24.9 dB OSNR. The multi-modulation format software-defined DSP allows for efficient operation from 6 dB up to 28 dB OSNR, flexibly switching modulation format depending on OSNR. swapping the binary labels of two randomly chosen constellation points until convergence is reached. After each iteration, the generalized mutual information (GMI) for the AWGN channel is evaluated and if gains are found, the modified constellation is taken as the new baseline. GS-8-QAM, see Fig. 7b, resembles a circular 8-QAM with center constellation point [22], but is different since it is not symmetric. Symmetries are added to GS-128-QAM to aid convergence of the constellation design. The result of the optimization algorithm is to increase the Euclidean distance between constellation points that differ more than 1 bit. GS formats for SNRs were created in steps of 1 dB ranging from 5 dB to 14 dB and 17 dB to 23 dB for 8-QAM and 128-QAM, respectively, and were experimentally tested using the field-deployed link since there is no a priori knowledge of the AWGN channel that best resembles the transmission scenario. Many of the GS formats tested on the transmission link perform worse than their conventional counterparts, but GS-8-QAM optimized for 14 dB SNR and GS-128-QAM optimized for 20 dB show significant experimental gains and are included in this work. The constellation design technique optimizes for AWGN, which is a substantially different channel than the one employed in this work. The gain of GS-8-QAM with respect to 8-QAM in the AWGN channel is 0.5 dB at both FEC thresholds. Fig. 8 shows the influence of KK reconstruction errors on OSNR for various CSPRs. The insets shows that the outer points of a constellation are predominantly distorted. The employed GS-8-QAM is more resilient against these distortions than 8-QAM, enabling operation at lower CSPR. Therefore, it is expected that more advanced optimization techniques taking into account the channel including CSPR and KK reconstruction errors will produce better performing modulation formats. However, determining the best GS modulation format for the transmission scenario was not the goal of this investigation. These suboptimal constellations are included as a proof-of-concept for using GS formats in the flexible multi-modulation format receiver, to demonstrate that the receiver supports non-symmetrical modulation formats, and to demonstrate the use of GS-QAM in KK coherent detection.

Discussion
In [11], we note that the error floor or Q-factor ceiling of high-cardinality formats is most likely due to low-pass filtering at the receiver causing KK reconstruction errors. Here, we use a photodiode with higher bandwidth, enabling successful processing of 64-QAM signals reaching the 20% overhead HD-FEC Q-factor threshold. Further gains could be made by employing a higher bandwidth ADC as well. Note that the KK reconstruction errors are caused by filtering effects between direct detection at the photodiode and conversion in the ADC [23]. Alternatively, one could implement a second static equalizer before the KK algorithm to counter these filtering effects [24]. The current implementation almost fully utilizes the GPU processing capabilities, but further optimization of the implementation can free up resources for such an additional filtering step.
The main factor limiting baud and data rates in this work is the ADC. The off-the-shelf commercial 12-bit 4 GS/s ADC has a 3 dB bandwidth of 1 GHz, limiting the baud rate to 1 GBaud. which limits the sampling rate of the ADC. Future ADCs, employing PCIe Gen 4 or additional lanes, are expected to offer greater sampling rates and bandwidth. Proprietary interfaces such as NVIDIA NVLink (at the time of writing is not available in off-the-shelf ADCs) can support an order of magnitude higher throughput from ADC to GPU and may enable greater increases in sampling rates.

Conclusion
A commercial off-the-shelf GPU is used for real-time digital signal processing of minimum-phase 4-, 8-, 16-, 32-, 64-, and 128-QAM as well as geometrically-shaped 8-QAM and 128-QAM signals detected with a KK coherent receiver [15]. This real-time, flexible, multi-modulation format receiver is experimentally validated using a field-deployed link between two Tokyo locations. A net data rate of 5 Gbps is demonstrated using 1 GBaud 64-QAM. This shows the potential of GPUs for software-defined signal processing functionality in optical communication systems. Data availability. Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.