Performance versus Complexity Study of Neural Network Equalizers in Coherent Optical Systems

We present the results of the comparative analysis of the performance versus complexity for several types of artificial neural networks (NNs) used for nonlinear channel equalization in coherent optical communication systems. The comparison has been carried out using an experimental set-up with transmission dominated by the Kerr nonlinearity and component imperfections. For the first time, we investigate the application to the channel equalization of the convolution layer (CNN) in combination with a bidirectional long short-term memory (biLSTM) layer and the design combining CNN with a multi-layer perceptron. Their performance is compared with the one delivered by the previously proposed NN equalizer models: one biLSTM layer, three-dense-layer perceptron, and the echo state network. Importantly, all architectures have been initially optimized by a Bayesian optimizer. We present the derivation of the computational complexity associated with each NN type -- in terms of real multiplications per symbol so that these results can be applied to a large number of communication systems. We demonstrated that in the specific considered experimental system the convolutional layer coupled with the biLSTM (CNN+biLSTM) provides the highest Q-factor improvement compared to the reference linear chromatic dispersion compensation (2.9 dB improvement). We examine the trade-off between the computational complexity and performance of all equalizers and demonstrate that the CNN+biLSTM is the best option when the computational complexity is not constrained, while when we restrict the complexity to lower levels, the three-layer perceptron provides the best performance. Our complexity analysis for different NNs is generic and can be applied in a wide range of physical and engineering systems.

Abstract-We present the results of the comparative performance-versus-complexity analysis for the several types of artificial neural networks (NNs) used for nonlinear channel equalization in coherent optical communication systems. The comparison is carried out using an experimental set-up with the transmission dominated by the Kerr nonlinearity and component imperfections. For the first time, we investigate the application to the channel equalization of the convolution layer (CNN) in combination with a bidirectional long short-term memory (biLSTM) layer and the design combining CNN with a multilayer perceptron. Their performance is compared with the one delivered by the previously proposed NN-based equalizers: one biLSTM layer, three-dense-layer perceptron, and the echo state network. Importantly, all architectures have been initially optimized by a Bayesian optimizer. First, we present the general expressions for the computational complexity associated with each NN type; these are given in terms of real multiplications per symbol. We demonstrate that in the experimental system considered, the convolutional layer coupled with the biLSTM (CNN+biLSTM) provides the largest Q-factor improvement compared to the reference linear chromatic dispersion compensation (2.9 dB improvement). Then, we examine the trade-off between the computational complexity and performance of all equalizers and demonstrate that the CNN+biLSTM is the best option when the computational complexity is not constrained, while when we restrict the complexity to some lower levels, the three-layer perceptron provides the best performance.

I. INTRODUCTION
A MONGST the variety of different nonlinearity compensation methods, the machine learning (ML) based techniques are gaining momentum as a promising and flexible tool capable to efficiently unroll fiber and component-induced impairments. In the past several years, the research on artificial neural networks (NN) for optical channel equalization has already led to the development of a noticeable number of novel digital signal processing (DSP) methods that can provide the performance better than that rendered by the "conventional" DSP approaches [1]- [10]. The fast development of NNrelated research and the growing ML developers community incites testing different novel NN architectures to mitigate fiber propagation impairments. In terms of the experimental verification of NN-based equalizers, several works dealt with the intensity-modulation with direct-detection (IM/DD) links. It was demonstrated that the application of the NNs with different internal structure, such as multi-layer perceptron (MLP) [11], [12] (i.e. a simple densely connected feedforward NN architecture), convolutional NNs (CNN) [13], [14], echo state networks (ESN) [15], and long short-term memory (LSTM) NNs [16], is efficient in improving optical system-level performance. However, the test of similar NN architectures in coherent optical systems has been carried out, mainly, numerically [17]- [20], or in short-haul experiments [21]- [24]. It is worth noticing that some very recent works evaluated the functioning of NN-based equalizers in metro/long-haul trials [4], [5], [8]- [10].
The variety of existing and emerging channel equalizers makes a comparative analysis of the different solutions a timely challenge. The NN-based channel equalization refers to two important aspects: i) the improvement of performance by the reduction of bit-error-rate (BER), and ii) the complexity of the algorithms, which is a fundamental issue for practical implementation. Clearly, the comparison can be carried out only for specific systems: some approaches can be more suitable for certain transmission links, while the others are favorable for different systems.
To gain a thorough understanding of how each of the aforementioned NN architectures performs, we need to pick a benchmark system for the comparison. In this work, we perform such a comparison using, as a benchmark, a single channel transmission of a dual-polarization (DP) 16-QAM signal with 34.4 GBd rate transmitted over 9×50km TrueWave Classic (TWC) fiber spans at the power of 2 dBm. Such a choice of the fiber and the power level ensures that the system is in the strongly nonlinear regime, as we intend to study how the NNs unroll the Kerr nonlinearity effects. In our work, we analyze both the synthetically simulated and the experimental data. We, first, analyze the performance of several previously studied NN models: MLP, bidirectional LSTM (biLSTM), and ESN. Next, we compare their performance with that rendered by new composite NN structures: i) the convolutional layer coupled with the MLP (CNN+MLP); ii) the combination of the convolutional layer with the biLSTM (CNN+biLSTM). These new designs are, then, tested in the same environment, allow-arXiv:2103.08212v2 [eess.SP] 23 Jun 2021 ing us to infer the performance characteristics pertinent to each type among the 7 different NN topologies. We point out that the term "topology" in our research identifies the particular NN structure (architecture) with a specific fixed distribution of hyper-parameters. We emphasize that, in contrast to other similar investigations, we employ the Bayesian optimization procedure [9] for each NN type studied. This provides the optimal distribution of hyper-parameters pertaining to each NN type, such that we identify the best functioning regime (in terms of the performance delivered) for each architecture without complexity constraints. We show that the new CNN+biLSTM combination performs better than all other studied types. For each NN type considered hereafter, we also present the analytical expressions for the complexity, i.e. the number of multiplications attributed to each specific NN per recovered symbol. The highest complexity for the optimized NN equalizers corresponds to the new CNN+biLSTM composition that also renders the best performance.
The completely new subject in the remit of this manuscript is what happens when we restrain the complexity of different NN types: no work has previously addressed the comparison of the performance rendered by different NNs considering the identical levels of computational complexity. Our findings demonstrate a nontrivial behavior: while at the relatively high complexity levels the best performing model is the CNN+biLSTM, when we constraint the complexity to lower values, the simple MLP equalizer outperforms the advanced NN structures with the same complexity. Nevertheless, we notice that the goal of this paper is not to reach a broad conclusion about the trade-off between the complexity and performance for all possible transmission scenarios; rather, we aim at emphasizing the importance of accounting for this issue in the equalizer design stage, and we provide the tools for one's correctly assessing the DSP-type complexity of the most popular neuro-layers.
The paper is organized as follows. In Sec. II we describe the details of the different NN equalizers analyzed in our study. Sec. III presents how to compute the computational complexity on all NN-based equalizers considered in this paper. Sec. IV describes the experimental setup and contains the results, including the comparison between the performance and computational complexity of different NN topologies; the performance is also compared with the digital back-propagation with 3 steps per span. Our findings are summarized in the conclusion.

II. A ZOO OF NEURAL NETWORK-BASED EQUALIZERS
In this section, we revisit the most popular NN architectures that have been proposed and investigated so far in coherent optical channel post-equalization. We also introduce two new composite NN equalizer structures that can be deemed as the extension of previously proposed NN configurations.
To enhance the reproducibility of our methods, we provide a thorough summary of each NN architecture. The code of the algorithms implemented in Python 3.6.9 with TensorFlow (2.2.0) GPU backend and Keras (2.3.1), is provided in Zenodo [25].
Before addressing the details of the NN-based equalizers, let us describe how the datasets used in this work are created. When dealing with the optical channel equalization, we require the NN to process not only the symbol of interest but also the neighboring ones insofar as both the chromatic dispersion and the drive amplifier add the memory to the channel. The latter means that the NN performs better if it is given information about the correlations between the symbols in the sequence. Therefore, the input of the real-value NN models used in this paper (in the regression task), is the time-domain vector delayed by k symbols (the memory vector) containing the real and imaginary parts of both polarizations for the symbol at the time-step k and its 2N neighboring (past and future) symbols. In the NN signal processing, due to the computational memory constraints the input layer receives just a portion of the total data, called the mini-batch, as far as the finite computational resources limit the length of the sequences with which we can operate. The NN input mini-batch shape can be defined by three dimensions: (B, M, 4), where B is the mini-batch size, M is the memory size defined through the number of neighbors N as M = 2N + 1, and 4 is the number of features for each symbol, referring to the real and imaginary parts of two polarization components. The output target is to recover the real and imaginary parts of the k-th symbol of one of the polarization, so the shape of the NN output batch can be expressed as (B, 2).
In general, for all the NNs considered in this paper, we use the mean square error (MSE) loss estimator, since this choice corresponds to the conventional loss function frequently used for the regression tasks [26]. The other types of loss functions such as the mean absolute error, the Huber Loss, and the Log-Cost loss, were also considered for our NNs, but they did not show any noticeable benefits compared to the MSE. Moreover, it is important to highlight that we decided to present just the regression task in this paper because (for our test case scenario) the results achieved by regression and classification algorithms were close, but some fewer epochs were needed in the case of regression to reach the lowest BER.
The classical Adam algorithm was chosen for the stochastic optimization step with the default learning rate equal to 0.001 [27]. All NNs were trained for at most 1000 epochs (if not stopped earlier because of negligible changes in the loss function value over 150 epochs) and, after every training epoch, we calculated the BER obtained using the independently generated testing dataset.
The dataset was composed of 2 20 symbols for the training dataset and 2 18 independently generated symbols for the evaluation. To eliminate any possible data periodicity and overestimation [28] in our experiment, a pseudo-random bit sequence (PRBS) of order 32 was used to generate those datasets with different random seeds for each of them. The periodicity of the data is, therefore, 2 10 times higher than our training dataset size, since the modulation format used in our study was the 16 QAM. For the simulation, the Mersenne twister generator [29], that has periodicity equal to 2 19937 − 1, was used with different random seed. Additionally, we highlight that the NN training data were shuffled using numpy.random.shuffle function in Python before feed-ing the dataset into the NN: such a shuffling helps to mitigate overfitting. The experimental setups and scenarios in which the datasets were acquired are described in the following sections.
The following subsections will delve deeper into the design of the NN models used within this paper.

A. A multi-layer perceptron
The first and, perhaps, simplest and well-studied NN-based equalizer that we consider is the MLP, proposed for the shorthaul coherent system equalization in [22] and the long-haul systems in [30]. The MLP is a deep feed-forward densely connected NN structure that handles the I/Q components for each polarization jointly, providing two outputs for each processed symbol: its real and imaginary parts. Due to the MLP's ability to process joint I/Q components, the equalizer can learn the nonlinear phase impairments in addition to the amplitude-related nonlinearities. When using the MLP, the channel and device-induced memory effects are taken into account by incorporating the time-delayed versions of the input signal, as was carried out in [30].
In a simulation environment, the MLP equalizer showed performance metrics similar to those delivered by the "traditional" digital back-propagation (DBP) with 2 steps-per-span and 2 samples-per-symbol at 1000 km of standard singlemode fiber [30]. In our current paper, we use the same 3layer MLP as in [22], but in our case here the number of neurons and the activation function optimized for each layer. Importantly, the number of layers in MLP, which is 3, has been found as optimal for our particular transmission scenario by the Bayesian optimizer (BO). However, this MLP topology rendered the BO, can alter essentially for different transmission scenarios.
The general equation, in a matrix form, describing the output vector y given the input x passing through the 3-layer MLP, is: where x is the input vector with n i elements, y is the output vector with n o elements, φ is a nonlinear activation function, W n1 ∈ R ni×n1 , W n2 ∈ R n1×n2 , W n3 ∈ R n2×n3 and W out ∈ R n3×no are the real weight matrices of the respective dimensions participating in each layer of the MLP, b 1,2,3 are the bias vectors, the indexes n 1,2,3 stand for the number of neurons in each hidden layer, and × in (1) is the matrix-vector convolution.

B. Long short-term memory NNs
Compared to static (memoryless) systems where the MLPs can be efficient, the time sequences usually ought to be approached dynamically. Thus, recurrent NNs (RNNs) are often favored over other NN models for time sequences. However, training the recurring connections can be a much more complicated task compared to the MLP training, so that the network weights are usually changed almost imperceptibly. This aspect of RNNs often leads to the well-known vanishing gradient problem [26], [31]. The LSTM networks were built to solve it and to harness the memory-related effects. The LSTM comprises a gateway architecture that includes three gate types: the input (i t ) gates, the forget (f t ) gates, and the output (o t ) gates, as shown in Fig. 1. The compact form of the forward pass LSTM cell equations for a time-step t, (i.e. when we process the input feature sequence x t having the size n i ) is [32], [33]: where C t is the cell state vector, h t is the current hidden state vector of the cell with size n h and h t−1 is the previous hidden state vector. Note that n i is equal to the number of features, and n h is the number of hidden units that will be chosen in the design process. The trainable parameters of the LSTM network are represented by the matrices W ∈ R n h ×ni and U ∈ R n h ×n h with the respective upper indices i, f , o, and c, referring to the particular LSTM gates mentioned previously. More details are given in Fig. 1. In (2), is the elementwise product, and σ denotes the logistic sigmoid activation function. The aim of the input i-gate is to store the content to the cell; the forget f -gate defines what information is to be erased; the output o-gate defines what information has to be passed to the next cell.  (2) for one time-step. The arrows represent the "flow" of respective variables (the blue/green ones refer to the previous state and current input), the rectangles identify the nonlinear functions, while the symbols in circles identify the respective mathematical operations.
What makes the usage of the LSTM a dynamical approach is: the time sequence is processed by the array of LSTM cells ranging over the t-interval of interest, which is the memory size in our case. Besides the "dynamical" LSTM property, the bidirectional LSTM (biLSTM) provides a more robust solution for time series since with the bidirectional structure, we are learning the weights from the past visible values to the future hidden values, and that corresponds to our learning which features of the past values are useful for a particular symbol value prediction [34]. In the optical channel equalization context, the key advantage of biLSTM is that it can efficiently handle intersymbol interference (ISI) between the preceding and the following symbols.
In the context of channel equalization, the LSTM was suggested in [35], [36] to reduce the transmission impairments in IM/DD systems with pulse-amplitude modulation (PAM). The LSTM-based approach was developed further in [17], where, for the first time, the biLSTM was used in an optical coherent system to compensate for fiber nonlinearities, but only in a simulation environment. Additionally, it was shown that the biLSTM also outperformed a low-complexity DBP [17]. More recently, a bi-vanilla RNN was applied as well for the softdemapping nonlinear ISI [10]. In our current study, we use a similar structure as in [17] , where the NN model is made up of a bidirectional LSTM layer followed by a dense layer. Finally, we note that, in contrast to the previous studies where the grid search was executed to guess the optimal number of hidden unities and memory size, this paper uses the BO to identify the best-performing biLSTM structure [9].

C. Echo state networks
The ESN is a promising type of RNNs due to its relaxed training complexity and its ability to preserve the temporal features from different signals over time [21], [37]- [40]. The ESNs are in the reservoir computing (RC) category because in the ESNs only the output weights are trainable. In Fig. 2, the grey-colored area is the reservoir "main" structure containing the randomly connected "neurons" that capture the time features of the signal, while the output weights are trained to define which states are more relevant to describe the desired output. In this paper, we use the concept of leaky-ESN [41] containing no output feedback connections. Our motivation to choose the leaky-ESN architecture is that there was an experimental observation that the leaky-ESN configuration outperforms the traditional ESN in feature extraction for noisy time series [38]. The latter is, evidently, an important property in optical transmission-related tasks. The leaky-ESN is formalized for a certain time-step t, as follows: where s t ∈ R Nr is the system state at time-step t, N r is the number of hidden neurons units in the dynamic layers, which represents the dimensionality in the reservoir; x t ∈ R ni and y t ∈ R no are the input and the output vector of the ESN, respectively; W r ∈ R Nr×Nr is a reservoir weight matrix that defines which neuron units are connected (including the selfconnections); this matrix is also characterized by a sparsity parameter s p defining the ratio of connected neurons to the total possible connections number. Finally, W in ∈ R Nr×ni is the input weight matrix, µ is the leaking rate parameter, and W o ∈ R no×Nr is the output weight matrix which is the only one that is trainable using a regression technique. This training phase in ESN does not affect the dynamics of the system, which makes it possible to operate with the same reservoir for different tasks [37]. A schematic representation of a leaky-ESN, including the sequential input, dynamic, static, and output layers, as depicted in Fig. 2. The signal passing through the dynamic layer in Fig. 2 is represented by (3), and this layer is the core of the reservoir structure. Then it is followed by a static layer, represented by (4), which incorporates the leaky-ESN behavior through accumulating (integrating) its inputs, but it is also losing exponentially (leaking) accumulated excitation over time. Finally, the output layer defines which units are relevant to the description of the current task (for the equalization, in our case), and it is described by (5).
Concerning the previous ESN applications for optical channel equalization [39], the ESN was implemented in the optical domain for the distortions' mitigation: a 2 dB gain in Q 2factor was achieved for 64-QAM 30 GBaud signals transmitted through 100 km fiber at 10 dBm input power. In addition, the same as it is in our paper, the reservoir can be applied in the digital domain. In [21], the leaky-ESN was successfully applied after the analog-to-digital converter to enable 80 km transmission to reach below KP4-FEC limit [42] for a 32 GBd on-off keying signal.

D. Convolutional neural networks
Due to their feature extraction propensity [26], the CNNs have become one of the most commonly used NN structures in such areas as 2D image classification and 3D video applications [43], [44]. Convolution layers have also been found efficient in the analysis of temporal 1D sequences with several applications to time series sensors, audio signals, and natural language processing [45], [46]. For longer sequences, the CNN layer can be used as a pre-processing step due to its ability to reform the original sequence and extract its high-level features used for further processing cycles [24].
Here, we investigate, for the first time, two new models for the equalization of signal distortions in metro systems, combining a 1D convolutional layer performing the effective signal pre-processing with two previously proposed NNbased equalizers: the MLP described in Sec. II-A, and the biLSTM, Sec. II-B. These new structures, CNN+biLSTM and CNN+MLP, are addressed in our study because it was shown that the convolution layers are efficient in image denoising [47] and array signal processing [48], where the CNNs can reduce the background and quantization noise effects on coded signals. Therefore, we can naturally surmise that in our model the first convolutional layer can enhance the received signal by removing a part of the embedded noise before it enters the next neural layer. Also, generally, by adding the CNN layer, we end up with a NN model with less trainable parameters without losing performance, which can be yet another advantage. To that end, in the current study, we analyze how the combined NN architectures work for the optical channel equalization task. A simplified CNN+MLP combination was already successfully used in [24] at the transceiver for the high-baud-rate 80 km system.
The convolutional layer is primarily characterized by three key parameters: the number of filters, the size of its kernel, and the layer activation function. The extracting functionality is achieved by applying n f filters, sliding over the raw input sequence, and generating the number of output maps equal to n f , with a fixed kernel size n k . The convolutional layer is constructed as a squash function, which means that the input is mapped to a lower-dimensional representation, in which only the main (or desirable) characteristics are retained. Since the CNNs were mainly developed in the context of image recognition and spacial feature extraction, other parameters such as padding, dilation, and stride, are also used in the design of the convolutional layers. Considering that the input shape is (B, M, 4), the output shape after the CNN layer with all those parameters is defined as (B, L out , n f ), where the parameter L out is the function of the CNN hyper-parameters and defined as: However, in this paper, we will not focus on the investigation of those additional parameters. Consequently, we fix the default convolutional layer configuration with the padding equal to 0 (which corresponds to "valid" in Keras), the dilation equal to 1, and the stride equal to 1. Then, the input-output mapping of the convolution layer for this configuration can be described as follows: where y f i is the output feature map of the i-th input element produced by filter f in the CNN layer, x in is the input raw data vector, k f j is the j-th trained convolution kernel of the filter f , and b f j is the bias of the filter f . Further, n is the feature index of the kernel and input data, ranging from 1 to n i , corresponding to the number of features in the data; φ, as before, denotes the nonlinear activation function used in the convolutional layer. Note that (7) is true for all i ∈ [1, ..., L out ]. Moreover, since the pooling layer captures only the most important features in the data and ignores the less important ones [49], the pooling discretization process is not used in our equalizers to avoid the downsampling of feature sequences.
The output collection of feature maps, y f , emerging from the convolutional layer, is then fed into one of the structures described above: either into two dense layers (MLP, where the number of layers is, again, dictated by the BO), forming the CNN+MLP structure or into the one biLSTM layer, resulting in CNN+biLSTM. We recall that we use the convolutional layer before the following layers to extract the middle-level locally invariant features from the input series.
Here we mention that even the CNNs alone are extremely powerful deep learning instruments that have a complicated multiparametric structure that combines filters, kernel size, padding, stride, dilation, and pooling. However, having performed an exhaustive experimental exploration, we observed that deep CNNs have not reached the substantial performance level, like the one achieved by CNN+ MLP or CNN+biLSTM in our test case. Therefore, in this work, we utilize the convolution layers as pre-processing feature-extracting step and do not include deep CNN architectures in our current study. In this section, the computational complexity in terms of real multiplications per recovered output symbol is examined for all introduced NN architectures. We notice that the number of additions is typically neglected for such estimation in ordinary DSP techniques [50]. The major reason for this is that the typical algorithms for multiplying two integers with n digits have a computational complexity of O(n 2 ), whereas adding the same two numbers has a computational complexity of Θ(n) [51]. As a result, due to dealing with float values with 16 decimal digits, multiplication is by far the most timeconsuming part of the implementation procedure.
Here we point out that the training complexity will not be considered since we evaluate the real-time computation complexity (evaluation phase), which is the most critical part, while the training of a NN equalizer is carried out offline (calibration phase). Also, the computational complexity of nonlinear activation functions is not considered in our framework, due to the fact that typically their operation is based on an approximation approach, rather than on direct multiplicative calculation. In the classical lookup tables-based (LUTs) approximation method, direct mapping can be digitally implemented with much fewer computations required to apply such activation functions [52], [53].
Early works presented the results regarding the complexity of the MLP [30], RNN [54], and LSTM [36] layers. However, to enhance the understanding of this subject and clarify it, in our work we directly relate those complexities to the parameters of the most widely used machine learning platforms (Keras, TensorFlow, and PyTorch) without losing generality, and specifically addressing the composite NN types described before. Let the mini-batch size be B, n s be the input time sequence size, with n s = M , where M is the memory size (see also Sec. II), and n i be the number of features, which in our case is equal to 4. Since we recover the real and imaginary parts of each symbol, the number of outputs per symbol, n o , is equal to 2. For ESN, biLSTM, and CNN layers, as they require inputs in the form of tensors of rank 3, the input of the NN equalizer can be parametrized as [B, n s , n i ], the three numbers defining the dimensions of the input tensor, as mentioned above. The parametrization for the MLP equalizer is simpler, with [B, n s · n i ] defining the dimensions of the 2D tensor input. We use flattening layers when it was necessary to reduce the dimensionality of the data.
In this case, considering three dense layers with n 1 , n 2 , and n 3 neurons, respectively, the complexity C MLP of the resulting NN is given by: where a 1 is the contribution of the input layer, b 1 is the contribution of the hidden layer, and c 1 refers to the contribution of the output layer. The subindex "1" in a, b, and c explicitly associates these parameters with the MLP architecture The next part presents the computational complexity for an NN-based equalizer composed of a biLSTM layer. Assuming that the biLSTM layer has n h hidden units, the complexity of such a NN is given by: where a 2 is the contribution of the only layer, while the subindex "2" attributes the number a to the biLSTM. This expression is easier to understand if we analyze the mathematical description of the LSTM cell, see (2) and Fig. 1. We have several contributions to the cell's complexity. In the first layer we have 4n i n h multiplications associated with the input vector x t . Then, 4n 2 h multiplications are due to the operations with the previous cell output h t−1 . Afterward, 3n h and n o n h multiplications due to the internal multiplications identified with and involving the current cell output (h t ) going into the output layer, respectively, are added. Lastly, we multiply the number of operations by the number of time steps in the layer, n s . Since the topology is bidirectional, the total contribution is also multiplied by 2.
Following Sec. II, now we address the computational complexity associated with the ESN equalizer. Before presenting the respective expression, it is important to emphasize two aspects. First, the implementation of the ESN in the digital domain does not benefit from the fact that only the output layer weights are trainable, since, as mentioned previously, the training is not a key bottleneck as it is carried out during the offline calibration process. Second, the complexity of the ESN can potentially drop drastically if we implement it in the optical domain as an ESN dynamic layer, as it was noted in [39]. However, in this paper, we analyze the ESN implementation in the digital domain, similarly to [21].
Considering the leaky-ESN definition given by (3)-(5), the computational complexity of this equalizer can be expressed as: In the expression above, a 3 represents the contributions of (3), where the input layer adds n i N r multiplications whereas the dynamic layers add N 2 r s p . b 3 refers to the contributions of (4) describing the static layer, and c 3 represents the multiplications in the output layer, (5). This overall process is repeated for all n s time steps. Note that in the case of a potential optical implementation of the ESN, a 3 and b 3 would be equal to zero, and only the final weights would be learned in the digital domain.
Finally, let us address the complexity of the composite structures: CNN+MLP and CNN+biLSTM. The computational complexity of a 1-D convolutional layer is described as: C CNN = n i n f n k n s + 2 padding−dilation(n k − 1)−1 stride +1 (11) However, we assumed (7) that the convolutional layer is defined by the number of filters n f , the kernel size n k , and that the number of time steps n s ≥ n k , according to (6) the output size for each filter of the CNN is (n s − n k + 1). (12) and (13) are the expressions for the complexity of a convolutional layer combined with two dense layers or one biLSTM layer, respectively: C CNN+MLP = n i n f n k (n s − n k + 1) a4 + (n s − n k + 1)n f b4 n 1 + n 1 n 2 + n 2 n o c4 , In this scenario, the two-layer MLP has n 1 and n 2 neurons in each respective layer, and the biLSTM layer has n h hidden units. In the equations above, a 4 and a 5 are the contributions of the convolutional layer, b 4 is the correction factor for the transition between layers since the flattening layer was placed before the dense layers; b 5 is the number of time-steps for the following biLSTM layer; c 4 is the contribution of the twolayer MLP; and c 5 is the contribution of the biLSTM layer where, in this case, the number of filters, n f , is equal to the number of features entering the LSTM cell. Finally, we would like to express the computational complexity of the DBP-based receiver used in this paper for benchmark purposes. We considered a basic implementation of the DBP algorithm [55], where each propagation step comprises a linear part for dispersion compensation followed by a nonlinear phase cancellation stage. The linear part is achieved with a zero-forcing equalizer by transforming the signal in the frequency domain and multiplying with the inverse dispersion transfer function of the propagation section. The complexity of the DBP in terms of RMpS is [30], [50]: (14) where N step is the number of steps per span used, N FFT is the FFT size, n is the oversampling ratio, and N D = τ D /T , where τ D corresponds to the dispersive channel impulse response and T = 1/R s is the symbol duration. We have considered that N FFT = 256 and τ D defined as: where f c is the optical carrier reference frequency that in our case was 193.41 THz, c is the speed of light, L span is the span length and D is the fiber dispersion parameter.

IV. PERFORMANCE VERSUS COMPUTATIONAL COMPLEXITY TRADE-OFF ANALYSIS
In this section, we initially describe the numerical and experimental scenarios used in this paper to analyze and compare the functioning of the equalizers detailed in Sec. II. After that, the two types of analysis for our set of NN structures are carried out. First, we present the maximum performance improvement (in terms of Q-factor gain compared to the non-equalized case) that each equalizer can deliver and compare this gain to the respective computational complexity corresponding to each optimized equalizer. Then, we decrease the computational complexity of six NN topologies from Sec. II and present the gain improvement provided by each NN-equalizer when all NNs have approximately the same computational complexity. This enables us to investigate the dependence of optical performance on the computational complexity and to identify which equalizer is better for a certain complexity level.

A. Experimental and numerical setups
The setup used in our experiment is depicted in Fig. 4. At the transmitter, a DP-16QAM 34.4 Gbaud symbol sequence was mapped out of data bits generated by a 2 32 − 1 PRBS. Then, a digital RRC filter with roll-off 0.1 was applied to limit the channel bandwidth to 37.5 GHz. The resulting filtered digital samples were resampled and uploaded to a digitalto-analog converter (DAC) operating at 88 Gsamples/s. The outputs of the DAC were amplified by a four-channel electrical amplifier which drove a dual-polarization in-phase/quadrature Mach-Zehnder modulator, modulating the continuous waveform carrier produced by an external cavity laser at λ = 1.55µm. The resulting optical signal was transmitted over 9×50 km spans of TWC optical fiber with EDFA amplification. The optical amplifier noise figure was in the 4.5 to 5 dB range. The parameters of the TWC fiber -at λ = 1.55µm -are: attenuation coefficient α = 0.23 dB/km, dispersion coefficient D = 2.8 ps/(nm·km), and effective nonlinear coefficient γ = 2.5 (W · km) −1 .  At the RX side, the optical signal was converted into the electrical domain using an integrated coherent receiver. The resulting signal was sampled at 50 Gsamples/s by a digital sampling oscilloscope and processed by an offline DSP based on the algorithms described in [56]. Firstly, the bulk accumulated dispersion was compensated using a frequency domain equalizer, which was followed by the removal of the carrier frequency offset. A constant-amplitude zero-autocorrelationbased training sequence was then located in the received frame and the equalizer transfer function was estimated from it. After the equalization, the two polarizations were demultiplexed and the signal was corrected for clock frequency and phase. Carrier phase estimation was then achieved with the help of pilot symbols. Thereafter, the resulting soft symbols were used as input for the NN equalizers. Finally, the pre-FEC BER was evaluated from the signal at the NN output.
With regard to simulation, we mimic the experimental transmission setup 1 . The optical signal's propagation along optical fiber was simulated by solving the Manakov equations via the split-step Fourier method (with a resolution of 1km per step). Every span was followed by an optical amplifier with noise figure NF = 4.5 dB, which fully compensates fiber losses and adds amplified spontaneous emission noise. At the receiver, after full electronic chromatic dispersion compensation (CDC) by the frequency-domain equalizer and downsampling to the symbol rate, the received symbols were normalized to the transmitted ones. Finally, we added Gaussian noise to the signal representing an additional transceiver distortion that we may have in the experiment, such that the Q-factor level of the simulated data matched the experimental one. The system performance is evaluated in terms of the Q-factor, defined as: Q = 20 log 10 √ 2 erfc −1 (2 BER) .

B. Optimized NN-based architectures
In this section, we show the maximum achievable Qfactor for all equalizers without constraining the computational complexity. The Bayesian optimization (BO) tool, introduced in [9] for optical NN-based equalizers, was implemented to identify the optimum values of hyper-parameters for each NN topology, which provides the best Q-factor in the experimental test dataset. As it was recently shown, the BO render superior performance compared to other types of search algorithms for machine learning hyperparameter tuning [57]. The same topologies (without further optimization) were tested for the numerical analysis as well. The search space used in the BO procedure was defined via the allowed hyper-parameters intervals: In Table. I, the line marked with the "Best Topology" label, summarizes the hyper-parameters obtained by BO. These values are used to count the real multiplications per symbol recovery (complexity), and to assess the equalizers' 1 We consider a DP-16QAM, single-channel signal at 34.4 Gbaud preshaped by an RRC filter with 0.1 roll-off transmissions with an upsampling rate of 8 samples per symbol (275.2 GSamples/s) over a system consisting of 9×50 km TWC-fiber spans TABLE I: Summary of the complexity attributing to each NN equalizer topology: the topology type is identified in the leftmost column. The complexity corresponding to each topology and the NN type is expressed in terms of real multiplications per symbol recovered (RMpS), highlighted in red. In this table, we also depict the hyper-parameters distributions found by the BO: the cell marked as "Best topology" and other 6 topologies (Topologies from 1 to 6, referring to the increasing complexity threshold number) for the study of complexity versus performance. In addition, for all topologies, the values of n s , n i and n o were: 41, 4 and 2, respectively, and these are not reported in the performance expressed via the Q-factor gain, Fig. 5. Note that for all equalizers, the same optimal number of taps found by the BO was N = 20, which means that the memory in our equalizers is M = 41 and the mini-batch size, B, is equal to 4331. Moreover, for the ESN, the BO found the best value µ = 0.57, and the optimal spectral radius equal to 0.667. The activation functions found for every hidden NN layer are summarized as following: 1D-CNN layer -'linear activation function followed by  The results obtained by using the numerical synthetic data are presented in Fig. 5a. First, the CNN+biLSTM turned out to be best-performing in terms of the Q-factor gain: it achieved a 4.38 dB Q-factor improvement when compared to the conventional DSP algorithms [56], 0.05 dB when compared to the biLSTM equalizer level, 0.47 dB when compared to the CNN+MLP equalizer level, 1.4 dB when compared to the MLP equalizer level, and 3.96 dB when compared to the ESN equalizer level. Second, when adding the convolutional layers to MLP and biLSTM, we observed the improvement in terms of the number of epochs needed to reach the highest performance: the single-layer biLSTM required 119 epochs, while the CNN+biLSTM reduced this number to 89 epochs; the MLP itself needed 214 epochs to reach the best performance level, and the CNN+MLP required just 100 epochs. Thus, we conclude that the addition of a convolutional layer indeed renders the enhancement in the NN structure's performance and assists in the training stage.
When considering how NN equalizers function with the experimental data, Fig. 5b, we can mention two major observations. First, similarly to the numerical results, the CNN+biLSTM is best-performing among all the considered NN structures in terms of the Q-factor gain. The CNN+biLSTM demonstrated a 2.91 dB improvement when compared to the conventional DSP, 0.15 dB when compared to the biLSTM equalizer, 0.61 dB when compared to the CNN+MLP equalizer, 0.96 dB when compared to the MLP equalizer, and 2.33 dB when compared to the ESN equalizer. Additionally, as was also observed in the numerical analysis, a lower number of training epochs was necessary to reach the best performance point when we add a convolutional layer: using the CNN+biLSTM we needed 169 epochs, while for the pure biLSTM this number was 232 epochs; the number of epochs required for the CNN+MLP to reach the best performance was 107, and for the pure MLP it was 753 epochs. Second, compared to the simulation, the overall gain of all NN-based equalizers is slightly reduced. This can be explained by the existing "reality gap" between the numerical model and the true experimental transmission results. In real transmission, extra nonlinearity and non-ideally behavior of the transceivers (signal clipping by the ADC/DAC, harmonic and intermodulation distortions of the driver amplifier (DA), I/Q skew, etc.) add extra noise and complexity to the process of channel inversion. We believe that with just the split-step method, the NNs can unroll the synthetic propagation effect more easily than reverting the actual propagation in the experimental condition. We also point out that even though the gain numbers are different in the numerical and experimental data, the NN structures' performance followed the same pattern for both numerical and experimental data: the best performance was attributed to the CNN+biLSTM, the next level performance pertains to the biLSTM, followed by the CNN+MLP, the MLP and, finally, the ESN.
Finally, of all equalizer types investigated in this study, the DBP 3 StPS applied with two samples per symbol was still the least complex method. In all simulation and experiment test cases, however, the CNN+biLSTM outperformed the DBP, as shown in Fig. 5. Even by optimizing the DBP's nonlinear coefficient parameter (γ), the DBP approach was able to enhance the Q-factor by 1.32 dB, whereas the CNN+biLSTM equalizer improved by 2.91 dB, in the experimental case. The boost in performance by the CNN+biLSTM relative to the DBP in the experiment scenario demonstrated the NNequalizer power in mitigating transmission impairments in a practical application.

C. Comparative analysis of different NN-based equalizers with the fixed computational complexity
The analysis given above does not address the question of which NN topology would provide the best gain if we restrict the NN structure's complexity to a certain level. To answer this question, we retested the equalizers constraining the total number of real multiplications per recovered symbol (RMpS). We considered the complexity values in the range from 10 3 to 10 8 RMpS. We note that the NN structures with large RMpS (∼ 10 8 ) can be prohibitively complex for efficient hardware implementation. However, Ref. [58] demonstrated an efficient   Table I in the cells marked from ''Topology 1" to "Topology 6". The parameters of those topologies were also tuned by the BO: for each case, we reduced the allowed BO search range to comply with each computational complexity constraint.
As seen in Fig. 6, for different allowed computational complexity levels, the performance ranking of equalizer types changes. Several conclusions can be drawn analyzing the results emerging from the simulated (Fig. 6a) and experimental (Fig. 6b) data. First, in the experimental scenario, the best complexities corresponding to the maximum gain coincide with the complexities identified by the BO procedure, which confirms the effectiveness of the BO in finding the "right" NN architecture. Second, in simulations, the maximum performance is reached already at a lower complexity level compared to the experimental results. As it can be seen from the experimental figure, the CNN+biLSTM, CNN+MLP, and biLSTM equalizers need ≈ 10 7 RMpS, while in the simulation ≈ 10 6 RMpS was already enough to achieve the best performance. This observation further confirms that the NN can cope with the reversion of the simulated channel more easily than with the reversion of experimentally obtained data. Third, when we increase the complexity above the level determined by the BO, the gain remains nearly constant: this is due to overfitting and it is particularly pronounced in the MLP scenario. The key concept of the function approximation capability of the MLP belongs to its number of i) feed-forward hidden layers and ii) hidden neurons; these two parameters define the NN's capacity [59]. Changing, the MLP's capacity by adjusting the complexity levels frequently leads to unpredictable changes in the NN's performance. Starting at the 10 5 complexity level for both simulation and experimental layouts, we can see that MLPs with oversized capacity suffer from overfitting, as the network memorizes the properties of the training set in such detail that it can no longer efficiently recover the information from the inference dataset [59]. The latter blockades the equalizer from providing further Q-factor improvement. Thus, we argue that the architectures found by the BO identify the most appropriate NN equalizer's capacity (structure) matching our problem, and a further increase in complexity cannot render any noticeable performance improvement.
Next, we note that for the high level of RMpS (Topologies 4, 5, and 6), the best-performing equalizer is the CNN+biLSTM. However, once we reduce the number of real multiplications from Topology 3 and below, the best-performing equalizer turns out to be the traditional MLP. This can be explained by the fact that advanced architectures, such as CNN and biLSTM, require more filters and a higher number of hidden units, respectively, to learn the complete dynamics of the data. Also, we observe that the CNN+biLSTM performs similarly to the CNN+MLP at low complexity levels (orange and yellow curves in Fig. 6), and similarly to the biLSTM (blue line) at high complexity. Consequently, we can infer how the addition of a convolutional layer works: while for high complexity the blue and orange curves are approximately the same, at a lower allowed complexity level the CNN+biLSTM performs better.
In addition, we used the hatched blue zone in both simulation and experimental cases, for the traditional DBP with 3 StPS, to highlight the performance of the NN equalizers with similar CC of the DBP. Then it is evident that reducing the number of neurons, filters, and hidden units is not the optimal technique to achieve low complexity architectures, because performance fell below the DBP level. As a possible alternative, pruning and quantization techniques [60], [61] can be used to minimize the CC of the NN equalizers without compromising their performance, making the NN equalizers appealing not only for their good performance but also for their decreased complexity.
Finally, the performance shown by the ESN does not meet the expectations, showing the lowest achievable gain number. However, [62] contains the results explaining the poor ESN performance for the nonlinear wireless scenario. It was shown that in the channel with a high level of noise, the ESNbased equalizer was found to perform poorly. Furthermore, they demonstrated that by increasing the ESNs' number of neurons (i.e., the complexity), and, thus, effectively increasing the hidden dimensionality of the representation, equalization performance worsens. Moving to the nonlinear optical channel equalization, we observed both aforementioned effects: the performance was relatively poor due to the high level of noise, and the performance did not improve when we increased the complexity, as observable in the green curve of Fig. 6.

V. CONCLUSION
In this paper, we proposed and examined novel designs of combined NNs: (a) CNN+MLP and (b) CNN+biLSTM for the equalization of coherent optical fiber channels. We reviewed and compared several key existing NN-based methods with the proposed new algorithms using both the numerically simulated synthetic data and the experimental data from the benchmark transmission system. One of the most relevant outcomes of our work lies in the reported analytical expressions for the complexity (the number of real multiplications) associated with each NN type considered in the paper. Although a comparative analysis has been carried out for a specific benchmark system, we believe that our findings are relatively generic and can be applied to other scenarios.
Fiber Kerr nonlinearity was the predominant source of signal deterioration in the experimental benchmark system for comparing different channel equalizers. In order to produce clear nonlinear signal distortions, we used a low dispersion TWC fiber and processed the data at 2 dBm signal launch power. We emphasize that the trade-off conclusions for each NN equalizer's performance and complexity are unique to the system under consideration in this paper. However, we believe that our research paves the way for a methodology for estimating the computational cost of various NN-based channel equalizers.
We described in detail the design of the selected most promising NN-based equalizers. To derive the best-performing NN structures, we utilized the Bayesian optimization of each NN type that provides the optimized set of hyper-parameters for each particular NN-based equalizer. For these optimized structures, we found that the best performance of the test system was rendered by the new CNN+biLSTM architecture, though the performance of the pure biLSTM was only slightly lower. However, the optimized CNN+biLSTM design corresponded to the highest complexity among all cases studied.
The important part of the analysis was the comparison of the performance under the condition of the restricted complexity: the respective results are given in the last section. We found that at high complexity levels the best-performing NN among studied cases is the CNN+biLSTM. However, when reducing the complexity, we observed the transition: when the allowed complexity is relatively low, the best-performing structure turned out to be the simple MLP. We can explain this behavior as follows: the advanced architectures (the CNN and biLSTM) require more complexity-hungry components (filters or hidden units) to learn the data dynamics, while the MLP is less demanding using just the summation and activation functions at the basic level. Overall, we conclude that the addition of the convolutional layer can be beneficial if we do not restrain the complexity. However, the important message is that complexity can play an important and even crucial role in the hardware implementation of the NN equalizers. Our analysis demonstrates that even the simple NN structures, like the MLP, can outperform the more advanced counterparts when the complexity is constrained to relatively low levels.