Assessment of Cross-train Machine Learning Techniques for QoT-Estimation in agnostic Optical Networks

With the evolution of 5G technology, high definition video, virtual reality, and the internet of things (IoT), the demand for high capacity optical networks has been increasing dramatically. To support the capacity demand, low-margin optical networks engage operator interest. To engross this techno-economic interest, planning tools with higher accuracy and accurate models for the quality of transmission estimation (QoT-E) are needed. However, considering the state-of-the-art optical network’s heterogeneity, it is challenging to develop such an accurate planning tool and low-margin QoT-E models using the traditional analytical approach. Fortunately, data-driven machine-learning (ML) cognition provides a promising path. This paper reports the use of cross-trained ML-based learning methods to predict the QoT of an un-established lightpath (LP) in an agnostic network based on the retrieved data from already established LPs of an in-service network. This advanced prediction of the QoT of un-established LP in an agnostic network is a key enabler not only for the optimal planning of this network but it also provides the opportunity to automatically deploy the LPs with a minimum margin in a reliable manner. The QoT metric of the LPs are defined by the generalized signal-to-noise ratio (GSNR), which includes the effect of both amplified spontaneous emission (ASE) noise and non-linear interference (NLI) accumulation. The real field data is mimicked by using a well reliable and tested network simulation tool GNPy. Using the generated synthetic data set, supervised ML techniques such as wide deep neural network, deep neural network, multi-layer perceptron regressor, boasted tree regressor, decision tree regressor, and random forest regressor are applied, demonstrating the GSNR prediction of an un-established LP in an agnostic network with a maximum error of 0.40 dB. © 2020 Optical Society of America under the terms of the OSA Open Access Publishing Agreement


Introduction
The global data traffic demand will experience a dramatic increase over the next few years [1], driven by the implementation of 5G technology and the expansion of bandwidth-hungry applications; such as high definition video, virtual and augmented reality contents. To sustain this remarkable rise in IP traffic the network operator demands the full exploitation of the residual capacity of already deployed network infrastructure. In order to fully exploit the residual capacity of any network infrastructure, the data transport layer needs to be driven to reach the maximum available capacity. The primary key enabler for optimal data transport exploitation is the dense wavelength division multiplexed (DWDM) transmission technique and network disaggregation. These features kick-off a road map for state-of-the-art technologies such as elastic optical networks (EONs) and software-defined networking (SDN) paradigm in optical networks. The striking features of EON and SDN in optical networks are dynamic and adaptive provisioning of network resources both in control and data plane. In the data plane, the EON paradigm [2,3] has explored a unique optical network architecture able to provide LPs based on the actual traffic demands. This flexibility makes the LP provisioning problem much more challenging than traditional fixed-grid wavelength division multiplexing (WDM) networks. Apart This prior provisioning of QoT can be used in agnostic network planning and the wavelength assignment in the online scenario.
The remainder of the paper is organized as follows. In Section 2, we briefly describe the abstraction of the physical layer, along with the argument that an accurate QoT-E has a key role in minimizing the margin. In Section 3, the simulation model, along with synthetic data generation and analysis, are presented. In Section 4, the orchestration of ML engine is presented. In Section 5, we illustrate the proposed ML methods in terms of QoT-E. Then, in Section 6, we describe the results in detail. Finally, the conclusion and future research direction are drawn in Section 7.

Abstraction of optical transport network
Typically, an optical transport network is a structure of connecting nodes with mesh topology, where traffic request is added/dropped or routed, as shown in Fig. 1(a). The topology links are bidirectional fiber connections from node-to-node, deployed as one or more fiber pairs with single fiber for each particular direction that are generally amplified by specific span length using combine and/or distributed amplification techniques: erbium-doped fiber amplifiers (EDFAs) optionally assisted by some distributed Raman amplification. Furthermore, topology links are commonly expressed as an OLS and a specific controller that has the characteristic to set the working point of each in-line amplifier (ILA) and the spectral load fed at the input of each fiber span.The current state-of-the-art optical network exploits DWDM for spectral usage of fiber propagation and coherent technology for optical transmission. Further to this, the transport layer routing operations are performed using reconfigurable optical add/drop multiplexers (ROADM) technology. The spectral grid considered in DWDM technology can either be fixed or flexible, according to the ITU-T recommendations [24], that define the spectral slots for both of the grid architectures. Using any grid architecture, LPs are deployed, where LPs are the set of possible connections between nodes according to the traffic requirements. Over each deployed LP, a polarization-division-multiplexed (PDM) multi-level modulation format is used for propagation from each source to the destination pair. During the transmission, LP suffers from several propagation impairments such as amplifiers noise added as an ASE, fiber propagation, and ROADM filtering effects. Additionally, the fiber propagation has been widely illustrated that the fiber propagation on an uncompensated optical coherent transmission system impairs the QoT of operated LPs by introducing amplitude and phase noise [4,[25][26][27]. This introduced phase noise is effectively compensated by the DSP module at the receiver, using a carrier phase estimator (CPE) algorithm. This particular kind of noise can only be considered for very high symbol rate transmission designed for short-reach [27]. In contrast to this, the amplitude noise, typically defined as the NLI, always impairs the performance. It is a Gaussian disturbance that accumulates with the receiver's ASE noise. Finally, the ROADMs filtering effects also apply some degradation level to QoT, generally considered an extra loss.

GSNR as quality of transmission estimation metric
It is well accepted that the metric of QoT for any candidate LP routed through a particular OLSs is given by the GSNR, including both the effects of the accumulated ASE noise and NLI disturbance, defined as Eq. (1), where OSNR= P Rx /P ASE , SNR NL = P Rx /P NLI , P Rx is the power of the particular channel at the receiver, P ASE is the power of the ASE noise and P NLI is the power of the NLI. Considering a particular BER vs. OSNR back-to-back characterization of the transceiver, the GSNR precisely gives the BER, as it has been extensively shown in multi-vendor experiments using commercial products [6]. The non-linear effects during fiber propagation generate P NLI , which depends on the power of the particular channel and the spectral load with a cubic law [4]. In the above context, it is quite clear that for each particular OLS, there exists an optimal spectral load that maximizes the GSNR [5]. Given the particular source and destination separated by N optical domains, each one is characterized by a GSNR i , where i = 1, . . . , N, the following Eq. (2) clearly demonstrate the overall QoT. Analyzing the propagation effects on a given LP over a particular source and destination, we abstracted the behavior as a cascading effect of each optical domain that introduces QoT impairments. Along with ROADM effects, given LP over a particular source and destination experiences the accumulation of the impairments of all previously traversed OLSs, where each traversed OLS introduces some amount of ASE noise and NLI. Therefore, besides the effects of ROADMs, each LP experiences the cumulative impairments of all previously passed OLSs, where each introduces some amount of ASE noise and NLI. For QoT purposes, the OLS can be abstracted by a unique parameter defined as SNR degradation that, in general, is frequency-dependent (GSNR i (f )), if the OLS controllers can keep the OLS operating at the optimal working point.Hence, considering the above scenario, a network can be abstracted as a weighted graph (G), where G = (V, E) corresponding to the particular topology. The G vertices (V) are ROADM network nodes, while the edges (E) are the OLSs having GSNR i (f ) degradation as weights on the corresponding edges, shown in Fig. 1(b). In particular, for an LP routed from source node A to destination node F that traverses through intermediate nodes C and E, the QoT is: After the availability of network abstraction, LPs provisioning can be possible through a particular source and destination with the minimum margin, which depends upon the GSNR of a particular source to the destination route.

Approaches for QoT estimation
In this section, we listed different possible approaches to getting knowledge of OLS characteristics, with each allowing a different reduction in the GSNR uncertainty.
In the first approach, the data available from the system and network components such as static characterization of devices (e.g., amplifier gain and noise figure in the frequency domain, connector loss, etc.) is used to implement an accurate QoT-E in vendor-specific systems. Regarding this particular approach, a considerable number of analytical models are available for calculating the GSNR by using the data and characterizing the OLS components. However, this static data-based approach may not be accurate as the components experience gradual degradation due to the aging factor, leading to progressively unreliable QoT-E after a certain period.
A second approach is based on the telemetry data considering only the present network status. Deducing an OLS agnostic functioning in an open environment, the OLS controller mainly depends upon telemetry data actualizing from the optical channel monitor (OCM) and the EDFAs. In this particular approach, it is workable to utilize the telemetry of the network's current state to estimate an accurate QoT. In contrast to the previous approach, this circumstantial approach does not rely on the static characterization of devices' parameters and thus removes the uncertainty in the QoT-E accuracy due to the device aging factor discussed in the former approach. Nevertheless, this particular method's problem is that the GSNR response, mainly its OSNR component, is highly dependent upon the spectral load configuration, which deduces a considerable uncertainty in the QoT margin [15].
The third approach considers a data set that collects the QoT responses against random spectral loads of in-service network. This data is generated during the operative phase of the in-service network by measuring the OLS response in terms of GSNR for various spectral load configurations. This particular case provides an ideal playground to apply ML. An ML method using a training data set composed of spectral load realizations of an in-service network yields an accurate QoT-E for each newly generated spectral load realization of agnostic network scenario. In contrast to the second approach, where only telemetry data is considered, this approach utilizes the QoT-E based on the GSNR response to specific spectral load configurations of in-service network, decreasing the uncertainty in GSNR predictions of agnostic network. Additionally, this method does not need any knowledge of the OLS's physical parameters in opposition to the first approach. Thus, this method provides an ideal platform to apply ML methods in an agnostic network scenario.
In this work, we focus on the third approach, the ML model's training on the GSNR response to specific spectral load configurations of the already deployed in-service network, and consider its realization to predict the QoT of an agnostic network shown in Fig. 2(a).

Simulation model and synthetic data generation & analysis
A typical SDN empowered backbone optical network is considered, in which edges are modeled by OLSs, while nodes are defined as ROADM sites. The given OLSs comprise of fibers and amplifiers and are managed by a controller supposed to operate at the optimal nonlinearpropagation working point. The random behavior of the physical layer is considered through the amplifier gain ripple. To these fiber impairments such as fiber attenuation (α), dispersion (D), and insertion losses are also considered. In order to make the simulation more realistic, the statistics of insertion losses are determined by an exponential distribution with λ = 4, as described in study [28]. Because of the limitation of computational resources, the considered OLSs carry only 76 channels over the standard 50 GHz grid on the C-band, having total bandwidth close to 4THz. We do not expect a substantial difference in results when considering standard 96 channels on the entire C-band. We are supposed to rely on transceivers at 32 GBaud, shaped with a root-raised-cosine (RRC) filter. The ILA, particularly EDFAs in the optical line, is configured to work at a constant output power of 0 dBm per channel. All network links are supposed to operate on standard single-mode fiber (SMF) with a span length of 80 km. The simulation parameters are given in Table 1. To retrieve a data set, in the absence of real field data, a reliable and well-tested open-source network simulation tool GNPy [29,30] is used to generate the synthetic data against the proposed scenario. Generally, this library outlines an end-to-end simulation environment that develops the network models for the physical layer [31]. The QoT-E engine of the GNPy library is spectrally resolved and is based on generalized Gaussian noise (GGN) model [30,32]. Characterizing this particular capability, the GNPy is configured to mimic the spectral load configuration of a network. The mimicked parameters are signal power at receiver, ASE noise accumulation, and NLI generation during the propagation of LP, GSNR against each propagating LP from a particular source to a destination. Finally, the number of spans traversed by candidate LP from source to destination. Considering the ASE noise and NLI augmentation, the ASE is more prominent, because it is twice the NLI when the system operates at optimal power [4,33]. Remarkably, it is also the most strenuous to measure. The ASE noise power depends on the working point of EDFAs [34], which eventually depends on the spectral load [7].In the above frame of reference, the generated data set is perturbated by varying the most delicate parameters of EDFA; noise figure and amplifier ripple gain. The selection of noise figure is made by uniform distribution varying between 6 dB to 11 dB while the amplifier ripple gain is varied uniformly with 1 dB of variation. Additionally, the generated data set is further perturbed by insertion losses figured by an exponential distribution [28]. The mimicked data set consists of two different sub-set of data; one set refers to an in-service network while the other refers to an agnostic network. The configuration used for an agnostic network scenario is the same as for the former in-service network case except for perturbed EDFA noise figure, ripple gain, and insertion losses in the OLS already described.
After concluding the basic network configuration, the most elegant part is the spectral load configuration of simulated links. The proposed model is the subset of 2 76 , where 76 is the total number of channels. In the considered subset of spectral load realization, each source-todestination (s -> d) pair has 1024 realizations of random traffic ranging from 34% to 100% of total bandwidth utilization. The first data set generated against the European Union (EU) Network Topology, used as an in-service network and for an agnostic network, the required data set is generated against the USA Network Topology shown in Fig. 2(b) and Fig. 2(c). The raw statistics of both network topologies are shown in Table 2. In order to mimic the proposed behaviour, GNPy is configured to generate 4096 realizations of spectral load configuration for 4 (s -> d) pairs of in-service EU Network and 35,840 realizations of spectral load against 35 (s -> d) pairs of an agnostic USA Network shown in Table 3 and Table 4.  After retrieving the data set, the statistical analysis of the GSNR against different spectral load configurations for particular (s -> d) is made to estimate the uncertainty in GSNR calculations. We considered a single path {Milwaukee-Minneapolis} in the test data set for the statistical analysis of the GSNR. We selected only one path for analysis as we do not expect a substantial difference in the statistical characteristics of GSNR for the other paths.Firstly, the distribution of GSNR for this particular path is depicted in Fig. 3. In this regard, we selected channel under test (channel number 10) for all the test realization in which this candidate channel is always ON. Along with this analysis, few primary considerations arise by computing the average of the GSNRs for every channel over all the test realizations of this peculiar path, presented in Fig. 4.
The average values of GSNRs present a characteristic figure of the OLSs component, which comes to pass between 13.27 dB, and 13.90 dB, along with standard deviations from 0.233 dB to 0.33 dB respectively.  Considering the current scenario, if nothing is known about the GSNR dependency upon frequency, the same GSNR threshold must be enforced for all channels with magnitude lower than a global expected minimum. In this case, the GSNR p i are fixed to the constant GSNR threshold of 12.4 dB creating an average margin of up to 1.70 dB over a considered set of realizations for this particular path. Moving towards the next scheme, consider the availability of stored data that particularizes the frequency-resolved GSNR response, one can truncate the margin by fixing a minimum value for each channel that must lie under the respective minimum measurement (continuous orange line in Fig. 4). However, this solution is sub-optimal; it is the most acceptable reachable result, conservative and agnostic with respect to the definite spectral load configuration against this peculiar path. This definitive solution yields a finite improvement, compared to the first case having a value of 1.70 dB, as the average margin would be reduced to 1.51 dB. However, this second approach is strikingly dependent upon the sample space of GSNR realization; acquiring a minimum reliable value for each channel requires a sufficiently large number of instances. In the above context, both the considered scenarios are not feasible given agnostic network, where the statistics of the GSNR against different spectral load configuration is not available.
In contrast to this, an ML approach appears to be a promising candidate not only to predict the GSNR against any particular path of a network but also aided a decrease in the uncertainty in GSNR predictions, even if the dimension of the sample of in-service network is limited.

Machine learning model orchestration
The proposed models cognate the features and labels of the candidate LP of the in-service network using ML. The manipulated parameters used to define the features for ML models include received signal power, NLI, ASE, channel frequency, and distance between source to the destination node (integral number of fiber spans traversed). In contrast, the exploit label is manipulated by the GSNR parameter of the candidate LP depicted in Fig. 5. The total number of input features for proposed ML models consist of 380 entries, as we have 76 entries against each manipulated parameters (76 X 5 = 380). The considered models utilize ML's functionality, which is a powerful tool to find the relationship between the provided features and the desired label [35].Typically almost all ML-based models perform better when they are trained on standardized data sets [36]. Generally, the standardized data set has zero mean and unit variance.
In the present work, standardization of data set is done using z-score normalization, expressed in Eq. (4), where µ and σ is the mean and standard deviation of each particular feature vector of data set. Along with this, model evaluation is done by mean square error (MSE) as a loss function expressed in Eq. (5), where GSNR a i and GSNR p i are the actual and predicted values of any candidate channel for the ith spectral load, respectively, and n is the total number of realizations in the test data set.

Machine learning analysis
The paradigm of ML and especially deep learning concept allows the inference of useful network characteristics that cannot be easily or directly measured. Generally, ML models' cognition ability is provided by a series of intelligent algorithms that learn the inherent information of the training data. The inherent information is then abstracted into the decision models that guide the testing phase. These trained decision models offer operational advantages by allowing the network to draw conclusions and react autonomously.
In the present work, six ML models are proposed for QoT-E. Each proposed model consists of three basic modules; pre-processing, training, and testing. The pre-processing module is used to standardize the data set before applying it to the training module. The training module uses the standardized data set of EU Network, which is considered as in-service network shown in Table 3 for the training of the proposed models. After training, the testing module explicitly starts testing on the USA Network data set subset, considered as agnostic network shown in Table 4.
The proposed ML-based models are developed by using high-level python application program interface (API) of two open-source ML libraries, scikit-learn © (SKL) [36] and TensorFlow © (TF) [37], which provide a variety of learning algorithms as well as appropriate functions to refine the data set before using as an input to ML model. Generally, SKL is a general-purpose or a traditional ML library, while TF is considered a deep learning library. Unlike TF, SKL provides a powerful feature engineering tool kit for data refining processes such as dimensional reduction, standardization, transformation, etc. while TF automatically extracts useful features from the data and does not need to this manually.
Using python API of SKL's, three possible ML models; Decision Tree Regressor, Random Forest Regressor, and Multi-Layer Perceptron Regressor, while using TF's; Boosted Tree Regressor, Deep Neural Network and Wide Deep Neural Network are developed.

Decision tree regressor
To estimate QoT, the Decision tree regressor (DTR) model is proposed, which provides direct relationships between the input and response variables [38]. DTR constructs a tree based on several decisions made by analyzing different aspects of data set features and eventually leads to a response variable. The DTR has two main basic tuning parameters; min_samples_leaf and max_depth. To get the optimum values of these parameters, we tuned the number of min_samples_leaf = 3 and max_depth = 100 in order to achieve the best trade-off between precision and computational time in the particular simulation scenario.

Random forest regressor
The Random forest regressor (RFR) is a type of ML algorithm that creates an ensemble of regression trees using a bagging technique [39]. Bagging creates various subsets of data from the training data set chosen randomly with replacement. The RFR is a step extension over Bagging. It not only takes a random subset of data but also a random selection of features rather than using all the features to train several decision trees. The final prediction of the RFR is made by merely averaging the predictions of each individual decision tree. Similar to DTR parameter tuning, we also do the same for RTR and decide min_samples_leaf = 3 and max_depth = 100 in the particular simulation scenario.

Multi-layer perceptron regressor
The Multi-layer perceptron regressor (MLPR) is one of the most commonly used AI networks. Generally, MLPR is not a single perceptron with multiple layers but multiple artificial neurons (perceptrons) with multiple layers. Typically, MLPR has three or more layers of perceptrons. The layers of the MLPR form a directed, acyclic graph where each layer of MLPR is fully connected to the subsequent layer [40]. The proposed MLPR is configured by several parametric values such as training steps = 1000, loaded with back-propagation (BP) algorithm along with default Stochastic Gradient Descent Algorithm (SGD) optimizer having default learning rate = 0.01 and L 1 regularization = 0.001 [41]. The basic function of L 1 regularization is to prevent MLPR from over-fitting. In addition to this, several non-linear activation functions such as Relu, tanh, sigmoid are tested during model building. After testing all of these non-linear activation functions, Relu is selected to empowered MLPR, as it outperforms others in terms of prediction and computational load [42]. Finally, the most important is the hidden-layer, MLPR is tuned on several numbers of hidden-layer and neurons to achieve the best trade-off between precision and computational time. These two parameters are linked to the complexity of the MLPR, which is tied to the complexity of the problem. Although the increase in the number of layers and neurons improves the accuracy of the MLPR up to a certain extent, a further increase in these values has an adverse effect that causes over-fitting and increases in the computational time. In this trade-off, MLPR for QoT-E uses 3 hidden-layers, containing 20 neurons each.

Boosted tree regressor
The Boosted tree regressor (BTR) is also a type of ML algorithm that creates an ensemble of regression trees using the Gradient-Boosting technique. BRT works by combining various regression trees' models, particularly decision trees, using Gradient-Boosting [43]. More formally, this class of models can be written as in Eq. (6), where the final regressor f is the sum of f (x) = r 0 + r 1 + r 2 + · · · · · · .. + r i (6) simple base regressor r i . Similar to other tree regressors parameter tuning, we also do the same for BTR and decide min_samples_leaf = 3 and max_depth = 100. Further more, BTR is also configured by several other parameters such as default learning rate = 0.01 along with L 1 regularization = 0.001 for absolute weights of the tree leafs for the current simulation scenario.

Deep neural network
The Deep neural network (DNN) is a type of Artificial Neural Network (ANN) with multiple layers between the input and output layers [44]. Each particular layer of DNN consists of multiple neurons, the computational and learning unit of neural networks shown in Fig. 6. For the QoT-E, the proposed DNN is configured by several parametric values such as training steps = 1000, loaded with default Adaptive Gradient Algorithm (ADAGRAD) keras optimizer with default learning rate = 0.01 and L 1 regularization = 0.001 [45]. The basic function of L1 regularization is to prevent DNN from over-fitting. In addition to this, during model building non-linear activation function, Relu is selected that allows translation of the given input features into the prediction label of our point of interest with less complexity [42]. Finally the most important is the number of hidden-layers, DNN is tuned on several number of hidden-layers and neurons in order to achieve the best trade-off between precision and computational time. In this trade-off, DNN for QoT-E uses 3 hidden-layers, containing 20 nodes each.

Wide deep neural network
The Wide deep neural network (W-DNN) is a type of DNN architecture that combines the strength of memorization with generalization [46]. Generally, W-DNN is synergically trained on wide linear model such as Linear Regression (LR) for memorization and for generalization on DNN. The output of wide LR and generalized DNN are combined at the output layer to get the final prediction shown in Fig. 6(b). For the QoT-E, the proposed W-DNN is configured with 1000 training steps, loaded with default Follow the Regularized Leader (FTRL) keras optimizer with default learning rate = 0.01 and L 1 regularization = 0.001 against wide LR model [47], while generalize DNN is loaded with default ADAGRAD keras optimizer with default learning rate = 0.01 and L 1 regularization = 0.001 [45]. The Relu, nonlinear activation functions is selected to empowered deep part, as it outperform others non-linear activation function in terms of prediction [42]. The output layer is fed with sigmoid function, as sigmoid produces activation values in a particular range so that the output layer will always be activated. During the training of W-DNN, the wide and deep parts are jointly trained at the same time. The loss function over training steps for W-DNN is observed in Fig. 12. The synergic training of both wide and deep architectures optimizes all parameters and the related weights of their sum, taken into account during the training time.

Results and discussion
The numerical assessment of various ML models developed by using python APIs of SKL and TF

Comparison of SKL based learning methods for QoT-E
In this section, we exploit the proposed ML models' prediction performance based on SKL's API: DTR, RFR, and MLPR. In Fig. 7 distribution of ∆GSNR, a prediction error metric is plotted for DTR, RFR, and MLPR for the test samples of on channels realization only. For the given simulation scenario, the DTR is unable to find the underlying relationship and irregularities. On the other hand, the RFR took advantage of averaging various decision trees instead of a single decision tree, trained on randomly selected subsets of the training samples. Therefore, the overall performance of the RFR is much better than the DTR. Furthermore, the MLPR performed very well due to its cognitional potentiality provided by internally configured neurons compared to the DTR and RFR. The above descriptive results are verified by observing the mean (µ) and standard deviation (σ) of ∆GSNR distribution against each proposed model in Fig. 7. Furthermore, observing Fig. 9(a), the box plot of ∆GSNR also depicts that the prediction performance of MLPR outperforms DTR and RFR. In Fig. 9(a), the central rectangle span specifies the first quartile (Q 1 ) to the third quartile (Q 3 ). A segment inside the rectangular box shows the median of ∆GSNR and "whiskers" around the rectangular box show the minimum and maximum values of ∆GSNR. Focusing on MLPR, after observing Fig. 7 and Fig. 9(a), we further analyze and elaborate the results related to the MLPR by showing the bean plot of ∆GSNR distribution against all the test paths of agnostic USA Network in Fig. 11. In Fig. 11, the "whiskers" above and below the bean show the out bound of ∆GSNR, the black segments show individual observation along with the µ value (red segment in each bean) of ∆GSNR for each test path. After demonstrating prediction performance, we further analyze the training time for each SKL based learning method shown in Fig. 10(a). The Fig. 10(a), depicts that the proposed MLPR takes longer time duration during training as compared to RFR and DTR due to its internal fully connected hidden perceptrons. The RFR takes a slighter longer duration than the DTR because of its dependency over the bagging technique.

Comparison of TF based learning methods for QoT-E
In this section, we are doing the same analysis for TF based learning methods as we have done for SKL based learning methods. Initially, we exploit the proposed ML models' prediction performance based on TF's API: BTR, DNN, W-DNN shown in Fig. 8. Figure 8 demonstrates the distribution of ∆GSNR for W-DNN, BTR and DNN for the test samples of on channels  realization only. For the given simulation scenario, the BTR took advantage of the boasting technique, combining various models of the regression trees and select the new tree that best reduces the loss function instead of choosing randomly. On the other hand, the DNN outperforms BTR due to its cognitional potentiality provided by internally configured neurons as compared to BTR. Finally, the W-DNN gives more refined tuning to DNN as it combines the LR and DNN. The wider LR gives more refined tuning to the DNN by characterizing features at the output layer. The above-described results are verified by observing the µ and σ of ∆GSNR distribution for each proposed model in Fig. 8.
In order to better visualize the comparison among W-DNN, BTR and DNN, Fig. 9(b) shows the box plot of ∆GSNR distribution for each TF based learning method. Focusing on W-DNN after observing Fig. 8 and Fig. 9(b), it is quite obvious that W-DNN is the best TF based learning method in the present simulation scenario. We further analyze and elaborate the results related to W-DNN by showing the bean plot of ∆GSNR distribution of all the test paths of agnostic USA Network in Fig. 13. Furthermore, we also analyze the training time for each TF based learning method shown in Fig. 10(b). Figure 10(b) depicts that the proposed W-DNN takes a long time duration during training compared to the DNN and BTR due to the synergic training of LR  and DNN. The DNN takes a longer duration than the BTR because of its internal hidden-layers containing several neuron units.
Finally, we analyze the prediction ability of the finest models of the two libraries; MLPR and W-DNN against all the test lightpaths. Observing Fig. 13 and Fig. 11 the W-DNN shows maximum ∆GSNR = 0.40 dB against against Buffalo-Charleston lightpath while the MLPR shows maximum ∆GSNR = 0.62 dB against Memphis-Miami lightpath. The remarkable performance of W-DNN is because of the synergic training of LR and DNN, which enables it to outperform the traditional MLPR and DNN. Furthermore, the classical MLPR shows a large deviation of ∆GSNR because it is based on the gradient-based local search technique, which at some point during training gets stuck in an unwanted local minimum.

Conclusion
In summary, we proposed and exploit the ability of several different ML models for QoT-E, considering the scenario of cross-training of these ML techniques on in-service EU Network and tested on completely agnostic USA network. The proposed ML models are developed by using higher-level APIs of SKL and TF libraries. The developed models are cross-trained and tested on the synthetic data generated by the GNPy library.
Exploiting the ability of ML, W-DNN performs barely better than DNN due to the enhancement provided by synergic training of generalize (DNN) and wide (LR) architectures. The prediction performance of the applied DNN is marginally more refined than the MLPR since the backpropagation algorithm used by the MLPR is based on the gradient-based local search technique, which at some point during training gets stuck in an unwanted local minimum [48].
In addition to this, the BTR and RFR prediction is better than DTR due to the use of ensembling techniques; boosting and bagging. Finally, results demonstrate that ML techniques are an undeniable alternative for faster and precise provisioning QoT of an LP in an agnostic optical network scenario. Remarkably, the W-DNN proved to be the model achieving the best generalization with maximum prediction error smaller than 0.40 dB.

Disclosures
The authors declare no conflicts of interest.