Supervised Machine Learning for Refractive Index Structure Parameter Modeling

The Hellenic Naval Academy (HNA) reports the latest results from an over water, free-space optical (FSO) LASER communications test facility, between the lighthouse of Psitalia Island and HNA’s laboratory building. The FSO link was established in the premises of Piraeus port, with a path length of 2958 meters and an average altitude of 35 meters, primarily above water. Recently, the facility has been upgraded with a BLS450 scintillometer, MRV TS5000/155 FSO system, and an Ambient Weather WS-2000 weather station. This paper presents the preliminary optical turbulence measurements, collected from 24 to 31 of May 2022, alongside with the macroscopic meteorological parameters. Four machine learning algorithms i.e. Random Forest, Single Layer Neural Network, Deep Neural Network and Gradient Boosting Regressor, were explored to predict the refractive index structure parameter via regression modeling.


I. INTRODUCTION
Free space optical communication technology with LASERS enables the transmission of information by propagation of an electromagnetic wave in an optical carrier signal. The proliferation of this technology has been significant in the last decade and it will definitely be key component of future technologies since bandwidth requirements continue to increase with time and will likely be too large for traditional radio-frequency (RF) devices. Free space optical (FSO) communication devices though are generally (i) smaller in size, weight and power (SWaP) requirements, (ii) have more efficient energy delivery due to the narrow beam profiles due to LASER characteristics and (iii) absence of any license restrictions due to their operational wavelength [1]. However, FSO systems are affected by numerous deleterious factors such as system, geometric, misalignment and atmospheric losses. A phenomenon that causes fading to the optical channel is atmospheric turbulence along the path, which arises as a result from inhomogeneities in temperature and pressure due to solar heating and wind [2].
Over the last three years the Hellenic Naval Academy in collaboration with the Directed Energy group of the Naval Postgraduate School has conducted a series of experimental research campaigns to model the performance of a free space optical link in a maritime environment as well as modeling the optical turbulence in such an environment through the refractive index structure parameter ( ) . In 2020 HNA proposed a new model [3] which allows FSO link performance estimation over sea when combined with point measurements of environmental parameters. The experiment measured the Received Signal Strength Indicator (RSSI) of the FSO system and constructed a second-order polynomial using regression modeling to quantify its relation with macroscopic environmental parameters collected by a weather station.
In 2021 [4], the same experimental setup was utilized to improve the aforementioned model and validated against real meteorological data. The predicted RSSI values exhibited a reasonably strong correlation with the measurements. From a modeling perspective, the Navy Surface Layer Model [10] (NAVSLaM) was used to predict the atmospheric turbulence based on the same meteorological data and thus allowed for a statistical correlation between the refractive index structure parameter and RSSI. The NAVSLaM model is developed by the Meteorology department at the NPS and predicts the refractive index structure parameter based upon mean atmospheric layer properties, with an emphasis on the air-sea temperature difference. In 2021 [5], the experiment used an information theoretic method, namely the so-called Jensen-Shannon divergence, a symmetrization of the Kullback-Leibler divergence, to measure the similarity between different probability distributions modeling an optical channel based on the strength of the atmospheric turbulence. Furthermore, the Pearson family of continuous probability distributions was also employed to determine the best fit according to the mean, standard deviation, skewness and kurtosis of the modeled data. Finally, the Monterey bay experimental setup at NPS was utilized in [6] to measure the atmospheric turbulence over the water and compare the results with the predictions of the NAVSLaM model, as well as conduct a regression analysis for turbulence predictive modeling based on environmental parameters. This paper provides preliminary results of HNA's study for modeling the refractive index structure parameter ( ) over a maritime environment, in collaboration with the NPS and the NIWC Pacific. The initial collected data-set of six macroscopic meteorological parameters with values, acquired during the last week of May 2022, is utilized to model the refractive index structure parameter, by employing regression machine learning algorithms. This paper compares four machine learning algorithms to include a Random Forest (RF), a Gradient Boosting Regressor (GBR) and two Neural Networks (a single layer and a deep network). The data was split into 2 sets, 80% of the data set is used for model training, with 20% being held for testing the model's prediction accuracy using the root-mean-squared-error (RMSE) and the R 2 parameter as the performance metrics. The respective percentage of each subset for the single layer NN is 80%, 10% and 10%, which correspond to training, validating and testing the model. The performance results of all algorithms appeared to be very promising (R 2 >0.80).

II. OPTICAL TURBULENCE BACKGROUND
All propagating LASER beams through the atmosphere suffer from temporal and spatial irradiance fluctuations, the so-called phenomenon of scintillation, which can cause severe degradation in the laser link performance. The quantitative way to determine the reliability of such a link is through a probability density function (PDF) for these fluctuations, related to any turbulence regime, either weak or strong. The most common metric of scintillation is the scintillation index (S.I.), defined as [7], where I the irradiance of the optical wave and 〈 〉 its ergodic value. According to weak fluctuation theory, the S.I. is proportional to the Rytov variance [7], where is the refractive index structure parameter, a metric for the strength of the turbulence, k = 2π/λ the optical wavenumber and L the path length of the link. A common model for prediction is the Hufnagel-Valley (HV5/7) model, where the values of 5 and 7 refer to the atmospheric coherence length (r0) in cm and isoplanatic angle (θ0) in μrads, respectively, for a wavelength of 0.55 μm. The shortfall of this model is that it is not considered appropriate for a maritime environment [7]. Other models exist in the literature that are applicable for turbulence predictions in the maritime environment.

III. STATISTICAL LEARNING BACKGROUND
Statistical learning refers to the collection of the tools used to understand a certain set of data. Usually, there are some input variables (features, Xi) based upon which the goal is to make a prediction, or get an output (response, Y). Statistical learning assumes there is some kind of relationship among them that can be modeled [8], where is an error term, independent of X with zero mean value. There are two main reasons for estimating f, prediction and inference. The majority of the statistical learning methods are categorized either as parametric or nonparametric, which is a known or not form of f. Another important phenomenon is data overfitting, which can lead to misleading results. Finally, most of the statistical problems are divided into two types, supervised and unsupervised learning. The former type possesses known response values while training the model, whereas the latter does not. This paper follows a supervised statistical learning method using four non-parametric models, i.e. Random Forest (RF), Gradient Boosting Regressor (GBR) and two Neural Networks (NN), a single layer and a deep network.

IV. MEASUREMENT SYSTEMS OVERVIEW
The experimental setup used for this research is located on the HNA premises, at Piraeus, Greece. A LASER communications link was established across the entrance of the Piraeus port, between the roof of HNA's laboratory building and the lighthouse of the Psitalia island, as shown in Figure 1. The total length of the link is 2958 meters, with the majority of that being over water and at a mean height value of 35 meters above sea level.
To quantify optical performance, a FSO system using MRV TS5000/155 transceiver heads was utilized.
To measure the local weather conditions, an Ambient Weather (WS-2000) weather station was setup to measure several macroscopic meteorological parameters including: wind speed; wind direction; air temperature; relative humidity; air pressure; dew point; solar radiation; and rainfall rate.
Finally, a BLS450 scintillometer is located ten feet away from the MRV and WS-2000 to measure the path integrated atmospheric turbulence and heat flux.

A. Data Set
During the aforementioned period, the observed parameter values from the BLS450 scintillometer were logged once per minute. The same time interval was used for the atmospheric data collection and storage from the WS-2000 weather station in order to accurately match them with the measurements and compile them in an .xlsx file. A few technical issues, such as system resets and line-of-sight link blockages due to maritime traffic, resulted in a few missed measurements. The meteorological conditions during the experiment were pretty stable, with air temperature values ranging from 20 to 29 C degrees, relative humidity within 45 and 85 percent and very low average wind speed. A key parameter of the meteorological data was the air-sea temperature difference (ASTD). To calculate this parameter, an online weather statistics database was referenced [9]. The entire dataset was screened and redundant recorded data excluded to end up with a clean dataset including 8055 rows and eight columns with the meteorological parameters and the respective output value of for the same date/time. Resulting in a single user-friendly file for further processing and analysis.

B. Regression Modeling Results
This first part of the analysis is devoted to the modeling of the refractive index structure parameter by using four machine learning based regression algorithms, namely a single layer neural network applied in the Neural Fitting application of Matlab and a deep neural network, a gradient boosting regressor and a random forest applied using a Jupyter notebook within the Anaconda python environment.
The Neural Fitting application allows for data selection, creation and training a network and performance evaluation according to the mean square error and regression analysis. A single hidden layer feed-forward network with sigmoid hidden neurons and linear output neurons was created in order to fit the seven meteorological parameters (inputs) to the (output). The network was trained with either a Levenberg -Marquardt backpropagation algorithm or a Bayesian Regularization algorithm. The first one requires more memory but less time to train the model. Training stops automatically when generalization stops improving, as indicated in the mean square error of the validation samples.
The second algorithm requires more time but can result in good generalization for difficult, small or noisy datasets. Training stops according to adaptive weight minimization. The network model was trained several times varying the training algorithms and number of nodes. The best outcome came for a network composed of 70 nodes and trained with a Levenberg -Marquardt algorithm to produce an R-squared of 0.896 and a mean square error (MSE) of 0.0834.
For reference, R-squared measures the correlation between outputs and target values. The closer its value to 1 the closer their relationship. The MSE is the average of the summation of the squared difference between the actual output value and the predicted output value. The goal is to minimize the MSE as much as possible. The split of the data followed an 80/10/10 mode for training, validation and testing. The results of the network fitting are shown in Figure  2. To develop the Random Forest (RF) model, we used the sklearn module and specifically the Random Forest Regressor function. There are several different parameters that can be selected for an RF model. For this data, after executing a grid search and cross validation, the optimal set of parameters were found to be: the number of decision trees, n_estimators, was 80; the criterion (loss function) used to determine the model outcome criterion, was MSE; the maximum possible depth of each tree (default value allows for leaves expansion until they are all pure) and the maximum number of features under consideration in each split (equal to the number of estimators). The results showed a very good agreement between model predictions and observed values, i.e. an R-squared of 0.865 and a root mean square error (RMSE) of 0.241 and are plotted in Figure 3. Gradient boosting (GB) is one of the variants of ensemble methods where weak learners are created in series in order to produce a string ensemble model. GB makes use of the residual error for learning. The main training steps for a GB model are the following, i) an initial tree estimates the label value, ii) afterwards, the residual error is calculated, iii) then, another model is created to predict the error based on the previous model, not the label and iv) therefore, the label prediction is updated based on the error prediction. Again, the GB model includes several hyperparameters that can be initially selected and tuned adequately. The hyperparameters selected for this model are, i) the number of boosting stages to perform (n_estimators = 1000), ii) the learning rate of the model (learning_rate = 0.05), iii) the maximum depth of the individual regression estimators (max_depth = 6) and iv) the minimum number of samples required to split an internal node (min_samples_split = 12). The results again showed good agreement between model predictions and observed values, i.e. an R-squared of 0.851 and a root mean square error (RMSE) of 0.252 and are plotted in Figure 4. The last algorithm explored used a deep neural network, an evolution of a single layer network in order to overcome its limitations. Practically, a deep neural network is a single neural network with added hidden layers. The number of the hidden layers and the number of nodes in each layer control the neural network model capacity and depends on the specific problem we want to solve.
As the dataset was not too big, we limited the number of layers in the deep learning model to save time and avoid overfitting. For our model, a three hidden layer architecture with a sequentially decreasing number of nodes in each layer (30/20/10), was selected, ran over batches of 16 for a total number of 350 epochs. A ReLu activation function was used to connect the hidden layer nodes and a linear for the output node because it was a regression model. The ReLu activation outputs the input directly if it is greater than 0; otherwise returns zero. The loss function was based on the mean squared error with an adam optimizer. The line plot ( Figure 5) shows the expected behavior. Namely, that the model rapidly learns the problem, decreasing the loss function down to about 0.01 in about 75 epochs and remains pretty stable thereafter. The line plot also shows that train and test performance remain comparable during training, whereas the training line is a bit bumpy. Figure 6 presents the scattering plot for the DNN algorithm. The results for this model were R 2 = 0.79 and an RMSE = 0.088.

VI. CONCLUSIONS
In this study the preliminary results of HNA's study for the modeling of the refractive index structure parameter ( ) over a maritime environment, in collaboration with the NPS and the NIWC, Pacific are presented. A new ML based approach was proposed to estimate in the atmospheric surface layer. To validate the proposed approach, three ML based algorithms were utilized, i.e. a Random Forest (RF), a Gradient Boosting Regressor (GBR) and two Neural Networks (a single layer and a deep network) to model the refractive index structure parameter based on six macroscopic meteorological parameters. The performance results of all algorithms appeared to be very promising (R 2 >0.80).
The most notable take away of this study is the proven strength of ML algorithms in accurate prediction of complex variables, such as . Therefore, we strongly believe that such methods should be further researched for optical turbulence modeling. Future work on this topic include comparing ML algorithms with empirical MOST based methods and specifically the Sabot and Kopeika macrometerology-based model and the Navy Atmospheric Vertical Surface Layer Model (NAVSLaM), developed by the meteorological department of the Naval Postgraduate School (NPS).