Comparison of the efficacy of particle swarm optimization and stochastic gradient descent algorithms on multi-layer perceptron model to estimate longitudinal dispersion coefficients in natural streams

Accurate estimation of the longitudinal dispersion coefficient (LDC) is essential for modeling the pollution status in rivers. This research investigates the capabilities of machine-learning methods such as multi-layer perceptron (MLP), multi-layer perceptron trained with particle swarm optimization (MLP-PSO), multi-layer perceptron trained with Stochastic gradient descent deep learning (MLP-SGD) and different regressions including linear and non-linear regressions (LR and NLR) methods for determining the LDC of pollution in natural rivers and evaluates the accuracy of these methods in comparison with real measured data. Furthermore, the correlation coefficient (CC), root mean squared error (RMSE) and Willmott’s Index (WI) were implemented to evaluate the accuracies of the mentioned methods. Comparison of the results showed the superiority of the MLP-SGD model with CC of 0.923, RMSE of 281.4 and WI of 0.954, which indicates the undeniable accuracy and quality of the deep-learning model that can be used as a powerful model for LDC simulation. Also due to the acceptable performance of the PSO algorithm in the hybridization of the MLP model, the use of PSO algorithms is recommended to train machine-learning techniques for LDC estimation.


Introduction
Nowadays, one of the most important environmental subjects in the world is the study of surface water and river quality. Since rivers and surface waters are the main sources of water supply for different purposes, it is significant to determine and protect their quality. The main causes of surface water pollution are sewage and industrial wastes and the use of pesticides and insect killers in agriculture. On the other hand, other problems such as drought damage make it necessary to study the issue of water quality.
As mentioned, the entry of various agricultural and industrial wastes into rivers and streams has become a common way to dissolve and eliminate biological matter. One of the solutions that can be applied to control pollution in open channel streams is the effluence CONTACT Saeed Samadianfard s.samadian@tabrizu.ac.ir; Shahab S. Band shamshirbands@yuntech.edu.tw; Amir Mosavi amir.mosavi@kvk.uni-obuda.hu of contaminants in a controlled and logical mode that enables the utilization of the process of dispersion and self-purification of the river, which is one of the most important methods of river environmental management, to decrease the damage to the environment. Therefore, it is necessary to have sufficient and accurate knowledge of the ability of water to carry, spread and clean the contamination along a certain length of its path, which is called the full mixing length. Pollutants are released under the effect of transfer and mixing processes in vertical, transverse and longitudinal directions. Capability and strength of rivers and other surface flows in the distribution of additional materials in longitudinal, transverse and vertical directions are expressed by dispersion coefficients K x , K y and K z , respectively (Tayfur & Singh, 2005). K x is the most important coefficient, which has a significant effect on farther distances as well (Chatila, 1997). Therefore, appropriate and accurate evaluation of longitudinal dispersion coefficient (LDC) is one of the main prerequisites for the study of variations in the concentration of contaminants in natural rivers. Also, accurate modeling and estimation of this coefficient has special position in reservoirs and deltas designing, environmental issues and river engineering (Fischer et al., 1979). The LDC can be easily specified in the presence of real data, but unknown features and characteristics of mixing and dispersion in most rivers have forced researchers to use different experimental relations. Some experimental and theoretic approaches that have been extensively developed, comprise those by Elder (1959), Fisher (1968Fisher ( , 1975, McQuivey and Keefer (1974), Liu (1977), Fischer et al. (1979), Iwasa and Aya (1991), Seo and Cheong (1998), Kashefipour and Falconer (2002) and Disley et al. (2015). Although empirical equations are valuable, they have disadvantages such as low accuracy and high cost. Also, each of these relations considers only some of the effective parameters and a comprehensive relation has not yet been accepted by all researchers which comprises all the effective parameters. In recent years, in addition to the use of these equations, the implementation of artificial intelligence methods that are highly accurate has attracted the attention of many river engineers and researchers. Kargar et al. (2020) investigated multiple linear regression, random forest, M5 model tree (M5P), Gaussian process regression and support vector regression to modeling LDC in 60 natural streams of different regions around the world. They used various geometric and hydraulic information of the study area as model input and compare the result with empirical models. It was concluded that the established M5P technique had better accuracy in comparison with empirical equations and other machine-learning techniques. Noori et al. (2009) estimated the LDC applying adaptive neurofuzzy inference system (ANFIS) and support vector machine (SVM) models. Results indicated that the established models were superior to classical regression methods. In another study, LDC was forecasted by genetic programming method in natural rivers. The developed genetic programming method created precise results (RMSE = 0.085 and R 2 = 0.98) (Azamathulla & Ghani, 2011). Tayfur and Singh (2005) calculated the LDC in natural rivers and streams implementing artificial neural network (ANN) technique. The model input variables include geometric features (channel shape parameter, channel sinuosity, and channel width) and hydraulic parameters (relative shear velocity, shear velocity, flow velocity, flow depth, and flow discharge). They indicated that the developed ANN model can be more useful than empirical formulas in prediction of LDC. Sattar and Gharabaghi (2015) presented a new gene expression model to establish experimental relations between the LDC and different control parameters in rivers of New Zealand, Europe, Canada, and the United States. Findings displayed that the developed models are accurate and can be efficiently utilized to forecast the LDC in the studied areas. In the research of Antonopoulos et al. (2015), an empirical equation and ANN model were developed for dispersion coefficient estimation based on various hydrodynamic and hydrological information of Axios River. They presented that the ANN models showed results with acceptable accuracy. Applying MLP and the radial basis function (RBF) techniques and empirical equations, Parsaie and Haghiabi (2015) forecasted the LDC in rivers and revealed that MLP and RBF models had appropriate performance for estimating LDC. Najafzadeh and Tafarojnoruz (2016) examined the capabilities of neuro-fuzzy-based group method of data handling (NF-GMDH) and enhanced this method by means of particle swarm optimization algorithms (PSO) in LDC estimation. The comparison of the NF-GMDH-PSO results with the results of artificial neural network (ANN), model tree (MT), and the differential evolutionary (DE) displayed that DE model as superior technique. Alizadeh et al. (2017) considered Bayesian network (BN) method for estimation of LDC based on two types of input data covering dimensionless and dimensional variables. The established technique provided an appropriate method for pollutant transport prediction in natural streams. Mohammad Najafzadeh et al. (2021) performed uncertainty analysis in estimation of longitudinal and lateral dispersion coefficients (K x and K y ) in rivers. For this purpose, they implemented five intelligent models for estimation of the mentioned coefficients. They reported that support vector machine gives least uncertainty in both K x and K y estimation. Ghiasi et al. (2021) suggested implementing Deep Convolutional Network (DCN) for improving the accuracy of LDC estimation. They stated that the utilized DCN model demonstrates exceptional accuracy in estimating LDC over the full possible range of data in comparison with the empirical and ML models. Naser Arya Azar et al. (2021) used least square-support vector machine (LS-SVM), adaptive neuro-fuzzy inference system (ANFIS), and ANFIS optimized by Harris hawk optimization (ANFIS-HHO) for LDC estimation and they compared the results with that of the experimental methods. The obtained results exhibited that the proposed ANFIS-HHO performed better than the other two models. The superiority of artificial intelligence and machine-learning methods over experimental equations in estimating longitudinal scattering coefficient has also been expressed in other studies (Azamathulla & Wu, 2011;Najafzadeh & Sattar, 2015;Noori et al., 2011Noori et al., , 2015Piotrowski et al., 2012;Riahi-Madvar et al., 2009;Sahay, 2011;Toprak et al., 2004Toprak et al., , 2014Toprak & Cigizoglu, 2008) The special position of LDC in contamination transfer and its dependency on geometric and hydrodynamic factors can be concluded from the number of studies that have been conducted over the past decades until now (Altunkaynak, 2016;Bencala & Walters, 1983;Deng et al., 2001;Etemad-Shahidi & Taghipour, 2012;Rutherford, 1994;Tutmez & Yuceer, 2013;Wang et al., 2017).
The main purpose of this study is to investigate different experimental methods and equations for accurate estimation of LDC in natural rivers and to evaluate the performance accuracy of these methods in comparison with real measured data. So, the main contribution of this study is evaluating MLP-SGD deep learning and PSO algorithm to improve the accuracy of standalone MLP in LDC estimation. Finally, the results of empirical equations and LR and NLR models are compared with each other.

Database used
Estimation of longitudinal dispersion coefficient using experimental equations requires a set of hydraulic and geometric data of the rivers. In this study, data from 50 different rivers in the US and the UK (see Appendix), which have been used in different sources have been gathered (Fisher, 1968;Graf, 1995;McQuivey & Keefer, 1974;Nordin & Sabol, 1974;Rutherford, 1994) The dataset includes flow depth (m), river width (m), average flow velocity (m/s), flow shear velocity (m/s), and ε(m 2 /s). Table 1 presents the variables required for the experimental equations, the parameters used for modeling, and the statistical parameters of the data used.

Models used in the research
The use of dimensional analysis method and presentation of dimensionless variables allows more detailed study of the factors affecting the relationships governing the LDC. The most important parameters that affect the diffusion phenomenon are three categories of fluid properties (density fluid (ρ), fluid viscosity coefficient (μ)), hydraulic characteristics of flow (average flow velocity (U), flow depth (h) and flow shear rate (U * )) and geometric parameters of flow cross section (river width (w), flow path shape, etc.). in other words: Equation 2 can be obtained by using Buckingham theorem. ε ε/HU * is dimensional mixing or dispersion coefficient without dimension and Re * is the shear Reynolds number. If the turbulent flow is rough (Re * ≥ 70), the effect of fluid viscosity can be disregarded and Equation 2 can be summarized as Equation 3.
For this reason, the existing experimental equations use the parameters of dimensionless velocity and dimensionless flow depth to estimate this coefficient.

Experimental equations
In order to estimate the LDC in the rivers, researchers presented various experimental relationships, the most important of which are presented in Table 2. As can be seen, all of these formulas calculate the LDC using variables related to the average flow conditions in the river section.

Feed-Forward multi-layer perceptron (MLP)
In this study, forward multilayer perceptron neural networks were used to estimate LDC. This network is made up of simple operational elements called neurons that work in parallel. Multilayer neural networks consist of an input layer (for input data), one or more hidden layers (for organizing neurons), and an output layer (for output data). The function of the neural network is determined by how the components are connected, by adjusting the values of each connection, which is called the connection weight. One of the most widely used types of neural networks in hydrology and water resources is feed-forward multilayer perceptron neural networks with back propagation error training pattern (Ulke et al., 2009). In this type of neural networks, the direction of data flows from the input layers towards the hidden layers as described as follows by (Tayfur, 2012).
Where w old ij and w new ij are the weights between neurons i and j, respectively, before and after a given repetition, η is the learning rate, and E is the error function. Network training and error reduction continue until network convergence is established. Neural networks can have multiple hidden layers, however, research shows that forward neural networks, with a hidden layer, are able to approximate any nonlinear function (Hornik et al., 1989).

Particle swarm optimization (PSO)
The Particle Swarm Optimization (PSO) algorithm is a social search algorithm inspired by the social behavior of birds and fishes when searching for food. In this algorithm, each solution (in this study, the weights and biases of the neural network), which is also called a particle, is equivalent to a bird in a pattern in the collective movement of birds and it is determined by a target function. In the PSO algorithm, the particles work together to achieve a common goal, so this method is more effective when the particles act separately. In the PSO algorithm, collective behavior is not only related to the behavior of the individual in the community but also to how the group interacts. So, the particles are spread in the search space, then in order to achieve the best solutions, under the influence of their own experience and knowledge and the knowledge of other particles of their neighbors, gradually tend to successful areas (optimal solutions). Compared to other evolutionary algorithms, the PSO algorithm has interesting features such as constructive collaboration and shared memory between individuals. Therefore, in this algorithm, moving to areas containing better solutions has a better chance and discovering solutions with the desired quality is faster. In addition, this algorithm is very simple and has high speed and low memory. Moreover, first a number of particles are created with random position and speed, then in each iteration; the particles correct the movement toward the target according to the best position of their past and their neighbors. After consecutive repetitions, the problem converges to the optimal answer. The correction of the velocity and position of each particle is done by the Equations 5 and 6, respectively& nbsp;as described by Marini and Walczak (2015).
In which gbest represents the best position obtained by the particle population and pbest represents the best position the particle itself has ever experienced, t represents the number of iterations, c represents the acceleration constant and rand1 and rand2 are random numbers in the range zero and 1. The coefficients C 1 and C 2 are cognitive parameter (personal experience) and social parameter (collective experience), respectively, and determine the slope of movement in local search. The value of these two coefficients is determined in the range of zero and 2. Moreover, ω is a coefficient that decreases linearly and is usually defined in the range of zero and 1. The operation of the PSO algorithm, which is adapted from Marini and Walczak (2015) is shown in Figure 1.

Application of PSO algorithm in optimization and training of neural networks
In the neural network training, optimization variables include the weights and biases of that network. If the nth layer of a hypothetical neural network consisting of input R and M neurons is considered, then the matrix of weights (W n ) and biases (B n ) of this layer can be represented using as follows: Where W n m,1 W n m,2 . . . W n m,R T is the vector of the weights that the m neuron relates to the inputs of the same layer. The vector of the parameters of this layer can also be shown with Equation 8.
Similarly, for each layer in the neural network, the weight and bias matrices and vectors of the corresponding parameters are defined. By subtracting the vector parameters of all network layers, the vector of the desired optimization variables is formed. Finally, for a network with L layer, the vector of X variables can be obtained from the equation 9.
where the vector represents the particle position as the optimal value . The position vector N of X i (i = 1,2, . . . ,N), represents the quantity of the members of the particle solutions. The neural network urther operates for parameters of the vectors' variables. As decribed by Garro and Vázquez (2015) the equations 5 and 6 in the next step are used for calculation of the new position vectors.

Multilayer perceptron-stochastic gradient descent (MLP-SGD)
Stochastic gradient descent (SGD) is a repeating method for improving a purpose function with proper softness properties. To estimate the gradient in the gradient descent (GD), the average of all n samples must be obtained. If the training dataset consists of millions or billions of samples, the average will be difficult to calculate, so each iteration will be long. SGD is used to prevent such a problem. In other words, in each iteration, a sample of a micro-batch M is sampled from n samples in the training dataset. These samples are uniformly sampled from the training dataset and the size of n is constant and relatively smaller than n. To update all parameters, a gradient is estimated for all pairs of specimens in the M sub-category. The SGD training transforms to gradient descent when the batch size is n. In the SGD method the parameters are upgraded more repeatedly than for the gradient descent, and the convergence is rummaged. When the batch size is 1, the maximum iteration of upgrades is done, leading to an ordinary perceptron-like algorithm (Tsuruoka et al., 2009). In this case, the equivalent of the gradient estimate for any data such as x (i) and its corresponding value y (i) would be as follows: After estimating the gradient, the parameter is updated in the opposite direction of the gradient: In Equation (11) η is the learning rate which decreases as the repetition process. MLP-SGD is a MLP based method that is improved by SGD. The MLP-SGD network consists many hidden layers include neurons with a scaled sigmoid, rectifier, max out, and ExpRectifier activation functions. Progressed features such as epochs, rho, L1 or L2 adjustment, momentum training, adaptive learning rate, and rate annealing allow for high prediction precision. The mentioned parameters activate if the adaptive learning rate is disabled.

Error evaluation indicators
In order to evaluate and compare the results obtained from machine learning and regression models with observed LDC data, correlation coefficient (CC), root mean squared error (RMSE) and Willmott's Index (WI) were used. The equations of the indicators used are as follows: where O i and P i are the measured and estimated values, respectively.Ō i represents the mean value of the observed values and n represents the number of data. The CC and WI values close to 1 and RMSE values close to 0 indicate the high quality of the LDC estimation. The Taylor diagrams with different colors at the polar poles (Taylor, 2001) are used to evaluate the model accuracy and the estimated valuesof LDC prediction.

Results and discussion
In the present study, 149 datasets belonging to 50 different rivers were used to model and estimate the LDC. Furthermore, there is no direct way for separating training and testing datasets. For example, for developing the model, Deo et al. (2018) and Samadianfard et al. (2019aSamadianfard et al. ( , 2019bSamadianfard et al. ( , 2020 implemented 70% of their data, Four-layer perceptron neural network (one input layer, two hidden layers and one output layer) was used to estimate LDC using MLP method. The Levenberg-Marquardt algorithm was implemented in estimating the LDC due to its efficiency and faster convergence in neural network training. Activation functions in latent and output layer neurons were considered sigmoid and linear, respectively. In order to use the PSO algorithm in this research, population size, maximum number of iteration, inertia weight and initial particles were considered 1000, 500, 0.9, and 30, respectively.
Using 30% of the dataset collected in this study, which was used to test the models, the LDC values were calculated using each of the relationships in Table 2 and their accuracy was evaluated using the measured data. Table 3 shows the results of the implemented empirical equations. Z-H is the most accurate empirical equation, with CC, RMSE and WI values of 0.774, 454.8, and 0.863, respectively, which shows the efficiency of this equation. Additionally, the proposed S-C equations have the lowest accuracy so that CC, RMSE, and WI values are 0.739, 791.0, and 0.796, respectively. The high value of RMSE indicates the inefficiency of this equation. Efficiency or inefficiency of these equations can be deduced from the scatter and comparison plots that have presented in Figures 2 and 3.
Additionally, Table 4 represents the evaluation meters of MLP, MLP-PSO, MLP-SGD, and regression methods. Accordingly, the best performance belongs to the MLP-SGD with CC of 0.923, RMSE of 281.4, and WI of 0.954. Thus, MLP-SGD had high potential compared to other models for low-error modeling of LDC values. On the other hand, according to the results obtained from MLP-PSO model, the use of PSO algorithm in MLP neural network training improved the performance of the MLP neural network model, so that all values of the validation indices of the standalone MLP model have improved. To better understand, the CC, RMSE, and WI values have changed from 0. 769-0.884, 803.6-347.8, and 0.775-0.937, respectively. In other words, these parameters have improved by 14. 9%, 56.7%, and 20.9%m respectively. Regression equations for LDC estimation were obtained using LR and NLR methods and presented in Table 5. These equations depend on the values of W/H and U/U * . According to the results shown in Table 4,   Table 4 with Table 3, it can be concluded that the performance of NLR is superior to the empirical equations examined in this study. Scatter and comparative plots of the implemented methods are shown in Figures 2 and 3. As can be seen from Figures 2 and 3, the scatter of points (light blue points) around the bisector line (solid black line) of the MLP-SGD are lower than other methods. Furthermore, the conformity of the estimated points (dark blue points) on the measured data line (red dotted line) in the MLP-SGD model is more than others, which indicates the superiority of the MLP-SGD model.
To compare the superior methods between empirical and machine-learning models and to identify the most efficient method, Figure 4 was drawn to compare the results of Z-H and the MLP-PSO, MLP-SGD, and NLR machine-learning methods that the proximity of the points related to the MLP-SGD (pale blue points) to the observed data line (red dotted line) showed the suitability of this method for estimating LDC. Moreover, Figure 5   demonstrates the boxplots of the utilized models. Similarly, it is clear that the estimates of MLP-SGD are in better agreement with the measured LDC values. In Figure 6, the Taylor diagrams of models areillustrated where the RMSE is presented as the distance from the reference point. The model with high accuracy is considered as less distance between reference and the correspondent point. As the brown point for MLP-SGD is  Comparing the results of this study with the findings of Kargar et al. (2020) indicated that the developed MLP-SGD model (with RMSE of 281.4) estimated LDC with higher accuracy than M5P (with RMSE of 454.9), as the best model of Kargar et al. (2020). So, based on the outcomes of this research, the proposed MLP-SGD may be recommended for LDC estimation with high degree of confidentiality.

Sensitivity analysis
Sensitivity analysis indicates the effect of different input parameters of the model, in which the values of each parameter include the range of that parameter, which is located between the lower and upper limits of the range of changes of that parameter; In this way, one parameter changes between its lower and upper limit, while the other parameter remains constant in its average. The results of sensitivity analysis showed that the importance of the flow-to-depth ratio (W/H) is higher than U/U * for estimating LDC using the most accurate model (MLP-SGD). This importance is due to the removal of this parameter from LDC modeling, which resulted in 60.9% decrease in CC, 140.6% increase in RMSE, and 61.0% decrease in WI indices, while these changes are 42.0% decrease in CC, 107.2% increase in RMSE and 44.9% decrease in WI indices by removing U/U * . The complete results of the sensitivity analysis can be seen in Table 6.

Conclusion
Proper estimation of longitudinal dispersion coefficient of pollution is very important due to its undeniable effect on pollution control and management in rivers. In this study, the LDC values were estimated through several experimental equations and machine learning and regression methods such as MLP, MLP-PSO, MLP-SGD, NLR, and LR and their accuracy were evaluated using the measured data. To achieve this goal, 149 datasets belonging to 50 rivers in the United States and the United Kingdom were divided into two parts: 70% training and 30% testing. Among the experimental methods, the Z-H had the best performance with the CC of 0.774, the RMSE of 454.8, and WI of 0.863. Among machinelearning methods, the MLP-SGD method showed highest accuracy with CC of 0.923, RMSE of 281.4, and WI of 0.954. Also the MLP-PSO method indicated acceptable accuracy with CC of 0.884, RMSE of 347.8, and WI of 0.937. Therefore, it may be concluded that the PSO algorithm had a significant effect on improving the results of the standalone MLP method, which indicates the efficiency of this optimization algorithm. Using NLR and LR methods, equations for LDC estimation were presented, which NLR equations in two domains with CC of 0.864, RMSE of 431.7, and WI of 0.856 had acceptable accuracy. Thus, LDC can be estimated by these non-linear equations considering low error values. Finally, due to the importance of estimating LDC as accurately as possible, the use of deep-learning methods and PSO optimization algorithms is recommended.

Disclosure statement
No potential conflict of interest was reported by the author(s).