Multi-step ahead forecasting of electrical conductivity in rivers by using a hybrid Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) model enhanced by Boruta-XGBoost feature selection algorithm

Electrical conductivity (EC) is widely recognized as one of the most essential water quality metrics for predicting salinity and mineralization. In the current research, the EC of two Australian rivers (Albert River and Barratta Creek) was forecasted for up to 10 days using a novel deep learning algorithm (Convolutional Neural Network combined with Long Short-Term Memory Model, CNN-LSTM). The Boruta-XGBoost feature selection method was used to determine the significant inputs (time series lagged data) to the model. To compare the performance of Boruta-XGB-CNN-LSTM models, three machine learning approaches—multi-layer perceptron neural network (MLP), K-nearest neighbour (KNN), and extreme gradient boosting (XGBoost) were used. Different statistical metrics, such as correlation coefficient (R), root mean square error (RMSE), and mean absolute percentage error, were used to assess the models' performance. From 10 years of data in both rivers, 7 years (2012–2018) were used as a training set, and 3 years (2019–2021) were used for testing the models. Application of the Boruta-XGB-CNN-LSTM model in forecasting one day ahead of EC showed that in both stations, Boruta-XGB-CNN-LSTM can forecast the EC parameter better than other machine learning models for the test dataset (R = 0.9429, RMSE = 45.6896, MAPE = 5.9749 for Albert River, and R = 0.9215, RMSE = 43.8315, MAPE = 7.6029 for Barratta Creek). Considering the better performance of the Boruta-XGB-CNN-LSTM model in both rivers, this model was used to forecast 3–10 days ahead of EC. The results showed that the Boruta-XGB-CNN-LSTM model is very capable of forecasting the EC for the next 10 days. The results showed that by increasing the forecasting horizon from 3 to 10 days, the performance of the Boruta-XGB-CNN-LSTM model slightly decreased. The results of this study show that the Boruta-XGB-CNN-LSTM model can be used as a good soft computing method for accurately predicting how the EC will change in rivers.

www.nature.com/scientificreports/were collected from Barratta Creek at Northcote (119101A) located at − 19.69 °S and 147.17 °E (http:// www.bom.gov.au/ water data/).Figure 2 shows the locations of stations at which the EC data were collected.Table 1 summarizes the descriptive statistics of the EC data for both stations.The average observed EC values were 459 and 380 for Albert River and Barratta Creek, respectively.According to the coefficient of variation (C.V) values, the variation in the EC in Albert River (C.V = 34.8%)was larger than that in Barratta Creek (C.V = 31.1%).Figure 3 shows the time series and frequency distribution of the EC values in both stations.

Boruta-XGBoost feature selection
The Boruta technique is a random forest algorithm wrapper named after the forest god from Slavic mythology 28 that computes the Z-scores of each predictor's input for the shadow attribute.The major predictor variables are established by the distribution of Z-score metrics 29 .In this study, instead of a random forest, the XGBoost Figure 2. Locations of stations at which the EC was measured (The map was generated using ArcGIS software, version 10.8: https:// suppo rt.esri.com/ zh-cn/ produ cts/ deskt op/ arcgis-deskt op/ arcmap/ 10-8-1), Australia shape file is from (https:// www.abs.gov.au) and maps of River stations are from (http:// www.bom.gov.au/ water data/).1. Random shadow characteristics are created.All data characteristics are shuffled at random, and their numerical order is altered.2. The XGBoost technique calculates the relevance, expressed by the Z-score, of both the shadow characteristics and original features.3. The essential characteristics are selected.An original feature with a Z-score greater than the largest Z-score in the set of shadow features is designated as "important".An original feature with a Z-score considerably lower than that of the shadow features is tagged as "not important" and deleted permanently from the feature set.4. Steps 1-3 are continued until the significance of all qualities has been marked or the set number of iterations is reached.5. Details can be found in the work of 31 .

MLP
MLP, as an architecture of artificial neural networks (ANNs) has been widely employed in various disciplines [32][33][34][35] .Similar to other ANN architectures, the MLP receives input signals and processes them before they are transmitted to the other neurons in the hidden layer(s).At least one hidden layer exists in the MLP structure.During the training phase, the neurons in each layer are linked to the neurons in the adjacent layer through a weight.Sigmoid and linear activation functions are typically used in the hidden and output layers to examine the input data characteristics 36 .The MLP can be expressed mathematically as follows: where q is the number of hidden neurons, x p (k) is the input signal, S q (k) is the output of the q th hidden neuron, and f is the tangent hyperbolic function.The activation function of the output neuron is linear (purelin).Two sets of weights must be updated: those of the input to hidden layer(s), denoted by vector W I (k) ), and those of the hidden to output layers, denoted by vector W O (k) .This study adopts MLP networks using a backpropagation algorithm, which can be considered the most prevalent and popular networks.Backpropagation is a supervised learning method that has been used in several prediction tasks 37,38 .In this study, the Levenberg-Marquardt technique, as a backpropagation algorithm, was used to train the MLP network.

XGBoost
XGBoost is an improved variant of the gradient boosting tree 39 .Based on the classification and regression tree theory, XGBoost is a successful solution for regression and classification tasks [40][41][42][43] .The XGBoost method approximates an objective function (showing the goodness-of-fit) using the quadratic Taylor expansion, enabling more rapid calculations 44 .The core of the algorithm is to optimize the value of the objective function, which typically has two components (training loss and regularization): where L is the loss function of training, and is the regularization term.The training loss is used to evaluate the performance of the model on training data.The regularization term seeks to limit the model complexity, such as overfitting 45 .The complexity can be defined in several ways, with the following expression commonly used for each tree: where T is the number of leaves, and ω represents the vector of leaf scores.The structural score is the following objective function: where ω j are distinct values.The quadratic form G j ω j + 1 2 H j + ω 2 j is the optimal ω j for a given structure q(x) .Figure 4 illustrates the structure of the XGBoost model.

KNN
KNN, developed by 46 , is a well-known ML method for addressing regression and classification problems.The technique includes a variable parameter, k, which represents the number of nearest neighbors.The KNN algorithm operates by locating the data point(s) or neighbors from a training dataset that are the closest to a query point.After selecting the k closest data points, a majority voting rule is applied to determine which class is the most prevalent.The most frequent category is determined to be the final classification for the query.The KNN for regression involves four steps: 1. Determine the distance between the query sample and labeled samples.
where N is the number of input features; x tr,n and x t,n are the nth feature values of the training ( x tr ) and testing ( x t ) points, respectively; and w n is the weight of the n th feature that ranges between 0 and 1.
2. Arrange the labeled instances in ascending order of the distance.3. Define the ideal number of neighbors based on the root mean squared error (RMSE), e.g., through crossvalidation.4. Calculate the average distance inversely using the k-nearest neighbors.

CNN-LSTM
In this study, modern DL techniques are used to develop a prediction model for forecasting the EC in rivers.The CNN-LSTM framework contains two key components: (1) convolutional and pooling layers that perform (1) complicated mathematical operations to produce input data features, (2) LSTM and dense layers that process the obtained features 47 .

CNN layer
The one-dimensional CNN (1D-CNN) is a deep feedforward neural network with local connections and weight sharing properties 48 .CNNs can automatically extract high-level dependence characteristics from input data.The learning performance and training duration of the model are determined by its structure, particularly the number of layers.A shallow structure may have inadequate performance, whereas an excessively deep CNN may deteriorate the temporal sequential element of the data or be vulnerable to overfitting 49 .Typically, the CNN network architecture has convolutional and max-pooling layers 50 .The CNN filter slides along the time axis, and its input is a three-dimensional tensor.The number of CNN convolution kernels is typically determined by the complexity of the objective.A batch normalization layer is added after the convolution layer to enhance the model performance 51 .Overall, CNNs consist of several layers such as the input layer, convolutional layers, nonlinear activation layer, pooling layers, dropout layer, batch normalization layer, one or more completely connected layers, and loss activation layer.Figure 5 shows the structure of the CNN model.

LSTM layer
The LSTM is a version of the recurrent neural network: memory blocks composed of memory cells connected by layers, unlike the neurons in ANNs.The approach was proposed by 52 and improved by 53 to address the gradient disappearance problem.Each LSTM unit consists of a memory cell and three primary gates: input, output, and forget gates 54 .By determining the information to be forgotten and remembered, the LSTM generates a regulated information flow and learns long-term dependencies.Specifically, the input gate i t and a second gate c * t control the new information stored in the memory state c t at time t.The forget gate f t regulates the previous information that must be erased or retained on the memory cell at time t−1 , whereas the output gate o t determines which information may be used to generate the output of the memory cell.Equations (6-10) represent the processes performed by an LSTM unit 55 : x t represents the input, W * and U * are weight matrices, b * represent the bias term vectors, σ is the sigmoid func- tion, and ⊙ represents component-wise multiplication.The output of the memory cell, which is the hidden state h t , is computed as Figure 6 shows the structure of the LSTM cell and CNN-LSTM model that is used to forecast the EC values in rivers.

Model development
A novel hybrid expert system composed of Boruta-XGBoost as the feature extractor and the CNN-LSTM model was developed to forecast the EC in rivers.Boruta-XGBoost, which is a tree-based feature selection method was used because classical statistical methods such as cross-correlation may introduce lagged time input components with errors owing to the assumption of linearity.Moreover, three other ML models: MLP, XGBoost, and KNN were coupled with the Boruta-XGBoost to validate the main hybrid framework for forecasting the daily EC values in 1-, 3-, 5-, 7-, and 10-month-ahead scenarios for the Barratta Creek and Albert River over the period of 2012 to 2021.
All the schemes were implemented in Python 3.60, based on the Keras, Scikit-learn, XGBoost, and Boruta-SHAP libraries.Figure 9 shows the process flow of the multi-step forecasting of the EC parameters.As discussed, the Boruta-XGBoost feature selection technique specifies an importance factor for each predictor, i.e., the Z-score 56 .If the Z-score is greater than the max-shadow (a benchmark criterion), the considered predictor is input to the ML models, and the predictors with Z-scores lower than the criterion are ignored 57 .Input pools including 20 lags of EC signals associated with both study areas in four horizons (i.e., 1-day, 3-day, 5-day, 7-day, and 10-day ahead) were assessed using the Boruta-XGBoost approach.Figures 7 and 8 show the results of the Boruta-XGBoost feature selection for the Albert River and Barratta Creek River, respectively.The green predictors are the significant components that pass the max-shadow condition, the red predictors are the rejected entities, and the yellow predictors are tentative entities.Table 2 lists the optimal lagged-time components to be fed to the ML models in the four horizons for each river.
It is necessary to use an appropriate strategy for splitting the time-series dataset for forecasting.Generally, approximately 60-80% of the dataset is used for training the models, and the rest is used for validation.To this end, cross-validation strategies such as k-fold cross-validation 58 , holdout, and walking-forward 59 approaches are promising to avoid overfitting.In this study, the holdout strategy was used, with 70% and 30% of the dataset used for training and testing, respectively.
Four powerful ML models were used to forecast the daily EC: Boruta-XGB-MLP, Boruta-XGB-XGBoost, Boruta-XGB-KNN, and Boruta-XGB-CNN-LSTM (proposed).Notably, the hyperparameters in hybrid models must be appropriately tuned to avoid overfitting while obtaining the optimal modeling results.To this end, various free-source strategies such as grid search, random search, and Bayesian optimization can be applied and implemented in various programming languages such as MATLAB and Python 60,61 .In this research, the ML model is optimized using the grid search technique.Table 3 summarizes the optimal settings, network architecture, and hyperparameters associated with the four ML models.The key hyperparameters of the Boruta-XGB-CNN-LSTM approach, as the model of interest, were the number of LSTM layers number, number of CNN layers, number of neurons, training algorithm, and learning rate 62 .
A pre-processing step, classical normalization, was applied to mitigate the negative effects of the data scale: All the inputs and targets were limited between zero and one.This operation is typically applied to increase the rate of convergence and modeling accuracy 63 .

Statistical metrics
Six statistical indices were used evaluate the robustness of the ML models: RMSE, correlation coefficient (R), uncertainty with a confidence level of 95% ( U 95% ), mean absolute percentage error (MAPE), T-statistic test ( T stat ), and Nash-Sutcliffe model efficiency coefficient (NSE) 60,61 , expressed as follows:
Figure 10 shows the scatter plots for the Boruta-XGB-CNN-LSTM and comparative models, incorporating the upper and lower bounds, in terms of the R and RMSE metrics between the measured and forecasted onestep-ahead EC (Albert River) in the testing period.The Boruta-XGB-CNN-LSTM model exhibited the highest accuracy with R = 0.9429 and RMSE = 45.68,followed by XGBoost (R = 0.9323 and RMSE = 52.444),MLP (R = 0.9261 and RMSE = 52.777),and Boruta-XGB-KNN (R = 0.8302 and RMSE = 82.499).Furthermore, the forecast generated by the Boruta-XGB-CNN-LSTM model lay within the 25% upper and lower bound thresholds, indicating a strong relationship between the forecasted and measured EC.
Figure 11 shows the ridge plots, which indicate the relative deviation percent (RD, %) to assess the one-stepahead EC forecasts for the Albert River obtained by the Boruta-XGB-CNN-LSTM and comparative models.In addition, the interquartile range (IQR) values are presented.The Boruta-XGB-CNN-LSTM model produced the most accurate RD distribution with the lowest IQR = 5.333.The benchmark Boruta-XGB-XGBoost model was superior to the Boruta-XGB-MLP and Boruta-XGB-KNN model.
Table 5 presents the one-step ahead forecasting results of the four models for Barratta Creek.The proposed Boruta-XGB-CNN-LSTM model was slightly more accurate than the comparative models in the training period (R = 0.9316, RMSE = 43.2172,MAPE = 7.6428, E = 0.8673, Tstat = 2.7861, U 95% = 119.7122)and testing period (R = 0.9215, RMSE = 43.8315,MAPE = 7.6029, E = 0.8488, Tstat = 1.1701,U 95% = 121.4845).Although the performance of the comparative models was satisfactory, it was lower than that of the proposed approach in forecasting the one-step ahead EC for Barratta Creek.
Figure 12 shows the scatter plots for the Boruta-XGB-CNN-LSTM and comparative models, incorporating the upper and lower bounds, in terms of the R and RMSE metrics between the measured and forecasted onestep-ahead EC (Barratta Creek).The Boruta-XGB-CNN-LSTM model achieved the highest accuracy (R = 0.9215 and RMSE = 43.831),and the forecast lay within the 25% range between the upper and lower bound thresholds.The models ranking second, third, and fourth in terms of the accuracy were Boruta-XGB-MLP (R = 0.9184 and RMSE = 44.717),Boruta-XGB-XGBoost (R = 0.9128 and RMSE = 46.064),and Boruta-XGB-KNN (R = 0.9042 and RMSE = 48.315),respectively.Although the 25% upper and lower bounds were reasonable for the comparative models, the Boruta-XGB-CNN-LSTM was the best model in this forecasting task.
Figure 14 shows the Taylor diagram of the one-step-ahead EC forecasted by the Boruta-XGB-CNN-LSTM, MLP, KNN, and XGBoost models for (A) Albert River and (B) Barratta Creek.The Taylor diagram is a valuable tool for comprehensively assessing the model's comparability against the observed EC using the standard deviation and correlation coefficient.For Albert River, the Boruta-XGB-CNN-LSTM (blue solid circle) forecast was close to the measured EC, with a correlation coefficient of more than 0.95 and standard deviation ranging between 125 and 150.The Boruta-XGB-MLP, Boruta-XGB-KNN, and Boruta-XGB-XGBoost predictions were slightly far from the measured EC with a correlation coefficient lower than 0.95 and standard deviation ranging between 100 and 150.For Barratta Creek, the Boruta-XGB-CNN-LSTM (red solid circle) model exhibited the highest precision with a correlation coefficient of 0.90-0.95,followed by the Boruta-XGB-MLP, Boruta-XGB-XGBoost, and Boruta-XGB-KNN models.In other words, the Boruta-XGB-CNN-LSTM model was superior in forecasting the one-step ahead EC for both Albert River and Barratta Creek.

Multi-step ahead forecasting
Table 6 presents the metrics for the Boruta-XGB-CNN-LSTM multi-step ahead forecasts (i.e., 3-, 5-, 7-, and 10-day-ahead) EC for Albert River.The forecasting accuracy in the 3-day-ahead scenario was higher than that for the 7-and 10-day-ahead cases in both the training and testing periods, as indicated by the superior goodness-of-fit metrics for the 3-day-ahead forecasts: (R = 0.8947, RMSE = 73.6800,MAPE = 10.4113,E = 0.7998, Tstat = 2.3851, U 95% = 204.1362)for the training period and (R = 0.8764, RMSE = 66.3651,MAPE = 12.0275, E = 0.7633, Tstat = 4.7504, U 95% = 183.0642)for the testing period.Similarly, the 5-day-ahead was superior to that of the 7-and 10-day-ahead forecasts but inferior to that of the 3-day-ahead horizon.In other words, the proposed model attained a higher precision in short-term forecasting (i.e., 1-, 3-, and 5-day) compared with that for long-term forecasting (i.e., 7-and 10-day) of the EC for Albert River.www.nature.com/scientificreports/ Figure 15 shows the scatterplots along with the R and RMSE metrics of the Boruta-XGB-CNN-LSTM model for multi-step ahead (i.e., 3-, 5-, 7-, and 10-day) EC forecasts for the Albert River.In addition, the 25% upper and lower bound confidence intervals are presented.The strongest correlation is observed for the 3-day-ahead EC forecasts, given the highest R (0.8764) and lowest RMSE (66.365), although the forecasts for the 5-, 7-, and 10-day-ahead EC forecasts are also satisfactory.Overall, the proposed model is better at short-term EC forecasting (1-, 3-, and 5-day), and the performance decreases over long-term forecast horizons (i.e., 7-and 10-day) for the Albert River.
Table 7 and Fig. 16 present the multi-step ahead (i.e., 3-, 5-, 7-, and 10-day) EC forecasts for Barratta Creek obtained using the proposed Boruta-XGB-CNN-LSTM model.Table 7 shows that the model yields more accurate forecasts in the 3-and 5-day-ahead horizon compared with the 7-and 10-day-ahead horizons in the training and testing periods.This finding is supported by the scatter plots in Fig. 16.The short-term forecasts (3-and 5-day-ahead) are more accurate (R of 0.7677 and 0.7108, respectively) with lower RMSEs (72.466 and 79.445, respectively) compared with those of the 7-and 10-day-ahead horizons.Therefore, the Boruta-XGB-CNN-LSTM model is more effective for short-term EC forecasting in Barratta Creek station.

Discussion
The results demonstrate the effectiveness of the proposed Boruta-XGB-CNN-LSTM model in accurately forecasting EC for the Albert River and Barratta Creek across different time horizons.For predictions from one day ahead, the hybrid model outperformed other ML approaches according to multiple statistical evaluation metrics.This indicates the benefits of optimizing input features and leveraging CNN-LSTM architectures for water quality   www.nature.com/scientificreports/prediction.Notably, short-term forecasts up to 5 days achieved higher accuracy than longer 7-10-day horizons.This is understandable, given the increasing uncertainty for more distant predictions.However, the model still produced reasonably good accuracy even 10 days ahead, suggesting usefulness for supporting various planning functions.While performance decreased with lead time as expected, the slight deterioration demonstrates the model's ability to learn dependencies beyond immediate observations.This capacity to capture rich temporal patterns should aid in addressing non-stationarities in environmental systems.Comparing performance across stations reveals the approach is transferable despite rivers' differing characteristics.Tests on independent sites within Australia indicate potential for applicability in diverse settings pending location-specific tuning.The study's findings have several potential applications and implications for improving water resource management and environmental monitoring.The accurate multi-step electrical conductivity forecasts produced by the Boruta-XGB-CNN-LSTM model allow river authorities to optimize water allocation and reservoir operations over different timescales.This helps balance the needs of water users.The model's predictions also help pollution control agencies identify at-risk areas and implement targeted mitigation strategies.Meanwhile, drinking water facilities and industries can better treat incoming supplies if alerted in advance about changing EC levels via the forecasts.Agricultural producers and fish farmers could also utilize the projections to schedule irrigation and select suitable crops/species.Furthermore, the predictions may aid emergency responders during flood and contamination events.Overall, systematically incorporating data-driven insights enables the development of long-term, sustainable river basin management strategies while considering both current and future water

Conclusion
A hybrid CNN-LSTM model was used to forecast multi-step ahead EC for the Albert River and Barratta Creek in Australia.The proposed model was optimized using the Boruta-XGBoost algorithm to rank and select the best input features.Forecasting was performed over the 1-, 3-, 5-, 7-, and 10-day horizons to demonstrate the applicability of the Boruta-XGB-CNN-LSTM model.Moreover, the forecasting performance of the proposed method was compared with those of the state-of-the-art models: Boruta-XGB-MLP, Boruta-XGB-XGBoost, and Boruta-XGB-KNN.The goodness-of-fit metrics demonstrated that the hybrid Boruta-XGB-CNN-LSTM could effectively forecast the multi-step ahead EC for both rivers.In particular, the proposed model attained the highest precision in the testing period for the Albert River (R = 0.9429, RMSE = 45.6896,MAPE = 5.9749, E = 0.8878, Tstat = 3.3426, U 95% = 126.3533)and Barratta Creek (R = 0.9215, RMSE = 43.8315,MAPE = 7.6029, E = 0.8488, Tstat = 1.1701,U 95% = 121.4845) in forecasting one-step ahead EC.Moreover, the Boruta-XGB-CNN-LSTM was more accurate in short-term (i.e., 1-, 3-, and 5-day) forecasting, and its performance slightly deteriorated in the 7-and 10-day-ahead forecast horizons.The proposed model can be extended to other applications such as agriculture, environmental, and atmospheric modeling.While the proposed Boruta-XGB-CNN-LSTM model achieved good performance, some limitations still exist.The study utilized daily water quality and meteorological data from only two rivers within Australia, so expanding data collection from more diverse locations globally would help validate the generalizability and robustness of models.Additionally, additional real-time data sources like satellite imagery could help capture spatial influences and improve forecasts.The study focused on predicting a single water quality parameter but developing multi-parameter models that simultaneously forecast other important indices would increase practical relevance.Moreover, while measures were taken to prevent overfitting, more rigorous validation techniques like uncertainty quantification on out-of-sample data could provide a realistic assessment of long-term forecast accuracy.Addressing these limitations through multidisciplinary collaborations in future work would help advance the development of widely applicable AI solutions for integrated water resource and ecosystem management globally.
Figure 1.(A) Point and nonpoint sources of river water contamination.(B) Expectations of sanitation, hygiene, and clean water in 2030.

Figure 3 .
Figure 3.Time series and frequency distributions of EC data for (a) Albert River and (b) Barratta Creek.

Figure 5 .
Figure 5. Structure of the CNN deep learning approach.

Figure 9 .
Figure 9. Modeling flowchart of the adopted research.

Figure 10 .
Figure 10.Scatter plots of forecasted versus measured EC for the Albert River.

Figure 11 .
Figure 11.Ridge plots of relative deviation percent (RD %) for the Albert River EC forecasted by different models.

Figure 12 .
Figure 12.Scatter plots of forecasted versus measured EC values for Barratta Creek.

Figure 13 .
Figure 13.Ridge plots of RD (%) for the Barratta Creek EC forecasted by different models.

Figure 14 .
Figure 14.Taylor diagram for one-step-ahead EC forecasting for (a) Albert River and (b) Barratta River.

Figure 15 .
Figure 15.Scatter plots of multi-step ahead forecasted EC versus measured EC for the Albert River.

Figure 16 .
Figure 16.Scatter plots of multi-step ahead forecasted EC versus measured EC for the Barratta Creek.

Table 1 .
Descriptive statistics of EC values in the stations of interest (2012-2021).

Table 3 .
Model adjustment for EC forecasting.

Table 4 .
Results of one-step ahead EC forecasting for the Albert River.

Table 5 .
Results of one-step ahead EC forecasting for Barratta Creek.

Table 6 .
Results of multi-step ahead EC forecasting for the Albert River.

Table 7 .
Results of multi-step ahead EC forecasting for Barratta Creek.