Machine learning models to predict nitrate concentration in a river basin

Aquifer-stream interactions affect the water quality in Mediterranean areas; therefore, the coupling of surface water and groundwater models is generally used to solve water-planning and pollution problems in river basins. However, their use is limited because model inputs and outputs are not spatially and temporally linked, and the data update and fitting are laborious tasks. Machine learning models have shown great potential in water quality simulation, as they can identify the statistical relationship between input and output data without the explicit requirement of knowing the physical processes. This allows the ecological, hydrological, and environmental variables that influence water quality to be analysed with a holistic approach. In this research, feature selection (FS) methods and algorithms of artificial intelligence—random forest (RF) and eXtreme Gradient Boosting (XGBoost) trees—are used to simulate nitrate concentration and determine the main drivers related to nitrate pollution in Mediterranean streams. The developed models included 19 inputs and sampling of nitrate concentration in 159 surface water quality-gauging stations as explanatory variables. The models were trained on 70 percent data, with 30 percent used to validate the predictions. Results showed that the combination of FS method with local knowledge about the dataset is the best option to improve the model’s performance, while RF and XGBoost simulate the nitrate concentration with high performance (r = 0.93 and r = 0.92, respectively). The final ranking, based on the relative importance of the variables in the RF and XGBoost models, showed that, regarding nitrogen and phosphorus concentration, the location explained 87 percent of the nitrate variability. RF and XGBoost predicted nitrate concentration in surface water with high accuracy without using conditions or parameters of entry and enabled the observation of different relationships between drivers. Thus, it is possible to identify and delimit zones with a spatial risk of pollution and approaches to implementing solutions.


Introduction
Nitrate is an important component in the environment. Its availability influences food supply, water and habitat quality, while toxic effects on stream biota and human health can occur with high concentrations of nitrate (Singh et al 2022). Its main source in Europe is diffuse pollution (Grinsven et al 2015, Alcon et al 2022, whereby nitrogen leaches when transformed into nitrate form. The main issue, then, with nitrates is their mobility in soil, and the fact that they can persist in surface water (SW) and groundwater (GW) (Defterdarović et al 2021), contributing to poor water quality and eutrophication (Pang et al 2022). Currently, the ecological status of more than half the water bodies in the EU is assessed as poor (Poikane et al 2019), contrary to the requirements of the Water Framework Directive (WFD) and Nitrates Directive (Directive 91/676/EEC). Decreasing nitrate concentration is already a challenge in several areas of Europe (Grizzetti et al 2021, Tzilivakis et al 2021 with approximately 40 percent of water bodies in Spain assessed as having poor water quality (Ministerio para la 2. Material and methods 2.1. Case study The Júcar RB is located within the Júcar River Basin District in the east of the Iberian Peninsula (Spain) on the Mediterranean side, with an area of 22,208 Km 2 (figure 1). The Júcar River has the largest catchment area and the greatest flow contribution of the Júcar RB District, with 36 surface water bodies and a length of 509 km on the main axis, which empties into the Mediterranean Sea. In the geomorphological context, the main characteristics of the basin can be grouped into two main zones: a mountainous interior, with peaks between 1,500 and 2,028 m, but which develops below 1,000 m and a second coastal zone, made up of coastal plains. This plain is an alluvial platform that provides nutrient-rich soil that supports most of the irrigated agricultural production, and is home to more than 80% of the basin's total population (Confederación Hidrográfica del Júcar 2022b).
Average temperatures range from less than 10°C inland to 18°C in the coastal zone (figure 1). The climate varies from humid to semi-arid, with the presence of droughts and a concentration of approximately 42 percent of the annual rainfall in autumn on the coastal strip. The average annual rainfall is 504 mm year −1 , with a spatial range of 797 mm year −1 in the headwater, 368 mm year −1 in the midstream and 679 mm year −1 at the mouth of the river at the Mediterranean Sea. The contribution to the main river network in the Júcar RB is 1245 hm 3 year −1 with 23.9 hm 3 year −1 discharging into the Mediterranean Sea. The great hydrological variability and the scarcity of resources in the basin has meant that, in order to meet the demand, especially for irrigation water, a large number of hydraulic infrastructures have been built with a total water storage capacity of 2,846 hm 3 (Confederación Hidrográfica del Júcar 2022b).
According to the dominant lithology of the GW bodies (IGME-DGA 2012), the outcrop can be classified as 25 percent detrital and 29 percent carbonate, with the rest being of mixed origin from both materials. The water bodies on the main axis of the river are classified as gaining stream (64 percent receiving discharges from the GW), losers (14 percent of the river infiltrating resources into the GW), and variable (22 percent representing one situation or another depending on the time of the year). The nitrate concentration of 25 percent of the aquifer is above the good status threshold, located in the midstream and downstream sections (Confederación Hidrográfica del Júcar 2022a).
The land use in the Júcar RB (EEA 2021) roughly breaks down into forest areas and open spaces (49 percent), agriculture (49 percent), and artificial surfaces (2 percent). Agriculture is the activity with the highest water resource requirement (85 percent of total demand), and the dry season (July and August) coincides with the most water demanding period (Ortega-Reig et al 2017). The water demand is 1338 hm 3 year −1 , of which 55 percent is supplied by rivers, and 41 percent by aquifers. The total rainfall area is 209,773 ha, 38 percent of which corresponds to citrus crops, located in the downstream of the basin, the area with the highest nitrate concentration in rivers and aquifers. The second and third most important groups are winter cereals for grain and the grape crop, each covering 11 percent of the area. However, net water demand is higher for rice crops (8011 m 3 ha −1 year −1 ), while citrus requires 3890 m 3 ha −1 year −1 (Confederación Hidrográfica del Júcar 2022c). Citrus orchards and rice crops with irrigation are the main sources of diffuse pollution in the basin. The largest cities in the basin are Albacete (385,000 inhabitants) and Cuenca (198.842 inhabitants). The discharge wastewater produced by domestic and industrial uses amounts to 20 hm 3 year −1 in the two cities. The greatest load of nitrate pollution comes from agriculture rather than from point sources (Dorado-Guerra et al 2021).

Observed data
The variable target nitrate concentration, water quality, water quantity and ecological parameters in SW, PL, and nitrate concentration in GW were measured by the Júcar RB District authority and the dataset is available on the Water Information System for the Júcar RB District report ('SIA Júcar' in Spanish: aps.chj.es/siajucar/, accessed on March 26 2021). The different sampling networks are shown in figure 1.
T w , pH, N, NO 2 , NH 4 , BOD 5 , SS, DO, and TP have recently been factors used to forecast nitrate concentration using machine learning models (Latif et al 2020). The variable target nitrate concentration and previous parameters have been measured at surface water quality gauging stations at 159 points since 1990.
Some studies have revealed the dependent relationship between hydrological factors and nitrate concentration in SW bodies with precipitation and streamflow playing an important role in the fluctuations across different temporal scales (Gu et al 2020). Precipitation and T a were acquired from AEMET (the State Meteorological Agency in Spain), which has a high-resolution (0.05 degrees) daily gridded precipitation dataset for Peninsular Spain and the Balearic Islands (version 2) (Peral García et al 2021). The point nearest to the surface water body was taken as the reference for precipitation in each of the river reaches, where the streamflow has been measured at 20 points since 1970. GW and SW interactions can be significant when modelling nitrate concentration in rivers in the region where piezometric levels and nitrate concentration in the GW are high (Rafiei et al 2022). The PL has been measured in 19 wells since 1990.
Changes are expected in the community structure after stress levels or pollutant agents and provide an early indication of possible adverse effects within the ecosystem. The Specific Pollution Sensitivity Index (IPS) measures the relative abundance of diatom species, and, with a score range from 0 to 20, the reaches with values above 18 are classified as good quality, while values close to 0 are classified as poor quality (Cemagref 1982). The Iberian Biological Monitoring Working Party (IBMWP) index is determined by the numbers of macroinvertebrate families (Alba-Tercedor et al 2002). Index scores range from 0 to 235 points, and reaches with values above 100 are classified as good quality, while values close to 0 are classified as poor quality. The quality riparian index (QBR) is used to assess the quality of the riparian vegetation, providing a rapid assessment of the overall condition of the riparian zone using four aspects (total riparian vegetation cover, cover structure, and quality and degree of naturalness of the stream channel). The QBR index scores range from 0 to 100, with reaches attaining values above 95 classified as good quality, and values close to 0 as poor (Munné et al 2003). Ecological indicators have been measured every year since 2009. Anthropogenic effects have been taken into account when dealing with DP, which corresponds to 99 percent of the nitrate load in the Júcar RB District. DP comes from the PATRICAL model using the methodology detailed in Dorado-Guerra et al 2021.
Once the time series were obtained for each SW body, the median of all parameters was calculated on a quarterly scale, with the exception of temperature and precipitation, which were entered into the model as cumulative. The analysis was performed for the period between 2009 and 2019, due to ecological indicator data being available since 2009. Table 1 shows the independent parameters, including data sources and timescale (Dorado-Guerra et al 2022).

Methodology
In total, 19 parameters, including climatic, hydrological, hydrogeological, ecological, water quality and anthropogenic, were used as inputs for modelling the SW nitrate concentration using RF and XGBoost models. The models were calibrated and validated with 70 and 30 percent of the dataset, respectively, which consisted of the target value and prediction factors at the location of each SW body from 2009 to 2019. Records with missing values were excluded from training and test datasets. As a result, some features with only few samples were excluded and the cross-validation (CV) method was applied, which allowed the algorithm to learn from the totality of the data, so that the data was unbiased. In order to identify the best input combination for nitrate estimation, a comprehensive feature selection analysis was carried out using Pearson correlation, mutual information (MI) and the BorutaShap algorithm. The applied methodology is depicted in figure 2.

Feature selection
When the number of inputs is high, selecting the best inputs has an important impact on the model accuracy and computational cost (Rodriguez-Galiano et al 2018, Effrosynidis and Arampatzis 2021). Therefore, to recognise  the best input combination for estimating nitrate concentration, a feature selection analysis was carried out using Pearson correlation, MI and the BorutaShap algorithm as a random forest-based wrapper process. MI is a measure of the quantity of information that a random variable shares with another variable. The mathematical definition of MI is described in Cover andThomas 2006, andVergara andEstévez 2014. It is related linearly to the entropies of variables: a nonlinear measure that can be a useful tool to determine the dominant inputs among large numbers of parameters, thereby supporting the information obtained with Pearson's coefficient. Features are ranked from largest to smallest MI values in terms of the target.
BorutaShap is a wrapper-feature selection methodology that merges the Boruta algorithm with the SHAP (Shapley Additive Explanations) framework for feature importance and ranking, and the sampling procedure uses smaller sub-samples of the available data at each iteration of the algorithm. Boruta and BorutaShap are based on a RF algorithm, which is faster than other algorithms, can usually be run without parameters tuning, can capture non-linear dependencies between predictor and dependent variables and provides a numerical score of feature importance (Kursa and Rudnicki 2010). The BorutaShap algorithm uses the following process (Keany 2021): (1). create shadow features (new copies of all the features in the dataset), and add the shadow features back to the dataset; (2). estimate the feature importance metrics of original and shadow features; (3). generate a threshold using the maximum importance score of the shadow features, and assign a hit to any features that are above the threshold; (4). carry out a two-sided t-test of equality for each unassigned feature; (5). classify the features into three groups-features with an importance significantly above the threshold ('important'), those that outperform at a less than the threshold ('tentative'), and features with an importance significantly below the threshold ('unimportant'), which are removed from the process; and 6. delete all shadow features and repeat the procedure until an importance has been assigned to each feature. The Boruta-SHAP library for Python was then applied to the feature selection (Keany 2020).

Machine learning models
Supervised learning algorithms, such as RF, are increasingly being used in SW pollution modelling (e.g., Thornhill et al 2017, Jamei et al 2022. RF is an assemblage of a large number of classification or regression trees, which uses a sample of the data to build a model. For regression targets, RF generates several decision trees and aggregates the predictions using bootstrapping, thereby averaging the predictions to construct a model using only a proportion of the predictors (Breiman 2001). The correlation between decision trees decreases, thereby improving the predictive power and reducing the computational complexity of the algorithm (Tyralis et al 2019).
XGBoost is an enhancement of the gradient-boosting decision tree algorithm (Chen and Guestrin 2016) with the main objective to improve the accuracy and speed of the model. Each update in the algorithm is based on the prediction results of the previous one; by adding a new tree to adjust the residual error between the prediction results of the previous tree and the true value, a new model was formed and used as the basis for the next model learning (J Li et al 2022). XGBoost increases the weight of training samples with high error rates and processes them multiple times with the aim of reducing the error rate (Kiangala andWang 2021, Singha et al 2021). Therefore, this algorithm is insensitive to outliers and consistent against overfitting, which simplifies model selection (Shahhosseini et al 2019). For the mathematical details of the algorithm, see Chen and Guestrin 2016.
The ML library packages within Python, scikit-learn and XGBoost, were used to carry out the RF and XGBoost algorithms and CV. Each model was validated using a K-fold CV with 10 repeats. To conduct RF and XGBoost analysis, a grid search for model performance optimisation was carried out with the CV; the hyperparameter ranges and optimised values detected are shown in table 2.

Prediction performance assessment
Model performance was evaluated using the modified version of the Kling-Gupta Efficiency (KGEM) and its three components (equation 1): r represents the correlation coefficient between the simulated and observed time series; b (bias) is the ratio between the simulated and observed means (μ) (equation 2); and γ is the ratio of the coefficients of variation for both time series (equation 3). The optimal value of the KGEM and for each of the three components is 1. The KGEM indicator provides a useful assessment of model performance due to its decomposition into correlation (r), bias (b), and variability (γ). In this way, the model's ability to reproduce the temporal dynamics and distribution of nitrate concentration can be measured (Gupta et al 2009, Kling et al 2012.  Table 3 shows the sensitivity analysis of applied MI for selecting dominant inputs. The highest MI scores were obtained with the N (1.15), PL (0.90), DRS (0.85), and nitrate-GW (0.68), and the lowest with DBO5 (0.00), SS (0.06), and DO (0.08). BorutaShap was applied to verify the Pearson and MI analysis, and the relative importance of features according to BorutaShap (table 3) indicated that N, DRS, piezometric level, IBMWP, TP and pH were the most important features for predicting nitrate concentration. The tentative features were DP and precipitation; the others were considered unimportant, and they should be omitted from the modelling process. The Pearson's coefficient, MI and BorutaShap values agreed on the three most influential parameters (N, PL and DRS), while the less influential parameters changed depending on the FS method.
The output of the FS methods was used to choose the input groups for the algorithms (table 4). Group 1 was composed of 19 features, and Group 2 of the 10 features with the highest value of the MI and Pearson correlation coefficient. Group 3 was similar to Group 2 but one variable (QBR) was excluded to increase the number of data; Group 4 was composed of the features selected using BorutaShap, and Group 5 was a mixture of the results found with MI, Pearson's coefficient (Group 3) and BorutaShap (Group 4). In Group 5, PL was excluded due to data availability and because it demonstrated a significant correlation with DRS, which could present collinearity. NO 2 was excluded due to data availability and because it did not show a strong relationship with nitrate concentration.

Modelling assessment
The nitrate concentration in the Júcar RB was predicted using RF and XGBoost algorithms with five groups of predictors. The KGEM indicator and its three parameters were calculated to evaluate the prediction accuracy of these models, and the values obtained in the validation stage are shown in figures 3(a) and (b). In all models with the RF algorithm the lineal correlation between simulated and observed is greater than 0.88, the bias was smaller than 10 percent and errors in the simulated variability less than 9 percent. The KGEM value ranged between 0.85 and 0.90, which means that there were no significant changes in the model's performance within the different groups. The difference in each of the parameters turned out to be only a few percent of the overall achievable range. However, a 4-percent increase in linear correlation was found with Group 3 when compared to Group 1. The best KGEM index was found within Group 5, which decreased the bias and increased lineal correlation. Meanwhile, the probability density function (PDF) of the residuals in validation shows (figure 3(c)) that all groups with RF algorithm were well-proportioned with lower mean and standard deviation values with high accumulation of errors in zero values. The differences observed between groups with the KGEM index are supported by the PDF. In the models with XGBoost algorithm in the validation stage, the KGEM index range was between 0.77 and 0.87, the lineal correlation greater than 0.86, the bias smaller than 6 percent, and the variability smaller than 16 percent ( figure 3(b)). In general, the XGBoost algorithm showed a systematic tendency to slightly underestimate the nitrate concentration in the validation. Group 5 showed the best result, decreasing the bias in simulated to 4 percent ( figure 3(b)), and improving the model performance by 2 percent compared with Group 1. The PDF shows that the errors of Group 5 were well-proportioned with lower mean and high accumulation in zero values, whereas the other groups showed a higher standard deviation of errors ( figure 3(d)). However, after using CV, the predictive performance of the models with XGBoost improved and reached a behavior similar to RF (figure 3(f)).
Group 5, which consisted of the variables with the best MI and BoruraShap scores, was identified as the optimal input combination for the two algorithms. It provided high lineal correlation, was unbiased (RF) or slightly biased (XGBoost), and the variability was smaller. Moreover, mean and standard deviation of errors had high accumulation in zero values. Likewise, the weakest performances in the validation with the two algorithms were related to Group 2, which consisted of the 10 variables with the best MI scores. It demonstrated high lineal correlation, and small bias; however, errors in the simulated variability are widespread (38 percent). After applying CV, Groups 1, 2, 3 and 4 displayed a similar behavior (figures 3(e) and (f)), and Group 5 still produced the best performance.
The plots simulated and observed nitrate values are shown in figure 4, comparing the performance of the two predictive algorithms applying CV with Group 5. The models showed a pattern of nitrate distribution along the river similar to the observed data, with differences existing mainly downstream of the watershed, where the models slightly underestimated the nitrate concentration (figures 4(a) and (b)). In general, the probability of identifying high nitrate concentrations increased in the middle and downstream of the watershed. The models fit the temporal variability of nitrate concentrations along the river. There had been a slight decrease in recent years, and this behavior is represented in the models. Moreover, the seasonal variability was in accordance with the observed values, with nitrate concentration higher in autumn and winter, and decreasing in summer. However, there was a slight underfitting in the values simulated in autumn and winter with the two algorithms downstream of the basin (figures 4(c) and (d)).

Importance of conditioning factors
The importance of the driving features in the modelling process is shown in figure 5. N was the most important feature in the prediction of nitrate concentration using RF and XGBoost algorithms. This result agrees with the MI, Pearson's coefficient and BorutaShap; however, there were differences between groups and algorithms in terms of ranking the features. Most important among the other features for the prediction of nitrate concentration are the following: DRS in Group 5 with RF, and Groups 4 and 5 with XGBoost; nitrate-GW in Groups 3 and 5 with both algorithms; precipitation in Groups 4 and 5 with both algorithms; PL in all groups in both algorithms (with the exception of Group 5); and pH and total P in all groups in both algorithms. Group 2, which performed with less accuracy using the two algorithms, gave a high importance (88 percent RF-90 percent XGBoost) to N, while in Group 5 with RF, the importance of N is 57 percent. Of all the variables used in the prediction of nitrate concentration, the least contributing variables were NO 2 , NH 4 , DO, SS, BOD 5 , Tw, Ta, streamflow and QBR.

Comparison of models and feature selection approaches
The models used with the RF and XGBoost algorithms are reliable when estimating the nitrate concentration in the Júcar RB. However, the difference in the calculation procedures of feature selection methods and algorithms  resulted in different model performances. Models in Group 2, consisting of the 10 features with the best MI score, performed the worst with the two algorithms. This could be because MI assesses the features independently without considering their context, and the features were selected in a univariate way. Therefore, Figure 5. Feature importance results of the RF and XGBoost algorithms with different groups of variables.
MI was not able to deal directly with the problem of redundant inputs (Nourani et al 2017, Effrosynidis andArampatzis 2021). Models in Group 3 were similar to Group 2, but the removal of a feature with only few data improved the model performance by 2 percent. Models in Group 4, consisting of the BorutaShap results (selection of 8 out of 19 features) improved the model performance from Group 2 by 2 percent. Although BorutaShap is a new algorithm, it has recently been used in different fields, performing well in terms of feature reduction and predictive accuracy (Kleiman et al 2021, Ghosh and Chaudhuri 2022, Peiró-Signes et al 2022. It reduces the number of features by including only the relevant ones without compromising the model performance, and the Shap value embedded in the algorithm adds an important explanatory capacity that reduces the overfitting problem (Ghosh and Chaudhuri 2022). The highest performance was found with Group 5 (merging Groups 3 and 4) with the two algorithms. Combining the results of the two selection methods and knowledge of the data allowed variables that were highly correlated and those that provided few data to be excluded. PL depends on DRS, and removing PL from the predictors reduced the model complexity and the cost of prediction and increased the sample size of the dataset. Sample size had a significant impact on modelling and prediction performance in this study, and the increase of training data and smaller set of features decreased the variance among the residuals. In this way, the performance of the model was improved. Comparing the two algorithms in the validation stage for Group 5, the RF resulted in a slightly better performance (3 percent) in respect to bias and variance. However, after applying CV the performance of XGBoost improved (4 percent), while RF remained the same. Therefore, either algorithm could be used for nitrate prediction, as the difference between the two algorithms was 1 percent. The improvement with CV for the XGBoost algorithm was possibly due to the fact that successive trees gave extra weight to points incorrectly predicted during the previous analysis and finally a weighted vote was taken for the prediction (Fan et al 2018). After using CV, both models were able to recognise the complex interactions between conditioning factors, and Tomperi et al 2017 reported an increase in the accuracy of the prediction of AI models after applying the CV method.
On the other hand, the results revealed how sensitive XGBoost is to the wrong features being selected. In Groups 2, 3, and 4, the XGBoost metrics decreased for the validation dataset. In contrast, RF showed a more robust model, and introducing wrong features to RF did not change the model performance considerably, as it maintained a similar performance level. In other research using RF and XGBoost algorithms, the authors reported that they obtained the best performance with XGBoost, although the difference with RF was small (Fan et al 2018, Zhong et al 2019, Kiangala and Wang 2021, Peiró-Signes et al 2022. XGBoost and RF are ensemble algorithms; therefore, it is difficult to explain their predictions, and each one has different limitations. The performance of RF depends on the amount of data used in the training dataset (Ghimire et al 2022), while XGBoost presents less accurate results when dealing with imbalanced data (Kiangala and Wang 2021).
The models used with XGBoost and RF algorithms are substantially higher than the traditional hydrological models applied in the Júcar RB. The coupling of hydrological and water quality models in the Júcar RB found 58 percent of lineal correlation, a bias smaller than 20 percent, and the variability was 25 percent (Dorado-Guerra et al 2021). ML algorithms improved the correlation, bias and variability measures reached with the coupling of hydrological models in the Júcar RB by 40 and 37 percent with RF and XGBoost algorithms, respectively, with the lineal correlation the parameter that improved the most. Similar results were found by Wu et al 2017, who reported that AI algorithms are statistically better than hydrologic models.

Use of extrinsic features of surface water bodies and their effect on nitrate pollution
It can be inferred that it is possible to model the nitrate concentration in SW in the Júcar RB using N, DRS, P, IPS pH, nitrate-GW, precipitation, DP and IBMWP, the features representative of weather, location, ecological status, water quality and anthropogenic effects. This approach could be considered as a methodology to predict nitrate concentration, especially in data-scarce areas, but it must be validated in the other catchments of the region. Other studies showed that location and precipitation were important driving factors affecting water quality in rivers and aquifers (Ha et al 2020, He et al 2022. The results show that the high nitrate concentration in the Júcar RB is linked to high nitrogen zones (figure 6), and that the relationship between these two variables is lineal as shown by Pearson's correlation. Other studies showed similar results, in which nitrogen was the main predictor of nitrates (Oehler and Elliott 2011). Nitrogen leaches when transformed into nitrate form, and the main issue then with nitrates is their mobility in the soil and the fact that they can persist in SW and GW (Defterdarović et al 2021). Agricultural activity is the main source of nitrogen in the watershed (Dorado-Guerra et al 2021); therefore, DP is the most probable cause for the higher nitrate probabilities and the increase of the nitrate concentrations in the river.
The DRS exhibited a positive effect on nitrate concentration in SW in the Júcar river (figure 6), and a similar result in GW was found by Rodriguez-Galiano et al 2014 and He et al 2022. This may be because the nitrate pollution is associated with agricultural zones located in the downstream of the watershed, while in the upstream the land use is forest (Dorado-Guerra et al 2021). Therefore, DRS contributed significant information to help identify polluted areas.
Precipitation was the most influential meteorological variable with relative importance, though a weak positive effect of precipitation on SW nitrate was detected by the two algorithms used. In this study, precipitation above 500 mm/trimester was associated with high nitrate concentration in SW (figure 6); as nitrate inputs were mainly from diffuse sources, rise of nitrate concentration takes place mainly in winter and spring when precipitation is high (figures 4(c) and (d)). However, the influence of precipitation on the SW nitrate concentration is complex, as shown in figure 6. For example, high rainfall increases the streamflow resulting in the dilution of SW chemical components (Romero et al 2007;Temino-Boes et al 2021), which can also promote crops to uptake nitrogen (Sieling and Kage 2006). The precipitation would then have positive and negative effects on nitrate concentration in SW.
TP was another important factor for predicting nitrates in SW, with similar results found by Oehler and Elliott 2011. TP above 0.1 mg l −1 was associated with high nitrate concentration in the Júcar RB (figure 6), which might be an indicator of the N:P ratio controlling important N speciation processes through temporary plant uptake and decay (Ensign and Doyle 2006). As for pH, there was a negative relationship with nitrate concentration, perhaps due to the fact that increasing pH affects microbial activity and decreases the nitrification process (Chen et al 2006), and pH levels above 8.25 and below 7.4 were associated with the lowest nitrate Figure 6. Scatter plot overlaid on a density contour plot to show a correlation between predictor features of group 5 with RF algorithm and the target variable. concentration (figure 6). The relationship between the high nitrate concentration in GW was not clearly related to the high nitrate levels in SW, because this relationship depends on the river-aquifer interaction. However, in a previous study, it was shown that there is a high linear correlation between nitrate content in both GW and SW when river and aquifer are connected (Dorado-Guerra et al 2021).
Nitrate is an important predictor of diatoms index IPS and macroinvertebrates index IBMWP (Valerio et al 2021), and several studies have shown that diatom distribution is highly dependent on nitrates, which have fast growth rates that allow them to react faster to chemical changes and detect the first step of degradation (Doung et al 2007, Tan et al 2017, Karaouzas et al 2019. The relationship between nitrate and IPS and IBMWP indices was negative in this study (figure 6); IPS values above 16 were related with the lowest nitrate concentration, while IBMWP values above 80 were related with the lowest nitrate concentration (figure 6).

Conclusions
This paper explores the potential of feature selection and artificial intelligence algorithms to model nitrate concentration in surface water bodies in areas with water scarcity and high interaction between rivers and aquifers. RF and XGBoost successfully modelled the nitrate concentration in the Júcar RB and enabled recognition of the complex interactions between conditioning factors. FS methods are useful tools, but they need to be combined with local knowledge of the dataset, as the amount of data available and high correlation between predictor features affect the performance of the models. Nitrogen, total phosphorus and location were the strongest predictor factors for nitrate concentration in surface water bodies in the Júcar RB, because they accounted for approximately 88 percent of the nitrate variation. On the other hand, RF and XGBoost models obtained better performance than hydrological models in the prediction of nitrate concentration in surface water bodies of Júcar RB.