Dissolved oxygen modelling of the Yamuna River using different ANFIS models

Dissolved oxygen (DO) is one of the prime parameters for assessing the water quality of any stream. Thus, the accurate estimation of DO is necessary to evolve measures for maintaining the riverine ecosystem and designing appropriate water quality improvement plans. Machine learning techniques are becoming valuable tools for the prediction and simulation of water quality parameters. A study has been performed in the Delhi stretch of the Yamuna River, India, and physiochemical parameters were examined for 5 years to simulate the DO using various machine learning techniques. Simulation and prediction competencies of adaptive neuro fuzzy inference system–grid partitioning (ANFIS–GP) and subtractive clustering (ANFIS–SC) were performed on high dimensional river characteristics. Four different models (M1, M2, M3 and M4) were developed using different combination of input parameters to predict DO. Results obtained from the models were evaluated using root mean square error and coefficient of determination (R) to identify the appropriate combination of parameters to simulate the DO. Results suggest that both types of ANFIS models work adequately and accurately predict the DO; however, ANFIS–GP outperforms the ANFIS–SC. M4 generated R of 0.953 from ANFIS–GP compared to 0.911 from ANFIS–SC.


GRAPHICAL ABSTRACT INTRODUCTION
The production and consumption of dissolved oxygen (DO) in rivers are dynamic and complex (Zahraeifard & Deng 2012). DO remains in water as free oxygen and its concentration varies due to diffusion. The concentration of DO in water depends on several sources, sinks and solubility rates. The atmosphere is the most significant external source of oxygen to stream, and photosynthesis plays a significant role as an internal source of oxygen (Lyons et al. 2014). Photosynthesis contributes more oxygen to water because the oxygen generated from the algae contains pure oxygen, whereas the atmospheric diffusion contains only 20% oxygen in overall gas transfer at the air-water interface (Holtgrieve et al. 2010). All the microorganisms, aquatic plants and aquatic animals consume oxygen through respiration, known as sinks and remain active throughout the day and night. In contrast, photosynthesis generates oxygen only during the daytime, and the algae act as both sources and the sink of oxygen (Arora & Keshari 2021b). Another critical factor is solubility, which depends on the water pressure, temperature and salinity. The increase in pressure increases the solubility of gas, whereas higher salinity and temperature reduces the solubility rate (Cox 2003;Verberk et al. 2011). A healthy riverine ecosystem maintains a synchronization in the sources and sinks of oxygen; however, several factors affect the DO concentration in the river along with the depth of the water body. Mathematically, the concentration of DO can be expressed as: where DO so is the source of DO, DO si is the sink of DO, and S is solubility. A low concentration of DO in a river for a long duration increases the start of several environmental problems (Ay & Kisi 2012). The river system's biota is affected if the oxygen content falls below 30% of the saturation limit. The variation in DO concentration occurs rapidly based on flow available in rivers, velocity, turbulence, the number of organics and atmospheric reactions involved in the riverine system (Cox 2003;Quick et al. 2019). Anthropogenic activities are becoming the significant sinks of oxygen that consume the available DO through partially or untreated wastewater from domestic, industrial, commercial and agricultural sectors (Arora & Keshari 2021b). It is mandatory to maintain equilibrium between sources and sinks for the aquatic ecosystem's sustainability.
The assessment of DO variation for heavily polluted rivers based on statistical methods is not the appropriate approach nowadays due to complex and nonlinear water quality parameters (Cox 2003;Parmar & Keshari 2012;Arora & Keshari 2021a). Various researchers have used machine learning techniques such as artificial neural networks (ANN) and adaptive neuro-fuzzy inference system (ANFIS) to predict the variation, simulate and forecast the water quality parameters (Singh et al. 2009;Chen & Liu 2014;Ay & Kisi 2017;Tiwari et al. 2018;Shah et al. 2021, Alsulaili & Refaie 2021. Fuzzy logic has several advantages in classification, data mining, interpretation, and optimization of time series data of various fields (Wijayasekara & Manic 2014;Tiwari et al. 2018). The fuzzy theory has been widely used to model the nonlinear behaviour for various hydrological applications (Altunkaynak et al. 2005;Keskin et al. 2006;Chang et al. 2015;Khan & Valeo 2015;Ay & Kisi 2017;Arora & Keshari 2021a). The fuzzy system can remove the uncertainties from the data and develop the model structure through the rule-based system (Huang et al. 2010;Shah et al. 2021). Altunkaynak et al. (2005) used the Takagi-Sugeno fuzzy logic approach to model fluctuations in DO at Golden Horn, Turkey, and compared the results with autoregressive moving average (ARMA) models. The results reveal that the fuzzy models are more superior to ARMA in predicting DO fluctuations. Güldal & Tongal (2010) identified the variation in the water depth in the lake and compared the accuracy of recurrent neural networks (RNN), ANFIS and stochastic models using the coefficient of determination. They found that RNN and ANFIS performed better than stochastic models. Moosavi et al. (2013) compared different data-driven models to predict a reservoir's groundwater level at two distinct basins. The researchers used ANN, ANFIS and ANN-ANFIS coupled models and found that the ANFIS and combinations of various models perform better than the ANN due to the errors involved in selecting the adequate number of neurons for the ANN model. ANFIS is also better than ANN because of its capability of analyzing uncertainties in input parameters. Parmar & Bhardwaj (2015) compared regression, ANN, Wavelet and ANFIS to predict chemical oxygen demand (COD) in the Yamuna River, India. They also compared the conventional techniques with the wavelet-coupled model. Khan & Valeo (2015) applied fuzzy regression and compared it with the Tanaka and Diamond method of fuzzy modelling to predict the DO and found that the ability to record water quality parameters' uncertainty makes the fuzzy regression technique a substantial approach for predicting DO. Shah et al. (2021) compared the performance of various ANFIS models by varying the type and number of membership functions (MFs) to predict the electrical conductivity (EC) and total dissolved solids (TDS) in the Indus River. The ANFIS model was developed using three MFs where triangular and Gaussian MF type was used for EC and TDS, respectively. The study generated high coefficient of correlation (R) of 0.91 and 0.92 for EC and TDS, respectively, and revealed that with pre-processing of data and selecting appropriate parameters, ANFIS can simulate the water quality parameter with lesser intricacy than deterministic models.
The literature review reflects that the fuzzy modelling techniques can be applied to a wide area with high accuracy. However, detailed studies over the differences between the two approaches (subtractive clustering [SC] and grid partitioning [GP]) of fuzzy modelling are unavailable. The application of a correct approach for the prediction of the parameter can improve the simulation results significantly. The derivation of fuzzy models depends on linguistic terms designed via MFs and delivers input parameters to the optimization model (Cordón 2011). As the DO acts as the health indicator of the riverine system, the predominant tasks are the accurate prediction of DO for assessing the state of the water body, designing policies for the water resource management, and appropriate allocation of available water that keeps a sufficient amount of flow in rivers. In this study, the GP and SC approaches are applied to model the DO of the stream passing through a highly urbanized area, which receives a large volume of wastewater from domestic, industrial and agricultural sources through multiple drains.

MATERIAL AND METHODOLOGY
ANFIS models were developed for the simulation and prediction of DO. A hybrid algorithm combining the least-squares and gradient descent methods was used to conserve the search space and minimize the model's operational time. The model's structure is designed using GP and SC methods, and various combinations of input parameters are tested using both methods.
Adaptive neuro-fuzzy inference system ANFIS is the combined structure of neural network and fuzzy logic. This composite structure allows neurons to record the input data and fuzzy rules to optimize the solution. The fuzzy sets in the model define the fuzzy rule base and make the ANFIS capable of simulating the nonlinear behaviour of input parameters. The rule base of the network increases with the number of input parameters. However, it also increases the computational time of the model (Chang & Chang 2006). The ANFIS structure uses five layers: input, fuzzification, normalization, defuzzification and output layers. The number of input parameters is defined in the first layer. Fuzzification includes the distribution of MF to each input parameter and allocation of type of MF. For splitting the ANFIS input function, the model uses fuzzy MFs, which cover the input space and activate several local regions simultaneously using single input through overlapping. The number of MFs plays a vital role as the MFs control the partitioned input function's resolution and the ANFIS model's approximation.
If-then rule bases are formed based on the number and type of MF defined in the previous step. The fuzzification covers the input into breakable fuzzy sets, and defuzzification again converts the fuzzy sets into output after applying inference processes, normalization and optimization (Chang & Chang 2006). However, the fuzzy inference system (FIS) rule base can be altered by understanding the relationship between input parameters and reducing the computational time with optimized output. The alteration of the rule base and the modified structure of FIS make it worthwhile for wide application over neural networks (Arora & Keshari 2020). FIS is designed using Gaussian type MFs with a hybrid learning algorithm to optimize the model. FIS structure depends on the type and number of MF selected for modelling (Babuška & Verbruggen 2003;Sonmez et al. 2018). The overall architecture of the FIS model is shown in Figure 1.

Grid partitioning
GP is commonly used to design the FIS, a fuzzy clustering method when input variables are less. The input space is partitioned based on the minimum distance between each input variable divided into two member functions. The problem region is divided into sub-regions, and input space is further divided into sub-regions to refine the space depending on the type and number of MFs selected for designing the model. The partitioning method is preferred when the knowledge about the centre's distribution is not adequate (Benmouiza & Cheknane 2019). The rule base of grid partitioned FIS is corrected Proof defined as: where x is the input region varying from 1,2,…,m for mth sub-region and A ki m is the fuzzy term.
is the maximal value of the input parameter, and both the values would be computed using the least square method. The input sub-region is divided into mth sub-regions, where x ¼ x 1 , x 2 , x 3 , …, x m. The MF for the fuzzy term A ki m would be: where m 0 m (x i ) is the MF. The output (O) corresponding to mth sub-region is written as: The sub-regions are divided on the maximum value of error from the training samples. Once the maximum errors of every sub-region are achieved, the region splits into two regions and the new approximation error is the minimum of the new subregion. The sub-region splitting continues until the errors become constant in two regions. The splitting of sub-region into multiple regions is shown in Figure 2. The maximum error obtained from the sub-region at which split occur is written as: where E m is the error obtained from mth sub-region from the N m training samples and x j and t j are the output generated from the model and targeted, respectively, from jth training samples. The computational time in GP increases exponentially with the number of input parameters and MFs.

Subtractive clustering
In SC, the rule base formed is equivalent to the MF formed. In this method, each data point is considered the centre, and the importance of each centre is identified through the data point in the centre's neighbourhood. The process runs through several iterations and allocates the centre by identifying the most influential centre with the highest number of data points in its surrounding. The radius of the cluster of points is identified using the centre of neighbouring points. The process repeats until corrected Proof all the data points fall within the radius of every cluster. The potential of the data point is written as: where P i is the potential index of x i data points and r is the radius where all the neighbourhood's data points fall. The second iteration is calculated as: where P c1 represents the potential of cluster 1 and r a is the K r * r b , where K r is the positive constant usually 1.5 and r b is the neighbourhood radius. The process is repeated, and a cluster radius is recalculated until a sufficient number of clusters centres are not generated.

MODEL DEVELOPMENT AND EVALUATION
The sampling data is divided into two parts of 70:30 to design the model, where 70% of sampling data points are used for model training, and 30% are used for testing the model. Different sets of parameters are selected for designing the model. The four models are designed using temperature, biological oxygen demand (BOD), COD, conductivity and ammonia. The spatial and temporal analysis of data suggests that temperature, BOD, COD and ammonia produces the most significant impact on the variation in DO concentration, hence they are used for the model development (Arora & Keshari 2021b). The first model (M1) is developed considering temperature, BOD and COD as input parameters. The additional parameter selected in the second model (M2) is conductivity. The presence of ammonia reflects the generation of algae in the water, which acts as both the source and sink of oxygen. The third model (M3) is designed by combining the base parameters with ammonia.
The fourth model (M4) covers the combined effect of conductivity and ammonia over the base parameters to simulate the river's DO. The GP and SC fuzzy clustering algorithms were used to design the model with similar input parameters. ANFIS input optimization technique was applied to identify the appropriate combination of input parameters. Four models were designed with different input parameters to observe the contribution of each parameter in affecting DO concentration. The spatial and temporal behaviour of the parameters and results of hierarchically aligned cluster analysis and principal component analysis (PCA) is used to select the appropriate parameters (Arora & Keshari 2021b). The input parameters selected to design the FIS models are shown in Table 1.
The coefficient of determination (R 2 ) and root mean square error (RMSE) were evaluated to analyze the model's performance. When the RMSE is closest to zero, it indicates the model is adequate, and when R 2 is closest to 1 it represents a better correlation between the observed and the predicted values obtained from the FIS model. Formulas used to identify the n v u u u t (10) where p i is the predicted value of DO and o i is the observed value of DO and o i is the mean observed value of DO.

STUDY AREA
Delhi is one of India's largest and most dense cities, and all the wastewater generated from various sectors (domestic, commercial, industrial and agricultural) of Delhi joins the Yamuna River. Some part of wastewater before its confluence with the river passes through the treatment processes. However, the percentage of wastewater that is treated is too low compared to the untreated wastewater that is discharged into the drain through irregular means and subsequently joins the river. The Yamuna River travels about 375 km before reaching Delhi, and the flow of the river is obstructed at the Wazirabad barrage for water supply. The freshwater remains low throughout the year, and only wastewater from Delhi flows in the river except during the monsoon period. Monthly water samples for 5 years were collected from Nizamuddin, Delhi, which is located 16 km downstream of the Wazirabad barrage in Delhi. In between the Wazirabad barrage and Nizamuddin, the Yamuna River receives effluents from several drains, out of which maximum effluents are discharged by the Najafgarh drain, which contains 2.5 times more water than is available in the river (CPCB 2006). As the Najafgarh drain joins the river just after the Wazirabad barrage (0.5 km downstream), it causes maximum damage to the river's water quality (CPCB 2006). The untreated or partially treated wastewater discharge from 16 drains makes the Delhi stretch of the Yamuna River one of the most polluted sections of the river, with 22 km between the Wazirabad and Okhla barrages. The water quality falls into category E of the designated best-use water quality criteria of Indian water quality standards, which indicates that the river's water is not fit for drinking, even after advance treatment (CPCB 2006). The BOD load increases in the river up to 80 tonnes/ day after the Najafgarh drain's confluence. The sampling location also receives the effluents from a thermal power plant on the river's right bank. The flow of the river is obstructed by several structures in the study area, which includes six road bridges, two railway bridges and two metro railway bridges that causes silting near the bridge piers. The DO content of the river falls to zero in this stretch and causes a significant degradation of aquatic plants and animals. The generation of anaerobic conditions has also been observed in the river due to the decomposition of organic matter in the absence of oxygen.
The sampling location is selected considering distance from the Wazirabad barrage, confluence of drains and time taken by the flow to provide sufficient mixing of wastewater with the river water to represent a homogeneous mixture. The river receives the effluent load from the right bank only and has heavy habitation close to the right bank, as shown in Figure 3. On the left bank, the river has a flood plain to cater for the excess water during a flood; however, the encroachment of the flood plain is another problem of the Yamuna River as it causes narrow channels of flow during the monsoon period and a river flow with low discharge during the rest of the year.
Monthly water samples were collected for 5 years (Arora & Keshari 2021b) and physiochemical analyses were carried out as per the standard methods (APHA 2005). In-situ measurements were performed for temperature, DO and pH of the water samples. The analysis of BOD, COD, ammonia, total Kjeldahl nitrogen (TKN) and conductivity were performed in the laboratory within 48 h of collection, after preserving the sampling with hydrochloric acid at temperature below 4°C in dark.

RESULTS AND DISCUSSION
The experimental analysis shows significant variation in all water quality parameters except pH. During summer, the average temperature remains around 33°C, which falls in winter to 11°C. The DO remains zero throughout the year due to the low fresh water flow and confluence of wastewater drains at regular intervals. The BOD and COD were significantly higher than the discharge norms (CPCB 2006). Similarly, ammonia and TKN were also much higher throughout the year, except during the monsoon season. Delhi receives precipitation in the form of rainfall from July to September, and sufficient water flows into the river during this period, reducing BOD levels to below 50 mg/L, and increasing DO to 2 mg/L. The minimum values observed for ammonia, TKN, BOD and COD are caused by the dispersion of impurities due to excess flow received in the monsoon season. This indicates the poor condition of the river ecosystem, but it is better than the river's non-monsoon state. The descriptive statistics of water quality parameters used for the study are shown in Table 2.

Input optimization
The effectiveness of the models depends on the type and number of parameters selected for the simulation of output. BOD and COD were selected in every model and produced a direct and substantial impact on the DO concentration, whereas  corrected Proof temperature affected the regeneration rate of DO from the atmosphere. The other parameters selected in M2 and M3 (i.e., ammonia and conductivity) were also found as principal factors in fluctuating DO values. The previous studies show that ammonia and TKN follows a similar pattern of variation and produced a similar degree of flux, and therefore only ammonia was selected for the model development.

ANFIS model analysis
The Takagi-Sugeno algorithm was used for the development of the model. Three MFs of Gaussian type were selected for each input parameter, and output is generated from the constant type MF. The number and type of MFs are selected based on the data distribution and computational time. The performance of GP depends on the fineness of grids and the SC calculations are proportional to data points. The model structure of GP and SC methods is shown in Table 3. In all the models based on GP, three MFs are used for each input, whereas the number of MF is generated automatically in SC, depending on the data points of each parameter and the number of input parameters in each model. In M1, there are three input parameters and five MFs are used; however, M2 and M3 have four input parameters in each model. Therefore, six MFs are used in both models. However, M4 contains five input parameters, and nine MFs are used to design the GP-based model. Along with the increase in input parameters, the rule base also elevates exponentially in GP, whereas SC considers each data point as a centre of cluster. The computational time in GP increases exponentially with the number of input parameters and MFs. Therefore, the rule base for five input parameters and 3 MFs would become 3 5 ¼ 243, whereas, in SC the rule base remains low and consumes less computational space. The optimization was carried out for both partitioning methods using hybrid learning for all the models. Each model's varying epochs were used until the observed error became constant or reduced to the minimum.
The performance of the models was evaluated using RMSE and R 2 . Results of ANFIS-GP and ANFIS-SC, as shown in Table 4, indicate that both models produce suitable solutions for the prediction. The M1 of both ANFIS-GP and ANFIS-SC produces considerable but high RMSE compared to other models. It indicates that the input parameters used for modelling are insufficient to explain the phenomenon of DO variation in the river. However, R 2 of more than 0.75 indicates that input parameters are substantial factors that affect the variability in DO concentration. The performance of the M2 and M3 corrected Proof models increases with the inclusion of conductivity and ammonia, respectively, compared to M1. However, the R 2 values indicate better performance of M2 compared to M3 in both the GP and SC methods. Results suggest that conductivity produces higher variation in DO compared to ammonia. Simultaneously, the combination of the conductivity, ammonia and other parameters is considered in M4 and it outperforms other models. The RMSE of M4 was only 0.049, and R 2 is 0.953 for ANFIS-GP. The bi-plots between the observed and predicted DO from all the ANFIS models obtained from both partitioning methods are shown in Figure 4. The performance of ANFIS-SC shows similar results for models. The highest RMSE is found in M1 and lowest in M4, which is similar to ANFIS-GP. The M2 and M3 deliver approximately similar results because both models contain a similar number of input parameters and MFs, which indicates that the output of ANFIS-SC essentially depends on the number of MFs rather than the characteristics of input parameters. The M4 model was observed with RMSE of 0.150, the lowest among all the ANFIS-SC models but higher than the M4 of ANFIS-GP, as shown in Table 4. The GP model does not use the coefficient and it relies on calculating the maximum number of rules based on the number of MF and input parameters. The mandatory optimization improves the results of the GP model for the larger number of parameters and subsequent rule bases, whereas the SC model works on good input-output relation and coefficient generated from the relation. Therefore, in the ANFIS-GP model that classifies data based on the rule base, the model performance improves with the size of the rule base and is not affected by the input-output relationship. It is evident from the results that the ANFIS-GP model outperforms the ANFIS-SC and could act as an effective tool for defining, planning and managing water quality parameters when inputoutput parameters are maintaining a regular variation and the input characteristics are significantly varied due to anthropogenic disturbances.
The results obtained from the present study are compared with recent studies performed to simulate the DO using various soft computing methods. Ay & Kisi (2017) developed several DO modelling techniques using different combinations of pH, EC, temperature and flow. The ANFIS models were selected for comparison with the present study, as shown in Table 5. Ay & Kisi (2017) used 2 MFs of triangular type to design the GP model. Stajkowski et al. (2020) modelled DO at two stations using normalized, standardized and spectral analysis of the ARIMA model. The normalized ARIMA model shows the highest accuracy with a configuration of 4,1,4, which states that the model is designed for fourth order of autoregressive, first order of differencing and fourth order of moving average model. In contrast, Abba et al. (2021) compared different types of neural network models with different combinations of input parameters. The emotional artificial neural network-genetic algorithm (EANN-GA) and neural network ensemble (NNE) produced the highest accuracy from the various models.

CONCLUSION
The study was carried out for the simulation of DO using different ANFIS models. Monthly water quality data was collected from the Yamuna River for 5 years, and significant parameters were identified using spatial and temporal analysis, cluster analysis and PCA (published as a separate study). The simulation models were developed using two fuzzy clustering algorithms, i.e., GP and SC, and the results of both clustering algorithms were compared using R 2 and RMSE. The different combinations of input parameters were used to develop the models using the ANFIS algorithms (ANFIS-GP and ANFIS-SC), and the applicability of ANFIS models was tested using the water quality parameters of the Yamuna River. The M4 model of ANFIS-GP had the lowest RMSE and the maximum R 2 of 0.953 and contained three MFs of Gaussian type. However, all the models of the ANFIS-GP worked well over the ANFIS-SC and showed a good correlation with the observed values of DO.
BOD and COD are significant parameters that reflect the direct consumption of oxygen through biological and biochemical decomposition of organic matter, respectively. The study reveals that, other than BOD and COD, conductivity also significantly impacts DO and could affect the DO concentration compared to ammonia. However, both conductivity and ammonia are involved in the DO modelling and improve the model performance. It can be concluded that the appropriate