GMDH also called self organizing modelling is a non-linear modelling technique first introduced by Ivakhnenko,1968 that establishes an input-output relationship within a complex system. The GMDH approach integrates elements from both statistics and neural networks, combining their strengths to improve the imputation of missing hydrological data.(Valenga et. al., 1998). It requires small data samples and is able to optimise models structure objectively (L. Anastasakis& N. Mort, 2001). GMDH has been used widely in diverse fields such as pattern recognition, physiological experiments, cybernetics, medical science, education, safety science, economics, tool life testing in gun drilling, forecasting of mobile communication, ecology, weather modelling, hydraulic field engineering systems etc (Onwubolu 2016); however, its use in hydrological modelling and forecasting is still less explored (Aghelpouret. al., 2020)
GMDH Algorithm
The GMDH algorithm follows the principle of polynomial approximation to connect input and output values. The general connection between input and output variable is expressed by the polynomial function known as Kolmogorov-Gabor polynomial as
Y = α0 +\({\sum }_{i=1}^{n}{\alpha }_{i}{x}_{i}\) + \({\sum }_{i=1}^{n}{\sum }_{j=1}^{n}{\alpha }_{ij}{x}_{i}{x}_{j}\) + \({\sum }_{i=1}^{n}{\sum }_{j=1}^{n}{\sum _{k=1}^{n}\alpha }_{ijk}{x}_{i}{x}_{j}{x}_{k}\) +……… (1)
Where, X( x1,x2,….xn) denotes the input variables, α( α1, α2,…. αn)represents weights and Y is the output variable (Li et. al., 2020).
The above equation necessitates a substantial amount of data and computationally-intensive calculations involving high-dimensional matrices to determine a large number of coefficients. However, the GMDH algorithm overcomes this limitation by employing a multilayered perceptron-type structure. In this structure, the functions utilized are either linear or second-order polynomial functions in two or three variablessuch as:
F1 (xi ,xj) = α0 + α1 xi + α2xj………… (2)
F2 (xi ,xj) = F1 (xi, xj) + α3 xixj + α4\({x}_{i}^{2}\) + α5\({x}_{j}^{2}\) ……… (3)
The model parameters are determined through the least-squares method, where the residues between the model outputs and targets are minimized. To achieve this, all possible combinations of two inputs are utilized to generate an output (as illustrated in Fig. 1). The outputs are evaluated, and the best results meeting a specified threshold are selected. This process is repeated for subsequent layers until the threshold criteria are satisfied.
GMDH algorithms have been successfully applied for short term prediction and forecasting of River Flow (Ikeda et. al. 1976; Valença et. al. 1998; Aghelpour et. al. 2020; Bonakdari et. al. 2020; Khodakhah et. al. 2022),water quality (Haghiabi et. al. 2018), turbidity (Tsai et. al. 2017), Dissolved Oxygen (Tomić et. al. 2018), river water level (Li et. al. 2020) etc. For instance, Dag et al. (2016) developed the R package GMDH for short-term forecasting and observed improved performance compared to ARIMA models and exponential smoothing methods. Changzheng et al. (2009) combined GMDH with the Expectation Maximization (EM) algorithm to address missing values in noisy data and found its superiority over four other popular imputation methods.
Moreover, Acock et al. (2000) demonstrated the utility of GMDH in filling gaps in weather data obtained from weather stations. Meghwar et al. (2016) utilized GMDH for analyzing missing data in river flow. Additionally, Aghelpour et al. (2020) employed the GMDH method for modeling and forecasting river flow, concluding that it is a reliable predictor for daily river flow, except during severe floods.
Overall, these studies highlight the versatility and effectiveness of GMDH in various applications, including both forecasting and addressing missing data challenges in hydrological analysis.
The objective of this research is to assess the efficacy of the GMDH technique in imputing missing values in long-term hydrological series data. The study employed historical hydrological datasets from four gauging stations located on the Mahanadi river and its tributaries to construct and evaluate the performance of the GMDH model.
Study Area:
The Mahanadi river, a significant east-flowing river in India, originates in the Dhamtari district of Chhattisgarh and empties into the Bay of Bengal. The Mahanadi basin is divided into three sub-basins: upper, middle, and lower Mahanadi. In this study, hydrological datasets from four Hydrological Observation (HO) sites operated by the Central Water Commission (CWC) (Table 2) in the lower sub-basin of the Mahanadi river were utilized to develop and evaluate the GMDH model. Among these sites, Tikarapara and Boudh are situated along the main course of the Mahanadi river, while Kantamal and Padampur are located on its tributaries, namely Tel and Ong, respectively. Each site has its own set of missing daily discharge data. The Fig. 2. below illustrates the geographical locations of the hydrological observation sites in the lower sub-basin of the Mahanadi river.
Table 2
Hydrological Observation site Location
River-Gauging Station | River/Tributary | Latitude | Longitude |
Tikarpara | Mahanadi | 20.6019 | 84.7761 |
Boudh | Mahanadi | 20.860 | 84.322 |
Kantamal | Mahanadi/Tel | 20.6527 | 83.7234 |
Padampur | Mahanadi/Ong | 21.0154 | 83.1043 |
Methodology:
In this paper, we use the GMDH toolbox provided by Yarpiz organization to create GMDH polynomial Neural Network(http://yarpiz.com/323/ypml120-time-series-prediction-using-gmdh). The code was executed in MATLAB using the following fundamental steps:
Step 1: Split the initial dataset into training and testing sets.
In the first step, continuous data X = {x1, x2, ..., xM} without missing values is chosen as the input variables as shown in Fig. 3. The available data is then divided into separate training and testing datasets to learn the system structure and select the best results respectively. This helps in avoiding over fitting for prediction and is known as regularization. The first 70% of the time series was used for learning set and last 30% of the data was utilized as a testing set (Dag, O. et. al., 2016).
Step 2: Create combinations of input variables within each layer.
The second step involves constructing MC2 = M(M − 1)/2 new variables within the training dataset. Additionally, a regression polynomial for the first layer is constructed by creating a quadratic expression that approximates the output y in the given Eq. (3).
Step 3: Apply an optimization principle to determine the elements within each layer.
In step three, the contributing nodes in each hidden layer are identified based on
the mean root square error (RMSE) values. The least effective variable is then eliminated by replacing the old columns of X with the new columns Z.
Step 4: Stopping rule for multilayer structure generationStep four involves the iterative process of the GMDH algorithm by repeating steps 2 and 3. The iteration continues until the errors of the test data in each layer no longer decrease. At that point, the iterative computation is terminated.
In each of the dataset the next 30 days values were imputed and compared with original data using the performance indices as discussed below. Figure 5 shows the
Performance Indices:
The evaluation of model performance for both training and forecasting data is conducted using commonly used metrics such as root-mean-square error (RMSE), correlation coefficient (R) and Mean Absolute Error (MAE). These indices, widely employed in time series forecasting evaluations, provide insights into the accuracy and correlation of the results. The definitions of RMSE, R and MAE are as follows:
Correlation coefficient R2 measures how well the regression model fits the data, with values ranging from 0 to 1. A higher R2aproaching 1 indicates a better fit. The formula for the correlation coefficient (R) is as follows:
R 2 = Cov(X, Y) / (σ_X * σ_Y)
where:
-
Cov(X, Y) is the covariance between variables X and Y,
-
σ_X is the standard deviation of variable X,
-
σ_Y is the standard deviation of variable Y.
RMSE = √(1/n * Σ(y_pred - y_actual)^2)
where:
-
RMSE is the Root Mean Square Error,
-
n is the number of data points,
-
y_pred represents the predicted values,
-
y_actual represents the actual values.
The RMSE provides a measure of the overall error between the predicted and actual values, with lower RMSE values indicating better model performance and higher accuracy (Aghelpour et. al., 2020).
MAE
The MAE is calculated by taking the average of the absolute differences between the predicted and actual values. The formula for MAE is as follows
MAE = (1/n) * Σ|y_pred - y_actual|
where:
-
MAE is the Mean Absolute Error,
-
n is the number of data points,
-
y_pred represents the predicted values,
-
y_actual represents the actual values.
As the RMSE and MAE criteria approach zero and simultaneously R2 approach one, accuracy will be higher.