Imputing missing data using grey system theory and the biplot method to forecast groundwater levels and yields

Groundwater management is one of today’s important tasks. It has become necessary to seek out increasingly reliable methods to conserve groundwater resources. Dependable forecasting of the amounts of groundwater that can be abstracted in a sustainable manner requires long- term monitoring of the groundwater regime (rate of abstraction and groundwater levels). Moni toring of the groundwater source for the town of Bečej, Serbia had been disrupted for multiple years. The objective of the paper is to assess the possibility of reinterpreting the missing data or, in other words, to reconstruct the operation of the groundwater source and its effect on groundwater levels. At the Bečej source, groundwater is withdrawn from three water-bearing strata comprised of fine- to coarse-grained sands. Historic data are used to reconstruct the ope ration of the Bečej source between 1 st of October 1980 to 1 st of May 2010. The monitored parameters are total source yield and piezometric head at seven observation wells and 14 pump ing wells. A data reconstruction methodology was developed, which included the use of an autoregressive (AR) model, a grey model (GM), and the biplot method. The methodology is ap plied to fill the data gaps during the considered period. The paper also describes the criteria for evaluating the accuracy of the AR model, GM, and biplot method. The proposed data recon struction approach yielded satisfactory results and the methodology is deemed useful for the Bečej source data, as well as other historic data not necessarily associated with groundwater sources, but also groundwater control and protection systems, as well as hydrometeorological, hydrological and similar uses.


INTRODUCTION
Groundwater is one of the primary sources of water for various uses. The demand for groundwater will steadily rise in the future. This is a growing problem and more and more attention will be devoted to seeking out methods and strategies for progressively efficient groundwater management. In practice, the water supply for a particualr area is often provided by developing groundwater sources that deliver sufficient water quantities of a satisfactory quality.
Effective groundwater source management largely depends on local natural conditions, the amount and quality of collected operational data, and the efficient and reliable forecasting of the future water demand and the ability of the hydrogeological setting to meet that demand. Roughly 75% of the water demand in Serbia relies on groundwater. However, there has been no regular monitoring of surface water and groundwater dynamics during a period of economic crisis in the country, which resulted in an underdeveloped observation network. At groundwater sources, regular monitoring was hindered by poor organization and lack of instrumentation and human resources. More recently, the surface water and groundwater observation network has been expanded based on European Union recommendations and the scope and extent of monitoring are now satisfactory.
The need to develop methods that will forecast sustainable rates of groundwater abstraction is becoming increasingly pronounced. Sufficiently reliable prognostic methods would create conditions for effective conservation of groundwater resources, addressing problems such as inefficient use and over-exploitation, hydrology (CHAKRABORTY et al, 2009;MEHDIZADEH, 2020), seismology (XIAOJUN et al., 2015), exodynamics (YONGPING et al., 2020), and hydrogeology (RISTIĆ VAKA-NJAC et al., 2018).
Many researchers in hydrogeology and similar disciplines have applied the grey system theory to address diverse problems, such as predicting earthquakes (DAOGONG et al., 1999), forecasting of pore pressure changes (MOHAMMED et al., 2010), projecting land subsidence to prevent hazards (TANG et al., 2008), sustainable groundwater management for drinking water supply (JIANFEI et al., 2012), estimating the amount of rainfall and runoff to prevent flooding (YU et al., 2001;KANG et al., 2006), forecasting groundwater levels (MAHMOD et al., 2013;KANG & MAENG, 2016) and predicting and analyzing spring discharges (HAO et al., 2010).
The biplot method is also applied by many researchers in different fields. For example, this statistical method has been very useful in addressing environmental protection issues, to predict the evolution of air pollution (GONZALEZ CABRERA et al., 2006;ZHANG et al., 2018); in mechanics, to solve problems applying SVD (singular value decomposition, which is the basis of the biplot method) (MEIJAARD, 1993;GAI & HU, 2018); in geophysics, to improve the quality of seismic data (MARTINS et al., 2016) and to determine the spatial relationship between deep geologic structures and gold (ZHAO & CHEN, 2010).
Upon reconstruction of the missing piezometric head data and total groundwater source yields, the gaps in the Bečej groundwater source (Bečej GS) data were filled over the study period. Sufficiently reliable input data were provided to assess the feasibility of Bečej GS expansion, predict the long-term effect of groundwater abstraction, and propose optimal solutions for future production wells.

STUDY AREA
The study area is the extended area of Bečej GS, Serbia, as shown in Fig. 1. The groundwater source is located west of the city of Bečej. Groundwater from the tapped aquifers is used for both the municipal and industrial water supplies. Between 1980 and 2010, 25 production wells and seven observation wells were drilled at Bečej GS. Fourteen production wells are currently in service (Fig.  1). The others have been decommissioned. The wells tap several aquifers. Their individual discharge capacities are 3 -16 l/s.  The extended study area is located within the Pannonian Plain and is composed of Precambrian/Palaeozoic, Mesozoic, Mi-ocene, Pliocene and Quaternary rocks. The oldest rocks are crystalline schists and Precambrian/Palaeozoic granites. The youngest rocks are Quaternary in age, with alternating lithological units (gravels and coarse sands underlie sandy silts, silts and clays) (MALEŠEVIĆ, 1984;TERZIN, 1994). Vertically, there are two types of aquifers down to a depth of about 130 m: an unconfined intergranular aquifer in the alluvial and terrace deposits (from the land surface to roughly 30 m -aquifer I) and three confined intergranular aquifers in deeper geological formations (from 60 m to 130 m -aquifers II, III and IV). Bečej GS taps the deeper aquifers of Quaternary age: the first is at a depth of 60 to 80 m, the second at 83 to 100 m, and the third from 110 to 130 m (Fig. 2).
The unconfined aquifer is composed of sands in the upper part and fine-grained, partly clayey sands in the lower part. The bottom of this layer comprises sandy clay. The three deeper aquifers are composed of fine-to medium-grained sands, with some gravel. The sands include a pelitic component, which hinders production/causes well ageing. The aquifers are separated by relatively thin layers of clay, sandy clay and clayey sand. Their thickness ranges from 5 to 20 m. These three aquifers constitute a hydrogeologic complex, which evolved during the Lower and Middle Pleistocene. On a regional scale, they are considered as a single, confined aquifer (MALEŠEVIĆ, 1984;TERZIN, 1994).
As part of the reported research, a methodology was developed to reconstruct the missing data (10-year period) due to the lack of continuous monitoring. It is shown in Fig. 4 in the form of an algorithm.
In the first step, input data comprising values observed during the course of monitoring were analyzed. The objective of the proposed methodology was to reconstruct the missing data applying the methods indicated in the algorithm. The AR model was used to reconstruct the dates (observation positions) on which piezometric head and source yield were not observed. The accuracy of the AR model was determined on the basis of AIC (Akaike information criterion), according to which the lowest value was assumed to be accurate. After the observation dates were reconstructed, the grey method was applied to arrive at piezometric head values for the period during which there was no monitoring. The input data for the GM were the original and reconstructed dates of observation and the observed piezometric heads. First the GM was tested against the (original) data observed during the study period. If a comparison of the original and reconstructed piezometric head data and MAPE criterion (Table 2) indicated that the accuracy of the model was satisfactory, the approach was acceptable for reconstructing the missing data. If not, the proposed methodology was not suitable for further reconstruction. Given that calculations and the MAPE criterion (Table 2) indicated that the accuracy of GM was satisfactory, it was applied to fill the data gaps. The above methods used to determine the missing observation dates and piezometric heads produced a completed matrix for the biplot method, which was applied to reconstruct the total source yields during the period of no observations. The accuracy of the total source yields derived by the biplot method was determined based on the condition d solution was acceptable, the model was deemed suitable for reconstructing the total source yields. If not, it would have been necessary to look for other, more accurate models. All the calculations were made in Microsoft Excel (AR model, grey theory and biplot method) and MathLab (biplot method). The figures were produced in CorelDraw (Figs. 1, 2, 4) and Microsoft Excel (Figs. 3,5,6,7,8,9).
A brief theoretical background of the models used in this research is provided below.

Autoregressive (AR) model
The AR model is commonly applied to process statistical timeseries. The AR model was used for predicting missing observation positions. More specifically, it is used to forecast the values of current and future variables, based on the values of the same variable at different times. The variable is described as a function of its values in previous periods Z t-1 , Z t-2 .
The mathematical expression is (DAI et al., 2015): where p is the AR model order and Zt is the constant.
In order to construct the AR model, first its order (p) needs to be determined. The principle is as follows: for the original (observed) data (x 1 ,x 2 ,...,x n ) of order p, the AR model is: If the original time-series is stationary, the AR process is applied directly to the original dataset. If not, the time-series needs to be transformed into a stationary time-series (sequence). The common transformation approach is first-order differentiation, as was required in the present case.
First-order differentiation is implemented as: Then the AR model can be applied to the differentiated timeseries. The ultimate forecasted values are derived from: where ˆt X D is the value resulting from the AR process. The AR model order is defined (determined) applying AIC. AIC is calculated as (CHAKRABORTY et al., 2009) where N is the length of the sample, p is the number of estimated coefficients (p-order, AR model order) and ˆi u are the residuals. The model where the AIC is the lowest is selected as optimal (most suitable).
Using the AR model, the observation positions (dates) are reconstructed and then used as input data for the GM.

Grey model (GM)
The theory provides an effective solution for the data uncertainty and discretization problem. GMs predict the future values of a time-series, based solely on a set of observed data (recorded, original values in the present case). According to the grey system theory, a general grey model GM(h,N) is the GM in which h is the order of the differential equation and N is the number of parameters.
GM(h,N) is defined by the following differential equation (YU et al., 2001): where a i and b j are the determined coefficients, x k l ( ) ( ) 1 is the sequence of the main factor, x k j ( ) ( ) 1 are the sequences of the influential factors, and k is the time sequence variable.
A single-variable first-order grey model GM(1,1) was used in the present research because there was a single time-series of observations.
The original time-series was represented as (ZHAO-HUI & JING, 2016):  Before proceeding with the GM equation, the accumulating generation operator (AGO) will be defined. AGOs are used in GM to reduce data randomness.
The first-order AGO for X (0) is defined as (ZHAO-HUI & JING, 2016): where (1) The sequence of the generated averages of the neighboring values of sequence (1) The grey differentiated first-order equation GM(1,1) is (ZHAO-HUI & JING, 2016): The first term (1)   1 ( ) x t of the sequence (1) ( )( 1, 2,..., ) = j x t j n is used as the initial condition of the grey differentiated equation.
The solution to Eq. (9) is a time-dependent function: To distinguish between a and b, Eq. (9) is discretized. In other words, the difference in grey derivative I of interval and thus: x (0) (t j ) = a(z (1) (t j ) + z (1) (t j-1 )) + b (i = 1,2,..., n; k = 2,3,..., m) (11) Coefficients a and b are obtained by the least squares method: The time sequence response formula, that is, the solution of (9) is: x t e x t a e I j= 1,2,..., n (14) The fitting forecast value of xj(0) can be obtained by the inverse AGO, as follows (LIU &LUO, 2010): The accuracy of the model is represented by the mean absolute percentage error (MAPE) (ZHAO-HUI & JING, 2016): where: In the present case, MAPE was used to assess model accuracy. The model accuracy criterion is shown in The GM (1,1) was used to reconstruct piezometric head data during the 10-year period of discontinued observations (from 1 February 1989 to 1 June 1998, Fig. 3). The completed observation positions (dates) and piezometric heads represented input data for the biplot method.

Biplot method
The biplot method involves the imputing of arbitrary values to missing data in order to complete the matrix, and then calculating SVD using only two components (ARCINIEGAS-ALAR- CON et al., 2014). SVD is the basis of the biplot method. In the reported case study, the biplot method was used to forecast missing data on total source yields from 1 October 1980 to 1 June 1998.
The initial data for constructing the matrix included: 1. Original and reconstructed observation positions (dates) on which groundwater level and total source yield were observed); 2. Piezometric head data -original and reconstructed using the GM (1,1); and 3. Observed source yield data. The constructed nxp matrix denoted by X comprised x ij (i=1,...,n; j=1,...,p) elements, where the missing data were symbolized by x ij aus . In the first step, the missing values were imputed in their respective columns, to complete matrix X. In the specific case, observed average discharges (i.e. known values) were used to fill the data gaps.
The completed matrix X was standardized: where: m j is the mean deviation of the j-th column, s j is the standard deviation of the j-th column, and p ij are the standardized elements. The matrix with p ij elements was denoted by P. The next step was to calculate SVD of matrix P. SVD of a standardized matrix is a fundamental and prominent approach in modern numerical linear algebra. The SVD algorithm represents an approximation of the main matrix by smaller matrices (MEI-  JAARD, 1993). According to this theorem, any matrix A (mxn) can be decomposed into three matrices, as: U is the orthogonal mxm matrix, V is the orthogonal nxn matrix, L is the diagonal mxn matrix with positive elements on the main diagonal, referred to as singular values of A, and all the other terms are equal to zero.
The singular values are generally sorted in descending order, on the main diagonal L.
Taking into account the first two components, the resulting equation (expression) is: l k are the singular values, a ik are the eigenvectors of the rows, g jk are the eigenvectors of the columns, and e ij is the error of row i in column j.
The new matrix, composed of p ij (2) elements, was denoted by P (2) . All the p ij (2) elements in P (2) were re-assigned their original values, (2) x m s p . This resulted in a new matrix -X (2) (n×p) . The x ij aus missing in the original matrix X were imputed with the corresponding values of ∧ x ij (2) in X (2) . Filling of data gaps was followed by iterations (back to matrix standardization, repeating the entire process) until the specified condition ( 0.01 − < where: and 1 2 2 -na is the total number of values missing in matrix X, -x i is the value forecasted for the i-th row of missing values in the active iteration, -x i A is the value forecasted for the i-th missing values in the previous iteration, -y ij are the original values (not missing) in the i-th row and j-th column, and -N is the total number of considered values.

RESULTS AND DISCUSSION -APPLICATION OF THE METHODOLOGY TO A REAL CASE STUDY
Before the missing data were reconstructed, the hydrogeological system was analyzed based on data collected in the last 30 years of operation to determine the groundwater dynamics.
The unconfined intergranular aquifer (aquifer I) is relatively thin and the values of its hydraulic parameters are low, so the groundwater reserves are modest. This aquifer is recharged by precipitation and via a good hydraulic link with a river. The aquifer is drained by the pumping wells and by migration into the deeper aquifers, given the semi-permeable floor of aquifer I.
Recharge of the confined intergranular aquifers (II, III and IV) is highly complex because of the large distances involved, so the recharge process can only be assumed based on the geological and hydrogeological features of the terrain and long-term piezometric levels of the groundwater. In view of the fact that the spread of the aquifers is regional, recharge occurs over a large area where their overlying strata are dominated by sands or in zones where the aquifers are directly linked to the land surface or rivers. Infiltration of precipitation into the confined aquifers has not been confirmed or occurs on a small scale. It is assumed that recharge also takes place through groundwater migration from the first, unconfined aquifer, given that the strata overlying aquifer II are composed of sands, clayey sands and sandy clays, which allow partial groundwater infiltration from aquifer I. This infiltration rate is very slow. The confined aquifers are drained solely by the studied pumping wells and neighbouring groundwater sources.
All the observation wells registered identical piezometric head fluctuation trends -lower in the summer and early autumn months due to higher rates of groundwater abstraction. Given the amount of water withdrawn from these aquifers, this decrease indicates that recharge conditions are satisfactory. The total average monthly yield of the groundwater source suggests that the amount of abstracted groundwater has not changed significantly compared to 10 years ago. The recorded piezometric level ranged from a minimum of 65 m above sea level to a maximum of 76 m.a.s.l., while the total source yield was from 68 l/s to 124 l/s.
The missing data reconstruction results are described below. Records of groundwater abstraction rates since the groundwater source was commissioned (a period of 30 years, Fig. 3 provided input data for applying the GM (1,1) and biplot method to Bečej GS. The missing data were reconstructed according to the algorithm shown in Fig. 4, first the observation dates using the AR model, then the groundwater levels resulting from GM (1,1), and finally the total source yield applying the biplot method. The proposed methodology with regard to the period of no observations is described below.

Imputing of missing observation positions based on the AR model
The AR model was used to forecast observation data (dates) of piezometric head and total source yield. Observation date is the date (day, month and year) of measurement. For AR modeling purposes, these dates were converted into cumulative days (Fig.  3 shows the dates of original measurement). Each observation date is identified as an observation position in order to define the total number of measured and later reconstructed values (125 measured and 38 reconstructed).
Given that the original time-series was transient (Fig. 5) and in view of the growing trend resulting from Eq. (3), first-order differentiation (Fig. 6) was needed in order to apply the AR model. Time-series that exhibit a trend are referred to as non-stationary.
Equation (1) was applied to the stationary time-series, the values for the AR models of the second, third and fourth orders were calculated, and then the AIC criterion was used to assess the most accurate solution. AIC was applied to second-, third-and fourth-order AR (AR2, AR3 and AR4). The results based on Eq.
The lowest AIC value was adopted (i.e. second-order AR, AR2). Figure 7. shows the original dates (in red) from 1 October 1980 to 1 February 1989, based on which the missing dates (shown in blue) were reconstructed, along with subsequent original dates (1 June 1998 to 1 May 2010). Table 4. is a numerical representation of the forecasted observation positions based on second-order AR. The dates shown in red were reconstructed (there were no actual piezometric head and source yield observations).
The MAPE of AR2 was 2.26 %. In view of the large number of missing observation dates, this error was deemed acceptable. Also, AR2 was adopted according to AIC.

Imputing of missing piezometric head data based on the GM
The input data used in this research were considered to be a grey system, with some known and some unknown information, so that it could be analyzed applying the grey system theory. The first step was to analyze the input data from 1 October 1980 to 1 May 2010, and the second step to apply a GM (1,1). The GM (1,1) was used to reconstruct piezometric head values for the period during which the observation wells (P1, P2, P3, P4, P5, P6, P7-2) were not monitored. The first step was to verify the accuracy of the GM (1,1) using the observed data.    Equations (7) through (15), described in the theoretical section, were applied in that order to the original time-series of observed piezometric head data. This yielded the predicted piezometric head values. The objective was to check the accuracy and applicability of the GM to the case study or, in other words, to the original measured data (Table 5). Then Eq. (16) was applied to the reconstructed time-series to check the calculation error and Table 2. was used to assess whether the GM was applicable to the original time-series and if it yielded acceptable solutions. Table 5. shows the observed (original) and reconstructed piezometric heads at observation well P1. It is apparent that the difference between the observed and reconstructed values was acceptable according to the criterion from Table 2 and that the model could be used for the discontinued period from 2 January 1989 to 6 January 1998.
The errors are shown in Table 6. The observed and reconstructed piezometric head values agreed rather well, given the ten-year period of discontinued groundwater monitoring (Table 5). Based on the criterion shown in Table 2, the conclusion was that the error was acceptable in view of the fact that the accuracy of the reconstructed values was high, so the solution was deemed acceptable.

Imputing of missing source yield data based on the biplot method
The biplot method was used to reconstruct the missing source yield data, from 1 October 1980 to 1 July 1998. The input data in this case included original and reconstructed: -observation dates, -piezometric heads and -observed source yield data.
The first step, before the biplot method, was to complete matrix X for calculating the missing total source yield values. The averages of the observed source yields were used for the missing values, x ij aus . The completed matrix included the observation dates (original and reconstructed), piezometric heads (original and reconstructed), observed source yields and average source yields (in place of those that needed to be reconstructed). Matrix X was than standardized applying Eq. 18, resulting in standardized matrix P. This was followed by the SVD of matrix P (in Mat-Lab), based on Eq. 19. Then Eq. 20 was applied, resulting in matrix P (2) . After applying (2) (  (Table 7). Table 8 and Figure 8 show the observed (original) and reconstructed source yields. It is apparent in Figure 8. that the original and reconstructed source yield data agree rather well, over the entire study period (from 1 October 1980 to 1 May 2010). Table 9. shows the errors in the 2 nd iteration. Figure 9. shows the Bečej GS observed and reconstructed piezometric head and source yield data.    . shows that the trends of the piezometric head and source yield data are consistent with those registered in situ. The reason for the lack of fluctuations of the reconstructed values is attributable to the long period of time during which the considered parameters were not observed, as the applied methodology cannot produce more accurate solutions. However, the proposed methodology was highly effective in the study area, given that all the solutions were within the set criteria and the errors were in the predefined range. The models used in this research yielded satisfactory results and it is believed that they can be applied to fill data gaps in the study area (Bečej, Serbia). The developed methodology can also serve as a tool for upgrading other models (e.g. hydrodynamic) in groundwater management, when available input data are insufficient for this type of analysis. The level of error associated with the proposed methodology indicates that it can be applied to other groundwater sources and groundwater management systems.

CONCLUSION
An adequate groundwater source monitoring database is essential for efficient decision making in groundwater resources management. When such data are lacking, it is a challenge to provide sufficiently reliable groundwork for forecasting the future status of a groundwater source and the regime of the captured groundwater.
The paper described a methodology developed to reconstruct missing piezometric head and source yield data because monitoring of the Bečej groundwater source (Bečej GS) had been discontinued for a period of ten out of 30 years. The accuracy of the reconstructed values was first assessed in respect of the observed (original) data. MAPE showed that the applied models were highly accurate, indicating that the methodology could be applied to the entire time series of data observed during the study period.
Temporarily discontinued monitoring of piezometric head and source yield is one of the problems that had an adverse effect on the reliability of decisions relating to the feasibility of expanding the groundwater source in the future, in view of the potential of the area and the predicted efficiency of groundwater abstraction over a long period of time.
The proposed methodology effectively solves that problem. The implementation of the proposed algorithm, resulting in complete input data, improved source expansion solutions based on one of many methods for predicting groundwater dynamics. In the specific case, future research will focus on hydrodynamic modeling with reconstructed and filled data gaps for Bečej GS.
The proposed, verified methodology can be applied to groundwater abstraction systems, but also other systems, such as groundwater control (mines, hydraulic structures, agriculture, industrial zones, and urban areas), groundwater protection, and hydrometeorological, hydrological and similar systems.
The advantage of the proposed methodology is that it ensures sufficiently reliable reconstruction and/or forecasting of needed parameters, in cases that involve a small amount of observed data, or known quantities. In some instances it is relatively simple to implement.
AUTHOR CONTRIBUTIONS: All authors discussed the results and contributed to the final manuscript. J.R. with D.P., Z.G. and D.B. designed the methodology and wrote the final manuscript. Research plan, fieldwork and initial data analysis was conducted by J.R. D.B. provided critical feedback and helped shape the research methodology.
FUNDING: This research received no external funding.

CONFLICTS OF INTEREST:
The authors declare no conflicts of interest.