Variable selection methods for water demand forecasting in Ethiopia: Case study Gondar town

This study developed variable selection methods to forecast urban water demand of Gondar town. Seven variable selection methods are adopted to develop appropriate water demand forecasting model. Multiple linear regression analysis was used to investigate in identifying the optimal predictor variable for developing the water demand forecasting model. The results showed that PCA played a big role to identify the influential variables in modeling of water demand in a better way as compared to other statistical methods. We developed three models to forecast the demand of water in the study area. This study selected Model 1 since Model 1 gives accurate results as compared to Model 2 and Model 3. Subjects: Agriculture; Environmental Sciences; Agriculture and Food


Introduction
Water demand forecasting would use to extrapolate the future demand according to the historical data of the urban water consumption as well as some correlation factor historical data (Xinping, 2009). Accurate water demand forecast is needed for sustainable supply of water to the consumers with excellent quality, quantity and pressure (Almutaz, Ajbar, Khalid, & Ali, 2012). Water demand forecast is an important work for water resources planning and optimal allocation (Liu, Savenije, & Xu, 2003;Mohamed & Al-Mualla, 2010;Tiwari & Adamowski, 2013). Water demand ABOUT THE AUTHORS We submitted a manuscript entitled: "Variable selection methods for water demand forecasting in Ethiopia: Case study Gondar town" which will be published in Cogent Environmental Science journal. All authors certify that they have participated sufficiently in the work to take public responsibility for the content, including participation in the concept, design, analysis, writing, or revision of the manuscript. For example, Dr. Mohammed Gedefaw is the principal author of this paper and made substantial contributions to the design, idea generating, analysis, interpretation and drafting of the manuscript. All the other co-authors greatly contributed for the improvement of the paper. This paper is under water and natural resources management research thematic team. The final manuscript before submission was checked and approved by all the authors.

PUBLIC INTEREST STATEMENT
Water demand forecast would use to extrapolate the future demand according to the historical data of the urban water consumption as well as some correlation factor historical data. Water demand forecast can be estimated by developing appropriate mathematical models based on the predictor variables that affect the demand of water. This study developed a variable selection method to forecast urban water demand of Gondar town. The study considers monthly total rainfall (mm), number of rainy days in a month, average maximum monthly temperature, water price (US$/KL), and water restriction levels. Seven variable selection methods are adopted to develop appropriate water demand forecasting model. Multiple linear regression analysis was used to investigate in identifying the optimal predicator variable for developing the water demand forecasting model. forecasting can be estimated by developing appropriate mathematical models based on the predictor variables that affect the demand of water (Haque, Rahman, Hagare, & Comparative, 2018).
Climatic variables (Rainfall and Temperature), socioeconomic conditions (household income and water price), population growth, technical innovation, cost of supply and condition of water distribution system are the factors that affects the water consumption pattern of the town (Anele, Todini, Hamam, & Abu-Mahfouz, 2018). Identification of crucial variable is essential for the development of the water demand forecasting model since the accuracy of the model depends on the selection of suitable sets of predictor variables. These variables are strongly correlated each other which can create multicollinearity problems during the regression-based model development (Haque et al., 2018).
In our paper, we choose the variable selection methods to choose the best variable selection methods to predict the urban water demand of Gondar town. Thus, our paper aims to compare the different variable selection methods regarding elimination of the multicollinearity problems in the linear regression model in line with water demand forecasting. The selected variables for prediction of water demand depend on the time horizon such as short-term and long-term water demand forecasting. Seven variable selection methods are nominated in our study to find the optimal variable to set in the development of the time horizon models in water demand forecasting. The variable selection methods compared are (1) Forward selection (FS); (2) Backward Elimination (BW); (3) Stepwise selection (SW); (4) Residual mean square error (MSE); (5) Mallow's CP criterion (CP); (6) Akaike Information Criterion (AIC) and (7) Principal component analysis (PCA). The target of variable selection procedure is to identify the right predictor variables that have a great impact on the response variable and provide robust model prediction. Many studies so far have been done on variable selection methods on different disciplines (Araujo, Peres, & Fogliatto, 2017;Duan et al., 2018;Figueiredo, De, Cordella, Bouveresse, Rutledge, & Archer, 2018;Gilhodes et al., 2017;Herrera, 2010;Rahman, Imtiaz, & Hawboldt, 2016;Sheng et al., 2018;Zubaidi et al., 2018). However, the number of studies by variable selection methods on water demand forecasting is very limited. Therefore, this paper tried to fill the gap to forecast the water demand by choosing seven variable selection methods.
The findings of this paper are expected to provide base information about variable selection methods in modeling of water demand forecasting for accurate prediction.

Study area
Gondar town is found in Northern Ethiopia which lies 12°30ʹ North and 37°20ʹ East. The town is located at an altitude of 1500-2200 m above sea level. The maximum and minimum temperatures of the area were 30.7°C and 12.3°C, respectively. The area receives a bimodal rainfall pattern with an annual precipitation rate of 1000 mm (Garedew, Hagos, Zegeye, & Addis, 2015).

Data sources
The raw data used in this study are monthly total rainfall (mm), number of rainy days in a month, average maximum monthly temperature, water price (US$/KL), and water restriction levels. These are obtained from National Meteorological Agency of Ethiopia.

Methods
For modeling the residential water demand forecasting, seven variable selection methods were adopted in this study to identify the predictor variables. These methods were evaluated by using a split sample validation methods (Mekasha, Tesfaye, & Duncan, 2014). The study period was divided into two parts (January 2010 to December 2015) to develop the multiple linear regression models (MLR) and from (January 2016 to December 2017) to validate the developed models. The MLR technique develops a model by constructing a linear relationship between two or more independent variables with a dependent variable and expressed as follows: where Y is the dependent variable, b 0 . . . b i are the coefficients estimated by the least squares method, x 1 . . . x i are the independent variables, i is the number of independent variables and ε is the error term related to each observation. The semi-log form was considered to develop the multi-linear regression models in this paper by using seven variable selection methods. In this paper, the dependent variable was taken in the logarithmic form and the independent variables were incorporated. The seven variable selection methods are described as follows.

Forward selection (FS)
FS method adds variables to the model until no remaining variable (outside the model) can add anything significant to the dependent variable. FS begins with no variable in the model. For each variable, the test statistic (TS), a measure of the variable' s contribution to the model, is calculated. If the calculated p-value for the variable is found to be less than the critical value, then the FS method keeps the variable in the model, otherwise the variable is removed from the model. This is done literatively until all the variables in the model have a value of p < 0.1. The partial F-statistic was calculated by Equation (2) and compared with Fdistribution to estimate the p-value. A critical threshold value of p < 0.1 was adopted in this study: where SSE i-1 and SSE i are the sum of square errors before and after the exclusion of a predictor variable, n is the number of data points, and k is the number of predictor variables.

Backward elimination (BE)
This is the simplest of all variable selection procedures and can be easily implemented without special software. This method deletes variables one by one from the model until all remaining variables contribute something significant to the dependent variable. BE begins with a model that includes all variables (Mekasha et al., 2014 (Haque et al., 2018). This process continues till none of the variables outside the model have a p-value less than the critical value and every single variable in the model satisfies the p criteria.

Mean square error (MSE)
This method finds several subsets of different sizes that best predict the dependent variable. R 2 finds subsets of variables that best predict the dependent variable based on the appropriate TSs. If there are k potential predictor variables, then the possible number of prediction models would be 2 k . The independent variables were considered and with MSE criteria, all the possible models were evaluated and the model with the lowest value of MSE was selected. The MSE measures the variance for each of the models and is equated as follows: where Y and Yp are the observed and predicted water demand value, respectively, n is the number of data points, and k is the number of independent variables.

Best model with the Akaike information criterion (AIC)
The AIC procedure was proposed by Akaike (Haque et al., 2018), and it selects the model with the minimum value of the AIC, which can be calculated by the following equation:

Best model with mallow's cp criterion (CP)
The Cp criterion was proposed by Mallow (Look, 2010) for univariate regression analysis, and it selects the model with the minimum value of the Cp statistic. The Cp statistic can be calculated as follows: where S 2 is the MSE for the full model and SSE k is the residual sum of squares for the subset model that contains k number of predictor variables in the model.

Principal component analysis (PCA)
Principal component analysis is one of the most frequently used multivariate data analysis methods. It is a projection method as it projects observations from a p-dimensional space with p variables to a k-dimensional space (where k < p) so as to conserve the maximum amount of information (information is measured here through the total variance of the dataset) from the initial dimensions. PCA dimensions are also called axes or factors.
If the information associated with the first 2 or 3 axes represents a sufficient percentage of the total variability of the scatter plot, the observations could be represented on a two-or threedimensional chart, thus making interpretation much easier.
where x 1 , x 2 ,..x k represent the original variables in the data matrix and b ij represent the eigenvectors.

Standardized coefficients of the variable selection methods
The standardized coefficient indicates how much the effect of each independent variables on the dependent variable in our results. The results showed that out of the five variables (rainfall, maximum temperature, number of rainy days, water price and water restriction zone) three variables are statistically significant in FS, four in BE, three in SE and four MSE variable selection methods (Figure 1). Rainfall was the most influential variables in our study for water demand forecasting.
The results showed that rainfall, maximum temperature, number of rainy days and water restriction zone had no any significant impact on water demand forecast by Mallow's Cp criterion and AIC variable selection methods. However, the results of these variables by Mallow's Cp criterion methods are negative. Maximum temperature showed a negative result by AIC methods. It is clearly seen in (Figure 1) and (Table 1) that all the variable selection methods show a different result and takes different sets of variables to be taken as an input in the linear regression models. The results also showed that the relation between the variable selection methods and water demand was irrational. This is due to the presence of multicollinearities among the independent variables (Haque et al., 2018).
As far as modeling statistical results concerned, the best model was selected based on the value of R 2 and adjusted R 2 value. Hence, the highest value of R 2 and adjusted R 2 are the best model. The findings of this paper showed that the highest value of R 2 is recorded in Model 2 and found to be the best model (Table 2).

P-value
Cf.

P-value
Cf.

P-value
Cf.

Pearson's correlation matrices of the water demand variables
The results of the Pearson's correlation matrices of the water demand showed that the maximum correlation coefficient was found to be 0.318 between water price and maximum temperature followed by 0.222 between number of rainy days and rainfall. All the correlation results are found in Table 3. However, the presence of high correlation between independent variables indicates a strong multicollinearity, which is more likely to produce biased results in the regression analysis.

Principal component analysis (PCA)
The eigenvalues of each principal component analysis on the independent variables are shown in Figure 2. The results showed that 60% of the last three principal components (PCs) show variability. The eigenvalues of the three PCs are less than one. Therefore, this paper selected PC-1 to PC-5 which were chosen to find the influencing variables to estimate the water demand. Bold values indicate a correlation between PCs and independent variables (Table 4). This study incorporated all the independent variables in the chosen five PCs.
Hence, the independent variables heavily loaded in PC-1 are rainfall, maximum temperature; in PC-2 are rainfall, maximum temperature and number of rainy days; in PC-3 the loaded variables are maximum temperature, number of rainy days and water price; in PC-4 number of rainy days, water price and water restriction zones and in PC-5 the loaded independent variables are number of rainy days, water price and water restriction zones. However, since PC-1 and PC-2 are mostly occupied by rainfall and maximum temperature variables and highly correlated with each other. Therefore, they are chosen from PC-1 to use in regression analysis to avoid the multicollinearity problem.
Finally, we developed three individual models by using the independent variables by considering the loading potentials with the principal component analysis. These models are 1, 2, and 3. The potentials of each model indicated in Table 2.
(1) Model 1: Rainfall, maximum temperature, number of rainy days, water restriction zone (2) Model 2: Rainfall, maximum temperature, number of rainy days  (3) Model 3: Rainfall, number of rainy days When we developed the three models, the water price zones in all models were not found to be statistically significant by using regression analysis. Therefore, we ignored this variable in the three models during simulation results of the models.
From the results, we can conclude that Model 1 gives better results as compared with the other two models. Models 2 and 3 give weak results. This may be due to multicollinearity problem of the variables (Haque et al., 2018) (Table 5). The most influencing predictor variables for determining the water demand in Gondar town were rainfall, maximum temperature, number of rainy days and water restriction zone. The results show that the chosen independent variables are performing good accuracy of water demand prediction and the developed model is free from the multicollinearity problem. This method is also simple to apply for water demand forecast in water supply system.

Conclusion
This study analyzed and compared the results of the FS, backward elimination, stepwise selection, MSE criterion, Mallow's Cp criterion, AIC and PCA variable selection methods for water demand  forecasting of Gondar town. The results indicate that the seven variable selection methods resulted in different sets of predictor variables.
The result showed that PCA played a major role to identify the influential variables in modeling of water demand in a better way as compared to other statistical methods. The results of this study are exactly fit with the study area in Ethiopia and hence other areas can also adopt the developed system having different water consumption pattern and climate conditions to develop water demand forecasting models. The results of this paper also indicated that incorporating many independent variables in the model could not necessarily improve the performance efficiency of the models.