Assessing Gene Expression Programming as a technique for seasonal streamflow prediction: A case study of NSW

This research aims to provide long term streamflow forecast models using multiple climate indices as the predictors with the help of an advanced evolutionary method, Gene Expression Programming (GEP) to solve the developed symbolic regression problems as it is found to be superior than other traditional methods. Being a transparent model, GEP is able to provide the relationship between input (climate indices) and output (streamflow) variables with mathematical expressions which help the users to understand the underlying hydrological process between the climate mode and streamflow without having much knowledge about the used software. Two stations of New South Wales (NSW) are chosen based on their longer data record and fewer missing values. Several preliminary researches including single and multiple correlation analyses reveal PDO (Pacific Decadal Oscillation), IPO (Inter Decadal Pacific Oscillation), IOD (Indian Ocean Dipole) and ENSO (El Nino Southern Oscillation) are few among the influential indices on the study region. The resultant models appear to be more efficient with up to 50% higher Pearson correlation (r) values than that of the simple MLR technique adapted in one of our previous studies. Furthermore, the statistical performance analyses including Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Willmott index of agreement (d) and Nash-Sutcliffe efficiency (NSE) ensure high predictability of the developed models. The similar correlation values (r) generated from calibration and validation periods which ranges between 0.74 and 0.91 increase the reliability of the resultant models for predicting seasonal streamflow up to three months in advance.


Introduction
An efficient forecast model is of immense importance to the water stakeholders as it can influence the low-risk decision making process to a great extent which will enhance the potential economic benefit as well [1,2]. As the remote climate drivers fluctuate at very low frequencies, they have better predictability of streamflow while comparing to initial catchment condition.
Australia is greatly influenced by the climate anomalies originated from the surrounding Pacific, Indian and Southern oceans. Some of the major climate drivers influencing southeast Australia's climate include but not limited to ENSO, IPO (PDO), SAM (Southern Annular mode) and IOD [3] originated from Pacific Ocean, the Indian Ocean and the Southern Ocean respectively.
The ENSO (El Niño-Southern Oscillation) phenomenon which causes climate variability in many parts of the world is the direct consequence of large-scale interactions between ocean and atmospheric  [13]. In addition to ENSO indices, eastern Australia is evident to be influenced by IOD [14] along with the interdecadal modulation impact of ENSO originated from the low frequency variability in Pacific Ocean, which is recognized as Pacific Decadal Oscillation or PDO [15]. Past research studies have investigated [16][17][18] the influence of Interdecadal Pacific Oscillation (IPO) on the decadal to multidecadal variation of rainfall and streamflow and suggested that IPO should be considered as a dominant climate index. IPO and PDO are highly correlated as these are the two indices of the same phenomenon where IPO is defined for the whole Pacific Basin and PDO is defined for the North Pacific (pole ward of 20°N) Basin.
Many hydrologists established the existence of strong correlations between streamflow and largescale climate drivers, though the nature of the relationship remained a difficult question to deal with. According to [5] the relationship between streamflow and remote climate drivers is more likely to be non-linear, thus a non-linear model is expected to give better solutions than a linear model.
Though the difficulties in dealing with artificial intelligence models encourage the users to attempt comparatively simple statistical models, the limitations of these models are evident when data become complex. In addition, one of the major advantages of artificial intelligence-based models like GEP and ANN over regression-based models is, they do not impose any fixed model structure on the data, rather they allow the data itself to identify the model structure by using artificial intelligence [28] which sometimes makes the models more robust.
One of the main advantages of GEP models over some other data-driven models is that the resultant model is not a complete "black-box", rather the relationship between input (climate indices) and output (streamflow) variables can be explained with mathematical expression (combination of basic operators and functions). GEP was found to give better performance than other data-driven methods such as ANN and ANFIS [29][30][31][32]. Reference [25] investigated the comparative performance of ANN, ANFIS, GEP and ARMA models to forecast lake levels in Turkey and concluded that GEP was the better performer among all other data driven models. GEP model was suggested as a feasible alternative to ANN, ANFIS and MLR time series when these models were applied to simulate rainfallrunoff transformation process [25,32] successfully generated flow duration curves using non-linear regression equations developed by GEP. ANN and GEP were compared for estimating reference evapotranspiration where GEP provided explicit equations, even though the outcomes from ANN were slightly better than GEP [33]. In the study of reference [34] GEP was efficient to predict one day ahead river flow with high correlation coefficient.
The study intends to provide deterministic forecast as it can play more important roles in solving water management problems by enabling the water stakeholders to take accurate decisions knowing the predicted amount of future streamflow, rather than the probabilistic approaches which have been attempted by many researchers till date [3,5, [35][36][37]. Furthermore, the Bayesian joint probability (BJP) method used by Australian Bureau of Meteorology (http://www.bom.gov.au/water/ssf/index.shtml) to provide futuristic streamflow is again a probabilistic method. Therefore, the present study is an endeavor to explore the non-linear relationship between climate indices and seasonal streamflow with a view to providing deterministic forecast.

Study area and data
North-east part of New South Wales is chosen as the current study area considering its agricultural importance, climatic variation and geographic location. Three streamflow stations (Figure 1) Hunter River at Singleton (Station ID 210001), Goulburn River at Coggan (Station Id 210006) and Namoi River at North Cuerindi (Station ID 419005) were selected based on their long data records and fewer missing values (less than 5%). Five climate indices: ENSO based SST anomalies NINO3.4, EMI, IPO, PDO, and DMI (IOD) were chosen for the analysis based on the previous research works on streamflow and rainfall in the study area. The oceanic and atmospheric climate indices data are collected from Climate Explorer website (http://climexp.knmi.nl) while the EMI data is obtained from the website of JAMSTEC (http://www.jamstec.go.jp/frcgc/research/dl/iod/modoki) for duration of 102 years . An overview of the used climatic variables is presented in Table 1. 102 years of observed monthly streamflow (in cumec) was collected from the Australian Bureau of Meteorology (BOM) (http://www.bom.gov.au/waterdata/ ) which ranges from 1914 to 2015. The whole dataset was divided into two segments where 96 years' (from 1914 to 2009) data was selected for the calibration of the models andthe remaining 6 years' (from 2010 to 2015) data was used for validating the models. Using the collected monthly streamflow data, seasonal mean discharge data was derived for spring (September-October-November). where n is the number of arguments of the functions. The steps to predict streamflow are as follows: • Initially random population is generated which is consisted of individual chromosomes of fixed length.
• Each chromosome in the initial population is expressed by the expression trees and evaluated using the predicted-observed data pairs of the training period as well as an appropriate fitness function.
• The next step is to determine the set of terminals T and the set of functions F to create the chromosomes.
• The next major step is to select the chromosomal architecture which is composed of length of head (h), the number of genes per chromosome and genetic operators.
• The proper linking function (addition, multiplication, subtraction or division) needs to be chosen to connect the algebraic sub-trees.
• Finally, the default values of the genetic operators need to be selected for the GeneXpro program [38]. An overview of the used parameters is described in Table 2. The development of GEP models and all the relevant statistical calculations are performed using the "GeneXpro tools 5.0" software.

Results and discussion
In one of our preliminary studies [39]   From Table 3 it is evident that the developed models have high correlation (r) values with very low errors which satisfies the statistical performance of the GEP models.
The output equations from the best developed GEP models for all 3 stations are as follows: (  In Table 4 the performances of the GEP models are compared with that of the MLR models developed in one of our previous studies (Esha and Imteaz, 2018). It is evident from both analyses (MLR and GEP) that the selected best models have statistically significant correlation (r) values with lower errors. While comparing the best models from both analyses, it is found for every station GEP models have outperformed MLR models with their much higher correlation values (almost twice of MLR) and much lower errors.

Conclusion
The study is conducted to assess the ability of GEP technique to predict seasonal streamflow of northeast NSW region using combination of lagged climate indices as predictors. The performance of the developed GEP models were evaluated based on several statistical parameters which includes Pearson correlation values (r), RRSE, RAE, RMSE, MAE and NSE. One best models from each of the three stations were selected considering their higher correlation values and lower errors. The outcomes of this study were compared with the results obtained from MLR analysis which was a part of one of our previous studies. The comparison revealed that for every station GEP models outperformed MLR models with their much higher correlation values which ranges between 0.74 and 0.91, whereas for MLR models the values varied between 0.41 and 0.65. The developed models are efficient to predict spring streamflow up to three months ahead, thus enable the water stake holders to take low risk decision at the earlier stage of the crop period. The predictability of the GEP models on seasonal streamflow of other regions of NSW will be assessed in our future studies.