Regional flood frequency analysis using data-driven models (M5, random forest, and ANFIS) and a multivariate regression method in ungauged catchments

Flooding is recognized worldwide joined of the most expensive natural hazards. To adopt proper structural and nonstructural measurements for controlling and mitigating the rising flood risk, the availability of streamflow values along a river is essential. This raises concerns in the hydrological assessment of poorly gauged or ungauged catchments. In this regard, several flood frequency analysis approaches have been conducted in the literature including index flow method (IFM), square grids method (SGM), hybrid method (HM), as well as the conventional multivariate regression method (MRM). While these approaches are often based on assumptions that simplify the complex nature of the hydrological system, they might not be able to address uncertainties associated with the complexity of the system. One of the powerful tools to deal with this issue is data-driven model that can be easily adopted in complex systems. The objective of this research is to utilize three different data-driven models: random forest (RF), adaptive neuro-fuzzy inference system (ANFIS), and M5 decision tree algorithm to predict peak flow associated with various return periods in ungauged catchments. Results from each data-driven model were assessed and compared with the conventional multivariate regression method. Results revealed all the three data-driven models performed better than the multivariate regression method. Among them, the RF model not only demonstrated the superior performance of peak flow prediction compared to the other algorithms but also provided insight into the complexity of the system through delivering a mathematical formulation.


Introduction
Floods are among the costliest natural hazards experienced in most of the places in the world, which results in heavy losses of life and economic damages (Gao et al. 2018). Regional flood frequency analysis (RFFA) is an especially used method for estimating flood risk at target locations in river basins where streamflow measurements are either limited or unavailable (Griffis and Stedinger 2007;Leclerc and Ouarda 2007;Zaman et al. 2012, Smith et al. 2015and Lotfirad et al. 2018 The first RFFA was undertaken by the USGS in New England (Kinnison and Colby 1945). Lately, RFFA has received significant heed for design and management of water substructures such as dams, reservoirs, and bridges to slight flood risks and hence financial devastations. Much of this attention has occurred in response to waterrelated hazards such as the flood in regions where minute or no data is available on peak flows such as the Indus floods in East of Iran.
RFFA has been used as a crucial tool for many applications for example (i) water substructure, (ii) flood preservation projects, (iii) land cover planning, and other hydrologic studies. The regional techniques consist of a multivariate statistical structure derived from catchment characteristics data . In this process, the region of influence approach can be formed where some catchments are pooled together based on vicinity in geographic. Consequently, an optimum region is made based on some objective function (Holmes et al. 2002;Aziz et al. 2010).
The relationship among flood flow values, physiographical and/or morphological characteristics of a catchment is a fundamental framework for RFFA. Various methods such as multivariate regression method (MRM), square grids method (SGM) and hybrid method (HM) have been developed in the literature for RFFA; each approach has its advantages and disadvantages. Among these methods, MRM has been widely used and presented in previous studies (Golestani et al. 2010;Malekinezhad et al. 2011;Latt et al. 2014).
Recently, data-driven models have been widely adopted in hydrological studies. Data-driven models can handle the nonlinearity and uncertainty in hydrological data very effectively. These methods are extensively used for rainfall-runoff modeling, water resource management modeling, and suspended sediment modeling take; for example, Kumar et al. (2015) provide the details of the application of data-driven models in regional flood frequency estimation which is explored. Regional flood frequency relationships are developed employing data-driven models viz. ANN and FIS for lower Godavari subzone 3(f) of India and the same have been compared with the regional relationships derived using the L-moments approach.
Recently, the data-driven models such as random forest (RF), fuzzy computing techniques and M5 decision tree, in complex modeling of the flood, have been developed by Solomatine and Xue (2004) Jahangir et al. (2022). Among the numerous data mining techniques, ANFIS is the most widely used approaches in various water-related areas. Being an accurate predictive tool, the ANFIS technique has, however, an inherent disadvantage that often results in hesitating to interpret their outputs. This is because of being a black box and consequently the nature of their solution is hazy. There might be a variation between networks of the same architecture trained on the same dataset due to the arbitrary nature of the internal representation. (Witten and Frank 2000). Srinivas et al. (2008); Aziz et al. (2017); Esmaelili- Gisavandani et al. (2017); Sharifi Garmdareh et al. (2018), and Zalnezhad et al. (2022) attempted to shed light on the structure of ANFIS using regional flood analysis and the methods of recovering rules. However, few studies have been used the application of the M5 algorithm and random forest in RFFA. Therefore, in this study, random forest and M5 algorithms were used to investigate peak flow and compare ANFIS with the multivariate regression method. The most advantage of M5 and RF have been classified by being induced into linear patches; these models provide a representation that is reproducible and understandable by practitioners. (Solomatine and Xue 2004;Jothiprakash and Kote 2011). The target of RFFA is predicting river peak flow associated with various return periods in ungauged catchments and also reduce uncertainty to evaluate the flooding (Merz and Blöschl 2008;Zaman et al. 2012;Shu and Ouarda 2012;Leščešen et al. 2019). The objectives of this study are two-fold. Firstly, the aim is to estimate the RFFA utilizing the random forest and M5 algorithms. Secondly, the goal is to estimate flood occurrences in data-deficient catchments within the western region of Iran.

Materials and methods
Out of eighty-nine stream gauges, thirty-two stations were used due to the availability of data. The data were obtained for the period of 1987-2018. The study area, including thirty-two stream gauges, is located in the west of Iran. From the homogeneous stations, twenty-seven stations were used for calibration and five stations for validation of the models (Fig. 1b). In fact, five stations were used for validation which was not used for the modeling and after making model, each data-driven model validated with these stations. To approach a unique model, the return period was taken into account as an independent factor. The study considered the annual maximum instantaneous peak flow. Kolmogorov-Smirnov test in EasyFit software 5.6 was used to estimate peak flow with different return periods based on the best distribution function (Shokouhifar et al. 2022).

Study area
The Karkheh River basin is located in the west of Iran (Fig. 1a). The Karkheh River basin covers 51,230 square kilometers in parts of six Iranian provinces. The Karkheh river length is approximately 900 km. (Fallah-Mehdipour et al. 2020). This study considers RFFA using three datadriven models (M5, RF and ANFIS) and a multivariate regression method in ungauged catchments. The further detail of Karkheh Basin could be found in many papers (see, e.g., Gheitasi 2016;Zamani et al. 2015).

Data
The following data were used in this study: (i) annual maximum instantaneous peak flow that was obtained from the Ministry of Energy of Iran and (ii) the physiographic characteristics of the catchment were extracted from the ALOS-PALSAR satellite with a spatial resolution of 12.5 m (https:// asf. alaska. edu). The extraction of physiographic characteristics was carried out in the ArcGIS software 10.5.
The data utilized in this study, including the annual maximum flood (Q), ranges from 17.5 to 1337.8 m 3 /s. The flood discharge was calculated with a return period (T) of 2, 10, 100, 1000 years. Drainage areas (A) range from 8.17 to 26,187.02 km 2 . The range of the height of each sub-basin (H) is 1043 to 2621 m; the range of stream length (L) is between 4.77 to 420.82 km and catchment slope (S) varying from 8.14% to 37.67%. Table 1 presents the physiographic characteristics of the studied catchments.

Data-driven modeling
Data-driven modeling relies on relationships between measured data without a need for a priori knowledge of the physical system behavior (Jones et al. 2013;Ashrafzadeh et al. 2020;Biazar et al. 2020;Jafarpour et al. 2022).
Once trained, data-driven modeling becomes a parametric description of the function. Out of several possible datadriven methods, ANFIS is the most widely used ones in water resource applications, whereas less attention has been directed toward the RF and M5 model trees. 70% of the data were used for training and 30% for testing in all models. A brief description of the methods mentioned above, is summarized as follows:

M5 model
M5 model is a data-driven model proposed by Quinlan (1992) and mainly employed in the realm of water science (Rahimikhoob 2014;Kisi and Kilic 2016;Kisi and Parmar 2016). Continuously, the final structure together with the dependent leaves is shown as a tree in Fig. 2b. The further detail of M5 model could be found in many papers (e.g., Farajpanah et al. 2020;Adib et al. 2023).

Random forest
The RF method is nonparametric and belongs to the family of ensemble methods. The RF method consists of a set of regression trees used to reconstruct educational data. Typically, a set of basic training examples is formed. Combining  three parameters in RF is essential. The first is how many trees should be created, the second is how many variables are involved in creating a node for each network, and the third parameter is the size of the node, which indicates the depth of the regression tree created. One of the advantages of this method is that there is no need to prune the trees during modeling and classification (Esmaeili- ).

ANFIS
Adaptive neuro-fuzzy inference system (ANFIS) could be a multilayer feed-forward network where each node performs a selected function on incoming signals (Jang 1993;Heddam 2014). An ordinary architecture of an ANFIS, during which a circle indicates a set node, whereas a square indicates an adaptive node, is shown in Fig. 3. The further detail of ANFIS could be found in many books (e.g., Azar 2010; Esmaeili-Gisavandani 2017; Adib et al. (2021).

Evaluation criteria
Normal root-mean-square error (NRMSE) and correlation coefficient (R 2 ) were used to evaluate model performance: (1)

Results
According to Table 2, four combinations of input data were used in the MRM, ANFIS, M5 and RF models to peak flow with different return periods for regional flood frequency analysis (RFFA) ( Table 3).

ANFIS results
To calculate RFFA with the ANFIS model, for any combination, an optimum number of membership functions was specified based on trial and error. The best type of membership function was recognized from between bell-shaped (gbellmf), trapezoidal-shaped (tramf), triangular-shaped (trimf), Gaussian (gaussmf) and Gaussian 2 (gauss2mf) by repeated model training and testing based on every membership function number and type via trial and error. Based on the correlation coefficient (R 2 ) and root-mean-square error (RMSE), combinations 4 (R 2 = 0.92 and NRMSE = 0.851) is better performance than the others (Fig. 4).

M5 results
The M5 model tree does not require to set any user-defined parameters. In addition, the M5 model can provide the number of linear relations which can be easily used to predict the RFFA, as shown in Fig. 5. As shown in Table 4, the M5 model results indicated that input combination 4 gave a better performance than the other combinations (R 2 = 0.95 and NRMSE = 0.45). The tree relationships of the M5 model for the best combination of the inputs are presented in the appendix.

RF results
As shown in Table 5, the RF model results indicated that input combination 4 gave a better performance than the other combinations (R 2 = 0.96 and NRMSE = 0.223).
As shown in Fig. 5, the peak flow values estimated by each model are compared. RF has the best performance in peak flow estimation, while MRM has the worst. Furthermore, most of the models underestimated the peak flow in the 2-year and 10-year return periods, while most overestimated the peak flow in the 100-year and 1000-year return periods.  As Fig. 6 illustrates performance of the used models in calibration (twenty-seven stream gauges) and validation (5 stream gauges) stages, according to the Taylor diagrams (Fig. 6), the performance of the RF model is the best. In the return periods of 2, 10, and 1000 years, the M5 model ranks second after the RF, but in the 100-year return period, the ANFIS model ranks second after RF.

Discussion
This study aims to provide a relatively simple method to estimate peak flow amounts in ungauged region based on their physiographic characteristics. To achieve this, datadriven models of varying natures were used. The MRM model is based on regression, the ANFIS model is based on fuzzy logic, the M5 model is based on classification, and the RF model is based on ensemble learning under supervision. Models require inputs such as area, stream length, basin slope, basin height, and return period number, which can all be derived from topography. Also, the best combination of inputs belonged to combination 4 with a higher correlation coefficient and lower NRMSE. Based on the excellent results obtained in estimating peak flow in the calibration and verification stages, particularly using the RF model, it is clear that this study is far more effective than similar studies whose inputs and modeling process were incredibly complex. It is evident from Fig. 6 that the models used for estimating peak flow had a favorable performance, especially the RF model, with better accuracy in the short-term return period of 2 and 10 years than in the long-term return period (100 and 1000 years).
RFFA makes a relationship between flood frequency and physiographical characteristics of catchments to estimate flood in ungagged regions like Rahman et al. (2020). In this regard, the performances of the RF and M5 tree network as piecewise linear functions, ANFIS and multivariate regression method were evaluated to estimate flood frequency in the ungagged sub-catchments like Vafakhah and Bozchaloei (2020).  . 6 Performance of RF, M5, ANFIS, and MRM in estimating flood frequency in five validation stream gauges at 2,10,100, and 1000 return periods A comparison of the correlation coefficient and rootmean-squared error values indicated an improved performance obtained from the data-driven model compared to traditional methods such as the multivariate regression (MRM) Method. However, the performance of the RF model is almost similar to the M5 and ANFIS models.

Conclusions
Knowing the magnitude of historical floods in a particular area is crucial for designing hydraulic structures. Small and medium watersheds often lack ground flow measurement stations due to the costs involved in building and maintaining them. In contrast, hydraulic structures need to be built on rivers in these areas in order to develop civil and agricultural activities. Therefore, the flood discharge design must be determined. This study used machine learning models to estimate the peak flow of ungauged watersheds.
The following model performance was found in this study: The procreated dendriform structure of multi-linear models utilized in RF and M5 is comprehensible and straightforward to grasp for decision-makers. It also provides an honest overview of the relationships between the physiographic characteristics of the watershed; The RF and M5 model permits to simply create a family of explainable models with a varied number of component models and thus varied strength and correctness; Modeling with the RF and M5 are the fastest data-driven models (proceeding of data with RF and M5 is faster than ANFIS); The information encapsulated in RF and M5 algorithms can potentially assist in variable selection and the evaluation of their relationships when processing data with other models. For instance, M5 can aid users in determining the sensitivity of the data.

Regression expressions of the best M5 model
Here, the linear regressions of the best results of M5 modeling.

Authors contributions
The authors declare that they have contribution in the preparation of this manuscript.

Funding
The authors did not receive support from any organization for the submitted work.

Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.

Ethical approval
The manuscript is an original work with its own merit, has not been previously published in whole or in part, and is not being considered for publication elsewhere.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.