A New Modeling Approach for Spatial Prediction of Flash Flood with Biogeography Optimized CHAID Tree Ensemble and Remote Sensing Data

: Flash ﬂoods induced by torrential rainfalls are considered one of the most dangerous natural hazards, due to their sudden occurrence and high magnitudes, which may cause huge damage to people and properties. This study proposed a novel modeling approach for spatial prediction of ﬂash ﬂoods based on the tree intelligence-based CHAID (Chi-square Automatic Interaction Detector)random subspace, optimized by biogeography-based optimization (the CHAID-RS-BBO model), using remote sensing and geospatial data. In this proposed approach, a forest of tree intelligence was constructed through the random subspace ensemble, and, then, the swarm intelligence was employed to train and optimize the model. The Luc Yen district, located in the northwest mountainous area of Vietnam, was selected as a case study. For this circumstance, a ﬂood inventory map with 1866 polygons for the district was prepared based on Sentinel-1 synthetic aperture radar (SAR) imagery and ﬁeld surveys with handheld GPS. Then, a geospatial database with ten inﬂuencing variables (land use / land cover, soil type, lithology, river density, rainfall, topographic wetness index, elevation, slope, curvature, and aspect) was prepared. Using the inventory map and the ten explanatory variables, the CHAID-RS-BBO model was trained and veriﬁed. Various statistical metrics were used to assess the prediction capability of the proposed model. The results show that the proposed CHAID-RS-BBO model yielded the We conclude that the proposed method can accurately estimate the spatial prediction of ﬂash ﬂoods in tropical storm areas.


Introduction
Flooding is a phenomenon in which the water level in one place is above the permitted level, which is determined by the current frequency index. Researchers and planners point out that flooding is considered a significant disaster where the flow of water can flow from any sources and can be sudden or deliberate [1]. Flash floods are the most dangerous natural occurrences among various types of floods because of their rapid occurrences in a short period of time, and they pose more risks than other floods [2]. Climate change and rapid population growth are among the main drivers of flooding [3]. Additionally, according to the Intergovernmental Panel on Climate Change (IPCC) assessment, heavy rains are forecasted to have more impact on future floods [4]. Deaths and economic damage, destruction of agricultural crops, damage to environmental ecosystems, and the spread of contagious diseases along the water route are direct effects of the floods, which can cause irreparable damage [5][6][7][8]. Considering the historical events of the floods in the period 1998-2018, about 3136 flood catastrophes worldwide have occurred, and their consequences have affected more than approximately two billion people and caused about 556 billion US$ in economic losses [9]. Indeed, the devastating consequences of flash floods on human lives have been spotted around the world [10,11]. There are a wide range of reasons, such as changes in the urbanization process, which cause vegetation cover changes and rapidly increasing population growth, which is accompanied by land use changes, resulting in an increase of the runoff coefficient [12]. Therefore, human settlements and vital infrastructure are vulnerable to flooding, and it is likely impossible to prevent this natural disaster completely. Thus, an effective spatial prediction of such events may reduce injuries and losses [13]. However, spatial prediction of flash flooding remains challenging due to the complex environmental factors involved [14,15]. Therefore, accurate modeling and mapping of flood risks play an important role in risk management planning and preventive measures [16]. Due to the destructive effects of flash floods on the environment and their social consequences, many studies so far have attempted flood risk modeling and zoning [17][18][19], because identifying areas vulnerable to flooding will be one of the most effective measures to reduce flood damage and flood management [20]. However, risk modeling and flood sensitivity mapping across large areas still remain challenging, because flash floods occur largely in each region under different climate conditions, which are unpredictable [21].
The literature review shows that in the development of new technologies, precise predictive models are often required for preparing the flood risk maps, which help with decision making to minimize and to monitor these events. A vast number of studies conducted on flood risk assessment usie hydrological and hydrodynamic models. For instance, Giustarini, et al. [22] attempted to map the flood risks by using the temporal correlation model combined with hydraulic variables and time in the Severn River floodplain in the UK, while Li, et al. [23] used the Urban Flood Simulation Model (UFSM) and the Urban Flood Damage Assessment Model (UFDAM) in Shanghai, China for flood simulation. Recently, Komi, et al. [24] employed the distributed and calibrated hydrological method in the River Basin in West Africa with an application of rainfall intensity analysis and frequency intensity distribution relationships in flood risk modeling. The SCS-CN (Soil Conservation Service Curve Number) method has also applied the hydrograph theory in Volvos metropolitan area, Greece [16]. However, due to the lack of hydrological data, the limitations of the forecast, and the lack of a hydrometric station to record runoff and discharge, these methods cannot be used as a basic and optimal method for risk assessment at all locations.
In recent years, multi-criteria decision-making models have also been used for mapping flood risk using six influencing factors, including rainfall, slope, elevation, river density, land use, and soil types in Sukhothai Province, Thailand [25]. Wang, et al. [26]) attempted to develop a new hybrid technique using an integration of multi-criteria decision analysis, network analytical process and Weighted Linear Composition (WLC) in Shanghai City, China. Although multi-criteria decision-making methods can be a potential approach for improving the prediction performance in environmental hazard assessment, these techniques still have critical limitations, due to differences in the weight value of each factor in different regions. Importantly, several influencing factors such as land-cover/land-use (LULC) are often obtainable from earth observation data that consist of optical and synthetic aperture radar (SAR) data. Optical remote sensing datasets, which can be acquired at a certain time throughout the year, largely affected by the cloud coverage that commonly occurs in the tropical regions [27]. On the other hand, SAR remotely sensed data could be acquired under all weather conditions and become an essential source for mapping LULC [28]. Among various SAR sensors, Sentinel-1 C band SAR, provided by the European Space Agency (ESA) with dual polarization (VH,VV) can be acquired free-of-charge with a very high temporal resolution of 6 days, which makes it possible to provide systematic continuity data for mapping LULC [29,30].
Recently, various statistical machine learning techniques have been developed, including Frequency Ratio Index (FR) for flood risk mapping in the Markam Basin of Papua New Guinea [31], and flood sensitivity modeling in part of the Middle Ganga Plain in the Ganga Land Basin [32]. A number of studies have investigated the ability, and the effectiveness, of machine learning approaches combined with various optimization techniques for forecasting flash flood risk such as a combined artificial neural network (FA-LM-ANN) model in the Bac Ha Region located in Northwest Vietnam [33] and flood prediction using a self-organized neural network (SOM) technique at Kemaman River in Malaya Peninsula [34]. Various attempts have been made to predict flood risk in the current literature. Shafapour Tehrany, Kumar and Shabani [5]) employed a Support Vector Machine (SVM) model for predicting flood risk in the Brisbane River basin, Australia, whereas Jahangir, et al. [35]) integrated a multilayer perceptron neural network (MLPNN) model with GIS for spatial flood analysis in Tehran Province, Iran. One of the biggest challenges of predicting the risk of flooding is the lack of data in different regions. As a result, specific models cannot be used directly in different environments. In this context, novel machine learning techniques are able to help researchers in tackling the systematic issues and improve the predictive accuracy of flooding.
Thus, this study aims to fill these gaps in the literature by developing a novel modeling framework for spatial prediction of flash floods using the random subspace (RS) ensemble and the tree intelligence-based random subspace optimization combined with biogeography optimized (the CHAID-RS-BBO model). The RS ensemble is a powerful framework that has proven efficient in various spatial domains, i.e., landslide [36], flood [37], and image classification [38], whereas the CHAID decision tree is capable of providing good classification accuracy [39][40][41]. For the case of the BBO, this algorithm provides a efficient solution in searching and optimizing model paramters [42][43][44]. The proposed method can overcome the shortcomings of recent studies on flash floods risk mapping and will provide insights for further development of techniques in monitoring flash flood in the stormy tropical regions.

Study Area
Luc Yen is a mountainous district of the Yen Bai province in the northwest region of Vietnam ( Figure 1). It covers approximately 810.10 km 2 , occupying 1.2% of the total area of the Yen Bai province. It is located between latitudes of 21 • 55 30"N and 22 • 03 30"N, and between longitudes of 104 • 30 06"E and 104 • 53 30"E. In terms of morphometry, the study area has a complex terrain consisting of mountain ranges, hills, mounts, cliffs, small valleys and plains along the Chay river, connecting directly to Thac Ba reservoir. The topography is divided into high mountainous and low-flat elevation areas. The mountainous areas have very steep slopes and sharp peaks with elevation ranging from 100 m to 1399 m, while lower elevation areas are small valleys and plains distributed along the Chay river with elevations varying from 43 m to 100 m. In addition, the study area has complex and dense small streams and springs originating from two mountain ranges (Nui Voi and Large Rock mountain) before discharging into the Chay river in a northwest to southeast direction. As a complex terrain and drain network, the study area is highly vulnerable to flash floods, taking place when rapid runoff from steep slopes discharges quickly into small streams within a short time before reaching the Chay river [45].
Remote Sens. 2019, 11, x FOR PEER REVIEW 4 of 22 As a complex terrain and drain network, the study area is highly vulnerable to flash floods, taking place when rapid runoff from steep slopes discharges quickly into small streams within a short time before reaching the Chay river [45]. In the study area, the geology consists of six formations and outcrop complexes in the study area with an uneven distribution. Three formations account for >85% of the total study area: Song Chay (38.8%), Song Hong complex (32.6%), and Nui Chua (15.6%). The climatic condition is typically characterized as subtropical monsoonal, with two rainy seasons (April to September) and a dry season (October to March). The average yearly total rainfall ranges between 1739.3 mm and 2437.8 mm [45] and is mainly distributed in the rainy season, which accounts for 67.74%-83.34% of the total annual rainfall. It is worth noting that high rainfall intensity events often occur in a short period coupled with steep slopes, and recent deforestation might cause frequent occurrences of flash floods and landslide in the study area.

Data
This research employed the on-off modeling approach [46] for the flash flood study, in which flash floods in the future will happen under the same conditions causing them in the past; therefore, historical flash floods must be collected. In this work, an inventory map with a total of 1866 flash In the study area, the geology consists of six formations and outcrop complexes in the study area with an uneven distribution. Three formations account for >85% of the total study area: Song Chay (38.8%), Song Hong complex (32.6%), and Nui Chua (15.6%). The climatic condition is typically characterized as subtropical monsoonal, with two rainy seasons (April to September) and a dry season (October to March). The average yearly total rainfall ranges between 1739.3 mm and 2437.8 mm [45] and is mainly distributed in the rainy season, which accounts for 67.74-83.34% of the total annual rainfall. It is worth noting that high rainfall intensity events often occur in a short period coupled with steep slopes, and recent deforestation might cause frequent occurrences of flash floods and landslide in the study area.

Data
This research employed the on-off modeling approach [46] for the flash flood study, in which flash floods in the future will happen under the same conditions causing them in the past; therefore, historical flash floods must be collected. In this work, an inventory map with a total of 1866 flash flood polygons for the district was derived from the flash flood inventory map of the state-funded Project No-03/HD-KHCN-NTM of Vietnam [47]. These flash floods, which occurred during the last five years (2015-2019), were derived through the change detection techniques using multi-temporal Sentinel-1 synthetic aperture radar (SAR) imagery [33], then field surveys with handheld GPS were carried out to check and confirm the result. The largest polygon size of these flash floods is 64,064.3 m 2 , whereas the smallest polygon size is 912.3 m 2 , and the average polygon size is 6037 m 2 .
Because flash flood occurrences are influenced by various factors with their complex interactions, therefore, researchers have different views on this issue. However, it is common that factors are selected relating to topography, climate, soil, and human activities [48,49]. Since there are no specific rules and criteria for selecting effective flood factors in different regions, we selected ten influencing factors as the input explanatory variables in flash flood modeling in this study, based on the suggestions of various prior studies in the literature and the opinions of experts (See Table 1). These factors included land use, soil type, rock type, river density, precipitation, elevation, topographic wetness index (TWI), slope, slope direction, curvature, and aspect) ( Table 1). Flash flooding begins with precipitation but depends on other factors, such as breadth, topography, and types of LULC during rainfall in the catchment [59]. Land-use type, especially vegetation compaction, has a significant impact on preventing or reducing flooding, and no matter how dense the vegetation, it will prevent severe flooding [51]. Additionally, different LULC types have different infiltration capacities and runoff coefficients, which influences significantly the time of concentration in a watershed [52,53]. Therefore, the characteristics of LULC are one of the main factors in flashflood prediction. The LULC map was interpolated using free-of-charge Sentinel-1 C band SAR data downloaded from the Copernicus open access hub of the European Space Agency (ESA) using the Sentinel Application Platform (SNAP) toolbox, with the random forest (RF) classification algorithm available on the SNAP toolbox. A total of eight types of land cover were obtained and visualized using the ArcGIS software in the study area, including bare land, crop areas, forest areas, grassland, orchard area, paddy rice, urban and built-up, and water bodies ( Figure 2a). Although mountainous areas in the northern, northwest, and southern parts of the study area have different types of forest vegetation, which may contribute to reducing flash floods, the transmitted areas from mountains to small valleys and plains consist of bare-land and grassland areas which have a high potential for flash floods taking place during or after high-rainfall-intensity events.

Soil Type
In terms of hydrology, soil types have a strong influence on the infiltration and erosion processes occurring in a watershed. This is because each soil type has different properties, which may reduce or increase runoff flow and/or erosion magnitude, and therefore have a direct relation to flash floods. For example, if the soil type is more capable of absorbing water, it can reduce runoff flow and time of water flow concentration into streams or rivers [60]. The soil layer of the study area was prepared by digitizing the soil texture map 1:50,000 scale. There are eleven soil types in the study area, in which YCMR soil occupies more than 80% of the total area, followed by WS and RM soils (Figure 2b).

Lithology
Flash flood flow often consists of different flow components, including surface flow, base flow, and groundwater flow. While soil types have a strong influence on surface flow, the type of rocks has a significant effect on base flow and ground flow system. Each type of rock has a specific permeability and density; these have different effects on infiltration and storage capacity and can influence the generation of water flow system in a watershed. For example, the resistant or impermeable rocks have less water absorption capacity, which may increase the base flow and runoff flow. Therefore, the type of rock in the region has a significant impact on flash flood risk modeling. The lithology map (Figure 2c) was obtained from the Luc Yen District Geological and Mineral Resources Map, with a scale of 1:50,000 [33]. The lithology was characterized by different types of rocks, including sedimentary, igneous, and metamorphic. The metamorphic rocks are dominant in the study area, accounting for 48%, followed by igneous and sedimentary (alluvium and recent deposits) [54]. Characteristics of lithologies in the study area were presented in previous studies [61][62][63][64][65][66] and are summarized in Table 2.

River Density
Rivers are one of the most important factors used in flood sensitivity mapping, due to their significant impact on flood occurrence [67]. The higher the density of the water network in an area, the greater the impact on flood flow expansion [55]. In this research, river density (Figure 2d) was extracted from the above Digital Elevation Model (DEM) and river network system.

Rainfall
One of the essential characteristics of a flash flood event is that it occurs quickly after high rainfall intensity within a short period of time (i.e., several hours) in steep mountainous areas with sparse vegetation coverage [56]. Therefore, rainfall is considered an essential factor in flood prediction, and rainfall rate was chosen for flood risk assessment in this study. The higher the rainfall in a range, the greater the likelihood of a flood. In this research, the highest 16-day rainfall during the last 3 years at 30 stations in and around the study area was used to generate the rainfall pattern map using the Inverse Distance Weight technique [68]. The rainfall map (Figure 3a), with 142 mm in the northern areas and 620 mm in the central and southeastern areas, was interpolated through the station of the regional gauges rain in ArcGIS software. areas and 620 mm in the central and southeastern areas, was interpolated through the station of the regional gauges rain in ArcGIS software.

Elevation
Elevation and its effects play an essential role in flooding, and the lower the altitude, the greater the probability of flooding in that area [56,58]. Surface water flow often moves from high elevations towards low elevations, and therefore the low and flat area has a naturally high probability of flood occurrence [58]. The elevation map of the study area (Figure 3b

Elevation
Elevation and its effects play an essential role in flooding, and the lower the altitude, the greater the probability of flooding in that area [56,58]. Surface water flow often moves from high elevations towards low elevations, and therefore the low and flat area has a naturally high probability of flood occurrence [58]. The elevation map of the study area (Figure 3b

TWI
One of the parameters related to water flow is the topographic position index (TWI), which has been prepared through the altitude map of the study area with the following relationship [69].
where A s denotes an upslope area, and β is the slope angle at the pixel. Topographic moisture index is used to measure topographic control in hydrological studies [70]. TWI is a type of topographic property that shows the spatial distribution of moisture and cumulative water flow in response to the guiding force of water to lower areas [71]. In this area, TWI (Figure 3c) ranges from 142.8 to 662.1, in which the high values (>300) show the greatest density of torrential areas (30.25% of the class surface).
5. Slope Slope, as one of the environmental parameters, has a direct impact on surface water flow processes through influence on flow direction, velocity, and especially the time of water flow concentration at outfall [72]. High slopes often create faster movement and high velocity of runoff flow, as well as speeding up water flow in streams and rivers relative to lower slopes. Hence, runoff flow forming from steep slopes will cause an increase in water accumulation in low slope areas [58]. The slope layer shows a wide variation, ranging from 0 to 83.3 degrees in the study area (Figure 3d). In this area, a high slope angle in the mountainous areas has a strong effect on flash flood generation, while low slope in small valleys and plains affects the flash-flood propagation and duration (Figure 3d).

Aspect
The slope aspect is one of the parameters influencing the hydrological conditions of the earth, which can affect local climate, physiographic approaches, soil moisture content and vegetation growth.
The aspect map consists of nine classes [55]: 7. Curvature Curvature presents the characteristic of morphometry and is obtained by intersecting a horizontal plane with the surface based on the Digital Elevation Model (30 m × 30 m). Curvature index has three states: concave (positive), convex (negative), and flat (zero), which can affect runoff processes [73]. The curvature map was prepared using altitude information on the study area. In this study area, approximately 70% of the research territory is covered by curvature values (Figure 3f). It was noted that most of the historical flash floods occurred in this area, being torrential.

Chi-Square Automatic Interaction Detection (CHAID)
The CHAID model is a classification tree technique used in many linear regressions [74]. The CHAID tree process is the division of large branches into smaller branches arranged in descending order from top to bottom, and the grouping continues based on specific factors [75]. The classification method of the CHAID algorithm was proposed by Kass [76]. This technique, as a new approach in the literature, has titles such as automatic interaction detection, classification and regression tree, artificial neural network, and genetic algorithm that can predict the required analysis [41]. The CHAID algorithm uses chi-square statistics as the separation criterion and performs the Dodge separation [77]. Thus, the classification continues as long as there is an acceptable value of chi-square between the dependent variable and the conditioning factors: that is, if the nodes with the highest chi-square value are in the first-order segmentation tree, and the nodes with the lowest chi-square value have the lowest degree. For this reason, the CHAID method chooses a statistical approach (Pearson's square equation) that is desirable in terms of data type and the nature of the target [78].
where, n ij is the frequency of observed cells, m ij , is the cell frequency for (x n = i, y n = j), and the p value is given by p = Pr (x d e > x 2 ) [79].

Random Subspace Ensemble (RSE)
The Random Subspace Ensemble algorithm was first developed by Hu [80]. RSE is a blended learning method in which a number of classifiers are combined and trained [81]. This algorithm, like the bagging algorithm, is randomly selected to create a training subset. The results from this technique are trees formed in earlier stages, which depend on learning differences and subcategories. The RSE algorithm is more robust than the Bagging and Adaboost algorithms.

Biogeography-Based Optimization (BBO)
BBO is an evolutionary population-based search technique developed by Dan Simon [82], and was first performed on the multilayer perceptron neural network by [83]. The basic concepts of this algorithm were based on biography topics, including species migration, species emergence, and extinction [84]. The BBO Algorithm starts by creating habitat, then the migration and mutation steps are performed [85].
According to the BBO algorithm, the purpose of migration is to upgrade or correct the quality of existing methods [86]. Then the migration rate (λ s ) is then defined to modify the suitability index variable. Therefore, due to some conditions that threaten the geographical location of the site, the habitat deviates from its optimal habitat suitability index, which is called the mutation process and is expressed as follows [87].
where P s , λ s and µ s are the possibility, the habitat migration, and the mutation, respectively; S max presents the maximum Kind count.

Proposed CHAID-RS-BBO Model for Flash Flood Susceptibility Modeling
The overall flowchart with the CHAID-RS-BBO model in this research is shown in Figure 4.

Proposed CHAID-RS-BBO Model for Flash Flood Susceptibility Modeling
The overall flowchart with the CHAID-RS-BBO model in this research is shown in Figure 4.

Flash-Flood Database Establishment, Coding and Checking
In this step, the flash flood database for the Luc Yen, which consists of 1866 polygons, was constructed using Sentinel-1 SAR images and field investigations with handheld GPS and the ten selected influencing factors. The database associated with the geodatabase model in the ESRI ArcCatalog function was employed due to the ability to optimize its performance [88].
Because the CHAID model cannot read and understand the flash-flood-influencing factors directly, a coding process is required to convert all values in the factor maps into the range 0-1. In our research, values in six continuous factors (river density, rainfall, topographic wetness index, elevation, slope, curvature) were rescaled into the above range, whereas the other categorical factors (LULC, soil type, lithology, and aspect) were coded using the method described in [58].
Subsequently, a total number of 1866 points representing flash flood locations were divided into two datasets: 70% of the locations were randomly selected and used as the training set, and the remaining 30% of locations were used as the testing set to validate the model accuracy, as suggested in [56,[89][90][91]. Finally, a sampling process was performed to generate values of the ten influencing factors.

Flash-Flood Database Establishment, Coding and Checking
In this step, the flash flood database for the Luc Yen, which consists of 1866 polygons, was constructed using Sentinel-1 SAR images and field investigations with handheld GPS and the ten selected influencing factors. The database associated with the geodatabase model in the ESRI ArcCatalog function was employed due to the ability to optimize its performance [88].
Because the CHAID model cannot read and understand the flash-flood-influencing factors directly, a coding process is required to convert all values in the factor maps into the range 0-1. In our research, values in six continuous factors (river density, rainfall, topographic wetness index, elevation, slope, curvature) were rescaled into the above range, whereas the other categorical factors (LULC, soil type, lithology, and aspect) were coded using the method described in [58].
Subsequently, a total number of 1866 points representing flash flood locations were divided into two datasets: 70% of the locations were randomly selected and used as the training set, and the remaining 30% of locations were used as the testing set to validate the model accuracy, as suggested in [56,[89][90][91]. Finally, a sampling process was performed to generate values of the ten influencing factors.

Establishing the CHAID-RS and the Cost Function
To generate the CHAID Decision Tree Ensemble using the Random Subspace framework (CHAID-RS), we determine three important parameters that are required to optimize: (1) number of CHAID trees used in the ensemble (m-tree); (2) number of the influencing factors used for the CHAID trees (m-factor); and (3) the minimum number of samples per leaf in the CHAID trees (m-leaf). The other parameters of the CHAID-RS model are used as the default values [92]. The three parameters were searched and optimized using the BBO algorithm.
Before optimizing these three parameters, it is necessary to design a cost function for the model. In this research, the cost function (CoF) (Equation (5)) proposed in [54] was adopted: where FLPR i is the predicted output of the flash flood model; FLIV i is the flood inventory value; n is the total samples used.

Optimizing the CHAID-RS Using the BBO Algorithm
To search and optimize the three parameters, m-tree, m-factor, and m-leaf, for the CHAID-RS model, a three-dimensional searching space was established: m-tree = [1-500]; m-factor = [2-10]; and m-leaf = [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20]. The three parameters were then transferred into a BBO matrix for optimizing. The other parameters of the BBO are as follows: the population size was 50; the maximum immigration and emigration values were 1.0; the mutation and crossover values were 0.25 and 0.95, respectively; and the total number of iterations used was 1000 [42]. Each individual of the population has three characteristics, which are the three parameters of the CHAID-RS model. The CoF was used to measure the suitability of the habitat. Herein, the smaller the CoF value, the better the habitat. Finally, the combination with the lowest CoF value was determined, and the best m-tree, m-factor, and m-leaf were derived. The best model was called the CHAID-RS-BBO model.

Final CHAID-RS-BBO Model and Flash Flood Susceptibility
Once the CHAID-RS-BBO model was obtained, the performance of the model on both the training dataset and the validation dataset was checked. In this research, positive predictive value (PPV), and negative predictive value (NPV), sensitivity, specificity, accuracy, kappa, ROC curve and area under the curve (AUC) were used. Since explanations of these metrics for measuring the quality of spatial models are common in the literature, e.g., [93][94][95], we do not repeat these explanations here. In the final step, the CHAID-RS-BBO model was used to estimate the flash flood susceptibility index for each pixel of the Luc Yen district and generate the flash flood susceptibility map.

Correlation of the Predictors of Flash Floods
The results of the Pearson's correlation among ten influencing factors (LULC, soil type, lithology, river density, rainfall, topographic wetness index, elevation, slope, curvature, and aspect) is presented in Figure 5. As can be seen from Figure 5, the highest positive correlation value (0.65) was observed between the LULC and the slope factors, whereas the largest negative correlation value of −0.57 was observed between the TWI and the slope factors in the study area. However, these correlation values are less than those of 0.7, which is the threshold value of the collinearity problem [96]. Therefore, it is concluded that there is no correlation problem among the considered affecting factors.

Training the Flash Flood Models
The training set accounts for 70% of the total dataset; the results in the training phase for the flash flooding occurrence using machine learning models are shown in Table 3 and Figure 6. It can be clearly observed that the CHAID-RS-BBO, the CHAID, the J48DT, the logistic regression, and the MLP-NN models had very good overall accuracies in the training dataset. The values of the AUC ranged from 0.871 to 0.979 (CHAID-RS-BBO= 0.979, CHAID= 0.949, J48DT= 0.955, logistic regression = 0.871, MLP-NN= 0.942). Besides, these corresponding numbers showed high predictive performances in terms of accuracy and kappa coefficient. The accuracies of the five ML models ranged from 81.36 to 91.00, whereas the kappa values were observed between 0.634 and 0.867. Table 3. Performance of the flash flood models in the training phase. True positive  867  832  893  835  868  True negative  828  823  786  654  774  False positive  41  76  15  73  40  False negative  80  85  122  254

Training the Flash Flood Models
The training set accounts for 70% of the total dataset; the results in the training phase for the flash flooding occurrence using machine learning models are shown in Table 3 and Figure 6. It can be clearly observed that the CHAID-RS-BBO, the CHAID, the J48DT, the logistic regression, and the MLP-NN models had very good overall accuracies in the training dataset. The values of the AUC ranged from 0.871 to 0.979 (CHAID-RS-BBO = 0.979, CHAID = 0.949, J48DT = 0.955, logistic regression = 0.871, MLP-NN = 0.942). Besides, these corresponding numbers showed high predictive performances in terms of accuracy and kappa coefficient. The accuracies of the five ML models ranged from 81.36 to 91.00, whereas the kappa values were observed between 0.634 and 0.867. Table 3. Performance of the flash flood models in the training phase.  In contrast to the ensemble-based models, the logistic regression model produced the lowest performance (AUC = 0.871, accuracy = 81.99, kappa = 0.634). Figure 6 shows the predictive performance of the models in the training phase using the AUC indicator. It can also be clearly seen from the graph that the proposed model performed well and produced the best predictive performance for flash flood susceptibility in the training dataset.

Validating the Fflash Flood Models
The results in the testing phase, using 30% of the total datasets for predicting flash flooding occurrence models, are shown in Table 3 and Figure 6. As can be observed from Table 4, the proposed ensemble-based model yielded the highest prediction performances with AUC = 0.960, accuracy= 91.00 and kappa = 0.820, followed by the MLP-NN, the CHAID, and the J48DT model. Conversely, the logistic regression model had the lowest performance in terms of the AUC, the accuracy, and the kappa coefficients (AUC= 0.880, accuracy= 81.36, kappa = 0.627). Generally, the results showed that the ensemble-based models archived high accuracy and satisfactory predictive performance for flash flooding accidence, and this outcome can be clearly seen in Figure 7.  In contrast to the ensemble-based models, the logistic regression model produced the lowest performance (AUC = 0.871, accuracy = 81.99, kappa = 0.634). Figure 6 shows the predictive performance of the models in the training phase using the AUC indicator. It can also be clearly seen from the graph that the proposed model performed well and produced the best predictive performance for flash flood susceptibility in the training dataset.

Validating the Fflash Flood Models
The results in the testing phase, using 30% of the total datasets for predicting flash flooding occurrence models, are shown in Table 3 and Figure 6. As can be observed from Table 4, the proposed ensemble-based model yielded the highest prediction performances with AUC = 0.960, accuracy = 91.00 and kappa = 0.820, followed by the MLP-NN, the CHAID, and the J48DT model. Conversely, the logistic regression model had the lowest performance in terms of the AUC, the accuracy, and the kappa coefficients (AUC = 0.880, accuracy = 81.36, kappa = 0.627). Generally, the results showed that the ensemble-based models archived high accuracy and satisfactory predictive performance for flash flooding accidence, and this outcome can be clearly seen in Figure 7. Table 4. Performance of the flash flood models in the validation phase.

Flash Flood Susceptibility Maps
Since the CHAID-RS-BBO had the highest predictive performance in both the training and the testing phases and outperformed the benchmark models, we employed this model to map the flash flooding susceptibility in the study area. Accordingly, the CHAID-RS-BBO model was also used to calculate the flash flood susceptibility for all the pixels in the map of the case study. The predictive results of flash flood capacity were converted into a raster format and presented in the ArcGIS environment. Figure 8 illustrates the spatial prediction of the flash flood in the study area ranging from 0.022 to 0.9101.

Flash Flood Susceptibility Maps
Since the CHAID-RS-BBO had the highest predictive performance in both the training and the testing phases and outperformed the benchmark models, we employed this model to map the flash flooding susceptibility in the study area. Accordingly, the CHAID-RS-BBO model was also used to calculate the flash flood susceptibility for all the pixels in the map of the case study. The predictive results of flash flood capacity were converted into a raster format and presented in the ArcGIS environment. Figure 8 illustrates the spatial prediction of the flash flood in the study area ranging from 0.022 to 0.9101.

Flash Flood Susceptibility Maps
Since the CHAID-RS-BBO had the highest predictive performance in both the training and the testing phases and outperformed the benchmark models, we employed this model to map the flash flooding susceptibility in the study area. Accordingly, the CHAID-RS-BBO model was also used to calculate the flash flood susceptibility for all the pixels in the map of the case study. The predictive results of flash flood capacity were converted into a raster format and presented in the ArcGIS environment. Figure 8 illustrates the spatial prediction of the flash flood in the study area ranging from 0.022 to 0.9101.  As can be seen from Figure 8, the highest flash-flood susceptibility index was observed in the steep mountainous highland areas, where the flash floods often occur largely during the storm season associated with tropical typhoons. In contrast, the lowest rate was presented in the lowland area closed to rivers and streams.

Discussion
This study proposed a novel framework based on Sentinel-1 SAR images and field investigations combined with a new ensemble-based model for spatial prediction of flash flood hazards. Ten flood flash predictors were selected based on a review of the literature and the interpretations of the correlations of them with flash floods in the study area. As suggested in previous work [54,97], correlations among these predictors should be checked before going ahead to the modeling process. In this work, Pearson correlation analysis confirmed that these predictors are valid for modeling where all correlation values are less than 0.7. Consequently, the high performance of the CHAID-RS-BBO model indicates that these predictors were selected, processed, and coded successfully.
Regarding the final flood model, this is a hybrid of three components, CHAID, RS, and BBO, in which the CHAID plays the classifier in a tree-like structure manner, whereas the RS with the feature sub-spacing framework helps to reduce error rates of the flood model by generating various sub-datasets for the forest of the CHAID classifiers. Additionally, the BBO was integrated to optimize the three parameters (m-tree, m-factor, m-leaf) of the hybrid model. In our work, the merit of the BBO is that, with 1000 iterations run, a total of 50,000 possible combinations of m-tree, m-factor, and m-leaf for the CHAID-RS model were checked and compared, in order to select the best combination. The high prediction capability of the CHAID-RS-BBO model indicates that the three parameters were globally searched and optimized.
The validity of the hybrid CHAID-RS-BBO for flash flood modeling was confirmed through comparison with those of five benchmark machine learning algorithms. The proposed model was the most accurate in predicting the flash flood events and outperformed the benchmarks, indicating the CHAID-RS-BBO is promising for flash flood studies.

Concluding Remarks
This research presents a novel modeling approach for flash flood modeling with a new hybrid of machine learning, geospatial data, and available remote sensing data. Based on the findings, some conclusions can be drawn, as follows: With the flash flood inventories and six predictors, the remote sensing data, Sentinel-1 SAR, Sentinel-2 and ALOS-PALSAR DEM, are important sources for flash flood modeling. With its high performance, it can be concluded that CHAID-RS-BBO is a new tool for flash flood modeling. The susceptibility map, which reveals the flash flood hotspots in Luc Yen, might help the local government and decision-makers to minimize the flash flood impacts in the selection and collection of the water of the flash floods for life requirements and development projects. The current study recommends the creation of precise and updated meteorology, morphometry, hydrology, geology, topography, and socioeconomic studies. Early warning systems (EWS) have to be developed to predict flash floods and consequently minimize losses and reduce damage. Last, but not least, a national plan for flash-flood disaster management and risk reduction has to be enabled.

Conflicts of Interest:
The authors declare no conflict of interest.