Flood Susceptibility Mapping Using Machine Learning Algorithms: A Case Study in Huong Khe District, Ha Tinh Province, Vietnam

A flood is a natural catastrophe that causes heavy damage not only to people but also to properties. To prevent and mitigate flood damage, an accurate flood susceptibility map that reveals highly potential flood-prone areas is essential. This study aims to construct flood susceptibility maps in the Huong Khe district using three machine learning algorithms, namely the K - Nearest Neighbour (KNN), the Support Vector Machine (SVM) and Artificial Neural Network (ANN). Training and testing datasets were extracted from Sentinel-1 SAR images. Seven causative factors were selected as input for predictive models after removing high-correlation factors and unimportant factors through a rigorous screening process by analyzing the Pearson correlation coefficient (PCC) and calculating the information gain ratio (InGR). The model's hyperparameters were found by grid search algorithm integrated 5-fold cross-validation. The three optimal flood susceptibility models showed excellent performance, with very high accuracy indices in the training and testing phases, over 90% of overall accuracy and UAC values. High and very high susceptibility classes on flood susceptibility maps accounted for around 18% of the total study area and were mainly located in residential and agricultural areas. Thus, there is a need to make proper land use planning for these areas to reduce damage in flood seasons

International Journal of Geoinformatics, Vol. 19, No. 7, July, 2023 ISSN: 1686-6576 (Printed) | ISSN 2673-0014 (Online) | © Geoinformatics International There were various methods applied in flood susceptibility assessment (FSA), including hydrological and hydraulic models [3] [4] and [5], Analytic Hierarchy Process (AHP) [6] [7] and [8], the integration of Fuzzy Logic and Analytic Hierarchy Process (F-AHP) [9] [10] and [11], frequency ratio (FR) [12] [13] and [14], the weight of evidence (WoE) [12] [14] and [15], logistic regression (LR) [16] and [17]. Physically-based models, such as MIKE FLOOD and HEC-RAS, have proven their efficacies in developing flood susceptibility models. These models require complex data that is not easy to meet for large areas, such as river cross-section, meteorological and hydrological data in long duration [18] and [19]. To develop a flood susceptibility map using an expertbased approach (such as AHP or F-AHP), experts evaluate the contribution of various influencing factors to the flood susceptibility index. However, these subjective judgments lead to predictive errors [20]. Flood susceptibility models constructed by statistical approaches, including FR, WoE, and LR, were rated as comprehensible and reliable. However, they require high-quality data on historical floods and influencing factors [21]. Furthermore, they do not examine the correlation between influencing factors. So redundant data may exist in the influencing factor database.
Machine learning algorithms (MLAs) have brought high accuracy in natural disaster prediction fields. Compare to expert-based and statistical approaches, MLAs have produced better results [22] [23] and [24]. It is because the input data for MLAs typically has the non-existence of highly correlated factors and low contribution factors by the feature selection process. Besides, the performance of machine learning models depends not only on input data and learning algorithms but also on hyperparameters. Previous studies on flood susceptibility assessment using MLAs used the "trial and error" method to find the best hyperparameters [25] and [26]. This method is unreasonable and needs to be improved by a better search algorithm.
In conclusion, the main objective of this study was to use MLAs for constructing FSMs in Huong Khe district, the most frequently flooding district of Ha Tinh province, Vietnam. To achieve this objective, sub-objectives were implemented: (1) building a historical flood database using Sentinel-1 SAR satellite images, (2) generating influencing factors database from original geospatial data, (3) eliminating highly correlated and low contribution influencing factors through the feature selection process, (4) tuning hyperparameters by Grid Search, (5) developing FSMs using selected MLAs, and (6) Assessing the accuracy and comparing the performance of the developed FSMs.

Methodology
The overall methodology of the study is represented in Figure 1. It includes developing a historical flood database, causative factor preparation, constructing optimal flood susceptibility models, and models' performance assessment, flood susceptibility mapping.

Study Area
Huong Khe is a border district of Ha Tinh province. It shares about 50km borderline with Lao People's Democratic Republic to the West and borders Tuyen Hoa district (Quang Binh province) to the South. The North and the East are bordered by Vu Quang, Thach Ha, and Can Loc districts of Ha Tinh province, respectively. This mountainous district ranges in height from 3m to 1440m. It is covered by forest land, agricultural land, specially used land, and residential land with a corresponding percentage of 78.58%, 14.22%, 2.65%, and 0.75%, respectively [27]. The three latter land-use types are distributed in low lands and are often seriously flooded.
Every year, this district receives high rainfall amounts. According to the statistical data from 2010 to 2020 at Huong Khe meteorological station, the average annual rainfall was 2560mm. Among months of the rainy season, September and October were the wettest months in terms of rainfall amount, with 641mm and 586mm. The corresponding figures in the three remaining months of the rainy season (including July, August, and November) were 268mm, 223mm, and 247mm, respectively. Because of the high rainfall amount, this district has frequently suffered natural disasters, such as floods and landslides. The location of the study area is represented in Figure 2.

Historical Flood Database
In Ha Tinh province, historical flood data has been not recorded as geospatial data, such as points or polygons. In practice, the flood situation is stored in the reports with simple descriptive information about where the flood occurred, how large it extended, and how deep it was. Therefore, collecting historical flood data from flood control departments for training and testing flood susceptibility models is not feasible. Researchers and scientists have used synthetic aperture radar (SAR) images to extract geo-information about past floods to solve this difficulty. In the period 2006-2011, ALOS PALSAR provided L-band SAR images that were useful for flood mapping [28] [29] and [30]. Recently, free Sentinel-1 SAR images have been effectively utilized in flood detecting and monitoring [31] [32] and [33]. Flood information extracted from SAR images has been used as training and testing data for flood susceptibility mapping models [34] [35] and [36]. Hence, SAR satellite images are a crucial data source for developing flood susceptibility models. To prepare training and testing data for flood susceptibility models, three Sentinel-1 SAR data images were used to extract historical flood data. These images included two images acquired in May and October 2020, and one image acquired in September 2019, and were preprocessed using the flowchart proposed by Filipponi [37]. Table 1 displays the detailed information of the Sentinel-1 images.   Among the three images, the image taken in May 2020, before the rainy and storm season, was used as the pre-flood event image, while the remaining images collected at flood times were used as flood event images. To detect and delineate floodwater, the image ratioing method was applied, resulting in ratio images by calculating the ratio of digital number value of corresponding pixels on pre-flood event images and flood event images. Floodwater was delineated by an Otsu's threshold method. 250 flood points were randomly created from inundated flood areas over the two years 2019 and 2020 and 250 non-flood points were also randomly generated from non-flood sites. Finally, the entire dataset of 500 points was divided into two subsets, namely training and testing, in a ratio of 2 to 1. The location of flooded points is represented in Figure 3.

Causative Factor Selection 2.4.1 Correlation analysis
Studies on disaster susceptibility assessment typically used Pearson's correlation coefficient (PCC) to analyze the correlation between causative factors [38] and [39]. Assuming that X and Y are two causative factors, the Pearson's correlation coefficient (r) is computed by following equation:

Equation 2
Where: Xi and Yi are samples of causative factors ̅̅̅ and ̅ are means of X and Y n is number of samples The absolute value of r ranges from 0 to 1. Two factors that have an absolute value of r greater than 0.6 are considered to have a strong correlation [40].
Then one factor will be eliminated.

Information Gain Ratio
Information Gain Ratio (InGR) is utilized to measure the information contribution level of causative factors to prediction models. It has been widely applied in disaster forecasting models [41] and [42]. Where: Infor (T) is information entropy for the dataset Infor (T, F) is amount of information (T1, T2,…, Tm) split from T regarding the causal factor F SplitInfor (T, F) is the potential information generated by dividing the training data T into m subsets.

Machine Learning Algorithms 2.5.1 Support Vector Machine
The Support Vector Machine (SVM) was firstly introduced in 1992 by Boser et al., [43]. It is a robust supervised learning algorithm that can be applied for both classification and regression. It has been effectively used for natural disaster prediction, including landslides [44] and [45] and floods [46] and [47]. The powerful algorithm aims to find an optimal hyperplane to correctly distinguish data points, with the hyperplane being considered optimal when it has the widest margin. For the SVM algorithm, kernel function, penalty (C), and gamma () are hyperparameters that directly affect its performance. The penalty parameter (C) controls the trade-off between achieving a low training error and maintaining a wider margin. It determines the level of misclassification that the SVM classifier is willing to tolerate. The gamma (γ) is a parameter of the kernel function, it defines the influence of each training example and affects the shape of the decision boundary. In this study, the default kernel, Radial Basis Function (RBF), was used because it producted higher accuracy compared to other kernels.

K-Nearest Neighbour (KNN)
To be considered the most straightforward algorithm in the machine learning field [48], KNN calculates the distance from the candidate point to K neighbor points. K is an integer and should be an odd number [49]. The distance can be calculated by Euclidean distance, Manhattan distance, or Minkowski distance. And then, the candidate point will be assigned to the class for which the number of neighbors is maximum. KNN is highly recommended for real-time applications because it is speedy [50]. However, it does not work well with large datasets and high dimension data.

Artificial Neural Network (ANN)
Artificial neural network (ANN) has been a powerful model to predict disaster susceptibility. It can solve the complex relationships between causative factors and natural disasters [51]. ANN simulates the behavior of the human brain to process input information [52]. This research used the multilayer perceptron (MLP) model, which has been proven to be effective for landslide susceptibility mapping and flood susceptibility mapping. MLP model contains three layers: one input layer, one or more hidden layers, and one output layer. The input layer is responsible for preparing the data for the model, and it consists of nodes for the input variables. The function of the hidden layer is to process the data, and the number of hidden layers needed depends on the complexity of the problem being solved. Finally, the output layer consists of nodes that represent the output results.
The ANN has several hyperparameters, including the activation function, solver, learning rate, and learning rate initialization. The activation function is a crucial component of the MLP as it introduces non-linearity into the network. It is applied to the output of each neuron, enabling the modeling of complex relationships between inputs and outputs. The solver determines the algorithm used to optimize the weights and biases of the MLP during training. It controls how the network adjusts its parameters to minimize the error between predicted and actual outputs. The learning rate governs the step size taken during each iteration of the optimization process, influencing the extent to which the weights and biases are adjusted based on the calculated gradient. The learning rate initialization parameter establishes the initial value of the learning rate. The three selected algorithms have showcased their impressive abilities in modeling landslide susceptibility mapping and flood susceptibility mapping in different regions over the world. In this study, we aim to evaluate their abilities in developing flood susceptibility models specifically for Huong Khe district, the mountainous region of central Vietnam.

Tuning Hyperparameters
The performance of machine learning algorithms mainly depends on data quality and hyperparameter configuration. The optimal hyperparameters are typically found through the hyperparameter tuning process. This process can be done by a manually or automatically. In the first method, different hyperparameters are experimented with through a "trial and error" approach [53] and [54]. In the second method, the optimal set of hyperparameters can be come out by algorithms, such as Grid Search ISSN: 1686-6576 (Printed) | ISSN 2673-0014 (Online) | © Geoinformatics International [55] and [56], Bayesian Optimization [57], Genetic Algorithm [58], or Whale Optimization [59]. Grid Search is the simplest among automatic searching methods but typically gives reliable results.
It generates combinations of hyperparameters' values and calculates the model's performance corresponding to each variety. The results are tracked and the optimal combination is released at the end of the calculating process. This study used the Grid Search algorithm and 5-fold cross-validation to find the best hyperparameters to optimize the models' accuracy.

Accuracy Assessment
There are a lot of metrics to assess the performance of machine learning models. This study used some primary statistical metrics, containing Precision = + + + +

Equation 6
Where: TP is short for True Positive TN is short for True Negative FP is short for False Positive FN is short for False Negative    Table 2 shows default values and tuned values of hyperparameters for the three chosen machine learning models. The results reveal that tuned hyperparameters slightly improved models' performance. The overall accuracy (OA) of KNN, SVM, and ANN increased from 0.91642, 0.91045, and 0.91940 to 0.92239, 0.91642, and 0.92239, respectively.

Flood Susceptibility Mapping
The flood susceptibility maps for the Huong Khe district were generated based on the three optimal flood susceptibility models (FSMs). Each pixel of these raster maps is assigned a digital number that represents the flood probability value, ranging from 0 to 1. The susceptible level was then classified into 5 categories: very low, low, moderate, high, and very high. The corresponding range of values for each category are 0-0.2, 0.2-0.4, 0.4-0.6, 0.6-0.8, and 0.8-1, respectively. These maps are shown in Figure 8. As shown in Figure 8, high and very high susceptibility areas are mainly distributed in lowlying agricultural and residential areas. On the other hand, low susceptibility areas are located in higher areas covered by forest.    : Flood susceptibility maps developed by three models Figure 9: Percentage of susceptibility classes by two models

Discussion
The InGR values of the causative factors indicate their importance in the flood susceptibility model. The study found that ELE and LU were the most critical factors, followed by NDVI. These findings suggest that topography, land use, and vegetation cover significantly influence flood susceptibility. On the other hand, factors such as SPI and ASP did not contribute valuable information and were therefore removed from the models. Strong correlations were observed between certain factors, such as DITRI-DITRO and TWI-SLO. These findings indicate that some factors provide redundant or overlapping information. To optimize the input dataset, factors with lower InGR values were eliminated when there were strong correlations between two factors. This process helped refine the selection of factors used in developing the flood susceptibility models. The hyper-parameter tuning process showed that tuned hyper-parameters slightly improved the models' performance in terms of overall accuracy. This finding highlights the importance of optimizing model parameters to achieve better results. The accuracy assessment results indicated high overall accuracy for all three models. In the training phase, the ANN model performed the best, followed by the KNN model, while the SVM model had the lowest accuracy. However, in the testing phase, the SVM model outperformed the other models. These findings suggest that different models may exhibit varying performance in different phases, highlighting the importance of evaluating models on independent datasets. The flood susceptibility maps generated based on the three optimal FSMs provided valuable insights into the areas at risk of flooding. The distribution of high and very high susceptibility areas was predominantly observed in low-lying agricultural and residential areas, while low susceptibility areas were mostly located in higher areas covered by forests. These findings demonstrate the ability of the models to effectively identify areas prone to flooding and provide useful information for flood risk management and mitigation.

Conclusion
This study aimed to develop flood susceptibility models using machine learning techniques and assess their accuracy. The results demonstrated that elevation, land use, and vegetation cover were significant factors influencing flood susceptibility. Through the selection and elimination of causative factors based on their importance and correlation, an optimal input dataset was constructed. Hyperparameter tuning was performed to enhance the models' performance, leading to slight improvements in overall accuracy. The accuracy assessment revealed that all three models achieved high overall accuracy, with variations observed between the training and testing phases. The SVM model performed the best in the testing phase, indicating its potential for accurate flood susceptibility prediction. The generated flood susceptibility maps provided valuable information for identifying areas at high risk of flooding. The maps highlighted the importance of factors such as topography and land use in determining flood-prone areas. These findings contribute to the understanding of flood vulnerability in the study area and can assist in implementing effective flood risk management strategies.