Leveraging machine learning for predicting flash flood damage in the Southeast US

Flash flood is a recurrent natural hazard with substantial impacts in the Southeast US (SEUS) due to the frequent torrential rainfalls that occur in the region, which are triggered by tropical storms, thunderstorms, and hurricanes. Flash floods are costly natural hazards, primarily due to their rapid onset. Therefore, predicting property damage of flash floods is imperative for proactive disaster management. Here, we present a systematic framework that considers a variety of features explaining different components of risk (i.e. hazard, vulnerability, and exposure), and examine multiple machine learning methods to predict flash flood damage. A large database of flash flood events consisting of more than 14 000 events are assessed for training and testing the methodology, while a multitude of data sources are utilized to acquire reliable information related to each event. A variable selection approach was employed to alleviate the complexity of the dataset and facilitate the model development process. The random forest (RF) method was then used to map the identified input covariates to a target variable (i.e. property damage). The RF model was implemented in two modes: first, as a binary classifier to estimate if a region of interest was damaged in any particular flood event, and then as a regression model to predict the amount of property damage associated with each event. The results indicate that the proposed approach is successful not only for classifying damaging events (with an accuracy of 81%), but also for predicting flash flood damage with a good agreement with the observed property damage. This study is among the few efforts for predicting flash flood damage across a large domain using mesoscale input variables, and the findings demonstrate the effectiveness of the proposed methodology.


Introduction
The Southeast US (SEUS) is known to be susceptible to flash flooding due to the frequent high intensity rainfall triggered by tropical storms, thunderstorms and hurricanes (Orville and Huffines 2001, Czajkowski et al 2011, Smith and Smith 2015. During the last two decades, widespread flash flood events have caused significant economic damage in this region. Recent studies have shown that the frequency of flash flooding is increasing in the SEUS (Alipour et al 2020). Therefore, predicting property damage of flash floods is crucial for attaining proactive disaster management in this region.
Generally, risk refers to the potential losses of a particular hazard (Cardona et  The conventional approaches for modeling flood risk are mostly dependent on the flood water depth to estimate the associated damage (Aerts et al 2014, Velasco et al 2014. Several recent studies have shown that considering multivariate data will improve the damage estimates (Wagenaar et al 2017). Therefore, over the past few years, several studies evaluated flood risk in various regions of the globe (de Moel et al 2015, Arnell andGosling 2016, van Berchum et al 2018) using a multitude of variables representing hazard, vulnerability, and exposure.
Recent advances in machine learning (ML) techniques have led to significant improvements in flood risk assessment (Wang et al 2015, Lai et al 2016. Artificial neural network (ANN), decision tree, logistic regression, random forest (RF), regression tree, support vector machine are the most widely used ML models for flood risk assessments  Table 1S (available online at stacks.iop.org/ ERL/15/024011/mmedia) lists all of the factors used in these studies. Although some of these works assessed the flood damage prediction, few attempted to predict the potential property damage of the flash flooding events. In addition, the majority of damage prediction studies are conducted at small-scale regional domains, explicitly applicable to the region of interest (Scheuer et al 2011, Garrote et al 2016.
Therefore, in this study, we propose a risk-based and physically informed model for near real-time estimation of the potential property damage of flash flood events across the SEUS. Several influential factors including geographic, socioeconomic, and climatic features are utilized as input to the ML model in order to predict property damage of each flash flood event. This study also presents a unique model input structure/topology by which the ML model produces improved results and would be a universal approach to predict potential property damage in any region of interest. The model was trained and tested based on a large database consisting of more than 14 000 flash flood events during 1996-2017. In this study, the overarching research objective is to develop a riskinformed mesoscale flash flood damage prediction model across the SEUS that assist decision makers and insurance companies dealing with flood risk assessment.

Study area and data
In this study, several data sources have been utilized to acquire the information for flash flood events as well as physical and geographical characteristics across the SEUS during 1996-2017. Each dataset and their characteristics are thoroughly explained in the following sections.

Study area
The study area encompasses nine southeastern US states (referred to as SEUS in this study) including Alabama, Arkansas, Florida, Georgia, Louisiana, Mississippi, North Carolina, South Carolina, and Tennessee. The climate of this region varies with latitude, topography, and proximity to Atlantic Ocean and Gulf of Mexico (Ingram et al 2013). The high-pressure system, known as Bermuda High, commonly draws moisture form Atlantic Ocean and Gulf of Mexico, and causes warm and humid summer in the SEUS along with frequent thunderstorms (Zhu and Liang 2013). Based on the 2017 US census estimation, over 61 million people reside in the 755 counties of the SEUS. A large number of flash flood events have impacted the SEUS in the past couple of decades and imposed billions of dollars in damage to the SEUS residents.

NOAA storm events database
The National Oceanic and Atmospheric Administration (NOAA) Storm Events database is a comprehensive repository that provides information for different types of natural disasters, such as flash flooding across the US from 1996 to present. This information include the beginning and termination date and time, location, the associated injuries and fertilities, amount of damage to properties and crops, and event narrative (Ashley and Ashley 2008, Sharif et al 2015, Konisky et al 2016, Hamidi et al 2017, Shah et al 2017. In this study, we have used the NOAA storm events database to obtain the information for 14 317 flash flood events including the onset time, duration, date, location, and property damage during 1996-2017.

NLDAS-2 hourly precipitation
The precipitation data from Phase 2 of the North American Land Data Assimilation System (NLDAS-2) are available at 1/8th-degree spatial resolution (about 12 km) and hourly temporal resolution during the period of January 1979 to present (Xia et al 2012). The hourly NLDAS-2 precipitation data is generated from different in situ and remote sensing data sources (Yu et al 2017).
Since flash floods generally occur in small catchments at usually less than 1000 km 2 (Villarini et al 2010, Llasat et al 2016, the hourly NLDAS-2 precipitation data is upscaled to 0.3 grid cell (using bilinear interpolation) so as to represent an approximate inundated area of 1000 km 2 . Then, the location, start time, and duration of each flash flood event (acquired from NOAA storm events database) are utilized to extract the mean and cumulative precipitation during each flash flood event using the NLDAS-2 data. The mean and the cumulative precipitation represent the intensity and severity of flash flood events, respectively, both of which are important characteristics for identifying flash flood hazard.
2.4. GTOPO30 topography data GTOPO30 has a 1 km spatial resolution and it has been used in many studies for estimation of multiple topographical indices ( In this study, we used GTOPO30 to derive several topographic factors including altitude, slope, flow accumulation, and TRI at different spatial resolutions (i.e. 1, 3, and 30 km) corresponding to each flash flood event.

Zillow database
Zillow real estate provides an index known as the zillow home value index (ZHVI) that is the median home value in a specific geographic region and housing type since 1996 to present. Several studies have utilized this product for risk analysis (Watson et al 2016, Morckel 2017, Miller 2018. In this study, we used ZHIVI to evaluate the median home value for all homes in each county during 199-2017. The median home value is an indicator of flash flood exposure. The Zillow dataset does not include information for all counties in all years, so we used machine learning to predict the missing values, which is explained in section 3.1 in more details.

US census bureau database
The US Census Bureau aims to provide accurate data of the people, economy of the nation and the geographic information of the country including the boundaries map of the state, county, place, and census tracts through questionnaires every 10 years. Here, we have extracted the population of SEUS counties from this dataset to analyze the flash flood exposure. We also derived the area of each county from the SEUS counties shapefile to estimate the population density as another indicator of flash flood exposure.

The centers for disease control and prevention's social vulnerability index (SVI)
The Centers for Disease Control and Prevention's (CDC) SVI is based on 15 social factors that consider unemployment, minority status, and disability. These factors are divided into four themes namely socioeconomic status, household composition and disability, minority status and language, and housing and transportation (Cimellaro et al 2016). The SVI values are available for the years 2000, 2010, 2014, and 2016. Since this data is not available for all the years during 1996-2017, for simplicity, accuracy and consistency we used the 2016 SVI at county level to evaluate flash flood vulnerability to flash flood events.

Methodology
This study proposes a risk-based model for flash flood damage prediction over the SEUS, a valuable tool for decision makers and insurance companies. The framework of the proposed approach is presented in figure 1.

Filling the gaps in zillow dataset
One of the variables used in this study is median home value that explains the flash flood exposure. We utilized Zillow dataset to extract this information for each flash flood event during 1996-2017 over the SEUS. Unfortunately, the median home value is not available for all counties and all years in the study period. To cope with this shortcoming, we utilized ANN to predict the missing median home values. ANN models are suitable for modeling a wide variety of nonlinear problems by extrapolating the relationships between a set of inputs and the output without taking any prior assumption nor any knowledge of the underlying physics of the process (ASCE Task  Committee 2000a, 2000b, Asadi et al 2013, Mitra et al 2016. There are several versions of ANNs that could be adopted to find missing home values. Our appraisal analysis suggests that a simple ANN-MLP structure suffices to properly estimate the missing home values in this study. A typical ANN-MLP structure consists of three layers including input layer, hidden layer that involves neurons, and output layer. ANN model with large dimension that include irrelevant inputs, behave poorly (Bowden et al 2005a, 2005b, Wu et al 2014. In this study, our input variables only include the centroid latitude and longitude of the counties, the year and its corresponding population of county, while the output layer is the median home value of each specific year and county. Number of neurons in the hidden layer has been selected by trial-and-error approach. Out of 16 610 samples (755 counties in the SEUS during 1996-2017 period: 755×22), 9853 cases were available in the Zillow dataset and the remaining median home value data (16 610-9853=6757 cases) were missing.
There are several methods for splitting the data into different subsets for training, validation and testing the model (Bowden et al 2002, Wu et al 2013. Here, we randomly separated the 9853 data into three groups: training (70% of data), to train and calibrate the ANN model, validation (15% of data), to validate the trained model and avoid the potential of model overfitting, and testing (15% of data), to verify the performance of trained model. It is important to note that random separation of dataset assures the generalizability of the trained model. We normalized the input variables and trained the ANN-MLP model using the training dataset. Validation is an important part of the modeling (Humphrey et al 2017). Here, we used validation data for early stopping during the model development process. The trained model was verified using the testing dataset as shown in figure 2. The result shows a high correlation between the model output and the actual values of median home value reported by Zillow. Therefore, the trained model was used to estimate the missing median home values in the Zillow dataset.

Variable selection
Variable selection is a common procedure for model development in artificial intelligence. It helps remove the redundant predictors that add noise to the major estimators, and saves computation time. Additionally, it prevents the potential overfitting of the model. Figure 3 illustrates the variable selection process, and the final selected features are shown in red, yellow, blue, and gray that are respectively representing exposure, vulnerability, hazard, and spatiotemporal features.
We selected our variables in different steps, such that we would able to address one issue at a time. The geomorphologic features of the inundated area namely altitude, slope, flow accumulation, and topographic roughness index are extracted at different spatial resolutions (1, 3, and 30 km). The correlation between different resolutions and the reported damage was estimated and the one with the highest correlation was selected in this step. Please note that we also used Spearman correlation coefficient that assesses the monotonic relationship between the two variables (whether linear or not) and found that the result for selected variables were the same as those obtained by Pearson correlation. Afterwards, we used variance inflation factor approach to remove multicollinear variables. VIF is calculated as 1/(1 − R 2 ), where R is the correlation computed for each pair of the predictor variables. To further reduce the dimension of our input variables, we also performed a leave-one-out approach where one input variable was removed and the prediction was implemented. Variables with the most accurate prediction result were selected as our final model input variables. Most of the selected variables (e.g. duration and median home value) represents the hazardousness of the flash flood events and the amount of exposed properties. Household Composition and disability index represent the percentage of people aged 65 or older, aged 17 or younger, civilian with a disability, and single-parent households. We chose this factor as it had higher correlation with the flash flood property damage based on the result. The housing of this group of people may reside in regions that are more prone to flooding due to either their lack of awareness or financial standing. The location of the flash flooding enables our model to predict the damage for a large region, and the timing variables (i.e. month and onset time) are indicators of those factors that are not included in our study (e.g. soil moisture). We compared the models' performance (classification and regression scenarios) with and without using the variable selection approach. Therefore, we realized that the proposed variable selection procedure that includes three main components namely correlation coefficient, VIF and leave-one-out approach all collectively assure that our ML models are fed by the most appropriate input variables, and guarantee generalizable and not-overfitted models. Figure 4 shows the spatial variation of input features including vulnerability, population of each county in 2017, median home value in 2017, mean duration of flash floods during 1996-2017, long-term

Random forest
The objective of this study is to build a model that can predict flash flood damage using the event characteristics as the input variables. In this study, we used RF for classification and prediction of flash flood property damage. RF, proposed by Breiman (2001), is an ensemble learning method that generates multiple decision trees using a randomly selected subset of samples through replacement. This method is suitable for both regression and classification problems. Due to randomized and decorrelated features of RF, it is able to build the connection between the input and output variables when their relationship is very complex and nonlinear ( He et al 2016, Hong et al 2016).
In this study, RF was used in two modes, classification and regression (see figure 5). For the classification problem, we transformed the damage values to a binary zero and one scoring system, such that zero represents the events with no property damage and one refers to any damage values greater than zero. In the regression mode, RF is used to estimate the relationships among the predictors and the output variable (damage). To deal with the skewness of data, for the regression model, both input and output variables were transformed using Box-Cox and log transformations. We randomly split the dataset into two groups, 85% of the data for training and the remaining (15% of data) for testing. In both classification and regression models, using a trial and error approach, we identified 1000 regression trees to yield promising performance. Using 1000 trees improve the model performance compared to small size of trees, while increasing the numbers of trees to more than 1000 result in very minor improvement and significantly add to the computational complexity. The model was also verified using the testing dataset.

Result and discussion
The results are discussed in two subsections below. Section 4.1 reports the performance of the proposed classification model for classifying flash flood events to

Damaging versus non-damaging classification
Here, sensitivity (true positive rate) and specificity (true negative rate) are utilized to assess the performance of the developed classifier model (Lin et al 2019). Sensitivity measures the proportion of positives that are correctly identified (i.e. the events that actually caused property damage and the model correctly classified them as damaging events) and specificity assesses the proportion of negatives that are correctly determined (i.e. the events that actually caused no property damage and the model correctly classified them as non-damaging events), both of which range from zero to one with an ideal value equal to one, which is indication of perfect model accuracy. Therefore, sensitivity and specificity are calculated using the following equations: No. of true predicted damaging events Total No. of damaging events 1 ( ) No. of true predicted non damaging events Total No. of non damaging events .
( ) -- Figure 6 shows the performance of the RF classifier model. The location of correct (blue) and incorrect (red) classified events in the testing dataset as well as the sensitivity and specificity of the model for each state are shown in this figure. Figures 6(a) and (b) show the result for damaging and non-damaging events. The number of correct (true) and incorrect (false) classifications are shown in each figure panel as well. The overall performance of the model is fairly high in both classifying of damaging and non-damaging events. The sensitivity and specificity for the states of Alabama, Florida, and Louisiana are considerably high (greater than 0.75), which indicates the higher reliability of classification model in these states. Sensitivity and specificity are inversely proportional, such that if sensitivity increases, specificity will decrease and vice versa (Parikh et al 2008). This is in particular more apparent for the case of Mississippi and North Carolina. Although North Carolina has low value of sensitivity, it has a high value of specificity (>0.9). Conversely, a high value of sensitivity and low value of specificity is observed for the Mississippi state. To better understand the classifier model's performance, the overall sensitivity and specificity of the model, as well as the accuracy of the model (see equation (3)) are presented in figure 7(a). The model accuracy, sensitivity, and specificity indicate the reliability of the model in the classification of flash flood damage   Figure 7(b) shows the importance of each variable in the developed classifier model. The importance of each variable is calculated based on the increase in the prediction error, if the values of that variable are permuted across the process. As it can be seen from the figure, the most important variables are the location of the event (latitude and longitude). This implies that by considering the location of events along with other geographic, socioeconomic, and flood factors, we can extend our prediction to larger domains. The flow accumulation is the least important feature, however the leave-one-out approach mentioned earlier in section 3.2 indicated that keeping this variable increases the accuracy.

Damage prediction model
RF not only is used as a classifier, it is also implemented to predict the amount of property damage from a particular flash flood event. The flash flood events that caused property damage were randomly divided into two parts: training (85% of dataset) and testing (15% dataset). The result of the developed model are evaluated using two performance measures: correlation coefficient (R) and bias, both of which have been commonly used to measure the accuracy and performance of the ML models

Summary and conclusion
This study proposed a risk-based and physically informed model for predicting flash flood property damage across the Southeast US (SEUS) using a variety of influential factors including geographic, socioeconomic, and climatic features. We selected RF as the central model. The model was trained and tested using 3 No . of true predicted damaging events No . of true predicted non damaging events Total No . of events ( ) -the information acquired from various data sources for a large number of flash flood events during the period of 1996-2017. RF has been implemented in two different modes, classification and regression. In the classification mode, we estimated whether the flash flood caused any property damage or not, and then in the regression mode, the amount of property damage was predicted. Various statistical measures were employed to evaluate the performance of both classifier and regression models, and the results indicated the reliability of the developed framework.
The findings of this study suggest the applicability and accuracy of RF model for prediction of property damage associated wit flash flood events over a large domain. For future work, researchers are encouraged to develop probabilistic models for predicting flash flood damage. Moreover, additional predictors such as watershed properties can be incorporated into the model.

Acknowledgments
We would like to acknowledge the National Centers for Environmental Information for providing access to the NOAA Storm Events Database. We also appreciate the data provided by the North American Land Data Assimilation Systems. The authors declare no competing interests.

Data statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.