Leveraging machine learning for predicting flash flood damage in the Southeast US

Atieh Alipour; Ali Ahmadalipour; Peyman Abbaszadeh; Hamid Moradkhani

doi:10.1088/1748-9326/ab6edd

1. Introduction

The Southeast US (SEUS) is known to be susceptible to flash flooding due to the frequent high intensity rainfall triggered by tropical storms, thunderstorms and hurricanes (Orville and Huffines 2001, Czajkowski et al 2011, Smith and Smith 2015). During the last two decades, widespread flash flood events have caused significant economic damage in this region. Recent studies have shown that the frequency of flash flooding is increasing in the SEUS (Alipour et al 2020). Therefore, predicting property damage of flash floods is crucial for attaining proactive disaster management in this region.

Generally, risk refers to the potential losses of a particular hazard (Cardona et al 2012, Armenakis et al 2017, Ahmadalipour et al 2019), which is characterized as a function of three major components: hazard, vulnerability, and exposure (Adger 2006, Dang et al 2010, Winsemius et al 2013, Budiyono et al 2015, Koks et al 2015). Assessing flash flood risk components has been the subject of several studies. Recently, Ahmadalipour and Moradkhani (2019) investigated the spatiotemporal characteristics of flash flooding hazard over the Contiguous United States (CONUS). Also, Khajehei et al (2020) assessed the socioeconomic vulnerability of flash flooding at the county scale across the entire CONUS while accounting for flash flood characteristics including duration, frequency, magnitude and severity.

The conventional approaches for modeling flood risk are mostly dependent on the flood water depth to estimate the associated damage (Aerts et al 2014, Velasco et al 2014). Several recent studies have shown that considering multivariate data will improve the damage estimates (Wagenaar et al 2017). Therefore, over the past few years, several studies evaluated flood risk in various regions of the globe (de Moel et al 2015, Arnell and Gosling 2016, van Berchum et al 2018) using a multitude of variables representing hazard, vulnerability, and exposure.

Recent advances in machine learning (ML) techniques have led to significant improvements in flood risk assessment (Wang et al 2015, Lai et al 2016). Artificial neural network (ANN), decision tree, logistic regression, random forest (RF), regression tree, support vector machine are the most widely used ML models for flood risk assessments (Kourgialas and Karatzas 2017, Mojaddadi et al 2017, Gotham et al 2018, Nafari and Ngo 2018, Shafapour Tehrany et al 2019, Terti et al 2019). Table 1S (available online at stacks.iop.org/ERL/15/024011/mmedia) lists all of the factors used in these studies. Although some of these works assessed the flood damage prediction, few attempted to predict the potential property damage of the flash flooding events. In addition, the majority of damage prediction studies are conducted at small-scale regional domains, explicitly applicable to the region of interest (Scheuer et al 2011, Garrote et al 2016).

Therefore, in this study, we propose a risk-based and physically informed model for near real-time estimation of the potential property damage of flash flood events across the SEUS. Several influential factors including geographic, socioeconomic, and climatic features are utilized as input to the ML model in order to predict property damage of each flash flood event. This study also presents a unique model input structure/topology by which the ML model produces improved results and would be a universal approach to predict potential property damage in any region of interest. The model was trained and tested based on a large database consisting of more than 14 000 flash flood events during 1996–2017. In this study, the overarching research objective is to develop a risk-informed mesoscale flash flood damage prediction model across the SEUS that assist decision makers and insurance companies dealing with flood risk assessment.

2. Study area and data

In this study, several data sources have been utilized to acquire the information for flash flood events as well as physical and geographical characteristics across the SEUS during 1996–2017. Each dataset and their characteristics are thoroughly explained in the following sections.

2.1. Study area

The study area encompasses nine southeastern US states (referred to as SEUS in this study) including Alabama, Arkansas, Florida, Georgia, Louisiana, Mississippi, North Carolina, South Carolina, and Tennessee. The climate of this region varies with latitude, topography, and proximity to Atlantic Ocean and Gulf of Mexico (Ingram et al 2013). The high-pressure system, known as Bermuda High, commonly draws moisture form Atlantic Ocean and Gulf of Mexico, and causes warm and humid summer in the SEUS along with frequent thunderstorms (Zhu and Liang 2013). Based on the 2017 US census estimation, over 61 million people reside in the 755 counties of the SEUS. A large number of flash flood events have impacted the SEUS in the past couple of decades and imposed billions of dollars in damage to the SEUS residents.

2.2. NOAA storm events database

The National Oceanic and Atmospheric Administration (NOAA) Storm Events database is a comprehensive repository that provides information for different types of natural disasters, such as flash flooding across the US from 1996 to present. This information include the beginning and termination date and time, location, the associated injuries and fertilities, amount of damage to properties and crops, and event narrative (Ashley and Ashley 2008, Sharif et al 2015, Konisky et al 2016, Hamidi et al 2017, Shah et al 2017). In this study, we have used the NOAA storm events database to obtain the information for 14 317 flash flood events including the onset time, duration, date, location, and property damage during 1996–2017.

2.3. NLDAS-2 hourly precipitation

The precipitation data from Phase 2 of the North American Land Data Assimilation System (NLDAS-2) are available at 1/8th-degree spatial resolution (about 12 km) and hourly temporal resolution during the period of January 1979 to present (Xia et al 2012). The hourly NLDAS-2 precipitation data is generated from different in situ and remote sensing data sources (Yu et al 2017).

Since flash floods generally occur in small catchments at usually less than 1000 km² (Villarini et al 2010, Llasat et al 2016), the hourly NLDAS-2 precipitation data is upscaled to 0.3 grid cell (using bilinear interpolation) so as to represent an approximate inundated area of 1000 km². Then, the location, start time, and duration of each flash flood event (acquired from NOAA storm events database) are utilized to extract the mean and cumulative precipitation during each flash flood event using the NLDAS-2 data. The mean and the cumulative precipitation represent the intensity and severity of flash flood events, respectively, both of which are important characteristics for identifying flash flood hazard.

2.4. GTOPO30 topography data

GTOPO30 has a 1 km spatial resolution and it has been used in many studies for estimation of multiple topographical indices (Marlier et al 2015, Folk et al 2018, Durand et al 2019, Abbaszadeh et al 2019b). In this study, we used GTOPO30 to derive several topographic factors including altitude, slope, flow accumulation, and TRI at different spatial resolutions (i.e. 1, 3, and 30 km) corresponding to each flash flood event.

2.5. Zillow database

Zillow real estate provides an index known as the zillow home value index (ZHVI) that is the median home value in a specific geographic region and housing type since 1996 to present. Several studies have utilized this product for risk analysis (Watson et al 2016, Morckel 2017, Miller 2018). In this study, we used ZHIVI to evaluate the median home value for all homes in each county during 199–2017. The median home value is an indicator of flash flood exposure. The Zillow dataset does not include information for all counties in all years, so we used machine learning to predict the missing values, which is explained in section 3.1 in more details.

2.6. US census bureau database

The US Census Bureau aims to provide accurate data of the people, economy of the nation and the geographic information of the country including the boundaries map of the state, county, place, and census tracts through questionnaires every 10 years. Here, we have extracted the population of SEUS counties from this dataset to analyze the flash flood exposure. We also derived the area of each county from the SEUS counties shapefile to estimate the population density as another indicator of flash flood exposure.

2.7. The centers for disease control and prevention's social vulnerability index (SVI)

The Centers for Disease Control and Prevention's (CDC) SVI is based on 15 social factors that consider unemployment, minority status, and disability. These factors are divided into four themes namely socioeconomic status, household composition and disability, minority status and language, and housing and transportation (Cimellaro et al 2016). The SVI values are available for the years 2000, 2010, 2014, and 2016. Since this data is not available for all the years during 1996–2017, for simplicity, accuracy and consistency we used the 2016 SVI at county level to evaluate flash flood vulnerability to flash flood events.

3. Methodology

This study proposes a risk-based model for flash flood damage prediction over the SEUS, a valuable tool for decision makers and insurance companies. The framework of the proposed approach is presented in figure 1.

**Figure 1.** Schematic representation of the proposed framework for flash flood damage prediction. In the figure, ANN (MLP) stands for artificial neural network (multilayer perceptron), and RF is random forest.
Download figure:
Standard image High-resolution image

3.1. Filling the gaps in zillow dataset

One of the variables used in this study is median home value that explains the flash flood exposure. We utilized Zillow dataset to extract this information for each flash flood event during 1996–2017 over the SEUS. Unfortunately, the median home value is not available for all counties and all years in the study period. To cope with this shortcoming, we utilized ANN to predict the missing median home values. ANN models are suitable for modeling a wide variety of nonlinear problems by extrapolating the relationships between a set of inputs and the output without taking any prior assumption nor any knowledge of the underlying physics of the process (ASCE Task Committee 2000a, 2000b, Asadi et al 2013, Mitra et al 2016). There are several versions of ANNs that could be adopted to find missing home values. Our appraisal analysis suggests that a simple ANN-MLP structure suffices to properly estimate the missing home values in this study. A typical ANN-MLP structure consists of three layers including input layer, hidden layer that involves neurons, and output layer. ANN model with large dimension that include irrelevant inputs, behave poorly (Bowden et al 2005a, 2005b, Wu et al 2014). In this study, our input variables only include the centroid latitude and longitude of the counties, the year and its corresponding population of county, while the output layer is the median home value of each specific year and county. Number of neurons in the hidden layer has been selected by trial-and-error approach. Out of 16 610 samples (755 counties in the SEUS during 1996–2017 period: 755 × 22), 9853 cases were available in the Zillow dataset and the remaining median home value data (16 610–9853 = 6757 cases) were missing.

There are several methods for splitting the data into different subsets for training, validation and testing the model (Bowden et al 2002, Wu et al 2013). Here, we randomly separated the 9853 data into three groups: training (70% of data), to train and calibrate the ANN model, validation (15% of data), to validate the trained model and avoid the potential of model overfitting, and testing (15% of data), to verify the performance of trained model. It is important to note that random separation of dataset assures the generalizability of the trained model. We normalized the input variables and trained the ANN-MLP model using the training dataset. Validation is an important part of the modeling (Humphrey et al 2017). Here, we used validation data for early stopping during the model development process. The trained model was verified using the testing dataset as shown in figure 2. The result shows a high correlation between the model output and the actual values of median home value reported by Zillow. Therefore, the trained model was used to estimate the missing median home values in the Zillow dataset.

**Figure 2.** Verification result for the ANN-MLP model for the testing period that is used to fill out the missing values of the median home values of the Zillow dataset; R = correlation coefficient.
Download figure:
Standard image High-resolution image

3.2. Variable selection

Variable selection is a common procedure for model development in artificial intelligence. It helps remove the redundant predictors that add noise to the major estimators, and saves computation time. Additionally, it prevents the potential overfitting of the model. Figure 3 illustrates the variable selection process, and the final selected features are shown in red, yellow, blue, and gray that are respectively representing exposure, vulnerability, hazard, and spatiotemporal features.

**Figure 3.** Flowchart of the variable selection method, and the final 11 chosen variables (at the bottom) that are used as input to the random forests model. Red, yellow, blue, and gray colors are used for variables representing exposure, vulnerability, hazard, and spatiotemporal features, respectively.
Download figure:
Standard image High-resolution image

We selected our variables in different steps, such that we would able to address one issue at a time. The geomorphologic features of the inundated area namely altitude, slope, flow accumulation, and topographic roughness index are extracted at different spatial resolutions (1, 3, and 30 km). The correlation between different resolutions and the reported damage was estimated and the one with the highest correlation was selected in this step. Please note that we also used Spearman correlation coefficient that assesses the monotonic relationship between the two variables (whether linear or not) and found that the result for selected variables were the same as those obtained by Pearson correlation. Afterwards, we used variance inflation factor approach to remove multicollinear variables. VIF is calculated as 1/(1 − ${R}^{2}$ ), where $R$ is the correlation computed for each pair of the predictor variables. To further reduce the dimension of our input variables, we also performed a leave-one-out approach where one input variable was removed and the prediction was implemented. Variables with the most accurate prediction result were selected as our final model input variables. Most of the selected variables (e.g. duration and median home value) represents the hazardousness of the flash flood events and the amount of exposed properties. Household Composition and disability index represent the percentage of people aged 65 or older, aged 17 or younger, civilian with a disability, and single-parent households. We chose this factor as it had higher correlation with the flash flood property damage based on the result. The housing of this group of people may reside in regions that are more prone to flooding due to either their lack of awareness or financial standing. The location of the flash flooding enables our model to predict the damage for a large region, and the timing variables (i.e. month and onset time) are indicators of those factors that are not included in our study (e.g. soil moisture). We compared the models' performance (classification and regression scenarios) with and without using the variable selection approach. Therefore, we realized that the proposed variable selection procedure that includes three main components namely correlation coefficient, VIF and leave-one-out approach all collectively assure that our ML models are fed by the most appropriate input variables, and guarantee generalizable and not-overfitted models.

Figure 4 shows the spatial variation of input features including vulnerability, population of each county in 2017, median home value in 2017, mean duration of flash floods during 1996–2017, long-term average intensity of flash flood events during 1996–2017, flow accumulation, and slope. Figure 4 also illustrates the monthly and diurnal distribution of flash flood events during 1996–2017. This figure indicates that flash floods are more frequent during spring and summer (April–September), and the onset is more likely to happen in the afternoon (3 pm–7 pm).

3.3. Random forest

The objective of this study is to build a model that can predict flash flood damage using the event characteristics as the input variables. In this study, we used RF for classification and prediction of flash flood property damage. RF, proposed by Breiman (2001), is an ensemble learning method that generates multiple decision trees using a randomly selected subset of samples through replacement. This method is suitable for both regression and classification problems. Due to randomized and decorrelated features of RF, it is able to build the connection between the input and output variables when their relationship is very complex and nonlinear ( He et al 2016, Hong et al 2016).

In this study, RF was used in two modes, classification and regression (see figure 5). For the classification problem, we transformed the damage values to a binary zero and one scoring system, such that zero represents the events with no property damage and one refers to any damage values greater than zero. In the regression mode, RF is used to estimate the relationships among the predictors and the output variable (damage). To deal with the skewness of data, for the regression model, both input and output variables were transformed using Box-Cox and log transformations. We randomly split the dataset into two groups, 85% of the data for training and the remaining (15% of data) for testing. In both classification and regression models, using a trial and error approach, we identified 1000 regression trees to yield promising performance. Using 1000 trees improve the model performance compared to small size of trees, while increasing the numbers of trees to more than 1000 result in very minor improvement and significantly add to the computational complexity. The model was also verified using the testing dataset.

4. Result and discussion

The results are discussed in two subsections below. Section 4.1 reports the performance of the proposed classification model for classifying flash flood events to damaging and non-damaging, and section 4.2 explains the effectiveness of the regression model for flash flood damage prediction.

4.1. Damaging versus non-damaging classification

Here, sensitivity (true positive rate) and specificity (true negative rate) are utilized to assess the performance of the developed classifier model (Lin et al 2019). Sensitivity measures the proportion of positives that are correctly identified (i.e. the events that actually caused property damage and the model correctly classified them as damaging events) and specificity assesses the proportion of negatives that are correctly determined (i.e. the events that actually caused no property damage and the model correctly classified them as non-damaging events), both of which range from zero to one with an ideal value equal to one, which is indication of perfect model accuracy. Therefore, sensitivity and specificity are calculated using the following equations:

$\begin{eqnarray}&&{\rm{S}}{\rm{e}}{\rm{n}}{\rm{s}}{\rm{i}}{\rm{t}}{\rm{i}}{\rm{v}}{\rm{i}}{\rm{t}}{\rm{y}}=\frac{{\rm{N}}{\rm{o}}.\,{\rm{o}}{\rm{f}}\,{\rm{t}}{\rm{r}}{\rm{u}}{\rm{e}}\,{\rm{p}}{\rm{r}}{\rm{e}}{\rm{d}}{\rm{i}}{\rm{c}}{\rm{t}}{\rm{e}}{\rm{d}}\,{\rm{d}}{\rm{a}}{\rm{m}}{\rm{a}}{\rm{g}}{\rm{i}}{\rm{n}}{\rm{g}}\,{\rm{e}}{\rm{v}}{\rm{e}}{\rm{n}}{\rm{t}}{\rm{s}}}{{\rm{T}}{\rm{o}}{\rm{t}}{\rm{a}}{\rm{l}}\,{\rm{N}}{\rm{o}}.\,{\rm{o}}{\rm{f}}\,{\rm{d}}{\rm{a}}{\rm{m}}{\rm{a}}{\rm{g}}{\rm{i}}{\rm{n}}{\rm{g}}\,{\rm{e}}{\rm{v}}{\rm{e}}{\rm{n}}{\rm{t}}{\rm{s}}},\end{eqnarray} \tag{ 1 }$

$\begin{eqnarray}&&{\rm{\text{Specificity}}}\,=\,\frac{{\rm{\text{No}}}.\,{\rm{\text{of}}}\,{\rm{\text{true}}}\,{\rm{\text{predicted}}}\,{\rm{\text{non}}} \mbox{-} {\rm{\text{damaging}}}\,{\rm{\text{events}}}}{{\rm{\text{Total}}}\,{\rm{\text{No}}}.\,{\rm{\text{of}}}\,{\rm{\text{non}}} \mbox{-} {\rm{\text{damaging}}}\,{\rm{\text{events}}}}.\end{eqnarray} \tag{ 2 }$

Figure 6 shows the performance of the RF classifier model. The location of correct (blue) and incorrect (red) classified events in the testing dataset as well as the sensitivity and specificity of the model for each state are shown in this figure. Figures 6(a) and (b) show the result for damaging and non-damaging events. The number of correct (true) and incorrect (false) classifications are shown in each figure panel as well. The overall performance of the model is fairly high in both classifying of damaging and non-damaging events. The sensitivity and specificity for the states of Alabama, Florida, and Louisiana are considerably high (greater than 0.75), which indicates the higher reliability of classification model in these states. Sensitivity and specificity are inversely proportional, such that if sensitivity increases, specificity will decrease and vice versa (Parikh et al 2008). This is in particular more apparent for the case of Mississippi and North Carolina. Although North Carolina has low value of sensitivity, it has a high value of specificity (>0.9). Conversely, a high value of sensitivity and low value of specificity is observed for the Mississippi state.

To better understand the classifier model's performance, the overall sensitivity and specificity of the model, as well as the accuracy of the model (see equation (3)) are presented in figure 7(a). The model accuracy, sensitivity, and specificity indicate the reliability of the model in the classification of flash flood damage

$\begin{eqnarray}&&\begin{array}{c}{\rm{Accuracy}}\\ \,=\frac{{\rm{No}}.\,{\rm{of}}\,{\rm{true}}\,{\rm{predicted}}\,{\rm{damaging}}\,{\rm{events}}\,+\,{\rm{No}}.\,{\rm{of}}\,{\rm{true}}\,{\rm{predicted}}\,{\rm{non}}\text{-}{\rm{damaging}}\,{\rm{events}}}{{\rm{Total}}\,{\rm{No}}.\,{\rm{of}}\,{\rm{events}}}.\end{array}\end{eqnarray} \tag{ 3 }$

Moreover, to further evaluate the ML model, we estimated the area under the relative-operating characteristic (ROC) curve. The ROC curve is considered as a representation of the model trade-off between the false positive (one-specificity) and true positive (sensitivity) rates, and it ranges between 0.5 and 1, where 1 is the ideal value (Chapi et al 2017, Rahmati and Pourghasemi 2017). The dashed line in figure 7(a) represents the area of 0.5 that means an inaccurate model. The area under the curve (AUC) indicates the accuracy of the model. Several studies employed AUC to measure the performance of classifier models. For instance, Joo et al (2019) used a Bayesian network to integrate weights of different variables that affect flood damage and reported AUC value of 0.67 for their method. The high value of AUC (i.e. 0.87) shown in figure 7(a) is an indication of the reliability of the proposed model.

Figure 7(b) shows the importance of each variable in the developed classifier model. The importance of each variable is calculated based on the increase in the prediction error, if the values of that variable are permuted across the process. As it can be seen from the figure, the most important variables are the location of the event (latitude and longitude). This implies that by considering the location of events along with other geographic, socioeconomic, and flood factors, we can extend our prediction to larger domains. The flow accumulation is the least important feature, however the leave-one-out approach mentioned earlier in section 3.2 indicated that keeping this variable increases the accuracy.

4.2. Damage prediction model

RF not only is used as a classifier, it is also implemented to predict the amount of property damage from a particular flash flood event. The flash flood events that caused property damage were randomly divided into two parts: training (85% of dataset) and testing (15% dataset). The result of the developed model are evaluated using two performance measures: correlation coefficient (R) and bias, both of which have been commonly used to measure the accuracy and performance of the ML models (Gavahi et al 2019, Neri et al 2019, Shastry and Durand 2019, Abbaszadeh et al 2019a). Here, the regression (i.e. damage prediction) model is evaluated for training, testing, as well as the entire dataset, and the results are shown in figure 8. The statistical measures shown in this figure indicate that there is a satisfactory agreement between the observed and predicted values. However, a slightly negative bias is observed in the model (mean −$1100 for testing and −$1010 in overall).

**Figure 8.** The performance of the random forest model in prediction of flash flood damage over the SEUS. The subplots indicate the histogram of bias during the training, testing, and the entire dataset (totaling 5500, 970, and 6470 events, respectively). The axis titles for all the panels are the same.
Download figure:
Standard image High-resolution image

The findings of several studies suggest that climate change will increase the likelihood of flooding events (Sisco et al 2017, Yin et al 2018, Zhang et al 2018, Marsooli et al 2019), and therefore, proactive disaster risk management strategies are required. The proposed framework in this study can help the decision makers and insurance agencies to better allocate the resources and inform the communities about the hazardousness of flash flood events (Shao et al 2019).

5. Summary and conclusion

This study proposed a risk-based and physically informed model for predicting flash flood property damage across the Southeast US (SEUS) using a variety of influential factors including geographic, socioeconomic, and climatic features. We selected RF as the central model. The model was trained and tested using the information acquired from various data sources for a large number of flash flood events during the period of 1996–2017. RF has been implemented in two different modes, classification and regression. In the classification mode, we estimated whether the flash flood caused any property damage or not, and then in the regression mode, the amount of property damage was predicted. Various statistical measures were employed to evaluate the performance of both classifier and regression models, and the results indicated the reliability of the developed framework.

The findings of this study suggest the applicability and accuracy of RF model for prediction of property damage associated wit flash flood events over a large domain. For future work, researchers are encouraged to develop probabilistic models for predicting flash flood damage. Moreover, additional predictors such as watershed properties can be incorporated into the model.

Acknowledgments

We would like to acknowledge the National Centers for Environmental Information for providing access to the NOAA Storm Events Database. We also appreciate the data provided by the North American Land Data Assimilation Systems. The authors declare no competing interests.

Data statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Leveraging machine learning for predicting flash flood damage in the Southeast US

Article metrics

Submit

Author e-mails

Author affiliations

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction