1 Introduction

Urban flooding presents escalating challenges for sustainable development worldwide (Hammond et al. 2015). The frequency and severity of flood damage are on the rise, attributed to the effects of climate change and rapid urbanization (Chen et al. 2015; Yin et al. 2015; Jamali et al. 2018).

Urban hydrology provides a basis to estimate the inundated areas and damage of urban floods (Fletcher et al. 2013; Ichiba et al. 2018). The prevalence of impervious surfaces in cities reduces precipitation infiltration and, simultaneously, increases the speed of surface water flow during a storm (Lee and Brody 2018; Yu et al. 2018; Feng et al. 2021). Consequently, these conditions afford only a limited time for warning and evacuating people possibly caught by the flash floods. Furthermore, urban flash floods are particularly destructive due to the higher exposure of people and expensive infrastructures in cities (Jamali et al. 2018; Singh et al. 2018).

Several hydrological models have been developed for urban flood simulations, among which models such as the Storm Water Management Model and MIKE FLOOD are widely applied worldwide (Mignot et al. 2019; Bulti and Abebe 2020; Guo et al. 2021; Babaei et al. 2018; Bai et al. 2018; Li et al. 2018; Nigussie and Altunkaynak 2019). These models enable the prediction of inundated areas in cities, which is essential for urban flood early warnings (Rangari et al. 2019; Seenu et al. 2020). However, there is a critical gap in model calibration and validation due to a lack of measured data of floodwater depths across urban areas. The primary challenges are the limited coverage of monitoring spots and the relatively high cost for maintaining the gauging equipment (Arshad et al. 2019; Moy De Vitry et al. 2019; Li et al. 2019). Moreover, urban areas present numerous and scattered waterlogging locations owing to various low-lying facilities or underground infrastructures, such as subway stations and traffic tunnels. Therefore, it becomes quite costly to monitor distributed flood inundations using a traditional sensor-based system in these scattered locations. Consequently, comprehensive monitoring networks that cover urban flood depths extensively remain scarce globally (Wang et al. 2018).

In the quest to acquire more comprehensive flood depth data within urban areas, a plethora of new technologies has been adopted over the past decade. These include remote sensing (Rong et al. 2020), social media or crowdsourcing data retrieval (Kankanamge et al. 2020), deep learning (Chia et al. 2023; Faramarzzadeh et al. 2023), and and object recognition in flood imagery (Jiang et al. 2019; Bhola et al. 2019; Park et al. 2021). Barz et al. established a method using content-based image retrieval technology and an artificial intelligence method to identify "flood" and "non-flood" events from a dataset of about 100,000 online images (Barz et al. 2019). Vlad et al. developed a deep learning classifier that detected flooding events by analyzing text published by online news outlets (Vlad et al. 2019). While these approaches have introduced novel insights into urban flood monitoring, achieving accurate and continuous data on floodwater levels at specific locations remains a challenge. To enhance the practicality of flood images, high-resolution photographs from water level gauges captured by video surveillance systems have been utilized to deduce water level data through identification technology (Lv et al. 2018; Jiang et al. 2020). Furthermore, Basnyat and Gangopadhyay developed a low-cost, low-power cyber-physical system prototype using a Raspberry Pi camera to detect the rising water level using image processing and text recognition techniques (Basnyat et al. 2018). This innovation has the potential to enable the widespread implementation of a flash flood detection system that requires minimal human supervision. Yet, the deployment of such systems in urban locales faces constraints, particularly due to the necessity for installing on-site water level gauges and camera systems at heavily trafficked urban locations (Yang et al. 2014).

Given the above limitations, this study aims to develop an innovative method that can obtain continuous data of urban inundation depths at a specific location during a flash flood. Image-recognition technology is applied to recognize flood level figures from traffic photos during floods by identifying the relative position of the water surface to the reference objects in the water, including people and vehicles passing through the water. The primary innovation of the article lies in the establishment of a new method for image recognition of surface water depth. Compared to traditional methods that utilize sensors for monitoring and numerical models for acquiring flood inundation depths, the method proposed in this study is cost-effective and computationally efficient, making it more suitable for providing flood inundation depth data over large spatial extents. Moreover, the proposed method could provide real-time data of inundated water levels by analyzing images from existing traffic surveillance cameras in road networks. As traffic surveillance cameras are widespread in cities globally, this method might be practical for expanding urban flood monitoring and providing timely early warning.

2 Materials and Methods

2.1 Data

A total of 1,177 images of flooded streets with pedestrians or vehicles were utilized for water depth identification. The image resolutions varied from 142 × 107 pixels to 4613 × 2595 pixels. These images were acquired through Google Search using keywords such as “urban waterlogging” and “urban floods.” Some of these photos were sourced from traffic cameras. Complex and challenging scenarios, including those featuring nighttime, muddy water, and splashes caused by vehicle movement, were considered in the selection of the flood photos.

According to the VOC (visual object class) data set framework developed by the PASCAL organization, the images were converted into a standard data set for flood depth recognition. PASCAL (Pattern Analysis, Statistical Modeling and Computational Learning) is a network organization funded by the EU, which has developed public datasets of images and annotations, together with standardized evaluation software (Everingham et al. 2010). PASCAL VOC is one of the widely used visual datasets that employs a strictly normalized framework (Gauen et al. 2017) to identify the size, object class, and coordinates of bounding boxes in the images. In this study, the 1177 images were annotated to 2095 objects and seven categories using the VOC format with the aid of the software LabelImg (Yakovlev and Lisovychenko 2020). The seven categories were defined based on the positions of people or vehicles as they were submerged by floodwater, as detailed in Table 1.

Table 1 Categories of flood images

2.2 Target Detection Algorithm

The “you only look once” (YOLO) algorithm was employed in this study to extract features and detect objects in the flood images using a deep convolutional neural network. Y OLO is favored for object detection due to its rapid image processing capabilities. It integrates feature extraction, object classification, and bounding box regression in a single pass through the neural network (Redmon et al. 2016). Unlike two-stage algorithms such as the Faster-RCNN with multiple steps for generating the candidate area of the input image and extracting features for the image recognition through different neural networks, YOLO skips the generation of a candidate area and completes the detection in one step. This endows YOLO with a markedly higher computational speed while maintaining adequate precision (Zou et al. 2019), aligning with the objectives of this study for real-time flood depth variation monitoring.

YOLO performs object detection by estimating the likelihood that each bounding box within an image grid pertains to predefined categories, employing a convolutional neural network (CNN) alongside a regression method. First, YOLO processes the full image by dividing it into a grid, typically of size 7 × 7, and then uses this grid as the input for the neural network. Subsequently, it proposes two potential bounding boxes for each cell of the grid, with each box having the capacity to classify objects across n categories. Every bounding box is characterized by five parameters: the x-coordinate (x) and y-coordinate (y) of the box's center, the box's width (w), height (h), and the confidence score (c), which is the probability of the box containing an object of interest, ranging between 0 and 1. Consequently, the network output, known as the YOLO Head, is a tensor with the dimensions of (7, 7, 2 * 5 + n). For instance, within a dataset encompassing 20 object categories, the YOLO output would be a tensor of dimensions (7, 7, 30). Here, the first two dimensions, a 7 × 7 matrix, signify the count of feature grids derived, while the final dimension of 30 is the product of the two bounding boxes per cell and the five parameters per box, coupled with the 20 class probabilities (2 × 5 + 20).

The YOLO algorithm was initially developed and subsequently upgraded to its third iteration by Redmon (Redmon and Farhadi 2018). The fourth generation, YOLOv4, was introduced by Alexey Bochkovskiy in 2020 (Bochkovskiy et al. 2020), and this version was employed in the current study. YOLOv4 consists of three components: the backbone, feature enhancement, and prediction. In the backbone, the model extracts three feature layers from the image using five residual networks after adjusting the image to a fixed resolution using CSPDarknet53 (Wang et al. 2020). The three feature layers are subsequently transferred to the feature enhancement part, known as the neck, where they are processed to produce three predictive layers that form the YOLO head. The YOLO head, within the prediction component, is responsible for the final object classification and image recognition tasks.

In this study, three feature layers with the shapes (13, 13), (26, 26), and (52, 52) were adopted to extract the feature from an image with the shape (416, 416) through pooling technology of the CNN, as shown in Fig. 1. Each cell within the feature layers was encased by three alternative bounding boxes with seven categories. As previously described, every bounding box is defined by five specific parameters. Consequently, the output dimensions of the YOLO heads correspond to (13, 13, 22), (26, 26, 22), and (52, 52, 22), respectively. Here, the third dimension, with a value of 22, is computed by summing the product of the bounding boxes' count and the parameters per box with the probability values for each category's association, resulting in 3 × 5 + 7.

Fig. 1
figure 1

Network of "You Only See Once" Version 4 (YOLOv4)

In addition, YOLOv4 employs the activation function Mish (Misra 2019) instead of LeakyRelu (Maas et al. 2013) used in YOLOv3. Mish function has the advantage of a lower bound compared to LeakyReLU, thereby enhancing the model's regularization effect. Moreover, the non-monotonic nature of Mish allows it to retain small negative values, which aids in stabilizing the gradient flow within the network. Its infinitely continuous and smoothly curved nature also contributes to the improvement of result quality. Although utilizing the Mish function demands more computational resources than LeakyReLU functions, it is beneficial for enhancing the model's accuracy. The activation functions are illustrated in Fig. 1.

2.3 Model Training, Validation and Test

In this study, the input data were randomly partitioned into a training set (81% of data), a validation set (9% of data) and a test set (10% of data). The test set remained unused during the training process. YOLOv4 was trained through the transfer learning method, wherein the initial parameters, including the initial weights of the CNN, were adopted from the existing dataset VOC2007 (Everingham et al. 2010). Transfer learning is recognized as a practical scheme in the training process, which can use generic features from other larger training datasets. This reduces the time required for training the model. The VOC2007 dataset contains about 500,000 annotated images categorized into 20 object classes, offering a sufficiently large and diverse foundation to provide the CNN utilized in this study with initial weights.

For the model training in this study, two hundred epochs were set. Initially, a frozen learning approach with a learning rate of 1e-3 was implemented for the first 90 epochs, followed by regular learning with a learning rate of 1e-4 for the subsequent epochs. Frozen learning did not change the initial parameters of the networks in the backbone part, and only generated minor adjustments to other networks during the training (Brock et al. 2017). In contrast, regular learning involved modifying the parameters across all networks within the model. The utilization of frozen learning facilitates a reduction in training time by leveraging existing network parameters from prior datasets, like VOC2007. However, due to the static nature of the backbone networks during frozen learning, the model tends not to converge after only a few epochs. Consequently, regular `learning was employed thereafter to ensure further model convergence. Additionally, a reduced learning rate was applied in regular learning to minimize the risk of overfitting.

The convergence curve of the model training is shown in Fig. 2. A stabilization in the decrease of training loss was observed after 70 epochs, attributable to the unchanged network structure during the frozen learning phase. Subsequently, with the cessation of frozen learning, a further reduction in training loss occurred, continuing until the 150th epoch.

Fig. 2
figure 2

Convergence curve of the model training

2.4 Indicators for Performance Assessment

Mean average precision (mAP) was used in this study to evaluate the accuracy of flood image recognition. It is a metric that is extensively applied in object detection and classification (Zou et al. 2019). The mAP was computed using the precision and recall indicators of the modeling results. Precision is the ratio of correctly predicted positive observations to the total predicted positive observations, and recall is the ratio of correctly predicted positive observations to all actual positives, as shown in the Eqs. 1 and 2.

$$Precision= \frac{TP}{TP+FP}$$
(1)
$$Recall= \frac{TP}{TP+FN}$$
(2)

where TP (true positive) represents the number of samples that are correctly recognized or classified, FP (false positive) represents the number of samples that are recognized incorrectly, and FN (false negative) represents the number of unrecognized results.

Given the above indicators, the mAP can be calculated as follows:

$$mAP= \frac{{\sum }_{1}^{n}\;AP}{n}$$
(3)
$$AP= \int_{0}^{1}p\left(r\right)dr$$
(4)

where AP (average precision) represents the area under the precision-recall curve of the prediction results referring to each category. n is the total number of categories. r is the value of recall from Eq. 2p (r) represents the Precision-Recall curve. The y-axis of p(r) is the maximum value of the corresponding precision when the value of recall increases from 0 to 1 at a given interval, such as 0.1.

3 Results

The results of flood recognition from the street images are presented in Fig. 3. As shown in the figure, the reference objects—vehicles and pedestrians—are enclosed within bounding boxes, with the identified category of water depth displayed at the top of each box. In this way, the flood water depth can be extracted by real-time image recognition using street photos or surveillance videos from traffic cameras. It must be emphasized that this method is not appropriate for severe urban flooding, as it depends on the relative position between the water surface and the reference object. When the floodwater level exceeds the height of the reference object, the model's accuracy is likely to be impaired.

Fig. 3
figure 3

Examples of flood depth recognitions

The precision, recall, and AP of the flood recognition from test set are shown in Table 2. he mAP (mean Average Precision) of the model's predictions reached 89.29%, demonstrating a satisfactory performance of the model in identifying the submerged depths of the reference objects. The AP for the predictions in the categories of Car-roof and Car-pipe was 98.81% and 98.63%, respectively, indicating that the most accurate performance of the flood depth recognition from the image dataset in this study was in identifying water levels reaching the car's roof and exhaust pipe. The AP for the categories "Car-handle" and "Person-leg" was 92.47% and 90.44%, respectively, representing the second-best performances in the flood depth recognition.

Table 2 Results of flood depth recognition categories

The AP of seven categories and the number of labeled objects in each category are shown in Fig. 4. The dataset consisted of a total of 2,095 objects. Out of these, 1,885 objects were used for model training and validation, while the remaining 210 objects were reserved for test purposes. The category "Per-leg" had the highest number of objects in the training and validation set, with a total of 494 objects. There were 67 objects belonging to the "Per-leg" category in the test set. Compared to the Per-leg category, the Car-pipe category had a fewer number of objects, with 342 objects in the training and validation set, and 30 objects in the test set. However, the AP of the Car-pipe was 98.63% and was higher than the AP of the Per-leg category, which was only 90.44%. In addition, the Per-waist category had the fewest number of objects but the second lowest AP. The AP was not only affected by the number of objects but also by the feature extractions of the objects. This finding was supported by other studies using the dataset VOC2007, in which the objects were classified into 20 categories with various APs. In previous studies, objects with relatively clearer contours and more regular shapes, such as cars and trains, achieved higher recognition APs. Conversely, irregularly shaped objects, such as humans, cats, and dogs, had lower APs (Wang et al. 2013). This is consistent with the results of this study. When a car was used as the reference object for the flood submerged depth, the AP of the flood recognition was higher than when using the human body as the inundated reference. This is because objects with clear contours and regular shapes are more easily identifiable, thus having features more easily extracted in image recognition.

Fig. 4
figure 4

AP and number of labeled objects of the predictions

4 Discussion

4.1 Performances of Frozen Learning and Regular Learning

The performances of frozen learning and regular learning on inundated depth recognition were analyzed by observing the changes in the mAP and the loss during the learning process using these different learning methods. The mAP and the loss values are shown at every 50 epochs during the learning process in Table 3. At 200 epochs, the mAP of regular learning was higher than that of frozen learning. The loss of regular learning was less than that of frozen learning. This indicates that the performance of regular learning was better than that of frozen learning. In addition, a further improvement of the model performance was observed by using a combined learning method, which employed frozen learning in the first 90 epochs and switched to regular learning for the subsequent epochs.

Table 3 mAP and loss across different epochs using frozen learning and regular learning

4.2 Impacts of Training Technologies

Three technologies were utilized in the deep learning process for image processing and model training to analyze their impact on enhancing image recognition performance for flood depths. These technologies include the Mosaic (Bochkovskiy et al. 2020), label smoothing (Szegedy et al. 2016), and cosine annealing (Loshchilov and Hutter 2016), which are currently widely used in image recognition. Mosaic is an image augmentation technique that combines four images through cropping and blending to fully utilize the dataset (Hao and Zhili 2020). Label smoothing is a regularization technique that introduces noise to the labels, helping to mitigate overfitting (Müller et al. 2019). Cosine annealing, characterized by periodic learning rate decreases and resets, aids in avoiding local optima during the training process by allowing for larger learning rate resets (Gotmare et al. 2018).

The applicability of various technologies was analyzed by comparing the model mAP across different application scenarios using eight individual or combined technologies, as detailed in Table 4. In scenario 1, where none of the aforementioned, the mAP of the modeling under this scenario was only 67.58%. After using Mosaic, label smoothing, and cosine annealing separately, the mAP increased to 72.14%, 70.94%, and 70.48%, respectively. The most significant improvement compared to scenario 1 was observed in scenario 2, where Mosaic alone was used. However, in scenarios 5 and 6, where Mosaic was combined with label smoothing and cosine annealing, the mAP reached 71.92% and 67.53%, respectively. The mAP decreased when Mosaic was used in conjunction with label smoothing and cosine annealing, compared to its solitary application. Mosaic alone demonstrated the greatest efficacy in enhancing the mAP for flood depth recognition in this study. This is in alignment with prior research involving YOLOv4, where Mosaic was shown to enhance the accuracy of image recognition (Bochkovskiy et al. 2020).

Table 4 mAP using different training technologies

4.3 Impacts of Activation Functions

DarknetConv2D serves as the foundational convolutional block in residual networks. It comprises three fundamental components: convolution, normalization, and the activation function. The activation function is critical, significantly affecting the performance of deep learning models by impacting both accuracy and computational speed. In this study, the effects of two activation functions, Mish and LeakyRelu, were evaluated by measuring the mean average precision (mAP) and frames per second (FPS), where a higher FPS denotes a faster speed of image recognition. The results are shown in Table 5. The YOLOv4 model utilizing Mish typically achieved higher mAP results compared to that using LeakyReLU. However, the FPS of the model with Mish was 30.49, which was less than that of LeakyRelu. This indicates that Mish operated at a slower speed in image recognition. These findings align with previous research, which indicated that while Mish experiences a greater reduction in detection speed compared to LeakyReLU, it offers superior accuracy (Wang et al. 2021). Despite these differences, it's noteworthy that both activation functions demonstrate impressive prediction speeds, averaging merely 32.8 and 28.6 ms per image for Mish and LeakyReLU, respectively.

Table 5 mAP (%) and FPS using LeakyRelu and Mish functions

5 Conclusion

An image recognition method for detecting urban flood inundation was established in this study by employing the YOLOv4 deep learning network combined with transfer training techniques. In the images, vehicles and pedestrians submerged by floodwaters were used as reference objects to gauge flood water depth. The model categorized images with discernible flood water levels into four categories: 'Car-none', 'Car-pipe', 'Car-handle', and 'Car-roof'. These categories indicate that the floodwater depth in the image was below 5 cm, or had reached the vehicle’s exhaust pipe, door, or roof, respectively. Similarly, for images involving pedestrians, the model classified the flood depth into three categories: 'Per-none', 'Per-leg', and 'Per-waist', signifying that the floodwater depth was below 5 cm, or had reached a person’s leg or waist, respectively. The results demonstrated that the YOLOv4 model's precision for flood depth recognition can reach up to 89.29%, with an impressive recognition speed of 30.49 frames per second (FPS).

It was also found that the accuracy of flood depth recognition by YOLOv4 is influenced by the type of reference object submerged. Recognition accuracy was higher when using vehicles as reference objects compared to people, attributable to vehicles' regular shapes and fixed positions, which facilitate easier identification. Furthermore, the study explored the impact of various transfer learning techniques on flood depth recognition. It was found that a hybrid approach, combining frozen learning with regular learning, yielded better results than employing each technique in isolation. Additionally, the implementation of image augmentation with Mosaic technology was effective in enhancing the model's recognition accuracy.

The proposed YOLOv4 model boasts high accuracy and rapid processing speeds in determining flood water depths from images, utilizing vehicles and pedestrians as reference markers. This capacity allows the model to leverage images or video data from existing traffic cameras to ascertain flood depths during storms, thus facilitating the acquisition of on-site, real-time, and continuous water depth measurements. Data on flood depths garnered through this approach can serve a dual purpose: they can be instrumental in the verification and calibration of urban hydrological models, and they can also provide timely flood inundation information essential for early warning systems in urban settings.