Intelligent Detection Method for Concrete Dam Surface Cracks Based on Two-Stage Transfer Learning

: The timely identiﬁcation and detection of surface cracks in concrete dams, an important public safety infrastructure, is of great signiﬁcance in predicting engineering hazards and ensuring dam safety. Due to their low efﬁciency and accuracy, manual detection methods are gradually being replaced by computer vision techniques, and deep learning semantic segmentation methods have higher accuracy and robustness than traditional image methods. However, the lack of data images and insufﬁcient detection performance remain challenges in concrete dam surface crack detection scenarios. Therefore, this paper proposes an intelligent detection method for concrete dam surface cracks based on two-stage transfer learning. First, relevant domain knowledge is transferred to the target domain using two-stage transfer learning, cross-domain and intradomain learning, allowing the model to be fully trained with a small dataset. Second, the segmentation capability is enhanced by using residual network 50 (ResNet50) as a UNet model feature extraction network to enhance crack feature information extraction. Finally, multilayer parallel residual attention (MPR) is integrated into its jump connection path to improve the focus on critical information for clearer fracture edge segmentation. The results show that the proposed method achieves optimal mIoU and mPA of 88.3% and 92.7%, respectively, among many advanced semantic segmentation models. Compared with the benchmark UNet model, the proposed method improves mIoU and mPA by 4.6% and 3.2%, respectively, reduces FLOPs by 36.7%, improves inference speed by 48.9%, veriﬁes its better segmentation performance on dam face crack images with a low ﬁne crack miss detection rate and clear crack edge segmentation, and achieves an accuracy of over 85.7% in crack area prediction. In summary, the proposed method has higher efﬁciency and accuracy in concrete dam face crack detection, with greater robustness, and can provide a better alternative or complementary approach to dam safety inspections than the benchmark UNet model.


Introduction
Concrete dams are an important infrastructure for developing and utilising hydropower and water resources in human society, and they play an extensive role in promoting sustainable socioeconomic development and ensuring energy security. However, concrete dams are exposed to atmospheric and water environments for long periods and are subjected to external effects such as water erosion, temperature changes, dry and wet conditions, and freezing and thawing [1,2]. Thus, defects inevitably appear on their surfaces [3], with cracks being the most significant factor threatening dam safety and stability [4]. Cracks not only exist on dam surfaces but, if left untreated, will also extend to the inside of the dam, affecting the strength and service life of the dam and eventually causing safety accidents such as leakage and collapse [5]. Therefore, timely identification and detection of cracks on concrete dam surfaces are of great importance in predicting engineering hazards and ensuring dam safety. Due to the low efficiency and accuracy of manual detection and the weak adaptability of traditional image segmentation methods, an intelligent detection method for concrete dam surface cracks based on two-stage transfer learning is proposed, which can achieve accurate and intelligent dam crack segmentation from unmanned aerial vehicle (UAV) images.
To identify the safety hazards posed by cracks in concrete dams, early crack detection was mainly manual, such as inspectors taking a census of cracks through auxiliary tools such as binoculars [6] or by means of manual hanging baskets close to the dam surface [7]. However, manual detection is costly, susceptible to subjective influences, inefficient and inaccurate, and dangerous [8]. Recently, researchers have been investigating the role of dynamic sensor-based crack detection, and together with online singular spectrum analysis with real-time eigen perturbations can make a great contribution to crack monitoring [9]. Real-time single-sensor online damage monitoring technology based on eigen perturbation has been rapidly developed. This method can process data online and detect damage in structures in real time, improve the accuracy of assessing damage by generating new data through eigen perturbation, and monitor the deformation and direction of cracks in real time, enabling temporal and spatial advantages in crack identification [10,11]. However, information also needs to be collected before employing real-time single-sensor and singular spectrum analysis monitoring based on characteristic perturbations [12,13]. The location of cracks is found in time and information such as the specific shape and area of cracks is obtained to better judge the crack hazard level and repair. The greatest difference between dam surface cracks and other surface cracks is in the spatial distribution. Additionally, other cracks are mainly located on land, which is more convenient for using instruments and equipment directly for close-range exclusion. Information related to cracks in dams can be obtained efficiently and accurately by processing UAV images with computer vision technology [14,15].
In recent years, crack image segmentation methods based on computer vision have been researched and can be divided into traditional image segmentation methods and deep learning semantic segmentation methods [16,17]. Traditional image segmentation methods mainly use lower-order visual information such as a lower grey value of the crack's image pixels than the background [18], image binarization, image filters, and numerical images to achieve segmentation for crack recognition [19][20][21]. For example, Talab et al. [22] first changed an image to a greyscale image and then classified the background and foreground in the greyscale image using a suitable threshold to segment out the cracks. Zhou and Liu [23] used a threshold segmentation algorithm to extract concrete cracks, which was more effective in identifying concrete with no colour difference on the surface. Cho et al. [24] identified and classified crack candidate pixels, then performed filtering to remove noise, and finally searched, filled, and thresholded the crack region to detect cracks. Compared with manual direct detection, traditional image segmentation methods have a certain degree of safety and feasibility; however, when there is considerable noise in the background, the method is vulnerable to interference from external factors and segmentation results are not accurate [25]. Moreover, segmentation methods have more steps for crack detection and require humans to continuously adjust the parameters to adapt to the segmentation scene. Thus, with low detection efficiency [26] that cannot meet practical engineering needs in terms of accuracy and efficiency, an accurate and efficient intelligent crack segmentation method must be established.
Compared with traditional image segmentation methods based on low-order visual information, the higher-order visual information used in deep learning semantic segmentation methods has higher accuracy and robustness [27]. Semantic segmentation algorithms classify each image pixel by using higher-order visual information to fully consider the association between each pixel point [28], thus allowing for more efficient and accurate segmentation of image classes.
After analysing existing studies, semantic segmentation algorithms are feasible for crack detection. Jiang et al. [29] proposed a concrete crack image recognition method based on the UNet model to segment cracks. Then, they used the maximum inner tangent circle algorithm to calculate crack width. Jian et al. [30] used the DeepLabV3 model to experiment with the public dataset Crack500, and the test set IoU was 67%, which is less accurate. However, most studies are based on experimental studies of cracks in housing buildings, pavement, and bridges using open datasets [31][32][33], while there are fewer studies on safe methods for concrete dam crack detection. Tang et al. [34] proposed a visual crack width measurement method to improve the crack shape estimation efficiency by simplifying the redundant data in crack images according to an improved image refinement algorithm and width measurement scheme after crack image segmentation using the UNet model. Huang et al. [35] proposed an improved DeepLabV3+ network based on hydraulic concrete crack segmentation but could not solve the application scenario of fine crack segmentation. Chen et al. [36] proposed A_DCDNet for dam crack detection based on an FCNN, which possesses efficient detection efficiency, but the model is too coarse for crack edge segmentation. Wang et al. [37] constructed a dam crack dataset and used the benchmark model SegNet to achieve dam crack segmentation, but crack leakage occurred. Cheng et al. [38] proposed an improved U 2 Net model for embankment crack detection with good detection accuracy, but the dataset is small and prone to underfitting and inadequate training if trained directly. Therefore, if the existing methods are directly applied to concrete dam surface crack detection, the following problems remain: (1) Concrete dam cracks are still different from cracks in the public dataset in distribution and pixel characteristics; thus, the public dataset is not suitable as a training dataset for concrete dam crack detection (see Figure 1). The cracks in the public dataset are more obvious, while the concrete dam cracks are more obscure, small, and low contrast, so the public crack dataset cannot be directly expanded into a dam face crack image, which will change the probability distribution of the training set data and reduce the recognition accuracy and edge segmentation precision. Additionally, the small sample size in the concrete dam crack dataset will lead to inadequate model training and missed detection. (2) Concrete dam images have few pixel points occupied by cracks, while the background takes up the vast majority of the image, resulting in the model's inadequate crack information extraction for cracks in the dam face features and weak segmentation of fine cracks. (3) Cracks in actual projects are more complex, and existing algorithms have low robustness and pixel accuracy for detecting cracks in concrete dams due to uneven lighting and the low contrast between cracks and the surrounding environment, making edge segmentation smooth.
Water 2023, 15, x FOR PEER REVIEW 4 of 23 model training and overfitting and reduce the phenomenon of fine crack miss detection. 2. The ResNet50 network is used as the backbone feature extraction network to enhance crack feature information extraction by the UNet encoder and improve the ability to segment fine cracks. 3. The designed multilayer parallel residual attention (MPR) is integrated into its jump connection path to suppress the interference of extraneous regions on crack segmentation to improve pixel accuracy and make crack edge contours clearer.  [31,39], and the second row shows images of cracks in concrete dam surfaces.

Deep Learning-Based Crack Detection
Deep learning currently has two popular directions: object detection methods and semantic segmentation methods. Usually, two main types of target detection exist: (1) two-stage detection [40] and (2) one-stage detection [41]. The two-stage detection ap-  [31,39], and the second row shows images of cracks in concrete dam surfaces. To solve the above problems and improve concrete dam crack detection efficiency and accuracy, this paper explores an intelligent detection method for concrete dam surface cracks based on two-stage transfer learning. A two-stage transfer learning strategy is designed to fully train the model and improve accuracy by using the ResNet50 network as the backbone feature extraction network to fully extract the dam face crack feature information. Multilayer parallel residual attention (MPR) is constructed and added to improve the model robustness and edge segmentation performance. We conducted experiments on a self-built dam surface crack dataset with UAVs collecting images at 3-10 m from the dam surface with crack widths between 3 and 20 mm to verify the correctness and feasibility of the proposed method. Finally, the experimental results show that the mIoU, mPA, and frames per second (FPS) of the proposed method in this paper are 88.3%, 92.7%, and 36.5, respectively, showing the best performance in dam surface crack segmentation.
The main contributions of this paper are as follows: 1. An intelligent detection method for concrete dam surface cracks based on two-stage transfer learning is proposed to share parameters and features through cross-domain and intradomain learning. The proposed approach can alleviate low segmentation accuracy due to the lack of concrete dam crack datasets that results in inadequate model training and overfitting and reduce the phenomenon of fine crack miss detection. 2. The ResNet50 network is used as the backbone feature extraction network to enhance crack feature information extraction by the UNet encoder and improve the ability to segment fine cracks. 3. The designed multilayer parallel residual attention (MPR) is integrated into its jump connection path to suppress the interference of extraneous regions on crack segmentation to improve pixel accuracy and make crack edge contours clearer.

Deep Learning-Based Crack Detection
Deep learning currently has two popular directions: object detection methods and semantic segmentation methods. Usually, two main types of target detection exist: (1) twostage detection [40] and (2) one-stage detection [41]. The two-stage detection approach first generates a fixed-size feature map for each candidate region using a Rol pooling layer and then obtains the result using bounding box regression, which is computationally intensive, although the accuracy is high. The one-stage detection method predicts the object class and bounding box in one step, as the name suggests, with high computational efficiency. Huang et al. [42] detected cracks in dams by improving YOLOX, and the results achieved high accuracy. Min et al. [43] proposed an improved YOLOv4 to achieve accurate bridge crack identification. The object detection method is accurate and efficient, enabling not only the identification of the type of detection target but also the localization of the target; however, it lacks the identification of specific contours in the detection target.
Deep learning semantic segmentation focuses on using convolutional neural networks to understand the real-world objects represented by each pixel in an image, that is, to segment the different objects in an image at the pixel level so that a specific outline of the target can be obtained. Since the specific shape and area of cracks are obtained to better determine the crack hazard level and to repair them, this paper uses semantic segmentation to conduct research on crack detection in concrete dams. In 2015, the fully convolutional network (FCN) was proposed [44] to reach a new milestone in image segmentation for deep learning. FCN uses a softmax function on top of a convolutional layer to classify each pixel to complete fine image segmentation. FCN cannot use global scene category information, which will cause feature loss. In the same year, Ronneberger et al. proposed the UNet model [45] based on the FCN network architecture, which modified the FCN encoder and decoder to obtain better performance on fewer training data, although the feature extraction capability is not optimal. Meanwhile, to solve the problem that FCN cannot utilise global scene information, Zhao et al. proposed the pyramid scene parsing network (PSPNet) [46] in 2017, which enables semantic segmentation models to fully access contextual relationships through the pyramid structure. In 2018, Chen et al. proposed a codec structure with null-separable convolution for image semantic segmentation (DeepLabV3+) [47], which achieves excellent segmentation results; however, it is computationally intensive and may lead to long training times. Subsequently, a number of excellent semantic segmentation models also emerged [48] and were applied to crack segmentation in various scenarios [49]. Liu et al. [50] used the DeepLabV3+ model for crack detection in ageing buildings and bridges to obtain information on crack changes in buildings. Rill-García et al. [51] trained an improved fully convolutional neural network (FCNN) on the public CrackForest dataset and achieved road crack detection but with low segmentation accuracy. Liu et al. [52] used Swin-UNet to perform experiments on a public fracture dataset, which can obtain fracture contours efficiently, but with an IoU of 70% and low accuracy. In summary, although semantic segmentation achieves better performance for crack detection than traditional image segmentation, targeted optimisation is needed for concrete dam surface crack detection where cracks are obscure and datasets are lacking.

Transfer Learning-Based Crack Detection
Deep learning algorithms require large datasets to support training, and insufficient datasets can seriously affect crack feature extraction and image detection. To address the lack of datasets limitation, the use of transfer learning is one of the most effective and practical solutions [53]. Transfer learning is the transfer of knowledge learned from one domain to another dataset domain; the knowledge is transferred from the source domain, and the knowledge is given to the target domain. In general, transfer learning can obtain relatively good test results even with small samples of labelled data [54]. For example, Fan et al. [3] proposed a transfer learning-based underwater dam crack image segmentation model that can accurately segment underwater dam crack images with a small dataset. Li et al. [55] improved crack detection accuracy by sharing model parameters in the source domain through a transfer learning approach. In theory, transfer learning can be performed between any domains, but if the pretrained dataset is more relevant to the task target dataset, then the better the pretrained model for knowledge transfer, the better the a priori knowledge acquisition for transfer learning will be [56]. Therefore, this paper proposes a two-stage transfer learning strategy to enable better transfer learning.

Crack Detection Backbone Network
Concrete dam surface cracks are more obscure and minute, and feature information extraction is difficult; therefore, it is necessary to make targeted changes to the model backbone feature extraction network to enhance the encoder information extraction capability. The dominant backbone network structures in current crack recognition are AlexNet [57], VGG [58], and ResNet [59]. Dung and Anh [60] replaced the backbone feature extraction network of the FCN model with a VGG network. The model achieved a comprehensive evaluation index, average accuracy of 90%, and can predict concrete cracks more accurately than the original model. Based on relevant research experience, network depth is crucial for visual recognition tasks; however, in practical experiments, it has been found that with the superposition of convolutional and pooling layers, the network learning effectiveness decreases and the error rate increases [61]. With the introduction of residual networks [59], it is possible to reach a deeper network level with a low classification error rate. Residual networks are widely used for this reason. Zhao et al. [62] addressed the low recognition accuracy problem of the UNet model by using ResNet18 as the backbone network to enhance the feature extraction capability of the network. Xu et al. [63] used the ResNet34 residual network as a model encoder to better extract crack detail information. Considering the stronger feature extraction capability of ResNet, this paper replaces the UNet backbone feature extraction network with ResNet50 to increase the depth of the feature extraction network and improve the segmentation accuracy of the model.

Crack Detection Based on Attention Mechanisms
In computer vision, attention mechanisms can mimic the human visual system, thus assigning different weights according to the importance of different content to improve attention to key information and enhance the network feature representation [64]. Yu et al. [65] addressed the low accuracy problem of existing deep learning road crack detection methods and improved the model's ability to detect cracks by embedding an attention mechanism in the UNet model. Given the application flexibility of the attention mechanism and the excellent performance that can be achieved in image processing and inspired by the residual module [59] and the atrous-spatial pyramid pooling (ASPP) module in DeepLabV3+ [47], an MPR attention mechanism was designed and integrated into the jump connection path of UNet to increase the model's attention to crack feature information, reduce the effect of noise on crack edge segmentation results, and improve the model's pixel segmentation accuracy.

Model Framework
The UNet model [45], as one of the most commonly used semantic segmentation models for deep learning, has the advantages of being lightweight and easy to deploy and can achieve good segmentation results even for small-scale dataset training; therefore, the UNet model is chosen for concrete dam surface crack detection in this paper. The UNet model can be divided into two parts: the encoder and the decoder. The encoder is used to extract features from the input image to gradually obtain higher-order semantic feature information in the image. The decoder is used to upsample and convolve the input feature map to gradually recover the image size and the number of feature map channels. Finally, the classification of pixels in the image is predicted. To enhance the information transfer between the encoder and decoder, the UNet model is designed with a jump connection so that the high-resolution shallow information output from the corresponding stage of the encoder is fed directly to the decoder, supplementing some of the feature information lost due to downsampling or convolution.
To further improve the segmentation ability and edge recognition accuracy of the UNet model for obscure and small cracks and to make the segmentation of concrete dam surface cracks more accurate, the UNet model is improved in this paper; the structure of the improved model is shown in Figure 2. The specific improvement measures are as follows: (1) A ResNet50 network is built as the feature extraction network of the UNet model encoder. The number of network layers and extraction capacity are deepened to fully obtain the crack feature information through the residual module so that the model can effectively learn the deep features of the dam face cracks and improve the crack segmentation model accuracy.
(2) MPR attention added to the jump connection layer is designed to enhance the feature representation of the model by acquiring more semantic information, which not only suppresses feature responses in irrelevant regions and increases the importance of effective feature information channels but also allows the network to focus on crack feature information, complementing the loss of detail and enabling the model to segment the dam face crack images more accurately.

ResNet50 Backbone Network
Appropriately increasing the network depth can strengthen the feature extraction ability of the model and obtain more deep semantic information, but some researchers have found that as the structure of the network model deepens, training becomes increasingly difficult and encounters problems such as disappearing and exploding gradients. To effectively solve the degradation problem caused by deepening the network while strengthening crack feature information extraction by the UNet backbone network and improving crack detection accuracy, the ResNet50 backbone network, shown in Table 1, is proposed as the UNet encoder in this paper. Each residual module in the ResNet50 network consists of three concatenated convolutional layers of 1 × 1, 3 × 3, and 1 × 1, and a residual structure is added to retain some of the shallow information, which can prevent the model degradation problem due to gradient explosion. The residual module is divided into 2 types, as shown in Figure 3. If the number of input feature map channels matches the number of output channels, the residual module is A. If the number of input feature map channels does not match the number of output channels, the residual structure needs to perform a 1 × 1 convolution to adjust the number of channels, and the residual module is B.

ResNet50 Backbone Network
Appropriately increasing the network depth can strengthen the feature extraction ability of the model and obtain more deep semantic information, but some researchers have found that as the structure of the network model deepens, training becomes increasingly difficult and encounters problems such as disappearing and exploding gradients. To effectively solve the degradation problem caused by deepening the network while strengthening crack feature information extraction by the UNet backbone network and improving crack detection accuracy, the ResNet50 backbone network, shown in Table 1, is proposed as the UNet encoder in this paper. Each residual module in the ResNet50 network consists of three concatenated convolutional layers of 1 × 1, 3 × 3, and 1 × 1, and a residual structure is added to retain some of the shallow information, which can prevent the model degradation problem due to gradient explosion. The residual module is divided into 2 types, as shown in Figure 3. If the number of input feature map channels matches the number of output channels, the residual module is A. If the number of input feature map channels does not match the number of output channels, the residual structure needs to perform a 1 × 1 convolution to adjust the number of channels, and the residual module is B.

Multilayer Parallel Residual Attention
MPR is designed to address the problem that cracks in the image provide little information and the pixel segmentation accuracy is low. MPR is added to the jump connection

. Multilayer Parallel Residual Attention
MPR is designed to address the problem that cracks in the image provide little information and the pixel segmentation accuracy is low. MPR is added to the jump connection so that the decoder can obtain more important semantic information for learning and increase the edge segmentation accuracy. The MPR consists of 4 parts (Figure 4): the skip layer, the 1 × 1 convolution layer, the 3 × 3 convolution layer, and the atrous convolution layer with a dilation rate of 3. The 4 parts perform the convolution calculation in parallel, and the main codes are shown in Table 2. The skip layer skips the input feature map x and adds it directly to the output of F i (x). This step retains the feature information of the previous layer in the feature map y of the next layer to protect information integrity. The 1 × 1 convolution and 3 × 3 convolution layers allow the model to perform feature learning under multiscale convolution, extracting deep semantic information and increasing the nonlinear capability of the model. The cavity convolution layer uses a cavity convolution with a sampling rate of 3 to give the model a larger perceptual field for the same number of parameters and computational effort, which helps to improve the missing information problem for crack edge identification. Finally, the outputs of the four are summed to increase the amount of crack information in each feature map dimension, thus reducing the influence of irrelevant feature information on subsequent crack identification results.

Two-Stage Transfer Learning Strategy
Semantic segmentation is a type of supervised learning that requires a large quantity of labelled training data, which is not easy to obtain on concrete dam surface cracks in complex environments. The labour cost of acquisition and data labelling is high, and the lack of datasets easily leads to inadequate model training and low detection accuracy. To solve the problem of insufficient training data for deep learning, transfer learning methods have been developed [66]. However, if the difference between two domains is particularly large, the results obtained by directly adopting the transfer learning method are often poor. Therefore, this paper designs a two-stage learning strategy for knowledge transfer to improve the low detection accuracy and poor modelling results on small sample datasets. The two-stage transfer learning training process is shown in Figure 5. Both phases were trained for 300 generations, but due to the similarity in data, tasks and models between the first and second phases, a freeze training step was added to the second phase transfer training to speed up the training efficiency and prevent the weights from being corrupted. This means that the backbone network was frozen for the first 150 training generations, only the decoding classifier was trained, and the whole network was unfrozen for the second 150 training and learning generations. The two-stage transfer learning strategy reduced the model training time and resulted in a highly accurate and robust model for intelligent concrete dam surface crack detection.
the existing labelled data samples in the source domain can be better utilised for clas cation training in the new space and solves the poor accuracy problem caused by sparse labelled dam face crack images.
Both phases were trained for 300 generations, but due to the similarity in data, ta and models between the first and second phases, a freeze training step was added to second phase transfer training to speed up the training efficiency and prevent the weig from being corrupted. This means that the backbone network was frozen for the first training generations, only the decoding classifier was trained, and the whole network w unfrozen for the second 150 training and learning generations. The two-stage tran learning strategy reduced the model training time and resulted in a highly accurate a robust model for intelligent concrete dam surface crack detection.

Experimental Setup
The experiments in this paper were based on the Ubuntu 18.04 operating system, the CPU processor was an AMD EPYC7543 32-Core processor, the GPU used a GeForce RTX3090 graphics card with 24 GB of video memory, and the programming language was Python 3.8, and CUDA 11.1, the PyTorch 1.8.1 deep learning framework for network training was used. The specific model training parameters are shown in Table 3.

Datasets
The dataset used for the experiments in this paper has three parts: the PASCAL VOC 2012 dataset, the first stage cross-domain training crack dataset (DatasetA), and the second stage target domain concrete dam surface crack image dataset (DatasetB). The PASCAL VOC 2012 dataset is a public dataset for world-class computer vision challenge image classification, detection, or semantic segmentation, consisting of 1464 images from the training set, 1449 images from the validation set, containing 20 categories and 1 background. DatasetA consists of 2 parts, partly derived from the publicly available datasets CrackForest [30], SDNET2018 [24], and Aft Original Crack DataSet Second [26], but the distribution and pixel characteristics of cracks on the concrete dam surfaces are very different from the open source dataset. Therefore, in the 3 publicly available datasets, only the images with fine and obscure cracks were selected, and those with similarity greater than 0.7 were excluded to improve the dataset quality; 280 images were finally obtained. Another part of DatasetA we acquired in actual concrete road cracks, and 700 small crack images were added to increase the realism of the dataset and make it closer to the actual engineering context. After the image enhancement algorithm retinex [67], adjustment of image luminosity, contrast, and spatial variation (random rotation, flip), DatasetA was expanded to 3152 images and used as the target domain dataset for cross-domain training in the first phase of the model. The selected open dataset images and road crack images are shown in Figure 6.

Evaluation Indicators
To objectively evaluate the performance of different models, typical evaluation metrics mean intersection over union (mIoU), mean pixel accuracy (mPA), parameters, and floating point operations (FLOPs) are introduced in this paper. Intersection of union (IoU) represents the overlap rate between the predicted mask and the actual mask. mIoU is the arithmetic mean of the IoU values for each category and assesses whether the overall image segmentation is precise. Pixel accuracy (PA) indicates the proportion of the number of correctly segmented pixels, and mPA is the arithmetic mean of each class of PA to assess the global accuracy of the model. Parameters measure the number of model parameters, FLOPs measure the complexity of the model, i.e., the number of model computations, and FPS measures the speed of inference of the model.
where k + 1 denotes the category to be predicted plus a background, and pij denotes the number of pixels in category i that are predicted to be in category j. Therefore, pii is a positive sample, pij is a false negative sample, and pji is a false positive sample.

Ablation Experiments
To verify the effectiveness and advancement of the improvement strategies in this The second phase target domain DatasetB is our self-built concrete dam surface crack dataset, and the DJI Mavic 3 UAV was used to photograph several concrete dams in the upper Jinsha River in China to obtain concrete dam crack image data. The shooting distance was 3~10 m and focused on image diversity (different angles, background conditions, light intensity) with a total of 350 images collected, with crack width between 3 and 20 mm. To ensure that the dataset can adapt to different complex environments and to make the model more generalisable and robust, the images were annotated and then expanded using retinex enhancement, adjusting image luminosity, contrast, and spatial variation to obtain 1393 images.
The intelligent detection method for concrete dam surface cracks in this paper used supervised learning. To ensure the quality of the dataset and the accuracy of the target information, both the actual collected images of DatasetA and DatasetB were manually annotated at the pixel level using the LabelMe annotation tool. Finally, DatasetB was randomly divided into a training set and a validation set at a ratio of 8:2.

Evaluation Indicators
To objectively evaluate the performance of different models, typical evaluation metrics mean intersection over union (mIoU), mean pixel accuracy (mPA), parameters, and floating point operations (FLOPs) are introduced in this paper. Intersection of union (IoU) represents the overlap rate between the predicted mask and the actual mask. mIoU is the arithmetic mean of the IoU values for each category and assesses whether the overall image segmentation is precise. Pixel accuracy (PA) indicates the proportion of the number of correctly segmented pixels, and mPA is the arithmetic mean of each class of PA to assess the global accuracy of the model. Parameters measure the number of model parameters, FLOPs measure the complexity of the model, i.e., the number of model computations, and FPS measures the speed of inference of the model.
where k + 1 denotes the category to be predicted plus a background, and p ij denotes the number of pixels in category i that are predicted to be in category j. Therefore, p ii is a positive sample, p ij is a false negative sample, and p ji is a false positive sample.

Ablation Experiments
To verify the effectiveness and advancement of the improvement strategies in this paper, eight sets of ablation experiments were designed to validate the model under the same dataset and experimental environment to verify the impact of different improvement strategies on the concrete dam surface crack extraction accuracy. In Table 4, it can be seen that the concrete dam surface crack segmentation performance improved under different strategies. After using ResNet50 as the backbone feature extraction network, mIoU and mPA improved by 1.2 and 1.3 percentage points, respectively, compared with the UNet model, the model computations were reduced by 64.1 percentage points, and the inference speed increased by 128.1 percentage points, indicating that the ResNet50 backbone network can obtain more crack feature information and segment the foreground crack region more accurately. Additionally, it can reduce the complexity of the UNet model and speed up the model inference capability. After the UNet model was embedded with MPR attention, mIoU and mPA improved by 1.1 and 1.6 percentage points, respectively, compared to the UNet model. The model's ability to extract important semantic information was enhanced, suppressing some of the useless background feature information, improving pixel segmentation accuracy, and making edge details more complete but sacrificing detection time. After using the two-stage transfer learning strategy, both mIoU and mPA substantially improved, proving that the method can effectively alleviate model overfitting and undertraining due to the lack of a dataset, reduce fragmentation of dam surface cracks, and reduce missed detections. In the overall analysis, the integrated capacity of the model was optimal after integrating the three improvements simultaneously. The mIoU and mPA of the proposed method were 88.3% and 92.7%, respectively, and were 4.6 and 3.2 percentage points higher than those of the UNet model. The computational volume required by this paper was 36.7 percentage points lower than that of the UNet model, and the inference speed was 48.9 percentage points higher than that of the baseline UNet model. This demonstrates that the method achieves better concrete dam surface crack segmentation performance. Although the increase in the number of method parameters in this paper led to an increase in the memory taken up by the model weights, the size of the memory taken up by the model had no direct effect on the concrete dam surface crack segmentation in this paper.

Comparative Trials of Different Attention Mechanisms
To reflect the advantages of MPR attention in this paper, Table 5 shows the experimental results of different attention mechanisms on model detection ability under the same experimental environment and dataset. The table shows that SE [68] attention had the least disturbance on the model inference speed but did not significantly improve model accuracy and precision. Compared to SE attention, the popular CBAM [69] attention mechanism achieved a more pronounced improvement to cleft IoU and PA, indicating that convolutional block attention module (CBAM) attention is more capable of extracting fine targets. The improvement in crack accuracy from adding CA [70] attention was negligible and had no significant effect on accuracy. There was a small decrease in accuracy with the addition of the lightweight ECA [71] model, with a 0.3% decrease in mIoU and a 0.7% decrease in mPA, indicating that lightweight efficient channel attention (ECA) is less suitable for fine crack detection and that attention is not always applicable to any scenario. Compared with other commonly used attention mechanisms, MPR attention, designed in this paper, is more effective in improving accuracy and precision and is more suitable for detecting concrete dam surface cracks. Although MPR will have some impact on the inference speed, the improvement in detection accuracy is undoubtedly more important for better ensuring dam safety and stability and is, therefore, more cost-effective.   Table 6 shows the effect of different transfer learning approaches on improving the segmentation accuracy of the model. It is clear from Table 6 that the cross-domain-based transfer school approach achieved the lowest improvement in training accuracy for the target domain, with mIoU and mPA only improving by 0.6% and 0.3%, demonstrating that if the difference between the 2 domains is large, the pretrained model has little effect in improving the target domain. Compared with cross-domain transfer, the intradomain transfer learning approach achieved a more significant improvement in target domain accuracy, with the IoU and PA of the cracks improving by 1.9% and 1.5%, respectively, indicating that better target domain learning can be achieved through intradomain feature knowledge transfer. Compared to single-segment transfer learning, a two-stage transfer learning approach with cross-domain and in-domain learning was used in this paper for optimal results. The cross-domain training was first performed on the in-domain DatasetA through the out-of-domain PASCAL VOC 2012 dataset to avoid training from zero so that a better in-domain pretraining model was obtained. Then, the pretraining model was used as the tutor model for the target domain to achieve the best transfer learning effect. The final IoU and PA of the cracks improved by 5.6% and 2.9%, respectively, indicating that in transfer learning, in addition to the impact of the difference in data distribution between the source and target domains on the accuracy improvement, the merit of the mentor model also has an impact on transfer learning, which also verifies the effectiveness of the two-stage transfer learning strategy in this paper.

Crack Segmentation Image Comparison
For a more intuitive analysis and to demonstrate that the proposed method has greater crack segmentation capability, the concrete dam surface cracks detection results are visualised in this section using the UNet model, UNet + ResNet50, UNet + ResNet50 + MPR, and the methods in this paper. The visualisation of the crack segmentation for the different models is shown in Figure 7. The selected dam face crack images were complex and contained obscure, microscopic cracks; therefore, segmentation was challenging. The UNet benchmark model in Figure 7 roughly extracted the dam face crack contours, but the extraction was incomplete and weak for fine cracks, leading to serious underidentification and noise phenomena, and the segmentation of crack edges was smoother and rougher, with low segmentation accuracy. With the addition of the ResNet50 backbone network to UNet, the model's ability to extract features from the image was enhanced, and the segmentation accuracy improved, allowing the outline of fine cracks to be roughly extracted. However, there were fractures and noise, and the edge details needed to be improved. UNet + ResNet50 then added MPR attention to reduce the interference of background regions on subsequent predictions, focus more on crack regions, and enhance fine crack extraction, thus improving the accuracy and precision of the model's crack edge shape segmentation with clearer contours and a better ability to retain details than the UNet model, but fragmented segmentation and missed detection were still present. After combining the three improved strategies, the proposed method achieved the highest degree of dam face crack segmentation completeness, which was closer to the manual annotation, and had a strong segmentation ability and robustness for cracks of different scales. These results show that the two-stage transfer learning strategy can compensate for the undersegmentation and overfitting phenomenon caused by the small quantity of data and can effectively improve the missed detection situation caused by thin cracks and few effective pixel points and process the edge details in a clearer, more detailed, and complete way. The combined subjective and objective results show that the proposed method exhibits the best performance in intelligent concrete dam surface crack segmentation and shows greater robustness in segmenting complex and obscure fine cracks. Table 7 shows the training evaluation results of this paper's model with the current typical semantic segmentation models UNet [28], PSPNet [29], DeepLabV3+ [30], and SegFormer [31] in the same experimental environment and with the same dataset. From Table 7, the DeeplabV3+ model had the lowest mIoU and mPA with only 71.6% and 75.0%, respectively, followed by PSPNet, with both models having lower segmentation values than expected. This indicates that using large networks such as DeepLabV3+ and other models with insufficient sample sizes can easily lead to overfitting. The better segmentation accuracy of the UNet and SegFormer models shows that these models have some stable segmentation ability even in small datasets and complex scenarios. Overall, the accuracy indexes of these methods were optimal, and the IoU and PA of cracks significantly improved compared with the UNet model before improvement, achieving 76.9% and 85.6%, respectively. This indicates that the segmentation accuracy of the improved method for concrete dam surface cracks is better than the other models and more suitable for concrete dam surface crack detection.  Table 7 shows the training evaluation results of this paper's model with the current typical semantic segmentation models UNet [28], PSPNet [29], DeepLabV3+ [30], and Seg-Former [31] in the same experimental environment and with the same dataset. From Table  7, the DeeplabV3+ model had the lowest mIoU and mPA with only 71.6% and 75.0%, respectively, followed by PSPNet, with both models having lower segmentation values than expected. This indicates that using large networks such as DeepLabV3+ and other models with insufficient sample sizes can easily lead to overfitting. The better segmentation accuracy of the UNet and SegFormer models shows that these models have some stable segmentation ability even in small datasets and complex scenarios. Overall, the accuracy indexes of these methods were optimal, and the IoU and PA of cracks significantly improved compared with the UNet model before improvement, achieving 76.9% and 85.6%, respec-   Figure 8 shows the change curves of the ablation experimental metrics. It can be observed that the mIoU and mPA metric values of each experimental model increased with the increasing number of iterations, and the training loss decreased with the increasing number of iterations. tThe models that did not use the transfer learning strategy were trained for 260 iterations before the fluctuation in each metric gradually stabilised. Then, the mIoU, mPA, and training loss curves were parallel to the X-axis, indicating that the training could end. The improvement in mIoU and mPA metrics and the decrease in loss values were evident in the models trained with the transfer learning strategy. mIoU and mPA increased rapidly in the first 10 phases of the training process and converged smoothly in the latter phases of the training process, with the corresponding loss values dropping to much lower values. In summary, with the use of the transfer learning strategy, the segmentation accuracy of small datasets can be improved through knowledge and feature sharing, and high accuracy can be achieved in a short training time. Therefore, transfer learning can be applied not only to detection tasks with small datasets but also to scenarios with limited computational power for model training to achieve fast training results, reduce the number of training rounds, and save computational resources.  Figure 8 shows the change curves of the ablation experimental metrics. It can be observed that the mIoU and mPA metric values of each experimental model increased with the increasing number of iterations, and the training loss decreased with the increasing number of iterations. The models that did not use the transfer learning strategy were trained for 260 iterations before the fluctuation in each metric gradually stabilised. Then, the mIoU, mPA, and training loss curves were parallel to the X-axis, indicating that the training could end. The improvement in mIoU and mPA metrics and the decrease in loss values were evident in the models trained with the transfer learning strategy. mIoU and mPA increased rapidly in the first 10 phases of the training process and converged smoothly in the latter phases of the training process, with the corresponding loss values dropping to much lower values. In summary, with the use of the transfer learning strategy, the segmentation accuracy of small datasets can be improved through knowledge and feature sharing, and high accuracy can be achieved in a short training time. Therefore, transfer learning can be applied not only to detection tasks with small datasets but also to scenarios with limited computational power for model training to achieve fast training results, reduce the number of training rounds, and save computational resources.   Figure 8 shows the metric values for each generation of weights obtained from training, from left to right, mIoU, mPA, and loss. The red circle in the figure shows the last 10 generations of the model metrics.

Crack Area Pixel Value Prediction
Tables 4, 6 and 7 and Figure 7 show that the proposed method performs best in the automatic concrete dam surface crack segmentation, with more accurate segmentation of complex and obscure small cracks, and can accurately and efficiently detect crack location and shape. This subsection, therefore, performs a pixel-level statistical analysis of the cracks obtained from the segmentation of this paper's method to predict the pixel value of the image crack area. Table 8 shows the statistical table of crack area prediction of the proposed method, and the selected images are consistent with those segmented in Figure 7. In Table 8, it can be seen that the crack area prediction accuracy by the proposed method reaches more than 85.7%, and the prediction accuracy of relatively obvious crack areas can reach more than 90%. The test results show that, in addition to the fast and accurate crack identification and segmentation, the proposed crack segmentation method can also accurately obtain crack area information. In summary, the UAV images can be fed into the proposed method to efficiently and accurately identify and segment concrete dam surface cracks, then the pixel values of the segmented crack areas can be derived, and the actual area values of concrete dam cracks can be obtained by converting the pixel points to the actual values [72]. The crack hazard level is judged based on the area obtained, and then the appropriate material is selected for remediation, enabling better targeted repair during dam health diagnosis and solving the low manual measurement efficiency and detection accuracy problems while avoiding material waste.  Figure 9a,b gives the results of the dam crack image segmentation without annotation by the method in this paper. The method effectively detects the general outline and location of the cracks in the picture and can accurately segment different cracks, which proves the effectiveness and reliability of the method in this paper for detecting cracks in actual dams. Figure 9c,d shows the dam crack images with different environmental backgrounds from DatasetB, and we find that they can also obtain good recognition, probably because the migration learning process is used in pretraining the weights for crack images of similar environments, so crack segmentation in other environmental backgrounds also achieves good robustness.  Table 4 reflects the combined ability of the methods in this paper to reach optimality. The mIoU and inference speed were improved by ResNet50 as a feature extraction net-  Table 4 reflects the combined ability of the methods in this paper to reach optimality. The mIoU and inference speed were improved by ResNet50 as a feature extraction network, but the improvement in mPA was not significant. The mPA measures the proportion of pixels correctly segmented by the model, indicating that there is still room for improvement in the model's segmentation and edge recognition capabilities. The experiments found that the mPA improvement was more obvious after adding MPR to the UNet model. Therefore, the two models were combined, but there was still the fine crack missed detection problem, which was most likely caused by the small sample size of the dataset, making the model training insufficient. After the improved UNet model was trained using a two-stage transfer strategy, a significant reduction in model misses was observed in Figure 5, demonstrating that transfer learning is very effective in helping to train when using small datasets. The method ultimately has good inference speed while achieving the highest segmentation accuracy but also reflects the large number of parameters and the large space occupied by the method volume, resulting in weak model mobility, which is less suitable for detection in moving scenes. In the future, the model can be thinned by pruning the convolutional layers or replacing the model encoder with a lightweight feature extraction network to reduce the parameters and computations to make the model lightweight so that the model can also be deployed in mobile and embedded devices such as Jetson.

Conclusions
To improve concrete dam surface crack detection efficiency and accuracy, an intelligent concrete dam surface crack detection method based on two-stage transfer learning is proposed, which can achieve accurate and efficient UAV concrete dam surface crack image segmentation. Compared with other typical semantic segmentation models, the proposed method has the best overall capability and better performance in detecting cracks in concrete dams.
After training using two-stage transfer learning, the proposed method alleviates the inadequate training problem caused by small-scale dam crack datasets, reduces the fine crack miss detection phenomenon, and can achieve high accuracy in shorter training rounds. Additionally, using ResNet50 as a UNet model feature extraction network can fully extract crack feature information and improve the model segmentation capability. Finally, adding MPR to the jump connection path significantly improves pixel accuracy and makes crack edge segmentation more delicate and complete.
The experimental results show that the mIoU and mPA of the proposed method reach 88.3% and 92.7%, respectively, and the FPS is 36.5. The proposed method has the ability to segment cracks on concrete dam surfaces more efficiently and accurately and achieves an accuracy of over 85.7% for predicting the crack area, which can better detect cracks in dams safely.