IterLUNet: Deep Learning Architecture for Pixel-Wise Crack Detection in Levee Systems

Deep learning has recently been extensively used for crack detection in structural health monitoring settings. However, detecting cracks in levee systems have yet to receive considerable critical attention. Thus, this study presents a novel encoder-decoder-based fully convolutional neural network to detect cracks from levee images at a pixel level automatically. We propose that the feature learning be strengthened using the decoder and bottleneck feature maps by concatenating them back to the encoder blocks. The addition reinforcement in the U-Net-like architecture results in a loop-like structure to exploit all the feature maps from encoders, bottlenecks, and decoders. The proposed architecture, Iterative Loop U-Net (IterLUNet), outperforms the state-of-the-art architectures on the image dataset of the levee system, achieving an increment of Intersection over Union (IoU) by 10.32% on average for a 10-Fold Cross-Validation (FCV) compared to the baseline U-Net model and 11.00%, 7.65%, and 7.43% with a range of latest models MultiResUnet, Attention U-Net, and Unet++ respectively. In addition, IterLUNet has at least 63% fewer parameters to be trained than the baseline model, thus, allowing less space consumption for pixel-wise crack detection in AI-based inspection of levee systems.


I. INTRODUCTION
Recent deep learning methods have achieved state-of-theart results on challenging computer vision problems like image classification, object detection, and image segmentation [1]. The Convolutional Neural Network (CNN or Con-vNet) has significantly advanced deep-learning methods [2] by introducing three layers -the convolutional layer as a feature extractor, the activation layer to add non-linearity, and the pooling layer to maintain the spatial dimension. Consequently, CNN gained popularity mainly because it automatically extracted essential features through successive convolutional layers. On the grounds of components of CNN, Long et al. [3] proposed a Fully Convolutional Network (FCN), a breakthrough in deep-learning-based end-to-end image segmentation methods without fully connected layers. The FCN was then extended to encoder-decoder architec- The associate editor coordinating the review of this manuscript and approving it for publication was Wentao Fan .
tures. The encoders in encoder-decoder architecture extract features from the images, and the decoders map low-level features from encoders to an output segmentation mask [3], [4], [5].
Several fully convolutional neural network-based architectures, FCN [3], SegNet [4], U-Net [5], MultiResUNet [6], Attention U-Net [7], and UNet++ [8], had been applied before to perform semantic or pixel-wise segmentation in medical imaging dataset. However, U-Net is a widely used encoder-decoder architecture that succeeded as the stateof-the-art model for image segmentation tasks in medical imaging [5]. The commonality in the variants of U-Net-like models is that they have skip connections from the encoder to the decoder to help retrieve any spatial information lost in the down-sampling path of the encoders. Hence, in this paper, we explore that the U-Net-like deep learning architectures have the potential to improve prediction on limited and complex datasets like levee crack images through two main hypotheses. First, the proposed model, IterLUNet, VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ improves performance by utilizing learned features from the decoder and bottleneck layer to feed the feature map back to the encoder. Secondly, U-Net-like architectures can be made deeper without increasing the number of training parameters by implementing existing concepts in deep learning to design encoder and decoder blocks. The proposed deep learning architecture directly learns meaningful underlying representations of cracks from the image dataset. Of course, the training process requires a considerable amount of labeled data which is a challenge in flood control systems where there need to be more images with cracks to train and evaluate models. Paradoxically, collecting levee crack images is labor and time intensive. In light of this, we aim to develop a deep learning architecture that can be trained using a small labeled dataset and assist during the field investigation performed through a handheld device or unmanned aerial vehicles. Furthermore, most deep-learning approaches detect cracks on concrete or asphalt surfaces, predominantly in civil infrastructure. Existing architectures have yet to address the complexities of surroundings in the levee system where cracks develop on the slopes, crest, concrete floodwalls, and areas nearby the structure.
Currently, the inspection of the flood water control system is done manually. Mostly, field investigators physically gather or fly drones to capture images, followed by hours of manual checking for any faults [9], [10]. The current inspection method is expensive, slow, and laborious. Thus, this research introduces a high-performance, fully automated AI-based inspection solution using an encoder-decoder-based fully convolutional neural network architecture to detect cracks from the levee images. Therefore, in this study, the U-Net model is further improved to address the limitation and intricacies of the levee crack dataset. The contributions of the proposed model in this paper can be summarized as follows: • With the underlying hypothesis that decoder and bottleneck outputs can reinforce the model's learning, we propose Iterative Loop U-Net (IterLUNet), an encoderdecoder and a decoder-encoder combined deep learning model with three different high-performing model components.
• We present that the U-Net-like architectures can be constructed deeper and broader to extract relevant features without compromising on the model's size by deliberately including powerful contemporary deep-learning concepts.
• We propose a new benchmark dataset for performing image segmentation on levee crack images.

II. RELATED WORKS
There is a significant amount of literature on automatically detecting cracks. Many studies utilize variations of the U-Net [11] architecture and its symmetrical contracting-expansive path with skip connections [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24]. Likewise, Zou et al. [24] developed DeepCrack, a SegNet-like architecture, to demonstrate the utilization of multi-scale convolutional features for better results and model convergence. In DeepCrack, encoder and decoder outputs are connected to build a single-scale fused feature map. Several single-scaled fused feature maps are combined hierarchically to produce a multi-scale fusion map to compute loss and the final output mask. These approaches primarily focus on crack detection in structural health monitoring settings. Whereas the primary purpose of pixel-wise segmentation in this study is to accurately locate crack regions in the levee from images and measure their size, provided the scale of the image. The distinctions in our proposed architecture compared to these studies include the added skip connections between expanding path and the contracting path. Moreover, a combination of advanced techniques used in the building blocks of the proposed architecture contributes to reduced model size and improved model performance.
Lately, object detection methods have also been used for crack detection in levee systems [25]. The authors in [25] analyzed machine learning and deep learning-based techniques and suggested a lightweight stacking-based model for edge devices like drones. The significant difference in this research is that, unlike in [25], where the authors detected a bounding box of cracks, the architecture developed in this study uses a pixel-based annotated levee dataset to perform semantic or pixel-level detection of cracks. Detection of cracks using a pixel-level approach qualifies for precise identification of crack regions on the levee systems, a clear advantage over using a bounding box approach.

III. PROPOSED ARCHITECTURE
The baseline architecture U-Net is symmetric because of the contracting path with blocks of encoder followed by max pooling layer to generate feature vector and expanding path that has blocks of decoder along with upsampling of the feature space. The feature vectors generated through encoder blocks contain fine-grained spatial information lost in the contracting path. So, in U-Net, the skip connections from the contracting path to expanding path are constructed by concatenating the feature vector from the encoder to the corresponding decoder to allow the architecture to propagate the spatial information from previous layers while accurately reconstructing the segmentation mask [5]. The fundamental hypothesis constructed for the architecture design of Iter-LUNet is that the higher-level features from expanding paths also have relevant information which could be helpful during training. Thus, the proposed architecture is based on building connections from the expanding path back from the decoder to the encoder to represent the complexity of cracks.
In a deep learning model, a considerable number of parameters are to be tuned during the training process. It requires thousands of training samples for a model to learn from so it can generalize well on unseen data. Training deep learning models with many parameters from scratch is prone to overfitting in real-world semantic segmentation tasks where the annotated images are limited, less than, or in hundreds. Additionally, a model with a higher number of training parameters increases the model's overall size, making it unfeasible to perform nearly real-time accurate segmentation of crack pixels from the non-crack pixels. Hence, a depthwise separable convolution and iterative loop-like structure are introduced to address the growing number of parameters and optimize the architecture to achieve higher performance. The decoder and bottleneck feature maps are iteratively concatenated to the encoder's input at the next stage using simple skip connections in a U-like shape, hence named Iterative Loop U-Net (IterLUNet), as illustrated in Fig. 2.

A. BUILDING BLOCKS
The primary components of IterLUNet are InitialBlock, Squeeze and Excitation (SE) Block, IntermediateBlock, and Iterative Loop Block (IterLBlock). In Fig. 1, substructure A, substructure B, and substructure C depict InitialBlock, IterLBlock, and IntermediateBlock, respectively, which are discussed in detail in the following sections.

1) INITIALBLOCK
In [26], the authors show that the structure of an inception module with factorized asymmetric convolutions does not work well in the early layers. Since IterLUNet trains on a small dataset, the classic convolution layer in InitialBlock instead of an inception module helps reduce model complexity. The InitialBlock has one set of 3 × 3 convolution with a stride of 1, followed by batch normalization and ReLU activation as shown in Fig. 1. substructure A. It is the initial convolution block used in the first encoder in every iteration and produces 64 feature maps.

2) SE BLOCK
The skip connections combine low-level and high-level feature maps. Therefore, it is essential to recognize and prioritize meaningful latent representations. Thus, the Squeeze and Excitation (SE) block [27] and its variant, concurrent channel, and spatial SE (csSE) block proposed in [23] are used in the architecture. The SE block Squeezes along the spatial domain and Excites or reweights the channels. The advanced version of SE, csSE, on the other hand, emphasizes the use of proper channels and spatial information. Therefore, the SE and csSE blocks in the architecture recalibrate the feature space spatially and channel-wise, which is one way to optimize the network with a slight increment in model complexity and computational cost.

3) INTERMEDIATEBLOCK
The IntermediateBlock is comprised of a single Depthwise Separable Convolution followed by a csSE block, as observed in substructure C of Fig. 1. In the Depthwise Separable Convolution (DSC) layer, the two separate cascaded operations generate latent representations of the concatenated intermediate feature maps. The first operation is 3 × 3 depthwise Convolution with a stride of one, dilation of one, and a depth multiplier to perform channel-wise spatial convolution.
Later 1 × 1 point-wise convolution operation with stride one follows batch normalization operation and elu activation in the intermediate block, as shown in Fig. 1. The performance using elu activation and batch normalization is a little enhanced and consistent compared to using relu activation mostly because elu avoids dying relu problem and improves generalization through faster learning [28]. The dsc layer in the intermediate block performs similarly to the traditional convolution layer; however, the laye's significant advantage is that it lowers the number of training parameters. Finally, adding the csSE block after convolution operations ensures that concatenated filters are relevant both spatially and channel-wise to add value to the performance gain of the model.

4) ITERATIVE LOOP BLOCK (ITERLBLOCK)
Based on second hypothesis we propose IterLBlock. The balance of width and height in the proposed architecture is accomplished by managing a number of output filters produced throughout the network and recalibrating the importance of filters for optimal performance. Accordingly, the convolutions of larger spatial filters are factorized while retaining a growing number of filters in IterLUNet. The proposed substructure, iterative loop block (IterLBlock), follows the design principles introduced in [26], factorizing more extensive filter-sized operations into asymmetric convolutions. The inception module-like substructure B has 1 × 1, 3 × 3, and 5 × 5 convolutions, as shown in Fig. 1. The 5 × 5 convolution operation is computationally expensive and slow, so it is replaced with 3 × 3 convolutions, which are further factorized into two asymmetric convolutions, 1 × 3 and 3 × 1 convolution. The order of operations is illustrated in Fig. 1. Substructure B. After each convolution operation, ReLU non-linearity follows a batch normalization layer. Throughout the network, the batch normalization layer after each convolution adds regularization, reducing the need for a dropout layer, and subsequently avoiding overfitting the model on the levee crack dataset.
The substructure B operates as a feature extractor conceptually similar to a classic convolutional layer. As the network advances more in-depth, the input to IterLBlock eventually receives a higher-dimensional feature vector since features of different scales and dimensions are concatenated. The higher dimensional feature vector is predisposed to exploding during training without advanced computational resources. So, IterLBlock adds computational efficiency without compromising the model's performance through two factors. Firstly, 1 × 1 convolution aims to reduce the dimensionality of the feature vector by compressing channels. The 1 × 1 convolution has made it possible to perform further expensive 3 × 3 and 5 × 5 convolutions for higher-dimensional input feature vectors. Secondly, stacking SE block or its variation after concatenation in the inception module, as shown in Fig. 1. Substructure B with batch normalization has rectified the learning and added regularization in the network [29].

B. LOOPS AND ITERATIONS
In IterLUNet, loops are created to support connections from the decoder to the encoder. As the links increase, the number of encoder-decoder blocks also grows, leading to three iterations to match output filter numbers with the baseline model. The initial encoder in each iteration uses InitialBlock with 64 output feature maps extracted from the input RGB image, whereas decoders and bottlenecks apply IterLBlock, as illustrated in Fig. 2. After the first iteration, the pooling layer output is concatenated with the output of the respective expanding path to maintain the spatial dimension of the input feature vector for the succeeding encoder.
The first iteration has a simple U-like structure with one set of encoder-decoder blocks and a bottleneck layer of total filters {64, 128}. The second iteration starts exploring the output vector of the decoder and bottleneck layer of the first iteration. Immediately from the second iteration onwards, the number of encoder and decoder blocks increases. After that, IntermediateBlock accepts concatenated feature vectors as input. The number of output filters in the second iteration evolves to {64, 128, 256}. In the third iteration, pursuing the same idea of concatenating feature vectors, the output filter numbers in the contracting path become {64, 128, 256, 512}. Finally, 1 × 1 Conv2D represents the network's final layer, which comprises convolution operation with a sigmoid activation function on the output of the final decoder of the third iteration to generate an image of the segmentation mask. Since the architecture is designed to predict binary segmentation mask, the final layer with a filter of size 1 × 1, having sigmoid activation and 1 channel output size, maps the channels to the crack and background classes. Fig. 3 provides a summary including the total number of building blocks used in the designing IterLUNet.

A. DATASET
The dataset of levee crack images has been collected over the years by the field inspectors of the New Orleans district of the U.S. Army Corps of Engineers (USACE). The collected levee images have cracks in the levee's crest, concrete floodwalls, slopes, and even on and surrounding areas of the levee system.  It can be observed that the images have different shapes and sizes of cracks on diverse backgrounds and surroundings. Fig. 4. (a), (b), (c), and (d) is the set of sample images with their ground truth. The levee crack dataset was first introduced in [30], which comprises 1650 images, and is used to conduct 10-Fold Cross-Validation of the proposed model and compare it with the latest encoder-decoder-based image segmentation models. We expanded the overall dataset by annotating 101 more levee crack images using the VGG Image Annotator tool [31]. The tool generates a JSON file with coordinates of manually labeled crack regions. Eventually, the python script converts the coordinates in the JSON file to corresponding masks of the input images. Separation of training and independent test images was manually performed to distribute samples with as equal representations as possible in both training and VOLUME 11, 2023  test datasets. Table 1 represents the number of training and independent test images for different experiments. One of the main reasons for splitting datasets and conducting several experiments is to assess the robustness of the models trained on the currently available levee dataset and computational resources, further enabling the selection of models diligently. The datasets have a dominance of non-crack pixels over crack pixels. On average, only five percent of pixels in the original images are crack pixels, and the remaining ninetyfive percent are background pixels. To further analyze the robustness of models, we also used the road crack dataset named DeepCrack, proposed by Liu, Yahui, et al. in their crack detection paper [22]. The DeepCrack test dataset has 237 images with their respective masks.

B. PRE-PROCESSING
A significant challenge in building a deep learning model for real-world scenarios is maintaining the quality of training and evaluation datasets. Fig. 4. shows that the sample dataset has diverse textures and scenes, cracks of different scales, and undefined boundaries. The deep learning models should be robust enough to generalize on such a dataset. Thus, the preprocessing approach included carefully selecting original images, generating ground truth, applying augmentation techniques [32], and analyzing the performance of the baseline method. Based on the iterative approach, images and augmentation techniques contributing to the model learning process were determined. The twenty-nine augmentation techniques selected include affine, elastic, and pixel-level transformations such as ColorJitter, GaussianBlur, Gaussian-Noise, OpticalDistortion, and ElasticTransform, to name a few. Through the iterative approach, we identified that not all augmentation techniques contribute to the learning process, especially on the dataset in which background pixels are comparatively higher than the object to segment. Therefore, in the sample experiment and experiment 3, only seventeen and twenty-four augmentations were applied to the original training images and masks. Additionally, augmented levee crack images were resized to 256 × 256 due to computational constraints. Table 1 presents the statistics of the datasets for each experiment.

C. LOSS FUNCTIONS AND EVALUATION METRICS
The choice of the loss function and evaluation metrics highly determines the training process and robustness of the mod-  els. A pixel accuracy alone cannot reflect the performance of segmentation models. Thus, the models were assessed based on the accuracy of locating crack pixels and computing overlap scores between a predicted mask and ground truth. Equations (1), (2), (3), and (4) represent Intersection over Union (IoU) for crack pixels, Dice Coefficient, F1 Score, and Tversky Index as metrics to evaluate semantic segmentation models. The Dice Coefficient from (2) and F1 Score from (3) acts similarly during binary segmentation task such as segmenting crack pixels from the background. Based on the Dice Coefficient, dice loss in (5) attempts to address the class imbalance problem between crack and non-crack pixels to achieve the expected performance, as the loss function only considers the segmentation region during the training process [33]. However, the weights for both false positive and false negative detections are equally distributed, which makes dice loss less suitable when the class imbalance in the dataset is high. Therefore, in the experiments, we also introduced another loss function based on the Tversky index, the focal Tversky loss function in (6), to generate a balance between precision and recall in highly imbalanced datasets by adjusting values of α, β, and γ [34] Here, Y predicted and Y groundtruth represents predicted sets of pixels and ground truth. Likewise, TP, FP, and FN represent true positive, false positive, and false negative segmentation of crack pixels. α = 0.7 and β = 0.3 are two parameters to penalize the model based on FNs and FPs, respectively, where their sum is 1. γ = 0.75 parameter controls the non-linearity of the loss. It is evident from Fig. 4. that the levee crack dataset is highly imbalanced since the percentage of crack pixels is less than that of non-crack pixels. Therefore, to understand the effects of different loss functions such as Dice Loss in (5), Binary Cross-Entropy (BCE loss) in (7), BCE Dice loss, and Focal Tversky loss in (6), adapted from [33], we further performed experiments on a sample dataset and recorded the evaluation metrics.

D. EXISTING MODELS
We compared IterLUNet to the U-Net [5] as the baseline model and the three advanced methods MultiResUNet [6], Attention U-Net [7], and UNet++ [8]. These methods implement encoder-decoder concepts and maintain filter num-FIGURE 6. Comparison of the UNet++ (M4) and IterLUNet (M5) models trained with different loss functions and achieved IoU Crack for each example of the independent test image. Each column above represents a mask overlaid on the original image. The red-colored mask is the ground truth, and the blue mask is the predicted segmentation mask by model trained with Focal Tversky Loss. Purple-colored, white-colored, and green-colored masks are predicted segmentation masks for M4 and M5 trained with BCE loss, Dice loss, and BCE Dice loss, respectively. bers {32, 64, 128, 256, 512}, which are the primary reasons for comparative analysis. Additionally, the selected models are well established in medical image segmentation, where the datasets have irregular shapes and variable sizes of objects with noisy or ill-defined boundaries. Table 2 shows all models' total number of parameters and Floating-Point Operations per Second (FLOPs). It can be observed that the IterLUNet has seventy percent fewer parameters to train on average than the base models. The design of the proposed model significantly reduces the number of training parameters because of the Depthwise Separable Convolution Layer (DSC). Previous research indicates that DSC layers reduce the model's complexity by maintaining fewer parameters than standard CNN, as shown in Table 2.
12256 VOLUME 11, 2023 E. EXPERIMENTAL SETUP All segmentation models were implemented using the Keras framework and trained on NVIDIA K80 GPU. The convolutional layers in each model were initialized using He Initialization [37] and a batch size of 4. For a 10-Fold CV, the models were trained to minimize binary cross-entropy with logits with an Adam optimizer using a batch size of 4 for 150 epochs. The initial learning rate (LR) was 1e-3 but decayed by 0.25 after every five epochs when the validation F1 score plateaued to the minimum value of 15e-6. Furthermore, early stopping was included to avoid overfitting during the model's training for each fold set.
For the second experiment, fifteen percent of an extended dataset of 3750 augmented images was used to validate and save the best-performing model. All models were trained to minimize dice loss with an Adam optimizer using a batch size of 4. We used an initial LR of 1e-4, which was reduced on a plateau by 0.15 after every five epochs until a minimum value of 15e-8. Finally, the model with the lowest validation loss over 80 epochs was saved to evaluate on independent test datasets.
The loss experiments for loss functions were conducted to analyze the effects of loss functions on highly imbalanced datasets like levee crack datasets. The training samples were eight percent of 1750 augmented images, and the remaining twenty percent was used as validation data. The initial learning rate for the Adam optimizer is 2e-3, which decreases by fifteen percent after eight epochs when validation loss ceases to decrease till 15e-8. We trained IterLUNet, our proposed model, and UNet++, the best among existing models, to 150 epochs and saved the best model. All the best models are evaluated on 15 independent levee crack images.
Likewise, images and augmentation techniques were carefully selected in the third experiment based on the analysis of results from experiment 1, experiment 2, and the sample experiment on loss functions. The training samples were eighty percent of 2850 augmented images, and the remaining twenty percent was used as validation data. Here, the Focal Tversky loss function was minimized using the training hyperparameters similar to that used in the sample experiment of loss functions.

A. 10-FOLD CV PERFORMANCE
The trained models are evaluated using a held-out test dataset. The evaluation metrics -mean Io (mIoU), IoU for crack pixels, and F1 score (F1) for each fold were also recorded. Table 3 shows the average metrics presented in percentage ratios (%) of 10-Fold Cross-Validation (FCV) and hold-out test images for all models. The performance of the proposed architecture based on the metric F1 measure, on average, is 7.4% greater than the baseline U-Net (M1) model. Furthermore, the best-performing model from 10-FCV was also evaluated on an independent levee crack dataset. It is  observed in Fig. 5 MultiResUNet (M2) detected non-crack pixels better than crack, regardless of the higher mIoU. Both Attention U-Net (M3) and UNet++ (M4) performed well on independent levee crack images while generating segmentation masks, as shown in Fig. 5.
Nevertheless, IterLUNet consistently achieved impressive IoU and showed superiority in complex backgrounds over all the latest models. The proposed model detected the boundaries of the cracks more precisely, while the other models struggled to do so. Meanwhile, the best-performing model for each architecture with the lowest gap between training and validation dice-coefficient was selected to evaluate on an independent test dataset. As shown in Fig. 5, results indicate that pixel-wise prediction of cracks on completely independent test data is relatively low for all models. Every model faced difficulties locating crack pixels for some images. Given the limited proportions of the levee crack dataset, ten independent test images did not represent the training and validation images adequately. The challenge was also due to the difference in the distribution of crack regions, shapes, and background texture between the independent levee crack dataset and the training data. It requires additional VOLUME 11, 2023  original images with well-defined crack areas to yield a robust and high-performing model. This is the primary reason for performing augmentation and 10-Fold CV to show a need for a robust architecture that generalizes well on unseen levee crack images.

B. ANALYSIS OF LOSS FUNCTIONS
Levee crack dataset is high class imbalanced as crack pixels to be segmented are in a tiny percentage compared to the background pixels. From the analysis of the performance of models in 10-Fold CV and experiment 1, we observe that the models are learning better to classify non-crack pixels than crack pixels. Therefore, understanding the effect of objective function used during the training process appears crucial. In Table 4, A, B, C, and D are BCE loss, Dice loss, BCE Dice loss, and Focal Tversky loss, respectively. All models, U-Net (M1)-the base model, UNet++ (M4)the best-performing model among U-Net-based models, and IterLUnet (M5) -the proposed model, are trained on sample data size to minimize these loss or objective functions. Fig. 6 illustrates a performance comparison between IterLUNet and UNet++ models trained with different loss functions. Fig. 6 also emphasizes that IterLUNet regularly performs well for the different experimental setups.
BCE loss being distribution-based log loss, measures the closeness of predicted pixels with the actual pixels and penalizes accordingly. However, all the other loss functions are region-based and directly try to maximize respective evaluation metrics. Table 4 illustrates that models trained using the Focal Tversky loss function provide a better balance of precision and recall.

C. COMPARATIVE PERFORMANCE ANALYSIS
Comparative performance analysis includes results and evaluation from experiment 2 and experiment 3. All architectures are trained on augmented images and evaluated with two independent test datasets. Table 5 shows metrics on the independent levee crack test datasets for experiment 2. The proposed model, IterLUNet, outperformed baseline architecture and the three latest best-performing models. We noticed that the increase in the number of original crack images and their ground truth had increased the performance of models. Fig. 7 depicts the proposed model's training and validation dice-loss and dice-coefficient curves over 80 epochs for experiment 2. With the trend of decreasing the gap between training and validation metrics, the complexity of the proposed model stands fit for the levee crack dataset. It also hints that since the dice loss value is still decreasing, increasing training epochs can lead to better results.
A public benchmark dataset to evaluate road crack detection system, DeepCrack [22], is used to assess models trained on the levee crack dataset as an out-of-domain evaluation dataset. The differences in predicted segmentation masks overlaid on original images are shown in Fig. 8 and Fig. 9 for the independent levee crack dataset and DeepCrack benchmark test dataset, respectively. Table 5 and Table 6 show the metrics. The outcomes indicate that IterLUNet consistently predicts cracks and has a better detection ability on unseen images. It can also be observed from Fig. 8 that the models trained on the levee crack dataset are robust to predict crack regions on a highly textural background and blurred or unclear images. Together these results provide insights into boundary information and the shapes of cracks better predicted by the proposed architecture.
The most striking finding of this experiment was that IterLUNet is capable of separating the region of interest even from the rough background, observed in Fig. 8 and Fig. 9. Correspondingly, because of the inception-like module reinforced by SE-block, IterLBlock can better focus on crack regions witnessed in an example rows 5, 6, and 8 of Fig. 9. Furthermore, the proposed model has higher balanced precision and recall avoiding false detection of true positives that may result in a devastating outcome. Since a model with a higher recall or true positive rate is crucial in an automatic crack detection system, such a model can potentially diminish the misidentification of crack pixels leading to an AI-based inspection solution.

VI. CONCLUSION
This paper experimentally established that expanding the path of an encoder-decoder architecture by connecting the decoder and bottleneck outputs back to the encoder increases model performance. We also demonstrated that an inception-like module, using only informative channel and spatial features through squeeze and excitation block variations, enhances the model's ability to focus on regions to detect. Therefore, we proposed an encoder-decoder-based fully convolutional neural network architecture, IterLUNet, to automatically detect cracks on the levee using a pixel-wise segmentation approach. Additionally, a benchmark dataset with levee crack images and corresponding ground truth segmentation masks was also introduced, which resulted in a substantial increase in Dice Coefficient and IoU, validating our hypotheses experimentally. The proposed architecture outperformed all the advanced architectures in terms of 10-Fold CV metrics and metrics on independent test datasets despite having nearly 63% fewer training parameters. Thus, the proposed concept helps improve overall IoU across semantic segmentation tasks. Availability of code and data are here.