U-Net-Based CNN Architecture for Road Crack Segmentation

: Many studies on the semantic segmentation of cracks using the machine learning (ML) technique can be found in the relevant literature. To date, the results obtained are quite good, but often the accuracy of the trained model and the results obtained are evaluated using traditional metrics only, and in most cases, the goal is to detect only the occurrence of cracks. Particular attention should be paid to the thickness of the segmented crack since, in road pavement maintenance, the width of the crack is the main parameter and is the one that characterizes the severity levels. The aim of our study is to optimize the crack segmentation process through the implementation of a modiﬁed U-Net model-based algorithm. For this, the Crack500 dataset is used, and then the results are compared with those obtained from the U-Net algorithm, which is currently found to be the most accurate and performant in the literature. The results are promising and accurate, as the ﬁndings on the shape and width of the segmented cracks are very close to reality.


Introduction
Every year, government authorities all around the world check and analyze the quality and performance of roads in order to detect possible road safety issues [1]. Good road conditions are the most important factor for safe driving and traffic conditions. The deterioration of road conditions in the form of cracks significantly degrades the quality and performance of roads and causes a significant decrease in traffic safety [2].
Cracks are a typical type of pavement distress that can compromise the safety of roads and highways. Localizing and repairing cracks is a critical task for the transportation maintenance department in order to keep the roads in excellent condition. Crack identification is an important part of the work.
The manual, labor-intensive, and subjective aspects of the traditional task of road crack segmentation by expert professionals makes it highly time-consuming. Therefore, designing and building an automated and effective road fracture segmentation system is extremely useful and valuable [3].
The identification of cracks can be carried out on 3D or 2D data and is aimed at assessing their severity levels. Based on the severity levels, the management entity can draw up an efficient maintenance plan [4].
The acquisition of 3D data involves using expensive and complex systems based on LiDAR technology and, in most cases, requires time-consuming and costly processing, particularly when the number of datapoints acquired is large [5]. The acquisition of 2D data, such as images, involves the use of lower-cost systems and significantly reduced processing times [6]. This has driven many researchers to study high-performance algorithms for identifying and segmenting cracks from images [7].
The automated road crack segmentation technique efficiently identifies road cracks and helps qualified technicians to evaluate the road's performance objectively, as well as helps the relevant departments to maintain roads that are in good condition and extend their service life. Given that automated survey systems can identify different types of road cracks at fast speeds in various situations, even in adverse weather conditions, it seems clear that introducing an intelligent pavement repair technology might lead to more effective outcomes.
In recent years, there have been many achievements related to road crack image segmentation algorithms based on computer vision (CV) techniques [8]. A machine can learn from the qualities of digital photos and videos thanks to computer vision. Using visual data, it helps to recognize characteristics and patterns better [9].
However, traditional CV techniques have weak generalization and adaptation abilities and heavily rely on the quality of the pictures. Additionally, the complex environment of the road surfaces results in poor camera settings with issues including low contrast, inconsistent lighting, and significant noise, making it challenging to build an efficient detection model using conventional CV approaches [10].
Progress in the identification of cracks was made thanks to deep learning (DL) techniques. Deep learning, a branch of artificial intelligence (AI), has been very successful in semantic segmentation. The semantic segmentation technique can meet the goals of crack segmentation, as it predicts a classification label for each pixel. Convolutional neural networks (CNNs), a key subfield in DL, provide promising results in the pixel-level detection of target objects in noisy images [11,12].
The CNN's end-to-end segmentation method is another benefit, as the segmentation process requires much less human involvement. Compared to the manually created features utilized in conventional approaches, the features obtained by convolutional neural networks offer a better rendering performance. By using this deep learning capability, certain efforts are devoted to developing reliable feature representations that can be utilized for segmenting images of road cracks. Deep neural networks use characteristics to identify whether or not there are cracks in the patches of a picture by combining the pixel-level classification confidence from several frames with various lighting conditions. The literature shows how well deep networks have worked for detecting pavement cracks using computer vision.
Lei Zhang et al. [6] proposed a crack detection method in which the discriminative features are learned directly from raw image patches using the ConvNets. Allen Zhang et al. [13] used two different CNNs for segmentation. Their approach was to first extract the relevant features and then input the same data into a different CNN. Haifeng Li et al. [14] developed a model employing a windowed intensity path technique to segment the extracted candidate cracks using a multivariate hypothesis test.
Tong et al. [15] also developed another two-stage CNN-based model to detect asphalt pavement crack length. The results presented show that the training strategy used produces an increase in accuracy. Fan et al. [16] used a trained CNN to model crack detection as a multi-label classification problem. Jenkins et al. [17] and Nguyen et al. [18] used a U-Netbased architecture for a semantic pixel-wise segmentation of road and pavement surface cracks. The CNNs proposed by Baoxian Li et al. [19] were used to classify crack patches into five categories based on 3D pavement images. They used WayLink's PaveVision3D Ultra to acquire 3D images.
König et al. [20] presented a novel surface crack segmentation method using an encoder-decoder-based DL architecture based on a U-Net-based network, and it was reported that its performance increases the effects when inserting pretrained encoder networks.
Fan et al. [21] demonstrated the use of Deep CNNs to detect and recognize cracks as defects with quantifiable properties in applications for crack detection on pavement surfaces (e.g., crack length and size). In a separate paper, the authors proposed a modified version of the U-Net in which two modules were added to the overall architecture to increase the performance of crack segmentation: the dilation module and the convolution and hierarchical feature learning module [22].
Almost all papers found in the relevant literature confirmed the performance of DL models under mostly ideal conditions with the absence of noise, obstacles, shadows, and overexposed areas, and sometimes without considering any rolling shutter effects.
However, such effects can definitely affect the final results of the segmentation process; thus, a growing research interest lies in those DL models and methodologies that maximize the accuracy of segmentation to have an appropriate confidence margin even in cases of poor-quality images analyzed by the trained model.
As a case in point, An et al. [23] highlighted that the integrated use of optical and thermal images in DL models improves crack detection in the presence of shadow, rust, dust, etc.
Other approaches are based on region-based classification or object detection [24][25][26], enabling improved classification capabilities in cases where images are acquired under poor conditions. An in-depth and detailed review of the scientific literature on the use of DL techniques for the analysis of distress and cracks in structures and infrastructures is found in [27].
All this confirms that the focus on developing a methodology that maximizes the accuracy of the DL model is a current and widely studied issue in the research.
This study suggests a technique for semantic crack image segmentation based on a residual structure developed with the architecture of the U-Net model and a ResNet50 encoder. Without first identifying the region of interest, this approach can automatically segment the cracks in a paved pavement image with a complex background by independently learning the characteristics of the cracks and obtaining additional feature information.

Methods and Dataset
The methodology we propose here aims to carry out the segmentation of cracks on road pavement from images acquired with commercial cameras using deep learning CNN techniques implemented in a Python environment.
The dataset used is the one proposed by Yang et al. [28]; the model implemented uses a modified U-Net to improve the automatic crack segmentation process. The proposed methodology is based on the following steps:

•
Use of the Crack500 dataset [28]; • Data augmentation on the dataset; • Implementation of the U-Net model with ResNet50 encoder pretrained with ImageNet; • Model training; • Metrics analysis.
A comparison will be made with the main results reported in the literature, particularly with those obtained by Lau et al. [29]. Figure 1 shows the workflow of the proposed methodology.
Almost all papers found in the relevant literature confirmed the performance of DL models under mostly ideal conditions with the absence of noise, obstacles, shadows, and overexposed areas, and sometimes without considering any rolling shutter effects. However, such effects can definitely affect the final results of the segmentation process; thus, a growing research interest lies in those DL models and methodologies that maximize the accuracy of segmentation to have an appropriate confidence margin even in cases of poor-quality images analyzed by the trained model.
As a case in point, An et al. [23] highlighted that the integrated use of optical and thermal images in DL models improves crack detection in the presence of shadow, rust, dust, etc.
Other approaches are based on region-based classification or object detection [24][25][26], enabling improved classification capabilities in cases where images are acquired under poor conditions. An in-depth and detailed review of the scientific literature on the use of DL techniques for the analysis of distress and cracks in structures and infrastructures is found in [27].
All this confirms that the focus on developing a methodology that maximizes the accuracy of the DL model is a current and widely studied issue in the research.
This study suggests a technique for semantic crack image segmentation based on a residual structure developed with the architecture of the U-Net model and a ResNet50 encoder. Without first identifying the region of interest, this approach can automatically segment the cracks in a paved pavement image with a complex background by independently learning the characteristics of the cracks and obtaining additional feature information.

Methods and Dataset
The methodology we propose here aims to carry out the segmentation of cracks on road pavement from images acquired with commercial cameras using deep learning CNN techniques implemented in a Python environment.
The dataset used is the one proposed by Yang et al. [28]; the model implemented uses a modified U-Net to improve the automatic crack segmentation process. The proposed methodology is based on the following steps:  Use of the Crack500 dataset [28];  Data augmentation on the dataset;  Implementation of the U-Net model with ResNet50 encoder pretrained with ImageNet;  Model training;  Metrics analysis.
A comparison will be made with the main results reported in the literature, particularly with those obtained by Lau et al. [29]. Figure 1 shows the workflow of the proposed methodology.

Dataset
The Crack500 dataset was proposed by Yang et al. [28] and contains images acquired around the main campus of Temple University (Philadelphia, PA, USA) using mobile phones. It initially consisted of 500 images of varying sizes, which are around 2000 × 1500 pixels. Each crack image has a pixel-level annotated binary map. In this study, it was divided into 250 training samples, 200 test samples, and 50 validation samples.
To make the training process more efficient, a set of data augmentation procedures were conducted on the data, a technique that has been developed to reduce overfitting [30]. Using Python, each image was cropped into six image regions that were overlapped by 80 pixels, flipped horizontally, and then rotated in 90-degree steps, starting from 0 degrees and increasing up to 270 degrees ( Figure 2). As a result, all images and the corresponding masks were obtained at a size of 512 × 512 pixels.

Dataset
The Crack500 dataset was proposed by Yang et al. [28] and contains images acquired around the main campus of Temple University (Philadelphia, USA) using mobile phones. It initially consisted of 500 images of varying sizes, which are around 2000 × 1500 pixels. Each crack image has a pixel-level annotated binary map. In this study, it was divided into 250 training samples, 200 test samples, and 50 validation samples.
To make the training process more efficient, a set of data augmentation procedures were conducted on the data, a technique that has been developed to reduce overfitting [30]. Using Python, each image was cropped into six image regions that were overlapped by 80 pixels, flipped horizontally, and then rotated in 90-degree steps, starting from 0 degrees and increasing up to 270 degrees ( Figure 2). As a result, all images and the corresponding masks were obtained at a size of 512 × 512 pixels.

The Algorithm
The network architecture we propose is a U-Net-based architecture with a ResNet50 encoder. The U-Net was presented by Ronneberger at the MICCAI conference in 2015 [31]. The U-Net is a U-shaped convolutional neural network that was originally used in the field of medical image segmentation. It has two symmetrical branches and is considered to be an encoder-decoder network structure. The architecture of the U-net is explained in Figure 3. A ResNet50 encoder trained on the ImageNet dataset [32] was utilized in this encoder-decoder-based design. The model quickly converges thanks to the usage of a pretrained encoder. The input picture is passed into the pretrained ResNet50 encoder, whose fundamental building blocks are a set of residual blocks. The relevant features from

The Algorithm
The network architecture we propose is a U-Net-based architecture with a ResNet50 encoder. The U-Net was presented by Ronneberger at the MICCAI conference in 2015 [31]. The U-Net is a U-shaped convolutional neural network that was originally used in the field of medical image segmentation. It has two symmetrical branches and is considered to be an encoder-decoder network structure. The architecture of the U-net is explained in Figure 3.

Dataset
The Crack500 dataset was proposed by Yang et al. [28] and contains images acquired around the main campus of Temple University (Philadelphia, USA) using mobile phones. It initially consisted of 500 images of varying sizes, which are around 2000 × 1500 pixels. Each crack image has a pixel-level annotated binary map. In this study, it was divided into 250 training samples, 200 test samples, and 50 validation samples.
To make the training process more efficient, a set of data augmentation procedures were conducted on the data, a technique that has been developed to reduce overfitting [30]. Using Python, each image was cropped into six image regions that were overlapped by 80 pixels, flipped horizontally, and then rotated in 90-degree steps, starting from 0 degrees and increasing up to 270 degrees ( Figure 2). As a result, all images and the corresponding masks were obtained at a size of 512 × 512 pixels.

The Algorithm
The network architecture we propose is a U-Net-based architecture with a ResNet50 encoder. The U-Net was presented by Ronneberger at the MICCAI conference in 2015 [31]. The U-Net is a U-shaped convolutional neural network that was originally used in the field of medical image segmentation. It has two symmetrical branches and is considered to be an encoder-decoder network structure. The architecture of the U-net is explained in Figure 3. A ResNet50 encoder trained on the ImageNet dataset [32] was utilized in this encoder-decoder-based design. The model quickly converges thanks to the usage of a pretrained encoder. The input picture is passed into the pretrained ResNet50 encoder, whose fundamental building blocks are a set of residual blocks. The relevant features from A ResNet50 encoder trained on the ImageNet dataset [32] was utilized in this encoderdecoder-based design. The model quickly converges thanks to the usage of a pretrained encoder. The input picture is passed into the pretrained ResNet50 encoder, whose fundamental building blocks are a set of residual blocks. The relevant features from the input picture are extracted by the encoder with the assistance of these residual blocks, and these characteristics are then sent to the decoder. A transpose convolution is started by the decoder to upscale the input feature maps into the proper form. The specified shape feature maps from the pretrained encoder are then concatenated with these upscaled feature maps using skip connections. By assisting the model in obtaining all the low-level semantic data from the encoder, these skip connections enable the decoder to produce the necessary feature maps. The two 3 × 3 convolution layers are then added after that, with a batch normalization layer and a ReLU non-linearity layer coming after each layer. The output of the final decoder block is sent into a 1 × 1 convolution layer, which is then fed into a sigmoid activation function to produce the appropriate binary mask. The architecture of the proposed algorithm is shown in Figure 4.
the input picture are extracted by the encoder with the assistance of these residual blocks, and these characteristics are then sent to the decoder. A transpose convolution is started by the decoder to upscale the input feature maps into the proper form. The specified shape feature maps from the pretrained encoder are then concatenated with these upscaled feature maps using skip connections. By assisting the model in obtaining all the low-level semantic data from the encoder, these skip connections enable the decoder to produce the necessary feature maps. The two 3 × 3 convolution layers are then added after that, with a batch normalization layer and a ReLU non-linearity layer coming after each layer. The output of the final decoder block is sent into a 1 × 1 convolution layer, which is then fed into a sigmoid activation function to produce the appropriate binary mask. The architecture of the proposed algorithm is shown in Figure 4. The model uses the Adam optimizer [33] with an initial learning rate set to 0.0001, which is reduced by a factor of 0.1 in every 4 epochs, and cross-entropy loss is established as its loss function. The network converges in 20 epochs. We implemented the network in Python using Tensorflow/Keras. The specifications of the workstation used to train the neural network are: TITAN X GPU (12 GB VRAM), Intel Core i7 processor, and 32 GB RAM.
In our work, the net was trained by using the training pair D = x(i), y(i), where x(i) is the i-th image patch and y(i) 0, 1 is the corresponding class label.

Evaluating the Segmentation Model
Two common evaluation metrics were utilized to assess the suggested approach in order to objectively estimate the performance of the network model. The F1 score and Intersection over Union (IoU) are the conventional quantitative evaluation metrics utilized in our research. F1 is the combination of Precision and Recall and is computed as the harmonic mean of the two quantities [34]. Precision (P) is the proportion of correctly classified observations per predicted class, whereas Recall (R) or Sensitivity is used to measure the percentage of actual positives which are correctly identified.
Often, there is an inverse relationship between Precision and Recall: when precision increases, model sensitivity worsens and vice versa. For these reasons, it is important to find the golden mean, meaning a balance between the two indicators, to obtain a model that best fits the input data. The formulas used are: where TP is the true positive (samples correctly classified as positive), FP is the false positive (samples incorrectly classified as positive), and FN is the false negative (samples incorrectly classified as negative). We do not consider the transitional areas (0-pixel distance) between non-crack and crack pixels. The F1 score is a combination of Precision and Recall and is a robust indicator for both balanced and unbalanced datasets. In general, F1 values greater than 0.9 are indicative of The model uses the Adam optimizer [33] with an initial learning rate set to 0.0001, which is reduced by a factor of 0.1 in every 4 epochs, and cross-entropy loss is established as its loss function. The network converges in 20 epochs. We implemented the network in Python using Tensorflow/Keras. The specifications of the workstation used to train the neural network are: TITAN X GPU (12 GB VRAM), Intel Core i7 processor, and 32 GB RAM.
In our work, the net was trained by using the training pair D = x(i), y(i), where x(i) is the i-th image patch and y(i) 0, 1 is the corresponding class label.

Evaluating the Segmentation Model
Two common evaluation metrics were utilized to assess the suggested approach in order to objectively estimate the performance of the network model. The F 1 score and Intersection over Union (IoU) are the conventional quantitative evaluation metrics utilized in our research. F 1 is the combination of Precision and Recall and is computed as the harmonic mean of the two quantities [34]. Precision (P) is the proportion of correctly classified observations per predicted class, whereas Recall (R) or Sensitivity is used to measure the percentage of actual positives which are correctly identified.
Often, there is an inverse relationship between Precision and Recall: when precision increases, model sensitivity worsens and vice versa. For these reasons, it is important to find the golden mean, meaning a balance between the two indicators, to obtain a model that best fits the input data. The formulas used are: where TP is the true positive (samples correctly classified as positive), FP is the false positive (samples incorrectly classified as positive), and FN is the false negative (samples incorrectly classified as negative). We do not consider the transitional areas (0-pixel distance) between non-crack and crack pixels. The F 1 score is a combination of Precision and Recall and is a robust indicator for both balanced and unbalanced datasets. In general, F 1 values greater than 0.9 are indicative of a very accurate classification; below 0.5, the classification may be considered inaccurate and therefore unsuitable. Analysis of F 1 is necessary when a balance between Precision and Recall is desired.
Intersection over Union (IoU) is a geometric type of evaluation metric. It describes the closeness of the predicted results to the ground-truth bounding boxes and is expressed as: where B p is the predicted bounding box and B g is the ground-truth bounding box. In this case, the predicted bounding box represents the mask obtained with the proposed model (prediction output) and the ground-truth bounding box represents the mask used to train the implemented model (target mask).
Thus, the overlap occurs between the two masks, the predicted mask (prediction output) and the original mask (target mask). This means that the IoU is equal to the number of pixels that are common between the target mask and prediction output divided by the total number of pixels in both masks. The higher the overlap, the higher the score; values close to one indicate an excellent overlap, and values below 0.5 indicate a poor overlap. In other words, should the predicted mask be identical to the mask used for training, the IoU would be one. These metrics are commonly used in crack detection, but they do not consider the subjectivity of manually labeled ground truth [35].

Results
The trained model was applied to the Crack500 dataset. The original dataset consisted of 250 training samples, and then the training dataset was artificially increased with the data augmentation technique to improve performance during the training phase. A total of 12,000 images were used for training, 2400 images were used for validation, and 9600 images for testing. Figure 5 shows that both accuracy curves of the training and the validation set increase and stabilize at high and similar values with a gap of 0.006, indicating that the model is learning correctly and generalizing well to unseen data.
proposed model (prediction output) and the ground-truth bounding box represen mask used to train the implemented model (target mask).
Thus, the overlap occurs between the two masks, the predicted mask (pred output) and the original mask (target mask). This means that the IoU is equal number of pixels that are common between the target mask and prediction output di by the total number of pixels in both masks. The higher the overlap, the higher the values close to one indicate an excellent overlap, and values below 0.5 indicate a overlap. In other words, should the predicted mask be identical to the mask us training, the IoU would be one. These metrics are commonly used in crack detectio they do not consider the subjectivity of manually labeled ground truth [35].

Results
The trained model was applied to the Crack500 dataset. The original d consisted of 250 training samples, and then the training dataset was artificially incr with the data augmentation technique to improve performance during the training p A total of 12,000 images were used for training, 2400 images were used for validatio 9600 images for testing. Figure 5 shows that both accuracy curves of the training an validation set increase and stabilize at high and similar values with a gap of indicating that the model is learning correctly and generalizing well to unseen data Our model has a total of 20.6 million parameters, and the number of FLOPS is the inference speed on our hardware is 6.7 frames per second on images of 512 × 5 a result, this strategy might not be advantageous for real-time applications and looks suited for batch processing.  Our model has a total of 20.6 million parameters, and the number of FLOPS is 236G; the inference speed on our hardware is 6.7 frames per second on images of 512 × 512. As a result, this strategy might not be advantageous for real-time applications and looks more suited for batch processing.
The metrics used for evaluating the proposed model are Precision (P), Recall (R), F 1 , and IoU, as described in Section 2.3. Table 1 shows the results obtained by applying the model trained with the proposed methodology and the results obtained by using the models reimplemented by both Lau et al. [29] and others [18,36] that achieve high enough and comparable accuracies. It should be noted that the results listed in Table 1 are derived from the application of the same U-Net architecture on the same training dataset (Crack500). Comparing the results shown in the table, those of Lau et al. emerged as the most accurate, resulting in it being the method working best at present; thus, this method is used by us as a benchmark.
Compared to the results of Lau et al., our model produced an increase in Precision, a slight decrease in Recall, and an increase in F 1 and IoU. The increase in Precision means that our model is more reliable, i.e., there are few false positives. However, the model may not predict all events by being less selective or sensitive (low Recall), i.e., there may be many false negatives even if the model is accurate.
Lau et al. implemented a model with the goal of balancing Precision and Recall since it is valuable that the model be precise, but it is also important to have adequate sensitivity (Recall). For us, the increase in Precision and the slight decrease in Recall still produced a balanced result, as can be seen by looking at the F 1 score, which is close to 0.76.
The increase in IoU compared to that of the model of Lau et al. means that the predicted masks are more similar to the original ones (target masks). This is important when estimating crack widths; we assume that the masks used as training datasets are consistent with reality and, consequently, with crack width. Crack width is the main parameter used to derive severity levels [4]. Figure 6 shows examples of the results of the model trained on images referring to some major crack types. The images given as the input to the model, shown in columns a and a1, belong to the test dataset but are definitely not the ones used to train the model, which are others that are not shown. Columns c and c1 show the results we obtained on those input images, while columns b and b1 show the hand-drawn masks available for the dataset for visual comparison with our outputs. Please note that in all cases, our output is more accurate than the hand-drawn masks. In detail, panels in column "a" show: (1-3) transverse cracking, (4-6) longitudinal cracking, and (7-9) block cracking, while panels in column "a1" show: (1-3) portions of alligator cracking, (4-5) edge cracking, and (7-9) portions of non-cracked pavement.
It should be pointed out that the portions of the pavement shown in Figure 6 belong to different types of wear layers, which can be distinguished by different levels of adherence. Indeed, the images display very heterogeneous color scales, which are sometimes marked by the presence of very evident stains due to the presence of aggregates of different types. Columns b and b1 display the target masks used to train the model, and columns c and c1 display the masks obtained by applying the trained model, which are the predicted masks.
In almost all cases, the masks predicted are better than those hand-drawn, particularly concerning the crack width, i.e., in row 5, panel c1(5), the crack was better delineated than the target mask shown in panel b1 (5), which is less compliant, in terms of width, than the real configuration, shown in panel a1 (5). This aspect is crucial since, in the design of the maintenance plans, the main parameter regulating the severity levels of the different cracks is their width, in addition to the linear development and the area of extension [37]. design of the maintenance plans, the main parameter regulating the severity levels of the different cracks is their width, in addition to the linear development and the area of extension [37].

Discussion
To further test the performance of the implemented model, the authors applied it to an orthorectified image not belonging to the Crack500 dataset. The image was acquired with a UAV (unmanned aerial vehicle) equipped with a Zenmuse P1 camera with a flight height of about 30 m. It portrays a short stretch of a provincial road with one carriageway and two lanes in each direction (Figure 7). The crack pattern visible in the figure mainly affects one lane. The crack widths were also measured in situ with a caliper and compared with those derived from the mask obtained.

Discussions
To further test the performance of the implemented model, the authors applied it to an orthorectified image not belonging to the Crack500 dataset. The image was acquired with a UAV (unmanned aerial vehicle) equipped with a Zenmuse P1 camera with a flight height of about 30 m. It portrays a short stretch of a provincial road with one carriageway and two lanes in each direction (Figure 7). The crack pattern visible in the figure mainly affects one lane. The crack widths were also measured in situ with a caliper and compared with those derived from the mask obtained.  Figure 8 shows the results of using the proposed model on the section of road pavement characterized by different types of cracking (mainly block, longitudinal, and fatigue cracking) and a surface where the wear layer is of a texture composition different from that present in the Crack500 dataset. The three boxes, (1), (2), and (3), show cracks of different severity levels that correspond to high, moderate, and low, respectively. Again, the panels in column (a) show the predicted masks obtained in column (b), the predicted and classified masks according to crack width presented in column (c), and the measurements made with the caliper.
To quantify the width of the cracks and to assess the sensitivity of the model on width segmentation, the trained model was applied to the orthophoto with a pixel size of 3 mm.
The crack width was calculated using a few functions implemented in Matlab and applied to the model output mask. In particular, the "bwmorph" function removes pixels inside the cracks and keeps only the border pixels (https://www.mathworks.com/help/images/ref/bwmorph.html, accessed on 1 April 2023). The same function allows us to obtain the skeleton of the cracks. On the other hand, the "bwdist" function calculates the distance between the skeleton and the border pixels (https://www.mathworks.com/help/images/ref/bwdist.html, accessed on 1 April 2023). Thus, the distance calculated was used to produce a raster containing the width of the cracks.  Figure 8 shows the results of using the proposed model on the section of road pavement characterized by different types of cracking (mainly block, longitudinal, and fatigue cracking) and a surface where the wear layer is of a texture composition different from that present in the Crack500 dataset. The three boxes, (1), (2), and (3), show cracks of different severity levels that correspond to high, moderate, and low, respectively. Again, the panels in column (a) show the predicted masks obtained in column (b), the predicted and classified masks according to crack width presented in column (c), and the measurements made with the caliper.
To quantify the width of the cracks and to assess the sensitivity of the model on width segmentation, the trained model was applied to the orthophoto with a pixel size of 3 mm.
The crack width was calculated using a few functions implemented in Matlab and applied to the model output mask. In particular, the "bwmorph" function removes pixels inside the cracks and keeps only the border pixels (https://www.mathworks.com/help/ images/ref/bwmorph.html, accessed on 1 April 2023). The same function allows us to obtain the skeleton of the cracks. On the other hand, the "bwdist" function calculates the distance between the skeleton and the border pixels (https://www.mathworks.com/help/ images/ref/bwdist.html, accessed on 1 April 2023). Thus, the distance calculated was used to produce a raster containing the width of the cracks.
The colors in Figure   Looking at Figure 8a, one can notice that our model does not segment cracks with severity levels lower than the low level, mainly because the resolution of the orthophoto does not allow for the detection of cracks with an amplitude less than or slightly greater than the pixel size. Some cracks characterized by a low severity level were not segmented because the color difference between the crack and the pavement was not sharp enough, but this is plausible given that the images were taken at a flight height of about 30 m. Cracks with medium/high levels of severity were all segmented, and were also observed by traditional surveys carried out in situ.
In particular, a high severity level crack is shown in panel 1c, which is congruent with the crack width estimated from the mask (panels 1a-1b). In panel 2c, a moderate severity level crack is shown; again, the implemented model accurately segmented the crack as the severity level inferred from the mask (panels 2a-2b) is congruent with that measured in situ. Finally, panel 3c shows a case halfway between low and moderate severity, with the crack width being just over 6 mm. In panel 3b, the part considered falls at the medium severity level (>6 mm) at the node of the crack junction; as one moves away from the junction (toward the right), the severity level turns low (<6 mm). Again, the model was able to segment the crack width with sufficient accuracy; the severity level is congruent with that measured in situ.
The performance of segmentation is closely related to the resolution, quality, and somewhat to the exposure of the image. To segment cracks with low or lesser severity levels, it is advisable to use images taken very close to the pavement, preferably from a mobile system mounted on a car and using a medium/high quality camera capable of returning a pixel size of at least half the width of the crack. The key point is that cracks with medium/high severity levels are identified and segmented in almost all cases, which is a major aspect for decision making and drafting pavement management plans.

Conclusions
The use of convolutional neural networks for pavement crack detection was the main goal of this study. The structure of our network is a U-Net with a ResNet50 encoder pretrained with the ImageNet dataset. The encoder component has proven to effectively extract image crack features. Even when these have irregular shapes and intricate textures, they can be handled due to the model's ability to accurately represent global context information. On the Crack500 dataset, the suggested crack segmentation model performs well. The results of the mask prediction show that the model can accurately segment cracks.
Even though the approach followed in this study demonstrated good performance, there is still much work to be conducted before pavement cracks can be automatically detected. A drawback of our proposed approach is that it requires a large number of manually drawn pixel-level crack pictures to build effective and accurate models. This is generally a well-known issue in the literature and is true for almost all ML approaches. The performance of the model is directly related to the dataset, and the manual annotation process is time-consuming and subjective. Collecting and labeling data samples takes time and must be performed accurately; synthetic data can be introduced in the model learning process.
The validation of the performance of the models is generally carried out on images acquired in nearly optimal conditions with no noise, obstacles, shadows, or overexposed areas; thus, applying the model on images of lower quality or with the presence of noise could lead to inaccurate results. Nonetheless, the application of our model on a real test area has led to very promising results, as the severity levels resulting from the crack calculation on the resulting masks are in line with those calculated from traditional surveys in situ. The proposed methodology aims to improve a DL model, referred to in the scientific literature as one of the most accurate models for crack segmentation based on a CNN, by modifying its architecture. The improvements made in our model also affect the segmentation of crack width; this aspect is relevant because the width is the key parameter for estimating severity levels that mainly affect the identification of stretches that should be prioritized for intervention.
Going forward, we aim to optimize the proposed model and test its performance on another dataset type. We hope to build a more sophisticated crack dataset that includes cracks in buildings or bridges to improve crack segmentation algorithms.