Automated Pavement Crack Damage Detection Using Deep Multiscale Convolutional Features

,


Introduction
Pavement crack detection plays an important role in the eld of road distress evaluation [1]. Traditional crack detection methods depend mainly on manual work and are limited by the following: (i) they are time consuming and laborious; (ii) they rely entirely on human experience and judgment. erefore, automatic crack detection is essential to detect and identify cracks on the road quickly and accurately [2]. is procedure is a key part of intelligent maintenance systems, to assist and evaluate the pavement distress quality where more continual road status surveys are required. Over the past decade, the development of high-speed mobile cameras and large-capacity hardware storage devices has made it easier to obtain large-scale road images. rough mobile surveying and mapping technology, integrated acquisition equipment is xed to the rear of the vehicle roof frame to monitor both the road surface and the surrounding environment. e images can be acquired by processing and storing pavement surface images that are realized [3]. Currently, many methods utilize computer vision algorithms to process the collected pavement crack images and then obtain the nal maintenance evaluation results [4].
Automatic crack detection is a very challenging image classi cation task with the goal of accurately marking crack areas. Figure 1 shows examples of data acquisition by a mobile pavement inspection vehicle. In a few cases, the cracks have good continuity and obvious contrast, as shown in Figure 1(a). However, in most cases, there is a considerable noise in cracks, which leads to poor continuity and low contrast, as shown in Figure 1(b). erefore, automatic crack detection mainly includes the following three challenges. (i) In a poorly lit environment and complex background, the texture, and linearity of interference (weeds, stains, etc.) have similar features, resulting in greater intraclass di erences. (ii) Boundary blurring occurs between small cracks and local noises. (iii) Blurred low-quality images from crack data collected at high speed are unavoidable. ese three di culties create considerable challenges in pavement crack detection. e recent publications [5][6][7] assumed that the crack pixels are generally darker than their surroundings and then used the threshold method to extract the crack area. ese methods lack the description of global information and are sensitive to noise. To improve the continuity of crack detection, researchers have attempted to detect cracks by introducing minimal path selection (MPS) [8][9][10], minimal spanning trees (MSTs) [11][12][13], and crack fundamental elements (CFEs) [14]. ese methods can partially eliminate noise and improve crack detection continuity. However, only the low-level features can be roughly obtained, some complex high-level crack features may not be presented, and utilized correctly. A randomly structured forest-based method is presented in [15] to detect cracks automatically. is method can e ectively suppress noise by manually selecting crack features and learning internal structures. Although it improves the recognition speed and accuracy but does not perform well when dealing with complex pavement crack situations. erefore, traditional machine learning methods simulate cracks by manually setting color or texture features. In these methods, the features cover only some speci c real-world situations. e set of crack features is simpli ed and idealized, which cannot achieve the robust detection requirements for pavement diseases.
In recent years, deep learning methods have been widely used to solve complex problems through hierarchical concepts. A deep convolutional neural network (DCNN) has shown great advantages in computer vision tasks, such as image classi cation [16][17][18], object detection [19], and semantic segmentation [20,21]. e DCNN can acquire expressive features at di erent levels as it consists of several trainable layers [22]. e rich hierarchical features of DCNN have made great progress in pixel-level semantic segmentation tasks [23][24] and crack detection. In [25], the AlexNet is used to extract the crack characteristics, and then crack detection is performed based on probability maps. However, the detailed division of the crack could not be completed. In [26] and [27], 3D crack detection networks based on DCNN are proposed for automated pixel-level crack detection on 3D asphalt pavement. In [28], an e ective detection model for concrete cracks is proposed through two modules of multi-view image feature detection and multitask crack detection. A robust algorithm by postprocessing the output feature mapping is proposed in [31] to detect cracks. e DeepCrack net is constructed based on the encoder-decoder architecture of the SegNet in [32], and the convolution features generated in the encoder network and decoder network are fused in pairs at the same scale to complete crack detection, but the width information of cracks may not be considered in the detection results.
Although most of the published methods have achieved ideal results, automated pavement crack detection in the complex backgrounds is still demanding. In this paper, we propose an end-to-end trainable deep convolutional neural network, called the CrackSeg, for pixel-wise crack detection from a complex scene. First, a multiscale dilated convolution module is proposed to obtain more abundant crack texture information. Additionally, to satisfy networks with a larger receptive eld and spatial resolution, multiscale context information is captured by di erent dilated rates. Second, a pixel-level dense prediction mapping is generated by fusing the upsampling module of low-level features to recover the crack boundary details. Finally, the model is systematically evaluated in three crack data sets by quantitative evaluation methods, including comparing the results with manual marking. e results show that the proposed crack detection method can accurately extract cracks in di erent pavement types and complex backgrounds. e contributions of this paper mainly include three aspects as follows: (1) A novel trainable end-to-end crack segmentation network, the CrackSeg, is designed to detect road cracks at the pixel level. e network makes full use of the semantic information of hierarchical convolution features and is very e ective for crack detection under a complex scene. (2) A crack feature detection network based on a joint multiscale dilated convolution module is proposed. While the computational cost is controlled, the multiscale semantic information is fused to obtain more abundant features. In the upsampling stage, the high spatial resolution features of the shallow network are fused to obtain more re ned crack segmentation results. (3) A multisource pavement distress labelling dataset, the CrackDataset, is established that re ects the overall situation of road diseases in China. e rest of this paper is organized as follows. Section 2 describes crack detection based on deep learning semantic segmentation. Section 3 demonstrates the e ectiveness of the proposed scheme through comparative analyses of experiments. Section 4 discusses the detailed design of the two modules proposed in this paper. Finally, Section 5 concludes the paper.

Materials and Methods
In this section, we introduce a novel end-to-end trainable crack detection DCNN structure based on multiscale features, which is divided into three parts. In the rst part, the overall structure of the crack detection network is introduced. In the second part, a multiscale dilated convolution module is introduced to obtain more abundant context information in the crack image, and feature mapping a er crack detection fusion is preliminarily obtained. In the third part, we propose a new upsampling scheme based on di erent resolution feature maps.

e Structure of CrackSeg.
e crack detection network is proposed based on a multiscale dilated convolution module and an upsampling module, as shown in Figure 2. e ResNet [31] pretraining model with a dilated network strategy is used to extract the crack characteristics. In the traditional CNN network structure, the use of a down-sampling layer can e ectively increase the receptive eld, and reduce the number of calculation parameters but also reduce the spatial resolution of learning features, making the nal feature mapping size smaller. A er nal feature representation, multiscale crack semantic information is obtained by using a multiscale dilated convolution module, and global prior information is captured by fusing di erent levels of semantic information. Finally, the shallow and deep semantic information is fused by the upsampling module so that the network output feature mapping size is consistent with the input image size, and the probability that each pixel belongs to cracks or noncracks is calculated by the so max function. e vector value [0, 1] generated by the so max function represents the probability distribution of a class, and the so max function can be expressed as: where and represent the weights of the network and input data, respectively. In the task of pixel-wise prediction loss, di erent pixels are divided into di erent categories by cross-entropy loss [32] and can be expressed as: where ∈ [0, 255] is the input pixel value, ∈ {0, 1} is the ground truth label, ̂ ∈ {0, 1} is the prediction probability, is the network weight matrix, and is the loss function.

Multiscale Dilated Convolution.
Using a top-down convolutional neural network can identify the target region with strong discrimination, but for the target region with weak discrimination, the classi cation performance is reduced [33]. In DCNNs, the size of the receptive eld represents the amount of available information. Increasing the receptive eld of convolution kernels can e ectively mix the semantic information around the target, thus improving the classi cation ability of regions with weak discrimination [34]. Dilated convolution is a special form of standard convolution. Zero values are inserted between the pixels of the convolution kernel to increase the image resolution of intermediate feature maps, thus enabling dense feature extraction in DCNNs with an enlarged convolution kernel eld: where represents the dilated rate of convolutional kernel to specify the number of zeros placed between pixels. Because of the dilation, only × pixels are involved in convolution calculation, which increases the receptive eld, and reduces the computational cost, thus increasing the receptive eld without losing resolution. Inspired by the mentioned ndings, a novel classi cation network with multiple dilated convolution blocks is proposed to generate dense localization. To capture the multiscale semantic crack information, the features of three di erent scales are fused. A multiscale dilated convolution module is constructed by combining multibranching with di erent kernels and dilated convolution layers, which forms a merged feature representation for di erent locations. A er fusing different sizes and levels of feature, the 1 × 1 kernel size convolution operation is carried out to reduce the dimension of semantic features to 1/ . e network structure and parameter details of the multiscale dilated convolution module are shown in Figure 3.
In the multiscale dilated convolution module, two main convolution operations are used: (i) obtaining accurate location mapping through standard convolution kernels to highlight the target areas with strong discrimination; (ii) introducing multiple dilated rates to expand the convolution kernel receptive elds to improve the target areas with weak discrimination. us, discriminant features from adjacent salient regions are transformed into target-related regions that have not been found. We nd that convolution blocks with a large dilation rate introduce some irrelevant regions, such as some true-negative regions, which would be Journal of Advanced Transportation 4 multiscale dilated convolution four times and then fuses with the low-level features with the same spatial resolution in the network. To reduce the dimension of low-level features and convolute them by 1 × 1, three convolution operations with highlighted by adjacent discriminant objects. erefore, a multiscale dilated convolution network with a small dilation rate is proposed.

Up-Sampling Module.
e multiscale dilated convolution module in the encoding stage can transform the input image into rich semantic visual features. However, these features have a rough spatial resolution [35]. e purpose of upsampling is to restore these features to the input image resolution and then predict the crack spatial distribution. e proposed upsampling module contains mainly two inputs: the low-resolution features with high-level semantic information and the high-resolution features on the bottom of the network that use features extracted at di erent scales to aggregate local and global context information. As shown in Figure 4, the features of the shallower encoding layer retain more spatial details, which helps to obtain sharper boundaries; the deeper features have stronger representation ability. e upsampling module rst samples the output features of the  2: Illustration of the crack detection network, CrackSeg. e multiscale dilated convolution module is used to capture abundant crack features. A er fusing with the lower level crack features in the network, three 3 × 3 convolution operations are used continuously to improve the feature expression ability. e output feature of the last convolution layer is the crack feature maps, which is the input into the binary classi er for crack pixel-wise prediction. e data cover most of the pavement diseases in the whole road network. ese images include collected images of di erent pavement, di erent illumination, and di erent sensors. e real values in the dataset provide two types of labels, cracks, and noncracks. e dataset is divided into three parts. e training set and the validation set are composed of 4736 and 1036 crack images, respectively. e test set contains 2416 images. In addition, two other crack datasets, CFD [15] and AigleRN [10], are used as test sets. e details of the datasets are shown in Table 2.

Implementation Details.
We implement our CrackSeg using the TensorFlow, which is an open source platform for deep learning. Because of the large image size, training the CrackSeg network requires a large amount of memory, which results in overburdening the training process. Additionally, the crack areas occupy a small proportion of the whole image, and many background areas are meaningless for the training process. erefore, the original road crack images are divided into several small blocks with a size of 256 × 256. To improve the robustness of the model, several transformations are made to the data, including random ip, color enhancement, and enlargement. We utilize the Adam [36] algorithm to converge the network. e network is trained with an initial learning rate of 0.0001. e momentum and weight decay are set to 3 × 3 kernel size are used to improve the feature expression ability a er feature fusion. Because the upsampling module is learnable, it can recover the ne information lost in the bilinear upsampling (BU) operation. Details of the parameters of the upsampling module are described in Table 1, where and are the height and width dimensions of the input features, respectively.

Experiments and Analysis
To verify the e ectiveness of our scheme, extensive experiments on pavement crack detection were conducted on various images. In this section, we depict the experimental setup and analyze our experimental results.  precision, mIoU, and -score. e mIoU value of CrackSeg reached the highest 73.53%, followed by DeepCrack, and Deeplabv3+, with the mIoU of 72.04 and 71.77. e mIoU of CrackForest, SegNet, U-Net, and PSPNet are 14.27%, 2.97%, 2.04%, and 3.90% lower than the results of CrackSeg. e performance improvement is mainly due to the use of a multiscale dilated convolution module in the encoding stage, which captures a multiscale context for accurate semantic mining. On the basis of obtaining rich semantic information, the boundary information of the target is recovered by using low-level high-resolution features, and more accurate segmentation results are obtained by using continuous convolution operations. Figure 5 describes the visual comparisons of the crack detection results using di erent methods. e rst row is the original image containing cracks, some of which are accompanied by noise such as shadows, oil spots, and watermarks, which are the main factors a ecting the detection of cracks.
e experimental results show that the CrackForest method based on traditional machine learning features can extract cracks in a simple background, but it still retains more noise and cannot adapt to the automatic crack detection in complex scenes. For SegNet and U-Net, the detection results are acceptable, but these methods produce many false detections in complex backgrounds.
e DeepCrack performs well in extracting the thin cracks in the complex backgrounds, however, some width information of cracks is lost in the detection results. e DeepLabV3+ has good performance in detecting light cracks, but nonexistent cracks occur because of its large dilation rate. Furthermore, its single convolution kernel size causes the loss of crack information. Our CrackSeg integrates low-level and high-level features in convolution stages at different scales and can further improve the accuracy of crack detection and robustness of background artifact suppression, e ectively eliminate the in uence of oil pollution, shadow, and complex backgrounds, and extract various complex topological crack relationships.

Network Robustness Analysis.
To verify the stability of our proposed method, the other two datasets (CFD and AigleRN) are tested by CrackSeg. e visual crack detection results are shown in Figure 6. It is noteworthy that this method does not use the crack images in these two datasets in the training phase.
e results show that the proposed method can extract most pavement cracks and that the model has strong robustness.

Discussion
In this section, to determine the optimal crack characteristics, we discuss the self-impact of the multiscale dilated convolution module. en, the low-level feature selection and convolution operation structure in the upsampling module are discussed.

Principle for Choosing Multiscale Dilated Convolution.
To compare the e ect of the multiscale dilated convolution module on crack detection more clearly, the features are sampled 16 times in the upsampling stage by BU, and the 0.9997 and 0.0005, respectively. All experiments in our work are performed using an NVIDIA GTX 1080 GPU and 8 GB of on-board memory.

Evaluation Metrics.
In the evaluation of crack detection accuracy, crack and noncrack pixels are considered as two categories.
e overall accuracy (OA), precision, recall, -score, and mIoU are used as the metrics for the quantitative performance evaluation and comparison method in the experiment. ese ve indicators can be calculated as follows: where represents the number of positive cases correctly divided, is the number of incorrectly classi ed positive pixels, is the number of incorrectly classi ed negative cases, OA, mIoU, and -score are comprehensive indicators, and the larger the value is, the higher the accuracy.

Result and Analysis.
To demonstrate the feasibility of the proposed scheme, we compare our CrackSeg with SegNet [38], U-Net [21], PSPNet [39], DeepCrack [31], and DeepLabv3+ [40]. In addition, to verify the advantages of the deep learning semantic segmentation model in crack detection, the nondeep learning method CrackForest is introduced to compare based on di erent comparative experiments. e quantitative comparison testing results in our CrackDataset are shown in Table 3, which shows that the crack detection accuracy based on the deep learning method in a complex background is higher and has good advantages. Compared with other segmentation methods based on deep learning, the CrackSeg achieves the highest OA, recall, (4) Overall accuracy = + + + + , convolution, di erent dilation rates were applied to the nal Block4, and the receptive eld size is increased. e experimental results show that the multiscale dilated convolution module with a large dilation rate and a small dilation rate increases by 1.47% and 2.05%, respectively. Although dilated convolution with a larger dilation rate has a larger receptive eld, it introduces other unrelated regions while capturing crack characteristics, which a ect the nal crack identi cation. With the smaller dilation rate, better optimal convergence, and better detection e ect can be obtained in model training.
To explore the in uence of di erent high-level features on the multiscale dilated convolution module, two high-level crack feature maps, Block3 and Block4, were fused in the experiment, and the network performance improved by 0.82%. e experimental results show that the high-level features fused at multiple levels have stronger representation ability, which helps locate crack pixels in the encoding process.
nal prediction results are obtained. In the experiment, ResNet50 is used as the network backbone to validate the multiscale dilated convolution module. Figure 7 shows the change in mIoU of di erent dilated convolution modules a er 20 epochs in the training stage. A er 14 epochs, each method reaches a stable state. e BaseLine has the lowest performance, and the purple polyline (fusion-S-dilated) represents the highest mIoU score compared with the other methods. In summary, the multiscale dilated convolution module with fusion features achieves the best results.
As shown in Table 4, the experimental results are compared and analyzed for the selection of high-level features. e mIoU of the BaseLine model using ResNet50 as the feature detection network is only 65.07%. e performance of the ASPP [33] module is 1.25% higher than that of the BaseLine, which shows that dilated convolution can improve the performance of crack detection. To facilitate the e ect of dilated Input CrackForest [15] SegNet [38] U-Net [21] PSPNet [39]  improve the crack detection accuracy, the features generated by the multiscale dilated convolution module are used as the high-level input feature of the upsampling module, which includes more discriminant semantic information. Lowlevel features in the network have a high spatial resolution, which retains the details of the crack boundary. A er the fusion of low-level semantic information and discriminative high-level features, the convolution operations are used in the upsampling module to obtain sharper detection results. Table 5 shows the performance of the di erent upsampling features and structures. rough comparative analyses of experiments, the choice of the number of convolutions has a great impact

Conclusions
In this paper, an end-to-end trainable pavement crack detection framework based on DCNN, CrackSeg, is proposed, which can automatically detect road cracks under complex backgrounds. First, a crack training dataset is established, on the nal crack detection results of the model. A er xing the low-level features generated by Conv1, the best results are achieved by using three [3 × 3,256] convolution, compared with using one convolution operation and two convolution operations, and the mIoU values are increased by 0.85% and 0.12%, respectively. When convolution operations are used four times, the accuracy of crack detection begins to decline. To evaluate the e ectiveness of low-level features on boundary restoration, the low-level features generated by Conv1 in the upsampling module of the network are changed to Block1 and the combination of the two modules (Conv1 and Block1). As shown in Figure 8, the features generated by the combination of Conv1 and Block1 can restore the best crack detection edge. In Table 5 which covers a wide range of data sources and re ects the overall situation of pavement distress in the Liaoning Province, China. Second, through the fusion of high-level features in the backbone network, we propose the multiscale dilated convolution module. By capturing the features of context information at multiple scales, the crack detection network can learn rich semantic information in a complex background. erefore, based on the dilated convolution theory, we design a novel network structure that can be inserted into the existing semantic segmentation system to improve the accuracy of crack feature detection. Finally, through the upsampling module, the low-level features, and continuous convolution features are fused to realize the crack pixel-level prediction. is feature aggregation, which combines di erent levels of feature information, can not only fully mine the crack features in the image but also restore and describe the details of the object boundary information. e experimental results of CrackSeg achieve high performance with a precision of 98.00%, recall of 97.85%, -score of 97.92%, and a mIoU of 73.53%, which are higher than those of other networks. Furthermore, the model has strong stability and robustness to solve the noise interference caused by shadows, stains, and exposures in the process of data acquisition. e good performance of the CrackSeg network provides a possibility for large area automatic crack detection.

Data Availability
e data used to support the ndings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no con icts of interest regarding the publication of this paper.