Automated Asphalt Highway Pavement Crack Detection Based on Deformable Single Shot Multi-Box Detector Under a Complex Environment

Pavement cracks are severely affecting highway performance. Thus, implementing high-precision highway pavement crack detection is important for highway maintenance. However, the asphalt highway pavement environment is complex, and different pavement backgrounds are more difficult than others for detecting highway pavement cracks. Interference from road markings and surface repairs also complicate the environments and thus the detection of crack. To reduce interference, we collected many images from different highway pavement backgrounds. We also improved the single shot multi-box detector (SSD) network and proposed a novel network named deformable SSD by adding a deformable convolution to the backbone feature extraction network VGG16. We verified our model using the PASCAL VOC2007 dataset and obtained a mean average precision (mAP) 3.1% higher than that of the original SSD model. We then trained and tested the proposed model using our crack detection dataset. We calculated precision, recall, F1 score, AP, mAP, and FPS to examine the performance of our model. The mAP of all categories in the test data was 85.11% using the proposed model 10.4% and 0.55% more than that of YOLOv4 and the original SSD model, respectively. These findings show that our model outperforms YOLOv4 and the original SSD model and confirm that incorporating a deformable convolution into the SSD network can improve the model’s performance. The proposed model is appropriate for detecting pavement crack categories and locations in complicated environments. It can also provide important technical support for highway maintenance.


I. INTRODUCTION
Pavement crack is the most common and important pavement disease. These cracks may be caused by different reasons, such as vehicle load, man-made, and natural factors, is the main performance of the early stage of pavement disease. Crack length can be from millimeters to meters, and crack width can be from 1 mm to a few centimeters. Highway pavement cracks lead to damage to the pavement structure, reduce running speed, slow down traffic transportation times, The associate editor coordinating the review of this manuscript and approving it for publication was Le Hoang Son . and shorten road operation times. Serious pavement cracks will also weaken the bearing capacity of the roadbed, give rise to pavement collapse and traffic accidents, affect traffic safety, and cause economic losses. Therefore, the detection of pavement cracks is important for maintaining the highway. Traditional pavement crack detection methods involve using a detection vehicle to collect pavement images, marking the cracks on the image manually, and calculating the length and width of the crack. However, as China's highway mileage is rather long, the manual detection of cracks consumes considerable manpower and time. Thus, determining how to VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ implement automatic and intelligent pavement crack detection is an urgent need.
In the field of computer vision, object detection is the most widely used technology compared to image classification and segmentation. Object detection has been used in remote sensing image detection, medical image detection, industrial parts detection, power inspection, crop disease detection, etc. This technology can not only identify the category of the object, but can also locate the position of the object in the image. Therefore, it greatly improves work efficiency, reduces labor costs, and improves the level of automation and intelligence. The residual network (Resnet) solves the overfitting problem caused by deepening network layers [1]. Therefore, deep convolutional neural networks (DCNNs) can be integrated into object detection networks as a backbone to upgrade networks such as YOLOv3 [2], YOLOv4 [3], and Faster R-CNN [4]. DCNNs promote the rapid development of the target detection network and achieve higher performance. Moreover, in natural language processing, transformers are main model that can not only extract features but also achieve multimodal fusion [5], [6]. Object detection networks based on transformer modules, such as the Swin-transformer, have been proposed and have obtained high performance [7]. These networks improve the practicability of target detection.
Pavement crack detection is a complex vision task. The aim of pavement crack detection is to determine whether any cracks exist that belong to a specified category and to identify the category and location information in the image. Highway asphalt pavement crack detection is more difficult than normal crack detection in pavement made of concrete because cracks in asphalt pavement are less obvious than those in concrete pavement, resulting in difficulty in performing feature extraction. In addition, interference from other pavement objects, such as crack surface repairs and pavement markings, and interference between classes including transverse, longitudinal, and map cracks also increases detection difficulty. Deep learning-based approaches are an intelligent and high-efficiency method for detecting pavement cracks. The rich hierarchical features of DCNNs and the end-to-end trainable network promote pixel-level semantic segmentation tasks [8]- [10]. At present, several crack detection methods have been proposed based on object detection [11], [12] and image block segmentation [13]- [15].
However, these methods have some defects. The existing pavement crack detection datasets include less interference and cannot meet the needs of crack detection under complex environments. Additionally, few datasets contain information on crack locations, so they cannot be used to locate cracks in the image. To address these challenges, we developed a highway pavement crack dataset and conducted experiments employing the original SSD and YOLOv4 models. Our proposed model performs better on the SSD model than on the YOLOv4 model. Specifically, this paper contributes to existing research by developing an improved object detection network that adds deformable convolution to the SSD network. We verify the proposed model on the VOC2007 dataset and our crack dataset. Our model achieves the highest mAP on pavement crack detection. This paper is organized as follows. Section II introduces the related work. Next section III presents a novel model named deformable SSD and introduces its network structure. We introduce the procedure of making dataset in section IV. Subsequently, results and discussion about the crack detection are presented in detail in section V. Then, we present the crack detection results in section VI. Finally, in section VII, conclusions and future work are also presented.

II. RELATED WORK A. OBJECT DETECTION
At present, object detection technology can be divided into one and two-stage object detection. Two-stage object detection, which mainly includes RCNN, Fast RCNN, and Faster RCNN, is divided into object classification and object localization tasks, and these two tasks are completed separately [16]- [18]. The one-stage object detection is mainly based on the SSD and YOLO series networks, which can complete both classification and positioning tasks at the same time. The one-stage object detection network is superior to the two-stage object detection network in terms of accuracy and speed, and it has higher practical value [19]- [24]. The R-CNN network was originally proposed by Zhang et al. [25], who applied high-capacity CNNs to bottom-up region proposals in the network. The mAP of the network improved from 35.1% to 53.7% and was much faster when compared with the multi-feature, non-linear kernel SVM method. However, R-CNN was slow. Subsequently, Girshick proposed a fast R-CNN and a faster R-CNN to accelerate object detection using SPPnet [4]. The faster R-CNN promoted real-time object detection. However, to overcome the flaw detected in the region proposal computation, region proposal networks (RPNs) that can share full-image convolutional features with the detection network were proposed [26]. To further improve accuracy, Cao et al. introduced a rotation-invariant faster R-CNN object detection network that integrates regularization constraints into the target function of the model [27]. Feature pyramid networks (FPNs) were also proposed to achieve multi-scale object detection [28]. Employing a Faster R-CNN integrated with FPN obtains better precision and improves the detection speed. Although these models perform well in object detection, they are slower and less accurate than onestage object detection.
YOLO series networks belong to one-stage object detection. Redmon et al. proposed a novel YOLO network that obtained double mAP in real-time detectors [29]. Subsequently, Redmon and Farhadi introduced the improved model YOLO V2 [30] by integrating batch normalization into the network to speed up the training. They also added anchor boxes and high-resolution classifiers to the network to improve the accuracy. YOLO V2 ran faster than F-RCNN with ResNet and SSD model. Then Redmon and Farhadi proposed a YOLO V3 model, which achieves multi-scale object detection [2]. The YOLO V3 was as accurate as the SSD model, but three times faster. However, it performed worse on medium-and larger-sized objects. Currently, the YOLO V4 and YOLO V5 models have been proposed [23], [31]. These models applied the Mish activation function and some data augmentation approaches to improve precision.
Object detection networks based on transformer modules have also performed well [32], [33]. A novel framework called the DEtection Transformer, or DETR, was proposed [34]. This method does not need a non-maximum suppression procedure or anchor generation that encoded prior knowledge. The model performance, in terms of precision and run-time, was comparable to the well-established and highly-optimized Faster RCNN model. The DETR had high computation and space complexity when the attention weight was computed in transformer module. However, in the encoder, the amount of calculation was proportional to the number of pixels squared, meaning it had difficulty processing the high resolution features. To address the problem, the deformable DETR was put forward [35], which added a deformable convolution to the network; the validity of the model was proved using the COCO dataset. The DETR and deformable DETR networks added only transformer modules. Josh Beal et al. proposed a vision transformer as a backbone, called ViT-FRCNN, for detection task [36]. While ViT-FRCNN served as a crucial first step in a class of transformer-based models, it did not achieve state-of-theart results and did not consider the local feature. Liu et al. proposed the Swin Transformer, which employed shifted windows [7], which achieved interactivity with information between windows, meaning that the model can obtain semantic and local information. The Swin Transformer outperformed the YOLOv4 and DetectoRS models when tested using COCO dataset. While the transformer-based models perform well on massive quantities of data, they tend to have a worse performance on datasets smaller than 100 k.
Several researchers have also proposed improved versions of the SSD model. Fu et al. introduced a Deconvolutional Single Shot Detector (DSSD) by combining a classifier (Resnet-101) with a SSD network [37]. Wang et al. proposed an improved SSD model by combining the advantages of existing target detection approaches [38]. Their model outperformed the F-RCNN and R-FCN models for small object detection. Kumar et al. used depth-wise separable convolution to improve the SSD network and had a high performance on small objects detection and real-time detection [39].

B. YOLOv4
In the field of object detection, the YOLOv4 model has high accuracy and speed. On common image datasets, such as the COCO dataset and VOC dataset, the performance of the YOLOv4 model is better than the SSD model. The YOLOv4 network structure mainly consists of three parts: the backbone feature extraction network CSPDarkNet53, spatial pyramid pooling (SPP), and the feature pyramid network (PANet). The Mish activation function has also been adopted [40], [41]. Through PANet, we can attain feature maps of three differ-ent scales and carry out feature fusions on different scales. YOLOv4 object detection network has been widely applied in autonomous driving and medicine fields [23], [40].

C. HIGHWAY PAVEMENT CRACK DETECTION
The development of image processing technology improves the level of automatic pavement crack detection. Generally, there is a significant difference between the brightness value of the pavement crack and the pavement background. Consequently, threshold segmentation-based methods can be employed to extract the crack features and complete the crack detection. Li and Liu proposed an approach for crack detection based on a neighboring difference histogram method [42]. They extracted the crack from the pavement background by setting a threshold. However, when the pavement background is dark, such as when it is wet or asphalt pavement, the difference in brightness between the cracks and pavement background is minimal, and the effect of the threshold segmentation method is poor. In addition, this method requires the threshold value to be set manually, which is largely subjective. Image filtering-based approaches can also be applied to crack detection. Subirats et al. employed a continuous wavelet transform to obtain a binary image to detect cracks [43]. However, the anti-noise ability of this method is poor. Cho et al. proposed a crack width transform for crack detection [44]. Their method can better detect crack width. Although it has a strong anti-noise ability, the threshold still needs to be set manually. Premachandra et al. proposed an image based automatic road crack detection method by employing the pixel variance and discriminant analysis and the method was effective [45]. Then Chinthaka et al. used color variance distribution and discriminant analysis for road crack detection and had higher precision than the conventional approaches [46]. However, the difference between the cracks and the pavement background was very obvious. The crack feature was easy to extract. When the difference between the cracks and the pavement background is not obvious, the method performs worse. Machine learning is widely used in the field of image processing. The application of this method to detect cracks further improves the automation level of crack detection. The genetic programming and percolation model-based approach was proposed for concrete pavement crack detection [47]. The algorithm enhances its anti-noise ability, accelerates the detection speed, and upgrades its precision, but its rate of convergence is slow. Shi et al. proposed a new pavement crack detection framework based on random structured forests named CrackForest [48]. As intricate structural features of cracks can be extracted, the framework has high detection precision and fast detection speed. However, the precision and intelligence levels of these methods are lower than those of deep learning approaches.
With the advance of DCNNs, object detection technology has become increasingly mature. CNNs can extract highdimensional crack features and put these features into a detector to complete crack classification and location. This method greatly improves the accuracy and speed of crack detection [49]. Bhat et al. employed CNN to detect cracks and achieved high precision [50]. However, The number of dataset they used was too small. Model generalization is too weak. Qu et al. proposed an improved VGG16 network model to detect cracks [51]. The model clearly outperforms VGG16, U-Net, and Percolation and obtains the highest F1 score when using the CFD dataset and Cracktree200 dataset. Zhang et al. proposed a novel model called APLCNet, which uses instance segmentation based on the model to attain pixel-level crack detection [52]. The model obtained higher precision, recall, and F1 scores in the CFD dataset; however, the study mainly employed datasets with concrete pavements on which the crack feature was more distinct. Therefore, these approaches perform well on concrete pavements, but their performance degrades on asphalt pavements. A novel model based on a feature pyramid and a hierarchical boosting network was proposed to detect pavement cracks [53]. The model integrated different levels of crack features and different kinds of crack datasets, including concrete and asphalt pavement image datasets. The model therefore has good robustness. Song et al. proposed a network named Crack-Seg that employed deep multiscale convolutional features to detect pavement cracks [54]. It achieved high performance in precision, recall, F1-score, and mIOU. However, the network does not make fine divisions for the categories of the crack. Different crack types, such as transverse, longitudinal, and map cracks, cause different kinds of damage to the pavement structure, so a fine classification of the crack categories is necessary. Interference among different kinds of cracks also makes it more difficult to detect them. Song et al. divided cracks into transverse, longitudinal, alligator, and block cracks and employed a multi-scale feature attention network to detect the cracks [55]. The classification precision of transversal and longitudinal cracks was above 95%, and the classification precision of alligator and block cracks was higher than 86%. However, the model does not consider the interference of surface repairs and pavement markings. Additionally, the model cannot locate cracks in the image and compute the positioning accuracy. Feng et al. employed an improved SSD model to achieve crack classification and location, and it performed well [56]. However, they still failed to consider the interference from crack surface repairs and pavement markings. Maeda et al. provided an open road damage dataset and used some detectors to verify the validity of the data [57]; however, they failed to provide comprehensive evaluation indicators.

III. PROPOSED MODEL A. SINGLE SHOT MULTI-BOX DETECTOR (SSD)
The SSD object detection network has a faster detection speed and higher accuracy compared with RCNN, Fast RCNN, and the Faster RCNN detection network, and its training speed and detection speed are faster than the YOLOv4 and YOLOv5 detection networks. Through comprehensive comparison, the SSD target detection network is a better performance network that has been widely applied in various fields. Figure 1 shows the structure of the SSD network, and its backbone network adopts the VGG16 [58] feature extraction network. The extra feature layers are added after the backbone network to achieve the multi-scale detections. The sizes of these layers decrease progressively, and the convolutional module for predicting detections differs for each feature layer. Detection predictions are produced by using convolutional filters on each existing feature layer. A 3 * 3 * a convolutional kernel is the basic module for predicting parameters of a detection on a feature layer of size m * m with a channels [59]. The kernel produces a score for a category and a shape offset for the default box coordinates. The kernel is used at each m * m location. And it produces the bounding box offset output values that are measured about a default box position. We input the image size of 300 * 300 * 3 and output the feature layers of six scales, which are 38 * 38 * 512, 19 * 19 * 1024, 10 * 10 * 512, 5 * 5 * 256, 3 * 3 * 256, and 1 * 1 * 256, respectively, where 512, 1024, 512, and 256 denote not only the number of image channel, but also the number of feature images extracted from each feature layer, respectively. The feature map cell is each grid in the feature map here. For example, 38 * 38 denotes 38 * 38 cells. After that, we input the feature maps of each scale into the classifier and detector respectively for classification and regression prediction and then select the optimal prediction box through the non-maximum suppression algorithm (NMS). In addition, to improve network performance, the 38 * 38 feature maps are L2 regularized to reduce the number of channels from 512 to 20.

B. DEFAULT BOX
The default box in the SSD network is similar to the prior anchor in the Faster RCNN network. However, the default boxes are applied to feature maps in different-scale feature layers. Six different size default boxes are defined in the SSD network. Each feature layer initializes some default boxes, and these default boxes are adjusted to achieve optimal classification and regression prediction (see Figure 2). The dotted boxes denote the default boxes, and the red dotted box denotes the ground truth box in Figure 2. The numbers [4,6,6,6,4,4] respectively denote the number of default boxes in each feature map cell in the six scales. For instance, if a feature map with 38 * 38 cells had four default boxes per cell, there would be 38 * 38 * 4 = 5,776 default boxes in the feature map. We need to compute the size and center of each default box. The size of the input images is S * S = 300 * 300. The calculation formulas of the default box are shown in Equations (1) - (5).
(cx, cy) = ( i + 0.5 where s k denotes the scale of the default boxes in each feature map; and s max and s min are 0.9 and 0.2, respectively, meaning that the highest layer has a scale of 0.9 and the lowest layer has a scale of 0.2. Different aspect ratios for the default boxes are imposed and are denoted as a r . The width (w a k ) and height (h a k ) of each default box are calculated. Moreover, for a r = 1, the scale s k = √ s k s k+1 of the default box is added, resulting in six kinds of default boxes. The center point (cx, cy) of each cell is computed in Equation (5), where |f k | denotes the size of the k-th feature map.

C. ENCODE
In the encoding procedure, the parameters g_xy and g_wh are adjusted from the default box to the ground truth box. The formula is shown in Equations (6) and (7).
where x 1 , y 1 are the center of the ground truth box; h 1 , w 1 denote the height and width of the ground truth box; x 0 , y 0 are the central point coordinates of the default box; h 0 , w 0 denote the height and width of the default box, respectively; and v 0 and v 1 are equal to 0.1 and 0.2, respectively.

D. DECODE
The decoding process attains a prediction box, according to the formula shown in Equations (8) and (9).
where x and y are the center coordinates of the prediction box, and h and w are the height and width of the prediction box.

E. LOSS FUNCTION
The loss function of the SSD network is a multi-task loss function, including classification and regression loss (as shown in Equations (10)- (17)). We call a sample containing objects in the default box a positive sample, and a sample with no objects in the default box a negative sample. Most default boxes do not include objects, which results in uneven positive and negative samples. To balance the positive and negative samples, we set the ratio of the positive and negative samples to 1:3. Classification loss includes positive and negative sample loss, and multi-classification cross-entropy loss is adopted. Negative samples do not need to be positioned, so regression loss includes only positive sample loss and the smooth_L1_loss function is used.
L loc (x, l, g) = N i∈Pos m∈{Cx,Cy,w,h} where N denotes the number of matched default boxes; i and j denote the number of prediction box and ground truth boxes, respectively; p is the category number;ĉ p i denotes the probability that the i-th prediction box predicts the category p, p = 0 is the background; L conf is the localization loss; L loc is the confidence loss; α is the weight item that is employed to balance the classification and regression loss and is set to 1; NMS mainly solves the problem of a target being detected many times [60]. First, the box with the highest confidence is located from all the detection boxes, and then the interaction ratios (IOU) between it and the remaining boxes are calculated successively. If the value is greater than a certain threshold (the coincidence is too high), the box is removed. The process is repeated for the remaining detection boxes until all detection boxes are processed. We set the IOU threshold to 0.45.

G. DEFORMABLE CONVOLUTION
The addition of deformable convolutions to a CNN can upgrade the performance of the network [61], [62]. For example, adding deformable convolution to an object detection network that consists of ResNet or CSPDarkNet as a backbone feature extraction network improves the detection performance of the network [63]. Deformable convolution adds an offset variable at each sampling point to increase its adaptability to geometric deformation compared with standard convolution (as shown in Equations (18) and (19)). Figure 3 shows the realization process of deformable convolution. The convolution kernel of the deformable convolution can adapt to the shape characteristics of the objects and match the shape changes of the objects. The convolution region always covers the surrounding objects. Sampling points of the deformable convolution are not uniformly distributed. Instead, they are distributed in the interior of the detected object according to the shape of the detected object. Deformable convolution has a strong scale modeling ability and a larger receptive field than standard convolution.
where P n denotes the pixel value in the convolution window, w is the weight of the convolution kernel, R represents the standard convolution kernel of size 3 × 3, and P n denotes the offset of a certain pixel.

H. DEFORMABLE SSD
The original SSD employs VGG16 as the backbone feature extraction network and uses standard convolution to extract the features. We add a deformable convolution (D_Conv) behind Conv7 in the VGG16 network, and Table 1 shows its parameter settings. When crack images are fed into the input layer, the proposed network does the following: (1) The size of the crack images is transformed to 300 * 300 by batch normalization.
(2) The crack images are passed through the VGG16 network extracting feature maps at different scales.
(4) The offsets are predicted for the default box shapes in the cell.
(5) Per-class confidence scores are predicted for each box. (6) Based on the IOU and NMS, the ground truth boxes are matched with the predicted boxes.

IV. DATASET
Crack Dataset: The dataset consists of highway pavement survey images from the national highway and provincial highway in Gansu Province, China. The cracks are about 3 mm wide. The data used in this study were mainly from onboard CCD cameras, which cover most pavement conditions. The pavement images have resolutions of 1688 × 1874.
Training networks with large images would require a large amount of memory, thus overburdening the training process. Further, since the crack region occupies only a small part of the entire image, it is difficult to extract features and recognize cracks. Therefore, to reduce memory usage and improve the precision, we divided the original highway crack images into small blocks with a size of 562 × 562 pixels (see Figure 4). Then, the dataset was manually divided into five categories: transverse cracks (1), longitudinal cracks (2), map cracks (3), crack surface repairs (4), and pavement markings (5). Subsequently, we used LabelImg to label the images,     including the crack category and location. We also divided the dataset into three sub-datasets, including the training set (18,694 pictures), the validation set (2,077 pictures), and the test set (1,483 pictures). Table 2 shows the number of objects per category.

V. RESULTS AND DISCUSSION
Each experiment in our study was performed under a Windows 10 Operating System with Intel(R) Core (TM) i9-9900K CPU, 3.60 GHz and a NVIDIA GTX 2080Ti GPU and 12 GB of memory. To verify the applicability of the proposed model, we first tested our model on the PAS-CAL VOC2007 dataset. We employed pre-training weight and transfer learning to train our model and set the initial learning rate to 0.0005 and epochs to 40. One epoch means the model has been trained once. Table 3 shows the test results. The mAP of the proposed model was 3.1% higher than that of the original SSD model, and the AP per class  improved. Fine-tuning the model as a long train process is an effective approach to avoiding overfitting and degeneration. To better train the model and obtain better results, we divided the training process into two stages and finetuned the training parameters in the second stages. Table 4 shows the training parameters of the different models at different stages. When training deformable SSD, we adopted the idea of transfer learning and trained the trained SSD model as the initial weight of deformable SSD. Because the training process was short and the performance was not suboptimal, we did not divide the two stages to train proposed model. To highlight the superiority of our model, we compared it with the original SSD model with the YOLOv4 model. We plotted the loss curves of three models at train process in Figure 5. The results show that the convergence speed of the proposed model is the fastest and the loss is minimal.
To evaluate network performance, we calculated the following indexes: Precision (P), Recall (R), F1, AP, mAP, and FPS. Their calculation formulas are shown in Equations (20)-(24), respectively.   In the formulas, true positive (TP) denotes the correct positioning result, and false positive (FP) is the wrong positioning result. False negative (FN) means the result is not predicted correctly; for example, there is an object in the picture, but the prediction box is not drawn. N denotes the number of predicted samples, denotes the difference, and m is the number of categories. FPS denotes frame rate per second, namely the number of images detected per second, which is an important indicator for evaluating the detection speed of the model.
We calculated the indicators of the three models on the test set, as shown in Table 5. Our model obtained the highest mAP on the test set, 0.55% and 10.4% higher than the SSD and YOLOv4 models, respectively, and had the best performance. The YOLOv4 network had the worst performance. Because the YOLOv4 model only obtains the features of three scales, VOLUME 9, 2021 it has a poor effect on large-scale object detection. However, map cracks usually cover the whole image (as shown in Figure 6), which requires a large-scale object detection feature layer to obtain global features. The SSD network has six scale feature layers, including large-scale feature layers, and it is suitable for map crack detection. The calculated FPS shows that the detection speed of the proposed model is faster than that of the YOLOv4 model and slightly slower than that of the SSD model. Because we used transfer learning to train our model, the Training time was greatly shortened. The model needed to allocate more memory in the training process because of the increasing model complexity after adding the deformable convolution. The results show that the proposed model and SSD model achieved great improvements in accuracy, recall, and F1 score for map cracks, which verifies that the YOLOv4 model has poor performance on large-scale object detection. The proposed model attained the highest AP for map crack and crack surface repairs and higher AP for other categories, which indicated that our model was more suitable for crack detection.
Taking map cracks as an example, we plotted precision and F1-score changing curves while changing the confidence threshold in Figure 7 and Figure 8. Confidence is a probability that reflects the similarity between predicted and truth objectives. Precision should increase as the confidence threshold increases. However, the F1-score first increased and then decreased as the confidence threshold increased. We therefore set the confidence threshold to 0.5 according to the results of our experiments.  Figure 9, cracks are detected on different pavement types. Cracks, surface repairs, and markings on the pavement are also detected, which is important for removing the interference of surface repairs and markings. The proposed model can detect not only crack pictures under different pavement backgrounds, but also crack pictures of different sizes, and it therefore has strong applicability. However, as can be seen in Figures 9 (d), (g), and (i), there are omissions on the image edge and on multi-objective detection. During cracks detection, we used a computer with GPU to achieve fast detection. Therefore, a highly configured computer is necessary. Currently, our model is not applicable to detect cracks on mobile or small type hardware.

VII. CONCLUSION
We propose a novel object detection model named deformable SSD to detect asphalt highway pavement cracks in complicated environments. We draw the following conclusions: (1) Annotating the crack category and location on a crack detection dataset and using objective detection networks to train crack detection models can effectively detect crack class and location. Obtaining this information is important for achieving automation and intellective crack detection. (2) Collecting crack images on different pavements to increase sample diversity and include surface repairs and markings in the dataset can reduce interference among classes and upgrade the generalization ability of the model. (3) Improving the SSD network by adding a deformable convolution to the network can upgrade the model performance. Employing fine tuning in training process can avoid overfitting and model degeneration, and transfer learning can accelerate model convergence. Comparative experiments of VOC 2007 and our dataset showed the proposed model outperformed the original SSD and YOLOv4 models. (3) The detection results show that our model can detect not only cracks in complicated environments but also multi-objective crack images. However, it is difficult to detect cracks on image edge. Therefore, we aim to develop approaches to enhance the crack detection feature on the image edge.