Dynamic Pavement Distress Image Stitching Based on Fine-Grained Feature Matching

Camera-based pavement distress detection plays an important role in pavement maintenance. Duplicate collections for the same distress and multiple overlaps of defects are both practical problems that greatly affect the detection results. In this paper, we propose a fine-grained feature-matching and image-stitching method for pavement distress detection to eliminate duplications and visually demonstrates local pavement distress. /e original images are processed through a hierarchical structure, including rough data filtering, feature matching, and image stitching./e original data are firstly filtered based on the global position system (GPS) information, which can avoid full-dataset comparison and improve the calculating efficiency. A scale-invariant feature transform is introduced for feature matching based on the extracted key regions using spectral saliency mapping and bounding boxes. Two parameters: the mean Euclidean distance (MEuD) and the matching rate (MCR) are constructed to identify the duplication between two images. A support vector machine is then applied to determine the threshold of MEuD and MCR. /is paper further discusses the correlation between the sampling frequency and the number of detection vehicles. /e method provided can effectively solve the problem of duplications in pavement distress detection and enhances the feasibility of multivehicle pavement distress detection based on images.


Introduction
Pavement condition measurements are essential for maintenance decisions [1]. Pavement distress detection has traditionally been a highly laborious and time-consuming task [2]. Currently, the most commonly used detection vehicle is a specially modified car with precise but delicate instruments, and the process of detection is time-consuming, expensive, and inefficient [3]. With the increasing demand for real-time pavement maintenance, detection methods based on lightweight sensors and rough-set data mining are becoming popular. Automated pavement detections using cameras [4], lasers, and ultrasonic sensors [5] are widely used as replacements to manual work, which significantly improves the efficiency and lowers the cost [6]. Among them, the camera is the priority choice in pavement detection because of not only its low cost and intuitive data but also its lightweight and detachable features that satisfy the requirements of multiple-vehicle detection and rough-set data collection [7]. erefore, pavement condition recognition based on video image has become a central issue [8].
With the development of deep learning and computer vision technology, image-processing algorithms lead to good performance in automatic identification of pavement distress [9]. Different kinds of pavement defects such as cracks, potholes, and nets can be identified with relatively high accuracy [10,11]; thus, the image-based detection has been proved to be a reliable and efficient method [12]. Different approaches were employed for image analysis. e Sobel edge detector recognizes edges in an image by smoothing the image before computing the derivatives in the perpendicular direction to the derivative [13]. e Canny method is a multistep algorithm that can detect edges and concurrently suppress noise in an image [14]. e semantic texton forests (STF) algorithm is also used as a supervised classifier on a calibrated region of interest (ROI) in the detection of multiple pavement defects [15]. However, the results of convolution neural networks (CNNs) are significantly better than the aforementioned algorithms in image-based detection [16].
CNNs have become the most popular algorithm and have been constantly improved to better fit the distress detection [17]. CNNs have the advantage of performing feature extraction and predicting crack/noncrack conditions in an integrated and fully automated manner with good prediction performance and a classification accuracy rate (CAR) of 92.08% [18]. Gopalakrishnan et al. employed a deep CNN with transfer learning for pavement distress detection [19]. Jenkins et al. proposed a deep fully CNN to perform pixel-wise classification of surface cracks on roads and pavement images with 92.46% precision [20].
Besides, 3D laser-illuminated camera is also used to detect pavement deterioration. Li et al. applied a fully automated algorithm for segmenting and enhancing pavement crack based on 3D pavement images [21]. e depth information collected by 3D techniques helps to perform better in analyzing cracks, textures, rutting, etc.
However, there are still some practical problems remain unsolved during road detection using a 2D or 3D camera. High-acquisition frequencies are used to reduce the number of missing defects to the minimum, and at the same time, multiple overlaps of defects take place. Besides, it is always the case that the low vehicle speed or traffic congestion causes image duplications. Such duplication can greatly affect the statistical reliability of pavement health assessment and the calculation of relative indices like the pavement condition index (PCI) [22]. Moreover, length and area are used as units of summarization to better describe a crack and this problem is more of a concern.
For the comprehensive inspection cars, wheel encoders are adopted to avoid overlaps. However, this solution is not only expensive but also not suitable for our lightweight equipment that can install and work quickly on any car. erefore, two existing problems are focused on in this paper as follows: (1) A defect in different images might be misidentified as different ones due to a location and pixel-size discrepancy in different images, as shown in Figure 1(a). (2) A longitudinal crack crossing different frames (Figure 1(b)) might be recognized as different cracks instead of one long crack.
To solve the problems mentioned above, we propose a pavement distress stitching method to preprocess detected data. On the one hand, stitching is a technology-neutral pattern to use in locating distress over multiple passes, especially over time. It eliminates duplications and orderly sorts the statistical summarizations such as number, length, and area. On the other hand, adjacent defects in consecutive images can be stitched to form a whole lane-level picture of pavement distress. Such panoramic pictures are conducive to manual verification while providing visualizations of the pavement condition.
One of the most crucial parts of image stitching is the feature-matching algorithm, which can be divided into three categories: global feature-based matching algorithms, local feature-based matching algorithms, and deep learning algorithms. Global feature-based matching algorithms such as the histogram of oriented gradient (HOG), local binary pattern (LBP), and Haar-like features performed well in human detection [23,24]. Compared with global featurebased matching algorithms, local feature-based matching algorithms are more stable. Scale-invariant feature transform (SIFT) was first proposed by Lowe as a local feature description algorithm based on the analysis of existing invariance-based feature detection methods [25]. SIFT has good stability and invariance, but it imposes a large computational burden [26]. Speeded-up robust features (SURF) is the replacement to SIFT, which has lower computation cost for real-time systems at a tradeoff of poor relative performance [27]. e oriented FAST and rotated BRIEF (ORB) algorithm is rotation invariant and resistant to noise, and it performs almost as well as SIFT while being two orders of magnitude faster [28]. In the field of deep learning, deep matching (DM) is one of the most popular methods for establishing quasi-dense correspondences between images [29]. DM relies on a hierarchical, multilayer, correlational architecture designed for matching images that have high information dimensions and need sophisticated calculation. Moreover, if the feature matrix correlation parameter threshold control is too strict, the angular resolution will consequently decline. erefore, SIFT is adopted in this paper because of its stability.
Image stitching is one of the main applications of SIFT. Lowe proposed an invariant feature-based approach to fully automatic panoramic image stitching [30], while Xiaoyan et al. created a large field of view for robot control and movement using dynamic image stitching when there was a moving object in the environment [31]. Qiu et al. proposed an image-stitching algorithm based on aggregated star groups to obtain a complete star map [32]. is paper applies the image-stitching method in pavement detection to solve engineering application problems.
Based on the above problems, we present a pavement distress image stitching method based on a feature-matching algorithm. Since the background of the pavement is monotonous and the algorithm can falsely match the features of the asphalt pavement, we propose the use of the spectral saliency mapping (SSM) method along with a pavement distress bounding box to extract information from dense regions. e scale-invariant features extracted from the key region serves as the stitching points between two images. e remainder of this paper is organized as follows. In Section 2, we present the data processing methods. In Sections 3, 4, and 5, we describe the framework of the proposed approach where the feature matching, key region extraction, and image stitching are introduced, respectively. In Section 6, we discuss the correlation between the sampling frequency and the number of detection vehicles. In Section 7, we offer the conclusions of this study.
1.1. Data. In our experiment, an integrated detection system was used to collect pavement images. An industrial camera was fixed on the back of the vehicle, which faced obliquely downward. e vehicle also equipped with a GPS unit, which allows the images to match the corresponding locations on the road. Full videos were stored in a vehicle-mounted terminal while clipped images were uploaded at a frequency of 2 Hz.
Several typical pavement distress defects on the urban road in Shanghai are considered in this paper, including cracks, patched cracks, potholes, patched potholes, nets, patched nets, and manhole covers ( Figure 2). A 13.2 km road section on Caoan Road in Shanghai was chosen for experiments and validation, as shown in Figure 3. e algorithm processed more than 6000 images and generated bounding boxes when the defects are recognized. At the same time, the results were artificially calibrated to guarantee accuracy. Figure 4 illustrates the flow chart of the proposed hierarchical framework for image processing, including rough data filtering, feature matching, and image stitching. e original images are firstly filtered according to the GPS information, which can exclude most of the irrelevant images. rough choosing the images that have the most overlap, a feature matching method is applied to extract the SIFT features in the key region using SSM and bounding boxes. After the feature-matching process, two or more images are stitched according to the features and the fitted perspective matrix.

Rough Data Filtering Using GPS to Reduce
Computational Cost e purpose of the preprocessing is to reduce the computation cost before further analysis. e basic idea is to select the images based on the GPS information because the location of the potential matched images must be close. GPS, though considered to be not accurate enough, excludes a large number of images that are geographically too far apart to be matched, thus serving as a rough data filtering to reduce the calculating amount. e GPS module recorded the real-time locations during detection and then linked to images according to the timestamps [33]. e GPS information makes it easier to manage the statistical data at the level of road segment. Due to the instability of GPS, images within 10 meters (P n ) are selected as candidates for matching to make sure that no targeted picture is omitted. e chances that two defects within 10 meters are too similar to differentiate by a human or algorithm are negligible. If it happens, the number of the candidate images would be more than the detection times, and in this situation, the images need to be checked by a human. e Haversine equation [34] was adopted to calculate the distance between two points using their longitudes and latitudes, as formulated in the following equation: where φ 1 /φ 2 and λ 1 /λ 2 are the latitude and longitude of point 1 and point 2, respectively and d is the distance between them. e same defects among P n were searched and labeled by artificial identification to build the ground truth. In most cases, the same defects can be found within 10 meters unless there exists a GPS deviation. erefore, when P n was an empty set, the GPSs of the retrieved images (P x ) were examined and the distances and time-lags from their adjacent and matched images (P k ) were calculated. Figure 5 describes the method of dealing with abnormal data. e collection speed was used as a discriminative index. When the calculated value was more than 1.5 times the true value as formulated in equation (2), the location was considered as being in error and was redefined as the timeweighted average of P k .
where v is the true value of the velocity, l is the distance between two locations, GPS X is the GPS location of P x and t x is the timestamp of P x , and GPS K is the GPS location of P k and t k is the timestamp of P k .

SIFT Feature Matching
SIFTfeatures are located at the scale-space maxima/minima of the differences between Gaussian functions, which keep the rotation, scale, or illumination invariant. ey are robust in terms of vision changes, affine changes, and noise [35]. SIFT feature matching mainly includes the following three steps.

Feature Detection in Scale Space.
is step involves searching for scale-invariant features from the multiscale images in scale space. e scale space is defined as the following convolution operation: where σ is the scale-space factor, G is driven from a variablescale Gaussian distribution, and I is the input image. e difference of Gaussian (DOG) function can be further established from the difference of the nearby scales with a constant multiplicative factor k as follows: 3.2. Feature Localization. e candidate feature points in the scale space extracted from the images are further refined to perform a detailed fit to the nearby data to determine the locations, scales, and ratios of principal curvatures. is information allows points to be rejected that have low contrast or are poorly localized along an edge. e DOG   Journal of Advanced Transportation  Journal of Advanced Transportation function at the candidate feature points X is adopted here to discard unstable features with low contrast in the underwater images: where X denotes the offset from the location of the extremum, and all extrema with a value of D(X) less than 0.03 are discarded. In this paper, the threshold of the principal curvature is set to 0.6 considering that the edge-detect results of the pavement distress are not obvious.

Orientation Assignment and Feature
Description. e main direction and auxiliary direction of the key points are given according to the gradient direction histogram of the key feature points, where the resultant matrix of 2 * 2 * 8 dimensions is mathematically described. SIFT features are calculated and the matching features are shown in Figure 6.
In Figure 6(a), the frame indicates the gradient direction of an extracted feature point. In Figure 6(b), a line indicates a link between two matched features. e more links exist, the greater the probability that the images share the same feature will be. However, the features of both pavement distress and normal pavement are extracted, as shown in Figure 7(a). Due to the similarity of the pavement structure and pavement markings therein, matching errors can easily arise. erefore, a bounding box is needed to extract features in the designated area, which can greatly improve the matching accuracy and pertinence as shown in Figure 7(b).
Meanwhile, the random sample consensus (RANSAC) method was used once the feature matching is finished. RANSAC was firstly proposed by Fichler and Bolles as a robust estimation procedure that uses a minimal set of randomly sampled correspondences to estimate image transformation parameters and screens correct data [36]. In general, different perspectives can be transformed by a perspective matrix, and RANSAC was used to find parameters with the maximum likelihood in image matching. eoretically, all the matched feature points should satisfy the matrix transformation. However, there will always be some errors, and RANSAC rejects abnormal values. e SIFT-matched results used in this paper were processed with RANSAC, which can effectively improve the reliability and robustness of feature points. e mean Euclidean distance (MEuD) between two feature points and the matching rate (MCR) were used as indices to evaluate the matching degree.
e Euclidean distance indicates the matching degree, and the matching rate illustrates the proportion of correctly matched points. e smaller the Euclidean distance is, the better two feature points match will be. When two defects are of the same type, there will be more matched features than those are not the same. However, the matched SIFT features do not fully indicate whether two objects are the same object. e shortest Euclidean distance can only illustrate the best match of the corresponding SIFT feature points on the other image. Hence, it is difficult to judge whether two matched features are the same defect with complete certainty using a numerical threshold or a threshold derived from the root mean square of the distance. In this paper, the MEuD and the MCR of the matched feature points were used as indicators for evaluating image similarity. e MEuD is defined as in the following formula: where MEuD(S, T) is the root mean square distance between two images (S and T), m is the number of matched SIFT feature points, j is the sequence number of a feature, and s ik /t jk is the SIFT matrix. Because the root mean square distance is affected by the size of the images, the MCR was also used as a similarity evaluation index. e MCR is defined as follows: where N represents the number of points and N m is the number of matched points. e MCR indicates the proportion of all retrieved matched SIFT features. Cross-validation was used to calculate the matching accuracy of the SIFT features. Table 1 shows the SIFT matching results of several selected images in a 10-m-long test section. Five of them are recognized as the same pothole by the algorithm.
Although SIFT is robust to the shooting angle, the MCR of the images with a large distance is only 37.39%, while the MCR of the images with similar angles is as high as 85.50%. e MEuD of the images is relatively stable, which reaches 10 4 orders of magnitude. As for the different types of defects, the MEuD does not exist and MCR equals zero because no matching features could be found.
A matching test of two hundred pairs of images was performed on the sample library to determine the SIFTbased image matching threshold. A support vector machine (SVM) was used to estimate the tangential plane to determine the model threshold. e matching accuracy, as determined with the five-fold cross-check method, of the SIFT features is 81.4%. Figure 8 shows the results of the binary classification based on SVM, in which the dots represent the good matching result, while the crosses represent the incorrect matching result. e SIFT model is more inclined to identify a mismatch as a correct match because SIFT has a certain degree of angular robustness. Unfortunately, this can easily cause errors due to the effect of shooting angles. As two different defects, which are highly similar, are less likely to be present in the same location, the matching accuracy was as high as 92% in the sample set test.

Key Region Extraction
e monotonicity of a pavement results in matching errors of the SIFT features, as shown in Figure 9. To this end, we propose SSM along with bounding boxes generated by the 6 Journal of Advanced Transportation   SSM is a simulation of human visual attention characteristics, which can capture significant changes in an image. It is the dynamic visual attention that makes it easier for a human being to find important information in an image at first glance, instead of searching the elements one-by-one. From the perspective of information theory, the information processed by human beings is mainly divided into background information and changing information, the latter to which human vision is more sensitive. Although image incision technology and semantic segmentation can also segment background and subject information, they can only target specific objects and require a large amount of model training. Moreover, these methods will destroy the overall characteristics of an image, and it is difficult to reflect the overall characteristics of real human vision.

Journal of Advanced Transportation
Xiaodi and Liqing found through a large amount of data analyses that the average log-spectrum of input images is positively correlated with the log frequency [37]. e spectral residual of an image in the spectral domain is extracted by subtracting the average log amplitude spectrum from the actual log amplitude spectrum of the image. In this paper, an FFT-based visual saliency model was used to extract the feature regions of the pavement, as shown in the following equation: where S(x) represents the SSM of graph x, g(x) is a Gaussian filter used to smooth the SSM graph, F − 1 represents the inverse Fourier transform, L(f ) is the log vibration spectrum of the image, A(f ) represents the average log vibration spectrum, and P(f ) represents the phase spectrum of the image. Figure 10 shows the key region extracted using SSM. According to Figure 10, the SSM method has a certain sensitivity to pavement distress, especially the patched distress, and the sensitivity is relatively stable, regardless of the location in an image. However, this method is not sensitive to potholes or cracks. erefore, SSM was combined with a bounding box to form key regions. After selecting the key regions, SIFT feature extraction was performed on the region locations, and the SIFT factor was calculated in the selected region. Each image was rescaled to ensure that the directions were consistent. A K-dimension tree (KD Tree) was established, and the k-nearest-neighbors (KNN) algorithm was used to find the KNN for each feature, where K was set to 2. e validity needed to be verified when the K neighboring values were found. e valid verification threshold was 0.6, as is shown in the following inequality (9): where NN represents the nearest-neighbor.    Journal of Advanced Transportation Figure 11 shows the effect of the feature region on the results. When a feature region is not adopted, a large number of matching points exist in the normal pavement and more mismatches are caused due to the consistency of the pavement. However, when the SSM combined with the bounding box is applied, the matching accuracy improved.
In addition to the SSM method, the bounding boxes are generated to locate the region of interest by the object detection algorithm named "you only look once version 3 (YOLOv3) [38]." YOLO is one of the real-time deep CNN methods that aim at detecting objects and is widely applied in traffic management. YOLO reasons globally about the image when making predictions and learns generalizable representations of objects [39]. And it has been proved that YOLO performs well among other existing models, such as SSD or R-CNN in pavement defects recognition [40]. Moreover, YOLOv3 performs best especially in small object detection among the four versions of YOLO [41]. e precision of the algorithm was 0.7869 with 10,000 pavement images for training and 3,000 images for testing. Additionally, although YOLO consumes a lot of computational power when training the model, not much computational power is needed for prediction.

Image Stitching
After matching the SIFT features in the key region using the SSM and bounding box, two candidate images that had the most matched features were stitched according to the features and the fitted perspective matrix. After that, the next image was stitched on to the base of the previously stitched images. e stitching results are displayed in Figure 12.
e angle and size of the stitched portion can change within the perspective matrix, so the weighted average fusion approach was used, as shown in the following equation: where p represents the synthesized pixel coordinates and d l and d r represent the distances of p l and p r , respectively, from the left and right edges of the image. According to the number of feature points matched in the image set, the images were preferentially stitched. e algorithm stopped when the ratio of inliers was less than 50%. A flow chart of the image stitching algorithm is shown in Figure 13. e current algorithm can process up to 12 images, and the result is shown in Figure 13. e perspective field of view exists in the original image, which makes it a challenge to stitch more images. e distortion becomes serious as the stitched images increase, and further study will be carried on to solve this problem.

Calculating the Minimum Number of Sampling Vehicles
It is difficult to obtain all the pavement distress characteristics with a single detection car. For one thing, it is always the case that some pavement defects are missed in the course of detection, in which the video sampling rate and vehicle speed are considered. For another, the algorithm could not completely identify the pavement defects, and the misdetections exist. erefore, it is necessary to have multiple detection vehicles to superimpose and match the data to show the overall condition of the pavement. e minimum number of required vehicles is discussed here using probability theory as shown in Figure 14. e precision p t of the pavement detection algorithm used in this paper is 0.7869, which is the probability that we can correctly detect pavement distress. e parameter p c represents the probability of collecting an image at a certain position on the pavement via detection with a single vehicle, which is the function p c (v, f) that is related to the traveling speed v and the camera sampling frequency f. When v is high and f is low, the detection vehicle could possibly miss some information at certain positions on the pavement, so the resultant value of p c is low. Conversely, when v is low and f is high, the p c value is high, but it can easily cause duplications. e number of detected pavement defects by the algorithm in one detection by a single vehicle is shown in the following expression (11): where M represents the actual number of pavement defects. Considering that v of different vehicles are basically the same in the same time period, and f are also the same, p is assumed to be a fixed value. In view of this, the pavement defects detected by each vehicle are consistent with the same distribution. Whether the pavement defects x can be detected conforms to the n-multiple Bernoulli trials x ∼ B(N, p c × p t ), as shown in the following equation: In order to meet the need that more than 95% of the defects are detected by multiple vehicles, the corresponding inequality is shown in the following inequality (13) A pixel in a camera image is i * j, and the range that the camera can detect is (u, v). e matrix transformation relationship between the pixels and world coordinates is as shown in the following equation: Distance along the road is l ∈ (u, v), and the probability of collecting an image at a certain position on the pavement p c can be expressed as follows: where ε represents equivalent loss of the focused image. When the length of the road covered by the camera exceeds v multiplied by the collection interval, there will be duplicate areas between the pictures, so the detection probability is 1 − ε. According to the conditions set in this paper, p c is calculated to be 0.67, where ε � 0.05, v � 50 km/h, f � 2 Hz, and l � 5 m. e minimum number of detected vehicles is five as calculated by the following formula (16): According to the calculation result, at least five vehicles are needed to form the whole picture of the road surface. Based on the camera parameters used in this experiment, the relationship between speed, sampling frequency, and the minimum number of vehicles is shown in Figure 15. Figure 16 depicts the relationship between sampling frequency and the minimum number of vehicles at a speed of 50 km/h. e sampling frequency is determined by the traffic flow, the number of vehicles, the facilities, and the experimental environment. e purpose of this part is to indicate that the number of detecting vehicles is an essential parameter for further field implementation, and thus we conducted theoretical deductions to provide a recommended number of detecting vehicles, which can offer help for field applications.

Conclusion
In this paper, we established a feature-matching and imagestitching method for pavement distress detection based on images obtained with multiple vehicles. A large number of pavement images and their corresponding time and positional information were obtained with detection vehicles under controlled acquisition conditions.
A hierarchical framework was built to process the images, including rough data filtering, feature matching, and image stitching. Duplications were effectively eliminated based on the three-layer structure that included GPS, bounding boxes, and SIFTfeatures. GPS is used to avoid fulldataset comparison, which can reduce the calculating amount. SIFT was introduced to match features based on the extracted key regions using SSM and the bounding boxes. An SVM was used to analyze the influence of the output parameter thresholds of the MEuD and the MCR of the matching classification. e matching accuracy using the 5fold cross-check method to calculate SIFT features is 81.4%, and the multilevel comprehensive matching accuracy can reach up to 92.0%. Images that have the most feature matches were stitched according to the matched features and the fitted perspective matrix. We then discussed the correlation between the sampling frequency and the number of detection vehicles and introduced a method to calculate.
Not only the whole lane-level pavement distress can be analyzed statistically by eliminating duplications and clustering according to the GPS tag and matched features, but local pavement distress can also be visually represented with the image-stitching algorithm. e algorithm provided in this paper effectively solves the problems of duplications of pavement distress and provides a reliable means for pavement distress detection in a collaborative, multivehicle environment.  Vehicles' sampling frequency f(HZ) Figure 16: e relationship between sampling frequency and minimum number of vehicles at the detection speed of 50 km/h.