Research Article Automatic Identification and Location of Tunnel Lining Cracks

The lining crack is common for the tunnel in the stage of operation, which has seriously inﬂuenced the service life and safety of tunnel engineering. It is a new trend to use computer vision to detect tunnel cracks over the past few years in China and foreign countries. By image processing technology and intelligent algorithm, the computer has a hominine visual perception system which understands, analyzes, and determines input image information, thus recognizing and detecting speciﬁc objectives. However, the eﬀect of image recognition for tunnel crack now cannot satisfy the demands of practical engineering. SSD algorithm has been used when analyzing features of lining surface image, while comparison analysis has been made from image recognition results, error rate, and running time. The results indicate that the SSD algorithm can accurately and rapidly detect and mark the position of the tunnel crack. The tunnel information obtained from image recognition is subsequently imported into the team independently developed software GeoSMA-3D, which is useful for determining tunnel grade.


Introduction
Since the 21 st century, China has seen the rapid development of underground engineering, especially tunneling construction. e water leakage, lining crack, and differential settlement are common for the tunnel in the stage of operation and maintenance, where lining crack is the most common structure damage [1]. e lining crack will break the stability of the tunnel, and continuous deterioration may result in instability of tunnel structure [2]. In fact, it is a process to see the change and deterioration of tunnel damage. Meanwhile, early detection and timely implementation of regulation measures can completely ensure tunnel safety in the stage of operation. erefore, it is extremely necessary to monitor tunnel damage. e tunnel structure damage is generally detected and recorded through traditional manual routing inspection which mainly depends on the level and subjective judgement of inspectors and therefore cannot completely and easily obtain accurate and detailed information [3]. Besides, a large number of labors and materials are required due to the low efficiency of detection (see Figure 1). With the change of demands for engineering detection, the method gradually develops from early inspection by naked eyes to advanced technologies like NDT and physical detection, while many evaluation methods and models have also been proposed. At the beginning of the 21 st century, the photographic car was developed for scanning surface damage on tunnel lining, and the laser scanning method was used for completely scanning surface conditions of lining concrete structure. e image of lining crack can be generated through recognition and analysis [4,5]. Literature works [6] have evaluated and studied the application of fiber optical sensors in practical engineering and their influences on tunnel safety. Besides, literature studies [7] have also developed a tunnel damage analysis system which can predict the status of the development of lining crack based on changes in lining strength.
Over the past few years, the computer visual inspection technology for collecting and recognizing the surface image of the tunnel structure has sprung up [8]. However, such inspection has brought a lot of image data about the surface of the tunnel structure. According to existing research, 200 GB inspection data may be generated for a tunnel with a length of 1 km. erefore, it is urgent to develop new technologies for rapidly recognizing and accurately extracting geometrical information about crack damage. With the coming big data era, computer's computing power and training data size have been improved dramatically, while complex models representative of deep learning start to attract people's attention. Due to excellent generalization and robustness, deep learning now has been applied to computer visual detection for civil engineering in China and foreign countries [9,10].
Target recognition technology originated in the 1940s. Until the 1990s, artificial neural network technology boosted the rapid development of this technology. At present, target recognition technology has been widely used in license plate recognition, face recognition, and object detection. Early traditional target recognition technology is mainly based on shallow-level models, and there is a need to artificially preprocess a mass of images, so the researchers have begun to study deeper network models and extract image features using the model. A variety of deep learning models have been proposed, including DBN [11], CNN [12], RCNN [13], FCN [14], and YOLO [15]. e DBN network model has greatly contributed to the establishment of multiple hidden layers of artificial neural network. e DBN model has been successfully applied in many fields, but still in the early stage of development [16]. However, it lacks an effective parallel training method, which produces high computational costs, and low-efficiency computation hinders the wide application of DBN. e CNN model has been well applied in the field of small image classification, but with a low recognition efficiency of large-scale data. Until 2012, Krizhevsky et al. achieved good results relying on deeper CNN in large-scale visual recognition challenge (LSVRC). RCNN can raise the accuracy of target detection to a new level, consisting of 4 independent steps: candidate window generation, feature extraction, SVM classification, and window regression, resulting in low detection efficiency. erefore, many scholars obtained the improved Fast-RCNN, through which the images were sent to deep network for once, instead of sending all the candidate windows to the network, which greatly improved the detection speed. As one of the best object detection algorithms, YOLO's main advantage lies in detection efficiency. As a part of CNN, the difference among feature extraction, SVM classification, and window regression is blurred for direct and quick detection.
For images of tunnel lining diseases, Smith and Brady [17,18] first extracted 17 low-level features from original images and then input these low-level features into the convolutional neural network to obtain high-level features and finally input these high-level features into a multilayer perceptron to achieve the identification of tunnel diseases. Sun et al. [19] compared and analyzed the image classification of pavement cracks through three methods, including deep convolutional network, support vector machine, and ensemble learning. It was found that the deep convolutional network had better detection effects than the other two methods. Huang et al. [20] used the deep convolutional network to study the identification of concrete cracks and detected images of any size with the combination of the sliding window method. However, the above method has shown many disadvantages, for example, single size of extracted feature graph, bad detection effect for small target crack, long time for data processing, and large calculation space. For this reason, this paper has used a deep learning-based SSD method to recognize the tunnel crack rapidly and accurately. ere are three steps: first, establish a sample dataset; second, build a network structure; third, evaluate the training grid and results.

Acquisition of Image Data
As an important subdiscipline of computer science, artificial intelligence aims to implement such technologies as natural language understanding, image recognition, and speech recognition through computer. Machine learning is a part of artificial intelligence and is a method of supporting machine learning and providing functions unrealizable through direct programming; in practice, machine learning is a method of collecting data, training the models based on data, and finally making predictions using the models. Deep learning is a subdiscipline of machine learning, and its core is to automatically combine simple features into more complex features and solve problems based on these features.
Deep learning is to train the model by training data, continuously collect a large amount of information in data into the model by machine learning algorithm, and make specific processing for real-world similar data by the trained model, for example, image classification and target positioning. e image classification is to distinguish images with and without crack and only determine an image with or without crack except the detailed position of the crack. e target positioning is to determine the crack position with the frame in the tunnel image and then obtain the detailed position of the crack. erefore, recognition and positioning for crack are required in this paper.

Acquisition of Tunnel Image.
e construction of an image set in a variety of ways is the first step and a basic part of image recognition based on deep learning, including original photographing methods, collection of massive data based on modern crawler technology and big data concept, and direct use of the existing open-source database. Regardless of which method is adopted, its ultimate goal is to collect desired and sufficient image samples containing feature objects, with diverse backgrounds. e image sample dataset is obtained through image acquisition for lining surfaces of 12 tunnels in Shenyang by fast mobile detection equipment for tunnel structure [21,22]. e tunnel crack is shown in Figure 2, where the total length is 7358 m for five long tunnels, 3080 m for four medium tunnels, and 1138 m for three short tunnels. e number of cracks in long tunnels is 139, with a total length of 1169 m, including 93 circular cracks and 46 noncircular cracks, 27 longitudinal cracks among noncircular cracks. e number of cracks in medium tunnels is 211, with a total length of 2009 m, including 142 circular cracks with a total length of 1702 m and 69 noncircular cracks with a total length of 306 m, 45 longitudinal cracks among noncircular cracks with a total length of 267 m. e number of cracks in short tunnels is 39, with a total length of 1065 m, including one circular crack and 38 noncircular cracks (completely longitudinal cracks). e image set for this research is independently shot and is quickly extended without damages by initially converting the images (including rotation, mirror, and scale) using general image processing technology. It can be deemed that original images and images extended through initial conversion have the same target features for deep learning training instead of the same images. erefore, when original images are insufficient, initial conversion becomes a common method of quickly extending the image set.

Establishment of Image Sample Dataset.
In order to reduce demands for computer video memory, collected images are cut into 1000 pixel × 1000 pixel subblocks, forming a sample dataset with a total of 20000 images. en, the sample dataset is classified into training set and verification set based on the percentage 4 : 1 between the training set and the verification set [23].
A large number of sample images marked with features are required for image recognition based on deep learning to form the training set, of which the basic principle is to conduct deep learning training with images marked with target features, find and extract features of marketed target, and then implement recognition and detection based on training results after recording. In the image sample dataset, each original image corresponds to a ground truth which is a datum reference obtained through marking crack region in the image manually. e labeling of original images is quite significant in target detections. It can label the locations of the target objects in original images and generate corresponding xml files for each image to represent the location of the target standard box. At present, the labeling tool used is simple and easy, but it cannot label multiple targets of the same category in the same image; besides, it can only generate a corresponding txt file after the labeling is completed, which needs to be converted into the corresponding xml file through certain tools. is paper used the labelImg tool, which can label multiple categories and directly generate xml files easily.
e label category can be modified in file "pre-defined_classes.txt" under folder "data" of the menu directory. When loading image with shortcut key "ctrl + u," you can designate a place for generated "xml" file with shortcut key "ctrl + r," and then mark targets in image by category with rectangular box. e software will automatically pop up category information after you mark a target, and then you can double click the corresponding name in pop-up information. When all required targets for each category are marked in an image, you can click the button or use the shortcut key "ctrl + s" to save and generate position information of the corresponding "xml" file. After that, you can mark the next image. e crack mark is shown in Figure 3. e prediction process is relatively simple. For each prediction box, you can firstly determine its category (with maximum confidence coefficient) and confidence coefficient by category confidence and filter prediction box in the background and then filter prediction box with a minimum threshold by the threshold of confidence coefficient (e.g., 0.5). e remaining prediction boxes are decoded, and then, its real position parameters are obtained upon prior box. After decoding, these parameters are generally ranked in descending order by confidence coefficient, and only top-k (e.g., 400) prediction boxes remain. Finally, NMS algorithm is used to filter prediction boxes with a high degree of overlap. e remaining boxes are final detection results.

Development Environment.
e real-time detection network developed based on computer vision for tunnel lining cracks is suitable for Windows operating platform and is developed based on 64-bit Windows 10 Professional operating platform. In terms of hardware, a computer is provided with Intel I7-9700F CPU (a key computer component), NVIDIA RTX2070 SUPER graphics card, and Advances in Civil Engineering SAMSUNG 970 PRO 500 GB SSD. Development language is Python programming language, and PyCharm is an integrated development environment, supplemented by visual development tool Qt Designer. TensorFlow 1.14 framework and Anaconda library file manager are used.

Crack Detection Algorithm.
e main idea of the SSD algorithm is to evenly conduct intensive sampling at different locations in images. Different scales and aspect ratios can be used when sampling. en, the CNN is used to extract features and directly perform the classification and regression. e entire process can be finished in one step. e advantage of one-stage methods is fast speed. e SSD algorithm refers to Single Shot MultiBox Detector. Single shot indicates that the SSD algorithm belongs to one-stage methods, and MultiBox indicates that the SSD algorithm uses multibox predictions [24]. e basic structure of the SSD is shown in Figure 4. e SSD uses the CNN network to extract detection results from different feature maps directly, which does not need detection after the fully connected layers. e detection can be completed in one step.
After a growing number of deep learning models have been proposed and improved, Google, Microsoft, Facebook, and other companies have developed a series of deep learning frameworks, including Google's TensorFlow framework, Facebook's Torch framework, Microsoft's CNTK framework, Fchollet's Keras framework, DMLC's MXNet framework, and Caffe framework jointly developed by Berkeley Vision and Learning Center and community contributors. ese deep learning frameworks are mainly used in image feature recognition, image classification, speech recognition, and natural language processing. With the emergence and application of the above frameworks, deep learning technology has been rapidly developed.
Based on existing image training samples, target detection algorithms for deep learning were studied, compared, and screened, including comparison of several classic target recognition algorithms based on the convolutional neural network: RCNN, YOLO, and SSD, among which SSD was finally used. e SSD algorithm uses feature maps of multiple scales for detection, and there are a large feature map and a small feature map, both used for detections. e advantage is that large-scale feature maps (front feature maps) can be used to detect small objects and small-scale feature maps (back feature maps) can be used to detect large objects. As shown in Figure 5 Each unit in the SSD algorithm has prior boxes with different scales or aspect ratios. e predicted bounding boxes are set based on these prior boxes to reduce the training difficulty. Normally, each unit will have multiple prior boxes with different scales and aspect ratios.
For each prior box of each unit, there is a set of independent detection values, which can be divided into two parts. e first part is the confidence or score of each category. e SSD algorithm also regards the background as a special category. If the detection target has c categories, the SSD algorithm needs to predict c + 1 confidence values and the first confidence is the score without targets or the score belonging to the background. In the prediction process, the category with the highest confidence is the category to which the bounding box belongs. Particularly, when the first confidence value is the highest, it indicates that the bounding box does not contain the target. e second part is the location of the bounding box, containing 4 values (cx, cy, w, h), representing the center coordinates, the width, and the height of the bounding box. However, the actual predicted value is the conversion value of the bounding box compared with the prior box. e location of the prior box is represented by d � (d cx , d cy , d w , d h ) and its corresponding bounding box is represented by b � (b cx , b cy , b w , b h ). en, the predicted value l of the bounding box is the conversion value of b compared with d:  Advances in Civil Engineering (1) e process above is the encoding of the bounding box, which should be reversed during the prediction, that is, decoding. e true location b of the bounding box can be obtained from the predicted value l: In summary, for a m × n feature map, there are mn units in total. e number of prior boxes set for each unit is recorded as k. en, each unit requires (c + 4)k predicted values, and all units require (c + 4)kmn predicted values. Because the SSD uses convolutions for detections, a total of (c + 4)k convolution kernels were used to complete the detection of this feature map. e SSD algorithm uses prior boxes (default boxes) of different sizes and aspect ratios to overcome the disadvantages of difficult detections of small targets and inaccurate positioning, so this paper used the SSD algorithm to detect cracks [25].

Network Structure.
e SSD algorithm uses VGG16 as the basic model and then adds convolutional layers based on VGG16 to obtain more feature maps for detections. e network structure of the SSD is shown in Figure 6. It can be clearly seen that the SSD uses multiscale feature maps for detections. e input image size of the model is 300 × 300.
In this paper, tunnel lining cracks were taken as the feature recognition target, and collected and constructed database of 20,000 pictures was used as a dataset for deep learning training. In the initial stage, it took about 6 hours to keep the loss value stable at about 1 in training this model under the above workstation configuration after deep learning training and recognition test. e verification of all the recognition and detection effects in this paper will not be further explained and is based on the above image set and its training results [25]. 300 × 300 images were input into SSD 300, and VGG16 convolutional layer was used to extract features. Two fully connected layers of VGG16 were converted into ordinary convolutional layers (conv6 and conv7 in the figure), and then multiple convolutional layers were connected for 1 × 1 output using a global average pool. It can be seen from the figure that the SSD connects conv4_3, conv7, conv8_2, conv9_2, conv10_2, and conv11_2 to the final detection/ classification layer for regression.
Basic steps of SSD network prediction are as follows: input a picture (300 × 300) into the pretrained classification network (modified traditional VGG16 network) to obtain feature maps of different sizes; extract feature maps of Conv4_3, Conv7, Conv8_2 Conv9_2, Conv10_2, and Conv11_2 layers, and then construct 6 default boxes of different sizes at each point of these feature map layers. Separately detect and classify the feature maps, and generate multiple default boxes that meet preliminary requirements; combine default boxes obtained from different feature maps, filter some overlapped or incorrect default boxes using NMS, and generate a final set of default boxes (i.e.,  Figure 4: Basic framework [25].

Post classify boxes
Advances in Civil Engineering detection results). Unlike YOLO provided with a fully connected layer, SSD directly extracts detection results from different feature maps through convolution. e output of 6 specific convolutional layers in the network is convolved by two 3 × 3 convolution kernels. An output classification corresponds to confidence, and 21 confidences are generated for each default box (for the VOC dataset containing 20 objects, background classification); an output regression corresponds to localization, and 4 coordinate values (x, y, w, h) are generated from each default box [25].
After the feature maps were obtained, they should be convolved to obtain the detection results. A 5 × 5 feature map was used for detection. Among them, the prior boxes were obtained. e detection value also consisted of two parts: the category confidence and the bounding box location, each of which was completed through a 3 × 3 convolution. Let n k be the number of prior boxes used in the feature map, then the number of convolution kernels required for the category confidence should be n k × c, and the number of convolution kernels required for the bounding box location should be n k × 4. As each prior box can predict a bounding box, the SSD300 can predict a total of 38 × 38 × 4 + 19 × 19 × 6 + 10 × 10 × 6 + 5 × 5 × 6 + 3 × 3 × 4 + 1 × 1 × 4 � 8732 bounding boxes, which is a fairly large number [25].

Training Process.
e training methods of the Ten-sorFlow framework include CPU-based training and GPU-based training. GPU training is well adapted to NVIDIA graphics cards. Despite the compatibility with AMD graphics cards, the configuration is more cumbersome, and graphics cards cannot be fully used. In order to improve training efficiency, a mass of data is obtained and debugged, and GPU-based TensorFlow is used for training. Direct files required for training include model configuration files, existing transfer learning models, record files, and label files, and training can be started after full preparation and configuration. After the training starts, there are two ways to stop training. e first is to reach the maximum number of training steps set in the model configuration file, and the second is to directly stop the training program. e training results are automatically saved every several steps, and the number of steps will increase as the total number of training steps increases. If there is a need to continue deep learning training after interruption, you can start training again in the original training directory.
Parameter configuration of the model file plays a vital role in deep learning. Under the TensorFlow framework, the config file is used as a model configuration file containing training feature information for SSD model, mainly including the number of detection target classes (num_classes), batch size for each training step (batch_size), learning rate (learning_rate), maximum number of training steps (num_steps), path of checkpoint file (checkpoint), directory of identification label file (label_map.pbtxt), and other parameters. By adjusting these parameters, the training process can be adjusted to affect the efficiency of deep learning and the accuracy of final recognition.
During the training process, it is necessary to determine the ground truth (real target) and the prior box used for matching in the training image first. e bounding box corresponding to the matched prior box will be responsible for predicting it. ere are two main principles for matching the prior boxes of SSD with the ground truth. First, for each ground truth in the image, the prior box with the largest IOU should be found, which matches with the ground truth. In this way, it can be ensured that each ground truth must match with a certain prior box. Usually, the prior box that matches with the ground truth is called a positive sample. Conversely, if a prior box does not match with any ground truth and can only match with the background, then it is called a negative sample. ere are few ground truths in an image, but there are many prior boxes. If only the first principle is followed, many prior boxes will be negative samples, resulting in the extreme imbalance between positive and negative samples. erefore, the second principle is necessary. e second principle is that, for the remaining unmatched prior boxes, if the IOU of a ground truth is greater than a certain threshold (usually 0.5), then the prior box also matches with this ground truth, indicating that a certain ground truth may match with multiple prior boxes.
After the training samples are determined, then the loss function is determined. e loss function is defined as the weighted sum of the localization loss and the confidence error loss: where N refers to the number of positive samples of the prior boxes. Here x p ij ∈ 1, 0 { } is an indicating parameter; when x P ij � 1, it indicates that the i-th prior box matches with the jth ground truth that belongs to the p category. c is the predicted value of the category confidence. l is the location prediction value of the bounding box corresponding to the prior box, and g is the location parameter of the ground truth.
A variety of data enhancement methods are used for SSD algorithm, including horizontal flipping, cropping, zooming in, and zooming out. Data enhancement can significantly improve the algorithm performance for the main purpose of making the algorithm more robust to targets of different sizes and shapes. It can be intuitively understood that the number of training samples can be increased through data enhancement, and more targets of different shapes and sizes can be built. By inputting into the network, network learning is more robust. e loss value of the deep learning model in the training process with the number of iterations is shown in Figure 7.

Training Effect.
After deep learning training, the files starting with events.out.tfevents in the folder for saving training results are the data files of the TensorBoard visualization tool and can be called by TensorBoard for training, during which the parameters such as loss value and learning rate vary with actual changes in the number of training steps. During this research, real-time detection of tunnel lining cracks based on deep learning and computer vision can support static image recognition, video file recognition, and real-time camera-based recognition and detection.
Rock images containing cracks were input. e detection results are shown in Figure 8. e SSD algorithm can accurately identify the location of cracks. e comparison results of the SSD algorithm and other detection algorithms are shown in Table 1. Basically, it can be found that SSD has an accuracy the same as Faster R-CNN and a detection speed the same as YOLO.
Tunnel patrol robots are networked for real-time detection and statistics of tunnel lining cracks in the future. e tunnel patrol robot is shown in Figure 9.   Figure 6: SSD network structure [25].  Advances in Civil Engineering

Conclusion
(1) e deep learning-based SSD algorithm has been used to extract advanced features of tunnel crack image, which can effectively avoid noise effect. Besides, recognition algorithm has been established for tunnel crack image. (2) Compared to the traditional algorithm, SSD algorithm has made improvements in multiscale feature graph, detection by convolution, and setting of prior boxes. As a result, SSD algorithm has achieved higher accuracy and better detection effect for small targets.
(3) From the view of recognition for a crack in the image, the rate of accuracy in this paper is approximated to 90%, which is far higher than other algorithms. With respect to running time, the algorithm in this paper takes a shorter time, being 0.2 s on average. erefore, methods in this paper can recognize tunnel crack image rapidly and accurately to satisfy the demands of practical engineering. (4) e size of the sample database determines the accuracy of deep learning training and whether it can easily overfit during the training process. is contribution attempts to establish a sample library of tunnel image features, but the current number of images is far from ideal. e next step will be to collect and collate new tunnel samples, increasing the types of network model tunnel damage identification, so that the network is not limited to the tunnel crack damage.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.