Designing image processing tools for testing concrete bridges by a drone based on deep learning

ABSTRACT Crack detection is one of the crucial aspects of bridge evaluation and maintenance. Several existing image-based methods require capturing the bridge surface and extracting crack features to detect the crack. However, in some positions such as the space under the bridge and piers, it is difficult to capture crack images. This paper aims to apply a method to detect cracks on the bridge surface by using a drone that can capture images in challenging positions. The video recorded from the drone will be automatically identified the cracks by employing the deep learning method. Deep learning is designed for training and testing the dataset with 51.000 images, each image sized 244 × 244. The deep learning method shows the feasibility of detecting the cracks in the transport facility. This is supported by the high accuracy of the experimental results of 95.19%. In addition, the tool can assign an ID containing information to each crack from video so that these cracks can then be mounted on a 3D map of the bridge for research on crack development over time in the task of assessing the health of bridges.


Introduction
The condition of the cracks in the surface and how the crack propagates over time are key criteria for assessing the health and longevity of concrete structures in general and concrete bridges in particular. Detecting cracks and other defects are one of the most complex tasks to assess bridge conditions because crack observation is a challenge for bridges with high static. Human visual inspection is expensive and unsafe, requiring scaffolding or other specialized bridge testing equipment that can disrupt traffic on the bridge. Therefore, it is necessary to use specialized equipment to collect sample cracks and other defects, as well as create tools to support crack image processing.
Machine learning and image processing approaches have been widely applied in crack detection to replace human visual inspection techniques. However, previous studies on crack detection methods imposed several limitations. Abdel-Qader et al. (2006) proposed a solution to detect cracks on the bridge deck based on the Principal Component Analysis algorithm. In this approach, the camera position negatively impacts the accuracy of the results. In addition, Prateek et al. (2012) built an automatic crack classification system using a Support Vector Machine and Histogram-Based Classification algorithms to test the bridge deck. However, the main disadvantage of this method is that the crack detecting accuracy is relatively low. Recent developments in the AI field have led to a renewed interest in deep learning methods, especially convolutional neural networks (CNNs) for crack detection. Feng et al. (2017) proposed a deep active learning strategy for civil infrastructure defect detection in which the deep prediction network was initially trained with a small set of images. The prefilter network was used to remove the error-free images to reduce the time consumption during the standardization process. Cha et al. (2017) used the CNN model to classify the obtained images into two categories which are cracked and non-cracked. The trained CNN model was used to examine the full images at different resolutions. In addition, a comparative study performed by Nhat-Duc et al. (2018) showed that CNN-based methods outperformed the remaining prediction models which are DFP-Canny and DFP-Sobel.
This paper focuses on processing concrete crack images caught by sample video from Unmanned aerial vehicles (UAVs) as drones, which is appropriate for detecting bridge cracks. Normal photography techniques are not effective to capture images under the bridge and piers. Drones can adequately address this complex problem at a considerably lower cost. An AI-based tool also can be integrated into the UAV, supporting the detection and identification of defects in the inspection process. This helps to reduce human error, especially in unfavourable working environments with long inspections.
The process of inspecting a bridge with a drone requires constructing a typical flight path for the sample collection video. Next, the tester will use a tool to separate the cracked images from the video and determine the POI (Point of Interest), which are the points that the experts should keep in mind in the checking process. At these POIs, the operator flexibly interacts with the drone to transfer control of the detailed observation, as well as activates the drone to split the trajectory or automatically return to orbit to continue performing the flight mission. Cracks and other defects are then measured and identified, then spread out on the 3D model of the bridge at the workstation. The image processing tool published in this paper has worked well, as shown by the actual sampled flight video processing results. The detection and identification of data collection by drone was carried out at the railway overpass 3, Cau Giay street, Lang Thuong ward, Dong Da district, Hanoi city, Vietnam. This task also complements the typical work-in-progress for data sampling, published in a paper by the same group of principal authors published at the CITA2022 conference, titled: 'Planned flight path for UAV in collecting crack images on Concrete Surfaces to assess the structural health of bridges' (Hoang et al., 2020).
One of the most well-known AI algorithms for detecting crack images is Convolutional Neural Network (CNN) which was first proposed by (LeCun, 1989). Due to its efficiency in feature extraction, this network is widely used in computer vision such as image classification, object recognition, and action recognition. Ouellette et al. (2004) introduced an algorithm based on a standard genetic algorithm (GA) to detect cracks automatically, with a crack detection accuracy of approximately 92.3 ± 1, 4% for 100 images. However, due to small data, it is not possible to fully demonstrate the effectiveness of this method. Zhang et al. (2016) proposed using a deep learning neural network to classify pavement cracks. However, the algorithm did not have strict classification criteria for positive and negative samples, the detection accuracy only reaches 86.96%. Based on transfer learning, Gopalakrishnan et al. (2017) proposed several concrete crack classification models, in which the VGG-16 (Visual Geometry Group Network 16) model has the best performance. In the same year, an algorithm that can analyze cracks in a single video frame using a complex neural network was developed by Chen et al. (2018), combined with a Naïve Bayes data aggregation diagram to synthesize video information. The network trained by Cha et al. (2017), combined with the sliding window technique, can scan any particular crack images with a resolution greater than 256 × 256 pixels. Wang et al. (2017) used CNN to detect pavement cracks and used principal component analysis (PCA) to classify detected pavement cracks. Pauly et al. (2017) demonstrated the effectiveness of using deeper networks to improve detection accuracy in computer vision-based pavement cracks One of the most well-known AI algorithms for detecting crack images is Convolutional Neural Network (CNN) which was first proposed by (LeCun et al., 1989).
Due to its efficiency in feature extraction, this network is widely used in computer vision, such as image classification, object recognition, and action recognition. Ouellette et al. (2004) introduced an algorithm based on a standard genetic algorithm (GA) to detect cracks automatically, with a crack detection accuracy of approximately 92.3 ± 1, 4% for 100 images. However, due to small data, it is not possible to fully demonstrate the effectiveness of this method. (2016) proposed using a deep learning neural network to classify pavement cracks. However, the algorithm does not have strict classification criteria for positive and negative samples, the detection accuracy only reaches 86.96%. Based on transfer learning, Gopalakrishnan et al. (2017) presented several concrete crack classification models, in which the VGG-16 (Visual Geometry Group Network 16) model has the best performance. In the same year, an algorithm that can examine cracks in a single video frame using a complex neural network was developed by Chen et al. (2018), combined with a Naïve Bayes data aggregation diagram to synthesize video information. The network trained by Cha et al. (2017), combined with the sliding window technique, can scan particular crack images with a resolution greater than 256 × 256 pixels. Wang et al. (2017) used CNN to detect pavement cracks and used principal component analysis (PCA) to classify detected pavement cracks. Pauly et al. (2017) demonstrated the effectiveness of using deeper networks to improve detection accuracy in computer vision-based pavement crack detection. However, this method has high complexity and low accuracy, therefore it has not been widely applied.
Hongyan Xu et al. (2019) developed a defect detection model on spheres based on convolutional neural networks (CNN), which takes advantage of the ASPP (Atrous Spatial Pyramid Pooling) module and the depth-decomposing convolution which obtained more accurate results. The structure of this neural network is shown in Figure 1. The network consists of 28 layers, 16 convolutional layers, and 3 max-pooling layers; the ASPP module occupies 10 convolutional layers. The paper shows that the proposed model achieved an accuracy of 96.37% when finding cracks on concrete surfaces which outperforms the mentioned methods. However, a key advantage of using Viraja's model is that this model is simple with fewer layers than the model of Hongyan Xu (Viraja, 2019;Xu et al., 2019). Moreover, this model is relatively effective with the larger dataset including about 40000 images of Kaggle (Kaggle, 2019). Viraja's model structure is depicted as shown in Figure 1 below: This study shows the potential of convolutional neural networks (CNN) in automatic cracks and defects detection on concrete bridge surface structures. However, Hongyan Xu et al. (2019) used an end-to-end CNN convolutional network to detect cracks, which used only images and labels as inputs. The limitation of this CNN model is that its result achieved 96.37% accuracy in detecting cracks, but only with 100 images, without pretraining and fine-tuning.
Another major source of uncertainty in the method is using convolution with a transform ratio between two in the last three con-volution layers of the network. This replaces the maximum composite layers, therefore losing edge crack information caused by the aggregation process. It is essential to develop an image improvement tool, thereby enhancing sample quality and replacing the maximum composite layer with the convolution.

Proposed a CNN model for crack detection
Viraja (2019) has a simple structure, so this method only effectively works with the clear datasets (large cracks) which are similar to Kaggle's datasets (Kaggle, 2019). When merging crack images with different sizes from the data packet of SDNet 2018 (2018), the model is difficult to predict. This is because images with small cracks are mispredicted as having no cracks.
In this research, Viraja's model was developed and tested in many deep learning models with different structures based on transformations, for example adding layers (Viraja, 2019). Each developed model has its advantages and disadvantages, which can be shown in Figure 2 below. A distinct feature in this model is the parallel structure where some data from front layers feed into the neural network.
Images are inserted into the model and divided into two types which are crack and non-crack. The blue boxes in Figure   and multiple layers and nodes in the neural network for a more accurate computational model. In particular, some parallel extraction layers are the computational data for the neural network (including linear layers), so the amount of useful information that the neural network in computation increases. As a result, crack detection results are more accurate.
In the model in Figure 2, the top 16, 32, 64, 128, and 256 numbers are the depths of the lower layers, respectively. The letters Conv, and MPool represent the convolution and max-pooling layers. The corresponding numbers behind those letters are the kernel sizes. Each convolution layer is followed by a Relu layer to non-linearise the values, helping the model to converge faster.
In general, the training model consists of 3 main parts. The first part includes a sequence of convolution, Relu, and max-pooling layers. This part is called serial extraction, significantly reducing the amount of computation while preserving the features of the image. The second part is called parallel extraction, working in tandem with serial  extraction, retaining some desired data of the front layers for training by the neural network. The last part is the neural network which is 17, 18, and 19 in Figure 3. This model avoids obtaining excessive convolution layers and ensures the model itself can work effectively. Furthermore, the max-pooling layers used for parallel data extraction (in the bottom row from 20 to 23) are adjusted, therefore no layer takes excessive or inadequate data. In the case of taking excessive data, the model increases hardware resources, increases computation time, and slows down the processing speed of the model significantly. On the contrary, taking inadequate data has a significant impact on the training. Parallel extraction resources (bottom row) are merged with direct resources (top row) initially. Subsequently, neural network layers are fed to perform computations to obtain the results. The alignment of the layers in the model offers a benefit in that the number of nodes synthesized from the flattened layer is reduced significantly to 28752. Accordingly, the total number of parameters used for the whole model is 16,163,986. The input size is 0.68 MB, and the total size calculated from the model is 122.1 MB.
Compared to Viraja's model, the developed model has improved the reliability of crack detection. The input image size is 244 × 244 (width x height), which is larger than the size of Viraja 's model, at 128 × 128. Besides, the number of convolution layers also increases for better feature extraction. However, increasing the number of convolutional layers slows down the performance of training and image prediction. Therefore, more layers of max-pooling are created to keep the model capable of meeting real-time speed requirements. In addition, after each max-pooling layer, a small amount of information is kept and put into the flattened block. This helps to increase the reliability and specificity of the solution, as shown in the test results presented in the next section. The idea of separating images from video Simonyan and Zisserman (2014), was developed by the research team in the form of a GUI that is convenient to use in training, detecting, and storing images in the dataset.

Tasks and methods
Data preprocessing requires a considerable effort to collect, process, and label data. Currently, concrete crack datasets are shared and published online in the AI community to reduce the difficulty of building the datasets. However, the problem of detecting damage on the reinforced concrete surface currently lacks specific datasets that are reliable enough for cracks on the surface of the bridge structure. Therefore, self-generating the data is necessary. The generated datasets and the existing datasets are used followed by the process in Figure 3.
For the task of image classification, a deep learning model was used for the classification problem to classify images with cracks and without cracks. This work is divided into 3 parts: sample processing (data), model training, and model validating. The crack detection procedure is performed as follows: In this study, 51,000 images are used; 40,000 images are used for training and the remaining 11,000 images for validating, with the ratio of cracked images to noncracked images being 1:1. The size of the images has 2 types, 256 × 256 and 227 × 227. Before being trained, the images are all resized to 244 × 244 to fit the training model.
In Figure 4, the top 2 rows are some representative samples of Kaggle's image, the 4 back rows are some illustrative samples D (raw No.3, 4) and P (raw No.5, 6) of SDNet2018, and the last 2 rows (raw No.7, 8) are some representative samples collected from reality.

Development of training and crack detection tools
Tkinter (2022) is utilized to design a user interface to facilitate the training process. This tool is used for training and testing, as opposed to building common tools which are similar to label processing tools for image samples, easing users to adjust parameters during training, stopping and restarting the training when needed (Cubuk et al., 2018;Wood, 2021).
The steps to perform the prediction of whether crack or not crack from the image set and video are similar to training data. However, the process ignores some unnecessary stages such as processing and increasing quality in the image preprocessing step. Calculating loss and updating parameters are no longer necessary. Besides, putting the image into the existing Net and reading the data out of the Net leads to faster prediction.
The training model process is as follows: First, a Net is created which is the training model and applies the parameters obtained from the previous training. At the same time, the training model is inserted through the GPU for faster processing. Afterward, the input images are preprocessed and fed to the network for computation. The output from the network is compared with the labels and the loss function is also calculated, then saving the parameters (params). In the mode section, the training mode is chosen first by default to train and update the parameters for a better model. Subsequently, selecting the validation mode to check if the model is acceptable or not, compared to the untrained data. Each epoch consists of one training and one validation. Image preprocessing is performed before prediction by the model. In image preprocessing, all images are reduced to negative form to minimize the information loss in max-pooling layers. In training mode, the images are randomly changed in brightness, and rotation to increase the training image's quality. Loss is the value of the loss function, and acc is the accuracy calculated as the number of correct images divided by the total number of test images (Figure 5).
At the beginning of the training part, the model's accuracy reached 45%. During training, the accuracy line increases rapidly and then remains the same. After 18 training epochs, the accuracy reached 92%. At this point, to avoid sudden growth and rapid decrease of the curve that leads to losing convergence and stability, the learning rate value is reduced. Therefore, the increase in the accuracy graph is no longer the same as before. After the 400 epochs, this increases to 96.7%. The process of increasing the stability graph is shown in the diagram in Figure 6.  The following step is to examine directly with several test directories of real data packets to evaluate whether the accuracy achieved is as desired or not. As shown in Figure 8, for Kaggle's data package (Kaggle, 2019), a folder of 5000 images containing only cracks is included for testing. At the end of the process, the model correctly detected 4818 images, which means that the accuracy achieved is 96.36%. Kaggle's data package contains 40,000 images, most of which are clear and easy to assess. Afterward, a data packet of SDNet2018 is added to test the reliability of the model. This SDNet2018 data pack has a significantly smaller number of images than Kaggle's. The cracks in the image are more difficult to detect because images have diverse shapes, hard-to-see images, and noise points similar to cracks in the form of dotted spots. Figure 7 below shows the test results of SDNet2018 and Kaggle datasets.
For this type of crack images with cracks are performed for testing in SDNet2018 subpackage D. The model results successfully detected 142 images with cracks, equivalent to 95.94%. Similar results are obtained with SDNet2018's P data packets of 78.77%, and with Kaggle's packets of 97.67%.
After achieving positive results with the sample data set, the actual data is collected and added to the dataset with 1800 images for training and retesting with actual video recording in both training and testing processes for crack detection (23). These results are shown in Figure 8.
The green line in Figure 7 shows the accuracy of model testing with actual video. Initial accuracy was obtained at only 78% because the actual data (cracks) is not identical to the sample data (image used to train the model). After the first 100 epochs, the accuracy reached 93.8%. The process of increasing is gradual, shown on the graph with little ripple. This indicates that the model is satisfactory with real data. Finally, real data was tested to see the true value of the model. A video with 158 images with cracks was put in the model. The model then predicts correctly with accurate detection results of 148 images, equivalent to 95.19%, as shown in Figure 8. Table 1 shows the result accuracy of our proposed method compared with two previous Viraja (2019) and Vaughn (2022) methods.
Detected cracks will be assigned IDs, spreading on the 3D model of the bridge at a later step to assess the health of the bridge. The first task is to determine the bounding boxes of the crack. Accordingly, a recursive function was used to add the bounding boxes of the boundaries to the list of bounding boxes as shown in Figure 9.
The dimensions of the crack are calculated according to the scale of the image in pixels, where it is necessary to specify the rotation/photographing distance parameter. Currently, Figure 8. Results of training and testing with sample data set including 1800 images collected from reality. with the image sets in the sample dataset of SDNet2018 and Kaggle, there is no parameter about the capture distance, so all the images in the sample dataset currently cannot measure the actual size, but only the size of the actually rotated dataset. The distance from the device to the crack is kept stable. This also meets the requirements of keeping a safe flight distance, which has been implemented in the drone's flight mission.
For the contours and shape of the crack, it can be seen that the model can detect faint and small cracks. However, there are still pictures with small and discrete cracks. Such shapes are missing because images are discarded along with similar small noise speckles. Our video splitter is capable of distinguishing cracks and other concrete defects such as scratches, spots, or protrusions on the concrete surface as shown in Figure 10.

Conclusion
The project is undertaken to develop a crack detection CNN model with parallel extractions to retain some desired data of the front layers and then put into training by the neural network. The training model is quite stable expressed the graph is less undulating as running a continuous curve from the top down. For the detection of cracks or no cracks, our model effectively works when it is able to predict many cracks of various sizes. The model has been very well designed, in which there are cases where the concrete structure is rough, with many spot holes, but the model does not mistakenly detect faults or other disturbances such as cracks, and mistakenly detects fuzzy cracks. The sample dataset image has a low-light (dark) image, which is the dominant colour of the cracks, but the detection process is still not mistakenly detected. In particular, for the cases where the crack image has noise as long grooves in the upper left corner, it is often easy for other models to misinterpret it as a crack due to its depression and darkening. The designed model still satisfactorily works with all cases reaching 93.67% accuracy in detecting cracks from the actual video.
In addition, our tool can measure crack based on pixels' images from sample videos. On the basis of the model development presented above, the research team has successfully built a user interface GUI tool to separate images from actual recorded video, and at the same time used in network training and validation where it can automatically detect and separate images with cracks to add to the datasets as well as serve the process of checking and detecting new cracks. From the existing surface cracks and crack growths in the structure detected at subsequent inspections, bridge inspectors can combine these specifications to assess health status, warn traffic, plan maintenance and predict the life of the structure.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This research is funded by the science project under grant number B2023-GHA-02.

Notes on contributors
Long Ngo is a sophomore at the faculty of International Education, University of Transport and Communications (UTC), Hanoi, Vietnam, majoring in Information Technology. He participated in a remote internship in Taiwan and in a student scientific research contest on computer vision. In 2022, his scientific research on 'Bridge Crack Detection Based on Deep Learning using Drone' won an excellent award. His research interests include optimizing decision making, computer vision, embedded systems and edge computing, robotics, autonomous vehicle, and UAVs. Email: longngo02utc@gmail.com Chieu. X. Luong received his Ph.D. degree in Traffic Engineering from the University of Transport and Communications (UTC), Vietnam in 2018. He is a lecturer at the Faculty of Civil Engineering, UTC, Hanoi Vietnam. His main research interest is the application of particularly advanced metrological techniques to the quality assessment of building structures, data fusion, computer vision, UAVs and Intelligent Transportation Systems. Email: chieu1256@utc.edu.vn Hoang. M. Luong is a master-year aerospace engineering student at the University of Bristol (MEng Aerospace Engineering). In June 2021, Mr Luong successfully completed year three of study and this is equivalent to a Bachelor's degree. In June 2022, it is expected he would be awarded the degree of Master of Engineering (Meng) in Aerospace. His current field placement is as a researcher assistant at the University of Bristol and the University of Transport and Communications. He is interested in composite manufacturing, aircraft structures and materials and AI applications. Email: zu18420@ bristol.ac.uk