An Improved Deep Convolutional Neural Network-Based Autonomous Road Inspection Scheme Using Unmanned Aerial Vehicles

Advancements in artificial intelligence (AI) gives a great opportunity to develop an autonomous devices. The contribution of this work is an improved convolutional neural network (CNN) model and its implementation for the detection of road cracks, potholes, and yellow lane in the road. The purpose of yellow lane detection and tracking is to realize autonomous navigation of unmanned aerial vehicle (UAV) by following yellow lane while detecting and reporting the road cracks and potholes to the server through WIFI or 5G medium. The fabrication of own data set is a hectic and time-consuming task. The data set is created, labeled and trained using default and an improved model. The performance of both these models is benchmarked with respect to accuracy, mean average precision (mAP) and detection time. In the testing phase, it was observed that the performance of the improved model is better in respect of accuracy and mAP. The improved model is implemented in UAV using the robot operating system for the autonomous detection of potholes and cracks in roads via UAV front camera vision in real-time.


Introduction
Deep learning (DL) which is a subset of machine learning has gained remarkable interest. It is commonly applied in facial expression recognition, selfdriving cars, autonomous systems, etc [1]. Unmanned aerial vehicles (UAVs) are well-known and an attractive solution for the deployment of DL-based application. In literature, discussions regarding the deployment of DL-based algorithms are available in UAVs. moreover, the detection of UAV based on deep learning is introduced [2]. Convolutional neural network (CNN) is a deep learning-based neural network, detection. SSD and YOLO are types of single-shot detectors. YOLO which is developed for real-time detection, which provides excellent results [7].
Old techniques generally use the subtraction of background [8] or different classification techniques such as Haar cascade for the detection of the objects [9]. In [10] the detection of disease in radish fields with the help of computer vision and camera attached to a drone is proposed. In [11], a convolutional neural network is utilized for the analysis of information in real-time with performance in detecting cattle with the help of a drone. In [12] autonomous computer vision based detection and landing system is proposed furthermore, in [13] the drone wireless charging method was implemented with the help of Hill-climbing algorithm. CNN based object detection is extended to medical images from the past decade for early diagnosis presented in [14].
This paper describes an improved CNN based algorithm for the autonomous inspection of roads with the improvements in the model to detect road cracks, potholes and yellow lane in real-time. Specifically, the yellow lane on the road is used as a reference for UAV to track and follow autonomously while detecting potholes and cracks on the road. YOLO is an object detector. The tiny version of YOLO is used with improvements regarding activation functions and convolutional layers.
The paper is organized as follows: where section 2, describes related work. Section 3 describes the proposed CNN model for road cracks, potholes and yellow lane detection with yellow lane tracking is explained in details along with processes of dataset acquisition and the process of augmentation. Section 4 provides the experimental results with discussion. Finally, in section 5 the conclusion and future work of this paper is presented.

RELATED WORK
Cracks and potholes are the common road pavement defects, which are difficult to find during the inspection of road. Moreover, manual inspection of each road is difficult and costly because it requires significant effort and manpower to find cracks and potholes on time [15]. Therefore, automatic detection of cracks and potholes is introduced for reliable and speedy analysis of road defects instead of relying on the slower process of traditional manual inspection procedures [16]. Autonomous navigation of UAV using deep neural networks for indoor environments is implemented in [17]. Moreover, this navigation is utilized for outdoor environments in [18] for product delivery purposes.
The utilization of drones is increasing rapidly. In some cases, they are being operated manually through a mobile-application based joystick; whereas others are using autonomous navigation through detection and tracking of objects as implemented in [19]. UAV is also navigated using GPS and inertial navigation systems, which provide the attitude, position and velocity information that is crucial for UAV navigation, as discussed in [20]. Moreover, UAV's are being utilized to search and rescue people at sea by implementing a CNN-based person detection as implemented in [21].

Proposed Scheme
Road defects, such as cracks and potholes are common problems that should be fixed as soon as possible. However, the inspection of roads requires sufficient manpower, and it is time-consuming. An autonomous road inspection system is proposed where a Jetson TX2 is used for the communication with Bebop drone using WIFI . Jetson tx2 received the images from UAV using wifi. The ROS (robot operating system) is running on Jetson with the YOLO object detector. The images are received on Jetson, where ROS and YOLO are being run. If the detected object class matched with yellow lane class then the tracking would be initiated. The position and distance of the detected object are calculated to estimate pitch, roll, altitude and yaw value using the tracking algorithm. These estimated values are then sent back to guide it track and follow the yellow lane. If the detected class is identified as cracks or potholes, the detected image is sent to the server through WIFI or 5G. The flowchart of the system is illustrated in Fig 1   The tracking and object detection algorithm is implemented in UAV by utilizing Robot operating system (ROS). There are two nodes in ROS: node 01 served as object detection and tracking node, whereas Node 02 was the Bebop drone driver package. Both nodes communicate with each other using ROS topics [23]. In this implementation, six ROS topics were utilized. The responsibility of each topic was to carry data as a message between two nodes. The graph with nodes is illustrated in Fig 4; wherein, four ROS topics are published by Node 01. Following are the topics /UAV/reset, /UAV/land, /cmd_vel and /UAVtakeoff . The responsibility of /cmd_vel is to conveying the pitch, altitude (Z), yaw and roll commands to Bebop drone. Furthermore, two ROS topics are subscribed by Node 01 such as /UAV/nav data and /UAV/front image_raw; which are responsible  for transmitting navigation and video data, respectively. Conversely, four ROS topics, which are published by Node 01 are subscribed by Node 02 which publishes two ROS topics, namely, /UAV/navdata and /UAV/front/image_raw.
In the beginning, the input image in the network is divided into a grid at the time of training phase. Bounding box "B" is predicted by each cell. There are five main characteristics of bounding box coordinates x,y at center, width and height represented as w and h, respectively; and confidence is represented as cs. The responsibility of confidence cs is to accurately determine the availability of an object in the bounding box. The detection of road cracks, pothole and a yellow lane is carried out using a lighter version of YOLO. Tiny YOLOv3 by improving the model structure. This version of YOLO is extremely fast and can be run on low-powerful devices, such as Raspberry Pi and the Jetson TX2 hardware because it has seven convolutional layers and six polling layers. However, it decreases the accuracy. Specifically, In the improved tiny version of YOLO, additional convolutional layers are added as shown in Fig 5, and the leaky ReLU activation function was replaced with the Mish activation function, which aids deeper propagation in hidden layers of the CNN [22] as shown in Fig 6. Mish was hence implemented to provide deeper propagation of information, better capping avoidance, and self-regularization.

Experimental Analysis
This section describes the details of the own created data set and provides detailed results obtained using the improved model with the specification. Moreover, a comparison of both the default and improved model with respect to accuracy and mAP provided herein.

Data Set Specifications
The creation of own data set is a difficult and hectic task. A high definition (HD) camera is used to create the own data set of road cracks, potholes and yellow lanes. The labelling of the created data set is an important task and should hence be performed carefully to achieve excellent results. The data set was divided into 80% for training and 20% for validation. The entire data set had three classes, namely, cracks, potholes and yellow lane which cumulatively consisted of 1000 images. The data set was first resized then verified after removing the bad images, and finally, improved before initiating the training.

Object Tacking and Navigation
After the detection of the yellow lane in an image, the bounding box is returned by CNN. The bounding box includes the position of the object, which is represented using pixel values; (xmin,ymax),(xmax,ymax),(xmin,ymin),(xmax,ymin), as shown in Fig.8. The center of the object is calculated using these values as: The center, which is required for object tracking, can be calculated as: Herein, the image center is (0,0). The image center and object center error is given as: e x (t) and e y (t) From the above equation, It can be concluded that for appropriately tracking the object, ex(t) and ey(t) must always approximate or be equal to zero. Moreover, for effectively tracking, the center must be equivalent to the middle of the image in order to do tracking properly. The center of the yellow lane detected the bounding box value, which further used for tracking and following the yellow lane by the UAV with the help of roll, pitch, altitude and yaw movements as shown in Fig.9. The four control parameters are responsible for Bebop drone movement. The responsibility of roll is to move the drone left or right, for upward or downward movement pitch is responsible. The responsibility of yaw is to rotate the drone counter-clockwise or clockwise and altitude is responsible for right or left movement. The relative distance is also measured between the yellow lane and bebop drone. In order to calculate the relative distance the bounding box width of the detected yellowlane is calculated. If the bounding box width is greater than defined value, the bebop drone will go backward or it will continue its forward movement.

Results and Training
In the training stage, the output weights are produced after every 1000 iterations. The weight with the highest mean average precision (mAP) is considered for testing. After training was completed on the improved version of YOLO, the highest achieved MAP was found to be 94% as shown in the real-time training chart in Fig 10. Moreover, the detection of potholes, cracks and yellow lane is shown in Fig 11. The YOLOv3 Tiny default accuracy is 89% and its mean average precision (mAP) was 85% ; whereas, the improved model accuracy was 95% which is a decent improvement. Therefore, it was observed that changing the model activation function and making the model deeper improved its accuracy. SGD is an optimizer employed for training the default and improved models with a momentum of 0.9 and learning rate of 0.001; other parameters are provided in Table 1. The improved model was trained for 10000 iterations, and after every 1000 iteration, new weight file result was produced every time. The best weight file which consists of the highest mAP was further used in the testing phase to calculate the accuracy of the model. The model is trained on a powerful GPU Titan Rtx with tensor cores allowed after 3000 iterations with a batch size of 64 and subdivision 4. The results of both models are provided in Table 2. The default model detected the object in 4.81 ms, whereas the improved model detected the object in 4.84 ms because the improved model is deeper than the default model. The comparison of detection time for both models is shown in Fig.12.

Performace Metrics for Evaluation
The metrics which were used to evaluate the detection of road cracks, potholes, and yellow lane in the road, are calculated using the following parameters as defined below: True Positive (TP): If the centroid falls within defined objects in the class ground truth then it is classified as true output detection. True positive is counted as one if multiple true output detection occurred within the frame.
True Negative (TN): The detection is true but negative frames it means frames without defined objects.
False Positive (FP): In this case, the detected centroid does not fall inside the defined objects in the class ground truth.
False Negative (FN): That is, objects that are defined in the class are missing in the frame.
In order to efficiently evaluate the performance of Precision (Pre) = T P T P + F P × 100 Sensitivity: This metric is also known as the true positive rate, or recall, and measures the proportion of real class of the defined object correctly.
F1-score and F2-score: The harmonic mean between sensitivity and precision is known as the F1and F2-score, and with in a range of [0,1]. To balance the sensitivity and precision, both these scores were recognized. The F1-score is given below as follows: and the F2-score is given and calculated as: Dice Coefficient: For the comparison of the pixel-wise result between the predicted detection and ground truth that ranges [0,1], these metrics are used as follows:

Loss Function
The overall process of YOLO uses the loss calculation known as a sum-square error [24]. The end to end network of YOLO which has simple differences of addition, such as coordinates errors, classifications errors and IOU errors. The below formula is used to express the loss function.
The weight of each loss function is calculated to estimate the overall loss function. During the training phase, the model exhibits unstable behaviour and divergence when the classification error is constant with a coordinate error. Therefore, the coordinate error weight was fixed to λ = 5. Yolo employs λ noobj for the IOU error to keep away from confusion between the object grid and no object grid. The absolute loss function obtained while training the dataset can be described as follows: In the above equation, the number of grids is represented by g. Each of the cell numbers corresponding to the prediction boxes are indicated using B. The coordinate center of each cell is defined as (a,b); moreover, its width and height are indicated as h,w, respectively. Furthermore, the prediction box confidence is indicated as C; the confidence of objects in the class is labelled as R. The weight of the loss function position is represented as λ coord . The classification loss function weight is defined as λ noobj If the objects that are trained in this class are present then the value is set as 1, and otherwise it is 0.

CONCLUSION AND FUTURE WORK:
A convolutional neural network was improved and implemented for the detection of cracks, potholes and yellow lane in the road. Autonomous navigation of UAV is achieved by tracking and following the yellow lane for road inspection to report road damages on the server. An HD data set was produced to achieve the best results. Subsequently, the results obtained from both models were compared in terms of detection time, mAP and accuracy. The future work will be conducted after creating a larger dataset and compare its results with those of other object detectors such as Faster-RCNN.