Vehicle Detection Using Different Deep Learning Algorithms from Image Sequence

. Image processing has become a very popular topic in recent years with images obtained by photogrammetry, remote sensing and computer vision. Deep learning practices are progressing rapidly with this innovation. Object detection, one of the new subjects of deep learning, is applied to high resolution aerial or remote sensing images to ex.tract information from these images. Traditional convolutional neural network (CNN) methods perform estimates in two stages but remain slow in terms of speed performance. You Only Look Once (YOLO) method that is used for real-time object detection, is quickly performed using a single convolutional neural network. In this study, YOLO-v3, YOLO-v3-spp and YOLO-v3-tiny models were applied in the Google Colab environment using python programming language. The comparison of YOLO models trained on COCO data was performed on the video obtained separately from a UAV and the terrestrial camera to identify the vehicles. As a result of the study, the highest results were obtained in Yolov3-spp method with average IoU 84,88% and precision value as 72,02%.


Introduction
The detection of vehicles ahead and the traffic conditions while driving are important factors for safe driving, accidental cruising and automatic driving and tracking (Jazayeri et al. 2011). Especially real-time perception plays an important role in the development of autonomous vehicles. Therefore, image processing can be used for these purposes. Image processing that has become a popular topic in photogrammetry is used for a variety of applications. However, it is a method used to extract some useful information from the original images obtained. The advancing technology provides many possibilities to develop methods to obtain some useful information from the original images, such as the detection, identification and classification of the objects. In particular, the detection and classification of objects in image content have been one of the main subjects in the literature on computer vision and photogrammetry (Yang et al., 2019).
Convolutional neural networks (CNN) have become a standard tool for many applications in computer vision and machine learning. CNN-based object sensors have made successful and important steps in object detection with the advancement of technology. CNN-based approaches consist of two classes, two-stage and one-stage detection methods (Dai, 2019). Two-stage classes include object detection methods such as regions with convolutional neural networks (R-CNN), Fast R-CNN and Faster R-CNN. In R-CNN, the image is first divided into region proposals and then CNN is applied for each region respectively. The size of the zones is determined and the correct zone is placed in the artificial neural network (Girshick et al., 2014). Although these methods have more accurate than single-stage methods, prediction of the features makes them slow. You Only Look Once (YOLO) is a single-stage object detection method. Additionally, classifying the image by CNN used for object detection, especially for the desired objects, YOLO solves the object perception as a regression problem. It can obtain the position of the object, category and corresponding confidence score, further increase the detection speed and detect the object in the real-time target . In the study of Tashiev et al. (2017), real-time classification of intelligent transport systems to provide solutions to traffic systems was performed. Classification is made using the convolutional network YOLO model on the BIT-Vehicle data set. Images are obtained from two different cameras with 1600x1200 size and 1920x1080 size. The model works with 90.35% accuracy. Han P. et al. (2019), the DRoINs network model was created for the analysis of changes occurring at different times over aerial images obtained by Unmanned Aerial Vehicles (UAV). To prevent seasonal errors, the dataset contains 1246 pairs of images. The data set includes 8 categories of cars, boats, motorcycles, aircraft, tractors, trucks and pickups. The experiment was carried out with a computer with 12 Xeon 2.5 GHz CPUs core and 128 GB memory and GeForce GTX 1080 Ti GPU and 44G memory. The model was compared with YOLOS and RPNS models. The network outperformed other models with a 94.9 mAP ratio. Another study (Carlet and Abayowa, 2017) innovations were made to improve the performance of the YOLOv2 model designed for real-time object detection. The YOLOv2 model, which was developed for better detection of vehicles from aerial images, was compared in the data sets of VEDAI, DLR3k, NeoVision2-Helicopter, AFVID 1, AFVID 2, AFVID3, AFVID4 and AF Building Camera. The proposed improved YOLOv2 model yielded successful results.
The scope of this study is to compare the methods of object detection methods, which are added every day, by using the aerial and terrestrial videos to find the appropriate methods according to the data group. In addition, the deficiencies and advantages of the methods are revealed and the results that will form the basis for new models are obtained. This study aims to compare object detection methods on aerial and terrestrial videos to find suitable methods according to the data group. In addition to the local video, the videos obtained from unmanned aerial vehicle (UAV) were also used in the study. Vehicle detection from UAV images has some difficulties due to the extremely high resolution of the images. These problems were also examined in the study. Especially, the performance of methods has been compared to find the car on video frames. Thus, precision and average Intersection over Union (IoU) of the car was investigated. In accordance with real-time applications, the methods were performed over real data. The results represented as tables. As a result of the study, inferences about the use of vehicle detection with deep learning in real time applications have been provided.

CNN Structure for Object Detection
CNNs can be defined as systems that implement the learning ability which is the basic function of the human brain for computers. It was originally inspired by the visual cortex in biology. The visual cortex consists of small cells that are sensitive to specific areas of the visual field (Oztemel, 2012). Inspired by the discovery that neurons systematically create a visual perception through a horizontal architecture, the foundations of CNNs have been formed. Convolutional neural networks consist of a plurality of convolution layers, non-linear, pooling and fully connected layers.

Convolutional Layer
Convolutional layer is the first layer of neural networks and it is composed of the pixel value sequence of the input image. The input image has size of m x m x r. r is the number of bands (r = 3 for RGB image), m is the height and width of the image. The first layer consists of k filters that have n x n x q dimensions. When finding the filter value array, select the q value that is the same as the variable n and, r which is smaller than the size of the image. The filters generate k interconnected activation (feature) maps (Ozkan and Ulker, 2017). The property map consists of regions where the individual properties of each filter are extracted. Depending on the intensity of these properties, it is determined which regions are more important. On the input image, a smaller-sized filter matrix is passed through the image (3x3 filter is applied to the 5x5 input image). At each step, the filter coefficients are multiplied by the values in the corresponding color channel and their totals are calculated. The activation map is created by collecting all three channels.

Rectified Linear Unit
The rectified linear unit (ReLu) follows the convolution layer and has the activation function. The objective in this layer is to apply the linear structure obtained from the previous layer to a non-linear structure. This layer increases the speed of learning in the network.

Pooling Layer
Pooling layer usually follows the ReLu layer. Its main purpose is to reduce the input size for the next layer (Oztemel, 2012). As with the convolution layer, certain filters are defined in the pooling layer, and by applying these filters in a specific order on the image, the pixel values can be calculated in two different ways; average pooling and maximum pooling. For maximum pooling, the highest values of the pixels in the image are selected. In the average pool, the average values of the pixels in the image are taken.

Fully Connected Layer
In the CNN structure, ReLu and the pooling layer are followed by a fully connected layer (FC). It connects all neurons from the convolution and pooling layer with different combinations to produce better properties .

DropOut Layer
The DropOut layer improves the performance quality of the network by preventing the memorization of the neural network.

Data Used
In the application, video obtained from UAV (van Es, 2017) with 1280x720 resolution and video obtained with terrestrial resolution of 1080x1920 were used. Both of the videos have a frame rate of 24 frames per second and the length of the videos is about a minute. The video obtained from the UAV was produced obliquely instead of nadir. Because the YOLO algorithm does not give results in the images obtained from the nadir due to the training set. The videos were used directly without any pre-processing steps.

COCO Dataset
YOLO model uses COCO dataset. COCO is a data set consisting of 80 separate classes such as cars, dogs, planes, bicycles, bags, and phones (Figure 1.) (Radovic et al., 2017). The data set contains 80,000 training images and 40,000 verification images.  (Lin et al., 2014)

YOLO Object Detection
YOLO that is an open-source object detection and classification algorithm based on the CNN network. Conventional CNN networks generate regional predictions to suggest bounding boxes. This is followed by the step of grading, correcting, and removing duplicates of the bounding boxes. It re-scores all bounding boxes based on the objects found. Finally, the region with the highest score on the image is considered as detected (Figure 2.). In YOLO, it can predict which objects are present in an image and their positions at first glance. It performs object detection by spatially separated bounding boxes and predicting the entire image with a single neural network. The main advantage of this approach is that the whole image is evaluated by a single neural network and object detection according to the predicted regions (Radovic et al., 2017). The structure of YOLO algorithm consists of conventional neural networks (Figure 3.). YOLO starts detecting the object by dividing the input image into SxS gray. In general operation, only one object is estimated with each grid cell. Bounding boxes are created for each grid cell and assign trust points to these boxes. Each boundary box contains 5 variables: x, y, w, h and C. For the bounding box width, w is calculated, and h for the height. X and y represent the center of the box. For each grid, there are 80 conditional class probabilities (C), with the probability that the detected object belongs to a particular class (Redmon and Farhadi, 2017). When constructing a trust score, it measures both the classification and the trust of the object in its location. The trust score is zero when the grid does not contain any objects. The final output of the YOLO estimate is the tensor S x S x (B x 5 + C). The main concept of YOLO (7,7,30) is to create a single CNN network to estimate its tensor. It uses those with a confidence score higher than the 0.25 threshold when estimating with bounding boxes (Radovic et al., 2017). where s is number of grids, classError is classification error, coordError is localization error, iouError is confidence error.

Classification Error.
The classError is the classification error value (Eq. (2)). The classification function results in the correctness of the estimated object.

Localization Error.
With the loss of localization, the accuracy of the estimated bounding box is calculated. The equation is measured by errors in the position and dimensions of the bounding box as shown in Eq. (3) .
where B is boundary box, x and y are offsets to the corresponding cell, w is width of image, h is height of images, where i Ĉ is confidence score.
If an object is not detected in the box, the confidence error is; In the application, the video obtained from the UAV with the video obtained from the UAV with sky tools traffic monitoring DJI INSPIRE 2 + POWERLINE Cam X5S 15mm was used distance of 50 m from ground. The other video that was obtained as terrestrial with a resolution of 1080x1920 was used. Terrestrial video was obtained manually by 1080p with 12 MP camera. The algorithms were run on a virtual computer that has 12 GB of RAM and an NVIDIA Tesla K80 GPU graphics card.

Results and Discussion
The study was carried out on the Google Colab cloud service. The algorithm, taken from the COCO dataset of weights that are trained on 80,000 training images, was run using the open source artificial neural network called as Darknet library. Only the vehicles are focused as objects to be determined.
As a result of the study, vehicles from video images were detected by YOLOv3, YOLOv3-spp and YOLOv3-tiny. The object type and IoU value for each detected vehicle are displayed by the algorithm on the bounding box. For classification assessment of the algorithm, it is necessary to compare the model with the actual results. In this study, from the simple and fast methods working in easy and homogeneous backgrounds, the deep learning methods used against the complex and difficult problems were evaluated. In order for the specified object to be considered correct, the IoU value must be 0.5 and above, similar to previous studies (Redmon et al., 2017). Even if the object is correctly named in aerial video, since the IoU value is less than 0.5, this is not considered an accurate object. YOLOv3-spp model was the most successful method with 84.88% on the ground accuracy of the methods on the COCO dataset (Table 1). Average IoU was reached in the terrestrial video ( Figure 5.) with 88.56% and UAV videos (Figure 4.) with 81.21%. The YOLOv3-tiny method fails to detect small objects (Yi et al., 2019). As the UAV video videos are obtained from a distance, the object cannot be detected with YOLOv3-tiny. Therefore, the YOLOv3-tiny method did not yield any results in the video obtained by UAV in the real ground value and accuracy comparisons. YOLOv3-spp model is more successful with a precision of 72.02% in accuracy results; 63.53% precision was achieved in UAV and 80.49% in the terrestrial video (Table 2). In both model comparisons, both models yielded results in terms of model estimation and accuracy factor, and it was found to be more suitable with high performance on terrestrial videos. With the data set suitable for the targeted study, successful determinations can be made on the vehicles in UAV images. To increase the accuracy factor, training and verification procedures are important in the data set suitable for the data used as input. With the data set to be used, more successful detections can be realized in parallel with the targeted study purpose. The most important reason for the low accuracy in aerial video is that the data set used is more suitable for terrestrial images. However, with stronger training, success rates will increase confidence in deep learning methods.  It is seen that our study has high accuracy values compared to other studies with car detection with YOLO algorithms. For the purpose of comparison, studies aiming to detect objects over video or in real time were selected. Our study has produced successful results in vehicle detection from both aerial and terrestrial video images (Table 3). Although lower precision is obtained in aerial video, 80.49% precision value was obtained in terrestrial video.

Conclusion
In both model comparisons, both models yielded results in terms of model estimation and accuracy factor, and it was found to be more suitable with high performance on terrestrial videos. Due to its structure, YOLO algorithm is suitable for real-time vehicle detection. With the data set suitable for the targeted study, successful determinations can be made on the vehicles in UAV images. To increase the accuracy factor, training and verification procedures are important in the data set suitable for the data used as input.
With the data set to be used, more successful detections can be realized in parallel with the targeted study purpose. However, with stronger training, success rates will increase confidence in deep learning methods.