Octave convolution-based vehicle detection using frame-difference as network input

Vehicle detection in video frames has been treated the same way detecting vehicle for an isolated image. However, the models designed for the isolated image are blind to fast-moving vehicles and cannot localize the moving targets partially occluded in the scene. In this case, we figure out a way to combine the classic moving target detection method with the neural network method. In this work, first, we propose to add three-differential-frames into the neural network of Yolov3 as the second input which contains the motion information on the front and back frames to detect vehicles partially occluded; second, we reform the network by using Octave Convolution to reduce memory and computational cost while boosting accuracy. We experimentally show that by using the aforementioned methods together, compared with using original YOLOv3 on UA-DETRAC data set, AP is increased by 2.31%, recall is increased by 4.01%, and precision is increased by 3.10%. We demonstrate that the proposed method is indeed effective.


Introduction
Detection of on-road vehicles is one of the most important tasks in intelligent transportation system (ITS Many machine learning and deep learning algorithms have been proposed to object detection and tracking such as supervised and unsupervised machine learning, reinforcement learning, convolutional neural network, sliding window approach, RCNN, Fast RCNN, and you only look once, multi-scale deep convolutional neural network.
Vehicle detection is the one to locate and find instances of real-world objects like multiple cars, buses, trucks, bicycles based on the high-level and low-level features of labeled information. Moreover, object detection falls into three-level categories such as motion-based object detection, featurebased object detection and offline-based detection methods.
Although these target detection techniques based on deep learning neural networks can achieve good vehicle detection results, previous research on neural network-based vehicle detection has only targeted static pictures to achieve real-time detection of video at a rate that static picture detection can achieve real-time detection. This is feasible, but also formed a blind spot in the study, which means the monitoring video of the vehicle in the traditional method always depend on the target motion characteristics and video frame of the temporal characteristics of the sequence is not used by the neural network on.
Chen et al. [1] obtained object edges in images by Canny operator, then extracted feature points from the extracted edges, transformed the extracted feature points optical flow information by the pyramid model of Lucas-Kanade optical flow method, clustered the feature points with weighted K-means, and used the results of clustering to identify for extracting moving vehicle targets from complex dynamic backgrounds.
The above method utilizes inter-frame information, and although the practical application to vehicle detection performs well, it has high requirements for changes in some external factors (light in the environment, vehicle movement speed, etc.).
Teoh et al. [2] proposed a method to achieve detection using the symmetry feature of vehicles. The method extracts the symmetry region in the image as a candidate area, then extracts the edge information within the candidate area and does enhancement, and then uses a classifier of support vector machine method to accept the processed edge information for classification judgment. However, the symmetry of the vehicle limits the shooting angle of the video and can only be applied in specific scenes that meet the requirements.
Mori et al. [3] proposed to detect the presence of vehicle targets using the shadows on the underside of the vehicle as a feature by exploiting the feature that the luminance value of the shadow region under the sunlight hardly changes during the day.
These vehicle detection methods mentioned above are simple and intuitive, but are limited to applications in simple environments. In complex environments, these features are susceptible to factors such as vehicle deformation, occlusion and weather changes, and their practical application is not ideal.
All the above studies on vehicle detection are the source of inspiration for our proposed approach in this paper. Therefore, we begin to consider feeding neural network with inter-frame difference graphs and presenting a novel idea to improve the effect of vehicle detection by obtaining more distinguishing features.
This paper contributes to the literature in three aspects: 1. We propose to use inter-frame difference graph as input of you only look once (YOLO)v3 and discover that inter-frame difference graph provides new features for detection. 2. We propose to insert one more backbone into YOLOv3 structure to get better detection results so that the network can extract features from both the original frame graphs and the inter-frame difference graphs. 3. We propose to replace vanilla convolutions with Octave convolutions to reduce calculating expenses and speed up the detection in the new two-backbone structure we have proposed.

Related work
The method of inter-frame difference has been playing an important role in vehicle object detection. Chen et al. [4] developed a method of moving vehicle detection, which is based on three-frame-difference and union set, and where there are four steps: first, difference the three grayed images and enhance them; secondly, dichotomize images and filter some interference; thirdly, make the union set operation; Finally, detect the vehicle with morphology process and connectivity analysis. Sengar et al. [5] presented the inter-frame differencing and W4-based moving object detection technique. The key aspect of the method is to detect the moving objects accurately even if noise and illumination variations are presented in the input frame sequences.
He et al. [6] presented a background modeling algorithm based on the Gaussian mixture model is combined to the three-frame-difference algorithm to achieve the moving target detection.
Cui et al. [7] first presented, extract the corresponding frame images of the video image sequence for inter-frame difference, and obtain the foreground map of the moving target; Second, put the foreground map into the CNN for training to obtain clearer moving target foreground image.
Ever since the pioneering work on AlexNet [8] and VGG [9] which achieved astonishing results by stacking a set of convolution layers, researchers have made substantial efforts on object detection, developed a lot of object detection frameworks based on deep learning methods including RCNN [10], SSD [11], YOLO [12][13][14][15], etc. Huge progress have been made in vehicle detection because of the usage of these deep learning methods.
Some researchers improved these general frameworks for vehicle object detection. Harikrishnan1 et al. [16] proposed a modified single-shot multi-box convolutional neural network named Inception-SSD (ISSD) for vehicle detection and a centroid matching algorithm for vehicle counting. An Inception-like block is introduced to replace the extra feature layers in the original SSD to deal with the multi-scale vehicle detection to enhance smaller vehicles' detection. Non-Maximum Suppression (NMS) is replaced with Affinity Propagation Clustering (APC) to improve the detection of nearby occluded vehicles.
There are also researchers utilizing the traditional method and neural network synthetically such as Chandrasekar et al. [17] who developed a highly efficient and fast multi-object tracking method using three-frame differencing-combinedbackground subtraction (TFDCBS)-coupled-automatic and fast histogram-entropy-based thresholding(HEBT) method together with GMPFM-GMPHD filters and VGG16-LSTM classifier. All of the above studies about vehicle detection have inspired the methods proposed in this paper. In addi-tion, there is another research about aggregating both visual and textual information which helps a lot. Ahmed et al. [18] fed the combined features to a fully connected multilayer Neural Network that estimates the house price as its single output. It was shown that aggregating both visual and textual information yielded better estimation accuracy compared to textual features alone. Moreover, better results were achieved using NN over SVM given the same data set.
Inspired by the above researches, we begin to wonder that except detecting the static image directly, whether can we use the inter-frame difference images to provide additional movement features to improve the detection effect when we use the neural network method like YOLO to detect the vehicle object in video surveillance.

Proposed methods
In this section, we will explain the progressive improvement of our proposed method. In the beginning, we generate the frame-difference and put it into the network to prove that the frame-difference can bring new useful features. After that, we construct the two-input network, which can obtain the features of the original image and the difference image at the same time. At last, we reform the network with octave convolution.

Using frame-difference as input
Generate frame-difference Usage of frame-difference. In order to extract the information of vehicle movement which exists among video frames, three-frame difference is employed to generate a new image accumulating motion trajectories that correspond to the original image.
To be specific, as shown in Fig. 1, we select three adjacent frames-the k−1th frame, the kth frame, the k + 1th frame, which are RGB images with three channels and whose pixel values are between 0 and 255. Let R k−1,k (x, y) denote the pixel value in location coordinate(x, y) from differing on the R(red) channel between the k−1th frame and the kth frame, and as is shown in (1), (2) Then, synthesize the difference results in three channels into one channel, where pixel values are denoted by D k−1,k (x, y), which is equal to the maximum value among R k−1,k (x, y),G k−1,k (x, y) and B k−1,k (x, y), but no more  (4). The same thing happens between the kth frame and the k + 1th frame, and there is D k,k+1 (x, y). At last, Merge the two difference graphs and get the final result-Alpha channel, where the pixel values can be computed by (5) where (x, y) donates the location coordinate of pixel and k is the sequence number of the frame. Alpha k is to be the fourth channel of Frame k besides RGB channels. (4) As a matter of fact, the camera has been shaking slightly in the process of capturing the video, which leaves traces of static objects on the Alpha channel when three-frame difference method is employed, as shown in Fig. 2. However, that's not necessarily bad.
It's more like a combination of edge detection and interframe difference on the final results, which of course is mixed with some noises but also helps to get the signals of vehicles stopping in a series of continuous frames, such as Fig. 2d, the bus stopping at the bus station on this figure would not have left a trace without camera shaking. As shown in Fig. 2e-l, only when a little part of the moving vehicle entered the screen, or the moving vehicle was under occlusion, the traces of moving vehicles on Alpha channel would be much easier for human vision to detect, which suggests it might be able to provide better features for detection.
Use frame-difference. Now that we have the framedifference Alpha k , how to use it becomes the key. There are two basic ideas. The first one, putting the frame-difference on the original frame as a mask. If it can get better features, or filter some noise, it might work. The second one, replacing the original frames with frame-differences. If frame-difference can provide better features or features are different from what the original frame provides, it would be another quite interesting stuff (Figs. 3, 4 and 5). Using frame-difference as a mask is changing the original image before input into the network. The highlighted part in the difference image overlaps with the vehicle target position. What if we use the difference image as the transparency channel like Alpha channel in PNG format to filter the original image?
Our experiments show that neither using the differentframe as transparency channel directly nor adding a Gaussian filter to filter the different-frame first does not help to improve the detection result. On the contrary, these operations create some interference effects, increasing the difficulty of target detection. As shown in Table 1, compared with the original input APs decrease a lot and the detection results were disastrous after operating in spatial domain. The mask way in spatial domain does not work.
Since a differential image is also an image, we intuitively replace the original video frames with the corresponding differential images as training and testing data. In Table 2, we compare the performance between Original input and Framedifference input. The performance is measured by PR-curve. Table 2 shows that using frame-difference as input has a 1.10% better precision and a 0.2% better F1 than that of using original. Hence, using frame-difference as input can achieve the same level as original, and even a little bit better in general, which proves that frame-difference can be treated in the same way as the normal image for neural network.
New discovery in feature domain. Detection is based on feature extraction. Discrepancy in the spatial domain does not mean that it must be different in the feature domain, for example, changing hue, saturation or brightness of the input image can still get great detection result because of their consistent features.
In that case, a question comes to us. Whether or not framedifference and original image have the consistent features? If they do, just like the neural network can correctly detect vehicle objects on those images whose hue, saturation and brightness have been changed, could the result of detecting objects still be correct if we input frame-difference into neural network trained by original images, or inputting original images into neural network trained by frame-difference? However, our experiment denies that.
When we input frame-difference or original image into the network trained by the opposite, there is no result of detecting vehicle objects, which means although using frame-difference and using original image as input for training and detecting achieve about the same level, the features extracted from the samples are different.
Furthermore, maybe we can make use of the features extracted from these two different sources in one multi-input neural network at the same time to obtain better detection results. In order to realize this idea, we are about to reform the structure of Yolov3 which we have been using.

Reformation of YOLOv3
Single Backbone Network As shown in Fig. 6, a convolutional layer, standard normalization layer and activation layer are encapsulated into a module called Dark-net_conv_BN_Leaky ReLU (DBR) in the implementation of a single backbone network. An important structure in the backbone network, the residual unit (ResUnit), is encapsulated into a module, and a residual unit adds the input and the output of that input after two DBR modules to get the output of the residual unit. The residual block (Resblock) consists of zero padding, a DBR module with stepwise convolution, and N residual units. DBR × 5 means five DBR modules are connected. The detection head of the network is connected to the outputs of the third, fourth, and fifth Resblock, respectively. The size of "output1" is 14 × 14. The output from the fifth Resblock is passed through DBR × 5 and then through the final full convolution layer. The size of "output2" is 28 × 28, which is obtained by combining the output of the fourth Resblock with the output of the fifth Resblock after a series of convolution and up-sampling, and then passing it into the full convolution layer. "output3" is 56 × 56, similar to the previous one, and is obtained by combining the output of the third Resblock with the output of the previous up-sampling and then by a series of convolutions.
One more backbone for frame-difference. In YOLOv3 structure, there are two network parts, backbone and detection head. The backbone is called Darknet-53 which has 53 layers but is actually used without the last connection layer. The detection head is connected to three different layers in the backbone, respectively.
Owing to the network which needs to extract features from two sources, one more backbone is required. The newly added backbone and the old backbone are as like as two peas, which has the same structure and the same is true of where its layers connect to the detection head. Now we have the new network structure where there are two backbones corresponding to two different inputs of which one is for original graphs and the other one is for frame-difference graphs. The feature tensors from the two backbones will be concatenated together and then sent to the detection head.
Octave convolution. In natural images, the information is dispersed in different frequencies, where the higher frequency parts are usually encoded with codes that can show fine details, and the lower frequency parts are usually encoded with global structures. Chen et al. [19] argue that similar to natural images, the output feature map of the convolutional layer can be considered as a mixture of information of different frequencies, and the output feature map of the convolutional layer can also be decomposed into features of different spatial frequencies, and from this idea a new multi-frequency feature representation method is proposed to store high-frequency and low-frequency feature maps into different groups, as shown in Fig. 7a. Therefore, as shown in Fig. 7b, the spatial resolution of the low-frequency group can be safely reduced by sharing information between adjacent positions to reduce spatial redundancy. To accommodate the new feature representation, a new feature representation method called Octave Convolution (OctConv), is proposed.
It includes two feature map tensor with one octave difference in frequency and extracts information directly from the low-frequency feature map without decoding it back to the high frequency, as shown in Fig. 7c.
In the ordinary convolution process, we consider a subset of feature maps that capture spatial low-frequency variations and contain spatially redundant information. To reduce this spatial redundancy, we introduce the Octave feature representation, which explicitly decomposes the feature mapping tensor into two groups corresponding to low and high frequencies. Scale space theory provides us with a principled approach to creating a spatially resolved scale space [20] and defines Octave as the spatial dimension divided by 2. Defining low-and high-frequency spaces in this way, i.e., by reducing the spatial resolution of low-frequency feature maps by one Octave. The specific form is as follows: First of all, we let X ∈ R c×h×w denote the input feature tensor of the convolution layer, where h and w represent the spatial dimension and c denotes the number of feature maps or called the number of channels. Next, we explicitly decompose X along the channel dimension as X X H , X L , where high-frequency feature map X H ∈ R (1−α)c×h×w maps in the spatial domain with fine details, and lowfrequency feature map X L ∈ R αc×0.5h×0.5w maps signals that change slowly in spatial dimension. Then, let α ∈ [0, 1], which denotes the ratio of channels assigned to the lowfrequency part, and the low-frequency feature mapping is defined as one Octave lower than the high-frequency feature mapping, i.e., half of the spatial resolution, as shown in Fig. 7c.
For ordinary convolution, we define W ∈ R c×k×k denotes the k × k convolution kernel, and X , Y ∈ R c×h×w denote the input tensor and output tensor, respectively, and the calculation of each feature map in the output tensor Y is expressed as in Eq. (12).
where (p, q) means location coordinates, N k means local neighborhood. For simplicity, padding is omitted in all equations, assuming that k is an odd number and that the input and output data have the same number of dimensions, i.e., c in c out c.
For the octave convolution, the design goal is to efficiently handle low and high frequencies in their corresponding frequency tensor, but also to be able to communicate efficiently between the high and low-frequency components. Define X, Y as the input and output tensor, X {X H , X L }, Y {YH, YL}, where YH Y H→H + Y L→H and YL Y L→L + Y H→L , as shown in Fig. 6, Y H→H , Y L→L means the information updates within the frequency, Y L→H , Y H→L indicates interfrequency information transfer.
In order to calculate the above items, the convolution kernel is also divided into two parts W [W H , W L ], as shown in Fig. 8.
For inter-frequency communication W L→H , we can collapse the up-sampling of the X L portion of the input feature tensor into the convolution, eliminating the need to explicitly compute and store the up-sampling feature mapping, as shown in Eq. (13) : where denotes rounding operation. For feature tensor X H , we collapse its down-sampling into the convolution, as shown in Eq. (14): where multiplying factor 2 by position (p, q) performs downsampling, and further shifts the position by half a step to ensure that the down-sampled mapping is well aligned with the input. However, since the index value of X H can only be an integer, we can round it to (2 × p + i, 2 × q + j) or approximate the value at (2 × p + 0.5 + i, 2 × q + 0.5 + j) by averaging all 4 adjacent positions. The first method is also called stride convolution and the second method is called average pooling. However, stepwise convolution can lead to down-sampled mappings that are not aligned with the input as well as center drift, as shown in Fig. 9, so we use the average pool to approximate this value in the rest of our paper. The output tensor Y {Y H , Y L } constructed by the octave convolution method using down-sampling and averagepooling can now be rewritten using following Eq. (15): where f (X L ; W ) denotes the convolution with parameter W , which is the average pooling with kernel size k × k and step size k. It is the up-sampling operation by recent interpolation by a factor of k, here k is taken as 2.
The details of the OctConv operator implementation are shown in Fig. 10. It consists of four computational paths, two green paths corresponding to the update of information inside the high-and low-frequency feature maps, and two red paths indicating the exchange of information between the two frequencies.
A characteristic of octave convolution is that the lowfrequency feature map has a large receptive field. Convolving the low-frequency part of X L with a k × k convolution ker- Fig. 10 Octave convolution nel effectively expands the receptive field by a factor of 2 compared to normal convolution. This further helps the convolutional layers of each octave convolution to capture more contextual information from more distant locations and potentially improve recognition performance.
Substitution: Octave convolution for vanilla convolutions. We reform the Residual block in Darknet53 of YOLOv3 by replacing vanilla convolutions with Octave Convolution (OctConv). The Octave Convolution has the larger receptive field for the low-frequency feature maps. In low-frequency part X L with k × k convolution kernels, OctConv has an effective enlargement of the receptive field by a factor of 2 compared to vanilla convolutions, which helps to capture more contextual information from distant locations and can potentially improve recognition performance, which is verified in our experiment.
OctConv is backward compatible with vanilla convolution and can be inserted into regular convolution networks. To convert a vanilla feature representation to a multi-frequency feature representation, i.e., at the first OctConv layer, we set α in 0 and α out 0.5. In this case, OctConv paths related to the low-frequency input are disabled, resulting in a simplified version which only has two paths. At the middle OctConv layer, we set α in α out 0.5. To convert the multi-frequency feature representation back to vanilla feature representation, i.e., At the last OctConv layer, we set α out 0. In this situation, OctConv paths related to the low-frequency output are disabled, resulting in a single full resolution output. Because of the division of convolution outputs, Batch standardization and activation will be set after high-frequency outputs and low-frequency outputs, respectively. In that case, as shown in Fig. 11, vanilla convolutions will be replaced by three new octave convolutions, which are to be called 'Init OctConv', 'General OctConv' and 'Final OctConv'.
We repackage each module to adapt to the input and output characteristics of octave convolution. As shown in Fig. 12, we set Batch Normalization (BN) after the high and low output of octave convolution, respectively. We use Leaky Rectified Linear Unit (LeakyReLU) as the activation function with each convolution layer.   Fig. 15, the Resblock which contains n ResUnit module now contains n−1 'Octave ResUnit' and a 'Final Octave ResUnit', and we rename it as 'Octave Resblock'. '5 OCBR' in Fig. 16, is composed by one 'Init OCBR' and 4 'General OCBR's. The whole network structure is shown in Fig. 17, which contains two backbones and using octave convolution.

Data sets
We use the public urban traffic data set UA-DETRAC where there are 10 h of videos with 25 fps and 960 × 540 resolution  captured with a Canon EOS 550D camera at 24 different locations in China to analyze the performance of the proposed framework [21]. We modify the UA-DETRAC data set, changing the number of categories to one. Different types of vehicles are classified as vehicle. The model is programmed by Python. The platform is with an Inter(R) Xeon(R) Silver 4116 CPU and a single Titan-X GPU.
Although the two backbones in the network work in parallel to deal with original images and three-different-frames, they are not programmed to run on one GPU in parallel at the same time but in sequence.
The two backbones are independent of each other. The detection results are mapped to the original image to get the final results. One object may have some bounding boxes. The non-maximum suppression (NMS) method is used to eliminate repeated bounding boxes. After NMS merging, we just reserve one bounding box with the highest score. At last, the top 20 boxes in descending order of score are reserved as the final detecting results.
Training sets have 45,960 frames, and validation sets have 19,690 frames, and test sets have 16,417 frames. All frames come from UA-DETRAC randomly.
Throughout the training, we use a batch size of 8, and we use the Adam optimizer with the default parameters except that we set two different learning rates: 10 -3 and 10 -4 . We train the network for 40 epochs, with the following learning rate schedule: 10 -3 for 2 epochs, and 10 -4 for 2 epochs.
To verify the performance of the improved network model, we evaluate the algorithm using four metrics: mean Average Precision (mAP), Precision (P), Recall (R), and frames per second (FPS) detection. R TP TP + FN (17) where TP (True Positive) indicates that the true frame is the vehicle and the prediction frame is also detected as the vehicle. FP (False Positive) means the true frame is background and the prediction frame is detected as vehicle. TN (True Negative) means the true value box is the vehicle and the prediction box is detected as the background, and TN (True Negative) represents that the true value box is the background and the detector also considers it as background. Precision (P) indicates the percentage of correctly predicted positive cases in the predicted sample. Recall (R) represents the proportion of correctly predicted positive cases to all true positive cases. The precision rate is more of an evaluation of the percentage of correct and wrong scores, while the recall rate is more of an evaluation of the percentage of all true positive cases that are found to be complete. F1-score To better show the balance between Precision and Recall, F1-score is the summed average of them both, and the formula can be expressed as follows: AP The AP value represents the average precision rate of a single category, which can be expressed as the average of the precision rates corresponding to the 0-1 interval recall, i.e., the area under the PR curve. It can be computed by the following equation: FPS It indicates the number of frames per second the model detects, the higher the value, the faster the model detects. FPS is given by the expression as follows: where f n means the total number of images processed by the model, T represents the time spent.
We also compared the numbers of parameters (Params) and calculations (FLOPs) to measure the complexity of the neural network models.

Experimental comparison of different detection methods
The Faster-RCNN algorithm is widely used in most scenarios due to its excellent detection accuracy, but it is limited by its slow detection speed; YOLO-v1 and YOLO-v2 algorithms can be used in some real-time detection applications due to their fast detection speed. YOLO-v1 and YOLO-v2 algorithms can be used in some real-time detection applications, such as real-time vehicle detection, but the detection accuracy is lower than that of Faster-RCNN algorithm; SSD algorithm combines the advantages of YOLO algorithm and Faster-RCNN algorithm, with fast detection speed and high accuracy, which is widely used in engineering projects. The YOLO-v3 algorithm is higher than the SSD algorithm in terms of both detection speed and accuracy, and has received wide attention from academia and industry since its emergence in 2018.
In this paper, we compare YOLO-v1, Faster-RCNN, YOLO-v2, Tiny-Yolo, YOLO-v3 algorithms with the algorithm in this paper, and analyze the advantages and disadvantages of each algorithm by comparing the check-all rate, check-accuracy rate and F1 value, and also count the frame rate of each algorithm for detecting the same image. Table 3 shows that using octave convolution achieves a higher F1 than using normal convolution by 1.40% when using the original image as a single input and by 1.20% when using frame-difference as single input.

Evaluation of the reformation of Yolov3
Although the network simply reformed to a two backbones net, which in Table 2 using normal convolution and using both original frames and frame-difference as input, outperforms the original network with a single input by achieving a higher F1 by about 1%, it achieves a much higher F1 by 3.50%, a higher recall by 4.01%, a higher precision by 3.10% and a higher AP by 2.31%, than the baseline when the two backbones network uses octave convolution. The proposed model can significantly improve the detection effect while slightly increasing the amount of calculation and the number of parameters.
As shown in Fig. 18, our final proposed method using frame-difference and octave convolution detects those vehicles partially occluded which are not detected by Yolov3. We attribute the success to the participation of frame-difference.

Performance and comparison of the different framework
From Table 4, it can be seen that the YOLO-v1 algorithm can reach 80.34% in vehicle detection, but the leakage rate

Conclusion
This paper proposes a new idea for target detection of traffic vehicles based on YOLOv3 network, combining inter-frame differential and octave convolution: (1) we propose a new method of utilizing inter-frame differential map, and after verifying that the method of using inter-frame differential map as a mask to filter the original frame map on the null domain is not effective, we instead try to input the inter-frame differential map into the neural network and extract the differential in the feature domain. The target detection is achieved by extracting the feature information in the time domain of the map. (2) We introduce a dual-input dual backbone network on the YOLOv3 network structure model, which consists of two independent backbone networks that extract the features of the original frame map and the inter-frame differential map, respectively, and then sink them together into the detection head part of the network, allowing the net- work to fuse the information of the original frame map and the inter-frame differential map. (3) To alleviate the problems of high computational overhead and reduced detection speed caused by the dual backbone network, we propose the use of the octave convolution method to replace the normal convolution to modify the network structure. The improved YOLOv3 network improves the average detection accuracy by 2.11% and the recall rate by 2.84% compared with YOLOv3 detection accuracy under the requirement of real-time detection. The good performance of the experiments confirms the feasibility of the method, which is expected to make up for the shortcomings of existing traffic information collection techniques and promote related research such as wide-area traffic flow analysis, so the network has a broader engineering application value and theoretical significance. Due to the complexity and diversity of traffic scenes, in future work, we can try to apply more advanced features or detection algorithms to detect vehicle targets, and consider knowledge distillation to compress the network model again to further improve the detection rate of targets in harsh environments.