Vehicle Detection and Ranging Using Two Different Focal Length Cameras

Vehicle detection is a crucial task for autonomous driving and demands high accuracy and real-time speed. Considering that the current deep learning object detection model size is too large to be deployed on the vehicle, this paper introduces the lightweight network to modify the feature extraction layer of YOLOv3 and improve the remaining convolution structure, and the improved Lightweight YOLO network reduces the number of network parameters to a quarter. Then, the license plate is detected to calculate the actual vehicle width and the distance between the vehicles is estimated by the width. This paper proposes a detection and ranging fusion method based on two di ﬀ erent focal length cameras to solve the problem of di ﬃ cult detection and low accuracy caused by a small license plate when the distance is far away. The experimental results show that the average precision and recall of the Lightweight YOLO trained on the self-built dataset is 4.43% and 3.54% lower than YOLOv3, respectively, but the computing speed of the network decreases 49ms per frame. The road experiments in di ﬀ erent scenes also show that the long and short focal length camera fusion ranging method dramatically improves the accuracy and stability of ranging. The mean error of ranging results is less than 4%, and the range of stable ranging can reach 100 m. The proposed method can realize real-time vehicle detection and ranging on the on-board embedded platform Jetson Xavier, which satis ﬁ es the requirements of automatic driving environment perception.


Introduction
Autonomous driving technology not only facilitates the driver but also improves the safety of the traffic environment. It senses the surrounding environment through various sensors, which is equivalent to the eyes of the vehicle, and then processes the sensory data to provide a dangerous warning for driving and even control the vehicle [1]. It can be seen that high precision and fast traffic target sensing technology is crucial for autonomous driving.
At present, the sensing sensors used in autonomous vehicles mainly include Lidar, millimeter-wave radar, ultrasonic radar, and camera [2][3][4][5]. Lidar can scan and measure by transmitting laser pulses to generate a precise map of road scene topography, which can be used for short-distance and long-distance obstacle detection. However, the high price of Lidar is not conducive to mass promotion. Millimeter-wave radar has a low price and strong ability to cope with environ-mental changes, but the perceived information is not comprehensive enough because it cannot identify the target. Ultrasonic radar measures short distances and is mostly used in scenes such as automatic parking. Compared with the above kinds of radar sensors, vision sensors are inexpensive and informative. The camera can accurately detect traffic targets such as vehicles, pedestrians, lane lines, and traffic signs through image algorithms and estimate the relative distance through the ranging model to complete the comprehensive perception of road information. Besides, some researches integrate multiple sensors such as radar and vision to improve perception efficiency [6,7]. Considering the cost, a camera-based solution is a strategy of choice. For example, MobilEye uses a single camera to implement adaptive cruise control (ACC) and forward collision warning (FCW) [8,9].
Traditional vision-based vehicle detection methods can be divided into two categories: the appearance-based method or the motion-based method. The appearance-based methods use shadow feature [10], symmetry property [11], color [12], texture [13], and headlights/taillights [14,15] to detect vehicle. Also, some methods introduce machine learning classifiers to train appearance features. Liu et al. used an AdaBoost classifier trained with the Haar-like features to detect vehicles [16]. Wei et al. proposed to train a support vector machine (SVM) by combining the features of Haar and histogram of oriented gradients (HOG) to extract vehicle positions [17]. The motion-based methods mainly include the optical flow method [18] and the dynamics background modeling method [19]. Fang and Dai proposed to combine the optical flow method and Kalman filter to realize vehicle detection and tracking [20].
In recent years, deep learning object detection algorithms based on the convolutional neural network have become more popular. One is a convolutional neural network based on region proposals, such as R-CNN [21], SPP-net [22], and faster R-CNN [23], but the calculation cost is relatively high. The other is a convolutional neural network based on the regression, such as YOLO [24], SSD [25], and YOLOv2 [26], which has continuously improved the calculation speed. Sang et al. proposed an improved YOLOv2 to improve detection accuracy and speed [27]. However, there are still tradeoffs between accuracy and speed when using convolutional neural networks for real-time vehicle detection.
Vision-based object ranging methods are mainly divided into two categories: stereo vision ranging and monocular vision ranging. Toulminet et al. proposed a stereo vision system to extract three-dimensional features of the preceding vehicle and calculate the distance [28]. However, stereo vision ranging is necessary to calibrate multiple cameras, and there are problems such as matching difficulties and massive calculation, which is not suitable for traffic target sensing in complex driving environments. Monocular vision ranging is relatively simple, which uses the object detection bounding box to estimate the distance. The distance estimation method based on monocular vision mainly uses the camera pinhole model or inverse perspective mapping (IMP). Adamshuk et al. proposed a distance estimation method based on IPM in the HSV color map, which used the linear relationship between the transformed image pixels and the actual distance [29]. Han et al. proposed calculating the distance based on vehicle width estimated by using the lane line and considered the situation without the lane line, but there is a big error in estimating the width of the lane line and target vehicle [30]. Mehdi et al. estimated the distance using the height and the pitch angle of the camera by assuming the road is flat, but this method does not consider lateral distance and the influence of camera attitude angles [31]. The above ranging methods have no length reference of a realistic target and rely solely on the imaging principle of the camera, which is challenging to achieve high robustness. Zhao et al. proposed to estimate the distance based on the license plate with a fixed width, but the license plate is difficult to accurately detect when the distance is long, and the plate is small, resulting in a limited scope of application [32].
Therefore, in order to solve the above problem that the deep learning target detection network is difficult to deploy on the embedded platform and the accuracy and robustness of vehicle ranging methods are unstable caused by lacks of the actual length reference, this paper proposes a vehicle and license plate detection model based on Lightweight YOLO and the long and short focal length cameras fusion ranging method. The main work of this paper is as follows: (  [33] has too many convolution layers and network parameters, which takes up a large part of the time in the feature extraction process, resulting in slow network forward propagation. Based on the network design structure of YOLOv3, this paper combines the lightweight network ShuffleNet [34] to build a Lightweight YOLO object detection model. Shuffle-Net uses the depth-separable convolution (DWconv) [35] to reduce the computational complexity of convolution and introduces channel shuffle to increase the flow of information across channels. Compared to Darknet53, this network structure can map more channel features with lower computational complexity and memory loss.
The Shufflenet unit is composed of two convolution blocks. As shown in Figure 1, the convolution block 1 is a downsampling module. By copying input features, and performing deep convolution with a stride of 2 at the same time, the feature size is reduced by half and the number of channels is doubled. The convolution block 2 preserves the shallow feature semantic information by splitting and splicing the channel and ensures that the size of the output feature is unchanged. At the same time, the channel exchange is performed between every two convolutions blocks, and the feature channels are arranged in a cross, which solves the problems of information duplication and information loss caused by the channel split.
The auxiliary layer of the Lightweight YOLO network is modified based on the original network, retaining the multiscale prediction method and reducing the number of convolution layers and computational complexity of the entire network. Finally, the Lightweight YOLO network structure is shown in Figure 2. The backbone network uses the Shufflenet structure. The shallow and deep features are fused by upsampling the deep feature map. The network outputs prediction tensors at three different scales. The input image size is 416 × 416 × 3, and the output sizes are three characteristic tensors of 13 × 13, 26 × 26, and 52 × 52, which detect objects of different sizes.

Loss Function.
We found that the detection results of object detection models such as YOLO are very accurate and can successfully identify the object, but the boundary position of the detection box is relatively vague. The monocular vision ranging method mainly relies on the detection bounding box, and the detection box of the boundary blurring causes the ranging accuracy to be low or even invalid, so it is crucial to improve the accuracy of the detection bounding box.
During the training process, the convolutional neural network continuously updates the model parameters through the loss function and backpropagation, reducing the model loss and improving the detection accuracy. The loss function of YOLOv3 consists of three parts: the bound-ing box prediction L bbox , the confidence prediction L conf , and the category prediction L cls . However, the bounding box predicts the loss using the mean square error, which only reflects the distance attribute between the detection box and the actual bounding box, while ignoring the Intersection over Union (IoU), as shown in Figure 3. When two rectangular boxes have the same L2 norm distance, their IoU may be different. It is necessary to introduce IoU prediction loss into the loss function.
The calculation formula for IoU is as follows: where A and B, respectively, indicate the detection box and the ground truth box. However, when the two boxes are in different superimposed states, IoU may be the same. Using IoU as a loss function has a considerable drawback. Therefore, this paper uses generalized IoU [36] as the optimized IoU loss calculation method. The formula is as follows: where C is the smallest rectangular box containing both A and B. When two rectangles do not coincide, generalized

Journal of Sensors
IoU can still describe the relative relationship. As shown in Figure 4, two rectangles still have the same IoU value in different overlapping cases, but the generalized IoU value of the right case is smaller than the left case. In other words, generalized IoU can highlight the misalignment between the two rectangles. It can be seen that generalized IoU solves the critical problem that IoU is not suitable as a loss function.
Finally, the loss function used by the Lightweight YOLO is as follows: 2.3. Vehicle Tracking. Vehicle tracking adopts the detectionbased multiple object tracking method SORT proposed in [37]. The interframe displacements of the vehicle can be seen as a linear constant velocity model which is independent of other vehicles and camera motion, and the state of each vehicle can be defined as follows: where u and v represent the coordinates of the center point of the target vehicle and s and r represent the area and the aspect ratio of the detection box of the target vehicle, respectively. This tracking method uses the Hungarian assignment algorithm [38] to correlate the detection box with the target vehicle based on the IoU between the predicted bounding box of the vehicle and the vehicle detection box in the current frame. When a detection box is associated with a target vehicle, the detected bounding box is used to update the target vehicle state where the velocity components are solved optimally via a Kalman filter [39]. If no detection box is associated with the target vehicle, the linear model is used to predict its state.
If the tracker matches the detection box for two consecutive frames, the algorithm judges it to be a valid tracking and outputs the detection bounding box and the corresponding tracker ID. If the tracker does not match any detection box for three consecutive frames, it is determined that the target disappears, and the predicted bounding box of the track is output during this period. The details of the tracking algorithm can be referred to in the literature [37].

Fusion Ranging
Most of the existing vehicle ranging methods rely on camera imaging principle to estimate the distance, mainly divided into two types: one based on the vehicle position [8,[29][30][31] and one based on the vehicle width [30,32]. The location-based ranging model assumes that the road is flat and sensitive to noise. The distance measurement model based on the vehicle width is relatively robust because the number of the pixels of the vehicle width is less affected by the change of the pitch angle of the camera, but the vehicle width is not fixed and cannot be accurately measured. On this basis, this paper proposes a long and short focal length camera fusion ranging method, which firstly matches the target vehicle and license plate detected through the long focal length and short focal length cameras, then calculates the vehicle width through the license plate information, and finally calculates the distance between the two vehicles using a pinhole model. The specific steps are shown in Figure 5.
Step 1: Capture the image of the road ahead through the short focal length camera and detect vehicle and license plate and then track the vehicle.
Step 2.1: When the tracked vehicle is relatively close, the license plate position can be accurately detected, and the actual vehicle width of the tracked vehicle can be calculated based on the actual width of the license plate by Equation (5). Then go to Step 3.
Step 2.2: When the tracked vehicle is far away, the license plate cannot be accurately detected because the pixel width is small. Capture the current frame image by the long focal length camera and detect vehicle and license plate. Find a vehicle that matches the tracked vehicle by Algorithm 1, and calculate the actual vehicle width based on its license plate width by Equation (5).
Step 3: After obtaining the actual vehicle width of the vehicle, the actual distance of the tracked vehicle can be calculated by Equation (6).
The actual width of the vehicle is calculated as follows: where W l and W represent the actual width of the license plate and the actual width of the vehicle, respectively. The national standard stipulates that the width of the license plate As the camera pin-hole model shown in Figure 6, the actual distance between the tracked vehicle and self-vehicle is calculated as follows: where f represents the camera pixel focal length. The two images taken at the same location by two cameras with different focal lengths have the same central point position, except that the image captured by the long focal length camera has a narrower field of view, and the relationship of the field of view between the long focal length and the short length is as follows: where W 1 and W 2 are the width of the field of view of the short focal length camera and the long focal length camera,  The image captured by a long focal length camera is scaled to match with the short focal length image according to the proportional coefficient by Equation (7). Based on the IoU of the vehicle detection bounding boxes in the two images, the box with the maximum IoU in the scaled long focal length picture is the object matched with the tracked vehicle.
The algorithm for finding a vehicle bounding box in the long focal length image that matches the tracked vehicle in the short focal length image is shown in Algorithm 1.
After determining the tracked vehicle, it is also necessary to match the license plate of the corresponding vehicle to obtain the actual vehicle width. We can match the license plate and the vehicle based on the size of the detection box. However, for a road environment with many vehicles and overlapping, there may be a wrong match. From the prior knowledge, we can know that each license plate only corresponds to one car, and the vehicle closest to the self-vehicle has the most features. Therefore, if a license plate bounding box falls within two vehicle detection boxes, the corresponding vehicle should be the closest one, and the image shows that the detection box of the vehicle is closer to the image bottom. Finally, the matching algorithm between the license plate and the corresponding vehicle is shown in Algorithm 2.
After the license plate is matched with the corresponding vehicle, the width of the vehicle can be calculated by the width of the license plate.

Experiment
In order to verify the performance of the vehicle detection and ranging method proposed in this paper, including the detection accuracy and speed of the Lightweight YOLO network, the stability of the vehicle tracking algorithm, and the accuracy of the long and short focal length cameras fusion ranging method, the road experiment was carried out. The experiment was implemented on the NVIDIA Jetson Xavier [40]. The camera was installed inside the windshield with a height of 1.3 m. The experimental equipment installation is shown in Figure 7.
The camera image sensor used in the experiment was OV10635 [41], with a pixel resolution of 1280 × 720 and a sensor image area of 5:5104mm × 3:4188mm. On the imaging area with width w′, the pixel width corresponding to the imaging area width w m is Input: Short focal length vehicle box B w = ðx w 1 , y w 1 , x w 2 , y w 2 Þ, Long focal length vehicle box B l = ðx l 1 , y l 1 , x l 2 , y l 2 Þ. Output: B l matched with B w . Zoom B l to B∧ l : B∧ l = ðx l 1 ,ŷ l 1 ,x l 2 ,ŷ l 2 Þ Calculating intersection ν between B w and B∧ l :

Journal of Sensors
Simultaneously, Equations (6) and (8) can obtain where f m represents the camera's actual focal length. After testing, the network in this paper can detect the object with a width greater than 15 pixels, and we want to detect the object within 100 m. According to Equation (9), in order to detect the license plate with a width of 440 mm, the actual focal length should be longer than 14.67 mm. Therefore, the long and short focal lengths were selected as 6 and 16 mm, respectively, which can meet the needs of fusion ranging.

Vehicle and License Plate Detection and Vehicle
Tracking Algorithm Experiment. Firstly, through the data screening and self-collection, an image data set containing different types of vehicles and license plates was established, a total of 30000, of which 24000 used as a training set and 6000 used as a test set. The training set contains a total of 45608 vehicle targets and 21888 license plate targets. The test set contains a total of 10801 vehicle targets and 5093 license plate targets. All images were labeled with yolomark [42].
The Lightweight YOLO model proposed in this paper was implemented through the PyTorch framework, and the Lightweight YOLO and YOLOv3 network models were trained on the NVIDIA GTX 1080Ti. The initial learning rate of the network is 0.001, the weight attenuation coefficient is 0.0005, and the training strategy uses a random gradient descent algorithm with a momentum term of 0.9.
In order to verify the performance of the Lightweight YOLO network, the detection results were compared to YOLOv3. The evaluation index of the experiment selects precision, recall, and calculation time per frame. precision = TP TP + FP , where TP indicates that the number of targets is correctly detected, FN indicates that the number of targets is not detected, and FP indicates that the number of targets is erroneously detected. When the IoU of the bounding box in the network detection target box and the test set label data is greater than or equal to the set threshold, it is considered to be correct detection. Otherwise, it is regarded as error detection, and the experimental threshold is taken 0.5 [43]. The Lightweight YOLO and the YOLOv3 were tested in the test set, and the results are shown in Table 1. Compared with YOLO, the average precision and recall of the Lightweight YOLO network decreased by 4.43% and 3.54%, respectively, but the number of parameters is a quarter. The license plate target is relatively small, so its precision is slightly lower than that of the vehicle. However, the detection speed of each frame is increased by 49 ms, which achieved  In order to verify the detection effect after adding the tracking algorithm, the effect of the tracking algorithm was also tested. We chose five videos, each of which has 350 frames, and we labeled the ground truth with yolo-mark.
Considering that the tracking algorithm used in this paper is based on the detection results, we tested the precision and recall of the bounding box generated by the Lightweight YOLO and the tracking algorithm, as shown in Table 2. Compared with the Lightweight YOLO, the precision and recall of the tracking algorithm increased by 1.56% and 2.11%, respectively. Although the test sample size is small, it can still prove that the detection effect after adding the tracking algorithm is better, which is due to the tracking algorithm reduces false detections and missed detections.
There are 39 vehicles in the test sequence. In the experiment, IDswitch is 52, which means that the tracking algorithm has been interrupted 52 times. Most of the IDswitch are due to the tracked vehicle was wholly obscured, and fully visible vehicles can be continuously tracked. Given that the obscured vehicle is not a high-priority ranging object, the impact is small. However, this is also one of the directions for future improvement. After adding the tracking algorithm, the computation time of each frame is increased by 5.4 ms, and the whole vehicle detection and tracking algorithm can still run in real time. Figure 8 shows the results of vehicle detection and tracking. It can be seen that the Lightweight YOLO can accurately detect the vehicle and license plate in the image. On the basis of the accurate detection of vehicles, the multitarget tracking method based on detection can also track vehicles stably.

Fusion Ranging Experiment.
Firstly, the effect of the vehicle detection boxes matching algorithm in long and short focal images is verified, as shown in Figure 9. Figure 9(c) shows that the vehicle detection box in the long focal length image. Figure 9(b) is scaled by Algorithm 1 and almost overlaps with the corresponding vehicle detection box in the short focal length image Figure 10(a). The IoU of two vehicle detection boxes are 0.79 and 0.71, respectively, which verifies that the scaling relationship between long focal length to short focal length is correct. The experiment verified that the detection boxes in different focal length images can be successfully matched through Algorithm 1. The deviation of IoU in the matching process mainly comes from the jitter of the two cameras caused by vehicle bumps and the detection accuracy of different size vehicle images.
Then, in order to verify the dynamic performance of the long and short focal length fusion ranging method, four sets of comparative experiments were carried out. In Experiment 1, the vehicle in front was gradually away from the self-vehicle, and the self-vehicle was stationary, which is the case where the pixel width of the vehicle tracked for the first time is too large in the ranging algorithm. In Experiment 2, the vehicle in front was getting closer to the self-vehicle, which is that the pixel width of the vehicle tracked for the first time is small, and the license plate cannot be detected accurately. In Experiments 3 and 4, the target vehicle is in the side lane, and the self-vehicle and the target vehicle are moving relatively. The comparisons between the long and short focal length fusion ranging results and the vehicle position-based ranging result and the actual distance are shown in Figures 10-13, where the vehicle position-based ranging algorithm used the method proposed in [31], and the actual distance was detected by the radar.
First of all, it can be seen that the stability of the distance estimation method proposed in this paper is higher than the method based on the vehicle position. Especially in the dynamic environments of Experiment 2, when the distance is above 30 m, the results of the position-based method appear huge fluctuation, as shown in Figure 11, which completely deviates from the actual distance. In the static environment of Experiment 1, the results of the positionbased method also show a significant deviation when the distance is greater than 60 m, as shown in Figure 10. In Experiment 3, the results of the position-based method appear fluctuation when the distance is greater than 40 m, as shown in Figure 12.
Then, in Experiments 1 and 2, the target vehicle was in the lane directly in front, and the ranging results are always accurate. However, in Experiments 3 and 4, it can be found that after a while, the ranging results show apparent deviation. Considering that the target vehicles of Experiments 3 and 4 were in the side lanes, due to the characteristics of the detection algorithm, the bounding box extracted by the detection algorithm is not only the rear of the vehicle but also   (5) is larger than the true actual vehicle width. As the target vehicle moved away, the error of the vehicle pixel width output by the detection algorithm became smaller, which makes the ranging results larger than the actual distance. In Figure 12, the ranging results of the proposed algorithm start to be significantly larger after 35 m. Similarly, in Figure 13, the ranging results are smaller than the ground truth in the later period. Although some errors occurred in Experiments 3 and 4, the accuracy and robustness of the algorithm proposed in this paper are significantly better than the position-based ranging algorithm. Table 3 shows the comparison of ranging performance. The proposed method can reduce both the   Journal of Sensors mean error μ d and the standard deviation σ d . The mean error percentage μ e shows that the relationship between the mean error percentage and the distance is small. The mean error percentage is less than 4%. Figure 14 is the results of vehicle detection based on Lightweight YOLOv3 and long and short focal length camera fusion ranging proposed in this paper. The proposed method can accurately detect vehicles and estimate the distance between vehicles under various light and road conditions. In further analysis, the long and short focal length camera fusion ranging method proposed in this paper is based entirely on vehicle and license plate detection results, and the error is mainly derived from the inaccurate detection boxes, including the side vehicle detection box width beyond

Conclusion
This paper proposes the Lightweight YOLO object detection model and the long and short focal length camera fusion ranging method. The lightweight network Shufflenet is integrated into the YOLOv3 network, and the Lightweight YOLO network is constructed. The parameter quantity is only one quarter of the original network, and the improved generalized IoU loss is added to the loss function. The precision and recall of the proposed method are slightly lower than YOLOv3, but the calculation speed is greatly improved, which achieved the demand for real-time detection. Besides, the fusion ranging method uses the license plate width to calculate the actual vehicle width and estimates the distance. Through the method of the long and short focal length fusion matching, it is possible to detect the license plate with a long distance and expand the range of ranging. The integration of the tracking algorithm also can detect the license plate once to determine the width of the tracked vehicle and reduces the calculation amount of the fusion matching algorithm.
The experimental results show that the Lightweight YOLO object detection model and the long and short focal length

Data Availability
All data included in this study are available upon request by contact with the corresponding author.