ACF Based Region Proposal Extraction for YOLOv3 Network Towards High-Performance Cyclist Detection in High Resolution Images

You Only Look Once (YOLO) deep network can detect objects quickly with high precision and has been successfully applied in many detection problems. The main shortcoming of YOLO network is that YOLO network usually cannot achieve high precision when dealing with small-size object detection in high resolution images. To overcome this problem, we propose an effective region proposal extraction method for YOLO network to constitute an entire detection structure named ACF-PR-YOLO, and take the cyclist detection problem to show our methods. Instead of directly using the generated region proposals for classification or regression like most region proposal methods do, we generate large-size potential regions containing objects for the following deep network. The proposed ACF-PR-YOLO structure includes three main parts. Firstly, a region proposal extraction method based on aggregated channel feature (ACF) is proposed, called ACF based region proposal (ACF-PR) method. In ACF-PR, ACF is firstly utilized to fast extract candidates and then a bounding boxes merging and extending method is designed to merge the bounding boxes into correct region proposals for the following YOLO net. Secondly, we design suitable YOLO net for fine detection in the region proposals generated by ACF-PR. Lastly, we design a post-processing step, in which the results of YOLO net are mapped into the original image outputting the detection and localization results. Experiments performed on the Tsinghua-Daimler Cyclist Benchmark with high resolution images and complex scenes show that the proposed method outperforms the other tested representative detection methods in average precision, and that it outperforms YOLOv3 by 13.69% average precision and outperforms SSD by 25.27% average precision.


Introduction
In many countries, pedestrians and cyclists are the most vulnerable road users (VRUs) in traffic crashes. It is easier for cyclists to get involved in traffic crashes because of their relatively fast speed. In recent years, a lot of research focused on developing Advanced Driver Assistance Systems (ADAS) for anti-collision of VRUs [1,2]. The detection of VRUs including cyclists and pedestrians is still a difficult problem, due to the difficulties brought by diverse cyclist postures, small-size, occlusions and relative fast speed, etc.
Many technologies have been proposed in the past decades. The main technological approaches for detection can be divided into two major approaches: sensor-based detection methods and vision-based detection methods. Sensors include liDAR, radar, infrared sensor and so on. Vision-based when dealing with the cyclist detection problem. The main reason is that these methods cannot perform well on detecting small-size objects in high resolution images.
We assume that if the potential regions can be gotten first, the high-resolution images can be cropped into some regions of interest (ROI). The YOLO or SSD based methods can be performed on these small regions to achieve better performance. Following this hypothesis, we design a cyclist detection framework based on YOLO network. The proposed framework has three main parts.
Firstly, in order to extract region proposals from high resolution images, based on aggregated channel feature (ACF) [40], we propose a region proposal extraction method called ACF based region proposal (ACF-PR) method. In ACF-PR, we firstly design an ACF based detector to fast extract candidates, and then a bounding boxes merging and extending method is designed to merge the bounding boxes into correct region proposals for the following YOLO net. Then, a suitable YOLO net is designed for fine cyclist detection in the region proposals generated by ACF-PR. Lastly, we design a post-processing step, in which the results of YOLO net are mapped into the original image outputting the detection and localization results. The proposed cyclist detection structure is evaluated on the public Tsinghua Daimler Cyclist Benchmark (TDCB) [41] and it outperforms the other representative methods in our comparison.
The paper is organized as follows. In Section 2, the proposed cyclist detection structure including a novel region proposal method, a deep learning network and a specific post-processing step is presented. Section 3 shows the evaluation results and comparison with other representative detection methods, and Section 4 gives the final conclusion and future work.

Proposed Methods
Images taken by the on-board camera for cyclist detection are usually with high resolution. However, dealing with high resolution images, the general deep learning algorithms including YOLO, SSD and Faster-RCNN have relatively poor performance. During experiment, we found that if the high-resolution image can be cropped into small regions that contain objects, some deep learning based networks can perform well on these small regions.
Following this methodology, a novel cyclist detection structure is proposed, which contains three main parts. (1) ACF based region proposal (ACF-PR) method, (2) YOLO based cyclist detection method, and (3) a post-processing step for fine localization. The overall architecture of the proposed cyclist detection structure is shown in Figure 1.

Generate potential regions YOLOv3 Input Images
Detection Results Postprocessing ACF-PR Figure 1. Overview of the proposed cyclist detection method. Color images are first processed by the aggregated channel feature region proposal (ACF-PR) method. The proposed ACF-PR method utilizes ACF to get region candidates, and then it performs the analysis of these candidates to generate potential regions. Then the YOLO network utilizes these potential regions as inputs to do fine detection and localization. At last, to get the final result, a post-processing step is performed to merge and map the bounding boxes.
Firstly, the ACF-PR region proposal extraction method is designed. An ACF detector is trained to detect coarse ROIs containing cyclists. Because the regions generated by ACF usually just contain parts of cyclists, we design boxes merging and extending method to merge the bounding boxes into correct region proposals for the following YOLO net. Then, a suitable YOLOv3 net is utilized for fine detection of cyclists in the region proposals generated by ACF-PR. Lastly, we design a post-processing step, in which the results of YOLO net are selected and mapped into the original image resulting in the detection and localization results.

ACF-PR Region Proposal Generating Method
The original ACF method has achieved good performance in some detection problems. In this paper, we explore the novel use of ACF for region proposal extraction. If ACF is directly performed for cyclist detection, the detected regions may contain just part of the cyclists and many false positives. To resolve this problem, we propose the ACF-PR region proposal generating method. In this method, instead of directly using the generated region proposals for classification like most region proposal methods do, we generate large potential regions containing objects for the following deep network. Using this methodology, regions containing only part cyclists can be avoided. The structure of ACF-PR is shown in Figure 2.  Given an input image, the ACF computes several channels, sums every block of pixels, smooths the resulting lower resolution channels and uses boosting to distinguish objects. ACF builds a fast feature pyramid P = {p 1 , p 2 , ..., p n }, here n represents the number of layers. The channels used are the same as [40]: normalized gradient magnitude (1 channel), histogram of oriented gradients (6 channels), and LUV color channels (3 channels), for a total of 10 channels.
The boosting is an integrated learning algorithm that linearly combines weak classifiers into a strong classifier, where, G m represents a weak classifier and a m is the weight of G m in strong classifier. sign(x) is the symbolic function. When training, the classifier produced in the next iteration was trained on the basis of the previous iteration, where F m (x) represents the classifier produced in the mth iteration. The loss function is, where Y represents the label of x, the f (x) represents the result we generate. We determine a m according to the minimum principle of the loss function L(Y, f (x)).
In this paper, Adaboost is used to train and combine 4096 2-depth decision trees over the h/4 · w/4 · 10 aggregated features, where h × w is the input window and 4 is the down sample scale. Based on these parameters, we can get the best performance in our experiments. In the detection process, multi-scale sliding-window is used to scan the image and generate aggregate channel feature, and these features are sent into Adaboost.
To adapt to the cyclist sizes in this study, modelDs (model height and width without padding) is set to (50, 32) and modelDsPad (model height and width with padding) is set to (64, 48). The nNeg (max number of negative windows to sample) is set to 10, 000 and the nAccNeg (max number of negative windows to accumulate) is set to 30, 000.
Using these training processes, we train an ACF detector to perform preliminary detection. Instead of arranging ACF bounding boxes according to the level of confidence like traditional ACF does, the bounding boxes in this method are reordered from left to right and from top to bottom in the image.
In detection, one cyclist detected by ACF may have several different bounding boxes, which causes many false positives in the detection process. During experiments, we found that the distances between the bounding boxes belonging to one cyclist are not far away. We design a merging method for merging bounding boxes belonging to the same object. In this process, all bounding boxes are divided into two cases according to the distances between the bounding boxes. In one case, two bounding boxes are partially overlapped or the distance is short. In the other case, two bounding boxes are far away from each other.

•
In one case, each detected cyclist instance is marked with several different bounding boxes. In order to merge bounding boxes into a correct one and get the entire cyclist instance, two small boxes are merged into one when the distance between them is within a certain range. To show the merging process intuitively, an example for merging is provided in Figure 3. We use x min , y min to represent the minimum value of the x, y coordinate on two boxes. Then, where, (x b1 , y b1 ) and (x b2 , y b2 ) are the coordinates of the top-left point of two bounding boxes.
We use x t and y t to represent the maximum distance threshold of two bounding boxes which can be merged on x, y coordinates. The value of x t and y t should ensure that the potential regions eliminate the situation where only half of the object is contained. The values of x t and y t are fixed to ensure that the potential regions contain the whole objects; hence, both x t and y t are set to 832 that is the maximum size of the cyclist instances. (w b1 , h b1 ) and (w b2 , h b2 ) represent the width and height of two bounding boxes. If x min + x t and y min + y t satisfy the condition, then, we can get one large bounding box by merging two bounding boxes. Comparing x b1 + w b1 and x b2 + w b2 to get the maximum x coordinate of two bounding boxes. Based on this maximum value, we can calculate the width of merged bounding box. Similarly, y b1 + h b1 and y b2 + h b2 are discriminated and used to calculate the height of merged bounding box.
where, (x b , y b ) represents the top-left point of the merged bounding box and (w b , h b ) represents the width and height of the merged bounding boxes. After merging two bounding boxes into one large bounding box that may contain one object or several objects. These merged bounding boxes are extended as potential regions and as inputs for the following network.

•
In the other case, two bounding boxes are far apart from each other, which means that these boxes are for different instances and do not need to merge. In this case, the bounding box may contain the entire object instance, and sometimes also may contain part of the object instance or just background. For fine detection and localization, these bounding boxes also need to be sent into the following deep network for further detection. If the distance between two bounding boxes is not within a certain range, these boxes are regarded as two separate objects. In order to contain as many entire object instances as possible, these bounding boxes are extended as potential regions and as inputs for the following network. Bounding boxes are all extended to m × m pixels to be served as potential regions, which ensures that the potential regions contain the whole objects. In this study, m is set to 832 that is the maximum size of the cyclist instances. We crop the potential regions according to the coordinates. The relationship between the potential region and the bounding box is, where, (x po , y po ) indicate the (x, y) coordinates of the top-left point of the potential region in the original image; (x b , y b ) indicate the (x, y) coordinates of the top-left point of the bounding box before extending; w b and h b indicate the width and height of the bounding box. At last, these cropped potential regions are sent into the subsequent network. To illustrate the advantage of ACF-PR, we compare the structures of Fast R-CNN [26], Faster R-CNN [27], ACF-detection method and our method in Figure 4. Fast R-CNN uses selective search (SS) method to generate bounding boxes. Experiments in [22]   Region proposal methods like region proposal network (RPN) in Faster R-CNN generate and select bounding boxes directly, and these bounding boxes are used for regression and classification. With high resolution input images (2048 × 1024) and a large range of object sizes from 20 pixels to 800 pixels, the feature map of convolutional network may lose some details and make it difficult to detect small size objects. Hence, RPN is not suitable for detecting small-size objects in high resolution images.
Different from RPN, our method only generates potential regions in which cyclists may appear; then these potential regions are sent into the following network for detection. We extract features from these potential regions rather than extracting features to generate potential regions. The main function of ACF-PR is to lessen the detection range. We do not utilize ACF as a region proposal method directly, because that ACF-detection method shown in Figure 4 usually generates bounding boxes with less than half of the instance. The experiments show that only approximately 69% cyclists are contained in the detection results of ACF. If these bounding boxes are sent into detection network, the detection rate will not be higher than 69%. However, our proposed ACF-PR method can contain 100% cyclists, which ensures the relatively good detection result. In addition, it only takes about 0.18 s per image (2048 × 1024) with Central processing unit (CPU) to generate potential regions, which ensures the relatively fast speed of detection.
Hence, our proposed ACF-PR method is more suitable for cyclist detection than the other region proposal methods, when dealing with images with high resolution.

YOLO Network for Cyclist Detection
YOLO has proved to have the ability to handle complex tasks, such as pedestrian detection [36], license plate detection [37], vehicle detection [38], and traffic sign detection [39], etc. YOLO is a single deep network which can get predicted bounding boxes and class probabilities at the same time, achieving high accuracy and extremely fast speed.
In this study, we designed a suitable detector based on YOLOv3 net for fine detection and localization. YOLOv3 has 106 layers, including successive 3 × 3 and 1 × 1 convolutional layers, shortcut connections, up-sample layers, route layers and detection layers. Figures 5 and 6 show that almost all of the sizes of cyclists in this study are less than 832, so the input size of YOLOv3 is set to 832 × 832. The structure is shown in Table 1. Shortcut connections have similar construction with ResNet [42]. The route layers are to combine two feature maps or get the feature map of a previous layer. The function of the up-sample layer is to up-sample the feature map with a stride of 2 via bilinear interpolation. In addition, batch normalization layer [43] is utilized to make improvements in convergence. We do not list this layer in Table 1, because each convolutional layer is followed by a batch normalization layer.   Figure 6. The distribution of cyclist instances in test set. The x-axis and y-axis represent the width and height of the ground-truth in pixels. One blue point represents one cyclist instance. Unlike the previous version, YOLOv3 predicts boxes at three different scales. From Table 1, the three detection layers are designed to preform prediction at three different scales. The similar concept of feature pyramid networks [44] is used to extract features from these three scales. It means that YOLOv3 divides the input image into three different sizes of grid: S 1 × S 1 , S 2 × S 2 , S 3 × S 3 . If the center of an object is in a grid cell receptive field, this grid cell is responsible for detecting this object. Each grid cell predicts three bounding boxes. Thus, the number of YOLOv3 anchors is 9 and the number of bounding boxes it can get is (S 1 × S 1 + S 2 × S 2 + S 3 × S 3 ) * 3. These bounding boxes are analyzed and selected to get final detection results. Comparing with the previous version, YOLOv3 can get much better detection performance and the speed of it is still fast.
In order to get the anchors that YOLOv3 needs, K-means clustering is utilized to determine bounding box priors. We set our anchors on the clustering result of K-means. In this network, the number of anchors is set to 9, which is the same as [31]. At each scale, each cell uses three anchors to predict three bounding boxes.
The distribution of cyclist instances in the training set and test set are shown in Figures 5 and 6 respectively. The coordinate x and y indicate the width and height of the ground-truth. Each blue point indicates one instance. Comparing the data in Figures 5 and 6, the distributions of these two sets are similar. We use K-means method to get nine clusters and set anchors according to the results of K-means. The result is shown in Figure 5 The inputs of YOLOv3 we used are outputs of ACF-PR. Therefore, the size of potential regions m is set to 832, which is equal to the size of input size of YOLOv3. The outputs of YOLOv3 are inputs of the post-processing process.

Post-Processing
The detection results of YOLOv3 are based on potential regions, and need to be mapped into the original image. Because some potential regions may be partially overlapped when potential regions are generated, several bounding boxes may be generated for one same object. In order to solve this problem, a post-processing process including boxes mapping and non-maximum suppressing is designed. The coordinates of bounding boxes from YOLOv3 are based on potential regions and potential regions are gotten from original images. To get final detection results, the bounding boxes should be mapped from potential regions into original images. The relationship between coordinates of bounding boxes in potential regions and final coordinates in the original images is, where, (x, y) indicate the x and y coordinates; the subscript b is for the bounding boxes in the original image; the subscript bp is for the bounding boxes in the potential regions; the subscript po is for the potential regions in the original image. The width, height and class of bounding boxes do not change when they are mapped into the original image. After mapping, overlapping detections may appear. One cyclist instance may have a number of bounding boxes associated with it. Non-maximum suppression (NMS) is utilized to eliminate repeated detections. b1 and b2 represent two bounding boxes. t represents the threshold. IoU(b1, b2) is defined, where, b1 ∩ b2 means the area of overlap, b1 ∪ b2 means the area of union. If IoU(b1, b2) > t, the NMS step retains the bounding box with the highest score as the detection result. After this process, the detection results of YOLOv3 are mapped into the original image.

Dataset and Evaluation Protocol
Comparing with pedestrian detection, cyclist detection receives far less research attention. The public challenging cyclist datasets is rare. Before 2016, only the KITTI object detection benchmark has cyclist instances; however, the number of cyclist instances is less than 2000, which is insufficient for cyclist training and testing. In 2016, the public Tsinghua-Daimler Cyclist Benchmark (TDCB) [41] was proposed, which contained more than 10,000 annotated cyclists.
As the only public cyclist detection dataset, TDCB still has some problems. Firstly, some cyclists in this dataset are invisible even to human eyes because of high similarity with background, small-size or occlusions. Secondly, some small cyclists cannot be distinguished from small motorcyclists. Thirdly, the samples in its training set were captured under similar weather and light conditions, and cannot cover very different weather and light conditions in the test set. These three problems may result in bad generalization ability. Hence, in our experiments, we rebuilt the training set, the validation set and the test set.
We firstly merge the original test and validation sets to a merged set. The 2758 images in the merged set are randomly selected to form the new test set. The new rebuilt training set has 10,000 images, including 380 images randomly selected from the merged set and 9620 images in the original training set; the rest of the images in the merged set form the new validation set. In this way, the weather and light conditions in these three rebuilt sets do not show many differences. After sets rebuilding, the percentages of the new training set, test set and validation set are approximately 70%, 20% and 10% respectively.
The method used in the PASCAL object challengers [32] is utilized here to show the relationship between precision and recall rate. Here, we use P to represent precision and R to represent recall. The precision and recall are calculated as, and, where, TP indicates the number of true positives, FP indicates the number of false positives and FN indicates the number of false negatives. The average precision (AP) is used here to represent the performance of detector. AP is defined as, where, R represents recall and P represents precision, both of which are between 0 and 1. P(R) here represents the curve composed of P and R. Figure 7 is an example. A larger value of AP means better performance. The PASCAL measure is used to assign the detection results to ground-truth objects, which means that the area of IoU overlap must exceed the threshold of 0.5. IoU is used here to measure the accuracy of detecting a corresponding object. IoU is defined as, where, DR represents the detection region, GT represents the ground-truth region. DR ∩ GT means the area of overlap, DR ∪ GT means the area of union. The threshold of IoU is set to 0.5, which means that if IoU is larger than 0.5, this object is considered as a successful detection. Our proposed region proposal method of ACF-PR was designed based on the latest version of Dollár's Computer Vision MATLAB Toolbox [40]. When training the detector, the cyclist instances were extracted from training set with bounding boxes higher than 60 pixels and fully visible, and the negative samples were from non-VRU set. YOLOv3 is open source. We trained our network based on the pre-trained model on ImageNet [35]. Batch size is 64. The value of max-batches is set to 206,000, and we used a learning rate of 0.0001 for 95,000 batches, and 0.00001 for the next 111,000 batches.

Evaluation of the ACF-PR Method
In this experiment, we compare ACF-PR with traditional ACF to show the efficiency and improvement of the proposed ACF-PR method.
ACF-PR is designed to generate potential regions in high resolution images. The potential regions are expected to contain all cyclist instances. The more cyclist instances in the potential area, the better the detection results can be. In this experiment, a cyclist instance with more than 50% area in the extracted potential regions is considered a region containing cyclist instances. Table 2 shows that ACF-PR can generate potential regions that contain 100% cyclist instances while ACF can only contain 69.22% cyclist instances. The result means that the potential regions extracted from our ACF-PR contain 100% cyclist instances, which ensures the relatively good detection performance of the following detection process. The detection rate of 69.22% gotten from ACF means that the following detection rate will only be equal with or lower than 69.22%. In this study, 832 × 832 size potential regions are sent to YOLOv3 network for computing localization and classification; in this case, the average ratio of the area of all potential regions to the area of original images is 78%. The main reason why ACF-PR has such a large advantage is that it has the mechanism of boxes merging and extending, which largely reduces the cyclist instances that are half detected or missed.

Comparisons with Other Detection Methods
To evaluate the effectiveness of the proposed method, we compare the performance of our proposed method ACF-PR-YOLO with some other methods including YOLOv3 [31], SSD [28], LDCF [5] and ACF [40]. ACF-PR-YOLO represents the proposed method which utilizes ACF-RP for region proposal and YOLO for detection. YOLOv3 and SSD are two representative one-stage deep learning based detection methods. For YOLOv3, the class number is 1 and the other parameters are the same as [31]. The bone net of SSD we used was VGG16 and the input size of it was 300 × 300. LDCF and ACF are popular traditional detection methods. Due to the limited computer memory problem, we use 3798 images when training ACF and LDCF detectors. Though Li et al. has used TDCB dataset for cyclist detection, we did not compare with them because there was an additional large training set for training in [22] which is not publicly available.
The curves in Figure 7 show the overall detection performance of all detectors tested on the TDCB dataset. From Figure 7, we can find that ACF-PR-YOLO outperforms YOLOv3 by 13.69% AP and outperforms SSD by 25.27% AP. These results mean that the representative YOLOv3 and SSD nets have poor performance in the cyclist detection problem; the main reason is that the YOLOv3 and SSD nets have relatively poor performance in detecting small-size objects in high-resolution images. The proposed ACF-PR-YOLO outperforms by at least 13.69% better AP than these two methods. From the comparison of ACF-PR-YOLO and YOLOv3, it can also be concluded that the ACF-PR-YOLO has better AP than that of YOLOv3, because of using the proposed ACF-PR to generate proposal regions. Figure 7 also illustrates that the proposed ACF-RP-YOLO method outperforms LDCF and ACF by 36.45% and 41.68% AP respectively. From experiments, we found that the LDCF and ACF methods usually extract just part of the objects or extract inaccurate object regions, which is the main reason for resulting in poor performance. This is also the reason why we use ACF to extract region proposals instead of directly using ACF for detection.
From the data in Figure 7, it can be concluded that the proposed ACF-PR-YOLO outperforms other methods in comparison. This achievement has three main reasons. Firstly, the ACF-PR can generate potential regions that contain 100% cyclists in the region proposal extraction process; this process can segment a high-resolution image into small potential regions, which can effectively avoid the poor performance of YOLOv3 on high-resolution images. Secondly, on the segmented small-size images, the YOLO can do fine detection achieving high performance on average precision. Thirdly, the designed post-processing method is designed to select the most suitable bounding boxes and to map them into a correct one. Table 3 lists the detailed comparison with other popular detection methods including YOLOv3 [31], SSD [28], LDCF [5] and ACF [40], using parameters of AP, code type and consuming time per frame. In Table 3, the popular methods achieve relative low APs ranging from 41.01% to 69.00% in cyclist detection, while the proposed ACF-PR-YOLO can detect cyclists in a high AP of 82.69% and with an average consuming time of 0.35 s. Hence, the proposed ACF-PR-YOLO method can detect cyclists with high precision of 82.69% AP and small consuming time of 0.35 s. Our proposed method, running on a 3.20-GHz i5 CPU processor and a TITAN X GPU processor, needs about 0.35 s per image (2048 × 1024). The time consumption of these three parts is listed in Table 4. ACF-PR is written via Matlab and runs on a 3.20-GHz i5 CPU. It costs about 0.18 s per image. YOLO is written via C and it runs on a TITAN X GPU. The time it costs is about 0.164 s. The post-processing step is written in Matlab and runs in CPU. It costs 0.003 s.
Some results of performing our detector on the TDCB dataset with different scenarios are shown in Figure 8. Figure 8 illustrates that our detector can have good performance in different scenarios. Our method can not only detect cyclists in complex backgrounds and dense crowds, but also can separate cyclists from pedestrians.  (g) (h) Figure 8. Some results of performing our detector on "Tsinghua-Daimler Cyclist Benchmark" with different scenarios. The images from (a) to (h) are the detection results in complex environments. In order to have a better visual effect, the contrast and brightness of the images above are enhanced for display. Blue bounding boxes represent detected cyclists.

Conclusions
Some representative fast deep networks including YOLO and SSD usually cannot achieve high precision when dealing with small-size objects and high resolution images. To overcome this problem, a framework for cyclist detection in large high-resolution images is presented in this paper. The framework contains an ACF-PR region proposal method, a YOLOv3 net for cyclist detection and a post-processing step.
In order to extract potential regions from high resolution images, the region proposal method of ACF-PR is proposed. In ACF-PR, an ACF detector is firstly utilized to fast extract candidates; then a bounding boxes merging and extending method is designed to merge the bounding boxes into correct region proposals for the following YOLO net. Then a suitable YOLOv3 net is designed to do detection in the potential regions generated by ACF-PR. The YOLOv3 net has a better performance on small-size potential regions rather than that on high-resolution original images. Lastly, a post-processing step is performed to select the most suitable bounding box and to map it into original images with high resolution. We evaluate our method on the public TDCB dataset and compare it with other representative methods. The experiments demonstrate that it outperforms the representative methods in our comparison, and that it outperforms YOLOv3 by 13.69% average precision and outperforms SSD by 25.27% average precision.
Although our algorithm is designed for cyclist detection, it has great potential for other object detection. In the future, in order to improve detection performance, we plan to develop an efficient detection algorithm that can adapt to more complex scenarios. Instead of designing a single frame detector, we plan to do detection based on video and do research on the feature relationship between consecutive video frames.
Author Contributions: C.L. and Y.G. designed the methods and wrote the paper. S.L. and F.C. conceived and designed the experiments.

Conflicts of Interest:
The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.