Multi-Object Tracking with Correlation Filter for Autonomous Vehicle

Multi-object tracking is a crucial problem for autonomous vehicle. Most state-of-the-art approaches adopt the tracking-by-detection strategy, which is a two-step procedure consisting of the detection module and the tracking module. In this paper, we improve both steps. We improve the detection module by incorporating the temporal information, which is beneficial for detecting small objects. For the tracking module, we propose a novel compressed deep Convolutional Neural Network (CNN) feature based Correlation Filter tracker. By carefully integrating these two modules, the proposed multi-object tracking approach has the ability of re-identification (ReID) once the tracked object gets lost. Extensive experiments were performed on the KITTI and MOT2015 tracking benchmarks. Results indicate that our approach outperforms most state-of-the-art tracking approaches.


Introduction
Multi-object tracking is one of the most fundamental capabilities for the autonomous vehicle. The autonomous vehicle mostly works in the dynamic environment. Only through precisely tracking other dynamic objects' movements, the autonomous vehicle could plan its own trajectory and run smoothly.
Existing multi-object tracking approaches mostly adopt the tracking-by-detection strategy. This is a two-step procedure. In the first step, the potential objects-of-interest are detected using object detection algorithm. These potential objects are then linked across different frames to form the so-called tracklets in the second step.
Recently, one has witnessed great improvements in object detection field. One of the most influential works is the region proposal based Convolutional Neural Network (CNN) [1][2][3][4]. This approach firstly generates object proposals using Region Proposal Network (RPN), and then features are extracted for each proposal. To detect multi-scale objects, the proposals are usually generated from the deep layers, such as the conv-5 layer [3]. However, the spatial resolution of the conv-5 layer is only 1/16 of the original image, making the algorithm difficult to generate small object proposals [5]. To tackle this problem, Liu et al. [6] proposed a Single-Shot multibox Detector (SSD) which gets rid of the region proposal branch. This approach directly generates object proposals on multiple layers. Small object proposals can be generated on shallow layers, and large objects could be generated on deep layers. Nonetheless, as shallow layers only contain low-level features, the small object proposals usually contain many false detections.
In the context of multi-object tracking, the problem of detecting small objects might be alleviated by utilizing temporal information. In this paper, our first contribution is to improve the single-frame object detection algorithm by incorporating temporal information. More specifically, we propose a multi-scale object detector that augments the SSD with temporal Region of Interests (ROIs), as shown Once the objects are detected, they are then linked across different frames using data association algorithm [7][8][9]. No matter what the data association algorithm is, it always relies on an affinity model which describes the similarity of detections. Many recent works try to build robust affinity models.
In [10], the authors proposed a new descriptor named as Aggregated Local Flow Descriptor (ALFD). This descriptor is then used with the motion relation model to perform as the affinity model. The motion relation model describes motion relations between bounding boxes. However, this model only works for a static camera. In [11], the authors used Integral Channel Features (ICF) to incrementally train a Multiple-Instance Learning (MIL) model to act as the affinity model. In [12], the authors proposed using the Support Vector Machine (SVM) to train an affinity model. The SVM performs on a series of low-level features, including the forward-backward optical flow error, the bounding box overlap, the height ratio and the normalized correlation coefficients between the two detections. As these low-level features have less semantic meaning, the SVM is prone to overfitting and needs to be trained for different environments. Recently, deep learning based approaches [13][14][15] have also been used to learn the affinity model. Generally, these models are trained with an end-to-end learning framework and are computationally intensive.
In this paper, we propose a novel compressed deep Convolutional Neural Network (CNN) feature based Correlation Filter (CF) tracker. The CNN feature has abundant geometric and semantic information and is applied in both the object association stage and the lost target re-identification stage. Compared with the point matching based tracking approaches, correlation filter based algorithms have achieved superior performance on several tracking benchmarks [16][17][18]. By combining the CNN feature with the CF tracker, our approach enjoys the merits of both approaches: on the one hand, the tracking module runs faster than 10 Hz due to the efficiency of the CF; and, on the other hand, the compressed CNN feature used in the tracker contains target-specific semantic information. Besides, the feature is directly inherited from the object detection stage, without the need to be re-calculated.
Our main contributions are summarized as follows. Firstly, a multi-scale detector was proposed which can efficiently detect small objects and produces fewer false negatives. Secondly, a novel compressed deep CNN feature based CF tracker, which exploits target-specific semantic information inherited from the detector and has low computational complexity, was. Thirdly, extensive experiments were performed on the KITTI [19] MOT [20] tracking benchmark, and the results show that our approach obtains superior performance to state-of-the-art algorithms.
The rest of the paper is organized as follows: Section 2 introduces the related work. Section 3 describes the details of our approach. Section 4 presents the experimental results and a comparison with state-of-the-art algorithms. Section 5 concludes the paper.

Related Work
Existing multi-object tracking approaches mostly use a two-step procedure consisting of the detection module and the tracking module. In this section, we point out related methods of these modules.

Object Detection
Recently the performance of object detection approaches has improved substantially. Viola et al. [21] utilized AdaBoost as the detector with Haar-like wavelets features. Dalal et al. [22] proposed linear SVM based approach with feature of Histogram of Oriented Gradient (HOG). To alleviate the non-rigid deformation effect, Felzenszwalb et al. [23] introduced an object detection system based on the latent SVM. Current state-of-the-art object detectors are mostly derived from R-CNN (Regions with CNN features) [1,2]. These approaches integrate the feature extraction and object detection into an end-to-end learning framework. These approaches could be roughly divided into two categories: Region Proposal Networks (RPN) based approaches and Single Shot multibox Detector (SSD) based approaches. RPN based methods are composed of two neural network branches: the region proposal network and the classification network [3,4]. These two networks are usually trained separately. For the SSD approach, it directly encapsulates the proposal generation and the classification into a single deep neural network, thus resulting in a much faster detection speed while simultaneously achieving similar accuracy [6] with RPN based approaches.
However, both approaches encounter difficulties in detecting small objects. For the RPN based method, due to the low resolution of the region proposal layer, it could hardly generate small object proposals. This is known as the collapsing bins problem [5]. For the SSD based approach, it directly generates object proposals from multiple layers, and the small object proposals are usually generated from shallow layers which contain mostly low-level features. Therefore, the generated small object proposals usually contain many false negatives. In this paper, we reduce the false negatives of small objects by incorporating temporal ROIs.

Object Tracking
After the potential objects are detected, they will be linked in nearby frames to form tracklets. This procedure is usually known as the data association. The literature on solving the data association problem is vast, including Linear Program based approaches [24], Bayesian filtering based approaches [25,26], graphical model based approaches [10,[27][28][29][30][31], etc. However, most of these approaches are designed for offline usage. To meet the online need for the autonomous vehicle, some approaches [12,32] model the data association problem as a causally optimal assignment problem and solve it with the Hungarian algorithm [33].
For single object tracking, Kalman filter based approaches [34] and Particle filter based approaches [35] are the most widely known. Besides these state estimation based methods, key point matching based approaches [36,37] are widely used for visual object tracking. Recently, Correlation Filter (CF) based methods [38] have been shown to outperform [18] key point matching based approaches which are frequently used in multi-object tracking area [10,12,25]. Since the seminal work in [38], CF has made great progress, and most state-of-the-art CF trackers have begun to use deep CNN features [18]. The usage of deep CNN features can indeed enhance the tracking performance, but it also brings heavily computational load. Furthermore, as the CNN is usually pre-trained from other datasets, it contains little target-specific semantic information [39]. To that end, we propose a novel deep compressed CNN feature based tracking approach in this paper. The CNN features are shared with the object detection module, thus introducing little computational effort. The CNN channels are also further compressed for more efficiency.

The Proposed Approach
In this section, we describe our approach. The proposed approach includes two key modules, the multi-scale object detection module and the compressed CNN feature based Correlation Filtering module (CCF). These two modules are interleaved. By carefully integrating these two modules, our proposed multi-object tracking framework could efficiently track multiple objects, and it also has the ability of re-identification (ReID) once the tracked object gets lost.

Multi-Scale Object Detection
The proposed multi-scale object detection module is based on the Single-Shot multibox Detection (SSD) [6] algorithm. We briefly describe this algorithm first. Then, to adapt to online video detection, the fine-tuning process and the usage of temporal information are introduced.
SSD directly produces object bounding box proposals on different convolutional layers with multiple scales s n and different aspect ratios a r = {1, 2, 3, 1 2 , 1 3 }. The width of the bounding box is w a n = s n √ a r , and the height is h a n = s n / √ a r , where s n is the scale factor for the n-th convolutional layer, and is calculated as: where s min and s max are set to be 0.2 and 0.9, respectively, across all the experiments. To detect larger objects, the width and height of the default box are set to be √ s n s n+1 .
To effectively detect small objects, the data augmentation strategy is usually adopted. In this strategy, the training data are augmented with patches randomly cropped from the input images. The cropped patches are classified into three categories: intact object, part of object or without object. All these cropped patches are then resized to the same size as the input image. In this way, the small object proposals could be generated in deeper layers. It is well known that deeper layers contain more semantic information, which is essential for reducing false detections.
In contrast to augmenting the training data with randomly cropped patches, we augment the training data with object ROI patches. The patches are cropped around objects with size: where S org is the object size and S norm is set to be [512, 512]. These cropped patches are then fused with the original images to form the augmented training set, as shown in Figure 2. In this way, the SSD training set is augmented with multi-scale samples. We termed this approach as SSD fine-tuned on multi-scale training data. This data augmentation strategy, however, could only be performed in the training stage. In the detection stage, there is no mechanism to crop object ROIs, and the smaller object proposals could only be generated in the shallow feature layers. When detecting objects in sequential frames, the temporal information can provide an important position prior for detectors. As the object moves smoothly across sequential frames, its position in the previous frame is highly correlated with the position in the current frame. More precisely, we utilize correlation filter to predict object position in the current frame. The details are introduced in the next section. Based on the predicted position, we can crop object ROIs and generate a set of multi-scale input data in the detection stage. As shown in Figure 1, these ROIs are then resized to a fixed size using bilinear interpolation. In this way, the small objects are upsampled and the corresponding proposals could be generated from deeper layers which contain more valuable semantic information. Experiments show that this semantic information helps reduce the false negatives for small object proposals. The details are described in Section 4.2.

Compressed CNN Feature Based Correlation Filter
For the tracking method, we chose to use the correlation filter as our base approach. Let w represent the weights to be learned, which is multiplied by the input sample feature x. The output is the desired filter response y. For a given training image, multiple training samples are generated by circular shift around the target, and the desired output is modeled as a Gaussian distribution with a small variance, as shown in Figure 3. This problem could be cast as a ridge regression problem: where represents the circular correlation, and λ is the weight of regularization. In the training phase, correlation filter weight W is obtained by solving the ridge regression problem in Equation (3), and the desired output is a Gaussian distribution. In the testing phase, the response map is calculated using Equation (8).
To reduce the computational complexity, the circular correlation in the spatial domain could be transformed into the frequency domain using FFT as: whereŵ,x, andŷ are Fourier transformations of w, x, and y, respectively; and denotes element-wise multiplication.ŵ * is the complex conjugate ofŵ, which ensures the operator to be correlation instead of convolution [40]. The solution to Equation (4) is: To adapt to the object appearance variations, the correlation filter is online updated as: where N t and D t are the numerator and denominator of Equation (5), respectively. t is the frame index and ψ is the updating rate.
Once the weightsŵ are learned in the training stage, they can be used to track the object. The response map for tracking the object is calculated as: whereẑ is the Fourier transform of the feature input, F −1 denotes the inverse Fourier transform, and y o is the response map, as shown in Figure 3. The computational complexity of the correlation filter is O(NM log M), where N is the number of feature channels and M is the length of each feature channel. The original implementation of CF only works with a few channels [38]. These channels correspond to low-level features, including histogram of gradients (HOG), magnitude of gradients, etc. Containing little semantic information, the low-level feature channels hinders the tracking performance of the CF. The ideal feature channels should be compact to ensure computational efficiency and contain discriminative semantic information.
Recently, CNN features have been demonstrated to be discriminative for object tracking [18]. For a typical CNN, it contains both the shallow layers and deep layers. The shallow layers contain geometric information and have high spatial resolution, which is crucial for object localization. A precise object localization is beneficial to the online update of the tracking model. The deep layers contain semantic information that is robust to occlusion, rotation, or illumination variation. The proposed tracker utilizes features from multiple layers of the SSD network, exploiting both geometric and semantic information.
We chose to incorporate the CNN feature into the correlation filter. The CNN layers that we use include conv1-1, conv3-3, and conv5-3 from SSD network with 64, 256, and 512 channels, respectively. To efficiently utilize these CNN feature channels, their dimensions need to be compressed. We compressed the feature channels using the Principal Component Analysis (PCA). The coefficient matrix of PCA was calculated once an object was detected and then kept fixed during the whole tracking process. To retain about 90% of variance, the number of the compressed feature channels was set to be 3, 30 and 10, accordingly. In the training stage, three correlation filters were trained with the three compressed feature channels. In the tracking stage, the three trained correlation filters were applied to obtain three response maps, which were combined linearly to obtain the final response map.
By combining the CF with the CNN feature, our proposed tracker enjoys the merits of both approaches. Furthermore, the CNN features we used in the tracker were directly inherited from the object detector, thus required no further computational load.

Multi-Object Tracking with Object Re-Identification
During the tracking step, the targets may disappear for several factors, such as occlusion, moving out of view, etc. When these temporally disappeared objects re-appear, the tracker should have the ability to recognize them, and assign them to existing tracklets instead of initializing a new tracklet. This procedure is usually known as the object re-identification (ReID).
To realize the ability of ReID, we should have a mechanism to judge whether the tracker output is becoming unreliable. Ideally, the response map of the correlation filter should have only one sharp peak and be smooth in other regions. Therefore, the fluctuation degree of the response map is an ideal indicator for the reliability of the tracker. As introduced in [41], we model the degree of fluctuation using the average peak-to-correlation energy (APCE): where y o is the response map of Equation (8). w, h, y max o , and y min o represent the width, height, and maximum and minimum of y o , respectively.
For the data association algorithm, the key points matching [10,12] and the intersection-over-union (IoU) of adjacent detections are usually adopted as the affinity model. We designed an affinity model using both the geometric cue and the appearance cue. The IoU between the position predicted by tracker and the current detection was used as the geometry cue, and the APCE score of the tracker response map was used as the appearance cue. The data association cost matrix is defined as: where C(t i , d j ) is the association cost, t i is the i-th tracker, and d j is the j-th detection. C(t i , d j ) is calculated as: Based on this cost matrix S, the data association problem was then modeled as an optimal assignment problem, and solved using the Hungarian algorithm [33].
We show an illustrative example of our ReID program in Figure 4. In frame 458, two new tracklets, ID-4 and ID-9, are initialized, with their detection scores larger than the threshold τ 1 . These two tracklets are then completely occluded in frame 461, and their detection scores become zero. Thus, they are labeled as inactive and stored as historical tracklets. In frame 462, vehicle ID-4 re-appears. We calculate the correlative responses of this new detection with existing inactive ID-4 and ID-9 tracklets, and the results are shown in Figure 4c,d. From there, we could observe that, although the maximums of the two responses are similar, their fluctuation degrees are different. The correct ReID (ID-4) response exhibits a single sharp peak, while the wrong ReID (ID-9) response fluctuates intensively. The APCE scores of the two responses in frame 462 are clearly distinguishable, as shown with the red and green point in Figure 4e. This suggests that our ReID mechanism is able to recognize specific target among similar looking objects.  Our proposed framework is summarized as follows. For a new frame, firstly the multi-scale object detection module is used to detect the potential objects. Then, we calculate the correlation coefficients and assign detections to existing tracklets using the Hungarian algorithm. After the assignment, if there are detections left with scores larger than τ 1 , they are treated as potential new objects. Before assigning new IDs to these objects, the object ReID program is performed. We calculate the APCE scores of the new detections with existing inactive tracklets. If the APCE score s a is larger than s s a , we activate the tracklet using this new detection. This suggests that a disappeared target reappears again. Otherwise, we assign a new ID to this object. For the ReID process, s s a is the last saved APCE score of the inactive tracklet. Finally, we update the states of the tracklets using threshold τ 1 and τ 2 . Active tracklets will either be updated if their scores s d are larger than τ 2 , or become inactive if their scores are smaller than τ 1 . This whole procedure is also described in Algorithm 1.

Input: The sequential frames
1: for each frame I in the sequence do 2: Obtain detections D w in the image I; 3: for each tracker T in the tracker set T = {T i } K i=1 do 4: Crop ROI patch R based on tracking output; 5: Obtain detections D r in the patch R; Calculate the correlation coefficients between detections D and trackers T ; 9: Associate D and T using the Hungarian algorithm; 10: for each detection D in D do 11: if D associate with no existing trackers and the score of the detection s d >= τ 1 then

12:
Generate a potential new tracker T new from D 13: for each inactive tracker T in the inactive tracker set T do 14: Calculate the APCE score s a of T new with T ; 15: if s a > s s a then 16: Activate T with T new ; 17: else 18: Assign T new with a new target ID; 19: end if 20: end for 21: end if 22: end for 23: for each tracker T in T do 24: if the score of tracker s d < τ 1 then 25: T becomes inactive, and save the APCE score s a as s s a . T ⇐ T; 26: end if 27: if the score of tracker s d > τ 2 then 28: Update the CCF model of T;

Experimental Setup
We performed experiments on the KITTI and MOT tracking benchmark, and implemented the algorithm on an Intel 3.00 GHz CPU and GeForce GTX 1080Ti GPU with MATLAB.
Datasets and Evaluation Metrics. The KITTI tracking dataset contains training and testing set with 21 and 29 sequences, respectively. The MOT tracking dataset contains 11 training sequences and 11 testing sequences. For these datasets, only the ground-truth of the training set is publicly available. The results for the testing set should be submitted for online evaluation [19,20].
We split the KITTI training set into a small training set with 6 sequences and a validation set with 15 sequences. Parameters were trained on the small training set, and then analyzed on the validation set.
The performance evaluation metrics include the Mostly Track targets (MT), Mostly Lost targets (ML), ID F1 score (IDF1), Multiple Object Tracking Accuracy (MOTA), and Multiple Object Tracking Precision (MOTP) [42,43], where MT and ML account for the number of ground-truth trajectories that are covered by the tracking output at a ratio larger than 80% and smaller than 20%, respectively. IDF1 is the ratio of correctly identified detections over the average number of ground-truth and computed detections [20]. MOTA is defined as: where m t , f p t and mm t denote the number of missed detections, the false positives and the mis-matched objects in frame t, respectively. MOTP is defined as: where d i t denotes the distance between the i-th ground-truth and the corresponding tracked object in frame t. c t is the number of matches in frame t.
Parameters Setting. In Algorithm 1, the birth/death of a tracker is determined by the threshold τ 1 . The update frequency of the CCF model is determined by the threshold τ 2 . These thresholds are achieved by exhaustive search on the small training set, where we used MOTA as the performance indicator. The results are shown in Figure 5. The maximal MOTA was achieved when τ 1 was set to 0.2 and τ 2 was set to 0.6. The update rate of the CCF model in Equations (6) and (7) was set to 0.0025, as suggested in [44].

Ablation Analysis
Multi-scale Augmentation. As shown in Table 1, to demonstrate the effectiveness of the multi-scale augmentation strategy for training data, we compared the performance of various detectors with the same tracker. The detectors included the original SSD, SSD fine-tuned on the training data (Finetuned SSD O ), and SSD fine-tuned on the multi-scale augmented training data (Finetuned SSD OP ). The training data used were the KITTI detection training set. Compared with the original SSD, Finetuned SSD O improved the MOTA, MOTP, and MT by 47.25%, 9.47%, and 54.64%, respectively, and decreased ML by 28.45%. This significant improvement clearly suggests that a fine-tuning step is crucial for data-driven approaches. Compared with Finetuned SSD O , Finetuned SSD OP improved MOTA and MT by 18.09% and 27.06%, respectively, and decreased ML by 14.94%. This performance improvement should be largely attributed to the multi-scale augmentation strategy for the training data. To demonstrate the effectiveness of our temporally augmenting strategy in the detection stage, we evaluated the quantitative performance of various detectors. The results are shown in Table 2 and Figure 6. The original FasterRCNN [3] and fine-tuned SSD detectors were regarded as baseline approaches. The detector named MultiFasterRCNN is the original FasterRCNN augmented with temporal ROIs. The detector named as Finetuned MultiSSD is the fine-tuned SSD augmented with temporal ROIs. Compared with the baseline, the temporally augmenting methods obtained better performance. The improvement should be largely attributed to the incorporation of temporal information.  Table 3 compares the performance of the correlation filter based affinity model with the MDP [12] tracker. Compared with MDP, the HOGCF (Correlation Filter with HOG features) based method increased the MOTA by about 11.75%. This suggests that the correlation filter is a better affinity model than the key point matching based approach used in MDP.
We compared the original implementation of correlation filter HOGCF with our compressed CNN feature based CF tracker (CCF). Results in Table 3 show that CCF achieves a 25.42% increase in MT and 4.6% decrease in ML. This suggests that the usage of compressed CNN feature is indeed much better than the low-level HOG features.

Benchmark Evaluation Results
Results on KITTI Dataset. The results on KITTI tracking testing set are summarized in Table 4. We compared our algorithm to several state-of-the-art approaches, including SSP [28], DCO-X [27], LP-SSVM [30] and NOMT-HM [10]. Some of these approaches could only be performed in the offline setting.
For the online setting, our approach achieved the best result. Compared with the second best, we obtained 3.11% and 2.28% increase in MOTA and MOTP, respectively, and 38.0% decrease in ML. The decrease in ML indicates that our approach has fewer false negatives, which should be largely attributed to the proposed temporal object ROIs. Low missing rate is a crucial property for autonomous vehicles to avoid the miss-detection of some important targets, such as pedestrians or vehicles. For the offline setting, our approach outperforms its counterparts in MOTP, MT and ML. Although the offline method LP-SSVM achieves the best result in MOTA, it utilizes the whole sequence for trajectory generation without considering causality, which is impractical for the autonomous vehicle. Figure 7 shows some qualitative results of our approach. Results on MOT2015 Dataset. We compared our algorithm with the state-of-the-art approaches on MOT2015 dataset. The results are summarized in Table 5. Our approach achieved the best result in MOTA, MT and ML. MDP [12] achieved a better result in IDF1, while it performed much worse in MT and ML. Besides, the MDP tracker needed to be trained for different environments and cannot work in real-time [12]. The oICF tracker [11] is based on Multiple-Instance Learning, and CNNTCM [15] utilizes Siamese CNN to associate objects. The approach of MCFPHD [29] generates the trajectory without considering causality. Compared with these approaches, our approach achieved better performance for MOTA, MT and ML.
MOT2015 dataset contains mostly pedestrian targets which are generally smaller than the vehicle objects in the KITTI dataset. As our approach achieved much better performance in MT and ML, it suggests that our approach is effective in tracking small objects with lower false negatives, as shown in Figure 8.

Conclusions
In this paper, we propose an efficient algorithm for multi-object tracking. Our approach is a two-step procedure. Firstly, we detect objects using a multi-scale detector. The detection module is augmented with temporal information and is quite effective for detecting small objects. In the second step, we associate the detections across frames. A novel compressed deep CNN feature based correlation filter is proposed as the affinity model. Extensive experiments were performed on the KITTI and MOT tracking benchmarks. Results show that the multi-scale detector helps reduce false negatives. It is also demonstrates that the correlation filter is a more suitable affinity model than previous key points matching based approaches. The compressed CNN feature contains more valuable semantic information than traditional low-level features, such as HOG, and this semantic information is beneficial for increasing the tracking performance.