Object Tracking for Satellite Video based on Kernelized Correlation Filters and Three Frame Difference

Recently, correlation filters have yielded lots of promising results in the field of target tracking. However, when dealing with object tracking in satellite video, KCF tracker achieves poor results, because the size of the target is too small compared with the whole image, and the target and background are exactly similar. So, in this paper, we introduce the KCF tracker to the satellite video and propose the three frame difference algorithm to improve the performance of KCF tracker. For the purpose of reducing the probability of drifting in KCF tracker, a simple but effective target detection method: three frame difference algorithm is utilized to locate the target every once a few frames. Based on combining the KCF tracker and three frame difference algorithm, the proposed tracker achieves outperform top ranking trackers such as Struck and TLD on two datasets.


Introduction
Remote sensing satellite image processing has attracted lots of attention, and has been widely used in many applications [1][2][3].Recently, commercial satellite has achieved great process in using the remote sensing devices to capture VHR satellite videos.Remote sensing videos have great potentials in traffic flow detection, forest cover dynamic monitoring, flood disaster monitoring, etc.According to 2016 IEEE GRSS Data Fusion Contest, a high-definition video with the spatial resolution of 1m from the International Space Station (ISS) was released [4], and has drawn much attention about object recognition and tracking for the cars, vessels and buildings.The Jilin No.1 commercial satellite produced by China can provide VHR satellite videos at 0.74-m spatial resolution.Those advancements prove that it is possible to research object tracking in satellite video.
Generally, tracking algorithms can be categorized into two classes, which are generative model [5][6][7][8] and discriminative model [9][10][11][12][13][14]18,19] based on their representation schemes.In generative model, tracking is treated as a problem of searching for the region within a neighborhood, which is most similar to the target object.A variety of search algorithms based on generative model have been employed to estimate object state.For instance, 11-tracker [5] used a sparse linear combination of the target and the trivial fragmental templates to establish a target model.Adam et al. [6] designed an appearance model by utilizing some information fragments to deal with pose variation and partial occlusion.When compared with the generative algorithms, the discriminative methods have attracted wider attention due to their exploitations to the information of target and background.They treat object tracking as a binary classification problem.Hare et al. [9] utilized a large number of image features to train a classifier based on the Structured Output Support Vector Machine and Gaussian kernels.Z. Kalal.[11] uses a set of structure constraints to guide the sampling process of a boosting classifier.
Recently, tracking methods based on the correlation filters have been proved to obtain good performance in object tracking problem.Henriques et al. [14] proposed kernelized correlation filter (KCF) algorithm to conduct dense sampling in the area around the target, and transformed the computation into Fourier domain.It can take advantage of abundant information of negative samples by dense sampling.Besides, KCF transforms the computation from the spatial domain into the Fourier domain by constructing a circulant matrix.As a result, the computational cost is reduced substantially.Many following studies show that trackers based on KCF are far ahead of other trackers evaluated on CVPR 2013 OOTB.
For a VHR image, the total number in a frame can be up to six million pixels, more than 100 times of the normal frame, and the resolution of satellite imagery is much less than natural image.Those factors will lead to higher probability for tracking window drift.Due to the fact that the surroundings in satellite video suffer less changes, we utilize the three frame difference method to detect moving objects.With assistance from three-frame-difference method, the drift offset caused by KCF in object tracking can be reduced.
So, in this paper, we propose a novel tracker algorithm based on the KCF and three frame difference.KCF tracker is employed for it takes full advantage of negative samples in VHR image and high speed in object tracking.To reduce drift error, a simple but efficient target detection algorithm, three frame difference, is used to detect the target.The flowchart of the proposed method is shown in Figure 1.
The rest of this paper is organized as follow.In Section II, we introduce the proposed algorithm.Section III provides experimental results.A conclusion is given in Section IV.

Three Frame Difference
For a given video sequences, we mark the current frame as k-th frame and the previous frame as (k-1)-th frame.A binary image will be acquired by the formula (1): T is a threshold, which is manually set according to experiment result.If T is set oversize, the target may be miss-detected.If T is set undersize, too much noise will be detected.Wojcik and Kaminski [15] proposed the three frame difference method.For three sequential frames: (k-1)-th, k-th and (k+1)th, firstly we calculate the 1 D ( , ) x y through the k-th frame subtracting the (k-1)-th frame, and 2 D ( , ) x y through the (k+1) th frame subtracting the k-th frame.And then, we get the result D( , ) x y through . This method is shown as follows: Compared with the two-frame-difference method, three-frame-difference method can deal with the occlusion more effectively and reduce the irrelevant noise points.Besides, three frame difference method is not sensitive to illumination variation.Figure 2 has shown the therapy of three-frame-difference method: Comparing Figure 2d, 2e with Figure 2f, we can conclude that Figure 2f generated by three frame difference has achieved much better result than Figure 2d and 2e.For the object outline is more clear and unbroken.

Kernelized Correlation Filters Tracking
The core component of most modern trackers is a discriminative classifier, tasked with distinguishing between the target and the surrounding environment.To cope with natural image changes, this classifier is typically trained with translated and scaled sample patches.Such sets of samples are riddled with redundancies -any overlapping pixels are constrained to be the same.Based on the simple observation, KCF [14] was proposed to take full advantages of negative samples and reduce the redundancies.Besides, KCF regards the tracking problems as regression rather than classification.For each sample, instead of labeling the positive samples as 1 and the negative samples as 0, KCF gives a value ranging in [0,1].Similarly, Struck [9] utilizes a loss function assigning a continuous value for each sample.
A typical tracker based on correlation filter trains the classifier with a target region sample image X, and set its size to I*J, Through circularly shifting X, shown in the (5), the method obtains numerous training samples ., i j x , where where x represents the base sample, and X represents the training samples through circularly shifting x. y .Then, the algorithm models the target with the filter W. It can be achieved by searching for the minimum value using the following formula (6): where ϕ denotes the kernel function mapping features into a kernel space, and λ is a regularization parameter.Following [14], we know that , , , ( ) , where: In (7), xx k stands for the kernelized correlation [14], F is defined as the discrete operator.In this paper, we adopt a Gaussian kernel function, as: where 1 F − represents the inverse Fourier transformation, * ( ') F x stands for the conjugate of ( ') F x , and � is the Hadamard product of the matrix.
In detecting phase, we firstly take the target location in the former frame as the central position, clip an image patch z of the size I*J in the new frame, and compute the response value of the classifier based on (9): where x is the learned target appearance model.The response value ŷ stands for the similarity between the candidate target and the real target.Therefore, the current position of the target can be detected by searching for the maximum value of ŷ , that is: Finally, we output the target location in the current frame, and take the output window as the base sample for next frame.

Model Update
Algorithm 1.The procedure of the proposed algorithm

Input: video frame (t) and l t-1
Method: 1. Sample a set of image batches based on l t-1 , where l t-1 is the tracking result at previous frame.2. Each image patch is Imposed a weighting factor for showing the similarities with the result at previous frame based on distance between the pixels.3. Build the KCF tracker by training the image patches in step (b) In order to reduce the probability of overfitting, we add a 2-norm to control kernel function.4. Every a few frames, three frame difference is utilized to detect the target and output l t and l t will be used as input for step 1).
5. Find out the maximum value of ŷ , Image patch with the max value L will be selected as tracking result.Combine the step (4) to output l t .
Output: tracking location l t For the most time, KCF tracker is employed for its high speed and reluctant accuracy in satellite video.KCF tracker will drift and lost its target for two factors.First, the target in VHR image contains less information, for its size is too small compared with the whole image.Second, target and background is exactly similar in VHR image.In this paper, we introduce the three frame difference to correct the drifting error caused by KCF tracker.The specific strategy is that every a few frames or the tracking window drifts drastically, three frame difference is used to detect the target and outputs the result.The output window will be used as the input window for KCF tracker.Repeat this process, proposed tracker will track the target at high accuracy and acceptable speed (50fps).Generally, the proposed tracker utilize three frame difference to correct the error generated by KCF tracker.The basic procedure of our algorithm is presented in the following table Algorithm 1, and Figure 1 shows the pseudo-code of the proposed algorithm.

Implementation Details
Since the satellite video data is relatively scarce, we acquire two videos generated from the 2016 IEEE Data Fusion Contest, Deimos Imaging and UrtheCast, and Chang Guang Satellite CO., LTD.The first video describes the traffic conditions of a harbor in Canada, and the second video describes the traffic conditions of New Delhi.The scene sizes of the three videos are all 3840×2160 pixels.Both the two videos contains 1024 frames.To make our work more meaningful, we select the moving trains as the targets.Figure 4 shows the detail of two datasets.Besides, we initialize the position of the first frame, and evaluate the proposed algorithm by comparing the output window with the ground truth window.For comparison, 3 state-of-art tracking algorithm, TLD [11], STRUCK [9] and KCF [14] are employed to evaluate the proposed algorithm.The three trackers have been proved achieving top performances in CVPR 2013 OOTB datasets.
The proposed algorithm is implemented in C++ Opencv library on 8G memory with 3GHz desktop.The speeds of the proposed algorithm with HARR-LIKE feature [17] and raw pixels are 40 fps and 50 fps.The size of the searching window is set 1.5 times of the target size.The σ used in Gaussian function is set to 0.5, the cell size of HARR is 4×4 and the orientation bin number of HARR is 9.The regularization λ is set 10 -4 .We set T in (2), (3) as 10 to apply the three frame difference method to reduce the probability of drifting.

Evaluation
Bounding box overlap and the average center location error referring to [16] are provided to make more quantitative comparisons.Table 1 shows the average center location errors (in pixel).Table 2 shows the average bounding box overlap score (%). Figure 5 shows some representative screenshots about the two video sequences.
To make the comparisons more quantitative, Table 1 and Table 2 show the accurate results according to the CLE and OS.The best performance was marked with red bold digits and second best performance was marked with blue bold digits.The character "X" in the Table 1 and Table 2 represents the tracker fails on the dataset completely.On all the two datasets, the proposed algorithm ranks first and second.TLD and STRUCK didn't perform well, even losing the target completely, although these three algorithms have achieved outstanding results in the CVPR 2013 OOTB.

Conclusion
In this paper, a new algorithm based on kernelized correlation filters and three frame difference is proposed for object tracking in satellite video.Given the speed and accuracy of the tracker, for the first time, we introduce the kernelized correlation tracking to high resolution satellite video, and combine three frame difference method to reduce the drift raising by the original KCF tracker.Combining KCF and three frame difference, we can deal with the satellite video tracking accurately at the acceptable speed, such as 40 fps on HARR feature and 50 fps on raw pixel.Three satellite video datasets were used to evaluate the proposed algorithm with three state-of-the-art trackers.The experiments show that proposed algorithm on HARR feature ranks first, and the proposed algorithm on raw pixel ranks second according to the CLE and OS.Therefore, it is proved that the proposed algorithm is a robust and efficient method to deal with tracking in the satellite video.

Figure 1 .
Figure 1.A system flowchart of the proposed algorithm.We can generate many image patches by circulant matrix based on the result from prior frame.KCF tracker is built by training the image patches, and then searches the area around the base image sample.It will Find the target location according to the maximum response value of the classifier.Besides, three-frame-difference is utilized to detect the target every given frames.Combine the KCF and three-frame-difference, the proposed tracker is able to get a good performance in object tracking

Figure 3
has shown the process of creating training samples based on the target region sample, which represent the 2D image.KCF utilizes a classifier to map them to the Gaussian function label , i j

Figure 3 .
Figure 3. Examples of vertical cyclic shifts of a base sample.Our Fourier domain formulation allows us to train a tracker with all possible cyclic shifts of the base sample, both vertical and horizontal, without iterating them explicitly.

Figure 4 .
Figure 4. Details of Two datasets evaluated in our experiments are listed.

Figure 5 .
Figure 5. Screenshots of Some Three Videos Tracking Results

Table 1 .
Average Center Location Errors (in pixel)