Robust Visual Tracking with Discrimination Dictionary Learning

It is a challenging issue to deal with kinds of appearance variations in visual tracking. Existing tracking algorithms build appearance models upon target templates. Those models are not robust to significant appearance variations due to factors such as illumination variations, partial occlusions, and scale variation. In this paper, we propose a robust tracking algorithm with a learnt dictionary to represent target candidates. With the learnt dictionary, a target candidate is represented with a linear combination of dictionary atoms. The discriminative information in learning samples is exploited. In the meantime, the learning processing of dictionaries can learn appearance variations. Based on the learnt dictionary, we can get a more stable representation for target candidates. Additionally, the observation likelihood is evaluated based on both the reconstruct error and dictionary coefficients with l1 constraint. Comprehensive experiments demonstrate the superiority of the proposed tracking algorithm to some state-of-the-art tracking algorithms.


Introduction
Visual tracking is a fundamental task in computer vision, which is applied in a wide range of applications, such as intelligent transportation, video surveillance, human-computer interaction, and video editing.The goal of visual tracking is to estimate target states of a tracked target in each frame.Although many tracking algorithms are proposed in recent decades [1], designing a robust tracking algorithm remains a challenging issue due to factors such as fast motion, out-ofrotation, nonrigid deformation, and background clutters.
Based on the types of target observations, visual tracking algorithms can be classified as either generative [2][3][4][5][6] or discriminative [7][8][9][10][11][12][13][14].Generative tracking algorithms search for an image patch that has the most similarity to the tracked target model as the tracking result in the current frame.For a generative algorithm, the prime problem is to build an effective appearance model that is robust to complicated appearance variations.
The discriminative tracking algorithms consider visual tracking as a binary classification problem.The tracked target is distinguished from the surround background by learnt classifiers.The classifiers compute the confidence value for target candidates and distinguish each as a foreground target or a background block.In this work, we will propose a generative algorithm.Next, we briefly review some related works to our tracking algorithm and some recent tracking algorithms.
Kwon et al. [3] decompose the observation model and motion models into multiple basic observation models and multiple basic motion models, respectively.Each basic observation model covers a target appearance variation.Each basic motion model covers a special motion model.A basic observation model and a basic motion model are combined into a basic tracker.The tracking algorithm is robust to drastic appearance changes.He et al. [4] propose an appearance model based on locality sensitive histograms at each pixel location.The proposed observation model is robust to drastic illumination variations.In [2], a target candidate is represented by a set of intensity histograms of multiple image patches, which has a vote value on the corresponding position.A target candidate is represented 2 Advances in Multimedia by fixed target templates, which is not robust to drastic appearance variations.Wang et al. [6] represent a target candidate based on target templates with affine constraint.The observation likelihood is computed based on a learnt distance metric.
The representation technique with sparse constraint is applied into visual tracking [15][16][17][18][19].The target representations with sparsity are robust to outliers and occlusions.In [15], Mei et al. use a set of target templates to represent target candidates and represent partial occlusions with trivial templates.The algorithm in [15] is robust to partial occlusion.While severe occlusions occur, the algorithm is not effective.Zhong et al. [18] propose a collaborative model with sparsity constraint.In order to improve the tracking performance, the tracking algorithm combines the generative model and discriminative model.Zhang et al. [16] exploit the spatial layout structure of a target candidate and represent target appearance based on local information and spatial structure.In [19], a target candidate is represented by underlying low-rank with sparse constraints, in which the temporal consistency is used.
Recently, correlation filter [21][22][23][24] and deep network [25][26][27][28] techniques are applied into visual tracking.In [23], the tracking algorithm takes different features to learn correlation filter.The proposed appearance model is robust to largescale variations and maintain multiple modes in a particle filter tracking framework.Liu et al. [22] exploit the partbased structure information for correlation filter learning.The learnt filters can accurately distinguish foreground parts from the background.In [28], Ma et al. exploit object features from deep convolutional neural networks.The output of the convolutional layers includes semantic information and hierarchies, which is robust to appearance variations.Huang et al. [27] propose deep feature cascades based tracking algorithm, which considers the visual tracking as a decisionmaking process.
Motivated by the above-mentioned work, we propose a learnt dictionary based appearance model.A target candidate is represented by a linear combination of the learnt dictionary atoms.The dictionary learning process can learn appearance variations.The dictionary atoms cover recent appearance variations and a stable target representation is obtained.The observation likelihood is evaluated based on the reconstruction error with sparse constrain on dictionary coefficients.Extensive experimental results on some challenging video sequences show the robustness and effectiveness of the proposed appearance model and the tracking algorithm.
The remainder of this paper is organized as follows.Section 2 proposes the novel tracking algorithm, which includes the appearance model, the dictionary learning, the observation likelihood evaluation, and the dictionary update.Section 3 compares the tracking performance of the proposed tracking algorithm with some state-of-the-art algorithms.Section 4 concludes this work.

Proposed Tracking Algorithm
In this section, we detail the proposed tracking algorithm including an appearance model based on a learnt dictionary, discrimination dictionary learning for target representation, and a novel likelihood function.In this work, we propose the tracking algorithm in a particle filter tracking framework [29].The particle filter framework is widely used in visual tracking due to its effectiveness and simplification.
In our tracking algorithm, the target state in the first frame is given as s 1 .y 1 denotes the corresponding target observation of s 1 .In the first frame, a set of particles (i.e., target candidates) are extracted and denoted as X 1 = {x 1  1 , x 2 1 , . . ., x  1 }.These particles are collected by cropping out an image regions surrounding the location of s 1 .These particles have same sizes as s 1 and they have same important weights as   1 = 1/,  = 1, 2, . . ., .The particles in frame  are denoted as X  = {x where the covariance Σ is a diagonal matrix, in which the diagonal entries denote the variances of the 2D position and the scale of a target candidate.
The target state and the corresponding observation in the -th frame are denoted as s  and y  , respectively.In the particle filter framework, the s  in the frame  is approximately estimated by x   as where    is the weight of the particle x   .In the tracking, the particle weights are dynamically updated according to the likelihood of the particle x   as where (y   | x   ) is the likelihood function of particle x   , which is introduced in (15).

Target Representations.
In existing algorithm, a target candidate is represented by a linear combination of a set of target templates.These templates are usually generated from tracking results in previous frames.There are some noises and uncertain information in these templates due to complicated appearance variations.These tracking algorithms are not robust to drastic variations.Thus, in our tracking algorithm, a target candidate is approximately represented by the atoms of a learnt dictionary.
Based on a learnt dictionary D = [d 1 , d 2 , . . ., d  ], a target candidate y is approximately represented as where The dictionary coefficient  is evaluated by solving where  are the coefficient vector for the target candidate y associated with a learnt dictionary D.

Dictionary
Learning.In existing tracking algorithms, target candidates are represented by target templates, which are some tracking results from previous frames.To improve the tracking performance, we use a learnt dictionary to approximately represent target candidates.
Denote by T = [t 1 , t 2 , . . ., t  ] the set of training samples.Denote by V the coding vector matrix of training samples T over D, i.e., T = DV.The learnt dictionary should have discriminative capability and can adapt to learn appearance variations like partial occlusions, nonrigid deformation, illumination variations, and so on.Based on the learnt dictionary, a stable target representation model is obtained.Motivated by a dictionary learning method [30], we use a discriminative dictionary learning model as where Φ(T, D, V) is the discriminative fidelity term; ‖V‖ is the sparsity constraint on the coefficient matrix V; (V) is a discriminate constraint on the coding coefficient matrix V;  1 and  2 are parameters for balancing this constraint terms.In [30], the dictionary D includes a set of subdictionaries for all classes.Different from the dictionary learning in [30], in our tracking algorithm, D is a one-class dictionary and it is learnt from only a set of positive training samples.
In the dictionary learning process, based on the reconstruction error between the training samples T and the dictionary D, the discriminative fidelity term is defined as To improve the discriminative performance of the learnt dictionary, we add the Fisher discriminative criterion to minimize the within-class scatter of the coefficient matrix V. Denote by () the with-class scatter, which is defined as where V  is a vector of the coefficient matrix V,  is the mean vector of the coefficient vector V.In the learning process, we use the trace of () as constraint term (V) in (6).To prevent some coefficients that are too large in constraint term, a regularized term where  is a balancing parameter.
In (10), we update the dictionary D and the corresponding coefficient V, iteratively.In the updating processing, one is updated when the other is fixed.When the dictionary D is fixed, the optimization function is reduced to the following: and where V  is a vector of the coefficient matrix V and  is the mean vector of the coefficient vector V.When V is fixed, the dictionary D in ( 10) is updated as In the learning process, the training samples are primely important, which should reflect the recent variations of the tracked target and keep diversity to adapt to target appearance variations.In the first frame, a set of training samples are collected.Firstly, the initialized target is selected as training samples.In the meantime, the other training samples are selected by perturbing a few pixels surrounding the center location of the tracked target.
In order to keep the diversity of the learnt dictionary to appearance variations, we set the size of the training samples to 25.In the subsequent frames, we should update the training samples and relearn a dictionary to adapt to target appearance variations.At the current frame, when the tracked target state is computed and located, we crop the corresponding image and extracted the feature vector as a new training sample.Then, the new training sample is added to the set of current training samples and the oldest training sample is swapped.

Likelihood Evaluation.
The similarity metric of a target candidate and the corresponding candidate is an important issue in visual tracking.In this work, the similarity is measured as where D is the learnt dictionary and α is the coefficient vector of the learnt dictionary computed in (6).
Based on the distance between a target candidate and the corresponding template dictionary, the target observation likelihood is computed as where (y, Dα) is the distance between a target candidate y and the corresponding dictionary D,  is the standard deviation of the Gaussian, and  is a positive parameter.

Visual Tracking with Dictionary Based Representation.
By integrating the proposed target representation and the online dictionary learning and updating and the observation evaluation, the proposed visual tracking algorithm is outlined in Algorithm 1.The particle filter framework is used for all video sequences.For a video sequence, the tracked target is manually selected by a bounding box in the first frame.A set of particles (i.e., target candidates) are selected with same weights in the particle framework.The training samples are collected and a dictionary is learnt in the first frame.In the subsequent tracking processing, when the current target states are evaluated, the current tracking result is added to the training samples.The dictionary is relearnt according to the updated training samples.

Experiments
We conduct comprehensive experiments on some challenging video sequences and compare the proposed tracking algorithm against some state-of-the-art tracking algorithms.These tracking algorithms include Struck [12], SCM [18], VTD [3], Frag [2], L1 [15], LSHT [4], LRT [19], and TGPR [20].For fairness, we use the source codes or binary codes provided by the authors and initialize all the evaluated algorithms with default parameters in all experiments.8 challenging video sequences from a recent benchmark [1] are used to evaluate the tracking performance.Table 2 shows the main challenging attributes in these test sequences.The proposed tracking algorithm is implemented in MATLAB.All experimental results are conducted on a PC with Intel(R) Core(TM) i5-2400 3.10GHZ and 4 GB memory.The number of particles is set to 300.The target features are described by the histograms of sparse coding (HSC) [31].The value of  in ( 15) is set to 20.In the proposed tracking algorithm, the number of atoms of a learnt dictionary is set to 25.
The proposed tracking average processing time of the proposed tracking algorithm is 1.55 frames per second (FPS).We show the average tracking speed for each sequence in Table 1.Compared with some state-of-the-art tracking algorithms [1], the proposed tracking algorithm is superior to SCM.However, it is slow to some tracking algorithms, e.g., Struck, VTD, and LSHT.This is due to online dictionary learning spending some time in optimizing the target representation.We can learn the dictionary every five frames.But this may influence the dictionary adaption to complicated tracking surrounding.In our tracking algorithm, the dictionary is learnt in each frame.

Quantitative Evaluation.
We use four evaluation measures in the experiments including average center location error, success rate, overlap rate, and precision.These measures are adopted in recent tracking benchmark [1].
We show the precision plots for these tracking algorithms in Figure 1 for 9 tracking algorithms.The average center location errors (CLE) are shown in Table 3. From Figure 1 and Table 3, we can see that the proposed tracking algorithm obtains the best two results in six of eight sequences.The proposed tracking algorithm achieves the smallest CLE over all the 8 sequences.TGPR achieves robust tracking results in the 1, V, and 2 video sequences.LRT obtains the best tracking results in the , , and  video sequences.Struck tracks the ℎ and  video sequences well and achieves the best tracking results in CLE.
Table 4 presents success rates for 9 tracking algorithms on the 8 sequences.Figure 2 also shows the success rate  plots for the evaluated tracking algorithms.From Table 4 and Figure 2, it can be seen that the proposed tracking tracks well in most of these video sequences.The proposed tracking algorithm obtains the best tracking results in most of these video sequences.Additionally, TGPR achieves accurate tracking results in the ℎ and 2 video sequences.LRT achieves the best tracking results in the , , and  video sequences.LSHT obtains the best tracking results in the ℎ, , and  video sequences.
Table 5 presents the average overlap rates for the 9 tracking algorithms.It can be seen that the proposed tracking algorithm achieves the best or the second best tracking results in most video sequences.Struck, LSHT, LRT, and TGPR also achieve robust tracking results in some video sequences.

Qualitative Evaluation.
Next, we will analyze the tracking performance of these tracking algorithms on the 8 video sequences.
In the  sequence shown in Figure 3(a), the tracked target is a coupon book with cluttered background.When the target is occluded by himself, VTD and L1 drift away from the target.Frag, TGPR, VTD, and L1 lose the target and track the other similar object until the end of whole sequence.Struck, SCM, LSHT, LRT, TGPR, and the proposed tracking algorithm can accurately track the target throughout the video sequence.
Figure 3(b) presents some tracking results for the 9 tracking algorithms on the ℎ sequence.Struck, LSHT, and TGPR achieve accurate tracking results.The proposed  tracking algorithm can learn the appearance variations in the dictionary learning processing.It can accurately track the target throughout the video sequence.Struck, LSHT, LRT, and TGPR also track the target until the end of the video sequence.
In the  sequence shown in Figure 3(c), a football match is going on.The tracked target is a player, which is similar to others in color and shape.The target undergoes partial occlusion, background clutter, and in-plane and outof-plane rotations.Due to the influence of background clutter, VTD and L1 lose the target.When some similar objects appear surrounding the target, Struck, SCM, Frag, and LSHT track the other similar distracters and lose the target.Compared with these algorithms, LRT, TGPR, and the proposed tracking algorithm can track the target successfully.
As shown in Figure 3(d), the tracked target is influenced by out-of-plane and in-plane rotations.And there are some other objects that are similar to the tracked target.Frag loses the target when other distracters appear surrounding the target, which is very similar to the tracked target.L1 and Frag achieve inaccurate tracking results when the target rotates in-plane and out-of-plane.Struck, TGPR and the proposed tracking algorithm can track the target throughout the sequence.In the three algorithms, the proposed algorithm achieves the most accurate tracking results in average center location error, success, and overlap rates.
Figure 3(e) shows some tracking results in the  sequence.The tracked target is a moving face in an indoor room.Influenced by drastic illumination variations, VTD, Frag, and TGPR lose the target after the 40th frame until the end of the sequence.Struck, SCM, LSHT, LRT, and the proposed tracking algorithm can successfully track the whole sequence.The proposed tracking algorithm obtains the most robust tracking result.
The 2 video sequence shown in Figure 3(f) is captured on an indoor stage with drastic illumination variations.The target is also affected by nonrigid deformation, background clutters, and in-plane and out-of-plane rotations.Struck, SCM, L1, and LRT only track the target before the first 47th frame due to the appearance variations.VTD obtains inaccurate scale evaluation when the target undergoes nonrigid deformation.The proposed tracking algorithm can successfully track the target in the whole sequence.The learnt dictionary covers the target variations, so the linear combination of the atoms in the dictionary can represent these variational appearance.
In the V video sequence shown in Figure 3(g), the target is a moved toy, which undergoes illumination variations and in-plane and out-of-plane rotations.L1 loses the target due to the influence of illumination variations and rotation from up to down.It locates the target again after the 495th frame when the tracked target rotates from down to up.When the target is affected by illumination variation, LRT drift away from the tracked target until the end of the video sequence.Frag uses fixed target templates, so it cannot adapt to the appearance variations.It achieves inaccurate tracking results.Compared with these algorithms, the proposed tracking algorithm obtains more tracking results.This is attributed to the fact that the proposed target representation can learn the appearance variations.
As shown in Figure 3(h), the tracked target is a walking man in an outdoor scene.Due to the influence of partial occlusion, nonrigid deformation, and scale variation, L1 loses the target until the end of this sequence.VTD, TGPR, and LSHT can not accurately evaluate the scale variation.SCM, Struck, LRT, and the proposed tracking algorithm obtain more accurate tracking results.
From the above analysis, we can see that the proposed appearance model is effective and efficient.The proposed tracking algorithm is robust to significant appearance variations, e.g., drastic illumination variations, partial occlusion, and out-of-plane rotation.

Conclusion
We have presented an effective tracking algorithm based on learnt discrimination dictionary.Different from exist tracking algorithms, a target candidate is represented with a linear combination of dictionary atoms.The dictionaries are leant and updated in the tracking processing, which can learn the target appearance variation and exploit the discriminative information in the learning samples.The learning samples are collected from previous tracking results.The proposed tracking algorithm is robust to drastic illumination variations, nonrigid deformation, and rotation.Conducted experiments on some challenging video sequences demonstrate the robustness in comparison with state-of-the-art tracking algorithms.

Figure 1 :
Figure 1: Precision plots in terms of location error threshold (in pixels).

Figure 2 :
Figure 2: Success plots in terms of overlap threshold.

Table 2 :
The main attributes of the 8 video sequences.Target size: the initial target size in the first frame; OPR: out-of-plane rotation; IPR: in-plane rotation; BC: background clutter; IV: illumination variation; Occ: occlusion; Def: deformation; SV: scale variation.

Table 3 :
Average center location errors (in pixels).The best two results are shown in italic and bold colors, respectively.

Table 4 :
Success rates (%).The best two results are shown in italic and bold colors, respectively.