Elsevier

Neurocomputing

Volume 100, 16 January 2013, Pages 31-40
Neurocomputing

Robust visual tracking based on online learning sparse representation

https://doi.org/10.1016/j.neucom.2011.11.031Get rights and content

Abstract

Handling appearance variations is a very challenging problem for visual tracking. Existing methods usually solve this problem by relying on an effective appearance model with two features: (1) being capable of discriminating the tracked target from its background, (2) being robust to the target's appearance variations during tracking. Instead of integrating the two requirements into the appearance model, in this paper, we propose a tracking method that deals with these problems separately based on sparse representation in a particle filter framework. Each target candidate defined by a particle is linearly represented by the target and background templates with an additive representation error. Discriminating the target from its background is achieved by activating the target templates or the background templates in the linear system in a competitive manner. The target's appearance variations are directly modeled as the representation error. An online algorithm is used to learn the basis functions that sparsely span the representation error. The linear system is solved via 1 minimization. The candidate with the smallest reconstruction error using the target templates is selected as the tracking result. We test the proposed approach using four sequences with heavy occlusions, large pose variations, drastic illumination changes and low foreground-background contrast. The proposed approach shows excellent performance in comparison with two latest state-of-the-art trackers.

Introduction

The purpose of visual tracking is to estimate the state of the tracked target in a video. It has wide applications such as intelligent video surveillance, advanced human computer interaction, robot navigation and so on. It is usually formulated as a search task where an appearance model is first used to represent the target and then a search strategy is utilized to infer the state of the target in current frame. Therefore, how to effectively model the appearance of the target and how to accurately infer the state from all candidates are two key steps for a successful tracking system. Although a variety of tracking algorithms have been proposed in the last decades, visual tracking still cannot meet the requirements of practical applications. The main difficulty of visual tracking is designing a powerful appearance model which should not only discriminate the target from its surrounding background but also be robust to its appearance variations. For the former issue, some promising progresses have been achieved recently by considering visual tracking as a two-class classification or detection problem. Many elegant features in the field of pattern recognition can be used to discriminate the target from its background. However, the latter is very difficult to achieve since there are a large number of un-predictive appearance variations over time such as pose changes, shape deformation, illumination changes, partial occlusion and so on.

As shown in Fig. 1(a), the candidate marked by the blue rectangle is a “bad” candidate for tracking because there are a large number of background pixels inside it. An effective appearance model should not consider this candidate as the tracking result. In contrast, although the candidate marked by the red rectangle is partially occluded by the book, it is still a “good” candidate and should be considered as the tracking result. Existing appearance models in visual tracking cannot meet these requirements. For example, HSV histogram was widely used to model the target's appearance [1]. We show the HSV histograms of the target template and two candidates in Fig. 1(b). Although modeling appearance with a HSV histogram can reflect the difference between the target template and the “bad” candidate, it also models the difference between the target template and the “good” candidate as shown in the tails of the three histograms. Furthermore, when certain similarity measure between two histograms is used as the data likelihood, the tracker may wrongly find the “bad” candidate as the final tracking result. For example, when Battacharyya coefficients are adopted to measure the similarity between two histograms, the similarity between the target template and the “bad” candidate is 0.852, which is larger than 0.829—the similarity between the target template and the “good” candidate. Therefore, traditional tracking methods that resort to an effective appearance model to achieve robust tracking are not always feasible.

Recently, sparse representation has attracted much attention in the field of computer vision [2], [3], [4]. In [2], a robust face recognition method was proposed and the robustness to occlusions was achieved by introducing an error vector in the sparse representation model. Motivated by this idea, Mei and Ling proposed a robust visual tracking method based on sparse representation [3]. In their method, partial occlusion, appearance variances and other challenging issues were considered as the error vector represented by a set of trivial templates. We realized that [2], [3] used column vectors of the identity matrix as basis functions to linearly represent the error vector. These methods have some drawbacks. First, the sparsity cannot be met. Each basis function used in their method can model whether one pixel position is occluded or not. It assumes that the occluded pixels occupy a relatively small portion of the entire image and the error vector has sparse nonzero entries. However, this assumption may not hold in real world especially when the target is severely occluded during tracking. For example, as shown in Fig. 1(a), the candidate marked by the red rectangle has almost half of the face occluded by the book. In this case, the representation coefficients are not sparse as shown in Fig. 2(a). Second, there are not basis vectors corresponding to background region. For example, the candidate marked by the blue rectangle in Fig. 1(a) contains some background pixels, its representation coefficients will also not be sparse as shown in Fig. 2(b). Finally, Mei's method uses a fixed basis to represent the error vector during the entire tracking. However, in practice, the error vector may change with the environment over time. Therefore, the fixed basis determined before tracking begins cannot effectively adapt the error vector to the environment's changes.

In this paper, motivated by the online basis learning for sparse coding [5], we proposed a novel tracking framework based on online learning sparse representation, which overcomes the aforementioned problems and achieves more robust results. The key idea of the proposed method is to simultaneously achieve discriminating the target from its background and being robust to its appearance variations separately based on sparse representation in a particle filter framework. Specifically, we sparsely represent each target candidate defined by a particle using target templates, background templates and error basis. The introduction of both the target and background templates in sparse representation is capable of discriminating the target from its background. For example, for the candidate corresponding to the target region, only target templates play key roles in the linear representation. In contrast, for the candidate corresponding to the background region, only background templates play roles in the linear representation. the target's appearance variants are modeled as the error in the linear representation. The error is spanned by a set of basis functions which are learned online when new observations are available over time. The representation coefficients are computed via 1 minimization. Each candidate is weighted in the particle filter framework based on the residuals when it is projected on the target templates. The tracking result is the weighted mean of all particles. The rationalities behind the proposed method are two-folds: First, although appearance variations are unpredictable during tracking, they can still be compactly represented in certain subspace. Using the learned basis to represent the appearance variations can assure that the representation coefficients are sparse. Second, online basis learning can adaptively model unpredictable appearance variations during tracking, therefore improve the robustness of the tracking method.

The rest of the paper is organized as follows. In Section 2 related methods on visual tracking and sparse representation are summarized. Section 3 details the tracking algorithm based on online learning sparse representation. Experimental results on four sequences are reported in Section 4. We conclude this paper in Section 5.

Section snippets

Visual tracking

In this subsection, in order to clarify the motivations of the proposed method, we reviewed the various appearance models and inference methods for visual tracking.

Tracking based on online learning sparse representation

In this section, we give the details of the proposed tracking method based on online learning sparse representation. We briefly review the particle filter based tracking framework, and then formulate visual tracking as the sparse representation problem which simultaneously models the target and background templates as well as the error basis in a linear system, The online learning of the error basis and the template update are introduced, followed by a speed up strategy.

Experimental results

In order to validate the effectiveness of the proposed method, we conduct experiments on four challenging sequences which involve severe appearance variations including heavy occlusions, large pose variations, and drastic illumination changes as well as low foreground background contrast. The proposed method is compared with two latest state-of-the-art methods named Incremental Visual Tracking (IVT) [20] and 1 minimization tracking (L1) [3]. Note that for all test sequences, we set main

Conclusion

In this paper we propose a robust visual tracking algorithm based on online learning sparse representation. The main contribution is that we integrate two requirements of an expected appearance model, discriminating the tracked target from background and being robust to appearance variations, into a linear representation system that each target candidate is represented by target templates, background templates and online learned error basis. To the best of our knowledge, it is the first time

Acknowledgement

This work is supported by the National Natural Science Foundation of China (Grant no.: 61071180 and Key Program Grant no.: 61133003). Shengping Zhang is also supported by funding of Ph.D. student short-term visiting abroad from HIT. Huiyu Zhou is currently supported by UK EPSRC Grant EP/G034303/1 and Invest NI.

Shengping Zhang received the M.S. degree in computer science from the Harbin Institute of Technology, Harbin, China, in 2008. Currently, he is pursuing the Ph.D. degree. His research interests focus on computer vision and pattern recognition, especially on moving objects detection and tracking.

References (39)

  • H. Zhou et al.

    Object tracking using sift features and mean shift

    Comput. Vision Image Understanding

    (2009)
  • H. Zhou et al.

    Efficient tracking and ego-motion recovery using gait analysis

    Signal Process.

    (2009)
  • B. Olshausen et al.

    Sparse coding with an overcomplete basis set: a strategy employed by V1?

    Vision Res.

    (1997)
  • P. Pérez, C. Hue, J. Vermaak, M. Gangnet, Color-based probabilistic tracking, in: Proceedings of European Conference on...
  • J. Wright et al.

    Robust face recognition via sparse representation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • X. Mei, H. Ling, Robust visual tracking using L1 minimization, in: Proceedings of the 12th International Conference on...
  • S. Zhang, H. Yao, X. Sun, S. Liu, Robust object tracking based on sparse representation, in: Proceedings of SPIE...
  • J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online dictionary learning for sparse coding, in: Proceedings of the 26th...
  • G. Bradski

    Computer vision face tracking as a component of a perceptual user interface

    Intell. Technol. J.

    (1998)
  • M. Isard, A. Blake, Contour tracking by stochastic propagation of conditional density, in: Proceedings of the 4th...
  • A. Shahrokni, T. Drummond, P. Fua, Fast texture-based tracking and delineation using texture entropy, in: Proceedings...
  • S. Zhang, H. Yao, S. Liu, Partial occlusion robust object tracking using an effective appearance model, in: Proceedings...
  • S. Zhang, H. Yao, P. Gao, Robust object tracking combining color and scale invariant features, in: Proceedings of SPIE...
  • S. Zhang, H. Yao, S. Liu, Robust visual tracking using feature-based visual attention, in: Proceedings of IEEE...
  • M. Swain et al.

    Color indexing

    Int. J. Comput. Vision

    (1991)
  • D. Comaniciu et al.

    Kernel-based object tracking

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2003)
  • S. Birchfield, R. Sriram, Spatiograms versus histograms for region-based tracking, in: Proceedings of IEEE Conference...
  • Q. Zhao, H. Tao, Object tracking using color correlogram, in: Proceedings of IEEE Workshop Performance Evaluation of...
  • A. Adam, E. Rivlin, I. Shimshoni, Robust fragments-based tracking using the integral histogram, in: Proceedings of IEEE...
  • Cited by (133)

    • Visual object tracking based on sequential learning of SVM parameter

      2018, Digital Signal Processing: A Review Journal
    View all citing articles on Scopus

    Shengping Zhang received the M.S. degree in computer science from the Harbin Institute of Technology, Harbin, China, in 2008. Currently, he is pursuing the Ph.D. degree. His research interests focus on computer vision and pattern recognition, especially on moving objects detection and tracking.

    Hongxun Yao received the B.S. and M.S. degrees in computer science from the Harbin Shipbuilding Engineering Institute, Harbin, China, in 1987 and in 1990, respectively, and the Ph.D. degree in computer science from Harbin Institute of Technology in 2003. Currently, she is a Professor with the School of Computer Science and Technology, Harbin Institute of Technology. Her research interests include pattern recognition, multimedia technology, and human–computer interaction technology. She has published three books and over 100 scientific papers.

    Huiyu Zhou obtained his Bachelor of Engineering degree in Radio Technology from the Huazhong University of Science and Technology of China, and a Master of Science degree in Biomedical Engineering from the University of Dundee of United Kingdom, respectively. He was then awarded a Doctor of Philosophy degree in Computer Vision from the Heriot-Watt University, Edinburgh, United Kingdom, where he worked with Professors Andy Wallace and Patrick Green for his Ph.D. thesis entitled “Efficient Ego-motion Tracking and Obstacle Detection Using Gait Analysis”.

    Xin Sun received the B.S. and M.S. degree in computer science from the Harbin Institute of Technology, Harbin, China, in 2008 and 2010, respectively. Currently, she is pursuing the Ph.D. degree. Her research interests focus on computer vision and pattern recognition, especially on moving objects detection and tracking.

    Shaohui Liu received the B.S. and M.S. degrees in computational mathematics from Harbin Institute of Technology in 1999 and 2001, respectively, and the Ph.D. degree in computer science from the same university in 2007. He is currently in the faculty of School of Computer Science and Technology at Harbin Institute of Technology. His research interests include computer vision and image and video processing. In particular, he has worked on algorithms for multimedia security, video surveillance and coding. He is the (co-)author of more than 40 papers.

    View full text