Elsevier

Neural Networks

Volume 87, March 2017, Pages 132-148
Neural Networks

Constructing a meta-tracker using Dropout to imitate the behavior of an arbitrary black-box tracker

https://doi.org/10.1016/j.neunet.2016.12.009Get rights and content

Abstract

Imitating the behaviors of an arbitrary visual tracking algorithm enables many higher level tasks such as tracker identification and efficient tracker-fusion. It is also useful for discovering the features essential in a black-box tracker or learning from several trackers to form a super-tracker. In this study, we propose a non-linear feature fusion framework, “MIMIC” that imitates many popular trackers by mixing a pool of heterogeneous features. The MIMIC framework consists of two subtasks, feature selection and feature weight tuning. These subtasks, however, tended to suffer from an overfitting problem when the number of videos available for training is limited. To address this issue, we incorporated Dropout algorithm into the training, which grants the trained MIMIC tracker a high degree of generalization. Extensive experiments testified the effectiveness of the proposed framework so that its applications would be promoted into different related tasks in visual tracking.

Introduction

Every year many off-the-shelf trackers are introduced to address various visual tracking challenges. Typically, those studies intend to handle a single or a few of these challenges: appearance changes (Babenko et al., 2009, Han and Davis, 2005, Henriques et al., 2012, Ross et al., 2008, Zhang et al., 2012, Zhong et al., 2012), abrupt/fast motion (Kwon and Lee, 2008, Zhou et al., 2012), target/non-target confusion (Dinh, Vo, & Medioni, 2011), long-term tracking (Kalal, Mikolajczyk, & Matas, 2012), background clutter (Avidan, 2007, Nguyen and Smeulders, 2004), irregular scale changes (Ning, Zhang, Zhang, & Wu, 2012), occlusions (Bao, Wu, Ling, & Ji, 2012) and moving cameras. However, large-scale benchmarks (Smeulders et al., 2015, Wu et al., 2015) suggested that TLD (Kalal et al., 2012), STRUCK (Hare, Saffari, & Torr, 2011), MIL (Babenko et al., 2009), FBT (Chu & Smeulders, 2012) and SCM (Zhong et al., 2012) have overall superior performance dealing with different challenges. Many of these trackers have solved the issues by which other methods are troubled. While mastering all of these domains to construct the holy-grail of trackers is difficult, fusing these trackers to integrate their merits into a unified framework has not been very successful (Martín & Martínez, 2014). The biggest obstacles to constructing such holy-grail trackers from currently realized ideas are the complexity of integrating mechanisms and contradictory objectives pursued by each of them.

In this study, we propose a framework, MIMIC, to imitate the behaviors of an arbitrary black-box tracker, with a relatively simple non-linear tracker benefiting from a pool of various features. This study is based on a premise that a linear combination of sufficiently expressive features along with a flexible but still simple non-linear observation model can roughly approximate the behaviors of many popular trackers.

Employing multiple features to track objects has been investigated considerably in visual tracking literature. Different frameworks like particle filters accommodate the feature fusion seamlessly (Perez, Vermaak, & Blake, 2004). Although automatic adjustment of the feature’s weights has been focused in some studies (Chau, Thonnat, Bremond, & Corvee, 2014), many studies opt to select an effective subset of implemented features to track with, yet the performance drastically depends on the way they approach to feature extraction and selection. Different features may be obtained by applying various pre-defined templates on a single image patch (Shi and Tomasi, 1994, Viola and Jones, 2004), be constructed by applying a particular function (e.g., scale) on a single image patch with different parameters (Kwon & Lee, 2010), or a “feature pool” may consist of several heterogeneous features (Chen, Liu, & Fuh, 2004).

Each employed feature lends its characteristics to the tracker so that the collective behaviors of the tracker are affected directly by its active features and/or their corresponding weights. Some features have prominent effects on the tracker, which grants it a certain degree of invariance against environmental changes. For instance, incremental PCA brings IVT (Ross et al., 2008) illumination invariance and sparse-coding-based features render the tracker robust against partial occlusions (Bao et al., 2012). Thus, it is natural to attribute the behaviors of a specific tracker to its employed features. Moreover, complicated trackers behave in a way that seems to be closely emulated using several features fused together.

The behaviors of many existing trackers can be imitated, by selecting the right set of features and/or proper weights for them. Without proper features, the observation model does not deal with the occlusion or sudden illumination changes, for instance, hence fails to track the objects. In this study, we propose a unified framework to emulate the behaviors of these trackers by a proper mixture of feature responses. The features when combined in a tracking-by-detection tracker (such as Ensemble tracker   Avidan, 2007) or in an optimization framework (such as Mean-shift tracker   Comaniciu, Ramesh, & Meer, 2003) lose the flexibility to mimic other trackers. Imitating trackers’ behavior, while they are dealing with different tracking challenges such as occlusions and illumination variations, complicates the imitation problem even more. Such challenges are managed by a combination of specialized features and stochastic sampling in some trackers (e.g., in IVT   Ross et al., 2008, MIL   Babenko et al., 2009 or L1APG   Bao et al., 2012), so deterministic feature fusion methods do not work well on imitating their behaviors. Sophisticated trackers (e.g.,  SCM   Zhong et al., 2012), on the other hand, cannot be approximated with linear motion models or dense sampling. Linear observation models (e.g., in Kalman Filter tracker   Čehovin, Kristan, & Leonardis, 2011) cannot emulate the non-stationarity of the scene observation, and moreover, a more complicated sampling and feature fusion are required to imitate highly adaptive trackers such as STRUCK (Hare et al., 2011). Dense sampling in effect acts like the sliding window scheme in tracking-by-detection methods, hence cannot reproduce well the behaviors of probabilistic trackers. A non-linear non-Gaussian probabilistic framework such as the multi-cue Particle Filter tracker (Perez et al., 2004), is a natural choice to address these issues.

To validate our hypothesis–i.e.,  a linear combination of proper features in a non-linear observation model emulates most of the complicated trackers–a rich pool of features is provided to a particle filter tracker with a few hundreds of particles and a random walk motion model. The goal of this tracker is to imitate the behaviors of a specific target tracker closely, in the sense of tracking overlap on the unseen test videos. To achieve this, our tracker adjusts the weights of the features of the feature pool, based on the outputs of the target tracker while it operates in several videos used as the training data.

Our tracker, MIMIC, extracts a variety of features from the image frame and weighs them in the observation model in a way to minimize the mismatch between its estimation of the object location and that estimated by the target tracker. These weights constitute the parameters to be tuned through supervised learning. Considering the high non-linearity of the objective function representing the mismatch to be minimized, this supervised learning was performed by an evolutionary algorithm which searches the high-dimensional parameter space for the best combination of feature weights. To select a subset of effective features from the feature pool, the optimization should set the weights of many unnecessary features to zero, which resembles the feature selection in effect. This is done by imposing an L1 regularization term to the objective function.

To suppress over-fitting in this high-dimensional learning task, we employed “Dropout”, as the success of this regularization mechanism has been demonstrated in artificial neural networks (Hinton, Srivastava, Krizhevsky, Sutskever, & Salakhutdinov, 2012). By emphasizing on optimization over the model ensemble, this algorithm investigates different combinations of the features and prevents over-fitting by approximating the training dataset by means of a virtual ensemble of the models employing a variety of feature subsets. The proper combination of the L1 regularization and the Dropout algorithm is expected to work well in the simultaneous task of feature selection and feature weighting, by avoiding over-fitting even when the number of available videos is limited (Fig. 1).

In summary, our contributions in this study are two-fold:

  • We proposed a unified framework to approximate the behaviors of an arbitrary tracker by employing a weighted subset of features selected from a feature pool in the observation model of our MIMIC tracker.

  • We introduced several applications of this framework including “tracker identification” in which a black-box tracker is identified among other trackers based on its tracking results, and also a couple of novel approaches to performing tracker fusion.

Following this introduction, Section  2 elaborates the proposed framework, and Section  3 demonstrates the wide range of applications of this framework. After the proposed framework is discussed in Section  4, our article is concluded in Section  5 with the outcome of this study and some future directions to improve it.

Section snippets

Method

MIMIC is implemented as a particle filter tracker (hereafter, PFT) employing a linear combination of features in its observation model. Intuitively, MIMIC intends to mirror a given target tracker by maximizing the overlap between its tracking results and those by the target tracker, via adjusting the weights of the features selected from a predominantly given feature pool. Fig. 1 depicts its basic scheme.

According to the MIMIC framework, a sequence of video frames It(t{1,2,,τ}) is provided to

Experiment

We first show the results in multiple experiments to verify the performance of MIMIC to shadow a target tracker. Later, we promote this scheme by demonstrating its applicability to different tasks. Specifically, the first three experiments were performed to show that the MIMIC framework is capable of approximately achieving the performance of the target tracker (3.1), it is robust in various scenarios (3.2), and its intrinsic dynamics are fairly interpretable (3.3). The subsequent six

Discussion

In this study, we proposed a new perspective of “unified tracker imitation framework”. Such framework is suitable to approximate the behaviors of trackers that are intelligent but demand heavy computations, by means of a computationally cheap online tracker.

Additionally, the proposed framework made it possible to imitate humans in various tracking scenarios. This can serve as a tool to monitor different features as they try to capture real-world. Furthermore, this framework is capable of

Conclusions

In this study, we introduced a novel approach to imitate the behaviors of an arbitrary target tracker, called MIMIC. We constructed MIMIC based on the assumption that most of the popular trackers can be imitated by a linear fusion of several features to be embedded in the observation model of a non-linear tracker. We employed a particle filter tracker and a rich pool of features to verify this hypothesis. The extensive experiments justified the overall validity of this hypothesis, although

Acknowledgments

This work was supported by the Platform for Dynamic Approaches to Living Systems from Japan Agency for Medical Research and Development (AMED) and by the Project of Next-Generation Core Robot and AI Technology Development from New Energy and Industrial Technology Development Organization (NEDO).

References (85)

  • S. Belongie et al.

    Shape matching and object recognition using shape contexts

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2002)
  • K. Briechle et al.

    Template matching using fast normalized cross correlation

  • L. Čehovin et al.

    An adaptive coupled-layer visual model for robust visual tracking

  • H.T. Chen et al.

    Probabilistic tracking with adaptive feature selection

  • D.M. Chu et al.

    Color invariant SURF in discriminative object tracking

  • R.T. Collins et al.

    Online selection of discriminative tracking features

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2005)
  • D. Comaniciu et al.

    Kernel-based object tracking

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2003)
  • Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In...
  • N. Dalal et al.

    Histograms of oriented gradients for human detection

  • Danelljan, M., Hager, G., Shahbaz Khan, F., & Felsberg, M. (2015). Learning spatially regularized correlation filters...
  • Dinh, T.B., Vo, N., & Medioni, G. (2011). Context tracker: Exploring supporters and distracters in unconstrained...
  • W.T. Freeman et al.

    The design and use of steerable filters

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1991)
  • Y. Gao et al.

    Symbiotic tracker ensemble towards a unified tracking framework

    IEEE Transactions on Circuits and Systems for Video Technology

    (2014)
  • Grabner, H., Leistner, C., & Bischof, H. (2008). Semi-supervised on-line boosting for robust tracking. In...
  • B. Han et al.

    On-line density-based appearance modeling for object tracking

  • Han, Y., Yang, Y., & Zhou, X. (2013). Co-regularized ensemble for feature selection. In...
  • S. Hare et al.

    Struck: Structured output tracking with Kernels

  • Henriques, J.F., Caseiro, R., Martins, P., & Batista, J. (2012). Exploiting the circulant structure of...
  • Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R.R. (2012). Improving neural networks by...
  • D.H. Hubel et al.

    Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex

    The Journal of Physiology

    (1962)
  • M. Isard et al.

    Condensation—conditional density propagation for visual tracking

    International Journal of Computer Vision

    (1998)
  • Johannes, M., Polson, N., & Stroud, J. (2006). Exact particle filtering and parameter learning. Technical Report....
  • L. Juan et al.

    A comparison of SIFT, PCA-SIFT and SURF

    International Journal of Image Processing

    (2009)
  • Z. Kalal et al.

    Tracking-learning-detection

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2012)
  • Y. Ke et al.

    Pca-sift: A more distinctive representation for local image descriptors

  • T. Kobayashi

    Bfo meets hog: feature extraction based on histograms of oriented pdf gradients for image classification

  • T. Kobayashi et al.

    Image feature extraction using gradient local auto-correlations

  • J.J. Koenderink et al.

    Representation of local geometry in the visual system

    Biological Cybernetics

    (1987)
  • Kristan, M., Pflugfelder, R., Leonardis, A., Matas, J., Čehovin, L., Nebehay, G., & Vojir, T. (2014). The visual object...
  • L.I. Kuncheva

    Combining pattern classifiers: methods and algorithms

    (2004)
  • J. Kwon et al.

    Tracking of abrupt motion using wang-landau Monte Carlo estimation

  • J. Kwon et al.

    Visual tracking decomposition

  • Cited by (3)

    • Scenario prediction and critical factors of CO<inf>2</inf> emissions in the Pearl River Delta: A regional imbalanced development perspective

      2022, Urban Climate
      Citation Excerpt :

      This approach discards several randomly selected neurons and their weight connections during the backpropagation error update to prevent an excessive repetition of training data. According to the individual Bernoulli distribution, the typical probability of any node's weight and connection being deleted is 50% (Meshgi et al., 2017), which can effectively prevent overfitting. Fig. 5 shows the network structure before and after applying dropout with a probability p of 50% (Bai et al., 2020).

    • Prediction of multiproject resource conflict risk via an artificial neural network

      2021, Engineering, Construction and Architectural Management
    View full text