Constructing a meta-tracker using Dropout to imitate the behavior of an arbitrary black-box tracker

doi:10.1016/j.neunet.2016.12.009

Neural Networks

Volume 87, March 2017, Pages 132-148

https://doi.org/10.1016/j.neunet.2016.12.009 Get rights and content

Abstract

Imitating the behaviors of an arbitrary visual tracking algorithm enables many higher level tasks such as tracker identification and efficient tracker-fusion. It is also useful for discovering the features essential in a black-box tracker or learning from several trackers to form a super-tracker. In this study, we propose a non-linear feature fusion framework, “MIMIC” that imitates many popular trackers by mixing a pool of heterogeneous features. The MIMIC framework consists of two subtasks, feature selection and feature weight tuning. These subtasks, however, tended to suffer from an overfitting problem when the number of videos available for training is limited. To address this issue, we incorporated Dropout algorithm into the training, which grants the trained MIMIC tracker a high degree of generalization. Extensive experiments testified the effectiveness of the proposed framework so that its applications would be promoted into different related tasks in visual tracking.

Introduction

Every year many off-the-shelf trackers are introduced to address various visual tracking challenges. Typically, those studies intend to handle a single or a few of these challenges: appearance changes (Babenko et al., 2009, Han and Davis, 2005, Henriques et al., 2012, Ross et al., 2008, Zhang et al., 2012, Zhong et al., 2012), abrupt/fast motion (Kwon and Lee, 2008, Zhou et al., 2012), target/non-target confusion (Dinh, Vo, & Medioni, 2011), long-term tracking (Kalal, Mikolajczyk, & Matas, 2012), background clutter (Avidan, 2007, Nguyen and Smeulders, 2004), irregular scale changes (Ning, Zhang, Zhang, & Wu, 2012), occlusions (Bao, Wu, Ling, & Ji, 2012) and moving cameras. However, large-scale benchmarks (Smeulders et al., 2015, Wu et al., 2015) suggested that TLD (Kalal et al., 2012), STRUCK (Hare, Saffari, & Torr, 2011), MIL (Babenko et al., 2009), FBT (Chu & Smeulders, 2012) and SCM (Zhong et al., 2012) have overall superior performance dealing with different challenges. Many of these trackers have solved the issues by which other methods are troubled. While mastering all of these domains to construct the holy-grail of trackers is difficult, fusing these trackers to integrate their merits into a unified framework has not been very successful (Martín & Martínez, 2014). The biggest obstacles to constructing such holy-grail trackers from currently realized ideas are the complexity of integrating mechanisms and contradictory objectives pursued by each of them.

In this study, we propose a framework, MIMIC, to imitate the behaviors of an arbitrary black-box tracker, with a relatively simple non-linear tracker benefiting from a pool of various features. This study is based on a premise that a linear combination of sufficiently expressive features along with a flexible but still simple non-linear observation model can roughly approximate the behaviors of many popular trackers.

Employing multiple features to track objects has been investigated considerably in visual tracking literature. Different frameworks like particle filters accommodate the feature fusion seamlessly (Perez, Vermaak, & Blake, 2004). Although automatic adjustment of the feature’s weights has been focused in some studies (Chau, Thonnat, Bremond, & Corvee, 2014), many studies opt to select an effective subset of implemented features to track with, yet the performance drastically depends on the way they approach to feature extraction and selection. Different features may be obtained by applying various pre-defined templates on a single image patch (Shi and Tomasi, 1994, Viola and Jones, 2004), be constructed by applying a particular function (e.g., scale) on a single image patch with different parameters (Kwon & Lee, 2010), or a “feature pool” may consist of several heterogeneous features (Chen, Liu, & Fuh, 2004).

Each employed feature lends its characteristics to the tracker so that the collective behaviors of the tracker are affected directly by its active features and/or their corresponding weights. Some features have prominent effects on the tracker, which grants it a certain degree of invariance against environmental changes. For instance, incremental PCA brings IVT (Ross et al., 2008) illumination invariance and sparse-coding-based features render the tracker robust against partial occlusions (Bao et al., 2012). Thus, it is natural to attribute the behaviors of a specific tracker to its employed features. Moreover, complicated trackers behave in a way that seems to be closely emulated using several features fused together.

The behaviors of many existing trackers can be imitated, by selecting the right set of features and/or proper weights for them. Without proper features, the observation model does not deal with the occlusion or sudden illumination changes, for instance, hence fails to track the objects. In this study, we propose a unified framework to emulate the behaviors of these trackers by a proper mixture of feature responses. The features when combined in a tracking-by-detection tracker (such as Ensemble tracker Avidan, 2007) or in an optimization framework (such as Mean-shift tracker Comaniciu, Ramesh, & Meer, 2003) lose the flexibility to mimic other trackers. Imitating trackers’ behavior, while they are dealing with different tracking challenges such as occlusions and illumination variations, complicates the imitation problem even more. Such challenges are managed by a combination of specialized features and stochastic sampling in some trackers (e.g., in IVT Ross et al., 2008, MIL Babenko et al., 2009 or L1APG Bao et al., 2012), so deterministic feature fusion methods do not work well on imitating their behaviors. Sophisticated trackers (e.g., SCM Zhong et al., 2012), on the other hand, cannot be approximated with linear motion models or dense sampling. Linear observation models (e.g., in Kalman Filter tracker Čehovin, Kristan, & Leonardis, 2011) cannot emulate the non-stationarity of the scene observation, and moreover, a more complicated sampling and feature fusion are required to imitate highly adaptive trackers such as STRUCK (Hare et al., 2011). Dense sampling in effect acts like the sliding window scheme in tracking-by-detection methods, hence cannot reproduce well the behaviors of probabilistic trackers. A non-linear non-Gaussian probabilistic framework such as the multi-cue Particle Filter tracker (Perez et al., 2004), is a natural choice to address these issues.

To validate our hypothesis–i.e., a linear combination of proper features in a non-linear observation model emulates most of the complicated trackers–a rich pool of features is provided to a particle filter tracker with a few hundreds of particles and a random walk motion model. The goal of this tracker is to imitate the behaviors of a specific target tracker closely, in the sense of tracking overlap on the unseen test videos. To achieve this, our tracker adjusts the weights of the features of the feature pool, based on the outputs of the target tracker while it operates in several videos used as the training data.

Our tracker, MIMIC, extracts a variety of features from the image frame and weighs them in the observation model in a way to minimize the mismatch between its estimation of the object location and that estimated by the target tracker. These weights constitute the parameters to be tuned through supervised learning. Considering the high non-linearity of the objective function representing the mismatch to be minimized, this supervised learning was performed by an evolutionary algorithm which searches the high-dimensional parameter space for the best combination of feature weights. To select a subset of effective features from the feature pool, the optimization should set the weights of many unnecessary features to zero, which resembles the feature selection in effect. This is done by imposing an L1 regularization term to the objective function.

To suppress over-fitting in this high-dimensional learning task, we employed “Dropout”, as the success of this regularization mechanism has been demonstrated in artificial neural networks (Hinton, Srivastava, Krizhevsky, Sutskever, & Salakhutdinov, 2012). By emphasizing on optimization over the model ensemble, this algorithm investigates different combinations of the features and prevents over-fitting by approximating the training dataset by means of a virtual ensemble of the models employing a variety of feature subsets. The proper combination of the L1 regularization and the Dropout algorithm is expected to work well in the simultaneous task of feature selection and feature weighting, by avoiding over-fitting even when the number of available videos is limited (Fig. 1).

In summary, our contributions in this study are two-fold:

•
We proposed a unified framework to approximate the behaviors of an arbitrary tracker by employing a weighted subset of features selected from a feature pool in the observation model of our MIMIC tracker.
•
We introduced several applications of this framework including “tracker identification” in which a black-box tracker is identified among other trackers based on its tracking results, and also a couple of novel approaches to performing tracker fusion.

Following this introduction, Section 2 elaborates the proposed framework, and Section 3 demonstrates the wide range of applications of this framework. After the proposed framework is discussed in Section 4, our article is concluded in Section 5 with the outcome of this study and some future directions to improve it.

Section snippets

Method

MIMIC is implemented as a particle filter tracker (hereafter, PFT) employing a linear combination of features in its observation model. Intuitively, MIMIC intends to mirror a given target tracker by maximizing the overlap between its tracking results and those by the target tracker, via adjusting the weights of the features selected from a predominantly given feature pool. Fig. 1 depicts its basic scheme.

According to the MIMIC framework, a sequence of video frames $I_{t} (t \in {1, 2, \dots, τ})$ is provided to

Experiment

We first show the results in multiple experiments to verify the performance of MIMIC to shadow a target tracker. Later, we promote this scheme by demonstrating its applicability to different tasks. Specifically, the first three experiments were performed to show that the MIMIC framework is capable of approximately achieving the performance of the target tracker (3.1), it is robust in various scenarios (3.2), and its intrinsic dynamics are fairly interpretable (3.3). The subsequent six

Discussion

In this study, we proposed a new perspective of “unified tracker imitation framework”. Such framework is suitable to approximate the behaviors of trackers that are intelligent but demand heavy computations, by means of a computationally cheap online tracker.

Additionally, the proposed framework made it possible to imitate humans in various tracking scenarios. This can serve as a tool to monitor different features as they try to capture real-world. Furthermore, this framework is capable of

Conclusions

In this study, we introduced a novel approach to imitate the behaviors of an arbitrary target tracker, called MIMIC. We constructed MIMIC based on the assumption that most of the popular trackers can be imitated by a linear fusion of several features to be embedded in the observation model of a non-linear tracker. We employed a particle filter tracker and a rich pool of features to verify this hypothesis. The extensive experiments justified the overall validity of this hypothesis, although

Acknowledgments

This work was supported by the Platform for Dynamic Approaches to Living Systems from Japan Agency for Medical Research and Development (AMED) and by the Project of Next-Generation Core Robot and AI Technology Development from New Energy and Industrial Technology Development Organization (NEDO).

References (85)

P. Baldi et al.
The dropout learning algorithm
Artificial Intelligence
(2014)
D.P. Chau et al.
Online parameter tuning for object tracking algorithms
Image and Vision Computing
(2014)
M. Heber et al.
Segmentation-based tracking by support fusion
Computer Vision and Image Understanding
(2013)
T. Kobayashi et al.
Motion recognition using local auto-correlation of space–time gradients
Physical Review Letters
(2012)
X. Li et al.
A survey of appearance models in visual object tracking
ACM Transactions on Intelligent Systems and Technology
(2013)
B. Zhong et al.
Visual tracking via weakly supervised learning from multiple imperfect oracles
Pattern Recognition
(2014)
S. Avidan
Ensemble tracking
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2007)
B. Babenko et al.
Visual tracking with online multiple instance learning
L. Ballan et al.
Effective codebooks for human action representation and classification in unconstrained videos
IEEE Transactions on Multimedia
(2012)
C. Bao et al.
Real time robust L1 tracker using accelerated proximal gradient approach

S. Belongie et al.

Shape matching and object recognition using shape contexts

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2002)

K. Briechle et al.

Template matching using fast normalized cross correlation

L. Čehovin et al.

An adaptive coupled-layer visual model for robust visual tracking

H.T. Chen et al.

Probabilistic tracking with adaptive feature selection

D.M. Chu et al.

Color invariant SURF in discriminative object tracking

R.T. Collins et al.

Online selection of discriminative tracking features

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2005)

D. Comaniciu et al.

Kernel-based object tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2003)

Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In...

N. Dalal et al.

Histograms of oriented gradients for human detection

Danelljan, M., Hager, G., Shahbaz Khan, F., & Felsberg, M. (2015). Learning spatially regularized correlation filters...

Dinh, T.B., Vo, N., & Medioni, G. (2011). Context tracker: Exploring supporters and distracters in unconstrained...

W.T. Freeman et al.

The design and use of steerable filters

IEEE Transactions on Pattern Analysis and Machine Intelligence

(1991)

Y. Gao et al.

Symbiotic tracker ensemble towards a unified tracking framework

IEEE Transactions on Circuits and Systems for Video Technology

(2014)

Grabner, H., Leistner, C., & Bischof, H. (2008). Semi-supervised on-line boosting for robust tracking. In...

B. Han et al.

On-line density-based appearance modeling for object tracking

Han, Y., Yang, Y., & Zhou, X. (2013). Co-regularized ensemble for feature selection. In...

S. Hare et al.

Struck: Structured output tracking with Kernels

Henriques, J.F., Caseiro, R., Martins, P., & Batista, J. (2012). Exploiting the circulant structure of...

Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R.R. (2012). Improving neural networks by...

D.H. Hubel et al.

Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex

The Journal of Physiology

(1962)

M. Isard et al.

Condensation—conditional density propagation for visual tracking

International Journal of Computer Vision

(1998)

Johannes, M., Polson, N., & Stroud, J. (2006). Exact particle filtering and parameter learning. Technical Report....

L. Juan et al.

A comparison of SIFT, PCA-SIFT and SURF

International Journal of Image Processing

(2009)

Z. Kalal et al.

Tracking-learning-detection

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2012)

Y. Ke et al.

Pca-sift: A more distinctive representation for local image descriptors

T. Kobayashi

Bfo meets hog: feature extraction based on histograms of oriented pdf gradients for image classification

T. Kobayashi et al.

Image feature extraction using gradient local auto-correlations

J.J. Koenderink et al.

Representation of local geometry in the visual system

Biological Cybernetics

(1987)

Kristan, M., Pflugfelder, R., Leonardis, A., Matas, J., Čehovin, L., Nebehay, G., & Vojir, T. (2014). The visual object...

L.I. Kuncheva

Combining pattern classifiers: methods and algorithms

(2004)

J. Kwon et al.

Tracking of abrupt motion using wang-landau Monte Carlo estimation

J. Kwon et al.

Visual tracking decomposition

Cited by (3)

Scenario prediction and critical factors of CO<inf>2</inf> emissions in the Pearl River Delta: A regional imbalanced development perspective
2022, Urban Climate
Citation Excerpt :
This approach discards several randomly selected neurons and their weight connections during the backpropagation error update to prevent an excessive repetition of training data. According to the individual Bernoulli distribution, the typical probability of any node's weight and connection being deleted is 50% (Meshgi et al., 2017), which can effectively prevent overfitting. Fig. 5 shows the network structure before and after applying dropout with a probability p of 50% (Bai et al., 2020).
The Pearl River Delta urban agglomeration (PRD) is the main body responsible for achieving carbon neutrality in China. However, high carbon dioxide (CO2) emissions are significantly affected by internal development disparities, hindering the realization of low carbon. Accordingly, considering the imbalanced development, the PRD is divided into four types: Guangzhou, Shenzhen, active development cities (ADCs), and potential development cities (PDCs). On this basis, this paper employs a back propagation neural network (BPNN) to establish a set of networks to predict the CO2 emissions of four city types. Then, in combination with scenario analysis, the BPNN is extended to explore critical factors at the urban agglomeration level. The findings show that the urbanization rate is the major contributor to increasing emissions in Guangzhou and the PDCs, whereas the growth of the industrial structure is the critical factor for Shenzhen. These factors should be given priority when designing reduction policies. Thus, specific and targeted countermeasures for local governments and enterprises are ultimately recommended. Overall, this paper not only provides a novel perspective of regional imbalances for emission mitigation but also bears significance to policies and actions for urban agglomerations, which are conducive to realizing emission reduction targets and achieving low-carbon development.
Prediction of multiproject resource conflict risk via an artificial neural network
2021, Engineering, Construction and Architectural Management
A verification method on post-pruning generalization ability al of neural network model
2019, Jisuanji Gongcheng/Computer Engineering

View full text

Constructing a meta-tracker using Dropout to imitate the behavior of an arbitrary black-box tracker

Abstract

Introduction

Section snippets

Method

Experiment

Discussion

Conclusions

Acknowledgments

Artificial Intelligence

Image and Vision Computing

Computer Vision and Image Understanding

Physical Review Letters

ACM Transactions on Intelligent Systems and Technology

Pattern Recognition

Ensemble tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence

Visual tracking with online multiple instance learning

Effective codebooks for human action representation and classification in unconstrained videos

IEEE Transactions on Multimedia

Real time robust L1 tracker using accelerated proximal gradient approach

Shape matching and object recognition using shape contexts

IEEE Transactions on Pattern Analysis and Machine Intelligence

Template matching using fast normalized cross correlation

An adaptive coupled-layer visual model for robust visual tracking

Probabilistic tracking with adaptive feature selection

Color invariant SURF in discriminative object tracking

Online selection of discriminative tracking features

IEEE Transactions on Pattern Analysis and Machine Intelligence

Kernel-based object tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence

Histograms of oriented gradients for human detection

The design and use of steerable filters

IEEE Transactions on Pattern Analysis and Machine Intelligence

Symbiotic tracker ensemble towards a unified tracking framework

IEEE Transactions on Circuits and Systems for Video Technology

On-line density-based appearance modeling for object tracking

Struck: Structured output tracking with Kernels

Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex

The Journal of Physiology

Condensation—conditional density propagation for visual tracking

International Journal of Computer Vision

A comparison of SIFT, PCA-SIFT and SURF

International Journal of Image Processing

Tracking-learning-detection

IEEE Transactions on Pattern Analysis and Machine Intelligence

Pca-sift: A more distinctive representation for local image descriptors

Bfo meets hog: feature extraction based on histograms of oriented pdf gradients for image classification

Image feature extraction using gradient local auto-correlations

Representation of local geometry in the visual system

Biological Cybernetics

Combining pattern classifiers: methods and algorithms

Tracking of abrupt motion using wang-landau Monte Carlo estimation

Visual tracking decomposition