Object Tracking in Frame-Skipping Video Acquired Using Wireless Consumer Cameras

Object tracking is an important and fundamental task in computer vision and its high-level applications, e.g., intelligent surveillance, motion-based recognition, video indexing, traffic monitoring and vehicle navigation. However, the recent widespread use of wireless consumer cameras often produces low quality videos with frame-skipping and this makes object tracking difficult. Previous tracking methods, for example, generally depend heavily on object appearance or motion continuity and cannot be directly applied to frame-skipping videos. In this paper, we propose an improved particle filter for object tracking to overcome the frame-skipping difficulties. The novelty of our particle filter lies in using the detection result of erratic motion to ameliorate the transition model for a better trial distribution. Experimental results show that the proposed approach improves the tracking accuracy in comparison with the state-of-the-art methods, even when both the object and the consumer are in motion.


Introduction
Object tracking, in general, is the tracking of an object or objects over a sequence of images. Object tracking is an important task in the field of computer vision and it is usually performed at an early stage in the context of higher-level applications such as automated surveillance, motion-based recognition, video indexing, traffic monitoring and vehicle navigation [1]- [3]. For example, in an automated surveillance system there are at least three key steps: detection of interesting objects, tracking of such objects over frames and analysis of object trajectories to recognize their behaviour. Therefore, object tracking is a critical task in many high-level applications.
Object tracking is a very challenging problem because a lot of difficulties can arise due to non-rigid object structures, occlusions, changing appearance patterns of both the object and the scene, etc. There have been many methods designed to overcome these common difficulties. However, the availability of low-cost hardware, such as CMOS cameras and microphones that are able to ubiquitously capture video content from the environment, has fostered the development of wireless video sensor networks (WVSNs) [4], [5]. Wireless devices ( Fig. 1 shows the wireless consumer cameras used in our experiment in Section 5) allow retrieving videos and tracking in WVSNs is a practical requirement of many real-time applications. The retrieved videos, however, usually have two common difficulties which are usually named together as a frame-skipping problem (see Fig. 2): one is unexpected frame dropping (missing frames in a continuous video sequence) and the other is low frame rate. The frame-skipping problem can be caused by various factors, e.g., low hardware cost, low or unstable processing speed in the video sources, frame dropping caused by the transmission conditions or online compressing or uncompressing which limits the frame rate. Therefore, videos with frame-skipping is common and "normal" for WVSNs and the property of the video flow itself -frame-skipping has become an important issue in many applications. Previous tracking methods, in general, depend heavily on object appearance or motion continuity (see Section 2 for a detailed review). These methods often utilize the assumption of temporal continuity, whereas in frameskipping videos the continuity of a target is often too weak to follow. Essentially, frame-skipping videos create difficulties in obtaining the transition model (describing how objects move between frames). Meanwhile, identifying the target from frame to frame is difficult due to the absence of context and we cannot rely only on image processing techniques. Therefore, most previous tracking methods cannot be directly applied to frameskipping videos. One feasible solution proposed previously, whether or not the motive is frame-skipping tracking, is the integration of object detection and tracking because object detection allows for discrimination of the target from the others [6]- [14]. The solution can overcome the frame-skipping difficulties partially, but applying reliable object detection over a large search space is often costly [6]- [12]. Furthermore, identifying the target requires strong discriminative power which is usually achieved by massive offline training [13], [14]. However, in many applications of WVSNs, offline training is impossible because both targets and scenarios are unpredictable. Our method, on the other hand, is highly efficient in frame-skipping videos and it does not require offline training.
Another important challenge of object tracking is that consumer cameras are frequently rotated or moved during the video capturing process. The motion of consumer cameras also makes object tracking more difficult because the non-stationary background is an obstacle to the extraction of moving objects, thus static background subtraction-based methods [6]- [10] are not applicable. When camera motion and possible cluttered backgrounds appear in some applications, a particle filter (using a dynamic model to guide the particle propagation within a limited sub-space of target state) has been used previously to solve object tracking effectively [15], [21]. However, when object motion becomes unpredictable under frame-skipping conditions, the standard particle filter will cause departure of the sample set from the true target state and this eventually leads to tracking loss. In this paper, we propose an improved particle filter with a better transition model to overcome the difficulties of object tracking in frame-skipping videos. In our method, it is motion detection rather than object detection as in [6]- [14] that plays the key role in object tracking. We apply fast and reliable detection which produces a global description in an acceptable search space. The novelty of our particle filter lies in using the detection result of erratic motion to ameliorate the transition model for a better trial distribution. We compare the tracking accuracy of the proposed approach with the state-of-theart methods and show that our new method is much better, even when both the object and the consumer camera are in motion.
The remainder of this paper is organized as follows: Section 2 briefly summarizes related works on frameskipping videos. Section 3 is devoted to analysing the essence of the frame-skipping problem in the probabilistic framework and then reveals the deficiency of the standard particle filter. Section 4 introduces the extraction of erratic motion and then proposes our particle filter with a newly defined transition model for a better trial distribution. Section 5 presents the experimental results which show that our new method outperforms the stateof-the-art methods and we also discuss the limitations of our method. Section 6 concludes this paper.

Related Works
The frame-skipping events, in general, are equivalent to uncertain erratic motion in most cases. A large number of the state-of-the-art methods, such as mean shift [18], [19], generally require the kernels or feature patches in consecutive frames to overlap with or be in a very close vicinity of each other. However, some existing publications [6]- [14] have attempted to tackle similar difficulties, whether or not the motive is partial frameskipping tracking. A common feature of all these methods is the integration of object detection and tracking. Furthermore, we classify these works into three categories.
i. "Global object detection" for object tracking. These methods use an independent detector to guide the search of an existing tracker when target motion becomes unpredictable and require an object detector fast enough to be applied to the whole frame in most cases. Okuma et al. [13] use a boosted detector to amend the trial distribution of the particle filter. However, the boosted detector requires massive offline training. Another similar piece of research on mixture trial distribution is described in [6]. Porikli et al. [17] extend the standard mean shift technique using multiple kernels at motion areas detected by background subtraction to track in both 6 fps (frame per second) and 1 fps camera fixed videos. In our method, we utilize erratic motion detection (this requires no offline learning) to conquer the frameskipping problem.
ii. "Object detection and connection" for object tracking. These methods detect the objects of interest and then constructing trajectories by analysis of motion continuity, object appearance similarity, etc. However, the algorithms of this category [8]- [10] are limited in static background scenes, where a fast change detector is easily to be realized. Besides, the trajectories are uncertain and usually cannot be recognized in frame-skipping videos. The methods in ii are not applicable in many applications of WVSNs because of non-stationary backgrounds and require an object detector fast enough to be applied to the whole frame in most cases. In our method, the improved particle filter can be applied to dynamic background scenes.
iii. "Multi-scale or multi-stage object detection" for object tracking. These methods increase the discriminative power by layered sampling of multi-scale likelihoods or multi-stage observations. In [11], multi-scale approaches are designed for erratic motion by layered sampling of multi-scale likelihoods [12]. However, the multi-scale approaches adopt the same observation model but lose image information in down-scaling process. Li et al. [14] propose a cascade particle filter with discriminative observers of different life spans. This method can be viewed as a classification problem in the sense of distinguishing tracking human face from the background. Besides, in the long span of this method, massive offline training costs several days. In our method, we integrate erratic motion detection and tracking together to find out a way in getting a better transition model (rather than others, e.g., the observation model proposed in [14]) to conquer the frame-skipping problem without massive offline learning.

Problem Analysis
The standard particle filter can effectively overcome the difficulties such as camera motion and clustered backgrounds which usually appear in applications of WVSNs. However, when object motion becomes erratic and unpredictable under frame-skipping conditions, the standard particle filter makes for tracking loss. Hereby we first briefly review the basic rules of object tracking in a probabilistic frame work and then reveal the deficiency of the standard particle filter in case of frame-skipping.

Tracking in a Probabilistic Framework
The basic idea of particle filter, which means tracking in a probabilistic framework, is now briefly reviewed here firstly.
To define the tracking problem, we can consider a dynamic system represented by the stochastic process  of a target given by x v n n are dimensions of the state and process noise vectors, respectively and k x is the hidden state such as object scale, location.
Then tracking problem can be converted to recursively Where the normalizing constant  depends on the likelihood function defined by the observation model in (2) and the known statistics of k n .
Note that in (3): describes a Markov process of order one [20].


The observation k z is used to modify the prior density and obtain the required posterior density of the current state.
Equation (3) describes the optimal Bayesian solution. However, this recursive propagation of the posterior density is only a conceptual solution and it cannot be determined analytically. So we need a method to approximate the optimal Bayesian solution such as a particle filter.
A particle filter [15], [21] uses a probabilistic framework to formulate tracking as an inference in a Hidden Markov Model (HMM). It is based on random measurement density approximated by a set of weighted particles. Each particle consists of the state domains and its corresponding probability (weight) is denoted by 1: where i k w is given in (6). When s N   the approximation presented in (7) approaches the true posterior density 1:

Deficiency of the Standard Particle Filter
In the standard particle filter, the distribution Update: 1: According to the standard particle filter, the calculation of the integral in (8) is carried out by importance sampling, which means samples that are generated from a proposal distribution.
In practice, a presumed prior distribution is widely used as the proposal distribution. But when object motion becomes erratic and unpredictable under frameskipping conditions, such transition model cause departure of the sample set from the true target state which eventually leads to tracking loss. Hereby we give an implement case of the standard particle filter to show the tracking loss in a frame-skipping video. In our case:  Prior distribution is simply an impulse based on user input. It describes initial distribution of object states and could be based on an object detector.  Observation model is a simple HSV histogrambased model (Fig. 3). We specify the likelihood of an object being in a specific state. Likelihood is based Transition model is a Gaussian window around current state where the standard particle filter usually samples the next state from. As to a given particle at time k-1, the prediction and measurement process is shown in Fig. 4.  p z x over integral space [6], [7], [13].
 Strengthening discriminative powers of the observation k z to modify the prior density and obtain the required posterior density of the current state [8]- [12], [14].  Improving the transition model.
In this paper, our choice is trying to improve the transition model directly (Case 4). Under other choices, the system's efficiency degrades (Case 1, 2) or massive offline training (Case 3) is required. So how can we overcome the frame-skipping difficulties when results of erratic motion detection are available?

Our Improved Particle Filter
Erratic motion detection can produce a global description in a search space. So in Section A, we present how to represent and extract erratic motion as a local motion vector. Then in Section B, the transition model is redefined in the view of target drift over frames based on the local motion vector. Finally in Section C, we describe how this new transition model is integrated into the probabilistic framework to tackle the frame-skipping problem.

Representation and Extraction of Erratic Motion
In this paper, we use MHI and its hierarchical mechanism [16], [17], a simple and fast enough method to represent motion in successively layered silhouettes which directly encode system time. This representation can be used to segment and measure the motions induced by the object in a video scene. These segmented regions are not "motion blobs", but motion regions naturally connected to the moving parts of the object of interest. First, we label those pixels (a set number of standard deviations from the mean RGB background) as foreground. Then a pixel dilation and region growing method is applied to remove noise and extract the silhouette. Then MHI representation is constructed by successively layering selected image regions over time using a simple update rule: , if (I(x,y)) 0 MHI ( , ) 0, else if MHI ( , ) ( ) where each pixel ( , ) x y in the MHI is marked with a current timestamp  if the function  indicates object  Since motion can be perceived from the displayed timestamp gradients in the template, we could convolve gradient masks with the timestamp values in the MHI to extract a motion vector at each pixel. Gradients of the MHI can be calculated efficiently by convolution with separable Sobel filters in the X and Y directions yielding the spatial derivatives: ( , ) x F x y and ( , )  x y x y However, the use of a discrete fixed-sized gradient mask (i.e., 3  3 Sobel gradient masks) limits the range of recoverable motion. When an object moves at different velocities, a fixed-sized mask can result in detection failure. Therefore, we use a hierarchical MHI mechanism [16], which extends the original MHI representation into a hierarchical pyramid format, to appropriately extract the MHI motions of different velocities. An image pyramid is constructed by recursively low-pass filtering and subsampling an image until reaching the desired size of spatial reduction. This permits us to use fixed-sized gradient masks at each pyramid level to calculate motions of different speeds. To create the corresponding MHI pyramid, each level from the  pyramid is used to update a MHI of that particular resolution. Then the algorithm to segment motion regions is denoted as follows: __________________________________________________________________________________________________________________________________________________________ _Step 1: we choose the pyramid level L with the minimum acceptable temporal disparity (finest temporal resolution):

The Integration of Motion Detection and Tracking
As what we have analysed in Section 3, when target motion becomes erratic and unpredictable under the frame-skipping conditions, the standard particle filter is not applicable, e.g., sampling the next state from a Gaussian window around the current state is no longer a good choice. Therefore, based on the local motion vector mentioned above, we redefine the transition model described in (1) as follows: Given a state where ( ) Drift  describes that the state  x by a predefined threshold to reduce the false alarm rate within the bounds of an acceptable one. Then the final improved particle filter is presented in Algorithm 1.
Theoretically, a good implementation of the transition model should take into account previous states for velocity and acceleration information. In this paper, we use an erratic motion detected, second-order, autoregressive dynamical model to predict the next state based on the previous two plus the Gaussian noise. Let k x be the coordinate of a sampled particle at the next time k, then Where x is the centre coordinate of particles; k g is Gaussian noise. In our case, a new prediction and measurement process is shown in Fig. 7.

Experimental Results
In this section we demonstrate the benefits of using the proposed method. We present the experimental setup in Section A, then the metrics for our method evaluation is summarized in Section B. By following these metrics, we analyse the efficiency of the sampling process in our method in Section C and do some quantitative comparisons of object tracking in Section D. Finally, the discussion of our method is given in Section E.

Experimental Setup
In our experiments, the improved particle filter is implemented in C++ on a PC with a Pentium IV 3.0 GHz CPU. Some test videos are taken using wireless consumer cameras (D-Link DCS-5300G, 1/4 inch colour CCD, using the 802.11g wireless technology, Fig. 1). The video resolution of these cameras is fixed at 704×576, whereas their frame rates go down to 6-10 fps.
The nonlinear function parameters (15) correspond to a priori knowledge about object movements, e.g., the previous two states and the movement randomicity. These features exploit basically low-level information about the movement characteristic. In our experiments, we set these parameters empirically, i.e., 1  =2.0, 2  =-1.0,

Algorithm 1 The improved particle filter
Input: -Initialization: if (Erratic motion is detected) then 5. -Draw: -Draw: End. 3  =1.0. Moreover, we abbreviate the standard particle filter as PF in the experiments. The numbers of particles adopted in different cases are 50-1500.

Metrics for Evaluation
To verify the effectiveness of the proposed method, we get a systematic objective evaluation chiefly via following metrics [14]: the effective sample size (ESS) [21] analysis in the sampling process, tracking error and computational cost with similar tracking performance, respectively.  ESS analysis in the sampling process. To analyse sampling efficiency, we compute ESS of importance sampling. In the same set of tracking sessions, ESS measures the uniformity of the weights of the particles and is defined by EES = 1/

Efficiency of the Sampling Process
With comparison to PF (with different particle numbers), a quantitative analysis of sampling efficiency is done on a test sequence (CAVIAR.avi) from CAVIAR datasets (free and open). CAVIAR.avi has been down-sampled randomly to corresponding 4-6 fps. Fig. 8 (a) includes a quantitative analysis of sampling efficiency by the curve of ESS.
We can simply describe ESS as that samples drawn from the target distribution can approximate s N weighted samples. Therefore, the higher ESS is, the better the sampling efficiency achieved by the system. Obviously, increasing the particle numbers of PF can compensate for its inaccuracy in the prediction stage under the frame-skipping conditions. When the particle number of PF is enlarged to 250 (5 times as much as the number in our method), the performance becomes similar to the proposed method. According to Fig. 8 (a), ESS is roughly in proportion to the particle number. In other words, the ability to predict the state of the target is enhanced with the cost of drawing more particles for a larger search space. The ESS of our method is even higher than PF with 250 particles, because in the update stage our method enables the convergence of particles around high-likelihood regions.

Quantitative Comparison
To validate the sampling effectiveness of the proposed method, quantitative comparisons of tracking are done via tracking errors in two aspects.  With comparison to PF with different numbers of particles, a quantitative comparison of position error in pixel is done on the test sequence (CAVIAR.avi). Fig. 8 (b) illustrates a quantitative analysis of sampling effectiveness: the error rate of our method is still low with the smallest particle numbers; the curve of tracking error also includes that enlarging the particle number of PF can compensate for its poor accuracy under frame-skipping conditions.  A comparison of position error in size is done among PF, the mean shift using a colour histogram [18] and our method on test sequences: CAVIAR.avi, lab.avi and football.avi. lab.avi (Fig. 5) is taken by a wireless consumer camera just outside of our lab (3-4 fps). football.avi ( Fig. 9 (a)) records a player engaged on a pitch (2-4 fps). All videos are down-sampled randomly to the corresponding frame rate. By the curve indicated in Fig. 10, our method shows higher accuracy than the others under the frame-skipping condition.
In addition, in our tracker: 1) the cost in the prediction and update stages of the particle filter mainly depends on s N . 2) For the calculation of the local motion vector, a hierarchical MHI mechanism is adopted to handle different velocities of moving targets efficiently and a pyramid of images is built so that, in each localized search space, an image is recursively low-pass filtered and sub-sampled until reaching the desired size of spatial reduction. Our experimental results show that the calculation of the local motion vector and the basic image processing involved is roughly comparable to PF with 50 particles.
Therefore, a comparison of the computational cost is done with the representation of average number of particles calculated by each observer per frame on three test sequences: CAVIAR.avi, lab.avi and football.avi. The result is shown in Fig. 11. The compared PF with different particles have similar tracking performance to our approach, yet many more excessive particles are calculated. Fig. 9 (a) shows a player engaged on a pitch. Challenges involved in this video sequence are common in real-life cases, e.g., the camera is moving and shaking when following the player's movements, the player's motion itself is unstable, with zoom in and zoom out, and the pose changes when he stands up to kick. Our method tracks the player successfully.

Discussion
Also, we validate the proposed method in a real-life vehicle counting system. Fig. 9 (b) includes that our method correctly tracks a car, even the video stream acquired by a wireless consumer camera is at 6-10 fps.
However, the discriminative power of our method decreases when cameras move too quickly due to too much false motion alarms caused by quick motion of cameras. Fig. 9 (c) shows the movement of a student (lab2.avi is also taken by a wireless consumer camera just outside of our lab with 3-4 fps). In contrast with the two earlier cases, one great challenge involved in this video is that the camera is moving too quickly and our method tracks the player unsuccessfully. Moreover, another main limitation of our method is multi-object tracking. In fact, tracking multiple interactive objects itself would be a much more challenging problem. Multi-object tracking in frame-skipping videos is our future work.

Conclusion
The key to successful tracking relies on the effective extraction of useful information of the targetʹs state from observations. A good transition model of the target will certainly boost this to a great extent. Generally one can say without exaggeration that a good model is worth a thousand pieces of data [22]. This paper has introduced a redefined transition model for frame-skipping tracking without any restrictions on a priori offline training, image quality, objectsʹ shapes and speed. We have contributed to the state-of-the-art methods in improving the standard particle filter for better tracking. We have compared our contribution to the state-of-the-art solutions in the literature and observed its superiority, even when both the object and the camera are moving. In our future work, we will study how to increase the discriminative power for object tracking when cameras move quickly. Also, multi-object tracking in frame-skipping videos is another issue to be addressed in our future studies.