TRACK AND CUT: SIMULTANEOUS TRACKING AND SEGMENTATION OF MULTIPLE OBJECTS WITH GRAPH CUTS

: This paper presents a new method to both track and segment multiple objects in videos using min-cut/max-ﬂow optimizations. We introduce objective functions that combine low-level pixel-wise measures (color, motion), high-level observations obtained via an independent detection module (connected components of foreground detection masks in the experiments), motion prediction and contrast-sensitive contextual regularization. One novelty is that external observations are used without adding any association step. The minimization of these cost functions simultaneously allows ”detection-before-track” tracking (track-to-observation assignment and automatic initialization of new tracks) and segmentation of tracked objects. When several tracked objects get mixed up by the detection module (e.g., single foreground detection mask for objects close to each other), a second stage of minimization allows the proper tracking and segmentation of these individual entities despite the observation confusion. Experiments on sequences from PETS 2006 corpus demonstrate the ability of the method to detect, track and precisely segment persons as they enter and traverse the ﬁeld of view, even in cases of occlusions (partial or total), temporary grouping and frame dropping.


INTRODUCTION
Visual tracking is an important and challenging problem. Depending on applicative context under concern, it comes into various forms (automatic or manual initialization, single or multiple objects, still or moving camera, etc.), each of which being associated with an abundant literature. In a recent review on visual tracking (Yilmaz et al., 2006), tracking methods are divided into three categories: point tracking, silhouette tracking and kernel tracking. These three categories can be recast as "detect-before-track" tracking, dynamic segmentation and tracking based on distributions (color in particular).
The principle of "detect-before-track" methods is to match the tracked objects with observations provided by an independent detection module. Such a tracking can be performed with either deterministic or probabilistic methods. Deterministic methods amount to matching by minimizing a distance based on certain descriptors of the object. Probabilistic methods provide means to take measurement uncertainties into account and are often based on a state space model of the object properties.
Dynamic segmentation aims to extract successive segmentations over time. A detailed silhouette of the target object is thus sought in each frame. This is often done by making evolve the silhouette obtained in the previous frame towards a new configuration in current frame. It can be done using a state space model defined in terms of shape and motion parameters of the contour (Isard and Blake, 1998;Terzopoulos and Szeliski, 1993) or by the minimization of a contour energy functional. The contour energy includes temporal information in the form of either temporal gradients (optical flow) (Bertalmio et al., 2000;Cremers and C. Schnörr, 2003;Mansouri, 2002) or appearance statistics originated from the object and its surroundings in previous images (Ronfard, 1994;Yilmaz, 2004). In (Xu and Ahuja, 2002) the authors use graph cuts to minimize such an energy functional. The advantages of min-cut/max-flow optimization are its low computational cost, the fact that it converges to the global minimum without getting stuck in local minima and that no a priori on the global shape model is needed.
In the last group of methods ("kernel tracking"), the best location for a tracked object in the current frame is the one for which some feature distribution (e.g., color) is the closest to the reference one for the tracked object. The most popular method in this class is the one of Comaniciu et al. (Comaniciu et al., 2000;Comaniciu et al., 2003), where approximate "mean shift" iterations are used to conduct the iterative search. Graph cuts have also been used for illumination invariant kernel tracking in (Freedman and Turek, 2005).
These three types of tracking techniques have different advantages and limitations, and can serve different purposes. The "detect-before-track" methods can deal with the entries of new objects and the exit of existing ones. They use external observations that, if they are of good quality, might allow robust tracking. However this kind of tracking usually outputs bounding boxes only. By contrast, silhouette tracking has the advantage of directly providing the segmentation of the tracked object. With the use of recent graph cuts techniques, convergence to the global minima is obtained for modest computational cost. Finally kernel tracking methods, by capturing global color distribution of a tracked object, allow robust tracking at low cost in a wide range of color videos.
In this paper, we address the problem of multiple objects tracking and segmentation by combining the advantages of the three classes of approaches. We suppose that, at each instant, the moving objects are approximately known from a preprocessing algorithm. Here, we use a simple background subtraction but more complex alternatives could be applied. An important novelty of our method is that the use of external observations does not require the addition of a preliminary association step. The association between the tracked objects and the observations is jointly conducted with the segmentation and the tracking within the proposed minimization method. The connected components of the detected foreground mask serve as high-level observations. At each time instant, tracked object masks are propagated using their associated optical flow, which provides predictions. Color and motion distributions are computed on the objects segmented in previous frame and used to evaluate individual pixel likelihood in the current frame. We introduce for each object a binary labeling objective function that combines all these ingredients (low-level pixel-wise features, highlevel observations obtained via an independent detection module and motion predictions) with a contrast-sensitive contextual regularization. The minimization of each of these energy functions with mincut/max-flow provides the segmentation of one of the tracked objects in the new frame. Our algorithm also deals with the introduction of new objects and their associated tracker. When multiple objects trigger a single detection due to their spatial vicinity, the proposed method, as most detect-before-track approaches, can get confused. To circumvent this problem, we propose to minimize a secondary multi-label energy function which allows the individual segmentation of concerned objects.
In section 2, notations are introduced and an overview of the method is given. The primary energy function associated to each tracked object is introduced in section 3. The introduction of new objects and the handling of complete occlusions are also explained in this section. The secondary energy function permitting the separation of objects wrongly merged in the first stage is introduced in section 4. Experimental results are reported in section 5, where we demonstrate the ability of the method to detect, track and precisely segment persons and groups, possibly with partial or complete occlusions and missing observations. The experiments also demonstrate that the second stage of minimization allows the segmentation of individual persons when spatial proximity makes them merge at the foreground detection level.

Principle and Notations
In all this paper, P will denote the set of N pixels of a frame from an input image sequence. To each pixel s of the image at time t is associated a feature vector s,t is a 3-dimensional vector in RGB color space and z (M) s,t is a 2-dimensional vector of optical flow values. Using an incremental multiscale implementation of Lucas and Kanade algorithm (Lucas and Kanade, 1981), the optical flow is in fact only computed at pixels with sufficiently contrasted surroundings. For the other pixels, color constitutes the only low-level feature. However, for notational convenience, we shall assume in the following that optical flow is available at each pixel.
We assume that, at time t, k t objects are tracked. The i th object at time t is denoted as O (i) t and is defined as a mask of pixels, O (i) t ⊂ P . The goal of this paper is to perform both segmentation and tracking to get the object O (i) t corresponding to the object O (i) t−1 of previous frame. Contrary to sequential segmentation techniques (Juan and Boykov, 2006;Kohli and Torr, 2005;Paragios and Deriche, 1999), we bring in object-level "observations". They may be of various kinds (e.g., obtained by a class-specific object detector, or motion/color detectors). Here we consider that these observations come from a preprocessing step of background subtraction. Each observation amounts to a connected component of the foreground map after background subtraction (figure 1). The connected components are obtained using the "gap/mountain" method described in (Wang et al., 2000) and ignoring small objects. For the first frame, the tracked objects are initialized as the observations themselves. We assume that, at each time t, there are m t observations. The j th observation at time t is denoted as M ( j) t and is defined as a mask of pixels, M ( j) t ⊂ P . Each observation is characterized by its mean feature vector: (1) The principle of our algorithm is as follows.
t−1 the mean, over all pixels of the object at time t − 1, of optical flow values: ( The prediction is obtained by translating each pixels belonging to O (i) t−1 by this average optical flow: Using this prediction, the new observations, as well as color and motion distributions of O (i) t−1 , an energy function is built. The energy is minimized using min-cut/max-flow algorithm , which gives the new segmented object at time t, O (i) t . The minimization also provides the correspondences of the object with all the available observations.

Energy functions
We define one tracker for each object. To each tracker corresponds, for each frame, one graph and one energy function that is minimized using the mincut/ max-flow algorithm . Nodes and edges of the graph can be seen in figure 2.
Graph for object i at time t Two observations are available, each of which giving rise to a special "observation" node. The pixel nodes circled in red correspond to the masks of these two observations. Dashed box indicates predicted mask.

Graph
The undirected graph G t = (V t , E t ) is defined as a set of nodes V t and a set of edges E t . The set of nodes is composed of two subsets. The first subset is the set of N pixels of the image grid P . The second subset corresponds to the observations: to each observation mask M ( j) t is associated a node n ( j) t . We call these nodes "observation nodes". The set of nodes thus The set E P represents all unordered pairs {s, r} of neighboring elements of P , and E M ( j) label l (i) j,t ("bg" or "fg") to each observation node n ( j) t . The set of all the node labels forms L (i) t .

Energy
An energy function is defined for each object at each instant. It is composed of unary data terms R (i) s,t and smoothness binary terms B (i) s,r,t :

Data term
The data term only concerns the pixel nodes lying in the predicted regions and the observation nodes. For all the other pixel nodes, labeling will only be controlled by the neighbors via binary terms. More precisely, the first part of energy in (4) reads: Segmented object at time t should be similar, in terms of motion and color, to the preceding instance of this object at times t − 1. To exploit this consistency assumption, color and motion distributions of the object and the background are extracted from previous image. The distribution p (i,C) t−1 for color, respectively p (i,M) t−1 for motion, is a Gaussian mixture model fitted to the set of values {z (C) . Under independency assumption for color and motion, the likelihood of individual pixel feature z s,t according to previous joint model is: The two distributions for the background are q (i,M) t−1 and q (i,C) t−1 . They are Gaussian mixture models built on the sets {z (M) respectively. Foreground likelihood at pixel s then reads: The likelihood p 1 , invoked in (5) within predicted region, can now be defined as: An observation should be used only if it is likely to correspond to the tracked object. Therefore, we use a similar definition for p 2 . However we do not evaluate the likelihood of each pixel of the observation mask but only the one of its mean feature z ( j) t . The likelihood p 2 for the observation node n ( j) t is defined as:

Binary term
Following (Boykov and Jolly, 2001), the binary term between neighboring pairs of pixels {s, r} of P is based on color gradients and has the form As in (Blake et al., 2004), the parameter σ T is set to r,t ) 2 , where . denotes expectation over a box surrounding the object. For edges between one pixel node and one observation node, the binary term is similar: Parameters λ 1 and λ 2 are discussed in the experiments.

Energy minimization
The final labeling of pixels is obtained by minimizing the energy defined above: This labeling gives the segmentation of the i-th object at time t as:

Handling complete occlusions
When the number of pixels belonging to a tracked object becomes equal to zero, this object is likely to have disappeared due to either its exit of the field of view or its complete occlusion. If it is occluded, we want to recover it as soon as it reappears. Let t o be the time at which the size drops to zero, and S (i) t be the size of object i at time t. The simplest way to handle occlusions is to keep predicting the object using information available just before its complete disappearance: and minimizing the energy function with (15) However, before being completely occluded, an object is usually partially occluded, which influences its shape, its motion and the feature distributions. Therefore, using only information at time t 0 − 1 is not sufficient and a more complex scheme must be applied. To this end, we try to find the instant t p at which the object started to be occluded. A Gaussian distribution N (S (i) , σ (i) S ) on the size of the object is built and updated at each instant. If |N (S (i) S , then we consider that the object is partially occluded and t p = t − 1. The prediction and the distributions are finally built on averages over the 5 frames before t p : respectively. Specific motion models depending on the application could have been used but this falls beyond the scope of the paper.

Creation of new objects
One advantage of our approach lies in its ability to jointly manipulate pixel labels and track-to-detection assignment labels. This allows the system to track and segment the objects at time t while establishing the correspondence between an object currently tracked and all the approximative object candidates obtained by detection in current frame. If, after the energy minimization for an object i, an observation node n ( j) t is labeled as "fg" it means that there is a correspondence between the i-th object and the j-th observation. If for all the objects, an observation node is labeled as "bg" (∀i,l (i) t, j = "bg"), then the corresponding observation does not match any object. In this case, a new object is created and initialized with this observation.

Segmenting merged objects
Assume now that the results of the segmentations for different objects overlap, that is ∩ i∈F O (i) where F denotes the current set of object indices. In this case, we propose an additional step to determine whether these objects truly correspond to the same one or if they should be separated. At the end of this step, each pixel of ∩ i∈F O (i) t must belong to only one object. For this purpose, a new graphG t = (Ṽ t ,Ẽ t ) is created, whereṼ t = ∪ i∈F O (i) t andẼ t is composed of all unordered pairs of neighboring pixel nodes ofṼ t . The goal is then to assign to each node s ofṼ t a label ψ s ∈ F . DefiningL = {ψ s , s ∈Ṽ t } the labeling ofṼ t , a new energy is defined as: (1 − δ(ψ s , ψ r )).
The parameter σ 3 is here set as r,t ) 2 with the averaging being over i ∈ F and {s, r} ∈Ẽ. The fact that several objects have been merged shows that their respective feature distributions at previous instant did not permit to distinguish them. A way to separate them is then to increase the role of the prediction. This is achieved by choosing function p 3 as: This multi-label energy function is minimized using the α-expansion and the swap algorithms (Boykov et al., 1998;. After this minimization, the objects O (i) t , i ∈ F are updated.

Experimental Results
In this section we present various results on a sequence from the PETS 2006 data corpus (sequence 1 camera 4). The robustness to partial occlusions and the individual segmentation of objects that were initially merged, are first demonstrated. Then we present the handling of missing observations and of complete occlusions on other parts of the video. Following (Blake et al., 2004), the parameter λ 3 was set to 20. However parameters λ 1 and λ 2 had to be tuned by hand to get better results. Indeed λ 1 was set to 10 while λ 2 to 2. Also, the number of classes for the Gaussian mixture models was set to 10.

Observations at each time
First results (figure 4) demonstrate the good behavior of our algorithm even in the presence of partial occlusions and of object fusion. Observations, obtained by subtracting reference frame (frame 10 shown on figure 3(a)) to the current one, are visible in the first column of figure 4. The second column contains the segmentation of the objects with the use of the second energy function. Each tracked object is represented by a different color. In frame 81, two objects are initialized using the observations. Note that the connected component extracted with the "gap/mountain" method misses the legs for the person in the upper right corner. While this impacts the initial segmentation, the legs are included in the segmentation as soon as the subsequent frame. Even if from the 102 nd frame the two persons at the bottom of the frames correspond to only one observation, our algorithm tracks each person separately (frames 116, 146). Partial occlusions occur when the person at the top passes behind the three other ones (frames 176 and 206), which is well handled by the method, as the person is still tracked when the occlusion stops (frame 248).
In figure 5, we show in more details the influence of the second energy function by comparing the results obtained with and without it. Before frame 102, the three persons at the bottom generate three distinct observations while, passed this instant, they correspond to only one or two observations. Even if the motions and colors of the three persons are very close, the use of the secondary multi-label energy function allows their separation.

Missing observations
On figure 6 we illustrate the capacity of the method to handle missing observations thanks to the prediction mechanism. In our test we have only performed the background subtraction on one over three frames. On figure 6, we compare the obtained segmentations with the ones based on observations at each frame. First column shows the intermittent observations, the second one the masks of the objects obtained in case of missing observations and the last one the masks with observations at each time. Thanks to the prediction, the results are only partially altered by this drastic temporal subsampling of observations. As one can see, even if one leg is missing in frames 805 and 806, it is recovered as soon as a new observation is available. Conversely, this result also shows that the incorporation of observations from the detection module enables to get better segmentations than when using only predictions.

Complete occlusions
Results in figure 7 demonstrate the ability of our method to deal with complete occlusions. In this portion of the video, we added synthetically a vertical white band in the images in order to generate complete occlusions. The reference frame can be seen on figure 3(b). On figure 7, the first column contains the original images (with the white band), the second one the observations and the last one the obtained segmentations. Our algorithm keeps tracking and segmenting the object as it progressively disappears and resumes tracking and segmenting it as soon as it reappears.

Conclusion
In this paper we have presented a new method to simultaneously segment and track objects. Predictions and observations, composed of detected objects, are introduced in an energy function which is minimized using graph cuts. The use of graph cuts permits the segmentation of the objects at a modest computational cost. A novelty is the use of observation nodes in the graph which gives better segmentations but also enables the direct association of the tracked objects to the observations (without adding any association procedure). The algorithm is robust to partial and complete occlusions, progressive illumination changes and to missing observations. Thanks to the use of a secondary multi-label energy function, our method allows individual tracking and segmentation of objects which where not distinguished from each other in the first stage. The observations used in this paper are obtained by a simple background subtraction based on a single reference frame. Note however that more complex background subtraction or object detection could be used as well with no change to the approach. As we use feature distributions of objects at previous time to define current energy functions, our method breaks down in extreme cases of abrupt illumination changes. However, by adding an external detector of such changes, we could circumvent this problem by keeping only the prediction and updating the reference frame when the abrupt change occurs. Also, other cues, such as shapes, should probably be added to improve the results. Apart from this rather specific problem, several research directions are open. One of them concerns the design of an unifying energy framework that would allow segmentation and tracking of multiple objects while precluding the incorrect merging of similar objects getting close to each other in the image plane. Another direction of research concerns the automatic tuning of the parameters, which remains an open problem in the recent literature on image labeling (e.g., figure/ground segmentation) with graph-cuts.