Practical automatic background substitution for live video

In this paper we present a novel automatic background substitution approach for live video. The objective of background substitution is to extract the foreground from the input video and then combine it with a new background. In this paper, we use a color line model to improve the Gaussian mixture model in the background cut method to obtain a binary foreground segmentation result that is less sensitive to brightness differences. Based on the high quality binary segmentation results, we can automatically create a reliable trimap for alpha matting to refine the segmentation boundary. To make the composition result more realistic, an automatic foreground color adjustment step is added to make the foreground look consistent with the new background. Compared to previous approaches, our method can produce higher quality binary segmentation results, and to the best of our knowledge, this is the first time such an automatic and integrated background substitution system has been proposed which can run in real time, which makes it practical for everyday applications.

S e e h t t p://o r c a .cf. a c. u k/ p olici e s. h t ml fo r u s a g e p olici e s.Co py ri g h t a n d m o r al ri g h t s fo r p u blic a tio n s m a d e a v ail a bl e in ORCA a r e r e t ai n e d by t h e c o py ri g h t h ol d e r s .
The first step is to extract the foreground from the input video, and the second step is to combine the original foreground with the new background.Given limited computational resources and time, it is even more challenging to achieve satisfactory background substitution results in real time for live video.In this paper, we focus on background substitution for live video and especially live chat video, in which the camera is monocular and static, and the background is also basically static.
Foreground segmentation, also known as matting, is a fundamental problem.Formally, foreground segmentation takes as input an image I, which is assumed to be a composite of a foreground image F and a background image B. The color of the ith pixel can be represented as a linear combination of the foreground and background colors, where α represents the opacity value: (1) This is an ill-posed problem which needs assumptions or extra constraints to become solvable.
Generally, existing work on foreground segmentation can be categorized into automatic approaches or interactive approaches.Automatic approaches usually assume that the camera and background are static, and a pre-captured background image is available.They try to model the background using either generative methods [6][7][8][9], or non-parametric methods [10,11].Those pixels which are consistent with the background model are labeled as background, and the remainder are labeled as foreground.
Some recent works incorporate a conditional random field to include color, contrast, and motion cues, and use graph-cut to solve an optimization problem [12][13][14].Most online automatic approaches only produce a binary foreground segmentation instead of fractional opacities for the sake of time, and then use feathering [12] or border matting [13] to compute approximate fractional opacities along the boundary.Feathering is a relatively crude, but efficient, technique that fades out the foreground at a fixed rate.Border matting is an alpha matting method that is significantly simplified to only collect the nearby foreground/background samples for each unknown pixel to allow fitting of a Gaussian distribution, which is then used to estimate the alpha value for that pixel.Although border matting also uses dynamic programming to minimize an energy function that encourages alpha values varying smoothly along the boundary, the result of border matting is far from globally optimal.On the other hand, interactive approaches have been proposed to handle more complicated camera motion [1,15,16].Since strictly real-time performance is unnecessary for such applications, they compute more precise fractional opacities along the segmentation boundary from the beginning.Such methods require the user to draw some strokes or a trimap in a few frames to indicate whether a pixel belongs to the foreground/background/unknown region.They then solve for the alpha values in the unknown region and propagate the alpha mask to other frames.
In contrast to the large amount of foreground segmentation publications, there are fewer studies on techniques for compositing the original foreground and a new background for background substitution.Since the light sources of the original video and the new background may be drastically different, directly copying the foreground to the new background will not achieve satisfactory results.Some seamless image composition techniques [17,18] may seem relevant at first glance, but they require the original and new backgrounds to be similar.Other color correction techniques based on color constancy [19][20][21][22] are more suitable in our context.Color constancy methods first estimate the light source color of the image, and then adjust pixel colors according to the specified hypothetical light source color.
In this paper, we present a novel practical automatic background substitution system for live video, especially live chat video.
Since realtime performance is necessary and interaction is inappropriate during live chat, our method is designed to be efficient and automatic.We first accomplish binary foreground segmentation by a novel method which is based on background cut [12].To make the segmentation result less sensitive to brightness differences, we introduce a simplified version of the color line model [23] during the background modeling stage.Specifically, we build a color line for each background pixel and allow larger variance along the color line than in the perpendicular direction.We also include a more recent promising alpha matting method [24] to refine the segmentation boundary instead of feathering [12] or border matting [13].To maintain real-time performance when including such a complicated alpha matting process, we perform foreground segmentation at a coarser level and then use simple but effective bilinear upsampling to generate a foreground mask for the finer level.
After foreground segmentation, in order to compensate for any lighting difference between the input video and the new background, we estimate the color of the light sources in both the input video and new background, and then adjust the foreground color based on the color ratio of the light sources.This color compensation process follows the same idea as the white-patch algorithm [25], but to our knowledge this is the first time such color compensation has been applied to background substitution.Compared to previous approaches, thanks to its invariance to luminance changes, the binary segmentation result of our method is more accurate, and, thanks to the alpha matting border refinement and foreground color compensation, the appearance of the foreground in our result is more compatible with the new background.
In summary, the main contributions of our paper are: • A novel practical automatic background substitution system for live video.

• Introduction of a color line model in conjunction
with a Gaussian mixture model in the background modeling stage, which makes the foreground segmentation result less sensitive to brightness differences.• Application of a color compensation step to background substitution, which makes the inserted foreground look more natural in the new background.

Automatic video matting
Unlike interactive video matting methods [1,15,16,26], which need user interaction during video playback, automatic video matting is more appropriate for live video.The earliest kind of automatic video matting problem is constant color matting [27], which uses a constant backing color, often blue, so is usually called blue screen matting.Although excellent segmentation results can be achieved by blue screen matting, it needs extra equipment including a blue screen and careful setting of light sources.More recent video matting methods loosen the requirement for the background to have constant color, and only assume that the background can be pre-captured and remains static or only contains slight movements.They model the background using either generative methods, such as a Bayesian model [6], a self-organized map [7], a Gaussian mixture model [8], independent component analysis [9], a foreground-background mixture model [28], or non-parametric methods [10,11].Such models allow prediction of the probability of a pixel belonging to the background.These methods can create holes in the foreground and noise in the background if the colors of the foreground and background are similar, because they only make local decisions.Some recent techniques utilize the power of graph-cut to solve an optimization problem based on a conditional random field using color, contrast, and motion cues [12][13][14]; they are able to create more complete foreground masks since they constrain the alpha matte to follow the original image gradient.Other work [29] focuses on foreground segmentation for animation.In our case, in order to acquire real-time online matting for live video, it is inappropriate to include motion cues.Thus our model is only based on color and contrast, like the work of Sun et al. [12].We also find that stronger shadow resistance can be achieved by employing a color line model [23].Another drawback of existing online methods is that they only acquire a binary foreground segmentation and then use approximate border refinement techniques such as feathering [12] or border matting [13] to compute fractional opacities along the boundary.In this paper, we will show that a more precise alpha matting technique can be incorporated while realtime performance can still be achieved by performing foreground segmentation at a coarser level and then using simple bilinear upsampling to generate a finer level foreground mask.

Interactive video matting
Interactive video matting is another popular video matting approach.It no longer requires a known background and static camera, and takes a user drawn trimap or strokes to tell if a pixel belongs to the foreground/background/unknown region.For images, previous methods are often samplingbased [30], affinity-based [24], or a combination of both [31], computing alpha values for the unknown region based on the known region information.For video, Chuang et al. [15] use optical flow to propagate the trimap from one frame to another.Video SnapCut [1] maintains a collection of local classifiers around the object boundary.Each classifier subsequently solves a local binary segmentation problem, and classifiers of one frame are propagated to subsequent frames according to motion vectors estimated between frames.However, they need to take all frames all at once to compute reliable motion vectors, which takes a huge amount of time, so are unsuitable for online video matting.Gong et al. [16] use two competing one-class support vector machines (SVMs) to model the background and foreground separately for each frame at every pixel location, use the probability values predicted by the SVMs to estimate the alpha matte; they update the SVMs over time.Near realtime performance is available with the help of a GPU, but they still need an input user trimap and an extra training stage, so are inconvenient for live video applications.
There are three main categories of methods for color adjustment to improve the realism of image composites.The first category focuses on color consistency or color harmony.For example, Wong et al. [32] adjust foreground colors to be consistent with nearby surrounding background pixels, but their method fails when nearby background pixels do not correctly represent the overall lighting conditions.
Cohen-Or et al. [33] and Kuang et al. [34] consider overall color harmony based on either aesthetic rules or models learned from a dataset, but they tend to focus on creating aesthetic images rather than realistic images.The second category of methods focuses on seamless cloning based on solving a Poisson equation or coordinate interpolation [2,17,18,35,36].A major assumption in these approaches is that the original background is similar to the new background, which we cannot guarantee in our application.The third category of methods is based on color constancy, estimating the illumination of the image first and then adjusting colors accordingly [19][20][21][22].In this paper, we utilize the most basic and popular color constancy method, the white-patch algorithm [25], to estimate the light source color, since we need its efficiency for real-time application.

Overview
We now outline our method.The pipeline can be separated into three steps: foreground segmentation, border refinement, and final composition.Firstly, for the foreground segmentation step, we suppose the background can be pre-captured and maintains static.Inspired by background cut [12], we build a global Gaussian mixture background model, local single Gaussian background models at all pixel locations, and a global Gaussian mixture foreground model.But unlike background cut, instead of using an isotropic variance for the local single Gaussian background models, we make the variance along the color line larger than that in the direction perpendicular to the color line.Here the concept of a color line is borrowed from Ref. [23].The original color line model built multiple curves to represent all colors of the whole image, and assumed that colors from the same object lie on the same curve.
To check which curve a pixel belongs to is a time consuming process.In order to achieve real-time performance, we adapt the color line model to a much simpler and more efficient version.In our basic version of the color line model, for each pixel we build a single curve color line model, which avoids the process of matching a pixel to one of the curves in the multiple curve model.Furthermore, instead of fitting a curve, we fit a straight line that intersects the origin in RGB space, which means we ignore the non-linear transform of the camera sensor.Our experiments show this simplified model to be sufficient and effective.By utilizing this color line model, we can avoid misclassifying background pixels which undergo color changes due to a shadow passing by, since color changes caused by shadows still remain along the color line.Using this background cut model, we can build an energy function that can be optimized by graph-cut to give a binary foreground segmentation matte.Secondly, we carry out border refinement for this binary foreground matte.Specifically, we use morphological operations to mark the border pixels between foreground and background.Considering these border pixels to be the unknown region results in a trimap.We then carry out closed-form alpha matting [24], which computes fractional alpha values for these border pixels.It is important to emphasize that, only when the binary foreground segmentation result is essentially correct, can a trimap be automatically computed reliably in this way.Lastly, to perform composition, we estimate the light source colors of the original input video and the new background separately, and adjust the foreground colors accordingly to make the foreground look more consistent with the new background.

Basic background cut model
In this section we briefly describe the background cut model proposed in Ref. [12].The background cut algorithm takes a video and a pre-captured background as input, and the output is a sequence of binary foreground masks, in which each pixel r is labeled 0 if it belongs to the background or 1 otherwise.Background cut solves the foreground segmentation problem frame by frame.For each frame, the process of labeling can be transformed into solving a global optimization problem.The energy function to be minimized is in the form of a conditional random field: where X = {x r }, x r denotes the label value, r, s are neighbouring pixels in one frame, E d (the data term) represents per-pixel energy, and E c is a contrast term computed from neighbouring pixels.Here λ 1 is a predefined constant balancing E d and E c , which is empirically set to 30 in our experiments.This is a classical energy function which can be minimized by graph-cut [37].Now we explain how to construct E d and E c .First we model the foreground and the background using Gaussian models.For the foreground, we build a global Gaussian mixture model (GMM).For the background, we not only build a global GMM, but also a local single Gaussian distribution model at each pixel location (a per-pixel model).The two global GMMs are defined as where i = 0 and i = 1 stand for background and foreground respectively, v r denotes the color of pixel r, k i denotes the number of mixture components, w i k denotes the weight of the kth component, N denotes the Gaussian distribution, µ i k denotes the mean, and Σ i k denotes the covariance matrix.The single Gaussian distribution at every pixel location is defined as where Σ s r = σ s r I, so, following Ref.[12], the variance of the per-pixel model is isotropic.The background global GMM and the background per-pixel model are initialized using pre-captured background data.The foreground global GMM is initialized using pixels whose probabilities are lower than a threshold in the background model.After initialization, these Gaussian models are updated frame by frame according to the segmentation results.
Based on the Gaussian models, the data term E d is defined as Here λ 2 is a predefined constant balancing the global GMM and the local per-pixel model, which is empirically set to 0.1 in our experiments.The contrast term is where d B (r, s) is a contrast attenuation term proportional to the contrast with respect to the background, z rs = max(||v r − v B r ||, ||v s − v B s ||) measures the dissimilarity between the pre-captured background and the current frame, and β, K, and σ z are predefined constants.In our experiments, we set β = 0.005, K = 1, σ z = 10.The introduction of the contrast attenuation term causes E c to rely on the contrast from the foreground instead of the background.
The energy function in Eq. ( 2) can be optimized using the graph-cut algorithm [37].For more details of the model, please refer to Ref. [12].One major drawback of this background cut model is that, when the color of a background pixel changes due to changes in illumination, it will have extremely low probability in the per-pixel model, which will cause the pixel to be misclassified as foreground instead of background.

Background cut with color line model
Now we explain how the color line model [23] can improve the effectiveness of the background cut model in the presence of shadows.
Starting from the basic color line model, we make the assumption that colors of a certain material under different intensities of light form a linear color cluster that intersects the origin in RGB space.Suppose the average color at a pixel location is µ s r = (r, g, b).When the illumination of the same pixel location changes, its color will also change from µ s r to v r .According to the color line model, v r will approximately lie on the line connecting the origin and µ s r in the RGB color space.With this insight, we can decompose v r as ) where σ pe and σ pa are the respective variances of the Gaussian distributions for the perpendicular direction and parallel direction.Then the per-pixel single Gaussian distribution Eq. ( 4) is modified to be p s (v r ) = f (v r , µ s r ) (10) As discussed before, the color of an object is more likely to fluctuate in the parallel direction than in the perpendicular direction.Therefore, we set σ pe = σ s r , σ pa = λ 3 σ pe , λ 3 > 1 to constrain variance in the perpendicular direction and tolerate variance in the parallel direction, which gives our model a strong resistance to shadow.Here we do not build a global color line model as in Ref. [23], which uses multiple color lines for the whole image to replace the global GMM, because it takes a long time to determine which line each pixel belongs to when the number of lines is large (e.g., a model with 40 lines is used in Ref. [23]), precluding real-time performance.

Border refinement
After graph-cut, we add an extra hole filling step by applying the morphological close operation to fill small holes in the foreground mask.See Fig. 1 for an example.However, we currently still have a binary foreground matte (see Fig. 1(d)).In this subsection, we explain how to automatically compute fractional alpha values for the segmentation border.
First, we automatically generate a mask covering the segmentation border as the unknown region: Here U i denotes the value of the ith pixel of the unknown mask, erode() denotes the morphological erosion operation, F is the binary foreground matte, and B is the binary background matte where B i = 1 − F i .The morphological operation radius is set to 2 for 640 × 480 input.The eroded foreground mask, eroded background mask, and unknown region mask are separately painted in white, black, and gray in the final trimap.Using this trimap with one of the most popular alpha matting methods [24], we calculate the fractional alpha values for the unknown region.See Fig. 2 for an example of the generated trimap and alpha matting result.

Composition
For an ideal final composition, the new composite image should be  denotes the original foreground, and B new denotes the new background (Fig. 3(c)).For previous methods whose pre-captured background (Fig. 3(a)) is unavailable, F old is approximated by I old : However, in our case, since the pre-captured background B old is available, we can calculate the original foreground more accurately: This gives a final composition formula: Directly applying the above composition will create unrealistic results due to the difference in light source colors between the original input and the new background.Thus, we propose a color compensation process to deal with this problem.First, we need to estimate the light source colors of the original input video and the new background image.The white-patch method [25], a popular color constancy method, assumes that the highest values in each color channel represent the presence of white in the image.In this paper, we use the variant of the white-patch method designed for CIE-Lab space, a color space that is naturally designed to separate lightness and chroma.We first calculate the accumulated histogram in the lightness channel L of an image in CIE-Lab space, and consider those 10% pixels with the largest lightness values to be the white pixels.Figure 3 shows an example of the light source masks.The estimated light source color is then computed as the mean color value of all light source pixels.Denote the estimated light source color of the input video as c old , that of the new background image as c new .Then the new composite image after color compensation is ) Figure 4 compares results with and without light source color compensation.We can clearly see that the result with color compensation is more realistic.

Results and discussion
In this section, we report results generated under different conditions.All results shown were generated using fixed parameters.
Results for different frames of the same input video.Figure 5 shows that our method can create generally good background substitution results for different frames, no matter what the gesture is.
Sometimes there may be residual background between the fingers (e.g., Fig. 5(c)) due to the hole-filling post-processing, but it does not do much harm to the overall effect.
Results for different input videos.Figure 6 shows that our method can deal with different kinds of foreground and background.Color compensation works fine for various lighting condition.Although the matting border is not 100% perfect for Fig. 6(b) due to confusion of hair and background, the composition result is generally good.
Comparison with previous methods.We compare various methods: fuzzy Gaussian [38], adaptive-SOM [7], background cut [12] using RGB color space and CIE-Lab color space, and our color line model.To implement the fuzzy Gaussian and adaptive-SOM methods, we used the code in the BGSLibrary [39].There are also other background subtraction methods in the BGSLibrary, we choose these two methods because they show the most promising results under real-time conditions.Figure 7 shows foreground masks created by different methods.After the person walks into the picture, some shadow will be cast onto the wall.The fuzzy Gaussian and adaptive-SOM methods create a lot of noise and holes since they do not utilize gradient information between neighbouring pixels.Background cut used in RGB color space does a better job by using the graph-cut model to introduce gradient information.However, it is sensitive to brightness differences, which causes shadow to be misclassified as foreground.If we set the variance of the Gaussian to be larger to tolerate some shadow, part of the true foreground is then misclassified as background.Background cut in CIE-Lab color space also suffers from the same issue.Although allowing a larger variance in the L channel can give   greater tolerance to brightness changes, in actual test cases, even when we only increase the variance in the L channel by a small amount, part of the collar disappears.In contrast, using our color line model with background cut constantly creates a better foreground segmentation result.
To further quantitatively evaluate the comparison, we created a large number of "ground truth" foreground masks following a similar approach to one in Ref. [40].The key idea is to use some balls as the moving foreground objects, and use a circle detection technique to detect the balls, which will automatically create "ground truth" masks for evaluation of our foreground segmentation methods.Specifically, we first calculate the difference image between the pre-captured background and the current frame (where one or more balls appear).Then we perform circle detection using the Hough transform [41] on the difference image, which generally produces reliable and accurate detection results.Finally, we manually eliminate the small number of outliers that occur when circle detection fails.In total, 4105 frames and their circle detection results are collected as the ground truth.Figure 8 shows a few examples.We did not use the ground truth from the VideoMatting benchmark [42], because their synthetic test images do not have shadows on the background, which is one of the fundamental aspects we wish to test.Using the generated ground truth, we tested different methods including fuzzy Gaussian, adaptive-SOM, background cut using RGB color space and CIE-Lab color space, and our color line model.For fuzzy Gaussian and adaptive-SOM, we used the default parameters provided by the BGSLibrary.For the background cut method, we tested several parameters and gave results with the highest F1 score.Table 1 shows that background cut with our color line model achieves the highest F1 score, the CIE-Lab space method follows closely, and others are  substantially worse.However, as we have already shown in Fig. 7, CIE-Lab space has an obvious drawback in actual application scenarios.We also tested an outdoor scene with different methods to show the effectiveness of our model: see Fig. 9.In conclusion, our color line model generally creates a better foreground segmentation boundary, and is effective at coping with differences in brightness.
Results with new background.We also tested our color compensation method using new backgrounds with different light sources.In Fig. 10, the first row shows the new input backgrounds, and the second row shows the light source pixel masks.The third row contains the composition results; we can see that the color of the foreground varies correctly according to different backgrounds.
Acceleration.Although we restrict the alpha matting computation to a very small region, it is still computationally expensive.In order to enable our algorithm to run in real time, we first downsample the input frames by a scale of two, carry out foreground cut and alpha matting on the downsampled images, and then upsample the matting result to the original scale.We finish the final composition step at the original scale.We call this process "sampling acceleration".As we can see in Fig. 11, the matting result using sampling acceleration is very similar to the result produced by processing the full frames.If we do not use alpha matting to refine the border, the border is jagged (see Fig. 11(c)).
Performance.We have implemented our method in C++ on a PC with an Intel 3.4 GHz Core i7-3770 CPU.For a 640 × 480 input video, our background substitution program can run at 10 frames per second using just the CPU, and it can run at a realtime frame rate with GPU parallelization.

Conclusions
In this paper, we have presented a novel background   substitution method for live video.It optimizes a cost function based on Gaussian mixture models and a conditional random field, using graph-cut.A color line model is used when computing the Gaussian mixture model to make the model less sensitive to brightness differences.Before final composition, we use alpha matting to refine the segmentation border.Light source colors of the input video and new background are estimated by a simple method, and we adjust the foreground colors accordingly to give more realistic composition results.Compared to previous methods, our approach can automatically produce more accurate foreground segmentation masks and more realistic composition results, while still maintaining real-time performance.
Pl e a s e n o t e: C h a n g e s m a d e a s a r e s ul t of p u blis hi n g p r o c e s s e s s u c h a s c o py-e di ti n g, fo r m a t ti n g a n d p a g e n u m b e r s m a y n o t b e r efl e c t e d in t his ve r sio n.Fo r t h e d efi nitiv e ve r sio n of t hi s p u blic a tio n, pl e a s e r ef e r t o t h e p u blis h e d s o u r c e.You a r e a d vis e d t o c o n s ul t t h e p u blis h e r's v e r sio n if yo u wi s h t o cit e t hi s p a p er. Thi s v e r sio n is b ei n g m a d e a v ail a bl e in a c c o r d a n c e wit h p u blis h e r p olici e s.

Fig. 1
Fig. 1 (a) Pre-captured background.(b) One frame of the input video.(c) Binary foreground matte after graph-cut.(d) Foreground matte after filling holes.

Fig. 3
Fig. 3 (a) Pre-captured background.(b) Estimated light source mask of the pre-captured background (a).(c) New background.(d) Estimated light source mask of the new background (c).
Fig. 4 (a) Composite result without color compensation.(b) Composite result with color compensation.

Fig. 8
Fig. 8 Example frames for creating ground truth.

Fig. 11 (
Fig. 11 (a) Matting result without sampling acceleration.(b) Matting result with sampling acceleration.(c) Foreground segmentation result with sampling acceleration but without alpha matting border refinement.

Table 1
Method comparison on ground truth dataset