Segmentation of Moving Objects using Background Subtraction Method in Complex Environments

Background subtraction is an extensively used approach to localize the moving object in a video sequence. However, detecting an object under the spatiotemporal behavior of background such as rippling of water, moving curtain and illumination change or low resolution is not a straightforward task. To deal with the above-mentioned problem, we address a background maintenance scheme based on the updating of background pixels by estimating the current spatial variance along the temporal line. The work is focused to immune the variation of local motion in the background. Finally, the most suitable label assignment to the motion field is estimated and optimized by using iterated conditional mode (ICM) under a Markovian framework. Performance evaluation and comparisons with the other well-known background subtraction methods show that the proposed method is unaffected by the problem of aperture distortion, ghost image, and high frequency noise.


Introduction
Moving object segmentation in video frames is the most significant step in many computer vision applications including human activity analysis, traffic monitoring, and video surveillance [1].However, the complexities to identify suspicious activities of people at social places and endangered object at shopping interacts, airports, banks, have become a matter of concern and motivated others toward the development of precise and robust surveillance systems [2].
In short, motion detection is a way to determine the magnitude of point or group of points in two or more consecutive images of a video sequence, which are non-stationary.In compulsion, object segmentation and motion perception in a video frames are a prerequisite of many post-processing steps such as target classification, behavior recognition, monitoring [3], [4].Some of the existing methods for motion detection are optical flow [5], frame difference [6], statistical method [7] and background subtraction [8][9][10][11].The frame difference method is robust and has a strong adaptability in varying environment along with less computation time and complexity.However, it creates holes inside the target due to incomplete generation of relevant pixel on the fore-ground mask.Optical flow is a reliable approach for local motion speculation, but it demands hardware in real time putting into use.On the other hand, background subtraction is a simplistic way to localize the target in a scene without the any prior information about the scene.Although, the background subtraction method is inexpensive with respect to memory requirement and computational time, yet it faces a few difficulties to contend with accuracy under spatiotemporally behavior of the background object.
Traditional background subtraction schemes such as AMF (Approximated Median Filter) [12], Kalman filter [13], and single Gaussian filter [14] reflect some irrelevant pixels on the foreground due to lack of correlation between the spatial and temporal constraints in their background maintenance schemes.Nevertheless, adjusting the learning rate to background pixels is another potential problem in background maintenance [15].The adaptive algorithms based on fast learning rate quickly absorb the environmental noise and contravene the generations of entire relevant pixels of the target.However, the algorithms based on low learning rate are less robust against a slow moving object and show the ways to generate multiple images or ghost on the foreground image [16].Furthermore, these algorithms do not integrate any data validation techniques that exploit the inter-pixel relationship to reduce the missclassification on the foreground mask.
In this paper, we focus to enhance the robustness of the background subtraction method under static and dynamic conditions of background [17].Initially, the spatial and temporal constraints are mapped to exclude the impulsive effect of the registered background model.A region level processing is conducted to assign the proper label to the moving object on the foreground image.The rest of the paper is organized as follows: Section 2 presents the essential related work concerning background modeling and foreground validation.Section 3 explains the proposed background model and foreground validation scheme.Experimental results are explored in Section 4. Concluding remarks are given in Section 5.

Related Work
In this section, we present the overview of some of the well-known background subtraction methods together with their updating scheme and foreground validation task.Background subtraction methods differ in the procedure employed to update the reference background during the motion detection task.
In [14], author suggested a running average method (RA) using a first order recursive filter to update the background model by integrating the new incoming frame.Even though, it is adaptive and requires less memory, but sensitive to ghost effect and environmental noise.It is heavily dependent on fixed learning rate and its effect due to either fast or slow learning rate is discussed in the introductory part of this paper.Temporal difference is very robust to environmental noise, but it does not generate the entire relevant pixels on the foreground [18].In [19], authors suggest to integrate the edges extracted through SDGD filter in order to complement the missing pixels on the foreground.However, background image updates according to traditional scheme and it requires high computational time due to the characteristic of second-order derivatives.
A simple statistical difference (SSD) model is proposed in [7].It allows adaptation to the environment changes in background model through natural variation in the current frame and less dependent on a threshold value.An absolute difference image is computed by subtracting the current frame from the mean of background frame in order to achieve the segmented region on the foreground.However, it is sensitive to environmental noise due to lightening and initial start-up time.In [20], author proposed a Σ-Δ filter (SDE) to estimate the temporal statistics for each pixel of image.However, it adapts always the temporal changes in the background model by either increasing or decreasing its pixel intensity to unity.The adaptation criterion is independent upon the difference image.Through the comparison between time variance and difference image, it detects the foreground pixel.As Σ-Δ method responds to signals with absolute time variance less than unity, which is insufficient to detect multiple objects.
Further enhancement in Σ-Δ filter is suggested in multiple Σ-Δ method (MSDE) [11].It computes a set of 'k' backgrounds instead of a single background.Each background has its weight and confidence coefficient, which vary according to the time variance.It estimates each pixel in a background model by taking the value from a set of 'k' backgrounds.Then, it compares each pixel of background model to the current input pixel in order to determine the foreground mask.An automatic motion detection algorithm is proposed in [8], which detects the moving object using alarm trigger module.However, it adapts the environmental changes in background model using traditional approach and it requires computational cost due to alarm trigger chain.
In [21], authors suggest to model each pixel of background using mixtures of Gaussians to deal the complex scene.In order to detect the foreground pixel, it compares each input pixel to each Gaussian distribution.The associated kernels of matched pixels are updated in background model.However, (Gaussian mixture model) GMM requires computational cost due to handling the associated kernels with it.It also fails to handle the foreground and background pixels, those have identical probability distribution function (camouflage effect).It is also less effective against aperture distortion.In [22], authors propose to train the background model with two mixture of Gaussian model where Gaussian kernels have identical parameters but different learning rate.Moving pixels are classified with the help of a finite-state machine.However, in case of slow motion, an operator could interactively maintain the system and estimated prior required for the input in a finite state machine.In [23], a prior knowledge, which includes spatial and temporal coherence, is fused with the cues provided by background subtraction scheme through MRF framework.Although the MRF based scheme achieves better segmentation, yet it is not applicable in real time operation and larger displacement in object motion.
A review of various background subtraction methods and their updating schemes has been discussed in [24].This review studies categorize the background representation frameworks into basic models, statistical parameters model, cluster and sparse models, artificial neural network models and fuzzy models.In our proposed work, we confine our study to basic and statistical model that provides sufficient numerical foundation to the projected background maintenance scheme.Since, the basic models use pixel-level processing that can update each pixel of initially registered background independently without any prior knowledge or cluster observation of pixels.On the other hand, the statistical modeling also proffers robustness against background motion and illumination.In [25], authors proposed fuzzy based approach focused on these two models in order to handle the dynamic background.It uses spatial and temporal constraints to enhance the performance.However, it could not handle the object when it became stationary in the scene due to traditional background maintenance scheme.
Another factor that affects the performance of foreground detection is the integration of region level processing.In [8], author proposed to integrate region level processing by evaluating block-wise entropy to estimate the initial motion field, but this affected the shape and lost the significant part of object near low entropy region.However, in our method, a regional level processing is included in data validation techniques to avoid the missclassification between stationary and non-stationary pixels on the fore-ground.Nevertheless, some feature and subspace learning based background modeling schemes well handled the complex videos, but at the cost of higher time and memory complexities [26].The analysis of the existing background subtraction methods reveal that the simple schemes suffer to generate reliable moving mask on the foreground, while the complex schemes handle it at higher computational cost and complexities.In addition to that, a robust background-modeling scheme must include the temporal and spatial constraints in order to get a reliable motion mask.A regional level processing should be included in data validation techniques to avoid the missclassification between stationary and non-stationary pixels on foreground.

Proposed Method
In this paper, the proposed framework establishes to diminish the complexities of the background modeling for the moving object detection under static camera arrangement.It is noted that video sequences in our experiment show spatio-temporally varying behavior due to rippling water, moving curtain, changing illumination and many more.The proposed method comprises of two stages in order to alleviate these problems.The first stage provides a suitable background model followed by an updating scheme.In the second stage, a region level processing is carried out based on the assumptions that neighboring pixels tend to possess identical property and each pixel may be affected independently in an image.As a result, a set of connected component of relevant pixels is found on the foreground mask under a Markovian framework.The functional block diagram of the proposed method is shown in the Fig. 1.The steps involved in the developed method to generate an efficient background model offer several advantages over other methods reported in this paper.

 It does not require memory buffer allocation to gener-
ate the reference background model.
 This framework incorporates spatial and temporal features to characterize the registered background appearance and provides a better adaptation for the temporal changes in the environment.
 The proposed background model is updated using a selective maintenance approach based on intensity variation of difference image that reduces the aperture distortion, ghost effect and over segmentation error.
 Finally, Markovian framework provides a suitable set of connected component on the foreground mask and spatial regularization against illumination discrepancy.
Each of comprising stages is elaborated in the following subsections.

Generation of Background Model
In our implementation, we assume initial frame I 0 (x,y) as reference background B ref (x,y), which consists of no foreground object.The first stage is to compute a set of stationary pixels using frame difference and reference background image.The frame difference differentiates the stationary pixels from non-stationary pixels by using a suitable threshold.Using the difference between the current frame I t (x,y) and previous frame I t -1 (x,y) a set of stationary pixels is selected with the aid of reference background frame B ref (x,y) as follows: ( , ) sgn( ( , ) ( , )), otherwise

B x y I x y I x y B x y B x y I x y I x y
(1) where B t fD (x,y) is a set of stationary pixels through frame difference method and τ 2 is a user defined threshold.The Signum function is defined as: where p is the input value.
At the same time, we investigate the stationary pixels through background subtraction method that subtracts the current input frame I t (x,y) from the reference background B ref (x,y).
where B t bG (x,y) is a set of stationary pixels through background subtraction method and τ 2 is a user defined threshold.The pixels on the current background frame are regis- The initial spatial variance is given as: where σ i 2 (x,y) is the initial variance.
The current change in spatial variance with respect to time is estimated using initial variance as follows: where σ c 2 (x,y) is the current spatial variance.
It is desirable to detect the pixels that significantly deviate from the background in order to get the moving object.Here, the approach is to categorize the background pixels on the basis of their stationary and non-stationary behavior in the consecutive frames.As compared to stationary pixels, a non-stationary pixel of background image possesses different statistical foundation.The non-stationary pixels arise due to local motion in the background image such as rippling of water, moving of curtain etc.In this regard, our proposal is to use a selective maintenance scheme that updates background model with different learning rate depending on the stationary or non-stationary background pixels.Since the running average is highly adaptive to the temporal changes in the environment.We analyze and exploit the properties of running average together with spatial variance of current input in order to update the initial background model.
The absolute difference between the current and background frame results initial motion field.Ideally, the initial motion field should contain significant magnitude of intensity of foreground pixels and zero intensity to the matched pixel.However, it is not possible ideally.The first possibility arises for erroneous detection due to the similar magnitude of intensity of foreground and background pixels that can cause holes in moving entities and increase the false-negative pixels.To minimize this error, we select a learning rate β and update those pixels of background image that satisfy the first condition of (7).

B x y I x y B x y I x y B x y B x y x y x y B x y I x y B x y I x y B x y B x y I x y B x y
In ( 7), B t (x,y) is the current updated background and B t-1 (x,y) is previous background or initial reference frame.σ i is the initial standard deviation of a reference background frame and σ c is the current standard deviation of current frame.ψ ranges from 1 to 3 in this experiment.τ 3 is user define threshold.The value of β is taken as 0.999 for all video sequences.The value of α ranges from 0.8 to 0.99.
A recursive filter integrates the current image with the difference image to update the pixel of the background model.It updates those pixels, which range under the first condition of (7).As a result, it provides a difference in the intensity level between foreground and background pixels that initially have identical magnitude.
The second probable reason for erroneous detection can arise when the variance of pixel changes due to the local motion in the background.Concerning this problem, we emphasize to incorporate the current and initial spatial variances that are blended with the pixels of background image using a different learning rate α to update the background pixel.This time, it updates those pixels, which range under the second condition of (7).As a result, it reduces the false-positive pixels.Otherwise, the update of background model is done according to the third condition of (7).The absolute difference image between the first frame and the current background is used to compute the initial motion field.The absolute difference image is given as: where M t (x,y) is the absolute difference image.

Detection and Labeling of Foreground Object
In real-time application, the estimation of initial motion field is perturbed due to noise.Concerning to this problem, the optimum labeling of motion mask is computed using Iterated Conditional Mode (ICM) under a Markovian framework [27].ICM is computationally efficient and provides robust smoothing to degraded image by considering the spatial correlation of neighboring pixels.ICM relies on the assumption that neighboring pixels consist of equal value of intensity and each pixel unit is corrupted independently with some probability.To estimate the foreground pixel, a Markovian framework requires prior information to the underlying scene.In this context, the labels achieved during the estimation of initial motion field or absolute difference image provide a provisional known to underlying image in this experiment.Using first order spatial neighborhood, the information regarding provisional known is provided to Gibbs prior within Ishin model [28], [29].It is focused to update the current estimated value R at pixel v = (x,y) by maximizing the given argument: where s/v includes the set of all neighbors of the pixel v and ∂v a small set of neighbors of pixel v defined by a first order neighborhood system.O v is the observed motion field at v. R v is the current estimated value of motion field at v.
However, a posterior probability for the estimation of true image relies on minimizing the potential at cliques.This is accomplished by minimizing the given argument iteratively as follows The term U(R v ) is the potential at neighborhood configuration comes through following the Gibbs sampler within the Ishin model.The U(R v ) is expressed as follows: However, α 1 controls the biasing of negative pixel and positive pixels.The α 1 is set to '0' in this experiment.v 1 (R v ) is the number of neighbors of v having label R v .β 1 is a constant, which is experimentally set to 0.0001.Generally the motion mask possesses two labels l1 and l2, such that l1 , l2  {R v }.The maximal likelihood of each of its labels together with the prior is expressed as: (13) where P 1 and P 2 are the posterior probability.Using the mean µ d and standard deviation σ d of initial motion field, the mean parameters µ l1, µ l2 and the variance parameters σ l1 2 , σ l2 2 for the likelihood functions are calculated.The mean µ d and the standard deviation σ d are given as: The variables w and h belong to width and height of a frame respectively.The value of µ l1 is estimated as µ d , while µ l2 ranges up to µ d ± 3σ d for the test videos in this paper.The value of σ l1 is estimated as σ d , while the ratio of σ l1 to σ l2 is taken 1.5σ d for the static sequences.Concerning the dynamic sequences, the ratio of σ l2 to σ l1 is taken as 2.5σ d .The value of the estimated binary motion mask D t (x,y) is evaluated as follows: To remove the unnecessary connected component from the foreground mask and filling the superfluous holes, the morphological operations are performed using structuring element [11].In this experiment, the morphological open operation followed by close operation is performed to investigate the relevant connected components using the structuring element 'disk' shape with radius '1'.The opening operation is performed on the foreground mask as follows: Consequently, closing on image is performed as follows: where 'S' is the structuring element,  operator performs erosion operation and  performs dilation on an image.

Experimental Results
In this section, seven standard video sequences are considered to validate our results qualitatively and quantitatively.The detailed analysis of some challenging sequences is explored in this section.The primary features of these video sequences are given in Tab. 1.The foreground mask may distort due to aperture effect, over-segmentation error, ghost and camouflage effect.Aperture effect is related to the problem to find the actual correspondence between the consecutive frames.False positive pixels cause over-segmentation error.A false copy of moving object generated on the foreground mask that disappears slowly with the time is called ghost effect.Camouflage effect may arise due to similar intensity between foreground and background image.In this regard, qualitative evaluation is done on various dataset against these problems.
The IR and MSA video sequences consist of static background object.In IR sequence, change of illumination condition and shadows cast by object can hurdle to produce the reliable foreground mask.The radial movement of person may also affect the aperture in IR sequence.In case of MSA and PET2006 sequences, changing illumination and abandoned object in the scene may degrade the performance of binary motion mask.As shown in Fig. 2a, our proposed scheme successfully detects the moving person against the illumination and shadows.Moreover, no aperture distortion is seen in the detection results of 'IR' sequences.In other static background of MSA and PETS 2006 sequence, the proposed approach detects the person activity along with the abandoned bag continuously object. in the consecutive video frames even the objects appear to sleep or motionless for some frames.The detection results of MSA sequence are shown in Fig. 2b.
The Water Sequences (WS) consist of dynamic background feature with changing illumination.This sequence suffers from high frequency noise due to rippling of water in the background.Moreover, a potential problem arises due to the identical pixel intensity of background vegetation and foreground pixel below the knee of the person (camouflage effect).Another issue arises in the WS sequence, when a person moves slowly and becomes stationary in the scene.As one can observe in Fig. 2c, the proposed method suppresses false-positives induced due to illumination changes and false negatives caused by the similarities between object and foreground.
In other dynamic background of MR sequence, waving curtain produces high frequency noise.In its consecutive images, the person's shirt tends to camouflage with color of moving curtain.Moreover, the distraction can also arise during the object segmentation in MR sequence due to slow motion and sleeping of the moving object.
In all these distracting cases, the background and foreground pixels are separated significantly by bounding the variance of background model through this proposed method.Our proposal eliminates the high frequency noise and false negative pixels, which arise due to non-stationary pixel on the background by the moving curtains.The detection results of 'MR' sequence are shown in Fig. 2d.The fountain sequence (FT) and CANOE sequences also consist of dynamic background feature.
The difficulties in segmentation can arise due to high frequency ripples produced by the fountain and river.As shown in Fig. 2e, this method adapts environmental changes of dynamic background and produce satisfactory results on the foreground.Figure 3 shows that the proposed method gives better performance than other background subtraction methods.
In addition to visual inspection, the performance of the proposed approach is also evaluated quantitatively on the above-mentioned video sequences with respect to their ground truths [1], [15], [17].Quantitative evaluations with respect to the ground truth image depend on True-positive (tp) pixels, True-negative pixels (tn), False-positive pixels (fp), and False-negative pixels (fn).True-positive pixels (tp) are the correctly detected pixels by the algorithm of the moving object.False-positives (fp) concern to those pixels, which are incorrectly detected as foreground.True negative pixels (tn) are correctly detected pixels that correspond to background while False-negative pixels (fn) correspond to the number of foreground pixels detected incorrectly as background.The relevant pixels on the binary motion mask are analyzed using Recall metric, which is given as: The irrelevant pixels on the binary motion mask are analyzed using Precision metric, which is computed as: An algorithm must achieve a high recall rate without sacrificing the Precision metric but, these two metric do not support the reliable measurement task.Similarity and F1 are two other parameters, which incorporate to patch up the reliable accuracy measurements in quantitative analysis.The Similarity and F1 are given as: Percentage of correct classification (PCC) is the most extensive way to assess a classifier's performance as it includes tp, tn, fp and fn parameters.The PCC is given as: We also analyze True positive rate (TPR) and False positive rate (FPR) to compare our misclassified results between foreground and background image.TPR is equivalent to Recall rate while, false positive Rate (FPR) is those background pixels which are misclassified as foreground.The TPR and FPR are given as: Table 2 lists the average accuracy rates through this method along with accuracy rates that were achieved by some other existing state-of-the-art background subtraction GMM, MSDE, SDE, SSD, [8] and [25], methods reported in this paper.The accuracy rates calculated by MSDE, SDE, SSD for IR, MR and WS video sequences are taken from [8], while the rest of accuracy rates for GMM, MSDE, SSD, SDE method [8] and method [25] are calculated using the optimum parameter as given in [25], [21], [20], [11], [8], [7].
We can easily examine that the performance of the proposed method is superior to previously reported six different methods.This method achieves the higher accuracy rates of all metrics than 92.36% for WS sequence.With regard to MR, WS and MSA sequence, it is noted that this method achieves greater accuracy rates of all the metrics than 82% that reflects the significant improvement in motion detection task under circumstances with illumination discrepancy and local motion.
In WS sequence, the highest average accuracy rates secured through F1 and similarity by this method are up to 56% and 54% higher than those attained by GMM method.In FT sequence, the lowest average accuracy rates secured through F1 and similarity by this method are also up to 17% and 23% higher than those attained by GMM method.
Sequences Evaluation Proposed Method Method [25]    The quantitative analysis between TPR and FPR is shown in Fig. 4. Our method achieves lower FPR, which reflects that average misclassified pixels are below 0.3% by employing this background subtraction method under study.
The PCC is measured by taking some sampled frames of each video sequence, which are shown in Fig. 5 (a)-(e).
The average PCC metrics measured for WS, MSA, MR, and IR video sequences are greater than 99.12% while for FT sequence it is measured above 98%.The high value of PCC reflects the better segmentation and identification of foreground pixels through this method.With regard to time complexity, we perform all experiments on Matlab 7.1 using 3.2 GHz Intel CPU, 2G RAM on Window7 platform.To process a 120 × 160 frame, this method takes 0.06 sec, while GMM takes 0.48 sec.Other methods are faster than our algorithm.

Conclusion
In this paper, we described our contribution to characterize the background appearance by using its principle feature and statistics.A test based on the standard deviation of pixel and estimated absolute difference image is applied in order to limit the variance due to the local motion and change in illumination in the background.In addition to that, the most appropriate label assignment to the motion field has been estimated and optimized by using iterated conditional modes under a Markovian framework.Nevertheless, one can extend the work in future in regard to the problem of handling the drastic illumination changes and multiple moving objects in the scene, yet this method can extract the moving object even under circumstances with moderate illumination discrepancy and local motion.Experimental results specify that the proposed algorithm has a propensity to localize the object in the scene without over-segmentation error, aperture distortion, and ghost effect.Extensive qualitative and quantitative analysis exemplify that our method attains greater accuracy rates than some other state-of-the-art background subtraction methods previously reported in the paper.