Improved visual background extractor using an adaptive distance threshold

Abstract. Camouflage is a challenging issue in moving object detection. Even the recent and advanced background subtraction technique, visual background extractor (ViBe), cannot effectively deal with it. To better handle camouflage according to the perception characteristics of the human visual system (HVS) in terms of minimum change of intensity under a certain background illumination, we propose an improved ViBe method using an adaptive distance threshold, named IViBe for short. Different from the original ViBe using a fixed distance threshold for background matching, our approach adaptively sets a distance threshold for each background sample based on its intensity. Through analyzing the performance of the HVS in discriminating intensity changes, we determine a reasonable ratio between the intensity of a background sample and its corresponding distance threshold. We also analyze the impacts of our adaptive threshold together with an update mechanism on detection results. Experimental results demonstrate that our method outperforms ViBe even when the foreground and background share similar intensities. Furthermore, in a scenario where foreground objects are motionless for several frames, our IViBe not only reduces the initial false negatives, but also suppresses the diffusion of misclassification caused by those false negatives serving as erroneous background seeds, and hence shows an improved performance compared to ViBe.


Introduction
In computer vision applications, objects of interest are often the moving foreground objects in a video sequence.Therefore, moving object detection which extracts foreground objects from the background has become a hot issue, [1][2][3][4][5][6][7] and has been widely applied to areas such as smart video surveillance, intelligent transportation, and humancomputer interaction.
Visual background extractor 8 (ViBe) is one of the most recent and advanced techniques.In comparative evaluation, 9 ViBe produces satisfactory detection results and has been proved effective in many scenarios.For each pixel, the background model of ViBe stores a set of background samples taken in the past at the same location or in the neighborhood.Then, ViBe compares the current pixel intensity to this set of background samples using a distance threshold.Only if the new observation matches with a predefined number of background samples is this pixel classified as background, otherwise this pixel belongs to the foreground.However, ViBe uses a fixed distance threshold in the matching process; hence, it has difficulties in handling camouflaged foreground objects (intentionally or not, some objects may poorly differ from the appearance of the background, making correct classification difficult 9 ).Moreover, a "spatial diffusion" update mechanism for background models aggravates the influence of misclassified camouflaged foreground pixels, and then decreases the power of ViBe in detecting still foreground objects.Camouflaged foreground objects and still foreground objects are two key reasons for false negatives in the detection results, and it is imperative and urgent to solve these two challenging issues in video surveillance.
In order to solve the aforementioned challenges, we propose an improved ViBe method using an adaptive distance threshold (hereafter IViBe for short).In light of the sensitivity of the human visual system (HVS) with regard to intensity change under certain background illumination, we set an adaptive distance threshold in the background matching process for each background sample in accordance with its intensity.Experimental evaluations validate that, because of using features of the HVS and performing background matching based on an adaptive distance threshold, IViBe has a better discriminating power concerning foreground objects with similar intensities to the background, and then effectively improves the capability of ViBe in coping with camouflaged foreground objects.Furthermore, IViBe also reduces the number of misclassified pixels which usually serve as erroneous background seeds propagating the false negatives.Experimental results show that, compared with ViBe, our IViBe allows a slower inclusion of still foreground objects into the background, and has a better performance in detecting static foreground objects.
The rest of this paper is organized as follows.In Sec. 2, we briefly explore the major background subtraction approaches.Section 3 describes our IViBe method, introduces the detailed derivation of our adaptive distance threshold, and analyzes the influence of this adaptive distance threshold together with the "spatial diffusion" update mechanism on the detection results.In Sec. 4, we qualitatively and quantitatively analyze the advantages of our IViBe compared with ViBe.Finally, a conclusion is drawn in Sec. 5.

Related Work
Background subtraction 10 (BS) is an effective way of foreground segmentation for a stationary camera.In the BS methods, via comparing input video frames to their current background models, the regions corresponding to significant differences should be marked as foreground.Also, the BS techniques adapt their background models to scenario changes through online update and have a moderate computational complexity, which makes them popular methods for moving object detection.
2][13] Although the last decade has witnessed numerous publications on the BS methods, according to Ref. 13, there are still many challenges not completely resolved in real scenes, such as illumination changes, dynamic backgrounds, bootstrapping, camouflage, shadows, still foreground objects, and so on.In 2014, two special issues 14,15 have just been published with new developments for dealing with these challenges.
Next, we briefly explore the major BS approaches according to the different kinds of background models they used.

Parametric Models
Gaussian mixture model (GMM) and its improved methods: GMM is a classical and probably the most widely used BS technique. 16GMM models the temporal distribution of each pixel using a mixture of Gaussians, and many studies have proven that GMM can handle gradual illumination changes and repetitive background motion well.In Ref. 17, Lee proposed an adaptive learning rate for each Gaussian model to improve the convergence rate without affecting the stability.In Ref. 18, Zivkovic and Van Der Heijden proposed a scheme to dynamically determine the appropriate number of Gaussian models for each pixel based on observed scene dynamics to reduce processing time.In Ref. 19, Zhang et al. used a spatio-temporal Gaussian mixture model incorporating spatial information to handle complex motions of the background.
Models using other statistical distributions: recently, a mixture of symmetric alpha-stable distributions 20 and a mixture of asymmetric Gaussian distributions 21 have been employed to enhance the robustness and flexibility of mixture modeling in real scenarios, respectively.They can handle the dynamic backgrounds well.In Ref. 22, Haines and Xiang proposed a Dirichlet process Gaussian mixture model which constantly adapts its parameters to the scene in a block-based method.

Nonparametric Models
Kernel density estimation (KDE) and its improved methods: a nonparametric technique 23 was developed to estimate background probabilities at each pixel from many recent samples over time using KDE.In Ref. 24, Sheikh modeled the background using KDE over a joint domain-range representation of image pixels to sustain high levels of detection accuracy in the presence of dynamic backgrounds.
Codebook and its improved methods: the essential idea behind the codebook 25 approach is to capture long-term background motion with limited memory by using a codebook for each pixel.In Ref. 4, a multilayer codebookbased background subtraction (MCBS) model was proposed.Combining the multilayer block-based strategy and the adaptive feature extraction from blocks of various sizes, MCBS can remove most of the dynamic backgrounds and significantly increase the processing efficiency.

Advanced Models
Self-organizing background subtraction (SOBS) and its improved methods: in the 2012 IEEE change detection workshop 26 (CDW-2012), SOBS 27 and its improved method SC-SOBS 28 obtained excellent results.In Ref. 27, SOBS adopted a self-organizing neural network to build a background model, initialized its model from the first frame, and employed regional diffusion of background information in the update step.In 2012, Maddalena improved the SOBS by introducing spatial coherence into the background update procedure, which led to the SC-SOBS algorithm providing further robustness against false detections.In Ref. 29, three-dimensional self-organizing background subtraction (3D_SOBS) used spatio-temporal information to detect a stopped object.Recently, the 3DSOBS+ 1 algorithm further enhanced the 3D_SOBS approach to accurately handle scenes containing dynamic backgrounds, gradual illumination changes, and shadows cast by moving objects.
ViBe and its improved methods: in the CDW-2012, ViBe 8 and its improved method ViBe+ 30 also achieved remarkable results.Barnich and Van Droogenbroeck proposed a samplebased algorithm that builds the background model by aggregating previously observed values for each pixel location.The key innovation of ViBe is introducing the random policy into the BS, which makes it the first nondeterministic BS method.In Ref. 30, Van Droogenbroeck and Barnich improved ViBe in many aspects, including an adaptive threshold.They computed the standard deviation of background samples of a pixel to define a matching threshold.The matching threshold adapts itself to statistical characteristics of background samples; however, all background samples of a pixel have the same thresholds, and one wrongly updated background sample will affect the thresholds of other background samples, which will lead to more misclassification.In Refs.30 and 31, a new update mechanism separating "segmentation map" and "updating mask" was proposed.The "spatial diffusion" update mechanism can be inhibited in the "updating mask" to detect still foreground objects.In Ref. 32, Mould and Havlicek proposed an update mechanism in which foreground pixels can update their background models by replacing the most significant outlying samples.This update policy can improve the ability to deal with ghosts.

Human Visual System-Based Models
Visual saliency, another important concept about the HVS, has already been used in the BS methods.In Ref. 33 In this paper, we propose an improved BS technique which uses the characteristic of the HVS.
We introduce an adaptive distance threshold into ViBe to simulate the capacity of the HVS in perceiving noticeable intensity changes, which can discriminate camouflaged foreground objects and reduce false negatives.Together with ViBe's update policy, our method further improves the ability to detect foreground objects that are motionless for a while.Hence, IViBe improves the ability of ViBe in dealing with camouflaged and still foreground objects.

Improved ViBe Method
Our IViBe is a pixel-based BS method.When building the background model for each pixel, it does not rely on a temporal statistical distribution, but employs a universal sample-based method instead.Let x i be an arbitrary pixel in a video image, and Bðx i Þ be its background model containing N background samples (values taken in the past in the same location or in the neighborhood): The background model Bðx i Þ is first initialized from one single frame according to the intensities of pixel x i and its neighboring pixels, and then updated online when pixel x i is classified as background or by a "spatial diffusion" update mechanism.
The pixel x i is classified as a background pixel only if its current intensity Iðx i Þ is closer than a certain distance threshold R k ðx i Þ (1 ≤ k ≤ N) to at least # min of its N background samples.Thus, the foreground segmentation mask is calculated as Here, Fðx i Þ ¼ 1 signifies that the pixel x i is a foreground pixel, # denotes the cardinality of a set, # min is a fixed parameter indicating the minimal matching number, and R k ðx i Þ is an adaptive distance threshold according to the perception characteristics of the HVS.
In Sec.3.1, we introduce our adaptive distance threshold and its derivation.Section 3.2 shows how our adaptive distance threshold together with the "spatial diffusion" update mechanism affects the detection results.

Adaptive Distance Threshold
In order to better segment foreground objects similar to the background, we introduce an adaptive distance threshold R k ðx i Þ for background matching.Different from ViBe which uses a fixed distance threshold R k ðx i Þ ¼ 20 for each background sample, we propose an adaptive distance metric through simulating the characteristics of human visual perception (i.e., Weber's law 35 ).
Weber's law describes the human response to a physical stimulus in a quantitative fashion.The just noticeable difference (JND) is the minimum amount by which stimulus intensity must be changed in order to produce a noticeable variation in the sensory experience.Ernst Weber, a 19th century experimental psychologist, observed that the size of the JND is linearly proportional to the initial stimulus intensity.This relationship, known as Weber's law, can be expressed as where ΔI JND represents the JND, I represents the initial stimulus intensity, and c is a constant called the Weber ratio.
In visual perception, Weber's law actually describes the ability of the HVS for brightness discrimination, and the Weber ratio can be obtained by a classic experiment 36 which consists of having a subject look at a flat, uniformly illuminated area (with intensity I) large enough to occupy the entire field of view, as Fig. 1 shows.An increment of illumination (i.e., ΔI) is added to the field and appears as a circle in the center.When ΔI achieves ΔI JND , the subject will give a positive response, indicating a perceivable change.In Weber's law, ΔI JND is in direct proportion to I. Hence, the ΔI JND is small in dark backgrounds and big in bright backgrounds.
In the BS methods, when comparing current intensity with the corresponding background model, the distance threshold can actually be considered as the critical intensity difference in distinguishing foreground objects from the background.Fortunately, Weber's law describes the capacity of the HVS in perceiving noticeable intensity changes, and the JND that the HVS can perceive is in direct proportion to the background illumination.Inspired by Weber's law, we propose our adaptive distance threshold in direct proportion to the background sample intensity; namely, the distance threshold should be low for a dark background sample and high for a bright background sample.
In our method, mapping to Weber's law is as follows: the background sample intensity B k ðx i Þ can be regarded as the initial intensity I, the difference between the current value and each background sample is the intensity change ΔI, and the distance threshold R k ðx i Þ can be regarded as the JND (i.e., ΔI JND ).Consequently, on the basis of Weber's law, we set In Eq. ( 4), B k ðx i Þ is the known background sample intensity, and if we want to derive the distance threshold R k ðx i Þ, we have to first obtain the Weber ratio c.However, we cannot directly use the Weber ratio obtained in the classic experiment, because the classic experiment uses a uniformly illuminated area as background, but what we need in our method is a Weber ratio with a complex image as the background.As described in Ref. 37, "for any point or small area in a complex image, the Weber ratio is generally much larger than that obtained in an experimental environment because of the lack of sharply defined boundaries and intensity variations in the background."Moreover, it is also difficult to gain the Weber ratio via redoing the classic experiment using a complex image as the background, because such an experiment will need many subjects and the subjects' evaluation criteria are inconsistent, which will reduce the creditability of the experiment.
Based on the consideration above, we employ a substitute of subjective evaluations in the classic experiment to derive the Weber ratio c for a complex image as the background.Specifically, the substitute is the difference of the peak signal-to-noise ratio (PSNR 38 ) presented by the motion picture experts group (MPEG).The MPEG recommends that, 38 for an original reference image (R) and two of its reconstructed images (D 1 and D 2 ), only when the difference of PSNR (i.e., ΔPSNR) satisfies jPSNRðD 1 ; RÞ − PSNRðD 2 ; RÞj ≥ 0.5 ðdBÞ; (5) the HVS can perceive that D 1 and D 2 are different.In Eq. ( 5 Since ΔPSNR can objectively reflect the ability of the HVS in discriminating intensity changes, we use ΔPSNR to substitute the subjects' perception in the classic experiment with a complex image as the background.Here, we first construct a complex image.Suppose there is a complex image whose rows and columns are divided into 16 equal parts, respectively.Thus, the complex image is composed of 256 regions of the same size.For each region, the setup is the same as the classic experiment shown in Fig. 1.That is, each region is uniformly illuminated with intensity I, and an increment of illumination (i.e., ΔI) is added to the centered circle.Such a region is called a basic region.The complex image consists of 256 basic regions (with I ¼ 0;1; : : : ; 255), which are randomly permutated, as shown in Fig. 2. In this way, we construct a complex image as the background to simulate the classic experiment in all intensity levels simultaneously, which makes our derivation general and objective.
All the circles in the basic regions of Fig. 2 simultaneously change their intensities with ΔI.When jΔIj reaches ΔI JND for all the basic regions, the HVS can barely perceive the intensity changes of the complex image (let this image be D 1 ).When jΔIj ¼ ΔI JND þ ε (ε is a very small constant, and for a digital image we set ε ¼ 1) for all the basic regions, the HVS can obviously perceive the intensity changes of the complex image (let this image be D 2 ).Suppose the complex image shown in Fig. 2 is the original reference image (i.e., R), then D 1 and D 2 can be regarded as two different distorted images which are reconstructed from the same R and are just perceivably distinguishable by the HVS.Accordingly, on the basis of Eq. ( 3), the 1-norm of difference between R and D 1 is given in Eq. ( 7), and the 1-norm of difference between R and D 2 is provided in Eq. ( 8), where w denotes the number of pixels in the circle of each basic region in Fig. 2. In accordance with the recommendation of the MPEG, the difference of PSNR between these two reconstructed images (D 1 and D 2 ) meets equality in Eq. ( 5), i.e., ΔPSNR ¼ 0.5 ðdBÞ, that is, 20 lg 255 − 20 lg 255 where n denotes the number of pixels in the complex image.Simplifying Eq. ( 9), we can derive c ¼ 0.13.As a result, we conclude that the relationship between the intensity of a background sample and its corresponding distance threshold is: Nevertheless, according to the description of brightness adaptation of the HVS in Ref. 37, we can infer that, in the extremely dark and extremely bright regions of a  complex image, the linear relationship in Weber's law cannot precisely describe the relation between perceptible intensity changes of the HVS and the background illumination.Therefore, our solution is to cut off the distance threshold for background samples whose intensities are too high or too low.After many experiments, we empirically set [10%, 90%] of the entire intensity range satisfying the linear relationship.Namely, the cut off intensities are T 1 ¼ d255 × 0.1e ¼ 26 and T 2 ¼ d255 × 0.9e ¼ 230.Consequently, the adaptive distance threshold can be calculated as which is shown in Fig. 3.

Background Model Update Mechanism and Impacts of Our Adaptive Distance Threshold Together with Update Mechanism on Detection Results
It is essential to update the background model Bðx i Þ to adapt to changes in the background, such as lighting changes and variations of the background.The update of background models is not only for pixels classified as background, but also for their randomly selected eight-connected neighborhood.In detail, when a pixel x i is classified as background, its current intensity Iðx i Þ is used to randomly replace one of its background samples B k ðx i Þ (k ∈ f1;2; : : : :; Ng) with a probability p ¼ 1∕ϕ, where ϕ is a time subsampling factor similar to the learning rate in GMM (the smaller the ϕ we use, the faster the update speed we get).After updating the background model of pixel x i , we randomly select a pixel x j in the eight-connected spatial neighborhood of pixel x i , i.e., x j ∈ N 8 ðx i Þ.In light of the spatial consistency of neighboring background pixels, we also use the current intensity Iðx i Þ of pixel x i to randomly replace one of pixel x j 's background samples B k ðx j Þ (k ∈ f1;2; : : : :; Ng).In this way, we allow a spatial diffusion of background samples in the process of background model update.
The advantage of this "spatial diffusion" update mechanism is the quick absorption of certain types of ghosts (a set of connected points, detected as in motion but not corresponding to any real moving object 8 ).Some ghosts result from removing some parts of the background; therefore, those ghost areas often share similar intensities with their surrounding background.When background samples from surrounding areas try to diffuse inside the ghosts, these samples are likely to match with current intensities at the diffused locations.Thus, the diffused pixels in the ghosts are gradually classified as background.In this way, the ghosts can be progressively eroded until they entirely disappear.
However, the "spatial diffusion" update mechanism is disadvantageous for detecting still foreground objects.In environments where the foreground objects are static for several frames, either because the foreground objects share similar intensities with the background, or due to the noise inevitably emerging in the video sequence, some pixels of the foreground objects may be misclassified as background, and then serve as erroneous background seeds propagating foreground intensities in the background models of their neighboring pixels.Since foreground objects are still for several frames, the background models of the neighboring pixels of these misclassified pixels will suffer from more and more incorrect background samples coming from misclassified foreground intensities.In this way, there will be more misclassified foreground pixels, which will lead to the diffusion of misclassification.
Fortunately, our IViBe employs background matching based on an adaptive distance threshold which can reduce the misclassification inside the foreground objects, can slow down the speed of the misclassification diffusion, and can lower the eaten up speed of still foreground objects.First, IViBe makes full use of the adaptive distance threshold to enhance the discriminating power of similar foregrounds and backgrounds, and then reduces the number of misclassified foreground pixels, which can decrease the possibility of erroneous background seeds occurring.Second, even though misclassification emerges inside the foreground objects for some reason and leads foreground intensities diffusing into the background models of neighboring pixels, the adaptive distance threshold can also cut down the misclassification possibilities of those neighboring pixels inside the foreground objects.Via the aforementioned analysis, we conclude that IViBe has the ability to detect still foreground objects that are present for several frames.
Since we use the adaptive distance threshold as Eq. ( 10), our threshold for dark areas will be smaller than that of ViBe, hence fewer pixels will be classified as background and then be updated; whereas for bright areas, our threshold will be larger than the fixed threshold used by ViBe, and so more pixels will be classified as background and will then be updated.Accordingly, the updating probability is lower for dark areas and higher for bright areas.

Experimental Results
In this section, we first list the test sequences and determine the optimal values of parameters in our IViBe method, and then compare our results with those of ViBe in terms of qualitative and quantitative evaluations.

Experimental Setup 4.1.1 Test sequences
In our experiments, we employ the widely used changedetection.net 26,39(CDnet) benchmark.We select two sequences to test the capability of these techniques in coping with the camouflaged foreground objects.One sequence is called "lakeSide" from the thermal category, and the other sequence is called "blizzard" from the bad weather category.In the lakeSide sequence, two people are undistinguishable from the lake behind them in thermal imagery after they get out from the lake and have the same temperatures as the lake.This sequence is really a challenging camouflage scenario, for it is even difficult for eyes to discriminate these Fig. 3 The relationship between the intensity of a background sample and its corresponding distance threshold.
people from the background.In the blizzard sequence, a road is covered by heavy snow during bad weather; meanwhile, some cars passing through are white, and some cars with other colors are partially covered by white snow, which makes correct classification difficult.
Besides, to validate the power of IViBe in coping with still foreground objects, we further choose two other typical sequences from the CDnet.One sequence is called "library" from the thermal category, and the other sequence is called "sofa" from the intermittent object motion category.In the library sequence, a man walks in the scene and selects a book, and then sits in front of a desk reading the book for a long time.In the sofa sequence, several men successively sit on a sofa to rest for dozens of frames, and place their belongings (foreground) aside; for example, a box is abandoned on the ground and a bag is left on the sofa.
Moreover, to test the performance of our method in general environments, we also select the baseline category which contains four videos (i.e., highway, office, pedestrians, and PETS2006) with a mixture of mild challenges (including dynamic backgrounds, camera jitter, shadows, and intermittent object motion).For example, the highway sequence endures subtle background motion, the office sequence suffers from small camera jitter, the pedestrians sequence has isolated shadows, and the PETS2006 sequence has abandoned objects and pedestrians that stop for a short while and then move away.These videos are fairly easy but are not trivial to process. 26

Determination of parameter setting
There are six parameters in IViBe: number of background samples stored in each pixel's background model (i.e., N), ratio of R k ðx i Þ to B k ðx i Þ (i.e., c), cutoff thresholds (i.e., T 1 and T 2 ), required number of close background samples when classifying a pixel as background (i.e., # min ), and time subsampling factor (i.e., ϕ).
In order to evaluate # min and N with a variety of values, we introduce the metric called percentage of correct classification 8 (PCC) that is widely used in computer vision to assess the performance of a binary classifier.Let TP be the number of true positives, TN be the number of true negatives, FP be the number of false positives, and FN be the number of false negatives.These raw data (i.e., TP, TN, FP and FN) are summed up over all the frames with ground-truth references in a video.The definition of PCC is given as follows: Figure 4 illustrates the evolution of the PCC of IViBe on the pedestrians sequence (with 800 ground-truth references) in the baseline category for # min ranging from 1 to 20.The other parameters are fixed to N ¼ 20, c ¼ 0.13, T 1 ¼ 26, T 2 ¼ 230, and ϕ ¼ 16.As shown in Fig. 4, when the # min increases, the PCC goes down.The best PCCs are obtained for # min ¼ 1 (PCC ¼ 99.8310), # min ¼ 2 (PCC ¼ 99.8324) and # min ¼ 3 (PCC ¼ 99.7923).In our experiments, we find that for stable backgrounds like those in the baseline category, # min ¼ 1 can also lead to excellent results.But in more challenging scenarios, # min ¼ 2 and # min ¼ 3 are good choices.Since a rise in # min is likely to increase the computational cost of IViBe, we set # min ¼ 2.
Once we set # min ¼ 2, we study the influence of the parameter N on the performance of IViBe. Figure 5 shows the percentages obtained on the pedestrians sequence for N ranging from 2 to 30.The other parameters are fixed to # min ¼ 2, c ¼ 0.13, T 1 ¼ 26, T 2 ¼ 230, and ϕ ¼ 16.We observe that higher values of N provide a better performance.However, the PCCs tend to saturate for N ≥ 20.Considering that a large N value will induce a large memory cost, we select N ¼ 20.
The time subsampling factor ϕ is just like the learning rate in the GMM.A large time subsampling factor indicates a small update probability, then the background samples are unable to timely adapt to changes in the real backgrounds, such as gradual illumination changes.That is, when using a large ϕ, there may be more false positives due to the outdated background model.On the contrary, a small ϕ means that the background samples are very likely to be updated according to the current frame, and a still foreground object may be  much easier to be absorbed into the background to produce more false negatives.Hence, there is a trade-off to adjust ϕ in order to balance the false positives and the false negatives.Besides, ϕ also affects the speed of our method, because a small ϕ will lead to a much higher computational cost for updating.As in ViBe, we also set ϕ ¼ 16.
Therefore, the parameters of IViBe are set as follows: the number of background samples stored in each pixel's background model is fixed to N ¼ 20; the ratio of R k ðx i Þ to B k ðx i Þ is set to c ¼ 0.13; the cutoff thresholds are set to T 1 ¼ 26 and T 2 ¼ 230; the required number of close background samples when classifying a pixel as background is fixed to # min ¼ 2; the time subsampling factor is fixed to ϕ ¼ 16.
As for ViBe, the recommended parameters as suggested in Ref. 8 have been used: and ϕ ¼ 16.

Other settings
For fair comparison, no postprocessing techniques (such as noise filtering, morphological operations, connected components analysis, etc.) are applied in our test for the purpose of evaluating the unaided of each approach.

Visual Comparisons
For qualitative evaluation, we visually compare the detection results of our IViBe with those of ViBe in Figs.6-10 on the test sequences.Although multiple test frames were used for each test sequence, we only show one typical frame for each sequence here due to space limitation.
Figure 6 shows the detection results of the lakeSide sequence.In the input frame shown in Fig. 6(a), after swimming in the lake, the body temperatures of the people are similar to that of the lake; therefore, intensities inside the human bodies (except the heads) are almost the same as the intensities of the lake.In the detection result of ViBe shown in Fig. 6(c), we can find that the child's body is incomplete with many false negatives.This is mainly because ViBe uses a fixed distance threshold R k ðx i Þ ¼ 20 which is large for dark environments, and unfortunately classifies dark foreground objects as background.However, as shown in Fig. 6(d), our IViBe is able to correctly detect most of the foreground regions due to its utilization of an adaptive distance threshold based upon the perception characteristics of the HVS.
In Fig. 7, the detection results of the blizzard sequence are depicted.To show more clearly, we enlarge two areas that only contain foreground cars to illustrate the improvement of our method.The blizzard sequence is also a very challenging sequence, as shown in Figs.7(a) and 7(e) [partial enlarged views of the Fig. 7(a)].Because of snow fall, most of the cars appear white, which will lead to confusion between the passing cars and the road covered by thick snow.As can be seen in Fig. 7(c) and particularly in Fig. 7(g), in the detection result of ViBe, there are holes inside the detected cars, and obviously false detections appear in the areas covered by snow.In contrast to ViBe, our IViBe can discriminate subtle variations using an adaptive distance threshold, and can gain more complete detection results.As shown in Fig. 7(h), our IViBe achieves an evident improvement compared to ViBe.
Figure 8 illustrates the detection results of the library sequence.This is an infrared sequence and contains a lot of noise.In Fig. 8(a), a man is static for a long time while he sits on the chair reading a book.Because of inevitable noise, in the detection result of ViBe, misclassification emerges in the head, shoulder, and legs of the foreground, later propagates to their neighboring pixels, and finally results in large holes inside the foreground, as shown in Fig. 8(c).However, due to the adaptive distance threshold, our method yields less misclassification in the regions of head, shoulder, and legs of the foreground, and also suppresses the propagation of misclassification.These results prove that our IViBe is more powerful for detecting still foreground objects lasting for some frames in comparison with ViBe.
Figure 9 shows the detection results of the sofa sequence.In Fig. 9(a), we can find an abandoned box (foreground) which is static for a long time on the left corner.Meanwhile, a man sits on the sofa and remains still for quite an extended period.Due to the presence of noise and the adoption of a "spatial diffusion" update mechanism, in the detection result of ViBe, as shown in Fig. 9(c), the box is almost eaten-up and a large number of false negatives appear inside the man.In Fig. 9(d), a notable improvement is shown in our result: the man is more complete and the top surface of the box is well detected.This improvement is mainly the result of the adaptive distance threshold we used.
Figure 10 shows the detection results of the baseline category.For the highway sequence, our method produces more scattered false positives than ViBe in the dark areas of waving trees and their shadows, but detects more complete cars in the top right corner.For the office sequence, a man stands still for some time while reading a book, and Fig. 10(d) shows that IViBe detects more true positives in the legs of the man compared to ViBe.For the pedestrians sequence, both methods yield similar results with evident shadow areas.For the PETS2006 sequence, a man and his bag remain still for a while, and IViBe obviously detects more complete results.

Quantitative Comparisons
To objectively assess the detection results, we employ four metrics 26,39 recommended by the CDnet, i.e., recall, precision, F1, and percentage of wrong classification (PWC) to judge the performance of the BS methods on pixel level.Let TP be the number of true positives, TN be the number of true negatives, FP be the number of false positives, and FN be the number of false negatives.These raw data (i.e., TP, TN, FP and FN) are summed up over all the frames with ground-truth references in a video.For a video v in a category a, these metrics are defined as  an impressive improvement over ViBe.As seen in Table 2, for the blizzard sequence, our precision value decreases by 0.01, but our recall value increases by 0.06.For F1 and PWC, our method achieves a moderate improvement.As shown in Table 3, for the library sequence, our proposed IViBe produces results with all the metrics better than those of ViBe.Table 4 shows that, for the sofa sequence, our precision value decreases by 0.13, while our recall value increases by 0.25.For F1 and PWC, our method obtains a remarkable improvement.The experimental results demonstrate that, in the scenarios which contain camouflaged foreground objects, our IViBe can significantly reduce false negatives in the detection results; in the environments where the foreground objects are static for some frames, our IViBe slows the eaten-up speed of those still foreground objects in the detection results.
To calculate the category-average metrics of the baseline category, we also utilize all the ground-truth references available.That is, frames 470 to 1700 in the highway sequence; frames 570 to 2050 in the office sequence; frames 300 to 1099 in the pedestrians sequence; frames 300 to 1200 in the PETS2006 sequence.Table 5 shows the category-average metrics for the baseline category using ViBe and our method.As can be seen in Table 5, our method produces results with a larger recall and a smaller precision; however, the overall indicators (F1 and PWC) of both methods are quite similar.
In general, through quantitative analysis, our IViBe method outperforms ViBe when dealing with camouflaged and still foreground objects, and has a similar performance to ViBe when dealing with normal videos with mild challenges.

Conclusion
According to the perception characteristics of the HVS concerning the minimum intensity changes under certain background illuminations, we propose an improved ViBe method using an adaptive distance threshold for each background sample in accordance with its intensity.Experimental results demonstrate that our IViBe can effectively improve the ability to deal with camouflaged foreground objects.Since the camouflaged foreground objects are ubiquitous in every real world video sequence, our IViBe has powerful practical value in smart video surveillance systems.Moreover, because of the capacity in dealing with the camouflaged foreground objects, our IViBe not only cuts down the misclassification of foreground pixels as background, but also further suppresses the propagation of misclassification, especially for those pixels inside the still foreground objects.Experimental results also prove that our method outperforms ViBe in scenarios in which foreground objects remain static for several frames.
), PSNRðD; RÞ is used to estimate the level of errors in a distorted image D from its original reference image R. For grayscale images with intensities in the range of [0, 255], PSNRðD; RÞ is defined as PSNRðD; RÞ ¼ 20 lg 255 1 n kD − Rk 1 ¼ 20 lg 255 1 n P n m¼1 jd m − r m j ðdBÞ; (6) where n is the number of pixels in the original image R, and d m and r m denote the intensities of the m'th pixel in D and R, respectively.

Fig. 1
Fig. 1 Basic experimental setup used to characterize brightness discrimination.

Fig. 6
Fig. 6 Detection results of the lakeSide sequence: (a) frame 2255 of the lakeSide sequence, (b) groundtruth reference, (c) result of ViBe, (d) result of IViBe.

Fig. 8
Fig. 8 Detection results of the library sequence: (a) frame 2768 of the library sequence, (b) ground-truth reference, (c) result of ViBe, (d) result of IViBe.

Fig. 9
Fig. 9 Detection results of the sofa sequence: (a) frame 900 of the sofa sequence, (b) ground-truth reference, (c) result of ViBe, (d) result of IViBe.