A Novel and Practical Scheme for Resolving the Quality of Samples in Background Modeling

In view of its important application value, background modeling is studied so widely that many techniques have emerged, which mainly concentrate on the selections of the basic model, the granularity of processing, the components in a framework, etc. However, the quality of samples (QoS) for training has long been ignored. There are two aspects regarding this issue, which are how many samples are suitable and which samples are reliable. To tackle the “how many” problem, in this paper, we propose a convergent method, coined Bi-Variance (BV), to decide an appropriate endpoint in the training sequence. In this way, samples in the range from the first frame to the endpoint can be used for model establishment, rather than using all the samples. With respect to the “which” problem, we construct a pixel histogram for each pixel and subtract one from each bin (called number of intensity values (NoIV-1)), which can efficiently get rid of outliers. Furthermore, our work is plug-and-play in nature, so that it could be applied to diverse sample-based background subtraction methods. In experiments, we integrate our scheme into several state-of-the-art methods, and the results show that the performance of these methods in three indicators, recall, precision, and F-measure, improved from 4.95% to 16.47%, from 5.39% to 26.54%, and from 12.46% to 20.46%, respectively.


Introduction
Background modeling is the basis and also the first step for many vision-based tasks, like foreground detection, objects of interest segmentation, high-level operations [1][2][3][4] (such as target recognition and tracking, pose and gesture analysis, action understanding), etc. In the past decades, a gamut of approaches, ranging from simple methods of applying differencing of adjacent frames to the complex methods of mixing or combining multi-cues, have been proposed. Obviously, not only the accuracy of the background model in detection or other applications has been improved greatly, but the model itself has evolved by assimilating new techniques for pursuing higher efficiency and robustness.
Statistical theory has proven appropriate for modeling uncertainties of natural scenes [5], and in fact, it yielded the best results in CDnet2014 [6]. Generally, most of those methods based on statistics obey a suggestion from Stauffer and Grimson [7] that a recent sequence of observations is more important for parameters' estimate, which somewhat coincides with the Markov property [8]. These observations are used for obtaining optimal parameters of the model in the training stage, whether online or offline, and then, these parameters are applied for initialization to guide the model to proceed elegantly. Subsequently, many researchers have determined this number N heuristically or by choosing N at point N i , where the performance reaches the beginning of the plateau, from a range (N 0 , N m ) (N 0 ≤ N i ≤ N m ), experimentally. For example, Elgammal et al. [9] simply took a sample of size 100 to train and represent the background in their KDE model. The number of samples adopted analytically by Barnich and Droogenbroeck's ViBemodel [10] and the SuBSENSEmodel of St-Charles et al. [11] was 20 and 35, respectively. We found that in CDnet 2014, SuBSENSE performed better than KDE and ViBe, while there were fewer samples needed in SuBSENSE than KDE and a few more than ViBe.
Therefore, a naturally-asked question is how many samples are needed after all and which samples are the best. All these questions point to an essential problem, which is the quality of samples (QoS) used for training or learning. To the best of our knowledge, QoS has not yet been focused on, and there is no principled way to settle it. In this paper, we tackle the two problems of "how many" samples and "which" samples of QoS.
To obtain an appropriate number in a principled way (that is the "how many" problem), we apply the tool variance, which is an important digital characteristic of random variables in statistics, on pixel values accumulatively (that is, each time from the first frame to the current frame for each pixel point). Thus, a series of variances is obtained, and the phenomenon of tending to converge at some frame emerges. However, for different pixel points, these variance sequences have different variance values where they converge, which can be seen from their distributions clearly, and this means that we cannot use a unified criterion to judge when they converge. We propose a new idea of applying the variance upon a window while sliding it on a previous variance sequence, which can tune the different values of corresponding converging points to a small positive value (such as 5 −2 ), and so gets its name, bi-variance. The proof and more details will be given in Section 3.1.
The "which" problem mainly concerns the effect of noises. We know that noises are always present in videos and, thus, disturb the modeling process of normal intensity values. Methods in the literature either explicitly model the noise by the assumption of some distribution (for example, Gaussian with mean zero) or implicitly regard it as part of normal intensity values. However, noises are always there. To get rid of the noise, we construct the pixel histogram by counting the number of intensity values (NoIV-1) observed for each pixel point, and then at some time instant, we subtract one from those with a number greater than zero. This method obeys an underlying fact that the number of noises is less than that of normal intensity values in each pixel point over a period of time.
Summarily, the main contributions in this paper are: (1) a novel and effective convergent method, called Bi-Variance (BV), is proposed and proven in statistical and mathematical ways; and (2) a simple denoising method is introduced by subtracting one from the number of intensity values (NoIV-1); (3) our scheme can be easily integrated into other approaches thanks to its advantages of being plug-and-play; and (4) extensive tests are carried out to verify our proposals, and it turns out that the performance indicators of recall, precision, and F-measure are improved to different extents for several contemporary state-of-the-art models.
In the following, the content is organized as follows: the related work will be introduced firstly in Section 2; then, in Section 3, we will demonstrate our methods in detail and describe the algorithm in pseudocode; Section 4 is devoted to verifying the effectiveness of our scheme by applying it to several contemporary state-of-the-art models; finally, we will conclude our work and discuss how to make it perfect in our future work in Section 5.

Related Work
Background modeling of a real scene is a basic work for vision tasks. From about half a century ago to now, there have been numerous methods proposed for modeling the background [5,[9][10][11][12][13][14][15][16][17][18] (a more detailed introduction about this technique and its applications was given in recent surveys [19][20][21][22][23]). However, most of them focus on how to choose the basic model (for example, Gaussian mixture model [7], kernel-or filter-based model [9,24], sample-based model [25][26][27][28][29], codebook model [30], etc.), the granularity of processing (for example, pixel-wise [13], blob-wise [31], region based [32,33], or spatiotemporal [5]), the components in a framework [10,11] (for example, one or more initializations, error propagation, and updating, apart from the essential subtraction process), and so on. In this paper, we do not come up with a new theory or method for background modeling, but propose a plug-and-play method to promote the performance of those approaches based on statistics.
Generally, a recent sample of a pixel's intensity values is needed, denoted as p 1 , p 2 , · · · , p N , where N is the number of values. Then, the model will be constructed by using this sample, and the incident pixel p t (t > N) will be judged as to which side it belongs, background or foreground. The question is how to decide the value of N. Toyama et al. [34] set a sequence with at least 200 frames for training the background model. Elgammal et al. [9,35] decided to use 100 samples to represent their background. Sheikh and Shah [5] straightforwardly ignored the discussion about this parameter and set it to 200 in their implementation. Barnich and Droogenbroeck [10] studied the influence of N by testing it from 2-50 and then set it to 20 because the curve of N-performance became flat as N became greater than 20. After their seminal work, recently, researchers have started to pay some attention to the tradeoff between the performance and computational costs by choosing an appropriate value for N. For instance, Bloisi and Iocchi [36] initialized it with 30 and then reduced it to 25. By performing an analog comparison like what ViBe [10] does, models like PBAS [13], LOBSTER [37], and SuBSENSE [11] all set their training sequence to 35 experimentally. However, the uncertainties of pixels are different from each other, so it is somewhat unreasonable to use the same number of samples for all pixels, let alone for all scenes. We propose a simple, but effective method to cope with this problem by deeply digging out the power of variance (please refer to Section 3.1).
In order to be consistently robust against various challenges, like illumination change, camouflage, etc., models should resolve well the influence by noises. In the literature, we can conclude that the noise is always treated in one of two ways, either explicitly or implicitly. The explicit way [11,24,31,35,[38][39][40][41][42][43] means that the noise is modeled by using some measures, for instance a Gaussian distribution assumption. Tavakkoli et al. [44] claimed that PFinder [31] models each pixel with a single 3D Gaussian and models the noise, then detects pixels whose probabilities are smaller than a threshold chosen heuristically. Soatto et al. [38] assumed that a Gaussian white noise is involved in their description of dynamic texture, which is a generative model for the dynamic background [45]. Jing Zhong and Stan Sclaroff [24], in their ARMA, assumed that the noise was distributed as N(0, Q) where Q is its variance. Furthermore, a band-pass filter was used previously to account for the background changes over time. Crivelli et al. [41] proposed a so-called mixed state statistical framework. In this framework, they thought that background intensity values were perturbed by the Gaussian noise. However, unlike the aforementioned ways, they used a local average of intensity differences to reduce the influence of the noise. For computational consideration, López-Rubio and López-Rubio [39] utilized a diagonal matrix to model the noise, which was introduced by the pixel's quantization and compression. Nikos and Visvanathan [46] utilized the spatial information to wipe out the disturbance of the noise. Thereafter, Li et al. [47] superadded the temporal information (that is, the spatio-temporal features) to further improve the ability against noises. The implicit way [5,10,37,48] always views the noise as part of the sources in modeling the background and sometimes adds a post-processing step, for example, to fill the holes, remove the blinking pixels, etc. Sheikh and Shah [5] applied a threshold k in their classifier to balance the trade-off between the sensitivity to the change and the robustness to the noise. Barnich and Droogenbroeck [10] claimed that their model can automatically adapt to the noise because it was constructed from noisy pixel values. SeungJong Noh and Moongu Jeon [48] and St-Charles and Bilodeau [37] simply applied pre-and post-processing to smooth or reduce the noise, respectively. However, Hofmann et al. [13] worried about the side effect of that, whether pre or post, which will also smooth blobs and lead to less sharp boundaries. Apart from these two types, Elgammal et al. [9,35,40] denoted that image noises can be modeled by a zero mean normal distribution if the scene is completely static, so only local or small global displacements, except for random noises, were addressed in their method.
No matter which case it is, the noise is always there. In this paper, we try to eliminate the noise to provide high-quality samples to train the model, and thus, it can be robust from the beginning. Section 3.2 will demonstrate our idea of removing the noise away in detail.
In the next section, the methodology, the main part of our paper, will be elaborated.

Methodology
In this section, firstly, we will give the detailed explanations (including motivations, proofs, effects, and analyses) of our methods: bi-variance and NoIV-1, in Sections 3.1 and 3.2, respectively. Then, according to the descriptions presented in the former two parts, in Section 3.3, the corresponding algorithm is demonstrated in pseudocode.

Bi-Variance
As we know, variance [49] is used to measure how far a set of (random) numbers is spread out from the average value. In the case of background modeling, we denote a recent set of intensity values of each pixel as p 1 , p 2 , · · · , p n , · · · , p N , which are viewed as random numbers. Most of pixels' intensity values behave with some underlying laws (see Figure 1), which motivates us to try to use the variance to model those values. It can be seen evidently that values of the pixel intensity fluctuate around some horizontal line and become stable asymptotically. to get the underlying law, we try to calculate variances of those cumulative sequences (like "p 1 ", "p 1 , p 2 ", "p 1 , p 2 , p 3 ", etc.), which is defined as follows, where (·) represents a sequence [50] and var indicates the math operator "variance". Thus, we can intuitively view the tendency of intensity values changing over time (see Figure 2).  2. VCS tendencies of points mentioned in Figure 1 are showed in (a-d), separately. When the tendency reaches a stable status, the Y-axis value is also called the convergent value (cv).
As we expect, although they fluctuate drastically in the early stage, plateaus are always formed at some time instant. For instance, point p 1 became stable approximately at Frame 65, and the VCS 65 was two; point p 3 almost became stable at 10 with the VCS 10 being 0.9. So far, we can conclude, experimentally, that the VCS of each pixel intensity values was almost convergent and also call values like VCS 65 convergent values (cv). However, the cvs (2 for point p 1 , 1.9 for point p 2 , 0.9 for point p 3 , and 3.6 for point p 4 ) were different, leading to a difficulty in judging when the convergent status (that is, the stable status) of each pixel intensity value was reached, which is related to resolving the "how many" sub-problem (the first problem of QoS). Therefore, we need a unified criterion to determine whether the sequence has achieved a convergent status.
Here, we propose a method, coined bi-variance, which views the VCS as a random variable, and thus, its value is the sequence (VCS n ). Then, we set a window (with a size L, as 30 in our work) above the (VCS n ) and also calculate the variance in each window by sliding along the (VCS n ). In this way, we can obtain a unified criterion, zero, in theory (nevertheless, generally a small non-negative real value, like 5e-2 in our experiment). In the following, we will give proofs from two aspects, statistics and mathematics.
In statistics, assume that X is a random variable; thus, its expectation and variance are E(X) and D(X), respectively, and D(X) can be calculated as follows, Then, we define the bi-variance of X as B(X), and: By replacing (2) into (3) and expanding it, we can see that no matter what the variance of X is, its bi-variance will tend to be zero.
From another perspective, according to (4), where D(X) is the mean of sequence (D(X n )). From the last row of (5), D(X i ) − D(X) reflects the rate of D(X i ) deviating from the mean D(X), and from Figure 2, we know that the tendency of D(X i ) goes toward being stable, which means the deviation becomes increasingly weaker, and so does D(X i ) − D(X) (i.e., D(X i ) − D(X) approaching zero).
In mathematics, a series S n [51] can be defined by a sequence (a n ) in its summation form, and a straightforward way [52] for judging whether S n is convergent is if the term a i reduces to zero with the i growing, meaning that from some position, like i , when i > i , a i → 0, and thus, the sum of this sequence, S n , will tend to be a fixed value (that is, S n = a 1 + a 2 + a 3 + · · · + a i ). Therefore, in (5) , and thus, this sequence converges to zero (since lim n→∞ S n n = 0 with S n fixed). By using the bi-variance on VCS in a window, we can get a new series of values. We call those values as "bi-variance", which can converge in a unified way. Figure 3 illustrates the bi-variances of those VCS in Figure 2. As mentioned before, we utilized a small positive value, 5 −2 , as the convergent criterion. Thus, those four points will converge approximately at Frames 65, 75, 10, and 55, respectively.

NoIV-1
We know that the way of constructing the histogram of an image (e.g. Figure 4b) is to count the number of each intensity value (from 0-255) by scanning the whole image (Figure 4a). Different from the classic image histogram, a pixel histogram is established by counting the number of observed intensity values in the temporal axis (Figure 4c,d).  Further, by careful examination of Figure 4d, we can notice that the number of some intensity values (like 46, 63, 64) was very small, which is very important information from which our method derives.
Although we have no idea about when and where the noise will appear, we know that the number of normal intensity values is more than that of noises (normally, noises scatter in different places with the number of them being few during a short period). We simply cut each column by one in the pixel histogram (see Figure 5), leading to the disappearance of noises. However, the effect of normal intensity values (for the case in Figure 5, I 1 , I 8 , and I n are viewed as noises, and others are normal) depends on their distribution. Therefore, analyses of it go as follows.
Assume that intensity values of one certain pixel p during a time period are I 1 , I 2 , . . . , I n . The number of each intensity value is denoted by N I 1 , N I 2 , . . . , N I n . Thus, the total number of intensity values, N T , is: Then, the contributions of each intensity value are: After using NoIV-1, the number of intensity values is: where N L 1 indicates the number of the first-level pixel p L 1 in the pixel histogram (for example, in Figure 5, the number of squares colored in gray in the bottom). According to (9), we will rewrite (8) as: On the one hand, for those intensity values with the number being only one, their contributions will be reduced to zero, like I 1 , I 8 , and I n in Figure 5; on the other hand, for the others, we analyze the ratio by comparing their contributions presented in (8) and (10) as follows, where i ∈ p − p L 1 . Let us have a look at the first item, N T N T −N L 1 , in (11), in which the numerator is a constant and the denominator, in fact, can reflect the shape of the distribution. If N L 1 is large, especially exceeding half of N T , the shape always tends to be flat, otherwise, tall and narrow. For the other item, if N I i = 1 (that is the noise), the effect of NoIV-1 is a fatal blow, or greater than one, lim meaning that the harm of those intensity values with a large number becomes weaker. Therefore, the contributions of the remaining intensity values mainly lie on the shape of their whole distribution, which is very important in determining an incident pixel belonging to the foreground or background in background models. As for the example in Figure 5, by applying NoIV-1, the contributions of noises, I 1 , I 8 , and I n , become zero. Apart from those with only one remaining weakened intensity(that is P I 2 , P I 3 , and P I 9 equal to 15 19 ), the contribution of P I 4 , P I 7 , and P I 10 is enhanced a little to be 60 57 , while P I 5 and P I 6 increase to 105 76 and 45 38 , respectively. In the real scene, by taking point p 4 for example, we can see that some peaks in Figure 6a (denoted by red circle) have disappeared compared to Figure 6b. Note that the point p 4 owns a unimodal distribution, which features a bump with contributions in two ends fewer than that of in the middle. That is why the points that have disappeared are located at the top or bottom in its distribution. The effect of NoIV-1 on the VCS of the point p 4 can be seen in Figure 7. In Figure 7a, we can see that the VCS curve of point p 4 had big changes from about 4.2, dropping down to 2.1 sharply, then rising to 4 with fluctuation and tended to be stable at about Frame 100. By using NoIV-1, the VCS curve (denoted by p 4 in Figure 7b) experienced fewer big changes and went into the convergent status at about Frame 50 quickly. The same effect was reflected by the bi-variance (see c and d, respectively).
However, an extreme case is that if the scene is very complex, like trees in a storm, the shape of its histogram will be very flat, and there would be may intensities in the bottom of the histogram. That means that all the intensities of the quantity being one in the bottom of the histogram will be viewed as noises and thus will be deleted according to the rule of NoIV-1. In this situation, one effective way is to maintain the reference samples long term.

Algorithm
According to the descriptions of our methods above, we will give the corresponding algorithm in pseudocode in Algorithm 1. Since the case of using bi-variance partnering with NoIV-1 performs better than that of using bi-variance alone (see Section 4), in our implementation, we placed NoIV-1 as the first step, then applied bi-variance on those purified results. Note that as we have not optimized our code, especially in NoIV-1, the whole runtime in the training stage of each host method was delayed from several milliseconds to tens of milliseconds (if the scene was challenging, leading to the sequence VCS converging slowly).

Algorithm 1 The implementation of bi-variance and NoIV-1.
Input: an intensity value sequence (p n ) of a pixel P, which is gathered from a scene video or image sequence during a period of time (for example, a set of frames, from the first image of any scene, with the amount specified in each scene's configuration file in our experiment) Output: the new sequence (p n ) without noises and the convergent frame c f vcs_std[i] = std[subSequence] 9: end for 10: window = L //the length of the sliding window 11: cnt = 0 //to record consecutive times of the bi-variance below , which is introduced in the following 12: cnt_thres = THRESHOLD //the threshold of cnt 13: c f = seqLen //initialized for the non-convergent case 14: wstd = 0 //the variance of vcs_std in the window, i.e. the bi-variance 15: for j = window : seqLen do 16: wstd = std(vcs_std(j − window + 1 : j)) // , a small positive real value, like 5e-2 17: if wstd < then 18: cnt = cnt + 1 19: if cnt < cnt_thres then 20: c f = j − cnt_thres  Figure 7. (a,b) depict the VCS curves of p 4 before and after using NoIV-1, respectively. (c,d) are the bi-variances of p 4 before and after using NoIV-1, respectively. Red rectangles emphasize the obvious changes.

Experiment
To validate the effectiveness of our methods, We tested our method on a PC with a 3.3-GHz Intel Xeon E3-1230 V2 380 CPU and 24 GB memory, and the environment was Microsoft Visual Studio 2010. We performed extensive tests on a well-known dataset (CDnet 2014) [6], which is adopted widely by researchers in studying background modeling and other works related to computer vision. CDnet 2014 involves eleven categories and 4-6 scenes in each one (53 scenes in total). There were three performance indicators, recall, precision, and F-measure, which are often used in the literature, and so, we also used them in this work. Their definitions are: (12) where TP, FP, TN, and FN indicate the number of "true positives", "false positives", "true negatives", and "false negatives", respectively.
Four famous methods (KDE [9], ViBe [10], PBAS [13], SuBSENSE [11]) in the recent two decades were chosen to take the role of a carrier (as claimed before, our methods can be viewed as a plug-in).
Then, we set three scenarios for evaluating the performance of our methods: 1. The basic one is that the original codes of these methods were executed without any modification. Note that because the starting points of the test of each method were different (such as KDE and SuBSENSE testing after their training frames, while ViBe from the first frame and PBAS with the first one blank), we conventionally obeyed the rule of setting the training number, which is in each scene's configuration file of CDnet 2014, as the starting points of the test. 2. The second test utilized the bi-variance to get the "right" number of frames for training based on the basic one. 3. The third scenario was a combination of the basic one with our methods, bi-variance and NoIV-1.

Quantitative Results and Analysis
The quantitative results of four host methods are listed in Tables 1-4, respectively. The last row with the title ratio indicates the performance enhancement by comparing the best results of each method after using bi-variance and NoIV-1 against the basic one. Meanwhile, to provide intuitive knowledge, we depict all of these data in Figure 8. In Tables 1-4, column "times" was only used for the third scenario, in which the method NoIV-1 was executed one time, two times, four times, and seven times, respectively. Firstly, we can notice that the performance of all host methods in the three indicators was enhanced by using bi-variance only or that together with NoIV-1. Secondly, the effects of applying their combination, bi-variance and NoIV-1, were better than those of only utilizing bi-variance. Thirdly, we found that when the method NoIV-1 was executed a different number of times, the final results were different. However, when we executed it four times, each host method in the combined model reached its best performance, and if we executed it three more times, the performance went down.
These phenomena reveal two facts: (1) the noise is a key factor in affecting the precision of background models; (2) the number of noises is a small number. By viewing Figure 9 below, there is a tendency of rising first and then descending for each method's performance indicator. Recall Figure 5 and the analyses in Section 3.2: executing method "NoIV-1" more times was increasingly harmful to the normal intensities and would destroy (rather than restore) the distribution of data. Therefore, the tendencies in Figure 9 are reasonable, and the tendency was basically predictable if we executed the method "NoIV-1" more than seven times. Figure 9. The tendency of each performance indicator of each host method by executing method "NoIV-1" one time, two times, four times, and seven times, respectively, in the combined model. In each broken line, the four vertices represent the four kinds of the different number of times of executing the method "NoIV-1".
In Table 1, for method KDE, the values of the indicators precision and F-measure of bi-variance only are smaller than that of basic (visually in Figure 8, part of bi-variance only's line below that of basic), because the training number of frames of bi-variance only was more than that of basic (100 mentioned in Section 2), leading to more noises being introduced.
In Tables 2 and 3, the ratios of performance enhancement in recall are less than those in precision and F-measure, which shows that our methods effectively reduced the number of FP more than that of FN by referring to their definitions in (12).
Interestingly, in Figure 8, we can also find that SuBSENSE had the best performance in all three scenarios against the others, which is a more recent method and gained a very high ranking in CDnet 2014 [6]. ViBe's performance obtained a relatively big promotion when our methods were applied to its training stage, which actually reflected that the influence of noises cannot be ignored and that an explicit processing will be better (for example, SuBSENSE adopts an explicit way, while ViBe's denoising mechanism is by constructing the model from noisy data straightforwardly).

Qualitative Results and Analysis
The qualitative results (see Figure 10 below) will display each method's cons and pros visually, especially in detail.
In Figure 10, for the results of the method KDE in the third row compared with those of KDE combined with our methods in the forth row, the obvious change is that the noise was depressed greatly (especially in the last three scenes).
For methods ViBe and PBAS, the promotion was relatively higher than that in the other two host methods, which can also be noticed in their quantitative results in Figure 8. For example, the method ViBe cannot reduce the disturbance of white lines in the badminton court of the scene badminton and the white pattern on the T-shirt of the scene o f f ice. However, those problems were alleviated after using our methods. Method PBAS was susceptible to illumination change, the dynamic background, and the thermal environment. Those can be identified distinctly in scenes o f f ice, overpass, and park in the seventh and eighth row.
As for the method SuBSENSE, one of its drawback is overfitting, which can be viewed in the last two rows; for instance, the one missing leg in scenes badminton and o f f ice, the separated body in scenes so f a and park, and the connected legs in the scene overpass. Table 5 gives the convergent frames (that is, the desirable number of training frames) of all scenes in each category in CDnet 2014 [6].

Convergent Frames and Analysis
By comparing those convergent frames of different scenes, there are several highlights presented in Table 5: 1.
the convergent frame of every scene was totally different from the others; 2.
there was a big gap between the minimal convergent frame, 32, and the maximal convergent frame, 380, of the whole scenes; 3.
Therefore, we assert that the arrangement of assigning all scenes with the same N is not appropriate, and the exact value of N should be determined by the data themselves. Additionally, The runtime of our method was from about 20 ms to less than 2 s, which specifically depended on the complexity, the resolution, and the convergent frame of each scene. Since we directly plugged our method into the source code of each host method, it exactly led to a delay to some extent. We are also considering this issue and intend to optimize our approach for speedup.

Conclusions
In this paper, we focus on the quality of training samples (QoS) for background modeling, and we have proposed a plug-and-play method, bi-variance, to cope with the "how many" problem in QoS and coupled with another denoising method, called NoIV-1, to sweep away those outliers, which refers to the "which" problem of QoS. Four seminal works, KDE [9], ViBe [10], PBAS [13], and SuBSENSE [11], have been introduced to evaluate our methods. In the experiment, we can see that the performance of each method (in three indicators, recall, precision, and F-measure) has been enhanced to different degrees.
By viewing the ratio-rows from Tables 1-4, the ratio of performance promotion was limited, generally speaking. However, how to greatly improve the performance is concerned with the quality of background model itself. Consequently, in our future work, we need to propose a high-quality background model based on the current study.