An Incremental Kernel Density Estimator for Data Stream Computation

Probability density function ( p.d.f. ) estimation plays a very important role in the ﬁeld of data mining. Kernel density estimator (KDE) is the mostly used technology to estimate the unknown p.d.f. for the given dataset. The existing KDEs are usually ineﬃcient when handling the p.d.f. estimation problem for stream data because a bran-new KDE has to be retrained based on the combination of current data and newly coming data. This process increases the training time and wastes the computation resource. This article proposes an incremental kernel density estimator (I-KDE) which deals with the p.d.f. estimation problem in the way of data stream computation. The I-KDE updates the current KDE dynamically and gradually with the newly coming data rather than retraining the bran-new KDE with the combination of current data and newly coming data. The theoretical analysis proves the convergence of the I-KDE only if the estimated p.d.f. of newly coming data is convergent to its true p.d.f . In order to guarantee the convergence of the I-KDE, a new multivariate ﬁxed-point iteration algorithm based on the unbiased cross validation (UCV) method is developed to determine the optimal bandwidth of the KDE. The experimental results on 10 univariate and 4 multivariate probability distributions demonstrate the feasibility and eﬀectiveness of the I-KDE.


Introduction
Probability density function (p.d.f.) estimation [1] uses a nonparametric way to determine the p.d.f. of a random variable (r.v.) based on the given dataset. It plays a very important role in the fields of data mining because many machine learning tasks are related to the p.d.f. estimation, for example, Bayesian classification [2], density-based clustering [3], feature selection [4], time series analysis [5], and image processing [6]. e mostly used p.d.f. estimator is the Parzen window estimator [7] which is also termed kernel density estimator (KDE). e KDE uses the superposition of multiple kernels (e.g., triangular kernel, Epanechnikov kernel, biweight kernel, triweight kernel, cosine kernel, and Gaussian kernel) to fit the unknown p.d.f. of given dataset. How to select an appropriate window bandwidth or kernel size is the core of training an effective KDE: the large bandwidth will lead to the oversmoothed estimation, while the small bandwidth will result in the undersmoothed estimation. Until now, the studies regarding how to construct the KDEs mainly focus on the selection of optimal bandwidths.
In order to select the optimal bandwidth, an effective error criterion should be deliberately designed [8]. e mean integrated square error (MISE) is a typical error criterion which measures the expected value of estimated error between the estimated p.d.f. and true p.d.f. Because the error criterion includes an unknown term, i.e., the true p.d.f., the bandwidth selection methods have to use the different approximation strategies to replace it and then determine the optimal bandwidths for the specific applications. ere are many bandwidth selection methods which have been developed based on MISE criterion. e representative works are summarized as follows. e rule of thumb (RoT) [9] is the simplest method to determine a quick normal scale bandwidth by assuming the data obeying a normal distribution. However, when the data are not close to normal, RoT tends to oversmooth and masks the important features of data [8]. e bootstrap method [10] used the p.d.f. estimated with the resampled data to replace the true p.d.f. and then minimize the bootstrap criterion function to determine the bandwidth for p.d.f. estimation. e unbiased cross validation (UCV) method [11] which is also termed least squares cross validation (LSCV) used the leave-one-out strategy to estimate the true p.d.f. of error criterion. e expected value of UCV is equal to the difference between the expected value of integrated square error (ISE) and a constant related to the true p.d.f. e biased cross validation (BCV) method [12] derived a smoothed objective function to optimize the bandwidth based on an asymptotic MISE. e theoretical analysis showed that BCV had a good convergence rate of optimal bandwidth. A solve-the-equation approach [13] for univariate p.d.f. estimation was studied based on the plug-in strategy to approximate the true p.d.f. in the asymptotic MISE. Correspondingly, an iterative algorithm was designed to find the optimal bandwidth. e experimental results demonstrate that the aforementioned MISE-based KDEs and their variants [14][15][16] can obtain the good performances when handling the p.d.f. estimation problems in a stationary way.
As far as we know, there is no such KDE which is specially designed to estimate the p.d.f. for the stream data, i.e., estimate the p.d.f. in a nonstationary or incremental way. e stream data [17] can be regarded as a dynamic dataset of which the amount gradually increases with the continuous collection of new data. When the data are obtained by the learning system in a batch way, the existing KDEs have to reestablish the bran-new p.d.f.s based on the combination of current data and newly coming data. is training mode severely wastes the computation resource because the information regarding p.d.f. estimated with the current data is fully discarded. In addition, it can significantly increase the computation time if the p.d.f. is estimated by putting the current data and newly coming data together. e existing KDEs cannot work well if the total amount of data with the progression of data batches is beyond the memory capacity of computer. e incremental learning [18] provides an efficient paradigm to deal with the stream data, especially in the big data age, and it also gives a flexible and feasible way to analyze the large-scale data.
is article proposes an incremental kernel density estimator (I-KDE) for stream data mining. e I-KDE updates the current KDE dynamically and gradually with the newly coming data rather than retraining the bran-new KDE with the combination of current data and newly coming data. e I-KDE deals with the p.d.f. estimation problem in the way of data stream computation. e theoretical analysis proves the convergence of the I-KDE only if the estimated p.d.f. of newly coming data is convergent to its true p.d.f. In order to guarantee the convergence of the I-KDE, a new multivariate fixed-point iteration algorithm based on the unbiased cross validation (UCV) [11] is developed to determine the optimal bandwidth of the KDE. e experimental results on 10 univariate and 4 multivariate probability distributions demonstrate the feasibility and effectiveness of the I-KDE to estimate the p.d.f. of stream data. e remainder of this article is organized as follows. In Section 2, we introduce our problem formulation. e I-KDE is presented in Section 3. In Section 4, we report the experimental results to demonstrate the effectiveness of the I-KDE. Finally, we give our conclusions and suggestions for the future research in Section 5.

Problem Formulation
Let represent the combination of datasets D 1 , D 2 , . . . , D T . For any x (T) n ∈ D, n � 1, 2, . . . , N, there exists t ∈ 1, 2, . . . , i . If we want to estimate the underlying p.d.f. for dataset D (T) , the mathematical model of classical KDE [7] is described as is the standard D-variate Gaussian kernel function and is the optimal window bandwidth which can be determined with the different bandwidth selection strategies, e.g., the bootstrap method [10], unbiased cross validation (UCV) method [11], biased cross validation (BCV) method [12], and the solve-the-equation method [13]. In fact, the estimation paradigm as shown in equation (3) and then obtain the estimated p.d.f. P where the mathematical symbol ⊕ represents the incremental updating to previous result.

I-KDE: An Incremental Kernel Density Estimator
is section presents an incremental KDE (I-KDE) for the stream data mining. Assume the learning system has received T(T ≥ 1) batches of data D 1 , D 2 , . . . , D T and the newly coming dataset is Complexity 3 respectively, where is the optimal bandwidth of p.d.f. P � (T+1) H (T+1) (x) estimated with the newly coming data D T+1 .
In equation (10), we can find that it consumes less time to determine the bandwidth H (T+1) than to determine H (T) and H (T+1) when N (T) ≫ N T+1 and N (T+1) ≫ N T+1 . us, the I-KDE only considers how to determine the optimal bandwidth for the newly coming data D T+1 and gives up to calculate the bandwidths for both the current data D (T) and the combination D (T+1) of current data and newly coming data. en, Eq. (9) can be represented as According to equation (11), we can iteratively derive the following equations: e learning system only receives the data D 1 , holds for D (1) � D 1 . Substituting equation (12) into equation (11) yields the mathematical model of the I-KDE as where H (k) (k � 1, 2, . . . , T + 1) is the optimal bandwidth of p.d.f. estimated with data D (k) . Equation (14) reveals that the KDE trained based on the union of different data blocks can be decomposed into the different KDEs which are trained based on the corresponding data blocks in an independent way. In equation (14), we can find that the I-KDE is an asymptotic integration model of a series of KDEs which are trained gradually based on the different data blocks. e I-KDE provides an effective way to deal with the p.d.f. estimation problem for stream data and meanwhile makes it possible to estimate the p.d.f. for large-scale data by partitioning it into different data blocks. Figure 1 depicts the procedure of the I-KDE. We provide a brief analysis to the computation complexity of the I-KDE. e classical unbiased cross validation (UCV) method [11] is used in this article to determine the optimal bandwidth. Its computation complexity is O(N 2 D), where N is the number of samples and D is the number of sample dimensions. Assume the learning system receives T batches of data D 1 , D 2 , . . . , D T ; the time complexity of the I-KDE is If we reestablish a bran-new p.d.f. for each batch of data by training the KDE based on the combination of current data and newly coming data, the time complexity of full retraining scheme is where 4 Complexity is the time complexity of estimating the p.d.f. for data D (T) � ∪ T t�1 D t . By comparing equation (15) with equation (16), we can get that the time complexity of the I-KDE is far less than the time complexity of retraining scheme, i.e., In addition, we can know that the I-DKE is able to deal with the p.d.f. estimation for large-scale data by comparing equation (15) with equation (17) Figure 2 presents the comparison of time complexity between the I-KDE and retraining scheme. Now, we give the theoretical analysis to the convergence of I-KDE. First, a lemma regarding the consistency of probability distribution function is given.

Proof. For any
is us, the number of en, the probability distribution function F(x) is is concludes the proof.
. , X T are mutually independent and there exists ε t > 0 such that where Proof. Based on Lemma 1, we have en, For the given is completes the proof.
Note. When ε t ⟶ 0 for t � 1, 2, . . . , T, then eorem 1 demonstrates the convergence of the I-KDE when the estimated p.d.f. P (t) (x) converges to the true p.d.f. P (t) (x). In order to obtain an accurate p.d.f. estimation for Figure 1: When the newly coming data D T+1 arrive at the learning system, I-KDE uses equation (14) to update the current p.d.f. rather than reestimate the p.d.f. for the combination of current data and newly coming data. (14), we use the multivariate fixed-point iteration to design the bandwidth optimization method based on UCV error criterion. e formulation [12] of UCV error criterion for D-dimensional where e optimal bandwidth for the estimated p.d.f. P It is very difficult to calculate the analytic solution of h (t) d , d � 1, 2, . . . , D from equation (31). However, an iterative function with respect to h (t) d can be derived by simplifying equation (31): Furthermore, a multivariate fixed-point iteration algorithm can be designed based on the aforementioned iterative function to determine the optimal bandwidth

Experimental Results and Analysis
In this section, we conduct three experiments to demonstrate the feasibility and effectiveness of our proposed I-KDE based on 10 univariate and 4 multivariate probability distributions of which the details are listed in Table 1 is the mean vector, Complexity quadrimodal normal) probability distributions. For each distribution, 1000 samples are randomly generated under Matlab programming environment. e experimental results are listed in Figure 3 and confirm that our designed fixed-point iteration algorithm can find the optimal bandwidths for univariate and multivariate UCV error criteria.
For the univariate probability distribution, e.g., the beta distribution in Figure 3(a), we can see that the UCV error first decreases and then increases with the increase of bandwidth parameter. For the any initial bandwidth value 0.001 or 1, our designed fixed-point iteration algorithm can find the optimal bandwidth 0.048 for the I-KDE. For the multivariate probability distribution, e.g., the 2-dimensional quadrimodal normal distribution in Figure 3(d), we know that its UCV error has the minimum value − 0.020. Algorithm 1 determines the optimal bandwidth vector (0.451, 0.575). e learning curves corresponding to the bandwidth optimization demonstrate the convergence of Algorithm 1.

Convergence of I-KDE.
e objective of this experiment is to check whether the estimation of I-KDE which is trained by means of data stream computation can converge to the estimation of KDE which is trained based on the combination of current data and newly coming data. e initial bandwidths in Algorithm 1 are determined based on the RoT method. e experimental results are presented in Figure 4.
For the univariate case, we use 10 different probability distributions as shown in Table 1. For each distribution, 5000 samples are randomly generated. e parameter settings of these p.d.f.s are summarized as follows: α � 2 and β � 2 for beta, k � 2 for chi-squared, μ � 1.5 for exponential, n 1 � 100 and n 2 � 100 for F, α � 9 and β � 0.5 for gamma, μ � 0 and σ � 0.25 for lognormal, μ � 0 and σ � 1 for normal, σ � 1 for Rayleigh, v � 1 for Student's T, and α � 1 and β � 5 for Weibull. In Figure 4(a), we can see that the red and green curves almost coincide with the coming of new data. It is very hard to distinguish the red and green curves. e similar situation exists in the multivariate case. For the multivariate case, 5 different probability distributions are selected. In Figure 4 [20]. e smaller the JS divergence is, the more similar the two p.d.f.s are. In Figure 4, we can see that the JS divergences are small and the quantitative measures also reflect that the differences between KDEs and I-KDEs are small.
We also test the convergence performance of the I-KDE on the testing data corresponding to beta (α � 2 and β � 2) and 2-dimensional trimodal normal (Θ (1)  We check the ratio of testing samples with the smaller absolute error to all testing samples. In Figure 5, we can see that the win ratios (i.e., the proportions of data points with smaller estimation errors) of the I-KDE and KDE are almost same with the increase of data blocks. is indicates that the I-KDE has the equivalent estimation capability with the KDE. Figure 5 also provides the comparison of training time. We can find that the training time of the I-KDE is far less than the KDE. Figure 6 shows the convergence tendency of the I-KDE on 2-dimensional bimodal normal (Θ (1) � (− 5, − 5), Θ (2) � (5, 5), Σ (1) � Σ (2) � [0.5, 0; 0, 0.5] and ε 1 � ε 2 � 1/2) distribution. 2000 training samples are randomly generated and partitioned into 20 data blocks. 20000 testing samples are selected with the incremental steps of (0.01, 0.01) in the space of [− 9.9, 10] ∪ [0.1, 10]. In Figure 6, the black points are training samples and the red points are the testing samples estimated by the I-KDE with the smaller absolute error. We can see that the estimation performance of the I-KDE will gradually converge to a stable state with the increase of training samples.

Estimation Performance of I-KDE.
End For (8) Until ‖H (t) − H (t) ‖ 1 < ξ ALGORITHM 1: A multivariate fixed-point iteration algorithm to determine the optimal bandwidth for the I-KDE.      Table 2 and confirm that the I-KDE is statistically convergent to the KDE. Taking Exponential distribution for example, we use the Wilcoxon signed-ranks test [21] to check whether the difference between the I-KDE and KDE is significant. Table 3 lists the comparative results of the I-KDE and KDE on 10 different training datasets of exponential distribution. Let R + be the sum of ranks for the training dataset on which the KDE outperforms the I-KDE and R − be the sum of ranks for the training datasets on which the I-KDE outperforms the KDE. We can calculate  14 Complexity We construct the statistics which is distributed approximately normally. e null hypothesis that the KDE and I-KDE perform equally well can be rejected if z is smaller than − 1.96 under the confidence level of 0.05. Because z > − 1.96, we accept the null hypothesis, i.e., the difference between the KDE and I-KDE is not significant. e similar result exists for the other distributions in Table 2. To sum up, we demonstrate that the I-KDE can obtain the equivalent p.d.f. estimation performance with the less training time in comparison to the KDE. e experimental results reveal that it is feasible to use I-KDE to deal with the p.d.f. estimation problem of large-scale data in the way of data stream computation.

Conclusions and Future Works
In this article, we proposed an incremental kernel density estimator (I-KDE) to deal with the probability density function (p.d.f.) estimation problem in the way of data stream computation. e I-KDE updated the current KDE dynamically and gradually with the newly coming data rather than retraining the bran-new KDE with the combination of current data and newly coming data. e theoretical analysis proved the convergence of the I-KDE only if the estimated p.d.f. of newly coming data is convergent to its true p.d.f. e experimental results on 10 univariate and 4 multivariate probability distributions demonstrated that the I-KDE obtained the equivalent p.d.f. estimation performance with the less training time in comparison to the KDE and thus indicated that the I-KDE can be used to deal with the p.d.f. estimation problem of large-scale data in the way of data stream computation. In future, we will try to combine the I-KDE with the random sample partition (RSP) model [22,23] of big data and seek the practical applications for the I-KDE, e.g., Bayesian classification, density-based clustering, and big data reduction [24].
Data Availability e data used in our manuscript can be accessed by readers via our BaiduPan at https://pan.baidu.com/s/ 1wgj-gTzEZL51WTtl2RpCzA with the extraction code "kai5".

Conflicts of Interest
e authors declare that they have no conflicts of interest.