A Possible World-Based Fusion Estimation Model for Uncertain Data Clustering in WBNs

In data clustering, the measured data are usually regarded as uncertain data. As a probability-based clustering technique, possible world can easily cluster the uncertain data. However, the method of possible world needs to satisfy two conditions: determine the data of different possible worlds and determine the corresponding probability of occurrence. The existing methods mostly make multiple measurements and treat each measurement as deterministic data of a possible world. In this paper, a possible world-based fusion estimation model is proposed, which changes the deterministic data into probability distribution according to the estimation algorithm, and the corresponding probability can be confirmed naturally. Further, in the clustering stage, the Kullback–Leibler divergence is introduced to describe the relationships of probability distributions among different possible worlds. Then, an application in wearable body networks (WBNs) is given, and some interesting conclusions are shown. Finally, simulations show better performance when the relationships between features in measured data are more complex.


Introduction
Clustering is a kind of technology for machine learning that puts similar objects into the same cluster. Clustering techniques play an important role in many areas such as health care and action recognition in the medical domain [1,2], behavior surveillance and battlefield prediction in the military field [3,4], resource and information management in the communications field [5,6], and so on. There are plenty of cluster methods presented that can be divided into three principal types according to the clustering scale: distance-based, density-based, and connectivity-based [7,8].
Most clustering methods focus on deterministic data. Unfortunately, almost all clustering data are collected by the corresponding equipment, which entails measuring errors. In this case, the uncertain data can describe the measurement data better. For acquiring better and more appropriate results, the fusion estimation methods such as the Bayes-based [9], Kalman-based [10], or artificial intelligence-based [11,12] methods are commonly used to estimate the measurements.
Fusion estimation is a technology that uses the computing power of data acquisition equipment to de-noise and de-redundancy the measurement data according to certain rules. It focuses on mining data information, designing corresponding estimation algorithms,

Related Works
In this section, the processing technologies of uncertain data are introduced in detail. The collected data that come from acquisition equipment contain noise, which means the collected data contain great uncertainty. Therefore, it is necessary to perform fusion estimation processing on the data first, and use the rules and redundancy of the data itself to improve the data accuracy and reduce the uncertainty of the data.
Commonly used fusion estimation algorithms include Bayes filter (BF) [17], Kalman filter (KF) [18], extended Kalman filter (EKF) [19], unscented Kalman filter (UKF) [20], and particle filter (PF) [21]. Wherein, BF and KF are estimates of linear systems, BF can theoretically estimate data of arbitrary noise distribution, and KF is BF when the noise is Gaussian white noise. The EKF, UKF, and PF are the estimates of the nonlinear system, where EKF is for weak nonlinear system, UKF is for strong nonlinear systems and has high computing complexity, while the PF is calculated directly from the average probability density conditions, in which the probability density is determined by EKF and UKF approximation, but the estimation precision is higher than that of a single use of EKF or UKF, but the number of calculations is much higher than that of EKF and UKF.
In [22], the authors argued that two possible world-based clustering algorithms suffered from the following issues: (1) they dealt with each possible world independently and ignored the consistency principle across different possible worlds; (2) they required an extra post-processing procedure to obtain the results, which meant that effectiveness was highly dependent on the post-processing method, and their efficiencies were also not very good. In order to solve the problems above, Liu et al. proposed a possible worldbased consistency learning model that considered the consistency principle during the clustering/classification procedure and thus could achieve satisfactory performance. The Possible world based consistency learning model for clustering uncertain data (PWCLU) was proposed in [22], which holds that the clustering results in each possible world are consistent. Several equipment types were used for collecting the same data. Each piece of data for one piece of equipment was considered to belong to a possible world, and the probability was regarded as equal for each possible world. The authors only gave an algorithm to deal with finite possible worlds.
On the other hand, clustering algorithms usually require a method to describe the distance between two datasets. In uncertain data, the distance can be expressed as a probability distribution in most cases. Therefore, a method of describing the distance between probability distributions is required. Sinkkonen and Kaski [23] studied the problem of learning groups or categories that were local in the continuous primary space but homogeneous according to the distributions of an associated auxiliary random variable over a discrete auxiliary space. In their model, Kullback-Leibler divergence was used to calculate the distance between two probability distributions.
In this paper, a possible world-based fusion estimation model (PWFEM) is proposed for clustering uncertain data. The proposed model removes the assumption of the consistency principle of [22]. Moreover, two PWFEM-based methods are given. One generalizes the PWCLU to the continuous possible worlds, which is based on numerical distance. Therefore, it is called PWFEM-nd. The other is based on probability distribution distance and is named PWFEM-pd. Then, an application in WSNs is discussed. Two specific distance functions that correspond to the numerical distance and probability distribution distance, respectively, are introduced to prove that the PWFEM-nd is equivalent to PWFEM-pd under certain circumstances. Finally, the simulations are discussed; they showed good performance of the models.

Preliminaries
In this section, some necessary definitions and assumptions are given for possible world and Kullback-Leibler divergence; the assumptions of independence for each component of the datasets and the structure of the data are also given.

Definition of Possible World
Let O ∈ R N×n , O = {O 1 , O 2 , · · · , O n } be an uncertain dataset, where O is not deterministic data but a probability distribution. If O is a discrete probability distribution, pw is one of the possibilities of the uncertain data O, which can be written as , which is deterministic data with its probability P(pw). If O is a continuous probability distribution, O can be described as a probability density function f (pw), where pw is the value of the random variable O. Then, D f (pw)dpw = 1

Definition of Kullback-Leibler Divergence
Let p(x) and q(x) be the distribution of random variable X, so the Kullback-Leibler divergence of p(x) and q(x) is:

Some Assumptions
Assumption 1. Almost all possible worlds exhibit the same class labels and cluster structures, and they exhibit the different class labels and cluster structures with small probabilities. Assumption 2. In Section 5, it is assumed that ∀x i , x j ∈X, x i + x j is also the Gaussian distribution. Assumption 3. In Section 5, it is assumed that the wearable nodes keep a stable state to collect the data all the time. Therefore, the covariance matrix will not change.

Possible World-Based Fusion Estimation Model (PWFWM)
In this section, the details of the PWFWM are introduced in three parts. The first part is the introduction of data fusion estimation. The second part is the introduction of the calculation process of distribution distance. The third part introduces the clustering method based on the possible world.

Data Fusion Estimation
The collected data can be divided into two types: filterable data and high accuracy data. Without loss of generality, it is assumed the measurement data at time t is: are the filterable data, and M a t = z a 1 , z a 2 , · · · , z a s t are the high-accuracy data.
Corresponding to the possible world, filterable data are the probabilistic data, while the high accuracy data are the numeric data. It is assumed the format of the clustering data in a possible world at time t is: are the probability data, and X n t = x n 1 , x n 2 , · · · , x n s t are the numeric data.
In most cases, the filterable data can be obtained according to the Kalman-based filter. The high accuracy data can be converted to filterable data by the Gaussian distribution, whose expectation is zero and whose variance is small. The details are as follows.
The measurement data are first converted to the clustering data by the following formulas: If the filterable data satisfied the following state function and measurement function: The appropriate filter algorithm can be used to solve the functions above. If the result isX p t , the probability data can be written asX Similarly, the numerical data can be written as where ω a t is Gaussian distribution with zero mean and small variance.
Then, we have: and Ω t = ω t/t−1 ω a t . Moreover, we let Ω t = ω p ω a , which is a scleronomic Gaussian distribution. Therefore, according to Assumption 2, the multivariate Gaussian distribution with X t can be written as follows: Based on the above, the structure of clustering data can be confirmed. Then, the distance-based functions need to be confirmed.

Distance Calculation Method Based on KL Divergence-Based Distance
Almost all clustering algorithms need to calculate the distance. In the PWFWM, there are two types of data: filterable and high accuracy. For accuracy data, the Euclidean distance can be used, and the KL divergence can be used to process the filterable data. In this Section, the distance calculation method based on KL divergence is introduced in detail.
KL divergence analyzes the degree of difference between two distributions from the perspective of information entropy. Assume that p(x) and q(x) are two distributions of random variable X, then the KL divergence is: The calculation formula in the discrete case is: Assuming that the probability distribution is usually Gaussian, P ∼ N(µ 1 , Σ 1 ) and Q ∼ N(µ 2 , Σ 2 ), and the dimension of the data is n. Then, the KL divergence calculation formula is as follows: Plugs the P ∼ N(µ 1 , Σ 1 ) and Q ∼ N(µ 2 , Σ 2 ) in (10): where and Moreover, if Σ 1 = Σ 2 = Σ. Then, we get: In this way, the distance between two probability distributions is obtained. Then, the clustering method based on the possible world can be used.

The Clustering Method Based on the Possible World
In [22], the authors used an adaptive, local-structure learning method to calculate the consensus affinity matrix. In their model, the collected numerical data are used to match the probability density function (PDF) of the uncertain data. However, the authors give no algorithm for the case where the PDF is given directly. Moreover, the proposed method needs a sizable quantity of data. In this paper, Assumption 1 is proposed instead of the consistency principle.
According to Assumption 1 above, the probability of each possible world should be considered when calculating the consensus affinity matrix. Then, the objective function is shown as follows: min where, d According to the conclusion of [22], t can be adjusted by α, and the optimization result is is another order of D pw i , and it ranges from small to large.
According to the formulas above, the extra information about classes is required to confirm t. It is set as t = N if there is no extra information about classes. That is: Finally, an optimization normalized distance matrix S* is needed for clustering the training set, which is satisfied by the following optimal model: where S i = [s 1i , s 2i , · · · , s ni ] T and S = [S 1 , S 2 , · · · , S n ] T . According to the object function (20), On the other hand, according to (19), we have Therefore, According to the properties of expectation and variance: and Equation (23) can be reduced to: Obviously, (7) is equivalent to the following optimal model: The optimal solution for the above optimal model can be obtained easily, which is Now, another understanding for a possible world is presented. Let us review the definition of possible world. The construction of an uncertain dataset and its PDF f (pw) are known. Then, if the dimensions of the dataset are finite, which is assumed to be o ij n j=1 , the edge probability density function (EPDF) for ith dimension is: Moreover, if the dimensions of O i (i = 1, 2, . . . , n) are finite, which is assumed to be o ij n j=1 , the edge probability density function (EPDF) for jth dimension of O i (i = 1, 2, . . . , n) is: Here, it is assumed that distance(O i ,O j ) is the distance between the random variables O i and O j . Then, the consensus affinity matrix S can be obtained according to the following formula: Then, according to the analysis above, if there is no extra information about classes, the optimal solution for the object function (15) is: Compared with (12), the distribution is used instead of the expectation of point distance. Therefore, (12) is appropriate for the possible world that includes fewer and simpler random variables, while (16) is appropriate for the possible world with complexity random variables in theory.
So far, when the distance-based function is confirmed, the optimization consensus affinity matrix S for the all possible worlds can be worked out.
According to the calculations above, the closer two data objects are, the larger s ij is. Therefore, the value of s ij may have no use when s ij < p (distance threshold). Then, the matrix S may need to be pruned to remove the meaningless s ij . This pruning is divided into two steps: removing and normalization. In the removing step, the meaningless values are replaced by 0. In the normalization step, the meaningful value is recalculated to keep the equation: The following Algorithm 1 shows the processing of pruning: Moreover, in spectral analysis, if a nonnegative affinity matrix S is given, the corresponding Laplacian matrix L s can be calculated as L s = D s − S T +S 2 , where D s is a diagonal matrix and its ith diagonal element is n ∑ j=1 s ij +s ji 2 . The Laplacian matrix L s has an important property as follows [24]. Theorem 1. Let S be a nonnegative affinity matrix; then, the multiplicity k of the eigenvalue 0 of the Laplacian matrix L s is equal to the number of connected components in the graph associated with the affinity matrix S.
It is assumed that the eigenvalues of the Laplacian matrix L s , which is {σ i } n i=1 , are ordered from small to large. According to the properties of the Laplacian matrix L s , we have the following conclusion: If the number of clusters k is unknown, the threshold Th is set to decide k, which satisfies: Finally, the eigenvectors of eigenvalues σ 1 to σ k comprise the matrix U ∈ R n×k . The k-means clustering algorithm is used to cluster the row of matrix U. The clustering result is that of the training set. The Algorithm 2 for processing S is shown as follows.

Algorithm 2 for processing S:
Input: the matrix S ∈ R n×n and clustering threshold Th The processing: Cluster the row of matrix U according to the k-means method. These are also the clustering results for training set. Therefore, the cluster {C i } k i=1 , and the number of cluster members {n i } k i=1 are obtained.

Updating
After clustering the training set, the data in the test set should be put into the clusters determined above. Firstly, the test set is given as follows: The Test Set: , o 2i , · · · , o ni ] T is the data object. The clustering updating algorithm for the test set is divided into two steps: clustering and updating. The details are shown in the following Algorithm 3: The processing: Clustering step:

Simulations
In this section, comparisons with three state-of-the-art uncertain data clustering algorithms are conducted on real benchmark datasets. Moreover, an uncertain dataset that obeys the multivariate Gaussian distribution is generated, and the parameters in the PWFEM model are discussed.

Dataset Objects Attributes Classes
These datasets were originally established as collections of data with determinate values. Then, we followed the method in [27] to generate uncertainty in these datasets, and the generation method is shown as follows Algorithm 4:

The Clustering Accuracy
In this part, 2 widely used evaluation metric, which are accuracy (ACC) and Normalized mutual information (NMI), are adopted to compare the different clustering algorithms. In this part, the proposed clustering algorithms, PWFEM-nd and PWFEM-pd, are compared with three state-of-the-art uncertain data clustering algorithms: UK-means [26], REP [27] and PWCLU. Each clustering algorithm was run 100 times. The maximum, minimum, mean value, and variance of the ACC were calculated with respect to each algorithm. The comparisons were simulated for two cases. Case 1 is the real mean value with variance known, while case 2 is the finite measurement results, which obey the given PDF instead.
In order for the proposed model to be executed properly, the exact values of expectation and covariance need to be known. However, the datasets used in this simulation do not give those values. Therefore, the approximate values were calculated instead according to the following formula: E = X and Cov = Cov(X).
where X = {x i } n i=1 is the dataset. The comparisons of ACC for each algorithm in case 1 are shown in Table 2. As shown in Table 2, in the datasets of wine and glass, the PWFEM-nd shows the best performance with maximum, minimum, and mean values. Unfortunately, it shows the worst performances with those values in the datasets of iris, Ecoli and PhishingData. As for the proposed PWFEM-pd, it shows the best performances with maximums in all datasets except wine and glass.
According to their respective algorithms, there may be plenty of reasons for the results above. Some analyses that have high probabilities are presented next.
Firstly, it is important to note that the UK-means, REP, PWCLU, and PWFEM-nd use the mean value only. Therefore, their variance values are zeros, which means the clustering results never change throughout the 100 iterations. Only PWFEM-pd uses the variance of uncertain data.
Secondly, UK-means clusters the dataset directly, while REP, PWCLU, PWFEM-nd, and PWFEM-pd cluster the dataset indirectly. Here, REP, PWCLU, PWFEM-nd, and PWFEM-pd use the model based on the possible world. Moreover, PWCLU uses the Euclidean distance ( · 2). PWFEM-nd uses the cosine similarity. PWFEM-pd uses the Kullback-Leibler divergence. Compared with the PWCLU, PWFEM-nd combines the distributions of each component in a datum. Moreover, PWFEM-pd calculates the distance in distributions directly, while PWCLU and PWFEM-nd transform the distributions into some special numbers (mean value and variance). Therefore, the clustering accuracy of PWFEM-pd may be higher than that of PWCLU in most cases. Moreover, PWFEM-pd can be regarded as having different covariances obtained randomly to that of clustering. If a covariance close to the true covariance is acquired, a high accuracy of clustering is gained.
For a clearer view of the changing of clustering accuracy with different covariances, see Figure 1.
of PWFEM-pd may be higher than that of PWCLU in most cases. Moreover, PWFEM-pd can be regarded as having different covariances obtained randomly to that of clustering. If a covariance close to the true covariance is acquired, a high accuracy of clustering is gained.
For a clearer view of the changing of clustering accuracy with different covariances, see Figure 1. As shown in Figure 1, the ACC of PWFEM-pd is sensitive to the covariance of the uncertain data. On the other hand, the impacts caused by covariances from different datasets lead to different results. Obviously, in Figure 1a,c,d,f, the ACC is highly dependent on the covariance. In Figure 1e, the ACC is divided into two parts: one is around 0.51 and the other is around 0.34, when different covariances are given. Moreover, in Figure 1b, the ACC is stable around 0.5 most times with the changing of covariance.
According to the analysis above, only for the proposed models, which are PWFEM-nd and PWFEM-pd, the ACC is sensitive to covariance. Then, the changing of mean values is added to the simulations. Therefore, the simulation results are given for As shown in Figure 1, the ACC of PWFEM-pd is sensitive to the covariance of the uncertain data. On the other hand, the impacts caused by covariances from different datasets lead to different results. Obviously, in Figure 1a,c,d,f, the ACC is highly dependent on the covariance. In Figure 1e, the ACC is divided into two parts: one is around 0.51 and the other is around 0.34, when different covariances are given. Moreover, in Figure 1b, the ACC is stable around 0.5 most times with the changing of covariance.
According to the analysis above, only for the proposed models, which are PWFEM-nd and PWFEM-pd, the ACC is sensitive to covariance. Then, the changing of mean values is added to the simulations. Therefore, the simulation results are given for case 2, which uses the generation method proposed in the beginning of this section; the results of case 2 are shown in Table 3 and Figure 2. As shown in Table 3, when combining the maximum value and minimum value, the clustering results of all clustering methods change. This means all the clustering methods are sensitive to the mean value. Moreover, the sensitivity to each clustering method varies. Obviously, the fluctuation ranges of all clustering methods in iris and glass are the most drastic. On the other hand, the clustering accuracy of the PWFEM-pd algorithm is always higher than that of the PWFEM-nd, but its stability is lower than that of the PWFEM-nd. Besides, compared with Table 2, the NMI are lower than the ACC for the same dataset, which means that in the clustering results of the model, the accuracy of each class is inconsistent, with some categories having high precision and some having low precision.
For a clearer view of the changing of clustering accuracy with different covariances and mean values, see Figure 2.
As shown in Figure 2, the PWFEM-pd has a similar fluctuation as that shown in Figure 1. Unfortunately, this clustering method is sensitive to both mean value and covariance. Therefore, it is hard to distinguish the main reason. Next, the remaining four clustering methods are discussed.
Firstly, similar to the conclusion in Table 2, Figure 2c,d in all clustering methods show a drastic fluctuation. For UK-means and CK-means, they show a drastic fluctuation in Figure 2a and are stable in Figure 2b,e,f. For PWCLU, it is stable in Figure 2a,b,e,f. For PWFEM-nd, it is stable in Figure 2a,b,f, while it is stable at two ranges in Figure 2e.
According to the analysis above, the situations of the proposed methods are clearer. However, the variation tendency with the mean value and covariance are not clear. Therefore, a specific dataset was generated to investigate the above issues.  As shown in Table 3, when combining the maximum value and minimum value, the clustering results of all clustering methods change. This means all the clustering methods are sensitive to the mean value. Moreover, the sensitivity to each clustering method varies. Obviously, the fluctuation ranges of all clustering methods in iris and glass are the most drastic. On the other hand, the clustering accuracy of the PWFEM-pd algorithm is always higher than that of the PWFEM-nd, but its stability is lower than that of the PWFEM-nd. Besides, compared with Table 2, the NMI are lower than the ACC for the same dataset, which means that in the clustering results of the model, the accuracy of each class is inconsistent, with some categories having high precision and some having low precision.
For a clearer view of the changing of clustering accuracy with different covariances and mean values, see Figure 2.
As shown in Figure 2, the PWFEM-pd has a similar fluctuation as that shown in Figure 1. Unfortunately, this clustering method is sensitive to both mean value and covariance. Therefore, it is hard to distinguish the main reason. Next, the remaining four clustering methods are discussed.
Firstly, similar to the conclusion in Table 2, Figure 2c,d in all clustering methods NMI NMI NMI NMI Figure 2. NMI with different clustering algorithms for 100 iterations in case 2.

The Simulation with a Specific Dataset
In this part, a specific dataset is generated to analyze the impacts of mean value and covariance. The generated dataset consisted of two dimensions, and the number of data points was set at 1000. It was divided into three clusters, whose centers were [0, 0], [100, 0], and [0, 100]. The distance between the datum and its center was randomly distributed in [0, r]. The variance for each dimension was σ i (i = 1, 2). Moreover, it was set as σ 1 = σ 2 = σ. The correlation coefficient of these two dimensions was ρ. Therefore, the covariance of this dataset was: σ ρσ ρσ σ . Next, the parameters r, σ, and ρ are discussed. In this simulation, σ = 2, ρ = 0, and r was from 1 to 100. As shown in Figure 3, the ACCs of all methods were 1 before about 50, and then reduced with increasing r. This simulation results are in accordance with common sense.
The correlation coefficient of these two dimensions was ρ. Therefore, the covariance of this dataset was: Next, the parameters r, σ, and ρ are discussed. In this simulation, σ = 2, ρ = 0, and r was from 1 to 100. As shown in Figure 3, the ACCs of all methods were 1 before about 50, and then reduced with increasing r. This simulation results are in accordance with common sense. On the other hand, if ρ = 0 and r is fixed, σ can vary the distance between the data evenly. Therefore, it cannot affect the clustering results, and the simulation proves it. Because the ACC curves of all methods are lines parallel to the X-axis, the figure was omitted.
Finally, the simulation for ρ is discussed with σ = 2, r = 20, 40, 60, and 80, and −1 < ρ < 1. The simulation results are shown in Figure 4.  On the other hand, if ρ = 0 and r is fixed, σ can vary the distance between the data evenly. Therefore, it cannot affect the clustering results, and the simulation proves it. Because the ACC curves of all methods are lines parallel to the X-axis, the figure was omitted.
Finally, the simulation for ρ is discussed with σ = 2, r = 20, 40, 60, and 80, and −1 < ρ < 1. The simulation results are shown in Figure 4. this dataset was: Next, the parameters r, σ, and ρ are discussed. In this simulation, σ = 2, ρ = 0, and r was from 1 to 100. As shown in Figure 3, the ACCs of all methods were 1 before about 50, and then reduced with increasing r. This simulation results are in accordance with common sense. On the other hand, if ρ = 0 and r is fixed, σ can vary the distance between the data evenly. Therefore, it cannot affect the clustering results, and the simulation proves it. Because the ACC curves of all methods are lines parallel to the X-axis, the figure was omitted.
Finally, the simulation for ρ is discussed with σ = 2, r = 20, 40, 60, and 80, and −1 < ρ < 1. The simulation results are shown in Figure 4.  As shown in Figure 4, when r < 50, the ACCs are stable for all methods with −1 < ρ < 1. This is because the cluster structure is prominent in this condition, whereas the effect of ρ on the clustering result is weak. Moreover, when r > 50, the ACCs of UK-means, CK-means, PWCLU, and PWFEM-nd show significant changes between in (−1, −0.7) and in (0.7,1). In these two intervals, ρ makes the data points even messier. Therefore, the ACCs of clustering results decrease if the data points are not processed. On the other  As shown in Figure 4, when r < 50, the ACCs are stable for all methods with −1 < ρ < 1. This is because the cluster structure is prominent in this condition, whereas the effect of ρ on the clustering result is weak. Moreover, when r > 50, the ACCs of UK-means, CK-means, PWCLU, and PWFEM-nd show significant changes between in (−1, −0.7) and in (0.7, 1). In these two intervals, ρ makes the data points even messier. Therefore, the ACCs of clustering results decrease if the data points are not processed. On the other hand, the ACCs become stable when −0.7 < ρ < 0.7. Obviously, the effect of ρ on the clustering results is weak.

Conclusions
In this paper, a possible world-based fusion estimation model for uncertain data is proposed. It includes two methods, which are the PWFEM-nd and PWFEM-pd. The PWFEM-nd is based on a data perspective, which uses a bottom-up method to cluster the data. The PWFEM-pd uses clustering according to the uncertain data directly. Both these methods depend more on the probability density distribution of uncertain data. We performed some simulations and confirmed that the proposed methods showed better performance in terms of probabilistic accuracy. The accuracy is highly dependent on the accuracy of covariance.
The discussion in the last section is incomplete. Obviously, it gets more complex when dimension increases. Only some simple conclusions are given in the simulation. In addition, the exact covariance is not usually obtained in actual scenarios. In any case, the proposed methods provide a new way to treat uncertain data clustering. The issues mentioned above are also to be addressed in future works.