Graph Embedding with Data Uncertainty

spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines. The main aim is to learn a meaningful low dimensional embedding of the data. However, most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty. Thus, learning directly from raw data can be misleading and can negatively impact the accuracy. In this paper, we propose to model artifacts in training data using probability distributions; each data point is represented by a Gaussian distribution centered at the original data point and having a variance modeling its uncertainty. We reformulate the Graph Embedding framework to make it suitable for learning from distributions and we study as special cases the Linear Discriminant Analysis and the Marginal Fisher Analysis techniques. Furthermore, we propose two schemes for modeling data uncertainty based on pair-wise distances in an unsupervised and a supervised contexts.


Introduction
With the advancement of data collection processes, high dimensional data are available for applying machine learning approaches. However, the impracticability of working in high dimensional spaces due to the curse of dimensionality and the realization that the data in many problems reside on manifolds with much lower dimensions than those of the original space, has led to the development of spectral-based subspace learning (SL) techniques. Spectral-based methods rely on the eigenanalysis of Scatter matrices. SL aims at determining a mapping of the original high-dimensional space into a lower-dimensional space preserving properties of interest in the input data. This mapping can be obtained using unsupervised methods, such as Principal Component Analysis (PCA) [1,2], or supervised ones, such as Linear Discriminant Analysis (LDA) [3] and Marginal Fisher Analysis (MFA) [4]. Despite the different motivations of these spectral-based methods, a general formulation known as Graph Embedding was introduced in [4] to unify them within a common framework.
For low-dimensional data, where dimensionality reduction is not needed and classification algorithms can be applied directly, many extensions modeling input data inaccuracies have recently been proposed [5,6]. In [6], data points are replaced by probability distributions modeling the artifacts and an SVM classifier was extended to operate on data distributions. However, for high dimensional data, where dimensionality reduction is needed, traditional methods, such as LDA and MFA do not take into consideration that the provided data can be exposed to measurement inaccuracies or artifacts. Thus, learning directly from data can lead to a biased or erroneous embedding of the high dimensional data [7,8,5,6]. Extensions of some SL methods taking into account the presence of outliers and noise in the data were proposed to account for this problem, such as the methods in [9,10] for LDA, and the method in [11] for PCA.
In this paper, we propose a novel spectral-based subspace learning framework, called Graph Embedding with Data Uncertainty (GEU), in which input data uncertainties are taken into consideration. Instead of relying on the train-ing data directly, we model each data point by a multivariate Gaussian distribution centered at the position of the original measurement and having a covariance matrix accounting for its uncertainty. To this end, we reformulate the Graph Embedding framework to operate on distributions at individual data point level allowing us to determine a mapping from the input data space into a lower-dimensional space via optimizing some properties of interest defined over these distributions. The outcome is a more robust data embedding scheme. As special cases of the proposed framework formulations, we investigate extensions of LDA and MFA techniques within the proposed GEU framework. We refer to these as GEU-LDA and GEU-MFA, respectively. An example of the decision boundaries obtained by using the original MFA, MFA with augmented data, and GEU-MFA on 2-D synthetic data forming two classes is illustrated in Figure 1. The incorporation of data uncertainty shifts the decision boundary of the original approach. We note that by using more augmented data the decision boundary of MFA shifts toward the GEU-MFA.
Furthermore, we theoretically show that under the proposed GEU framework, the rank of matrices involved in the optimization problem, i.e., the scatter matrices, increases compared to the original methods. As a result, methods formulated under the proposed framework lead to an increased number of projection directions. This is because the covariances employed to model the uncertainty at the level of the individual data point introduce a regularization term to both scatter matrices. Thus, an indirect advantage of formulating traditional SL methods, such as LDA, under the proposed framework is that it allows for addressing the small sample size problem [12], even for problems formed by two classes.
Although the focus in this paper in on LDA and MFA, the proposed GEU framework operating on generic graph structures can directly be used to obtain robust solutions for other SL methods formulated under the Graph Embedding framework. The contributions of the paper are as follows: • We propose a novel spectral-based subspace learning framework which takes into consideration uncertainties in the input data.
• We reformulate the Graph Embedding framework to operate on distributions at individual data points. In this way, we provide a generic approach for accounting for data uncertainties in a multitude of SL methods expressed under the Graph Embedding framework.
• We study as special cases of the proposed framework GEU-LDA and GEU-MFA, and we theoretically show that considering uncertainty leads to an increased number of projection directions.
• We propose two schemes to model uncertainty of each sample based on pair-wise distances of data points in the original space.
The remainder of the paper is organized as follows. Section 2 provides a brief review of the related work. Section 3 describes in detail the proposed GEU framework. Section 4 provides the conducted experimental analysis, and Section 5 concludes our work.

Graph Embedding
Graph Embedding [4,13,14] is a general framework encapsulating several SL methods as special cases. Data points are modeled as vertices of two graph structures, namely an intrinsic graph expressing data relationships to be emphasized and a penalty graph expressing data relationships to be suppressed. Using such intrinsic and penalty graphs, the optimization problems of SL methods, such as LDA, PCA, and MFA, can be formulated.
Given a set of data points and their corresponding class labels , where x i ∈ R D for i = 1, ..., N , the goal in Graph Embedding is to determine a mapping which maps x i to a lower dimensional representation This is achieved by forming a weighted (intrinsic) graph G = {X, W}, where X = [x 1 , ..., x N ] is the vertex set and W ∈ R N ×N the graph weight matrix whose elements encode the pair-wise relationships between the graph vertices x i . Furthermore, a penalty graph G p = {X, W p } can be defined on the same graph vertices, whose weight matrix W p ∈ R N ×N expresses pair-wise relationships to be penalized.
The graph preserving criterion is formulated as follows: can be defined as a constraint matrix, e.g., B = I to enforce orthogonality constraints, or as a scatter matrix based on the Laplacian of the penalty graph.
For a linear data mapping, i.e., y = X T v, where v ∈ R D is a unitary projection vector mapping x i ∈ R D to y i ∈ R, Eq. (1) can be rewritten as follows: where L = D − W is the Laplacian matrix with D being the diagonal degree In this case, the solution of the optimization problem in Eq. (2) is given by solving the generalized eigenvalue decomposition problem and keeping the eigenvector corresponding to the smallest (positive) eigenvalue.
To obtain more than one projection direction, the corresponding projection matrix V ∈ R D×d is formed by the eigenvectors corresponding to the d smallest eigenvalues.
Specific selections of W and W p lead to different subspace learning methods.
For LDA, the within-class scatter and the between-class scatter matrices are given by where C is the number of classes, N c is the cardinality of class c, e ∈ R N is the vector with all elements equal to 1, and e c ∈ R N is a vector with the elements corresponding to data points of class c equal to one and the rest equal to zero.
Thus, LDA can be formulated in the Graph Embedding framework by using the graph weight matrices where N ci is the cardinality of the class, which x i belongs to. MFA is formulated by using the graph weight matrices where N + k1 (j) is the set of the k 1 nearest neighbors of the x j in the same class, and P k2 (c) is the set the k 2 nearest pairs among the set Here, we should note that several other methods which employ pair-wise similarity/distance measures, e.g. [8,15,16,17,18,19,13,20], can be formulated using the Graph Embedding framework.

Learning with uncertainty
Research in uncertainty has gained a lot of attention lately in many branches of science [21,22], since data can be subject to measurement inaccuracies and artifacts. Taking this into consideration in the data modeling and learning process is critical for building robust models. Exploiting uncertainty in machine learning has been studied from many different viewpoints. Methods dealing with uncertainty can be grouped into two different categories: sample-wise uncertainty modeling and feature-wise uncertainty modeling.
In sample-wise uncertainty, the noise is modeled at the sample level. The main assumption in such methods is that few training data points are outliers and thus they need to be suppressed or partially suppressed to not affect the solution of the subsequent processing steps. Various robust extensions of SL methods have been proposed to reduce the sensitivity of a classifier to outliers [9,10,11,23,24,25,26,7]. In [23] and [24] for example, robust extensions of LDA were proposed by reducing the sensitivity of the model to outliers.
In feature-wise uncertainty, the noise is modeled at the data dimension level.
The main assumption in such methods is that certain data dimensions are corrupted by noise. This type of noise modeling was employed to extend SVM in [6]. For SL, feature-wise uncertainty is used in [9], where a robust extension of LDA is proposed. Instead of using point estimates of speech data, a probabilistic description based on Gaussian distributions at the individual data point level are used as inputs to LDA. In our work, we use a similar uncertainty modeling. However, we note two key differences: i) Our work is based on the Graph Embedding framework formulation of SL and, thus, it is not restricted to LDA.
ii) We propose two schemes to model the uncertainty of each sample based on pair-wise distances of data points in the original space. Thus, our approach of modeling uncertainty is not restricted to speech data and can be applied to any data, even when an explicit noise propagation model is absent.

Graph Embedding with Data Uncertainty
Let us denote by {y i } N i=1 a set of the random Gaussian variables expressing the low-dimensional representations of the input data x i , i = 1, . . . , N . We express the graph preserving criterion using y i as follows: where E(·) denotes the expectation operator. For a Gaussian uncertainty, i.e., , the pair-wise distances z ij between y i and y j are also random variables following a Gaussian distribution Thus, the expectation term in Eq. (10) can then be rewritten as follows: By substituting Eq. (12) to Eq. (10), we get The first term of the summation is equivalent to the original Graph Embedding and depends on E(y), i.e., the expectation of y: By defining σ = σ 2 1 , ..., σ 2 i , σ 2 n , the second term in the summation can be expressed as follows: Thus, using Eq. (14) and Eq. (15), our new graph preserving criterion is given as follows: For a linear data mapping y = X T v and modeling each data point in the input space using a Gaussian distribution, i.e., corresponds to a linear projection of a Gaussian, which is a Gaussian distribution Thus, the second term in Eq. (16) can be written as follows: The equality in Eq. (17) follows from: Based on the above, the final form of Eq. (16) is Following a derivation similar to the above, we note that a similar graph preserving criterion can be formulated with the constraint: The solution of the optimization problem in Eq. (18) is given by solving the following eigenvalue decomposition problem and keeping the eigenvector corresponding to the smallest (positive) eigenvalue.
To obtain more than one projection directions, the corresponding projection matrix V ∈ R D×d is formed by the eigenvectors corresponding to the d smallest eigenvalues.
From Eq. (18), we can observe that when uncertainty is not used, i.e., by having Σ x i equal to zero, the Gaussian distributions x i become equivalent to Dirac function. Hence, in that case, Eq. (18) becomes equivalent to Eq. (2) and the solution of the proposed approach is equivalent to that of the original Graph Embedding framework. It should be noted that, as explained above, the projected data y i * obtained for each data point x i is also a random variable characterised by the mean E(y i ) = v T µ x i and variance σ y i = v T µ x i v. One can use this additional information for the projected data or only employ the first order approximation, i.e., the mean E(y i ), as the final projection of the original sample x i . In this paper, we use the latter in the classification step. That is, the additional terms i D ii Σ x i and i D p ii Σ x i introduced to the scatter matrices defined over the intrinsic and penalty graphs act as regularization terms leading to full-rank matrices. This is due to that the Gaussian distribution covariance matrix, Σ x i , is a strictly positive-definite matrix. Hence, the introduction of the proposed approach to model uncertainty at the individual data point level results in an intuitive regularization procedure, increasing the number of projection directions. This allows avoiding the small sample size problem of LDA [12] and provides more projection directions, even for binary problems.

Uncertainty estimation
In the proposed GEU framework, we encode the uncertainty of each individual data point by a Gaussian distribution centered at the position of the data point and having a variance which needs to be appropriately determined to reflect the properties of the problem at hand. However, data is commonly available without such uncertainty information. We propose two schemes for defining such a variance estimate based on pair-wise distance between data points in the unsupervised and the supervised settings.
Each sample x i is defined by its mean E(x i ) = x i for both techniques and its covariance Σ i defined as follows: where σ is a constant, diag(·) is the diagonal operator, and x i * is the closest data point to x i in the admissible set. For the unsupervised case, the admissible set is composed of all the training data except x i and for the supervised case the admissible set is composed of all the training data except x i and having the same class as x i .

Experiments and analysis
In this section, we study as special cases of the proposed framework the traditional subspace learning techniques LDA and MFA using our learning paradigm.
For all testing scenarios, we rely on Nearest Neighbor for the classification. For the evaluation, we use three different datasets: • Breast Cancer Wisconsin dataset [27]: It is a binary classification dataset composed of 569 samples with 32 features. An explicit uncertainty estimate is proposed in [6]. We use a random 5-fold split for the evaluation of different approaches. We keep the folds fixed for the different methods.
• Cifar2: We use two classes, "cat" and "dog", from the original Cifar10 [28]. We randomly sample 900 images per class for the training. For the

MFA
MFA is a SL technique which characterizes the intraclass compactness in the intrinsic graph and the interclass separability in the penalty graph. It can be formulated using the Graph Embedding framework as explained in Section 2.
Thus, it can be extended using our framework to incorporate the data uncertainty using Eq. (18)- (20). For higher values of (d,k), the performance of both approaches increase and they tend to perform similarly.
In Figure 3, we show the performance of the variants of MFA as a function of the number of training samples on Cifar2. We note that incorporating uncer-  In Table 1, we show the robustness of MFA [4], RMFA [4], and our proposed approach with both variants of uncertainty estimation, i.e., GEU-MFA-U and GEU-MFA-S, on the three datasets with different additional noise levels. We repeat each experiment ten times and report the average accuracy achieved by each method. We note that the proposed methods outperform the original MFA for all noise levels. We also note that the accuracies of all the methods drop clearly when the noise level is higher. The supervised technique for estimating the uncertainty achieves the top performance except for Yale B Face dataset with no additional noise, where the best performance is achieved by GEU-MFA-U.

LDA
In Figure 4, we evaluate the performance of LDA, GEU-LDA-U and GEU-LDA-S as a function of the number of training samples on Cifar2. We repeat each experiment ten times and report the mean and the variance of accuracies for all the training sizes. Similar to MFA, incorporating uncertainty yields a performance boost for both variants of uncertainty techniques compared to the original LDA. We also note that for higher number of training samples, the performance gap decreases. Both variants of uncertainty estimations achieve a similar performance for different training sizes.
We report the performance of LDA [30], regularized LDA [4], Robust Sparse Linear Discriminant Analysis (RSLDA) [23], Uncertain Linear Discriminant Analysis (ULDA) [9], GEU-LDA-U, and GEU-LDA-S on the three datasets for different noise levels in Table 2. We repeat each experiment ten times and report the average accuracy achieved by each approach. For the clean Cifar2 dataset, the best accuracy is achieved by GEU-LDA-U, while for the noisy Ci-far2, GEU-LDA-S achieves the best results. The regularized LDA yields the best accuracy for Cancer and Yale B (noise=10%) datasets. However, for the other two variants of Yale B dataset, the highest accuracy is achieved by GEU- Table 2: Classification accuracy of LDA [30], RLDA [4], RSLDA [23], ULDA [9], GEU-LDA-U, and GEU-LDA-S in the different datasets.

Conclusion
In this work, we introduced a novel spectral-based dimensionality reduction framework called Graph Embedding with Data Uncertainty (GEU) that reformulates the Graph Embedding to consider input data uncertainties and artifacts. We model the uncertainty around each data point by a multivariate Gaussian distribution centered around the original sample and a covariance matrix characterizing the uncertainty of the corresponding sample along each feature dimension. Two techniques to generate the distribution of each data point were proposed based on the pair-wise distances between samples. Uncertainty introduces a regularization term that expands the rank of the scatter matrices and increases the number of available projection directions compared to the original subspace learning methods. We studied as special cases of the proposed framework the traditional subspace learning techniques LDA and MFA. The proposed framework was extensively evaluated over three datasets and it led to performance improvement compared to the original methods as well competing methods that consider uncertainty.