Mismatch in the Classification of Linear Subspaces: Sufficient Conditions for Reliable Classification

This paper considers the classification of linear subspaces with mismatched classifiers. In particular, we assume a model where one observes signals in the presence of isotropic Gaussian noise and the distribution of the signals conditioned on a given class is Gaussian with a zero mean and a low-rank covariance matrix. We also assume that the classifier knows only a mismatched version of the parameters of input distribution in lieu of the true parameters. By constructing an asymptotic low-noise expansion of an upper bound to the error probability of such a mismatched classifier, we provide sufficient conditions for reliable classification in the low-noise regime that are able to sharply predict the absence of a classification error floor. Such conditions are a function of the geometry of the true signal distribution, the geometry of the mismatched signal distributions as well as the interplay between such geometries, namely, the principal angles and the overlap between the true and the mismatched signal subspaces. Numerical results demonstrate that our conditions for reliable classification can sharply predict the behavior of a mismatched classifier both with synthetic data and in a motion segmentation and a hand-written digit classification applications.


I. INTRODUCTION
Signal classification is a fundamental task in various fields, including statistics, machine learning and computer vision.One often approaches this problem by leveraging the Bayesian inference paradigm, where one infers the signal class from signal samples or measurements based on a model of the joint distribution of the signal and signal classes [1,Chapter 2].
Such joint distribution is typically inferred by relying on pre-labeled data sets.However, in practical applications, the methods used to estimate the distributions from training data inevitably lead to signal models that are not perfectly matched to the underlying one.This can be due to an insufficient number of labeled data, the noise in the pre-labeled data [2], [3], [4], or due to the non-stationary statistical behaviour [5].
It is therefore relevant to ask the question: What is the impact that a mismatched classifier, i.e. a classifier that infers the signal classes based on an inaccurate model of the data distribution in lieu of the true underlying data distribution, has on classification performance?
We answer this question for the scenario where the data classes are constrained to lie approximately on a low-dimensional linear subspace embedded in the highdimensional ambient space.Indeed, there are various problems in signal processing, image processing and computer vision that conform to such a model, some of which are: • Face Recognition: It can be shown that, provided that the Lambertian reflectance assumption is verified, the set of images taken from the same subject under different lighting conditions can be well approximated by a low-dimensional linear subspace embedded in the highdimensional space [6].This is leveraged in several face recognition applications [7]- [9].• Motion Segmentation: It can also be shown -under the assumption of the affine projection camera model -that the coordinates of feature points associated with rigidly moving objects through different video frames lie in a 4 dimensional linear space [10], [11], [12].This is leveraged in [10] to design subspace clustering algorithms that can perform motion segmentation.• In general, (affine) subspaces or unions of (affine) subspaces can also be used to model other data such as images of handwritten digits [13].
Our contributions include: • We derive an upper bound to the error probability associated with the mismatched classifier for the case where the distribution of the signal in a given class is Gaussian with zero-mean and low-rank covariance matrix.• We then derive sufficient conditions for reliable classification in the asymptotic low-noise regime.Such conditions are expressed in terms of the geometry of the true signal model, the geometry of the mismatched signal model and the interaction of these geometries (via the principal angles associated with the subspaces of the true and mismatched signal models as well as the dimension of the intersection of such subspaces).• We finally provide a number of results, both with synthetic and real data, that show that our sufficient conditions for reliable classification are sharp.In particular, we also use our theoretical framework to determine the number of training samples needed to achieve reliable classification in a motion segmentation and a handwritten digit classification applications.

A. Related Work
The concept of model mismatch has been widely explored by the information theory and communication theory communities.For example, in lossless source coding problems, mismatch between the distribution used to encode the source and the true distribution is shown to lead to a compression rate penalty which is determined by the Kullback-Leibler (KL) distance between the mismatched and the true distributions [14,Theorem 5.4.3].
In channel coding problems, mismatch has an impact on the reliable information transmission rate that has been characterized via inner and outer bounds to the achievable rate and error exponents of different channel models [15]- [19].The problem of mismatched quantization is considered in [20].
The concept of mismatch has also been explored in the machine learning literature [5].In particular, [5] studies the impact on classification performance of training sets consisting of biased samples of the true distribution, expressing classification error bounds as a function of the sample bias severity and type.The effect of label noise in the training sets is also considered in classification algorithms such as Support Vector Machines [3] and Logistic Regression classifiers [4].See also [2] for an overview of the literature on classification in presence of label noise.
Signal classification and estimation using mismatched models is also considered in [21]- [24].For example, [23] expresses bounds to the error probability in the presence of mismatch via the f -Divergence between the true and mismatched source distributions, and [24] expresses the mean-squared error penalty in presence of mismatch in terms of the derivative of the KL distance between the true and the mismatched distributions with respect to the decoder signal to noise ratio (SNR).In particular, the work in [23] is closely related to our work in the sense that it also establishes bounds to the error probability in the presence of mismatch.The bounds presented in [23] are more general since they do not assume a particular form of probability density functions.Our work, on the other hand, leverages the assumption that signals are contained in linear subspaces in order to derive an upper bound that sharply predicts the presence or absence of an error floor.The bounds in [23] fail to capture the presence or absence of an error floor when specialized to the proposed signal model.

B. Organization
The remainder of this paper is organized as follows: Section II introduces the observation and signal models, the Mismatched Maximum-a-Posteriori (MMAP) classifier and the geometrical quantities associated with the signal and the mismatched model that are essential for the description of the MMAP classifier performance.The upper bound to the error probability associated with the MMAP classifier and the asymptotic expansion, which provide sufficient conditions for reliable classification in the low-noise regime, are given in Section III.In Section IV the theoretical results are validated via numerical experiments.Applications of the proposed bound in a motion segmentation task and in a hand-written digit classification task are given in Section V.The paper is concluded in Section VI.The proofs of the results are given in the Appendix.

C. Notation
We use the following notation in the sequel: matrices, column vectors and scalars are denoted by boldface upper-case letters (X), boldface lower-case letters (x) and italic letters (x), respectively.I N ∈ R N ×N denotes the identity matrix and 0 M ×N ∈ R M ×N denotes the zero matrix.The subscripts are omitted when the dimensions are clear from the context.e k denotes the k-th basis vector in R N .The transpose, rank and determinant operators are denoted as (•) T , rank(•) and | • |, respectively.x denotes Euclidean norm of the vector x and X 2 denotes the spectral matrix norm of the matrix X.The image of a matrix is denoted by im(•) and the kernel of a matrix is denoted by ker(•).The sum of subspaces A and B is denoted as A + B and the orthogonal complement of A is denoted as A ⊥ .log(•) denotes the natural logarithm, and the multi-variate Gaussian distribution with the mean µ and covariance matrix Σ is denoted as N (µ, Σ).We also use the following asymptotic notation:

II. PROBLEM STATEMENT
We consider a standard observation model: where y ∈ R N represents the observation vector, x ∈ R N represents the signal vector and n ∼ N (0, σ 2 I) ∈ R N represents observation noise, where σ 2 denotes the noise variance per dimension. 1We also assume that the signal x ∈ R N is drawn from a class c ∈ {1, . . ., C} with prior probability P (c = i) = p i , and that the distribution of the signal x conditioned on a given class c = i is Gaussian with mean zero and (possibly) low-rank covariance matrix Σ i ∈ R N ×N , i.e.
with rank(Σ i ) = r i ≤ N .Therefore, conditioned on a given class c = i, the signal lies on the linear subspace spanned by the eigenvectors associated with the positive eigenvalues of the covariance matrix Σ i .
The classification problem involves inferring the correct class label c associated with the signal x from the signal observation y.It is well known that the optimal classification rule, which minimizes the error probability, is given by the Maximum-A-Posteriori (MAP) classifier [1,Chapter 2.3]: where p(c = i|y) represents the a posteriori probability of class label c = i given the observation y and

Mismatched
denotes the mismatched probability density function of the observation y given the class label c = i.
The probability of error associated with a MMAP classifier is given by: where and u(•) is the unit-step function.This error probability cannot be calculated in closed form, but it can be easily bounded.
Our goal is to study the performance of the MMAP classifier by establishing conditions, which are a function of the geometry of the true and mismatched signal models as well as the interaction of such geometries, for reliable classification in the low-noise regime i.e. such that lim σ 2 →0 P (e) = 0.

A. Geometrical Description of the Signals
Our characterization of the performance of the MMAP classifier will be expressed via various quantities that embody the geometry of the true signal model, the geometry of the 2 We assume that C and σ 2 are known.Since we study the scenario where σ 2 → 0, the assumption that σ 2 is known exactly is immaterial.
mismatched signal model, and their interplay.The quantities central to the analysis are given in Table I and the relationships between the presented quantities are summarized in Table II.
1) Quantities associated with the geometry of the true signal model or the mismatched signal model: The signal space corresponding to class i and the mismatched signal space corresponding to class i, which are subspaces of R N , are denoted as im(Σ i ) and im( Σi ), respectively.An orthonormal basis for im(Σ i ) is denoted as U i ∈ R N ×ri and an orthonormal basis for im( Σi ) is denoted as Ũi ∈ R N ×ri ; these quantities follow directly from the truncated eigenvalue decompositions . ., λi ri ) ∈ R ri×ri are diagonal matrices containing the positive eigenvalues of Σ i and Σi , respectively.Note that im(Σ i ) = im(U i ) and im( Σi ) = im( Ũi ).
2) Quantities associated with the interplay between the geometry of the mismatched signal models: We consider quantities that reveal the relationship between the mismatched signal spaces of classes i and j.In particular, such quantities follow from the decomposition of the subspace im( Σi + Σj ) = im( Ũi ) + im( Ũj ), which spans the mismatched signal subspaces of classes i and j, given by: where represents an orthonormal basis for the intersection im( Σi ) ∩ im( Σj ) and r∩ ij is the dimension of im( Σi ) ∩ im( Σj ).This intersection is associated with class i as well as class j; • Ũ ij ∈ R N ×r ij represents an orthonormal basis for the orthogonal complement of im( Σi ) ∩ im( Σj ) in im( Σi ) and r ij is the codimension of im( Σi ) ∩ im( Σj ) in im( Σi ).im( Ũ ij ) can be interpreted as the subspace of the mismatched signal space corresponding to class i that is only associated with class i and not with class j; • Ũ ji ∈ R N ×r ji represents an orthonormal basis for the orthogonal complement of im( Σi ) ∩ im( Σj ) in im( Σj ) and r ji is the codimension of im( Σi ) ∩ im( Σj ) in im( Σj ).im( Ũ ij ) can be interpreted as the subspace of the mismatched signal space corresponding to class j that is only associated with class j and not with class i.
3) Quantities associated with the interplay between the geometry of the true signal model and the mismatched signal model: We also consider quantities that capture the interaction between the signal space corresponding to class i and the mismatched signal spaces of classes i and j.Such quantities are given by the decomposition of im(Σ i ) = im(U i ) given by: ) can be interpreted as the subspace of signal space corresponding to class i that is orthogonal to im( Ũ ij ); ) can be interpreted as the subspace of signal space of class i that is not orthogonal 4) Principal angles and distance between subspaces: Finally, our results will also be expressed via the principal angles between certain subspaces.In particular, consider a subspace Y with an orthonormal basis Y ∈ R N ×y , where y = dim(Y), and a subspace Z with an orthonormal basis Z ∈ R N ×z , where z = dim(Z), and define k = min(y, z).

Then the principal angles
2 between Y and Z are given by the singular value decomposition (SVD): where H ∈ R y×y and J ∈ R z×z are orthonormal matrices and D ∈ R y×z is a rectangular diagonal matrix containing the singular values: The principal angles are used to define various distances on a Grassmann manifold [26].We will be predominantly using the max correlation distance between two subspaces which is a function of the smallest principal angle θ 1 , and the min correlation distance between two subspaces which is a function of the largest principal angle θ k between the two subspaces.Note that we slightly abuse the notation in the second term of ( 10) and ( 11), as Y and Z are bases for the subspaces, not subspaces.5) Interpretation: It is instructive to cast some insight on the role of these various quantities in the characterization of the performance of the MMAP classifier.
Consider a two-class classification problem that involves distinguishing class 1 from class 2 in the low-noise regime (so y ≈ x).It is clear that the MMAP classifier will associate an observation y ∈ im( Ũ 12 ) with class 1 and an observation y ∈ im( Ũ 21 ) with class 2; in turn, the MMAP classifier may associate an observation y ∈ im( Ũ∩ 12 ) either with class 1 or 2. In general, the observation associated with class 1 is such that y ∈ im(U 1 ) = im(V 12 ) + im(W 12 ).
The following example demonstrates the classification of y|c = 1 by the MMAP classifier where the covariance matrices are assumed to be diagonal.

Example 1:
We take the covariance matrices to be The relevant quantities (see Table I) are given as: and We also determine im(W 12 ) and im(V 12 ): Assume now that y ∈ im(V 12 ) and note that im(V 12 ) = im( Ũ 12 ).Therefore, y ∈ im(V 12 ) will be classified as class 1 by the MMAP classifier.In contrast, assume now that y ∈ im(W 12 ) and note that im(W 12 ) contains im( Ũ 21 ).Therefore y ∈ im(W 12 ) may be classified as class 2.
Next, we modify the mismatched model of class 2 as which leads to Ũ 21 = e 4 .Note now that im(W 12 ) does not contain im( Ũ 21 ) and y ∈ im(W 12 ) will not be associated uniquely with class 2 by the MMAP classifier.
It is now clear that the relationship between subspaces im(W 12 ) and im( Ũ 21 ) will play a role in the characterization of conditions for perfect classification in the low-noise regime.
The next example demonstrates the role of principal angles in the conditions for perfect classification in the low-noise regime.
Example 2: We take the signal space bases as: The relevant quantities (see Table I) are given as: The geometry of the signals and decision regions is presented in Fig. 2 (a).Note now that y|c = 1 ∈ im(U 1 ) can potentially be associated to the correct class 1 depending on the distance (computed according to an appropriate metric) between im(V 12 ) and im( Ũ 12 ) and the distance between im(V 12 ) and im( Ũ 21 ).In particular, the angle between im(V 12 ) and im( Ũ 12) is greater than the angle between im(V 12 ) and im( Ũ 21 ), which leads to misclassification of signals from class 1.On the other hand, if we take T the angle between im(V 12 ) and im( Ũ 12 ) is smaller than the angle between im(V 12 ) and im( Ũ 21 ), which leads to perfect classification of signals from class 1 in the low-noise regime.This case is presented in Fig. 2 (b).
The ensuing analysis shows how these various quantitieswhich are readily computed from the underlying geometry of the true subspaces and the mismatched ones -can be used as a proxy to define sufficient conditions for perfect classification in the low-noise regime.In particular, these quantities bypass the need to compute the decision regions associated with the MMAP classifier in order to quantify the performance.

III. CONDITIONS FOR RELIABLE CLASSIFICATION
We now consider (sufficient) conditions for reliable classification in the low-noise regime.We derive these conditions directly from a low-noise expansion of an upper bound to the error probability associated with the MMAP classifier.
The following upper bound to the probability of error associated with a MMAP classifier will play a key role in the analysis.Then the error probability associated with the MMAP classifier in (7) can be bounded as follows: where • If ∃(i, j) with i = j : Σ ij 0 then P (e) ≤ P (e) = 1.
Proof: The proof appears in Appendix.
This upper bound to the error probability of the MMAP classifier can capture the fact that the error probability may tend to zero as the noise power approaches zero, depending on the relation between the true signal parameters and the mismatched ones.In particular, the upper bound to the misclassification probability of class i is expressed as a function of the covariance matrix of class i, the mismatched covariance matrix of class i and the mismatched covariance matrices of classes j = i.In contrast, the bound proposed in [23] expresses the upper bound to the error probability as a function of the sum of f -divergences between the true and the mismatched distributions of class i, for all classes i.Therefore, it does not capture the interplay between mismatched models of different classes.In addition, when specialized to the proposed signal model, the bound in [23] always predicts the presence of an error floor (see Section IV).
The following Theorem presents a low-noise expansion of the upper bound to the error probability of the MMAP classifier.
Theorem 2: The upper bound to the error probability of the MMAP classifier in (13) can be expanded as follows: • Assume that ∀(i, j) , i = j, the following conditions hold: and take d = min (i =j) d ij , where and - where A > 0. • Assume ∃(i, j), i = j, such that conditions (15) or (16) do not hold.Then Proof: The proof appears in Appendix.
The expansion of the upper bound to the error probability embodied in Theorem 2 provides a set of conditions, which are a function of the geometry of the true signal model, the geometry of the mismatched signal model, and the interaction of the geometries, that enable us to understand whether or not the upper bound to the error probability may exhibit an error floor.In particular, in view of the fact that we use the union bound in order to bound the error probability of a multiclass problem in terms of the error probabilities of two-class problems, these conditions have to hold for every pair of class labels (i, j), i = j.We can note that: • The upper bound to the probability of error exhibits an error floor if either (15) or ( 16) are not satisfied for some pair (i, j), i = j.The interpretation of condition ( 15) is straightforward by noting that the subspace im(W ij ) contains vectors of class i that are orthogonal to the subspace im(U ij ), which is the subspace uniquely associated with class i.Then, condition (15) states that such vectors must also be orthogonal to the mismatched subspace uniquely associated with class j, i.e. im(U ji ).
The interpretation of condition ( 16) is obtained by reformulating the expression as: is the norm of the projection of x onto im( Ũ ij ).Therefore, ( 16) requires that the norm of vectors in im(V ij ), which are associated with class i, projected onto im( Ũ ij ), which is also associated with class i, is greater than the norm of vectors in im(V ij ) projected onto im( Ũ ji ), which is associated with class j.Equation ( 21) is also implied by which requires that the largest principal angle between im(V ij ) and im( Ũ ij ) is smaller than the smallest principal angle between im(V ij ) and im( Ũ ji ). 3 Demonstration of this condition is provided by Example 2 in Section II-A.
• On the other hand, the upper bound to the probability of error does not exhibit an error floor if conditions ( 15) and ( 16) are satisfied for all pairs (i, j), i = j and d > 0. In particular, necessary and sufficient conditions for d > 0 depend on the dimension of the various subspaces and their relation, i.e. s V ij > 0 for all pairs (i, j) such that rj − ri ≤ 0 is necessary and sufficient for d > 0. For example, if the rank of all covariance matrices associated to the mismatched model is the same, i.e., if ri = r, for i = 1, . . ., C, then s V ij > 0, ∀(i, j), i = j is necessary and sufficient for d > 0. Note that a positive value for s V ij indicates that there is at least one vector in im(U i ) that is not contained in im( Ũ ij ) ⊥ , or equivalently, there exists at least one vector in im(U i ) that has a non-zero projection onto im( Ũ ij ), therefore leading to reliable classification of signals from class i.
• Note that parameters α ij do not play a role in the characterization of the necessary and sufficient conditions for d > 0. In fact, the conditions for d ij > 0 do not depend on a particular value of α ij , provided that α ij ∈ (0, 1 |rj −ri| ).• Note also that the value of d represents a measure of robustness against noise in the low-noise regime, as it determines the speed at which the upper bound of the error probability decays with 1/σ 2 .In particular, higher values of d will represent higher robustness against noise, in the low-noise regime.For example, on assuming ri = r for i = 1, . . ., C, we observe that larger values of s V ij correspond to larger values of d.Therefore, as expected, higher levels of robustness are obtained when the overlap between im(U i ) and im( Ũ ij ) ⊥ , i.e. dimension of im(W ij ), is reduced.We also discuss how the value of d ij in equation (17) relates to the value of d ij for the non-mismatched case. 4n particular, we assume that r i = r j = ri = rj and that the true and the mismatched covariance matrices are diagonal.Then for the non-mismatched case and for the mismatched case Therefore, in the non-mismatched case d ij is at most r i and it decreases as the dimension of the intersection of the signal spaces of classes i and j increases.In the mismatched case d ij is also at most r i , but it decreases as the dimension of the intersection of the signal space of class i and the mismatched signal space of class j increases, and as the dimension of the intersection of the signal space of class i and the noise subspace of the mismatched classifier, i.e ker( Ũi )∩ker( Ũj ), increases.It can also be easily verified that the value of d for a nonmismatched 2 class problem obtained in [27] matches the value of d derived via the proposed bound.Note that the bound analyzed in [27] is different than the bound proposed in this paper and it is only valid for non-mismatched models.• The constant A in (19) distinguishes the upper bounds for different mismatched models with a constant d, in the low-noise regime, and is determined as the ratio of volumes of subspaces associated with true and mismatched signal subspaces and their interaction.See Appendix for the detailed expression.Theorem 2 therefore leads immediately to sufficient conditions for reliable classification in the low-noise regime.
Proof: The proof appears in Appendix.
Note that the conditions in Corollary 2 are implied by (hence are weaker) the conditions in Corollary 1.
The conditions for reliable classification are particularly simple for the scenario where true and mismatched covariance matrices are diagonal.
Proof: The proof appears in Appendix.
Note that in diagonal case the sufficient conditions for perfect classification simplify only to inclusion of subspaces.Recall the Example 1 where we demonstrate that the signals in im(W ij ) may be associated with class i or with class j.Condition (26) formalizes the intuition that the signals in im(W ij ) must be orthogonal to im( Ũ ji ), which is uniquely associated with class j.
We finally illustrate how our conditions cast insight onto the impact of mismatch for a two-class case where the mismatched subspaces are a rotated version of the true signal subspaces. where are orthogonal matrices, and s 12 , s 21 > 0. 5 By defining it follows that The proof is in the Appendix.
This example provides sufficient conditions for reliable classification in the low-noise regime by relating the degree of mismatch -measured in terms of the spectral norm of the matrix I − Q i , i = 1, 2 -to the minimum principal angle between subspaces.It states that the larger the minimum principal angle between the spaces spanned by signals of class 1 and class 2, i.e. the larger 1 − δ 12 , the more robust is the classifier against mismatch, where the level of mismatch is measured by 1 + 2 .The maximum robustness against mismatch is obtained when δ 12 = 0, which means that signals from class 1 and class 2 are orthogonal.
This example also provides a rationale for state-of-the-art feature extraction mechanisms where the signal classes are transformed via a linear operator Φ prior to classification.In particular, assume that Σ 1 and Σ 2 correspond to the covariances of signals in class 1 and 2 after the transformation Φ: the example suggests that the operator Φ should transform the signal covariances so that δ 12 is small (i.e.so that the signals from class 1 and 2 are close to orthogonal) in order to create robustness against mismatch.Such an approach is considered, for example, in [28], where signals are transformed by a matrix, which promotes large principal angles between the subspaces.Note that the work in [28] is not motivated on the basis of robustness against mismatch, but rather on intuitive insight about classification of signals that lie on subspaces.

IV. NUMERICAL RESULTS
We now show that our conditions for reliable classification in the low-noise regime are sharp, by revisiting the Examples 1 and 2 presented in Section II-A.The model parameters and results are summarized in Table III.
Fig. 3 shows the estimated true error probability, which is obtained from simulation 6 , the upper bound to the error probability given in Theorem 1 and the bound proposed in [23] (using the KL-divergence) as a function of σ 2 .Note that the proposed upper bound to the error probability and the derived sufficient conditions give a sharp predictions of an error floor, and also that the bound proposed in [23] always exhibits an error floor.
In case (a), condition (15) in Theorem 2 is not satisfied for (i, j) = (1, 2), i.e. im(W 12 ) = im([e 2 , e 3 ]) ⊆ im( Ũ 21 ) ⊥ = im(e 3 ) ⊥ , therefore, via Theorem 2 we conclude that the upper bound exhibits an error floor.The results in Fig. 3 show that in this case the true error probability also exhibits an error floor.In case (b), conditions ( 15) and ( 16) are satisfied and d > 0. Therefore, via Theorem 2, the upper bound to the error probability approaches zero, which also implies that the true error probability approaches zero, in the low-noise regime.
For cases (c) and (d) the intuition is provided by the Corollary 2, where in the case of the one-dimensional subspaces the concept of principal angles simply reduces to the notion of angle between two lines.In particular, in case (c) the condition (25) in Corollary 2 is not satisfied for (i, j) = (1, 2), and we observe an error floor in the true error probability.On the contrary, in case (d) the conditions (25) in Corollary 2 are satisfied which immediately implies perfect classification in the low-noise regime.
We now explore how different mismatched models affect the value of d.Consider the following 2-class example in R 6 with orthonormal basis vectors e i , i = 1, . . ., 6, where the signal spaces are: and various mismatched signal spaces are: It is straightforward to verify that the sufficient conditions for perfect classification given by Theorem 2 hold for all three pairs of mismatch models (32), ( 33) and (34).Furthermore, one can also determine the values of d as 0.5, 1 and 1.5, where values of d do not depend on α ij , for the mismatched models given by ( 32), ( 33) and (34), respectively.As observed in Section III, a higher value of d implies a higher robustness to noise.Simulation results of the true error probability and the values of the upper bounds as given in Theorem 1 are 6 In our simulations, signals are drawn independently from the true distribution and are classified by the MMAP classifier.Black, blue and red lines correspond to the simulated error probabilities for examples given by ( 32), ( 33) and (34), respectively.Dashed black, blue and red lines correspond to the upper bound given in Theorem 1 for examples given by ( 32), ( 33) and (34), respectively.plotted in Fig. 4. One can observe that increasing values of d (associated with the upper bound to the error probability) correspond to steeper decrease of the true error probability as σ 2 → 0.Moreover, the values of d obtained via the upper bound match the values of d obtained from the simulation of the true error probability for all the examples (32)- (34).

V. APPLICATIONS
We finally show how theory can also capture the impact of mismatch on classification performance in applications involving real world data.We consider a motion segmentation application, where the goal is to segment a video in multiple rigidly moving objects, and a hand-written digit classification application.In both tasks we concentrate on a supervised learning approach, in which we are given a number of labeled samples, which are used to estimate the model (training set) and a number of unlabeled samples that we want to classify (testing set).Our aim is to determine the minimum size of the training set needed to guarantee reliable classification of the testing set.

A. Datasets
For the motions segmentation task we use the Hopkins 155 dataset [29], which consists of video sequences with 2 or 3 motions in each video.The motion segmentation problem is usually solved by extracting feature points from the video and tracking their position over different frames.In more details, in this application, observation vectors y are obtained by stacking the coordinate values associated to a given feature point corresponding to different frames, and the objective of motion segmentation is that of classifying each feature point as belonging to one of the moving objects in the video [10].
Theoretical results show that the features points trajectories belonging to a given motion lie on approximately 3 dimensional affine space or 4 dimensional linear space [10]- [12].We validate that empirically by observing the decay of singular   1.In all plots, the black line corresponds to the true error probability P (e) obtained via simulation, the red line corresponds to the proposed upper bound to error probability P (e) given in Theorem 1 and the dashed orange line corresponds to the upper bound in [23] (with KL-divergence).
values of the data matrix associated with a given motion, which is shown in Fig. 5 (a).Note that singular values are close to zero for singular value indices that are greater than 4.
For the experiment we consider a video with 3 motions 7 , where number of samples of class 1, class 2 and class 3 is 236, 142 and 114, respectively.The rule adopted to pick the video was the maximal possible feature points -samples -for each motion.The ranks of the true and the mismatched covariances is always set to 4. We also split the dataset samples randomly into a training set and a testing set, where the training set contains n max = 90 samples per class.
For the hand-written digit classification task we use the MNIST dataset [30], which consists of 28 × 28 grey scale images of hand-written digits between 0 and 9.We obtain observation vectors y by vectorizing the images.
The decay of singular values associated with the data matrix of MNIST digits is shown in Fig. 5 (b).Note that the singular values do not approach zero as fast as in the case of the Hopkins dataset.We can argue that the classes in the MNIST dataset are only "approximately low-rank", i.e. the covariance matrix associated with the class i can be expressed as Σ i = Σi +δI, where Σi is low-rank and δ > 0 accounts for the deviation from the perfectly low-rank model.In view of the presented signal model this can be interpreted as a classification of signals with low-rank covariance matrix Σi at finite σ 2 = δ.The sufficient conditions for perfect classification in the case of "approximately low-rank" model will now predict what number of training samples is required to achieve the best possible error rate for the given classification problem.
The ranks of the true and the mismatched covariances is always set to 20 in the experiments.Such rank leads to capturing approximately 90 % of the energy of the signals.The split into training and testing set is provided by the MNIST dataset, where the training set contains approximately n max = 5000 samples per class.

B. Methodology
We obtain the class-conditioned covariance matrices by retaining only the first r principal components of the estimated covariances obtained via the maximum likelihood (ML) estimator 8 for each class.The covariance matrix associated with the "true model" of class i is obtained by estimating 7 Denoted as "1RT2RCR" in the dataset. 8Note that this is equivalent to computing the empirical covariance matrix.the covariance matrix on all available data samples of class i, and the covariance matrices associated with the "mismatched model" of class i are obtained by estimating the covariance matrix on n i data samples of class i.
Results are produced as follows: in each run n i samples are drawn at random from the training set for various values of n i , i = 1, . . ., C, and the signal covariances are estimated.The error rate of the MMAP classifier is then evaluated on the testing set.At the same time, we also determine if sufficient conditions for perfect classification as in Theorem 2 hold.We run 1000 experimental runs with the Hopkins dataset, where in each run dataset is split at random into training and testing sets.We run 20 experimental runs with the MNIST dataset, where in each run the draw of the n i samples from the training set is random for i = 1, . . ., C.
The particular choice of samples in the training set can lead to high variability in the mismatched models, especially for small number of training samples.Therefore, in the following, we have chosen to report the results as follows: • we state that analysis predicts reliable classification if the sufficient conditions in Theorem 2 hold with probability p p over the different experiment runs; • we also state that simulation predicts reliable classification if the true error probability is 0 with probability p p over different experiment runs; • if the simulated error rate exhibits an error floor we report the worst case error rate with probability p p : the error rate that is achieved at least with probability p p over all experimental runs.

C. Results
The results for the Hopkins dataset are reported in Fig. 6.We observe that the phase transition predicted by analysis approximates reasonably well the phase transition obeyed by simulation.In particular, we can use our theory to gauge the number of training samples required for perfect classification in the low-noise regime.As expected, we also observe that the larger value of p p gives more conservative estimates of the required training samples.This holds for both simulation and analysis.
We also observe that identical trends hold for other values of n 3 .In particular, for n 3 < 30 simulation does not show a phase transition and likewise analysis does not show a phase transition either (these experiments are not reported in view of space limitations).In contrast, for n 3 ≥ 30 both simulation and analysis predict a phase transition in the error probability.
The results for the MNIST dataset are reported in Fig. 7.Note that the number of training samples per class is the same for all classes, i.e. n i = n , i = 1, . . ., C.
In contrast to the results with the Hopkins dataset, the error rate obtained on the MNIST dataset exhibits an error floor.
However, we observe that the worst case error rate reduces with a higher number of training samples and reaches an error floor at sufficiently large number of training samples.We also observe that the phase transition obtained via Theorem 2 predicts reasonably well the number of training samples needed to reach the error floor.
Finally, note that real data are not drawn from Gaussian distributions or perfect linear subspaces (the two main ingredients underlying our analysis).Nevertheless, we have shown that the derived bound has practical value even when the two assumptions do not hold strictly.

VI. CONCLUSION
This paper studies the classification of linear subspaces with mismatched classifiers, i.e. classifiers that operate on a mismatched version of the signal parameters in lieu of the true signal parameters.In particular, we have developed a lownoise expansion of an upper bound to the error probability of such a mismatched classifier that equips one with a set of sufficient conditions -which are a function of the geometry of the true signal distributions, the geometry of the mismatched signal distributions, and their interplay -in order to understand whether it is possible to classify reliably in the presence of mismatch in the low-noise regime.
Such sufficient conditions are shown to be sharp in the sense that they can predict the presence (and the absence) of a classification error floor both in experiments involving synthetic data as well as experiments involving real data.These conditions have also been shown to gauge well the number of training samples required for reliable classification in a motion segmentation application using the Hopkins 155 dataset and a hand-written digit classification application using the MNIST dataset.
Overall, we argue that our conditions can also be used as a proxy to develop linear feature extraction methods that are robust to mismatch.In particular, our study suggests that such methods ought to orthogonalize the different classes as much as possible in order to tolerate model mismatch.This intuition has been pursued in recent state-of-the-art linear feature extraction methods.

A. Preliminaries
We introduce additional quantities and Lemmas that are useful for the proofs.
a) Quantities: We define the projection operators: where U i , Ũi and Ũ ij are given as in Section II-A.In addition to the bases U i and Ũi for the im(Σ i ) and im( Σi ), respectively, we also introduce the bases for the ker(Σ i ) and ker( Σi ) as U ⊥ i ∈ R N ×N −ri and Ũ⊥ i ∈ R N ×N −ri , respectively.We define the projection operators onto this subspaces: Training samples per class n Worst case error rate (c) Classification of 10 digits Fig. 7.The worst case error rate and phase transition predicted via Theorem 2 for a given probability pp are plotted for classification of MNIST digits.Solid red and blue lines correspond to worst case error rates for pp = 0.9 and pp = 1, respectively.Dashed vertical lines denote the phase transition predicted via Theorem 2 for pp = 0.9 (red) and pp = 1 (blue).
We also define and write where in view of the fact that P i + K i = I and Pi + Ki = I and Pi − Pj = P ij − P ji .The last equality simply follows from the definition of P ij and P ji , and the definitions of Ũ ij , Ũ ji and Ũ∩ ij given in Section II-A: Finally, we present a decomposition of x ∈ R N .We write where for some vectors b) Lemmas: Lemma 1: The following equality holds: .
Proof: By leveraging the definition of P ij in (36) we have Lemma 2: The following statement holds: Proof: First, note that Then we write the following Note that the singular values of (V ij ) T Ũ ij and (V ij ) T Ũ ji correspond to the cosines of the principal angles between and im(V ij ) and im( Ũ ij ), and im(V ij ) and im( Ũ ji ), respectively.We then consider the SVDs where the dimensions of matrices H ij , H ji , D ij , D ji , J ij and J ji follow from the dimension of the V ij , Ũ ij and Ũ ji as shown in (9).We can now express (51) as It is straightforward to see that (50) implies (51).Lemma 3: The following equalities and inequalities hold: x Proof: The inequality in (57) is due to the fact that x ∈ im(L i ) = im(U i ) and 1 λ i 1 +1 is a lower bound to the minimum positive eigenvalue of L i .The inequality in (58) is due to the fact that Lj is positive semidefinite and that 1 λi ri is an upper bound for the largest eigenvalue of Li .The equality in (59) follows from the definition of the projector K i .
Lemma 4: Assume that im(W ij ) ⊆ im( Ũ ji ) ⊥ and (60) Denote by c 0 the smallest eigenvalue of (62) Proof: Note that (49) implies x W ∈ im(W ij ) = im(Σ i ) ∩ ker( P ij ), and the condition (61) also implies x W ∈ ker( P ji ) = im( Ũ ji ) ⊥ .Then, we can write and we note that condition (61) implies the lower bound Moreover, all the eigenvalues of , and, on leveraging Cauchy-Schwarz inequality, we also have

B. Proof of Theorem 1
We prove Theorem 1 by using the fact that u(x) ≤ exp(αx), ∀x, α > 0 and by leveraging the union bound.
Recall from (7) that the error probability associated with the MMAP classifier can be expressed as where P (e|c = i) = P (ĉ = i|c = i) is the error probability for signals in class i. Via the union bound, we can state that where •u log pj p(y|c = j) pi p(y|c = i) dy .(68) We will denote P (ĉ = j|c = i) = P (e ij ).Now, by letting α ij > 0 ∀i = j we can upper bound the step function to obtain • exp α ij log pj p(y|c = j) pi p(y|c = i) dy where we recall If Σ ij 0 ∀i = j, then the integral in (69) converges ∀i = j.Therefore, we can bound the error probability as follows: where If ∃i = j : Σ ij 0 then the integral in (69) does not converge.Therefore, we trivially bound the error probability as P (e) ≤ P (e) ≤ 1 .

C. Proof of Theorem 2
The proof is presented in two parts.First, we establish sufficient conditions for Σ ij 0; second, we establish conditions for the upper bound to the probability of misclassification to approach zero as the noise approaches zero.
1) Positive Definiteness of Σ ij : The following two Lemmas gives sufficient conditions for Σ ij 0.
Lemma 5: where c 0 is the smallest eigenvalue of Proof: To show this we first produce a lower bound: where by using the equalities and inequalities (57)-( 59) and (62).Now, by using standard algebraic manipulations, it is possible to show that the choice (74) leads to C 0 , hence (75) holds. Then Proof: We prove the Lemma by constructing the lower bound by using the inequalities equalities and inequalities (57)-( 59) and (62), and by noting that x V = 0.The choice (81) then leads to (82).
2) Part 2: Low-noise Expansion: To obtain the low-noise expansion of the upper bound to the error probability we first present two supporting Lemmas.
Lemma 7: Assume that condition (72) given in Lemma 5 holds.Assume also that s V ij > 0 and (73) and (74) given in Lemma 5 hold, or that s V ij = 0 and (81) given in Lemma 6 holds.Then K ij 0 and rank(K ij ) = N + s V ij − r i .Proof: Assume that (72), s V ij > 0 (73) and (74) are satisfied.By definition, im(W ij ) = im(Σ i ) ∩ ker( P ij ) and, as a consequence of (72), it also holds im(W ij ) ⊆ ker( P ji ), which leads to im( ). Namely, by leveraging the equality in (59) and inequality in (62), we can write where x ⊥ , x V have been defined in (47) and (48).If α ij < c0 c0+1 then the right hand side of (85) is always strictly positive, unless x = 0.Then, since the condition in (74) implies α ij < c0 c0+1 , we can conclude that , where we have used the fact that eigenvalues of P ji − P ji contained in the interval [−1, 1].Since (81) implies α ij < 1 we conclude, via an argument similar to that in previous paragraph, that Lemma 8: Assume that condition (72) given in Lemma 5 holds.Assume also that s V ij > 0 and ( 73) and (74) given in Lemma 5 hold, or that s V ij = 0 and (81) given in Lemma 6 holds.Then, as σ 2 → 0, we can write where r Kij = rank(K ij ), and v ij is given as Proof: Note first that the sufficient conditions imply K ij 0 via Lemma 7. We can write the eigenvalue decomposition of K ij : where Now, we can write where We also denote by E i1...im the principal submatrix of order N − m obtained by deleting the rows and the columns i 1 , . . ., i m of the matrix E.
Note that E i1...im = P T i1...im EP i1...im , where the matrix P i1...im ∈ R N ×N −m is obtained by picking all the columns from the identity matrix with the column indices different from i 1 , . . ., i m .Then, the Poincaré separation theorem [32,Corollary 4.3.37]guarantees that the eigenvalues E i1...im are bounded by the minimum and the maximum eigenvalues of E, which correspond to the minimum and maximum eigenvalues of L ij .Moreover, as σ 2 → 0, while the diagonal elements of 1 σ 2 Λ Kij grow unbounded, the eigenvalues of L ij , and therefore, also the determinant of E i1...im , are bounded.
Then, we can use the determinant decomposition in [33,Theorem 2.3] where and the summation is over all possible ordered subsets of m indices from the set {1, . . ., r Kij }.Otherwise, if r Kij < N : where Now we show that (87) holds.We first assume rank(K ij ) = N and take the right hand side of (92) and multiply it by Note now that for all S m , m = 1, . . ., N − 1, lim σ 2 →0 (σ 2 ) r K ij S m = 0 and lim σ 2 →0 (σ 2 ) r K ij |L ij | = 0. Therefore, (87) holds for the case rank(K ij ) = N .To show the derivation of v ij for the case rank(K ij ) < N we use the same technique where we multiply by 1 σ 2 r K ij σ 2 r K ij the right hand side of (94) to get As σ 2 → 0 we can write This concludes the derivation of (87).Note also that v ij > 0, since the pseudo-determinant and the determinants in (87) are greater than zero.
We now provide the low-noise expansion of the upper bound to the probability of misclassification.
Assume that sufficient conditions for positive definiteness of Σ ij , ∀i = j do not hold.Then, the upper bound to the probability of error is chosen to be P (e) = 1, so that in general it does not tend to zero as σ 2 tends to zero.
Assume now that the sufficient conditions for Σ ij 0 as given in the first part of the proof hold ∀i = j.Then, the upper bound to the probability of misclassification can be written as follows:9 P (e) = i j =i p i pj pi We will now produce a low-noise expansion of (98) in order to understand whether or not lim σ 2 →0 P (e) = 0.The following low-noise expansions are trivial: The low-noise expansion of |Σ ij | is more involved and it is provided in Lemma 8.Then, it follows immediately that the low-noise expansion of each term in the upper bound to the probability of error in (98) is given by where Assume s V ij > 0 ∀(i, j), i = j and d min (U i , Ũi ) < d max (U i , Ũj ) ∀(i, j), i = j .(106) Note that d min (U i , Ũi ) < d max (U i , Ũj ) implies where we have used result in Lemma 2 in the Appendix A. By taking x ∈ W ij or x ∈ V ij it is straightforward to show that (108) implies ( 15) and ( 16), thus obtaining conditions identical to those in Corollary 1.

E. Proof of Corollary 3
We prove the corollary by showing that in diagonal case (16) always holds.Note first that It is also straightforward to establish that ( 16) holds if and only if im(V ij ) ⊆ im( Ũ ji ) ⊥ , and this always holds since im(V ij ) ⊆ im( Ũ ij and im( Ũ ij ⊆ im( Ũ ji ) ⊥ .

F. Derivation of Example 3
We prove statement (30), by showing that together with s 12 , s 21 > 0 implies the sufficient conditions for perfect classification in Corollary 2. Assume U i and U j are given and the singular values of (U i ) T U j are known.We also know that Ũj = Q j U j .We can write On leveraging [34, Theorem 1], we can state that the i-th singular value di associated with (U i ) T Ũj lies in the interval , where d i is the i-th singular value of (U i ) T U j .Then, we can write the upper bound where the first inequality follows from the SVD separation theorem [35,Theorem 2.2].Note also that where the singular values of (U i ) T Ũi are bounded by 1 ± (U i ) T (Q i − I)U i 2 .By leveraging (111) we can further bound the singular values as 1 ± i .
Note now that 1 − δ 12 > ( 1 + 2 ) if and only if 1 − 1 > δ 12 + 2 , which implies and is also equivalent to where max k cos k ((U 1 ) T Ũ2 ) denotes the cosine of the smallest principal angle between im(U 1 ) and im( Ũ2 ), max k cos k ((U 1 ) T Ũ2 ) denotes the cosine of the largest principal angle between im(U 1 ) and im( Ũ1 ).The equivalence between (113) and (114) follows straight from the definition of min and max correlation distances.It is now easy to verify that 1 − 1 > δ 12 + 2 implies (114), since 1 − 1 is a lower bound for the cosine of the largest principal angles between U 1 and Ũ1 , and δ 12 + 2 is an upper bound to the cosine of the smallest principal angles between U 1 and Ũ2 .
Finally, the same arguments can be used to show that d min (U 2 , Ũ2 ) < d max (U 2 , Ũ1 ).This concludes the proof.

Fig. 1 .
Fig. 1.System Model Each singular value d l corresponds to the cosine of the principal angle θ l between Y and Z, i.e., d l = cos(θ l ) [25, Chapter 8.7].

e 1 e 2 im(U 1 ) 1 e 2 im(U 1 )
Fig. 2. The two plots illustrate the decision regions associated with the 2class MMAP classifier for different values of U 1 , U 2 , Ũ1 and Ũ2 in the limit σ 2 → 0. Transparent blue and red regions indicate the decision region where MMAP outputs class labels 1 and 2, respectively.Blue line represent the signal subspace im(U 1 ) and red line represent the signal subspace im(U 2 ).Dashed blue line represents the mismatched signal subspace im( Ũ1 ).The subspace bases are given in Example 2.

-
Fig. 4.Black, blue and red lines correspond to the simulated error probabilities for examples given by (32), (33) and(34), respectively.Dashed black, blue and red lines correspond to the upper bound given in Theorem 1 for examples given by (32), (33) and(34), respectively.

Fig. 3 .
Fig. 3. Simulation results for the examples in Table1.In all plots, the black line corresponds to the true error probability P (e) obtained via simulation, the red line corresponds to the proposed upper bound to error probability P (e) given in Theorem 1 and the dashed orange line corresponds to the upper bound in[23] (with KL-divergence).

Fig. 5 .
Fig. 5. Normalized singular values of data matrices corresponding to: (a) motions in the Hopkins dataset and (b) digits in the MNIST dataset.For Hopkins dataset only the first 15 out of 58 singular values are shown.For MNIST dataset only the first 200 out of 784(= 28 × 28) singular values are shown for the first 3 classes.