L1 least squares for sparse high-dimensional LDA

: This paper studies high-dimensional linear discriminant analysis (LDA). First, we review the (cid:2) 1 penalized least square LDA proposed in [10], which could circumvent estimation of the annoying high-dimensional covariance matrix. Then detailed theoretical analyses of this sparse LDA are established. To be speciﬁc, we prove that the penalized estimator is (cid:2) 2 consistent in high-dimensional regime and the misclassiﬁcation error rate of the penalized LDA is asymptotically optimal under a set of reasonably standard regularity conditions. The theoretical results are complementary to the results to [10], together with which we have more understanding of the (cid:2) 1 penalized least square LDA (or called Lassoed LDA).


Introduction
Classification problem is important in a lot of fields such as pattern recognition, bioinformatics etc. There are a few classic classification methods, including LDA (linear discriminant analysis), logistic regression, naive bayes, SVM (support vector machine). LDA is very popular due to its simplicity, robustness and great performances in practice. When the number of features (or predictors) denoted by p is fixed, under some regularity conditions, LDA is proved to be optimal (see standard statistical text books such as [1]).
However, recent technology makes it easy to obtain data with very large number of features, which makes it challenging to apply LDA in practice. One problem is that when p is bigger than n (the number of observations), the covariance matrix of predictors performs very poorly -it is even not invertible. [2] pointed out that the LDA performs poorly and can even perform just as random guessing when p is big. A lot of researchers noticed that in high-dimensional classification problems, it is critical to have a "good" estimation of the covariance matrix. [2] showed that one could use a diagonal matrix instead, which used the idea of Naive Bayes and assumed that features are independent. [7] pointed out that even if a "Naive Bayes" (or independence) rule is used, if there are too many features in the model, the performance of LDA is still poor. So they proposed using only a small number of selected features. But the assumption that features are independent is a little bit annoying. [15] proposed covariance-regularized method to estimate covariance matrix by shrunken method. [12] studied sparse LDA by assuming both the covariance matrix and the difference between the mean vectors of two classes are sparse. Let Σ be the shared covariance matrix and δ be the difference between the mean vectors. [6] proposed a sparse LDA by directly assuming that Σ −1 δ is sparse and their method circumvents estimation of the inverse covariance matrix directly.
We'd like to emphasize here that there is a perfect connection between LDA and the least squares. This connection was first established by [8]. Using this connection, in fact, one could solve the LDA problem by directly using the vanilla 1. Under the restricted eigenvalue condition on the covariance matrix and some other regularity conditions, the Lassoed LDA estimator (defined later) is proved to be 2 consistent in high-dimensional regime. 2. The misclassification error rate of the Lassoed discriminant analysis tends to be asymptotically optimal.
We want to emphasize that the analysis for sparse LDA via the Lasso is quite different from that for the Lasso in sparse linear regression model. The main reason is that for the Lasso problem, the response y is a linear combination of predictors X 1 , . . . , X p plus additive noise. But for LDA, although least squares min β Y − Xβ 2 2 is used, there is not any functional connection between binary vector Y and the predictor matrix X.
The rest of this paper is organized as follows. In Section 2, we first give a brief and clear review of the Lassoed discriminant analysis proposed in [10], and then we establish the 2 consistency of the Lasso estimator to solve sparse high-dimensional LDA problems. We also prove that the misclassification error rate of the Lasso tends to be asymptotically optimal. We conclude in Section 3. All proofs are postponed into appendix (Section 4).

Sparse LDA and its asymptotic properties
In this section, we first review the procedure of the sparse LDA via the penalized least squares (Lasso). Then we provide the 2 consistency of the Lasso to solve high-dimensional LDA problems under the restricted eigenvalue condition and some other mild regularity conditions. Restricted eigenvalue condition on the covariance matrix, has proved to be much weaker than the irrepresentable condition. At last, we establish the asymptotic optimality of the corresponding misclassification error rate in the sense that the Lasso tends to achieve the Bayes error.
Throughout this paper, we assume that x (g) i , i = 1, · · · , n g are generated independently from normal distributions N(μ (g) , Σ), which share the same covariance matrix Σ but different means μ (g) , where g = 1, 2. Let μ = μ (1) + μ (2) /2, δ = μ (1) − μ (2) be the mean of and the difference beween two population means, n = n 1 + n 2 be the total sample size, Δ 2 p be the Mahalanobis squared distance between two populations andμ (g) ,Σ (g) be the well-known maximum likelihood estimates. All C's below are positive constant but differ from one to another.

Review of sparse LDA
To make our paper self-contained, we first review LDA and explain the connection between LDA and least squares in low-dimensional case. Then we breifly introduce one direct method for solve sparse high-dimensional LDA problem proposed in [10], which used the perfect connection between LDA and the least squares.
The classic LDA approaches the classification problem by applying the Bayes rule and classifies a new data point with equal prior weight for each class to g = 1, at population level, if and only if where β * = Σ −1 δ, which is called the fisher classification direction. However the population parameters in (2.1) are always unknown in practice, their plug-in estimatesμ are used instead to get the sample LDA function (2) . The LDA function (2.2) could be explained from another point of view. First we give a label to each data point. Let the label be 1 if the data point is from N μ (1) , Σ and −1 if it is from N μ (2) , Σ . In fact, any binary code is OK. Then, for simple notation, we pool all of these data points together. Let z i = x (1) i for i = 1, 2, · · · , n 1 and z n1+k = x (2) k for k = 1, 2, · · · , n 2 . Therefore the class label y i = 1 for i = 1, 2, · · · , n 1 and y n1+k = −1 for k = 1, 2, · · · , n 2 . We also define the centered version of labels and design matrix as follows. Let s are the centered class labels and X = [x 1 , x 2 , . . . , x n ] T ∈ R n×p are called the centered design matrix.
Using notations above, the following lemma gives the connection between LDA and the least quares, see Chaper 4 of [9], from which we can estimate the fisher classification direction(β * ) without the estimates of the mean vectors and the inverse of the covariance matrix separately.
We have that, there exists a positive constant C such thatβ ols = Cβ lda .
Lemma 2.1 gives a perfect connection between LDA and the least squares, though it only holds for low-dimensional settings. The good property of this connecting is that, we could circumvent estimation of the annoying high-dimensional covariance matrix when we extend LDA to high-dimensional regime via the least squares. [10] used this connection and proposed a direct approach to sparse discriminant analysis defined as follows [10] mainly showed that the Lasso estimator, defined in (2.3), also called Lassoed LDA estimator can consistently identify the important features under the irrepresentable condition and some other regularity conditions. In the next subsection, we will derive the 2 consistency of the Lasso estimator to solve sparse high-dimensional LDA problems under the restricted eigenvalue condtion on the covariance matrix, together with a set of reasonably standard mild conditions. To be more specific, the whole procedure using the Lasso to solve sparse LDA could be described in Algorithm 1.

2 consistency
This subsection will focus on the 2 consistency of the Lassoed LDA estimator. Thoughβ(λ) is not close to β * , it can be proved to be 2 consistent to an intermediate variableβ defined in (2.4), which is a positive constant multiple of β * , see Lemma 2.2. From the property of hyperplane, the linear seperator with β is the same as that with β * .
There exists a positive constant C such thatβ = Cβ * . Specifically, C can be displayed as where θ n = n 1 /n is the sample size ratio of the first class.
This result could be seen from [1]. In the following, we will derive the conditions under which the 2 consistency could be hold. Certainly, the Lassoed LDA can not work for an arbitary p. First, we need to control the sample size (n), the number of predictors (p) and the number of relevant features of discriminant direction (q := #{j : β * j = 0}) to satisfy the following condition (C1). When Δ 2 p is too small, which is to say that the two populations N(μ (g) , Σ) are too close, we can not expect any classifier to peforme well since the Bayes rule in this case is just as random guessing, which could be seen directly from (2.7) in the next subsection. So we need to bound Δ 2 p away from below. We also need to balance the sample sizes of two classes and control the maximum eigenvalue of the covariance matrix, which are commonly used in high-dimensional settings. These conditions are listed in (C2). The key condition for the 2 consistency of the Lasso-type estimator in high-dimensional linear model is the restricted eigenvalue condtion, which was first proposed in [3] and whose definition is defined in Definition 2.3.

Definition 2.3 (Restricted Eigenvalue
A few class of matrices have been proved to satisfy the restricted eigenvalue condition with high probability( [11,5]), for example, the matrices of which all entries are i.i.d. from a broad class of multivariate normal distributions. Restricted eigenvalue condition has been proved to be nearly necessary to control the 2 error in minimax setting in high-dimensional regression. Now under the conditions given above, we give the first result of this paper.
From the distance betweenβ(λ) andβ, as long as (p, n, q) increase under condition (C1),β(λ) is 2 consistent toβ with high probability, which is the first contribution of this paper. To prove Theorem 2.4, we first give a deterministic result in Lemma 2.5. For the sake of briefness, we use Y to shortly denote the centered class labels (defined below) y i for i = 1, 2 · · · , n.
Note that the Lasso estimateβ(λ) defined in Equation (2.3) is equivalent to the following Lasso problem without an intercept.
z i are the centered class labels and the centered observations respectively.

Lemma 2.5. Suppose that X satisfies restricted eigenvalue condition RE(γ, 3).
For any β ∈ R p with β S c = 0 and λ satisfying the following relationship for any Lasso estimatorβ(λ), we have This result is a deterministic result -no matter what relationship between Y and Xβ is, as long as inequality (2.6) and restricted eigenvalue condition holds. We could use this lemma to bound the distance between the Lasso estimatorβ(λ) andβ. For simple linear regression, since Y − Xβ (hereβ is the true parameter) is the noise vector and usually one assume it is (sub)Gaussian distributed, it is very straightforward from Lemma 2.5 that one could choose a suitable λ, such that inequality (2.6) holds and so do results (1), (2) and (3) with high probability. But for classification problems, Y is binary coded and clearly Y − Xβ is not a i.i.d noise vector. We do not have a straightforward choice of λ satisfing inequality (2.6) with high probability. Fortunately, with upper bounds of μ (1) − μ (2) Tβ andβ T Σβ given in Appendix, we could find a suitable λ such that inequality (2.6) holds with high probability.
where C is a positive constant depending on θ n , Δ 2 p , λ max (Σ) .
The right hand given in Theorem 2.6 is on the same order of λ used in linear regression. And the proof of Theorem 2.6 is the most difficult part in the paper. Once we have Theorem 2.6, together with Lemma 2.5, we immediately have the consistent results in Theorem 2.4. We postpone all the proofs of Theorem 2.4, Lemma 2.5 and Theorem 2.6 in appendix. Theorem 2.4 gives 1 and 2 consistency of the Lassoed estimator using the penalized least square LDA, which together with the variable selection property established in [10], ensure thatβ(λ) is a good estimator of the fisher classification direction β * . The next subsection shows that the misclassification error rate of this sparse LDA method tends to be aymptotically optimal, which once more confirms us to use the Lassoed LDA in practice.

Asymptotic optimal
Being linear with respect to x, U(x) and W(x) are both normal distributed no matther which class x belongs to. Recall that W(x) = x −μ (1) is the estimated discriminant function and U(x) = (x − μ) T β * is the optimal (true but unknown) discriminant function. Through simple calculations, we have the misclassification error R of U(x) and the misclassification probability R n of W(x) conditional on the training samples, whereβ could be eitherβ lda (β ols ) orβ(λ). It is well known that in the classic settings when p is fixed, the fisher LDA asymptotically attains the optimal misclassification rate defined in (2.7). Now for the high-dimensional extension, one natural question is do we still have the optimality result? We answer this question in this subsection under conditions (C1), (C2) and (C3) together with a new condition (C4).
The results of asymptotic optimal misclassification rate is very similar to the results in [6], where the author proposed a method like Danzig selector for sparse LDA problem. Due to the similarity between Lasso and Danzig selector estimator, our results and proof techniques are very similar -both try to bound functions of a few normal statistics with high probabilities. We borrowed a few techniques from [6].

Theorem 2.7. Under the same conditions in Theorem 2.4, together with condition (C4) and
with probability greater than 1 − O(p −1 ).
[6] provided a similar result for their Danzig selector type sparse LDA, which could be displayed as by using notations defined in the present paper. These two rates are sligtly different, which could be expected because of the similarity of Danzig selector and the Lasso(see [3]). Note that Δ 2 p = δ T Σ −1 δ and β * = Σ −1 δ, as long as conditions (C2) and (C4) hold, we have Δ p = O( β * 2 ). When β * S is bounded below and above, β * 2 = O( √ q), leading to the fact that Δ p = O( √ q). In this case, the asymptotic rates of Danzig selector type sparse LDA and the Lassoed LDA are exactly the same. The result given in Theorem 2.7 is called asymptotically optimal, whose definition could be found in [12]. If we relax the condition (2.8), we will prove that this sparse LDA method is asymptotically sub-optimal, see [12] too.
The proof of Theorem 2.7 and 2.8, which are both adapted from [6], are postponed in appendix. Same as the analysis after Theorem 2.
the condition (2.9) can be rewritten as which is contained in condition (C1). So in this case, we can derive the asymptotically sub-optimal property of the penalized least square LDA without any further condition.

Conclusions
Efficient high-dimensional discriminant analysis is very demanding in today's real applications. Fisher's LDA is a fundamental method. The extension of LDA to high-dimension is crucial. We studied the asymptotic properties of the Lassoed LDA estimator. The large sample results convince people to use this simple method to solve LDA problem. [10] gives a variable selection consistent result under irrepresentable condition and we now provide an 2 consistent result under restricted eigenvalue condition. These results look similar to the linear regression results. But the proofs here are much more complicated because the response (class label) and the separator (hyperplane) do not have a straightforward stochastic relationship as the linear regression model.

Appendix
This appendix contains technical proofs for our results. We first give a proof of Theorem 2.4 by using the result in Lemma 2.5 and results in Theorem 2.6. Then we prove Lemma 2.5 and Theorem 2.6. Finally, we prove Theorem 2.7 and 2.8.

Proof of Theorem 2.4
Proof. When λ is chosen to be c λ log p/n for the same positive constant c λ as in Theorem 2.6, from Theorem 2.6 we immediately have with probability greater than 1 − O(p −1 ). Then by Lemma 2.5, we derive the upper bounds for the 1 and 2 norm of the distance betweenβ(λ) andβ From inequalities (4.1), (4.2) together with condition (C1), we immediately have the consistency ofβ(λ)

Proof of Lemma 2.5
Proof. Without any confusion, we useβ to shortly denote the Lassoed estimator β(λ). Note thatβ minimizes 1 2n Y −Xβ 2 +λ β 1 , for any β ∈ R p with β S c = 0 we have 1 2n By arranging the terms and replacingβ with β + ν, we have where ·, · denotes the inner product of two vectors. From the above inequality we have Together with RE(γ, 3), we immediately have that Consequently, from which, we finally get

Three useful lemmas
Before giving the proof of Theorem 2.6, we first introduce three technical lemmas which are the main tools we used in this subsection. First the definition of subexponential random variable and its corresponding concentration inequality are given.
Proof. By definition, we need to prove for all |λ| < 1 4 , Since (X 1 , X 2 ) follows a joint normal distribution, X 2 can be written as where has a normal distribution with mean 0 and variance (1 − ρ 2 )σ 2 2 and is independent of X 1 . So Note that when |λ| ≤ 1 4 , So (4.4) holds from the moment generating function of Chi-square distribution, and from (4.4), to prove inequality (4.3), it is sufficient for us to verify Note that ). Let X be a vector of n independent standard normal random variable. Let f : R n → R be an L-Lipschitz function. Then, for all t > 0,

Lemma 4.4 (Concentration Inequality for Sub-exponential).
For sub-exponential random variable X with parameter (σ, b), The proof of Lemma 4.3 and Lemma 4.4 can be found in [4]. Here we omit them for briefness. Now we begin to prove the main result of this section.

Proof of Theorem 2.6
Proof. Recall the definition ofβ, Multiplying on the left by μ (1) − μ (2) T andβ T Σ on both sides, and rearranging the terms, we have where ω n = θ n (1−θ n )Δ 2 p , which is bounded below from condition (C2). As a consequence, μ (1) − μ (2) Tβ falls in (c,2) for some c > 0 andβ We will prove the theorem in three steps according to the decomposition, and draw the conclusion in step four.
Step one: From the normal distribution ofμ (1) andμ (2) , and by the independence of two sample pairs, we have Then applying Gaussian concentration inequality(Lemma 4.3), we have Step two: Decompose ξ 2 first, We will prove ξ 2 ∞ tends to zero with high probability by analysing the infinity norm of all the three parts tend to zero separately. Part 1 : Note ξ 1 ξ T 1β ∞ = ξ 1 ∞ |ξ T 1β | and ξ T 1β is a Gaussian random variable with mean zero and variance n n1n2β T Σβ, which is bounded above by where Part 2 : For each j ∈ {1, 2, · · · , p}, the jth element of μ (1) − μ (2) ξ T 1β is a Gaussian random variable with mean zero and variance n n1n2 μ which is bounded above by So applying Gaussian concentration inequality, we immediately have where c 3 = 16λmax(Σ) θn (1−θn) . Part 3 : From the analysis before step 1, we have Then from the result given in step one, we immediately have (4.8) which together with inequalities (4.6) and (4.7), implies that P ξ 2 ∞ > (c 1 c 2 + c 3 + 2c 1 ) log p n ≤ 4 p .