Hierarchical clustered multiclass discriminant analysis via cross-validation

Linear discriminant analysis (LDA) is a well-known method for multiclass classification and dimensionality reduction. However, in general, ordinary LDA does not achieve high prediction accuracy when observations in some classes are difficult to be classified. This study proposes a novel cluster-based LDA method that significantly improves the prediction accuracy. We adopt hierarchical clustering, and the dissimilarity measure of two clusters is defined by the cross-validation (CV) value. Therefore, clusters are constructed such that the misclassification error rate is minimized. Our approach involves a heavy computational load because the CV value must be computed at each step of the hierarchical clustering algorithm. To address this issue, we develop a regression formulation for LDA and construct an efficient algorithm that computes an approximate value of the CV. The performance of the proposed method is investigated by applying it to both artificial and real datasets. Our proposed method provides high prediction accuracy with fast computation from both numerical and theoretical viewpoints.


Introduction
Linear discriminant analysis (LDA) is a well-known technique for multiclass classification problems based on the covariance structure of each class (Fisher, 1936;Rao, 1948;Fukunaga, 2013). Furthermore, it is a dimension reduction tool for understanding the positional relationship among multiple classes in high-dimensional data. The implementation of LDA is easy because it results in a generalized eigenvalue problem.  kernel method (Fisher discriminant analysis;Mika et al., 1999;Baudat and Anouar, 2000).
Sparse multiclass LDA (Clemmensen et al., 2011;Shao et al., 2011;Witten and Tibshirani, 2011b;Safo and Ahn, 2016) is a useful technique for interpreting the classification results for high-dimensional data. Wu et al. (2017) introduced a hybrid version of LDA and deep neural networks, and applied it to person re-identification.
In practice, observations in some classes are often easy to be classified, whereas those in other classes are difficult to be classified. Figure 1(a) shows 2D projections of 36-dimensional data points obtained by LDA of mouse consomic strain data (Takada et al., 2008). The number of classes is 30. Most species are similar; however, some species have different characteristics.
For example, classification of an input whose label is either "MSM" or "B6-11MSM" seems easy.
Meanwhile, inputs with other labels may be difficult to classify correctly; the data points in similar classes are not sufficiently separated in the two-dimensional space. Therefore, ordinary LDA may lead to a high misclassification error rate.
To address this issue, one can adopt the following two-stage procedure. In the first stage, we calculate a representative point in each class (e.g., mean value) and apply standard cluster analysis (Ward's method, k-means clustering, etc.) to the representative points. In the second stage, we separately apply LDA to each cluster. With the two-stage procedure, each cluster has different classification boundaries, leading to more flexible boundaries compared to ordinary LDA. Thus, the misclassification error rate can be reduced. Figure 1(b) shows the dendrogram of Ward's method for the mean vectors of the classes for the mouse consomic data. The result suggests that three clusters can be constructed: "MSM", "B5-11MSM", and "other 28 classes".
Thus, one can create the following classification rule: when a new input is observed, we may first classify one of these three clusters. If the input is classified into "other 28 classes", we classify one of the 28 classes. A similar two-stage procedure based on linear and nonlinear discriminant functions has been proposed by Huang and Su (2013).
However, improvement of the prediction accuracy is not guaranteed with ordinary clustering algorithms, such as Ward's method. In general, cluster analysis and classification problem have different purposes; cluster analysis is used to find collections of classes on the basis of the similarity of two classes (e.g., Euclidean distance and Mahalanobis distance), whereas LDA is conducted to construct the classification rule. Combining two different methods does not usually guarantee minimization of the overall error rate (e.g., Yamamoto and Terada, 2014;Kawano et al., 2015Kawano et al., , 2018. Therefore, it is crucial to develop a clustering technique that aims to minimize the error rate. In this study, we propose a novel cluster-based LDA method to minimize the misclassification error rate. We adopt hierarchical clustering to construct clusters of classes. A crucial aspect is how to determine the clusters. Ordinary hierarchical clustering merges clusters on the basis of a dissimilarity measure between two clusters. However, the dissimilarity measure does not aim to decrease the error rate. This study uses the leave-one-out cross-validation (CV) value because it is an unbiased estimator of the misclassification error rate (Lachenbruch, 1967). The CV value is computed for the following two-stage classification rule. In the first stage, we adopt LDA to create a classification rule that allocates the input to some cluster of classes. In the second stage, we again apply LDA to each cluster. The clusters are constructed such that the CV error is minimized.
The proposed approach involves a heavy computational load because the CV value must be computed at each step of the hierarchical clustering algorithm. Therefore, it is necessary to construct an efficient algorithm to compute the CV value. One can develop a regression formulation and use an efficient algorithm for ridge regression (see, e.g., Konishi and Kitagawa, 2008). Cawley and Talbot (2003) introduced an efficient algorithm using a regression formulation given by Xu et al. (2001); however, their method can be applied to only two-class LDA. Ye (2007) proposed a multivariate regression formulation for multiclass LDA; however, his method can be used only when the number of dimensions of the projected spaces, say D, is fixed. In other words, we cannot select the value of D. In practice, an analyst often needs to determine D for interpretation purposes (e.g., D = 2 for visualization). Moreover, the prediction accuracy is strongly dependent on D. Thus, Ye's (2007) approach is limited. In this study, we develop a novel regression formulation for multiclass classification in which a user can determine D. Accordingly, we derive an efficient algorithm that computes an approximate CV value. A theoretical justification for the approximation is also presented.
Monte Carlo simulation is conducted to investigate the performance of the proposed procedure. The usefulness of the proposed procedure is illustrated through the analysis of mouse consomic strain data. We provide an R package hclda for implementing our algorithm; it is available at https://github.com/keihirose/hclda.
The remainder of this article is organized as follows. Section 2 briefly reviews multiclass LDA. Section 3 formulates the proposed algorithm on the basis of a two-stage procedure with LDA. Section 4 derives an efficient algorithm for computing the CV value in multiclass LDA.
Section 5 discusses the effectiveness of our procedure on the basis of Monte Carlo simulation.
Section 6 presents real data analysis . Finally, Section 7 concludes the paper. Some technical proofs are deferred to the appendices. Letx j be the sample mean vector in class j andx be the sample mean vector:

Linear discriminant analysis
where n j is the number of observations in class j. Let S B be the between-class covariance matrix and S W be the within-class covariance matrix: Through LDA, a p-dimensional predictor x is transformed into a D-dimensional space (D < min(J − 1, p)) by linear transformation matrix A, which is obtained by maximizing the ratio of the between-class covariance matrix and the within-class covariance matrix: The optimization problem given by Eq. (1) results in a generalized eigenvalue problem; the d-th column ofT (d = 1, . . . , D) is given by the eigenvector corresponding to the d-th largest eigenvalue of matrix S −1 W S B . The minimum eigenvalue of S W is often small when p is large. Indeed, the inverse matrix of S W does not exist when p > n. In such cases, one can use the ridge penalization S W,δ := S W + δ n I, where δ ≥ 0 is a regularization parameter. The optimization problem is then expressed asT Now, we consider a multiclass classification problem. Given x ∈ R p , the p-dimensional observation x is transformed as z :=T T δ x. The class of x, say G(x), is assigned using a Euclidean distance in the transformed space (Witten and Tibshirani, 2011a):

Proposed method
As described in the Introduction, observations in some classes are easy to be classified whereas those in others are difficult to be classified. To address this issue, we define a set of classes whose observations may not be easy to be classified with ordinary LDA. We refer to a set of classes as metaclasses. Suppose that we have a set of m metaclasses, M m = {C 1 , . . . , C m }, where each C i (i = 1, . . . , m) is a metaclass. The metaclasses satisfy C i ⊂ G (i = 1, . . . , m), On the basis of the metaclasses, we conduct two-stage LDA for given input x. First, we The two-stage algorithm is summarized in Algorithm 1.
The two-stage algorithm is implemented once a set of metaclasses M = {C 1 , . . . , C m } is determined. To determine the metaclasses, we conduct hierarchical cluster analysis as follows.
The initial value of the metaclasses is M J = {C 1 , . . . , C J } and C j = {G j } (j = 1, . . . , J). At each step, we combine two metaclasses C j and C k such that the CV value is minimized. The details of the CV will be presented in Section 4; the definition is given by Eq. (3). The algorithm is summarized in Algorithm 2. 3: Perform LDA on M to predict the class of x. Suppose that x is allocated to C j .
4: if C j includes more than one class then 5: Perform LDA on C j and allocate x to a class of C j 6: else 7: Algorithm 2 Hierarchical LDA. The details of two-stage LDA are provided in Algorithm 1.
1: Let C 0 j = G j (j = 1, · · · , J). Define a set of metaclasses as Perform CV in two-stage LDA based on a set of metaclasses A (j,k) , and obtain the CV value CV(j, k) 7: end for 8: Find an index (j 0 , k 0 ) whose CV value CV(j 0 , k 0 ) is minimized. 9: K−t } 12: end for 4 Efficient algorithm for CV of LDA CV is adopted to construct the metaclasses. Typically, 5-or 10-fold CV is used in classification problems. However, in practice, it would be unstable when n j is small, as shown in the case of the mouse consomic data. Thus, we adopt leave-one-out CV.
To implement CV, we compute the D largest eigenvalues and eigenvectors of (S W,δ are the between-class covariance and within-class covariance matrices constructed by {x 1 , . . . , wherez .
The CV value is calculated as where I(·) is an indicator function.
Our hierarchical LDA algorithm needs at least O(J 2 ) operations of CV; hence, it becomes slow when J is large. Leave-one-out CV needs the eigenvalue and eigenvectors of (S W,δ ) −1/2 for i = 1, . . . , n, which involves a heavy computational load for large p. Therefore, an efficient algorithm for computing CV is required. In this section, we present an efficient algorithm by extending CV to two-class LDA described in Cawley and Talbot (2003).

Least-squares formulation
We consider the problem of ridge regression as follows: where y = (y 1 , . . . , y g ) is a response vector, 1 n is an n-dimensional vector whose elements are 1, β 0 is an intercept, X = (x 1 , . . . , x n ) T is a design matrix, β is a regression coefficient vector, and δ ≥ 0 is a regularization parameter.
The response vector y corresponds to the output of the classification problem. For example, in the two-class classification problem (i.e., J = 2), we may use y 1 = 1 n 1 1 n 1 and y 2 = − 1 n 2 1 n 2 (Cawley and Talbot, 2003). With these response vectors, the estimate of the regression coefficient vectorβ is parallel to the eigenvector of S −1 W,δ S B . However, to the best of our knowledge, the construction of y for multiclass LDA has not been investigated thus far. To show how the response vector y is constructed, first, we define We define the response vectors y jd (j = 1, . . . , J; d = 1, . . . , D) as Then, we get the following proposition: Proof. The proof is given in Appendix A.
With Proposition 4.1, Eq. (5) is expressed as We have the following theorem: Theorem 4.1. Letβ d be the solution of β d , i.e.,β d is a regression coefficient that satisfies Eqs. (7) and (8). Lett d,δ be the dth column vector ofT δ . Then, we obtain The estimate of regression coefficient vectorβ d is parallel to the transformation vector of multiclass LDA,t d,δ .
Proof. The proof is given in Appendix B.

Construction of an efficient algorithm
We consider the problem of CV with the regression formulation in Eq. (5). We remove the ith observation fromX and y d , and construct an (n − 1) The ridge estimate based onX (−i) and y Based on the regression coefficient vectorα and from Eq. (9), the transformed value for the dth dimension is (2), the class for the ith observation (i = 1, . . . , n) is The problem with using Eq. (10) Furthermore, the approximation of λ The definition of λ * (−i) d in Eq. (12) is derived from the following relationship between X and λ d : Lemma 4.1. The following equation holds: Proof. The proof is presented in Appendix D.
Based on the approximations given by Eqs. (11) and (12), the class allocation is now defined

Efficient computation of
However, the following theorem leads to efficient computation of ( Theorem 4.2. Given input vectorx = (1, x T ) T , we havê x Tα * (−i) Proof. The proof is presented in Appendix C.
Using Theorem 4.2, we havẽ x where c i = (X TX + ∆) −1x i and a i =ŷ id −y id 1−h ii . These values are required to compute G * (−i) (x i ). Using the above-mentioned formulae is far more efficient than direct computation of (X (−i) ) TX (−i) + ∆ − because it requires only O(n) operations onceŷ and H are computed.
Algorithm 3 summarizes our efficient algorithm for CV of multiclass LDA.  Under Assumption 4.1, we obtain the following proposition.

12: end for
13: Calculate the CV value as follows: Here, ξ 0,jd is a constant value.
Proof. The proof is presented in Appendix E.
Since y jd = ξ jd 1 n j as in Eq. (6) (14). Therefore, it would be reasonable to perform CV using y * (−i) jd instead of y (−i) jd . In Section 5.4, we investigate the numerical performance of CV. Our results show that the approximation works well when n is large.

Data generation
In the numerical experiment, the label of the ith observation, y i , is generated according to a multinomial distribution with probability P (y i = j) = 1/J (i = 1, . . . , n; j = 1, . . . , J).
We then define the mean vector of each cluster, say µ j (j = 1, . . . , J). Given label y i , the ith predictor vector is generated from x i ∼ N p (µ y i , I p ), (i = 1, . . . , n).
Here, two simulation models are considered as follows.
Model 1 We set J = 9, p = 2, and n = 200. The mean vectors µ j (j = 1, . . . , 9) are defined as Figure 2(a) shows the data points generated from Model 1. Each class is well separated from the others; thus, ordinary LDA is expected to perform well when D = 2. We investigate whether our method, HLDA, performs well even when the clusters are not necessarily required for classification.
Therefore, the mean vector belongs to one of the three clusters of the centers: 1 p , 10 · 1 p , and −10 · 1 p . Here, we set J = 30, p = 20, and n = 600, except for Section 5.4, in which the performance of our fast CV is investigated.
The 2D projection of the data points is shown in Figure 2(b). In this case, ordinary LDA may not perform well for D = 2 because of the difficulty in classification within a cluster.
We expect HLDA to perform better than LDA.
We employ R 4.0.2 to implement our algorithm. The RCpp package is used for fast computation. The OpenBLAS library is used for matrix computation. For implementation, we use Amazon Web Services (AWS) with Intel Xeon Platinum 8175 processors (3.1 GHz), 32 vCPUs, 128 GB memory, and CentOS 7. Throughout the experiments, the ridge parameter is δ = 10 −5 .

Illustration of our method
With our HLDA algorithm, the number of clusters is determined by the number of steps of the hierarchical clustering algorithm, t; the number of clusters is J − t. Note that HLDA is equivalent to LDA when t = 0 and t = J − 1. We first apply the HLDA algorithm to the dataset shown in Figure 2. Figure  clustering; therefore, the cluster analysis does not always decrease the CV error rate. Indeed, CV selects t = 0, which is identical to the ordinary LDA classification rule. Thus, HLDA selects an appropriate classification rule even if clustering is not required. For Model 2, our HLDA algorithm significantly decreases the CV error rate when t ≥ 1, which suggests that clustering improves the prediction accuracy considerably. The number of clusters should be relatively large to ensure a sufficiently low CV error rate.

Prediction accuracy
We compare the prediction accuracies of three methods: LDA, HLDA, and Ward's hierarchical clustering. The training set with n observations is generated, and the classification rules for the three above-mentioned methods are then constructed. The test data with n observations are generated from the same distribution as the training set, and the error rate is computed.
This procedure is repeated 100 times. The error bar represents one standardization error.
number of clusters of HLDA is determined such that the CV value is minimized. The number of clusters of Ward's method is the same as that selected by HLDA.
For Model 1, HLDA performs much better than existing methods when D = 1. Therefore, our clustering method significantly improves the accuracy. When D = 2, LDA is expected to perform well, as shown in Figure 2(a). Indeed, LDA achieves good accuracy when D = 2.
Nevertheless, HLDA and LDA provide nearly identical performances; HLDA may perform well even if clustering is not required. Ward's method performs well but is slightly worse than LDA and HLDA.
For Model 2, HLDA performs much better than existing methods, especially when D ≤ 3.
When D = 4, LDA performs well but is outperformed by HLDA. We observe that the error rates become nearly 0% within only a few steps of the HLDA algorithm when D = 4. Figure 5 depicts the clustering results of the dataset shown in Figure 2(b). The number of clusters is 3 (i.e., t = 27) because we assume three clusters of mean vectors in Model 2.
We remark that the 2D plots are depicted by 1 out of 100 datasets, and we observe similar tendencies for most of the datasets. The results show that HLDA produces different clusters compared to Ward's method. In particular, HLDA often creates a cluster whose classes are separated, whereas Ward's method cannot produce such a cluster. The error rates of HLDA and Ward's method are 4.83% and 7.17%, respectively; thus, HLDA achieves higher accuracy than Ward's method.

Comparison of CV computation
We discuss the numerical investigation of our fast CV computation, described in Section 4.
In particular, we focus on the comparison of the computational timings and the accuracy of fast CV.

Comparison of computational timings
For the timing comparison, the datasets are generated from Model 2 described in Section 5.1 with various dimensions: p = 20, 50, 100, 200, 500, 1000. The number of observations, n, is n = 6000. We employ the fast and ordinary CV algorithms and compare their computational timings over 10 runs.
Figure 6(a) shows the computational timings of the fast and ordinary CV algorithms. The error bar represents the maximum and minimum values over 10 runs. The results show that our CV is generally faster than ordinary CV. In particular, the difference between the timings of these two CV methods increases with p; therefore, our algorithm is efficient for high-dimensional data. The difference is caused by the fact that ordinary CV requires the computation of eigenvalues and eigenvectors of a p × p matrix for each observation.

Accuracy of fast CV
The error rates of the two CV algorithms are also compared; these two CV algorithms result in slightly different values owing to some approximations of fast CV computation (see Section 4.2 for the details). Note that the approximation of fast CV is justified for sufficiently large n, as described in Section 5.1; thus, we compare the CV values under various sample sizes, n = 300 × 2 k−1 (k = 1, . . . , 8), and examine whether fast CV numerically converges to ordinary CV as n increases. The dimension of the predictor variable is p = 20. Figure 6(b) depicts the error rates of two CV computations. For comparison, we also compute apparent error rates (AER). The error bar represents the maximum and minimum values over 10 runs. The results show that fast CV and AER underestimate true CV value because they use test data for training. Fast CV provides a better approximation of true CV value than AER, especially for small and moderate sample sizes. For example, when n = 2400 (i.e., n j ≈ 80), the difference between two CV errors is not significant; meanwhile, the AER may not provide a good approximation of true CV. On the other hand, for sufficient large n, three error rates have almost identical values.

Real data analysis
We apply our method to mouse consomic strain data (Takada et al., 2008). The mouse   The error rate as a function of D is shown in Figure 7(b). The results imply that HLDA outperforms existing methods for any D. Ward's method performs worse than LDA, which suggests that it cannot achieve higher prediction accuracy than LDA for this dataset.

Conclusion
We considered a multiclass classification problem based on dimensionality reduction. In particular, we mainly dealt with a dataset that consists of a relatively large number of classes; for example, in real data analysis, we analyzed consomic strain data consisting of 30 classes.
Through numerical experiments, we showed that our method achieves higher prediction accuracy than ordinary LDA while maintaining its advantages, including visualization.
Our approach is based on hierarchical clustering; therefore, we cannot apply it to a dataset that consists of a massive number of classes (e.g., tens of thousands of classes) owing the prohibitively heavy computational load. As an alternative to hierarchical clustering, nonhierarchical clustering, such as k-means clustering or convex clustering (Chen et al., 2015;Donnat and Holmes, 2019), can be applied to large-scale datasets. In the future, it would be interesting to extend our method to a non-hierarchical clustering technique.

D Proof of Lemma 4.1
From Eq. (9), we have The proof is complete.

E Proof of Proposition 4.2
It is sufficient to show that ξ jd converges almost surely to some constant value because y jd is defined as y jd = ξ jd 1 n j . Under Assumption 4.1, S B,δ and S W,δ converge almost surely to some constant matrices, say Σ B,δ and Σ W,δ , respectively. Let (λ 0,d , s 0,d ) be the dth largest eigenvalue and eigenvectors of Σ