MANOVA, LDA, and FA criteria in clusters parameter estimation

: Multivariate analysis of variance (MANOVA) and linear discriminant analysis (LDA) apply such well-known criteria as the Wilks’ lambda, Lawley–Hotelling trace, and Pillai’s trace test for checking quality of the solutions. The current paper suggests using these criteria for building objectives for finding clusters parameters because optimizing such objectives corresponds to the best distinguishing between the clusters. Relation to Joreskog’s classification for factor analysis (FA) techniques is also considered. The problem can be reduced to the multinomial parameterization, and solution can be found in a nonlinear optimization procedure which yields the estimates for the cluster centers and sizes. This approach for clustering works with data compressed into covariance matrix so can be especially useful for big data.

ABOUT THE AUTHOR Stan Lipovetsky, PhD, senior research director, GfK, Marketing Sciences. Stan has numerous publications in multivariate statistics, multiple criteria decision-making, econometrics, microeconomics, and marketing research.
The methods of multivariate statistical analysis are widely applied in marketing research. Clustering technique described in the current paper can be especially useful for big data with possible hundreds of thousands or millions of observations when regular clustering algorithms presents a hard computational burden. The suggested method operates with the data already compressed into covariance matrix. When the cluster centers and sizes are estimated, the actual clustering, or assignment of each observation to one or another cluster can be performed by allocating them to the closest cluster.

PUBLIC INTEREST STATEMENT
The paper describes several main multivariate statistical techniques, such as multivariate analysis of variance, linear discriminant analysis, and factor analysis in relation to cluster analysis. It shows that known in those techniques criteria of quality of solutions can be used for data clustering as well. These criteria are employed to find cluster centers and sizes, because optimizing such objectives corresponds to the best distinguishing between clusters. The problem is expressed via multinomial parameterization of the clusters characteristics, and solution can be found in optimization procedure yielding estimates for clusters. This approach uses only sample covariance matrix and not the observations themselves, so it can be especially valuable in difficult clustering tasks on big data from data bases, data warehouses, and data-mining problems.
numerical covariates (MANCOVA) is a similar extension of ANCOVA. Linear discriminant analysis (LDA) and MANOVA can be considered as dual techniques-in LDA the independent variables are the predictors (or attributes) and the dependent variables are the groups, while in MANOVA vice versathe independent variables are the groups (dummy variables identifying the clusters belonging) and the dependent variables are the attributes. Both these techniques can be presented via the canonical correlation analysis (CCA) between two sets of variables, with an additional step of prediction of one set by another one. When the canonical aggregates of the variables are obtained, all the tests for multivariate variables can be reduced to the known tests for one variable. LDA consists in testing significance of the discriminant functions of the attributes (with additional classification), and MANOVA-in testing significance of differences between groups' vectors of means (with additional identification which of the attributes have different means across the groups). There are various criteria known for testing quality of LDA and MANOVA solutions (see, for instance, Dillon & Goldstein, 1984;Härdle & Simar, 2012;Izenman, 2008;Timm, 1975).
The current paper considers possibilities to apply these criteria for clustering purposes. Indeed, if such criteria permit testing quality of LDA and MANOVA solutions, they can be optimized to obtain the best distinguishing between the clustering groups as well. As it is well-known in multivariate statistical analysis the total (T) variance-covariance matrix can be presented as the sum T = B + W of the so-called "Between" (B) and "Within" (W) matrices defined by the variance-covariance between and within the groups, respectively. The loadings for aggregation of the attributes in LDA or MANOVA are commonly found by maximizing a criterion of the quotient of the between-to-within quadratic forms which can be presented via the generalized eigenproblem with these matrices. Using its eigenvalues, it is possible to test the quality of LDA and MANOVA solutions. For this aim, the so-called Wilks' lambda criterion Λ (a multivariate generalization of F-statistics) compares the determinants (generalized variances) calculated by within and total variances. Wilks' Λ varies from 0 to 1, and Λ = 0 indicates that the groups' mean vectors differ, while Λ = 1 shows that the groups' means are the same. Thus, minimizing an objective of the Wilks' criterion by the parameters of the cluster centers permits to find them, as well as the group base sizes, which define the best distinguishing between groups. Wilks' lambda has the so called U-distribution which is tabulated, for instance, in Timm (1975). It also can be numerically approximated and presented as a regular F-test.
There are other overall tests on significance known in MANOVA, for instance, Hotelling T 2 (a multivariate generalization of the t-test for comparing vectors of means for two groups), and its generalization to Lawley-Hotelling trace criterion. The maximization of this criterion can be used for clustering problem. A modification of this criterion is given by another criterion widely used in MANOVA-the so-called Pillai's trace. It also can be used for clustering aims via the corresponding maximized objective. Such criteria are very convenient for clustering problem because they do not require calculation of each eigenvalue but only their total, which coincides with the trace, or sum of the diagonal elements of the matrix used in maximization. Practical application of all these tests in LDA and MANOVA are demonstrated, for instance, throughout the monograph (Timm, 1975). There are other tests, like Roy's largest eigenvalue max , and more specific ones like Levene's test to define whether the variances between groups are equal, partial eta-square (similar to partial F-statistics) to see the variance explained by individual independent variable, and other extensions to the multivariate statistics. However, for clustering aims, we are interested in the criteria operating with the totals of the eigenvalues which can be reduced to some functions of the main variance-covariance T, B, and W matrices.
Contemporary cluster analysis includes a large spectrum of methods developed in the areas of segmentation, pattern recognition, machine learning, data mining, and others (for instance, see Bishop, 2006;Eldén, 2007;Frey & Dueck, 2007;Gan, Ma, & Wu, 2007;Hastie, Friedman, & Tibshirani, 2001;Lipovetsky, 2012;Lipovetsky, Tishler, & Conklin, 2002;Liu & Motoda, 2008;Nowakowska, Koronacki, & Lipovetsky, 2014;Ripley, 1996). Various works suggest different ways to divide data into groups of observations more closely related within each group in comparison with the relations between the groups by variance-covariance matrices (Brusco & Steinley, 2007;DeSarbo, Carroll, Clark, & Green, 1984;Friedman & Meulman, 2004;Heiser & Groenen, 1997;Szekely & Rizzo, 2005). The current paper suggests a new approach based on the special objectives corresponding to the MANOVA and LDA criteria which optimization guarantees the best quality of the distinguishing among the groups estimated using the same criteria. This problem can be reduced to the optimizing procedure for nonlinear approximation of the covariance matrix by the total of the outer products of the distances from cluster centers to total center. The means and sizes of the clusters can be found using parameterization via multinomial shares-a technique developed and successfully applied for solving various problems in regression, principal components, singular value decomposition, and clustering (Lipovetsky, 2009a(Lipovetsky, , 2009b(Lipovetsky, , 2013a. The relation with factor analysis by least squares (LS) and generalized least squares (GLS) objectives (Joreskog, 1977;Jöreskog, 1967;Jöreskog & Goldberger, 1972;Lawley & Maxwell, 1971;Maxwell, 1983) are discussed as well. This approach can be especially useful for large data-sets with thousands and millions of observations because it operates not with original multiple observations but with the data compressed into covariance matrix.

Criteria for finding cluster centers and sizes
Let X denote N by n matrix with elements x ij (rows i = 1, 2, … , N of observations by the variables in n columns x 1 , x 2 , … , x n ). The elements of the total matrix S tot of second moments are defined as: where M j denotes the mean value of each x j . Let observations be divided into K subsets, and these clusters are numbered as q = 1,2, … , K, and each q-th cluster have N q observations, so their total equals the sample base: Consider decomposition of the cross-product (1) into the items related to the data subsets with sizes (2). Such a transformation is known in ANOVA (Ladd, 1966;Lipovetsky & Conklin, 2005) and can be presented as: where x q ij indicates that the i-th observation by the j-th variable belongs to the q-th cluster, and m q j is the mean value of each j-th variable within q-th cluster. The relation between the subsets and total means for each j-th variable is as follows: where both sides express simply the total by each x j .
The double sum in (3) equals the pooled second moment within each cluster: The last sum in (3) corresponds to the weighted by group sizes second moment between the cluster means centered by the total means: So (3) can be presented as the matrix sum: with the outer product of vectors of distances from the centers m q for each q-th cluster to the total center M, where each vector m q consists of the means m q j by all the variables, and the vector M contains the total means M j . For a given matrix S tot (7), the data clustering corresponds to maximizing the distances between the groups and minimizing them within the groups. If observations within each q-th cluster collapse to one point of its center, the elements of the matrix S within (5) reach zero. Thus, to find clusters, we can minimize the total of squared elements of matrix S within , or in other words-the total of differences between elements of known matrix S tot and unknown matrix S between in (7): The objective (8) presents the squared Frobenius norm for a matrix (also known as Hilbert-Schmidt, or Schur norm). This formulation corresponds to the LS objective for the nonlinear regression model of fitting the values in S tot by the known vector M and the sets of unknown constants N q and unknown vectors m q . Estimation of these parameters in the approach of multinomial parameterization is considered in Lipovetsky (2013aLipovetsky ( , 2013b where the problem is reduced to nonlinear regression modeling. A brief description of this technique is given in Appendix. The basic relation (7) is also the fundamental equation for MANOVA and LDA where it is usually presented in one of the following known notations: where S within = W = E denotes the Within (W), or Error (E) matrix, and S between = B = H denotes the Between (B), or Hypothesis (H) matrix of second moments. Finding vectors of loadings, or discriminant functions α in LDA or MANOVA is commonly performed by maximizing the criterion of the Rayleigh quotient of the between-to-within quadratic forms: which can be presented as the generalized eigenproblem: with the eigenvalues of the matrix E −1 H.
The so-called Wilks' lambda criterion (a multivariate generalization of F-statistics) compares the generalized variances within the groups and in the whole data-set: This criterion (12) can be used for finding the groups' centers. For this aim it can be rewritten via the means of S between (6) as the unknown parameters of interest and minimized: S tot = S within + S between = W + B = E + H, where in the numerator we have minimization similar to used in (8), and in the denominator is the constant of the determinant of the total variance-covariance matrix. The difference in LS (8) and Wilks' (13) criteria consists in using Euclidean norm squared or the generalized variance in determinants, respectively.
Another overall test in MANOVA is Hotelling T 2 -statistic, a multivariate generalization of the t-test for comparing vectors of means for two groups, and its generalization to Lawley-Hotelling trace (Tr, the total of diagonal elements) criterion: The maximized criterion (14) can be used for clusters parameter estimation by trace of the matrix: with the same S between (6).
A modification of (14) widely used in MANOVA is the so-called Pillai's trace: For estimation of cluster centers this test corresponds to the objective for maximization: Both (15) and (17) criteria use optimization by the same matrix S −1 tot S between of fitting used in (13) as well.
All these criteria are convenient for clusters parameter estimations because they do not require calculation of each eigenvalue but work with the total matrices. The meaning of all these objectives, including LS (8), is similar-to identify the parameters of cluster centers by closeness of S between to S tot (7).
The LS (8) and MANOVA (12)-(17) objectives for clusters parameter estimations correspond to the criteria in Joreskog's classification for methods of factor analysis (Joreskog, 1977;Jöreskog, 1967;Jöreskog & Goldberger, 1972;Lawley & Maxwell, 1971;Maxwell, 1983) based on a given covariance matrix S approximated by a matrix Σ of lower rank (built as the product of a matrix of loadings and its transposed). Joreskog (1977) distinguishes the unweighted least-squares (ULS) the generalized least-squares and the maximum likelihood (ML) Our notations for the matrices S tot and S between in (8) correspond to the matrices S and Σ in Joreskog's notations. The total of all elements squared, or squared Frobenius norm in (8), can be equally presented as the trace of a matrix multiplied by its transposition. Thus, we see that up to the constant ½, the expression (8) The objective (8) can be transformed as follows: Besides of (8), skipping the constant term ||S tot || 2 in (21), it is possible to use the objective of total residual sum of squares in minimizing the following deviations: This objective coincides with GLS (19), up to the term ½. MANOVA (13), (15), (17) objectives also use the matrix S −1 tot S between which corresponds to Σ −1 S in (19) and (20).
Quality of data fit for the objective (8) can be estimated via pseudo-R 2 similar to the coefficient of multiple determination in nonlinear regression, defined as: where F min is the residual sum of squares in the minimum of the objective (8), divided by the total sum of squares of all the elements in the fitted matrix expressed via the trace of the squared matrix S tot . This measure will be in favor of the ULS results obtained by the objective (18). Similarly for the objective (22), the pseudo-R 2 can be defined as one minus F min divided by the original sum of squares equal n for the identity matrix I n . This measure would correspond to the quality of fit for the GLS results (19).
The objectives (8) and (22), and the presentation in (18) and (19), are theoretically meaningful and correspond to MANOVA relations (12)-(17). In implementation for the numerical estimations, the totals of the squared deviations of the numerical covariance matrix' elements and their parametric counterparts are used, and there the first objective (8), or ULS (18), is preferable because it does not include the inversion of the covariance matrix which could be prone to multicollinearity in the data.

Numerical examples
Consider the iris data (Fisher, 1936) on the measured sepal and petal length and width of fifty iris specimens for each of three species, Iris setosa (SE), Iris versicolor (VE), and Iris virginica (VI). This data can be found in the Iris file available in the software package (S-PLUS'2000(S-PLUS' , 1999, or in R datasets. The variables are highly correlated: except the sepal width, the correlations range from 0.81 to 0.96. In Table 1, the first three numerical columns show the means of the variables for each kind of iris, the next three columns show the groups centers and sizes estimated by the ULS (18), then the next three columns show the results by the GLS (19), and the last three columns present the regular K-means clustering for comparison. The last row presents the pseudo-R 2 (23), which is, of course, favorable to the criterion (8), but important is that the quality of the ULS is the same as of K-means. The vectors of cluster centers for the ULS outperform the GLS-they are noticeably closer to the original centers of the iris specimens. ULS results are very similar to K-means as well, but in contrast to K-means the ULS cluster centers and base sizes are obtained using only covariance matrix, without the data-set itself.
Estimation by the objective (8), or ULS (18), does not require inversion of the covariance matrix, thus, the clustering results are more robust and less prone to multicollinearity within the data. It is the reason why ULS regularly outperforms the GLS technique (19) or (22), which employs the inverted covariance matrix with possible inflated values of elements and leads to worse clustering results that was observed by various data-sets.

Summary
This work considers the problem of finding cluster centers and sizes by fitting covariance matrix with the between-cluster matrix of a lower rank constructed by outer products of the parameters of cluster centers weighted by cluster sizes. Relation of this approach to the criteria from multivariate analysis of variance, MANOVA, and linear discriminant analysis, LDA, in the objectives for optimization in cluster analysis is discussed. Such criteria as Wilks' lambda, Lawley-Hotelling trace, and Pillai's trace for building objectives for finding clusters parameters produces the best distinguishing between the clusters. Solutions can be found in a nonlinear optimization procedure with the multinomial parameterization which yields estimates for the cluster centers and sizes.
This approach can be especially useful for big data-sets. Indeed, for a big data with possible hundreds of thousands or millions of observations any regular clustering algorithm presents a hard computational burden, while the suggested method operates with the data already compressed into covariance matrix. When the cluster centers and sizes are estimated, the actual clustering, or assignment of each observation to one or another cluster can be performed by allocating them to the closest cluster due to the shortest distance to the centers. The described approach employs only the sample covariance matrix and not the observations themselves, so it can be valuable in difficult clustering tasks on huge data-sets from data bases, data warehouses, and data mining problems. It can also be useful for finding cluster structure when the data itself is already unavailable for any reason and only the covariance matrix can be used. In (A2) we have share parameters: N q /N with their total equal to one due to (2), and the means m q N q / (MN) with the total equal to one due to (4). The multinomial parameterizations for these shares can be defined by the new sets of unknown parameters as following. Instead of the shares N q /N, let us use the multinomial parameterization with the first parameter put to zero and needed K−1 parameters α q . Similarly, in place of the shares m q j N q ∕(M j N) in (A2), for each variable x j we define a new multinomial parameterization: with the needed (K−1)n parameters q j . Using parameterization (A3) and (A4) in place of the shares in the diagonal matrices (A2) and substituting the expression (A2) as the outer product into objective (A1) yields: with g q denoting a vector of n-th order of multinomial shares (A4) for each q-th cluster, and the expression g q M corresponds to the element-wise product of two vectors.
The objective (A5) corresponds to the nonlinear regression of the dependent variable presented by the values of elements in the matrix C by the values of the total means within the complex structure of the unknown parameters. Minimization (A5) by the parameters α q and