Coupled regularized sample covariance matrix estimator for multiple classes

The estimation of covariance matrices of multiple classes with limited training data is a difficult problem. The sample covariance matrix (SCM) is known to perform poorly when the number of variables is large compared to the available number of samples. In order to reduce the mean squared error (MSE) of the SCM, regularized (shrinkage) SCM estimators are often used. In this work, we consider regularized SCM (RSCM) estimators for multiclass problems that couple together two different target matrices for regularization: the pooled (average) SCM of the classes and the scaled identity matrix. Regularization toward the pooled SCM is beneficial when the population covariances are similar, whereas regularization toward the identity matrix guarantees that the estimators are positive definite. We derive the MSE optimal tuning parameters for the estimators as well as propose a method for their estimation under the assumption that the class populations follow (unspecified) elliptical distributions with finite fourth-order moments. The MSE performance of the proposed coupled RSCMs are evaluated with simulations and in a regularized discriminant analysis (RDA) classification set-up on real data. The results based on three different real data sets indicate comparable performance to cross-validation but with a significant speed-up in computation time.

Coupled regularized sample covariance matrix estimator for multiple classes Elias Raninen, Student Member, IEEE, Esa Ollila, Senior Member, IEEE Abstract-The estimation of covariance matrices of multiple classes with limited training data is a difficult problem. The sample covariance matrix (SCM) is known to perform poorly when the number of variables is large compared to the available number of samples. In order to reduce the mean squared error (MSE) of the SCM, regularized (shrinkage) SCM estimators are often used. In this work, we consider regularized SCM (RSCM) estimators for multiclass problems that couple together two different target matrices for regularization: the pooled (average) SCM of the classes and the scaled identity matrix. Regularization toward the pooled SCM is beneficial when the population covariances are similar, whereas regularization toward the identity matrix guarantees that the estimators are positive definite. We derive the MSE optimal tuning parameters for the estimators as well as propose a method for their estimation under the assumption that the class populations follow (unspecified) elliptical distributions with finite fourth-order moments. The MSE performance of the proposed coupled RSCMs are evaluated with simulations and in a regularized discriminant analysis (RDA) classification set-up on real data. The results based on three different real data sets indicate comparable performance to cross-validation but with a significant speed-up in computation time.

I. INTRODUCTION
A N increasingly common scenario in modern supervised learning problems is that the dimension p of the data is large compared to the number of available training samples n or exceed it multifold (p n). Such scenarios are commonly referred to as high-dimensional or insufficient sample support problems. In this paper, we address the problem of high-dimensional covariance matrix estimation in a multiclass setup, where there are K different classes or populations, each comprising n k , k = 1, . . . , K, independent and identically distributed (i.i.d.) p-dimensional samples. Estimates of the covariance matrices are needed in many multivariate analysis problems, such as in principal component analysis and canonical correlation analysis [1], discriminant analysis [2], Gaussian mixture models (GMM) [3], as well as in many engineering applications, e.g., in array signal processing [4], genomics [5], portfolio optimization in finance [6], and graphical models [7]. The success of the analysis is often directly lined with the accuracy of the estimated covariance matrix.
The covariance matrix of class k ∈ {1, . . . , K} is defined as where x ik denotes the ith sample from class k and µ k = E[x ik ] is the mean of class k. The conventional estimate for the covariance matrix is the unbiased sample covariance matrix (SCM) defined for class k by where x k = (1/n k ) i x ik is the sample mean of class k. In high-dimensional settings, the SCM is known to work poorly due to its high variability. Furthermore, if n k < p, then the SCM is singular, and hence, its inverse cannot be computed.
Better estimators can be developed by using regularization, where the key idea is to shift or shrink the estimator toward a predetermined target or model. This can significantly decrease the variance of the estimator and improve the overall performance by reducing its mean squared error (MSE). This phenomenon can be understood via the following well-known bias-variance decomposition of the MSE. For an estimatorΣ k of Σ k , the MSE can be written as where the first term on the right-hand side is the variance and the second term is the squared bias of the estimator. Since the SCM is unbiased, its MSE is equal to its variance. By using a regularized SCM (RSCM), however, it is possible to reduce the MSE significantly at the cost of introducing some bias.
In general, regularization combines an unstructured estimate with a predefined model or target. The target in regularization can be decided based on a prior knowledge, assumptions, or on a property, which we want to enforce. Regularization can be accomplished different ways. For instance, the SCM can be combined linearly with a target matrix as, e.g., in [6], [8], [9], [5], [10], [11], [12], and [13]. The approaches for parameter tuning differs between the methods. A popular approach taken in [8] is based on estimating the asymptotically optimal (in terms of minimizing the MSE) tuning parameters. Under additional distributional assumptions, the method of [8] has later been improved for Gaussian samples in [10] and elliptically distributed samples in [13]. Other approaches for tuning parameter selection are for example the expected likelihood approach [14], [15]. Alternatively, the eigenvalues of the SCM can be transformed non-linearly toward a specific structure as in [16] and [17]. Multiple targets can also be used. For instance, in [18], a double shrinkage covariance matrix estimator was considered, which shrinks simultaneously toward a spherical matrix and a diagonal matrix. A somewhat related RSCM formulation had been proposed in [3] in the setting of Gaussian mixture models. In [19], the estimator of [8] was extended to multiple simultaneous target matrices, which satisfy a certain target structure. In [20], a linear multi-target shrinkage covariance matrix estimator was proposed, which optimizes the tuning parameters using low-complexity leaveone-out cross-validation. There are also methods tailored for multiclass problems. For instance, [21] considered covariance matrix estimation from two possibly mismatched data sets using the maximum likelihood principle. In our previous work [22], linear pooling of SCMs was considered and was applied to portfolio optimization. In the context of discriminant analysis classification, a popular RSCM was proposed in [2]. Inspired by [2], this paper focuses on two specific target matrices: the pooled SCM and the spherical matrix.
The rest of the paper is organized as follows. In Section II, we give some background and motivation for the proposed estimator. In Section III, we analyze some properties of the proposed estimator as well as discuss the optimization of the tuning parameters. In Section IV, we show how the theoretical tuning parameters can be estimated in practice when the unknown class populations follow unspecified elliptically symmetric distributions. Section V discusses some practical considerations and how to use the method in choosing the tuning parameters in RDA. In Section VI, synthetic simulation studies are conducted in order to assess the MSE performance of the estimator. The method is then compared to crossvalidation in choosing the tuning parameters for RDA classification using three different real data sets. Lastly, Section VII concludes.
Notation: Throughout the paper, unless stated otherwise, all norms are Frobenius norms defined by A 2 F = A, A F = tr(A A), where the inner product of two matrices (of appropriate dimensions) A and B is defined by A, B F = tr(A B). For any square matrix A, we frequently use the notation I A = (1/p) tr(A)I and A I = A − I A . For a vector a ∈ R p , the Euclidean norm is defined as a = √ a a. We define R ≥0 = {a ∈ R : a ≥ 0}. For a scalar variable a, we use the following shorthand notation ∂ a = ∂/∂a for the partial derivative with respect to a. For scalars a < b and c, we define the clamp or clip function [c] b a = max{a, min{b, c}}, which projects c onto to the interval [a, b]. Lastly, Unif{a, b} and Unif(a, b) denote the discrete and continuous univariate uniform distributions on the set {a, a + 1, . . . , b − 1, b} and on the open interval (a, b), respectively.

II. BACKGROUND AND MOTIVATION
In multiclass problems, when the classes can be assumed to have a similar covariance structure, it is beneficial to shrink the individual class covariance matrix estimates toward the pooled (average) SCM of the classes, where π k = n k n 1 + n 2 + · · · + n K .
For example, the methods proposed in [2], [23], and [24] used the convex combination where β ∈ [0, 1], as an estimate for the class covariance matrix. The methods in [2], [23], and [24] were designed for discriminant analysis (DA) classification in which the problem is to classify a new sample x to one of the K classes using the discriminant rulê whereμ k andΣ k denote estimates of the mean and the covariance matrix of class k, respectively. Different DA methods differ in the approach used to estimate the class means and class covariance matrices. For example, quadratic discriminant analysis (QDA) uses sample means and SCMs in (3). In linear discriminant analysis (LDA) the pooled SCM S in (1) is used for all classes. If the true covariance matrices are equal, Σ 1 = Σ 2 = . . . = Σ K ≡ Σ, then LDA is well justified since the pooled SCM is an unbiased estimator of Σ, i.e., However, even in the case that the true population covariance matrices differ substantially, LDA often outperforms QDA when the sample size is small compared to the dimension [23], [25]. This is due to the high variance of the class SCMs compared to the pooled SCM. The partially pooled estimator (2) includes QDA and LDA as special cases when β = 1 and β = 0, respectively. In a very high-dimensional case, when p > j n j , the partially pooled estimator (2) will no longer be positive definite. In regularized discriminant analysis [2] (RDA), this problem is solved by coupling the pooled estimator with a spherical target matrix, that is, by regularization toward a scaled identity matrix viâ whereΣ k (β) is given in (2) and IΣ k (β) = (1/p) tr(Σ k (β))I. The success of regularization depends on proper selection of the tuning parameters. In classification problems, crossvalidating the classification error is the standard method for choosing the tuning parameters. Cross-validation can, however, be computationally very costly for large data sets. In certain high-dimensional binary classification settings, it is possible to leverage on results from random matrix theory in order to estimate the (asymptotically) optimal tuning parameters, which minimize the classification error [26], [27].
The main contribution of this work is to develop a computationally efficient method for choosing the tuning parameters in a multiclass covariance matrix estimation setting, where there can be more than two classes, and where the RSCMs are defined in (4). Specifically, we find estimates of the tuning parameters that minimize the class-specific MSE: for each population k = 1, . . . , K. The expressions for the optimal tuning parameters will depend on unknown population parameters since the MSE expression involves the unknown covariance matrix. However, we show that the MSE can be estimated fairly easily by assuming that the class populations follow unspecified elliptically symmetric distributions.
It is worth noting that, in RDA [2] the tuning parameters are common across the classes, whereas in this paper we use class-specific tuning parameters. However, in cases when it is useful to use common tuning parameters, they can easily be acquired by averaging as explained in Subsection V-B. Hence, our proposed method can be used to obtain tuning parameters for the original RDA [2] framework, for which there already exist widely established toolboxes in programming languages such as R (see e.g., [28], [29]). Lastly, an important distinction to [2] is that our method is not only targeted for classification problems but is suitable for other applications as well. For example, [21] considered the problem of estimating a covariance matrix from two data sets, where the population covariance matrix of the first data set is different but close to the population covariance matrix of the second data set. This type of problems are encountered in radar processing as well as in hyperspectral imaging applications, where the additional data sets may have been acquired with slightly different measurement configurations, and hence, have slightly different population parameters.
The current paper extends our earlier preliminary work in [30], which considered the estimation of the MSE optimal tuning parameters for the partially pooled estimator in (2). Here, we consider the more general estimator in (4), which includes additional shrinkage toward the scaled identity matrix, and is therefore applicable also in the cases when p > j n j .
III. ESTIMATOR Let us first consider four special cases of the estimator (4): (C1) The unpooled regularized SCM estimator omits the pooled SCM and only shrinks toward the scaled identity matrix:Σ k (α k , β k = 1) = α k S k + (1 − α k )I S k . This type of shrinkage is typically considered in single class covariance matrix estimation (see e.g., [8] and [13]). (C2) The partially pooled estimator omits regularization toward the scaled identity and only shrinks toward the pooled SCM: (C3) The fully pooled estimator uses the pooled SCM for every class k and shrinks it toward the scaled identity matrix: Such shrinkage can be considered if all classes have an identical distribution. (C4) The scaled identity estimator uses the partially pooled estimator to scale the identity matrix: Since it is clear that the tuning parameters are class-specific, we drop the subscripts from α k and β k and denote them from now on simply by α and β.
, of the estimator (4) as a function of the tuning parameters (α, β). The figure corresponds to setup A of the numerical study of Section VI. In the figure, the small gray dots depict the estimated tuning parameters (showing 400 realizations of the 4000 Monte Carlo trials). The blue square ( ) denotes the mean (m α , m β ) of the estimated tuning parameters over the Monte Carlo runs. The optimal tuning parameter pair (α , β ) is denoted by the black triangle ( ). The special cases (C1-C4) correspond to the edges of the (α, β)-plane.  As can be observed from Figure 1, the optimal tuning parameter pair is in this case closest to I S . Hence, in this particular case it would be a good strategy to use the pooled SCM and regularize it toward a scaled identity matrix, i.e., to use the fully pooled estimator (C3). As can be noted, the proposed method is able to automatically choose the tuning parameters in a near optimal way.
Let us now give an expression for the MSE of the estimator.
Theorem 1. The MSE of the estimator (4) is a bivariate polynomial of the form where the coefficients, C ij , depend on the scalars tr(Σ j ), Proof. See Appendix A for the proof and the expressions for the coefficients C ij .
The estimation of the coefficients C ij is deferred to Section IV. Given estimates of the coefficients, the critical points of (6) can be solved numerically (see Appendix A).
A useful property of the MSE polynomial in (6) is that given a fixed value of either α or β, the optimization of the remaining tuning parameter is a convex problem. This is known as biconvexity.
Theorem 2. The MSE (6) is a biconvex function. The optimal tuning parameter α given a fixed value of β ∈ [0, 1] is Likewise, the optimal tuning parameter β given a fixed value The optimal tuning parameter for the special cases (C1-C4) can be solved using (7) and (8) of Theorem 2.

A. Partially pooled estimator without identity shrinkage
Let us next highlight certain properties of the partially pooled estimator (2) of (C2) corresponding to the caseΣ k (α = 1, β). The propositions below provide additional insights on the optimal parameter β in this case.
Proof. The proof follows from showing that the numerator of (8) is always less than the denominator, which is always positive. By setting α = 1, after some algebra, we have that −1/2 times the numerator, i.e., (−1/2)(C 21 . Subtracting the numerator from the denominator, we get which holds almost surely (a.s.) for any continuous distribution.
This has the important implication that in a multiclass problem it is always possible to reduce the MSE of the SCM by using regularization toward the pooled SCM.
In the special case that all of the population covariance matrices are equal, we would like to only use the pooled SCM. This is exactly what happens as is shown in the next proposition.
Proof. The proof follows from showing that if the true population covariance matrices and the sample sizes are equal, then the numerator of (8) is zero and the denominator is non-zero. Observe that if Σ k = Σ j and π k = π j , ∀j, then we have

B. Streamlined analytical estimator
An alternative estimator to (4) can be obtained if we change the α-regularization target so that the estimator becomes where T ∈ {S k , S} andΣ k (β) is defined in (2). This simplifies the expression for the MSE and allows for an analytical solution for the tuning parameters. In [3] and [18], somewhat similar formulations were used with different targets in a GMM and a single class covariance matrix estimation setting, respectively. Note that the difference between (9) and (4) is in the scale of the identity target. The trace of (9) (sum of the eigenvalues) is dependent on α, whereas in (4) this is not the case. However, when tr(I T ) ≈ tr(IΣ k (β) ), the performance of the two estimators is expected to be similar. The MSE of (9) is given in the next theorem.
Theorem 3. The theoretical MSE of the (streamlined analytical) estimator (9) is a bivariate polynomial of the form Otherwise, the optimal parameters are on the boundary of the feasible set Proof. See Appendix B for the proof and the expressions for the coefficients B ij .

IV. ESTIMATING THE TUNING PARAMETERS
The MSE and the optimal tuning parameters in Theorem 1 and Theorem 3 depend on the coefficients C ij and B ij , which in turn are functions of the unknown scalars Hence, estimates of the optimal tuning parameters can be formed by a plug-in method, i.e., replacing the above unknown parameters by their estimates. Such estimates are constructed in this section under the very general assumption that the samples are generated from unspecified elliptical distributions with finite fourth-order moments. Elliptical distributions are a location-scale family of spherical distributions generalizing many common distributions, such as the multivariate normal distribution (MVN) and Student's t-distribution to name a few. The probability density function (p.d.f.) of an elliptically distributed sample x from class k is up to a normalizing constant of the form where µ k and Σ k are the mean and the positive definite covariance matrix of the distribution, respectively. The function g k : R ≥0 → R ≥0 is called the density generator, which determines the distribution. For example, g k (t) = exp(−t/2) defines the MVN distribution. For references about elliptical distributions see, e.g., [31] and [32].
Estimates of the unknown scalars, tr(Σ k ) and Σ k 2 F , can be obtained by finding estimates of the scale parameter η k and the sphericity parameter γ k defined as Observe that I Σ k where τ 1k = 1/(n k − 1) + κ k /n k , and τ 2k = κ k /n k .
In the following subsections, we show how to estimate the unknown parameters, η k , γ k , κ k , and Σ i , Σ j F . We use the same approach as in [22].

A. Estimation of the elliptical kurtosis
The elliptical kurtosis of class k is estimated by the sample averageκ wherek j = m is the qth order sample moment. Here, the notation (·) j picks the jth variable from the vector. The maximum constraint ensures thatκ k respects the theoretical lower bound of −2/(p + 2) [33]. Note that, the finite fourth-order moments are required in order to have a finite elliptical kurtosis.

B. Estimation of the sphericity
It would be natural to use the SCM in order to develop an estimator for γ k . However, following [34], [13], [30], and [22], we use the following simple and well performing estimator which uses the sample spatial sign covariance matrix (SSCM) defined asS where the mean µ k is estimated by the spatial median [35] µ k = arg min µ n k i=1 x ik − µ . Particularly, in [13], the SSCM based sphericity estimator was compared to a SCM based sphericity estimator, when sampling from elliptical distributions. The SSCM based estimator performed better in all cases except when sampling from a MVN distribution. Furthermore, when µ is known, in [22] it was shown that the expectation of the estimator is asymptotically unbiased as the dimension grows if γ k /p → 0 as p → ∞. This assumption was shown to hold, for example, for the first order autoregressive (AR(1)) covariance matrix but not for the compound symmetry (CS) covariance matrix (see Section VI for the definitions of these covariance matrix structures) [22].

C. Estimation of the scale and the inner products
The scale η k as well as the inner products Σ i , Σ j F , for i = j, can be estimated by using the SCMs, i.e., bŷ η k = tr(S k )/p and S i , S j F , respectively. However, regarding the inner products Σ i , Σ j F , we use the same SSCM based estimator as in [22], which isη iηj

V. PRACTICAL CONSIDERATIONS
In this section, we first discuss how to compute the estimates of the optimal tuning parameters in practise. Then, we discuss regularizing the tuning parameters themselves, which will in effect make the method compatible with RDA [2].
A. Computation of the optimal tuning parameters A straightforward way of solving for the optimal tuning parameters (5) for the proposed estimator (4) is to form a two-dimensional grid of (α, β) ∈ [0, 1] × [0, 1] and choose the point, which yields the minimum (estimated) MSE in (6). The solution can further be fine-tuned via an alternating convex minimization by iterating between (7) and (8). Convergence of the iterations is addressed in Appendix A-C. This method of finding the optimal tuning parameters was used in the simulations of Section VI, where the method is denoted by POLY (due to the polynomial structure of the MSE in Theorem 1). In the simulations, the particular (α, β)-grid was constructed from all pairs in the set {0, 0.05, . . . , 1} (α, β).
An alternative way of solving for the optimal tuning parameters is to find the critical points of the MSE polynomial (6), which is addressed in Appendix A.
Given estimates for the coefficients C ij , evaluating the equations (6), (7), and (8) is computationally very light. Most of the computation is due to estimating the coefficients and not in optimizing the tuning parameters.
The streamlined analytical estimator (9) with T = S, which is discussed in Section III-B, is denoted by POLYs. In this case, the tuning parameters are found using the plug-in estimates of the optimal values stated in Theorem 3.

B. Averaging of the tuning parameters
The estimation accuracy of the optimal tuning parameters (α k , β k ) K k=1 depend on the estimates of the statistical population parameters discussed in Section IV. If the estimation of these parameters is difficult for some data sets, which can happen if the elliptical distribution assumption does not fit the data well, regularization of the tuning parameters themselves can be useful. A possible way to accomplish this is to average the tuning parameters of the classes by usingᾱ = 1 K α k and β = 1 K β k in place of the class-specific tuning parameters. Averaging the estimated class tuning parameters is reasonable as long as the optimal tuning parameters are not too different from each other, which is the case when the class covariance matrices are similar. If the population covariance matrices are very different from each other, averaging the estimated tuning parameters might degrade the performance. However, by using (ᾱ,β) as common tuning parameters for all classes in (4), the resulting RSCM estimator will be compatible with RDA proposed in [2]. Thus, this method can be used instead of cross-validation as an alternative way for choosing the tuning parameters for RDA. In Section VI, we illustrate that this approach enables a significant speed up in the computation of RDA with no effective loss in performance.
The averaged version of the estimator (4), which uses the mean of the estimated tuning parameters for all classes is   denoted by POLY-Ave. Regarding the streamlined analytical estimator of (9), the averaged version is denoted by POLYs-Ave.
VI. NUMERICAL STUDIES In this section, the performance of the proposed estimators is reported and compared against other competing methods. First, in Subsection VI-A, the MSE performance of the method is examined with synthetic simulations. Then, in Subsection VI-B the method is applied to RDA classification and compared to cross-validation in choosing the tuning parameters.

A. Synthetic simulations
We evaluated the empirical NMSE performance of the proposed methods as well as the accuracy of the plug-in estimatesη k ,κ k ,γ k , and the estimates of the inner products Σ i , Σ j F . The results were averaged over 4000 Monte Carlo trials. We simulated four different setups: A, B, C, and D. In each setup, we generated K = 4 classes, each of dimension p = 200. The data was generated from a multivariate Student's t ν -distribution with various degrees of freedom ν and the means of the classes were generated from the standard normal distribution N p (0, I) and held fixed over the Monte Carlo trials for setups A, B, and C. Regarding setup D, the mean as well as the other parameters were randomly generated again for each Monte Carlo trial. The four different setups were as follows.

D. Randomized
For each k = 1, 2, 3, 4 and each Monte Carlo trial: • Σ k either AR(1) or CS with equal probability and k = Unif(0, 0.9). In addition to the proposed estimators, in the simulations we include the sample covariance matrix (SCM), the pooled SCM (1) (POOL), the partially pooled estimator (2) using the method from [30] (PPOOL) for choosing β k , the method from [13] (ELL1), which is designed for single class covariance matrix estimation and uses a convex combination β k S k + (1 − β k )η k I. We also include the estimators proposed in [22] (LIN1 and LIN2), which use a nonnegative linear combination of the class SCMs. LIN2 differs from LIN1 in that it incorporates additional shrinkage toward the identity matrix. In addition, we include the shrinkage covariance matrix estimator proposed in [20] (LOOCV1 and LOOCV2). The method is based on estimating the covariance matrix via a nonnegative linear combination of the SCM and target matrices using (lowcomplexity) leave-one-out cross-validation. In the simulations, for each class k, LOOCV1 uses the pooled SCM and the identity matrix as target matrices for regularizing the SCM of class k. Respectively, for each class k, LOOCV2 uses the individual SCMs of the other classes as well as the identity matrix as target matrices for regularizing the SCM of class k.
The empirical NMSE, Ave Σ k −Σ k 2 F / Σ k 2 F , is reported for each class in Table I. As can be noted, in the setups A, B, and C, the proposed methods: POLY, POLYs, POLY-Ave, and POLYs-Ave yielded significantly lower NMSE than the methods: SCM, POOL, PPOOL, ELL1, LIN1, and LOOCV1. The only methods, which performed comparably well to POLY and POLYs were LIN2 and LOOCV2. In setup A (AR(1) case), the best performing methods were POLY, POLYs, and LIN2, which all had a similar NMSE. In setup B (CS case), the methods POLY, POLYs performed best (LIN2 and LOOCV2 had a slightly higher NMSE or higher standard deviation). In the setup C (mixed case), LIN2 and LOOCV2 had similar performance and a slightly lower NMSE than the proposed methods. In setup D (randomized case), LIN2 and LOOCV2 had slightly lower NMSE than POLY and POLYs. However, LOOCV2 had over two times the standard deviation compared to LIN2, POLY, and POLYs. In summary, when the covariance matrices had a similar structure (setup A and setup B), the proposed POLY and POLYs methods performed best, whereas when the covariance matrices had a different structure LOOCV2 and LIN2 had a slight edge. The averaged versions of the proposed methods, POLY-Ave and POLYs-Ave, performed well, when the covariance matrices had similar structure (setup A and setup B), whereas in setup C and setup D, their performance degraded compared to the methods using class-specific tuning parameters. Lastly, the performances of POLY and POLYs were almost identical as well as the performances of POLY-Ave and POLYs-Ave. Figure 3 shows the boxplots of the estimated statistical parameters η k , κ k , γ k , and the inner products Σ i , Σ j F of setup C, where classes 1 and 2 have an AR(1) covariance structure and classes 3 and 4 have a CS structure. The median of the estimated scales η k coincide with the true values (shown as black triangles ( ) in the plot). The elliptical kurtosis κ k was more difficult to estimate for heavier tailed distributions (classes 2 and 4 with ν = 8 degrees of freedom) than lighter tailed distributions (classes 1 and 3 with ν = 12 degrees of freedom). The sphericity γ k was well estimated for classes 1 and 2, which both had the AR(1) covariance structure. Regarding classes 3 and 4, which had the CS covariance structure, the sphericity estimates were noticeably biased. This was most likely due to the sphericity estimate not being asymptotically unbiased for the covariance matrix with CS structure [22]. The inner products were well estimated except for those pairs for which both covariance matrices had a CS structure, i.e., Σ i , Σ j F , for (i, j) = {(3, 3), (3,4), (4, 4)}.
The estimated tuning parameters are shown in Figure 2 for the setup C. The figure depicts the theoretical NMSE of the estimator as a function of the tuning parameters. The optimal tuning parameter pair (α , β ) is denoted by the black triangle ( ). The small gray dots correspond to the estimated tuning parameters. The first 400 estimates from the 4000 Monte Carlo trials are shown. The mean of the estimated tuning parameters, denoted by the blue square ( ), was very close to the optimal value, although a slight bias could be observed. The estimated tuning parameters were clustered tightly around the optimal point, especially for classes 1 and 2 whose covariance matrices had an AR(1) structure. Regarding classes 3 and 4, whose covariance matrices had the CS structure, there was slightly more spread in the estimates.

B. Discriminant analysis
We evaluated the performance of the proposed method POLY-Ave in discriminant analysis classification. The implementation was written in R (programming language) using three different real data sets: Sonar, Vowel, and Ionosphere, which were obtained from [36] via the R package mlbench [37]. The features in the data sets, which were constant for some of the classes were removed. Specifically, from the Vowel data set, we removed the first feature, and from the Ionosphere data set, we removed the first two features. The final specifications of the data sets are given in Table II. In our implementation, we used the package SpatialNP [38] for computing the SSCM more efficiently, and the package tictoc [39] for recording the computation times of the methods.
In the simulations, we randomly selected a proportion of the samples as training data for estimating the sample means and RSCMs. The remaining data was used as test data to estimate the classification accuracy using the classification rule given in (3). The training set size is given as the number of training samples divided by the total number of samples in the data set. So, the training set size is given as a number between zero and one, where, e.g., 0.3 corresponds to the case where 30% of the data is used as training data and 70% of the data is used as test data to estimate the classification accuracy. The results were averaged over 10 independent Monte Carlo repetitions for each training set size.
For comparison to our proposed method, we included two different methods for implementing cross-validation. The first method chooses the tuning parameters from a two-dimensional grid of candidate parameter values as the minimizer of the 5-fold or 10-fold cross-validated misclassification rate. In 10-fold cross-validation the twodimensional grid of tuning parameters scans all the pairs in the set {0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1} α, β. Correspondingly, in 5-fold cross-validation the set is {0, 0.25, 0.5, 0.75, 1} α, β. This method is denoted by 5-CV and 10-CV and was implemented using the packages: caret [40] and KlaR [28], [29]. The second cross-validation method is also based on selecting the tuning parameters that minimize the cross-validated misclassification rate. However, instead of using a predefined two-dimensional grid, the second method uses a Nelder-Mead-algorithm included in the R package KlaR [28], [29]. This method was implemented using both 5-fold and 10-fold cross-validation and is denoted by 5-CV-NM and 10-CV-NM, respectively.  Figure 4 depicts the mean classification accuracy on the test set as a function of the training set size as well as the median computation time. It can be seen that all of the tested methods gave comparable classification performance. Depending on the data set and training set size, there were only minor differences. The computation time was, however, substantially longer for the cross-validation based methods and it tended to increase along with the training data size faster than for the proposed method. The chosen tuning parameter values are shown in Figure 5. It can be seen that for the crossvalidation based methods the tuning parameter values varied a lot with different training set sizes. The tuning parameters remained much more stable for our proposed method.

VII. CONCLUSIONS
Two regularized sample covariance matrix estimators specified in (4) and in (9) for multiclass problems were considered in this work and their theoretically optimal (in terms of MSE) class-specific tuning parameters were derived in Theorem 1 and Theorem 3, respectively. Since the optimal tuning parameters depend on unknown scalar parameters in (10), a method for their estimation was proposed. The usefulness of the method was supported by the conducted numerical simulations, which demonstrated very good MSE performance when compared to similar methods. By averaging the class-specific tuning parameters, the method was applied for choosing the tuning parameters in an RDA classification framework. The proposed approach had comparable classification performance to cross-validation on each of the three tested real data sets, but had significantly faster computation time. The codes for the proposed methods in Matlab, R, and Python programming languages are available at https://github.com/EliasRaninen.

A. Notation and useful identities
Let A and B be symmetric positive semidefinite matrices of same size. Using the notation

B. MSE of the estimator
The estimatorΣ k (α, β) in (4) can be rewritten aŝ Taking the expectation of the squared error gives the MSE, with the coefficients C mn = E[C mn ], where m, n ∈ {0, 1, 2}.
For example, F ] + i =j π i π j I Σi , I Σj F . The estimation of these terms is explained in Section IV.
The optimal tuning parameters can be solved exactly as follows. Let L k = MSE(Σ k ). The gradient equations of the MSE can be written as Set ∂ α L k = 0. Then, solving for α gives α = − 1 2 βC 11 + C 10 β 2 C 22 + βC 21 + C 20 (11) when β 2 C 22 +βC 21 +C 20 = 0, which holds a.s. for continuous distributions (see Appendix A-C). By substituting (11) into the equation ∂ β L k = 0, after some algebra, we obtain a rational function. The zeros of the rational function can be computed from the roots of its numerator, which is a quintic (fifth order) polynomial in β. To this end, general polynomial solvers can be used. The critical points of the MSE are obtained by substituting the roots into (11). The optimal tuning parameters then correspond to the critical point, which yields the minimum estimated MSE. If this critical point is not in the feasible set [0, 1] × [0, 1], the other critical points and the boundaries need to be considered, i.e., the special cases C1-C4 (see Section III). For this purpose the equations (7) and (8) of Theorem 2 can be used. In practise, it may be simpler to find an approximately optimal tuning parameter pair by forming a two-dimensional grid of (α, β) and choosing the point with the smallest estimated MSE as explained in Subsection V-A. The approximate solution can then further be finetuned by iterating (7) and (8) (see Appendix A-C).

C. Alternate convex minimization
By considering one of the tuning parameters α or β fixed, the optimization of the remaining one is a convex problem. This is known as biconvexity. To prove this, we show that the second derivatives of the MSE L k are positive (implying strict convexity). Indeed, we have ∂ 2 α L k = 2β 2 C 22 + 2βC 21 + 2C 20 > 0 ∂ 2 β L k = 2α 2 C 22 + 2C 02 > 0. The first equation is an upward opening quadratic function in β. Hence, it is positive if its corresponding discriminant function is negative (there are no real roots). Indeed, the discriminant, C 2 21 − 4C 22 C 20 , is negative, i.e., which follows from the Cauchy-Schwartz inequality. The second equation is also positive since C 22 and C 02 are a.s. both positive.
The corresponding (unconstrained) solution for α given a fixed value of β was already given in (11). To guarantee that α, β ∈ [0, 1], a projection to the feasible set by the clip function [·] 1 0 has to be applied. Hence, we obtain (7) and (8).
Lastly, we make some remarks about the convergence of sequentially iterating (7) and (8). Let α (i) and β (i) denote the tuning parameter values at the ith iteration. Denote the MSE function to be minimized as L k : X × Y → R, where X = [0, 1] and Y = [0, 1]. Since the MSE L k is bounded from below and the sequence {L k (α (i) , β (i) )} i∈N is monotonically decreasing, it converges [41,Theorem 4.5]. Then, since L k is continuous, X and Y are closed sets, and both (7) and (8) have unique solutions (due to strict convexity) in the compact set [0, 1], by [41, Theorem 4.9], we have lim i→∞ (α (i+1) , β (i+1) ) − (α (i) , β (i) ) = 0. Furthermore, since X and Y are subsets of R, the algorithm is coordinate-wise and the sequence {(α (i) , β (i) )} i∈N converges [41, pp. 398]. If the accumulation point of the sequence {(α (i) , β (i) )} i∈N is in the interior of [0, 1] × [0, 1], then it is a stationary point [41,Corollary 4.10]. However, in theory, the stationary point can be a global minimum, local minimum, or a saddle point. Therefore, it is important to select the starting point for the iterations carefully.
By setting ∂ β L k = 0, it is easy to see that (α = 0, β = −B 10 /B 11 ) is a critical point. Note, however, that when α = 0, the estimator does not depend on β. Then, if we assume α = 0, solving for β yields β = − 1 2 Substituting this expression into ∂ α L k = 0 and solving for α yields Substituting this back to the equation for β gives The Hessian matrix is where the second partial derivatives are The determinant of the Hessian matrix is det(H) = L αα L ββ − L 2 αβ . For the critical point (α = 0, β = −B 10 /B 11 ), we have L ββ = 0 and L αβ = B 11 . The determinant of the Hessian is then −B 2 11 < 0. This implies that the eigenvalues of the Hessian matrix have different signs, and hence, the corresponding critical point is a saddle point. Regarding the other critical point defined by (12) and (13)