Tuning the Parameters for Precision Matrix Estimation Using Regression Analysis

Precision matrix, i.e., inverse covariance matrix, is widely used in signal processing, and often estimated from training samples. Regularization techniques, such as banding and rank reduction, can be applied to the covariance matrix or precision matrix estimation for improving the estimation accuracy when the training samples are limited. In this paper, exploiting regression interpretations of the precision matrix, we introduce two data-driven, distribution-free methods to tune the parameter for regularized precision matrix estimation. The numerical examples are provided to demonstrate the effectiveness of the proposed methods and example applications in the design of minimum mean squared error (MMSE) channel estimators for large-scale multiple-input multiple-output (MIMO) communication systems are demonstrated.


I. INTRODUCTION
Covariance matrix reveals the marginal correlations between variables, while precision matrix (i.e., inverse covariance matrix) encodes conditional correlations between pairs of variables given the remaining ones [1].For Gaussian graphical models, the sparsity pattern of the precision matrix depicts the graph structure [2].Both the covariance matrix and precision matrix are used extensively in signal processing and machine learning [3], [4].This paper deals with the estimation of precision matrix, which is required in applications such as minimum mean squared error (MMSE) estimation [5], [6], array signal processing [7], correlation analysis [8] and linear discriminant analysis (LDA) [9].A critical challenge is that when the training data is limited, sample covariance matrix (SCM) is ill-conditioned and even singular.A precision matrix constructed by directly inverting the SCM, referred to as sample precision matrix (SPM) below, may suffer from significant errors.In this case, regularization, which often imposes a priori assumptions on the structure of covariance or precision matrix, may reduce the number of free parameters to be estimated.By tuning the regularization parameter properly, a good tradeoff between bias and variance may be achieved and the overall estimation accuracy can be improved.
The associate editor coordinating the review of this manuscript and approving it for publication was Taufik Abrao.
In order to optimize the performance of regularized estimators, parameters such as bandwidth, sparsity level, and shrinkage coefficients must be tuned properly.There are significant results about data-driven, automatic parameter tuning for covariance matrix estimation [10], [12]- [14], [17], [20], [30].For banding-based designs, the distribution-free resampling methods have shown to be able to select the bandwidth [14], [30] for covariance matrix estimation under several criteria.However, they may not yield satisfactory performance for applications where the precision matrices are indeed required.
The choice of parameters for regularized precision matrix estimation may be cast as a model (order) selection problem.Typical tools include information criteria (IC) [31]- [35].IC generally require knowledge about the data distribution VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see http://creativecommons.org/licenses/by/4.0/and quantitative analysis of the complexity of the candidate models.There may be applications where the distribution is unknown and/or it is challenging to quantize the model complexity (e.g., for shrinkage estimators).Furthermore, some IC methods may not select the model that possesses the best prediction power, which can be critical for certain applications [36], [37].Random matrix theory (RMT)-based methods have been proposed for shrinkage-based highdimensional precision matrix estimation [21]- [23].However, it is unclear how to generalize them to precision matrix estimators employing more general forms of regularization.
In this paper, we introduce simple yet effective methods based on cross-validation (CV) [36] to select the parameters for the precision matrix estimation.We focus on banding-based estimators where a bandwidth constraint is imposed on the covariance or precision matrix [14]- [16], [19], [20], [25].We target the tuning of regularization parameters for minimizing the Frobenius norm of the error of precision matrix estimation.Since CV requires a form of prediction error to be used as a proxy for measuring the quality of parameter estimation, we propose to use two types of regression errors for this purpose: One uses a regression interpretation of the precision matrix itself and the other uses a similar interpretation of the modified Cholesky factor of the precision matrix [25].The resulting bandwidth selection methods are distribution-free and the CV scores can be easily computed.We show that the proposed methods can select precision matrix estimators achieving good out-of-sample prediction power.They can be directly used for other precision matrix estimators such as graphical lasso [28] and reduced-rank estimators, as shown by numerical results.The application of the proposed methods in designing MMSE channel estimators for large-scale multiple-input multiple-output (MIMO) communication systems is also demonstrated.This paper is presented in part at IEEE ICASSP 2018 [38].
Notation: X † and X −1 denote the conjugate transpose and the inverse of the matrix, respectively.X denotes the estimate of X. denotes element-wise product.||X|| F denotes the Frobenius norm of a matrix.Tr(X) denotes the trace of a matrix.vec(•) denotes vectorization and ⊗ Kronecker product.

II. REGRESSION ANALYSIS-BASED PARAMETER SELECTION FOR REGULARIZED PRECISION MATRIX ESTIMATION
In this section, we first introduce an example of banding-based estimation of covariance and precision matrix and the bandwidth selection problem.We then discuss two bandwidth selection methods based on CV and regression analysis, which may be applied to more general regularized designs.

A. AN EXAMPLE OF BANDING-BASED PRECISION MATRIX ESTIMATION
Consider an N -dimensional signal y with mean zero, covariance matrix and precision matrix −1 .Suppose we have T independent and identically distributed training samples and let y t be the t-th sample.The SCM [3] is computed as If the above SCM is full-rank, the SPM can be obtained as When the sample size T is not much larger than the dimensionality N , both the SCM and SPM can suffer from significant errors.
Regularization is usually applied to improve the accuracy of covariance matrix estimation and may be generalized to precision matrix estimation in different ways.In this section, we take the banding technique of [14] as an example to motivate the study of parameter selection problem.(Other forms of regularized designs will be considered later.)The banding estimators have shown to provide significant improvements when the variables are assumed to have a natural ordering, where entries far away from each other have weak correlations [16], [19], [20].With a bandwidth K , we can generate from SCM the following banded covariance matrix estimate [14] where denotes element-wise product and B K is defined as A regularized estimate of the precision matrix can then be obtained by inverting K , i.e., Note that the precision matrix estimate here K is generally not banded and the bandwidth K in (3) actually refers to the bandwidth of the corresponding covariance matrix estimate K .

B. PARAMETER SELECTION
The banding-based estimate (1) K in (5) can potentially improve the accuracy of precision matrix estimation.This is illustrated in Fig. 1.The autoregressive (AR) model, which has been widely considered in similar studies [11], [14], is assumed for the true covariance matrix of y, with its (i, j)-th entry given by where ρ is a constant.We use the normalized quadratic losses as the performance metrics for covariance and precision matrix estimation, respectively, where and denote estimates of and , respectively, and • F denotes the Frobenius norm.From Fig. 1, it can be seen that the regularization parameter K has a crucial influence on the estimation accuracy.Furthermore, under the given performance metric, the bandwidth optimal for the covariance matrix estimation can be suboptimal for the precision matrix estimation, as demonstrated by the different bandwidths required for optimizing the estimates of and by K and (1) It is thus critical to properly tune the parameter K to optimize performance.
The parameter tuning problem may be formulated as a model selection problem and solved using information criteria (IC), e.g., Akaike information criterion (AIC) [31], Bayesian information criterion (BIC) [32] and their generalizations [33]- [35].This is generally implemented by optimizing the log-likelihood of the model, penalized by the model complexity.In particular, the BIC and similar methods achieve model selection consistency for parametric models [34], [35], which is critical when the goal is to interpret the observed data and/or infer the features of the data.AIC tends to choose models that exhibit better predictive performance, which may be more relevant for applications where model-based prediction or estimation is of interest.Such methods formally require the knowledge of the distribution of the data, which may not be always well specified.It may also be nontrivial to quantize the model complexity due to the constraints involved in the model.Furthermore, sometimes models with the same number of parameters need to be compared, e.g., the shrinkage estimators [21]- [23] need to tune the continuous-valued shrinkage coefficients.In such cases, it may be challenging to apply IC to tune the parameters.It is also noted in [19] that AIC and BIC may not work well in high-dimensional applications and a hypothesis testing procedure may be used to determine the bandwidth for Gaussian data.
Cross validation (CV) is another universal tool for model selection.It is especially useful in selecting a model with good prediction power [34], [36], [39], e.g., for future signal or data processing tasks.This is because CV, by splitting the data into training and validation sets, can measure directly the prediction error of a model.For covariance matrix estimators, CV and its variants have been successfully used for determining the shrinkage coefficients [10], [38], bandwidth [14, Eq. ( 24)] and thresholds [17], where the difference between the SCM constructed from the validation set and the regularized covariance matrix estimate constructed from the training set is minimized.They are shown to be able to select near-oracle parameters for minimizing the Frobenius norm of the error of covariance matrix estimation.However, as shown in Fig. 1, inverting a regularized covariance matrix optimized under a certain criterion such as the mean squared error (MSE) does not necessarily optimize the precision matrix estimation under the same criterion.On the other hand, directly replacing the covariance matrices of [14, Eq. ( 24)] by precision matrices is not feasible in low-sample support cases when the needed SPM cannot be computed.
In this work, we apply the universal tool of CV to select the parameter for banding-based precision matrix estimators.Following the general principle of CV, we repeatedly split the overall dataset into the training and validation subsets.For each split, a regularized precision matrix estimate is constructed using only the training subset.By fitting the estimated model on the validation set, a prediction error is obtained for each data split.The parameter that minimizes the average prediction error is finally chosen.Since the true precision matrix is unknown, it is infeasible to use directly the error of precision matrix estimation as the prediction error for CV.It is thus an important task to find proper proxies of the estimation performance.Log-likelihood is a candidate choice, which requires knowledge of the data distribution.
We propose to apply two types of CV-based regression errors as proxies of the estimation performance, which do not require knowledge of the distribution.This exploits the notion that the precision matrix gives coefficients for regressing an entry of the signal y on a subset of the remaining entries.We introduce two such CV schemes, CV-I and CV-II, based on the precision matrix and its Cholesky factor, respectively.Note that both schemes can be applied to choose parameters for general regularized estimators of precision matrix, not restricted to the banding example discussed above.

C. CV-I
The first CV scheme (CV-I) exploits a regression interpretation of the precision matrix itself.In order to illustrate the basic idea, let us partition the entries of a signal vector y into two disjoint subsets y = y (1)  y (2) , where the lengths of y (1) and y (2) are N 1 and N 2 , respectively.Accordingly, let us partition the covariance matrice of y as VOLUME 7, 2019 Therefore, from the submatrices 11 and 21 of the precision matrix , we can construct an N 2 × N 1 matrix It can be easily seen that W is the matrix of coefficients for regressing y (1) on y (2) and is the linear minimum mean squared error (LMMSE) estimator for estimating y (1) from y (2) .For notational simplicity, the above assumes contiguous partitioning of y.However, noncontiguous partitioning can be easily handled by transforming it into the contiguous form of (8) using a permutation matrix.
The above interpretation links the precision matrix to regression analysis of the data.This has been exploited for deriving regularized precision matrix estimates [25].Given the training data, one can estimate by conducting a regression analysis of the training samples.Constraints on the regression coefficients can be imposed to obtain different regularized estimators [29].In this work, we exploit the above regression interpretation for another purpose, i.e., for determining the bandwidth for the regularized precision matrix estimation.The rationale is that, if we have a better estimate of the true precision matrix , then from it a linear predictor constructed as (13) (with replaced by ) should lead to a smaller error ξ 1 of predicting y (1) from y (2) , where is obtainable from the training data and the estimate of the precision matrix.We propose to use the above prediction error as a performance metric to choose the bandwidth K using a CV procedure that involves data splits in not only the time domain (w.r.t. the indexes of the samples) but also in the space domain (w.r.t. the indexes of the entries of each sample).Let I be the number of time-domain splits and J the number of space-domain splits for each time-domain split, respectively.The overall training data Y = [y 1 , y 2 , • • • , y T ] are pre-partitioned in the time into approximately equal-size subsets {Y i } [36].For the i-th time-domain split, the validation set is chosen as Y i and the remaining data in Y, denoted by Y ∼i , is used as the training set.An estimate of the precision matrix with regularization parameter K is constructed from Y ∼i .For the j-th space-domain split, Y i is further split into two disjoint subsets Y (1) i,j and Y (2) i,j .A predictor W K ,i,j is then constructed using the estimated precision matrix to predict Y (1) i,j from Y (2) i,j .Note that the subscript K in W K ,i,j is introduced to indicate the dependency of the estimators on the regularization parameter K .The prediction error is then used to assess the quality of the precision matrix estimation.Different patterns of data splits may be used in the time and space domains.The parameter is then chosen as where the cost function represents the total prediction error For simplicity, we assume that for all the splits the numbers of rows and columns of Y i,j are fixed.Notice that the estimator for estimating is computed from a permutation of the regularized precision matrix estimate K ,i obtained from the training sample where j is the permutation matrix that is consistent with the pattern of the j-th space-domain split of Y i .Summarizing, the CV-I cost with parameters (I , J ) is given as A grid search of K can be conducted to choose the minimizer of L 1 (K ) as the bandwidth.
In a special case where J = N and every single entry of y is estimated from the remaining entries, we obtain the leaveone-out (LOO) partition in the space domain.In this case, Y (1) i,j reduces to a row vector and K ,i,j,11 becomes a scalar number.It can be shown that the performance metric ( 16) can be rewritten as where D K ,i denotes the diagonal matrix whose diagonal entries are the same as those of K ,i .
The above bandwidth selection method is based on the regression analysis of the original signal y.Alternatively, we may consider a generalized cross validation (GCV) treatment [40].Instead of conducting the regression analysis of the entries of y, we can consider the regression analysis of the linearly transformed signal where V = UF, with F being the discrete Fourier transform matrix and U the eigenvector matrix of the covariance matrix .In this case, the precision matrix of y is given by This is a circulant matrix with equal diagonal entries given by 1 N tr( ), i.e., Note also that for an arbitrary y By replacing the true precision matrix as its estimate, the LOOCV cost function of (20) applied to the transformed signal ( 21) is then written as Note that the result in (25) does not explicitly require the calculation of (21).In other words, (21) only serves as a proxy for deriving the generalized CV expression.

D. CV-II
The second type of regression-based CV scheme (CV-II) uses the modified Cholesky factor of the precision matrix as described below.For an arbitrary positive-definite precision matrix , the modified Cholesky factorization can be written as where D is a diagonal matrix with positive diagonal entries, the modified Cholesky factor T is a lower triangular matrix with unit diagonal entries as follows The coefficients {φ nm } have a regression interpretation [29], [25]: The LMMSE estimate of the n-th entry y (n)  of the signal y from its precedents y (1) , y (2) , • • • , y (n−1) in y can be computed as Furthermore, the mean squared error (MSE) of this estimator is equal to the n-th diagonal entry of D, i.e., where we defined y (1)  = 0. From the regression interpretation of the Cholesky factor coefficients, T can be used for predicting the entries of a sample y from their precedents using (28).Now let us replace the true precision matrix by its regularized estimate.It is expected that if a good precision matrix estimate is obtained, then the Cholesky factor-based prediction of the entries of y has a small average error.Assume the same time-domain data splits as for CV-I.The error for predicting the entries of signal Y i in the i-th split is then given by where T K ,i is the modified Cholesky factor of the precision matrix estimate K ,i constructed using only Y ∼i and the bandwidth K .Using this error, we obtain an alternative CV scheme which minimizes the total prediction error given by In order to obtain L 2 (K ), we need to find the modified Cholesky factor T K ,i for each time-domain data split and each bandwidth K .In contrast to CV-I, the Cholesky factorization approach does not require explicit data splits in the space domain.

E. APPLICATIONS TO GENERAL PRECISION MATRIX ESTIMATORS
Both CV-I ( with the cost functions given by ( 19), ( 20) or ( 25)) and CV-II ( with the cost function given by ( 31)) utilize only the outcome of the precision matrix estimator, and are transparent to the detailed estimation process.Therefore, they can be used as universal tools for tuning the regularization parameters for precision matrix estimators such as those based on rank reduction, sparsity and shrinkage.Fast implementations may be derived for certain types of estimators.A related scheme was discussed in [46] for choosing the shrinkage coefficients, where a holdout strategy which splits the data only once in the time domain (i.e., I = 1) and N times in the space domain is introduced.It has been shown that near-oracle choice of the shrinkage coefficients is achieved for a linear estimation application.This work generalizes the method of [46] by allowing more general data splits and also different regression analysis to be used.

III. NUMERICAL EXAMPLES
In this section, we present numerical examples to demonstrate the effectiveness of the proposed bandwidth selection methods for regularized precision matrix estimation and also show their applications to MMSE channel estimation in large-scale MIMO communication systems.

Example 1 (Precision Matrix Estimation via Covariance Matrix Banding):
We first apply the proposed CV methods to choose the bandwidth K for the banded covariance matrix estimator in (3) such that the resulting precision matrix estimator K in (5) achieves good performance.We assume zero-mean, complex-valued Gaussian data but our methods do not rely on knowledge about the distribution.We use the normalized squared Frobenius norm of the estimation error (7) as the performance metric and define its average as the normalized MSE (NMSE) for the precision matrix estimation.An example of the AR covariance matrix model in (6) is presented in Fig. 2 and 3. CV-I and CV-II with I = 2 and 5 time-domain data splits are tested.For each of the I time-domain splits, T /I distinct samples of y t are chosen to form the validation data set Y i .When applying CV-I with J space-domain splits of Y i , N /J distinct, uniformly spaced rows of Y i are chosen as the target of prediction.The CV-I costs are computed using (19) and (20), respectively, for J = 2 or N .
Banding may destroy the positive-definiteness of the precision matrices and in this case the regression interpretations of the precision matrix become invalid.We thus exclude the candidate precision matrix estimates that are not positivedefinite.Similarly, if the precision matrix estimate K ,i in the i-th time-domain data split is not positive-definite, the candidate bandwidth K is excluded, which can lead to a performance loss due to the exclusion of certain valid positive-definite precision matrix estimate K .Techniques that can be used to restore the positive-definiteness, such as [41], may be employed but its study is beyond the scope of this paper.In Fig. 2 and 3, the results labeled by ''oracle'' use the bandwidths that minimize the Frobenius norm loss of ( 7), which can be obtained only when the true precision matrix is known.These results are used to benchmark the performance of our proposed CV schemes.It can be seen that the banding estimators significantly outperform the SPM which is the (pseudo)inverse of the SCM, especially when the number of samples is smaller than the dimensionality.It is seen that for this example, CV-I performs slightly better than CV-II when T is large.
The simulation results also suggest that the numbers of data splits (I , J ) influence the bandwidth selection behavior.When the number of samples T is large, more splits in the time and space domains lead to better estimation under the MSE criterion, as seen from Fig. 3.However, it is also shown in Fig. 2 that for low sample supports (with a small T ) a smaller number of splits yields better performance.In general, the complexity increases with the number of data splits but fast computations may be obtained for the leave-oneout (LOO) implementations.To our best knowledge, there are no universal data-driven rules for choosing the optimal values of (I , J ), though this problem is similar to the problem of automatically choosing between AIC and BIC, where a higher-level of CV may be used [45].
Example 2 (Precision Matrix Estimation via Cholesky Factor Banding): Directly banding the SCM may destroy the positive definiteness, making the resulting covariance and precision matrix estimates invalid for certain applications.One way of maintaining positive-definiteness and also producing a banded precision matrix estimate is to find the banded Cholesky factor of the precision matrix directly from the training data following the regression interpretation [20] as introduced in Section II-D.For the modified Cholesky factorization in (26), we assume that the precision matrix is banded with a bandwidth K , i.e., φ n,m = 0 for |m − n| > K in (27).In this case, the LMMSE estimate of the n-th entry y (n) of y from its precedent entries can be rewritten as where . . .
and s = max(1, n − K ).It can be verified that the LMMSE estimator can be written as As will be discussed next, this can be exploited to develop a regularized, column-by-column estimator of the precision matrix [25].Consider a case with B training samples We can first compute the estimate of the covariance matrix of y as the banded SCM which requires about 2NBK flops, where a flop refers to an operation of complex numbers (e.g.addition and multiplication).The sample covariance matrix n,K of y (n) and the sample cross-covariance σ n,K between y (n) and y (n) can then be obtained by directly extracting the relevant entries of K .They are then used to replace the covariance matrices in (34), resulting in which can be computed via Cholesky factorization and backward/forward substitutions with about K 3 3 + 2 K 2 flops for each n.The diagonal entries of D are estimated as costing about 2BK flops for each n.Finally, the entries of w (n)  and D n,n are plugged into ( 27) and ( 26) to produce a banded estimate of the precision matrix (2) which requires about N (K 2 + K ) flops.We can analyze the complexity of the above estimator with the CV methods used for selecting the bandwidth of  35)-( 38) are repeated for I data splits and the complexities are summarized in Table 1.The total complexity for obtaining  For CV-II, note that T K ,i , which is triangular with a bandwidth of K , is directly available from the estimation process.While evaluating (31), the complexity of finding T K ,i Y i is about (2K + 1)NT /I flops and that for the squared Frobenius norm is about 2NT /I flops.The overall complexities for CV-I and CV-II are also summarized in Table 1, which shows that CV-II has a complexity slightly lower than CV-I for this example.
Note that the above complexities should be accumulated for the candidates of K tested.As a comparison, the SPM, which is the inverse of the SCM, has a computational cost of O(N 2 T +N 3 ), which can be significant when N is large.From the analysis above, the complexity of the proposed methods can be moderate when the candidate bandwidths K are small.
In Fig. 4, we show the precision matrix estimation K for the AR model in (6) with two different values of ρ.When ρ is increased from 0.2 to 0.9, the decaying of the off-diagonal entries of the covariance matrix becomes slower and the condition number of is also increased.The proposed CV schemes achieve near-oracle performance, and the overall precision matrix estimation schemes significantly outperform the SPM.We can show that the inverse Cholesky factor-based estimator (2) K significantly outperforms (1) K when ρ is large.It can be shown that the true precision matrix of the above AR model has a bandwidth of 1.The banded estimate (2) K above does not exploit the feature of the AR model and may choose a bandwidth different from the true one.Table 1 shows the accuracy of the bandwidth chosen by CV-I for different settings, which is defined as the ratio of the correct bandwidth selections among 100 trials.It is seen that the correct bandwidth is selected with a high probability for the AR model.We can show that CV-II exhibits a similar behavior for this example.
Example 3 (Fractional Gaussian Noise (FGN) Model): We also test the proposed methods on the fractional Gaussian noise (FGN) model considered in [14]: Bandwidth selection accuracy of the proposed CV-I method.
5. MSE of estimating the precision matrix using different designs and CV bandwidth selection for the fractional Gaussian noise (FGN) model with N = 100, I = 2, J = N, K max = min(N, T (I − 1)/I).Note that the CV choices achieve nearly the same performance as the oracle choice for (1) .
Fig. 5 shows the results of precision matrix estimation for N = 100 and H = 0.9.Note that this is an ill-conditioned case where has a condition number of 222.The design K in (5) based on covariance matrix banding achieves less significant improvements compared to SPM because the off-diagonal entries of decay slowly.However, K is still very effective.Again, our proposed bandwidth selection methods are able to achieve near-optimal NMSE performance for both precision matrix estimators.
Example 4 (Graphical Lasso-Based Sparse Precision Matrix Estimation): We also test the proposed CV methods for sparse precision matrix estimation based on graphical lasso, which maximizes the penalized log-likelihood as where S is the SCM, logdet(•) denotes the logarithm of the determinant, Tr(•) the trace, || || 1 the sum of the absolute values of the entries of , and λ controls the sparsity level of the estimated precision matrix, which should be carefully tuned.For such an estimator, the extended BIC (EBIC) proposed in [35] is shown to provide a consistent estimate of λ that can give the correct sparsity pattern for large sample sets.The proposed CV methods are compared with EBIC with γ = 0.5 (see [35] for the parameter γ of the EBIC).Fig. 6 demonstrates the squared Frobenius norm of the error  of estimating the precision matrix .The sparsity patterns associated with the parameters chosen by CV and EBIC are illustrated in Fig. 7.It can be seen that the EBIC chooses a parameter that leads to a sparsity pattern closer to the truth.However, CV is significantly better for choosing λ to minimize the squared Frobenius norm.This confirms the consistency of EBIC and also demonstrates that CV is more suitable when precise estimation of the precision matrix is required.Note that model order selection and parameter estimation are different tasks.In fact, a small perturbation of the model parameters can significantly affect the model order (e.g., the number of nonzero entries) but the error measured (e.g., using the Frobenius norm) in the parameter estimation can be negligible.

B. APPLICATION TO MIMO CHANNEL ESTIMATION
A potential application of the proposed techniques is the design of practical MMSE estimators for MIMO channels [6].Consider a point-to-point system with N t FIGURE 8. NMSE of MIMO channel estimation with N t = 8, N r = 32 and an exponential covariance matrix model.The precision matrix of the received signal is estimated using (2) of (38).The pilot-to-noise ratio is set to 0 dB.The precision matrix has a size of 256 × 256.''MMSE'' indicates performance of the MMSE estimator with perfect knowledge of the covariance and precision matrices.
transmitting antennas and N r receiving antennas.Let B be the length of the pilot sequence.In the training stage, the received signal matrix can be modeled as where Y ∈ C N r ×B is the received signal matrix, H ∈ C N r ×N t the channel matrix, P ∈ C N t ×B the pilot matrix, and N ∈ C N r ×B the noise which is uncorrelated with H. Vectorizing Y in (41) gives where y = vec(Y), P = P T ⊗ I, h = vec(H), n = vec(N), vec(•) denotes vectorization, and ⊗ Kronecker product.We assume a Rayleigh fading channel and denote by h E[hh † ] ∈ C N t N r ×N t N r the covariance matrix of the channel vector h.We also assume that the disturbance n is Gaussian-distributed with zero mean and identity covariance matrix.
The MMSE estimate of h from y is computed as [6] h MMSE = h P † y y, (43) where y = ( P h P † + I) −1 is the precision matrix of y.In order to compute h MMSE , h and y , which can be very large, must be estimated.In communication systems, h is not directly observable.When orthogonal training signal with P = √ PI is applied, where P determines the power for training signals, it can be shown that Therefore, if a good estimate of matrix y of y can be obtained, we can then also find an estimate of h as h = 1 P ( −1 y − I) according to (44) and the MMSE channel estimator can then be constructed.For simplicity, the Kronecker channel covariance matrix is assumed [42] our numerical studies, i.e., where t and r are, respectively, the transmitter side and receiver side covariance matrix.Note that, however, the precision matrix estimators of this paper do not rely on the above Kronecker structure and are applicable to general cases.
Example 5 (Exponential Covariance Model): We now show an example in Fig. 8, where the transmitter and receiver covariance matrix follow the exponential model [6]: We set r t = 0.7 e −j0.9349π and r r = 0.9 e −j0.9289π but similar results can be obtained for other settings.For simplicity, we consider a reduced-complexity (RC) implementation: 95% of the total training samples of y are used to construct the precision matrix estimates using (38), while the 5% remaining samples are used to select the bandwidth K .In this case, I = 1 for ( 20) and (31).Note that this holdout strategy [47] is a special case of the CV method and incurs a very low complexity when the considered bandwidth is small and is particularly suitable for wireless communication applications where low-complexity algorithms are preferred.The performance is compared with the ''oracle'' MMSE channel estimation performance that is achievable given the considered banded precision matrix estimators and the least squares (LS) channel estimator that does not need knowledge of the precision matrix.From the simulation results, it can be seen that when the number (T ) of samples is small, the MMSE channel estimator constructed using the SPM estimate of y is significantly poorer than the LS estimator which does not require any knowledge of y .Therefore, an enhanced estimate of the precision/covariance matrix is necessary to exploit the potential of the MMSE channel estimator.Our proposed methods achieve near-oracle bandwidth selection when applied to the channel estimation problem.Furthermore, they also achieve near-oracle performance for precision matrix estimation, as demonstrated by the NMSE results in Fig. 9.

Example 6 (One-Ring Covariance Model and Reduced-Rank (RR) Estimator):
We also consider the one-ring model considered in [43], where the channel covariance matrix h tends to be low-rank.In this case, the received signal may be decomposed into components in a low-rank signal subspace and a noise subspace.Exploiting this, a regularized precision matrix estimate can be obtained as where K denotes the assumed rank of the signal subspace, Q is the eigenvector matrix of the SCM, and the diagonal matrix K is defined as λ k denotes the k-th largest eigenvalue of the SCM .This reduced-rank estimate K can also be used to generate an estimate of h by exploiting (44).Once the rank K is determined, the MMSE channel estimator can then be constructed following the principal component analysis (PCA) as where Q K and K denote the eigenvector and eigenvalue matrix corresponding to the rank-K signal subspace.Due to its simplicity (without a Cholesky factorization step), the CV-I cost in ( 25) is used to determine the rank K .Similarly to Example 5, 95% of the samples are used to construct the estimator while the 5% remaining samples Y i are used for rank selection.The CV cost is then computed as where K is diagonal and Q † Y i is independent of K .The simulation results are presented in Fig. 10, where the precision matrix estimators (38) and (48), labeled by (2) and (3) , respectively, are compared.It is shown again that the MMSE channel estimator can significantly outperform the LS estimator.For each given type of precision matrix estimator, the NMSE performance achievable with the proposed CV methods is nearly the same as that with the ''oracle'' parameter choice and their difference is hardly visible in Fig. 10.This suggests that the regression analysis-based methods are effective for choosing the parameter for the channel estimation application.It is also seen that, for the one-ring model which has a low-rank channel covariance matrix, the reduced-rank design provides better performance than the banding design but also requires higher complexity due to the eigenvalue decomposition needed.

IV. CONCLUSION
In this paper, we introduced two data-driven, distribution-free cross-validation methods for tuning the for banding-based precision matrix estimation.These methods are based on regression interpretations of the precision matrix and its Cholesky factor and can be applied to general regularization schemes.Numerical examples show that the proposed methods can approximate the oracle bandwidth choices for precision matrix estimation under a quadratic loss and are effective a MIMO channel estimation application where the MMSE estimator is constructed from training samples.The complexity of proposed schemes generally with the number of data splits.This problem may be alleviated by reusing calculations among different data splits.holdout implementation may also achieve satisfactory perin certain applications.

FIGURE 1 .
FIGURE 1.An example with N = 50, T = 200 and [ ] i ,j = 0.5 |i −j | .The regularization parameter K is the bandwidth for the banded covariance matrix estimator in (3).The optimal K for estimating and are 2 and 4 respectively.Inverting 2 leads to an estimate of about 7 dB worse than (1) 4 .
2, and E[•] denotes mathematical expectation.The precision matrix is then computed as
that in total we have T training samples equally split into I subsets.Then a training subset has B = T − T /I samples and a validation subset Y i has T /I samples.The operations of (

( 2 )
K for I splits is about C (35) + C (36) + C (37) + C (38) flops.For CV-I, we take the LOO implementation (20) as an example.Evaluating K ,i Y i costs about (4K + 1)NT /I flops as K ,i has a bandwidth of K and Y i has T /I columns.Multiplying the result by [D K ,i ] −1 takes NT /I flops and computing the squared Frobenius norm takes 2NT /I flops.

TABLE 1 .
Approximate computational complexities of the estimators in Example 2. We assume that there are T training samples equally split into I subsets for CV, each training subset has B = T − T I samples, each validation set has of T I samples, and intermediate results are stored and reused for evaluating (35) for the I splits.

FIGURE 6 .
FIGURE 6. Snapshot of the normalized squared Frobenius norm of the precision matrix estimation error and the parameters elected by the EBIC CV methods.The AR model with N = 100, ρ = 0.5 is assumed, T = 500, I = 5, and J = N.

JIANGTAO
XI received the B.E. degree from the Beijing Institute of Technology, Beijing, China, in 1982, the M.E.degree from Tsinghua University, Beijing, in 1985, and the Ph.D. degree from the University of Wollongong, Wollongong, Australia, in 1996, all in electrical engineering.He was a Postdoctoral Fellow with the Communications Research Laboratory, McMaster University, Hamilton, ON, Canada, from 1995 to 1996, and a Member of Technical Staff with Bell Laboratories, Lucent Technologies Inc., NJ, USA, from 1996 to 1998.From 2000 to 2002, he was the Chief Technical Officer with TCL IT Group Company, China.In 2003, he rejoined the University of Wollongong as a Senior Lecturer, where he is currently a Professor and the Head of the School of Electrical, Computer, and Telecommunications Engineering.His research interest includes signal processing and its applications in various areas, such as optoelectronics, instrumentation and measurement, and telecommunications.YANGUANG YU received the B.E. degree from the Huazhong University of Science and Technology, China in 1986, and the Ph.D. degree from the Harbin Institute of Technology, China, in 2000.She was with the College of Information Engineering, Zhengzhou University, China, on various appointments including a Lecturer, from 1986 to 1999, an Associate Professor from 2000 to 2004, and a Professor, from 2005 to 2007.From 2001 to 2002, she was a Postdoctoral Fellow with the Opto-Electronics Information Science and Technology Laboratory, Tianjin University, China.She also had a number of visiting appointments including a Visiting Fellow at the Optoelectronics Group, Department of Electronics, University of Pavia, Italy, from 2002 to 2003, a Principal Visiting Fellow with the University of Wollongong, Australia, from 2004 to 2005, a Visiting Associate Professor and Professor with the Engineering School, ENSEEIHT, Toulouse, France, in 2004 and 2006.She joined the University of Wollongong, in 2007, where she is currently an Associate Professor with the School of Electrical, Computer, and Telecommunications Engineering.Her research interests include semiconductor lasers with optical feedback and their applications in sensing and instrumentations, and secure chaotic communications.She is also interested in signal processing and its applications to 3-D profile measurement and telecommunication systems.PHILIP O. OGUNBONA received the Ph.D. and DIC degrees in electrical engineering from Imperial College, London.He is currently a Professor and the Co-Director of the Advanced Multimedia Research Lab, University of Wollongong, Australia.His research interests include signal and image processing, computer vision, pattern recognition, and machine learning.He is a Fellow of the Australian Computer Society.
NMSE of the estimation of the precision matrix of the received signal with N t = 8, N r = 32 and an exponential covariance matrix model.
FIGURE 10.NMSE of the channel estimation with N t = 8, N r = 32 and the one-ring covariance matrix model.The model is generated as [44, Eq. (46)] with an average angular spread of 10 • and the central angles are randomly distributed.