High-dimensional autocovariance matrices and optimal linear prediction ∗

: A new methodology for optimal linear prediction of a station- ary time series is introduced. Given a sample X 1 ,...,X n , the optimal linear predictor of X n +1 is ˜ X n +1 = φ 1 ( n ) X n + φ 2 ( n ) X n − 1 + ··· + φ n ( n ) X 1 . In practice, the coeﬃcient vector φ ( n ) ≡ ( φ 1 ( n ) , φ 2 ( n ) ,...,φ n ( n )) ′ is routinely truncated to its ﬁrst p components in order to be consistently esti- mated. By contrast, we employ a consistent estimator of the n × n autocovariance matrix Γ n in order to construct a consistent estimator of the optimal, full-length coeﬃcient vector φ ( n ). Asymptotic convergence of the proposed predictor to the oracle is established, and ﬁnite sample simula- tions are provided to support the applicability of the new method. As a by-product, new insights are gained on the subject of estimating Γ n via a positive deﬁnite matrix, and four ways to impose positivity are introduced and compared. The closely related problem of spectral density estimation is also addressed.


Introduction
Let X 1 , . . . , X n be the realization of a covariance stationary time series with mean zero and autocovariance function γ k = E [X t X t−k ]. We consider the problem of predicting X n+1 based on these observed data. With respect to Mean Squared Error (MSE), the optimal linear predictor is X n+1 = φ 1 (n)X n + φ 2 (n)X n−1 + · · · + φ n (n)X 1 , where the coefficients φ i (n) are given by (see e.g. p. 167 in Brockwell and Davis, 1991). In equation (2), Γ n = [γ |i−j| ] n i,j=1 is the autocovariance matrix of X 1 , . . . , X n , and γ(n) = [γ 1 , . . . , γ n ] ′ is the vector of covariances at lags 1, . . . , n. Predictor (1) is an oracle because the coefficients φ 1 (n), . . . , φ n (n) are unknown. In practice, the coefficient vector φ(n) ≡ (φ 1 (n), φ 2 (n), . . . , φ n (n)) ′ is routinely truncated to its first p components in order to be consistently estimated; this procedure is equivalent to fitting an auto-regressive AR(p) process to the data. The resulting predictor iŝ X AR n+1 =φ 1 X n +φ 2 X n−1 + · · · +φ p X n−p+1 , where the coefficient vector is typically estimated by the Yule-Walker equations In (4),γ k = n −1 n−|k| t=1 X t X t+|k| is the sample autocovariance at lag k,γ(p) = [γ 1 , . . . ,γ p ] ′ , andΓ p = [γ |i−j| ] p i,j=1 . Interestingly,Γ p is positive definite for any p as long asγ 0 > 0, which is a sine qua non. In addition, for any finite p,γ(p) andΓ p are consistent for their respective targets γ(p) and Γ p . Unfortunately, when p is large, problems ensue. For example, when p = n, Wu and Pourahmadi (2009) showed that the sample autocovariance matrixΓ n = [γ |i−j| ] n i,j=1 is not a consistent estimator of Γ n in operator norm. Hence, equation (4) cannot be used with p = n to give a consistent estimator of the full coefficient vector φ(n).
In the present work, we investigate an alternative approach to estimating all n coefficients in the oracle predictor (1); this allows for the complete process history to be used in prediction. The estimated prediction coefficientŝ φ 1 (n), . . . ,φ n (n) are given by the n-dimensional Yule-Walker equations: whereΓ * n is a positive definite version of the n × n banded and tapered estimate of the autocovariance matrix Γ n introduced in McMurry and Politis (2010), and γ(n) is the corresponding estimate of the autocovariance vector; see Section 3.2 for details.
It has been widely thought until now that an estimate such as the one in (5) is not feasible. For example, on p. 717 of the recent work by Bickel and Gel (2011) it is stated that "given n observations, it is impossible to estimate n AR parameters sufficiently well for prediction purposes." The present work demonstrates that this is not the case. In addition, we discuss an intermediate approach, i.e., an analog of (4) but with p that can be arbitrarily large as long as p ≤ n.
The remainder of the paper is structured as follows. Section 2 provides the background on the estimatorsΓ * n andγ(n) that are required in order to estimate the prediction coefficients. Section 3 contains our main asymptotic results; in particular, the consistency ofφ(n) is shown, and the resulting predictor is shown to be asymptotically equivalent to the oracle predictor (1). Section 4 presents four ways to correct our matrix estimator in order to ensure positive definiteness-and therefore invertibility-in finite samples. Section 5 contains the results of finite-sample simulation studies and a real data experiment. Section 6 summarizes our results. All technical proofs have been placed in Section 7. Our paper concludes with an Appendix that shows how the positive definiteness corrections described in Section 4 can find application in the related problem of spectral density estimation.

Estimation set-up
2.1. Estimating the n × n autocovariance matrix Γ n The accuracy of the coefficients estimated by equation (5) rests on the ability to accurately estimate Γ n = [γ |i−j| ] n i,j=1 . However, as mentioned in the introduction, Wu and Pourahmadi (2009) showed that the sample autocovariance matrixΓ n = [γ |i−j| ] n i,j=1 is not a consistent estimator of Γ n in operator norm. In order to achieve consistency, they introduced an l-banded estimate that leaves the 2l + 1 main diagonals of the sample autocovariance matrix intact, and sets the remaining entries to 0. Under conditions on l and short range dependence assumptions on {X t } t∈Z they established the asymptotic consistency of the banded matrix. Their banded estimator is optimal for MA(q) models as, in that case, it corresponds to the parametric estimator. However, if the autocorrelation does not vanish after a finite lag, the banded estimator can behave erratically (Politis, 2011).
To improve performance when the autocorrelation does not vanish, McMurry and Politis (2010) proposed a banded and tapered matrix estimator in which the 2l + 1 main diagonals of the sample autocovariance matrix are kept intact but the remaining entries are gradually tapered to zero. The gradual taper can substantially improve finite sample performance when the autocorrelation does not vanish, at little cost when it does. The asymptotic convergence rates given in McMurry and Politis (2010) were similar to those in Wu and Pourahmadi (2009).
Remark 1. Recently, Cai, Ren and Zhou (2013) showed that the banded and tapered estimator also enjoys a (slightly) improved rate of convergence as compared to the purely banded estimator; for their proof, they used the trapezoidal taper proposed by Politis and Romano (1995) but it is conjectured that the same holds true for the family of so-called 'flat-top' tapers as long as they are continuous-see Politis (2001) for more details.
In the above, κ(·) can be the aforementioned trapezoidal taper, i.e., More generally, κ(·) can be any member of the flat-top family of functions defined in Politis (2001), i.e., κ(·) is given as where the function g(·) satisfies |g(x)| < 1, and c κ is a constant satisfying c κ ≥ 1. The matrix estimator (6) has a banding parameter l ≥ 0. The flat-top tapering leaves the 2l + 1 main diagonals of the sample autocovariance matrix intact, and gradually down-weights more distant diagonals. In order to cover the possibility of the data at hand being uncorrelated, it is useful to adopt the convention that when l = 0, the resultingΓ n matrix is given byγ 0 I; this is equivalent to adopting that 0/0=0 in the context of eq. (7).
The trapezoidal taper given in (8) is very convenient, and has been shown to have good performance in different practical settings; we will also employ it in the numerical work in this paper. Nevertheless, our theoretical results apply to a broad class of weight functions including pure banding (no taper) as used in Wu and Pourahmadi (2009), and ultra-smooth tapers such as that suggested in McMurry and Politis (2004). These possibilities are captured by different choices of the function g(·) and the constant c κ in (9); e.g., letting c κ = 1 corresponds to pure banding.
Note thatΓ n as defined by (6) is asymptotically positive definite, but for finite samples it can have negative eigenvalues. For the remainder of the paper, we assume that it has been corrected to positive definiteness-if needed-as described in Section 4. The positive definite version of matrixΓ n will be denoted byΓ * n .

Estimating the length n vector φ(n)
After the above preparatory work, we are able to define the proposed new predictor asX n+1 =φ 1 (n)X n +φ 2 (n)X n−1 + · · · +φ n (n)X 1 , where the coefficientsφ 1 (n), . . . ,φ n (n) are given by equation (5) in conjunction with the estimates from equations (6) and (7). We can call predictor (10), the Full-Sample Optimal (FSO) predictor since-as shown in Section 3-it is a consistent proxy for the oracle optimal predictor (1). By comparison, Bickel and Gel (2011) have recently investigated a predictor for X n+1 that uses the upper-left p n × p n submatrix of the banded sample autocovariance matrixΓ n with p n = o(n). Their estimator is designed for an "on-line" prediction problem that allows for the parameters to be updated after each new observation at relatively low computational cost, and the resulting prediction for X n+1 is a linear combination of X n , . . . , X n−pn+1 . This is still an AR-type predictor as in (3) but they use a higher order p n than the one obtained by minimizing AIC or a related criterion. Lettingγ we can construct an alternative predictor that is based on a partial sample, i.e., a predictor as in (3) with p n that can be arbitrarily large as long as p n ≤ n. This new predictor is defined aŝ X pn n+1 =φ pn 1 (n)X n +φ pn 2 (n)X n−1 + · · · +φ pn pn X n−pn+1 where the length-p n coefficient vectorφ whereΓ * pn is the matrix that results afterΓ pn in (11) is corrected to positive definiteness. We can call predictor (12), the Partial-Sample Optimal (PSO) predictor as it will be shown to be a consistent proxy for the oracle optimal Partial-Sample predictor, i.e., the optimal linear predictor of X n+1 given the last p n observations; recall that the oracle predictor is constructed using the (unrealistic) knowledge of the whole autocovariance structure.

Data-based choice of the banding parameter l
The FSO and PSO predictors of equations (10) and (12) clearly depend on the choice of the banding parameter l. One possible approach to choosing it in a data-dependent way is the following rule, which was introduced for density and spectral density estimation in Politis (2003). McMurry and Politis (2010) further showed this rule produces approximately correct rates for autocovariance matrix estimation and good finite sample performance.
Empirical rule for picking l Let ̺ k = γ k /γ 0 and̺ k =γ k /γ 0 . Letl be the smallest positive integer such that |̺l +k | < c(log n/n) 1/2 for k = 1, . . . , K n where c > 0 is a fixed constant, and K n is a positive, nondecreasing sequence that satisfies K n = o(log n).
Remark 2. The empirical rule for picking l remains valid for all c > 0 and 1 ≤ K n ≤ n, although different choices of c and K n can lead to very different finite sample performances. Nonetheless, there are some guidelines for practically useful choices. The factor (log n) 1/2 varies slowly, so it has little influence. For example, if log is taken to denote base 10 logarithm, then for sample sizes between 100 and 1000, as is quite typical, (log n) 1/2 varies between 1.41 and 1.73. Thus, if c is chosen to be around 2 and K n about 5, Bonferroni's inequality implies that the bound ±c(log n/n) 1/2 can be used as the critical value of for an approximate 95% test of the null hypothesis that ̺(l + 1), . . . , ̺(l + K n ) are all simultaneously equal to zero. We have found values in this range work well in practice.

Basic assumptions
The convergence ofΓ * n to Γ n , the primary result underpinning our present work, is established in McMurry and Politis (2010) under physical dependence measure conditions (Wu, 2005). In order to define our results, we briefly describe these conditions.
Let {ǫ i , i ∈ Z} be a sequence of i.i.d. random variables. Assume that X i is a causal function of {ǫ i }, i.e., where f is a measurable function such that X i is well defined and E X 2 i < ∞. In order to quantify dependence, let ǫ ′ i be an independent copy of ǫ i , i ∈ Z.
Note that the difference between X i and X ′ i is due only to the difference between ǫ 0 and ǫ ′ 0 , and therefore δ α (i) measures the dependence of X i on an event i units of time in the past. To measure the cumulative dependence across all time, the quantity These notions of dependence underlie the following assumptions which, in conjunction with further assumptions about the weight function κ(·), the bandwidth l, and the underlying process, will be sufficient to establish the consistency of FSO predictor (10).
Assumption 2. The weight function κ is a 'flat-top' taper defined by eq. (9) where the function g(·) and the constant c κ satisfy |g(x)| < 1 for all x, and c κ ≥ 1.
Assumption 3. The quantity converges to zero as n → ∞.
All asymptotic results and order notations in the paper will be understood to hold as n → ∞ without explicitly denoting it. In fact, Assumption 3 necessitates that n → ∞; furthermore, the banding parameter l may have to diverge at an appropriate rate to ensure the convergence of (14). However, if it so happens that γ i = 0 for all i > some q, e.g., under a moving average MA(q) model, l does not need to diverge; any finite value of l would be acceptable as long as it is at least q.
Assumption 4. The spectral density of {X t } t∈Z , defined as satisfies 0 < c 1 ≤ f (ω) ≤ c 2 < ∞ for all w, and some positive constants c 1 and c 2 .
Remark 3. Under different regularity conditions, Xiao and Wu (2012) were able to show the sharper result |γ i | using our Assumption 2 and some constant C > 0. For instance, Xiao and Wu (2012) assume E|X t | 4+δ < ∞ for some δ > 0 whereas we allow for the possibility that δ = 0. Nonetheless, we note that r n can be replaced by r ′ n in all asymptotic results of our paper provided our Assumptions 2 and 4 hold together with the conditions of Theorem 4 in Xiao and Wu (2012).

Estimating the length n vector γ(n)
Implicit in the n-dimensional Yule-Walker equations (5) is the need for consistent estimation of the length n vector of auto-covariances γ(n) = [γ 1 , . . . , γ n ] ′ . The vector of sample auto-covariancesγ(n) = [γ 1 , . . . ,γ n ] ′ is not a consistent estimator of γ(n); in fact,γ(n) misbehaves. To see why, recall that the periodogram of the centered data vanishes at frequency zero; this implies the identity n i=1γ i = −γ 0 /2 which, of course, has no reason to hold for the true γ i . By contrast, the flat-top weighted estimatorγ(n) = [γ 1 , . . . ,γ n ] ′ defined in equation (7) is consistent, as the following Lemma shows. Let | v| 2 denote the l 2 norm of the vector v. Then, Noticeγ(n) is closely related to the first row ofΓ n which is a consistent estimator of Γ n ; the only difference is that whileγ(n) = [γ 1 , . . . ,γ n ] ′ , the first row ofΓ n is [γ 0 ,γ 1 , . . . ,γ n−1 ] ′ . However, the Yule-Walker equations (5) require a positive definite version ofΓ n , denotedΓ * n (see Section 4). By looking at the first row of such aΓ * n , we can obtain alternative estimates ofγ(n) that are also consistent as the following Lemma shows.

Optimal prediction using the full sample
Assumptions 1-4 are sufficient to ensure the vector convergence of the estimated prediction coefficientsφ(n) given by (5) to the optimal prediction coefficients φ(n) given by (2).
Theorem 2. Under Assumptions 1-4, Corollary 2 of Wu and Pourahmadi (2009) establishes the same rate of convergence for the vector of prediction coefficients resulting from purely banded estimates of Γ n and γ(n). In addition, Corollary 1 of Bickel and Gel (2011) establishes the convergence of a vector of prediction coefficients of length p n to the optimal vector of the same length; this is similar in spirit to our Theorem 2 but using p n of smaller order than n.
The vector convergence of estimated prediction coefficientsφ(n) to φ(n) shown in Theorem 2 is important but it is not by itself sufficient to establish the convergence of the resulting predictor to the oracle predictor; this convergence is the subject of our Theorem 3 and is our main theoretical result. To show that the FSO predictorX n+1 converges to the oracle predictorX n+1 we will need two modest additional assumptions.
High-dimensional autocovariance matrices and optimal linear prediction 761 ii. r n k 1/2 n → 0, where r n is as given in (14).
Assumption 6. n 1/2 ∞ i=n+1 |φ i | → 0. Remark 4. The rate k n described in Assumption 5 is required to exist in order to establish the asymptotic optimality of the FSO predictor (10). However, it is not a tuning parameter, and does not need to be estimated and/or chosen by the practitioner.
Remark 5. It can easily be seen that Assumptions 5 and 6 impose few additional restrictions on the process {X t } t∈Z , as the following discussion shows.
i. Assumption 5i requires that k n grows slightly faster than l. The optimal l depends on the rate of decay of |γ i | (see Corollary 1 of McMurry and Politis, 2010). If |γ i | = O(i −d ) for some d > 1, then the optimal l is proportional to n 1/(2d) ; if |γ i | decays exponentially, then it is sufficient for l to grow logarithmically. ii. Assumption 6 is satisfied whenever |φ i | ≤ C φ i −k for i > I 0 , some k > 3/2, and some C φ > 0. iii. Assumptions 5ii and 5iii require some balancing of convergence rates, but they can be achieved with only modest restrictions on the underlying process. As long as |φ i | decays at a rate faster than i −3/2 (as required by Assumption 6), the term in Assumption 5iii will converge to 0 provided k n > C kn n 1/2+ǫ for some C kn > 0 and some ǫ > 0. As long as |γ i | < C γ i −k for i > I 0 for some I 0 and some k > 2, this allows for convergence of the prediction when l is the asymptotically optimal bandwidth. If φ i and γ i decay faster, conditions 5ii and 5iii will be satisfied by a wider range of k n and continue to allow for the optimal l.
Remark 6. Our Theorems 2 and 3 are expected to hold true verbatim if the estimated autocovariancesγ s that constitute the entries of matrix (6) are also thresholded as described in Section 2.3 of Paparoditis and Politis (2012). It is less clear how the estimator will perform if the entries of autocovariance matrix are only thresholded without use of the flat-top weight function κ(·), i.e., a thresholded version of the sample autocovariance matrix as in Xiao and Wu (2012) and Section 2.2 of Paparoditis and Politis (2012). The reason is that our proof of Theorem 3 relies on the rapid decay of theφ i (n) as i increases; this is in part ensured because the non-zero diagonals ofΓ n are constrained to a band of width proportional to l which grows slowly with n.

Optimal prediction using the partial sample
Theorem 3 demonstrates the asymptotic consistency of the estimated prediction coefficients when the n × n covariance matrixΓ * n is used. An approach more in line with traditional AR model fitting or the work of Bickel and Gel (2011) would be to fit an AR model of order p n < n, where p n could potentially grow with n. This would entail one-step ahead prediction that uses only the last p n observations; the prediction coefficientsφ(p n ) are given by (13).
Theorem 2 carries over to this lower order setting without modification, although it should be emphasized that if p n grows slowly enough, faster convergence rates than the one given below are possible; for example, if p n is constant, then the actual convergence rate will be n −1/2 , i.e., |φ( The extension of Theorem 3 requires a strengthening of Assumption 6. Our arguments depend on the closeness of φ(p n ) to the corresponding AR(∞) coefficients; this closeness improves as p n increases, necessitating the following assumption.
Assumption 7. Let k n be as in Assumption 5. Then either a. p n ≤ k n for all n, or b. p n > k n and n 1/2 ∞ i=pn+1 |φ i | → 0. Remark 7. In the case where p n > k n , Assumption 7b is only a modest strengthening of Assumption 5iii.
Corollary 2. Let 1 ≤ p n ≤ n. Under Assumptions 1-5 and 7, whereX pn n+1 is the PSO predictor of eq. (12) with coefficientsφ pn (n) obtained from eq. (13), andX pn n+1 is its oracle counterpart of order p n . Corollary 1 is quite similar to Corollary 1 in Bickel and Gel (2011); however, their result requires p n = o(n) whereas ours is valid for all non-negative sequences p n ≤ n. In addition, neither Bickel and Gel (2011) nor Wu and Pourahmadi (2009) provide a result similar to Corollary 2.
Remark 8. The FSO predictor (10) and the PSO predictor (12) are based on eq. (5) and (13) respectively that employ the matrix estimatorΓ * n , and the vector estimatorγ(n). Of course, using the positive definite matrix estimator is necessary because the finite-sample inverse is needed. Note, however, that we could equally have chosen the vector estimatorγ * (n) of Lemma 2 instead of γ(n) in the Yule-Walker equations (5) and (13). All our asymptotic results of Section 3 on FSO/PSO predictors remain true verbatim with such a choice.

Corrections towards positive definiteness
Under Assumptions 1-4, the matrixΓ n of eq. (6) will have eigenvalues bounded away from zero with probability tending to one as n → ∞. However, for finite samples,Γ n will occasionally have eigenvalues that are negative and/or positive but too small. Since the inverse ofΓ n is a key element in prediction, the matrixΓ n must be corrected to achieve finite-sample positive definiteness and avoid ill-conditioning. In this section, we present four ways to implement such a correction. The method presented in Section 4.1 was originally proposed in McMurry and Politis (2010); we now complete that proposal by observing the need to rescale the matrix after its being corrected. The methods in Sections 4.2, 4.3, and 4.4 are novel.

Eigenvalue thresholding
In the context of the Linear Process Bootstrap, McMurry and Politis (2010) suggested correcting the eigenvalues obtained in the spectral decomposition where T n is an orthogonal matrix, and D is diagonal with ith entry denoted d i . McMurry and Politis (2010) showed that the adjusted estimatê is positive definite but maintains the same asymptotic rate of convergence aŝ Γ n ; in the above, ǫ > 0 and β > 1/2 are some fixed numbers. For the purposes of Linear Process Bootstrap, it had been found that the simple choices ǫ = 1 and β = 1 worked well in practice. In the present context, however, we found that ǫ = 1 sometimes produced unstable predictions. A much larger ǫ, of the order of 10 or 20, seems to solve the problem; we used ǫ = 20 and β = 1 in the simulations. Note that the average eigenvalue ofΓ n equalsγ 0 , which is our best estimator of var [X t ]; similarly, the average eigenvalue ofΓ n equalsγ 0 =γ 0 . However, the threshold correction (18) increases the average eigenvalue of the estimated matrix, implicitly suggesting an increased estimate of var [X t ] (see Appendix A for the connection of the eigenvalues of Γ n to the spectral density, and therefore also to var [X t ]). Consequently, it is intuitive to rescale the estimateΓ ǫ n in order to ensure that its average eigenvalue remains equal tô γ 0 =γ 0 .
Another way to motivate rescaling the corrected matrix estimate is to note that the Yule-Walker equations (5) should be scale invariant, i.e., invariant upon changes of var [X t ]. In fact, they are often defined via a correlation matrix and vector instead of a covariance matrix and vector. To turnγ(n) into a vector of correlations, we just divide it byγ 0 . DividingΓ * n byγ 0 should then provide a correlation matrix-hence the need for rescaling.
The rescaled estimate is thus given bŷ

Shrinkage of problematic eigenvalues towards positive definiteness
Section 4.1 described a hard-threshold adjustment to the eigenvalues ofΓ n in order to render it positive definite. An alternative approach is to make the adjustment based on a positive definite estimate of Γ n ; this approach is novel in the literature of estimating large Toeplitz matrices and/or spectral densitiesfor the latter see Appendix A.
If the flat top weight function (8) is replaced by a weight function with a positive Fourier transform satisfying κ(0) = 1, such as Parzen's piecewise cubic lag window (Brockwell and Davis, 1991, p. 361), the resulting estimatorΓ pd n will be positive definite and consistent-albeit with a slower rate of convergence thanΓ n . SinceΓ pd n andΓ n are both Toeplitz, they are asymptotically diagonalized by the same orthogonal matrix (Grenander and Szegő, 1958). Therefore, letting T n be the orthogonal matrix from equation (17), the matrix defined as n T n will be close to diagonal, and its diagonal entries will approximate the eigenvalues ofΓ pd . Letd 1 , . . . ,d n be the diagonals ofD. We then produce adjusted eigenvalues d * i of D (as in (17)) by the following shrinkage rule. Let d + i = max{d i , 0}. Then where τ n = c/n a for constants c > 0 and a > 1/2. Let D ⋆ be a diagonal matrix with diagonal elements d ⋆ 1 , . . . , d ⋆ n , and define the shrinkage estimator Γ ⋆ n = T n D ⋆ T ′ n that is positive definite, and maintains the same asymptotic properties asΓ n as long as the constant a in (20) is greater than 1/2. However, if a is chosen too large, the shrinkage correction will be ineffective for small samples. Finally, note that a rescaling step as given in eq. (19) must be performed here as well; hence, our final estimator is given bŷ andd ⋆ = n −1 n i=1 d ⋆ i is the average eigenvalue ofΓ ⋆ n . Remark 9. Shrinking the PSO predictor towards a positive definite estimator is not expected to perform well when p n << n; this is becauseΓ pd pn andΓ pn are less close to being diagonalizable by the same orthogonal matrix when p n is not large.

Shrinkage towards white noise
Section 4.2 proposed shrinkingΓ n towards the positive definite estimatorΓ pd n . The shrinking was selective: only problematic eigenvalues were corrected as in the threshold method of Section 4.1. We now propose a different correction that is based on shrinking the corresponding spectral density estimate toward that of a white noise with the same variance-in effect adjusting all eigenvalues. This approach is novel in the literature of estimating large Toeplitz matrices and spectral densities, and provides substantial computational benefits. However, the notion of shrinking covariance matrices towards the identity has been previously employed by Wolf (2003, 2004) in a different context, namely as a tool to regularize the sample covariance matrix based on a sample consisting of multiple i.i.d. copies of a random vector.
Recall that, up to a factor of 2π, the eigenvalues ofΓ n are asymptotically given by the values of the corresponding spectral density estimate evaluated at the Fourier frequencies; see e.g. Gray (2006). Thus, negative eigenvalues correspond to negative values in the estimated spectral density. The estimated spectral density can be made positive-while keepingγ 0 fixed-by shrinkingγ i (for i = 0) towards zero by a constant factor s ∈ (0, 1], chosen to ensure that the minimum of the estimated spectral density is greater or equal to ǫγ 0 /(2πn β ). To elaborate, if the minimum of the estimated spectral density happens to be greater or equal to ǫγ 0 /(2πn β ), then no correction is needed; if not, then s is chosen so that the the minimum of the corrected spectral density is exactly equal to ǫγ 0 /(2πn β ). See Appendix A for more details.
The same adjustment can be applied to the estimated autocovariance matrix, resulting in the shrinkage corrected version ofΓ n given bŷ where I n is the identity matrix and s ∈ (0, 1]. If all the eigenvalues d i are greater or equal to ǫγ 0 /n β , then we let s = 1. Otherwise, we let s be the maximum value that ensures that the minimum eigenvalue ofΓ * n is exactly equal to ǫγ 0 /n β . Estimator (22) has several appealing properties. First, it keeps the estimated variance of the process fixed toγ 0 , i.e., there is no need for rescaling. Second, the shrinkage estimatorΓ * n remains banded and Toeplitz, so fast, memory efficient Toeplitz equation solving algorithms (Brent, Gustavson and Yun, 1980), can always be used. Third, the estimate itself does not require numerical diagonalization ofΓ n since s can be estimated by evaluating the corresponding spectral density estimate.

Shrinkage towards a 2nd order estimate
Section 4.2 suggested shrinking the smaller eigenvalues ofΓ n towards a second order target. Section 4.3 introduced the idea of shrinking all the eigenvalues ofΓ n towards those of a white noise process. An approach that combines the most appealing features of these methods is to shrink the whole ofΓ n towards a positive definite, 2nd order estimate of Γ n .
LetΓ pd n be as defined in Section 4.2, and define the corrected estimator bŷ The shrinkage factor s ∈ [0, 1] is chosen to raise the minimum eigenvalue ofΓ n as close as possible to ǫγ 0 /n β while keeping s in the desired range. Our algorithm exploits the connection between Toeplitz matrices and the spectral density and is described in detail in Appendix A.3.
Estimator (23) is particularly appealing becauseΓ * n remains banded and Toeplitz, and thus can be inverted via a fast algorithm. In addition,Γ * n has no need for rescaling as it hasγ 0 on the main diagonal. Finally, using the second order estimator as the target feels less arbitrary than shrinking towards white noise. But the reason that both corrections work well, both asymptotically and in simulations, is that the correction is a small one, i.e., s tends to one in large samples; thus, the target is not meant to be achieved but gives only a general direction for the correction-see the Appendix for more discussion in the spectral density analog.

Remarks on the four correction methods
LetΓ * n denote the corrected (and rescaled-if needed) matrix according to one of the methods presented in Sections 4.1, 4.2, 4.3, or 4.4. By construction,Γ * n is positive definite but maintains the same fast asymptotic rate of convergence aŝ Γ n . The proof of the following corollary is similar to the proofs of Corollaries 2 and 3 in McMurry and Politis (2010).
In addition, the positive definite estimatorΓ * n may find other applications when a consistent estimator of Γ −1 n is needed. For example, in the aforementioned Linear Process Bootstrap of McMurry and Politis (2010) the un-scaled threshold estimatorΓ ǫ n discussed in Section 4.1 was employed. We conjecture that using the rescaled estimatorΓ * n of eq. (19) may improve the finite-sample performance of the Linear Process Bootstrap by better capturing/preserving the scale of the problem. In addition, the estimatorsΓ * n from Sections 4.2, 4.3, and 4.4 are directly applicable to the Linear Process Bootstrap, and may also also improve its performance.
Remark 10. The problem of estimating high-dimensional covariance and/or precision matrices emerges under many settings different from ours; see e.g. Basu and Michailidis (2014), Bickel and Levina (2008a,b), Liu (2011), Cai andZhou (2012a,b), and Chen, Xu and Wu (2013). As the method of shrinkage towards the identity has been found useful by Wolf (2003, 2004), it is conjectured that the above four methods of correcting a matrix towards positive definiteness may be found useful in different such contexts.
Remark 11. Among the four correction methods, the two shrinkage estimators, namely estimators (22) and (23), may prove especially useful in the case of very large data sets. The reason is that they both result in a banded Toeplitz matrix that can be calculated easily, stored efficiently, and inverted via fast algorithms. Recall that the system T b = z with T Toeplitz can be solved in O(n log 2 n) time using O(n) memory (Brent, Gustavson and Yun, 1980).

Simulations and numerical experiments
We tried a variety of simulation experiments. For each simulated data set, we used the first n observations to predict the n + 1'st observation. Each prediction was made using 16 approaches: • The FSO predictor with the threshold correction to positive definiteness described in Section 4.1 together with rescaling to keep the average eigenvalue the same. Two versions ofγ(n) were considered: The first was the version given by (7), and the second given by the first row ofΓ * n , i.e., ([Γ * n ] 1,2 , . . . , [Γ * n ] 1,n−1 , 0) ′ ; see Section 3.2 for details. In the simulation tables, these estimates are denoted FSO-Th-Raw and FSO-Th-Shr respectively.
• The FSO predictor with shrinkage to positive definiteness described in Section 4.2. Both raw and shrunken versions ofγ n were considered. These predictions are denoted respectively FSO-PD-Raw and FSO-PD-Shr. • The FSO predictor with shrinkage towards white noise, as described in Section 4.3. Both raw and shrunken versions ofγ n were considered. In the simulation tables, these estimates are denoted respectively FSO-WN-Raw and FSO-WN-Shr. • The FSO predictor with shrinkage towards a 2nd order estimate described in Section 4.4. Both raw and shrunken versions ofγ n were considered. In the simulation tables, these estimates are denoted respectively FSO-2o-Raw and FSO-2o-Shr. • The FSO predictor using a rectangular weight function κ(·) along with either the adaptive bandwidth choice (ABC) described in Section 2.3 or the subsampling bandwidth choice (SSBC) described in Wu and Pourahmadi (2009); both bandwidth choices were considered in combination with raw and shrunken versions ofγ n . These estimates are denoted Rect-ABC-Raw, Rect-ABC-Shr, Rect-SSBC-Raw, and Rect-SSBC-Shr. All matrices were corrected to positive definiteness by shrinkage to white noise. • The PSO predictor with a threshold correction and p n = (np aic ) 1/2 together with the raw estimate ofγ n ; this estimator is denoted PSO-Th-Raw. We found the shrunken version ofγ n preformed erratically in this setting, and the results are omitted. Note that p aic denoted the AR order chosen by minimization of the AIC criterion. • The PSO predictor with shrinkage towards white noise and p n = (np aic ) 1/2 together with the shrunken estimate ofγ n . This estimator is denoted PSO-WN-Shr.
• An AR(p aic ) prediction with p aic chosen by AIC minimization, denoted AR. • A version of the method described in Bickel and Gel (2011), denoted BG, with p n = n 1/2 .
Accuracy of all predictions is described by root mean square prediction error (RMSPE) taken across all simulations. The trapezoidal taper of equation (8) was used throughout. The predictor described in Bickel and Gel (2011) has coefficients given bŷ whereΓ k pn is the k-banded version of the p n × p n autocovariance matrix with p n = o(n), andγ(p n ) is the vector of autocovariances at lags 1, . . . , p n . In our simulations, we found thatΓ k pn was frequently not positive definite. Bickel and Gel (2011) recommend either considering a reduced set of banding parameters k or taperingΓ pn with a positive definite taper (Xiao and Wu, 2012). We found the first approach to produce unstable predictions, so we focused on the second. In particular, we tapered the entries ofΓ pn by the Parzen piecewise cubic lag window (Brockwell and Davis, 1991, p. 361), and chose the width of the lag window by cross-validation over the values from 1 to 3p n plus ∞ (no tapering).
For the implementation of the FSO predictor (10), the employed threshold correction used constants ǫ = 20 and β = 1. When shrinking towards a positive definite estimator, we used constants c = 6 and a = 0.55. For shrinkage towards white noise, we scaled the off-diagonals ofΓ n until the smallest eigenvalue was at least max{10γ 0 /n, 0.5×λ min (Γ pd n )}. Shrinkage towards a 2nd order estimate used threshold of 10γ 0 /n. All second order estimates used the Parzen piecewise cubic lag window and bandwidth chosen by the plug-in approach proposed by Politis (2003).

AR(1) prediction
For the first experiment, we simulated AR(1) time series of length 201 and used the first 200 data points to predict the 201'st. Each simulation was repeated 1000 times, and the root mean square prediction errors are shown in Table 1. This simulation should favor the AR predictor (3) since it directly fits an AR model.
The banding ofΓ n implies that the FSO predictor (10) is based on a model whose autocovariances vanish for lags bigger than 2l, in effect an MA model of order 2l. Hence, if a dataset can be well approximated by a low-order AR model, it is expected that the AR predictor (3) will have an advantage over a method that is trying to approximate the low-order AR by a high-order MA. Table 1 shows that FSO predictor (10) is competitive (and even better) than the AR(p) model for small values of the AR coefficient but becomes less competitive as the AR coefficient becomes larger; this is not surprising since accurate approximation of such models by a moving average would require a very high order MA model. Table 1 Root mean square prediction errors (standard error in parentheses) for AR(1) processes  Standard errors for the RMSPE estimates are shown in parentheses in Table 1. The standard errors for the differences in RMSPE between our methods and the AR predictions tend to decrease with the magnitude of the AR parameter. When the AR parameter is −0.1 or 0.1, the standard errors for these differences tend to be around 0.005. When the AR parameter is −0.5 or 0.5, the standard errors for these differences tend to be around 0.009. When the AR parameter is −0.9 or 0.9, the standard errors for these differences tend to be around 0.014.

MA(1) prediction
For the second experiment, we simulated time series of length 201 and used the first 200 data points to predict the 201'st. Each simulation was repeated 1000 times, and the root mean square prediction errors are shown in Table 2. This simulation should favor the FSO predictor (10) since it estimates the correlation structure of an MA model directly.
Note that the FSO predictor (10) is competitive with the AR(p) model for all values of the MA coefficient, and frequently shows slightly better performance. Table 2. Standard errors for the differences in RMSPEs between our methods and the AR(p) predictions were approximately 0.004 when the MA parameters were −0.1 or 0.1, 0.006 when the MA parameters were −0.5 or 0.5, and 0.010 when the MA parameters were −0.9 or 0.9.

Standard errors for the RMSPEs are shown in parentheses in
Remark 12. The AR(1) simulations suggest that the shrunken estimate of γ n tends to outperform the raw estimate when the AR parameter is large. The improvement seems to come at little to no cost, so the shrunken estimate seems advisable in practice; further evidence to support this point is provided in Section 5.6.

MA(2) prediction
In the next simulation, we considered a wide range of MA(2) processes, with coefficients θ 1 and θ 2 ranging from −1 to 1 in steps of 1/3. Several of these combinations of parameters, for example θ 1 = −1 and θ 2 = 0, have MA polynomials with roots on the unit circle causing the spectral density to have a corresponding zero. These simulations are expected to cause trouble for all approaches to prediction since Γ n is not invertible for large n, but the troubles have been somewhat masked by the correction to positive definiteness.
For the simulation, 1000 data sets of sizes 101 and 501 were generated for each combination of θ 1 and θ 2 , and the final observation was predicted using all preceding observations; results are given in Tables 3 and 4. Note that for the larger sample size, one of our estimators was the best performing in more than half of the MA(2) cases under consideration. For the smaller sample size, our estimators were consistently competitive except for the case θ 2 = 1.
Standard errors for the RMSPEs in the MA(2) case were again consistently close to 0.024. The five-number summary for the simulations of size 100 was: min = 0.021, Q 1 = 0.024, Q 2 = 0.026, Q 3 = 0.028, max = 0.051. Standard errors for the simulations of size 500 were almost identical.
Remark 13. The shrinkage corrected FSO and PSO predictors produced very similar results across most of the simulations; this is not surprising since in generalφ i (n) decays rapidly as i increases. As long as p n ≫ l, the coefficients of the FSO and PSO predictors that are significantly different from zero agree almost exactly.

Real data experiment
For our final prediction experiment, we tried our methods on real data using time series from the M3 competition database (Hyndman, Akram and Bergmeir, 2013). In order to avoid the need for seasonal adjustment, we first selected only those time series measured yearly. We then further restricted to those time series which were not found to be nonstationary, i.e., for which the test of Kwiatkowski et al. (1992) could not reject the null hypothesis of absence of a unit root at the α = 0.05 level. The end result was 105 time series of lengths between 20 and 47. Table 3. Root mean square prediction errors for MA(2) process with n = 100  Table 4. Root mean square prediction errors for MA(2) process with n = 500   Each of these time series was then rescaled to have variance 1 so that prediction errors would have approximately the same scale. The experiment consisted of predicting the second to last and the last value in each series using all preceding observations, i.e., having a testing set of size 2 for each series; this resulted in 210 total predictions. Root mean square prediction errors (obtained as an average of the 210 predictions) corresponding to the different methods are shown in the first column of Table 5.
To corroborate and add weight to these findings, we then reversed time and used the later times in each series to predict the first and second observations. Time reversal produces stationary time series with the same covariance structures, allowing us to perform an additional 210 predictions. Results are shown in the second column of Table 5.
In both the original and time reversed real data experiments, the benchmark AR prediction outperformed all other methods. Our procedure is analogous to nonparametric spectral estimation, so in some sense it is unsurprising that a nonparametric procedure is uncompetitive with smaller sample sizes. More surprisingly, subsampling bandwidth choice outperformed adaptive bandwidth choice, and the estimators using shrunken estimates of γ(n) underperformed those using raw estimates.
In both the original and time reversed real data experiments, the benchmark AR prediction outperformed all other methods. Our procedure is analogous to nonparametric spectral estimation, so in some sense it is not surprising that it is not competitive with sample sizes between 20 to 50. More surprisingly, subsampling bandwidth choice outperformed adaptive bandwidth choice, and the predictors using shrunken estimates of γ(n) underperformed compared to those using raw estimates. Both these phenomena seem to contradict the findings of the simulations; we suspect the sample sizes were so small that they did not allow the asymptotics to kick in.

Relative performance of different matrix estimators
Rescaling of the threshold corrected matrix given in eq. (19) is a new proposal in the literature. Similarly, the shrinkage corrected matrices described in Sections 4.2,4.3,and 4.4 are also novel. For this reason, we also carried out a small simulation study demonstrating their ability to improve estimates of Γ n . Data sets of size n = 200 were used throughout using some simple AR(1) and MA(1) models, along with the ARMA(2,1) model Each data set was used to estimate the autocovariance matrix, and the estimate was then compared to the true autocovariance matrix in operator norm. 1,000 replications were performed for each model. Average differences in operator norm are shown in Table 6. The estimators that have been corrected to positive definiteness and then scaled to keep the average eigenvalue unchanged show a consistent advantage over the initial es-timateΓ n and the unadjusted threshold corrected matrix. Shrinkage to white noise and to a second order estimate both show strong performance, with the former particularly strong for the MA processes, and the latter stronger for the AR processes.

Relative performance of autocovariance vector estimators
Section 3.2 introduced two estimates of the autocovariance vector γ(n). The first,γ(n) is the raw or unadjusted banded and tapered estimate. The second, γ * (n) is the shrunken version taken from the first row of the estimated autocovariance matrix after correction to positive definiteness. These two estimators have similar theoretical performance, but it is unclear which is preferable in application.
In order to compare performance, we conducted a small simulation study using a selection of AR(1) and MA(1) models, along with the ARMA(2,1) model X t − 0.7X t−1 + 0.5X t−2 = ǫ t − 0.3ǫ t−1 , each with a sample size of n = 200. l 2 norm errors are shown in Table 7. All the shrinkage type estimators, except possibly the selective shrinkage towards a positive definite estimate, seem to consistently improve on the raw estimateγ(n); shrinkage towards white noise is a particularly strong performer here.

Conclusions
The thrust of this paper was to demonstrate the viability and asymptotic consistency of the FSO linear predictor (10) that uses the complete process history.
A key element here is an accurate estimate of the full n × n autocovariance matrix given a sample of size n. As a by-product, we also show the consistency of the PSO linear predictor (12) which is an AR(p) predictor based on the last p data values for any p ≤ n; this is a substantial strengthening of previous results which had required p = o(n). In simulations, it is shown that the FSO and PSO predictors are competitive as compared to the state-of-the-art linear predictor which amounts to fitting an AR(p) model with p chosen by AIC minimization.
As part of our investigations, we have introduced several refinements to the current state of the art in estimating large autocovariance matrices under the restriction that they are finite-sample positive definite and not ill-conditioned. In particular, when using the eigenvalue threshold correction, we noted the necessity of rescaling the matrix so that the mean eigenvalue remains unchanged. In addition, we introduced three new corrections to positive definiteness, namely shrinking towards positive definiteness, shrinking towards the (rescaled) identity/white noise, and shrinking towards a 2nd order estimate. All three corrections are shown to work well with the shrinkage towards white noise appearing to have a small finite sample performance advantage over shrinking towards a 2nd order estimate.
In particular, note that the estimators resulting from shrinkage towards either white noise or a second order estimate both result in a banded Toeplitz matrix. As such, they can be calculated easily, stored efficiently, and inverted via fast algorithms; this property is especially important in the case of very large sample sizes.
Finally, in Appendix A we use these insights into large covariance matrix estimation to refine flat-top kernel spectral density estimates in order to ensure their positivity.
Proof of Lemma 1.
|γ n | Bounds for the above three terms are obtained in the proof of Theorem 1 in McMurry and Politis (2010); using those bounds, we have where d 2 is a constant depending on E X 4 i and ∆ 4 but not l or n; this establishes the convergence ofγ(n) to γ(n) with the same rates asΓ n converges to Γ n , described in Corollary 1 of McMurry and Politis (2010).
The final equality is established in the proof of Theorem 1 in McMurry and Politis (2010); see also Section 3.2.
We now turn our attention to A 2 . By Corollary 3 in McMurry and Politis (2010), Since ∞ i=1 |γ(i)| < ∞, the result follows. Proof of Theorem 3. We compare the FSO predictorX n+1 to the oracle optimal predictionX n+1 based on the following decomposition: The basic idea of the proof is that we can let k n → ∞ slowly enough that the first term goes to 0 by the Cauchy-Schwarz inequality. Since the coefficientŝ φ i (n) and φ i (n) decay quickly as i increases, by allowing k n to grow fast enough the second two terms can also be shown to converge to 0, again by Cauchy-Schwarz.
We begin with term A. By the Cauchy-Schwarz inequality By Assumption 5ii this term tends to 0. Term B will be handled by Proposition 2.2 of Demko, Moss and Smith (1984) which shows that as long asΓ * n is a banded matrix, which it will be with probability tending to 1, (for small samplesΓ * n may be corrected to positive definiteness, and depending on the technique used, no-longer banded) where C 2 and λ < 1 depend only on the largest and smallest eigenvalues ofΓ * n . Since with probability tending to 1, these are bounded away from 0 and from above, for large enough n, C 2 and λ can be chosen independent of n with (25) holding with probability tending to 1.
Since by Assumption 5i, k n grows faster than l, there is no loss in considering only i > 2l. In this case where the bound above holds with probability tending to 1.
By (26), By (27) and the Cauchy-Schwarz inequality, term B converges to 0 by Assumption 5i. By the Cauchy-Schwarz inequality, Term C can be bounded by where φ i denotes the corresponding AR(∞) coefficient, and inequality (28) follows from the variant of Baxter's inequality (Baxter, 1962(Baxter, , 1963 given in Lemma 2.2 of Kreiss, Paparoditis and Politis (2011) and holds for all n > N 0 for some positive N 0 . The first term in (28) converges by Assumption 6. The second term in (28) converges by Assumption 5iii.
Proof of Corollary 1. For any sequence p n < n,Γ * pn converges to Γ pn as fast or faster than the convergence of the larger n × n matrices; this is because the absolute row sum norm of the difference of the smaller matrices is bounded from above by the maximum absolute row sum norm of the difference of the larger matrices. Similarly, the convergence ofγ(p n ) to γ(p n ) is not made worse; see Section 3.2. Finally, the eigenvalues ofΓ * pn and Γ pn have the same positive upper and lower bounds as their larger counterparts; see Lemma 4.1 in Gray (2006). Therefore, the proof of Theorem 2 carries over directly.
Proof of Corollary 2. In the case that p n ≤ k n , terms B and C in (24) are 0, and the Cauchy-Schwarz inequality can be used directly on term A, giving the desired result. If p n > k n , term B in decomposition (24) is again handled by Proposition 2.2 of Demko, Moss and Smith (1984) with the only change being that the sum (27) stops at p n . The challenge comes with term C, where Baxter's inequality is now used to compare φ i (p n ) and φ i . Since p n < n, this approximation becomes worse, and the first half of term C becomes which converges to 0 by Assumption 7.
Then we can define a corrected flat-top spectral density estimator aŝ (ω) where τ n = c/n a for constants c > 0 and a > 1/2. Since a > 1/2, the correction by factor τ n is asymptotically negligible so thatf ⋆ io (ω) enjoys the same fast rate of convergence asf io (ω).
Using the formula for the Fourier coefficients and noting that κ(0) = κ 2o (0) = 1, it follows thatγ i.e., the area under any choice of spectral density estimate equals the sample autocovariance at lag zero which is our best estimate of var [X t ]. Note, however, that the shrinkage estimator f ⋆ io (ω) has an area that is larger thanγ 0 , therefore implying a bigger estimate for var [X t ]. This is not intuitive, and hencef ⋆ (ω) should be appropriately rescaled. Our final, rescaled shrinkage estimator is given byf * A.2. Shrinkage toward white noise As described in Section 4.3, we may shrinkγ i (for i = 0) towards zero by a factor s ∈ (0, 1] chosen to ensure that the minimum of the estimated spectral density is greater or equal to ǫγ 0 /(2πn β ). The resulting estimator is positive definite while maintaining the same fast asymptotic rate of convergence asf io (ω). Note that by construction,f * io (ω) ≥ ǫγ 0 /(2πn β ) for all ω. By analogy to Section 4.3, the estimatorf * io (ω) has no need for rescaling as it maintains the same area under the curve asf io (ω), and therefore is associated with an estimate of var [X t ] given byγ 0 =γ 0 .
The reason that both shrinking towards white noise and shrinking towards a 2nd order estimate work well-asymptotically and in finite samples-is explained in the following remark.
Remark 14. Spectral estimators such asf io andf 2o can be alternatively expressed as weighted local averages of the periodogram (see Brockwell and Davis, 1991). Since the periodogram is (approximately) unbiased, the bias in spectral estimation is due to the local averaging that, in effect, "trims the hills, and fills the valleys". The fact thatf io (ω) is less biased thanf 2o (ω) implies thatf io (ω) can follow "the hills and the valleys" better thanf 2o (ω). In that sense, shrinkinĝ f io (ω) towards the spectral density of a white noise is tantamount to shrinkinĝ f io (ω) towardsf 2o (ω) for all ω ∈ [−π, π]; the goal of shrinkage towards either target is a flatter version off io (ω). Of course these targets are not meant to be achieved-just to give a general direction for the correction.

A.5. Numerical illustrations
Although asymptotically negligible, the corrections discussed in Sections A.1-A.4 can dramatically improve finite sample performance. Figure 1 provides an illustration using a dataset simulated from the ARMA(2,1) model X t −0.7X t−1 + 0.5X t−2 = ǫ t − 0.3ǫ t−1 with n = 100. Notably, this was not a dataset selected at random; it was chosen among many realizations of datasets from this ARMA model because for this particular datasetf io behaves poorly at ω = 0 necessitating substantial correction.  In addition, we tried a formal simulation experiment to compare the various corrections to positive definiteness in spectral density estimation using difference AR(1) and MA(1) models, as well as the aforementioned ARMA(2,1) model of Figure 1. For each simulated dataset of size n = 200, we estimated the spectral density using the uncorrected infinite order estimate and the methods described in Sections A.1-A.4. The thresholds for correction were the same as those used in the corresponding autocovariance matrix simulations.
Mean integrated square errors are shown in Table 8. All of the new correction methods show substantial improvement overf io . As in the matrix estimation set-up of Section 5.5, shrinkage towards white noise and towards a 2nd order estimator appear particularly powerful. Shrinkage towards white noise seems to perform better for MA processes, while shrinkage towards the 2nd order estimator appears to have a small edge for AR processes.