Adaptive Higher-order Spectral Estimators

Many applications involve estimation of a signal matrix from a noisy data matrix. In such cases, it has been observed that estimators that shrink or truncate the singular values of the data matrix perform well when the signal matrix has approximately low rank. In this article, we generalize this approach to the estimation of a tensor of parameters from noisy tensor data. We develop new classes of estimators that shrink or threshold the mode-specific singular values from the higher-order singular value decomposition. These classes of estimators are indexed by tuning parameters, which we adaptively choose from the data by minimizing Stein's unbiased risk estimate. In particular, this procedure provides a way to estimate the multilinear rank of the underlying signal tensor. Using simulation studies under a variety of conditions, we show that our estimators perform well when the mean tensor has approximately low multilinear rank, and perform competitively when the signal tensor does not have approximately low multilinear rank. We illustrate the use of these methods in an application to multivariate relational data.

Often, a tensor is corrupted by noise. The model we consider for this is: ...,i K ] ∼ N (0, τ 2 ) independent for i k = 1, . . . , p k , and k = 1, . . . , K, where Θ ∈ R p 1 ×···×p K is the signal and E ∈ R p 1 ×···×p K is the additive Gaussian measurement error or noise with mean 0 and various τ 2 . The performance of an estimator t(X ) ∈ R p 1 ×···×p K can be evaluated by statistical risk under quadratic loss, i.e. mean squared error (MSE): where i = (i 1 , . . . , i K ) is a K-tuple of tensor indices.
In the matrix variate case, X ∈ R p×n , an investigator often believes that the mean is well approximated by a low rank matrix. There has been much work on "denoising" (or mean estimation) in matrix variate data by using this knowledge. A typical estimation scheme begins by computing the singular value decomposition (SVD) of X: where, in the case n ≥ p, U ∈ R p×p is orthogonal, D = diag(σ 1 , . . . , σ p ) with σ 1 ≥ . . . ≥ σ p ≥ 0, and V ∈ R n×p contains orthonormal columns. The columns of U and V are, respectively, the left and right singular vectors of X and the diagonal elements of D are the singular values. A key property of the SVD is that the number of non-zero singular values of X is precisely the rank of X. One widely studied approach to estimating Θ when it is assumed that Θ has nearly low rank is to shrink the singular values of X towards 0 while keeping the singular vectors unchanged, thereby inducing an (approximately) low rank estimate. The resulting "spectral" estimator t(X ) of Θ then takes the form t(X ) = U f (D)V T where f (D) = diag(f 1 (σ 1 ), . . . , f K (σ K )) and each f i (·) shrinks the singular values towards 0. These estimators are orthogonally equivariant, meaning that t(W XZ T ) = W t(X)Z T for orthogonal matrices W, Z [Shabalin and Nobel, 2013]. Early work on singular value shrinkage estimation from a non-statistical perspective began with Eckart and Young [1936], where they proved that the best rank r approximation to the data matrix X ∈ R p×n (in terms of sum of squared differences from X) is found with the shrinkage function: where 1(·) is the indicator function. We call (4) the truncation estimator. However, approximating the data X well is not the same as estimating the underlying signal Θ well. In terms of estimating Θ, the matrix X is unbiased, minimax, and the maximum likelihood estimator under normally distributed errors. However, it is well known that shrinkage estimators, such at that of Stein [1981] can uniformly dominate X in terms of risk. This seminal shrinkage estimator, in the context of matrix estimation, is given by where λ > 0 is some tuning parameter. For data that exhibit associations between the rows and/or columns of the mean matrix, the estimator of Efron and Morris [1972a], given by was introduced and results in different amounts of shrinkage for each singular value. Efron and Morris [1976] improved upon this estimator with a generalization of both (5) and (6), given by where λ > 0 and γ > 0 are tuning parameters.
More recent work has focused on estimators whose functions f i (·) induce sparsity in the singular values, which may be more appropriate than (5), (6), and (7) in cases where the true signal itself has (approximately) low rank. Motivated by penalized maximum likelihood estimation, the hardthresholding estimator and the soft-thresholding estimator were introduced [Candès et al., 2013, for example]. Here, (y) + = max(y, 0) is the "positive part" function. A clever shrinkage function that includes (8), (9), and a truncated version of (6) [Verbanck et al., 2015] as special cases is that of Josse and Sardy [2015]: This estimator was inspired by the adaptive LASSO [Zou, 2006]. A variety of other shrinkage estimators have also been developed [Nadakuditi, 2014, Shabalin andNobel, 2013]. All of these estimators are specific to matrix-variate data. If one were to apply these matrix methods to a tensor, one would first convert the tensor into a matrix. For a K-dimensional tensor, such "matricization" destroys the indexing structure along all but one of the dimensions. This may be detrimental to estimation if, in addition to a data set having approximately low rank, it also has approximately low multilinear rank (see Section 2), that is, "matricizing" along each index set, or "mode", results in a low rank matrix.
An extreme simulated example that exhibits this phenomenon is presented in Figure 1. There, we plotted the mode-specific singular values of a tensor that we generated to have full rank along one mode and low ranks along two modes. That is, we plotted the singular values of each matricization of the tensor. If an analyst were presented with a noisy version of this tensor and only matricizing along the first mode, then they would only observe a noisy realization of the solid lines, which would suggest the data are full rank. However, the second and third modes have low-rank structure and shrinking the singular values along these additional modes may improve estimation.
In this article, we introduce a family of estimators that shrink tensor-valued data towards having (approximately) low multilinear rank. We perform this shrinkage on a reparameterization of the higher-order singular value decomposition (HOSVD) of De Lathauwer et al. [2000], where we shrink the mode-specific singular values of the data tensor towards zero. We consider classes of such "higher-order spectral estimators", where a class is defined by a mode-specific shrinkage function indexed by a tuning parameter. We propose to adaptively select the tuning parameters by minimization of an unbiased estimate of the risk.
Our paper is organized as follows. In Section 2, we review tensors and the HOSVD. We then present how one may define functions that shrink the mode-specific singular values of the HOSVD. In particular, we present two specific estimators that shrink the data tensor towards having (approximately) low multilinear rank and provide some discussion on the intuition behind these estimators.
In Section 3, we review Stein's unbiased risk estimates (SURE), then derive the SURE for a broad class of higher-order spectral estimators. In Section 4 we present simulations demonstrating that (1) tensor specific methods perform better when the mean tensor has approximately low multilinear rank; (2) when the mean tensor has low multilinear rank our methods accurately estimate the multilinear rank; and (3) tensor specific methods perform competitively when the signal tensor does not have approximately low multilinear rank. In Section 5 we illustrate the use of these methods in an application to multivariate relational data. We finish with a discussion in Section 6.
2 The higher-order SVD and higher-order spectral estimators Some tensor data sets have approximately low multilinear rank, which we now define. Recall that the rank of a matrix is the dimension of the vector space spanned by its columns and rows. Define the k-mode vectors of a tensor X ∈ R p 1 ×···×p K as the p k -dimensional vectors formed from X by varying i k and keeping the other indices fixed. The k-mode rank r k is the dimension of the span of the k-mode vectors, and the multilinear rank of the K-order tensor X is the K-tuple, (r 1 , . . . , r K ). Define the k-mode matricization [Kolda and Bader, 2009], or k-mode unfolding, of X to be X Then, equivalently, r k is the rank of X (k) .
The SVD , presented in Section 1, has been used to shrink matrix valued data towards low rank. One generalization of the SVD to tensors is the HOSVD of De Lathauwer et al. [2000], which relates directly to multilinear rank.
Definition 1 (HOSVD of De Lathauwer et al. [2000]). Let X (k) = U k D k V T k be the SVD of each k-mode unfolding of X . Let S = (U T 1 , . . . , U T K ) · X , then is the higher-order singular value decomposition (HOSVD).
The product "·" in (11) between a list of matrices, {U 1 , . . . , U K } for U k ∈ R p k ×p k , and a tensor, S ∈ R p 1 ×···×p K is called the Tucker product. The Tucker product is defined through the k-mode matricizations of (U 1 , . . . , U K ) · S: where "⊗" is the Kronecker product. The "core array", S has the property of all-orthogonality where The HOSVD is multilinear rank-revealing in the same way the SVD is rank-revealing. That is, let D k = (S (k) S T (k) ) 1/2 = diag(σ k 1 , . . . , σ k p k ) be the mode specific singular values of X . Then the multilinear rank of X is (r 1 , . . . , r K ) if D k contains r k non-zero mode-specific singular values. In the core array, this is equivalent to S containing zeros everywhere except in one of the "corners": S [1:r 1 ,...,1:r K ] , where 1 : r k = 1, . . . , r k . It is possible, then, to shrink S towards having (approximately) low multilinear rank by shrinking the elements in S towards 0. We propose doing this via a re-parameterization of S, given as follows: where S = (D 1 , . . . , D K ) · V. Our higher-order spectral estimators shrink S by shrinking each mode-specific D k . We abuse notation a little by allowing "·" to also represent a binary operator between two lists of matrices whose operation is component-wise multiplication. This should not cause confusion because (A 1 B 1 , . . . , Using reparameterization (12), we now define higher-order spectral estimators of Θ under the model (1).
Each of the matrix shrinkage functions listed in Section 1 (4)-(10) may, in principle, be applied to each mode in our higher-order spectral estimator (13). We focus on two examples of higher-order spectral estimators. One of these is a generalization of the matrix truncation estimator (4) and the other is a generalization of the matrix soft-thresholding estimator (9). The former can be used to choose the multilinear rank of Θ, the latter is for estimation of Θ when we suspect that the mean tensor has approximately low multilinear rank.
Example: Truncated HOSVD to find the multilinear rank. The first step in many tensor applications is to choose the multilinear rank of the underlying signal, a difficult task [Timmerman and Kiers, 2000, Kiers and Kinderen, 2003, Ceulemans and Kiers, 2006. The methods in this paper present a way to choose the multilinear rank. The truncated HOSVD is one popular way to induce low multilinear rank [De Lathauwer et al., 2000]. Given multilinear rank (r 1 , . . . , r K ), it is found by taking the HOSVD (11) and setting all elements in S except the "corner" S [1:r 1 ,...,1:r K ] to 0. The truncated HOSVD may be viewed as a higher-order spectral estimator (13), where This sets to 0 all but r k of the mode-specific singular values, resulting in an estimate of Θ that has multilinear rank (r 1 , . . . , r K ). The set of all possible multilinear ranks defines a class of reduced rank estimators of Θ. In this paper, we suggest adaptively selecting an estimator from this class by minimizing an unbiased estimate of the risk.
Example: Mode-specific soft-thresholding. Shrinking all of the singular values can generally improve estimation over just truncating the smallest few singular values. A popular form of shrinkage that accomplishes this, a result of nuclear-norm regularization, is the soft-thresholding estimator (9). The second estimator we explore is obtained by applying soft-thresholding to the mode-specific singular values: As with the previous example, the set of (λ 1 , . . . , λ K ) defines a class of estimators. We propose adaptively selecting a member of this class by minimizing an unbiased estimate of the risk. A few words are in order about the mode-specific soft-thresholding estimator in (15). First, we note that the resulting core array (f 1 (D 1 )D −1 1 , . . . , f K (D K )D −1 K ) · S is not generally all-orthogonal. Hence, the f k (D k ) are not actually the new mode-specific singular values of the estimator t(X ). That is, it would be incorrect to think that subtracting off λ 1 from the first-mode singular values means that the new first-mode singular values are σ 1 i 1 − λ 1 . We are altering the mode-specific singular values, but the relationship is complex. Rather, the proper intuition for shrinkage functions of the form (15) is that the larger the value of λ k , the more dispersed the resulting mode-specific singular values tend to be on a normalized scale. Likewise, the more negative the value of λ k to the singular values the less dispersed the resulting mode-specific singular values tend to be. To gain intuition regarding this phenomenon, we provide an extreme case. We generated a 10 × 10 × 10 tensor where each mode had approximately the same singular values. The first-mode specific singular values were (947,873,844,801,746,698,675,597,524,472). We applied the mode specific soft-thresholding function (15) to each mode with λ 1 = 500, λ 2 = 0, λ 3 = −10000. We then calculated the mode-specific singular values of the resulting tensor and compared these to the original mode-specific singular values, scaled to sum to one. The comparisons can be found in Figure 2. The changed (and normalized) singular values are more dispersed for the first mode, remain relatively unchanged for the second, and are less dispersed for the third.
We have found that we can improve performance (with respect to MSE) by adding an overall scale tuning parameter. That is, we consider a shrinkage estimator of the form: where c > 0 is the overall scale parameter,

Stein's unbiased risk estimate
Both shrinkage function (14) and (16) define classes of estimators, indexed by tuning parameters. Ideally, we would like to choose these tuning parameters by minimizing the risk (2). However, because the mean Θ is unknown, minimization of (2) with respect to the tuning parameters is not possible. One approach for selecting an estimator from one of these classes is to minimize a risk estimate that does not depend on the unknown parameter. One such estimate is Stein's unbiased risk estimate: Theorem 1 (Stein [1981]). Under the model (1), suppose t : Then where div(·) is the divergence of t(·). We denote Stein's unbiased risk estimate (SURE) as "Almost differentiable" basically means differentiable everywhere except on a set of Lebesgue measure zero [Stein, 1981, Definition 1]. Because the SURE (18) does not depend on the parameter values Θ, we can minimize the SURE and use this minimization as a proxy for minimizing the risk. In many cases, adaptive estimators obtained by minimizing SURE over a class of estimators yields improved risk performance, as was observed by Candès et al. [2013] in the matrix case.
The difficult part of (18) is calculating the divergence. We will spend the next two subsections performing this task. First, we will calculate the differentials for the elements of the altered HOSVD (12) in Subsection 3.1. Then we will use these differentials to derive the divergence of estimators of the form (13) in Subsection 3.2. This divergence can then be inserted into (18) to obtain the SURE.

Differentials of the HOSVD
In this subsection, we calculate the differentials for the elements in the altered HOSVD (12). In what follows, we will assume that X has full multilinear rank. Given that p k ≤ p/p k for all k = 1, . . . , K, where p = K k=1 p k , this rank condition is fulfilled almost surely for data X that have a p.d.f. that is absolutely continuous with respect to Lebesgue measure on R p 1 ×···×p K [de Silva and Lim, 2008, Proposition 7.2].
An outline of the derivation is as follows: Because each U k and D k from the HOSVD is from the SVD of X (k) = U k D k V T k , the calculation begins by recognizing that the differentials of the U k 's and the D k 's are the same as in the matrix case. The differentials can then be re-written as functions of the terms in the HOSVD. To obtain the differential of V, we write X = (U 1 , . . . , U K )·(D 1 , . . . , D K )·V and apply the chain rule to each U k , each D k , then to V. We then solve for the differential of V, which may be written in terms of the differentials of the U k 's and the D k 's.
Proof of Theorem 2. Denote the differential of a function g at X with increment ∆ as dg [∆]. Since U k and D k are the left singular vectors and the singular values, respectively, of X (k) for each k = 1, . . . , K, the differentials, dU k [∆] and dD k [∆], are the same as in Candès et al. [2013] and have a closed form solution, given by where Candès et al. [2013] to get (19). Let . Then from (4.8) of Candès et al. [2013] we have and so We now derive dV [∆]. Let U = (U 1 , . . . , U K ) and D = (D 1 , . . . , D K ). Also note that dX [∆] = ∆. Using the chain rule, and following Chapter 8, Section 1, Equations (15) and (16) of Magnus and Neudecker [1999] for the differential of matrix multiplication and the Kronecker product, we have where From (22), we solve for dV[∆] and have where

Divergence of higher-order spectral estimators
In this section, we show that the divergence of higher-order spectral estimators of the form (13) can be found in the following theorem.
Theorem 3. The divergence of estimators of the form (13) is where Sum(A) is the sum of all elements in the tensor A, and C ∈ R p 1 ×···×p K such that where • is the outer product and U k[:,i k ] is the i k th column of U k . Note that where E i is the p 1 × · · · × p K array with a one in position (i 1 , . . . , i K ) and zeros everywhere else. Similar to the arguments of Candès et al. [2013], also note that ∆ i forms an orthonormal basis for R p 1 ×···×p K , and so where , is the usual Euclidean inner product. From the chain rule, we have: where "•" now means composition. Hence, where The outline of the derivation of the divergence is as follows. The ultimate goal is to obtain the (32) and plug that into (31). We will first calculate all of the differentials that are in (32), then we will determine the (i 1 , . . Then we will simplify (31). These latter two steps may be found in Appendix A. We begin with the differentials. From (19), we have Similarly, from (20), we have Also, from the chain rule, we have that We have just completed all of the calculus necessary to obtain the divergence, and the remainder of the calculation is simplification. That is, we can use equations (25), (31), (32), (34), (36), and (37) to calculate a closed-form expression for the divergence. This simplification is relegated to Appendix A.
We now present the formula for the SURE for all higher-order spectral estimators of the form (13): Theorem 4 (SURE for (13)). Under the model (1), suppose t(·) in (13) is almost differentiable and for which (17) holds. Then This SURE formula is applicable for all shrinkage functions of the form (13) where f k (D k ) = diag(f k 1 (σ k 1 ), . . . , f k p k (σ k p k )). For such shrinkage functions, the shrinkage being applied to each singular value is a function only of that singular value. However, it is possible to construct estimators which use all of the mode k singular values to shrink each mode k singular value, e.g. if we were to use a shrinkage function analogous to those of (5) or (7). For such estimators, we prove in Appendix C that the form of the divergence is very similar as in (28). The only difference is that one replaces d . That is, for such shrinkage functions, df k (D k ) is a diagonal matrix containing only the diagonal of the Jacobian matrix of the transformation diag(D k ) → diag(f (D k )).

Simulation studies
In this section, we consider four competitors to the mode-specific soft-thresholding estimator (16) and the truncated HOSVD (14). We will compare these estimators assuming the error variance τ 2 is one. The first competitor is X , which is the maximum likelihood estimator and the uniformly minimum variance unbiased estimator. However, the risk-performance of this estimator is known to be dominated by our second competitor, the James-Stein estimator (5) [Stein, 1981]. This estimator may be derived from an empirical Bayes argument where Θ [i] ∼ N (0, γ 2 ) [Efron and Morris, 1972b]. As such, it should perform well when the entries of Θ are centered about 0. For a matrix parameter Θ, Efron and Morris [1972a] developed an empirical Bayes estimator that performs better than the James-Stein estimator when Θ exhibits empirical correlation along the rows. With this in mind, our third estimator is obtained by applying the Efron-Morris estimator (6) to the first mode matricization of the data tensor. However, the Efron-Morris estimator does not induce low rank estimates, and so our fourth and final competitor is the matrix soft-thresholding estimator (9) applied to the first mode matricization of X , and whose tuning parameter is chosen with the SURE formula from Candès et al. [2013]. This estimator should improve on the Efron-Morris estimator when Θ (1) has approximately low rank.
For each scenario, we re-scaled Θ to have Frobenius norm √ 1000, so that E[||E|| 2 ] = 1000 = ||Θ|| 2 . For each Θ, we simulated X [i] ∼ N (Θ [i] , 1), calculated the six estimators given this data tensor, and calculated the squared error loss for each estimator. We repeated this process 500 times. Box plots of the losses for each of the six Θ values are given in Figure 3.
The James-Stein estimator (5) is expected to perform well in Scenario A as it can be viewed as an empirical Bayes procedure for the prior with which Θ was actually generated. Indeed, from Figure 3 (A), the James-Stein estimator does perform best, but the mode-specific soft-thresholding estimator performs almost as well, even though there is no correlation along any of the modes of the mean tensor.
For scenario B, we expect the matrix soft-thresholding estimator (9) to do well. Since the mean tensor in this scenario has approximately low rank only along the first mode, estimators that shrink towards the space of low multilinear rank tensors should be over-fitting and should not perform well. From Figure 3 (B), the matrix soft-thresholding estimator does perform best, but surprisingly, the mode-specific soft-thresholding estimator does equally well.
For Scenario C, we expect the matrix soft-thresholding estimator (9) and the Efron-Morris estimator (6) to perform well. There is temporal correlation along one of the modes of the mean tensor. We take into account the temporal correlation of the mean by performing soft-thresholding along this mode. However, from Figure 3 (C), we see that the mode-specific soft-thresholding estimator performed best.
The matrix soft-thresholding estimator (9) was designed to do well when the mean matrix is of low rank. This is exactly the situation in Scenario D, as a tensor with low rank along one mode may be matricized to form a low rank matrix. However, from Figure 3 (D), for our one Θ value, the mode-specific soft-thresholding estimator performs best.
As for Scenario E, we expect the mode-specific soft-thresholding estimator (16) to do well, as the mean tensor has approximately low multilinear rank, but it is not exactly low multilinear rank. Figure 3 (E) reveals the mode-specific soft-thresholding estimator does indeed perform better than the other estimators.
We expect the truncated HOSVD (14) to do well in Scenario F because the mean tensor has low multilinear rank, and the truncated HOSVD is correctly shrinking toward this structure. From Figure 3 (F), we see that the truncated HOSVD does indeed perform best in terms of loss. However, the mode-specific soft-thresholding estimator does not perform much worse. The estimators that do not take into account the tensor indexing perform about twice as bad as these tensor-specific estimators.
For scenarios C and D, we emphasize here that we are looking at the risk only at a few points in the parameter space. There are likely points where the matrix-soft thresholding estimator performs better than the tensor estimators. However our mode-specific soft-thresholding estimator did not perform poorly under any of our simulated mean tensors.
Our procedure for the truncated HOSVD produces a multilinear rank with the smallest SURE. It is of interest to know if this multilinear rank provides a good estimate of the true rank of Θ. We evaluated this possibility in simulation Scenarios D and F. In Scenario F, where the tensor had dimension (10, 10, 10) and the true multilinear rank was (5, 5, 5), this SURE method correctly estimated the multilinear rank in 92.6% of trials. In Scenario D, where the true multilinear rank was (5, 10, 10), the results of the simulation study can be found in Table 1. There, we see that the rank of the first mode is correctly estimated in 97% of trials. The rank of the second and third modes are correctly estimated a majority of the time. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q  Table 1: Proportion of times each rank is estimated based on SURE for each mode over 500 repetitions when the true multilinear rank is (5, 10, 10).

Multivariate relational data example
In this section, we demonstrate the applicability of our estimators to multivariate relational data. Such data may be viewed as a three-way tensor X where entry X [i,j,k] is the value of relation type k from node i to node j. One example of such a data set is a social network in which multiple types of relations are measured between individuals. As another example, in sports statistics, round robin interaction data consist of outcomes of competitions between teams. In this section we illustrate our methods with round robin data from the 2014-2015 regular season of the National Basketball Association (NBA). The NBA consists of a Western conference and an Eastern conference of fifteen teams each, where intra-conference play has three to four games per year per pair of teams and inter-conference play is limited to two games a season per pair of teams. For each conference, we created a four dimensional tensor where element Y [i,j,k, ] is statistic k obtained by team i while playing team j either during team i's first home ( = 1) or first away ( = 2) game against team j during the season. The statistics we considered were free-throw percentage, two-point field goal percentage, and three-point field goal percentage. We thus have two tensors each of dimension 15 × 15 × 3 × 2, one for each of the two conferences. In this section, we illustrate the utility of tensor shrinkage by predicting late season relational basketball statistics from early season data. Our approach is analogous to that of Efron and Morris [1975], who illustrated the utility of vector shrinkage estimation by predicting late season baseball batting averages from data on early season batting averages. The statistics in our data set are all empirical proportions. We model the elements of Y with a binomial model, where all elements are independent, given the p i,j,k, 's. We apply an arc-sin transformation to the data tensor to stabilize the variance: From the central limit theorem, we have approximately where Θ [i,j,k, ] = (n i,j,k, ) 1/2 arcsin(2p i,j,k, − 1), resulting in the model in (1).
A commonly used representation of a mean tensor Θ is an ANOVA decomposition, such as whereΘ [i,j,k, ] contains all of the interaction effects. Note that 1 T p 1 α = 0, 1 T p 2 β = 0, 1 p 3 γ = 0, and 1 T p 4 δ = 0, where 1 p k is the vector of ones of length p k . The tensorΘ also satisfiesΘ (k) 1 p/p k = 0 for all k = 1, 2, 3, 4. Suppose we obtain the maximum likelihood estimates of µ, α, β, γ, and δ by fitting a main-effects ANOVA model. We then calculate the residual tensor, This residual tensor has an expected value ofΘ. It was proposed in Stein [1966] and Efron and Morris [1972a] that we estimate the interaction effectsΘ with a vector shrinkage-type estimator on the residuals. If the interactionsΘ are close to zero -when the interaction effects are smallthen such estimators will adaptively shrink the residuals towards zero. However, these estimators were developed to adapt to patterns in vectors or matrices of residuals, and not tensors of residuals.
In contrast, our approach should be able to adapt to these patterns along any of the four modes of the residual tensor. We applied mode-specific soft-thresholding and the truncated HOSVD to the array of residuals R from the main effects ANOVA model. These methods suggest that the residual tensor should be heavily shrunk both towards zero and towards low multilinear rank structure. For the West, the Frobenius norm of the residual tensor was 38.38, while the Frobenius norm of the resulting shrunken residual tensor using the mode-specific soft-thresholding estimator was 7.81. In the East, the values were 38.95 and 6.97, respectively. We also used SURE to estimate the multilinear rank of each residual tensor using the truncated HOSVD. The estimated multilinear rank of the residual tensor of the Western conference was 2 × 3 × 1 × 2, and for the Eastern conference the estimated multilinear rank was 4 × 2 × 1 × 1. These are very small ranks compared to the dimensions of the tensors 15 × 15 × 3 × 2.
An ad hoc evaluation of the performance of our estimators can be obtained by predicting game statistics after the first home and first away games. Since some teams only play each other three times, we do not have late season data on all possible combinations of team pairs by home versus away games. For the late season data we do have, we present the squared error losses for predicting the statistics of the remaining part of the season for each conference in Table 2. The different estimators are (1) the raw data array X , (2) the mean estimates of the main-effects ANOVA model, (3) the mode-specific soft-thresholding shrunken residual tensor added to the mean estimates of the main-effects ANOVA model, (4) the truncated HOSVD shrunken residual tensor added to the mean estimates of the main-effects ANOVA model, and (5) an estimator derived from logistic regression using the main-effects of each mode. The losses are with respect to the arc-sin transformed data. The poor performance of X is unsurprising. The amount of shrinkage that our estimators produce indicates that the fully saturated model is over-fitting and that most of the information is contained in the main-effects. However, our mode-specific soft-thresholding estimator is also fitting the fully saturated model and it performs comparable to the main-effects ANOVA model, even improving the predictions for the Eastern conference. X 2410 2476  ANOVA 1344 1364  Mode-specific Soft-thresholding 1327 1385  Truncated HOSVD 1391 1451  Logistic Regression 1481 1552   Table 2: Squared error losses when predicting the statistics of the remaining games of the season.

Discussion
This paper introduced new classes of shrinkage estimators for tensor-valued data that are higherorder generalizations of existing matrix spectral estimators. Each class is indexed by tuning parameters whose values we chose by minimizing an unbiased estimate of the risk. In terms of MSE, these estimators outperform their matrix counterparts when the mean has approximately low multilinear rank and they perform competitively when the mean does not have low multilinear rank. There has been some recent work on penalized optimization methods for estimating signal tensors in the presence of Gaussian noise [Signoretto et al., 2010, Tomioka et al., 2011a,b, Liu et al., 2013, Tomioka and Suzuki, 2013. Usually, these estimators are defined as the minimizers of a penalized squared error empirical loss, where the penalty is usually some generalization of the nuclear norm to tensors (for example, the sum of the nuclear norms of the K matricizations of a tensor). These estimators, though similar in spirit, are very different from our approach. The main advantage of our estimators is their simplicity -they are simply functions of the HOSVD (13) for which there are efficient and accurate numerical procedures to compute.
We have presented a way to adaptively choose the tuning parameters of our higher-order spectral estimators by minimizing the SURE. This approach is applicable, not just for the truncated HOSVD (14) and the mode-specific soft-thresholding (16) estimators, but also for all estimators of the form (13) that satisfy the conditions of Theorem 1. Although we found that adaptively choosing the tuning parameters by minimizing the SURE worked well under the scenarios we studied, there are other ways to select tuning parameters. In the case of matrix spectral estimators, others have chosen the amount of shrinkage by minimax considerations Morris, 1972a, Stein, 1981], cross-validation [Bro et al., 2008, Owen and Perry, 2009, Josse and Husson, 2012, and asymptotic considerations [Gavish and Donoho, 2014a,b]. Exploring these methods for our higher-order spectral estimators (13) is a current research area of the authors.
In this paper, we focused on estimators of the form (13). If the mean tensor is believed to have approximately low multilinear rank, we should shrink the core array through the Tucker product along the modes to obtain this low multilinear rank. The form of our higher-order spectral estimators (13) allows us to use the mode-specific singular values to determine the form and amount of shrinkage that should be performed to each mode of the core array. However, different classes of higher-order spectral estimators can be studied. In the Appendix D, we explore functions that shrink each element of the core array individually: This class of estimators can be used, for example, to induce zeros in the core array, which has applications in increasing the interpretability of a higher-order generalization of principal components analysis [Henrion, 1993, Kiers et al., 1997, Murakami et al., 1998, Andersson and Henrion, 1999, De Lathauwer et al., 2001, Martin and Van Loan, 2008.
Although the error variance τ 2 in (1) might be known in some settings, such as fMRI data sets [Candès et al., 2013], in most applied situations the variance would not be unknown. There are matrix-specific estimates of the variance that can be applied to tensor-variate datasets by first matricizing along each mode. In our software, we have implemented the methods described in Choi et al. [2014] and Gavish and Donoho [2014a]. Though, instead of plugging in an estimate of the variance into the SURE formula (18), there has been a recent suggestion to use a generalized SURE formula [Sardy, 2012, Josse andSardy, 2015]: This formula is motivated by generalized cross-validation [Golub et al., 1979] and is an approximation to SURE [Josse and Sardy, 2015]. Importantly, GSURE does not require the variance to be known, and so its minimization may be accomplished without an estimate of τ 2 . For our higherorder spectral estimators, we have already accomplished the hard work of calculating the divergence in this paper, and implementing GSURE is an easy application of this result. Our software allows for GSURE implementation for the estimators discussed in this article.
All methods discussed in this paper are implemented in the R package hose available at https://github.com/dcgerard/hose.
Code and instructions to reproduce all of the results of this paper are available at https://github.com/dcgerard/hose paper/tree/master/reproduce sure.

A Simplification of the divergence
We will need the (i 1 , . . . , i K )th element of U T ·df [∆ i ] in (32). There are three terms in (32). We will deal with them one by one. First, we will work with the first term of (32), Now we work with the second term of (32), K k=1 df (D) k [∆ i ] · V. We have that: since It remains to work with the third term in (32) We have: We now need to obtain dV[∆ i ] [i] . From (25), we have There are three terms in (42). Let us deal with them one by one. The first term in (42) is The second term in (42) is The third term in (42) is To obtain the third term in (32), we need only plug in (43), (44), and (45) into (42). And then we need to plug in (42) into (41).
We will now show that the divergence is of the form: for H k in (29) and C ∈ R p 1 ×···×p K in (30). The term f (D) · D −1 · C is from the first and second parts of (32), whereas the terms K k=1 H k · S 2 are from the second part of (32) and were already derived in (40). Let us find C.
. Ignoring the second term in (32), we have that the sum of the first and third terms in (32) is equal to: After rearranging summands, we obtain: And after factoring out K k=1 (σ k i k ) −1 , we get: That is,

B Details of optimization
We now provide some brief details on our optimization strategy when considering only the modespecific soft-thresholding estimator. Let To update each λ k , we simply apply a general purpose univariate optimizer (e.g. Brent's method [Brent, 1971]). To update c, we have , where we are summing over the set of i k 's such that σ k i k > λ k for k = 1, . . . , K. Then the minimum c occurs at (b − d − e)/a. This is a global minimizer, conditional on the λ k 's, since a > 0.

C General spectral functions
In Section 3.1, we assumed that the spectral functions were of the form: That is, we only used σ k i when determining the amount of shrinkage to perform on σ k i . In this section, we will extend these results to weakly differentiable functions of the form: where D + p k is the space of p k by p k diagonal matrices with non-negative diagonal elements. This will allow us to use σ k 1 , . . . , σ k p k to determine the amount of shrinkage to perform on σ k i . These types of spectral functions might be desirable if, for example, we wished to develop a generalization of estimator (7). Let s k = (σ k 1 , . . . , σ k p k ) T be the vector of the kth mode specific singular values. We look at functions where R p k + is the space of p k vectors with non-negative elements. Then The derivation of the SURE is the same as in Section 3.1 except for the second term in (32): We have: By the chain rule: where J g k (s k ) is the Jacobian matrix of g k evaluated at s k . We know from (37) that So ds k [∆ i ] contains zeros except in the i k th position. Hence Inserting (50) into (49), we get: That is, we only need the (i k , i k )th element of the Jacobian matrix of the spectral function. Let J k (D k ) = diag(J g k (s k ) [1,1] , . . . , J g k (s k ) [p k ,p k ] ) for k = 1, . . . , K. The divergence is now of the form: Sum f (D) · D −1 · C + K k=1 Q k · S 2 .

D SURE for estimators that shrink elements in S
Consider the HOSVD (11). In this section, we will find the SURE for estimators of the form: where (g(S)) [i] = g i (S [i] ).
That is, we shrink each element of S separately. An example of such a function is to soft-threshold each element of S: where sign(x) is −1 of x < 0, 1 if x > 0, and 0 if x = 0. Such a function induces 0's in the core array, which has applications to increasing interpretability of higher-order PCA [Henrion, 1993, Kiers et al., 1997, Murakami et al., 1998, Andersson and Henrion, 1999, De Lathauwer et al., 2001, Martin and Van Loan, 2008. Inducing 0's in the core array is usually performed by applying orthogonal rotations along each mode. Our approach provides an alternative mechanism to induce 0's in the core array.
Theorem 5. The differentials of U k and S are given in equations (21) and (52) where dŨ k [∆] is defined in (33).
The derivation of the divergence for functions of the form (51) is very similar to that in Section 3.2. The divergence may still be found from (31). From the chain rule, we have: where this "•" means composition and dU k [∆ i ] is from (23). Hence, where dŨ k [∆ i ] is from (33), noting that the relationship in (36) still holds. From the chain rule we have: We need the (i 1 , . . . , i K )th element of Note that for any A ∈ R p 1 ×···×p K dŨ k [∆ i ] · A We can rearrange the summations in the left part of (55) by switching the order of the j and the i k and then altering the notation of the dummy variables to obtain: Hence, the SURE for these higher-order spectral functions (51) is: