Shrinkage estimation with a matrix loss function

Consider estimating the n by p matrix of means of an n by p matrix of independent normally distributed observations with constant variance, where the performance of an estimator is judged using a p by p matrix quadratic error loss function. A matrix version of the James-Stein estimator is proposed, depending on a tuning constant. It is shown to dominate the usual maximum likelihood estimator for some choices of of the tuning constant when n is greater than or equal to 3. This result also extends to other shrinkage estimators and settings.


Introduction
Shrinkage estimators are usually set in the context of vector data. If x(n × 1) is a random vector with mean θ, then a shrinkage estimator of θ takes the formθ a = x − ag(x; u) (1.1) where a > 0 is a tuning parameter, and g(x, u) (n×1) is a "shrinkage function", depending on the data x, and possibly on extra information in an auxiliary random variable (or random vector) u. Let F (x, u) denote the joint distribution of x and u, depending on θ. The classic James-Stein estimator (Stein, 1956;James and Stein, 1961) is a special case in the setting x ∼ N n (θ, σ 2 I n ), n ≥ 3. When σ 2 is known, the shrinkage function is given by When σ 2 is unknown, the shrinkage function is given by where u ∼ σ 2 χ 2 ν is an auxiliary random variable independent of x which is used to estimate σ 2 .
The objective in shrinkage estimation is to estimate the vector parameter θ, where the performance of an estimatorθ =θ(x) is judged by the scalar loss function In order to guarantee that the shrinkage estimator dominates the simple unbiased estimatorθ 0 = x, the usual strategy is to demonstrate the "crossproduct inequality" for all θ, where g = g(x, u) is a function of the random vector x (and of u when present). The last inequality has been included to ensure that g is nontrivial. Throughout the paper we assume that x and g(x, u) have finite second moments. Then the following well-known result holds.
Theorem 1. Let x(n × 1) be a random vector and u be an auxiliary random variable such that E(x) = θ under a probability model F depending on θ. Also suppose there exists a shrinkage function g = g(x, u) such that the cross-product inequality (1.5) holds. Then the shrinkage estimatorθ a in (1.1) dominates the simple estimatorθ 0 = x under the scalar loss function (1.4) provided the tuning parameter a satisfies 0 < a < 2.
For the James-Stein estimator, Stein's Lemma (Stein, 1981) states that the cross-product inequality for known σ 2 holds for n ≥ 3 and is actually an equality. That is, if x ∼ N n (θ, σ 2 I n ), n ≥ 3, then (1.7) where A = A(λ 2 ) depends on λ 2 = θ T θ/σ 2 and 0 < A < ∞. Stein's Lemma can be proved using integration by parts (e.g. Efron and Morris (1976) or Stein (1981)). An equality also holds in the analogue of (1.7) for the unknown σ 2 case since E(u) = νσ 2 , E(u 2 ) = ν(ν + 2)σ 2 in (1.3). Hence Theorem 1 for the James-Stein estimator, in both the known and unknown σ 2 cases, can be strengthened to conclude that the optimal value of the tuning constant is a = 1, uniformly over all θ. The purpose of this paper is to extend James-Stein and other shrinkage estimators to a matrix setting where the data take the form of an n × p matrix X with mean E(X) = Θ. The objective is to estimate Θ using a p × p matrix quadratic loss function, (1.8) and associated risk function R matrix (Θ, Θ) = E{L matrix (Θ, Θ)}. Note that this loss function pools errors across the n rows, but treats the p columns separately. Hence we look for an estimator which shrinks across the n rows, but does not shrink across columns.
Other authors have considered the use of shrinkage methods in a matrix setting; see, e.g. Morris (1972, 1976); Haff (1977); Zheng (1986); Ghosh and Shieh (1991); Tsukuma and Kubokawa (2007); Tsukuma (2009). However, these papers use a scalar squared error loss function and so are not directly relevant here. There seems to be little work focused on a matrix loss function.
Section 2 states the main result in the matrix setting, with some discussion given in Section 3.

Matrix data
Suppose the data take the form of an n × p matrix X, plus auxiliary random variables u = (u 1 , . . . , u p ) T , when present. Let Θ = E F (X) denote the n × p matrix of means, where F denotes the joint distribution of X and u. The objective is to estimate Θ under the p × p matrix quadratic loss function (1.8). Let x (j) denote the jth column of X.
Suppose that for each column j = 1, . . . , p, there is a shrinkage function g (j) = g (j) (x (j) , u j ). A natural estimator is the "diagonal shrinkage estimator", defined by applying the vector shrinkage estimator separately to each column of X. That is, defineΘ a =Θ a (X) in terms of its columnsθ a,(j) bŷ using (1.1). Note that the shrinkage applied to each column does not depend on the data in other columns. We use the term "diagonal" because in the setting (1.2) the estimator can also be written in matrix form using a diagonal matrix, Given two estimatorsΘ (1) andΘ (2) depending on X, say thatΘ (1) strictly dominatesΘ (2) if R matrix (Θ (1) , Θ) < R matrix (Θ (2) , Θ) for all Θ, where "<" means that the difference between the right-and left-hand sides is a positive-definite matrix. The following theorem is the main result of this paper.
Theorem 2. Let X(n × p) be a random matrix and u = (u 1 , . . . , u p ) T be a vector of auxiliary random variables such that E F (X) = Θ under a probability model F depending on Θ, and the data {x (j) , u j } are independent for different j. Suppose there exist shrinkage functions g (j) = g (j) (x (j) , u j ) such that the cross-product inequality (1.5) holds for each j = 1, . . . , p. Then the shrinkage estimatorΘ a in (2.1) dominates the simple estimatorΘ 0 = X under the matrix loss function (1.8) provided the tuning parameter a satisfies 0 < a < 2/p.
Proof. The proof makes use of the following inequality, where α is a p × 1 vector and G is an n × p matrix with columns g (j) , j = 1, . . . , p, p j,k=1 (2. 2) The two inequalities follow from two versions of the Cauchy-Schwarz inequality.. To showΘ a dominatesΘ 0 = X for a particular choice of a, we need to show that Equivalently we need to show that α T Rα < nα T I n α = n for all Θ, (2.3) where R = R matrix (Θ a , Θ) and α is an arbitrary standardized p-dimensional vector, α T α = 1. The left-hand side of (2.3) can be written as p j,k=1 ), so δ j ≥ γ j > 0. In going from the second to the third line of (2.4) notice that many of the off-diagonal terms vanish because the different columns are independent and E(x (j) − θ (j) ) = 0. The fourth line follows from the third line by the Cauchy-Schwarz based inequality (2.2). The last line follows from the fifth line by simple properties of quadratic functions.
Comments (a) The allowable interval for a decreases with p. This property is related to the result that for a matrix loss function, it is harder to dominate the maximum likelihood estimator than for a scalar loss function. (b) For the James-Stein case, the p-dimensional result is less powerful than the one-dimensional result. In one dimension a = 1 is optimal;θ 1 dominatesθ a for all other choices of a. In contrast, if p > 1 there is no single choice of a forΘ a which dominates all other choices. (c) Further, at least for the James-Stein case, the interval (0, 2/p) is the best possible interval for a. If a < 0 or a > 2/p, it is possible to find values of Θ such thatΘ a does not dominateΘ 0 .
Here is a simple construction in the case of known variance σ 2 = 1. Recall x ij ∼ N(θ ij , 1) independently for i = 1, . . . , n, j = 1, . . . , p. Let α j = 1/ √ p, j = 1, . . . , p. Let θ * be a n-vector of unit size, θ * T θ * = 1, and suppose all of the columns of Θ are equal to the same multiple of θ * , θ (j) = κθ * . For large κ it is straightforward to show that for all j, where m = n − 2. Further (2.2) becomes an equality in this setting so that the risk in (2.3) reduces to Ignoring the remainder term, the quadratic function of a in (2.5) is less than n for 0 < a < 2/p and exceeds n for a < 0 or a > 2/p. Hence for any specific choice of a < 0 or a > 2/p, α T Rα > n for sufficiently large κ.
The same argument works for the case of unknown σ 2 . (d) In the vector case, if the shrinkage function g is re-scaled to cg for some constant c > 0, then the cross-product inequality needs minor adjustment and the allowable interval for the tuning parameter a changes from (0, 2) to (0, 2/c). The scaling convention for the cross-product inequality chosen in this paper has been made to make the treatment of different columns as consistent as possible in the extension to the matrix case. (e) Efron and Morris (1972) proposed the "matrix" James-Stein estimator and investigated its properties under the scalar loss function (1.4). However, its properties under the matrix loss function (1.8) are unknown.

Discussion
For the classic vector James-Stein estimator there are several ingredients in the formulation of the problem and the estimator such as the following: (a) normality of the data, (b) uncorrelated components, (c) the specific choice (1.2) for the shrinkage function g, and (d) the assumption that the range of possible values for θ spans all of R n .
Each of these ingredients can be relaxed, either individually or in combination. Here are some examples.
(a) relax normality to (i) more general spherical distributions (Brandwein and Strawderman, 1991;Cellier and Fourdrinier, 1995) or (ii) independent components (Shinozaki, 1984); (b) allow correlated normal or more general elliptic distributions (Fourdrinier, Strawderman and Wells, 2006); (c) use other shrinkage estimators such as (i) subspace shrinkage, or more generally (ii) Bayes or generalized Bayes estimators based on superharmonic prior distributions (Stein, 1981); (d) relax the range of possible values for θ from all of R n to a specified cone (Fourdrinier, Strawderman and Wells, 2006).
In each case the improved performance of the shrinkage estimator is justified by a version of the cross-product inequality. Hence in each case there is an immediate extension to the matrix case.
Another direction in which the paper might be extended is to allow dependence between the columns. At least in the normal case with a known p × p covariance matrix Σ, it is possible to adapt the results of this paper.
Thus let X(n × p) follow an np-dimensional normal distribution with mean E(X) = Θ, with independent rows and with common covariance matrix Σ within each row. Let A be a matrix square root of Σ −1 , so that AA T = Σ −1 . Then Y = XA has independent columns. Hence the methodology of Section 2 can be applied to Y to yield an estimatorΦ a of Φ = ΘA. Back-transforming yields an estimatorΘ a =Φ a A −1 which dominatesΘ 0 in the matrix sense (1.8), provided 0 < a < 2.
It is not clear to what extent these results carry over when Σ needs to be estimated. Further, note that A is only defined up to a multiplication on the left by a p × p orthogonal matrix. Thus the methodology of Section 2 defines a whole family of estimators, each with the same statistical properties. It is not clear whether it might be possible to combine them in some way to yield a superior estimator.