Evaluating the Contributions of Individual Variables to a Quadratic Form

Summary Quadratic forms capture multivariate information in a single number, making them useful, for example, in hypothesis testing. When a quadratic form is large and hence interesting, it might be informative to partition the quadratic form into contributions of individual variables. In this paper it is argued that meaningful partitions can be formed, though the precise partition that is determined will depend on the criterion used to select it. An intuitively reasonable criterion is proposed and the partition to which it leads is determined. The partition is based on a transformation that maximises the sum of the correlations between individual variables and the variables to which they transform under a constraint. Properties of the partition, including optimality properties, are examined. The contributions of individual variables to a quadratic form are less clear‐cut when variables are collinear, and forming new variables through rotation can lead to greater transparency. The transformation is adapted so that it has an invariance property under such rotation, whereby the assessed contributions are unchanged for variables that the rotation does not affect directly. Application of the partition to Hotelling's one‐ and two‐sample test statistics, Mahalanobis distance and discriminant analysis is described and illustrated through examples. It is shown that bootstrap confidence intervals for the contributions of individual variables to a partition are readily obtained.


Introduction
Quadratic forms feature as statistics in various multivariate contexts. Well-known examples include Hotelling's T 2 statistic and the Mahalanobis distance. When the value of a quadratic form is large, then an obvious question is: Which variables cause it to be large? To illustrate, suppose x is an observation that should come from a distribution with mean μ and variance . However, it appears to come from a different distribution because the Mahalonobis distance, equal to the quadratic form (x − μ) −1 (x − μ), is large. It might be helpful to have a measure of the contribution of individual variables to the size of this quadratic form.
When variables are correlated, it is not immediately apparent that a sensible answer to this question can be given. However, we shall argue that the question can be answered in a meaningful way and we will propose a method of partitioning a quadratic form into contributions from individual variables. This does not imply that there is a "best" way of forming such a partition, other than in some simple situations where arguments of symmetry can be used. However, although a partition of a quadratic form may be arbitrary to a degree, it can still be useful and informative. We show that the partition we propose meets certain optimality criteria.
Our method of forming a partition is based on a transformation that we call the corr-max transformation. Garthwaite, Critchley, Anaya-Izquierdo & Mubwandarikwa (2012) focussed on a transformation referred to as the cos-square transformation, but also proposed a second transformation called the cos-max transformation. The latter is closely related to the corrmax transformation that we introduce here. However, while the cos-max transformation was designed to transform a data matrix, the intended use of the corr-max transformation is the transformation of a random vector. The cos-max transformation adjusts a data matrix by a minimal amount while yielding a matrix with orthonormal columns; each of the original variables is associated with exactly one of these columns. The corr-max transformation yields a vector whose covariance matrix is proportional to the identity matrix, while each of the original variables is associated with exactly one component of the transformed vector. The strength of the associations is measured by correlations and the transformation is chosen to maximise the sum of these correlations (hence our name for the transformation).
Collinearities between variables will reduce the strength of some associations. The variables that are involved in a collinearity can be identified using the cos-max transformation (Garthwaite et al. 2012). The coordinate axes corresponding to these variables can then be rotated to yield a set of variables with little collinearity. We adapt the corr-max transformation so that contributions to the quadratic form, as measured by the partition, will only change for those variables that are affected by the rotation. We refer to this feature as the rotation invariance property.
The task of determining which variables have most influence on a Mahalanobis distance has attracted some attention in the literature. The Mahalanobis-Taguchi system, which features in statistical process control, estimates the covariance matrix from 'normal' data and computes the Mahalanobis distances for a set of 'abnormal' data points, in order to determine signal-to-noise ratios for individual variables and hence identify variables that are useful diagnostics of abnormality (Taguchi & Jugulum 2002;Das & Datta 2007). In ecology, the Mahalanobis distance has been used in the construction of maps that show suitable habitat areas for a particular species. Pixels on the map are equated to points in multidimensional space on the basis of environmental variables and the Mahalanobis distance is used to give a measure of the distance from a point to the mean of the ecological niche. To identify the minimum set of basic habitat requirements for a species, Rotenberry, Preston & Knick (2006) proposed a decomposition of the Mahalanobis distance that exploits the eigenvectors of . Based on work in Rotenberry, Knick & Dunn (2002), they argued that the variables that load heavily on the eigenvectors corresponding to the smallest eigenvalues are the most influential in determining habitat suitability. Calenge, Darmon, Basille & Jullien (2008) added a step to the method of Rotenberry et al. (2006), forming a further eigenvector decomposition with the aim of producing biologically meaningful axes. The decomposition we propose here is simpler to implement and has a straightforward interpretation, making it more likely to be put into practice. Rogers (2015) adopted it as a tool for identifying the key climate variables in determining future changes in the distribution of vector-borne diseases, illustrating its use through application to dengue, an important tropical disease. He referred to the contributions of variables being measured on the Garthwaite-Koch scale, citing a technical report (Garthwaite & Koch 2013) that forms the basis of the present paper.
In Section 2 we argue that the value of a quadratic form can be meaningfully partitioned into separate contributions of individual variables and give the criteria that determine the corr-max transformation and our proposed partition. In Section 3 we obtain the transformation and the partition. In Section 4 the transformation is adapted to have the rotation invariance property and ways to exploit this property are suggested. In Section 5 we describe use of the partition in contexts where Hotelling's T 2 statistic or Mahalanobis distance arise, and in discriminant problems involving two groups. Bootstrap confidence intervals for the contributions of individual variables can be constructed to quantify uncertainty in these contributions and to increase our insight into the relative importance of these contributions. These ideas are illustrated in Section 6. Concluding comments are given in Section 7.

Rationale for a partition
Let Q be the quadratic form where X = (X 1 ,…, X m ) is an m × 1 random vector whose variance is proportional to and where μ is a given m × 1 vector that is not necessarily the mean of X. This type of quadratic form arises in various applications. For example, in Hotelling's one-sample T 2 statistic, X would take the value of a sample mean, would be the population variance, and μ would be the hypothesised population mean. The purpose of this paper is to give a method of evaluating the contributions of individual variables to Q. Before doing so, we must first consider whether it is possible, in principle, to meaningfully answer the question, What are the contributions of individual variables to a quadratic form?
Clearly a good answer can easily be given when is the identity matrix: the contribution of each variable is then the square of the corresponding component of x − μ. Extension to the case where is diagonal is obvious. However, if is not diagonal then it is less clear that Q can be partitioned between variables in a meaningful way. To examine this issue, we consider an example.
Specifically, let and, to aid explanation, suppose the three components of x = (x 1 , x 2 , x 3 ) correspond to standardised variables, age (x 1 ), height (x 2 ), and weight (x 3 ). In this example the contribution of age (x 1 ) to Q is always clear, since If x 2 = x 3 , then height and weight contribute equally to Q, from symmetry. Hence, even though −1 is not diagonal, the contributions of each variable to Q can be determined: age contributes x 2 1 while height and weight each contribute (Q − x 2 1 )=2. To expand this example, suppose x 3 to be slightly greater in magnitude than x 2 . Then the contribution of age to Q would still be x 2 1 while, in dividing the balance of Q between height and weight , it seems reasonable to give weight slightly the greater portion. Other situations are also readily constructed where common sense can indicate, approximately, the contributions of each variable to Q. Usually though, there will be no partition of Q that is unquestionably better than any alternative. However, it may still be the case that sensible methods of partitioning Q broadly agree on the contributions made by individual variables. We construct a partition that helps interpret the results of some statistical analyses by giving a clearer relationship between the data variables and a test statistic or some other quantity that is based on Q. The transformation that underlies the partition is defined in the next section.
Before ending this section we introduce further notation that will be used in the remainder of the paper. Bold-face italic capital letters X, Y, W , W, etc. are m × 1 random vectors. Subscripts are added to denote components of the vector: X = (X 1 ,…, X m ) , W = ( W 1 ,…, W m ) , etc. The notation is used to denote a generic estimate of the m × m population variance matrix, . Likewise 1 is used to denote the standard unbiased estimate of given by one sample and p is used to denote the standard pooled estimate of based on independent samples from two populations that both have variance . The symbols 2 i , 2 i are used to denote the ith diagonal entries of and , respectively. The symbols D and D are used to denote m × m diagonal matrices with ith diagonal entries equal to −1 i and −1 i , respectively (i = 1,…, m). Thus D D and D D have diagonal entries equal to 1. The symbol X = (x 1 ,…, x n ) is used to denote the n × m data matrix whose rows are the n observations andx i is used to denote the ith column of X. The symbols A, A, B, C, H, , are used to denote m × m matrices and and d are used to denote m × m and d × d orthogonal matrices, respectively.

The corr-max transformation
To form our partition, we consider transformations of the form where W is an m × 1 vector and for any value of X. Then where W = (W 1 ,…, W m ) , so W yields a partition of Q. The partition will be useful and meaningful if (a) the components of W are uncorrelated and have identical variances, and (b) it is reasonable to identify W i with the ith x-variable, as the contribution of that x-variable to Q can then sensibly be defined as W 2 i .
The following theorem gives the transformation that maximises m i=1 cor(X i , W i ) under the constraints that (2) and (3) hold, where cor(·, ·) denotes correlation. Proofs of theorems are given in Appendix A.
Theorem 1. Suppose W = A(X − μ) and var(X) ∝ . If (3) holds for all X, then the components of W are uncorrelated with identical variances. If, in addition, A is chosen to maximise where D is a diagonal matrix such that D D has diagonal entries equal to 1.
We define the corr-max transformation to be the transformation given by (2) with A = (D D) −1=2 D. From Theorem 1, this transformation yields a W that satisfies requirement (a). For (b), we first note that it is always possible to scale and translate X i so that it has the same variance and the same mean as W i , whence the degree to which X i equates to W i would primarily be determined by its correlation with W i . (Perfect correlation would imply that they were identical.) Moreover, scaling and translation do not change the nature of a variable. Otherwise, for example, temperature measurements on the Celsius and Fahrenheit scales would not be equivalent. Hence, the degree to which W i equates to the ith x-variable is largely determined by cor(X i , W i ). Consequently under a sensible criterion the corr-max transformation satisfies (b) as fully as possible, since it maximises m i=1 cor(X i , W i ) when the constraint equations (2) and (3) hold. The extent to which the corr-max transformation satisfies (b) is discussed further in Section 7.
Theorem 1 completes the specification of our partition for the case where is known. To summarise, if X takes the value x and var(X) ∝ , the corr-max transformation yields the new vector w = (D D) −1=2 D(x − μ) and the contribution of the ith x-variable to Q(x) = (x − μ) −1 (x − μ) is defined to be w 2 i (i = 1,…, m). When is unknown, we replace it in the foregoing method with an estimate, say. In some contexts this type of substitution can have drawbacks but here it seems appropriate, since it yields properties similar to Theorem 1, but in terms of maximising sample correlations, which we denote by cor s (·, ·), rather than population correlations. This result is given in Theorem 2. Its proof is similar to that of Theorem 1 and is omitted.
Theorem 2. Suppose that the sample variance of X is proportional to and m j=1 cor s (X j , W j ) is to be maximised, subject to W = A(X − μ) and W W = (X − μ) −1 (X − μ) for any X.
where D is diagonal and D D has diagonal entries equal to 1.
While the corr-max transformation yields a sensible method of partitioning Q into contributions of individual variables, other reasonable methods may well give a slightly different partition, but differences should be small when there is a close relationship between each W i variable and the x-variable with which it is paired. Information about the strength of these relationships is provided by the correlations between X i and W i (i = 1,…, m). The following theorem gives a simple means of finding the values of these correlations and, more generally, the correlations cor(X i , W j ) and cor s (X i , W j ) for i = 1,…, m; j = 1,…, m. It has the interesting implications that cor(X i , W j ) = cor(X j , W i ) and cor s (X i , W j ) = cor s (X j , W i ) for all i and j, since (D D) 1=2 and ( D D) 1=2 are both symmetric matrices.
So far we have only considered the partition of a quadratic form, but the corr-max transformation also gives a useful partition of the bilinear form U −1 V, provided var(U) ∝ and var(V) ∝ . Theorem 4 gives the relevant result. Its proof follows from the proof of Theorem 1.

Theorem 4. Suppose var(U) ∝ and var
From this, and from the theorem, it is reasonable to identify the ith components of W Å and W • with the ith components of U and V, respectively. Then In Section 5 we use the theorem to form a partition of Fisher's linear discriminant function. When is estimated from data, results corresponding to Theorem 4 hold with A = ( D D) −1=2 D.

Rotation invariance property
When the correlations between X i and W i are weak for some values of i, there will generally be strong collinearities between some of the x-variables. The standard diagnostic for detecting collinearities are variance inflation factors. Suppose the values of X 1 ,…, X m are observed on each of n items (n > m) and let R 2 j denote the multiple correlation coefficient when X j is regressed on the other X variables. Then the variance inflation factor for X j , VIF j say, is defined to be (1 − R 2 j ) −1 . This will be large if X j is involved in a collinearity. Garthwaite et al. (2012) showed that the x-variables involved in a collinearity can be identified using the cos-max transformation. Example 2 in Section 5.2 illustrates this.
Collinear variables can be replaced by non-collinear variables via orthogonal rotation of coordinate axes. This can clarify the relationship between x-variables and a quadratic form, as examples will illustrate. Only axes corresponding to collinear variables need be rotated. The results of a rotation are sensitive to scale, so before rotation we scale the x-variables. This is the same as in principal component analysis, where variables are frequently scaled to have identical variances before applying the principal component transformation (which is an orthogonal rotation).
Here var(X) ∝ and D D has diagonal entries all equal to 1, so the components of DX have identical variances. Let be an m × m orthogonal matrix and put Y = D(X − μ), so that Y is obtained by a re-scaling of X − μ, followed by an orthogonal rotation. Suppose that we want to apply a transform in such a manner that there are large correlations between Y i and W i for i = 1,…, m.
The components of Y are not all equally important -after rotation some components will have a smaller variance than others and those with smaller variances are deemed to be less important, as in principal components analysis. The corr-max transformation would . This gives greater weight to the Y i with greater variance. Theorem 5 gives the resulting matrix C and properties of the transformation.
The transformation from X − μ to W will be referred to as the adapted corr-max transformation. It is identical to the ordinary corr-max transformation if there is no rotation, that is when = I. If is unknown, we replace it with an estimate, , and put The contribution of the ith variable to the quadratic form is evaluated as (w i ) 2 , where w i is the value taken by the ith component of W or W . From equation (8) we obtain the same result whether (a) we multiply D(X − μ) by the rotation matrix and transform the result, or (b) we transform D(X − μ) and multiply the result by . That is, with the adapted corr-max transformation, the operations of rotation and transformation are commutative.
This property allows us to rotate the coordinate axes corresponding to x-variables involved in a collinearity while neither affecting the identity of other x-variables, nor altering assessments of the latter variables' contributions to the quadratic form. To elucidate, suppose that we want to rotate the first d of the m axes. Then the rotation matrix has the following block-diagonal form: where d is a d × d orthogonal matrix. Multiplying X by only changes the first d components of X and leaves its other components unchanged, so the latter components are the original variables. Moreover, under the transformation in equation (7), the last m − d components of W are unaffected by d ; the rotation only changes its first m components. Thus, under the adapted corr-max transformation, the rotation of selected axes will leave some variables unchanged (those corresponding to unrotated axes) and the contributions of those variables to the quadratic form, as measured by the partition, will also be unchanged. We refer to this as the rotation invariance property. Ideally, a partition yields orthogonal components that are closely related on a oneto-one basis with meaningful quantities. When these quantities cannot be the original x variables because of a collinearity, the rotation invariance property suggests that we might rotate the axes corresponding to variables involved in the collinearity, and then apply the transformation. There should still be close pairwise relationships between each unrotated variable and the variable to which it transforms, as these relationships are not compromised by the rotation. Also, there should now be close relationships between the quantities obtained through rotation and the variables to which they transform.
A rotation is attractive if it yields meaningful quantities. If, say, the only collinearity was between the first two variables, X 1 and X 2 , a sensible rotation might be which constructs two new variables, one proportional to the sum of X 1 and X 2 , and the other proportional to their difference. This will often create variables that have a natural interpretation and the new variables will also have a low correlation if the variance of X 1 is similar to the variance of X 2 . If rotation is used to counteract more than one distinct collinearity between the x variables, then m should have a block diagonal form, with a separate block for each collinearity. An example is given in Section 5.2. When a collinearity involves more than two variables, constructing meaningful variables that have low correlations can be a challenging task. An approach based on orthogonal contrasts that might sometimes be useful is described in Appendix B. While rotation can be helpful when collinearities are present, we should stress that rotation is never essential. The standard corr-max transformation of Section 3 can be applied whenever is a positive-definite matrix, even if contains high correlations, and it will yield a sensible partition of a quadratic form, as we discuss further in Section 7. Hence axes should only be rotated when the new variables that are constructed have an understandable interpretation.

Applications
In Section 5.1 we describe some common applications in which the corr-max transformation yields a partition that quantifies the contributions of individual variables to a test statistic. In Section 5.2 an example is given in which collinearity is present and some x-axes are rotated while applying the transformation.

Hotelling T 2 , Mahalanobis distance and discriminant analysis
The standard application in which the partition is useful is where a statistic of interest, say, has the form with an estimate of , var(X) ∝ and a known positive scalar. From equation (5), the corr-max transformation yields W = ( D D) −1=2 D(X − μ), and the contribution of the ith x-variable to is evaluated as w 2 i , where (w 1 ,…, w m ) is the value of W given by data. Before the partition can be applied, X, , , and μ must be identified and it must be checked that var(X) ∝ . (The matrix D is obtained from .) The individual contributions, w 2 i for i = 1,…, m, then follow automatically. After using the transformation the analyst should examine the correlations between components of W and the corresponding components of X; rotation of x-axes might be considered if some correlations are low. (In our examples we consider rotating axes when correlations are 0.8 or lower.) In the following four applications, the first three have precisely the form given in (11), while the fourth is closely related to it.
(a) Hotelling's one-sample T 2 statistic. A random sample of size n is taken from N(μ, ), giving a sample mean X and sample covariance 1 . The standard test of the hypothesis μ = μ 0 is based on Hotelling's one-sample T 2 statistic, Let the role of X in (11) be played by X, so that var(X) = =n. The partition is obtained by putting = 1 , = n and μ = μ 0 . (b) Hotelling's two-sample T 2 statistic. Two random samples of sizes n 1 and n 2 are drawn from the multivariate normal distributions, N(μ 1 , ) and N(μ 2 , ), that have the same covariance matrix. Then the hypothesis μ 1 = μ 2 is tested using Hotelling's two-sample T 2 statistic, where X 1 and X 2 are the sample means and p is the pooled estimate of derived from the two samples. Let the role of X in (11) be played by X 1 − X 2 , so var(X) ∝ . Put = p , = n 1 n 2 =(n 1 + n 2 ) and μ = 0 to obtain the contributions of individual variables to T 2 2 . (c) Mahalanobis distance. If X (1) and X (2) are two m × 1 vectors, then the Mahalanobis distance between them is Here X (1) and X (2) must be independent, but either or both of them could be individual observations, or sample means, or one of them could be a vector of known constants. We suppose var(X (i) ) = k i (i = 1, 2) where k 1 or k 2 (but not both) may equal 0. We also suppose that E( ) ∝ so, for example, might be the maximum likelihood estimate or an unbiased estimate of . Let X = X (1) − X (2) , so var(X) ∝ . Put = 1 and μ = 0. Then the partitioning gives the contributions of individual variables to the Mahalanobis distance. (d) Fisher's linear discriminant function. Suppose an observation needs to be classified as belonging to one of two classes that are characterised by the multivariate normal distributions N(μ 1 , ) and N(μ 2 , ), with sample means X 1 and X 2 and common estimated covariance matrix p . A new observation X Å is classified as belonging to class 1 on the basis of Fisher's linear discriminant function if Consider the transformations and Since var(X 1 − X 2 ) ∝ and var[X Å − 1 2 (X 1 + X 2 )] ∝ , Theorem 4 applies. Hence the ith components of both W • and W Å can be identified with the ith x-variable.
We use two examples to explore how the transformation and partition work in practice.
In the first example we apply the transformation without rotation of variables and consider applications (a), (c) and (d). In the second example, given in the next subsection, we illustrate application (b) and apply the transformation to both rotated and un-rotated variables. Flury & Riedwyl (1988) present data on 100 genuine Swiss 1000-franc bank notes. Six measurements were made on each note: length (length), left-ht (height measured on the left), right-ht (height measured on the right), lower (distance from the inner frame to the lower border), upper (distance from the inner frame to the upper border), and diagonal (length of the diagonal). These measurements are the data values of X = (X 1 ,…, X 6 ) . Their sample standard deviations are (0.388, 0.364, 0.355, 0.643, 0.649, 0.447) and the reciprocals of these standard deviations form the diagonal entries of D. The sample correlation matrix of X is: If the corr-max transformation is applied to a vector X to yield a vector W, the correlations between components of X and the corresponding components of W are equal to the diagonal entries of ( D D) 1=2 . These diagonal entries are 0.96, 0.90, 0.91, 0.91, 0.91 and 0.98. They are all large, indicating close one-to-one relationships between each x-variable and its corresponding component of W, so rotation of x-variables is unnecessary.

Example 1: Swiss bank notes
Hotelling's one-sample T 2 statistic might be used to test the hypothesis that the population mean vector is, say, μ 0 = (215. 007, 129.979, 129.756, 8.369, 10.233, 141.562) . These values have been chosen so that, for each variable, the hypothesised population mean exceeds the sample mean by 0.1 standard deviations. The value of the test statistic, given by equation (12), is T 2 1 = 8.74. We have already calculated D and D D. Setting X and μ equal to x and μ 0 respectively in equation (5) gives W = −(0.051, 0.053, 0.055, 0.164, 0.182, 0.138) . As = 100, the contribution of the ith x-variable to T 2 1 is 100w 2 i , so the contributions of the six x-variables are 0.51 2 , 0.53 2 , 0.55 2 , 1.64 2 , 1.82 2 and 1.38 2 . These values sum to 8.75, which differs slightly from T 2 1 because we have listed all contributions to 2 decimals only, and not given their precise values. The actual sum of the contributions equals T 2 1 as the theory tells us. Although for each component the sample mean differs from the hypothesised population mean by an equivalent amount, the last three x-variables (lower, upper, and diagonal) make larger contributions to the T 2 1 statistic than the first three x-variables (length, left-ht and right-ht ).
As an example involving Mahalanobis distance, suppose the measurements for an additional banknote that might be a forgery are x 2 = (215.8, 129.7, 129.0, 6.9, 8.6, 143.2) . The Mahalonobis distance between x 2 and the mean value of X in the sample of 100 genuine banknotes x, is given by equation (14) with X (1) = x and X (2) = x 2 . The value of this distance is 55.69, which gives clear evidence the note is a forgery (p < 0.0001). Our partition can be used to determine which characteristics of the new banknote distinguish it from the genuine banknotes. We put X = x − x 2 and μ = 0 in equation (5), to obtain W. As = 1, the contribution of the ith x-variable to the Mahalanobis distance is the square of the ith component of W. These squared values are (8.64 0.87 4.54 16.66 15.12 9.86). Hence the measurements that most distinguish the new banknote from genuine banknotes are X 4 (lower ) and X 5 (upper ).
The Swiss bank notes dataset given by Flury & Riedwyl (1988) contained 100 faked bank notes in addition to the 100 genuine notes. As an example that involves Fisher's discriminant rule, we consider the task of using these data to classify a note as genuine or from the same popultation as the fakes.
Table 1 summarises the analysis. The first two rows, X 1 and X 2 , show the sample means of the genuine and faked bank notes, respectively. The note to be classified is X Å . Equation (15) gives −20.34 as the value of r(X Å ), indicating that the new note should be classified as coming from the same population as the fakes. Applying equations (16) and (17)  and diagonal, underlie the outcome of the discrimination rule, as they make much larger contributions to r(X Å ) (in absolute value) than the first three variables.
In Section 1 we noted that Rotenberry et al. (2006) examined eigenvectors corresponding to small eigenvalues in order to determine influential variables on a Mahalanobis distance. Before leaving this example we illustrate their method by applying it to the case where we have 100 genuine banknotes and an additional banknote that might be a forgery. Their approach is to decompose the quadratic form Based on just the eigenvector corresponding to the smallest eigenvalue, X 2 and X 3 (left-ht and right-ht ) are clearly the most important variables, since in the case of that eigenvector they have much larger coefficients (in absolute value) than the other variables. However, if the three eigenvectors displayed above are all considered relevant, then deciding which x-variables are important is not clear-cut and requires the analyst to make an intuitive judgment. Moreover, there is no obvious method of determining the relative quantitative importance of different variables, and with any such method the answers are likely to depend on whether the three smallest eigenvalues or only the very smallest are considered "small".

Collinearities, rotation and quadratic forms
Some advantages of the (un-adapted) corr-max transformation are diminished when strong collinearities are present: not every X variable will be closely related to the transformed variable with which it is paired. Here we examine a dataset in which collinearities are present and illustrate use of the cos-max transformation matrix to identify the variables that are collinear. To identify collinearities we apply the cos-max transformation to data that have been standardised to have means of 0 and variances of 1, making the cos-max and corr-max transformations very similar, as will be seen.
The dataset contains two strata whose means will be compared using Hotelling's twosample T 2 statistic. We partition the test statistic into the contributions of individual variables/variable combinations by applying the adapted corr-max transformation. The rotation matrix ( ) we use in the transformation creates meaningful non-collinear variables from the variables that are involved in the collinearities.

Example 2: Female and male athletes
The data relate to the following nine measurements (X 1 ,…, X 9 ) that were made on n 1 = 100 female and n 2 = 102 male athletes collected at the Australian Institute of Sport (Cook & Weisberg 1994): Wt (weight), Ht (height), Rcc (red blood cell count), Hg (hemoglobin), Hc (hematocrit), Wcc (white blood cell count), Ferr (plasma ferratin concentration), Bfat (% body fat), and SSF (sum of skin folds). It is assumed that the two groups (females and males) may have different means, μ 1 and μ 2 , but have a common covariance matrix . Let p denote the pooled estimate of . The pooled correlation matrix, D p D, takes the value Under the cos-max transformation, a data matrix X is transformed to (X X) −1=2 X. Let X s denote the data matrix after variables have been centred and scaled so that the correlation matrix of X s is X s X s . If we put (X s X s ) −1=2 = H = (h 1 ,…, h m ) then, as Garthwaite et al. (2012) pointed out, the variance inflation factor for the jth variable (VIF j ) is equal to h j h j . Moreover, if VIF j is large, indicating a collinearity, then large components of h j correspond to the variables that underlie the collinearity. In the present example, X s X s = D p D, so examining the rows of ( D p D) −1=2 identifies variables involved in collinearities. (When X s X s = D p D, the corr-max and cos-max transformations are the same.) We put ( D p D) −1=2 = (h 1 ,…, h m ) and give the values of the h j in Table 2. Values greater than 0.80 in absolute value are given in bold-face type. The last column of the table gives the VIF for each variable, e.g. 8.15 is the VIF for X 5 and equals h 5 h 5 . A VIF above 10 is often treated as indicative of collinearity (Neter, Wasserman & Kutner 1983 p. 392) On this basis, X 8 (Bfat ) and X 9 (SSF ) are involved in collinearites and, from the bold-face numbers in the display of h 8 and h 9 , there is a collinearity between them. Weaker boundaries for flagging a collinearity have also been proposed; Menard (1995, p. 66) suggests a VIF above 5 should raise concern and O' Brien (2007) reports that boundary values as low as 4 have been suggested as rules of thumb. A boundary of 4 or 5 would indicate one further collinearity, between X 4 (Hg) and X 5 (Hc). Table 2 Rows of ( D p D) −1=2 and variance inflation factors for data on athletes If the corr-max transformation is applied to X = (X 1 ,…, X 9 ) , then the following are the sample correlations between each x variable and the variable to which it transforms: The correlations for Hg and Hc are a little low, suggesting that remedial action might be taken to offset both the mild collinearity between this pair of variables as well as the stronger collinearity between Bfat and SSF. To rotate the axes associated with these variable pairs we replace the corr-max transformation by the adapted corr-max transformation given by equation (9), with set equal to the following block-diagonal orthogonal matrix: We refer to the variables to which Bfat and SSF transform as B+S and B−S, and those from Before rotation, the sample means for the female and male athletes (x 1 and x 2 ), and the pooled standard deviations (S.D.) were as follows. The reciprocals of the standard deviations constitute the non-zero (diagonal) entries of the matrix D. When Hotelling's T 2 test is used to compare the means of the two groups we obtain a T 2 2 statistic equal to 1199.1. This value gives, as you might expect, very clear evidence of differences between the two groups (p 0.0000). However, the question of which quantities contribute most to the T 2 2 value is still relevant. The partition allows us to evaluate the contributions of individual x-variables/variable combinations to T 2 2 as proportional to the squares of the components of w : 3.08, 2.19, 1.17, 3.02, 0.24, 0.00, 1.64, 5.81, 6.61.
(When multiplied by , which here equals 100(102)=(100 + 102), these sum to the value of the T 2 2 statistic, 1199.1, apart from rounding error.) On the scale given by our partition, the largest contributors to the size of T 2 are the average of Bfat and SSF (contributing 24%) and the difference between these same two quantities (contributing 28%). With the other pair of variables whose axes were rotated, Hg and Hc, their average makes a substantial contribution (13%) while the contribution from their difference is only 1%.

Bootstrap confidence intervals
The corr-max transformation gives point estimates of the contributions of individual variables to a quadratic form. Obtaining theoretical results that give interval estimates of these contributions is difficult, but the bootstrap can be used to obtain approximate confidence intervals. We elucidate the procedure through examples.

Confidence interval for contributions to a Mahalanobis distance
In Example 1 there were 100 genuine Swiss 1000-franc bank notes and an additional bank note that might be a forgery. The Mahalanobis distance between the potential forgery and the mean of the genuine bank notes was 55.69 and the contributions of the six individual variables were estimated as (8.64 0.87 4.54 16.66 15.12 9.86). To obtain bootstrap confidence intervals for these contributions we generated 100 000 resamples from the 100 genuine bank notes. Each resample was a random sample of size 100 drawn with replacement from the 100 genuine notes.
Each resample was used in the same way as the original sample. The Mahalanobis distance between the potential forgery and the mean of the resample was calculated, with the resample being used to estimate the covariance matrix, . The contributions of individual variables to the Mahalanobis distance were then evaluated using the corr-max transformation. This gave 100 000 estimates of the contribution of each variable and the kth smallest of these is equated to the (k=1000)th percentile of the bootstrap distribution. The median for a variable's contribution is thus the 50 000th smallest value and the endpoints of an approximate 95% confidence interval are the 2500th smallest and 2500th largest values.
(This is the simplest method of forming bootstrap confidence intervals. As is well known, it typically works reasonably well but produces some bias, so work is underway to explore its performance in the current context and compare it with other bootstrap methods.) Figure 1 gives 'pseudo-boxplots' for the contributions of each of the six variables. As in a conventional boxplot, the ends of the box indicate the interquartile range of the data and the line within the box marks the median. However, we used the whiskers to depict the central 95% confidence interval, rather than the trimmed range. The circles show the point estimates (8.64,…, 9.86) given by the actual data. The figure indicates that measurements of the height on the left and right sides (left-ht and right-ht) contribute comparatively little to the Mahalanobis distance, while the distances from the inner frame to the lower border (lower) and from the inner frame to the upper border (upper) contribute noticeably more.
Other firm conclusions are difficult to make, because there is substantial uncertainty as to the contributions of variables.

Confidence interval for contributions to a two-sample T 2 statistic
Example 2 involves the study of a group of 100 female athletes and a group of 102 male athletes. Nine variables were measured on each athlete and two pairs of variables were rotated to reduce collinearities. The T 2 2 statistic for comparing the two groups was calculated and gave overwhelming evidence that the groups differed. To form bootstrap confidence intervals for the contributions of individual variables to this statistic, the groups must be resampled separately -a resample consists of the measurements of 100 athletes randomly drawn with replacement from the female athletes and 102 athletes drawn with replacement from the male athletes. The T 2 2 statistic was determined for each of 100 000 resamples and the contribution of individual variables/variable combinations to the statistic in each resample was evaluated using the adapted corr-max transformation.
Pseudo boxplots derived from the results are given in Figure 2. These show that the primary contributions to the T 2 statistic are clearly from B+S and B−S, the combination variables that are formed from the sums and differences of Bfat (percentage of body fat) and SSF (sum of skin folds). Other variables contribute noticeably less, but the only variable that clearly makes almost no contribution is the white cell blood count (Wcc). The confidence intervals are skewed to the right and the larger contributions tend to have wider confidence intervals. These appear to be characteristic traits and can also be seen in Figure 1.

Concluding comments
The corr-max transformation is straightforward to calculate. The matrices and D are readily determined and a spectral decomposition gives, say, D D = H H where is a diagonal matrix of eigenvalues of D D and H is an orthogonal matrix whose columns are eigenvectors. After and have been determined, ( D D) −1=2 is set equal to H −1=2 H . Hence, the corr-max transformation and the partition it yields are readily implemented in any programming language that offers matrix functions. To facilitate use of the partition in some important applications, programs have been written in R that determine the contributions of individual variables to a quadratic form in the contexts of Hotelling's one-sample and two-sample T 2 tests, Mahalanobis distance, and the classification of an item to one of two populations on the basis of Fisher's linear discriminant function. These programs are available from URL: http://users.mct.open.ac.uk/paul.garthwaite.
The rotation of variables has received much attention in this paper, and further comment is needed to give a balanced perspective on its role in partitioning a quadratic form. As in equation (5), let W = ( D D) −1=2 D(X − μ). When the correlations are high between each component of W and the corresponding component of X, then the partition is clearly a sensible way of evaluating the contribution of each x-variable. When some of these correlations are low, they can sometimes be increased dramatically through rotations that yield interpretable variables. This potential benefit of rotation was illustrated in Section 5.2. However, finding suitable rotations that yield interpretable variables is not always possible. Moreover, even when such rotations can be found, there are attractions in the simplicity of forming a partition that retains the original x-variables. We briefly return to the athletes data to show that low correlations do not preclude a transparent relationship between the x-variables and the contributions allocated to them by the partition.
Let X # denote the difference between an athlete's measurements and the average for their gender. Put X Å = DX # , so that the components of X Å = (X Å 1 ,…, X Å 9 ) are standardized values of each variable. Let ( W 1 ,…, W 9 ) = W = ( D p D) −1=2 X Å . Then W 2 i is the contribution of the ith variable to the quadratic form in equation (11). We focus on the two most highly correlated variables, Bfat (X 8 ) and SSF (X 9 ). The partition uses the following equations (obtained from Table 2) to determine their contribution to .
Our method is derived from a clear, understandable criterion that gives it a sound basis. In our experience the method has never given an evaluation that seems unreasonable and we recommend its use for the decomposition of a quadratic form for any positive-definite matrix . In reporting results, the method used to obtain the decomposition should be stated so as to define the evaluated contributions unambiguously. Proof of Theorem 1. For any X, by assumption (X − μ) −1 (X − μ) = W W = (X − μ) A A(X − μ). Choosing X so that only one entry of (X − μ) is non-zero shows that the diagonal entries of −1 and A A are equal. Choosing X so that only two entries of (X − μ) are non-zero then does the same for off-diagonal entries, so −1 = A A. Since var(X) ∝ , it follows that var(W) ∝ A A = A(A A) −1 A = I. Consequently the components of W are uncorrelated and have identical variances. For the next part of the theorem, let W Å = A{X − E(X)} and var(X) = k . Then E(W Å ) = 0 and var(W Å ) = kA A = kA(A A) −1 A = kI. Let Z = D{X − E(X)} and put Z = (Z 1 ,…, Z m ) , so that E(Z) = 0, var(Z) = kD D, and var(Z i ) = k for i = 1,…, m. We know that cor(  1=2 ]. Since var(Z i ) = var(W j ) = k, both cor(Z i , W j ) and cor(X i , W j ) equal the (i, j)th entry of (D D) 1=2 . Similar reasoning shows that cor s (X i , W j ) is the (i, j)th entry of ( D D) 1=2 .
Proof of Theorem 5. Part (i) follows from reasoning similar to the first steps of the proof of Theorem 1. To prove (ii), let = D D and let V = Y − E(Y), so that E(V) = 0 and var(V) ∝ . Put W Å = CV, so that var(W Å ) = I since C C = −1 . It then follows that The proof of (iv) is analogous to the proof of Theorem 3. Part (v) is immediate from equation (7).
such that d j=1 ij = 0, d j=1 2 ij = 1 and, for i = k(k = 1,…, d − 1), d j=1 ij kj = 0. The set of orthonormal contrasts is not unique, giving flexibility in constructing the Y i . To form a rotation matrix from these contrasts, we set the (i, j)th entry of d equal to ij (i = 1,…, d − 1; j = 1,…, d ) and the (d , j)th entry of d equal to d −1=2 (j = 1,…, d ).