Permutation inference for canonical correlation analysis

Canonical correlation analysis (CCA) has become a key tool for population neuroimaging, allowing investigation of associations between many imaging and non-imaging measurements. As age, sex and other variables are often a source of variability not of direct interest, previous work has used CCA on residuals from a model that removes these effects, then proceeded directly to permutation inference. We show that a simple permutation test, as typically used to identify significant modes of shared variation on such data adjusted for nuisance variables, produces inflated error rates. The reason is that residualisation introduces dependencies among the observations that violate the exchangeability assumption. Even in the absence of nuisance variables, however, a simple permutation test for CCA also leads to excess error rates for all canonical correlations other than the first. The reason is that a simple permutation scheme does not ignore the variability already explained by previous canonical variables. Here we propose solutions for both problems: in the case of nuisance variables, we show that transforming the residuals to a lower dimensional basis where exchangeability holds results in a valid permutation test; for more general cases, with or without nuisance variables, we propose estimating the canonical correlations in a stepwise manner, removing at each iteration the variance already explained, while dealing with different number of variables in both sides. We also discuss how to address the multiplicity of tests, proposing an admissible test that is not conservative, and provide a complete algorithm for permutation inference for CCA.


. Introduction
Canonical correlation analysis (cca) (Jordan, ; Hotelling, ) is a multivariate method that aims at reducing the correlation structure between two sets of variables to the simplest possible form (hence the name "canonical") through linear transformations of the variables within each set.Put simply, given two sets of variables, the method seeks linear mixtures within each set, such that each resulting mixture from one set is maximally correlated with a corresponding mixture from the other set, but uncorrelated with all other mixtures in either set.
From a peak use through from the late 's until mid-'s, the method has recently regained popularity, presumably thanks to its ability to uncover latent, common factors underlying association between multiple measurements obtained, something relevant in recent research, particularly in the field of brain imaging, that uses high dimensional phenotyping and investigates between-subject variability across multiple domains.This is in contrast to initial studies that introduced cca to the field (Friston et al., , ; Worsley, ; Friman et al., , , ) for investigation of signal variation in functional magnetic resonance imaging (fmri) time series.For example, Smith et al. ( ) used cca to identify underlying factors associating brain connectivity features to various demographic, psychometric, and lifestyle measures; Rosa et al. ( ) used a sparse cca method to investigate differences in brain perfusion after administration of two distinct antipsychotic drugs; Miller et al. ( ) used cca to identify associations between imaging and non-imaging variables in the uk Biobank; Drysdale et al. ( ) used cca to investigate associations between brain connectivity and clinical assessments, and found two canonical variables that would allow classification of participants into distinct categories (but see Dinga et  ) used a combination of pattern classification algorithms and cca to study imaging and behavioral correlates of the subjective perception of oneself belonging to a particular gender.In most of these between-subject, group level studies, putative nuisance variables or confounds were regressed out from the data before proceeding to inference, and all of them used some form of permutation test to assess the significance of the results.In a recent review, Wang et al. ( ) described a permutation procedure for cca as "a random shuffling of the rows (. . . ) of the two variable sets".
Permutation tests are well known and widely used.Among their many benefits, these tests lead to valid inferences while requiring assumptions that are commonly satisfied in between-subject analyses, such that of exchangeability of observations under the null hypothesis.However, here we show that simple implementations of permutation inference for cca are inadequate on four different grounds.First, simple, uncorrected permutation p-values are not guaranteed to be monotonically related to the canonical correlations, leading to inadmissible results; for the same reason, multiple testing correction using false discovery rate is also inadmissible.Second, except for the highest canonical correlation, a simple one-step estimation of all others without considering the variability already explained by previous canonical variables in relation to each of them also leads to inflated per comparison error rates and thus, invalid results.Third, regressing out nuisance variables without consideration to the introduction of dependencies among observations caused by residualisation leads to an invalid test, with excess false positives.Fourth, multiple testing correction using the distribution of the maximum test statistic leads to conservative results, except for the highest canonical correlation.
In this paper we explain and discuss in detail each of these problems, and offer solutions that address each of them.In particular, we propose a stepwise estimation method for the canonical correlations and canonical variables that remains valid even when the number of variables is not the same for both sides of cca.We propose a method that transforms residualised data to a lower dimensional basis where exchangeability -as required for the validity of permutation tests -holds.We also propose that inference that considers multiple canonical correlations should use a closed testing procedure that is more powerful than the usual correction method used in permutation tests that use the distribution of the maximum statistic; the procedure also ensures a monotonic relationship between p-values and canonical correlations.Finally, we provide a complete, general algorithm for valid inferences for cca.

. . Notation and general aspects
Thorough definition and derivation of cca can be found in many classical textbooks on multivariate analysis (e.g., Kendall, ; Mardia et al., ; Brillinger, ; Muirhead, ; Seber, ; Krzanowski, ; Anderson, ); the reader is referred to these for a comprehensive overview.Here we present concisely and only allude to the distinction between population (ρ) and sample (r) canonical correlations where strictly needed.Let Y N ×P and X N ×Q be each one a collection of, respectively, P and Q variables observed from N subjects, N P + Q.Without loss of generality, assume that P Q, that the columns of Y and X are mean-centered, that these matrices are of full rank, and define E(Y Y) = Σ YY , E(X X) = Σ XX , and E(Y X) = Σ YX = Σ XY .The goal of cca is to identify canonical coefficients or canonical weights A P ×K = [a 1 , . . ., a K ] and B Q×K = [b 1 , . . ., b K ], K = min(P, Q), such that the pairs (u k , v k ) of canonical variables, defined as: have correlations r k that are maximal, under the constraint that U U = V V = I.Estimation of A and B amounts to finding the K solutions to: where the unknowns are r k , a k and b k ; r k are the sample canonical correlations, i.e., the correlations between the estimated canonical variables u k and v k .The coefficients a k are eigenvectors of Σ −1 YY Σ YX Σ −1 XX Σ XY , whereas b k are eigenvectors of Σ −1 XX Σ XY Σ −1 YY Σ YX ; the respective eigenvalues (for either a k or b k , as these eigenvalues are the same) are r 2 k .For convenience, we call canonical component the ensemble formed by the k-th canonical correlation, its corresponding pair of canonical variables, and associated pair of canonical coefficients; canonical variables may also be termed modes of variation.
The typical method for estimation involves an iterative procedure that finds one r k and a k at a time, with b k computed as a function of these.However, the method proposed by Bjorck and Golub ( ) is more concise and numerically more stable; it is described in the Appendix (Algorithm ).The canonical correlations are then produced in descending order, r 1 . . .r k . . .r K 0; this positiveness of all canonical correlations is a consequence of these values being explicitly maximised during estimation; reversal of the sign of the coefficients a k can always be accompanied by the reversal of the sign of the corresponding coefficients b k in the other side (and of u k and v k ), to no net effect on r k .That is, the signs of the canonical variables and coefficients are indeterminate, and any solution is arbitrary; nothing can be concluded about the specific direction of effects with cca.

. . Parametric inference
The distribution of the sample canonical correlations r k is intractable, even under assumptions of normality and independence among subjects, and is a function of the population correlations ρ k (Constantine, ; James, ).This difficulty led to the development of a rich asymptotic theory (Fisher, ; Hsu, ; Lawley, ; Fujikoshi, ; Glynn and Muirhead, ).However, these approximations have been shown to be extremely sensitive to departures from normality, or require additional terms that further complicate their use (Muirhead and Waternaux, ); Brillinger ( ) recommended resampling methods to estimate parameters used by normal approximations, which otherwise can be biased (Anderson, ).These difficulties pose challenges for inference.Even though some computationally efficient algorithms have been proposed (Koev and Edelman, ), these tests continue to be rarely used.Instead, a test based on whether a certain number of correlations are equal to zero has been proposed.The null hypothesis is where the constant C = 0 if there are no nuisance variables (Section .).Under the null hypothesis, λ k follows an approximate χ 2 distribution with degrees of freedom ν = (P − k + 1)(Q − k + 1) if each of the columns of Y and X have values that are independent and identically distributed following a normal distribution (but see Glynn and Muirhead, , for a different expression).Unfortunately, this test is sensitive to departures from normality, particularly in the presence of outliers, and its use has been questioned (Seber, ; Harris, ).Another test statistic is based on Roy ( ) , and is simply: The critical values for the corresponding parametric distribution at a given test level α can be found in the charts provided by Heck ( ), using as parameters s = min(P, ), where the constant C = 0 if there are no nuisance variables (Section .), or in tables provided by Kres ( ); more recent approximations for normally distributed data can be found in Chiani ( ) and Johnstone and Nadler ( ).Some approximations, however, are considered conservative (Harris, , ).Note that, while Roy ( ) proposed the use of the largest value as test statistic, which would then be r 2 k=1 , any given null H 0 k at position k must have already considered the previous canonical components, from 1 until k − 1, such that the maximum statistic, after the previous canonical correlations have been removed from the model, is always the current one.A similar reasoning holds for the smallest canonical correlations in the test proposed by Wilks ( ).This feature is exploited in the stepwise rejective procedure proposed in Section . .

. . Permutation inference
The above problems can be eschewed with the use of resampling methods, such as permutation.An intuitive (but inadequate) permutation test for cca could be constructed by randomly permuting the rows of Y or X.For each shuffling of the data, indicated by j = {1, . . ., J}, a new set of canonical correlations (r k ) * j and associated statistics (λ k ) * j would be computed.A p-value would be obtained as is the indicator (Kronecker) function, which evaluates as if the condition inside the brackets is true, or otherwise, and the index j = 1 corresponds to the unpermuted data (i.e., no permutation, with the data in their original ordering).
Such a naïve procedure, however, would ignore the fact that, this resampling scheme matches the first null hypothesis H 0 1 , but not the subsequent ones.For a given canonical correlation at position k being tested, k > 1, one must generate a permutation that satisfy the corresponding null H 0 k , but not necessarily {H 0 1 , . . ., H 0 k−1 }.Otherwise, latent effects represented by the corresponding earlier canonical variables [u 1 , . . ., u k−1 ] and [v 1 , . . ., v k−1 ] would, in the procedure above, remain in the Y and X at the time these are permuted.However, the variance associated with these earlier canonical variables would have already been explained through the rejection of their respective null hypothesis up to H 0 k−1 .This variance is now a nuisance for the positions from k (inclusive) onward.It contains information that are not pertinent to position k and subsequent ones, and that therefore should not be used to build the null distribution, i.e., variance should not be re-used in the shufflings that lead to a given (r k ) * j or subsequent correlations.Fortunately, cca is invariant to linear transformations that mix the variables in Y or in X.Since the canonical variables are themselves linear transformations of these input variables (Equation ), a second cca using U and V in place of the initial Y and X leads to the same solutions.Yet, unless P = Q, V will not span the same space as X.In principle, this would be inconsequential as far as the canonical variables are concerned.However, ignoring the variability in X not contained in V would again affect the pvalues should U and V be used in a permutation test, as the permuted data would not be representative of the original (unpermuted) that led to these initial canonical variables.To mitigate the problem, include into the matrix of canonical coefficients their orthogonal complement, i.e., compute Ṽ = X [B, null (B )], then use Ṽ instead of V as a replacement for X.In this paper we adopted the convention that P Q, but the same procedure works in reverse and, algorithmically, it might as well be simpler to compute also a Ũ = Y [A, null (A )] and use it instead of U as a replacement for Y.
While these transformations do not change in any way the canonical components, they allow the construction of an improved algorithm that addresses the issue of variability already explained by canonical variables of lower rank (i.e., the ones with order indices smaller than that of a given one).It consists of running an initial cca using Y and X to obtain Ũ and Ṽ, then subject these to a second cca and permutation testing while, crucially, at each permutation, iteratively repeating cca K times, each using not the whole Ũ and Ṽ, but only from the k-th component onwards, i.e., [ũ k , . . ., ũP ] and [ṽ k , . . ., ṽQ ] for the test about the k-th canonical correlation.Of note, a test level α does not need to be specified at the time in which the above iterative (stepwise) procedure is performed; instead, and in combination with the multiple testing procedure described below, adjusted p-values are are computed, which then are used to accept or reject the null for the k-th correlation.Algorithm (Section .) shows the procedure in detail (the algorithm contains other details that are discussed in the next sections).
A number of further aspects need be considered in permutation tests: the number of possible reorderings of the data, the need for permutations that break the association between the variables being tested, the random selection of permutations from the permutation set when not all possible permutations can be used, the choice of the test statistic, how to correct for the multiplicity of tests, the number of permutations to allow narrow confidence intervals around p-values, among others.These topics have been discussed in Winkler et al. ( , ) and references therein and will not be repeated here.However, for cca, some aspects deserve special treatment and are considered below.

. . Choice of the statistic
Asymptotically, using Wilks' statistic λ k or Roy's θ k are expected to lead to the same conclusion since all correlations are sorted in descending order: if r k = 0, then all subsequent ones must be zero; likewise, if r k > 0, then clearly at least one correlation between k and K is larger than zero, which has to include r k itself.Moreover, permutation under the null is justifiable in the complete absence of association between the two sets, which implies that, under the null H 0 k , all correlations r k , r k+1 , . . ., r K are equal to zero.With finite data, however, one statistic can be more powerful than the other in different settings; their relative performance is studied in Sections and .
Computationally, Wilks' requires more operations to be performed compared to Roy's statistic.Since the relationship between r k and θ k is monotonic, the two are permutationally equivalent, and using r k alone is sufficient, which makes Roy's the absolute fastest.However, even for Wilks', the amount of computation required is negligible compared to the overall number of operations needed for estimation of the canonical coefficients, such that in practice, the choice between the two should be in terms of power.
In either case, while inference refers to the respective null hypothesis at position k, it is not to be understood as inference on the index k.Rather, the null is merely conditional on the nulls for all previous correlations from 1 to k − 1 having been rejected.Rejecting the null implies that the correlation observed at position k is too high under the null hypothesis of no association between the two variable sets after all previous (from 1 to k − 1) canonical variables have been sequentially removed, as described in Section . .In Algorithm (Section .) this is done in line , that uses as inputs to cca the precomputed canonical variables only from position k onwards, as opposed to all of them.

. . Multiplicity of tests
For either of these two test statistics, the ordering of the canonical correlations from larger (farther from zero) to smaller (closer to zero) imply that rejection of the null hypothesis at each k must happen sequentially, starting from k = 1, using the respective test statistic and associated p-value until the null H 0 k , for some k = {1, . . ., K}, is not rejected at a predefined test level α.Then, at that position k, the procedure stops, and the null is retained from that position (inclusive) onward until the final index K.
The ordering of the canonical correlations brings additional consequences.First, because rejection of H 0 k implies rejection of all joint (intersection) hypotheses that include H 0 k , that is H 0 1 , . . ., H 0 k−1 , such sequentially rejective procedure is also a closed testing procedure (ctp), which controls the amount of any type i error across all tests, i.e., the familywise error rate (fwer) in the strong sense (Marcus et al., ; Hochberg and Tamhane, ).Thus, there is no need for further correction for multiple testing.Another way of stating the same is that the test for a given r k , k > 1, has been "protected" by the test at the position k = 1 at the level α.Adjusted pvalues (in the fwer sense) can be computed as [p k ] fwer = max(p 1 , . . ., p k ), that is, the fwer-adjusted p-value for the canonical component k is the cumulative maximum p-value up to position k.Such adjusted p-values can be considered significant if their value is below the desired test level α.
The second consequence is that fwer adjustment of p-values using the distribution of the maximum statistic (not to be confused with the cumulative maximum described in the above paragraph) will be conservative for all canonical components except the first.The reason is that the maximum statistic is always the distribution of the first, which is stochastically dominant over all others.The distribution of the maximum is usually sought as a shortcut to a ctp when the condition of subset pivotality holds (Westfall and Young, ), as that reduces the computational burden from 2 K tests to only K tests.Interestingly, the ordering of the canonical correlations from largest to smallest leads to a ctp that does not use the distribution of the maximum, and yet requires only K tests, regardless of whether subset pivotality holds.
A third consequence is that using permutation p-values outside the above sequentially rejective procedure that controls fwer is not appropriate since these simple, uncorrected p-values are not guaranteed to be monotonically related to the canonical correlations r k .Attempts to use these uncorrected p-values outside a ctp would lead to paradoxical results whereby higher, stronger canonical correlations might not be considered significant, yet later ones, smaller, weaker, could be so; that is, it would make the test inadmissible (Lehmann and Romano, , p. ).For the same reason, such simple p-values should not be subjected to correction using false discovery Table : Taxonomy of canonical correlation analysis with respect to nuisance variables.In all cases, the aim is to find linear combinations of variables in the left and in the right sets, such that each combination from one set is maximally correlated with a corresponding combination from the other, but uncorrelated with all other combinations from either set.

Name
Left set Right set ).Let the Z be a N × R matrix of nuisance variables, including an intercept.Partial cca consists of considering Z nuisance for both Y and X.This is distinct from part cca, which consists of considering Z a nuisance for either Y or X, but not both.Finally, bipartial cca consists of considering Z a nuisance for Y, while considering another set of variables W, of size N × S, a nuisance for X.In all three cases, such nuisance variables can be regressed out from the respective set of variables of interest, then the respective residuals subjected to cca (Table ).In the parametric case, inference can proceed using the distribution of λ k or θ k (Equations and ) using C = R for partial or part, and C = max(R, S) for bipartial cca (Timm and Carlson, ; Lee, ).Permutation inference, however, requires further considerations, otherwise, as shown in Section , results will be invalid.Consider first the case without nuisance variables.Let M = [Y, X] N ×(P +Q) be the horizontal concatenation of the two sets of variables whose association is being investigated.Both Y and X occupy an N -dimensional space and, therefore, so does M.A random permutation of the rows of either of the two sets of variables will not affect their dimensionalities.For example M * = [PY, X] continues to occupy the same N -dimensional space as M.
However, residualisation changes this scenario.Let R Z = I − ZZ + be the residual forming matrix associated with the nuisance variables Z, with the symbol + representing the Moore-Penrose pseudo-inverse.R Z has the following interesting properties: R Z = R Z (symmetry) and R Z R Z = R Z (idempotency), both of which will be exploited later.In partial cca, Z can be regressed out from Y and X by computing ] be the concatenation of the residualised sets Y and X with respect to Z.While Y occupies an N -dimensional space, Ỹ occupies a smaller one; its dimensions are, at most, of a size given by the rank of R Z , which is N − R assuming Y and Z are of full rank.The same holds for X and X and, therefore, for M and M.
Permutation affects these relations: while M * still occupies a space of N dimensions as the unpermuted M, M * , differently than M, may now occupy a space with dimensions anywhere between N − R and N , depending on a given random permutation.With fewer effective observations determined by this lower space after residualisation, and the same number of variables, the sample canonical correlations in the unpermuted case are stochastically larger than in the permuted, which in turn leads to an excess of spuriously small p-values.For not occupying the same space as the original, the permuted data are no longer a similar realisation of the unpermuted, thus violating exchangeability, and specifically causing the distribution of the test statistics to be unduly shifted to the left.
Here the following solution is proposed: using the results from Huh and Jhun ( ), let Q Z be a N ×N semi-orthogonal basis (Abadir and Magnus, , p. ) for the column space of R Z constructed via, e.g., spectral or Schur decomposition, such that since, as discussed earlier, R Z is symmetric and idempotent, and While pre-multiplication by Q Z does not affect the cca results , it changes the dependence structure among the rows of the data: M = [ Ỹ, X] occupies an N -dimensional space, and so does M * = [P Ỹ, X], for a permutation matrix P of size N × N , such that exchangeability holds, thus allowing a valid permutation test.
The treatment of partial cca, as described above, can be seen as a particular case of bipartial cca in which W = Z, that is, the set of nuisance variables in both sides is the same.Of course, for bipartial cca proper, this equality does not necessarily hold, and the two sets may be different in different ways: Z may be entirely orthogonal to W, or some or all variables from one set may be fully represented in the other, either directly (e.g., some of the variables present in both sets), or as linear combinations of one set in the other, or it may be that these two sets are simply not orthogonal.The direct strategy of computing R W = I − WW + and its respective semi-orthogonal matrix Q W leads to difficulties because, unless R = S, the products Ỹ = Q Z Y and X = Q W X will not have the same number of rows: A more general solution, that accommodates bipartial and, therefore, is a generalisation for all cases of nuisance variables in cca, consists of randomly permuting rows of Ỹ and/or X using, respectively, permutation matrices P Y and P X of respective sizes N and N , therefore permuting in the lower dimensional space where Ỹ and X are exchangeable, then, crucially, reestablishing the original number N of rows using the property that the transpose of a semi-orthogonal matrix is the same as its inverse (Q = Q + ), to only then perform cca.Therefore, cca is computed using Left and right sides will continue to have rank N and N respectively, will have already been permuted, and will both have N rows.The procedure is fully symmetric in that, when the permutation matrices P Y and P X are both identity matrices (of sizes N and N , respectively), which is equivalent to no permutation, the expressions for each side reduce to the residualised data As originally proposed, in the context of the general linear model (glm), Huh and Jhun ( ) This simplification holds true also for the glm (not discussed in this article).
Table : Proposed permutation method for the various cases of cca, with respect to nuisance variables.

Name
Left set Right set cca ("full", no nuisance)  .Full and partial have matching number of rows in both sides, such that only one side needs be permuted; part and bipartial, however, have at the time of the permutation a different number of rows in each side, such that both can be permuted separately through the use of suitably sized permutation matrices PY and PX; PY is size N × N for full cca, and N × N for the three other cases; PX is size N × N for full and for part cca, and N × N for the two other cases.thus addressing the above problem of the unpermuted test statistic having a different and stochastically dominant distribution over that of the permuted data.Table summarises the proposed solution for all cases, including part cca.

. . Restricted exchangeability
The above method uses the Huh-Jhun semi-orthogonal matrix applied to cca and leads to a valid permutation test provided that there are no dependencies among the rows of M. That is, the method takes into account dependencies introduced by the regression of Z and/or W out from Y and/or X, but not dependencies that might already exist in the data, and which generally preclude direct use of permutation tests.However, structured dependencies, such as those that may exist, for instance, in studies that involve repeated measurements, or for those in which participants do not constitute independent observations, e.g., sib-pairs, as in the Human Connectome Project (hcp; Van Essen et al., ), can be treated by allowing only those permutations that respect such dependency structure (Winkler et al., ).Unfortunately, the Huh-Jhun semi-orthogonal matrix does not respect such structure, blurring information from observations across blocks, and preventing the definition of a meaningful mapping from the N original observations that define the block structure to the N or N observations that are ultimately permuted.
Such mapping, whereby each one of the N and N rows of, respectively, Ỹ and X corresponds uniquely to one of the N rows of the original data Y and X, can be obtained using a different method, due to Theil ( , ), and reviewed in detail by Magnus and Sinha ( ).Consider first the case of Z.In the Theil method, that here is adapted for , where the exponent −1/2 represents the positive definite matrix square root, and S is a N × N selection matrix, N = min(N , N ), that is, an identity matrix from which some max(R, S) rows have been removed.Pre-multiplication of a matrix by a selection matrix deletes specific rows, i.e., the ones that correspond to columns that are all zero in the selection matrix (Figure ).The Q Z Y thusly computed are the best linear unbiased residuals with scalar covariance (blus), in that they are unbiased estimates of S Y , where Y are the (unknown) true errors after the nuisance effects of Z have been removed from Y; S Y contains the variance of interest, which may be shared among linear combinations of variables in both sides; it is an estimate of that is subjected to cca and statistical testing.For partial cca, Q Z is the same for both sides; for bipartial cca, similar computations hold for the other side, i.e., Q W = R W S (SR W S ) −1/2 .Table summarises the two methods.
To construct a permutation procedure for cca that respects the block structure, the Theil method can be used to compute Q instead of the Huh-Jhun approach.Choose max(R, S) observations to be removed from both sides (for partial cca, R = S since W = Z).Construct the selection matrix S of size N × N , define the exchangeability blocks based on N observations, compute Q Z and Q W using the same S for both (for part cca, use the same strategy as for bipartial, replacing R W for I), residualise (in the blus sense) the input variables by computing Ỹ and X.These have the same number of rows, and the dependencies among these rows is the same for both sides; thus, only one side needs be subjected to random permutations that respect such existing dependencies.Optionally, after permutation, the A selection matrix is an identity matrix from which some specific rows have been removed.Pre-multiplication by a selection matrix deletes specific rows (those that correspond to columns that are all zero in the selection matrix).
Table : The semi-orthogonal matrix Q (QZ and/or QW, subscripts dropped), discussed in Sections .and ., is not unique.Two principled methods to obtain it are below.

Method Matrix
Theil ( ) Q = RS (SRS ) −1/2 Huh and Jhun ( ) QEQ = R (via svd or Schur) R is the residual-forming matrix (RZ or RW, for the respective set of nuisance variables, subscript dropped); since R is idempotent, all its eigenvalues (the diagonal elements of E) are equal to or .In the Theil method, S is a N = N × (N − R) (for Z) or N = N × (N − S) (for W) selection matrix; the matrix square root (in the exponent −1/2) is the positive definite solution.In the Huh-Jhun method, after Schur or svd factorisations of R are computed, the R or S columns of Q that have corresponding zero eigenvalues in the diagonal of E are removed, such that Q computed from the factosisation is reduced from size N × N to N × N or to N × N .At the end of these computations (see Algorithm , "semiortho", in the Appendix), for both methods Q Q = I, QQ = R, and Q = Q + .Both methods aim at obtaining residuals with a scalar covariance matrix σ 2 I. Theil explictly seeks blus residuals.However, strictly, S does not need to be a selection matrix: choose S to be QR (not to be confused with qr decomposition) using Q computed with the Huh-Jhun approach.Then, following Magnus and Sinha ( , Theorem , p. ), it can be shown that Huh-Jhun also provides blus residuals.
number N of observations may be reestablished by pre-multiplication by Q Z and Q W . Finally, cca is performed, with observation to the aspects discussed in Sections .and . .A detailed algorithm is presented in Section . .It remains to be decided how to select the max(R, S) observations to be dropped.In principle, any set could be considered for removal, provided that the removed rows of Z or W form a full rank matrix.Some informed choices, however, could be more powerful than others.One of the conclusions from Winkler et al. ( ) is that the complexity of the dependence structure and the ensuing restrictions on exchangeability leads to reductions in power.Thus, natural candidates for removal are observations that, once removed, cause the overall dependence structure to be simpler.For example, it is sometimes the case that some observations are so uniquely related to all others that there are no other observations like them in the sample.These observations, therefore, cannot be permuted with any other, or perhaps with only a few.Their contribution to hypothesis testing in the permutation case is minimal, and their removal are less likely to affect a decision on rejection of the null hypothesis.Consider for example a design that has many monozygotic, dizygotic, and non-twin pairs of subjects, and that in the sample, there happens to be a single pair of half-siblings.It is well known that, for heritable traits, genetic resemblance depends on the kinship among individuals; half-siblings are expected to have a different degree of statistical dependency among each other compared to each one of the other types of sibships in this sample.Thus, in there being just one such pair, it would be reasonable to prioritise it for exclusion, while keeping others.

. . General algorithm
A set of steps for permutation inference for cca is described in Algorithm .In it, input variables Y and X will have been mean-centered before the algorithm begins, or an intercept will have been included as nuisance variable in both Z and W. P = {(P Y , P X ) j } is a set containing pairs of permutation matrices indexed by j = {1, . . ., J}.In this set, the first permutation is always "no permutation", i.e., (P Y , P X ) j=1 = (I N ×N , I N ×N ), such that (λ k ) * j=1 = λ k , for all k.For the cases in which only one side of cca needs be permuted (Table ), or for the cases in which R = S, or when there are dependencies among the data such that the Theil method is used to construct Q (Table ), then (P X ) j can be set as I for all j.Details on how P is defined in observance to the null hypothesis and respecting structured dependencies among the data have been discussed in Winkler et al. ( , ).In the algorithm, P can be larger, equal, or smaller than Q.Optional input arguments are the matrices with nuisance variables Z and W, and the selection matrix S. If Z is supplied but not W, then the algorithm performs part or partial cca, depending on the Boolean argument partial; if both Z and W are supplied, the algorithm performs bipartial cca; if neither is supplied, then "full" cca is performed.If S is supplied, then the blus residuals based on Theil are used; otherwise, Huh-Jhun residuals are used.For either of these two cases, the semi-orthogonal matrix Q is computed using a separate, ancillary function named "semiortho", described in the Appendix.
An initial cca using residualised data is done in line ; this uses another ancillary function, named "cca", and also described in the Appendix; this function returns three results: the canonical coefficients A and B, and the canonical correlations r k .The canonical coefficients are used to compute the canonical variables U and V, augmented by their orthogonal complement needed to ensure that they span the same space as the variables subjected to this initial cca; the canonical correlations are ignored at this point and not stored (hence the placeholder "_").A counter c k for each canonical component is initialised as .
The core part of the algorithm are the two loops that run over the permutations in P and the K canonical components (between lines and ).At each permutation j, cca is executed K times.In each, the columns of U and V that precede the current k are removed, such that their respective variances are not allowed to influence the canonical correlations at position k.At each permutation, the K canonical correlations are obtained (the third output from the function "cca") and used to compute the associated test statistic.As shown, Wilks' statistic, λ k , is used, simplified by the removal of the constant term, which does not affect permutation p-values.For numerical stability, sum of logarithms is favoured over the logarithm of a product (compare line with Equation ).For inference using Roy's statistic, replace the condition (λ k ) * j λ k for (r k ) * j r k in line ; this modification alone is sufficient as θ k is permutationally equivalent to r k .In that case, computations indicated in line are no longer needed and can be removed to save computational time.
Whenever the statistic for the correlation at position k in a given permutation is higher or equal than that for the unpermuted data, the counter c k is incremented (line ).After the loop, the counter is converted into a p-value for each k.These simple, uncorrected p-values, however, are not Algorithm : Permutation inference for cca.
Require: Y N ×P , X N ×Q , P. Optional: Z N ×R , W N ×S , S, partial. Inputs.
If left-side nuisance were defined.
: R Z ← I − ZZ + Residual forming matrix due to Z. : If W not given, and this is partial cca. : If right-side nuisance were defined.
Canonical variables from residualised X. : for k ∈ {1, . . ., K} do For each component. : Initialise a counter.: end for : for all (P Y , P X ) j ∈ P do For each permutation. : for k ∈ {1, . . ., K} do For each component. : If statistic after permutation is larger.
: useful.Instead, fwer-adjusted p-values are computed under closure using the cumulative maximum, i.e., the p-value for r k is the largest (least significant) uncorrected p-value up to position k.The algorithm returns then these adjusted p-values, which can be compared to a predefined test level α to establish significance.Note that α itself is never used in the algorithm.
As presented, the algorithm does not cover dimensionality reduction or any penalty to enforce sparse solutions for cca.Dimensionality reduction using methods such as principal component analysis (pca), if included, would be performed after residualisation, but before cca.Thus, in the algorithm, pca or ica, if executed, would be done between lines and .As for the many forms of sparse or penalised cca (Nielsen, ; ), in principle these can be incorporated into the algorithm through the replacement of the classical cca in lines and for one of these methods.

. Evaluation Methods
In this section we describe the synthetic data and methods used to investigate error rates and power under the different choices for the various aspects presented in Section at each stage of a permutation test for cca, providing empirical evidence for the approach proposed.An overview of these aspects and choices at each stage is shown in Table .For each case, we use a series of simulation scenarios: each consists of a set of synthetic variables constructed using random values drawn from a normal or a non-normal (kurtotic or binary) probability distribution, sometimes with or without dimensionality reduction using principal components analysis (pca; Hotelling, ; Jolliffe, ), sometimes with or without signal, and sometimes with or without nuisance variables.We also consider cases with large sample sizes and large number of variables.An overview of these scenarios (there are twenty of them) is in Table .We start by investigating aspects related to the estimation of the canonical components at each permutation.Specifically, we consider (a) a one-step estimation of all canonical components, from 1 to K, versus (b) sequential estimation that removes, for the k-th canonical component in a given permutation, the variance already explained by the previous ones, as described in Table : A valid permutation test for cca needs to consider several computational and statistical aspects, which can be seen as steps in a procedure; each of them may be addressed using different strategies, some of which are studied below.The recommendations for the different options for each case (column "Use") are based on theoretical grounds shown in the sections as indicated in the table, and verified empirically using synthetic data (Section ).
Step Can or should be used.
• Can but should not be used.Cannot or should not be used.In the table, R and S refer to the number of nuisance variables other than the intercept, which is always included (so the number of nuisance variables in left and right sides for all the simulation scenarios was always, respectively R + 1 and S + 1).For partial cca, the number of nuisance variables on one side is always the same as in the other, i.e., S = R, but that does not have to be for bipartial cca, even though here the same size was used.The case with larger samples was used for investigation of partial cca.
Section . .With respect to the inclusion of the complement of the canonical coefficients, we consider (a) without the inclusion of the null space of the canonical coefficients, versus (b) with its inclusion so as to ensure that all variance from the original data not explained in the initial cca is considered in the estimation at every permutation, as described in Section . .With respect to multiple testing, we consider the following strategies: variables on the right side (X) (the procedure is symmetric; the choice of sides is arbitrary and does not affect results); for these six scenarios, data are drawn from one of three possible distributions: a normal distribution with zero mean and unit variance, a Student's t distribution with variable degrees of freedom ν = {2, 4, 6, 8, 10} (kurtotic), or a Bernoulli distribution with parameter q = 0.20 (binary).Analyses with and without dimensionality reduction to variables using pca are considered.The number of permutations used to compute p-values was set as J = 2000, with 2000 realisations (repetitions), thus allowing the computation of error rates.
We then turn our attention to aspects related to nuisance and residualisation discussed in Sections .and . .We consider (a) simple residualisation, (b) residualisation using the Huh-Jhun method, and (c) residualisation using the Theil method.For this purpose, scenarios vii-xvii are constructed similarly as i-vi, except that a third set Z of R = 15 variables is used as nuisance for partial cca, whereas two other scenarios, xiii and xiv, a fourth set of variables W of S = 15 variables is used as nuisance for bipartial cca.
The impact of ignoring, in samples substantially larger than the number of variables, the dependencies introduced by the residualisation of both sides of cca is studied with scenarios xv and xvi, that consider samples progressively larger, N = {100, 200, . . ., 900, 1000}, while keeping the other parameters similar as in scenarios vii and viii.Finally, we briefly investigate power and the choice of the test statistic: we consider (a) Wilks' statistic (λ k ), as well as (b) Roy's largest root (θ k ), as discussed in Sections .and . .We define scenarios xvii and xviii similarly as i, this time including a strong, true signal in one canonical component, thus named "sparse", or multiple, weaker signals shared across multiple (half of the smaller set, thus, "dense").For all scenarios, an intercept is always included as nuisance variable in both sides such that the actual number of nuisance variables is R + 1 and S + 1 for each side, respectively.To report confidence intervals ( %), the Wilson ( ) method is used.

. Results
In the results below, the Sections .and .establish empirically that with an estimation method (i) that includes the null space of the canonical coefficients, (ii) that finds the canonical correlations in an iterative manner, and (iii) that after computing p-values through a closed testing procedure, the error rates are controlled.The subsequent results, from Section .onwards, consider only this valid approach.

. . Estimation strategies
Not including the complement of the canonical coefficients (null space) caused error rates to be dramatically inflated, well above the expected test level α = 0.05 ( %), regardless of whether the estimation used the single step or the stepwise procedure, and regardless of any of the multiple testing correction strategies discussed; these results are shown in Table .Even when the null space of the canonical coefficients was included, a single step procedure was never satisfactory.To understand this, consider the following consequence of the theory presented in Sections .and .: for a valid, exact test in cca, the expected error rate for each H 0 k , i.e., the per comparison error rate (pcer; Hochberg and Tamhane, ) is α for k = 1, but for k = 2 it is α × α, since the null can only be rejected if the previous one has also been declared significant at α.More generally, the pcer for a valid test is α k for the k-th test, i.e., for the k-th canonical correlation.If the test level is set at %, then the pcer is % for k = 1, .% for k = 2, .% for k = 3, and so forth.Error rates above this expectation render the test invalid; below render it conservative.In the simulations, a single step procedure never led to an exact test, with or without consideration to multiple testing, as shown in Table . .

. Multiple testing
As with the pcer, it is worth mentioning what the expected fwer for a valid, exact test is.That expectation is the test level itself, i.e., α.Any Table : Observed per comparison error rate (%) and % confidence intervals for the first canonical correlations in scenario i, assessed using the Wilks' statistic and three different multiple testing correction methods; the observed familywise error rate (fwer) for each case is also shown.Valid methods should have a fwer close to the nominal %, and pcer close to the nominal (5%) k ; see Sections .and .for details.Using the Roy's statistic led to similar results as with Wilks (not shown).Dimensionality reduction with pca led to similar results for the case in which the null space is included (not shown).For the case in which the null space is not included, results are not comparable with the ones above because, after pca, P = Q in the simulations, such that there is no null space to be considered as the matrices with canonical coefficients in both sides are then square.
higher error rate renders a test invalid; lower error rate renders it conservative, though valid.Table shows the fwer for the three different correction methods considered.
If the null space was not included, since the pcer was not controlled, the fwer could not be controlled either (first two columns).If the null space of the canonical coefficients was included (last two columns), even though the single step estimation controlled the pcer, the fwer was not controlled for the simple, uncorrected p-values (third column, upper panel), which is not surprising.It should be emphasised, however, that these simple p-values have another problem: they are not guaranteed to be monotonically related to the respective canonical correlations, such that it is possible that, using these p-values, the null hypothesis could be rejected for some canonical correlation, but retained for another that happens to be larger than the former.The use of such uncorrected, simple p-values, therefore, constitutes a test that is inadmissible.The problem with lack of monotonicity with uncorrected p-values is less severe if estimation is done in a stepwise manner (fourth column, upper panel), but is nonetheless still present, as shown in Figure , and has potential to lead to an excess fwer, even though that did not occur in these simulations.
For the other two correction methods, when the null space of the canonical coefficients was included in the estimation process, fwer was controlled (third and fourth columns of Table , middle and lower panels), but there are particularities.Using the distribution of the maximum (lower panel) led to very conservative pcer, for both single step or stepwise estimation, whereas correction with closure led to invalid pcer for single step estimation (third column, middle panel).
The only configuration that led to exact (neither conservative or invalid) control over pcer and fwer, and a monotonic relationship between canonical correlations and associated p-values, is the one in which a stepwise estimation was performed, with the null space of the canonical coefficients included, and with correction using a closed testing procedure (fourth column, middle panel of Table ).Moreover, the fwer, when controlled using the cumulative maximum or the distribution of the maximum statistic, is guaranteed to match the pcer for k = 1: in the former case, any further rejection of the null is conditional on the first one having been rejected; in the latter, the distribution of the maximum coincides with the distribution Figure : Relationship between canonical correlations (horizontal axes) and associated pvalues (vertical axes) for realisations of scenario i, considering two estimation methods (single step and stepwise) and three multiple testing correction methods (uncorrected, corrected using the cumulative maximum, and corrected using the distribution of the maximum statistic).The figure complements Table by showing example realisations that average to the error rates shown in the table for the cases in which the null space is included.For simple, uncorrected p-values, the test is inadmissible; for corrected using the distribution of the maximum statistic, the test is overly conservative; single step does not control the familywise error rate.
Table : Observed per comparison error rate (pcer, %) and % confidence intervals for the first canonical correlations in scenarios vii (partial cca) and xiii (bipartial cca), for the three different methods considered for treatment of nuisance variables.

Simple residuals
Huh-Jhun Theil Estimation included the null space of the canonical coefficients and a stepwise procedure, assessed using the Wilks' statistic, and corrected using a closed testing procedure (ctp).
The ctp guarantees that the familywise error rate (fwer) matches the pcer for the first canonical correlation (i.e., for k = 1).Using the Roy's statistic led to similar results as with Wilks'; likewise, dimensionality reduction with pca led to similar results (not shown).
of the first as the canonical correlations are ranked from largest to smallest.

. . Nuisance variables
For partial cca, simple residualisation, even using the above procedure (stepwise estimation, null space included, correction via closure), resulted in the error rates being dramatically inflated, as shown in Table .The Huh-Jhun and the Theil residualisation methods, in contrast, resulted in the error rates being controlled at the nominal level, with no excess false positives.For bipartial cca, the problem did not happen in the simulation settings: simple residualisation of both sides by entirely different sets of variables did not cause the error rates to be inflated; yet, using Huh-Jhun or Theil also produced nominal error rates, suggesting that these could be used in any configuration of nuisance variables, regardless of whether those in one side are not independent from those in the other.
Table : Observed per comparison error rate (pcer, %) and % confidence intervals for the first canonical correlation in scenarios i-vi (without nuisance) and vi-xii (partial cca, with Huh-Jhun), considering different distributions for the data.

Distribution
Without nuisance Partial cca ν: Degrees of freedom of the Student's t distribution used to simulate data; q: Parameter of the Bernoulli distribution used to simulate data.Estimation used the null space of the canonical coefficients and a stepwise procedure, assessed using the Wilks' statistic, and corrected using a closed testing procedure (ctp).The ctp guarantees that the familywise error rate (fwer) matches the pcer for the first canonical correlation (i.e., for k = 1).
Using the Roy's statistic led to similar results as with Wilks'; using Theil led to similar results as Huh-Jhun; likewise, dimensionality reduction with pca led to similar results (not shown).

. . Non-normality
Without nuisance variables and with kurtotic data simulated using a Student's t distribution with a small number of degrees of freedom, ν = {2, 4, 6, 8, 10}, as well as with binary data simulated using a Bernoulli distribution with parameter q = 0.20, error rates were controlled nominally, as shown in Table .In partial cca, however, even using the Huh-Jhun method, highly kurtotic data led to excess error rates.In particular, for the simulated data using a Student's t distribution with degrees of freedom of only ν = 2, the observed error rate was .%, for a test level of %; using the Theil method led to also inflated error rate in this case, with .% ( % confidence interval: .-. , not shown in the table).For ν 4, error rates were controlled at the nominal level, for both Huh-Jhun (Table ) and Theil (not shown).

. . Large samples
Increasing the sample size while keeping the number of variables fixed progressively reduced the amount of errors for the simple residualisation method to treat nuisance variables, as shown in Table ; the trend was similar with or without dimensionality reduction using pca.The reduction in the error rate as the sample size increased did not affect Huh-Jhun or Theil methods, for which error rates were already controlled even with a relatively smaller sample compared to the number of variables.

. . Dimensionality reduction
Dimensionality reduction with pca did not affect error rates (pcer and fwer) with respect to single step vs. stepwise estimation of canonical coefficients, nor correction for multiple testing, nor method for addressing nuisance variables.That is, these results (not shown) were indistinguishable from those obtained without pca (shown above).Moreover, as the simulations used the same number of principal components for both sides of cca, including or not the null space could not have affected results, as P = Q after dimensionality reduction.Using pca did yield higher power to detect effects, for both Wilks' and Roy's test statistics (Table , next item).This apparent extra power can be attributed to the smaller number of variables after pca, as the principal components that were retained contained most of the simulated signal, which, given the reduced dimensionality of the set of data, could then be detected with higher likelihood.

. . Choice of the statistic
The results above, that consider solely the error rates, and are based on results with the Wilks' statistic (λ k ), are essentially the same for Roy's largest root (θ k ; results not shown).That is, results regarding the estimation strategies, multiple testing, nuisance variables, non-normality, behaviour with large samples, and dimensionality reduction with pca, are virtually the same for Wilks' and Roy's statistics.In the presence of synthetic signal, however, the two test statistics diverged.Table shows that, with signal spread across multiple canonical components (i.e., "dense"), Wilks' is substantially more powerful than Roy's statistic.With signal concentrated in just one (the first) canonical variable (i.e., "sparse"), the trend reverses, and Roy's become more powerful than Wilks'.

. . Permutation tests
Compared to univariate, multivariate tests pose the problem of establishing the distributional form for more complicated test statistics; in the Without pca: P = 20, Q = 16.With pca: P = Q = 10.Estimation included the null space of the canonical coefficients and a stepwise procedure, assessed using the Wilks' statistic, and corrected using a closed testing procedure (ctp).The ctp guarantees that the familywise error rate (fwer) matches the pcer for the first canonical correlation (i.e., for k = 1).Using the Roy's statistic led to similar results as with Wilks'.The confidence intervals are wider than for other tables because the number of realisations (and also of permutations) was smaller (Table ) Table : Observed power (%) and % confidence intervals for the first canonical correlation in scenarios xvii and xviii, that included a synthetic signal added to either one or half ("sparse" or "dense", respectively) of the initial variables, thus captured by only one (sparse case) or multiple (dense case) canonical correlations.

Signals
Estimation used the null space of the canonical coefficients and a stepwise procedure, assessed using the Wilks' statistic, and corrected using a closed testing procedure (ctp).The ctp guarantees that the familywise error rate (fwer) matches the pcer for the first canonical correlation (i.e., for k = 1).
parametric case, inference is marred by a set of difficulties: the assumption that all observations are independent and identically distributed following normal theory, the extremely complicated formulas for the density of the canonical correlations, which further depend on the (unknown) population canonical correlations, the sensitivity of asymptotic approximations to departures from assumptions, bias in estimations of parameters, and the validity of these approximations only for particular cases.
Permutation tests address these difficulties in different ways, and their advantages are well known (Ludbrook and Dudley, ; Nichols and Holmes, ; Good, ; Pesarin and Salmaso, ): no underlying distributions need be assumed, non-independence and even heteroscedastic variances can be accommodated, non-random samples can be used, and a wide variety of test statistics are allowed.Moreover, all information needed to build the null distribution lie within the data, as opposed to in some idealised population.
These many benefits extend to inference for multivariate methods.In the case of cca, one benefit is immediately obvious: the complicated formulas and charts for the distribution of the canonical correlations can be bypassed completely, thus with no need to appeal to distributional assumptions.In effect, as shown in Section ., even with all variables not following a normal distribution, error rates were still controlled at the nominal level.It should be noted, however, that extremely kurtotic data, such as that generated with a Student's t distribution with extremely low degrees of freedom, caused results to be invalid in the presence of nuisance variables, even with the Huh-Jhun or Theil methods.Such data, however, are rare (recall that with degrees of freedom, the Student's t distribution has infinite variance); most applications of cca investigate datasets that have variables with data that have diverse distributional properties.
Yet, although in the univariate case, algorithms for permutation inference tend to be relatively straightforward to implement and do lead to valid results, for cca, the theory presented in the previous sections and the results with synthetic data show that a simple permutation algorithm that does not consider aspects such as a stepwise estimation of the canonical correlations, nor the inclusion of the null space of canonical coefficients when the two sets of variables do not have the same size, or that does not accommodate specific treatment for nuisance variables, or addresses multiplicity respecting the ordering of the canonical correlations, lead to invalid results.

. . Estimation and multiple testing correction
Results from Sections .and .show that the estimation method that leads to exact, valid results (neither conservative or invalid) is the one that estimates one canonical correlation one at a time, in a stepwise, iterative manner, that includes the null space of the canonical coefficients when the sets of variables have different sizes (i.e., when P = Q), and that computes adjusted p-values using a closed testing procedure.All alternative approaches led to either invalid or conservative results when considering pcer or the fwer.
It should be emphasised, however, that there are cases in which the naïve permutation method, described at the beginning of Section .remains valid.The method is valid whenever only the first (k = 1) canonical component is of interest, and there are no nuisance variables or, if there are nuisance variables, those in the left and right side (Z and W) are completely orthogonal (thus, excluding partial cca).Even though the naïve method was not explicitly tested, it is equivalent to the single step method with the null space included, which in the simulations led to an error rate of .% at test level .
(Table ).The reason why it remains valid is that, if interest is only in the first canonical component, there is no need to perform an initial cca to allow stepwise removal of previous (before the current k components).Moreover, there is no multiple testing to be considered.
The last column of Table may suggest that uncorrected p-values (upper panel) and a ctp (middle) are equivalent for stepwise estimation.They are not, and their differences are manifest in two ways, both previously discussed: first, uncorrected p-values are not monotonically related to the canonical correlations (Figure ), and second, fwer has potential to be higher than the pcer for k = 1, even though that did not happen in the simulations.

. . Inference in the presence of nuisance variables
It is sometimes the case that known, spurious variability needs to be taken into account.For example, variables such as age and sex are often considered confounds.Merely regressing out such nuisance variables from all other variables that are subjected to cca, then proceeding to a simple permutation test, leads to inflated error rates and an invalid test, as expected from Section ., and evidenced by the results in Section . .The depen-dencies among observations introduced through the residualisation renders the data no longer exchangeable.
This inflated error rate, even after multiple testing correction, is the probably the most striking finding of the current study, as the results can be dramatically affected, particularly if the number of nuisance variables is relatively large compared to the sample size, as shown in Section . .Transformations that make residuals exchangeable again, through the use of a lower dimensional basis where exchangeability holds, namely, the Huh-Jhun and Theil methods, mitigate the problem, as evidenced by the theory and through the simulations.
Even though both methods led to similarly controlled error rates, they are not equivalent: Huh-Jhun always leads to same canonical components as they would have been obtained from the residualised data, whereas the Theil method can allow for multiple, different solutions depending on the choice of the selection matrix S. Theil ( ) suggested that the choice of the observations to be dropped should consider power; here we suggest that the choice of S can be based on restrictions on exchangeability: if all original data are freely exchangeable, the Huh-Jhun method is a preferable choice in that it does not require an additional arguments that affect the results; however, it does require a Schur or singular value decomposition of the residual-forming matrix, which is a rank-deficient matrix, such that numerical stability should also be a factor for consideration.
For bipartial cca, while error rates were controlled even in the simple residualisation case, it should be noted that Z and W were generated independently in the simulations, such that they were expected to be orthogonal.With real data, possible overlap among columns or linear combinations of columns between Z and W create a case that would lie between the two extremes of partial and bipartial cca.In such case, and given the results for partial cca, error rates are not expected to be controlled with simple residualisation.Huh-Jhun and Theil, being able to deal with the most extreme case of dependencies between Z and W (that is, when the two are the same, which define partial cca), constitute a general solution to all cases.

. . Relationship with the glm
The dangers of residualising both dependent and independent variables in the general linear model (glm) with respect to nuisance variables, then proceeding to a permutation test, as proposed originally by Kennedy ( ) are well known (Anderson and Robinson, ).It is not a complete surprise, therefore, that permutation inference for cca would lead to invalid results in similar settings.The original Huh and Jhun ( ) method (see also Kherad-Pajouh and Renaud, ) was proposed for the glm as a way to address shortcomings of the Kennedy method in accommodating nuisance variables.Both Kennedy and Huh-Jhun were evaluated by Winkler et al. ( ): among the methods that can be considered for permutation inference in the glm, Huh-Jhun is the only that cannot be used directly with exchangeability blocks, as the reduction to a lower dimensional space does not respect the block structure.The solution proposed here for permutation inference for cca in the presence of exchangeability blocks, that uses the Theil method, is expected to solve the same problem also for the glm, i.e., as a replacement for Huh-Jhun in cases where the data have a block dependence structure, as it does for freely exchangeable data (Ridgway, ).As in the univariate case, permutation tests in the presence of nuisance variables are approximate.Their exactness is in the sense that, under the null hypothesis, the probability of finding a p-value smaller or equal to the test level is the test level itself.Such tests are not perfectly exact as the true relationship between the nuisance variables and the variables of interest are not known and needs be estimated.Even in the absence of nuisance variables, however, permutation tests that use only a fraction of the total number of possible permutations are also approximate, for not covering the whole permutation space (the number of potential permutations tends to be very large, and grows very rapidly with increases in sample size).The same holds for other resampling methods that do not use all possible rearrangements of the data.Regardless of the reason why the tests are approximate, results are known to converge asymptotically to the true pvalues.

. . Choice of the statistic
Among the two test statistics considered, Wilks' (λ k ) tends to be more powerful than Roy's (θ k ) for effects that span multiple canonical components; the converse holds for signals concentrated in only a few of the canonical components, i.e., when many of the canonical variables are zero; in these cases, Roy's tend to be more powerful than Wilks', as shown in Section . .The respective formulas (Equations and ) give insight on why that is the case: Roy's statistic is invariant to canonical correlations other than the first (largest), whereas Wilks' pool information across all correlations; past simulations, reviewed by Johnstone and Nadler ( ), corroborate to the finding.
The use of these two statistics for any canonical correlation other than the first (i.e., for k > 1) is possible in the proposed iterative procedure because, for the current position k being tested, all the variance associated with the previous canonical components at positions {1, . . ., k − 1} will have already been removed from the model (Sections .and .), such that the largest canonical correlation (Roy's statistic) is the current one being tested; for Wilks', the procedure holds because these earlier canonical correlations are not marked as zero; instead, they are ignored altogether when the statistic is computed, as if the previous canonical components have never existed.
Wilks' lambda and Roy's largest root are not the only possible statistics that can be considered for cca, and permutation tests allow the use of yet others.Some, such as Hotelling-Lawley and Pillai-Bartlett, were considered by Friederichs and Hense ( ).Using simulations and Monte Carlo results, the authors found that parametric distributions of these classical multivariate statistics were accurate, and could be obtained quickly at low computational cost; it should be noted, however, that the study used normally distributed simulated data, in which case parametric assumptions are known to hold.

. . Relationship with previous studies
While a number of studies have used permutation tests with cca, not many investigated the performance of these tests.Nandy and Cordes ( ) proposed a non-parametric strategy for inference with cca for the investigation of task-based fmri time series: the method uses a resting-state (no task-related activity) dataset to build the null distribution; as resampling time series can be challenging due to temporal autocorrelation, the null distribution uses multiple voxels selected far apart from each other so as to also avoid issues with spatial correlation.The approach differs from the one presented here in that it uses subject-level time series (as opposed to between-subject analyses), is specific to brain imaging (the proposed method is general) and does a resampling method that shares similarities with, yet is not the same as permutation.Eklund et al. ( ) specifically used permutation tests for cca with fmri time series whitened with a combination of methods to allow permutation; the authors demonstrated that permutation tests for both the glm and cca could be greatly accelerated through the use of graphics processing units (gpus).
Kazi-Aoual et al. ( ) proposed analytical formulas for the first three moments of the permutation distribution of Pillai's trace for cca; these moments can be used to fit a Pearson ( ) type iii distribution, from which p-values can be obtained.Legendre et al. ( ) studied parametric and permutation tests for redundancy analysis (rda; Rao, ) and for canonical correspondence analysis (referred to also by the acronym "cca"; ter Braak, ); the authors found that a simultaneous test of all canonical eigenvalues for the respective axes (eigenvectors of predicted response variables in a linear model) in rda, despite simple, is not valid, whereas a marginal test on each eigenvalue, as well as a "forward" test in which previously tested canonical axes are added to a matrix of nuisance variables, performs well, even if conservatively for axes other than the first.Yoo et al. ( ) investigated the relationship between cca and regression, proposing the use of permutations and studying cases without nuisance variables; in the method, for the k-th canonical correlation, variance not already explained by canonical variables {1, . . ., k} in one of the sides is permuted, whereas the other variables remain fixed.Turgeon et al. ( ) considered using a small number of permutations for cca, recording of the empirical distribution function, then using it to estimate the parameters of a Tracy-Widom distribution (Tracy and Widom, ; Johnstone, ) for cases in which the number of observations is smaller than the number of variables in either Y or X; the distribution is then used to obtain p-values; data are assumed to follow a normal distribution, and inference is for the largest canonical correlation.
Permutation tests for the method of partial least squares (pls; ) used a permutation test to investigate how differences in the strength of association between variables (magnitude of estimates) further differed between two or more groups.These would be equivalent to, in the context of cca, testing whether canonical correlations obtained across different groups would differ.Le The current paper therefore fills a substantial knowledge gap, whereby not many studies considered at all the validity of permutation inference with cca, but those that did approach the topic were not sufficiently general; none covered the topics discussed here.Moreover, in principle, the method as proposed can be used with subject-level fmri or other timeseries data provided that whitening has been successful in removing temporal dependencies.Additionally, given the conceptual similarity between pls and cca, it is possible that permutation inference for pls would require similar strategies as described in Section , particularly in the presence of nuisance variables and re-use of variance already explained.Whether that is the case, it is a question that remains open for future investigation.

. . Recommendations
Given the above results, the main recommendations for permutation inference for cca can be summarised as follows: • When studying a given k-th canonical variable or canonical correlation, 1 < k K, remove the effects of the previous ones, i.e., the variance from one set that has already been explained by the other, as represented by the earlier canonical variables.These effects are surely significant (regardless of the test level), otherwise the current canonical variable or correlation would not be under consideration.Ignoring the earlier ones cause the error rates be inflated (empirical evidence provided in Section .).
• For sets of variables with different sizes (i.e., P Q), ensure that the variability not represented by the canonical variables produced at the first permutation is considered in all and every permutation.That is, include the null space of the canonical coefficients when computing the variables subjected to permutation.Not including the null space leads to excess false positives (empirical evidence provided in Section .).
• Do not use simple p-values for inference, and make sure that a closed testing procedure is used.Using simple, uncorrected p-values has two negative consequences: (i) both the pcer and fwer are inflated, and (ii) since simple p-values are not guaranteed to be monotonically related to the canonical correlations, the resulting test is inadmissible (empirical evidence provided in Section .).
• For the same reason, do not use fdr to correct for multiple testing after using simple p-values: while the p-values themselves satisfy the requirements of fdr, they lead to an inadmissible test even after correction, leading to non-sensical results whereby a stronger canonical correlation may be less significant than a weaker one (empirical evidence provided in Section .).
• While valid, inference using the distribution of the maximum statistic across canonical correlations leads do conservative results, except for the first canonical correlation (empirical evidence provided in Section .).
• If regressing out nuisance variables from both sets of variables subjected to cca, make sure that the residuals are transformed to be exchangeable, e.g., with the Huh-Jhun or Theil methods, then permuted accordingly.Failure to observe this recommendation leads to excess false positives, particularly when the number of nuisance variables is a large fraction of the sample size (empirical evidence provided in Sections .and .).
All these recommendations are integrated into Algorithm .

. Conclusion
As evidenced by the theory and simulations in the previous sections, a simple permutation procedure leads to invalid results: (i) simple p-values are not admissible for inference in cca, lead to excess pcer and fwer, and cannot be corrected using generic methods based on p-values such as fdr; (ii) ignoring the variability already explained by previous canonical variables leads to inflated error rates for all canonical correlations except for the first; (iii) regression of the same set of nuisance variables from both sides of cca without further consideration leads to inflated error rates; and (iv) the classical method for multiple testing correction, that uses the distribution of the maximum statistic leads to conservative results.The use of a stepwise estimation procedure, transformation of the residuals to a lower dimensional basis where exchangeability holds, and correction for multiple testing via closure, ensures the validity of permutation inference for cca.

Appendix A. Ancillary functions
Algorithm requires two relevant ancillary functions: one to compute the semi-orthogonal matrix Q, and another to conduct the cca proper and obtain the canonical coefficients A and B; these two functions are described in pseudo-code in Algorithms and .The "semiortho" function takes as input a residual-forming matrix R and, optionally, a selection matrix S. If S is supplied, it computes Q using the Theil method; otherwise, it uses the Huh-Jhun method (Table ).
As shown, "semiortho" uses Schur decomposition for Huh-Jhun, but that decomposition can be replaced by singular value decomposition (svd) or qr decomposition.Another possibility consists of never using R directly, computing instead an orthogonal basis for the null space of Z (not shown; it would require taking Z as an input argument).All these are expected to produce the same results.However, as the residual-forming matrix is rank deficient and idempotent, all its eigenvalues are identical to or .Thus, considerations about numerical stability and float point arithmetic (Moler, ), as well as speed, should determine the best choice for a particular programming language or computing architecture.
The "cca" function takes as main inputs the sets of variables Y and X.These will have been mean-centered and possibly residualised outside the function, such that no further mean-centering or residualisation is performed; if mean-centering was performed, then, at a minimum, the other two arguments are R = S = 1; if other variables were regressed out, as in part, partial, or bipartial cca, then R and S are supplied with their corresponding values.The algorithm uses the method described by Bjorck and Golub ( ), and is based on results of Olkin ( ) and Golub ( ); additional details can be found in Seber ( ). Inside this function, variables Q and R (subscripts omitted) refer to the factors of a qr factorization, hence with a different meaning than the similarly named matrices used elsewhere this paper.In the algorithm, Y and X are subjected to qr decomposition al., ); Kernbach et al. ( ) used cca to identify connectivity patterns in the default mode network associated with patterns of connectivity elsewhere in the brain; Bijsterbosch et al. ( ), Xia et al. ( ), and Mihalik et al. ( ) likewise used cca to identify associations between functional connectivity and various indices of behaviour and psy-chopathology, whereas Sui et al. ( ) used a combination of multivariate methods, including cca, to investigate brain networks associated with composite cognitive scores; Li et al. ( ) used cca to investigate, among subjects, the topography of the global fmri signal and its relationship with a number of cognitive and behavioral measurements; Ing et al. ( ) used cca to identify symptom groups that were correlated with brain regions assessed through a diverse set of imaging modalities; Alnaes et al. ( ) used cca to investigate the association between imaging measurements and cognitive, behavioral, psychosocial and socioeconomic indices; Clemens et al. ( similarly defined matrix for the column space of RW, N = N − S. The bipartial cca case generalizes all others: for "full" cca, RW = RZ = IN×N , and so, QW = QZ = IN×N ; for partial cca, W = Z; for part cca RW = IN×N , and so, QW = IN×N .For full and partial, pre-multiplication by QZ can be omitted since Q Z QZ = IN×N , such that results do not change.Once these simplifications are considered, the general bipartial cca case reduces to the other three as shown in the Table Figure :A selection matrix is an identity matrix from which some specific rows have been removed.Pre-multiplication by a selection matrix deletes specific rows (those that correspond to columns that are all zero in the selection matrix).
(a) simple, uncorrected p-values, [p k ] unc , (b) corrected under closure, [p k ] clo , and (c) corrected using the distribution of the maximum statistic [p k ] max ; both [p k ] clo and [p k ] max offer fwer control, as discussed in Section . .Keeping the same notation, we define scenarios i-vi consisting of N = 100 observations, with P = 16 variables on the left side (Y) of cca and Q = 20 (c) Corrected, distribution of the maximum, [p k ] max .
Table:Observed familywise error rate (fwer, %, that matches the pcer for k = 1) and % confidence intervals for scenarios xv and xvi, used to investigate effect of different sample sizes, for the three different methods for dealing with nuisance variables.

Wilks
Floch et al. ( ) investigated strategies for dimensionality reduction and regularisation for imaging and genetic data, whereas Grellmann et al. ( ) compared the performance of variants of cca and pls for similar problems.Both studies used direct permutation of the data, and were mostly focused on the relative performance of the different methods, offering no specific treatment of nuisance variables or the other aspects considered here, and which concern validity.Monteiro et al. ( ) investigated a strategy for sparse pls and sparse cca in which data are split into training and hold out, and inference uses permutation of the training data, with coefficients applied to test data, in which measurements of association are computed.
is a residual forming matrix that considers the nuisance variables in Z, and is computed as IN×N − ZZ + , where the symbol + represents a pseudo-inverse.RW is computed similarly, considering the nuisance variables in W. The choice which set is on left or right side is arbitrary.