Permutation tests for the equality of covariance operators of functional data with applications to evolutionary biology

In this paper, we generalize the metric-based permutation test for the equality of covariance operators proposed by Pigoli et al. (2014) to the case of multiple samples of functional data. To this end, the non-parametric combination methodology of Pesarin and Salmaso (2010) is used to combine all the pairwise comparisons between samples into a global test. Different combining functions and permutation strategies are reviewed and analyzed in detail. The resulting test allows to make inference on the equality of the covariance operators of multiple groups and, if there is evidence to reject the null hypothesis, to identify the pairs of groups having different covariances. It is shown that, for some combining functions, step-down adjusting procedures are available to control for the multiple testing problem in this setting. The empirical power of this new test is then explored via simulations and compared with those of existing alternative approaches in different scenarios. Finally, the proposed methodology is applied to data from wheel running activity experiments, that used selective breeding to study the evolution of locomotor behavior in mice.


Introduction
In recent years, an increasing number of applications has involved data that are best described as being functional. Examples can be found in medicine (West et al., 2007), Moreover, if the null hypothesis H 0 is rejected, we would like to identify which pairs of groups show a difference between covariance operators. To do this, we will rely on the non-parametric combination methodology introduced by Pesarin and Salmaso (2010) for multivariate permutation, which enables to combine many different partial tests in an overall global test. In our case, the idea is to combine all the pairwise comparisons between the q samples in order to obtain the p-value of the global test. Using this method, the post-hoc comparisons are straightforward: the global p-value and the partial p-values of the pairwise group comparisons are computed simultaneously. However, some care is required when jointly analyzing the latter, because a multiple testing problem arises. Thus, we suggest to use a step-down approach to control the family-wise error rate. The empirical power of the proposed test is evaluated through simulation studies and compared with those of previously proposed testing procedures. Finally, we analyze the covariance operators of wheel-running mice activity curves. These data have been collected during an evolutionary biology experiment to investigate the evolutionary behaviour of these activity trait (see Swallow et al., 1998;Koteja et al., 1999;Kane et al., 2008).

Testing equality of covariance operators
In this section, we describe the proposed strategy to test the equality of covariance operators across multiple groups, which allows for the use of the most appropriate metric for covariance operators in the problem at hand and, at the same time, for the investigation of pairwise difference between groups. First, we discuss a few possible choices of distance between covariance operators.

Metrics for covariance operators
Let x be a random function which takes values in L 2 (Ω), Ω ⊆ R, such that E(||x|| 2 L 2 (Ω) ) < +∞. The covariance operator Σ x is defined, for g ∈ L 2 (Ω), as Σ x g(t) = Ω c x (s, t)g(s)ds, where c x (s, t) = cov(x(s), x(t)) = E [(x(s) − E [x(s)]) (x(t) − E [x(t)])] . Then, Σ x is a trace class, self-adjoint, compact operator on L 2 (Ω) with non negative eigenvalues (see, e.g., Bosq, 2012, Section 1.5). Indeed, any compact operator T has a canonical decomposition that implies the existence of two orthonormal bases {u k }, {v k } for L 2 (Ω) such that T f = k σ k f, v k u k , or, equivalently, T v k = σ k u k , where v, v indicates the inner product in L 2 (Ω) and the non negative real numbers {σ k } k∈N , are called the singular values of T . If the operator is self-adjoint, there exists an orthonormal basis {v k } such that T f = k λ k f, v k v k , or, equivalently, T v k = λ k v k and the sequence {λ k } ∈ R is called the sequence of eigenvalues for T . A compact operator T is said to be trace class if the trace tr(T ) = k T e k , e k < +∞ for every orthonormal basis {e k }. A compact operator T is said instead to be Hilbert-Schmidt if its Hilbert-Schmidt norm is bounded, i.e., ||T || 2 HS = tr(T T ) < +∞, where T denotes the adjoint operator of T . The Hilbert-Schmidt norm is a generalization of the Frobenius norm for finite-dimensional matrices.
It is then possible to embed the space of covariance operators in the space of Hilbert-Schmidt operators and use the Hilbert-Schmidt distance ||Σ 1 − Σ 2 || HS to measure the distance between two covariance operators Σ 1 and Σ 2 . However, this is an extrinsic metric based on the above embedding and thus ignores the geometry of the space of covariance operators, such as the trace class property and the non negativity constraints on the eigenvalues. Pigoli et al. (2014) show that when the covariance operator is the object of interest for the statistical analysis, taking into account the property of the space leads to tests with higher empirical power. This motivated the introduction of new metrics such as the square root distance and the Procrustes distance.
Let Σ be a self-adjoint trace class operator, there exists a Hilbert-Schmidt self adjoint where λ k are eigenvalues and v k eigenfunctions of Σ. The square root distance between two covariance operators Σ 1 and Σ 2 is therefore defined as The square root distance is based on the mapping of the two operators Σ 1 and Σ 2 from the space of covariance operators to the space of Hilbert-Schmidt operators, through the square root map. This is a particular choice among a family of such maps that transform the covariance operator Σ to a Hilbert-Schmidt operator L so that Σ = LL . It is easy to see that L is defined up to a unitary operator R, since (LR)(LR) = LRR L = LL = Σ. Therefore, it is natural to follow a Procrustes approach to minimize the distance with respect to this arbitrary unitary operator. Pigoli et al. (2014) define the square of the Procrustes reflection size-and-shape distance between two covariance operators Σ 1 and Σ 2 as where L i are such that Σ i = L i L i , for i = 1, 2, and O(L 2 (Ω)) is the space of unitary operators on L 2 (Ω).

Non-parametric combination
In this section we describe how it is possible to test the global hypothesis that all the covariance operators are equal across the groups by combining pairwise group comparisons which are based on the two-sample permutation test described in Pigoli et al. (2014). This approach will allow us to use any metric in the definition of the test statistics without making any assumption on the data generating process. Let us assume we have q independent groups of functional data x i1 , . . . , x in i ∈ L 2 (Ω), i = 1, . . . , q.
and they are independent and identically distributed samples from a random process with distribution P i , mean µ i and covariance operator Σ i . In the following, we denote with x i the vector of observations (x i1 , . . . , x in i ) from group i. We would like to test if the covariance operators are all equal. The global null hypothesis can be viewed as an intersection of partial null hypotheses and the global alternative hypothesis as the union of the corresponding alternative hypotheses, i.e.
The idea is to combine the k = q(q − 1)/2 two-sample tests for each pair of groups in a global test, using the non-parametric combination algorithm of Pesarin and Salmaso (2010).
Let T ij = d(S i , S j ) be the test statistic of our choice, associated to the partial test H ij 0 of groups i and j respectively, where S i , S j are sample covariance operators of the corresponding groups and d(·, ·) is some distance between covariance operators. In particular, in this work we consider the square root, Procrustes and Hilbert-Schmidt distances defined in Section 2.1. Let us define by T = (T 1,2 , T 1,3 , . . . , T q−1,q ), the vector of all partial test statistcs T ij , with 1 ≤ i < j ≤ q.
The partial tests H ij 0 : d(Σ i , Σ j ) = 0 against H ij 1 : d(Σ i , Σ j ) = 0 marginally satisfy the assumptions required for the test (i.e. they are marginally unbiased, consistent and significant for large values) for any of the distances presented in Pigoli et al. (2014). Therefore, the considered algorithms can be applied to any functional dataset using the vector of test statistics T.
The partial test statistics in T are combined by a function Ψ that must satisfy the properties indicated by Pesarin and Salmaso (2010): 1. Ψ is non-decreasing in each argument, 2. If one or more arguments are zero, Ψ attains its supremum valueΨ, possibly not finite.
3. For all α > 0, the critical value T α Ψ of Ψ is assumed to be finite and strictly smaller thanΨ.
Also, the curves must be centred around the sample mean of each group, because exchangeability of the observations is required in order to apply permutations.
We indicate by x We obtain the following algorithm: Algorithm 2.1 (Multiple-sample permutation test for the equality of covariance operators). Let x ij , i = 1, . . . , q, j = 1, . . . , n i be the considered dataset.
where m i is the sample mean of x i , for all i = 1, . . . , q, j = 1, . . . , n i .
2. Let T (0) be the k-dimensional vector containing the pairwise distances between the sample covariance operators of the centred groups x 3. For b = 1, . . . , B, consider a random permutation u (b) of the data labels and compute the k-dimensional vector T (b) containing the distances between the sample covariance operators of the groups of the permuted data set, d(S b=1 is a random sampling from the permutational distribution of the random vector T.
Proposition 1. If we make the additional assumptions that, when n goes to infinity, then so also do the sample sizes of all groups and that the number B of Monte Carlo iterations goes to infinity while k and α remain fixed, then it is possible to prove that the test we obtain is strongly consistent and unbiased for the overall null hypothesis H 0 against the alternative H 1 .

Synchronized permutation tests
When data belong to multiple groups, a few different strategies can be used to generate the permuted samples. In particular, Step 3. of Algorithm 2.1 requires to generate a certain number of permutations of the original dataset. In Solari et al. (2009), three different ways of permuting data are proposed. The simplest idea is to perform permutations involving the whole data set, so-called pooled permutations. This can be done because, under H 0 , the observations of all groups are exchangeable. However, this strategy does not allow to test also the partial hypotheses, since each comparison involves not only the observations belonging to the pair of considered groups, but also those of the other groups. Therefore, the resulting global p-value is correct, but the partial p-values would not be accurate when doing post-hoc comparisons. The second proposal is to apply paired permutations, that is while comparing the i-th and j-th groups, the inference is made on each paired vector (x i , x j ) independently. The result would be opposite than the one obtained with pooled permutations: the partial tests are exact, just like in the two-sample case, but the global test is not reliable since this method does not take into account the dependencies between the marginal tests.
Therefore, we want paired permutations to be done not independently but jointly. At the same time, we would like to keep the partial comparisons separate, so as to be able to do post-hoc comparison with no additional computational effort. Then, if the design is balanced, i.e. n 1 = · · · = n q =n, a further possibility explored by Solari et al. (2009) is to apply synchronized permutations, exchanging the same number ν of units between each pair of blocks. Applying synchronized permutations allows both maintaining the dependencies among partial tests and involving the observations of each comparison at the same time. First of all, we build the pseudo-data matrix where each column is formed by the samples from two different groups and we have as many columns as needed to account for all the groups pairings. Then, we can apply constrained synchronized permutations, that is to exchange units in the same original position within each block. This can be done by permuting the rows of the pseudo-data matrix. Since there are N CSP = 2n n possible ways to exchange units in the first pair of blocks, N CSP is the cardinality of the constrained synchronized permutations. Since the test statistic is symmetric, the number of possible distinct values of the vector of test statistics T is N CSP /2.

Post-hoc analysis
After performing the global test, if the null hypothesis H 0 is rejected in favour of the alternative H 1 , it is often of interest to find out which of the data samples led to this conclusion. One of the advantages of the non-parametric combination methodology is that partial p-values are computed at the same time of the global one. Therefore, the post-hoc comparisons can be done with a small computational effort. We investigate here the methods that allow to control the family-wise error rate, in order to simultaneously assess which of the partial null hypotheses H ij 0 are rejected. First, we recall the resampling step-down method proposed by Westfall and Young (1993). The idea is that, rather than adjusting all p-values according to the minimum p-value distribution, one should only adjust the minimum p-value using this distribution and then adjust the remaining p-values according to smaller and smaller sets of p-values. The effect of using restricted sets of p-values is to make the adjusted p-values smaller, thereby improving the power of the method.
Let the ordered partial p-values have indexes r 1 , . . . , r k so thatλ (1) =λ r 1 ,λ (2) = λ r 2 , . . . ,λ (k) =λ r k . The step-down adjusted p-values are defined sequentially as follows: The use of max operator insures that the order of the adjusted p-values is the same as that of the original p-values. Westfall and Young (1993) proved that this procedure controls the family-wise error rate in the strong sense. Pesarin and Salmaso (2010) showed that the resampling method proposed by Westfall and Young (1993) is equivalent to iteratively use the non-parametric combination with the Tippett combining function Ψ Tippett (Birnbaum, 1954): Step-down method for the Tippett combining function).
Let λ (1) , . . . , λ (k) be the increasing ordered p-values corresponding to the set of partial hypotheses.  Furthermore, Lehmann and Romano (2006) presented a similar step-down method, that uses the test statistics T ij instead of the p-values λ ij . This method is equivalent to the one based on the Tippett combining function but allows to avoid the computations of the permutational distributions of the partial p-values. For example, let us suppose that the individual tests H ij 0 are based on test statistics T ij with large values indicating evidence against the partial null hypotheses. Let K = {H ij 0 , 1 ≤ i < j ≤ q} be the set of all the partial test hypotheses andK a subset of K,K ⊆ K. First of all, we have to define the critical value of the combined test of all the hypotheses contained inK at level α ∈ [0, 1] so that the family-wise error rate is controlled in the strong sense. Many definitions are possible, as long as the properties indicated in Lehmann and Romano (2006), Theorem 9.1.3 are verified. We choose to use the definition given in Solari et al. (2009), where the critical value ofK at level α is defined as the m-th smallest value among the permutation distributions of TK = max

For
, where m = B − Bα . For this reason we will refer to this as the step-down method for the max T combining function. The algortithm is defined as follows: be the corresponding hypotheses. 2. For i = 2, . . . , k, let K i be the set of hypotheses not previously rejected, i.e. Lastly, when using another combining function, it is possible to use the closed testing procedure of Marcus et al. (1976). This method is based on the idea that one may reject any hypothesis H ij 0 , while controlling the family-wise error rate, when the test of H ij 0 itself is significant and the test of every intersection of partial hypotheses that includes H ij 0 is significant. Hence,λ ij is the maximum of all the p-values of the partial hypotheses containing H ij 0 . This method has two major drawbacks: it requires a greater number of computations and it is very conservative. However, it is a useful tool when the use of the Tippett or max T combining functions is not suitable.  3 Simulation studies

Synthetic data sets
We generate synthetic data sets as follows. All the curves are generated on an equispaced grid of p = 31 points on Ω = [0, 1] and the sample size of each group isn = 20. Unless otherwise stated, curves are simulated from a multivariate Gaussian process. We consider q different groups (with q varying across simulation studies) and for all q groups the mean function is equal to sin(x), x ∈ [0, 1]. The covariance operator of each group varies according to the test case. Let Σ 1 and Σ 2 be the sample covariance operators of the male and female subjects in the Berkeley growth study dataset described in Ramsay and Silverman (2005), rescaled to [0, 1].
The two test cases represent two different ways in which the null hypothesis can be violated. The second case pertains to a difference in the total variation between groups, while the first test case presents also a difference in shape between covariance operators.
Each permutation test is performed with B = 1000 iterations of the Monte Carlo Algorithm 2.1 and is repeated for 1000 replicates of the simulated dataset.
In the following, we use this simulated data to evaluate the empirical size and the empirical power of the proposed test when using different distances between covariance operators.
All the functions needed to apply the permutation tests to these simulated data have been made available in the R package "fdcov" (Cabassi and Kashlak, 2016).

Empirical size and power of the test
We consider first a simulation with q = 3 groups, where the first group has covariance operator Σ 1 and the others two covariance operators Σ(γ). Figures 2 and 3 show the empirical power of the global and partial tests done using the synchronized permutations, the max T combining function and the Procrustes, square root and Hilbert-Schmidt distance, for the first and second test cases respectively. It is evident that the test has greater empirical power when using Procrustes and square root distances, with the latter being in this case preferable due to the lower computational cost. Moreover, the global test appears to have the correct level for all the distances while the partial tests are conservative for γ = 0, as expected, and the proportion of rejection for the partial test between the second and third group (which have equal covariances) is less than 5% for all values of γ.
We want now to explore how the performance of the test changes when the number of groups increases. Figure 4(a) shows the empirical power of the global test using the square root distance when the number of groups goes from 4 to 10, always with the first group with covariance operator Σ 1 and all the others with covariance operator Σ(γ), with γ varying from 1 to 5. It is possible to see that the level of the test is respected for all numbers of groups while the empirical power tends to decrease when the number of groups increases. This is due to the fact that only q partial tests out of q(q − 1)/2 are bringing information about the violation of the null and they form a smaller and smaller proportion of all the partial tests when q increases. If we instead have half of the groups with covariance operator Σ 1 and half with covariance operator Σ(γ), the loss of empirical power when q increases is smaller, as shown by the empirical power curves reported in Figure 4(b). This is because of the larger proportion of false partial hypothesis.

Comparison with the other existing tests
We compare now the proposed method, using the square root distance and the max T combining function, with some alternative approaches to test for the difference between covariance operators. We consider first a generalization of the Levene's test (Anderson, 2006) which is sensitive only to the difference in total variation between groups and it is implemented using the permutational analysis of variance. Paparoditis and Sapatinas (2016) introduced an empirical bootstrap approach based on Hilbert-Schmidt distance (or alternatively, on other test statistics based on the Karhunen-Loéve expansions of the covariance operators). In the interest of a fair comparison, we apply here the same procedure to the test statistics based on the square root distance. It should be noted however that the theoretical properties of this modified procedure still need to be studied. Finally, we consider the test based on the concentration inequalities method of Kashlak et al. (2016). Figures 5 show the empirical power of the generalisation of Levene's test (Anderson, 2006), the empirical bootstrap by Paparoditis and Sapatinas (2016) and the concentration inequalities method of Kashlak et al. (2016) for the two test cases, compared to the results obtained using the proposed permutation test. Here data are simulated from q = 3 groups with the first group having covariance operator Σ 1 and the other two covariance operators Σ(γ). It appears that the permutation test and the empirical bootstrap have approximately the same empirical power in both test cases. On the contrary, Levene's test performs very differently. As expected, it outperforms the others  in the second test case, where it captures very well the differences in scale, but it is dramatically less powerful in the first test case, where the difference between the covariance operator is mostly in shape. The non-asymptotic test of Kashlak et al. (2016) is slightly less powerful than the permutation test and the empirical bootstrap but it has the advantage of being much less computationally expensive than the resampling-based methods.
We want also to explore what happens when data are generated from a non Gaussian distribution. We simulated data from a multivariate t distribution with 4 degrees of freedom and correlation matrix implied by the covariance operator Σ 1 for the first group and Σ(γ) for the other two groups. Here it is not possible to apply the non-asymptotic test of Kashlak et al. (2016), because calibration parameters are not yet available when data are not Gaussian. Figure 6 shows the empirical power for the permutation test, the empirical bootstrap test and the Generalized Levene's test. Here the permutation test appears to perform slightly better then the bootstrap, while Levene's test is again performing very well in the second tests case but not in the first. Overall, the empirical power of all tests is lower than in the Gaussian case, but they respect the nominal level.

Application to evolutionary biology
In this section we apply the proposed permutation test to a data set of interest in evolutionary biology. The question of interest here is whether there is a difference in the covariance operator of a function-valued trait (Kingsolver et al., 2001;Stinchcombe et al., 2012) among experimental lines of mice with known differences in evolutionary histories.

Data set
Data were collected from aging house mice (Mus domesticus) that were members of the 16th generation of a selective breeding experiment for increased voluntary wheel-running behavior (Swallow et al., 1998). This experiment produced four replicate lines selected for the total number of wheel revolutions run on days 5 and 6 of a six day exposure to running wheels that occurred when the mice were six to eight weeks of age, and four replicate control lines that were randomly bred each generation (see Swallow et al., 1998, for additional details). At generation 16, a total of 360 mice were used to establish an aging colony (Morgan et al., 2003;Bronikowski et al., 2006). Half of the mice in the colony were from the four high-selected lines and half were from the four control lines that were randomly bred with respect to running behavior, and half of each selection group was housed with running wheels (active mice) and half was housed in cages without wheels (sedentary mice). One male from one of the control lines died of unknown causes during the early stages of the experiment. Each week every mouse was measured for body mass and food consumed, and each active mouse had the total number of wheel revolutions run that week recorded (see Morgan et al., 2003;Bronikowski et al., 2006, for more details).
Herein we examine only data from the active mice from both selection groups from the first 80 weeks of the experiment, as reported by Morgan et al. (2003). The variables in the dataset are: a unique id number for each mouse and id of fullsib family from which the mouse was drawn; the age and sex of the mouse; the line number (lines 1, 2, 4, 5 are control lines, the others are selected lines); the week of wheel measure and the number of revolutions run during the week. Total activity, measured as number of revolutions run in a given day, can be decomposed into the product of mean velocity and duration of activity. Thus, increased total activity levels could be accomplished by an increase in mean velocity, an increase in the amount of time spent running, or a combination of both.
The raw data are presented in Figure 7. Each line connects the number of revolutions run by each mouse during the first 80 weeks of the experiment (Morgan et al., 2003). Mice identified by ID numbers 90183 and 90224 are taken as examples of the selected and control lines respectively. The corresponding wheel-running functions have been highlighted in each figure. The first one is a male belonging to family number 29 from line 1 (control), while the second one is a female belonging to family number 11 from line 3 (selected).
At several times during the experiment, data collection was skipped for one or two weeks. In these cases, the data collected after the skipped week(s) was divided by number of weeks, giving multiple weeks in a row with the same value. This is easily seen in Figure 7 at weeks 38, 39, 40, when the values are constant for each mouse, because the wheel revolutions recorded for week 40 were divided by 3 and assigned to weeks 38 and 39 as well as 40. The weeks in which this occurred are: 34; 35; 38; 39; 40; 50; 51; 72; 73. We regularized data using cubic smoothing splines. In particular, we used the the routine spline.smooth() of the R package "stats" (R Core Team, 2016). Since individual mice can have their own biological clock, curves are aligned to remove phase variability (Ramsay and Silverman, 2005), via the elastic analysis described in Tucker et al. (2013) and implemented in the R package "fdasrvf" (Tucker, 2016). Figure 8 shows the smooth and aligned wheel-running activity curves.

Missing observation
In the voluntary wheel running activity data set, all groups (experimental lines) are composed of 20 mice. However, one of the mice died of unknown causes during the early stages of the experiment and therefore one group has only 19 observations. For this reason, in order to apply the synchronized permutations, we have to prove that the presence of a missing observation does not affect the inference. Following the guidelines given by Pesarin and Salmaso (2010), we give a new formulation of the test that takes into account the presence of missing data. Thanks to this, we are able to prove that it is possible to apply the proposed test to an unbalanced data set with one missing observation, under certain assumptions on the process that generates the missing observations. Consider again a functional data set of the form X = {x ij , i = 1, . . . , q, j = 1, . . . , n i }, that consists of q ≥ 2 samples of size n i ≥ 2. The groups are related to q levels of a treatment and the data x ij are supposed to be independent and identically distributed with distributions P i ∈ P, i = 1, . . . , q. In order to take into account that, for whatever reason, some of the data are missing, Pesarin and Salmaso (2010) suggested to consider the inclusion indicator associated to the considered data set, that is where o ij = 1 if x ij has been observed and collected, o ij = 0 otherwise. We denote with o i the vector of observation indicators (o i1 , . . . , o in i ) from group i. This indicator represents the observed configuration in the data set. Hence, the data set can be seen as the pair of matrices (X, O). Therefore we would like to perform the following test: Thus, if we assume that data are jointly exchangeable under the null hypothesis with respect to the groups, we can, again, utilize a permutation test. Let us represent by P i the joint multivariate distribution of (x i , o i ), i = 1, . . . , q under the null hypothesis. Then it holds: The idea of Pesarin and Salmaso (2010) is to break down the null hypothesis in the following way: Furthermore, we assume that the missing data are missing completely at random. In this case, we can condition with respect to the observed inclusion indicator and ignore H O 0 , because O does not provide any information about treatment effects (Rubin, 1976). In other words, the partial hypotheses on O are true by assumption and the null hypothesis can be simplified: We indicate by O * any permutation of O, the permutational vector of inclusion indicators, and by κ * = [κ * 1 , . . . , κ * q ] the corresponding vector of counts of valid observations in each group, where Then we can group the set of all permutations of the dataset, according to the vectors of actual sample sizes of valid data κ * . Now, let T be the vector of partial test statistics based on functions of sample valid data; we denote its permutation distribution as F [T|(X, O)], T ∈ R k . Pesarin and Salmaso (2010)  holds for every T ∈ R k , for every permutation O * of O and for all data sets X. In the case of the tests for covariance operators, this is true because the test statistic T Ψ of the global test is a combination of the partial test statistics of the pairwise comparisons between the groups. These, in turn, depend only on the distances between covariance operators and their permutations. We can suppose that, under the null hypothesis, the permutation distribution of the partial test statics T ij depends essentially on the number κ * i , κ * j of summands. Thus, just like in the case of the multivariate analysis of variance studied in Pesarin and Salmaso (2010), the previous distributional equality is equivalent to Hence, we would like our partial test statistics to be invariant with respect to κ * and for all X. Now, suppose that we are in the balanced case, i.e. n 1 = · · · = n q =n and one observation is missing in one of the groups, say group a, where 1 ≤ a ≤ q. In the wheel-running data set, for instance, q = 8,n = 20 and one observation is missing in group 1. All the pairwise comparisons between groups i and j with 1 ≤ i < j ≤ q and i, j = a are not affected by the problem of missing data since κ * i = κ * j =n. As regarding the others, at each iteration of the algorithm, we could have κ * a =n and κ * j =n − 1 or viceversa, depending on the permutation. However, since distances are symmetric, this two cases are permutationally equivalent under the null hypothesis and Equation (1) is always satisfied. For this reason, we can apply the synchronised permutations as usual. At each iteration of Algorithm 2.1 the sample covariance of each permuted group is computed only with the available data. This is more complicated when the number of missing data becomes greater than one, since the vector κ * of actual sample sizes can assume other values.

Hypothesis testing
We can finally apply the test to the smoothed and aligned wheel-running activity curves. The aim of the analysis is to check if the covariance operators of the eight groups of mice are the same and, if this is not the case, to identify which lines have different covariances. This is necessary for two reasons. First, the covariance operator is in itself of biological interest for exploring which type of variability is environmental in nature and which is due to genetic components. Second, inference on the mean functions often requires the assumption of equality of covariance operator and it is important to be able to check this assumption.
We want then to test the hypothesis H 0 : {Σ 1 = · · · = Σ 8 } against H 1 : {at least one of the equalities is not true}.
To this end, we use the Monte Carlo Algorithm 2.1 to obtain an estimate of the permutation test proposed in Section 2.2. We have shown in the previous section that synchronized permutations can be used, even if one of the observations is missing. We use here the square root distance between covariance operators as partial test statistic and we choose the Tippett combining function. We set the number of iterations B to 1000.
The p-values of the partial tests between each pair of lines, adjusted with the stepdown method, are reported in Figure 9. The p-value of the global test (< 0.001) indicates that there is strong evidence to reject the null hypothesis. This is due mainly to the first group of mice for which many partial null hypotheses are rejected (i.e., the differences between the covariance operator of line 1 and the covariance operators of these others lines are significant) and, when using the max T combining function, we reject H 0 even if only one of the partial tests is rejected.
The results of this test are somewhat surprising. First, no systematic differences were detected between the selected lines and the control lines, while under directional selection there is at least a theoretical expectation that genetic variances and covariances would change between selected and unselected populations (e.g. Falconer and Mackay, 1996), and work on the 31st generation of mice from this selection experiment has demonstrated some changes in the genetic variances of wheel running over the first 6 days of wheel running (Careau et al., 2015). Second, the results suggest that line 1 randomly differs from one other control line and one other selected line. Such random differences in biological populations can be caused by genetic drift occurring during the selection experiment or by founder effects when the original base population was randomly subdivided into eight lines. Indeed, the trait of selection itself (wheel running on days 5 and 6 of a 6 day exposure) and underlying physiological traits (e.g., basal metabolic rate) demonstrate the effects of drift and/or founder effects (Swallow et al., 1998;Kane et al., 2008). The results presented herein are suggestive of similar processes influencing the phenotypic covariance structure of wheel running across age, which presents interesting possibilities of additional biological experiments.

Conclusions and further developments
We extended the application of hypothesis tests that take into account the geometry of the space of covariance operators to the case of multiple groups, using a permutation approach. In particular, synchronized permutations allow us to make inference also on the pairwise comparison between groups while controlling the family-wise error rate. We illustrate via simulation studies that the proposed test has the correct effect size and indeed the square root distance and the Procrustes distance lead to higher empirical power in the multiple groups comparison as well. While we have shown that the method can be applied in the case of a missing observation, a more general treatment of the case of unbalanced design and missing data is scope for future works.
We have also shown that the empirical power for the global test is comparable to those obtained using bootstrap approximation in the Gaussian case and slightly better in the non Gaussian case. It is worth to notice that, while simulation results shows the bootstrap approach to be promising as well for the global test, its property has not yet rigorously studied for test statistics based on metric different from the Hilbert-Schmidt distance and this is an interesting direction for future research.
The application of the procedure to the mice voluntary wheel running activity curves shows that, while a difference between covariance operators is indeed present, this is not caused by selection itself. Instead it would appear that random biological processes such as genetic drift or founder effects are influencing the covariance operators of the phenotypic curves, suggesting further investigation of this trait and demonstrating the importance of random processed during evolution.