1 Introduction

Consider J dependent groups and let \(\theta _j\) denote some measure of location associated with the jth group. Of course, a common goal is to test

$$\begin{aligned} H_0: \theta _1 =\cdots = \theta _J. \end{aligned}$$
(1)

The classic approach to testing this hypothesis is based in part on the variation of the J measures of location. That is, it is based on an estimate of a particular measure of effect size:

$$\begin{aligned} \Lambda = \sum (\theta _j -\bar{\theta })^2, \end{aligned}$$
(2)

where \(\bar{\theta }=\sum \theta _j/J\). The standard approach to testing (1) is to determine whether the estimate of \(\Lambda \) is sufficiently large to justify rejecting the null hypothesis.

Recent years have seen an increased interest in measures of effect size that reflect differences among measures of location in conjunction with some measure of dispersion. When comparing three or more independent groups, there are now a variety of scale invariant measures of effect size that might be used. That is, they are scale free; multiplying the data by some constant c, \(c \ne 0\), does not alter their value. Included are robust, heteroscedastic measures of effect size that take into account differences among some measure of location in conjunction with some robust measure of dispersion (e.g., Wilcox, 2022).

A minor goal here is to suggest a robust, scale invariant measure of effect size for characterizing the extent dependent groups differ. The basic idea is to measure the distance of an estimate of the null vector from an estimate of (\( \theta _1, \ldots , \theta _J)\), the center of the data cloud. Perhaps the most obvious approach is to use Mahalanobis distance, but there are two concerns with this approach. First, it is not robust (e.g., Rousseeuw and Leroy, 1987). That is, even a small shift in a distribution can alter its value tremendously. In particular, a relatively large distance assuming normality can be rendered arbitrarily small by even a small shift toward a heavy-tailed distribution where outliers are likely to occur. A known strategy for dealing with this issue is to replace the mean and covariance matrix with some robust analog. There are a variety of possibilities (e.g., Wilcox, 2022). However, there is a second concern: robust analogs of Mahalanobis distance are reasonable provided the unknown multivariate distribution is elliptically contoured. A goal here is to avoid this restriction. This is done via the notion of projection distance (e.g., Wilcox, 2022, section 6.2.5), which takes into account the overall structure of the data cloud.

By design, the measure of effect size considered here, \(\Xi \), is equal to zero when (1) is true. There are two main goals. The first is to report simulation results on how well a method for testing

$$\begin{aligned} H_0: \Xi =0 \end{aligned}$$
(3)

performs in terms of controlling the Type I error probability. Because the method for testing (3) is sensitive to different features of the data compared to a conventional method for testing (1), there are situations where the power of the proposed method can be substantially higher, as will be illustrated. Not surprisingly, there are situations where the conventional method for (1) has higher power than the proposed method.

Now consider two independent groups where for each group there are K measures for each participant. Let \(\Xi _{\ell }\) (\(\ell =1\), 2) denote \(\Xi \) for the \(\ell \)th group. The method for testing (3) is readily extended to testing

$$\begin{aligned} H_0: \Xi _1 = \Xi _2. \end{aligned}$$
(4)

The second goal is to report simulations on how well the proposed method performs when testing (4). As will be evident, testing (4) can detect differences between groups that are missed when comparing the marginal measures of location instead.

There are numerous robust location estimators that might be used when dealing with dependent random variables (e.g., Wilcox, 2022). The focus here is on a 20% trimmed mean associated with the marginal distributions with the understanding that arguments can be made for using a variety of alternative estimators. For notational convenience, consider a single random sample \(X_1, \ldots , X_n\) and let \(X_{(1)} \le \ldots \le X_{(n)}\) denote the values written in ascending order. Let \(\gamma \) denote the amount of trimming, \(0 \le \gamma <0.5\). Let \(g=[\gamma n]\), where \([\gamma n]\) is the value of \(\gamma n\) rounded down to the nearest integer. The sample trimmed mean is computed by removing the g largest and g smallest observations and averaging the values that remain. More formally, the sample trimmed mean is

$$\begin{aligned} \bar{X}_t = \frac{X_{(g+1)} + \cdots + X_{(n-g)}}{n-2g}. \end{aligned}$$
(5)

Here, \(\gamma =0.2\) is used because it performs nearly as well as the sample mean under normality while guarding against low efficiency when dealing with heavy-tailed distributions where outliers are likely to occur.

The paper is organized as follows. Section 2 describes the details of \(\Xi \). Section 3 describes a simple method for testing (3) as well as (4). Section 4 reports simulation results on how well the method controls the Type I error probability. The power of the proposed method for testing (3) is compared to a standard method for testing (1). The proposed methods are illustrated in Section 5.

2 The Proposed Method

First consider the issue of measuring effect size based in part on the marginal trimmed means but with the additional goal of taking into account the overall dispersion and structure of the data cloud. Let \(\hat{\varvec{\mu }}_t = (\bar{X}_{t1}, \ldots , \bar{X}_{tJ})\), where \(\bar{X}_{tj}\) is the sample trimmed mean based on the jth marginal distribution. Let \(\tilde{\varvec{\mu }}_t =(\bar{X}_G, \ldots , \bar{X}_G)\) denote the estimate of the measures of location when (1) is true, where \( \bar{X}_G =\sum \bar{X}_{tj}/J\). As previously noted, the basic idea is to use an estimate of a projection-type distance of the null vector, \(\tilde{\varvec{\mu }}_t \) from an estimate of the center of the data cloud, \(\hat{\varvec{\mu }}_t\). Projection distance has a close connection to the Donoho and Gasko (1992) approach to defining the notion of halfspace depth derived by Tukey (1975).

For notational convenience, let \(\textbf{Y}_i= (X_{i1}, \ldots , X_{iJ})\) , \(i =1, \ldots , n\) and \(\textbf{Y}_{n+1}=\tilde{\varvec{\mu }}_t \). An outline of the projection distance estimator is as follows. For each i (\(i= 1, \ldots , n+1)\), project the data onto the line between \(\hat{\varvec{\mu }}_t\) and \(\textbf{Y}_i\). For each projection compute a standardize distance between \(\hat{\varvec{\mu }}_t\) and \(\textbf{Y}_{n+1}\). The maximum of these \(n+1\) distances is the projection distance between \(\hat{\varvec{\mu }}_t\) and \(\textbf{Y}_{n+1}\), which is taken to be \(\hat{\Xi }\), the estimate of the effect size, \(\Xi \).

The computational details are as follows. For any i, \(i=1, \ldots , n+1\), let

$$\begin{aligned} \textbf{U}_i = \textbf{Y}_i - \hat{\varvec{\mu }}_t, \end{aligned}$$
$$\begin{aligned} B_i = \textbf{U}_i \textbf{U}_i^{\prime } \end{aligned}$$
(6)

and for any j (\(j=1, \ldots , n+1\)) let

$$\begin{aligned} W_{ij} = \sum U_{ik} U_{jk}, \end{aligned}$$
$$\begin{aligned} T_{ij} = \frac{W_{ij}}{B_i} (U_{i1}, \ldots , U_{iJ}) \end{aligned}$$
(7)

and

$$\begin{aligned} G_{ij} = \Vert T_{ij}\Vert , \end{aligned}$$

where \(\Vert T_{ij}\Vert \) is the Euclidean norm. Now let

$$\begin{aligned} g_{ij} = \frac{G_{ij}}{q_2-q_1}, \end{aligned}$$
(8)

where \(q_1\) and \(q_2\) are estimates of the lower and upper quartiles, respectively, based on the values \(G_{i1}, \ldots , G_{i,n+1}\). Here, the quartiles are estimated with the ideal fourths estimator (Frigge et al., 1989). The projection distance of \(Y_{n+1}\) is \(\max g_{i,n+1}\), the maximum being taken over \(i=1, \ldots , n+1\) and is taken to be \(\hat{\Xi }\).

3 Methods M and C

This section describes a method for testing (3) and (4). It is noted that when working with a robust estimator, a percentile bootstrap method often performs relatively well when testing hypotheses and computing confidence intervals (e.g., Wilcox, 2022). But this approach was rather unsatisfactory. When testing (4), even when both sample sizes are equal to 200, the actual level was well below the nominal 0.05. An alternative approach was found to be more satisfactory in simulations.

3.1 Method M

Method M, aimed at testing (3), is similar in spirit to Student’s t-test and the ANOVA F test: determine the null distribution under normality and investigate the impact of non-normality on the actual level of the test statistic. More precisely, momentarily assume multivariate normality with a mean \(\textbf{0}\) and covariance matrix equal to the identity matrix. Next, estimate the null distribution via a simulation. That is, given n and J, generate n vectors of observations from a J-variate normal distribution and compute an estimate of \(\Xi \), \(\hat{\Xi }^*\). Repeat this B times yielding \(\hat{\Xi }^*_1, \ldots , \hat{\Xi }^*_B\). Here, \(B=2000\) is used. Let \(\hat{\Xi }^*_{(1)} \le \cdots \le \hat{\Xi }^*_{(B)}\) denote \(\hat{\Xi }^*_1, \ldots , \hat{\Xi }^*_B\) written in ascending order. The null hypothesis is rejected at the \(\alpha \) level if \(\hat{\Xi }\) is greater than or equal to the \(1-\alpha \) quantile of the estimated null distribution. Here, the \(1-\alpha \) quantile of the null distribution is simply taken to be \(\hat{\Xi }^*_{(c)}\), where \(c=(1-\alpha )B\) rounded to the nearest integer. A p-value for testing (3) is

$$\begin{aligned} \sum \frac{1}{B} I(\hat{\Xi } \ge \hat{\Xi }^*_b), \end{aligned}$$
(9)

where the indicator function \(I(\hat{\Xi } \ge \hat{\Xi }^*_b)=1\) if \(\hat{\Xi } \ge \hat{\Xi }^*_b\), otherwise \(I(\hat{\Xi } \ge \hat{\Xi }^*_b)=0\). This will be called method M henceforth.

3.2 Method C

As for testing (4), method C mimics method M. That is, for each group, momentarily assume multivariate normality with a mean \(\textbf{0}\) and covariance matrix equal to the identity matrix and use a simulation to estimate the null distribution of \(D=\hat{\Xi }_1-\hat{\Xi }_2\). That is, generate \(n_1\) vectors of values for the first group, \(n_2\) vectors for the second group, compute the estimated difference yielding \(D^*\), and repeat this process B times yielding \(D^*_1, \ldots , D^*_B\). Let \(D^*_{(1)} \le \cdots \le D^*_{(B)}\) denote the \(D^*_1, \ldots , D^*_B\) values written in ascending order. Let \(\ell = \alpha B/2\), rounded to the nearest integer and let \(u=(1-\alpha /2)B\) rounded to the nearest integer. Now reject at the \(\alpha \) level if \(D \le D_{(\ell )}\) or if \(D \ge D_{(u)}\). Let

$$\begin{aligned} P= \frac{1}{B} \sum I(D < D^*_b) \end{aligned}$$
(10)

A p-value is

$$\begin{aligned} 2 \min \{P, 1-P\}. \end{aligned}$$
(11)

4 Simulation Results

Simulations were used to assess the ability of methods M and C to control the probability of a Type I error when dealing with non-normal distributions as well as situations where the the correlations among the random variables differ from zero. All of the simulation results are based on 2000 replications. The marginal distributions were taken to have one of four g-and-h distributions (Hoaglin, 1985) that contains the standard normal distribution as a special case. If Z has a standard normal distribution, then by definition

$$V = \left\{ \begin{array}{ll} \frac{\textrm{exp}(gZ)-1}{g} \textrm{exp}(hZ^2/2), &{} \text{ if } g>0\\ Z\textrm{exp}(hZ^2/2), &{} \text{ if } g=0 \end{array} \right. $$

has a g-and-h distribution where g and h are parameters that determine the first four moments. The four distributions used here were the standard normal (\(g=h=0\)), a symmetric heavy-tailed distribution (\(h=0.2\), \(g=0.0\)), an asymmetric distribution with relatively light tails (\(h=0.0\), \(g=0.2\)), and an asymmetric distribution with heavy tails (\(g=h=0.2\)). Table 1 shows the skewness (\(\kappa _1\)) and kurtosis (\(\kappa _2\)) for each distribution. Additional properties of the g-and-h distribution are summarized by Hoaglin (1985). Data were generated from a multivariate normal distribution with a common Pearson’s correlation, \(\rho \), equal to zero or 0.5. Then the marginal distributions were transformed to a g-and-h distribution. This transformation alters slightly the value of Pearsons’s correlation, but the R function rngh in the R package WRS adjusts for this. The sample sizes were taken to be \(n=25\), 50, 100 and 200

Table 1 Some properties of the g-and-h distribution
Table 2 Estimated Type I errors using methods M and RMT, \(J=4\)

Table 2 shows the estimated Type I error probability when \(J=4\) and when testing at the 0.05 level. To add perspective, results are reported on a method for testing (1), based on a 20% trimmed mean, which is labeled method RMT. The computational details are summarized in Wilcox (2022, section 8.1.1), but for brevity they are not described. The results for \(n=100\) provided no new insights so they are not reported.

Although the seriousness of a Type I error depends on the situation, Bradley (1978) suggests that as a general guide, when testing at the 0.05 level, the actual level should be between 0.025 and 0.075. RMT satisfies Bradley’s criterion. The largest estimates for methods M and RMT are 0.058 and 0.053, respectively. As for method M, when dealing with a relatively heavy-tailed distribution, the estimates drop below 0.025 for \(n=25\) and 50. The lowest estimate was 0.016. For \(n=200\), the estimates indicate that M satisfies Bradley’s criterion except for \(\rho =0.5\), \(g=0\) and \(h=0.2\), the estimate being 0.022.

Table 3 shows the results when \(J=6\). Again, both methods perform well in terms of avoiding Type I errors well above the nominal level. The main difficulty is that for method M, estimates drop below 0.025, the lowest being 0.014. As in Table 3, it is heavy-tailed distributions that cause problems, especially when the variables have a relatively high correlation.

Table 3 Estimated Type I errors using methods M and RMT, \(J=6\)

Table 4 compares the power of method M and RMT when \(n=50\). This was done by generating the data as done in Table 2, and then transforming the first marginal distribution to \(\sigma X_{i1}+\delta _1\) (\(i=1, \ldots , n\)). The second marginal distribution was transformed to \(X_{i2}+\delta _2\). As might be expected, the difference in power can be quite large because the two methods are sensitive to different features of the data. The main point is that there are situations where each method offers a distinct advantage over the other. That is, in practical terms, given the goal maximizing power, the choice can make a substantial difference with the optimal choice depending on the unknown nature of the distributions.

Table 4 Estimated power using methods M and RMT, \(J=4\), \(n=50\)

Finally, Tables 5 and 6 report results on the ability of method C to control the Type I error probability. Now there are situations where the estimates exceed 0.06, the largest estimate being 0.069. For equal sample sizes, the lowest estimate is 0.015, which occurred when \(n=25\), \(\rho =0.5\) and \(g=h=0.2\). For unequal sample sizes, there is only one estimate less than 0.025, namely 0.022, which occurred for \(n_1=25\), \(n_2=50\), \(\rho =0.5\) and \(g=h=0.2\).

Table 5 Estimated Type I errors using method C, equal sample sizes, \(K=4\)
Table 6 Estimated Type I errors using method C, unequal sample sizes, \(K=4\)

5 An Illustration

The methods are illustrated with data from the Well Elderly 2 study (Clark et al., 2011) where the general goal was to assess the efficacy of an intervention strategy aimed at improving the physical and emotional health of older adults. The focus here is on cortisol levels measured at four different times: upon awakening, 30-45 minutes after awakening, just before lunch and just before dinner. Past studies indicate that cortisol has a connection with psychosocial factors (Kirschbaum et al. 1995; Chida and Steptoe, 2009) including depression and anxiety disorders (e.g., Stetler and Miller, 2005; Bhattacharyya et al., 2008). These studies have focused on the cortisol awakening response (CAR), which is just the difference between a participant’s cortisol level upon awakening and measured again 30-45 minutes later.

It is briefly noted that both methods M and RMT reject at the 0.05 level indicating that the cortisol levels differ over time. What is more interesting are results based on method C when comparing two groups based on a measure of depressive symptoms. The first group consisted of participants with a score greater than 15, which generally is taken to indicate someone with minor depressive symptoms or worse. The sample size is 61. The second group consisted of the participants with a score less than or equal to 15 and the sample size is 114.

First focus on the first three cortisol measures. Viewing the data in the context of a between-by-within ANOVA design, the between group main effect yielded a p-value equal to 0.214 based on a Welch-type analog for trimmed means, which is described in Wilcox (2022, section 8.6). For each of the three measures taken, the two groups did not differ significantly at the 0.05 level based on a method for comparing 20% trimmed means derived by Yuen (1974). The corresponding p-values are 0.124, 0.299 and 0.740. Comparing the groups based on CAR, again the groups did not differ significantly; the p-value is 0.464. That is, past studies suggest that these two groups would differ based on CAR, but this was not verified here. However, based on method C, again using the first three cortisol measures, the p-value is less than 0.001. The estimated effect sizes are 2.01 for the depressive group and 1.23 for the non-depressive group. The ratio of these two estimates is 1.6. Repeating the analysis using all four cortisol measures, now the p-value based on method C is 0.022.

Method C was applied again for a separate group where measures were taken after intervention. Now the sample sizes for the depressed and not depressed groups are 65 and 169, respectively. Comparing the depressed group to the not depressed group based on the first three cortisol measures, the p-value is 0.048. But using all four measures the p-value is 0.574.

6 Concluding Remarks

All indications are that methods M and C avoid Type I errors well above the nominal level. The main difficulty occurs when dealing with distributions that are skewed and relatively heavy tailed: there are situations where the actual level was estimated to be below 0.025 when testing at the 0.05 level.

There are many possible variations of methods M and C. For example, replace the 20% trimmed mean with the median or perhaps a robust M-estimator. There are also numerous affine equivariant estimators that effectively deal with outliers in a manner that takes into account the overall structure of the data cloud (e.g., Wilcox, 2022, section 6.3.13). And there are alternative ways of computing the depth of a point in a data cloud. The practical advantages of these variations remain to be determined.

It is not being suggested that inferential methods based on measures of location should be replaced by methods that compare scale invariant measures of effect size. The idea is that different methods provide different perspectives and that multiple perspectives can provide a more nuanced understanding of data. The illustration based on method C demonstrates this point.

Finally, R functions for applying methods M and C are stored in the file Rallfun-v40, which can be downloaded from https://osf.io/nvd59/quickfiles. The R function rmES.pro applies method M and bwESP.GLOB.B applies method C.