Sample Size Requirements to Test Subgroup-Specific Treatment Effects in Cluster-Randomized Trials

Cluster-randomized trials (CRTs) often allocate intact clusters of participants to treatment or control conditions and are increasingly used to evaluate healthcare delivery interventions. While previous studies have developed sample size methods for testing confirmatory hypotheses of treatment effect heterogeneity in CRTs (i.e., targeting the difference between subgroup-specific treatment effects), sample size methods for testing the subgroup-specific treatment effects themselves have not received adequate attention—despite a rising interest in health equity considerations in CRTs. In this article, we develop formal methods for sample size and power analyses for testing subgroup-specific treatment effects in parallel-arm CRTs with a continuous outcome and a binary subgroup variable. We point out that the variances of the subgroup-specific treatment effect estimators and their covariance are given by weighted averages of the variance of the overall average treatment effect estimator and the variance of the heterogeneous treatment effect estimator. This analytical insight facilitates an explicit characterization of the requirements for both the omnibus test and the intersection–union test to achieve the desired level of power. Generalizations to allow for subgroup-specific variance structures are also discussed. We report on a simulation study to validate the proposed sample size methods and demonstrate that the empirical power corresponds well with the predicted power for both tests. The design and setting of the Umea Dementia and Exercise (UMDEX) CRT in older adults are used to illustrate our sample size methods. Supplementary Information The online version contains supplementary material available at 10.1007/s11121-023-01590-6.


Introduction
Pragmatic cluster-randomized trials (CRTs) are commonly conducted in healthcare delivery systems and adopt cluster randomization due to logistical, administrative, or political considerations (Turner et al., 2017a(Turner et al., , 2017b)).While the overall average treatment effect has been the primary focus in many CRTs, there is an emerging interest in understanding whether the intervention is effective in pre-specified participant subgroups, such as those defined by baseline demographics or clinical characteristics (Bowden et al., 2021;Cox & Kelcey, 2022;Dong et al., 2018Dong et al., , 2021a, b;, b;Gabler et al., 2009;Kravitz et al., 2004;Li & Konstantopoulos, 2023;Spybrook et al., 2016).Participant subgroups can respond to the intervention differently for various reasons, such as differential access to healthcare and differences in clinical characteristics.With an increasing number of CRTs conducted under routine healthcare conditions with the inclusion of broader eligible populations, there is also a greater need to assess how participant-level or cluster-level factors moderate the intervention effect, facilitating the development of interventions to reduce known health disparities and improve health equity.Subgroup analyses based on health equity variables are not uncommon in pragmatic trials.For example, Nicholls et al. (2023) reviewed 62 pragmatics trials of people with dementia published from 2014 to 1 3 2019 and identified 10 studies reporting subgroup analyses across health equity variables; the majority of these studies employed an interaction test.In addition, Starks et al. (2019) conducted a systematic review of CRTs published between 1/1/2010 and 3/29/2016 that focused on cardiovascular disease, chronic lower respiratory disease, and cancer.They reported that 16 out of 64 CRTs examined heterogeneity of treatment effects among demographic participant subgroups but noted a lack of guidance on subgroup analyses for CRTs.Given this context, statistical methods that address sample size and power considerations with a focus on subgroupspecific treatment effects (also referred to as stratumspecific effects by epidemiologists or sometimes simple main effects by social scientists) in pragmatic CRTs are vitally important but remain relatively underdeveloped.
There have been several recent efforts to develop explicit sample size methods for testing confirmatory hypotheses about treatment effect heterogeneity in CRTs.For example, assuming a linear mixed analysis of the covariance model, Yang et al. (2020) proposed an analytical sample size expression for the treatment-by-covariate interaction test and pointed out that results depend on both the intracluster correlation coefficient (ICC) of the outcome and that of the covariate (or sometimes referred to as the effect modifier).The ICC measures the degree of similarity between outcomes measured within the same cluster and plays an important role in planning CRTs (Eldridge et al., 2009).Tong et al. (2022) relaxed the equal cluster size assumption and investigated the impact of variable cluster sizes on power for an interaction test in CRTs.They found that the coefficient of variation of the cluster size (defined as the standard deviation of cluster size divided by the mean cluster size) has minimal impact on the variance of the interaction effect estimator, as long as the effect modifier is measured at the participant level.Li et al. (2022) generalized these sample size procedures for testing heterogeneity of treatment effect to accommodate three-level CRTs with randomization carried out at either the cluster or subcluster level.
While these prior efforts have primarily focused on sample size requirements for testing differences between subgroup-specific treatment effects, sample size methods for testing the subgroup-specific treatment effects themselves have not received adequate attention.In principle, the test of the subgroup-specific treatment effect addresses the question of whether the intervention is effective in one or more subpopulations, as defined, for example, by sex, race, baseline comorbidities, or other health equity variables.In addition, the power analysis for detecting an intervention effect in any subgroup or all subgroups may not be the same as applying a standard power evaluation for an overall effect but with a smaller sample size, because the target hypotheses can be different and because a common practice for data analysis proceeds with an analysis of covariance model including a treatment-by-subgroup interaction that provides a unifying analysis framework for assessment of both overall average treatment effect and subgroup-specific treatment effects (Yang et al., 2020).To fill this important methodological gap, we propose formal sample size procedures for testing subgroup-specific treatment effects in CRTs based on the linear mixed analysis of the covariance model.We focus on a continuous outcome and a binary subgroup variable (measured either at the participant level or cluster level).We outline explicit expressions for power and its key determinants when the focus is on subgroup-specific treatment effects.Finally, we carry out a simulation study to validate our analytical expressions under different target null and alternative hypotheses relevant to subgroup analyses in CRTs.
Our proposed sample size methods are illustrated in the context of the Umea Dementia and Exercise (UMDEX) study (Toots et al., 2016), a CRT evaluating a high-intensity functional exercise program versus a seated control activity to reduce decline in independence in activities of daily living (ADLs) among older people with dementia in residential care facilities.To reduce the risk of contamination, naturally occurring clusters consisting of residents with cognitive impairment who were inhabitants of the same wing, unit, or floor were randomized to receive the intervention or the control (both delivered at the cluster level).Specifically, the study involved 36 clusters of 3 to 8 participants each and considered a continuous primary outcome.To detect potential differences in exercise effects among subpopulations defined by dementia type, prespecified subgroup analyses by dementia type were performed.Dementia type was dichotomized as Alzheimer's versus non-Alzheimer's dementia (including vascular, mixed Alzheimer's and vascular, frontotemporal, Lewy body, and Parkinson's dementia), as the majority of previous trials only included individuals with Alzheimer's disease (Toots et al., 2016).We will formally quantify the sample size required to achieve sufficient power for subgroup analyses.Although dementia type is not a traditional health-equity effect modifier, it defines important subgroups in gerontological research.Additionally, we would like to emphasize that the proposed methods can be applied to any binary effect modifier including typical health equity variables such as race/ethnicity and socioeconomic status.

Linear Mixed Analysis of the Covariance Model
We consider a CRT with n clusters, where n 1 clusters are rand- omized to the intervention condition, and the remaining n − n 1 clusters to the control condition.The randomization proportion is defined as = n 1 ∕n .We write Y ij as the quantitative out- come of participant j in cluster i and denote the total number of participants in cluster i as m i (i = 1, … , n) .Furthermore, Z i ∈ {0, 1} denotes the treatment status for cluster i .Suppose S ij is a binary subgroup variable taking values in {0, 1} .For example, S ij = 1 indicates a resident with Alzheimer's disease, and S ij = 0 indicates a resident with non-Alzheimer's dementia in the UMDEX study (or S ij could be referred to as a binary demographic variable such as sex in other contexts).A common analytical model for examining the subgroup-specific treatment effect in CRTs is the analysis of the covariance model: b and ij ∼ N 0, 2 are the random clus- ter-level intercept and error, respectively.In addition, which is the treatment effect among the subgroup defined by the collection of indices 0 = (i, j);S ij = 0 (in the UMDEX study, 0 refers to the subgroup of participants with Alzheimer's disease), 3 is the main effect of the subgroup variable, and 4 is the treatment-by-subgroup interaction.The model also implies the treatment effect among the subgroup Specifically, previous research focused on testing for 4 , whereas this paper focuses on testing for Δ 0 , Δ 1 .In Model (1), the total variance of the outcome is defined as 2 y|s,z = 2 b + 2 and the ICC of the outcome is given by the ratio of the between-cluster variance and the total variance, or y|s,z = 2 b 2 b + 2 (Eldridge et al, 2009).Estimating the subgroup-specific treatment effects Δ 0 , Δ 1 requires estimating the regression parameters, which typically proceeds via maximum likelihood techniques.In the design stage, sample size calculations often assume that the variance components (and hence the ICC) are known and require an explicit characterization of the variance expressions.Specifically, if we represent the collection of design points for each participant as T and the design matrix for each cluster as X i = X i1 , … X im i T , then the best unbiased linear estimator for regression coefficients 1 − y|s,z I m i + y|s,z J m i is the compound symmetric variance matrix ( I m i is the m i × m i identity matrix and J m i is the m i × m i matrix of ones), and T is the collection of all outcomes in cluster i .The estimators for the subgroup-specific treatment effects are then given by Δ0 = ̂ 2 and Δ1 = ̂ 2 + ̂ 4 , whose variance expressions are of interest for study design calculations.To obtain these variances, it is useful to study the variancecovariance matrix for ̂ , given by For simplicity, we make the conventional assumption of equal cluster sizes such that m i = m for all i. (1)

Variance for Subgroup-Specific Treatment Effect Estimators
We first review key existing results to set the stage for introducing our new results.Assuming Model (1), Yang et al. (2020) developed an explicit expression for Var ̂ 4 , which was referred to as the variance of the heterogeneous treatment effect estimator.The variance for this treatment-by-subgroup interaction effect estimator takes the following explicit form.
where p 1 = P[S ij = 1] is the marginal probability of the sub- group population 1 , and p 0 = 1 − p 1 is the marginal prob- ability of the subgroup population 0 .Three key observations follow.First, 2 HTE depends on both the ICC of the outcome adjusting for the subgroup variable ( y|s,z ), as well as the ICC of the subgroup variable itself ( s ).In CRTs, both the outcome of interest and baseline covariates can have non-zero ICCs (Raudenbush, 1997), and when interest lies in detecting heterogeneity of treatment effects, these two ICC parameters play an equally important role in determining study power.Second, while 2 HTE is monotonically increasing in s , it has a parabolic relationship with y|s,z ; therefore, a larger value of the outcome ICC does not always inflate 2 HTE when holding all other parameters constant (Yang et al., 2020).Thirdly, as a special case of Eq. ( 2), when the subgroup variable is measured at the cluster level (cluster subgroups are formed such that S ij = S i for all j ), we have s = 1 , and expression (2) reduces to a much simpler expression ̃ 2 HTE = Var( ̂ 4 ) = 2 y {1+(m−1) y|s,z } (1− )p 1 p 0 nm .More generally, Tong et al. (2022) have shown that under Model (1) with the subgroup variable measured either at the participant level or cluster level, the variance for the overall average treatment effect estimator (overall average treatment effect parameter defined as which also has the identical form to the variance of an average treatment effect estimator without the subgroup variable (Murray, 1998), with the caveat that the outcome ICC is now defined adjusting for the subgroup indicator.While these variance expressions have been characterized in prior literature, we provide new insights pertaining to the variances of the subgroup-specific treatment effect estimators in Result 1. Brief derivation details are found in Web Appendix A.
Result 1.Under Model (1) with the subgroup variable measured either at the participant level or cluster level, the variances of the subgroup-specific treatment effect estimators and their covariance are given by weighted averages of the variance of the overall average treatment effect estimator and the variance of the heterogeneous treatment effect estimator; that is, where 2 ATE and 2 HTE are defined above in Eqs. ( 2) and (3) respectively.
Several observations follow from Result 1. Firstly, the variance of each subgroup-specific treatment effect estimator does not depend on the true effect size and is only a function of 2 ATE , 2 HTE as well as the marginal prevalence of the subgroup indicator, p 1 .As expected, a larger subgroup (e.g., 1 increases in size when p 1 moves closer to 1) corresponds to a smaller variance of the associated subgroup-specific treatment effect estimator.When p 1 = 0.5 , the two subgroup sizes are balanced in expectation such that Secondly, the covariance between the subgroup-specific treatment effect estimators is the difference between the variance of the overall average effect estimator and that of the heterogeneous effect estimator scaled by p 1 p 0 .In the extreme case where the subgroup variable is defined at the cluster level (in which case the covariate ICC s = 1 , then Cov Δ0 , Δ1 = 2 ATE − p 1 p 0 ̃ 2 HTE = 0 and the two subgroup effect estimators are uncorrelated (just like conducting two separate studies), regardless of the subgroup proportions.In this case, Var Δ0 = 2 ATE ∕p 0 and Var Δ1 = 2 ATE ∕p 1 and the vari- ance of the subgroup treatment effect estimator is inversely proportional to the size of the subgroup; in addition,

HTE
. Finally, if we set s = y|s,z = 0 , the result is applicable for subgroup analyses in individually randomized trials where data are often assumed to be independent.

Sample Size Estimation Based on Omnibus Test
The explicit characterization of the covariance matrix for Δ0 , Δ1 provides an analytically tractable approach for quantifying the power for testing the subgroup-specific treatment effects.We first consider the null hypothesis that the intervention has no effect in both subgroups, corresponding to testing In this case, an investigator may declare the treat- ment a success if an effect is observed in at least one subgroup.A possible test statistic for H 0 is the F-statistic, given by the quadratic form where ΩΔ is the estimated covariance matrix of the subgroup-specific treatment effect estimators ( Δ0 , Δ1 ) with elements defined in Result 1 (an explicit expression is before Eq. ( 4)).Under H 0 , F * approximately follows a central F -distribution with the numerator and denominator degrees of freedom (2, n − 2) , where n − 2 was chosen as the between-within degrees of freedom (# of clusters − # of cluster-level covariates) to reflect a penalty due to at least two cluster-level parameters in Model (1); an alternative choice of degrees of freedom, such as n − 4 , can be made with a cluster-level subgroup variable.We consider the Ftest rather than the 2 -test because the former usually has a more robust small-sample performance (Roy et al., 2007;Tian et al., 2022).Under the alternative, F * approximately follows a non-central F-distribution with noncentrality parameter = Δ 0 , Δ 1 Ω −1 Δ Δ 0 , Δ 1 T , where we have obtained from Result 1 that Therefore, for a nominal type I error rate , the power under a given effect size Δ 0 , Δ 1 is where F 1− (2, n − 2) is the critical value of the central F(2, n − 2) distribution, and f (x; , 2, n − 2) refers to the probability density function of the noncentral F( , 2, n − 2) distribution.Finally, to determine the required sample size, one could fix the type I error rate ( ), randomization proportion ( ), subgroup proportions ( p 1 , p 0 ), outcome variance ( 2 y|s,z ), outcome ICC ( y|s,z ), subgroup variable ICC ( s ), cluster size ( m ), and effect sizes ( Δ 0 , Δ 1 ) and specify a series of integers n .Then the required sample size can be obtained as the smallest integer that provides the pre-specified power ( 1 − ) using Eq. ( 4).The role of n and m can be switched in this procedure to solve for the required cluster size given the number of clusters.

Sample Size Estimation Based on Intersectionunion Test
Alternatively, a more stringent testing framework can be considered such that the null hypothesis would only be rejected when there is a treatment effect in both subgroups.That is, an investigator would declare the intervention a success only if a treatment effect is observed in both subgroups.
In this case, one may be interested in testing H 0 ∶ Δ 0 = 0 and/or Δ 1 = 0 versus H 1 ∶ Δ 0 ≠ 0 and Δ 1 ≠ 0 and employ the intersection-union test based on the linear mixed analysis of the covariance model.Of note, the null space is composite as it includes the following three cases: treatment has no effect on both subgroups, treatment has no effect on subgroup 0 , and treatment has no effect on subgroup 1 .For testing this composite null, we consider the bivariate Wald test statistic, = 0 , 1 T , where represent the standard erroradjusted treatment effect estimators.Therefore, one can show that follows a multivariate normal distribution and correlation matrix Φ , whose diagonal elements are given by 1 and off-diagonal elements by Var Δ0 Cov Δ0 , Δ1 Var Δ1 −1∕2 (Tian et al., 2022).Given the total number of clusters n and cluster size m , the power func- tion to simultaneously detect the treatment effect in both subgroups is given by where c 0 , c 1 are two subgroup-specific critical values for rejecting the null, and g(a, b) is the density function of the Wald test statistics under the alternative.While a typical choice of g is the multivariate normal distribution, we fol- low Yang et al. (2022) and consider a bivariate t-distribution with location vector , shape matrix Φ , and degrees of freedom n − 2 as this has been shown to have better con- trol of type I error rates in small samples (by partially accounting for the variability in estimating the covariance parameters).The specification of critical values can lead to intersection-union tests with different operating characteristics (Kordzakhia et al., 2010), and we adopt a simple approach such that c 0 = c 1 = t (n − 2) , which is the (1 − ) quantile of the univariate t-distribution.That is, we reject H 0 when  0 > t  (n − 2) and  1 > t  (n − 2) .This specification of critical values is at most conservative such that the type I error rate is controlled to be strictly below within the composite null space (Li et al., 2020).Of note, the performance of this approach can critically depend on the number of clusters.For example, when the number of clusters is small, the estimated degrees of freedom n − 2 may be very small, and therefore, the test may be conservative (Davis-Plourde et al., 2023).Finally, for sample size determination, one can use the power Eq. ( 5) and solve for n or m given pre-specified values of all other design parameters using the procedure described for the omnibus test.

The Role of ICC Parameters
To provide some intuition on how the ICC parameters affect study power, we numerically explore the relationship between power and the two relevant ICC parameters ( y|s,z and s ) for the omnibus test and intersection-union test in ( 5) Figs. 1 and 2. We consider a CRT with equal allocation to both arms with = 1∕2 and assume n = 30 clusters, cluster size m = 100 , total variance of the outcome 2 y|s,z = 1 , the treatment effect among the subgroup 0 is Δ 0 = 0.3 , and the treatment effect among the subgroup 1 is Δ 1 = 0.4 and vary the prevalence of the subgroup indicator by choosing p 1 ∈ {0.3, 0.5, 0.7} .In Fig. 1, we observe that the power of the omnibus test monotonically decreases in s but has a parabolic relationship with y|s,z .In general, power is not too sensitive to s , especially when y|s,z is small.But power often increases as the prevalence of the subgroup with a larger treatment effect increases.In Fig. 2, we observe that the power of the intersection-union test monotonically decreases in both s and y|s,z , is more sensitive to changes in y|s,z than in s , and appears to be more sensitive to changes in s than the omnibus text, particularly for larger values of s .In practice, we recommend exploring the sensitivity of sample size and power under varying a priori estimates for the two ICC parameters, as our numerical results illustrate that power may change according to different ICC assumptions.

Simulation Design
We follow the ADEMP framework proposed by Morris et al. (2019), which breaks down the simulation study into five key elements: aims, data-generating mechanisms, estimands, methods, and performance measures.

Aims
This simulation study aims to assess the performance of our sample size formulas with equal randomization ( = 1∕2 ) and equal subgroup proportions ( p 1 = p 0 = 1∕2 ), for both the omnibus test and the intersection-union test.The primary objectives are to verify that the empirical type I error rate is controlled at or under the nominal level and empirical power is close to that predicted by the formula.

Data-Generating Mechanisms
From the proposed sample size formulas, the total number of clusters depends on the following parameters: nominal type I error rate ( ), power ( 1 − ), total variance of the outcome ( 2 y|s,z ), ICC of the outcome ( y|s,z ), ICC of the subgroup variable ( s ), cluster size ( m ), and effect sizes.Throughout, we fixed the total variance 2 y|s,z at 1, nominal 1 3 type I error rate at 5%, desired power 1 − at 80%, 2 = 0.2 and 4 = 0.1 for the omnibus test, and 2 = 0.3 and 4 = 0.1 for the intersection-union test and varied the remaining parameters.That is, the true subgroup-specific treatment effects were Δ 0 = 0.2 and Δ 1 = 0.3 for the omnibus test and Δ 0 = 0.3 and Δ 1 = 0.4 for the intersec- tion-union test.Different effect sizes were chosen for different tests to ensure a realistic range of the predicted number of clusters.We considered three levels of cluster size m ∈ {20, 50, 100} , three levels of ICC for the outcome conditional on the subgroup variable y|s,z ∈ {0.02, 0.05, 0.1} , and three levels of ICC for the subgroup variable s ∈ {0.1, 0.25, 0.5} ; s was chosen to follow pre- vious simulations for the interaction tests (Tong et al., 2022;Yang et al., 2020) and to better illustrate its potential impact on power.In summary, we considered 3 × 3 × 3 = 27 parameter combinations for each test.In each scenario, the total number of clusters n was deter- mined as the smallest number that ensured the predicted power was at least 80%.To assess the empirical type I error rate, both 2 and 4 were fixed at 0 for the omnibus test, while only 2 was fixed at 0 for the intersection-union test.For each sample size obtained from the respective formula, we then simulated the binary subgroup variable S ij from the beta-binomial model with the cluster- specific probability of the subgroup population 1 as p i ∼ Beta(q 1 , q 2 ) and S ij ∼ Bernoulli(p i ) , where q 1 and q 2 were determined by the marginal probability of the Fig. 1 Power of the omnibus test with n = 30, m = 100, 2 y = 1, Δ 0 = 0.3, Δ 1 = 0.4 at p 1 ∈ {0.3, 0.5, 0.7} as a function of (A) y|s,z when fixing s = 0.2 ; (B) y|s,z when fixing s = 0.5 ; (C) s when fixing y|s,z = 0.05 ; (D) s when fixing y|s,z = 0.2 subgroup population 1 as p 1 = q 1 ∕(q 1 + q 2 ) and the ICC of the subgroup variable as s = 1 + q 1 + q 2 −1 .For each scenario, we also simulated the outcome Y ij from Model (1), by fixing 1 = 0 and 3 = 0.15 (in theory the power is not affected by these two parameters).

Estimands
Given our focus on testing, the estimand aspect of the ADEMP framework could be interpreted as the empirical power and empirical type I error rate of each test, estimated by the formula predictions.

Methods
In each scenario, 2000 data replications were generated and analyzed for the evaluation of the empirical type I error rate under the null and empirical power under the alternative hypothesis.As the nominal type I error rate was 5%, according to the margin of error from a binomial model with 2000 replications, we considered an empirical type I error rate from 4.0 to 6.0% as close to the nominal.Similarly, as the predicted power was at least 80% for each scenario, we considered an empirical power differing at most 2.0% from the predicted power as acceptable.In addition, for each scenario, Fig. 2 Power of the intersection-union test with n = 30, m = 100, σ 2 y = 1, Δ 0 = 0.3, Δ 1 = 0.4 at p 1 ∈ {0.3, 0.5, 0.7} as a function of (A) y|s,z when fixing s = 0.2 ; (B) y|s,z when fixing s = 0.5 ; (C) s when fixing y|s,z = 0.05 ; (D) s when fixing y|s,z = 0.2 we also provide a comparison with a back-of-the-envelope approach.This approach estimates the required number of clusters n c by first using our formula ignoring any intraclus- ter correlations with y|s,z = s = 0 (i.e., assuming individual randomization) and then multiplying the required sample size by the conventional design effect for CRTs: 1 + (m − 1) y|s,z .We also calculate the actual predicted power using our formula based on n c to compare the perfor- mance of the back-of-the-envelope approach to our method, as well as relative saving in the required number of clusters ( =

Performance Measures
To assess the performance of the sample size formulas for each test, we compute both the empirical type I error rate and empirical power in each simulation scenario.The empirical type I error rate is calculated as the percentage of times a null hypothesis is rejected when the null is actually true; the empirical power is calculated as the percentage of times a null hypothesis is rejected when the null is actually false.

Simulation Results
All statistical analyses were conducted with R, version 4.2.2, and the convergence rate was 100% for each scenario.Table 1 summarizes the estimated required number of clusters ( n ) using the proposed formula, empirical type I error, empirical power, and predicted power, for the omnibus test.For all scenarios, the type I error rates were all within the acceptable range, and the empirical power corresponded well with the predicted power.In the last three columns of Table 1, we present the results for the back-of-the-envelope approach.We The treatment effect among the subgroup 0 is Δ 0 = 0.2 , and the treatment effect among the subgroup 1 is Δ 1 = 0.3 .In the last two columns, we estimate the required number of clusters n c using the proposed formula with y|s,z = s = 0 and the design effect with the true value of y|s,z and then obtain the actual predicted power (actual power) using our formula based on n c as well as the true values of y|s,z and s a All scenarios assume a CRT with equal randomization ( = 0.5 ) and equal subgroup proportions ( p 1 = p 0 = 0.5 ) and a quantitative outcome having variance 2 y = 1 b The type I error rates were all within the acceptable range (from 4.0% to 6.0% ), and the empirical power corresponded well with the predicted power (differing at most 2.0% from the preficted power) observe that for the omnibus test, simply inflating the sample size with the conventional design effect always leads to a larger sample size than the proposed method, and the relative savings in the required number of clusters ranges from 8.3 to 54.5% across the scenarios considered.Table 2 summarizes the estimated required number of clusters ( n ) using the proposed formula, empirical type I error, empirical power, and predicted power, for the intersection-union test.With a small cluster size ( m = 20, 50 ), the intersection-union test provided conservative type I error rates ( < 4.0% ); for the larger cluster size ( m = 100 ), the type I error rates grew closer to nominal.The empirical power corresponded well with the predicted power across almost all scenarios.Finally, we also observe that for the intersection-union test, inflating the sample size under individual randomization via the simple design effect always results in a larger number of clusters than the proposed method and may therefore lead to unnecessary use of resources.Specifically, the relative savings in the required number of clusters ranges from 17.4 to 59.1% across the scenarios considered.Therefore, our method is especially attractive when the available numbers of clusters or resources are limited.

Illustrative Data Example
We illustrate our sample size methods by calculating the required number of clusters (i.e., groups of participants) in the context of the UMDEX study, which was introduced in the "Introduction" section.Recall that an important aim of the study was to investigate intervention effects in different subpopulations defined by dementia type: Alzheimer's disease versus non-Alzheimer's dementia.Clusters were Table 2 Simulation scenarios a , estimated required number of clusters n based on the proposed formula, empirical type I error rates (emp.size), empirical power (emp.power), and predicted power (pred.power) for the intersection-union test.The treatment effect among the subgroup 0 is Δ 0 = 0.3 , and the treatment effect among the subgroup 1 is Δ 1 = 0.4 .In the last two columns, we estimate the required number of clusters n c using the proposed formula with y|s,z = s = 0 and the design effect with the true value of y|s,z and then obtain the actual predicted power (actual power) using our formula based on n c as well as the true values of y|s,z and s a All scenarios assume a CRT with equal randomization ( = 0.5 ) and equal subgroup proportions ( p 1 = p 0 = 0.5 ) and a quantitative outcome having variance 2 y = 1 b Starred text indicates the type I error rates were smaller than 4.0% , which were conservative but the tests were still valid; double starred text indicates the empirical power was more than 2.0% smaller than the predicted power randomized to either the exercise or control activities in a 1:1 ratio.The intervention exercise program consisted of five exercise sessions lasting approximately 45 min, each held per 2-week period for 4 months (40 sessions in total).The primary outcome of ADL independence was measured at the patient level at 4 months using the FIM, a 13-item instrument with items rated on a scale from total assistance (1) to complete independence (7) and a total score ranging from 13 to 91.We treat the FIM as a continuous outcome with larger values indicating more independence in ADLs.First, suppose the investigators were interested in the omnibus test, demonstrating a treatment effect in at least one of the two subgroups.They need to determine the required number of clusters to achieve at least 80% power at the 5% nominal test size.The target effect size for the subgroup with non-Alzheimer's dementia was a standardized difference of Δ 0 ∕ y|s,z = 0.7 and for the subgroup with Alzheimer's disease was Δ 1 ∕ y|s,z = 0.5.Furthermore, the assumed ICC of the subgroup variable (dementia type) was s = 0.2 , and of the outcome FIM adjusted for the subgroup indicator was y|s,z = 0.04 .The anticipated prevalence of patients with Alzheimer's disease was p 1 = 36% , and the anticipated number of patients per cluster was assumed to be m = 10 .With these assumptions and inverting power Eq. ( 4), the required number of clusters can be calculated as 18 with a predicted power of 85.5%. Figure 3 shows a Fig. 3 Predicted power contours for the omnibus test as a function of s and y|s,z at m = {10, 20, 30, 40} , with n = 18 , Δ 0 = 0.7 × y , Δ 1 = 0.5 × y for the UMDEX study sensitivity analysis of power as a function of s and y|s,z and for m = {10, 20, 30, 40} , with n = 18 .When y|s,z ≤ 0.09 , the predicted power decreases as s or y|s,z increases.Overall, even with 10 patients per cluster, the power of the study remains above 0.8 as long as y|s,z ≤ 0.06 .However, using the back-of-the-envelope approach, the required number of clusters would be 20 (i.e., 2 more clusters compared to the proposed method) with an actual power of 89.8%.
Second, suppose the investigators were interested in the intersection-union test, demonstrating treatment effects in both subgroups.Again, they need to determine the required number of clusters to achieve at least 80% power at the 5% nominal test size.Keeping all of the other assumptions as in the omnibus test and using Eq. ( 5), the required number of clusters would be 34 with a power of 80.6%.With the same assumptions, more clusters are required for the intersection-union test because of a more stringent requirement of treatment effects in both subgroups based on the omnibus test.Figure 4 shows a sensitivity analysis of power as a function of s and y|s,z at m = {10, 20, 30, 40} , assuming n = 34 .When y|s,z ≤ 0.09 , the predicted power decreases as s or y|s,z increases, and the power appears to be more sensitive to s compared to the omnibus test.Overall, even with 10 patients per cluster, the power of the study remains above Fig. 4 Predicted power contours for the intersection-union test as a function of s and y|s,z at m = {10, 20, 30, 40} , with n = 34 , Δ 0 = 0.7 × y , Δ 1 = 0.5 × y for the UMDEX study 0.8 as long as ≤ 0.04 and s ≤ 0.25 .However, using the back-of-the-envelope approach, the required number of clusters would be 42 (i.e., 8 more clusters compared to the proposed method) with an actual power of 87.7%.
In this example, we additionally consider the sample size required for the interaction test (Yang et al., 2020).Suppose the investigators were interested in the interaction test for heterogeneity of treatment effects, demonstrating a difference in treatment effects between the two subgroups.Keeping all of the other assumptions as in the omnibus test and the intersection-union test and using the method in Yang et al. (2020), the required number of clusters would be 284 with a power of 80.2%.Note that the interaction test requires a much larger sample size because the effect size for the between-subgroup difference is much smaller than the effect size for each subgroup.This comparison demonstrates that sample size requirements for subgroup-specific treatment effects and those for testing treatment effect heterogeneity do not have a clear nesting relationship, and the choice between them ultimately depends on the scientific study objective.Finally, we calculated the required sample size for testing the overall average treatment effect ( H 0 ∶ 2 + p 1 4 = 0 versus H 1 ∶ 2 + p 1 4 ≠ 0 ).Keeping all assumptions as above, using a t-test with the variance (3) suggested that the required num- ber of clusters would be 12 with a power of 85.9%; thus, a smaller sample size is sufficient for testing the overall average treatment effect.

Discussion
It is increasingly important for investigators to explicitly formulate health equity objectives about testing subgroupspecific treatment effects and then design the trial accordingly, i.e., with adequate power to address the health objectives.Accordingly, there is an emerging need to study sample size requirements for such objectives, especially for cluster-randomized designs.Recently, the National Institute on Aging (NIA) IMbedded Pragmatic Alzheimer's disease (AD) and AD-Related Dementias (AD/ADRD) Clinical Trials (IMPACT) Collaboratory considered the need to "clearly state health-equity-relevant aims & hypotheses" and "be explicit in sample size justification with regard to the health equity objective" in their health equity best practices guidance document (NIA IMPACT Collaboratory, 2022).It is therefore critically important to integrate health equity considerations in the design stage of a pragmatic trial (Nicholls et al., 2023), for which we contribute analytical power and sample size formulas.We consider a simple yet common case with a binary subgroup variable and clarify the ingredients that determine the variance of the subgroup-specific treatment effect under a linear mixed analysis of the covariance model.On some occasions, we recognize that many CRTs include the analysis of the overall treatment effect as the primary analysis and the subgroup analysis as secondary.In those cases, our methods can help provide a context to interpret the subgroup results and address questions of how many more clusters are needed if the study aims to generate evidence on subgroup treatment effects.Alternatively, if the sample size is driven solely by the overall average treatment effect, it is still helpful to know what power is available to detect plausible subgroup treatment effects, even if it is not 80% .Moreover, we consider both the omnibus test and the intersection-union test.The choice of tests depends on the study context and scientific question, and our work allows investigators to focus on either test, as well as compare the sample size implications of the two tests.In addition, for the omnibus test, power generally decreases in the ICC of the subgroup variable and has a parabolic relationship with the ICC of the outcome conditional on the subgroup variable; for the intersection-union test, power monotonically decreases in both ICC parameters.This information can help investigators specify ICC parameters that are likely to provide conservative sample size estimates if accurate information on design parameters is unavailable at the design stage.Finally, even though our data example does not include a multilevel intervention which has two or more levels of intervention at the same time or in close temporal proximity (Agurs-Collins et al., 2019), our approach remains applicable to a multilevel intervention as long as the study considers cluster-level randomization.
Importantly, this paper focuses on testing subgroupspecific treatment effects and has a distinct focus from the previous research on the heterogeneity of treatment effects (i.e., an interaction test).In our perspective, these two analyses provide complementary evidence by addressing different aspects of how intervention affects subpopulations in CRTs.The choice between testing subgroup-specific treatment effects and testing heterogeneity of treatment effects ultimately depends on the study objective, that is, whether the study aims to test the treatment effect in each subgroup or the difference in treatment effects between the subgroups.The required sample size to detect the subgroup-specific treatment effects is expected to be different than that for testing the treatment difference, even under the same effect size specifications, as demonstrated in the illustrative example.The reason is that our methods depend on the size of effects specified for each subgroup, while the interaction test depends only on the difference between two subgroups.Furthermore, there exist practical situations with an unbalanced distribution of an important subgroup variable (e.g., gender identity) whereby the trial could only include very few people in one group but many more in another.This 1 might lead to challenges in powering the interaction test (since p 1 p 0 is close to zero), but one may still have a chance to identify treatment effect signals in at least one subgroup with sufficient power.Given the importance of addressing sex-gender considerations in trial designs, our methodology allows investigators to ensure that there is adequate power in at least the larger of the two subgroups even when the study may not be adequately powered for detecting heterogeneity of treatment effects.
Although Model ( 1) is a commonly used analytical model, it assumes that the correlation among participant outcomes within the same cluster is the same between the two subgroups.Ignoring the difference in correlations among members of the same cluster in the two subgroups, when it exists, may lead to an inflated type I error rate.Extending Model (1), we can include a random coefficient for S ij to allow the outcome correlation among participants from the same cluster to differ between subgroups.Specifically, one can consider the model: where the parameters are similarly interpreted as those in Model (1), except for the addition of the random clusterlevel slope, c i ∼ N(0, 2 c ) .Model (6) encodes three outcome ICCs: the ICC between different participants in subgroup 0 , the ICC between different participants in different subgroups, and the ICC between different participants in subgroup 1 .The closed-form formulas for Var Δ0 , Var Δ1 , and Cov Δ0 , Δ1 under Model (6) are analytically less trac- table due to the complexity of the correlation structure.Therefore, in Web Appendix B, we propose an efficient Monte-Carlo sample size procedure through simulating data and inverting the correlation matrix, as an extension to allow for different outcome ICCs in different subgroups.
Our development of sample size procedures focusing on testing subgroup-specific treatment effects for a binary subgroup variable represents an endeavor to improve standards for confirmatory subgroup analyses in CRTs but comes with several limitations that we plan to address in future work.First, there are scenarios where more than two subgroups are of interest, and our method can be generalized to accommodate multiple subgroups following the derivations in Sect.3.2 of Yang et al. (2020).However, the final covariance matrix for the subgroup-specific treatment estimators may depend on two ICC parameters of two dummy variables, and the sample size procedure is inevitably more complicated.In addition, for a total study sample size, an increasing number of subgroups will on average lead to smaller subgroup sample sizes, which could diminish power.In future work, it would be worth exploring the implications of the number of subgroups ( 6) on study power for both the omnibus and intersectionunion testing frameworks.Second, we assumed equal cluster sizes to simplify derivations.Such an assumption is routinely made in CRTs at the design stage but could be violated in practice.It would be interesting to extend our sample size formulas to accommodate variable cluster sizes, perhaps along the lines of van Breukelen et al. (2007) and Tong et al. (2022).Third, our work assumed the effect of the treatment group variable is constant across clusters.An extension of our work to varying effects can be made by including an additional random slope for Z i in Model (1) or Model (6) (Tong et al., 2023).A Monte-Carlo procedure similar to that developed in Web Appendix B can be used for sample size estimation but requires assumptions on additional ICC parameters which will be explored in future work.Fourth, we only considered equal subgroup proportions in the simulation study.In addition, smaller numbers of clusters, such as 8, 10, or 12, which may occur in some CRTs, were not considered in the simulation study.Possible challenges and additional scenarios with unequal subgroup proportions and with a small number of clusters will be addressed in future work.Finally, we have assumed that the outcome is continuous, and analysis is based on linear mixed analysis of covariance.An extension of our work to binary and categorical outcomes will be pursued in future work.However, in some cases, the sample size results developed for continuous outcomes can still be used to provide an approximate calculation for binary outcomes, providing that the target effect size is on the risk difference scale.The accuracy of this approximate procedure remains to be investigated in the context of subgroup analyses.

Table 1
power) for the omnibus test.