Power calculation for detecting interaction effect in cross-sectional stepped-wedge cluster randomized trials: an important tool for disparity research

Background The stepped-wedge cluster randomized trial (SW-CRT) design has become popular in healthcare research. It is an appealing alternative to traditional cluster randomized trials (CRTs) since the burden of logistical issues and ethical problems can be reduced. Several approaches for sample size determination for the overall treatment effect in the SW-CRT have been proposed. However, in certain situations we are interested in examining the heterogeneity in treatment effect (HTE) between groups instead. This is equivalent to testing the interaction effect. An important example includes the aim to reduce racial disparities through healthcare delivery interventions, where the focus is the interaction between the intervention and race. Sample size determination and power calculation for detecting an interaction effect between the intervention status variable and a key covariate in the SW-CRT study has not been proposed yet for binary outcomes. Methods We utilize the generalized estimating equation (GEE) method for detecting the heterogeneity in treatment effect (HTE). The variance of the estimated interaction effect is approximated based on the GEE method for the marginal models. The power is calculated based on the two-sided Wald test. The Kauermann and Carroll (KC) and the Mancl and DeRouen (MD) methods along with GEE (GEE-KC and GEE-MD) are considered as bias-correction methods. Results Among three approaches, GEE has the largest simulated power and GEE-MD has the smallest simulated power. Given cluster size of 120, GEE has over 80% statistical power. When we have a balanced binary covariate (50%), simulated power increases compared to an unbalanced binary covariate (30%). With intermediate effect size of HTE, only cluster sizes of 100 and 120 have more than 80% power using GEE for both correlation structures. With large effect size of HTE, when cluster size is at least 60, all three approaches have more than 80% power. When we compare an increase in cluster size and increase in the number of clusters based on simulated power, the latter has a slight gain in power. When the cluster size changes from 20 to 40 with 20 clusters, power increases from 53.1% to 82.1% for GEE; 50.6% to 79.7% for GEE-KC; and 48.1% to 77.1% for GEE-MD. When the number of clusters changes from 20 to 40 with cluster size of 20, power increases from 53.1% to 82.1% for GEE; 50.6% to 81% for GEE-KC; and 48.1% to 79.8% for GEE-MD. Conclusions We propose three approaches for cluster size determination given the number of clusters for detecting the interaction effect in SW-CRT. GEE and GEE-KC have reasonable operating characteristics for both intermediate and large effect size of HTE. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-024-02162-0.


Background
The push towards studying and improving racial and ethnic disparities in healthcare has grown in recent years.Across various settings, retrospective studies continue to demonstrate that Black patients are less likely to obtain adequate care and outcomes than that of White patients [1][2][3][4].Within the field of palliative care, Black patients are less likely to receive advanced care planning and to enroll in palliative care than other racial groups [5][6][7].One step towards mitigating these findings includes implementing interventions, first introduced through randomized controlled trials, which work to close the health disparity gap between minority and non-minority patients.Examples include properly training and removing racial bias among physicians [6], improving communication between physicians and patients [8,9], and integrating automated tools into physicians' decision making to prompt care [10].While the interventions should work to increase quality of care among all patients, in order to truly tackle the issue of disparity, they should have a greater impact on minority patients, as their baseline care is usually less adequate than that of non-minorities.In the statistical models analyzing the impact of these interventions, the main interest is the magnitude and significance of the interaction term between race and the exposure variable.
Interventions aimed at reducing racial disparities in healthcare often tackle change at the provider or clinic level, suggesting the use of cluster randomized trials (CRT).This has been conducted in several published trials.In a two-arm, parallel group CRT, researchers aimed to increase unmet palliative care needs among both Black and White intensive care unit (ICU) families by introducing an automated web app, aimed at improving communication between physicians and families [9].While this study relied on a parallel group CRT, these designs are not always feasible, particularly in cases where the intervention cannot be denied to half of the participants.In these situations, the stepped-wedge cluster randomized trial (SW-CRT) is an appealing design [11].Appealing features of SW-CRTs include having each cluster act as its own control and not needing to withhold the intervention from any patient due to the key feature of only needing unidirectional cross-over of clusters from control to intervention in a time-staggered manner.However, from a statistical perspective, the SW-CRT design gives us complex correlation structures compared to the standard CRT design due to the uni-directional crossover from control to intervention.
Several approaches for optimal sample size determination of SW-CRT for estimating overall treatment effect (OTE) are available in published literature [12][13][14][15][16][17] for both continuous and binary outcomes.Various software implementations of these approaches are summarized in existing literature [18,19].In the study design stage for CRT, detecting the overall intervention effect is the main consideration in the sample size determination/power calculation, and examination of differential effect by group is the secondary objective [9,20].However, in trials assessing health disparities, the key interest is in examining whether the intervention effect differs among study subgroups.In healthcare delivery research, examination of differential effect by group would be also considered in the study planning stage.Sample size calculation methods for continuous outcomes based on the interaction level have been studied extensively in recent advances [21,22].However, conducting power calculations around the interaction level in these types of trials is challenging, and sample size determination for detecting an interaction effect between intervention status variable and a key covariate in the SW-CRT study has not been proposed yet.Since binary outcome measures are the most common outcome in SW-CRTs for healthcare delivery research, we focus on binary outcomes in this study.
There is limited evidence of similar health-equity focused CRTs properly powering their studies for the interaction term or providing enough transparency of methodologies used.The parallel-group CRT assessing palliative care needs among ICU families powered their study around the OTE for all patients as well as by independently powering each racial subgroup for identical OTEs.However, the study admitted they likely did not have enough power for the interaction effect between the exposure and race [9].Similarly, a SW-CRT aimed at increasing the rates of advanced care planning (ACP) among African-American patients through a structured ACP intervention conducted their main sample size calculation around the OTE, and they briefly mentioned they had sufficient power in the secondary objective for detecting a reduction in the racial disparity gap [8].Limited information was provided on how this calculation was conducted, preventing future researchers from utilizing their design.Aside from the statistical drawbacks that come with under-powering a study, failing to address that the effect size for the minority group should be larger than that of the non-minority group is an issue.By designing the study to achieve the same effect size among all subgroups, the issue of the disparity gap will remain.Instead, these types of studies should be powered to detect a larger effect size in the minority group.
Researchers at Mount Sinai Hospital System (MSHS) are planning to conduct a cross-sectional SW-CRT for advanced cancer patients who are at high risk for dying within a short period of time (less than 6 months).There would be two choices for the patients and their family members that oncologists are expected to discuss as the 'goals of care' (GoC): (1) remaining in active treatment with curative intent and (2) moving into hospice care for management of pain and discomfort only.A machinelearning based predictive model for 6-month predictive mortality has already been optimized with retrospective data and validated with prospective data.A clinical decision support system (CDSS) with information from the predictive model will guide the physician through the electronic medical record (EMR) system to have a GoC for the patient predicted at high-risk for mortality.Physicians in the intervention arm will also be trained on how to have the GoC with patients and family members and will be shown data on low GoC conversations for the minority (Black and Hispanic) patients at MSHS.Therefore, the intervention of the planned SW-CRT is a combination of training and a CDSS based alert about prediction.The physicians in the control arm will have neither the CDSS alert nor the training.The binary outcome in this case is having the GoC conversation or not.The binary covariate is race (White/Black).It is hypothesized that the proportion of GoC in Black patients will increase at a higher rate than for White patients with the same intervention.This is equivalent to testing the hypothesis that the interaction effect between intervention and race is significant.
In this article, our main focus of the study design is on the cross-sectional SW-CRT design, in which the outcome is measured on individuals within each cluster at each time period.For example, in a crosssectional SW-CRT to evaluate a hospital-based psychological care quality improvement intervention for cancer patients, the binary outcome of health-related quality of life improvement (improved vs worse/ unchanged) was measured among cross-sections of individuals attending the hospitals during each timeperiod [6].We propose approaches to this problem in a cross-sectional SW-CRT setting with an individual binary covariate.We utilize the generalized estimating equation (GEE) model setting for these formulations for binary outcomes.We organize this article in the following manner.In Sect.2, we introduce the three proposed approaches for obtaining the optimal sample size for detecting interaction effect in the cross-sectional SW-CRT.In Sect.3, we describe the planned simulation study for estimating comparative empirical type I and type II error rates for several settings.In Sect.4, we show simulation results.Section 5 discusses a motivational example for this manuscript.We provide an illustration of the proposed approaches using a real data example from the MSHS study.We conclude our article in Sect.6 and Sect.7 with some discussion and plans for expanding our study to continuous and censored endpoints as well as detecting the interaction effect with cluster-level covariates.

Statistical approaches in the cross-sectional SW-CRT
In general, the design and analysis for SW-CRT has been based on either conditional model approaches or a marginal model approach.Conditional model approaches require specification of fixed effects and random effects.Linear mixed models (LMM) are used for continuous outcomes, while generalized linear mixed models (GLMM) are used for binary outcomes.When we consider a simpler model without an interaction between intervention and covariates, fixed effects include period effect terms and intervention effect term, while random effects include random cluster effect and cluster-by-time interaction.For the closed cohort SW-CRT, the random effect for the repeated measures from an individual within the same cluster is needed.Using these random effect terms, conditional approaches enable us to have flexible random effect structures.While the LMM benefits from the flexible random effect structures due to the interpretation that the intervention effect is unrelated to these random effect structures.The interpretation of the intervention effect is not straightforward for the GLMM since it depends on the specification of the random effect structure.For sample size determination, flexible modeling of random effects requires distributional assumptions.These assumptions need more information in the study design stage.However, the marginal model approach provides intuitively straightforward interpretation of intervention effect since it focuses on the population-averaged intervention effect.This approach needs two specifications: marginal mean model and a working correlation structure.Unlike conditional model approaches, the intervention effect interpretation does not depend on correlation structure modeling [23].

Models for estimating the heterogeneity of effect in the cross-sectional SW-CRT
We consider the marginal model via the generalized estimating equation (GEE) approach for the cross-sectional SW-CRT design, where we focus on the interaction effect between the intervention and a binary individual covariate.Let Y ijk be a binary outcome of the k th individual ( k ∈ {1, ..., m i } ) in the i th cluster ( i ∈ {1, ..., I} ) at the j th time interval ( j ∈ {1, ..., J } ), where m i denotes cluster size for cluster i, I denotes the total number of clusters, and J denotes total time steps and Y ijk = 1 denotes the event of interest and Y ijk = 0 otherwise.A GEE model is formu- lated for the cross-sectional SW-CRT as follows: where µ ijk = E Y ijk , denotes the marginal mean response of Y ijk , θ 0 is the baseline log-odds of the out- come in the control group corresponding to the reference group for the binary covariate ( X ijk ∈ {0, 1} , X ijk = 0 represents the reference group and X ijk = 1 represents the other), and γ j is the period fixed effect for the j th time interval.W ij is the design indicator; W ij = 1 means that all individuals in cluster i at time interval j receive the intervention and W ij = 0 otherwise.θ 1 is the overall treatment effect (OTE), θ 2 is main effect of binary covari- ate, and θ 3 captures the interaction between treatment and the binary covariate (i.e., heterogeneity in treatment effect denoted by HTE).For identification, we set γ 1 = 0 .We use � = (γ 2 , • • • , γ J , θ 0 , θ 1 , θ 2 , θ 3 ) ∈ R J +3 for the parameters and variance of Y ijk is defined as υ ijk , where R d represents the space of all d-dimensional real vec- tors and υ ijk = µ ijk (1 − µ ijk ) for binary outcome.Hence model (1) can be written as where as the vector of length J − 1 with all elements equal to 0 except the j − 1 th element, which is equal to 1. Now by stacking M ijk 's as the matrix of Jm i rows and J + 3 col- umns, namely, We reach (1) ⊺ and the function logit here is applied to the vector u i elementwise.
If the HTE is to be tested with respect to X ijk , the inter- action effect parameter θ 3 , instead of the OTE θ 1 , should be considered for the sample size calculation.In this case, the null hypothesis H 0 : θ 3 = 0 is to be tested against an alternative hypothesis H a : θ 3 = δ for some prespeci- fied HTE level δ = 0 .For the purpose of simplification, we assume the SW-CRT has equal cluster sizes, i.e.
If the HTE level θ 3 is estimated by a consistent, asymptotically normally distributed estimator θ 3,m , then the power is approximately calculated using the two-tailed Wald test, where is the standard normal distribution function and z 1−α/2 is the (1 − α/2) th standard normal quantile.When estimating the OTE, t distribution is also recommended particularly for SW-CRT designs with small number of clusters: where t,df is the cumulative t distribution function with certain degree of freedom (df ).Although the proposed method can be used to determine either the number of clusters or cluster size, in this article we focus on determining cluster size and we provide R code for both methods.The use of formula (2) or (3) requires an approximation of Var θ 3,m , which is the (J + 3, J + 3) th element in the model-based variance-covariance matrix, Var m where m is the GEE estimator of .In this article, we provide the sandwich form of approximation for Var m 's, which arises from the GEE method.Due to the presence of the individual binary covariate, the closed form of the sample size calculation is not available.We consider two correlation structures: a simple exchangeable correlation structure and a nested exchangeable correlation structure which can be based on the values of within-period correlation and betweenperiod correlation.The intraclass correlation (ICC) measures the correlation on the outcome for different individuals in the same cluster within a given time period.The within-period correlation is same as the intraclass correlation (ICC).The cluster autocorrelation (CAC) is the correlation between the population means from the (2) same cluster at two different time periods and is defined as the ratio of between-period correlation and withinperiod correlation.Hence, a simple exchangeable correlation structure has the same value for within-period correlation and between-period correlation (i.e., CAC = 1), and a nested correlation structure has different values for within-period correlation and between-period correlation.In general, between-period correlation is smaller than within-period correlation (0 < CAC < 1).Now denote the ICC as α for j′ � = j .Then the variance-covariance matrix of the binary outcomes is i for cluster i , where I n is the n × n identity matrix, J n is an n × n matrix with all elements equal to 1, and for cluster i .Equation ( 6) leads to a model-based (naïve) variance estimator.
i with α and ρ also obtained by solving the GEE (4).Note that for power calculation we do not have data.Instead, we have assumptions about the values of α and ρ .Hence we may compute Var m with- out any data analysis, i.e.
Var m is a deterministic quantity rather than a random variable.We call power calculation method (2) or (3) based on the Var m with specific values of α and ρ "GEE" for simplicity.

Bias-correction sandwich variance approaches
It is well-known that Var θ 3,m , the (J + 3, J + 3) th ele- ment of Var m , is less than the true value of Var θ 3,m due to ignoring the O P I −1 term particu- larly for cases with small I .From the perspective of data analysis, if the number of clusters I is small, bias- correction techniques are recommended for obtaining Var m to mitigate increased risk of type I errors [24].In this article, we consider two bias-correction techniques from data analysis to adjust Var θ 3,m for small number of clusters: 1) the Kauermann and Carroll (KC) [25] Correction; and 2) the Mancl and DeRouen (MD) [26] Correction.Both bias-correction techniques lead to the following modified form of approximation for Var m : where i is the corresponding residual-modified version of i with the form where the residuals modification matrices F i is defined as one of and depending on the selection of bias-correction technique (KC or MD).The matrix H i is defined as Note that by writing the right hand-side of ( 8) collapses to Var m , which means Var m is also a particular case of the right hand-side of (8) like the KC and MD cases.For this reason, we may distinguish our power calculation methods ( 8) by the choice of F i in the right hand-side of (8), namely, "GEE" refers to the method with F i = I Jm i ; "GEE-KC" refers to the method with F i = F KC i ; and "GEE-MD" refers to the method with F i = F MD i .Again, we do not estimate Var m because I Jm i , F KC i , and F MD i are known given the assumption of R i (α, ρ) 's and the parameters therein at the design stage..In other words, GEE-KC and GEE-MD are proposed to improve the predicted power based on GEE in the situation that number of clusters is small.

Simulation study
We performed a simulation study to examine the operating characteristics of the three proposed methods described in Sect. 2 to detect the interaction effect between intervention and binary individual covariate in cross-sectional SW-CRT.The simulation study was conducted using statistical software R version 4.2.2.

Scheme for generating simulation datasets
We simulate binary outcomes Y ijk 's based on marginal means u i and correlation matrix R i (α, ρ) .Note that we have where To simulate Y i , we consider the copula-based method for multivariate binary outcomes [27,28] which assumes Y ijk = 1 Z ijk ≤ � −1 µ ijk , for j = 1, . . ., J and k = 1, . . ., m , where 1{•} is the indicator function, is the standard normal CDF and the random vector, marginally).The elements of i corresponding to the location of Z ijk and Z ij ′ k ′ within Z i are defined as For all j, j′ = 1, . . ., J and k, k′ = 1, . . ., m .To ensure that R i (α, ρ) is the correlation matrix of Y i , ξ ijk,ij ′ k ′ is determined by solving the equation where C ξ ijk,ij ′ k ′ is a bivariate Gaussian copula with corre- lation parameter ξ ijk,ij ′ k ′ .We compute the bivariate cop- ula components using the "pbinormcop" function of the R package "VGAM".
Note that the resulting i is not always positively definite [29].In this case, we need to modify this matrix to force it to become non-negative definite such that we may simulate the random vector Z i using the modified i .An eigenvalue modification trick [30] is employed for this purpose.Values for parameters for the simulation study setups are shown in Table 1.We set θ 0 = log(0.15/0.85)representing 15% prevalence of outcome for control group with reference group of binary covariate.This prevalence rate came from our motivational example.We included increasing secular time trend by specifying γ 2 = 0.1, γ 3 = 0.2, γ 4 = 0.3, and γ 5 = 0.4 .In power calculation the secular time trend fixed effects should be determined either from existing literature or by a preliminary analysis.In the Supplementary materials of ( 16), a demonstration of the 'power.ap'function in the R package CRTpowerdist considers categorical time trend using coefficients 1, 2, 3, and 4 for Gaussian outcomes; for binary outcomes using logit link these values are too large for computation.As a result, we use 0.1, 0.2, 0.3, and 0.4 instead Setup for simulation study for parameter values except interaction along with the number of clusters, time points, and cluster size

Value for parameter
Reason for determining value θ 0 = log(0.15/0.85) the true log odds ratio of the outcome in the control group ( W ijk = 0 ) with reference group for binary covariate ( X ijk = 0 ) for k th individual in i th cluster at j th period.We assume that prevalence of outcome is 15% For a simple exchangeable correlation structure For a nested exchangeable correlation structure for demonstration purpose only.We simulated binary covariate, X ijk using the cluster prevalence levels 30% and 50%, respectively.These two values also came from our motivational example in which 30% of patients are Black and 48% of patients are female.We plan to evaluate the statistical operating characteristics of HTE when we have imbalanced and balanced binary covariates.For OTE, we set θ 1 = log(1.35)and θ 1 = log(1.68)representing a small and intermediate effect sizes for the intervention effect and θ 2 = log(1.5)for the binary covariate.The main quantity of interest is HTE, θ 3 .We considered two values, log(1.5)and log(2) , to achieve an intermediate and large effect size, respectively.

SW-CRT design related parameters
Relatively small number of clusters are often used in cross-sectional SW-CRT.A literature review found that 50% of 56 cross-sectional SW-CRTs reviewed had fewer than 10 clusters [31].Therefore, we simulated eight clusters (I = 8), five time periods (J = 5), and several fixed cluster sizes (20, 40, 60, 80, and 120).Hence, the total numbers of subjects in this simulation study were 800, 1,600, 2,400, 3,200, and 4,800 individuals, respectively.
Since we used GEE models, we needed to specify working correlation matrix structure.Although there are three distinct correlation structures that are typically used in the cross-sectional SW-CRT: simple exchangeable correlation structure, nested exchangeable correlation structure, and exponential decay correlation structure, we decided to simulate the first two because they are most commonly used in simulation studies [17,32,33].For both correlation structures used, we set ICC to be 0.1.We considered a CAC value of 0.8 for the nested exchangeable correlation structure.

Measure of operating characteristic for simulation study
In the simulation study we examined simulated power and empirical type I error rate for three proposed approaches from 1,000 generated datasets using the simulation setup shown in Table 1.
For the null scenario with θ 3 = 0 , the empirical Type I error rate ( ψ 0 ) was calculated as the proportion of false rejections among the 1,000 tests for θ 3 ; for the nonnull scenarios ( θ 3 = log(1.5)and log(2) ), the simulated power ( ϕ 0 ) was calculated as the proportion of correct rejec- tions among the 1,000 tests for θ 3 .When fitting simulated samples using GEE, the naive standard error estimation is considered because modifications KC and MD are added in the approximation of Var m .

Comparison between simulated power and predicted power for sample size determination
We conduct a simulation study to examine sample size determination, provided in terms of cluster size, from each approach given the numbers of clusters and periods.We consider two distinct values for the prevalence of binary covariate (30% and 50%), the OTE effect size ( log(1.35) and log(1.68)), and the HTE effect size ( log(1.5) and log(2)).
The predicted powers/Type I errors are obtained by our proposed methods (GEE, GEE-KC and GEE-MD) using Eq. ( 2) given the design parameters.The simulated powers/empirical Type I errors are obtained by simulation based on the same set of design parameters.While the simulated powers are usually seen as the true powers for the given design due to the law of large numbers, they might be very time-consuming depending on the complexity of the simulation scheme and the true value of OTE/HTE.For this reason, we expect to have priori information of a suitable range of cluster sizes such that our simulated power may achieve the target power (80% for example) without conducting simulation studies.To this end, mathematical approximation methods are employed to compute the predicted powers/Type I errors as fast feedbacks to guessing values of cluster sizes and to help determine the suitable cluster sizes.

Sensitivity analysis
In the sensitivity analysis for the three proposed approaches using different numbers of clusters and cluster sizes on simulated power, we consider two additional numbers of clusters (I = 20 and 40) and five additional cluster sizes (m = 20, 40, 60, 80, and 100) given a prevalence rate of 50% for the binary covariate and an intermediate OTE effect size ( θ 1 = log(1.68)).Once again, we consider intermediate and large effect sizes for HTE, the quantity of interest.From different combinations of the number of clusters and cluster sizes, we examine whether an increase in the number of clusters improves statistical power compared to an increase in cluster size indirectly.We also examine the tradeoff between the number of clusters and cluster size in simulated power.We consider two total number of observations per step (160 and 800) and then two combinations of the number of clusters and cluster sizes for a given total number of observations.One pair represents small number of clusters and relatively bigger cluster size and another pair represents larger number of clusters and small cluster size.

Table 2
Required cluster sizes to achieve 80% predicted power to detect interaction effect sizes for a simple exchangeable correlation structure with the number of clusters = 8 and ICC = 0.1 Estimated required cluster size was determined to achieve at least 80% predicted power.Bias of the standard error SE, empirical Type I error ψ 0 , simulated power ϕ 0 (GEE fitted using both naïve and robust SE, respectively), and predicted power ϕ were obtained from the GEE power calculator with ICC = 0.1 using three methods GEE, GEE-KC, and GEE-MD for various combinations of HTE θ , we also report the average of cluster sizes obtained by GEE and GEE-KC as a fourth method.Under scenario of prevalence = 30%, the cluster sizes are taken as multiples of ten intentionally to generate integer number of observations for both X = 0 and X = 1 within each cluster.And for the same reason, only even numbers are taken into account under the scenario of prevalence = 50% Boldfaced simulated power denotes simulated power falls outside of 95% confidence interval for predicted power  3 Required cluster sizes to achieve 80% predicted power to detect interaction effect sizes for a nested exchangeable correlation structure with the number of clusters = 8, ICC = 0.1, and CAC = 0.8 Estimated required cluster size was determined to achieve at least 80% predicted power.Bias of the standard error SE, empirical Type I error ψ 0 , simulated power ϕ 0 (GEE fitted using both naïve and robust SE, respectively), and predicted power ϕ were obtained from the GEE power calculator with ICC = 0.1 and CAC = 0.8 using three methods GEE, GEE-KC, and GEE-MD for various combinations of HTE θ , we also report the average of cluster sizes obtained by GEE and GEE-KC as a fourth method.Under scenario of prevalence = 30%, the cluster sizes are taken as multiples of ten intentionally to generate integer number of observations for both X = 0 and X = 1 within each cluster.And for the same reason, only even numbers are taken into account under the scenario of prevalence = 50% Boldfaced simulated power denotes simulated power falls outside of 95% confidence interval for predicted power

Comparison of sample size determination
In Tables 2 and 3, we evaluate the difference between predicted standard error (SE) and empSE defined in Table 10 of [34] ( SE), simulated power ( ϕ 0 ) and empirical Type I error ( ψ 0 ) based on the naïve and robust SEs of GEE for a determined cluster size satisfying 80% power from our three approaches using the following values (I = 8, J = 5, θ 0 = log(0.15/0.85)).We evaluate this for a simple exchangeable correlation structure (ICC = 0.1) and for a nested exchangeable correlation structure (ICC = 0.1, CAC = 0.8).We also consider two different values for effect sizes of OTE and HTE as well as the prevalence rate of the binary covariate.The cluster sizes are taken as multiples of ten intentionally to generate integer number of observations for both X = 0 and X = 1 within each cluster.For the same reason, only even numbers are taken into account under the scenario of prevalence = 50%.For all cases, empirical Type I errors fall into the 95% confidence interval of nominal significance level (5%).For both a simple exchangeable correlation structure and a nested exchangeable correlation structure, simulated powers of GEE-KC and GEE-MD are above the upper limit of the 95% confidence interval of the predicted power ( ϕ ) for both intermediate and large effect sizes of HTE.GEE maintains the simulated power inside of the 95% confidence interval of the predicted power in both correlation structures when naïve SE is used.If robust SE is used, simulated powers of GEE are also above the upper limit of the 95% confidence interval of the predicted power ( ϕ ) for both intermediate and large effect sizes.When we calculate the difference between simulated power ( ϕ 0 ) and predicted power ( ϕ ), GEE shows the smallest among the three approaches.This comes from relatively bigger sample size determination for the two bias-correction approaches provided.
Based on Tables 2 and 3, statistical significance is determined by the naïve SE, and GEE-MD consistently over-predicts the cluster sizes while GEE consistently under-predicts the cluster sizes.However, when the statistical significance is determined by the robust SE, all three methods tend to consistently over-predict the cluster sizes due to the widely-acknowledged phenomenon that robust SE (without any bias correction) leads to inflated empirical type-I error [13,35].Hence the resulting cluster sizes from GEE and GEE-MD may provide a priori range of cluster sizes such that the SW-CRT design may achieve the target power when the data is analyzed with the naïve SE.However, if the data is analyzed with the robust SE, the resulting cluster sizes from GEE are sufficiently large, in which the empirical type-I error should be the major problem.Finally, testing the simulated power/empirical Type I error obtained from the cluster sizes within this range yields more accurate cluster size selection.Moreover, GEE-KC also tends to over-predict the cluster sizes.Hence it is natural to consider the performance of the average cluster size between the cluster sizes obtained by GEE and GEE-KC particularly for the low HTE scenarios when using naïve SE.We add this as a fourth method and report the empirical type-I error and power in Tables 2 and 3 as well.The corresponding predicted power is also directly calculated as the average of predicted powers obtained by GEE and GEE-KC.In Supplementary materials, we show the tradeoff between the number of clusters and cluster size and impact of ICC and CAC on simulated power (Table S1A, Table S1B, and Table S1C).In general, when we have more clusters per step, simulated power increases and when we have more observations per step, the increase in simulated power is slightly bigger than that of fewer observations per step.When ICC and/or CAC decrease, simulated power decreases.
We also perform the same simulation study by replacing the required cluster sizes and predicted powers based on (2) by their analogues based on t-distribution with degrees of freedom equal to 4 (we have 8 clusters and 4 cluster level parameters θ 0 , θ 1 , θ 2 , θ 3 , hence we choose the degrees of freedom 4 = 8-4 [36]).In Table S2A and B, the GEE method provides sufficiently large power compared with the corresponding predicted powers when the prevalence is 30%.For prevalence 50%, GEE method shows similar results as the scenarios based on normal Wald test.However, GEE-KC and GEE-MD tend to generate even larger power compared to the corresponding scenarios based on normal Wald test.For the empirical type-I error, the model-based standard error for GEE leads to very small results for t-distribution as shown in Web Fig. 3 in the Supplementary materials of [13].

Simulated powers
We show simulated powers for a simple exchangeable correlation structure (Table 4) and a nested exchangeable correlation structure (Table 5).For both correlation structures, simulated powers are close to each other but a nested exchangeable correlation structure case has slightly lower power compared to a simple exchangeable correlation structure case.Among three approaches, GEE has the largest simulated powers and GEE-MD has the smallest simulated power.This is a similar pattern as shown in [17] for the OTE.Given cluster size of 120, GEE has over 80% statistical power.When we have a balanced binary covariate (50%), simulated power increases compared to unbalanced binary covariate (30%) as shown in Table 4 Simulated powers for a simple exchangeable correlation structure with the number of clusters = 8, cluster size = 120, and ICC = 0.1 Prevalence Table 5 Simulated powers for a nested exchangeable correlation structure with the number of clusters = 8, cluster size = 120, ICC = 0.1, and CAC = 0.8 Prevalence   Tables 4 and 5.When we have a large effect size for HTE, simulated power is over 80% for both small and intermediate effect sizes of OTE.

Simulated powers in the sensitivity analysis
First, we examine the effect of different cluster sizes on the simulated power with the same setups.In Tables 6 and 7, five additional cluster sizes along with cluster size of 120 are considered (20, 40, 60, 80, and 100).With the intermediate effect size of HTE, only cluster sizes of 100 and 120 have more than 80% power using GEE for both correlation structures.With the large effect size of HTE, when cluster size is at least 60, all three approaches have more than 80% power.With cluster size of 40, GEE and GEE-KC have more than 80% power.Next, we examine the effect of different number of clusters on the simulated power.In Tables 8, 9, 10, and 11, we use 20 and 40 for the number of clusters, respectively.When we have 20 clusters, we have more than 80% power for all six different cluster sizes with the large effect size of HTE.When we have a cluster size of 60 or more, all three approaches have more than 80% power with the intermediate effect size of HTE (Tables 8 and 9).When we use 40 clusters, we have more than 80% power across all six different cluster sizes except GEE-MD for cluster size of 20 (Tables 10 and 11).When we compare the impact of increasing cluster size versus increasing the number of clusters, the latter has slightly higher gain in power.For cluster size change from 20 to 40 with 20 clusters, power increases from 53.1% to 82.1% for GEE; 50.6% to 79.7% for GEE-KC; and 48.1% to 77.1% for GEE-MD (Table 6 first two columns).When the number of clusters changes from 20 to 40 with cluster size fixed at 20, power increases from 53.1% to 82.1% for GEE; 50.6% to 81% for GEE-KC; and 48.1% to 79.8% for GEE-MD (Tables 8 and 10; column 1 each).

Power calculation for the example
Retrospective data for MSHS oncology practices show a racial disparity in who receives the goals of care (GoC) conversations: 35% of White patients get GoC while only 15% of minority patients get GoC.Hence, the primary objective of this SW-CRT trial is to evaluate whether our intervention of a CDSS guided alert system, combined with training in communication and knowledge about the disparity provided to physicians, in the intervention arm results in a higher increase in the proportion of GoC in minority patients as compared to that in White patients.In other words, the main focus is the HTE, the interaction effect of intervention by race.Our hypothesis is that the rate of GoC increases from 15 to 30% (15% increase) in minority patients and from 35 to 40% (5% increase) in non-minority patients.Design parameters for this SW-CRT are shown in Table 12.Based on this setup, statistical powers to detect the HTE were 17.8% for GEE, 15.4% for GEE-KC, and 13.4% for GEE-MD.Given a cluster size of 15 and 80% power, the effect size of HTE should be more than two times larger than the original value of HTE (= 1.96) in Table 12 (4.24for GEE, 4.7 for GEE-KC, and 5.51 for GEE-MD).Given an HTE effect size of 1.96, bigger cluster sizes are needed (81 for GEE, 90 for GEE-KC, and 117 for GEE-MD).

Discussion
Detection of interactions between treatment effects and patient or cluster descriptors in SW-CRT is critical for optimizing the healthcare delivery process.In this article, we propose three different approaches for determining sample size and statistical power for the interaction between binary intervention and patient level covariate in SW-CRT.We show through an illustrative example that a much larger sample size is needed for detecting an interaction effect.In this work we focus on crosssectional SW-CRT and we deal with two generic correlation structures and two bias-correction techniques.Hence, determination of cluster size should be developed  the true log odds ratio of the outcome in the control group ( W ijk = 0 ) with reference group for binary covariate ( X ijk = 0 ) for k th individual in i th cluster at j th period.We assume that prevalence of outcome is 15% In this article we consider binary outcomes, while an extension of this work to consider continuous and censored endpoints along with cluster level covariates is of interest for future study.In this article, we focus on the immediate treatment effect model, though we acknowledge there is the possibility of a time-varying treatment effect as proposed in [37].Further research should be done for examining different patterns of treatment effect between subgroups.
It is worth noting that in Tables 2 and 3, the empirical Type I errors ( ψ 0 ) are well-maintained, which deviates from the pattern of inflated empirical Type I error observed by previous studies [13,36].By comparing our study design with these previous studies, we found two potential reasons for the empirical Type I errors not to be inflated: 1) the treatment effect of interest (in this study we focus on HTE while the previous studies [13,36] focus on the OTE, in Web Fig. 3 in the Supplementary materials of [13] the resulting empirical type-I error under scenarios with small number of clusters is slightly inflated above the 95% upper limit of 0.05 even with the model-based SE); 2) the choices of ICC and CAC for the simulation (in [13] the ICC is chosen to represent the small correlations commonly reported in parallel CRTs.In such cases, using model-based SE might be punished by higher level of misspecification of correlation structure while the robust SE fails to converge due to small number of clusters.As a result, both choices are likely to cause the inflation of empirical type-I error).To investigate these two potential reasons, we compute the difference between the empirical variances based on 1,000 simulations and the model-based calculated variances as shown in Table 13, which indicates that the Naïve (model-based) SE estimator underestimates the true SE (consider the empirical SE as true) for all cases as is wellknown.However, when estimating OTE θ 1 the difference Var θ 1,m − Var θ 1,m is quite sensitive to the selection of the ICC α and CAC ρ ; while using the same data-gen- erating procedure to estimate the HTE θ 3 the difference Var θ 3,m − Var θ 3,m seems to be less sensitive to the selection of the CAC ρ .In APPENDIX we show the empirical and model-based calculated variances.The effect of CAC ρ on the model-based Var θ 1,m seems to disappear due to replacing N i with M i in the inversion of matrix.Similar results can be found in [22] where empirical type-I errors are stably located between 0.04 to 0.06 across all scenarios compared with the inflated type-I error from Fig. 2 of [13].Last but not least, our selection of θ 1 = log(1.68)and θ 2 = log(1.5)happens to reduce the difference between Var θ 3,m and Var θ 3,m for the Table 13 Difference between empirical variance and the model-based variance calculation for OTE θ 1 = 0 and HTE θ 3 = 0 .The number of clusters I = 8 , the total time steps is J = 5 , the cluster size is m = 40 for each cluster, the prevalence of X ijk is 50% when estimating HTE, γ j = 0.1(j − 1) for j = 1, 2, 3, 4, 5 , and θ 0 = log(0.15/0.85)Table 14 Empirical Type I error for OTE θ 1 = 0 and HTE θ 3 = 0 based on 1,000 simulations using naïve standard error estimator provided by gee function using R.The number of clusters I = 8 , the total time steps is J = 5 , the cluster size is m = 40 for each cluster, the prevalence of X ijk is 50% when estimating HTE, γ j = 0.1(j − 1) for j = 1, 2, 3, 4, 5 , and θ 0 = log(0.15/0.85) The empirical and model-based calculated variances for the OTE θ 1 = 0 and the HTE θ 3 = 0 were shown in APPENDIX cases with ρ = 1 , which leads to even better results of empirical Type I error in Table 10.The empirical Type I errors for all settings in Table 13 are given in Table 14.We observe the inflated Type I errors for cases with ρ = 0.5 when estimating the OTE.As to the HTE, the empirical Type I errors are well-maintained for all cases.We acknowledge that there might be other factors affecting the difference Var θ 3,m − Var θ 3,m as well as the performance of empirical Type I errors, such as the prevalence of the covariate X ijk within each cluster.Thus, the situation might be more complicated than our observation from Tables 13 and 14.We believe that further investigations are needed to explore how the difference Var θ 3,m − Var θ 3,m might be controlled such that the empirical Type I error can be maintained well.
Our work highlights the importance of properly conducting sample size calculations around the interaction of the exposure and patient level covariate in health disparities research.With the need to introduce interventions aimed at closing the racial gap, properly designing and powering these trials is critical in ensuring statistical results are reliable.The previous trials mentioned in the Background section introduce important and promising interventions, but further potential exists to more directly target the interaction effect.In the case of Cox et al. 's trial [9], the ICU web app is targeted to benefit Black and White families with the same effect size.While beneficial, this alone will not reduce the existing baseline gap between the two subgroups.In the case of Ejem et al. 's trial [8], limited information is provided on how to recreate the racial-disparity targeted sample size calculation.Our work aims to provide researchers with a transparent and accessible tool to design their own SW-CRTs around similar interaction effects.We also acknowledge that the field of health disparities research requires a thoughtful, comprehensive, and nuanced approach.While studying the interaction term is important, investigators should not solely rely on this term when making conclusions about disparities.In addition to the presence or absence of a statistically significant interaction, additional factors may exacerbate health disparities, such as the distribution of the exposure and outcome prevalence across subgroups [38].We also note that the detection and interpretation of the interaction term is scale-dependent.The comparison of changes in the interaction term may be different when interpreted on a relative risk scale versus a risk difference scale.In this paper we focus the interaction on the odds ratio scale [39].Our work is one step in the comprehensive framework of working to reduce disparities in healthcare.

Conclusion
We propose three approaches for cluster size determination given the number of clusters.These methods can be also applied for determining the number of clusters given cluster size.GEE has reasonable operating characteristics for determining cluster size in both intermediate and large effect sizes of HTE.Both GEE-KC and GEE-MD provide relatively large sample sizes compared to GEE.R codes used for this manuscript are available in the Supplementary material.
By solving the GEE (3), the resulting m satisfies By ignoring the O P I −1 term if I is sufficiently large, we have As a result, Var m can be approximately calcu- lated as 3, γ 5 = 0.4 Increasing secular trend θ 1 = log(1.35);θ 1 = log(1.68)Chosen arbitrarily for small and intermediate effect size for intervention effect 30%, 50% Prevalence rate for binary covariate θ 2 = log(1.5)Chosen arbitrarily for intermediate effect size for binary covariate θ 3 = log(1.5);θ 3 = log(2) Chosen arbitrarily for intermediate and large effect size for interaction I = 8, 20, 40 Number of clusters J = 5 Number of time steps m = 20, 40, 60, 80, 120 Cluster size Hence, I = 8 & m = 20 and I = 40 & m = 4 for 160 observations per step and I = 8 & m = 100 and I = 40 & m = 20 for 800 observations per step.We also examine the impact of ICC and CAC on simulated power for the above setup.We consider three setups for ICC and CAC: ICC = 0.1

Table 6
Predicted powers for three effect sizes in combination with different cluster sizes and the number of clusters = 8, prevalence rate = 50%, OTE = log(1.68)for a simple exchangeable correlation structure

Table 7
Predicted powers for three effect sizes in combination with different cluster sizes and the number of clusters = 8, prevalence rate = 50%, OTE = log(1.68)for a nested exchangeable correlation structure

Table 8
Predicted powers for three effect sizes in combination with different cluster sizes and the number of clusters = 20, prevalence rate = 50%, OTE = log(1.68)for a simple exchangeable correlation structure

Table 9
Predicted powers for three effect sizes in combination with different cluster sizes and the number of clusters = 20, prevalence rate = 50%, OTE = log(1.68)for a nested correlation structure

Table 10
Predicted powers for three effect sizes in combination with different cluster sizes and the number of clusters = 40, prevalence rate = 50%, OTE = log(1.68)for a simple exchangeable correlation structure

Table 11
Predicted powers for three effect sizes in combination with different cluster sizes and the number of clusters = 40, prevalence rate = 50%, OTE = log(1.68)for a nested exchangeable correlation structure

Table 12
Setup for a motivational example Number of clusters, and time steps.Equal sizes for every cluster for other types of design for SW-CRT such as closed cohort and open cohort.Other types of bias-correction technique are also considered.In reality, the fixed cluster size assumption is too strong, and unequal cluster sizes should be considered in the sample size determination.