Simulation-based power and sample size calculation for designing interrupted time series analyses of count outcomes in evaluation of health policy interventions

Objective The purpose of this study was to present the design, model, and data analysis of an interrupted time series (ITS) model applied to evaluate the impact of health policy, systems, or environmental interventions using count outcomes. Simulation methods were used to conduct power and sample size calculations for these studies. Methods We proposed the models and analyses of ITS designs for count outcomes using the Strengthening Translational Research in Diverse Enrollment (STRIDE) study as an example. The models we used were observation-driven models, which bundle a lagged term on the conditional mean of the outcome for a time series of count outcomes. Results A simulation-based approach with ready-to-use computer programs was developed to calculate the sample size and power of two types of ITS models, Poisson and negative binomial, for count outcomes. Simulations were conducted to estimate the power of segmented autoregressive (AR) error models when autocorrelation ranged from −0.9 to 0.9, with various effect sizes. The power to detect the same magnitude of parameters varied largely, depending on the testing level change, the trend change, or both. The relationships between power and sample size and the values of the parameters were different between the two models. Conclusion This article provides a convenient tool to allow investigators to generate sample sizes that will ensure sufficient statistical power when the ITS study design of count outcomes is implemented.


Introduction
Interrupted time series (ITS) analysis is a strong quasi-experimental design that can be used to evaluate the effectiveness of a populationlevel intervention that is clearly defined at a given time point ( [1][2][3]). ITS designs usually involve repeatedly collecting a particular aggregate level outcome pre-and post-intervention ( [4,5]). The segmented time series regression model ( [2]) with one discontinuity time point is the general tool used to evaluate such data, in which each segment can have a different level, trend, or both. That is, two-line segments are fitted simultaneously and separated at the intervention time point. A change in the "level" of the outcome is indicated by a discontinuity at the time point when the intervention was introduced, and the change in the "trend" is revealed by a change of slope. Statistical hypothesis tests [6] are typically used to detect changes in outcome after the implementation of intervention. ITS is typically used when randomized trials are infeasible and has been extensively used on evaluating public health and health service interventions ( [3,7]). The assumptions and advantages of using ITS analysis have been thoroughly discussed ( [8,9]). Although most studies have focused on aggregated level single-arm ITS design, two-arm ITS design ( [4]) and individual level ITS models ( [10]) have also been discussed.
Modeling the time series of the observed count data is a more challenging task than creating time series models for continuous data. Unlike modeling a normal time series of continuous data, according to Jung et al. [11], a potential model for the time series of the count data must be able to characterize both the dependence structure and the overdispersion of data. Several models have been proposed and categorized into two types ( [12]): observation-driven models, which bundle a lagged term on the conditional mean of the outcome; and parameter-driven models, driven by a dynamic process, which are reviewed by Cameron and Trivedi [13]. That is, observation-driven models directly model the conditional mean of current count data to historical data, and parameter-driven models can be considered to be a generalized linear model (GLM) with a pre-specified dependence structure.
For the observation-driven models, the two most commonly used models are the generalized linear autoregressive moving average (GLARMA) model and the log-linear (LL) model. The GLARMA model was proposed by Shephard [14] and Davis et al. [15], and the LL model, first proposed by Zeger and Qaqish [16], has been further investigated by Fokianos and Fried [17,18], Woodard, Mateeson, and Henderson [19] and Douc, Doukhan, and Moulines [20]. Further discussion on theoretical properties, like the stationarity and ergodicity of the GLARMA and LL models, can be found in Dunsmuir and Scott [21] and Liboschik et al. [22]. The most common parameter-driven model is the Zeger model [23]. Considering the Gaussian linear process of the conditional mean of the outcome, the Zeger model was studied by Zeger [23] and Davis et al. [24]. Its equivalent logarithmic form was studied by Chan and Ledorter [25], Kuk and Cheng [26], Jung and Liesenfeld [27], and Jung and Tremayne [28].
Though count outcomes are the common practice in policy research, the ITS design of count outcomes has only made limited appearances in the literature. For instance, Walter et al. [29] modeled injury count data using the negative binomial log-linear model and fit the model by the maximum likelihood estimator. Wang, Olivier, and Grzebieta [30] considered the same model and compared the estimation performance between the maximum likelihood estimator, the full Bayesian estimator, and the empirical Bayesian estimator via simulation. However, the power of the statistical tests in ITS analyses with count data has never been studied. To address this gap, in this manuscript we conducted simulations to estimate the power and sample sizes in various settings. Here, we only considered the most basic two-phase single-arm ITS design for count outcomes. More complicated three-phase two-arm models are beyond the scope of this paper. A similar study on the two-phase ITS design of continuous data outcomes was conducted by Zhang, Wagner, and Ross-Degnan [6]. Herein, we solely focus on the observation-driven model for a time series of count data, in particular the LL models. We only consider the observation-driven models because they are designed to allow the likelihood to be evaluated easily, but the parameter-driven models usually involves high-dimensional integration, which is computationally infeasible [15]. Table 1 Estimated power testing H 0 : β 2 ¼ β 3 ¼ 0 for the Poisson time series with a conditional mean model LL (0,1) when β 2 þ β 3 ¼ �0:25; � 0:5; � 1 based on 200 simulated data sets and a statistical significance level of 0.05. The symbol "-" indicates that more than one fourth of the data sets cannot be successfully generated.

Exemplar study: Strengthening Translational Research in a Diverse Enrollment (STRIDE) study
The power and sample size calculation for the ITS design of count outcomes were motivated by the required statistical analysis of data generated from the STRIDE study, an ongoing five-year study aimed at developing an intervention to increase the engagement of African Americans and Latinos in translational research ( [31]). Since the primary outcome of the study is the number of African Americans and Latinos enrolled in ongoing translational clinical trials, to mitigate their historical underrepresentation in translational research, the STRIDE study is a representative example of the ITS design of count data.
The STRIDE project is a partnership of the CTSAs (Clinical and Translational Science Awards program) at the University of Massachusetts Medical School, the University of Alabama at Birmingham, and Vanderbilt University-three geographically diverse sites with large African American and Latino populations. The STRIDE intervention was motivated by previous studies of exposed barriers to research participation ( [32][33][34]). Participant and systematic barriers include limited research literacy, lack of trust stemming from historical abuses, lack of research staff training in appropriate cultural competency skills, and confusion of informed consent procedures in research. To overcome these barriers, the proposed multi-level intervention contains three components: (1) storytelling for the promotion of research literacy; (2) simulation-based training to improve culturally appropriate recruitment and informed consent; and (3) an electronic consent platform to enhance cultural competency. The STRIDE intervention builds synergistically on emerging work at each institution to create a new intervention that addresses barriers on multiple levels. The primary outcome of the STRIDE project is the number of recruitments of African Americans and Latinos, as well as the total recruitment.
To test the effectiveness of the STRIDE intervention, we have recruited ongoing translational clinical studies at each of the three partnering CTSA hubs. Both the interventions and contemporaneous controls (i.e., clinical trials without STRIDE intervention) are introduced at each of the CTSA hubs. Each participating university layers the STRIDE intervention on one study, with another study serving as the unintervened control. Thus, using the number of African American and Latino participants recruited, or the total number of participants, as the prime response variable (outcome), the STRIDE intervention will be evaluated by the two-arm ITS design and will include six ongoing translational research studies. Three studies will receive the intervention and comprise the study group, and the remaining three unintervened studies will comprise the comparison group. The study outcomes are collected on a weekly basis. The change in study outcomes will be examined based on a two-phase framework (pre-implementation versus post implementation).

Design and analysis of a single-arm ITS study with count outcomes
The STRIDE study has motivated our investigation of a time series Table 2 Estimated power testing H 0 : β 2 ¼ β 3 ¼ 0 for the negative binomial time series with a conditional mean model LL (0,1) when β 2 þ β 3 ¼ � 0:25; � 0:5; � 1 based on 200 simulated data sets and a statistical significance level of 0.05. The symbol "-" indicates that more than one fourth of the data sets cannot be successfully generated.
In a two-phase ITS study, if all study subjects and sites are planned to be exposed to an intervention over time, then such a study is a single-arm ITS study. Let Y t represent the count outcome variable that is measured at time point t, let T t be the actual or converted study time (in the simulation, we also considered the logarithm of the actual time to avoid model explosion) from the start to the end of the study, let X t be a binary indicator for the second phase of the study, and let t 0 be the time point after the onset of intervention.

Observation-driven model
Here, we give a brief introduction of the modeling framework for the observation-driven segment regression time series model of count outcomes. For a single-arm ITS design of count outcomes, a common kind of observation-driven time series build model on the logarithm of the conditional mean of the response Y t can be written as conditioning on the past responses and means, the function g joints current outcome with past outcomes that are correlated in the time series, T t is the actual time of the study, t 0 is the time point of intervention, X t is the binary indicator for the second phase of the study, and β 0 ;β 1 ;β 2 ; β 3 , and θ are unknown parameters.
In observation-driven models, the effect of covariates on the outcome or its mean is complicated and difficult to interpret because the conditional mean also dependents on past outcomes ( [15]). For the ITS design, the coefficient β 0 is the regression intercept representing the starting level of the logarithm of the conditional mean, β 1 is the slope of the logarithm of the conditional mean before the implementation of the intervention, β 2 represents the change in the level of the logarithm of the conditional mean caused by the intervention versus non-intervention, and β 3 represents the difference in the slopes of the logarithm of the conditional mean caused by the intervention versus non-intervention. The focus of the ITS analysis is to examine the significance of β 2 , which indicates an immediate intervention effect on the level change of the conditional mean, and the significance of β 3 , which indicates the intervention effect in terms of the change in the trend of the conditional mean. Note that the purpose of subtracting t 0 , the time point after the onset of intervention, from the study time T t is to maintain the interpretation of the corresponding regression coefficients β 3 .
while p and q are nonnegative integers less than t. A variety of choices for g were proposed. For example, when gðF t is a scaling residual and ν t is some scaling function of μ t , and θ ¼ fall α j and γ j g, the model is a generalized linear autoregressive moving average (GLARMA) model [14]; when gðF t 1 Þ ¼ γ j lnðY t j þ 1Þ, it is a log-linear (LL) model. Here, we will focus on LL models with low orders, i.e., small values of p and q. Specifically, we model the time series of counts via the LL model with p ¼ 0 and q ¼ 1, denoted by LL (0,1), which has the form Where the logarithm of the mean linearly depends on the logarithm of the last observation, which positively or negatively depends on γ 1 . Since we use some logarithm functions in this model, it is hard to develop formulas for the mean or the autocovariance function of lnμ t or Y t . The most commonly used distribution for count data is Poisson distribution, in which the conditional distribution of response Y t on past history F t 1 is denoted by Y t jF t 1 � Poissonðμ t Þ, and the density has the form Poisson distribution is simple and popular. However, Poisson distribution is known to have equal mean and variance, which can be unrealistic in some settings. A more appropriate and flexible model for modeling count data with a larger overdispersion than Poisson (i.e., with greater variability) is negative binomial distribution. Denoting the conditional distribution of response Y t on past history F t 1 to be Y t jF t 1 � NBðμ t ; φÞ, the density function for negative binomial can be expressed as where φ > 0, with variance μ t þ μ 2 t φ . For many observation-driven models of count time series, the stationarity and ergodicity of the process, which are used to develop consistency and asymptotic normality, are only partially discussed in some special and simple scenarios, the majority of which are still unclear. For Poisson responses with η t ¼ η 0 (constant), model (2) has a stationary distribution when jγ 1 j < 1 . More discussion on the stationarity and ergodicity of GLARMA and LL models can be found in Dunsmuir and Scott [21] and Liboschik et al. [22].

Simulation-based sample size and power calculation
We used a simulation-based method to calculate the power of different statistical tests under different scenarios (different sample size 1; the right panel is for the negative binomial time series with β 2 þ β 3 ¼ 1.

Table 3
Estimated power testing H 0 : β 2 ¼ 0 for the Poisson time series with a conditional mean model LL (0,1) when β 2 ¼ �0:25; � 0:5; � 1 based on 200 simulated data sets and a statistical significance level of 0.05. The symbol "-" indicates more than one fourth of the data sets cannot be successfully generated.  Table 4 Estimated power testing H 0 : β 2 ¼ 0 for the negative binomial time series with a conditional mean model LL (0,1) when β 2 ¼ �1; � 2; � 3 based on 200 simulated data sets and a statistical significance level of 0.05. The symbol "-" indicates that more than one fourth of the data sets cannot be successfully generated.  and parameter values) for the two-phase single-arm ITS design of the count outcomes. For an arbitrary two-sided statistical test with the null hypothesis H 0 : β ¼ 0 versus the alternative H 1 : β 6 ¼ 0, where β can be either a univariate regression coefficient or any combination of multiple coefficients defined in Section 3.2. Here, we considered three null hypotheses in our simulation study: (i) β 2 ¼ β 3 ¼ 0, to test whether any changes (level, trend or both) exist after intervention; (ii) β 2 ¼ 0, to test the change on level after intervention; and (iii) β 3 ¼ 0, to test any trend changes after intervention. In this simulation-based sample size and power calculation, we considered the logarithm of actual time to avoid model explosion. β 2 represented the change in the level of the logarithm of the conditional mean caused by intervention versus non-intervention, and β 3 represented the difference in the slopes of the logarithm of the conditional mean caused by intervention versus non-intervention. For these three hypothesis tests, chi-square (Wald) tests were employed as test statistics, and the empirical power of these tests were calculated via simulation. For any statistical tests, the power under a pre-specified significance level is defined as the probability that rejecting the null hypothesis conditioning with the alternative hypothesis is true, i.e., PðReject H 0 jH 1 is trueÞ. Since this probability is generally unknown, we used simulation to estimate the power. For the simulation-based method, a large number of datasets were randomly generated from the ITS model we introduced in Section 3.2, with pre-specified non-zero coefficients, and statistical hypothesis tests were conducted for each dataset. Then, the empirical power was estimated as the frequency that the null hypothesis was rejected divided by the total number of datasets. Denoting R as the number of datasets, this estimated power will approach the true power if the R is large enough. In our simulation study, we used R ¼ 200 and a significance level of 0.05 for all cases.
We Negative values for the parameters indicate a "decrease" (either level, trend, or both) after intervention, and positive values indicate an "increase" after intervention. We chose different parameter values between the Poisson and negative binomial models because negative binomial models usually use modeling count data with larger overdispersion than Poisson models. We also considered different values for coefficient γ 1 in model (2), which represents the degree of dependence between the current conditional mean μ t and historical outcomes. Here, we considered all cases from 0.9 to 0.9, with a step of 0:2 and case γ 1 ¼ 0, which represents the case with no correlation.   Table 5 Estimated power testing H 0 : β 3 ¼ 0 for the Poisson time series with a conditional mean model LL (0,1) when β 3 ¼ �0:01; � 0:05; � 0:10 based on 200 simulated data sets and a statistical significance level of 0.05. The symbol "-" indicates that more than one fourth of the data sets cannot be successfully generated.   Table 6 Estimated power testing H 0 : β 3 ¼ 0 for the negative binomial time series with a conditional mean model LL (0,1), when β 3 ¼ �0:05; � 0:10; � 0:25 based on 200 simulated data sets and a statistical significance level of 0.05. The symbol "-" indicates that more than one fourth of the data sets cannot be successfully generated.

Results
Tables 1 and 2 show the estimated power for testing hypothesis (i) H 0 : β 2 ¼ β 3 ¼ 0 for the Poisson and negative binomial time series for model (2), with β 2 þ β 3 ¼ �0:25; �0:5; �1; based on a significance level of 0.05. The estimated power increased as γ 1 , the sample size increased, or the values of the parameter became more significant (i.e., the absolute value of β 2 þ β 3 became greater). The trends of the estimated power of γ 1 and sample size n are illustrated by the surface plots in Fig. 1. Table 3 and Table 4 show the estimated power for testing hypothesis (ii) H 0 : β 2 ¼ 0 for the Poisson and negative time series for model (2) and the pre-specified parameter values in the level change based on a significance level of 0.05. We considered β 2 ¼ �0:25; �0:5; �1 for the Poisson time series in Table 3, and β 2 ¼ �1; �2; �3 for the negative binomial time series in Table 4. For the Poisson models, the estimated power increased as γ 1 , the sample size increased, or the values of the parameter became more significant. For negative binomial models, the results were similar to those of the Poisson models, but the estimated power was decreased for very large values of γ 1 . The trends for the estimated power of γ 1 and sample size n are illustrated by the surface plots in Fig. 2. Table 5 and Table 6 show the estimated power testing H 0 : β 3 ¼ 0 for the Poisson and negative time series with model (2) and the prespecified values of the trend change parameter based on a significance level of 0.05. We considered β 3 ¼ �0:01; �0:05; �0:10 for the Poisson time series in Table 5, and β 3 ¼ �0:05; �0:1; �0:25 for the negative binomial time series in Table 6. Similar to the previous test, the estimated power increased as γ 1 , the sample size increased, or the values of the parameter became more significant for the Poisson time series. For the negative binomial time series, again, the estimated power increased first and then decreased as γ 1 increased. This phenomenon can be more clearly observed for large values of the parameter. Further, when the value of the parameter was negative, the estimated power increased first and then decreased as the values of the parameter decreased. The difference in the estimated power between the parameter values of the opposite signs is due to the fact that count data are defined based on the non-negative support. Thus, models are built on the logarithm of the conditional mean of the responses. The trends of the estimated power of γ 1 and sample size n are illustrated by the surface plots in Fig. 3.
For large absolute values of γ 1 , the time series were more likely to explode, i.e., the data in the certain time series can increase (or decrease) so fast that the computer program cannot generate values over a certain threshold because of this rapid expansion. It was also often impossible to generate a time series with the desired sample size. This situation usually happened for large sample sizes. Estimations do not exist for these exploded models, since data cannot be successfully generated, so the estimated powers are marked with the symbol "-" in the tables when more than one fourth of the simulations (more than 50 times) could not generate a time series with the specified length.

Discussion
ITS is a powerful yet simple quasi-experimental design that has been widely applied to many population-based public health and health service intervention studies ( [2,7]). In this article, we studied the models of ITS design for count outcomes. More specifically, we discussed low order log-linear models for ITS design, a special type of observation-driven model, with two distribution specifications (Poisson and negative binomial). Our study was motivated by the STRIDE study, which was designed based on the state-of-the-art power calculation method of the two-arm two-phase ITS design of continuous outcomes (the rate of African American and Latino participants recruited) proposed by Zhang et al. [6]. Because we were also interested in the number of African American and Latino participants recruited and the total number of participants, similar power calculation method using ITS design for count outcomes needed to be investigated. Herein, a simulation-based method was applied to demonstrate the power of hypothesis tests on level change, trend change, and the change of both (the sum of the level change and trend change) under different values of parameters, sample sizes, and autocorrelation coefficients (γ 1 ) under pre-specified conditions. We focused our attention on single-arm ITS studies. Tests for two-arm ITS studies require future investigation. As anticipated, for Poisson models, the estimated power increased as γ 1 , the sample size increased, or the values of the parameter became more significant. For the negative binomial method, the estimated power increased as the sample size increased, or values of the parameter became more significant. However, the change of power showed a U-shape pattern as γ 1 increased for tests on level change and trend change and also increased as γ 1 increased for the tests of the total change. Further, summarizing the results across the six tables, the power of the hypothesis tests with the same level of parameter values can vary widely depending on the type of tests (level, trend, or both) and the model specifications.
Like most ITS designs, our simulation-based power and sample size calculations were based upon models at the aggregated data level. For instance, in the STRIDE study, the aggregate number of participants of African Americans and Latino descent will be collected weekly. However, this type of analysis will not only lose information when individual level data are unavailable, but can also give an incomplete conclusion if the total number of participants increases simultaneously. Thus, although aggregate level ITS designs are the common practice, power and effect size calculations based on such an approach only consider the number of time tables, but not the number of observations at each time window. For this reason, individual level ITS designs for count data or ITS designs that account for the number of observations at each time window need to be further investigated.
This study has several limitations. Firstly, we only considered observation-driven ITS models. Previous studies suggested that parameter-driven models are usually more complicated and computationally intensive because full likelihood of these models involve highdimensional integration. Yet parameter-driven models have better interpretability for their parameters than observation-driven models.
Thus, the performance of parameter-driven models for ITS design, based on count outcomes, needs to be further studied and compared with our proposed models. Secondly, it may be too simplistic to assume that an intervention is implemented at a single time point. Using the STRIDE study as an example, it is reasonable to assume that a "ramp-up" period is required to allow the research assistants to complete their training and for the intervention to achieve full implementation. Further, the study contains a comparison group. Although the ITS study may still be valid with the absence of a control study ( [7]), and adapt the three-phase design to a two-phase design ( [35]), the strength of the inference will be weaker. Therefore, the power and effect size calculations of count outcomes for more complicated models like two-arm three-phase ITS design should be further investigated. Thirdly, as mentioned above, the integrated level ITS design does not consider the number of individuals at each timetable. Using the STRIDE study as an example, this limitation may yield incomplete conclusions, since we expect an increase in the number of African Americans, Latinos, and total participants. Individual-level ITS design could be a reasonable approach to overcome this issue, though only a few health policy studies ( [36]) have taken such an approach. Fourthly, excessive zeros are an issue in health policy studies, including the STRIDE study. Our ongoing research seeks to extend our work to zero-inflated Poisson or zero-inflated negative binomial models.

Conclusions
Sample size and power calculations were conducted for ITS studies of count outcomes using an observation-driven model through the simulation-based methods presented in this article. Results varied among the different model specifications and the target of the study (i.e., investigating level change, trend change, or both).

Author's contributions
BZ, WL, and SY presented the research idea. WL performed the numerical simulation. BZ and SY wrote the original manuscript with support from MAP, EJR, MID, and CL. The STRIDE principal investigators, KGS, PAH, SCL, and JJA, motivated the research idea and helped supervise the project.

Declaration of competing interest
None.