Accounting for baseline trends in intervention studies: Methods, effect sizes, and software

Abstract Single-case experimental studies are relevant and important to substantiate the effectiveness of behavioral, clinical, and educational interventions. When a baseline trend is present in intervention studies, it is challenging for researchers to determine, visually or statistically, if an effect is due to the intervention or to the naturally occurring trend. In this paper, we demonstrated and appraised four methods that can assess an intervention effect in the presence of baseline trends. These four methods quantified an intervention effect as a phase change effect size (i.e., mean phase difference and the slope and level change) or a nonoverlap effect size (i.e., Tau-UAB-A and Tauc). Empirical data from an intervention study were used in the demonstration. All methods were evaluated in terms of types of intervention effects assessed, assumptions, specialized computing tools, and application issues (e.g., missing scores). To empower researchers to use these four methods, we provide a summary of each method’s appropriate inferences about an intervention effect and its computing tools’ strengths and limitations. Based on the results, we recommend using multiple baseline trend-controlled methods to draw a conclusion about an intervention effect.

Abstract: Single-case experimental studies are relevant and important to substantiate the effectiveness of behavioral, clinical, and educational interventions. When a baseline trend is present in intervention studies, it is challenging for researchers to determine, visually or statistically, if an effect is due to the intervention or to the naturally occurring trend. In this paper, we demonstrated and appraised four methods that can assess an intervention effect in the presence of baseline trends. These four methods quantified an intervention effect as a phase change effect size (i.e., mean phase difference and the slope and level change) or a nonoverlap effect size (i.e., Tau-U AB-A and Tau c ). Empirical data from an intervention study were used in the demonstration. All methods were evaluated in terms of types of intervention effects assessed, assumptions, specialized computing tools, and application issues (e.g., missing scores). To empower researchers to use these four methods, we provide a summary of each method's appropriate inferences about an intervention effect and its computing tools' strengths and limitations. Based on the results, we recommend using multiple baseline trend-controlled methods to draw a conclusion about an intervention effect.

ABOUT THE AUTHOR
Our research team has been interested in investigating practical and theoretical issues surrounding the definition, application, and computation of effect sizes for between-subjects and within-subjects designs, including the single-case experimental designs. The issues we have investigated and contributed toward the effect size literature include the impact of APA and AERA guidelines on effect size reporting practices, alternative effect sizes beyond Cohen's d, how to deal with baseline trends, the treatment of missing data, the treatment of ties, strengths and limitations of computing tools for the computation of effect sizes.

PUBLIC INTEREST STATEMENT
How can we know that an intervention worked, despite a natural improvement in a participant's behavior? Clinicians, educators, service providers, and specialists often rely on a single-case design to determine if an intervention has effectively changed a specific behavior, such as disruptive behavior, in a participant. A basic single-case design consists of a baseline phase and an intervention phase. If a behavior shows an improvement trend during the baseline phase, such as a decrease in disruptive behavior or an increase in communication, it is difficult to conclude that such a behavior has improved due to the intervention alone, or due to a combination of the intervention and the behavioral trend. Using real-world data, this article demonstrates four methods that can untangle a baseline trend from an intervention effect. It is hoped that this article empowers practitioners and researchers to appropriately assess an intervention effect in the presence of baseline trends.
In intervention studies, the effectiveness of an intervention or treatment is often studied using single-case experimental designs (SCEDs). SCEDs have been used in subfields of psychology, such as, neuropsychology (Zermatten, Rochat, Manolov, & Van der Linden, 2018), health psychology (Sniehotta, Presseau, Hobbs, & Araújo-Soares, 2012), clinical psychology (Morgan & Morgan, 2001), and counseling psychology (Ellis, Hutman, & Chapin, 2015). A SCED study collects detailed information about a few participants' target behavior over an extended period of time in order to determine the effect of an intervention. SCED studies typically consist of a baseline (or A) phase and an intervention (or B) phase in which the participants serve as their own controls. Consequently, data are correlated serially over time or sessions.
While visual analysis is the primary method of evaluating an intervention effect in SCEDs (Kazdin, 2011), statistical methods are increasingly employed to complement visual analysis (Manolov, Gast, Perdices, & Evans, 2014). For both visual and statistical analyses of SCED data, one challenge often encountered by practitioners and researchers was the presence of a baseline trend, because it may persist into the intervention phase (Sullivan, Shadish, & Steiner, 2015). If an undesirable behavior decreases (a downward trend), or a desirable behavior increases (an upward trend), during the baseline phase, it may be argued that the behavior has naturally improved without the intervention. According to Parker, Cryer, and Byrns (2006), 41% of the 165 data graphs, based on an AB design, showed an improvement trend in the baseline phase. When baseline trends were present, researchers could not reach a consensus based on visual analysis, or yield valid conclusions about the intervention effect (Lieberman, Yoder, Reichow, & Wolery, 2010;Mercer & Sterling, 2012).
Likewise, the statistical analyses (e.g., effect size, its statistical test, and interpretation) of SCED data are compounded by baseline trends (Campbell, 2004). It is therefore imperative that researchers be informed of ways of dealing with an existing baseline trend when assessing an intervention effect in SCEDs. A number of methods and computing tools have been proposed to fit a straight line as the baseline trend (Manolov, 2018), according to What Works Clearinghouse's standards (Kratochwill et al., 2010). Yet these methods and computing tools have not been critically assessed in terms of their ability to deal with practical issues such as, how to properly interpret each method's proposed effect size index when assumptions are violated, when missing data are present, or when scores are tied. Furthermore, the functionalities of specialized computing tools written for baseline trend-controlled methods are not widely known.
This study seeks to fill these voids by demonstrating and evaluating four baseline trendcontrolled methods and their computing tools. Specifically, this study aims to (I) use empirical data (Lambert, Cartledge, Heward, & Lo, 2006) to define and illustrate four methods that account for baseline trends in SCEDs; (II) investigate each method's proposed intervention effect size in terms of its proper interpretation, assumptions, and specialized computing tools; and (III) resolve each method's application issues, including unequal phase lengths, tied scores, missing data, out-of-range predicted scores, and unequally spaced sessions.
The empirical data is hereafter referred to as the Lambert data. The Lambert data have been used in methodological articles to illustrate statistical methods suitable for SCED studies (e.g., Chen, Peng, & Chen, 2015; Shadish, 2014). Lambert et al. (2006) implemented the use of response cards in two fourth-grade math classes to determine if the intervention minimized students' disruptive behaviors while learning math. Their study employed a reversal design with two baseline phases (A1 and A2) and two intervention phases (B1 and B2). A disruptive behavior was recorded in 10 intervals of each study session. The dependent variable was the number of intervals in which at least one disruptive behavior was observed; hence, data ranged from 0 to 10 for each student in each study session.

The Lambert data
To appraise the methods examined in this paper, we focused on data obtained from three students (S1, S2, and S3) during the second baseline (A2) and the second intervention (B2) phases (see Figure 1). These students' data were well suited for evaluating the four methods for five reasons. First, the three students' data exhibited different baseline trends during the A2 phase. According to Figure 1, Student S1 showed an upward trend, Student S2 showed a flat or no trend, and Student S3 showed a downward trend. The manner in which the four methods dealt with different trends, or lack thereof, was of interest to this study.
Second, the intervention was implemented in two math classes for different lengths. The different phase lengths in the Lambert data could present a challenge when student-level results are integrated into class-level results. Third, there were tied scores for all three students. These tied scores required adjustments when a trend-controlled method based on ranks of scores is applied. In this paper, we demonstrated how adjustments could be implemented. Fourth, several intervention sessions were missed; hence, missing scores occurred during those sessions. If and how each baseline trendcontrolled method dealt with missing scores was of interest to our evaluation because missing scores frequently occurred in SCED studies (Chen, Feng, Wu, & Peng, 2019). Lastly, when a method used a trend line to account for the baseline trend, predicted scores derived from the trend line might be outside the legitimate score range of 0 to 10 for the Lambert data. It was, therefore, important to investigate if and how each method dealt with out-of-range predicted scores while controlling for a baseline trend.

Methods, effect sizes, and software accounting for baseline trends
To identify methods suitable for our examination, we used the search terms ("single-case" OR "singlesubject") AND ("trend") in the Web of Science database, along with references located in published A2 A2 B2 B2 A2 B2 Figure 1. The number of intervals with disruptive behaviors in A2 and B2 phases for Students S1, S2, and S3. S1's and S2's data were replicated from Lambert et al. (2006). S3's data were suggested by one student in Lambert et al. (2006). journal articles. We limited our search to methods proposed after 2009. Our search identified four methods that were suitable for our investigation and assessment. These four methods do not require a model for data or analysis and are applicable in simple as well as complex designs. Those methods that could not be fully implemented without specialized computing tools were excluded, such as, the graph rotation for overlap and trend method (Parker, Vannest, & Davis, 2014).
The four methods quantified an intervention effect as a phase change effect size (slope change and/or level change), or a nonoverlap effect size, after adjusting it for a baseline trend. Below we define each method in terms of a step-by-step algorithm, followed by a demonstration. A discussion of each method's effect size and its proper interpretation, assumptions, and application issues is presented at the end of each effect size section. An alpha level of .05 was pre-selected as a criterion for statistical significance for any statistical test of a null hypothesis. Table 1 summarizes each method's effect size index, assumptions, computing tools, and application issues.

Phase change effect sizes
Two methods computed the phase change as an index of an intervention effect: the mean phase difference (MPD) method and the slope and level change (SLC) method.

The MPD method
This method was proposed by Manolov and Solanas (2013) to quantify the phase difference between the intervention scores and predicted intervention scores, predicted from a baseline trend. Since 2013, multiple variations of MPD have been proposed. In this paper, we refer to the MPD initially proposed by Manolov and Solanas (2013) as MPD1, and subsequent variations as MPD2 to MPD8. MPD1 to MPD4 are conceptually and computationally simpler than MPD5 to MPD8 that were recently proposed by Manolov, Solanas, and Sierra (2018). MPD1 to MPD4 are presented in this section, whereas MPD5 to MPD8 are presented in supplemental materials available from https://osf.io/h75fd/?view_only=0c40cd0678ed45798501de418cca0f44

MPD1
The computation of MPD1 consists of four steps.
Step 2: Compute the mean of the differenced baseline scores and treat it as the slope (b A ) of the baseline trend.
Step 3: Use the slope (b A ) to compute the predicted intervention scores.
Step 4: Obtain residuals by subtracting the predicted intervention scores from the observed intervention scores. The mean of the residuals from Step 4 is the MPD1.
The differenced baseline score of Step 1 is the difference between a baseline score and its preceding baseline score. Thus, the first differenced baseline score is the difference of the second baseline score minus the first baseline score, and the second differenced baseline score is the difference of the third baseline score minus the second baseline score. This definition of differenced scores is referred to as the first-order differencing (Manolov & Solanas, 2013). The first-order differencing assumes that a score is influenced by its immediately preceding score. The number of differenced baseline scores is one fewer than that of baseline scores, or (n A -1).

In
Step 2, the mean of the differenced baseline scores is the slope (b A ) of the baseline trend. b A can be directly calculated from the difference between the last and the first baseline scores, divided by (n A -1), without considering baseline scores in between. The slope is used in Step 3 to derive predicted intervention scoreŷ nAþj at the time point of (n A + j) according to Equation (1): where y 1 = the first baseline score, and j = the time point in the intervention phase = 1, 2, 3, …, n B . Therefore, the first predicted intervention score ðŷ nAþ1 Þ = y 1 + b A × (n A ), and the last predicted intervention score ðŷ n A þn B Þ = y 1 + b A × (n A + n B -1).

In
Step 4, MPD1 = the mean of residuals ð¼ ∑ n B j¼1 ðy n A þj Àŷ n A þj Þ=n B Þ where y n A þj = the observed intervention score at the (n A + j)th time point, andŷ n A þj = the predicted intervention score obtained from Equation (1) at the same time point. By definition, MPD1 is a measure of phase change due to level change and/or slope change (Manolov & Solanas, 2013).
An empirical benchmark of MPD1 has been established to facilitate its interpretation. Solomon, Howard, and Stein (2015) compiled a distribution of empirical MPD1s based on 131 studies of different sorts (school-wide positive behavior support, teacher performance feedback, math interventions, and classroom-based individual behavior interventions). Solomon et al. (2015) reported the first quartile of the empirical MPD1 distribution to be 1.22, the second quartile =1.90, and the third quartile = 2.80.

MPD2
Manolov and Rochat (2015) modified the computation of MPD1 to use the median of the baseline scores as the constant in Equation (1), instead of the first baseline score. According to Manolov and Rochat (2015), the modified predicted intervention score ðŷ 0 n A þj Þ at the time point of (n A + j) is obtained from Equation (2): where y A_Mdn = the median of baseline scores, b A = b A from Equation (1), and (n A +1)/2 is the median session of the baseline phase.
Similar to MPD1, MPD2 = the mean of residuals ¼ ∑ n B j¼1 y n A þj Àŷ 0 , where y n A þj = the observed intervention score at the (n A + j)th time point, andŷ 0 n A þj = the predicted intervention score defined in Equation (2) at the same time point.
MPD1 to MPD4 can be computed using a free web-based tool available from https://manolov. shinyapps.io/MPDExtrapolation/, under the tab "Mean phase difference". When missing scores occur, researchers can obtain MPD1 to MPD4 from the MPD R function included in supplemental materials.

Demonstration
For S1 and S2, MPD1 and MPD4 were obtained from the web-based tool mentioned above. For S3, we took the missing sessions (27 and 32) into consideration when computing MPD1 to MPD4 using the MPD R function.
According to Table 2, Student S1's MPD1 (= −9.57), MPD2 (= −12.07), MPD3 (= −7.79) and MPD4 (= −8.00). For Student S2, both MPD1 and MPD2 were identical (= −9.43) and both MPD3 and MPD4 were identical (= −8.00) because S2's baseline data displayed a flat or no trend in Figure 1. These MPDs were interpreted as follows: for S1 and S2, there was a decrease of at least seven intervals containing disruptive behaviors during the intervention phase from what was predicted from the baseline trend. Furthermore, the MPD1s for the two students were both greater, in absolute values, than the third quartile of 2.80 (Solomon et al., 2015). We, therefore, concluded that the response card intervention was effective in decreasing disruptive behaviors for Students S1 and S2, after accounting for an upward trend for S1, but no trend for S2. Student S3's MPD1 (= 3.57), MPD2 (= 3.07), MPD3 (= 0.50), and MPD4 (= 0.39) were all positive suggesting an increase in disruptive behaviors during the intervention phase, after accounting for a downward baseline trend. These results did not support an effective intervention for S3, beyond a decreasing trend already occurring during the baseline phase. Because there is no statistical test of a sample MPD1 to MPD4, we interpreted these MPDs descriptively.

The SLC method
The SLC method was proposed by Solanas, Manolov, and Onghena (2010) to quantify slope change and level change separately. The SLC method consists of three steps.
Step 1: Remove the baseline trend from baseline and intervention scores.
Step 2: Compute the slope change based on detrended intervention scores. Step 3: Compute the level change based on detrended and slopechange-controlled intervention scores and detrended baseline scores. Table 2. MPD1, MPD2, MPD3, and MPD4 computed for Students S1, S2, and S3 of the Lambert data Values under each data graph show the mean of residuals for its corresponding MPD. The data graphs were copied from the free web-based tool available at https://manolov.shinyapps.io/ MPDExtrapolation/, except for S3. In addition to the data graphs, the returned results from the web-based tool show the values of baseline trend (b A ) and MPD in red font. When To remove the baseline trend in Step 1, Equation (3) is applied: where D t is the detrended score at the tth time point, y t is the observed score at the same time point t, and b A is the slope in Equation (1).
Step 2's slope change (sc) is the difference between the last detrended intervention score and the first detrended intervention score, divided by (n B -1). Thus, Step 3's level change (lc) is the difference between the mean of the detrended and slope-changecontrolled intervention scores and the mean of the detrended baseline scores. The detrended and slope-change-controlled intervention scores, or DSCC t , are given by Equation (5): where D t is the detrended intervention score at the tth time point in the intervention phase. Thus, Both sc and lc can be computed using a free web-based tool at https://manolov.shinyapps.io/Change/, under the tab "Slope and level change". When there are missing scores, researchers can obtain sc and lc from the SLC R function included in supplemental materials.

Demonstration
For Students S1 and S2, we computed sc and lc using the web-based tool listed above (See Table  3). For S3, we took the missing sessions (27 and 32) into consideration and used the SLC R function to compute sc and lc. For Student S1, an sc of −1.09 was interpreted as an average decrease of 1.09 intervals with disruptive behaviors from a previous session at any given intervention session. An lc of −7.59 was interpreted as S1 exhibiting an average of 7.59 fewer intervals with disruptive behaviors in the intervention phase than in the baseline phase, after accounting for an upward baseline trend. For Student S2, an sc of −0.16 was interpreted as an average decrease of 0.16 intervals with disruptive behavior from a previous session at any given intervention session. An lc of −8.66 was interpreted as S2 exhibiting an average of 8.66 fewer intervals with disruptive behaviors in the intervention phase than in the baseline phase with no trend. For both S1 and S2, their lcs suggested a noticeable decrease from A2 to B2 phases. Their scs suggested a downward trend during the intervention phase. Both results supported an effective intervention for S1 and S2, after considering the baseline trend.
Student S3's sc (= 0.73) was interpreted as an average increase of 0.73 intervals with disruptive behavior from a previous session at any given intervention session. Student S3's lc (= −0.94) was interpreted as S3 exhibiting an average of 0.94 fewer intervals with disruptive behaviors in the intervention phase than in the baseline phase, after accounting for a downward baseline trend. These results did not support an effective intervention for S3, beyond a decreasing trend already occurring during the baseline phase.

Discussion of the MPD and the SLC methods
MPD1 to MPD4, sc, and lc are applicable to interval-level or ratio-level scores. These six effect sizes can be computed for individuals, as well as for a class/group as an average, even if phase lengths are unequal. These six effect sizes can be interpreted directly and descriptively, if measurements are meaningful, as in the Lambert data. MPD1 can also be compared relatively to quartiles of empirical MPD1s compiled by Solomon et al. (2015). Because the sampling distributions of these six effect sizes are not available, researchers cannot interpret them inferentially. Note.
The graphs were generated in a row for each student under the tab "Slope and level change" available at https://manolov.shinyapps.io/Change/, except for S3. The first graph with the caption "Original data points and detrended asterisks" presents the observed scores as black dots and the detrended scores of Equation (3) as red asterisks.
"Baseline trend =", shown in the green font is the value of the baseline trend (or b A ). The second graph with the caption "First and last data points and detrended asterisks" presents a green line connecting the first and the last detrended baseline scores, and a red line connecting the first and the last detrended intervention scores.
"Change in slope =", shown in red font is the value of sc.
The third graph with the caption "Detrended data with no phase B slope" presents the detrended baseline scores and the detrended and slope-change-controlled intervention scores in black dots. The two blue-dashed lines represent the mean of the detrended baseline scores and the mean of the detrended and slope-change-controlled intervention scores, respectively.
"Net change in level =", shown in blue font is the value of lc. The calculation of MPD1 to MPD4, sc, and lc assumes that (1) the baseline trend is linear and persists throughout the entire intervention phase; (2) the baseline trend (i.e., b A ) can be determined by the first and the last baseline scores; (3) baseline and intervention sessions are equally spaced; and (4) the first-order-differenced baseline scores can be used to effectively remove the baseline trend (Manolov & Solanas, 2013).
The third assumption is derived from the fact that differenced scores, defining b A in Equations (1 and 3), should correspond to time points, not session numbers. Linking scores with real time points is also recommended in the visual analysis according to "The Single-Case Reporting Guideline In BEhavioural Interventions (SCRIBE) 2016: Explanation and Elaboration" (Tate et al., 2016, p. 23).
When baseline sessions are not equally spaced, b A should be adjusted for the unequal spaces between sessions. For example, assuming four baseline scores of (3, 5, 7, 9) were obtained from Day 1, 3, 4, and 5, respectively; the mean of the differenced baseline scores (namely, b A ) should be adjusted to (9-3)/(5-1) = 1.5, rather than (9-3)/(4-1) = 2. Likewise, predicted intervention scores of Equations (1 and 2) need to be adjusted accordingly by treating the intervention sessions, that would equalize the unequal-spaced sessions, as missed sessions in the MPD and the SLC R functions.
Tied scores are acceptable to the MPD and the SLC methods because they are not based on ranks of scores. These methods' web-based tools cannot accept missing scores. Missing scores from Sessions 27 and 32 by S3 were entered as "NA" into the MPD and the SLC R function before they computed MPD1 to MPD4, or sc and lc, respectively. Missing scores may also be imputed by statistical methods in order to maintain the design structure of an intervention study (Chen et al., 2019;Peng & Chen, 2018, in press).
Out-of-range predicted scores of Equations (1 and 2) were adjusted for MPD3 and MPD4 by the MPD's web-based tool and by the MPD R function to user-specified maximum or minimum, according to Faith et al. (1997). The SLC method does not have to deal with out-of-range predicted scores.

Nonoverlap effect sizes
Two methods computed the degree of nonoverlap as an index of the intervention effect: the Tau-U AB-A method and the baseline-corrected Tau (Tau c ) method. Nonoverlap is defined as the percentage of intervention scores that do not overlap with baseline scores. The higher the nonoverlap, the greater is the intervention effect.

The Tau-U AB-A method
The Tau-U AB-A method adjusts the Tau-U AB index for the baseline trend. Both Tau-U AB and Tau-U AB-A are nonoverlap effect sizes proposed by Parker, Vannest, Davis, and Sauber (2011). Both are derived from Kendall's Tau and Mann-Whitney U, hence, the name Tau-U. The Tau-U AB-A method consists of three steps.
Step 1: Compute Tau-U AB between baseline scores and intervention scores.
Step 2: Compute the baseline trend in terms of Kendall's S A .
Step 3: Adjust Tau-U AB for the baseline trend to yield Tau-U AB-A .

Tau-U AB in
Step 1 is defined as a ratio of Kendall's S divided by the product of the number of baseline scores (n A ) and the number of intervention scores (n B ): S AB in Equation (7) is calculated from pairwise comparisons of baseline scores with intervention scores using a matrix suggested by Parker et al. (2011). Figure 2 presents such a matrix based on S1 data. In Figure 2 matrix, baseline (A2) scores are listed as rows in an ascending order of sessions; the intervention (B2) and baseline scores (A2) are listed as columns in a reversed order of sessions. Cell entries are results of pairwise comparisons. When a column value is greater than a row value, a "+" is recorded in their intersection cell. Conversely, when a row value is greater than a column value, a "-" is recorded in the cell. When the column value and the row value are tied, a "T" is recorded. The shaded area in the matrix is used to compute S AB = the number of "+" minus the number of "-". The number of tied comparisons (i.e., Ts) is not considered in computing S AB .
Tau-U AB ranges from −1 to 1. A value of −1 indicates that all baseline scores are greater than intervention scores. A value of 1 indicates that all baseline scores are smaller than intervention scores; 0 means a complete overlap between the two phases. Because lower scores in the Lambert data meant lower occurrence of disruptive behaviors, or better performance, an effective response card intervention should result in Tau-U AB closer to −1, than to 1 or 0.
In Step 2, S A is computed from the unshaded area of Figure 2 matrix. S A = the number of "+" minus the number of "-". Similar to S AB , "+", "-", or "T" in the unshaded area of Figure 2 denotes the corresponding column value greater than, smaller than, or equal to, the row value, respectively.
Step 3 computes Tau-U AB-A from Tau-U AB , according to Equation (8) (Parker et al., 2011), where S A is computed in Step 2: The Tauu R function from http://ktarlow.com/stats/r/tauu.txt (Tarlow, 2017b, March) computes Tau-U AB-A under "ab.mina" and Tau-U AB under "ab", according to Parker et al. (2011). The definition of Tau-U AB-A permits its interpretation to be relative to Tau-U AB , or to Tau-U AB-A 's empirical distribution (Solomon et al., 2015). According to Solomon et al. (2015), the first quartile of Tau-U AB-A = .28, the second quartile =.47, and the third quartile = .57.

Demonstration
Using the Tauu R function, we computed Tau-U AB for Students S1, S2, S3 to be -.92, -1.00, and -.98, respectively (see Table 4). These Tau-U AB s were interpreted as, 92%, 100%, and 98% of pairs of data showed improvement from the A2 phase to the B2 phase for Students S1, S2, and S3, respectively.
Using the Tauu R function, we computed Tau-U AB-A for Students S1, S2, and S3 to be −.78, −.79, and -.56, respectively. Student S1's Tau-U AB-A meant that 78% of pairs of data showed improvement after the upward baseline trend was removed from Tau-U AB . It was a decrease in absolute value from its corresponding Tau-U AB = −.92, but greater, in absolute value, than the third quartile of Solomon et al.'s (2015) distribution of Tau-U AB-A . Student S2's Tau-U AB-A meant that 79% of pairs of data showed improvement after the somewhat flat baseline trend was removed. It was a decrease in absolute value from its corresponding Tau-U AB = −1.00, but greater, in absolute value, than the third quartile of Solomon et al.'s (2015) distribution. Student S3's Tau-U AB-A meant that 56% of pairs of data showed improvement after the downward baseline trend was removed. It was a decrease in absolute value from its corresponding Tau-U AB = −.98 and was similar, in absolute value, to the third quartile of Solomon et al.'s (2015) distribution. We, therefore, concluded that the response card intervention improved S1's, S2's, and S3's disruptive behaviors.  Score  0  1  0  4  2  3  1  4  3  8  10  10  10  6  8  8  3  Session 31  30  29  28  27  26  25  24  23  22  21  20  19  18  17  16  15   A2   3 15  ---+  -T  -+  T  +  +  +  +  +  +  +  8 16 - Figure 2. Student S1's data matrix.  Note.
Tau-U AB , Tau-U A , and Tau-U AB-A are obtained from the Tauu R function available at http://ktarlow.com/stats/r/tauu.txt (Tarlow, 2017b, March) a Tau-U AB is shown under c Tau-U AB-A is shown under "ab.mina".

The Tau c method
The Tau c method consists of four steps.
Step 1: Quantify a baseline trend as a Kendall's Tau (τ) correlation coefficient (Kendall, 1962) by correlating baseline scores and their time points.
Step 3: If Kendall's S test is statistically significant according to a researcher's pre-specified criterion, perform the Theil-Sen regression on the baseline data to remove the baseline trend from both the baseline and the intervention scores.
Step 4: Compute a Kendall's Tau correlation coefficient (Kendall, 1962) between residuals, obtained from the Theil-Sen regression line, and a dummy variable, coded as 0 for all baseline residuals and 1 for all intervention residuals. Because the residuals are uncorrelated with the baseline trend, Kendall's Tau from Step 4 is called the baseline-corrected Tau in Tarlow (2017a), or Tau c in this paper.
If the Step 2 test is not statistically significant, Tau c is not computed and an uncorrected Tau (or Tau noc in this paper) is computed instead. If Tau noc is computed in Step 2, Steps 3 and 4 are not performed. Kendall's τ from Step 1, Tau noc from Step 2, and Tau c from Step 4 are all Kendall's Tau correlation coefficients. They can be calculated using Tarlow's tool from http://www.ktarlow.com/stats/tau.

Kendall's τ from
Step 1 is identical to Tau-U A (Parker et al., 2011) in Table 4, if baseline scores are not tied. If some of baseline scores are tied, as in Students S1, S2, and S3 data, Equation (9) corrects tied scores for Kendall's τ: where S A is identical to the numerator of Tau-U A in Table 4, and U = the correction variable for ties in baseline scores. The correction variable U is determined from Equation (10): where u = the number of tied baseline scores in each set, and 1 2 u(u-1) is the pairwise comparisons among tied scores in each set. For example, for Student S1, there were three 8s (the first set of tied baseline scores) and three 10s (the second set of tied baseline scores). Thus, U ¼ ∑ 1 2 u u À 1 ð Þ¼ 1 2 Â 3 Â 3 À 1 ð Þþ 1 2 Â 3 Â 3 À 1 ð Þ¼6 for S1. The second square root of the denominator in Equation (9) contains no correction for ties in the baseline time points because there are none among time points. In Step 2, the S test of Kendall's τ is performed as a normal approximate test when the number of baseline scores (n A ) is 10 or more (Kendall, 1962). If the S test of Step 2 turns out to be statistically insignificant, according to a pre-specified α, Tau noc is computed, instead of Tau c . Tau noc is computed between the observed baseline and intervention scores and a dummy variable, coded as 0 for all baseline scores and 1 for all intervention scores, according to Equation (11): The S AB in Equation (11) is identical to the numerator of Tau-U AB in Equation (7), and U AB = the correction variable for ties in baseline and intervention scores. U AB is determined from Equation (12): where u AB = the number of tied scores within and across baseline and intervention phases in each set. For Student S1, there were six sets of tied scores, including three 3s, three 8s, three 10s, two 4s, two 1s, and two 0s. Thus, U AB ¼ ∑ 1 2 u AB u AB À 1 ð Þ¼ 1 2 Â 3 Â 3 À 1 ð ÞÂ3 þ 1 2 Â 2 Â 2 À 1 ð ÞÂ3 ¼ 12 for S1. The second square root of the denominator in Equation (11) contains the correction for ties in the dummy variable, namely, À 1 2 n A n A À 1 ð ÞÀ 1 2 n B n B À 1 ð Þ Â Ã ; because the dummy variable is tied on 0s and 1s. Tau noc is tested using a normal approximate test when the sample size of (n A + n B ) is 10 or more (Kendall, 1962).
If the S test of Step 2 is statistically significant, the Tau c method proceeds to Step 3.
Step 3 determines the slope and the intercept of the Theil-Sen regression line. The Theil-Sen regression is a nonparametric robust regression method suitable for skewed and heteroscedastic data. The Theil-Sen slope (b TS ) is the median based on all pairs of baseline scores, according to Equation (13): b TS ¼ median of y k À y 0 k k À k 0 ; ; for k ¼ 1; 2; : : : ; n A where y k is the baseline score at the kth session and y 0 k is the k 0 th session score, where k≠ k 0 . The Theil-Sen intercept (a TS ) is computed by Equation (14): The a TS and b TS obtained from Theil-Sen regression make the correlation coefficient between the baseline scores and their residuals equal to approximately zero. The a TS and b TS are subsequently used to yield residuals (e k ) for all baseline and intervention scores, as in Equation (15): In Step 4, Tau c is computed according to Equation (16) between residuals derived in Step 3 and a dummy variable, coded as 0 for all baseline residuals and 1 for all intervention residuals: where S c is computed from shaded cells of a data matrix, similar to the shaded cells in Figure 2, for which the residuals of Equation (15) are listed as rows and columns, and U c = the correction variable for ties in the residuals. U c is determined from Equation (17): where u c = the number of tied residuals in each set across baseline and intervention phases. Similar to Tau noc , the second square root of the denominator in Equation (16) contains the correction for ties in the dummy variable. Tau c is tested using a normal approximate test when the sample size of (n A + n B ) is 10 or more (Kendall, 1962). Table 5 presents results obtained from Tarlow's tool. In

Demonstration
Step 1, Kendall's τs of the baseline trend for Students S1, S2, and S3 were: τ = .48 (p = .158), .28 (p = .437),-.75 (p = .031), respectively. The τs of Students S1 and S2 were not statistically significant at α = .05. Tarlow's tool indicated that there was no need to correct for these students' baseline trends. We therefore computed Tau noc for S1 and S2, respectively, as-.70 (p = .002) and-.75 (p = .001). These results indicated that there was a statistically significant decrease in S1's and S2's intervals with disruptive behaviors from baseline to intervention phases. We, therefore, concluded that the response card intervention was effective for S1 and S2.
For S3, Tarlow's tool computed Tau c = .56 (p = .017) because Kendall's τ of -.75 was statistically significant. Although Tau c was statistically significant, its sign was positive indicating an increase in S3's intervals with disruptive behaviors from baseline to intervention phases. Such a result contradicted S3's data displayed in Figure 1. According to S3's residual graph shown in the upper panel of Table 5, residuals obtained for intervention sessions (from 24 to 32) were increasingly larger than 0. This phenomenon is likely to occur when data are bounded (such as the Lambert data bounded by 0 and 10) (Tarlow, 2017b). It was, therefore, difficult to determine if the response card intervention was effective for S3, based on Tarlow's Tau c = .56.

Discussion of the Tau-U AB-A and the Tau c methods
Tau-U AB-A and Tau c are applicable to scores measured at least at the ordinal level. Both can be computed for individuals, as well as for a class/group as an average, even if phase lengths are unequal. Tau-U AB-A can be interpreted relatively by comparison to Tau-U AB , or to quartiles of empirical Tau-U AB-A s compiled by Solomon et al. (2015). Because there is no sampling distribution for a sample Tau-U AB-A , it is inappropriate to interpret Tau-U AB-A inferentially (Peng & Chen, 2018, April). Tau c can be interpreted descriptively as well as inferentially, as demonstrated above.
In contrast to MPD and SLC, Tau-U AB-A and Tau c do not assume that the baseline trend is linear; they assume that the baseline trend persists throughout the intervention phase. The computation of Tau-U AB , Tau-U AB-A , Tau c or Tau noc is not affected by unequally spaced sessions because all are based on ranks of scores, not time points. The calculation of Tau-U AB or Tau-U AB-A is not adjusted for tied scores. Therefore, when tied scores occur, neither Tau-U AB nor Tau-U AB-A can reach its theoretical maximum (Peng & Chen, 2018, April). The computation of Tau c is adjusted for tied scores, as shown in Equations 16 and 17.
The Tau-U AB-A method and the Tauu R function, as well as the Tau c method and Tarlow's tool, treated missed sessions, such as Sessions 27 and 32 in S3, as nonexistent. Consequently, in Table 5 for S3, the residual obtained for Session 28 is plotted for Session 27 and the residual obtained for Session 34 is plotted for Session 32.
Because Tau-U AB-A does not compute predicted scores, out-of-range predicted scores are not an issue for this method. It is an issue with the Tau c method. Out-of-range predicted scores in Equation (15) were not adjusted by Tarlow's tool. They can be adjusted to user-specified maximum or minimum, according to Faith et al. (1997) in the Tauc R function, available from the supplemental materials. After adjusting the out-of-range predicted scores for S3, Tau c = .07 (p = .818) is statistically insignificant and different from Tarlow's Tau c (= .56, p = .017). Based on the adjusted Tau c = .07, we concluded that the response card intervention was ineffective for S3.
When deciding whether to report Tau-U AB-A or Tau-U AB , researchers are faced with different recommendations in the literature. Parker et al. (2011) recommended to remove the baseline trend, therefore reporting Tau-U AB-A , if Tau-U A (Table 4) was at least .40. In Parker et al.'s (2011) analysis, Tau-U A =.40 corresponded to the 75th percentile of 382 empirical Tau-U A s derived from published articles. We agree with Brossart, Laird, and Armstrong (2018) recommendation to report Tau-U AB-A when there are theoretical and empirical reasons for it. According to Brossart et al. (2018), a possible theoretical reason is that the researcher suspects a Hawthorne effect. A possible empirical reason is that the baseline trend is tested to be statistically significant, although the power of such a test is low when there is a small number of baseline scores. Similarly, the choice of Tau c or Tau noc depends on the significance test of Kendall's τ in Step 1. Tarlow (2017a) recommended at least 10 baseline scores in order to achieve a reasonably large statistical power for testing Kendall's τ.

Conclusion
We defined and illustrated four methods that can account for baseline trends while assessing an intervention effect. We demonstrated these methods using three students' data extracted from Lambert et al. (2006) and specialized computing tools. Three students' data displayed an upward, flat, or downward trend in the baseline phase. All methods successfully incorporated three types of baseline trends into the assessment of the response card intervention for the students. Results revealed that the response card intervention was effective for Students S1 and S2. Such a conclusion was supported by all four methods investigated in this paper. Three methods (MPD, SLC, and Tau c ) concluded that the intervention was ineffective for S3. Because S3 showed a downward trend (or decrease in intervals with disruptive behaviors) during the baseline phase, a further decrease  Residual graph adjusting for out-of-range predicted scores The results are generated from Tarlow's tool available at http://www.ktarlow.com/stats/tau, except for the lower panel of S3. Edited results are shown in bold.
To perform the significance test of Step 2, users need to first enter data into Tarlow's tool, then click the button "Test for Baseline Trend". Tarlow's tool returns Kendall's τ in Equation (9) under "Baseline Trend" and presents the recommendation.
shown in the intervention phase was not sufficient, according to the three methods, to claim an effective intervention. Yet Tau-U AB-A suggested that the intervention was effective for S3. In light of S3's data display and proper interpretations of the results obtained from the four methods, we concluded that the intervention was not effective for S3. If the intervention were implemented over a longer phase for S3, the data and conclusion might be different. These findings highlight the importance of employing multiple baseline trend-controlled methods and of implementing an intervention sufficiently in order to fully assess an intervention effect for each participant. We agree with Kratochwill et al. (2013) that multiple statistical analyses should be used to assess an intervention effect in SCED studies.
Our study also investigated the proper interpretation of each method's proposed effect size, assumptions, and ways in which several application issues can be resolved. Our findings are summarized in Table 1. The MPD and SLC methods require that data be at least interval-level measurements and the baseline trend be linear. The calculation of the baseline trend in MPD and SLC also assumes that the baseline trend can be determined solely by the first and last baseline scores. When applying these two methods, researchers need to attend to the equally spaced measurement issue. Unequally spaced measurements may be resolved by the MPD R and SLC R functions available from the supplemental materials. Although the web-based computing tools for MPD and SLC do not accept missing values, data graphs are generated to facilitate the interpretation of the results for datasets without missing scores.
The Tau-U AB-A and Tau c methods are both meaningful, even when there are outliers, because they are based on ranks of scores. The issue of out-of-range predicted scores is likely to be encountered with MPD1, MPD2, and Tau c . In this paper, we provided a solution to adjust the outof-range predicted scores to their nearest maximum or minimum. Manolov (2018) proposed another strategy to permit the baseline trend to level off at a certain point during the intervention phase (see supplemental materials).
Each of the four methods reviewed and evaluated in this paper has its unique merits. Of the four methods, Tau c was the only index that allows for an inferential interpretation of the intervention effect. The other three methods yield effect sizes that should be interpreted descriptively, but not inferentially. It should be noted that the causal interpretation of Tau c , or other effect size measures, requires a sound study design. Researchers are encouraged to follow the SCED design standards presented in What Works Clearinghouse (Institute of Education Sciences, 2017) when designing an intervention study. Furthermore, according to What Works Clearinghouse, six data features should be examined to determine the effectiveness of an intervention, including level change, trend, variability, immediacy of the effect, overlap, and consistency of data in similar phases. To fully examine the six features of a participant's data, researchers can consult Chen et al. (2015) and Manolov (2018). Given the importance of SCED studies in establishing and confirming evidencebased intervention practices, we hope that this paper has empowered practitioners and researchers to appropriately employ methods to disentangle baseline trends from intervention effects.