Playing the long game: A multivariate multilevel non-linear growth curve model of long-term effects in a randomized trial of the Good Behavior Game

This cluster randomized controlled trial (RCT) examined the impact of the Good Behavior Game (GBG) on children's developmental trajectories of disruptive behavior, concentration problems, and prosocial behavior from middle childhood (ages 6–7 years) to early adolescence (ages 10–11 years). Seventy-seven schools in England were randomly assigned to intervention and control groups. Allocation was balanced by school size and the proportion of children eligible for free school meals. Children (N = 3084) ages 6–7 years at baseline were the target cohort. Outcome measures, assessed via the Teacher Observation of Child Adaptation Checklist, were taken prior to randomization (baseline – Time 1) and annually for the next 4 years (Time 2 to Time 5). During the 2-year main trial period (Time 1 to Time 3), teachers of this cohort in intervention schools implemented the GBG, whereas their counterparts in the control group continued their usual practice. A multivariate multilevel non-linear growth curve model indicated that the GBG reduced concentration problems over time. In addition, the model also revealed that the intervention improved prosocial behavior among at-risk children (e.g., those with elevated symptoms of conduct problems at Time 1, n = 485). No intervention effects were unequivocally found in relation to disruptive behavior. These findings are discussed in relation to the extant literature, strengths and limitations are noted, and practical and methodological implications are highlighted.


Introduction
The Good Behavior Game (GBG) is a school-based, universal behavior management intervention implemented by classroom teachers. The core components of the GBG consist of classroom rules, team membership, monitoring behavior, and the use of positive reinforcement. Children work in teams to win the GBG to access agreed-upon rewards. The GBG is played during a normal classroom activity for a specified time period during which the class teacher monitors contraventions to four rules: (a) we will work quietly, or at a noise level appropriate to a given activity; (b) we will be polite to others; (c) we will get out of our seats with permission; and (d) we will follow instructions. Teams with four or fewer rule breaks at the end of the game are deemed the winners and are rewarded. Over time, the game develops in terms of the frequency and duration of gameplay as well as the nature and timing of rewards (Donaldson & Wiskow, 2017). The theoretical bases of the GBG include behaviorism (i.e., contingency management and the replication of rewarded behavior), social learning theory (i.e., learning of appropriate behavior modeled by other team members) and life course/social field theory (i.e., promotion of adaptive processes to enable children to meet social task demands in the classroom; Kellam et al., 2011).
In the decades since its original development (Barrish et al., 1969), 14 randomized controlled trials (RCTs) of the GBG have been conducted of the program being implemented universally during a normal school day. Six of these were conducted in the United States (Dolan et al., 1993;Hansen et al., 2010;Ialongo et al., 2019Ialongo et al., , 1999Reid et al., 1999;Tolan et al., 2020); two each in the Netherlands (van Lier et al., 2004;Witvliet et al., 2009) and Canada (Dion et al., 2011;Jiang et al., 2018); and one each in Belgium (Leflot et al., 2010), Northern Ireland (O'Keeffe, 2019), Estonia (Streimann et al., 2020), and England (for a review, see Humphrey et al., 2021). The current paper reports on the findings from the RCT conducted in England. A fifteenth RCT based in the United States reports on the impact of the GBG in the context of an afterschool program . Collectively, these trials have provided robust evidence that the intervention leads to statistically significant changes in a range of salient outcomes, including conduct problems and peer relations, with intent-to-treat (ITT) effect sizes typically in the g = 0.10-0.20 range (Smith et al., 2021). This evidence is broadly consistent with intervention effect sizes reported in meta-analyses of a range of universal behavior management approaches (Korpershoek et al., 2016) and school-based preventive interventions more generally (Tanner-Smith, Durlak, & Marx, 2018). Furthermore, this evidence aligns with the view that the behavior of most children is typically not a cause for significant concern (Department for Education, 2012).
In this article we report findings from the first RCT of the GBG in England. The main project report (Humphrey et al., 2018) for this RCT documents the main ITT and subgroup moderator analysis findings in relation to reading scores and behavioral outcomes using point-in-time estimates focusing on the end of the intervention period (noted as Time 3 in Design below). In brief, Humphrey et al.'s ITT analysis found no evidence that the GBG improves children's reading or behavior. Furthermore, the subgroup moderator analysis revealed no significant differential gains for children eligible for free school meals or for boys at risk of developing conduct problems.
We build and extend on the above findings in several ways. First, we used growth curve models, as opposed to point-in-time estimates. Second, we examined long-term, as opposed to short-term, intervention effects. Third, these intervention effects were considered at both main and subgroup levels; in terms of the latter, we examined sex and conduct problems risk status, with effects examined separately and in combination, as opposed to solely the latter. In doing so, our intended contribution was to extend the knowledge base regarding the scope, specificity, and timing of intervention effects. Put another way, we go beyond what works by asking for which outcomes? fFor whom? and When?
Disruptive behaviors (e.g., talking out, getting out of seat, touching others, being disobedient or aggressive) were our primary focus because these behaviors are key proximal outcomes of the GBG (Chan et al., 2012) and are developmentally significant as they are predictive of adult anti-social behavior and related outcomes (e.g., arrest for a violent offence; Hubbard et al., 2006). However, the nature of the intervention also means that reductions in concentration problems (e.g., difficulties in paying attention, staying on task, resisting distractions) and improvements in prosocial behavior (e.g., compliance with rules, demonstrating empathy, social problemsolving) following exposure to the GBG are feasible, and indeed these too are developmentally important. For example, children with early attention difficulties are 40% less likely to graduate from high school (Rabiner et al., 2016).
We focused on intervention effects over the long-term rather than immediately following the conclusion of implementation because, from a theoretical perspective, preventive effects are hypothesized to take time to emerge, especially when a relatively small proportion of the population have, or are at risk of, developing problems in the first place (Hill et al., 2016). Thus, as Greenberg and Abenavoli (2017) noted, complete evaluation of universal interventions requires data collection over extended periods in order for changes among intervention recipients to consolidate, for small but key changes to snowball, and for the members of the control group to exhibit difficulties of the kind that are the focus of prevention efforts. Despite this, long-term follow-up is far from the norm. For example, in Durlak et al. (2011) seminal meta-analysis of universal social and emotional learning interventions, only 15% of studies collected follow-up data at least 6 months after a given intervention ended. Even in cases where long-term follow-up is implemented, studies are limited by a reliance on point-in-time estimates that do not analyze the developmental process of growth, although there are a couple of notable exceptions (e.g., Nix et al., 2016). Because a key purpose of universal interventions is to alter developmental trajectories, it is important that this is reflected in the analytical techniques adopted by researchers (Greenberg & Abenavoli, 2017).
In addition to examining growth over time, it is also important to recognize that the effects of universal interventions may vary across specific strata of the population. In the case of the GBG, there are theoretical and empirical reasons why we might expect effects to vary by sex and/or conduct problems risk status. Regarding sex, the intervention procedures are likely to appeal particularly to boys given the gendered socialization of competitiveness (Gneezy et al., 2009). In relation to conduct problems risk status, it stands to reason that those whose behavior is already a significant cause for concern would stand to gain the most benefit from the GBG, especially given its emphasis on adaptive socialization processes (e.g., alerting children to, and rewarding them for, meeting social task demands in the classroom; Kellam et al., 2011). Finally, on the basis of these two strands of inquiry, we might also reasonably expect to observe amplified intervention effects at their intersection (e.g., boys exhibiting elevated symptoms of conduct problems; Kellam et al., 1994). However, the nature and magnitude of such effects could feasibly vary by outcome domain, supporting the adoption of a multivariate approach reported herein. whereas three or more data points potentially allow for curves to be fitted.
In our specific case, we assumed that children's 4-year trajectories of disruptive behavior, concentration problems, and prosocial behavior are systematic since they follow a pattern over the time period (specified as a reasonably complex function of time), and they are variable because children have varying rates of change across time (growth). These trajectories are also group-specific and outcome-specific since they estimate differences between groups defined by individual characteristics (e.g., sex) or school characteristics (e.g., trial arm), in each of the three developmental outcome measures. Considering all of the above, the research question we sought to address was: What are the effects of the Good Behavior Game on the long-term trajectories of disruptive behavior, concentration problems, and prosocial behavior?

Design
A cluster-RCT design was used and is described in more detail elsewhere (Humphrey et al., 2018;trial registration: ISRCTN64152096). Seventy-seven schools were randomly allocated by an independent trial unit to deliver the GBG (intervention) or continue usual practice (control) for a period of two years. A minimization algorithm was used to ensure balance across trial arms with respect to school size and the proportion of children eligible for free school meals. Outcome data were collected at baseline (pre-  (Humphrey et al., 2018) represents the period of GBG implementation in the intervention arm of the trial and T3 to T5 represents a clean follow-up phase (Humphrey et al., 2021; i.e., none of the trial sample were exposed to the GBG during this period). Fig. 1 depicts the flow of participants through the study.
Ethical approval was granted by the authors' host institution (The University of Manchester, Ref: 15126). All schools signed a Memorandum of Agreement confirming their willingness to participate. Opt-out consent was sought from parents/caregivers, of whom 68 (2.2%) exercised their right to opt their children out of the trial. Finally, children were provided with information about the study (including their guarantee of anonymity and right to withdraw) and were asked to give their assent to participate; none declined assent or exercised their right to withdraw from the study.

Schools
Trial schools' composition mirrored that of primary schools in England in respect to size and the proportion of children speaking English as an additional language; however, trial schools contained significantly larger proportions of children with special educational needs and disabilities and children eligible for free school meals, which was used as a proxy to indicate lower income or socioeconomic disadvantage (Department for Work and Pensions, 2013), in addition to lower rates of absence and attainment. Intervention and control schools did not differ substantively on any of these characteristics (e.g., intervention schools had an average of 27.56% of children eligible for free school meals, whereas control schools had an average of 24.46%; Humphrey et al., 2018, p. 34).
Within intervention and control schools, 279 classroom teachers of children in the trial cohort participated during the main trial period (T1 to T3), of whom 139 were in control schools and 140 were in GBG treatment schools. Specific to different phases of the trial, 135 classroom teachers taught the trial cohort from T1 to T2 and 144 classroom teachers taught the trial cohort from T2 to T3. Teachers in GBG schools received 2 days of initial training and an additional single day of follow-up training midway through their first year of implementation (approximately 21 h total). This pattern was repeated in the second year of the trial as children moved up into new classes. On-going support for implementation was provided by trained GBG coaches employed by Mentor UK, who in turn were supported by staff at the American Institutes for Research.

Children
At baseline, 3084 children ages 6-7 years from participating schools comprised the trial cohort. No new joiners were allowed to enroll after T1. Those attending intervention and control schools did not differ significantly with respect to sex, free school meals eligibility (FSM), English as an additional language (EAL), or special educational needs and disabilities (SEND; Humphrey et al., 2018). Lost to follow-up numbers are reported in Fig. 1. Lost to follow-up refers to data on outcomes that were not possible to collect. These data are independent of whether a given school has continued or ceased implementation. For example, a school may have discontinued implementation but continued to provide outcome data. Furthermore, some schools may have not returned all surveys in all waves and some may have dropped out of one wave and returned in a subsequent one.
The at-risk subsample was comprised of 485 (15.7%) children scoring in the borderline or abnormal range (i.e., a score of 3 or more out of 10) on the teacher informant-report version of the Strength and Difficulties Questionnaire (SDQ; Goodman, 1997) at baseline. Two hundred seventy-two students attended intervention schools and 213 students attended control schools; 143 students were female and 342 students were male.

Disruptive behavior, concentration problems, and prosocial behavior
The Teacher Observation of Classroom Adaptation-Checklist (TOCA-C; Koth, Bradshaw, & Leaf, 2009) is a 21-item instrument that provides indices of children's disruptive behavior (nine items), concentration problems (seven items), and prosocial behavior (five items). Teachers read statements about a child (e.g., "pays attention") and endorse items on a 6-point scale (from never to always).
Evidence has suggested that data derived from the TOCA-C are internally consistent (i.e., all subscales α > 0.86) and have a factor structure that is invariant across gender, race, and age (Koth et al., 2009). Furthermore, Bradshaw, Waasdorp, and Leaf (2012) demonstrated that the data derived from the TOCA-C are sensitive to change in universal preventive interventions. Kourkounasiou and Skordilis (2014) provided further evidence of divergent validity, concurrent validity, and test-retest reliability. We also provide evidence of strict longitudinal measurement invariance in the current study.

Co-variates
Outcome data were supplemented by socio-economic and demographic information obtained from the National Pupil Database and a conduct problems risk status indicator derived from the teacher informant-report version of the SDQ (Goodman, 1997) that was administered at baseline (see Children above). School level data were drawn from the Department for Education website. Table 1 provides a summary description of the variables used in the current study.

GBG implementation and the counterfactual
Data derived from an online GBG scoreboard developed by the research team for use by participating teachers indicated that children in the intervention arm were exposed to the GBG for an average of 1066 min between T1 and T3 (SD = 719.5). In terms of frequency and duration, teachers played the game approximately twice a week between T1 and T2 and between once and twice a week between T2 and T3; the average game session length in both years was approximately 15 min. Nine GBG schools formally ceased implementation prior to T3, although their dosage data are included in the above estimates. Likewise, their data on outcomes, when provided, were also included in our analyses, hence they were not always "lost to follow-up".
Other dimensions of implementation were assessed via a structured observation schedule administered annually by a member of the research team. In relation to fidelity, a list of required steps outlined in the GBG manual (Ford et al., 2014) were scored on a binary yes/no scale. Fidelity was high (70%) in both years, indicating that teachers followed most of the prescribed procedures associated with the GBG. Quality was rated on a 5-item scale (each scored 0-2) and was also high (70%) in both years, indicating enthusiastic and engaging delivery. Almost all children in a given class were present when the game was played throughout the main trial period (>95% reach from attendance count). Participant responsiveness was rated on an 8-item scale (each scored 0-2) and indicated that children responded favorably (e.g., correcting their behavior following an infraction) when the game was being played (75% from T1 to T2; 69% from T2 to T3; Humphrey et al., 2018).
In terms of the counterfactual, surveys of usual practice in behavior management (derived from Reupert & Woodcock, 2010) among teachers in control schools revealed that 95% reported that they established and maintained a set of classroom rules, 90% reported communicating clear expectations about rules and children's responsibilities (e.g. through posters), 100% reported that they observed and monitored children's behavior in the classroom, 60% used prizes as rewards for good behavior daily or weekly, and 67% used group rewards daily or weekly (Humphrey et al., 2018). These data appear to indicate relatively low program differentiation; however, as a counterpoint we note that the idea of an untreated control group in the context of school-based preventive interventions has long been regarded as a fantasy (Durlak, 2015).

Analytical strategy 2.4.1. Multivariate multilevel growth curve modeling
Growth curve models are a type of statistical model for repeated measures in which occasions (or time points) are nested within individuals. Under a multilevel modeling framework, the total variance of the outcome of interest is split into variance between occasions (within individuals) and variance between individuals. Time is modeled explicitly in the model as a lower-level (occasions) Note. FSM = eligible for free school meals.
covariate and its slope is allowed to vary randomly across the higher-level units (individuals); in other words, a growth curve model is a random slopes model. Time can also be treated flexibly to allow for non-linear effects. The multilevel specification of a growth curve model also has the advantage of allowing for further higher levels and multivariate outcomes. The approach presented here is largely equivalent to a latent growth curve modeling approach, with the notable exception that the multilevel specification fits a single (invariant) within-individual variance, which is a reasonable assumption when time is treated flexibly (Goldstein, 2011). This model also assumes uncorrelated within-individual residuals, which is reasonable when the repeated observations are not very close together in time (Goldstein, 2011;Hox et al., 2017). The model we fit in this paper was a multivariate non-linear growth curve model, the algebraic form of which is provided in Appendix A. The intercept was allowed to vary randomly across children and schools, which allowed us to examine the split of the total variance across time, children, and schools. This model treats time as a cubic polynomial term and allows the slope of the linear term for time (i.e., growth rate) to vary randomly across children. The model was estimated by using the Markov Chain Monte Carlo (MCMC) algorithm as implemented in MLwiN (Rasbash et al., 2020), which we called from within R (R Core Team, 2019) using the package "R2MLwiN" (Zhang et al., 2016). The MCMC chains use the coefficient estimates from the Iterative Generalized Least Squares (IGLS) algorithm as starting values. Overall model fit was evaluated by using the Deviance Information Criterion (DIC; Spiegelhalter et al., 2002). Details of the code can be found in Appendix B.
The multivariate multilevel specification allowed us to measure the effect of the GBG simultaneously on our three outcomes while controlling for the correlation between them (which helps to avoid confounding) and a set of socio-economic, demographic, behavioral, and school-level baseline characteristics. As noted by Troncoso (2019), this specification also allowed us to test cross-outcome hypotheses (e.g., whether the effects of a particular covariate differs across outcomes). A further advantage is that the variancecovariance matrix is efficiently estimated even in the presence of missing data (Goldstein, 2011), rendering it equivalent to full information maximum likelihood (FIML) because it uses all the available information and results are therefore unbiased under the assumption of data Missing at Random (MAR). Our full model preserved 2938 children (95.27%) in 77 schools (100%); time-specific missing values for the outcomes by trial arm are reported in Fig. 1. We fitted a multilevel model for missingness in the full model, which failed to find enough evidence that our main covariates predicted missingness. Details of this procedure are provided in Appendix C.
Even though this is a trial, effect sizes cannot be reported as there is no exact equivalent in the Bayesian framework. Instead, we report standardized coefficients, as well as the Bayes Factor (BF) and Posterior Model Probabilities (PMP) for the GBG effect estimate over time, following Moerbeek's (2019) "informative hypothesis evaluation" approach.

Longitudinal measurement invariance
Item-level TOCA-C analyses were performed to determine the longitudinal invariance of the three subscales, following Grimm et al. (2017) approach. We used the R package "lavaan" (Rosseel, 2012) to run confirmatory factor analyses. Details of the procedure, code, and outputs are available on request. Concisely, the procedure involved replicating the factor structure over the five time points for each subscale of the TOCA-C and fitting increasingly constrained models to determine whether the fit of the measurement model significantly worsened. The first fitted model, termed "configural", was an unconstrained model with factor loadings, observed variables means, variances, and covariances estimated freely. The configural model (Measurement Model 1) was used as the baseline comparison for three other models. The weak invariance model (Measurement Model 2) constrained the factor loadings across time points to equality and the strong invariance model (Measurement Model 3) constrained the means of the observed variables to equality across time while keeping the constraints of Measurement Model 2. Finally, the strict invariance model (Measurement Model 4) placed additional constraints by setting residual variances to be equal across time. A comparison of the model fit of these models for all outcomes can be seen in Table 2.
Given the oversensitivity of the Chi-squared test for large samples, we judged the goodness of fit by using the Comparative Fit Index (CFI), Tucker Lewis Index (TLI), Root Mean Square Error Approximation (RMSEA), and Standardized Root Mean Square Residual (SRMR) measures. The freely estimated measurement models (i.e., configural) displayed better goodness of fit measures than the constrained models (i.e., weak, strong, and strict) for all outcomes. Nevertheless, strict invariance models for all outcomes  (Hu & Bentler, 1999). The largest departure from this was the disruptive behavior subscale, which yielded the lowest CFI and TLI values (0.922 and 0.919, respectively); however, RMSEA and SRMR values were still below conventional thresholds (i.e., < 0.06 and < 0.08, respectively; Hu & Bentler, 1999), as was also the case for concentration problems and prosocial behavior. In conclusion, the assumption of strict invariance was justifiable since the models for all three outcomes showed adequate fit. This allowed us to use the TOCA-C composite scores because strict invariance avoids confounding change in scores over time with changes in reliability of the items (Newsom, 2015). Table 3 displays the means and standard deviations over time by trial arm. The means for concentration problems and prosocial behavior tended to decrease each year, with standard deviations remaining relatively stable. The opposite occurred with disruptive behavior as observed means tended to increase with time. Table 3 provides indications of small differences by trial arm. Children in GBG schools tended to have lower mean concentration problems (2.352) and disruptive behavior (1.732) than those in control schools (2.392 and 1.74, respectively) by the end of the period of study, having started off at marginally higher means (2.602 and 1.709 in GBG, as opposed to 2.548 and 1.612 in control). The opposite trend occurred for prosocial behavior where children in GBG schools began with slightly lower means (4.893) than the control group (4.946) but tended to surpass them by T5 (4.916 in GBG and 4.842 in control). However, these time trends are only descriptive and they most likely obscure underlying patterns of variation that we unveil in the following sections.

Variation across outcomes and levels
We fitted an unconditional means model (i.e., empty multivariate multilevel model) to assess the weight of the contribution of each level to the total variation in concentration problems, disruptive behavior, and prosocial behavior. This is a model in which no explanatory variables are added as it is only used to decompose the total variance into variance within children (Level 1), between children (Level 2), and between schools (Level 3). Although no explanatory variables are included, the model does control for the correlation between outcomes; hence, the means and variances presented in Table 4 are more accurate than the observed mean trajectories presented in Table 3.
The correlations between outcomes display notable differences across levels. Concentration problems and disruptive behavior were strongly and positively associated at the level of children and schools; however, they were more weakly associated at the within-child level. This may indicate that time-invariant factors (i.e., invariant child characteristics) played a more important role in the relationship between these outcomes, especially concentration problems and disruptive behavior. This is further supported by the variance composition presented in Table 5.
It is worth noting that the variation in prosocial behavior was greater within children than between children (55.12% and 38.05%, respectively), as opposed to the variation in concentration problems (39.88% and 56.23%, respectively) and disruptive behavior (38.81% and 55.72%, respectively). Even though differences between children comprised a large proportion of the total variation in prosocial behavior, it appears as if this developmental process was more markedly an individual process over time. Judging by the proportion of variance that was attributable to differences between schools, prosocial behavior (6.82%) appeared to be slightly more driven by the wider school context than concentration problems (3.88%) and disruptive behavior (5.47%). Fig. 2 shows quite distinct average temporal trajectories for all three outcomes by trial arm. Fig. 2A indicates that concentration problems tended to decrease over time, but they did so more rapidly for children in GBG schools. Fig. 2B highlights that disruptive behavior increased over time with those in the intervention group apparently making shallower increases and a lower expected average by the end of the study period. Fig. 2C reveals a downward trajectory in prosocial behavior; however, those in GBG schools    Fig. 2. Predicted standardized scores for concentration problems, disruptive behavior and prosocial behavior by trial arm. P. Troncoso and N. Humphrey remained roughly at the same average they started at and were apparently higher than their counterparts in usual practice schools.

A conditional growth model for the effect of the GBG on children's trajectories of disruptive behavior, concentration problems, and prosocial behavior
To assess these trends more robustly, we began by fitting an intermediate unconditional model and found that time was a relevant factor in all three processes before controlling for covariates, although some differences were noted. Concentration problems and prosocial behavior exhibited strong evidence of non-linear (cubic) trends; however, the former increased over time and the latter decreased over time. Disruptive behavior also displayed an increase over time; however, there was not enough evidence to assert it was a non-linear trend. We kept these non-linear terms for all outcomes in the subsequent models as they do display better overall fit (assessed via deviance information criterion comparison) than other intermediate models without them. Full details of these intermediate models are available on request. In Table 6, the fixed part of the full, final model is displayed (the random part of this model is available in Appendix D). Children in GBG schools demonstrated a notable decrease in concentration problems over time (posterior mean = − 0.151; SD = 0.061; 95% credible intervals = − 0.270, − 0.033); each year, those in the intervention arm of the trial displayed a mean linear decrease of 0.15 SD with respect to the previous year, as compared to their counterparts in control schools, although this effect changed across the study period. This trend is more easily interpreted by inspecting Fig. 2A.
In Table 7, we present the Bayes Factor (BF) estimates and their corresponding posterior model probabilities (PMP). For all outcomes, the informative hypotheses are split into two options: (a) outcomes for GBG participants were better than the outcomes of the control group (H1) and (b) outcomes for GBG participants were equal to or worse than the outcomes of the control group (H2). In the case of concentration problems and disruptive behavior, if the GBG produces better outcomes (H1), then we would expect the difference between the mean of the control group and the trial arm to be positive (H1: Control -GBG > 0), which would indicate that GBG participants had less concentration problems and disruptive behavior than the control group. Conversely, for prosocial behavior, our informative hypothesis was that if GBG produces better outcomes, then the difference between the mean of the control group and the trial arm was expected to be negative (H1: Control -GBG < 0). The Bayes Factor estimates for informative hypothesis H1 is BF 1 and consequently the Bayes Factor estimates for informative hypothesis H2 (the complement of H1) is BF 2 . Their corresponding posterior model probabilities (PMP 1 and PMP 2 ) indicate the estimated probability of the hypothesis being true, given the observed data. An additional indication of the strength of the evidence is given by the Bayes Factor estimate of H1 over H2 (BF 1,2 ), which provides a measure of how likely H1 is in comparison with H2. If BF 1,2 is over 1, then H1 is more likely than H2; if it is less than 1, then H1 is less likely than H2 and if it is 1, both hypotheses are equally likely. Following van Doorn et al. (2020), BF 1,2 estimates between 1 and 3 would be considered "weak" evidence in favor of H1 over H2; values between 3 and 10 would be considered "moderate" evidence; and "strong" evidence would be backed up by estimates over 10. Nevertheless, Moerbeek (2019) cautioned against the stringent use of interpretative rules for Bayes Factor estimates and so these are used only as broad guides here. Table 7 shows higher posterior model probabilities supporting the informative hypothesis that GBG children had less concentration problems (H1) from T2 onwards, with moderate evidence (BF 1,2 > 5) of better GBG outcomes at T2 and T5. We did not find reliable evidence of an intervention effect on trajectories of disruptive behavior or prosocial behavior, with 95% credible intervals of the Table 6 Fixed-effects parameters of the full multivariate multilevel non-linear growth curve model for concentration problems, prosocial behavior, and disruptive behavior (standardized).  (2019). Trajectories mix well with approximately normallydistributed posteriors; however, they are not presented here as they exceed the scope of this paper. Full details are available on request.
posterior mean crossing zero for both outcomes (Table 6) and only weak evidence in favor of H1 over H2 arising from the BF estimates (BF 1,2 < 3) in Table 7. Nevertheless, Table 7 also shows strong evidence (BF 1,2 = 13.52) supporting the hypotheses of better outcomes for GBG pupils in prosocial behavior at T5, with a notably higher posterior model probability (PMP 1 = 0.931) than the alternative (PMP 2 = 0.069). Subgroup effects were modeled as main effects rather than in interaction with time in the interests of parsimony (e.g., doing so would require seven additional interaction terms to be introduced to an already complex model) and interpretation of other findings (e. g., doing so would affect the meaning of the main time * GBG interaction such that it would refer to females not at risk in the intervention arm, which was not our intention). At-risk children (e.g., those with elevated symptoms of conduct problems at T1) in the intervention arm (n = 272) of the trial recorded a noteworthy higher posterior mean for prosocial behavior when compared to their counterparts in the control group (n = 213) with a posterior mean estimate of 0.377 (SD = 0.115; 95% credible intervals = 0.152, 0.602), which implies that they are expected to have a prosocial score nearly two fifths of a standard deviation higher than those in the control group across all time points. This finding provides robust evidence that the GBG contributed to improved prosocial behavior outcomes among at-risk children. However, we did not find equivalent effects for disruptive behavior or concentration problems, with credible intervals of the posterior mean crossing zero for both outcomes. Finally, we found no reliable evidence of differential intervention effects for males (control n = 837; GBG n = 786) or indeed at-risk males (control n = 162; GBG n = 180) for any of our three outcomes.

Discussion
In our main report on the first trial of the GBG in England (Humphrey et al., 2018), we examined its effects on reading and behavioral outcomes in the short-term using point-in-time estimates, finding no evidence of its efficacy. The current study built upon and extended this work by using growth curve models as opposed to point-in-time estimates to examine long-term, as opposed to shortterm, main and subgroup (sex and conduct problems risk status, with effects examined separately and in combination, as opposed to solely the latter) effects of the intervention on children's behavioral outcomes. We fitted a multivariate multilevel growth curve model that examined trajectories that spanned two years of implementation and two additional years of follow-up. Our analyses revealed that the intervention altered trajectories of concentration problems, with those exposed to the GBG experiencing a mean linear decrease of 0.151 SD with respect to the previous year (and strong support for better outcomes at T5), relative to their counterparts in control schools (although this effect changes through time, as evidenced by the polynomial terms for time). However, we did not find reliable evidence of an intervention effect on trajectories of disruptive behavior or prosocial behavior, with the notable exception of higher posterior probabilities supporting better outcomes for GBG at T5. In terms of subgroup moderator effects, we found robust evidence that the intervention benefited those children with elevated symptoms of conduct problems at baseline, with a 0.377 SD difference between children in GBG and control schools across time. However, we did not find equivalent effects for disruptive behavior or concentration problems. Finally, we found no reliable evidence of differential intervention effects for males, or indeed at-risk males, for any of our three outcomes.
Our model provided robust evidence that the GBG influences the trajectory of children's concentration problems over time. The direction of this effect is consistent with both the theorized effects of the intervention (Chan et al., 2012) and developmental trends in children's capacity to pay attention, stay on task, and resist distractions during the elementary school years (e.g., Best et al., 2009). Thus, although we observed a general trend indicative of reductions in concentration problems over four years, a notably sharper Table 7 Bayes factors and posterior model probabilities for the difference in outcomes between children in the trial and the control group over time. decrease was evident among children in the intervention arm of the trial who were exposed to the GBG for the first two years of this period (see Fig. 2A). These results align with those of Leflot et al. (2010) who identified a positive effect of the GBG on children's ontask behavior and also align with van Lier et al. (2004) who demonstrated that the intervention impacted developmental trajectories of attentional difficulties. Our findings extend the work of these authors in important ways. In the case of the former, we show the impact on developmental trajectories as opposed to point-in-time estimates. In the case of the latter, although these authors did focus on developmental trajectories, their data points only covered the period of GBG implementation, as opposed to the longer-term follow-up period included here. The other notable intervention effect identified in our model related to improvements in prosocial behavior among at-risk children. This finding mirrors that of Jiang et al. (2018) who identified a subgroup moderator effect on high-risk students' prosocial behavior in their GBG trial, albeit with similar distinctions to those noted above (e.g., point-in-time estimation and time period), alongside different criteria for at-risk status (which was defined using the borderline/abnormal range scores for prosocial behavior as opposed to conduct problems). This effect is particularly noteworthy given the general downward trend in prosocial behavior observed in the sample over the course of the study (see Fig. 2C). It is also interesting to consider in the context of the other (null) outcomes for this subgroup; it might suggest, for example, that prosocial behavior is more malleable and amenable to a relatively low intensity intervention than disruptive behavior.
Our analyses revealed no unequivocal trial effects in relation to disruptive behavior. One possible explanation is that teachers implementing the GBG were much more sensitized to the types of behaviors (e.g., talking out, getting out of seat, touching others, being disobedient or aggressive) being assessed in this particular subscale of the TOCA-C given their central focus in the initial training and subsequent delivery (e.g., infractions associated with the four classroom rules related to acts of disruption; by contrast, there is much less direct focus on concentration and prosocial behavior). In tentative support of this proposition, Fig. 2B illustrates that at T1, which is when all teachers who completed ratings were naïve to the GBG, differences between intervention and control arms were negligible. However, ratings at T2 and T3, which is when teachers in the intervention arm were trained and implementing the GBG, their ratings indicated higher rates of disruptive behavior. In the follow-up period (i.e., T4 and T5), teachers providing ratings of children's behaviors were again naïve to the GBG, and it is in this period that we see the disruptive behavior trajectories of the two trial groups cross over, with relative stability in the GBG arm contrasted with increased disruptive behavior among children in the control arm. Future research could address this issue using independent observational data rather than teacher ratings.

Strengths and limitations
The current study benefits from several strengths that contribute to the security of the findings reported herein. We used a clusterrandomized design and adopted an analytical technique that took account of the hierarchical and clustered nature of the dataset, thus allowing us to model the long-term developmental trajectories of our three outcome variables. This multivariate multilevel non-linear growth curve model is particularly well-suited to researching the effects of universal preventive interventions, whose effects may take time to emerge (Greenberg & Abenavoli, 2017). The trial was large and well powered. Attrition was within acceptable limits and missing data were handled using all available information (see Method section). Balance on observables in the analysis was good, with negligible differences between trial arms for children's outcomes at baseline. The use of cluster randomization minimized the possibility of contamination and the allocation procedure itself was conducted independently of the evaluation team by a trial unit. The outcome measure used (i.e., TOCA-C) is psychometrically robust (Koth et al., 2009) and has been demonstrated to be sensitive to change in previous trials of universal preventive interventions (e.g., Bradshaw et al., 2012).
However, there are also several limitations that should be acknowledged. First, out of necessity, teachers providing ratings via the TOCA-C were not blinded to trial allocation status. This introduces the possibility the effects identified are the result of biased ratings. However, this seems quite unlikely given the relative preponderance of null findings (e.g., for disruptive behavior). Furthermore, it should be noted that ratings on only two of the five data points (i.e., T2, T3) were provided by teachers directly involved in implementation of the GBG. Second, we did not assess compliance effects. In the context of this GBG trial, we understand compliance as surpassing a given threshold of dosage due to the low variability in fidelity (e.g., the median or the 75th percentile). As noted earlier, nine of the 38 schools in the trial arm ceased implementation prior to the end of the main trial period (T3). Thus, although our study conforms to the 'analyze as you randomize' principle that is designed to provide an unbiased estimate (Gupta, 2011), we cannot rule out the possibility that the presence and magnitude of intervention effects reported here is attenuated by lack of intervention compliance (e.g., dosage variability) among some schools. In other words, we may have observed more and larger intervention effects had the game been played with greater frequency and/or had several schools not discontinued implementation prior to T3. Nevertheless, given that dosage is only observed in the intervention arm, the statistical approach to address this in combination with longitudinal outcome trajectories requires substantial work exceeding the scope of this study. Finally, although the overall trial sample size was very large, the same cannot not be said for our subsample of at-risk boys (11% of overall sample; n = 342). Thus, it may well be that the impact of the GBG was too small to be ascertained in this subgroup for this specific sample due to loss of statistical power. In other words, assuming constant sample size, a larger difference or smaller within-group variation would have had to be recorded.

Implications
This study provides robust evidence that the GBG can be used as an efficacious means through which to alter developmental trajectories of children's concentration problems. The ability to pay attention, stay on task, and resist distractions are key social task demands in the classroom context that yield powerful concurrent and prospective benefits in learning and attainment throughout childhood and adolescence (Breslau et al., 2010). Hence, even the relatively small gains (as judged by conventional effect size standards, although we note the growing resistance to reflexive reference to these; e.g., Tanner-Smith et al., 2018) over time evidenced here can be considered practically significant. This is particularly the case given the relatively low input required to bring about these gains; recall that teachers spent an average of only 30 min per week playing the GBG, and this was further reduced in the second year of implementation. It should also be noted that because the game itself is designed to be played during a normal classroom activity, it produces relatively little displacement of academic curriculum time (Ford et al., 2014).
The observed improvements in prosocial behavior among children in GBG schools considered to be at risk by virtue of their emergent conduct problems is also significant from a practical perspective. Life-course research indicates that childhood conduct problems, particularly among boys, are associated with a two-to-threefold increase in early adulthood public sector costs (mainly via the criminal justice system) and significantly higher rates of unemployment (D'Amico et al., 2014;Knapp et al., 2011). Middle childhood is a particularly critical window for intervention in this regard. Notwithstanding the lack of impact on disruptive behavior for this subgroup, which may be more resistant to low intensity intervention and are likely driven by factors beyond the reach of schools when deeply entrenched (Moffitt & Scott, 2008), the moderate observed improvements in behavorial indicators of empathy, compassion, and other facets of prosocial behavior yield great promise given their primacy in terms of children's social adaptational status.
From a methodological standpoint, the current study demonstrates the value and utility of applying growth curve modeling to research the effects of universal preventive interventions. Furthermore, our findings align directly with Greenberg and Abenavoli's (2017) argument that trials of such interventions should always be designed with longer-term follow-up built in from the outset. Given the inevitable variability in implementation, however, an interesting avenue for future research in this space is to develop and make use of fused approaches in which the relationship between intervention compliance and outcome trajectories are modeled. Accordingly, at the time of writing, we are working on ways to implement a statistical approach that combines complier average causal effect estimation and growth curve modeling.

Conclusion
This study has demonstrated the impact of the Good Behavior Game on children's developmental trajectories of concentration problems, in addition to resulting in notable improvements in prosocial behavior among those with elevated conduct problems. In doing so, it has also highlighted the value and utility of growth curve modeling of intervention effects and including data points that extend well beyond the conclusion of a given period of implementation. Thus, in several ways we have shown that playing the "long game" may come with benefits.

Funding
This work was supported by the Education Endowment Foundation (no grant number), the National Institute for Health Research (grant number 14/52/38), and the Economic and Social Research Council (grant number ES/V011243/1). The corresponding author is also affiliated to the Scottish Centre for Administrative Data Research (SCADR), which is part of the Administrative Data Research UK (ADR UK) partnership, funded by the Economic and Social Research Council.

Declarations of interest
None.

Appendix B. Exemplar code to fit the models
This is an example of how the models presented here can be fitted in the R package "R2MLwiN". The model in Appendix A would be fitted as follows: Model_A <-runMLwiN(c(concentration, disruptive, prosocial) ~ 1 + time + I(time^2) + I(time^3) + gbg + gbg*time + I(time^2)*gbg + I(time^3)*gbg + (1 | sid) + (1 + time | pid) + (1 | waveid), D=c("Multivariate Normal"), data = data, estoptions = list(EstM = 1)) "Model_A" is a multivariate multilevel unconditional linear growth curve model fitted via the default settings of the MCMC algorithm as implemented in MLwiN. As mentioned before, the model in Appendix A has been kept deliberately simpler than the actual full model to improve readability. The full model presented in this article is fitted as follows: This code presupposes the installation of R, MLwiN and the R package "R2MLwiN". Indentation is not required. Categorical variables ("fsm", "gbg", "male" and "risk") are coded as binary.

Appendix C. Multilevel model for missingness in the full model
We fitted a model to explore the factors associated with missingness in the full model. The outcome variable is a binary variable to indicate whether the observation (child) is missing from the full fitted model ( Table 6). The covariates of the full model were used only as main effects. Table C.1 shows there is not enough evidence to support the hypothesis that any covariate is associated with missingness in the final model (confidence intervals overlap with zero). A caveat to these results is that the at-risk indicator is missing for 114 out of 147 children who were missing from the final model. A full information maximum likelihood (FIML) specification for the missingness model was attempted, but led to estimation issues, hence the model presented in Table C.1 is for complete cases of the atrisk indicator. All other observations are included (N = 2970; 96.3% of the full sample). Given the above, the data are unlikely to be missing completely at random (MCAR) and hence we assume them to be missing at random (MAR).