Effects of school-based physical activity interventions on mental health in adolescents: The School in Motion cluster randomized controlled trial

Purpose: To investigate the effects of two school-based physical activity interventions on mental health in Nor-


Introduction
The leading causes for disability among children and adolescents worldwide are mental-and substance use disorders (Erskine et al., 2015). In Norway, 10% of boys and 28% of girls graduating from lower secondary school report mental health problems (Bakken, 2019). Furthermore, from 2014 to 2019, the amount of Norwegian adolescent boys and girls reporting mental health problems has increased, respectively, from 8% to 10% and 21%-27% (Bakken, 2015(Bakken, , 2019. Similar increases in adolescent mental health problems have also been found internationally (Collishaw, 2015). For adolescents, mental disorders such as depression have been associated with poor academic achievement, low school attendance and alcohol and drug use (Fröjd et al., 2008;Glied & Pine, 2002). Adolescents who experience mental disorders are also more likely to develop similar or more severe conditions as adults (Kessler et al., 2007). Adolescent mental health therefore also represents an economic challenge as mental disorders topped the list of the costliest conditions in Norway in 2013 (Kinge, Saelensminde, Dieleman, Vollset, & Norheim, 2017). The above mentioned studies provide a solid foundation to establish feasible methods for the purpose of preventing these problems.
Physical activity (PA) as a treatment against mental health problems in clinical adolescent populations has been researched extensively, and two recent reviews concluded that a large variety of PA interventions and moderate to high intensity aerobic exercise were likely to have a positive effect on depression (Bailey, Hetrick, Rosenbaum, Purcell, & Parker, 2017;Biddle, Ciaccioni, Thomas, & Vergeer, 2019). Studies examining the potential effect of PA on internalizing and externalizing problems in generally healthy adolescent community populations show mixed or weak results, indicating a potential ceiling effect due to a smaller potential for improvement (Spruit, Assink, van Vugt, van der Put, & Stams, 2016). However, the data is inconclusive and further research in community populations (Biddle et al., 2019), for instance, among adolescent school students, is warranted. Schools are optimal arenas for intervening, as researchers can reach students equally across sex, SES and ethnicity. The World Health Organization (WHO) has recommended that schools take part in promoting PA, for the purpose of raising children's and adolescents' PA levels and improving health (WHO, 2010). Since students spend a significant portion of the day at school, there is an opportunity to facilitate increased PA during school hours (Hills, Dengel, & Lubans, 2015). Potentially, this would make an impact on Norwegian 15-year-olds' PA levels, as only half of this demographic meet the recommended 60 min of daily moderate to vigorous PA (Dalene et al., 2018). Additionally, concerns that cognition or academic achievement would be negatively affected by increasing school-based PA at the expense of theoretical subjects are not supported by research (Donnelly et al., 2016;Singh et al., 2019).
The few studies examining how school-based PA interventions affect mental health in children and adolescents show mixed results. Bonhauser et al. (2005) Casey et al. (2014) and Lubans, Smith, et al. (2016) examined low-SES populations and showed, respectively, a decrease in anxiety and an increase in self-esteem, improvements in health-related quality of life for adolescent girls, and improved well-being for boys. Christiansen et al. (2018), Smith et al. (2018) and Eather, Morgan, and Lubans (2016) examined generally healthy community populations and found no overall effect. However, Smith et al. (2018) found a tendency toward an effect on self-esteem in the overweight/obese subgroup. Eather et al. (2016) found that the subgroup with the most psychological difficulties at baseline showed beneficial effects on self-esteem, perceived body fat, perceived appearance and physical self-concept. Lastly, Christiansen et al. (2018) found improvements in self-worth for students who did not participate in leisure-time sports. These findings suggest that school-based PA interventions are unlikely to elicit detectable effects on mental health outcomes in generally healthy adolescent populations. However, the findings also suggest that certain subgroups may benefit more from a school-based PA intervention than the average population. This substantiates findings of Cerin (2010), which indicated that PA effects can be heterogeneous, and therefore warrants investigation of relevant subgroups. In this context, relevant subgroups have been shown to display lower PA levels and poorer mental health than the population outside of the subgroups. This is the case among immigrants (Abebe, Lien, & Hjelde, 2014;Sagatun, Kolle, Anderssen, Thoresen, & Søgaard, 2008;Singh, Yu, Siahpush, & Kogan, 2008), low socioeconomic status (SES) populations (Bøe, Øverland, Lundervold, & Hysing, 2012;Heelan et al., 2010), girls (Bakken, 2019;Dalene et al., 2018) and poor mental health populations (Pinto Pereira, Geoffroy, & Power, 2014). Although school-based PA interventions have been shown to be effective on mental health in low-SES groups (Casey et al., 2014;Lubans, Smith, et al., 2016) and poor mental health groups (Eather et al., 2016), the knowledge base is limited. Furthermore, PA has been shown to be beneficial for mental health among immigrants (Siddiqui, Lindblad, & Bennet, 2014); however, to the authors' knowledge, no studies have examined the association between PA and mental health among immigrant adolescents. Although the use of subgroup analyses is debated (Sun et al., 2012;Wang & Ware, 2013), Biddle et al. (2019) recommended that future research should focus on the potentially different effects between sexes and between those with different mental health conditions.
The primary aim of the present paper was to assess the effect of two school-based PA interventions on adolescents' mental health. We hypothesized heterogeneous effects, so a secondary aim was to analyze subgroups. A third aim was to examine the subscales of the mental health instrument to establish which aspects of mental health a potential effect could be attributed to.

Design and participants
This paper presents data from the School in Motion study. Briefly, School in Motion was a multicenter study, designed as a cluster randomized controlled trial (RCT) involving four test centers in Norway. The primary aim of the study was to assess whether two different schoolbased PA interventions affected PA levels. The secondary aims were to assess the effects on physical fitness, mental health, learning environment and academic achievement. Thirty lower secondary schools accepted the invitation to participate and were randomized into three groups: two intervention groups and one control group. A neutral third party was responsible for the randomization process, after which, one of the schools withdrew from the study. Students attending ninth grade during the intervention period were invited to participate, and we obtained informed parental consent from 76% of the eligible students (n = 2084). The intervention period was 29 weeks. The study is registered in ClinicalTrials.gov ID nr: NCT03817047. Fig. 1 shows the participant flow from enrollment to post-testing.

Interventions
Intervention model 1 (M1), named "Active learning", consisted of weekly physically active academic lessons (30 min/week), PA not connected to a curriculum (30 min/week), and one additional physical education (PE) lesson (45-60 min/week). The purpose of M1 was to increase PA levels and assess the feasibility of incorporating PA into theoretical subjects in lower secondary school. We encouraged the schools to incorporate PA into all theoretical subjects, but they were ultimately in charge of which subjects they wanted to be included in the project. The intervention draws on three theoretical perspectives: Physical Literacy (Whitehead, 2010), Self-efficacy (Bandura, 1982) and Basic Psychological Needs Theory (Ryan & Deci, 2002). In short, the intervention is theorized to increase students' motivation for PA, to let students' physical learning influence other types of learning and to increase self-efficacy. In turn, increased self-efficacy is associated with mental health outcomes, such as anxiety and depression (Muris, 2002). Intervention model 2 (M2), named "Don't worry, be happy", consisted of one additional PE lesson (45-60 min/week) and one additional PA lesson (45-60 min/week). The purpose of M2 was not only the PA dose itself, but also to encourage students to pursue activities of their own interest in groups they formed themselves. They were allowed to choose what they wanted to do and small "activity groups" were formed based on students' choices. The intervention draws on three theoretical perspectives: Positive Youth Development , Relational Developmental Systems (Lerner, Hershberg, Hillard, & Johnson, 2015) and Positive Movement Experiences (Agans, Sävfenbom, Davis, Bowers, & Lerner, 2013). In short, the intervention is theorized to facilitate that students develop social relationships and experience positive emotions while participating in activities that are meaningful to them. In turn, the intervention is thought to increase participants' motivation for PA and influence mental health through a psychosocial mechanism. Although activities in M2 were mostly student-led, both interventions were formally delivered by the teachers. Extended descriptions of the design and implementation of the M2-intervention has been published elsewhere (Åvitsland, Ohna, et al., 2020).
Teachers from both intervention groups participated in a workshop where they received instructions and training. In addition, halfway through the intervention period, another workshop was organized, in which teachers from different schools could discuss their progress and solutions with each other and the researchers. Considering the design of M2, teachers providing this intervention did not require extensive training. The teachers in M1, however, had access to an online "tool-kit" containing various suggestions for physically active academic lessons. Throughout the intervention period, researchers from each test center kept in touch regularly with schools via emails and visited the schools at least twice per semester.

Mental health
To measure mental health, the participants completed the Strengths and Difficulties Questionnaire (SDQ; Goodman, 1997) at baseline (T1, May-August 2017) and at follow-up, (T2, April-June 2018). The SDQ is a self-report questionnaire consisting of 25 items, divided into five subscales, each containing five items. The main outcome variable is psychological difficulties, expressed by the total difficulties score (TDS), which is made up of four subscales: 1) emotional problems (worry, unhappy, nervous, scared), 2) conduct problems (temper tantrums, fights, lies, steals), 3) hyperactivity (restless, fidgety, distracted) and 4) peer problems (solitary, not liked, bullied). The subscales score from 0 to 10 and TDS ranges from 0 to 40. A higher score signifies increased psychological difficulties. In large populations, TDS can detect changes in psychopathology on each point of the scale, and therefore it can be seen as a dimensional measure that can indicate a general mental health state in children and adolescents (Goodman & Goodman, 2009). The TDS scale can be divided into three levels, in order to identify the risk of mental disorders: "normal" (0-15), "borderline" (16-19) and "abnormal" (20-40), which is characterized as being at risk of developing mental disorders (Goodman, Ford, Simmons, Gatward, & Meltzer, 2000). The psychometric properties of the SDQ have been validated in many countries, including Norway (Van Roy, Veenstra, & Clench-Aas, 2008).

Subgroups
To allow for the examination of interaction effects and subgroup analyses, participants self-reported their sex and status as either immigrant or non-immigrant. Immigrant status was determined by foreign or native birthplace (born in Norway, yes/no). We used national registries to obtain SES, which is expressed as the parent with the highest education level (Erola, Jalonen, & Lehti, 2016). SES was divided into four subgroups: 1) lower secondary school or less, 2) upper secondary school, 3) less than four years university education, and 4) four years or more university education. We created baseline TDS subgroups based on the predetermined cutoffs defining "normal", "borderline" and "abnormal" TDS. This paper adheres to the three most critical criteria used to assess credibility of subgroup effects by Sun et al. (2012): subgroup variables must be assessed at baseline, subgroup hypotheses must be specified ahead of analyses and there must be an interaction effect.

Physical activity
PA was objectively measured with Actigraph accelerometers, models GT3X and GT3X+ (Actigraph, LLC, Pensacola, Florida, USA). Measurements were carried out over seven consecutive days, and participants were instructed to wear the accelerometer on their right hip, and to take it off during sleep, or when in contact with water. The accelerometers were initialized to start recording at 06:00 the day after the participants started wearing them. We excluded data recorded between 00:00 and 06:00, and intervals with more than 20 consecutive minutes without accelerations. Days with more than 480 min of active recording were considered valid. School-time PA was defined as occurring between 08:00 and 14:00. Schooldays with more than 40% of active recording during school-time were considered valid. Valid school-time measurements were also determined by manually coding each class' schedules to control that valid measurements occurred on days when the intervention was scheduled. The subsequent analyses included only participants with at least two days/schooldays with valid measurements. We used Actilife software (Actigraph, LLC, Pensacola, Florida, USA) to initialize and download the accelerometer data. STATA (Stata Statistical Software, StataCorp LP) was used to process and analyze the raw data. Epoch was set to 10 s. Overall PA is expressed as average counts per minute (counts•min − 1 ). To assess average minutes per day and per school-day spent sedentary or in moderate to vigorous PA, time registered with <100 counts per minute and >1999 counts per minute, was divided by valid days/school-days of assessment. We used established cut-points for moderate to vigorous PA of 2000 counts per minute, which is equivalent to a >4 km/h walking speed, among adolescents (Kolle, Steene-Johannessen, Andersen, & Anderssen, 2010). The interventions' effects on PA are outlined in a separate paper (Kolle et al., 2020, submitted).

Adherence to protocol
Several measures were taken to ensure adherence to protocol. To assess fidelity, adaptation, quality and responsiveness, qualitative process evaluations were carried out on both interventions. The process evaluation for M2 has been published elsewhere (Åvitsland, Ohna, et al., 2020), while the process evaluation for M1, is provisional and only available in Norwegian (Kolle et al., 2019). Dose delivered was also measured: One teacher liaison from each school was responsible for reporting the intervention components as executed/not executed on an online platform. Dose delivered is expressed as the mean percentage of intervention components that were executed relative to the total number of intervention components that were possible to execute during the 29-week intervention period.

Statistical analyses
All analyses were performed in IBM SPSS Statistics 25 (IBM, Armonk, New York, USA). SDQ data were managed and organized into the predetermined scales by the syntax provided by the SDQ information web page (Youthinmind, 2018). We report descriptive statistics from T1 and T2 as means and standard deviations (SD). We used Cronbach's alpha to assess the internal consistency of TDS and its subscales. The respective results from T1 and T2 were as follows: emotional problems (0.67 and 0.71), conduct problems (0.51 and 0.53), hyperactivity (0.66 and 0.68), peer problems (0.61 and 0.61) and TDS (0.62 and 0.61). We tested the baseline differences between M1 and control, and between M2 and control, using one-way ANOVA with Fisher's LSD post hoc test.
Of the students who consented to participate in the study (n = 2084), 83% (n = 1728) completed the SDQ at T1. Missing values between enrollment and T1 have been described previously (Åvitsland, Leibinger, et al., 2020). Of the completers at T1, 20% (n = 337) did not complete the SDQ at T2, resulting in the complete case group (completers at T1 and T2) including 1391 participants. To examine if the data were missing completely at random (MCAR), Little's MCAR test was conducted with the variables TDS at T1 and T2, sex, SES and immigrant status. The test did not indicate that data were compatible with MCAR (chi square = 39.408; DF = 9; p < .001). We used logistic regression to examine whether TDS at T1 or T2, sex, SES, immigrant status and three auxiliary physical fitness variables (cardiorespiratory fitness, muscular strength and body composition; see Åvitsland, Leibinger, et al., 2020), could predict the likelihood of being a complete case or having at least one missing TDS value. The results (odds ratio; 95% CI; p) indicated that missingness did not depend on the outcome (1.0; 0.91 to 1.1; p = .658), baseline TDS (1.0; 0.95 to 1.05; p = .821) or any other variable included in the model. We therefore assumed the possibility that complete case analyses could produce unbiased results (Hughes, Heron, Sterne, & Tilling, 2019). Two final post hoc tests were carried out to assess whether experimental groups influenced missingness: First, the logistic regression described above was stratified for experimental group to examine whether variables predicted missingness differently between M1, M2 and control. Second, ANOVA with Fisher's LSD post hoc test was used to assess whether TDS at T1, among those with missing TDS at T2, was different between M1, M2 and control. Neither of these tests indicated differences between the groups.

Intervention effects
To test the effect of the interventions, we conducted complete case analyses using linear mixed effects models, with schools as a random effect. We tested the effect of group (M1, M2 and control) on change in the dependent variables, in models that included the respective baseline variable as a covariate. Moderating effects for change in the dependent variables were determined by testing a categorical subgroup * group interaction in models controlling for main effects on group and subgroup. The moderating variables were sex, SES, immigrant status and TDS level at baseline. In cases where the results indicated an interaction effect, subgroups were analyzed separately by stratifying the data file. If a model indicated results that we interpreted to be compatible with an effect on TDS at T2, subsequent analyses were conducted with the TDS subscales as dependent variables. We report the estimated mean difference in change between groups (b), 95% confidence intervals (CI), exact p-values and the intraclass correlation coefficient (ICC) for the cluster effect of schools. The estimated mean difference in change is expressed by measurement units on the scale of the dependent variable, adjusted for potential baseline differences (M1-control and M2-control).
We have not used the Bonferroni adjustment, as we concur with its critics (Moran, 2003;Nakagawa, 2004;Perneger, 1998) who argued that the adjustment increases the risk of type 2 error, and that "simply describing what tests of significance have been performed, and why, is generally the best way of dealing with multiple comparisons" (Perneger, 1998).
In an effort to adhere to the American Statistical Association's statement on statistical significance and p-values (Wasserstein & Lazar, 2016), and more recent recommendations in a special issue of the journal The American Statistician (Wasserstein, Schirm, & Lazar, 2019); we do not dichotomously interpret the p-values to be either significant or non-significant. Instead, we interpret the p-values as continuous quantities that express how compatible the observed data are with the null-hypotheses: Smaller p-values indicate greater incompatibility with the null-hypotheses. Based on the continuous p-values, the size of the unstandardized regression estimates (b) and the limits of the confidence intervals, we interpret how compatible the results are with our hypotheses and present the results accordingly (Amrhein, Greenland, & McShane, 2019;Greenland et al., 2016). Other factors that influence our interpretations are related to prior evidence, plausibility of mechanism, study design and data quality (McShane, Gal, Gelman, Robert, & Tackett, 2019).

Multiple imputation
Despite the plausible MCAR assumption, the high amount of missing values increases the possibility that the data might be missing at random (MAR) or not missing at random (MNAR). Therefore, multiple imputation was employed. With five imputations and ten iterations, missing data were imputed with TDS at T1 and T2, moderating variables and other auxiliary variables (muscular strength, body composition and cardiorespiratory fitness), using the automatic procedure. We did not impute on the SDQ subscale variables, because SPSS cannot carry out multiple imputation with that many missing values (Mustillo & Kwon, 2015). Analyses that showed results compatible with effects from the complete case data were repeated on the imputed dataset. Complete case results and multiple imputation results are presented as recommended by Manly and Wells (2015) and Sterne et al. (2009). Table 1 shows the number of participants with valid data from each group at both time points, baseline characteristics and PA levels, results from T1 and T2, and the distribution of participants within the TDS subgroups "normal", "borderline" or "abnormal" at T1 and T2. The results were compatible with baseline differences between intervention groups and control group: M1 had 6% higher TDS (b = 0.6; 95% CI = 0.003 to 1.17; p = .049) than control, and M2 had 3.4% lower SES (− 0.2 to 0.0; p = .039) than control. Furthermore, regarding school-time PA, M1 had 13% fewer counts per minute (b = − 65.0; 95% CI = − 90.2 to − 39.8; p < .001), 6% more sedentary time (b = 14.2; 95% CI = 10.2 to 18.2; p < .001) and 14% less moderate to vigorous PA (b = − 4.3; 95% CI = − 5.9 to − 2.6; p < .001) than control.

Adherence to protocol
Average dose delivered in the M1 schools was registered to be 81%, ranging from 72% to 95%. Average dose delivered in the M2 schools was 80%, ranging from 67% to 93%. This is equivalent to 97 and 96 min, respectively, out of 120 possible minutes per week of additional schoolbased PA. The compliance for reporting was 98.2%. Table 2 shows the estimated mean difference in change between the intervention groups and the control group. The ICC was 0.007 for TDS, which indicates small to no difference between schools (Killip, Mahfoud, & Pearce, 2004). For the overall population, the results were incompatible with an effect of M1 or M2 on TDS. We interpreted the results to be incompatible with interaction effects for sex (p = .150) or SES (p = .951), and compatible with interaction effects for immigrant status (p = .061) and baseline TDS levels (p = .008). The subsequent subgroup analyses showed beneficial results. In the abnormal TDS subgroup, results were compatible with a mean difference in change for TDS in favor of M1, compared to their control group counterparts (b = − 2.9; 95% CI = − 5.73 to − 0.07; p = .045). Relative to the estimated baseline levels within the abnormal TDS subgroup (23 points), M1 reduced TDS by 22%, while the control condition reduced TDS by 9%. Subsequent analyses of the SDQ subscales showed that the result could mainly be attributed to difference in change for conduct problems (b = − 0.99; 95% CI = − 2.02 to 0.04; p = .058) and hyperactivity (b = − 1.13; 95% CI = − 2.1 to − 0.19; p = .019).

Intervention effects
In the immigrant subgroup, results were compatible with a mean difference in change for TDS in favor of both M1 (b = − 1.6; 95% CI = − 3.53 to 0.27; p = .093) and M2 (b = − 2.1; 95% CI = − 4.36 to 0.21; p = .075), compared to their control group counterparts. Relative to the estimated baseline levels within the immigrant subgroup (11 points), TDS increased 5% in M1, 0% in M2 and 18% in the control group. Subsequent analyses of the SDQ subscales showed that the result in favor of M1 could mainly be attributed to difference in change for emotional problems (b = − 1.1; 95% CI = − 1.89 to − 0.29; p = .008), and the result in favor of M2 could mainly be attributed to difference in change for emotional problems (b = − 1.0; 95% CI = − 1.99 to − 0.07; p = .036) and hyperactivity (b = − 0.94; 95% CI = − 1.90 to 0.02; p = .055).
The analyses on the immigrant/non-immigrant subgroups with the SDQ subscales as dependent variables showed an unexpected result: in the non-immigrant subgroup, there was compatibility with a mean difference in change for peer problems indicating an increase from M2 (b = 0.32; 95% CI = 0.03 to 0.62; p = .034), compared to the corresponding control subgroup. This warranted further investigation into the subscale peer problems to understand whether there were heterogeneous effects between specific subgroups. The results were compatible with a mean difference in change indicating that M2 increased peer problems among non-immigrant girls (b = 0.42; 95% CI = 0.11 to 0.72; p = .010) and in the subgroup with borderline TDS at baseline (b = 0.89; 95% CI = 0.1 to 1.67; p = .029), compared to their respective control group counterparts. Relative to the estimated baseline levels within the non-immigrant girls subgroup (1.6 points), peer problems increased 19% in M2, while their control group counterparts displayed a 13% decrease. Similarly, relative to estimated baseline levels within the borderline TDS at T1 subgroup, (3.2 points), peer problems increased 6% for M2, while their control counterparts displayed a 31% decrease.
Note. We display results from analyses of intervention groups as a whole, and subgroups that show heterogeneous effects. TDS = total difficulties score. T1 = baseline. M1 = intervention group 1. M2 = intervention group 2. Results interpreted to be compatible with effects are accentuated in bold. Table 1 Participants' demographic characteristics, school-time physical activity at T1 and SDQ scores at T1 and T2. Presented as means with standard deviations (SD). Valid n is presented for each variable. School-time physical activity is reported as average minutes per school day for participants with at least two valid days of wear time. CPM = counts per minute; MVPA = moderate to vigorous physical activity; TDS = total difficulties score.

Multiple imputation
Compared to the complete case results, the linear mixed effects model conducted on the imputed dataset showed results that were less compatible with effects on TDS, although the unstandardized coefficients and confidence intervals showed similar tendencies (Table 3). These results may be biased, however, because of the large amount of data that were imputed and the majority of missing data existed in the outcome variable (Hughes et al., 2019;Lee & Carlin, 2012). Additionally, multiple imputation on datasets containing cluster randomized groups and subgroups may skew the imputed values toward the mean (Sullivan, White, Salter, Ryan, & Lee, 2018). This may explain why the results for the immigrant subgroup and the abnormal TDS subgroup were less compatible with effects than in the complete case analyses. For these reasons, the emphasis in the discussion will be placed on the complete case results.

Discussion
The aim of this paper was to assess the effect of two school-based PA interventions on adolescents' mental health. The complete case results indicate that the interventions did not affect TDS in the overall population. Subgroup analyses, however, showed beneficial effects from both interventions. Specifically, M1 reduced TDS in the subgroup with the highest levels of psychological difficulties at T1 and both interventions prevented an increase in TDS for a majority in the immigrant subgroup. Analyses of the SDQ subscales revealed, surprisingly, that M2 caused peer problems to increase in both the non-immigrant girls subgroup and the borderline TDS at T1 subgroup.
Although the immigrant subgroups' mixed model p-values were nonsignificant in the traditional sense, they were low enough to indicate that the data conformed more to the hypothesis of an effect, than the null hypothesis (Greenland et al., 2016). Furthermore, compared to M1, the difference in change for 95% of the control group spanned from a 3.5 points bigger increase to a 0.3 points lower increase. Compared to M2, the difference in change for 95% of the control group spanned from a 4.4 points bigger increase to a 0.2 points lower increase. The two upper limits indicate an increase with no practical implications, while the lower limits indicate substantial clinically significant effects. This interpretation is based on the findings by Goodman and Goodman (2009), suggesting that every one-point increase in TDS represents a 16%-23% increased likelihood of developing a mental disorder. Therefore, even though the p-values were above the traditional significance level and the confidence interval contained the null-value; a majority of the immigrant subgroup who received the interventions may have experienced a substantial decrease in the likelihood of developing a mental disorder, compared to the respective control subgroup.

Overall population
The lack of an overall effect was not surprising, considering that 85% and 83% of the respective M1 and M2 populations had normal levels of psychological difficulties. Similar ceiling effects were suspected in comparable studies by Eather et al. (2016), Smith et al. (2018) and Christiansen et al. (2018). Although our intervention period of 29 weeks was longer than in similar studies (Eather et al., 2016;Lubans, Smith, et al., 2016;Smith et al., 2018); it might have been too short to attenuate an overall increase in psychological difficulties among generally healthy adolescents. Furthermore, although 80-81% of the dose was registered as delivered, analyses on PA outcomes show that M1 slightly increased Table 2 Estimated mean difference in change in dependent variables, between groups.  Table 3 Estimated mean difference in change in Total Difficulties Score between groups, using the imputed dataset. A. Åvitsland et al. school-time PA levels, while M2 did not (Kolle et al., 2020). This difference may have been caused by the design of the M2 intervention: All three M1 intervention components were teacher-led and were anticipated to be performed with moderate to high intensity. M2, however, contained two lessons that were mainly student-led. The process evaluation substantiates the hypothesis that the extensive freedom and student-led activities that characterized M2 sometimes led to truancy (Åvitsland, Ohna, et al., 2020). Additionally, the intervention specified that students were allowed to choose their preferred activity, which involved everything from low intensity walking to high intensity soccer. These factors may have resulted in a dichotomization among the M2 participants, characterized by physical activity and inactivity. It is also important to note that the control population was not physically inactive: The Norwegian lower secondary school curriculum mandates at least two PE lessons per week and students can also opt in for the elective subject of physical activity and health (often organized in weekly 90-min lessons). In addition, 63% of lower secondary school students in Norway participate in leisure time sports (Bakken, 2019). Among upper secondary school students, however, only 40% participate in leisure time sports and it is possible that an intervention continuing into this period of adolescence would have shown an effect in the overall population.

Abnormal TDS at T1
TDS decreased in all abnormal TDS at T1-subgroups respective to M1, M2 and control, indicating a regression to the mean. However, the abnormal TDS at T1 subgroup that received the M1 intervention displayed a reduction in TDS that was more than twice as big as the reduction in the corresponding control subgroup. The almost 3 points larger mean reduction in M1 may be a substantial clinically significant difference (Goodman & Goodman, 2009). The results concur with previous studies that also found effects in similar subgroups (Christiansen et al., 2018;Eather et al., 2016;Smith et al., 2018).
The effect on TDS could be attributed to reductions in conduct problems and hyperactivity, although this does not align with the metaanalysis by Ahn and Fedewa (2011), which did not show an association between PA and conduct problems. However, a possible explanation for the present reduction is that PA can influence mental health through a behavioral mechanism, for example by improving coping and self-regulation (Lubans, Richards, et al., 2016a), which are inversely associated with conduct problems in adolescents (Ebata & Moos, 1991). The reduction in hyperactivity is supported by a substantial amount of evidence suggesting that PA has a beneficial effect on hyperactivity through neurobiological pathways (Gapin, Labban, & Etnier, 2011).

Immigrants
The immigrant subgroups that received the M1 and M2 interventions displayed a smaller increase and no increase in TDS, respectively, compared to the corresponding control subgroup. Mainly, the effect from M1 could be attributed to reductions in emotional problems, and the effect from M2 could be attributed to reductions in emotional problems and a prevented increase in hyperactivity. These effects might be explained by a psychosocial mechanism (Lubans, Richards, et al., 2016a): Social support may be the psychosocial mechanism that was affected, as it has been shown to be associated with sports participation (Babiss & Gangwisch, 2009) and the M2 intervention was designed to facilitate positive social relationships. Immigrant adolescents experience less social support than non-immigrants, and therefore might have a larger potential for change (Oppedal & Røysamb, 2004). In turn, social support can stave off emotional problems (Garnefski & Diekstra, 1996) and is inversely associated with hyperactivity (Mastoras, Saklofske, Schwean, & Climie, 2015). Self-esteem is another potential psychosocial mechanism that may explain the results, considering that PA is associated with adolescent self-esteem (Dale, Vanderloo, Moore, & Faulkner, 2019), which, in turn is associated with hyperactivity (Edbom, Lichtenstein, Granlund, & Larson, 2006) and emotional health (Moksnes & Espnes, 2012). Moreover, adolescent immigrants may have lower self-esteem than their non-immigrant peers (Bankston & Zhou, 2002). To discuss potential reasons why immigrant adolescents experience less social support or have poorer senses of selves than their non-immigrant peers is beyond the scope of this paper. However, racial discrimination occurs, and can influence connected outcomes such as perceived physical appearance, feelings of belonging to a peer group and identity development (Virta, Sam, & Westin, 2004). To the authors' knowledge, no previous study on the effect of school-based PA interventions on mental health in adolescent populations has specifically identified immigrants as a subgroup. However, in a recent study on an adult immigrant population in Sweden, a four-month lifestyle intervention positively influenced mental health, and the authors emphasized increased PA and social support as potential causes (Siddiqui, Lindblad, Nilsson, & Bennet, 2019). Furthermore, in a qualitative study, "enhanced self-confidence, happiness, and lower stress" were frequently reported as experienced benefits from PA among adult and adolescent immigrants of different origins in USA (Wieland et al., 2015). The immigrant experience of school-based PA interventions has been neglected in previous research. Similar studies in the future should identify this subgroup, not only to assess quantitative effects, but also to assess why this subgroup may benefit from increasing school-time PA.

Non-immigrant girls and borderline TDS at T1
In the subgroups non-immigrant girls and borderline TDS at T1 that received the M2 intervention, peer problems increased, while their respective control counterparts displayed decreases. Although the differences in change were small to moderaterespectively 0.4 and 0.9 pointsthe percentage results relative to baseline gave reason for concern. Scoring high on peer problems is characterized by the SDQ as being solitary, having few friends, being bullied by others and getting along better with adults than with peers. The negative effects were surprising, considering that the intervention was designed to facilitate social relationships through PA. Additionally, participation in team sports has been associated with fewer mental problems than participation in individual sports (Breistøl, Clench-Aas, Van Roy, & Raanaas, 2017), and a recent review by Pels and Kleinert (2016) concluded that PA could contribute to reducing loneliness. The process evaluation of the M2 intervention offers a possible explanation for the negative effect among girls (Åvitsland, Ohna, et al., 2020): The formation of activity groups could lead to some girls feeling ostracized, while planning and cooperating within the group sometimes led to disagreements, which could cause one or several girls to leave their groups. Previous research on sex differences in adolescent peer relationships has shown that social anxiety, expressed by a "fear of negative evaluation from peers, and more social avoidance and distress in new situations" (La Greca & Lopez, 1998), is more prevalent in girls than in boys. Furthermore, while boys tend to thrive in social groups, girls tend to place more emphasis on dyadic relationships (Prinstein, Borelli, Cheah, Simon, & Aikins, 2005), perhaps because they experience less conflicts in these relationships. According to Xie, Swift, Cairns, and Cairns (2002), conflicts among girls involving social aggression, e. g., exclusion, isolation and gossiping, most often occur in groups of four or more members, a common group size in the M2 intervention. The potential sources of conflict for the non-immigrant girls may also have been the reason for the negative effect in the borderline TDS subgroup; however, it is unclear why the negative effect occurred specifically in this subgroup.

Strengths and limitations
Strengths of this study include a large sample of an understudied population, the use of a cluster-RCT design and multilevel analyses of whole groups and subgroups. The mental health outcome variable is comprised of four subscales, which can be helpful for interpreting the explanatory mechanisms. Except for the non-immigrant girls subgroup, all subgroups were determined a priori and fulfil most of the credibility criteria set by Sun et al. (2012). However, limitations must be addressed and there are at least six specific factors that increase the uncertainty of the results: 1) The participating schools volunteered and may be systematically different from schools that declined to participate. This may restrict the generalizability of the results. 2) There is also the possibility of a collider bias, i. e. that the mental health of participants predicted different rates of participation at follow-up, depending on the experimental groups. Between T1 and T2, M1, M2 and the control group lost 22%, 36% and 40% of their respective participants. The loss in M2 can be explained by the school withdrawing from the project. The high attrition in the control group, however, cannot be accounted for and may influence the results somehow. 3) Although we assumed that complete case analyses could produce unbiased results, the assumption is uncertain. The high levels of missing values may have influenced the results. 4) The subgroups in which we interpreted effects contained between 5% and 47% of the full population and testing of the smallest subgroups may have been low in power. The small subgroups may also not be representative of equivalent subgroups outside of the study population. 5) The internal consistencies of TDS and subscales were below the recommended cutoff point at 0.7 (Bland & Altman, 1997). This could be due to poor understanding of the questions, unwillingness to answer the questions honestly, or actual low consistency. Although this contributes to uncertainty, it should be noted that the relevance of Cronbach's alpha has been criticized (Sijtsma, 2009). It should also be considered a limitation that we only used one measure of mental health, as there may be other instruments that are more sensitive to change. 6) Lastly, schools are complex contexts  and our interventions can be characterized as complex (Craig et al., 2008). This means that there are many potential interacting systems that we cannot control, that may influence the results.

Conclusions
The School in Motion cluster-RCT with two intervention arms spanning over 29 weeks, did not affect psychological difficulties in the overall population (14-15-year-olds in Norway attending ninth grade). Results indicated beneficial effects in two subgroups: those with the highest baseline levels of psychological difficulties and immigrants. The effects could be attributed either to a reduction, or prevented increase, in conduct problems, hyperactivity and/or emotional problems. The present results indicate that school-based PA interventions may cause clinically significant changes in psychological difficulties in these subgroups, and these changes may reduce the odds of developing mental disorders. The M2 intervention may be beneficial for immigrants in its current form, although further studies are needed with adapted versions to avoid an increase in peer problems in some subgroups. Future research should focus on the causal relationship between school-time PA and adolescent immigrants' mental health, as no previous research exists on this subject.

Declaration of competing interest
The authors declare that they have no competing interests.