Improving Numeracy Skills in First Graders with Low Performance in Early Numeracy: A Randomized Controlled Trial

Children with low performance in early numeracy are at risk of facing learning difficulties in mathematics, but few trials have examined how this can be ameliorated. A total of 120 first-grade children (Mage = 6.4 years) were randomly assigned to an intervention or a control condition. The 14-week intervention targeted early numeracy skills and was delivered in small groups three times a week. Immediately after the initial 8-week intervention phase, moderate and positive effects were found on early numeracy (d = 0.19), word problem solving (d = 0.41), and approximate number sense (d = 0.35). However, only the effects on word problems were significant, and all effects disappeared after the children undertook a second 6-week intervention phase. Overall, results indicate that (a) early numeracy skills are malleable in low-performing children, but (b) frequent and long-term interventions are needed for the positive effects to last.


Previous Numeracy Interventions in Kindergarten and Early Grades
There are a number of meta-analyses of interventions targeting mathematical learning difficulties (Chodura et al., 2015;Dennis et al., 2016;Gersten et al., 2009;Jitendra et al., 2018;Kroesbergen & Van Luit, 2003;Monei & Pedro, 2017;Wang et al., 2016). Chodura et al. (2015), whose inclusion criteria overlap with the sample in our study, have examined the effects of interventions in 6-to 12-year olds with mathematical difficulties in pre/post control group studies. Based on 35 studies overall, there were large intervention gains in number skills and arithmetic (d = 0.83). Furthermore, moderator analyses have shown that the most efficient interventions were based on direct and assisted teaching. However, most studies had poor or no randomization and low power (>30 in each group).
In another meta-analysis, Dennis et al. (2016) also examined children with mathematical difficulties in pre/post control group studies. Their results showed promising findings for interventions that included peer-assisted training and explicit instructions in small groups, but overall effects were only moderate (d = 0.55, k = 25). The reason for the discrepancy in mean effect size between these two metaanalyses may lie in the selection criteria, as Dennis et al. (2016) used a more lenient definition of mathematical difficulties compared to Chodura et al. (2015). Ultimately, both meta-analyses show that mathematics interventions can be effective, but that moderators may be important to the size of the effects (Dennis et al., 2016).
If we examine the randomized trials included in these reviews more closely, they show mixed effects. For instance, Fuchs et al. (2013) have examined the effects of strategic number knowledge intervention for first graders in three groups: a control group, a group with strategic counting without speeded practice (i.e., reinforcing thoughtful application to support reasoning of strategies during fact retrieval), and a group with strategic counting and speeded practice (i.e., promoting quick response to support fact retrieval). This intervention lasted for 16 weeks (30-minute sessions, three times weekly). Strategic counting without speeded practice was seen to improve number combination fluency compared to the control condition on immediate posttest (d = 0.43), while strategic counting with speeded practice improved number combination fluency and transfer to procedural calculations compared to both competing conditions on immediate posttest (d = 0.67). Gersten et al. (2015) conducted a scale-up trial of the "Number Rockets" (Fuchs et al., 2005) intervention program, a small-group intervention for at-risk first graders focusing on number operations (30 hours of small-group work). Children in the intervention group outperformed the control group on a broad measure of mathematics proficiency on the immediate posttest (d = 0.34). Clarke et al. (2016) have examined the efficacy of a kindergarten early numeracy intervention program ("ROOTS"), focusing on developing whole-number understanding for children assessed as at-risk in mathematics (50 20-minute sessions over 10 weeks). Results across 29 classrooms showed that children in the intervention group outperformed the control group (d = 0.28 for oral counting, d = 0.75 for early numeracy, and d = 0.48 for early number sense). However, no effects were found in the follow-up posttest, indicating that the initial positive impacts of the intervention did not remain long-term. Hence, Clarke et al. (2016) raised the concern of limited impact on long-term achievement in mathematics interventions. Notably, effects from ROOTS were replicated by Doabler et al. (2016) with similar findings (effect sizes ranging from d = 0.31 to 1.08).
A consistent finding across studies is that effects fade out after a seemingly effective mathematics intervention has ended (Bailey, 2019). Little is known about the nature of fade-out effects and their influencing factors in the context of randomized controlled trial (RCT) studies. Two hypotheses have been suggested: first, the constraining content hypothesis, which suggests that fade-out effects are due to environmental factors, given that subsequent instruction does not build on the skills learned during the intervention. Second, the preexisting differences hypothesis (Bailey et al., 2016) suggests that fade out is due to stable, underlying characteristics in mathematics that cause children to revert to their previous individual trajectories (Bailey et al., 2016).

Research Aim and Questions
Evidently, previous studies have identified promising effects immediately after intervention, but these effects fade out as soon as the intervention is taken away. Here, we present a study that will contribute to knowledge in this area. First, we examine effects from an early numeracy intervention in a developmental period in which numeracy skills are presumed to be malleable. Second, the present study attempts to prevent fade-out effects by adding a second intervention phase that serves as a refresher of the intervention content.
Accordingly, we aim to respond to the following research questions:

RQ1.
Does an early numeracy intervention lead to pretest/posttest differences between treatment and control groups in early numeracy, word problem solving, arithmetic skills, and approximate number sense (immediate intervention effects)? RQ2. Does including a second intervention phase lead to pre/follow up-test differences in outcomes between treatment and control groups (follow-up intervention effects)?

Participants
All children born in 2010 and attending first grade in two municipalities in Norway were invited to participate in the study. Children start school at the age of six years in Norway; this resulted in 369 initial participants. The CONSORT diagram in Figure A1 in the online supplemental appendix depicts the flow of participants throughout the study (Schulz et al., 2010). Ethical approval was obtained from the Norwegian Social Science Data Services, and informed parental consent was given. The children were selected based on a screening with the Early Numeracy Screener (Lopez-Pedersen et al., 2021), consisting of 52 items measuring early numeracy skills, understanding numerical relational skills, counting skills, and basic arithmetic skills. The tasks in the screening measure are like those assessed by other early numeracy measures (Clements et al., 2008;Jordan et al., 2007). The reliability of the screening measure in our sample was Cronbach's α = .943. We identified 32% of the children with the lowest scores in the early numeracy screener (n = 120, 57% girls) for further participation in the study (M age = 77 months, SD = 3.94 months).
Two of the authors randomized the children at the individual level by using random.org (https://www.random. org); using the same program, we applied blocking to ensure an equal number of children in both groups. The study had little attrition: only 5.8% (n = 7) of participants dropped out due to moving school districts. Little's MCAR (Missing Completely at Random) Test (R. J. A. Little, 1988) of the pretest data indicated that the data were likely to follow a missing-completely-at-random mechanism rather than a missing-at-random mechanism, χ 2 (105) = 79.0, p = .97. We therefore performed the full-information maximum-likelihood procedure to handle the missing data (Enders, 2010).

Measures
Children were assessed individually at preintervention (t1), at immediate posttest at the end of the first 8 weeks of intervention (t2), at immediate second posttest after receiving the secondary intervention once a week for 6 weeks (t3), and at follow-up 6 months after the intervention ended (t4). All testing was conducted by trained research assistants in the children's schools during school time. Internal consistencies of all measures were satisfactory (see Table 1).
Early numeracy skills. Early numeracy skills were assessed using items of counting and numerical relational skills from Note. Word problems TM = Word problem items from the research-developed early numeracy test, Word problems W = Word problem items from Wechsler Intelligence Scale for Children (4th ed.; WISC-IV). Measures of counting and relational skills in t4 was changed in order to avoid ceiling effects. Items that 95% of the children correctly solved on t3 were taken out and the remaing items with increased difficulty level. Thus lower means is due to fewer items.
a test custom-developed for this study . This test consisted of 24 counting tasks, measuring numberquantity correspondence, enumeration, and number sequences. The 24 items measured numerical relational skills, such as comparing numbers; for example, identifying the smallest/largest number within the number range 1 to 201 (e.g., 22-19-28), identifying quantities with instructions such as "one more than," "one less than," and items measuring ordinal numbers without time limit. Each item was given one point for the correct answer and zero for the incorrect answer.
Word problem solving. Word problem solving was assessed using items from two tests: WISC-IV arithmetic tasks (Wechsler et al., 2003) and a custom-developed test for this study . With the former, the WISC-IV arithmetic tasks contained 34 arithmetic word problems, with a time limit of 30 seconds per item and a stopping rule of four consecutive errors. With the latter, word problem solving was assessed using an 8-item test.
In both tests, the children were given word problems (read aloud to them) and then asked to solve them mentally, reporting their answers to the assessor. Each item was given one point for the correct answer and zero for the incorrect answer.
Arithmetic skills. Arithmetic skills were assessed by measuring addition and subtraction skills using items from a test developed for this study . The children were asked to perform 10 addition tasks (using paper and pencil) without a time limit. Eight of the tasks were in the number range of 0 to 20, and two were in the range of 10 to 30. For the subtraction items, the children were asked to perform 10 subtraction tasks (paper and pencil) without a time limit. All tasks were in the number range of 0 to 20. Each item was given one point for the correct answer and zero for the incorrect answer.
Approximate number sense (ANS). Approximate number sense (ANS) was assessed by measuring dots and digit comparison skills using two tasks from the Test of Basic Arithmetic and Numeracy Skills (Brigstocke et al., 2016). For the dot comparison tasks, the test presented arrays of dots randomly arranged within a 2.5 cm 2 box on a white background. A series of items with two adjacent boxes were given, and the children were asked to quickly tick the box with the largest number of dots and to complete as many boxes as possible within 30 seconds. The digit comparison tasks were presented in columns of two digits next to each other, and the children were asked to mark the larger of the two numbers. The children completed as many tasks as they could within 30 seconds and were given one point for each correct answer and zero for each wrong answer.

Intervention Program
When designing interventions for early numeracy skills, targeted skills generally fall into three domains: understanding numerical relations, counting skills, and basic arithmetic skills (e.g., Aunio & Räsänen, 2016;Jordan et al., 2009;Purpura et al., 2013). The present intervention program is focused on counting in the number range 1 to 20 and is based on the model of development of core numeracy skills theorized by Aunio and Räsänen (2016). Explicit teaching serves as an instructional feature because it has repeatedly been proven to be an effective approach (e.g., Chodura et al., 2015;Kroesbergen & Van Luit, 2003). (See Tables A1 and A2 in the online supplemental materials for the detailed content of each intervention session).
The intervention sessions were conducted in small pullout groups consisting of four to six children and started with a short warm-up activity related to the content to be taught or with brief repetition of skills practiced in the previous session. A teacher-led activity followed, including modeling of new concepts and strategies. This was followed by children working in pairs, with hands-on activities (e.g., games and using manipulatives) guided by the teacher. At the very end of each session, the children completed a short individual written task. After every two small-group sessions, each child attended one 15 to 20-minute individual session with the intervention teacher. The objective of this individual session was to give the teacher an opportunity to work even more closely with the children and to give the teacher additional insight into each child's learning trajectory in early numeracy learning.

Procedure
The intervention condition comprises two phases. The first phase was administered to the children three times a week for 8 weeks by trained teachers and special educational needs teachers at school. A total of 24 sessions (16 smallgroup and eight individual sessions) were delivered, amounting to approximately 130 minutes each week. The second phase of the intervention started 2 weeks after the first intervention phase had ended. This phase consisted of six instructional sessions, once a week, over a total of 6 weeks. Content-wise, the sessions in the second phase repeated those of the initial 8-week intervention. Each intervention teacher received the material prior to the start of intervention, as well as training and practice in using the material.

Treatment Fidelity
During the intervention, we monitored the implementation of the intervention program using audio recordings. In addition, we used logs to note the children's attendance.
A random selection of 10% of the sessions across all schools was checked, and at least one session per teacher was checked. These sessions demonstrated 100% consistency between the audio recordings and the events reported in the logs. There was little absence from the intervention: on average, the children completed 27 out of 30 intervention sessions, yielding an absence rate of 10 %.

Statistical Analyses
We performed structural equation modeling (SEM) because it has the advantage of being a flexible framework for integrating observed and unobserved variables and including multiple groups and time points (Kline, 2016). We specified models that represented the key constructs in our study as either manifest or latent variables. These models describe the intervention effects at two measurement points (after controlling for children's performance at t1) and allowed us to examine the immediate and follow-up effects of the intervention. To sustain acceptable power, we decided to include only three rather than all four measurement points in the analytic models and to estimate separate models for each construct. In fact, power analyses in the R package "WebPower" version 0.5.2 (Zhang & Yuan, 2018) indicated reasonable power to detect small but significant intervention effects (see Supplementary Material S3 https://osf.io/ pb2zn/). Specifically, to test the immediate intervention effects, we chose the outcome variables at t2 and t3, given that these two measurement points were close to each other. To test the follow-up effects, we chose the outcome variables at t4.
All variables were standardized on the dependent variables; thus, the path coefficients can be interpreted as differences in SD units (Cohen's d). We used intention to treat (ITT) analyses that included all children who received the pretest, irrespective of how many sessions they had actually taken. We performed all analyses using the R packages "lavaan" version 0.6-6 (Rosseel, 2012), "semTools" version 0.5-3 (Jorgensen et al., 2020), and "semPlot" version 1.1.2 (Epskamp, 2015), utilizing maximum-likelihood estimation and treating missing data via the full-information maximumlikelihood procedure (Enders, 2010). Supplementary Materials S3 and S4 https://osf.io/pb2zn/ provide the respective syntax and output of the analyses in R.
The data presented in this study has a partially nested design; that is, while students who were assigned to the intervention condition received the intervention in small groups (four to six students per group, 12 groups in total), students in the control condition did not. Several univariate and multivariate approaches have been proposed to efficiently consider this data structure, such as multigroup multilevel structural equation modeling (e.g., Candlish et al., 2018;Sterba et al., 2014). Despite wanting to do so, we were unable to model the partially nested structure of the data explicitly; given the small and unbalanced sample sizes at the cluster level, the parameters derived from such models were unacceptably large and did not result in reliable estimates of the intervention effects.
To evaluate the fit of the structural equation models that contained latent variables, we considered the common guidelines for model fit. These guidelines suggest an acceptable fit to the data if the Comparative Fit Index (CFI) exceeds .95, the Root Mean Square Error of Approximation (RMSEA) is less than .08, and the Standardized Root Mean Square Residual (SRMR) is less than .10 (Hu & Bentler, 1999;Marsh et al., 2005). We further tested for the invariance of the measurement models over time by specifying a configural, a metric, and a scalar invariance model, and comparing them against each other (T. D. Little, 2013). The more constrained model could be retained if the CFI did not decrease by more than .010, the RMSEA did not increase by more than .015, and the SRMR did not increase by more than .030 after introducing the equality constraints on factor loadings and intercepts to the model, (see Khojasteh & Lo, 2015;Putnick & Bornstein, 2016). Table 1 shows the means, standard deviations, and reliabilities for all measures. The distribution of the variables was acceptable, except for subtraction, which had a floor effect at pretest. Supplementary Material S1 exhibits the full correlation matrix of these variables.

Measurement invariance testing.
Given that the indicators of the models of approximate number sense and word problem solving were closely related (see Supplementary Material S1 and S3, https://osf.io/pb2zn/), we represented these constructs as latent variables. Testing for the invariance of the measurement model, scalar invariance held between groups for the construct of word problem solving. Hence, group comparisons were not affected by the differential functioning of the indicators of the two constructs (Millsap, 2011), and sufficient comparability over time was evident for the structural equation modeling of the intervention effects. As for approximate number sense over time, we found support for metric invariance (see Supplementary Material S2, https://osf.io/pb2zn/). This finding applied to both sets of measurement occasions: t1, t2, and t3, and t1, t2, and t4. Scalar invariance held for both sets across the two groups (control vs. intervention group). Thus, for word problem solving and approximate number sense, we carried out the analyses using latent variables. For early numeracy and arithmetic, we did not obtain measurement invariance. Due to the lack of invariance and the poor fit of the models with latent variables, we carried out these analyses on the observed manifest variables only (for details, see Supplementary Material S1, https://osf.io/pb2zn/).

Structural Equation Modeling of the Intervention Effects
Early numeracy skills First and second posttest. As mentioned, we used counting skills and numerical relations as manifest scores due to a lack of invariance and examined their effects in the intervention separately. The resulting models were exactly identified and had a perfect fit to the data. Figure 1 shows the effects on the models at posttest (immediately after 8 weeks of training three times per week) and at follow-up posttest (after an additional 6 weeks of training once a week). Overall, the intervention effects were small and insignificant (counting skills: d = −0.09, p > .10 for both t2 and t3; numerical relations: d = 0.17 for t2 and d = −0.03, p > .10 for t3).
Follow-up test 6 months after intervention. We specified a model like the one presented in Figure 1, except with t4 instead of t3. The model was exactly identified and thus had a perfect fit to the data. The treatment effects for t4 were not significant (counting skills: d = 0.02, p = .90; numerical relational skills: d = −0.19, p = .23).

Word problems
First and second posttest. Since we obtained invariance, the model for word problems consisted of one latent variable, with word problems from WISC-IV and from the test developed for the study as indicators. Figure 2 shows the effects of the intervention on word problems at the first posttest and second posttest. This model exhibited an excellent fit to the data: χ 2 (9) = 6.69, p = .67, RMSEA = .000 (90% CI = .000 -.198), CFI = 1.000, SRMR = .036. There was a significant and moderate effect at first posttest (d = 0.41, p < .05). However, when children received the sessions only once a week, the effects faded out at second posttest (d = 0.04, p = .81). Notably, after correcting for multiple corrections using the Benjamini-Hochberg procedure (Benjamini & Hochberg, 1995), results were no longer significant (p = .15).

Discussion
Overall, the intervention produced positive benefits (d = 0.20) on early numeracy learning, but these were not significant. There were moderate and significant effects on  Note. STDY parameters shown. The variable treatment is binary (1 = intervention group, 0 = control group). ANS = approximate number sense. *p < .05. **p < .10, ns = statistically not significant (p > .10).
word problem solving (d = 0.41), but after correcting for multiple significance tests, results at posttest for word problems were no longer significant. In addition, effects on all four outcome measures were reduced and faded out at the second test (after the second intervention phase) and at the follow-up test (6 months after the intervention) compared to the immediate posttest. Fade-out effects indeed took place for all four outcome variables, indicating that the second phase of the intervention did not successfully prevent or ameliorate such effects.

Immediate Intervention Effects and Transfer Effects
Given the efforts to construct and implement an intervention for children who struggle with numeracy skills, our results can be considered somewhat disappointing. There was a significant effect on word problems, but as mentioned, this was no longer significant after controlling for multiple significance tests. However, it should be noted that whether such a procedure should be employed is debatable (e.g., see Gelman et al., 2012;Rothman, 1990). Indeed, it has been argued that these kinds of correction procedures are too conservative and lead to elevated levels of type 2 errors (Rothman, 1990).
As for the reasons behind the rather weak effect found in our study, one is that the power level was based on overly optimistic assumptions of how large the effects would be, and we did not have sufficient power to detect the effects around 0.2 to 0.3 Cohen's d. Another reason is that the duration of the intervention in our study was 8 weeks, three times per week. Although this intervention intensity was comparable to many other studies (see Chodura et al., 2015;Dennis et al., 2016), our intervention study comprised a mere 24 sessions compared to 48 sessions in Fuchs et al. (2005), an additional 30 hours on top of typical classroom instruction in Gersten et al. (2015), and 50 sessions in the "ROOTS" program by Clarke et al. (2016). A third reason could be that, since the children did not have particularly severe mathematical learning difficulties, some of the intervention content may have been too easy for them.
Considering the effect on word problems, it should be noted that word problems were not directly trained in the intervention, so this may support theories of knowledge transfer, at least within the same domain (Taatgen, 2013). One reason for this effect could be that improving children's numeracy and arithmetic skills in general will help them solve word problems. It may also be that, in small groups in which the teachers provided explicit instructions, the activity of solving and reasoning about mathematical tasks (and using expressive language to talk about mathematical problems) enhanced the children's quantitative language skills, thereby helping them to solve word problems on their own. However, results for word problems must be interpreted with the caveat that their effects were no longer present after correcting for multiple significance tests.

Second Intervention Phase
As for follow-up versus immediate effects, the effects in the current study faded when the initial weeks of the intervention had ended. In previous studies examining intervention in young at-risk children, only one of the randomized controlled trials has reported results on follow-up effects . This trial also showed clear fade-out effects. This is particularly problematic for interventions that build early numeracy skills because most children are likely to eventually acquire at least minimal levels of these skills soon after entering school. Indeed, much of the fadeout effect in early childhood interventions has been attributed to this type of catch-up among the larger population of children (Bailey et al., 2016). Thus, fade-out effects have important implications for teaching. Early interventions do not imply that the children's challenges are solved but that children who experience problems are likely to need interventions regularly so that they do not fall back into a lower developmental trajectory.
As for the reasons why interventions fade out, our study was not designed directly to examine the nature of fade-out effects. Considering the constraining content hypothesis (Bailey et al., 2016), our study attempted to sustain the intervention effect by adding a second intervention phase. However, our study did not incorporate environmental factors such as how teachers could build on the skills the children had learned after the intervention ended or how they were instructed in ordinary classroom settings. We also did not plan how teachers could sustain the effects after the intervention ended. Furthermore, the preexisting differences hypothesis (Bailey et al., 2016) suggests that fade out is due to stable, underlying characteristics in mathematics that cause children to revert to their previous individual trajectories (Bailey et al., 2016). This notion makes it hard to ameliorate children with mathematical learning difficulties, and this may also be related to the intensity of our intervention. To tackle these preexisting differences, the intervention should be maintained for longer if the new trajectories are to be sustained.

Recommendations for Future Studies and Conclusion
In future studies, it will be important to conduct more wellcontrolled and well-powered studies to increase our knowledge about how difficulties in learning mathematical skills can be prevented and ameliorated. Moreover, the fade-out effects in this and other studies underline that future studies should be designed with interventions featuring more sessions and over longer periods, as well as interventions that pause for a certain period to discover how more persistent effects might be achieved. For instance, an intervention could be implemented in blocks (with multiple periods of intervention phases) to see if it is possible to prevent fade out. Indeed, one recent language intervention has successfully applied such a procedure (Hagen et al., 2017). Furthermore, it could also be important with active control groups to control for nonspecific effects (i.e., that the intervention group is given more attention than the control group).
Early numeracy skills and early mathematical development can be seen as gatekeeper skills. Children with low performance in early numeracy are at risk of facing learning difficulties in mathematics. From this and other studies, it is clear that mathematical skills can be improved despite high stability in rank order between children. However, it is important to note that improvement will require great effort, and that most studies have inadequate intervention intensity to achieve this, particularly in the long run. Even though our study included a repetition phase, this was not sufficient to gain lasting improvements. Thus, to achieve this, future studies are likely to need a new continuous take on interventions, with only short breaks in between each phase. Ultimately, it seems unlikely that this type of 10-to 20-week intervention will be helpful for most children who struggle.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.