Equal opportunities for all? Analyzing within-country variation in school effectiveness

The present study addresses the equality of school effectiveness across schools. One central aim of compulsory education is for students to learn equally well at all schools in a country even if these schools differ in terms of student composition. However, measuring equality of learning gains independently from selection effects usually requires longitudinal data. This study suggests a different approach and proposes a new measure for the equality of school effectiveness across schools. We applied a two-level regression discontinuity approach to estimate the between-school variation in added-year effects on mathematics and science achievement that result from an additional year of schooling, after controlling for the effects of age and student selection—i.e., between-school differences in achievement levels. We utilized data from a total of 13 samples. These stemmed from Nordic and other European countries, which assessed students from two adjacent grades at the same schools. The samples stemmed from TIMSS 1995 and 2015 and both primary and secondary school levels. The main findings indicated that although schools differed in initial achievement levels in all samples, schools in some countries, such as Norway and Cyprus, attained a high degree of equality of school effectiveness—i.e., of the effect of an additional year of schooling. Despite the fact that schools with a more privileged student composition had higher achievement levels than less privileged schools, their school effectiveness did not usually differ significantly. Both age and an additional year of schooling resulted in positive effects on mathematics and science achievement, however, effect sizes differed considerably between the 13 samples. We discuss the implications of the proposed school effectiveness measure, which is based on a regression discontinuity approach. We conclude that countries, such as Nordic ones, should consider extending their participation in international large-scale assessments with additional grades from the same schools in future cycles. This design would enable a multitude of robust school effectiveness studies in the future.

There are numerous reasons why students have differing learning opportunities. The students' socioeconomic background is among the most important ones-with children of highly educated, well-earning parents also being privileged in terms of educational achievement and attainment (Jerrim & Macmillan, 2015;Sirin, 2005;Strietholt et al., 2019). Family background disparities are, by implication, very difficult to overcome for both education stakeholders and education systems. However, one aim of compulsory primary and secondary education is to foster learning for all students, independent of their family backgrounds (Jackson, 2013;Marks, 2005;Schütz et al., 2008). Especially the social-democratic Nordic countries aim for their schools to provide the same learning opportunities to all students and to equally foster learning outcomes across schools (Blossing et al., 2014;Frønes et al., 2020;Yang Hansen et al., 2014).
If an education system did, indeed, succeed in fulfilling the goal of providing the same learning opportunities at all schools, then school effectiveness measures should not vary between schools. Clearly, the ambition to have equally effective schools does not simply mean that all schools have the same mean achievement outcomes. Different student enrollment policies typically result in different student compositions across schools. Regardless of whether these policies relate to geographical catchment areas, merit-based selection, or more direct monetary mechanisms, the likely result is that schools will end up with different socioeconomic student compositions. This is important because the socioeconomic composition of the student body is associated with the students' initial mean achievement levels (Mullis et al., 2017(Mullis et al., , 2020OECD, 2019b). Instead, equality of school effectiveness implies that students should make the same learning progress, independent of their school's socioeconomic composition. Hence, ideally, the learning gains per school year-a measure for school effectiveness-should be the same across all schools in a given country even if they do vary in terms of their student bodies (i.e., student selection). We consider equality of school effectiveness to be an important facet of educational effectiveness because it depicts whether some students are left behind at their schools (Bosker & Scheerens, 1994;Chapman et al., 2012;Townsend, 2007). This is also relevant information for policymaking because the policies that affect the allocation of highly qualified teachers and adequate resources for all schools, as well as other school-level policies, can more easily be changed through political reforms than, for instance, the socioeconomic backgrounds of students.
One way to disentangle school effectiveness from student selection effects at the school level is to use longitudinal studies with large sample sizes. In such studies, the between-school differences in the learning progress found between measurement points can be isolated from the achievement differences that had already existed at the first point of measurement. Such large-scale longitudinal studies are available in a number of countries (see overview in Blossfeld et al., 2019). Some of these countries also extend their participation in cross-sectional international assessments with a longitudinal follow-up measurement (e.g., OECD, 2012;Prenzel et al., 2006). Others, such as the Nordic countries, are able to longitudinally connect central data registers of their entire populations with results from standardized achievement tests and teacher-set grades (see e.g., Raeder et al., 2020;Figlio et al., 2016). Without a doubt, such longitudinal designs provide unique research opportunities for tracing school effectiveness differences between schools. However, these studies have some key disadvantages. They are expensive and

Utilizing regression discontinuity designs to estimate the equality of school effectiveness
The basic principle of the regression discontinuity approach is that small, random differences in the so-called running or forcing variable make a pronounced difference in terms of treatment because the so-called scoring rule applies (Angrist & Pischke, 2009Imbens & Lemieux, 2008). In the present study, this means that small differences in the birth date (running variable age) decide whether a student is placed in a lower grade (control group) or upper grade (treatment group) due to school entry regulations that have fixed birth-date-related thresholds (scoring rule). In Norway, for instance, December-born children enter school a whole year earlier than children who are born only one month later, in January, because of school entry policies. This is illustrated in Fig. 1, where the age in months determines whether students attend the lower or upper of two adjacent grades. In terms of the outcome-the achievement in a standardized test-both older age (running variable) and an added year of schooling (treatment) should have positive effects. Due to brain maturation processes, older children should perform better Fig. 1 Schematic illustration of the regression discontinuity approach. Figure displays the results of the regression (dashed line) of achievement scores (y-axis) on age in months (x-axis) and an added year of schooling in a sample of students (grey dots) from two adjacent grades. Within grades, the slope reflects the effect of age. The discontinuity in the regression reflects the effect of an added year of schooling Page 4 of 34 Steinmann and Olsen Large-scale Assessments in Education (2022) 10:2 on tests than their younger classmates (age effect). Children who go to school one year longer should outperform schoolmates in a lower grade due to the learning effects of formal education (added-year effect).
In the regression discontinuity approach, the outcome (achievement score) is regressed on both the treatment (added year of schooling) and the running variable (age in months). The discontinuity of this regression at the scoring rule threshold reflects the treatment effect, i.e., the added-year effect. In Fig. 1, this regression is depicted as a dashed line. This so-called sharp regression discontinuity approach can be applied if enough students comply with the scoring rule, i.e., attend the formally correct grade in accordance with their birth date and school entry regulations. This is the case in education systems that have strict school entry regulations as well as rare grade retentions and accelerations. Other prerequisites that must be met to apply regression discontinuity designs (Schochet et al., 2010; see Appendix A) are usually fulfilled when regressing achievement on age and grade (Luyten, 2006;Luyten et al., 2017).
The findings of these studies can be summarized as follows. Both age and schooling usually have positive effects on scores in more general cognitive as well as academic achievement tests. However, the literature provides rich and nuanced information beyond this simple fact. First, effect sizes vary considerably between countries (Luyten, 2006;Luyten & Veldkamp, 2011;Singh, 2020;Webbink & Gerritsen, 2013). Moreover, based on studies that are repeated at regular intervals, the age effects fluctuate considerably over time within the same country (Olsen & Björnsson, 2018).
Second, effect sizes of age and added-year effects vary between younger and older students. Over their school career, both age and added-year effect sizes typically decrease (Cliffordson, 2010;Luyten et al., 2017;Olsen & Björnsson, 2018;Webbink & Gerritsen, 2013). However, this finding might also result from an increasing between-student variability in test scores, which would mean that absolute age and added-year effects are evaluated against a larger variance in older students. In one study with vertically linked scores, the schooling effect did not decrease over the school career (Kyriakides & Luyten, 2009).
Third, the literature indicates interesting differences between test domains. In comparison to tests on academic achievement domains that are closely tied to school curricula, more general cognitive ability tests indicate smaller added-year effects (Cahan et al., 2008;Cliffordson & Gustafsson, 2008;Luyten et al., 2017). The obvious explanation for this finding is that schooling seems to have greater effects in areas that are explicitly taught at school (e.g., mathematics, reading) than on more general cognitive abilities (e.g., generic problem solving). Furthermore, two studies found that age effects on cognitive ability test scores decreased more over the school career than age effects on academic achievement test scores (Kyriakides & Luyten, 2009;Luyten et al., 2017). In terms of the added-year effect, no such differences between the two test domains were revealed (Luyten et al., 2017).
Extending the regression discontinuity approach to measure the equality of school effectiveness The generic example of regression discontinuity in Fig. 1 provides a simple illustration of how this approach effectively disentangles the effects of age and schooling. By translating this approach to a multi-level framework, however, one can further separate the between-school differences in school effectiveness from student selection effects. This approach is illustrated in Fig. 2. Just as before, an outcome (achievement score) is regressed on the treatment (added year of schooling) and on the running variable (age in months). A two-level regression model includes a random intercept, a random slope for the added-year effect, and a fixed slope for the age effect. Hence, the model captures the differences in intercepts and added-year effects between schools, while the age effect is the same across schools. This is illustrated for three schools in Fig. 2. Example school A starts off with a higher achievement level in the lower grade than schools B and C (cf. intercepts in Fig. 2) but has a smaller added-year effect (cf. regression discontinuities Fig. 2 Schematic illustration of between-school variation in student selection and added-year effects. Figure displays the results of the regression (dashed lines) of achievement scores (y-axis) on age in months (x-axis) and an added year of schooling in a sample of students from two adjacent grades at three schools (A, B, and C). The between-school variation in the intercepts reflects differences in student selection. The between-school variation in the added-year effects (regression discontinuities) reflects unequal school effectiveness Page 6 of 34 Steinmann and Olsen Large-scale Assessments in Education (2022) 10:2 in Fig. 2). School B has the lowest achievement levels in the lower grade but the largest added-year effect. Age effects (cf. slopes in Fig. 2) are the same at all schools. The two-level regression discontinuity approach allows for the estimation of the degree to which schools vary in their school effectiveness (i.e., the added-year effect), while controlling for the between-school differences in student selection (i.e., the intercept) that already preexisted in the lower grade. Although it is probably very difficult for countries to reduce the between-school variation in selection mechanisms (i.e., intercepts), which occur due to, for instance, residential segregation, countries can nevertheless aim to minimize the between-school variation in school effectiveness. Similarly, it is probably unavoidable for schools with a privileged student composition to have higher achievement intercepts (Mullis et al., 2017(Mullis et al., , 2020OECD, 2019b). However, countries can aim to attain the same added-year slopes at all schools, regardless of their socioeconomic composition.

Previous two-level regression discontinuity findings
The variation in the added-year effects between schools, which serves as the central measure of interest in the present study, has rarely been investigated in the past. Only six of the above-mentioned regression discontinuity studies applied a multi-level extension to investigate the between-school variation in intercepts and added-year slopes.
One study that investigated age and added-year effects on mathematics and reading outcomes between grades 7 and 8 in 53 schools in the US found significant betweenschool variance in both intercepts and added-year slopes (Ali & Heck, 2012). At schools with a disadvantaged socioeconomic student composition, both intercepts and addedyear slopes were lower than in schools with a more privileged student body. Another mathematics and reading domain study from the US also found significant betweenschool variations in both age and added-year effects in grade 4 and 5 students from 198 schools (Heck & Moriyama, 2010). In contrast to the first study, however, they found no significant correlation between the schools' added-year effects and socioeconomic compositions. However, both studies were based on complex approaches, where age and added-year effects were embedded into multi-level structural equation models with multiple predictors and interaction terms at student and school levels. Hence, the interpretation of the results is not straightforward.
A multi-level regression discontinuity study that used data from 15-year-old students from 270 schools in England found a significant between-school variation in intercepts but not in added-year effects (Luyten et al., 2008). In complex models with multiple predictors, schools with disadvantaged student compositions showed lower intercepts but higher added-year effects than schools with more privileged student bodies. Another study that used primary school level data from England found that 18 schools varied significantly in both intercepts and added-year effects ), but it did not investigate associations with school composition.
In a multi-level regression discontinuity study that used data from six secondary school grades at six Cypriot schools, the schools significantly differed in added-year effects on mathematics but not on Greek language or cognitive ability tests (Kyriakides & Luyten, 2009). The study, possibly due to the small sample of schools, found no significant association between added-year effects and any student background characteristics. Another study used primary school level data from eight countries that participated with two adjacent grades in the Third International Mathematics and Science Study (TIMSS) in 1995 (Luyten, 2006). In all countries, both intercepts and added-year effects on mathematics and science test scores varied between schools, but this variation was not significant in all countries. The between-school variations of intercepts exceeded those of the added-year effects. Interestingly, the countries showed very different degrees of between-school variation of added-year effects. In none of the countries did the schools with a more privileged student body significantly differ from others in terms of added-year effects. However, these models contained numerous further variables at both the student and school levels, which again complicates the interpretability of their estimates.
In summary, existing studies showed that schools usually differ in intercepts internationally, i.e., achievement levels in lower grades. Furthermore, schools with a more privileged student composition were found to have higher intercepts than others. These findings are unsurprising because achievement intercepts reflect the fact that schools recruit from residential areas that are differently composed as well as because schools might have different school effectiveness levels in prior grades. In contrast, not all studies found pronounced between-school variation in added-year effects. While this might relate to the sometimes small school sample sizes, among other issues, it could also reflect the fact that some countries, indeed, manage to attain a very similar school effectiveness across schools. In most studies, school composition was not significantly associated with the schools' added-year effects. However, these studies included a multitude of additional predictor variables and different interaction terms at individual and school levels in their models, which may obscure and complicate the interpretability of their findings.

Advantages of the regression discontinuity approach for investigating the equality of school effectiveness
The previous literature has demonstrated that the multi-level regression discontinuity approach can be used to investigate the equality of school effectiveness by modeling the between-school variation of added-year effects. In comparison with alternative longitudinal designs, this approach has several advantages (Angrist & Pischke, 2009. First, longitudinal analyses are more demanding to carry out, both in terms of cost and the time that elapses before data can be analyzed. For the regression discontinuity approach, in contrast, student data of at least two adjacent grades can be collected at the same point in time. Second, the regression discontinuity approach does not suffer drawbacks from students changing schools and other sample attrition causes as longitudinal studies usually do. Third, it is much easier to repeat the same regression discontinuity study after a few years in order to obtain trend information than to repeat a longitudinal study. Fourth, it is much easier to conduct the same regression discontinuity study in multiple countries in order to enable a comparison of the equality of school effectiveness, as was done in TIMSS 1995. Unfortunately, the subsequently repeated TIMSS cycles (Trends in International Mathematics and Science Study) only targeted one grade per school in their international design. In addition, it is important to emphasize that studies that did directly compare the results of longitudinal and regression discontinuity Page 8 of 34 Steinmann and Olsen Large-scale Assessments in Education (2022) 10:2 approaches found that these reached very similar conclusions at the country level Perry, 2017;Singh, 2020). Nevertheless, the regression discontinuity approach is also associated with potential limitations. The most important one for the current study is that its design requires large complier proportions-i.e., students who are in the formally correct grade given their birth date and school entry regulations. In countries with strict school entry policies and low rates of grade repetition and acceleration, such as the Nordic countries, the regression discontinuity approach is a promising alternative to longitudinal designs for obtaining robust estimations of the inequality of school effectiveness.

The present study
This study aims to disentangle the between-school variation in school effectiveness from the between-school variation in student selection. We applied the above-described two-level regression discontinuity approach to estimate the between-school variation in added-year effects (school effectiveness) and intercepts (student selection). Furthermore, this study investigates whether the schools' added-year effects and intercepts differ depending on their socioeconomic student composition. This study uses Norway as a showcase with other carefully selected Nordic and European countries serving as a basis for comparison. Altogether, 13 samples are investigated-from 7 countries, 2 grade levels, and TIMSS 1995 and 2015-that were identified as being complicit with the standards for conducting regression discontinuity analyses. On this basis, the current paper demonstrates how the equality of school effectiveness measure can be compared between countries and over a 20-year interval. Furthermore, the study includes both primary and secondary school levels. However, their scores are not directly comparable because they were scaled independently. To complement previous research, the central strengths of this study lie in the fact that we ran simpler and more targeted models to identify the between-school variation in school effectiveness. In addition, the reanalyzed data from previous studies are complemented with an analysis of new data.

Data
The main data source of interest came from Norway's extended participation in TIMSS 2015, where the international sampling design was complemented by the assessment of 2 adjacent grades at the same schools and at both primary (grades 4 and 5) and secondary (grades 8 and 9) school levels. To compare these findings with other relatively similar countries and with an earlier point in time, we added 11 primary and secondary school samples from TIMSS 1995 in which 2 grades from the same schools were assessed. This study partly uses the same datasets as Luyten (2006). We applied a strict criterion for selecting countries from the TIMSS 1995 study in order to ensure that only samples with a high share of compliers were included: the proportion of students in the formally correct grades-given their birthdates and the countries' school entry regulations-was at least 93% in the samples included for analysis. In some of the countries excluded from the analysis, such as France, Germany, and Ireland, the shares of compliers were well below 80%. In addition, we tested the assumptions for regression discontinuity designs for all samples (see below). Table 1 illustrates the sample preparation for the 13 selected samples. First, we identified those students with complete birth and test date information who attended schools that assessed two adjacent grades. Second, we identified the share of compliers (see columns 1-3 in Table 1). Third, we only included students from school samples with at least 10 lower and 10 upper grade students in order to enable analyses at the school level (see columns 4-11 in Table 1). Apart from the Norway 1995 secondary school sample, the number of students exceeded 2,000 and the number of schools 50 in the analyses samples. With the exception of the Sweden 1995 secondary school sample, the lower and upper grades contained approximately balanced numbers of students. It should be noted here that there are differences in grade labels and mean age (see columns 7, 8, 10, and 11 in Table 1) and that these reflect different policies for school starting ages in the selected countries.

Outcome variables
The outcome variables were student achievement scores in mathematics and science. In both TIMSS 1995 and 2015, students responded to standardized paper-pencil-based tests that contained multiple choice and open-ended items (Martin et al., 2016, Martin & Kelly, 1996. Two age-adequate test versions for primary and secondary school students were used, respectively. Per age group, students from the lower and upper grades responded to the same tests, and the achievement scores were scaled to allow for the comparison of lower and upper grade students' scores. As achievement scores, TIMSS provided five plausible values per student, which were computed using conditioning techniques with test and background data. Per age group and achievement domain, the scores had an international mean of 500 and standard deviation of 100 (Martin & Kelly, 1997;Martin et al., 2016). Plausible values contained no missing values.

Predictor variables
We used two main predictor variables-added year of schooling (treatment) and age (running variable). Per country sample, we recoded the grade variable to an added-year dummy variable for treatment (1 = upper grade) versus control (0 = lower grade) group membership. Age in months was determined based on the students' birth month and year relative to the test month and year. To facilitate the interpretability of intercepts in later regression models, we recoded the age in months relative to the threshold age between grades (0 = oldest students in lower grade, 1 = youngest student in upper grade) so that the age variable ranges between − 11 (youngest students in lower grade) and 12 (oldest students in upper grade). Since we only included students attending formally correct grades, none of the students in the analyses samples were formally too young or old for their grade. As mentioned above, students with missing test or birth date information were excluded because their formally correct grades could not be determined. In the original samples, the share of missing date values were in the 0-2% range, except for the Scotland 1995 primary school sample (11%). The grade variable contained no missing values.
Furthermore, we used the school-level aggregated number of books at students' homes as a proxy for socioeconomic school composition. In both TIMSS 1995 and Table 1 Overview of countries and student samples The complier samples contain students who were in the formally correct grade. The analysis samples contain those compliers who were at schools with at least 10 students in the lower and 10 students in the upper grade in the samples.  2015, the books at home variable was assessed through student questionnaires and ranged from 1 = none or very few (0-10 books) to 5 = enough to fill three or more bookcases (more than 200 books). We aggregated the books at home variable while removing missing values. Between 0% (in the Norway 1995 secondary school sample) and 16% (in the Greece 1995 primary school sample) of students in the analyses samples had not responded to the books at home questionnaire item. However, the aggregated number of books variable was available for all schools in all samples, except for two schools in the Scotland 1995 secondary school sample. These two schools were excluded from the analyses that included the socioeconomic school composition.

Statistical analyses
We conducted three sets of analyses. First, we ran regression discontinuity analyses with a student-level focus to illustrate general age and added-year effects on mathematics and science achievement and to ensure that regression discontinuity analyses prerequisites were met. Second, we ran school-level analyses to disentangle the between-school differences in school effectiveness and student selection. These constitute the central analyses in the present study. Third, we regressed this between-school variation in selection and added-year effects on schools' socioeconomic composition. We ran separate analyses for all 13 samples and for both outcome domains. We used Mplus version 8.6 (Muthén & Muthén, 2017) for the analyses and the R package MplusAutomation (Hallquist et al., 2018) to handle the Mplus results. We applied the student sampling weights "TOTWGT" to account for the stratified complex sampling designs (Meinck, 2020) and accommodated the nesting of students in schools in all analyses. The analyses were run separately for the five plausible values and the results were combined using Rubin's (1987) rules.

Regression discontinuity analyses with student-level focus
In the present study, the regression discontinuity approach implies regressing the achievement score Y of student i on the added-year dummy D and age A . To consider the clustering of students in schools, we ran a two-level regression discontinuity model where student i was nested in school j: This regression disentangles the general effect of an added year of schooling β 1 from the general effects of an additional month of age β 2 across schools (see Fig. 1). The intercept β 0 reflects the estimated achievement score of the oldest students in the lower grades ( A = 0 and D = 0). This two-level model estimates the same intercept β 0 , schooling effect β 1 , and age effect β 2 across schools (i.e., fixed intercept and slopes). A number of prerequisites must be met to estimate these models and interpret them as robust regression discontinuity results (Angrist & Pischke, 2009Schochet et al., 2010). These considerations and findings are summarized in Appendix A. From the results presented in Appendix A, we concluded that the regression discontinuity prerequisites were met and that the approach was applicable for all samples. (1) Page 12 of 34 Steinmann and Olsen Large-scale Assessments in Education (2022) 10:2

Regression discontinuity analyses with school-level focus
Since the main aim of the present study is to estimate the between-school variation in added-year and student selection effects within countries, we consequently extended the two-level regression model with random effects for the intercept and added-year slope: Both the intercept and added-year slope were modelled to vary between schools, as follows: and Inserting (3) and (4) in (2) results in: In these models, the intercept γ 00 reflects the mean achievement score of the oldest students in the lower grades across schools and u 0j reflects its between-school variation. The between-school variation in the intercept reflects all sorts of student selection differences between schools as well as prior school effectiveness differences. Therefore, the intercept can be expected to vary between schools in all countries. The parameter γ 10 reflects the mean effect of an additional year of schooling across schools and u 1j reflects its between-school variation-i.e., the degree to which schools vary in the added-year effect on achievement. The added-year effect can be interpreted as a measure of school effectiveness, and the associated between-school variation can accordingly be regarded as an indicator for the equality of school effectiveness. Ideally, schools should not vary in this school effectiveness measure in education systems. The parameter β 2 reflects the general effect of age on student achievement. It should be noted here that this age effect does not vary between schools in the model. To be able to compare the results of the countries, we z-standardized the achievement plausible values for each country before running these school-level analyses.

Regression discontinuity analyses with school-level focus and socioeconomic school composition predictor
To further illustrate the different meanings of between-school variations in intercepts and added-year effects, we extended the school-level model by regressing both the intercept β 0j and the added-year slope β 1j on socioeconomic school composition C j , i.e., the aggregated mean number of books at home, as follows: (6) β 0j = γ 00 + γ 01 C j + u 0j (7) β 1j = γ 10 + γ 11 C j + u 1j . After inserting (6) and (7) in (2), this results in: Therefore, the parameter γ 01 depicts the degree to which the intercept-i.e., the mean achievement of the oldest students in the lower grades-is associated with socioeconomic school composition. This association should be positive in all countries because privileged students, on average, score higher on achievement tests. The parameter γ 11 reflects to what extent the mean added-year effect-i.e., school effectiveness measurerelates to the socioeconomic school composition. Ideally, added-year effects should be unrelated to schools' socioeconomic composition. In other words, schools should promote the same learning outcomes, regardless of their socioeconomic composition. Just as we had done for the achievement plausible values, we also z-standardized the aggregated number of books variable before running these school-level analyses. Table 2 summarizes the descriptive statistics for the 13 samples. The fact that the schools in the samples had different socioeconomic student compositions shows in the intra-class-correlations (ICCs) of the number of books at home variable (see column 1 in Table 2). The lowest ICC was observed in the Norway 1995 primary school sample and the highest in the Scotland 1995 secondary school sample. These books at home ICC values can reflect different selection mechanisms, including residential segregation between social groups. Similarly, columns 2-3 show the ICCs of the mathematics and science achievement scores. Again, the lowest ICCs were observed in the Norway 1995 primary school sample and the highest in the Scotland 1995 secondary school sample. These achievement ICCs can also reflect residential segregation mechanisms as well as differences in the effectiveness of schools. Across the three ICCs, the scores for school segregation by socioeconomic status and achievement were rather low in Norway, Iceland, and Cyprus in comparison to Greece, Scotland, Sweden, and the Slovak Republic.

Descriptive statistics
The mathematics and science score distributions in the lower and upper grades are depicted in columns 4-11 in Table 2. Within domains and countries, the students in the upper grades scored higher than the ones in the lower grades. These differences in means reflect that they were both older and had attended school for an additional year. By tendency, these gaps between younger and older students were higher in the primary than in the secondary school samples. It should be noted here that the secondary school samples did not show systematically higher means than the primary school samples because TIMSS scales achievement scores independently for primary and secondary school populations. Within both grades and domains, the countries differed in their mean achievement scores. These between-sample differences in mean achievement levels reflect all sorts of heterogeneity, including the fact that the samples differed slightly in terms of attended grades and ages (see Table 1).

Results of the regression discontinuity analyses with student-level focus
In the first set of analyses, we ran regression discontinuity models that disentangled the age and added-year effects without displaying the between-school differences. Figure 3 (8) Y ij = γ 00 + γ 01 C j + u 0j + γ 10 D ij + γ 11 C j D ij + u 1j D ij + β 2 A ij + r ij . (3) (8) graphically illustrates the results of these analyses for the 13 samples (see Appendix B for numerical results). It should be noted here that the y-axes in the panels in Fig. 3 are centered around the samples' overall means with the scale limits of 100 points above and below these means. Therefore, the y-axes of the panels are comparable even though the samples' general achievement levels differ. The results for mathematics are displayed as solid lines and those for science as dashed lines. The fact that the mean scores by months of age (dots for mathematics and triangles for science) deviated only slightly and unsystematically from the regression lines illustrates that the regression models reflect the data well. As was expected, we observed more deviations in the smaller samples (cf. Fig. 3 Results of the regression discontinuity analyses with student-level focus. The figure displays achievement score (y-axes) means per month of age (x-axes) for mathematics (dots) and science (triangles) domains. The lines illustrate regression discontinuity results for mathematics (solid lines) and science (dashed lines). The slopes illustrate age effects. The regression discontinuities between age = 0 (oldest students in lower grade) and age = 1 (youngest students in upper grade) illustrate the added-year effects across all schools. The y-axes are centered around the samples' overall mean achievement, with the scale limited to 100 points above and below those means. The findings are displayed numerically in Appendix B. The analyses are based on unstandardized achievement scores  Table 1). Figure 3 illustrates that we always found positive effects of age, as indicated by the slopes, and positive effects of an additional year of schooling, as indicated by the regression discontinuities between the oldest students in the lower grades (age = 0) and the youngest students in the upper grades (age = 1). In some cases, we observed very similar achievement levels, age, and added-year effects in the mathematics and science domains, which is reflected in an overlap of regression lines. In other samples, there were greater differences between the two domains.
Conforming to the literature, we observed greater age effects, as indicated by steeper slopes, and greater added-year effects, as indicated by greater regression discontinuities, in the younger samples in comparison to the older ones (see Fig. 3). This can be explained by the fact that the greatest age effects are to be expected when brain maturation proceeds most rapidly-i.e., in young children. However, the fact that the primary and secondary school tests were scaled independently prohibits direct comparisons.

Results of the regression discontinuity analyses with school-level focus
Since the aim of this study is to disentangle the between-school variation in student selection and school effectiveness, we ran two-level regression discontinuity models, as in Eq. (5). The results are summarized in Table 3 for mathematics and in Table 4 for science. In contrast to the previously presented unstandardized results, we used the z-standardized achievement scores in these analyses in order to make the results more comparable across the different samples. Hence, intercepts reflect the mean z-standardized achievement scores of the oldest students in the lower grade across schools. The variance of the intercepts depicts to what extent schools varied in these achievement levels. This parameter can thus reflect both the between-school differences in student selection and in prior school effectiveness. We found statistically significant betweenschool variation in intercepts in all samples. We observed lower degrees of betweenschool variation in intercepts in Norway, Iceland, Sweden, and Cyprus in comparison to Scotland, Greece, and the Slovak Republic (see Tables 3 and 4). We did not observe systematic differences between the primary and secondary school samples from the same countries.
Added-year slope parameters, shown in Tables 3 and 4, reflect the mean effects of an additional year of schooling across schools. At the primary school level, these mean added-year effects ranged between 0.305 standard deviations (SD) in Scotland 1995 and 0.634 SD in Norway 1995 in mathematics and between 0.202 SD in Scotland 1995 and 0.448 SD in Iceland 1995 in science. At the secondary school level, the mean added-year effects ranged between 0.089 SD in Cyprus 1995 and 0.332 SD in the Slovak Republic 1995 in mathematics and between 0.084 SD in Norway 2015 and 0.308 SD in the Slovak Republic 1995 in science. Consistent with the literature and student-level findings in Fig. 2, we observed moderate to large added-year effects at the primary school level and small to moderate added-year effects at the secondary school level.
In our models, the variances of the added-year slopes reflect how much schools varied in added-year effects after taking into account the between-school differences in student selection, i.e., the intercepts. Consequently, these parameters reflect to what extent education systems attained equal school effectiveness at all schools. In both mathematics and science, we observed low degrees of between-school variation in added-year slopes   in primary school samples, especially in Norway, Iceland, and Cyprus 1995. At the secondary school level, we observed similarly low degrees of between-school variation in added-year slopes in Norway 1995 and 2015 as well as Cyprus 1995 samples. More pronounced between-school variations in these added-year effects were found in the secondary school samples in Sweden, Scotland, and the Slovak Republic 1995. It is interesting to contrast the two Norway samples for primary grades. Although the mean added-year effect was substantially smaller in 2015 in comparison to 1995, the associated between-school variability increased, becoming statistically significant. In this comparison, however, one should be reminded that Norway participated with second and third graders in 1995 but with fourth and fifth graders in 2015. Similarly, in the interpretation of the results of the secondary school sample in Sweden 1995, it should be considered that this sample contained about twice as many lower than upper grade students. This might point to sampling issues that limit the interpretability of the findings.
Age slope parameters reflect the difference that one month of age makes for achievement scores, independent from grade level and school attended. At the primary school level, these slopes ranged between 0.024 SD in Iceland 1995 and 0.033 SD in Scotland 1995 in mathematics and between 0.026 SD in Iceland 1995 and 0.034 SD in Norway 1995 in science. This means that the effects of one year of age (i.e., multiplied by 12) on mathematics and science achievement ranged from moderate to large effect sizes (0.288-0.408 SD). At the secondary school level, age slopes ranged between 0.006 SD in Scotland and the Slovak Republic 1995 and 0.019 SD in Cyprus 1995 in mathematics and between 0.005 SD in the Slovak Republic 1995 and 0.021 SD in Iceland 1995 in science. The effects of one year of age on mathematics and science achievement therefore ranged from small to moderate effect sizes (0.060-0.252 SD). Hence, we found more pronounced age effects in younger groups, which conforms to the literature and studentlevel findings (see Fig. 2).
The residual variances shown in Tables 3 and 4 reflect the degree to which student achievement scores deviated from the values that were predicted on the basis of student age, grade, and attended school.

Results of the regression discontinuity analyses with school-level focus and the socioeconomic school composition predictor
The results of further regression discontinuity analyses, which included the socioeconomic school composition predictor as in Eq. (8), are depicted in Tables 5 and 6 for mathematics and science, respectively. Both socioeconomic school composition and achievement scores were z-standardized in these analyses. The results showed positive and, with one exception, significant associations between socioeconomic composition and achievement intercepts of schools. In other words, the more privileged the student composition of a school, the higher the mean achievement of its oldest students in lower grades. These regression coefficients ranged between moderate to large effect sizes in both mathematics (between 0.270 SD in Iceland 1995 secondary school sample and 1.188 SD in Cyprus 1995 secondary school sample) and science (between 0.399 SD in Norway 1995 primary school sample and 1.175 SD in Cyprus 1995 secondary school sample).
Page 20 of 34 Steinmann and Olsen Large-scale Assessments in Education (2022) 10:2 Table 5 Two-level regression discontinuity results for mathematics achievement as an outcome and socioeconomic school composition as a predictor The columns depict the results of the two-level regression model described in Eq. (8) Table 6 Two-level regression discontinuity results for science achievement as an outcome and socioeconomic school composition as a predictor The columns depict the results of the two-level regression model described in Eq. (8)  In contrast, with one exception, the results showed no significant associations between schools' compositions and added-year effects. Only in the Scotland 1995 secondary school sample, a more privileged student composition was significantly associated with more pronounced added-year effects in science. However, it is important to stress that we did not always observe a significant amount of between-school variation in the added-year effects to begin with (see Tables 3 and 4). Therefore, the results of the association between the schools' compositions and added-year effects should not be overinterpreted for the primary school samples in Norway, Iceland, and Cyprus 1995 and for the secondary school samples in Norway and Cyprus 1995 in both mathematics and science.

Discussion
The main aim of the present study is to utilize two-level regression discontinuity analyses to disentangle the between-school variation in school effectiveness (i.e., added-year effects) from the between-school variation in student selection (i.e., intercepts). To achieve this aim, we used overall 13 samples that included both primary and secondary school level data for seven Nordic and six other European countries obtained from TIMSS 1995 and 2015. In all samples, we found considerable between-school differences in intercepts-i.e., achievement levels. This variation can reflect both selection as well as prior effectiveness differences. Conforming to expectations, the intercepts were higher in schools with a more privileged student composition in all samples. Especially in Sweden, Scotland, and the Slovak Republic, we also found between-school variation in added-year effects-i.e., school effectiveness. However, especially Norway and Cyprus attained a low degree of between-school variation in added-year effects-i.e., quite similar added values of an additional year of schooling across schools. Furthermore, addedyear effects were not significantly associated with socioeconomic student composition, except for the Scotland 1995 secondary school level sample. Although schools with a privileged socioeconomic composition usually had higher intercepts, they did not always have higher added-year effects. In addition, even though schools differed in student selection, some countries seemed to be able to attain a high degree of equality of school effectiveness. This is the most central implication of our study.
Norway is emphasized in the study because it had implemented the design with two adjacent grades at the primary and secondary school level in both 1995 and 2015. In brief, the analyses revealed that the return of an added year of schooling was substantially lower in 2015, particularly at the primary school level. Additionally, the betweenschool variability in this added-year effect increased, becoming significant at the primary school level. However, we are cautious not to draw definite conclusions from this finding. Obviously, Norway participated in TIMSS 2015 with higher grade levels and, hence, older students than in 1995. This, in itself, would already lead to lower expected grade effects. Furthermore, Olsen and Björnsson (2018) illustrated that the age effect is not very robust between cycles of international large-scale assessments. Indirectly, this lack of age effect robustness should also affect the robustness of added-year effects over time.
However, this study has a number of additional central implications. It shows that the regression discontinuity approach can be utilized to measure the equality of school effectiveness independently from selection effects. In comparison with longitudinal approaches for measuring school effectiveness, the regression discontinuity approach has several key advantages (cf. Luyten et al., 2009;Perry, 2017;Singh, 2020). It can be implemented in a cross-sectional assessment of two adjacent grades from the same schools. This makes it more time-and cost-efficient than longitudinal designs, minimizing attrition issues. Another advantage is that it disentangles schooling from pure age effects. Furthermore, this approach can be implemented in a comparably easy manner both internationally and in repeated cycles. We conclude that the suggested measure for the equality of school effectiveness can serve as a feasible and important evaluation criterion when assessing educational inequalities in school effectiveness research.
In comparison with earlier studies that used TIMSS 1995 data, the main contribution of our study is that a less complex and more targeted model was applied in order to strengthen the interpretability of the estimates of the between-school variation in school effectiveness. Additionally, this study was the first to utilize Norway's extension of the TIMSS 2015 design, and it explicitly tested whether the core assumptions of regression discontinuity designs were met in the investigated countries.
Together with the summarized literature (Luyten, 2006;Luyten & Veldkamp, 2011;Luyten et al., 2017;Olsen & Björnsson, 2018;Webbink & Gerritsen, 2013), our study illustrates that even though an additional year of schooling and age usually has positive effects on achievement outcomes, these effects vary considerably between countries, domains, and grade levels as well as over time. This is an interesting finding in light of the common rule of thumb that "learning gains on most national and international tests during one year are equal to between one-quarter and one-third of a standard deviation" (Woessmann, 2016, p. 6;cf. also OECD, 2019a). Given the findings, this rule of thumb seems overgeneralized. Instead, different benchmarks should be developed for different countries, outcome domains, and school career stages.

Limitations
Our study also has some core limitations. First, we focused on only a small number of countries and on two points in time because international large-scale assessments usually assess only one grade per school. However, it would be interesting to compare these findings with findings from other countries, especially lower-income countries (cf. Heyneman & Loxley, 1983).
Third, the regression discontinuity approach is only applicable in countries that have only one fixed school entry date per year and high proportions of students in formally correct grades. The literature usually applies a 95% complier threshold (e.g., Cliffordson, 2010;Luyten, 2006)-at 93%, we were slightly more liberal in this study. Furthermore, we simply excluded non-compliers, while some authors suggest more complex procedures such as instrumental variable approaches in a fuzzy regression discontinuity framework (e.g., Hahn et al., 2001;Webbink & Gerritsen, 2013). Fourth, the two-level regression discontinuity approach for identifying the equality of school effectiveness requires, per country, a substantial number of schools that are large enough. In the present study, we also included samples comprising comparably few Page 24 of 34 Steinmann and Olsen Large-scale Assessments in Education (2022) 10:2 schools (see Table 1), and the results obtained should, hence, be interpreted with appropriate caution. We excluded schools with less than 11 students in upper as well as lower grades because the approach is not applicable to very small schools. Fifth, we found greater age and added-year effects in primary school samples. However, we would like to stress that these are not directly comparable because the achievement tests are not vertically scaled.
Sixth, our approach requires that the sampling mechanisms do not differ between the lower and upper grades of the same schools. We only included samples where the TIMSS technical reporting did not indicate any sampling issues. However, in the secondary school sample in Sweden 1995, there were approximately twice as many students in the lower than in the upper grade samples (see Table 1), which may indicate sampling issues. Therefore, these findings should be interpreted with caution.

Conclusion
We conclude that the two-level regression discontinuity design can be applied if countries assess large enough samples of students from two adjacent grades at the same schools as well as if the share of students in formally correct grades is high enough in these countries. This can be achieved by extending the international designs of largescale assessment studies with relatively low additional costs and efforts. The resulting data can be utilized to determine the (equality of ) the effectiveness of schooling independent of selection and age effects. In addition, this approach makes it possible to investigate what kinds of schools provide greater added-year effects (e.g., Ali & Heck, 2012;Heck & Moriyama, 2010). For instance, the approach can be used to test if schools with a high instructional quality, certain school resources, or high average teacher competencies attain greater added-year effects, than others.
Consequently, we recommend that countries, such as the Nordic ones, assess additional grades through national extensions of large-scale assessments like TIMSS (cf. also Van Damme et al., 2010). When extending the international design in this manner, it would be advisable to assess the same grades over repeated cycles. Unlike in the case of Norway, this would facilitate more direct comparisons over time. Even if countries already have longitudinal data from registries and national assessments, extending international large-scale assessments like this has the advantage of making the findings comparable with those of other countries as well as of allowing rich questionnaire data to be taken into consideration.

Appendix A: Regression discontinuity prerequisite checks
In order to ensure that the regression discontinuity approach is applicable and delivers robust results, we followed the standards of the What Works Clearinghouse (Schochet et al., 2010) and ran numerous prerequisite check analyses. These analyses, which are described in detail in this section, showed that the regression discontinuity approach could be applied to the current case of achievement discontinuity in lower-and uppergrade students from the same schools as well as to the datasets at hand.

Running variable requirements
The running or forcing variable, in this case age in months, should have at least four unique values below and above the threshold and should not be confounded with other interventions. In the present study, the running variable age consisted of 12 unique values in both lower and upper grades. Furthermore, it seems unlikely that age might be confounded with additional interventions because students of different ages have, within the same grades, the same curricula and teachers.

Integrity of the running variable
It should be ensured that the running variable, in this case age, is not systematically manipulated to change the treatment assignment-i.e., school entry. Manipulating both the birth date of a child and the scoring rule-i.e., school entry regulations-is unlikely. School entry rules are determined by policymakers and rarely changed. The months of age frequency did not systematically change around the threshold (see Fig. 4), which could otherwise be interpreted as an indication of systematic birth date manipulations by parents. Furthermore, the frequency of non-compliers was below 8% in all 13 samples.

Low attrition rates
Researchers should also assure that attritions by treatment status, running variable, and outcome are low so that the generalizability of the findings and the absence of systematic biases can be assumed. In the present study, there was no attrition by treatment status because all students were either in the lower or upper grades. Attrition by age, the running variable, was negligible (≤ 2%; see Measures section) with one exception. There was also no attrition by outcomes because the achievement plausible values were available for all students. The findings are, however, only generalizable to populations who participated in the TIMSS tests and who were not otherwise excluded (see Data section).

Continuity of the relationship between the running variable and the outcome
Furthermore, it should be established that if the treatment does not take place, then a smooth relationship between the running variable and the outcome would be developed at the threshold. In this study, this means that if the younger and older students entered school at the same time and attended the same grade, we should not observe a discontinuity in the achievement regression. This might not be the case if student selection mechanisms differed between the two grades, for instance. We approximated this by showing that correlates of achievement-namely, gender (see Fig. 5) and the number of books at home (see Fig. 6)-did not systematically change at the threshold. In addition, we observed no other achievement discontinuities within the grades (see mean scores by age in months in Fig. 2).

Appropriate functional form and bandwidth
Finally, it should be ascertained that the functional form assumption holds empirically and that the effects do not drastically change when investigating subsets of cases closer to the threshold. Our model assumed the same linear effects of age in both grades, after controlling for the added-year effect. We ran alternative models: one with an additional quadratic term for age (see Table 7), one with an additional interaction term between age and grade (see Table 8), and one with a reduced sample of students in the bandwidth of six months before and six months after the threshold (see Table 9). With one exception for each, the regression parameters of the additional quadratic or interaction terms were not statistically different from zero across Fig. 6 Mean number of books at home by age in months in the 13 samples. The x-axis depicts the age in months and the y-axis depicts the mean numbers of books at home. The books at home variable was assessed through student questionnaires and ranged between 1 = none or very few (0-10 books) to 5 = enough to fill three or more bookcases (more than 200 books). The black dots indicate lower grade students and the grey dots upper grade students   samples and domains. When only analyzing the subset of students who were closest to the threshold, both age and grade effects were either statistically significant and positive or not significantly different from zero. Table 9 Results of the regression discontinuity analyses with a smaller age bandwidth The rows depict the results of the regression model described in Eq. (1) with a reduced age bandwidth (six months before and after threshold) for the different samples. The columns depict the parameters with standard errors (SE) in parentheses. The analyses are based on unstandardized achievement scores  Table 10 Results of the regression discontinuity analyses with student-level focus The rows depict the results of the regression model described in Eq. (1) for the different samples. These results are illustrated graphically in Fig. 2