Academic tracking is related to gains in students' intelligence over four years: Evidence from a propensity score matching study

Ability grouping or tracking during secondary schooling is widespread. Previous research shows academic track schools are more successful than non-academic track schools in teaching mathematics, reading and foreign languages. Reasons include a more favorable student composition and higher instructional quality. However, there is less evidence that between track differences are even large enough to differentially affect the students' cognitive development. We used data from a large Hamburg panel study to test this hypothesis (N 1⁄4 8628). By employing several propensity score matching algorithms we formed parallelized samples of academic track and either non-academic track students or comprehensive school students. After four years of tracking, academic track students showed considerably higher intelligence scores than their counterparts at the non-academic tracks and slightly higher scores than students at the comprehensive schools. Our results underline the importance of a cognitively stimulating learning environment in school to support students' cognitive development. © 2016 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
Some schools more effectively teach reading, mathematics and sciences than others. School effectiveness research mainly agrees with this statement (Reynolds et al., 2014). However, increasing students' general cognitive abilities is usually not an explicit goal of schooling (Adey, Csap o, Demetriou, Hautam€ aki, & Shayer, 2007). Yet, the question arises whether school quality indicators not only result in different subject specific outcomes but also differentially affect students' general cognitive abilities. This question is relevant against the background of broad evidence regarding the meaning of intelligence for numerous factors of life quality such as educational success, employment status, higher income, better health, higher life expectancy, and enduring partnerships (Der, Batty, & Deary, 2009;Gottfredson, 2003;Wrulich et al., 2013). Therefore, and in light of an increasingly complex environment a closely related, albeit not identical construct, that is domain-general problem solving, has received a lot of attention from educational researchers to the point of its inclusion in the PISA 2012 cycle (Programme for International Student Assessment; Greiff et al., 2014).
Most recently, to address the question of school quality effects on students' intelligence, Becker, Lüdtke, Trautwein, K€ oller, and Baumert (2012) took advantage of structural features of the German school system: The explicit between-school tracking during secondary schooling in Germany goes along with significant advantages for the academic tracks in terms of teacher qualification, cognitively demanding instruction and student composition (Klusmann, Kunter, Trautwein, Lüdtke, & Baumert, 2008;Retelsdorf, Butler, Streblow, & Schiefele, 2010;Trautwein, Lüdtke, Marsh, K€ oller, & Baumert, 2006) and resulted in a clear advance in psychometric intelligence scores for academic track students compared to a matched sample of non-academic track students. Our own study extends the findings of Becker et al. in several directions: First, employing the German adaption of Cattell's Culture Fair Intelligence test (Cattell, 1960;Weiß, 1998), we use a more comprehensive instrument of psychometric intelligence. Second, the sample in our study is eight times larger and considerably more heterogeneous concerning student prior achievement and social background. Third, we not only use students from non-academic tracks but also students from non-tracked comprehensive schools as an additional and more challenging comparison group to the academic track students.

Tracking and student achievement
Many school systems integrate some sort of grouping of students at least during secondary schooling based on the assumption that teaching is easier and more effective in homogenous groups (LeTendre, Hofer, & Shimizu, 2003). Grouping can take place within class, on a course-level (setting or streaming) or on a school level (tracking). The placement of students often depends on their achievement (ability grouping, Trautwein et al., 2006). Differences between groups or tracks are expected for two main reasons, compositional effects and institutional effects (Maaz, Trautwein, Lüdtke, & Baumert, 2008). Compositional effects refer to the more favorable student composition at academic track schools. On average, students show higher achievement and higher cognitive abilities along with a more favorable social background. This allows for interactions between students which are more cognitively activating. Institutional effects refer to the fact that tracks differ in their pedagogical response to the different groups in terms of curricular foci, teacher qualification and instructional quality (Ireson & Hallam, 2001). Concerning the curriculum, in Germany, for example, academic track students are required to learn a second foreign language (Kultusministerkonferenz, 2006). In their language lessons they focus more on literature while in the nonacademic track the focus is more on basic linguistic skills (Klieme et al., 2008). Academic teachers have greater content knowledge and greater pedagogical content knowledge. This results in cognitively more activating instruction, for example by encouraging students to discuss and validate different solution paths of a specific task instead of training one correct solution (Baumert et al., 2010;Klusmann et al., 2008;Retelsdorf et al., 2010).
Research on the effects of tracking has shown, that academic track students indeed reach a higher level of achievement than students on other, more vocationally-oriented tracks, even when controlling for intake differences between tracks. This effect is most pronounced for mathematics achievement (Becker, Lüdtke, Trautwein, & Baumert, 2006;Guill & Gr€ ohlich, 2013;Opdenakker & Van Damme, 2006), but can also be found for French (Neumann et al., 2007) as a foreign language. Findings for reading achievement are less consistent and if track differences exist, effect sizes are lower (Retelsdorf, Becker, K€ oller, & M€ oller, 2012).

Tracking and intelligence
Increasing students' general cognitive abilities is neither just another subject in school nor an explicit aim of systematic instruction (for a criticism, see Adey et al., 2007; similar for domaingeneral problem solving Greiff et al., 2014).
When speaking of students' cognitive abilities or their intelligence we think of their "ability to understand complex ideas, to adapt effectively to the environment, to learn from experience, to engage in various forms of reasoning, to overcome obstacles by taking thought" (Neisser et al., 1996, p. 77). In some models, it is differentiated in a crystallized component, that is acquired abilities, and a fluid component, the capacity to analyze and solve novel problems independent of cultural experiences and acquired abilities (Cattell, 1963;Horn, 1994). There is also evidence that fluid intelligence coincides with the g factor, the common factor resulting from factor analyses of broad ranges of intellectual tasks (Jensen, 2002). According to Cattell's Investment theory this is because at the beginning of an individual's development her or his fluid intelligence is invested in all kinds of complex learning tasks resulting in high correlations between acquired, crystallized abilities (Valentin Kvist & Gustafsson, 2008). From a developmental perspective following a Piagetian tradition, fluid intelligence is also modelled as developing through four reconceptualization cycles.
School-age children are either in the cycle of rule-based reasoning (6e11 years) or principle-based reasoning (11e18 years). Each cycle consists of two phases, the latter implying the full mastery of the thinking possibilities of the new cycle. Growth through these cycles is characterized by change in the nature of representations and their inferential interlinking (Christoforides, Spanoudis, & Demetriou, 2016).
There is no doubt about substantial influence of the genetic disposition on an individual's intelligence (Plomin, 2003). However, we know from various fields that a cognitively stimulating environment also has positive effects on individual cognitive abilities. This could e.g. be shown for challenging work environments (Schooler, Mulatu, & Oates, 1999), memory training programs (Jaeggi, Buschkuehl, Jonides, & Perrig, 2008), music practice (Schellenberg, 2006) and direct or content-based training programs (Adey et al., 2007). Last but not least there is strong evidence regarding the impact of quantity of schooling on students' intelligence. As Ceci (1991) documented especially when using natural experiments, every year of schooling brings with it substantial IQ score gains of 2e6 points. However, it remains unclear whether school quality differences are substantial enough to affect the students' general cognitive abilities differentially. In tracked school systems the compositional and institutional effects described above, consistently work across all academic subjects. Concerning compositional effects following Vygotski's concept of mediated learning experiences the interaction with peers being slightly ahead in terms of cognitive functioning should stimulate learning processes (Adey et al., 2007) and these peers are more likely to be found at the academic tracks. Concerning institutional effects across all subjects there is more stimulation of advanced reflection at the academic tracks e.g. when learning to identify the common structure of a drama in different plays or when learning the requirements of valid mathematical proofs. It is known from content specific training programs that they transfer to the students' fluid intelligence and can either improve the students' efficiency of reasoning on a given developmental cycle (Papageorgiou, Christou, Spanoudis, & Demetriou, 2016) or accelerate the transition to the following cycle (Christoforides et al., 2016). In sum, because of the more activating environment in academic tracks one might expect a positive influence of academic tracks on their students' intelligence.
Until now, the effect of tracking on students' intelligence development has been investigated several times. Findings from Swedish (e.g. Balke-Aurell, 1982;H€ arnqvist, 1968), Israeli (Shavit & Featherman, 1988) and US American studies (Rosenbaum, 1975) during the last decades show consistently higher intelligence scores for students on academically oriented tracks compared to students on vocationally oriented tracks. Cliffordson and Gustafsson (2008) could demonstrate advantages for different academic profiles (social sciences vs. technical) on the respective components of an intelligence test. All of these studies found systematic differences in the social and cognitive composition of the students at the onset of tracking. They usually controlled for at least some of these intake differences using standard least-square regression analyses. However, they all have been criticized either for controlling only a few variables and potentially failing to control all the selection bias or for relying on regression analyses without fulfilling its preconditions, e.g. by extrapolating results for subjects without comparable individuals in the control group Brody, 1992).
In their study, Becker et al. (2012) made considerable efforts to overcome these disadvantages. In Germany, after primary school students continue on different formal educational tracks, these being either vocational (further: non-academic track) or academic. In the Becker et al. study tracking started after six years of primary schooling. Previous research has shown that placement into the different tracks is highly predictable by students' prior academic achievement and their social background (e.g., Maaz et al., 2008;Pietsch & Stubbe, 2007). Becker et al. (2012) used propensity score matching (PSM) as a pre-processing method to address the systematic intake differences. PSM basically consists of matching individuals based on their probability (conditional on the covariates) to get the treatment in question (see methods for further details). This way they were able to parallelize the academic and non-academic track students on numerous covariates, including pre-tracking intelligence scores, test achievement scores, grades, and social background indicators. As an indicator for students' psychometric intelligence they used the 25-items Figure Analogies subscale of a slightly adapted German version of Thorndike's Cognitive Abilities Test (KFT 4e13þ;Heller, Gaedike, & Weinl€ ader, 1985;Thorndike & Hagen, 1971). The PSM analysis revealed that after four years of tracking academic track students showed significantly higher mean intelligence scores with an average effect size of d ¼ 0.46. While the methodological approach of Becker et al. overcomes some of the shortcomings of earlier studies their rather limited and homogenous sample has some limitations addressed in the present study.

This study
The present study adds to the work on effects of academic tracking on students' psychometric intelligence by replicating and also extending their findings in several meaningful directions. While Becker et al.'s (2012) findings rely on only one subscale of a test battery; in our study subjects did all four subtests of the German adaption of Cattell's (1960) Culture Fair Intelligence Test (CFT 20, Weiß, 1998). As the Figure Analogies these subtests load on an inductive reasoning factor and a higher order general factor (Carroll, 1993). The CFT 20 is a more comprehensive instrument of psychometric intelligence but similarly as the Figure Analogies constructed of material not directly covered in school.
Like Becker et al. (2012) we investigate the effect of academic tracking within the German school system. We analyze the effects of academic tracking after an equal time span of four years allowing a direct comparison of the results.
As tracking started two years earlier in our sample from the federal state of Hamburg, we extend Becker et al.'s (2012) finding to students two years younger at the onset of tracking. This enables us to integrate all students in the compulsory school system in our analyses while Becker et al. lost those students who graduated from the least demanding track (as required) at the end of grade 9. Situated in a metropolitan area and including a large number of students with an immigrant background our sample is much more heterogeneous than the Becker et al. sample with few immigrants. As the qualitative differences between the tracks apply for our study as well, in line with Becker et al. we expect to find larger intelligence score gains in the academic track compared to the nonacademic track (Hypothesis 1).
A specific feature of the Hamburg school system allows us a further extension. Besides the tracked school system, Hamburg offers comprehensive schools. Here, the career paths of the students are less pre-determined and students are prepared for later vocational or academic orientation in shared classrooms. Only in some subjects like mathematics and English within-school streaming takes place starting not earlier than in grade 7. Comprehensive schools are therefore rather similar to secondary schools in non-tracked school systems. The staff consists of teachers qualified for academic-track schools and those qualified for non-academic track schools. Comprehensive schools usually attract more academically orientated students (Beh€ orde für Schule und Berufsbildung, 2011), although substantially less than the purely academic tracks. Given that comprehensive schools still have less favorable institutional and compositional characteristics than academic track schools we expect to find larger intelligence score gains in the academic tracks than in the comprehensive schools, even if the difference in intelligence score gains between academic track students and comprehensive school students might be smaller than between academic and non-academic track students (Hypothesis 2).

Sample
The data came from the Hamburg school achievement study "Aspects of learning background and learning development" (abbreviated LAU for its German name; Beh€ orde für Schule und Berufsbildung, 2011). The study started in September 1996 with the complete cohort of grade 5 students (LAU 5) at the onset of secondary schooling, continued in September 1998 with the cohort of grade 7 students (LAU 7) and in September 2000 with the grade 9 students (LAU 9). Intelligence was measured in LAU 5 and in LAU 9. This enabled us to examine the development of intelligence scores over a period of four years for all those students with a normal school carrier (e.g., without retention).
All LAU tests and questionnaires were administered on two respectively four consecutive days (LAU 5/LAU 9) and altogether took two school lessons ( a 45 min) each day. They were administered by trained administrators. Participation in the achievement tests was obligatory while participation in the intelligence tests in grade 9 and in the student und parent questionnaires required parental permission. About 13,000 students participated at each measurement point.
Our analytic sample was limited to those students who took part in the LAU 5 and the LAU 9 assessment. This was true for 9864 of the 13,026 LAU 5 students (75.7%). Furthermore, we excluded those students who changed their track during this time period (between LAU 5 and LAU 9). This resulted in a total drop-out rate of 33.7% and an analytic sample of 8628 students. Drop-out rates differed between the tracks. While the academic track had a nearly-average drop-out rate of 34.9% it was considerably higher in the nonacademic tracks (41.5%) and much smaller for the comprehensive school students (24.8%). Besides track changes the drop-out was attributable to grade repetition, premature end of the school career and family relocation out of the Hamburg area. Track changes mainly took place between academic and non-academic tracks. Descents from the academic to the non-academic track were four times more frequent than ascents in the opposite direction. Premature school ending was found more often in the non-academic tracks and grade repetition rates vary systematically between the tracks in favor of comprehensive school students and to a lesser extent of academic track students (Prenzel, Zimmer, Drechsel, Heidemeier, & Draxler, 2005). On average, the drop-out students showed lower test achievement results and a less favorable social background than the longitudinal students (see Table A.1 in the online supplemental material for detailed descriptive analyses).
In the analytic sample 3545 students attended the academic track (41.1%), 2168 the non-academic track (25.1%) and 2915 comprehensive schools (33.8%). The students came from 183 different schools. Girls were with 49.8% slightly underrepresented. 23.1% of the students also spoke a language other than German at home, reflecting the high immigrant proportion in Germany's larger cities.

Instruments
Dependent variable. The short form of the "Grundintelligenztest Skala 2 e CFT 20" (Weiß, 1998) was used as a measure of the students' intelligence. This is the German adaption of Cattell's (1960) "Culture Fair Intelligence Test e Scale 2". The CFT 20 is intended as a measure of fluid intelligence. It consists of four subtests, namely series (12 items), classification (14 items), matrices (12 items) and topologies (8 items). Each subtest is highly speeded, taking between 3 and 4 min each. The whole test with instructions takes about 35 min. Each subtest contains figural stimuli and students have to choose one of five answer options (multiple-choice format) according to rules derived from the given figures. Usually, only the sum score over all subtests is interpreted. The test comes in two versions identical in content and differing only in item sequence within the subtests to hinder students from copying from their neighbor. The same test material was administered at T1 (Grade 5) and T2 (Grade 9). The reliability as measured by Cronbach's a was a ¼ .82 at T1 (Beh€ orde für Schule und Berufsbildung, 2011) and a ¼ .84 at T2. The four-year retest stability in our sample was satisfactory with r ¼ .57. The manual of the CFT 20 (Weiß, 1998) reports an internal (split-half) consistency of r tt ¼ .90 (short form) and the two-week retest reliability as r tt ¼ .77 for the complete test.
Control variables. PSM should include those variables predicting the students' assignment to the treatment conditions, that is the different tracks, as well as confounder variables that are associated with the treatment as well as the outcome measure. First, we used all available achievement measures at T1 to control for potential selection biases. The list of measures included the students' reading, language (grammar and vocabulary), orthography and mathematics score from the achievement test battery KS HAM 4/5 in Grade 5 (Mietzel & Willenberg, 1996). Their reliability scores were between a ¼ .85 and a ¼ .90 (Beh€ orde für Schule und Berufsbildung, 2011).
Additionally, we used the students' grades at the end of primary schooling as achievement indicators. They covered the subjects German, mathematics, social studies and sciences, music and art. The primary school teachers gave recommendations whether besides comprehensive schools academic tracks or non-academic tracks were most appropriate for the individual student. Grades and primary school recommendation are the most important predictors of track choice in Germany, followed by social background indicators (Maaz et al., 2008). Grades ranged from 1 (very good) to 6 (fail) and were reverse coded with higher values representing better grades.
Students' academic self-concept was measured by an 11-item scale, rating statements like "I have no trouble to understand complex relationships at once" on a four point rating scale with higher scores indicating a more positive self-concept (a ¼ .87; Beh€ orde für Schule und Berufsbildung, 2011).
Parents reported their highest school leaving certificate and their highest post-secondary school degree. This information was combined to form two dummy variables indicating whether at least one parent finished the academic track successfully and whether at least one parent has a university degree. We further used parents' reports about cultural belongings such as number of books at home as indicators of cultural capital (Bourdieu, 1977). Migration background was coded if the parents mentioned speaking an additional language to German at home. Additionally, the students' age and gender were used as control variables.

Treatment of missing values
Due to the obligatory student participation the missing data rate for the LAU 5 achievement and intelligence tests was only 5.8%.
However, for additional information provided by students, parents and teachers in LAU 5 the proportion of missing values was about 23.3%. The missing data rate for the LAU 9 intelligence test was 20.8%.
Multiple imputation is currently considered the preferable approach to deal with missing data to avoid biased parameter estimates (Schafer & Graham, 2002). All covariates in the propensity score model, the outcome variable and some additional correlated variables (i.e., auxiliary variables) were included in the imputation model. Due to their low proportion of missing values (on average below 7%) we used LAU 7 and LAU 9 achievement test scores (English, reading and mathematics) and grades for imputing missing values of the LAU 9 intelligence test. However, we used no variable affected by the treatment to impute pretreatment variables (Langenski€ old & Rubin, 2008). To account for the clustered data structure class means of grade 5 intelligence scores were included in the imputation model.
We used the multiple imputation by chained equations method (van Buuren & Groothuis-Oudshoorn, 2011) which is implemented in the package mice 2.22 in the R environment (R 3.1.3, R Core team, 2015). Trace plots indicated a successful convergence of the algorithm for the means and variance of the imputed variables. In total, we imputed 10 data sets which were further analyzed separately. All parameter estimates and standard errors were combined by Rubin's (1987) rules.

Propensity score matching
Propensity score matching was used as a pre-processing strategy to control for systematic differences between the tracks on a large set of covariates (Ho, Imai, King, & Stuart, 2007). In a first step, the probability of attending an academic track was estimated (i.e., propensity score) for each student by using logistic regression analyses. In the next step we matched students which were similar on their propensity scores from different tracks to create samples of students which were balanced on all covariates. For the matching procedures we used the software package MatchIt 2.4e21 by Ho, Imai, King, and Stuart (2011).
Following the recommendation to use different matching techniques to test stability and robustness of the findings (Morgan & Winship, 2015) we used four matching algorithms: (a) nearestneighbor matching without replacement and a ratio of 1:1 (i.e., each academic track student is matched to exactly one student from the comparison group). Matched pairs did not have to be identical on their propensity score but we allowed for small differences, the so called caliper, of c ¼ 0.1; (b) nearest-neighbor matching without replacement, 1 a caliper of c ¼ 0.1 and a ratio of 1:5 (i.e., allowing up to 5 available students from the comparison groups to be matched to one academic track student); (c) nearestneighbor matching without replacement, a caliper of c ¼ 0.1 and a ratio of 5:1 (i.e., allowing up to 5 available students from the academic track to be matched to 1 student from the comparison group); and (d) full matching discarding students outside the area of common support (ACS) which is the region of overlap between the propensity score distributions of treatment and control individuals. Here, all students within the ACS were included. We restricted the minimum ratio of students from the comparison groups to academic track students to be permitted within a matched set to 0.025 to prevent an extremely high weighting of a 1 1:k-matching with replacement is the more common technique but in our case resulted in extremely high weightings (sometimes up to 400 times of some individuals) and an insufficient reduction of selection bias and was therefore not further used. small number of cases (Stuart & Green, 2008).
Matching was performed twice, once for the academic track students compared to the non-academic track students and once for the academic track students compared to the comprehensive school students.
We screened all matched samples for balance on simple covariate comparisons, quadratic and interaction terms. We report standardized mean differences and variance ratios as indicators of remaining differences in the matched samples which are independent on the sample size (Stuart, 2010). Concerning the analysis of the treatment effect we chose regression analyses to control for all covariates of the propensity score model as a double-robust check (Ho et al., 2007). The type ¼ complex analysis option of Mplus 7.3 (Muth en & Muth en, 2008e2014) was used to account for the nested structure of the data.

Results
We present our results in three steps. First, we describe the areas of common support between the academic track sample and both comparison groups. Second, we describe to which extent the matching algorithms succeeded in removing differences in the covariates' distributions. Third, we present the effect of academic tracking on students' intelligence scores.

Area of common support
Figs. 1 and 2 illustrate the distributions of the propensity scores of attending the academic track for academic track, non-academic track and comprehensive school students. The propensity scores were transformed to logits to normalize the skewed distributions in both groups. As expected, the academic track students showed a higher propensity of attending an academic track (given the covariates of the estimation model) than either the non-academic track students (Fig. 1) or the comprehensive school students (Fig. 2). However, for both comparisons there was a substantial overlap of the distributions. Of the academic-track students 49.4% (N ¼ 1754) had a potential matching partner and 77.7% of the non-academic track students (N ¼ 1684) could serve as their matching partners. The area of common support was even larger for academic track students compared to comprehensive school students (see Fig. 2). With 95.2% nearly all academic track students (N ¼ 3377) had a potential matching partner and 84.1% (N ¼ 2452) of the comprehensive school students could serve as their potential match.
3.2. Sample differences before and after propensity score matching Table 1 illustrates sample differences before matching. The academic track students were a positively selected group. They had higher pre-treatment intelligence scores, higher test achievement scores, better grades, more primary school recommendations for the academic track and a more favorable social background than the other groups. Differences between academic track students and either comprehensive school students or non-academic track students showed the same pattern on nearly all covariates. The absolute values of the differences were smaller between academic track and comprehensive school students than between academic track and non-academic track students.
In Table 2 we present sample differences after 1:1 nearestneighbor matching of academic track and non-academic track students as an example of the effects of matching on group differences. No mean difference reached statistical significance. Standardized mean differences were in all cases below d ¼ 0.1 with only a minimal positive tendency in favor of academic track students remaining. The pattern was the same after 1:5-and 5:1nearest neighbor matching. After full matching, the remaining group differences were a little larger, but with the exception of one case (Number of books at home, d ¼ 0.27) still below the criterion of d ¼ 0.25 for acceptable group differences after matching (Stuart, 2010). The balance diagnostics of the academic track-tocomprehensive school matching yielded to similar results. We present detailed balance statistics for all simple covariates for each group comparison and each matching algorithm in the online supplemental material (see Tables A.4 to A.10). A screening of the quadratic and interaction terms supported the overall impression of good balance between the groups (see Table A.11 for details).  Table 3 demonstrates the differential efficiency described for the various matching algorithms (Stuart, 2010). Sample sizes were smallest for 1:1-nearest neighbor matching, increased for the ratiomatchings and were largest for full matching.

Effects of academic tracking
After four years of tracking academic track students' intelligence score was significantly higher than the mean intelligence score of the matched group of non-academic track students. Using the standard deviation of the academic track sample before matching, Note. Grades reverse coded, from 1 ¼ fail to 6 ¼ very good. AT ¼ academic track. Non-AT ¼ non-academic track. CS ¼ comprehensive school. Continuous variables were z standardized. Cohen's d is computed relative to the AT and with the SD of the academic track sample before matching.

Table 2
Univariate findings after 1:1 matching of academic track and non-academic track students for covariates at the beginning of grade 5 (N ¼ 638). Note. Grades reverse coded, from 1 ¼ fail to 6 ¼ very good. AT ¼ academic track. Non-AT ¼ non-academic track. Continuous variables were z standardized. Cohen's d is computed relative to SD of the academic track sample before matching. VR ¼ variance ratio, Var(Non-AT)/Var(AT).  Table 4). Academic track students also reached significantly higher intelligence test scores than their matched counterparts in the comprehensive schools. However, the effect was smaller and varied between d ¼ 0.28 for the variations of nearest-neighbor matching and d ¼ 0.17 for full matching.
The interpretation of the effect size estimates from PSM analyses relies on the assumption that there are no unmeasured confounder variables. Therefore, we conducted sensitivity analyses to test the robustness of our effects (VanderWeele & Arah, 2011). Concerning the effect of academic tracks compared to nonacademic tracks of about 0.4 SD, a possible unobserved confounder with a moderate (small/large) effect size of 0.3 (0.1/0.5) would have to differ about 1.3 SD (3.9/0.8) between treatment and control group (after controlling for all observed covariates) to eliminate the effect of academic tracking on students' intelligence (0.4/0.3 ¼ 1.3). Concerning the effect of academic tracking compared to comprehensive schools of 0.17e0.28 (depending on the matching algorithm), a possible unobserved confounder with a moderate (small/large) effect size would have to differ about 0.6e0.9 SD (1.7e2.8/0.3 to 0.6) between treatment and control group (after controlling for all observed covariates) to eliminate the effect of academic tracking on students' intelligence.

Discussion
The results of our analyses are in line with our hypotheses: Students on academic tracks show greater intelligence score gains than students on other tracks. On a descriptive level, this effect was most pronounced compared to non-academic track students but also present compared to comprehensive school students.
While the direction of the effect supports the findings of Becker et al. (2012) and underlines the reliability of the effect of academic tracking on students' intelligence (see Simons, 2014, for the value of direct replication), the effect sizes we found were lower. Becker et al. reported an average effect size of d ¼ 0.46, while it was d ¼ 0.40 for academic track students compared to the nonacademic track students in our study. When using the standard deviation of the control group as a reference like Becker et al. did, it even reduced to d ¼ 0.31. A possible reason for this difference might be the younger age of our sample at the onset of tracking (grade 5 vs. grade 7 students in Becker et al.'s sample). In the German school system grade 5 and 6 are conceptualized as an observation stage (Kultusministerkonferenz, 2006). Changes between the tracks are to be simplified. This might reduce those curricular and instructional differences which contribute to the differential development of students' intelligence. For example, academic track students do not start to learn a second foreign language before 7th grade.
The smaller differences between academic track students and comprehensive school students compared to non-academic track students are descriptive and in line with the more favorable institutional and compositional characteristics of comprehensive schools compared to non-academic track schools. Therefore, a direct comparison of comprehensive school students and nonacademic track students in the full range of their area of common support would be an interesting topic for future research.

Strengths and limitations
Our study has several strengths: The results are based on a large and heterogeneous sample of one of Germany's metropolitan areas. Using the CFT 20 (Cattell, 1960;Weiß, 1998) we employed a rather broad measure of intelligence. Employing PSM as a pre-processing method we made a substantial effort to eliminate potential confounders of the tracking effect. These measures contribute to the validity of our findings.
There are some limitations of our study, too: Our results apply only to those students with a normal school career without grade repetition or track changes. It would be interesting to investigate in future research whether students who descend from an academic track to the cognitively less activating environment of a nonacademic track hold or lose their advantage in terms of intelligence scores e and vice versa if students ascending to the academic track can catch up with their peers who were in the academic track from the onset of tracking. However, such a research question would require an additional assessment of intelligence at the moment when students change their track.
Given the CFT 20 is not constructed to differ between developmental cycles it remains open to future research whether the students at the academic track acquired new thinking possibilities or whether they became more efficient in dealing with problems on their developmental level (Christoforides et al., 2016). Given the age span of our sample from about 10 to 15 years both seems possible.
Furthermore, our results are limited to those students in the area of common support. Effects of academic tracking are not testable for the non-academic track students at the lower end or the academic track students at the upper end of the propensity score distribution. There are no students to compare these students to. This limitation is less pronounced for the academic track/ comprehensive school comparison where the overlap of the distributions is much broader.
Propensity score matching relies both on the assumption that there are no unmeasured confounders of the treatment effect and that the covariates are measured without error. With regard to measurement error, if the grade 5 intelligence is not perfectly measured, for example, we would overestimate the effect of academic tracking because unreliable intelligence scores result in an underadjustment for preexisting differences in the mean between non-academic track (or comprehensive school) and academic track students (for an illustration see Maxwell & Delaney, 2004, p. 427). This biasing impact of fallible covariates on the estimated treatment effect is also known as regression to the mean effect in the literature (Althauser & Rubin, 1971). Following a formula proposed by Althauser and Rubin (1971) and assuming a reliability of .80, we estimated that up to 36% of the effect of academic tracks compared to comprehensive schools and 26% of the effect of academic tracks compared to non-academic tracks might be due to measurement error of the grade 5 intelligence scores. 2 However, we would also like to point out that in our study the intelligence at grade 5 was measured using four subtests which allows for a more comprehensive and reliable assessment than in previous studies (e.g., Becker et al., 2012). In addition, recent methodological research using simulation studies suggests that the inclusion of many covariates may help to mitigate the harming influence of measurement error (Steiner, Cook, & Shadish, 2011). Concerning unmeasured confounders one might for example think of fundamental processes like working memory and processing speed limiting the future cognitive potential of the students. To investigate the robustness of our results, we conducted a sensitivity analysis which revealed that any unmeasured confounder would have to have quite a strong impact either on the students' intelligence or on the track assignment to fully eliminate the effect of academic tracking on students' intelligence. Given the list of covariates that were taken into account in the matching process we believe that the existence of such an unobserved confounder does not seem very likely. However, the effect of academic tracking compared to non-academic tracking is more robust against unmeasured confounders than its effect compared to comprehensive schools. Future studies might integrate additional measures of cognitive functioning into the matching procedure to further reduce the risk of unmeasured confounders.

Conclusion and implications for future research
Our study further supports the position that not only subject specific competencies but also general cognitive abilities can be improved by a cognitively stimulating learning environment in school (Adey et al., 2007). The learning environment in academic tracks can be characterized by compositional and institutional effects and their interaction (Maaz et al., 2008). All of these are relevant for subject-specific achievement (Guill & Gr€ ohlich, 2013;Opdenakker & Van Damme, 2006). Future research should try to disentangle these effects to gain more insight in their relative importance to support the optimal development of students' general cognitive abilities.
Thinking again of the importance of intelligence regarding many aspects of success and life-satisfaction one would wish to offer the cognitive advantages of an academic track to more or even all students. However, it is not possible to enroll all students at academic track schools without changing these schools themselves (at least their student composition). On a system level an increasing number of German federal states, including Hamburg since 2010, chose to combine all kinds of non-academic tracks to one new track where in shared classrooms students are prepared either for later vocational or later academic orientation, similarly to the Hamburg comprehensive schools described here. The result is a two-pillarsystem of academic tracks and comprehensive tracks. The federal state of Berlin even requires the same teacher education program for both types of tracks which should increase the general level of teacher competencies in the comprehensive track.
Given that changes on the education system level are draining and not per se effective (Hattie, 2009) one would wish that students could fully develop their cognitive potential while at any existing track. Increasing the level of cognitive activating instruction for all students in the normal classroom, within-class ability grouping with more-demanding tasks for the more able students and enrichment courses on an academic-track-level at the nonacademic track and comprehensive schools might be measures to help students at these schools further develop their cognitive potential. Additionally, all tracks might consider integrating direct intervention programs to foster cognitive abilities in their curricula as these have been shown to have effect sizes going beyond the tracking effects (Adey et al., 2007). These training programs can be content-specific (e.g. Papageorgiou et al., 2016) or address more general competences like deductive reasoning (Christoforides et al., 2016) as both improve the students' cognitive abilities.
that is due to measurement error in a covariate as follows: Bias ¼ b,ðm X1 À m X2 Þ, s 2 e =s 2 x ½1þs 2 e =s 2 x where b is the regression coefficient of the intelligence score at T2 on the intelligence score at T1 (assumed the same in both tracks), and m X1 and m X2 are the mean values of the intelligence scores at T1 in the academic track, and the non-academic track (or comprehensive school) sample. Assuming a reliability of .80, the variances s 2 e and s 2 x are given as .20 and .80.