Investigating cohort effects of early foreign language learning

ABSTRACT With the rapid implementation of early foreign language programmes in the state of North-Rhine-Westphalia, Germany, first for Grade 3 (ages 8–9 years) in 2003 and then from Grade 1 (ages 6–7 years) in 2008, primary school teachers had to adapt to teaching a foreign language in Grade 1 quickly. Teachers had little experience with language teaching to very young learners, and curricula and materials had not been tested prior to implementation. This study investigates the development of receptive English proficiency across three large cohorts (N = 7,289) . The first cohort started in Grade 3 , the second cohort was the first to start in Grade 1, and the third cohort started in Grade 1, six years after the initial implementation. Propensity scores were used to compare sampling weights of cohorts without the influence of confounding variables. Results confirmed a slight advantage for an earlier start in primary school for students’ receptive proficiency in Grade 5. The results further indicate that proficiency scores did not improve from the first cohort of students starting in Grade 1 to one six years later. Systemic changes in teacher education for language specialists in primary education may not yet have been able to affect student outcomes.


Introduction
The shift of language learning into earlier years of primary school has gained popularity across the globe.In Europe, early language learning (ELL) has now been introduced into the primary curriculum in almost every EU member state (Eurostat 2019).The popularity of ELL in Europe is a consequence of the considerable support by the European Council and its 'mother tongue+2' policy (Council of the European Union 2002).Across Europe, for example, most countries have introduced foreign language education early in primary school before students turn eight years old (European Commission/EACEA/Eurydice 2017).The value of foreign language skills and cultural competencies is an integral part of this policy, reflecting skills that are seen as valuable for employment, social integration, and are fundamental in appreciating cultural and linguistic diversity.The belief that an early start to foreign language learning in primary school is 'easier' and has a positive, lasting effect on language proficiency is a central reason for its introduction and popularity.Research in second language acquisition (SLA), however, provides only limited support for this belief when minimal input is provided, e.g. in classroom learning.The recent increase in research in language learning outcomes has started a much-needed debate about effective ways of teaching foreign languages to younger learners and policies necessary to provide a supportive educational setting (Hayes 2020;Wilden and Porsch 2020).
extent, due to a more developed L1 understanding, including L1 literacy, which can support L2 learning.The increased linguistic understanding also allows for more complex language contrasts.In many contexts, language learning may also be facilitated by intensive language programmes.Well-articulated programmes in primary schools, for example, immersion or partial immersion programmes (Elsner 2013) or Content and Language Integrated Learning (CLIL) approaches, have been shown to foster language development and allow attainment of high proficiency levels early on in students' academic careers (Jaekel et al. 2017;Rumlich 2016).
The goals of ELL are generally to gradually develop language skills and allow for language learning to occur over a more extended period with modest short-term goals in most contexts (Nikolov 2009).In Germany, the goals of ELL range from A1-A2 according to the Common European Framework of Reference for Languages (CEFR; Kultusministerkonferenz 2013).These goals are in line with the generally modest achievements reported for ELL (Curtain 2009).
Younger learners, particularly girls, have shown positive attitudes towards ELL (Jia and Aaronson 2003;Moyer 2004;Szpotowicz, Mihaljevic Djigunovic, and Enever 2009).However, they are not always consistent and may decline over time (Chambers 2000;Mihaljevic Djigunovic and Krevelj 2009;Nikolov 1999).Contextual and process variables have been shown to impact the attitudes students have.Making adequate progress is vital to maintain learners' attitudes and motivation (Csizér and Kormos 2009) and avoid boredom to maintain enthusiasm for language learning, particularly for struggling learners (Bolster 2009).

Individual differences and their relevance for ELL
In the highly diverse contexts of public education, it is crucial to consider the effects individual differences have on academic outcomes.The research literature on individual differences in SLA, particularly for learners in primary through high school settings, is still limited.Awareness of these relationships is paramount in developing curricula, adjusting teaching to the point that learner needs are met, and providing an accurate representation of the context and outcomes in the context of research.While the field of individual differences is vast, in this review, we will only focus on variables included in the study (for a more detailed discussion on individual differences, see, for example, Dörnyei 2010).
In the context of this study, the age of onset of learning English as a foreign language (EFL), gender, cognitive abilities, cultural capital, learners' L1, and students' self-concept were included in the conceptualisation of the study.Students who indicated English as their first language were excluded from the analyses.In primary and secondary education, mixed results have been reported for the impact of gender on language outcomes.However, the main tendencies support slight advantages for girls in the areas of listening and reading (Bos 2007;2009;Jaekel et al. 2017;May 2007).Cognitive abilities are a significant predictor for academic success and, more specifically, in the context of test-taking (Muñoz 2008).Figural analogy-based tests have shown their predictive relevance for a variety of language assessments (Dallinger 2015;Jaekel 2020;Jaekel et al. 2017) and allow for a language-independent evaluation of cognitive abilities.Objectified cultural capital, i.e. book ownership, is an indicator for family literacy and also serves as a proxy for students' socio-economic background (Sieben and Lechner 2019).Book capital is correlated with the academic success of students and is broadly used in educational contexts (Graaf, Graaf, and Kraaykamp 2000).In the context of SLA, Jaekel et al. (2017) have shown that book ownership predicts reading and listening outcomes in EFL.
With increasing linguistic diversity in classrooms, research on the impact of students' L1 on learning a foreign language in school settings remains scarce.An increasing emphasis on research in this area is particularly vital as foreign language education in schools most often relies on reference to the majority L1, particularly when contrasting grammatical constructs.Research on the L1 impact on foreign language learning highlights the positive impact of biliteracy on L3 learning rather than bilingualism alone (Rauch 2014).In the ELL context that relies more on acquisition-based teaching approaches, multilingual learners may not be at a disadvantage or potentially even benefit from their multilingual background if literacy is not a focus yet (Hesse, Göbel, and Hartig 2008).With a shift to a stronger literacy focus that relies more on explicit grammar learning and contrasting language, these multilingual learners could be at a disadvantage.
Self-concept is a dynamic, hierarchical, domain-specific, situated belief system that encompasses one's beliefs to accurately depict one's abilities in a particular domain (Marsh, Xu, and Martin 2012;Mercer 2011;Pajares and Schunk 2005).While self-concept has been shown to predict academic achievement, its relationship can better be described as reciprocal, i.e. academic achievement predicts self-concept and vice versa.Self-concept research has only recently received more attention in SLA (Mercer 2011).For this study with young participants, we consider the L2 self-concept to be a developing belief that is malleable through interactions with one's environment (see other studies Kangasvieri and Leontjev (2021); Waddington (2019)).

Motivation for the study
The starting point for this study was a key contextual limitation of Wilden, Porsch, and Ritter (2013) and Jaekel et al. (2017), requiring replication in another sample.Replication studies in the social sciences provide added support for the original study through validation, provide data to support the generalisability of outcomes (Porte and McManus 2019), and offer an opportunity to address the potential shortcoming of the original study and expand the scope.The previous studies (Jaekel et al. 2017;Wilden, Porsch, and Ritter 2013) presented data of two cohorts, first recruited in 2010.Cohort 1 (C1) started EFL in Grade 3, and cohort 2 (C2) in Grade 1. Wilden, Porsch, and Ritter (2013) reported that students who started in Grade 1 (C2) significantly outperformed students that started in Grade 3 (C1) in both listening and reading skills.C2 students received a total of 105 hours of additional English lessons (Jaekel et al., 2017).Importantly, this cohort was the first-ever cohort of students in the state to learn English in Grade 1. Accordingly, the cohort's context was confounded with several potential issues related to the implementation of ELL: (1) a lack of primary school EFL teachers majoring in English instruction, (2) the curriculum was new and not yet widely tested, (3) materials were new or not yet readily available, and (4) teachers in primary schools were adapting their lessons to the younger beginning age with learners that were starting to build their German literacy in Grade 1.
What has (and has not) changed since C2 data were collected?
Since the initial introduction of ELL in primary schools in Germany, universities have established and expanded programmes to train primary school teachers for English.However, at the time of the initial two cohorts, the large majority of English teachers lacked training (Edelenbos, Johnstone, and Kubanek 2006) and had limited experience teaching a foreign language.By 2018, when this study was conducted, teachers had gained experience, and more teachers with a language degree started working in primary schools.Textbooks are generally used for language learning in Germany.When ELL was shifted to Grade 1, most available books were aimed at Grade 3, building on students' already developing literacy skills in German and their experience learning in school.Since the initial two cohorts, textbooks have been tailored to the needs of 1st graders, requiring only limited literacy skills.State curricula were focused more on oracy skills initially (MSW -NRW 2003) and moved to incorporate literacy skills more (MSW -NRW 2008).During the initial two cohorts' data collection, teachers had little experience working with the curriculum.When data for C3 was collected, teachers had had time to adjust their classes accordingly.Despite these changes, ELL remains a niche subject in primary school ( van Ackern 2021).This is also reflected in a lack of a national curriculum, which the SLA field has long called for to ensure cohesion.Following these contextual changes and developments, a key aim of this current study is to evaluate the contextual effects of the 6-year implementation on students' EFL learning progress.

Research questions
In this study, we aim to answer two research questions: 1. How does a start of EFL in Grade 1 versus Grade 3 affect English listening and reading proficiency in Grade 5? (replication study) 2. How has the 6-year implementation of EFL in Grade 1 affected English listening and reading proficiency of a new cohort in Grade 5?
These questions will be tested while controlling for individual student differences and contextual variations using propensity score matching (PSM).Hypothesis 1: C3 will perform significantly better than C2 in English listening and reading assessments due to the additional 1 ½ years of instruction.
Hypothesis 2: C3 will perform significantly better than C2 in English listening and reading assessments.This hypothesis is based on the available research literature discussed in the literature review and the hypotheses we proposed in Jaekel et al. (2017) that skills should improve with curricular adjustments, and updated teaching materials.

Study context
The current study initially started as part of the longitudinal, multi-disciplinary Ganz In -All Day-Schools for a Brighter Future project endorsed by the Ministry of Education in North-Rhine Westphalia, Germany (MSW -NRW 2015).Overall, 31 grammar schools participated in 2010 and 2012 (see Figure 1) as part of a quasi-experimental design.Participating schools covered large areas of the state, both in rural and urban contexts.
Germany's secondary school system differs between states.Generally, students are streamed into different secondary schools after Year 4. The Gymnasium (grammar school) and Gesamtschule (comprehensive school) offer high school qualifications that provide full access to tertiary education.With the focus on grammar schools, participants in this study will have shown better than average academic aptitude at their primary school as grammar schools generally attract students with better grades or more promising academic development.The project from which the data originates focused on improving graduation rates in schools with, on average, higher rates of students from immigrant and lower socio-economic background families.Therefore, comparing these data with other grammar schools needs to be done with caution.
The current study includes three cohorts and adds a natural experiment setting (i.e. the curriculum implementation) to the original quasi-experimental design.Previously, we compared two cohorts of ELLs at the beginning of secondary school in Grades 5 (ages 9-10 years) and 7, respectively (see Figure 1).C1 started EFL in Grade 3 of primary school, while C2 started in Grade 1.In 2008, students in C2 were the first cohort of students in the state to receive EFL lessons from Grade 1.The new C3 entered Grade 1 in 2014, six years after C2, and was assessed in the fall of 2018.Therefore, the three cohorts that are compared in this study differ in their ELL experience while stemming from the very same geographical and educational context.C1 received 140 hours of EFL instruction, while C2 and C3 received an additional 105 hours for a total of 245 hours across their primary school years.
Consequently, data for C3 in Grade 5 were collected to ascertain if a trialled curriculum would yield better outcomes.In 2018, eight schools participated in the replication study, which followed the same procedure as the previous data collections.All previous participating schools, as well as other regional schools, were contacted and asked to take part in this replication study.Participation was voluntary.Two of the schools that participated in Ganz In project and the data assessments in 2010/2011 and 2012/2013 did take part in this replication study.Six new schools from the same region agreed to participate in the study.

Participants
Participants were all Grade 5 students (average age of 10 years) at participating schools whose parents consented.In addition, students assented to participate in the study and had the opportunity to opt-out themselves.At each of the participating schools, all Grade 5 classes participated in this study.Overall, 26% of students reported that they speak another language but German at home.

Procedure
The paper-pencil assessments and student questionnaires were conducted in schools during regular school hours.In addition, parental questionnaires were collected.

Instruments
The dependent variables 'English reading' and 'Listening proficiency' were assessed using previously validated scales from the Evening study (Evaluation Englisch in der Grundschule [Evaluation of English as a Foreign Language in Primary School] (Engel and Ehlers 2013).For listening, students answered 28 multiple-choice questions targeting picture recognition (17) and sentence completion (11) in German.For reading, 20 multiple-choice and 4 open-answer items assessed text understanding.All items on the reading and listening tests were coded dichotomously.The fit of the items and the consistency of the tests were evaluated within independent one-parameter logistic models.All items showed a reasonable mean-square outfit and infit of MNSQ min = 0.90 to MNSQ max = 1.15.The reliability of the person estimator reached WLE Reliability reading = 0.66 and WLE Reliability listening = 0.67.Based on those evaluations, the test scores were derived as sums (Rost 2004).
Cognitive abilities were assessed with the Figural Analogy Form B subtest of the Kognitiver Fähigkeitstest 4-12 + r (KFT; Cognitive Abilities Test) by Heller and Perleth (2000), as it allows for an estimate of students' general cognitive abilities independent of their L1 proficiency.In the present sample, the test reached good reliability in line with the norm-sample values (α norm = .94,α all cohorts = .92,α cohort Demographic variables, including student age, gender (0 = male, 1 = female), cultural capital, and home language, were based on students' responses to questionnaires.Regarding cultural capital, the students were asked how many books were present in their homes.Five categories were offered: 1 '0-10', 2 '11-25', 3 '26-100', 4' 101-200', or 5 'more than 200'.For home language, participants were asked which language they regarded as their mother tongue.The answer was dichotomised into 1, 'L1 is the language of the country of the test,' and 0, 'L1 is another language.'Where data was not available from students because of non-response, parental responses were used instead.All items originated from the international TIMSS and PIRLS assessments and were adapted for use in Germany (e.g.Bos (2009).
To map self-concept, three individual idiosyncratic variables were used, which asked to what extent the child believed that they would get good grades in English (20a), learn quickly (20b) or refer to themselves as a 'lost cause' (20c; Wilden, Porsch, and Ritter 2013) In addition, the participants' last report card grades in German, mathematics, and English were collected (very good (1)unsatisfactory ( 6)).

Statistical analysis plan
Only students that had completed at least one of the two parts of the test in English were included.The overall percentages of missing values ranged from 18.8% for gender to 22.5% in the test for cognitive abilities.
For average distribution of background variables within the cohorts before imputation, please see Table 1.A structural difference distinguishes the first two cohorts from C3.The cognitive abilities and the amount of objectified cultural capital of the sample in C3 are significantly lower than in C1 and C2.Additionally, the proportion of L1 students differs slightly.
Missing values were then imputed using the R package Mice (Buuren and Groothuis-Oudshoorn 2011) along with an analysis of the weighting and comparison of more than two non-equivalent groups with the R packages Twang (Ridgeway et al. 2014) and Survey (Lumley 2021).M = 5 imputed datasets were generated with mice (van Buuren and Groothuis-Oudshoorn 2011).Next, propensity score weights for all three points of measurement were calculated.The point of measurement, therefore, was used as the treatment variable for the derivation of propensity score weights.As the estimation method, the average treatment effect was chosen.This effect estimates the change in the outcome if the treatment was applied to the entire population (Ridgeway et al. 2014).The depth of interactions accounted for was 2. The stopping method was the mean of the effect size.All five propensity score matchings (imputation 1-5) were evaluated for balance by the comparison of weighted and unweighted differences as well as the distribution of the propensity scores between cohorts.
The population means, as well as the weighted means of the background variables, are given in Table 2.By implementing propensity score weights, the background characteristics are appropriate to the cohort-specific means of the dependent variables.

Results
Table 3 presents average scores in reading and listening for the three cohorts.Based on a multidimensional generalised linear model with cohort as categorical predictor for the reading and listening scores, C2 and C3 differ significantly from C1.With C2 selected as the reference level, C1 scored significantly lower, while C3 did not score significantly higher (Table 4).
Regarding Research Question 1, the results demonstrate an advantage for English skills for starting ELL in Grade 1 (C2 and C3) compared to a later start in Grade 3 (C1), confirming Hypothesis 1 and replicating Wilden, Porsch, and Ritter (2013) findings.When transformed into Cohen's d the differences can be enumerated as a small to medium effect of d = −0.313 between cohort 1 and cohort 2 and d = −0.352 between cohort 1 and cohort 3. The effect of the difference between cohort 2 and cohort 3 is −0.052.
As for Research Question 2, the values in Table 3 show a slightly higher score for C3 in their English proficiency scores.However, the multidimensional generalised linear model does not deem the proficiency difference between C2 and C3 to be significant.Hypothesis 2, i.e. that C3 would perform significantly better than C2 after 6 years of ELL implementation, was not confirmed.

Discussion
There are two main findings from this study.Firstly, as expected, the study replicated Wilden, Porsch, and Ritter (2013) findings that students starting ELL in Grade 1 (C2 and C3) scored slightly higher on receptive skills tests than students that started in Grade 3 (C1), confirming Hypothesis 1.Secondly, there were no differences in receptive proficiency scores between the first-ever cohort of students learning English from Grade 1 onwards in Grade 5 compared to another cohort of students six years later.Hypothesis 2, that ELL in primary school has become more efficient, i.e. implying that students attain higher levels of language proficiency, with expected growing teacher experience, professional development, and trialled curricula, could not be verified.Mean score comparison demonstrates that the two early starting cohorts' mean proficiency scores did not differ significantly, consequently refuting this hypothesis.
Propensity score matching allowed us to compare students across cohorts with very similar traits to focus on sample level changes, i.e. analyses focused on the three cohorts within their contextual constraints, not on the traits of students.Propensity scores and associated sampling weights were used to compare non-equivalent cohorts without the influence of confounding variables.Students between the three cohorts were matched based on cognitive abilities, book capital, and L1 predicted receptive proficiency as well as grades in English, mathematics, and German and facets of selfconcept.This matching procedure allowed us to compare the three cohorts while avoiding confounding contextual differences.
As data were collected after 6-9 weeks in Grade 5, proficiency scores reflect the outcome of primary school English classes rather than gains made after the transition to secondary school.The difference in receptive proficiency between C2 and C3 is negligible and indicates no systemic improvement or learning effect.The results offer two possible explanations: (1) the initial implementation of ELL already achieved the maximum possible outcome for receptive skills, or (2) few changes regarding teacher training, curricular adjustments, materials, and teaching approaches have been made and/or little gains achieved.
The first possible explanation posits that there are no further gains possible.If the maximum outcome had already been achieved, the results would be underwhelming.Considering that ELL at this scale, i.e. mandatory state-level implementation, is still very new, this explanation is not likely, because educational research has shown that systemic change is often evaluated in decades (Tenorth 2010).A key factor that likely impacted student outcomes is teacher training and their professional qualities.Teacher training in Germany focused on primary school ELL was increasing capacity to meet the level of the new demand at the time of the data collection, but the impact of these new training programmes may need more time to have an effect.Since our study focused on receptive skills, there may be improvements in productive skills that this study could not uncover.
The second possible explanation suggests that little or no changes were made over the six-year period between the two assessments of the cohorts that started in Grade 1 or that adjustments effects have not yet trickled down to the classroom to have a meaningful impact.It is likely that a combination of the two may be the cause as state-or system-level changes take time.At the time of data collection, few teachers had extensive experience in teaching a foreign language to young learners.At this early stage in the implementation, many teachers likely lacked specific ELL education, professional development (Edelenbos, Johnstone, and Kubanek 2006;Mihaljevic Djigunovic, Nikolov, and Otto 2008), or (confidence in their) English language proficiency (Edelenbos, Johnstone, and Kubanek 2006;Jaekel et al. 2017;Piske 2013).The lack of qualified primary school English teachers was addressed with intensive courses that prepared licensed teachers to teach without having studied the language.However, it will take time to introduce new licensed primary school teachers with a specialisation in English across the country.This is an important development to consider along with the presented results as research has shown the positive impact teacher quality and experience have on student attainment in general (Clotfelter, Ladd, and Vigdor 2006;Harris and Sass 2011).In the context of language education, teachers' language proficiency (Unsworth et al. 2014), teaching quality, and teacher qualification (Wilden and Porsch 2020) have been shown to have a profound effect on students' language attainment.With increasing numbers of highly qualified language teachers entering the workforce, findings may show different trends in future cohorts.
The outcome suggests that more focused research involving quantifiable teacher data such as English proficiency, time spent abroad, experience teaching young learners or efficacy teaching English should be included.Contextual variables are also important to consider to identify universal versus context-specific factors of teaching foreign languages to young learners.Future studies should also include productive skills and other outcome variables such as long-term motivation and efficacy that ELL may impact to determine the overall benefit of an early start to language learning.International, longitudinal studies could help the field better understand how different educational contexts impact ELL.
The results do not warrant a recommendation about whether a start in Grade 1 or Grade 3 would be better for learners in the long run.The study only investigated receptive skills and, as such, cannot make predictions about productive skills development for EFL.
Lastly, while a focus on measurable outcomes is highly desirable in today's education contexts for policymakers and parents, other curricular goals that are not easily quantifiable, such as intercultural communicative competence, must be considered in decisions about ELL policies.

Limitations
As with every study, there are limitations.This study did not assess productive skills or collect teacher variables.These are two important factors that need to be considered in future research.As students in C2 and C3 both started early and received 105 English lessons more than C1 in their first two years of primary elementary school, both effects cannot be disentangled to explain which one is the cause for the better performance of C2 and C3.

Conclusion
An earlier start to ELL allows schools to offer lower intensity programmes across the primary school years.The data demonstrate that an earlier start with only 1-2 language lessons per week resulted in a slight advantage for listening and reading skills early on in Grade 5.The results of this study suggest that we need to have patience and foresight with implementing an educational policy as widereaching as ELL.Furthermore, to help better understand foreign language education in primary school in mainstream education, research should increasingly include teacher variables.

Figure 1 .
Figure 1.Project outlineonly Grade 5 listening and reading data was included in the analyses.

Table 1 .
Means, standard deviations, and one-way ANOVA or X 2 test of background variables by cohort.

Table 2 .
Comparison of population means and weighted means.

Table 3 .
Pooled and weighted average reading and listening scores by cohorts.

Table 4 .
Pooled and weighted generalised linear effects between cohorts on reading and listening (Reference = Cohort 2).