Moving beyond binary language status in research: Investigating early foreign language learning and linguistic distance

Globalization and migration continue to shape our societies, including educational contexts such as school classrooms. In response to young learners ’ linguistic needs, particularly in the context of foreign language learning in Germany, educational approaches need to be adapted to meet the needs of multilingual students. Current, binary approaches accounting for diverse linguistic backgrounds of students in research assume a high degree of homogeneity among multilingual students. Linguistic distance measures may provide alternative, more fine-grained, continuous tools to account for linguistic diversity. This study employs lexical linguistic distance to account for young language learners ’ linguistic diversity in a reanalysis of Jaekel et al. (2017). Additionally, mixed-effects modeling was employed to factor in within-class effects for within-class factors versus structural equation modeling, which was previously used. The results outline that linguistic distance provides additional information beyond binary language status. Mixed effects modeling renders comparable results with the same tendencies, but yields more nuanced perspectives on the data.


Introduction
Increasing globalization and migration developments continue to shape societies and classrooms alike (Borgonovi & Ferrara, 2020), which necessitates a change or adaptation of educational approaches such as teaching as well as assessments to cater to the needs of all students, including those with a multilingual background (Hall & Cook, 2012;Singleton & Aronin, 2018).Linguistic diversity and heterogeneity in schools have long been acknowledged as the norm rather than the exception (e.g., May 2013); however, there remains a discrepancy between research findings and actual teaching approaches in schools (Krulatz, Neokleous & Lorenz, 2023).In addition, whereas there is work on multilingual teaching, including plurilingual approaches such as pedagogical translanguaging (Cenoz, 2017;Cenoz & Gorter, 2021), as well as multilingual testing and assessment (De Angelis, 2021), many schools keep following monolingually oriented teaching and assessment approaches (Hall & Cook, 2012;Illman & Pietilä, 2018).
Notwithstanding, the issue of how to fully account for the diverse language repertoires of speakers is also relevant in research.Linguistic and educational studies addressing how monolinguals, bilinguals, and multilinguals acquire additional foreign languages often struggle to represent the linguistic heterogeneity found in today's classrooms (Mavrou & Chao, 2023).Frequently, including in our own work, linguistic background or language repertoires are simplified by categorical variables, such as a Turkish, French, or Chinese speaker learning English as a foreign language, or comparisons are drawn between bi-/multilinguals versus monolinguals acquiring a foreign language (see, for example, Jaekel et al., 2017;Lorenz, 2022;Lorenz et al., 2020).With such approaches, speakers of different languages are understood as occupying a position on the same level, ignoring the different linguistic starting points for learning a language these learners have based on the linguistic proximity or distance between the language currently studied and their previously acquired language(s) (e.g., Muñoz et al. 2018;Schepens et al. 2016).This approach opposes previous research that has identified an effect of typological similarity concerning additional foreign language learning (e.g., Cenoz, 2001;Rothman, 2011).Moreover, students are often selected to participate in a study because they belong to a specific speech community (e.g., Brandt et al. 2017;Klinger et al. 2022).This limits the categorical levels one needs to control for in a study and facilitates analysis and interpretation.However, at the same time, language backgrounds that are relatively infrequent in a particular speech community are rarely considered or often excluded from studies due to their small numbers.In order to truly represent the heterogeneity found in today's classrooms, all linguistic repertoires need to be included to obtain more stable, trustworthy, and generalizable results.
Relatively recently, a number of studies have started to operationalize different learner groups with a continuous measure instead of binary (monolingual versus bilingual, native versus non-native) or categorical (Russian, Turkish, Vietnamese, …) variables (e.g., Schepens et al. 2016;van der Slik, 2010; see also section Research outcomes).These continuous measures approximate linguistic distance (LD) and have the advantages of (1) differentiating between the various speaker groups and (2) ensuring that all languages can be included within one study.Different LD measures have been proposed: cognate, lexical, (phylo)genetic, and morphological (see section Different LD measures).
When considering multilingual students' achievement in large-scale assessments, we consistently find that, on average, they lag behind their monolingual native-speaking peers in academic achievement (OECD, 2018).However, in the context of mathematics assessments in the Programme for International Student Assessment (PISA), teachers did not expect lower achievement from their multilingual students; instead, they overestimated results (Hachfeld et al., 2010).The authors conclude that "good oral language skills of bilingual students may mask potential language problems in the academic domain" (Hachfeld et al., 2010, p. 87).These outcomes reflect the distinction Cummins (2000) made between cognitive academic language proficiency required in schools versus basic interpersonal communicative skills used in everyday contexts.This study outlines the interaction effects between language background, immigration, and the linguistic complexity of the respective tasks.
Within the multilingual landscape, it is also important to consider that across the globe, early language learning (ELL), especially English as a Foreign Language (EFL), has been well-established, often commencing in the first years of elementary school.While ELL students show positive language development, generally meeting or exceeding curricular goals (e.g., De Bot, 2014;Enever, 2011), older students tend to progress more rapidly, building on their more advanced literacy skills and cognitive maturity (Jaekel et al., 2017;Muñoz, 2008).Nevertheless, the long-term development of ELL students' EFL proficiency has received increasingly positive evaluations (Jaekel et al., 2022;Porsch et al., 2023).However, the introduction of ELL in year 1 of elementary school in Germany has also been criticized, following the declining outcomes in majority language literacy in national and international comparative studies (Mullis et al., 2017;OECD, 2023).These concerns and an overall tight elementary school curriculum have led to the reversal of educational policies in some contexts, for example, moving ELL back from year 1 to later years (e.g., MSW -NRW, 2020).Majority language literacy has been shown to significantly affect foreign language proficiency (e.g., Baumert et al., 2020), and multilingual students and those from low socioeconomic backgrounds tend to perform lower in language assessments (Edele et al. 2018;Jaekel et al., 2017), unless controlled for learners' background variables such as biological sex, first language (L1), or socioeconomic status (Hopp et al., 2021).However, plurilingual or immersive approaches have demonstrated encouraging outcomes, particularly in diminishing or eliminating disparities between multilingual learners and monolingual speakers (Steinlen & Piske, 2018).
In the following contribution, we first offer a close look at LD, including its definition(s) and different LD measures, and subsequently zoom into language, educational, as well as societal research outcomes of studies employing LD as a numeric measure.We then introduce the current study, a re-analysis of data used in Jaekel et al. (2017), and subsequently present the methodology and results, zooming in on LD effects on ELL based on mixed-effects regression models.Finally, these findings will be discussed in light of Jaekel et al.'s (2017) findings, in which LD was not considered and ultimately placed within the current debate on educational policies.
Measuring of differences in language learning: a closer look at linguistic distance LD as a concept has been defined and ultimately operationalized in different ways.It aims to measure how different or similar languages are to quantify these differences for further analysis.Crystal (1987, p. 371) refers to LD as "[…] structural closeness of languages to each other […]."From a language typology perspective, LD could be understood as the degree of difference or similarity between languages.When LD is based on language typology, it refers to comparing languages based on their structural features rather than their historical relationships.These structural features can include aspects of phonology, morphology, syntax, and semantics (Llama et al. 2010;Rothman, 2015).In this context, a small LD would indicate that two languages share many structural features, making them typologically similar.Conversely, a large LD would suggest that the languages are typologically distinct, with few or no shared structural features.As a measure, LD helps us to understand the relationship and divergence among languages.
More narrowly, LD has also been defined within the LD scales and measurements employed in studies.For example, LD measures based on the Levensthein distance, a string metric for measuring the difference between two words calculated based on the number of letter deletions, substitutions, and insertions (Levenshtein, 1966), have used this distinction to define LD (see, for example, Petroni & Serva, 2010).However, it can also be understood as a learnability index (Schepens et al. 2016;Chiswick & Miller, 2005).Where N. Jaekel et al. languages share similar linguistic features, language learning is assumed to be enhanced by facilitative transfer.When investigating native versus non-native speakers or immigrant versus non-immigrant participants, LD could also be used to move beyond a binary perspective, providing a more fine-grained, continuous scale to account for language differences and similarities (Jaekel et al. 2023).
In view of Crystal's definition from 1987, LD can be considered an established property of language variation.Yet linguistic, educational, and other research has so far sparsely employed it and largely relied on categorical or binary language status.Undoubtedly, the challenge of creating a genuinely comprehensive measure that accounts for all linguistic features has been a significant barrier (Gooskens, 2007).The complexity of encapsulating all aspects of linguistic variation accounting for multiple dimensions, among others lexical, phonological, and grammatical similarities and differences, into a single numeric score has likely deterred many linguists from adopting LD measures.Instead, they have tended to rely on simpler, binary variables that are more easily agreed upon and accessible.
Additionally, researchers may have hesitated to use reduced or approximate measures, feeling that such simplifications might not capture the full complexity of linguistic relationships.While current linguistic distance measures are approximations, focusing on specific aspects like grammar or lexicon but overlooking others, they nevertheless represent progress in the field.

Different LD measures
Traditional language family trees do not offer the possibility to quantify differences between languages easily or lack the nuance that other measures provide.Overall, quantifying LD constitutes a "highly complex" task (Crystal, 1987, p. 371), and many different approaches have been taken to establish reliable and valid instruments.The methods used have predominantly adhered to typological strategies where features or structures such as lexical, morphological, phonetic, or syntactical elements between languages have been compared and quantified.
A prominent LD measure is cognate distance, which compares the percentage of shared cognates, i.e., a word related to or derived from another language and resembling its spelling and/or pronunciation between languages.Lexicostatistical measures have been used to establish measures alongside language family trees that (a) confirm the structure of language family trees, and (b) provide a quantifiable means to comparing languages based on word similarity/cognateness (Dyen et al. 1992;McMahon & McMahon, 2005).The idea of a "shared lexicon" is easily relatable and beneficial for non-linguists without any specific theoretical understanding of grammatical or morphological rules (Rothman, 2015).
Beyond cognate distance, lexical distance offers another means to compare words across languages.Many lexical distance measures incorporate the Levenshtein distance (Levenshtein, 1966), representing the number of character changes necessary to change one word to another.For example, the Levensthein distance between house (English) to casa (Spanish) is 4, while the distance from house to Haus (German) is only 2 (not accounting for the change of the capital letter).The project team around the Automated Similarity Judgement Program (ASJP; Wichmann et al. 2020), for example, developed their lexical distance measure on the Levensthein distance.The ASJP LD score, derived from a 40-word Swadesh list (Swadesh, 1955) of words that are universally significant and culturally independent, is normalized for word length differences by the ASJP software, yielding the Levenshtein Distance Normalized (LDN) score (Wichmann et al., 2010).This is done by dividing the LD by the number of symbols in the longer word.A final normalization, termed Levenshtein Distance Normalized Divided (LDND), adjusts for random word similarities (Wichmann et al. 2020).
Another sophisticated means of calculating LD is the index method using data from the World Atlas of Language Structures (WALS; Dryer & Haspelmath, 2013); see, for example, (Berthelé et al., 2022;Chai & Bao, 2023).These methods build on extensive work in contrastive linguistics, allowing LD measures to be multidimensionally founded, drawing from phonological, grammatical, and lexical language properties.The method is advantageous as it allows flexibility to adjust measures to particular fields, such as beginners' foreign language school grammar, or particular research interests, such as a focus on morphology and LD.A drawback of the index method is that the WALS and other databases do not include all relevant languages or that individual language properties may not have been examined yet and would be missing from the linguistic distance measure.
(Phylo)genetic distances refer to the difference between languages based on their evolutionary history.Historical linguistics and language phylogeny used this concept when studying the evolutionary relationships among languages.Methods based on phylogenetic distance aim to construct a phylogenetic tree that aligns with a matrix of pairwise genetic distances (Cavalli-Sforza et al., 2018;van der Slik, 2010;Wichmann et al., 2010).For a detailed discussion and outline of the calculation and theoretical foundation of genetic distances, please refer to Cavalli-Sforza et al. (Cavalli-Sforza, Menozzi and Piazza, 2018).
Alternative measures have included language learning difficulty scores.The Foreign Service Institute of the US Department of State categorized languages according to the number of hours it takes an average American to reach general professional proficiency (https://www.state.gov/foreign-language-training/,see, for example, Chiswick & Miller, 2005).
More recently, in order to compare second-language writing, Chai and Bao (2023) categorized learners of Chinese as those experienced with or without logographic writing.
While the various LD measures have their respective advantages and drawbacks, it must be acknowledged that they can only approximate the linguistic complexity of languages without being able to fully model it.We also have to acknowledge that LD generally focuses on a standard of a language, neglecting the many varieties or dialects that exist (see, for example, LD in the Arabic context).
For this study, we opted for lexical distance (ASJP) due to the availability of measurements that include almost all language comparisons, ensuring wider inclusion for languages that may otherwise have not been included in our analyses, for example, Malayalam or Wolof.The measure has been widely used and assessed and is reliable and valid (e.g., Berthelé et al., 2022;Borgonovi & Ferrara, 2020;Jaekel et al., 2023).While other measures might yield important results, considering the research context of ELL, the N. Jaekel et al. transfer of words might be the most immediate benefit for learners and relatable for researchers and practitioners alike.Further, the lexicon is more transparent to language learners and non-linguists learning a (new) language (see Rothman, 2015).
While LD promises a more fine-grained perspective on language similarities and differences, social, cultural, and economic are intertwined with language variables, particularly in migration-related contexts.To accurately disentangle any linguistic variables within such a study context, it is essential to statistically control for additional background traits such as parental education or socioeconomic status, as is standard in educational psychology.Furthermore, from a linguistic perspective, incorporating variables related to language acquisition, such as the onset of second language (L2) learning, age of arrival, and the specific variation or dialect spoken by participants, can provide further valuable insights.

Language outcomes
A number of studies have included LD when investigating different learner populations and their respective outcomes in a variety of language measures.Some have identified a statistically significant effect of lexical LD of the L1 on performance in the majority language as well as the foreign language.For example, Jaekel et al. (2023) found a negative effect of lexical LD to English on English reading and English listening comprehension tests.Thus, students who speak a language at home that is more similar to English were shown to have an advantage in English reading and listening, compared to their peers whose L1 is more distant to English.These findings were part of a study where the authors additionally controlled for numerous social variables such as cognitive ability, socioeconomic status, or sex in a large sample (n = 3,179) of a linguistically heterogeneous group of secondary school students (year 5; age 10) growing up in Germany (see also Jaekel et al., 2024).In a different setting, namely with young (age 7 and 9) Danish and Spanish/Catalan learners of English, Muñoz et al. (2018) reported a significant effect of lexical LD to English in a test assessing receptive vocabulary skills.Even though the Danish students had fewer hours of English instruction, the smaller LD to English (compared to the LD of Spanish to English) was argued to be an advantage over their Spanish/Catalan peers, visible in the comparable English scores of the two groups.A strong effect of LD on foreign language listening and reading (English, French, or Spanish) was confirmed in a study by Lindgren and Muñoz (2013).They compared school-aged foreign language learners from Croatia, Denmark, Italy, Poland, Spain, Sweden, and the UK and tested their foreign language reading and listening skills.LD, based on cognate LD, was the strongest predictor of the test scores.Moreover, Berthelé et al. (2022) analyzed 14-year-old multilingual French learners in Switzerland and found a decrease in French listening, reading, writing, and speaking scores with increasing lexical LD.Schepens et al. (2016) examined state exam results of Dutch universities in a large sample of multilinguals (n = 39,300).They found significant effects of both LD measures they employed, namely lexical and morphological LD.Thus, the learnability of the third language, Dutch, was less successful if the previously acquired languages were lexically more distant and showed a lower morphological complexity than Dutch (Schepens et al., 2016; see also Schepens et al., 2013).In addition, they could also demonstrate that the distance effect of the speakers' L1 was stronger than that of the second language (L2).Similarly, van der Slik (2010) compared adult migrants from Western European countries speaking 11 Western European languages in their speaking and writing performance in Dutch as an L2.He controlled for different learner characteristics (age of arrival, length of residence, hours of studying Dutch, years of education, gender) as well as context characteristics.The latter included schooling quality and status of being monolingual versus multilingual and two types of LD measures, namely cognate and genetic LD.Van der Slik (2010) identified a significant positive effect of cognate LD and a significant effect of genetic LD.With a similar population but in a different setting, Chiswick and Miller (2005) analyzed English proficiency among a large sample (> 450,000) of male adult immigrants in the US and Canada.They used a language score that measured learnability difficulty of English to operationalize LD and found that if confounding variables such as age, years of schooling, length of stay in the new country, among others, were held constant, LD had a significantly negative effect on the immigrants' English proficiency (see also Isphording & Otten, 2011), who report similar results on language fluency of immigrants, based on lexical LD, in the German context, or Isphording & Otten, (2013), for the US, Germany, and Spain).Furthermore, Crossley et al. (2019) investigated oral language competences, more specifically frequency effects in English production, in the US by learners with different L1s, relying on Chiswick and Miller's (2005) LD measure (i.e., learnability difficulty).They found LD to affect the learners' word frequency, with a smaller distance to English resulting in the production of more frequent words.Moreover, Mavrou and Chao (2023) analyzed the L2 Spanish writing competence of approximately 300 immigrant workers with different L1s.They employed cognate LD and found that this variable was among the significant predictors of L2 writing scores.Thus, similarly to the studies reported above, they identified a positive association between linguistic proximity (i.e., small LD score) and a language score (here, L2 writing).Those learners who spoke languages with a smaller cognate LD to Spanish had an advantage in writing in Spanish compared to those with a larger cognate LD (see also Isphording & Otten, 2013)).
However, some studies have not detected a significant effect of LD.Shatz (2021) reported, based on L2 English learner corpus data, that LD was not a significant predictor for lexical diversity when controlling for English proficiency.Similarly, Shatz (2021) could also not detect a significant effect of LD on English vocabulary use when English proficiency and word frequency were accounted for in the analysis.The author argues that even though lexical similarity can be facilitative in foreign language learning and processing, its effect on lexical diversity and vocabulary use is limited.He suspects "learners to have similar lexical diversity, regardless of the lexical similarity between their L1 and the target L2, at least in certain task-based settings" (Shatz, 2021, p. 57; see also Shatz, 2021, pp. 88-89).

Educational outcomes
Beyond the effects of predicting performance in language subjects, some studies identified an association between LD and general educational success.Jaekel et al. (2024) report an adverse effect of paternal LD to German on scores in a mathematics test among year 5 (age 10) students in Germany.In addition, Borgonovi and Ferrara (2020) used data from the large-scale PISA assessment.They investigated how lexical LD impacts academic achievement in reading, mathematics, and science in a sample of >30,000 15-year-old secondary school students with a migrant background.For all three academic achievement variables, Borgonovi and Ferrara (2020) identified an adverse effect of lexical LD.In contrast, they could not find a systematic significant association between lexical LD and a sense of belonging.

Societal outcomes
Even apart from language effects or general education outcomes, several studies identified a relation between LD and broader societal consequences.For example, Strøm et al. (2018) compared male foreign-born versus male native Italian workers and demonstrated that LD, which they used as a proxy for proficiency, negatively affected their wages.They showed that LD interacted with months of experience, namely that the positive effect of work experience was reduced with increasing LD.Similarly, Chiswick and Miller (2012) analyzed the wages of immigrants in the US -based on census data from 2000 -in relation to the LD of their native language to English.The authors hypothesized a direct effect of LD or language skills on the job search and found that initially, the earnings of immigrants were lowest for those with languages most distant from English.In addition, they could show that with increasing length of stay in the US, wages increased most for immigrants with languages most distant to English (Chiswick & Miller, 2012).In addition, Bredtmann et al. (2017) studied the effects of migrant networks and LD and how these two relate to selecting a new country of residence.The data they considered came from the 2007 European Labour Force Survey, which provides information on approximately 1.8 million people based in the EU and their countries of birth and new locations of residence.LD was operationalized using Levensthein distance and thus represents the average phonetic similarity (Bredtmann et al., 2017).They displayed that both measures, migrant networks and LD, affected the choice of the new country of residence: whereas networks had a positive effect, a negative effect could be attributed to LD.Finally, Mavrou and Chao (2023) argued that LD has economic consequences, especially for immigrants.They explain that language skills represent part of the human capital (e.g., Darvin & Norton, 2016), which necessarily has important consequences for life quality (see also Isphording & Otten, 2011).By and large, adequate literacy skills, and career opportunities are mutually dependent, and language could be understood as a gateway to a more successful social and economic integration (Mavrou & Chao, 2023).
As the previous discussions have shown, LD impacts the acquisition of languages in different ways, and can additionally have broader educational or even societal consequences.It is thus imperative to further study the significance of LD, including different measures, to understand its influence better and assess its relative importance for learners of foreign or additional languages compared to other influential variables.

Research aims and questions
The current study aims to reanalyze data used in a study by Jaekel and colleagues (2017) that assessed English reading and listening comprehension skills of year 5 and year 7 secondary school students in Germany (MSW-NRW, 2015).Jaekel et al. (2017) applied structural equation modeling (SEM) and controlled for students' L1 with a binary variable (German versus non-German).In the year 5 cohort, they did not find a significant effect of L1 on receptive English skills, but a small effect was attested in the year 7 cohort.In contrast to Jaekel et al. (2017), the current study (1) employs a numeric/continuous measure to account for LD instead of a binary variable to represent the heterogeneity of the students and (2) uses mixed-effects modeling to account for the nested data via multilevel analyses.In the following, we focus on answering two research questions (RQs): RQ1: How does using a continuous measure of LD instead of binary language status (German versus non-German) affect reading and listening proficiency in secondary school students (years 5 and 7) in Germany?
RQ2: How do outcomes differ when we use mixed-effects modeling, separately for reading and listening comprehension, instead of SEM?

Study context
The study's data were collected in a multi-year (2010-2018), multi-site project involving 31 grammar schools in North-Rhine Westphalia (NRW), Germany.Germany's three-tiered secondary school system streams students into tracks (Hauptschule, Realschule, Gymnasium (grammar school), and Gesamtschule (comprehensive school)) based on teacher evaluations and academic performance (Ditton & Krüsken, 2006).Students attending grammar schools, the traditional educational path to tertiary education, have generally demonstrated higher academic achievement across their four years in elementary school.German and Mathematics grades are particularly considered for admission.
The state of NRW is characterized by a linguistically and culturally diverse population, particularly in urban areas.At the time of the study, 26.2 % of the overall student population reported a migration background; however, only 13.5 % of grammar school students had a migration background (IT.NRW, 2012).Considering the sample, participating schools' classroom composition varied greatly.Across all schools, 29.7 % of the multilingual students reported that both parents were born abroad, and 54.3 % reported that N. Jaekel et al. both parents were born in Germany (Schwanenberg & Schurig, 2015).Considering the heterogeneous context of the study, a central aim was to increase the graduation rates of students from underrepresented populations, including lower socioeconomic families and those with a migration background.
For this study, data from two cohorts of students were assessed: one who started English in year 3 (late starters) and one who started in year 1 (early starters).The start of ELL instruction in Germany varies by state, with the majority commencing in year 3.However, by 2013, six states had opted for ELL in year 1 (Standing Conference of the Ministers of Education, 2013).For instance, NRW exemplifies this earlier integration, having introduced EFL in year 1 in 2008, following its initial inclusion in the elementary curriculum in 2003 (MSW-NRW, 2008).

Sample
The parents of all students were asked to give their voluntary consent for their children to participate.Only students who took part in both proficiency tests, i.e., reading and listening comprehension, in the two cohorts were included in the study.This results in groups with a size of n = 6, 460 in year 5 and n = 5, 917 in year 7, respectively.This way, students who left grammar school between years 5 and 7 must be considered when interpreting the results.
The descriptive values of both samples are provided in Table 1.The economic capital was taken from a parental questionnaire used in year 5.If there was no data in year 5, this information was taken from an additional survey in year 7.The information on gender, the number of books at home, and the language spoken at home was taken from the student questionnaire.The sample matches what Jaekel et al. (Jaekel et al., 2017) outlined, but more students were included due to the different handling of the missing values.To address the missing data in the samples, multiple imputation (k = 5) was applied.The imputation was done with the mice package ( van Buuren & Groothuis-Oudshoorn, 2011) and mitml (Grund et al., 2016) in R (R Core Team, 2024).The dependent variables were not imputed.The variance on the school level was taken into account by using a two-level normal model with homogeneous within-group variances for the imputation of continuous and ordinal variables (Grund et al., 2016).Further variables were imputed by predictive mean matching.For more details on the procedure, see the R code (Schurig et al., 2024).

Instruments
Reading and listening skills were assessed in year 5 and 7 using standardized tests previously used in Germany.The year 5 assessment involved picture recognition and sentence completion for listening with 28 items (α = 0.68) and 20 multiple-choice and four open-answer questions for reading (Engel & Ehlers, 2013; α = 0.71).In year 7, listening was assessed based on sentence completion with 17 items and through eleven multiple-choice questions (α = 0.89).The reading test in year 7 consisted of 11 multiple-choice items and 15 open-answer questions (Institut zur Qualitätssicherung im Bildungswesen [Institute for Educational Quality Improvement], 2014; α = 0.79).

Mixed effects models
When analyzing educational data, nested data structures and unbalanced designs often result in problems with the analysis of variance (see Green & Tukey, 1960).This includes threats to the central assumptions of linear models, such as group membership (e.g., schools or classes) or incomplete independence (as in time series) affecting the variance.These dependencies are often referred to as hierarchies or nestedness.In traditional analysis strategies, these elements of variance were often regarded as disturbances and ignored or reduced as far as possible by design.In more modern approaches, however, it is seen as a potential source of information.To address and distinguish those sources of variance different but similar (or outright even) models can be used.This includes hierarchical models, multilevel models, random effects models, mixed effect models, and varying coefficients models.The core of those models is a distinction between even or fixed effects across hierarchical or nested factors and varying or random effects across those factors.See Gelman and Hill (2018) for other conceptual frameworks that do not rely on this distinction.
When talking about those effects, the first question is: What is the difference between fixed and random variables and their effects?The fixed part of the modeling is usually composed of the main analytic parts.When addressing the change in reading proficiency across time between three experimental groups, the experimental groups are usually the fixed effects.Those main effects are the effects in question.The focus lies on the differences between the experimental manipulations and no other.The students' learning, on the other hand, could be considered to be "random."Often, the interest in random effects refers only to the general variability.Fixed effects can be continuous or categorical in mixed models.Ideally, fixed effects are repeatable and are linked to the dependent variable.Random effects are sampled randomly and exert peculiar effects.Random-effect factors are always categorical (e.g., grouping).For a comprehensive explanation of those differences, see Howell (2013).
Often a nested data structure is analyzed in which different data levels are distinguished.This shows the high degree of proximity to multilevel analysis.The first level (level 1) may consist of students, the second (level 2) of classrooms, and possibly a third (level 3) of schools.However, multiple observations across time (level 1) within students (level 2) within classes (level 3) within experimental groups (level 4) might also be addressed.Change across time within groups can be decomposed into components of the individual level, group membership, time, and interactions.This, in turn, can be processed in a single summarizing model, avoiding problems such as economic errors and error accumulation and, at the same time, fully utilizing the volume of a sample (see Campbell & Kenny, 1999).
To run a mixed model, we should answer three questions: • What are the grouping variables to control for?
• What are the effects in question (fixed effects)?
• What kind of random effects are there?Are the intercepts or also the slopes random?
In linear models, such as an analysis of variance, these effects might be addressed by F statistics for all effects without differences.In mixed models, effects and (if feasible) significance tests are done for the fixed effects while controlling for the variance of the random factors.Random effects can be modeled as random intercepts and random slopes.The random intercepts account for variability in the intercept for each level of the random-effect factor.For example, different chains may have different starting levels of compositionality (Winter & Wieling, 2016).In contrast, a random slope models variability in the effect of a certain predictor on the dependent variable for each level of the random-effect factor.For example, different chains may vary in how fast compositionality increases over time.Our study does not explicitly model the slopes per group but controls the intercepts by group.This shrinks the estimates of the intercepts toward the common mean (see Kreft & de Leeuw, 1998), unpooling the (eventually) biased estimates.The word "unpooling" derives from the modeling approaches Gelman and Hill (2018) described.Pooled models apply estimations to grouped data, ignoring the grouping.Unpooled models can be understood as separate estimations for each group.However, this does not consider the limitation of sample sizes for each group.The approach in mixed models is to estimate specific intercepts and then to pool the estimates.Sometimes, it is referred to as partial pooling.
In the current analysis, random intercepts are assumed.This means that there is an assumed variance between the means of the schools but no assumed variance in the individual slopes.
The central analysis strategy is a backward-stepping approach within the framework of linear mixed models with lme4 (Bates et al., 2015).Backward-stepping was chosen to address the assumed confoundations of correlated variables in the model and derive the most parsimonious model.In the first step, models with all reasoned covariates are calculated for reading and listening in years 5 and 7.Then, variables that did not contribute meaningfully to the variance explained were taken out stepwise until either all variables left were significant or the ratio of the likelihoods was significant.The cohort variable was never taken out due to representing the quasi-experimental variable in question.Multicollinearity was checked by calculating all four models with all variables with non-imputed datasets.The highest variance inflation factor (VIF) of all four models is 1.72, and the lowest tolerance is 0.58, indicating acceptable low correlations.The variable in question is the L1 in year 5.For this, the performance package was used (Lüdecke et al., 2019).The complete reduction steps are outlined in the supplementary file (https://osf.io/6eafs/).

Results
Table 2 provides the results of the mixed effects models in year 5.For reading, most covariates except the LD from German showed significant effects.For listening, the cohort and cultural capital showed none.The explained variance is 25 % in reading and 23 % in listening by fixed effects alone (RB1) and 75 %, respectively, 65 % by fixed and random effects (RB2).The variance explained is calculated by using the procedure from Raudenbush and Bryk (2002).Additional estimations of the first-level variance (SB by Snijders and Bosker, 2012) and multilevel variance partitioning ([MVP] by LaHuis et al. 2014) are given.For more details, see Grund et al. (2016).The intra-class-correlation (ICC) only ranged from 1 % to 3 %, but the explained variance of the added fixed and random effects showed a large difference to the fixed effects alone.The LD to English holds significant effect sizes in both models.While the L1 still held a significant effect in the model on reading when controlling for LD, the L1 did not in listening in year 5, when LD was controlled for.
The outcome from year 7 is shown in Table 3.Here, the results differ in multiple aspects from the results in year 5.In the reading model, the economic and cultural capital, L1, and neither of the LD reached significant effects.For listening, gender, economic capital, L1, and the LD to English did not show significant effects.The explained variance of fixed effects was higher because of the inclusion of the year 5 proficiencies, reaching 31 % in reading and 32 % in listening.The variance explained by fixed and random effects is in line with the model of year 5, reaching 73 % and 75 %.The ICC is again small, with values of 2 % and 3 %.

Discussion
This study explored the relationship of LD with the English attainment of young language learners across school years 5 and 7. We employed LD as an alternative to binary language status and followed a mixed-effects model approach, confirming central tendencies in our reanalysis of Jaekel and colleagues (2017).We also gained differentiated insights into the association of LD with English outcomes in contrast to binary language status.
Research question 1 addressed how using a continuous measure of LD instead of controlling for binary language status (German versus non-German) affected reading and listening proficiency in secondary school students (years 5 and 7) in Germany.Whereas in Jaekel et al. (2017), there was no significant effect of the L1 on a combined measure of English receptive skills among the year 5 students, we found a significant effect of LD to English for reading and listening skills separately.The higher the LD of the L1 to the target language, English, the lower the students' reading and listening comprehension scores.Given the language outcomes reported above, this finding is expected and in line with previous research using LD as a continuous measure (e.g., Berthelé et al., 2022;Jaekel et al., 2023;Mavrou & Chao, 2023;Schepens et al., 2016).This is particularly interesting, as the model analyzing English reading comprehension additionally controlled for language status (i.e., German versus non-German L1).No effect, however, can be reported for LD to German.In the year 7 data, Jaekel et al. (2017) reported a small significant effect for L1 status, with advantages for students whose L1 was German.In the current models, neither LD to English predicting English reading or listening skills nor LD to German concerning English reading skills were significant.The only significant effect detected was LD to German for predicting English listening skills.As before, the effect for LD was negative, meaning that with increasing LD, the English listening scores decreased.This could potentially be attributed to the increased emphasis on grammar teaching and the introduction of more complex structures in Notes: RB1: Fixed effect Variance by Raudenbush and Bryk (2002); RB2: Fixed and Random effect variance by Raudenbush and Bryk (2002); SB: Fixed effect Variance by Snijders and Bosker (2012); MVP: Fixed effect Variance by LaHuis et al. (2014); LRT: Likelihood Ratio Test; LD_E = Linguistic distance to English; LD_G = Linguistic distance to German.
school year 5/6 or in grammar schools, which might have been discussed in reference to German or contrasting German with English constructs (Kultusministerkonferenz (KMK), 2019).However, the effect of LD to German in year 7 is relatively small compared to that of LD to English in year 5.This suggests that the LD to English has a stronger impact on learners at an earlier stage, immediately after transitioning to secondary school.The use of LD alongside language status across all models clearly demonstrates that LD reaches significant levels in every model but one.Language status, on the other hand, only gains significance in year 5.All other models either include LD to English or German or, in the case of reading in year 7, no language-related variable gained significance.Overall, we must acknowledge that of the retained predictors in our models, LD_E and LD_G have the lowest effect sizes.However, we believe it is important to highlight that the sample constitutes a selective group of students who had to demonstrate their academic potential before admission to grammar school.Considering that excellent German skills are at the core of this rigorous selection process, LD should no longer be a factor at this point.However, despite their "academic elite" status, LD still mattered, even if only to a comparably small extent.
Even though the results between Jaekel et al. (2017) and our current results differ somewhat, the underlying tendencies previously shown remain the same.Nevertheless, the advantage of using LD as a continuous measure lies in the nuanced assessment of the English learners.Whereas in Jaekel et al. (2017), language status among the year 7 students could only discriminate between those who have German as an L1 versus those who spoke other languages at home, accrediting an advantage to the former, in the current study, we could show that a simple binary comparison does not reflect the underlying differences.Students with different L1s that are more or less similar to German or English (i.e., smaller or greater LD) show differences in their English scores.However, it would be too simplistic to ascribe an advantage to students with German as an L1 only since students whose L1 is less distant to either English or German, such as Dutch, are indeed more advantageous than German learners of English.These results suggest that students with different linguistic backgrounds have different starting points or developmental trajectories based on LD (Muñoz et al., 2018), at least for the two language skills assessed here, and that pedagogical approaches to bridging the linguistic gaps may be warranted.Further, the models outline a stronger association of LD with EFL outcomes in year 5, suggesting a stronger effect of the language gap on students' English attainment.
Using LD as a numeric measure introduces ontological implications for understanding individual multilingualism and language relationships.It challenges traditional notions of discrete language boundaries by considering the nuanced interactions between languages, varieties, or dialects.As an alternative to binary or categorical language status, employing LD measures recognizes the fluid nature of linguistic repertoires, where an individual's competence may span a continuum of related linguistic systems rather than existing in isolated compartments.It prompts us to reconsider how we conceptualize language acquisition and interaction within multilingual speakers.This approach accounts for the lived realities of speakers who navigate and negotiate multiple linguistic identities.Such an ontological shift emphasizes the importance of understanding multilingualism not merely as the sum of discrete languages known by an individual but as a complex, interwoven fabric of linguistic competencies and affiliations.A binary or categorical variable falls short of appreciating and valuing linguistic diversity.
While the debate over a complete model of linguistic distance continues, significant progress has been made.These advancements suggest that future research could greatly benefit from incorporating these more nuanced approaches to understanding multilingualism and language relationships.Research question 2 aimed to answer how the outcomes might differ when using mixed-effects modeling.In the previous publication, listening and reading scores were operationalized as latent scores in the SEM.While the SEM provides a holistic overview of the data, including the impact of learner characteristics, relationships between these individual differences and the two receptive outcomes may have been masked.The two outcome variables were thus treated as one.The advantage for early starters in year 5, as shown by the effects of the cohorts, for example, is only significant for reading, but not for listening in the full model (Table 2).For year 7, the advantage for late starters remains; however, it reaches significant levels for listening and marginally for reading (Table 3).Consequently, this may explain the fact that in Jaekel et al. (2017), for example, L1 was associated with the latent outcome variable in year 7, but not in year 5. Furthermore, in the current investigation, the random intercepts of schools were included in addition to measures on the LD.
It has to be stated, too, that the effects in the SEM were standardized and, therefore, easier to compare than the effects in the mixed modeling approach.Gender, economic capital, cultural capital, and cognitive abilities showed similar, but more nuanced effects.But how can we explain that in 2017, L1 played a role in year 7 but not in year 5, whereas in the current study, we see LD to English explain the variance in year 5 but not in year 7?Moreover, how can we explain that only LD to German significantly explains English listening comprehension in year 7 while adding no effect in year 5?
To address the latent change, only those students who participated at both measurement time points (i.e., years 5 and 7) could be included.This means that students with non-majority L1s were underrepresented in year 7, as they more often leave grammar school between years 5 and 6.Social disparities are traditionally hidden behind the changes in school type in Germany.Students from lower social groups, especially those with a migration background, not only find it more difficult to get into schools with higher qualifications, they also have greater problems staying there (Autorengruppe Bildungsberichterstattung, 2006, p. 52).The current study used individual scores for listening and reading and the data from all students with sufficient proficiency scores.Thus, the change was only included in the form of the covariation between and across the scores.Moreover, the random effects were added.In combination, this led to a clear and comprehensible shift in the statistical variance explanation.Whereas in 2017, 14 % of the variance of the latent variable on English proficiency was explained in year 5 and 49 % in year 7, with the approach in this paper, roughly ~24 % were explained in year 5 and 31 to 32 % in year 7 by the fixed effects alone.When taking into account the intercepts on the school level, the spread or social reference standard, so to say, increases to very roughly 70 % in all models.In summary, the mixed-effects models are particularly useful when very flexible modeling, possibly also of quasi-experimental conditions, stratifications, or time series, is required to accurately predict a manifest (measured) variable.However, if whole theoretical structures, potentially with several exogenous or endogenous latent variables, are to be identified, structural equation approaches are specifically beneficial.

Conclusions
This study set out to investigate the association of linguistic distance with EFL receptive skills for early language learning contexts across years 5 and 7 as an alternative to binary language status.We reanalyzed Jaekel and colleagues' (2017) data, comparing a cohort learning English from year 1 and another learning English from year 3 onwards.Using mixed-effects models, the study confirmed the overall tendency of the benefits of an earlier start for year 5 and a reversed outcome favoring late starters in year 7.
From a methodological perspective, differentiated results were derived because of the different research questions in Jaekel et al. (2017), which in turn relates more strongly to univariate prediction models that are as well controlled as possible.The current article addressed the limitation of dealing with missing values and possible group effects in the previous publication and dealt with comparing predictors using sequential prediction models rather than the control of the mean change of a holistic construct.Linear mixed models are a highly flexible approach to modeling hierarchical or nested systems within a single model.It can be combined with multiple imputation and, due to non-linear functions, categorical and continuous independent variables.However, the possibility alone of modeling everything in a unified framework is a rather poor rationale.Many parameters might be realistic and testable, but they are hard to present and interpret without theoretical assumptions.Often, there is not enough theory or an insufficient sample size to interpret effects.This argument can also be extended to the choice of backward-stepping regression.The method can be criticized for being overly simplistic and solely data-driven.At the same time, however, in this application, it offers the possibility of resolving the complex structures and making them carefully explainable.
With LD as a continuous measure instead of a binary variable employed in Jaekel et al. (2017), a more differentiated perspective on how language background affects (early) foreign language learning could be offered.Thus, we submit that even though the results of the current analysis report the same tendencies as disclosed in the 2017 study, LD, as a more realistic approximation of students' heterogeneous language repertoires, is to be favored when investigating (foreign language) learning in multilingual settings.The exact operationalization of LD and the potential differences that arise with the choice of LD measure and their association with different language skill assessments remain yet to be investigated further.

Table 1
Descriptive values in year 5 and year 7.

Table 2
Mixed effect model in year 5.