The role of working memory in young second language learners ’ written performances

This study investigated the role of working memory (WM) in the second language (L2) writing performance of young English language learners. It also examined how L2 writing achievement relates to task type and grade level and whether the effect of cognitive abilities varies across different task types and grade level. The participants were 94 young learners (Grades 6 and 7) in Hungary, who performed four writing task types as part of the TOEFL JuniorTM Comprehensive test-battery and completed cognitive tests that assessed their WM functions. Participants scored high on the email writing and integrated Listen-Write tasks. Irrespective of WM functions, on average learners in Grade 7 outperformed those in Grade 6 on the Listen-Write task and the Email task. Students gained lower scores on the non-academic version of an editing task than on most other types of tasks. WM functions had no significant relationship with L2 writing scores, except for the academic editing task. In Grade 7, the effect of WM was not significant on the integrated Listen-Write task, but it resulted in the change of expected score. Learners with high working memory in Grade 6 showed somewhat more consistent performance across tasks than did learners with low working memory.


Introduction
A large number of second language (L2) learners in instructed settings are young learners (Butler, 2017), as foreign languages are often a compulsory school subject. In particular, English, being a lingua franca, is regarded as an important target language because of its key role in international education and employment opportunities. Given the vital role of English for the future of young learners, the teaching and assessment of English language skills for this age group has recently become the focus of growing interest (Nikolov, 2016;Wolf & Butler, 2017). In the construction and evaluation of teaching tasks and assessment tools for young learners, special attention is being paid to the developing cognitive capacity and affective characteristics of younger learners (Bailey, 2017). With a focus on assessment, research has also begun to investigate the age appropriateness, validity and reliability of young learners' tests in various educational contexts (Papageorgiou & Cho, 2014;Papp & Walczak, 2016;Wolf & Butler, 2017).
The present study contributes to this line of research by exploring the role of working memory (WM) functions in the writing performances of young L2 English learners, adopting a cognitive perspective on L2 writing (cf. Cumming, 2016). Despite a growing body of research into the relationship between WM functions and adult L2 performance (see meta-analysis by Linck et al., 2014), we still have a limited understanding of how young learners' WM relates to their L2 task performance. Yet, individual differences in WM are particularly relevant for children whose attentional regulation mechanisms are still developing (Jarvis & Gathercole, 2003). Study of the link between WM storage capacity and attention regulation ability is also important because teaching and assessment tasks can vary in their demands on attentional resources (Robinson, 2001). Hence, tasks that are excessively taxing on the storage and processing functions of WM and attentional resources might not contribute to children's L2 development and might unfairly disadvantage children with lower levels of WM functioning in high-stakes and classroom testing contexts. Therefore, it seems critical to examine whether certain task types, especially more complex ones which involve the integration of writing with skills such as reading or listening, create an unnecessarily high working memory load and thereby restrict the potential of L2 learning through writing (Byrnes & Manchón, 2014). As assessment of L2 writing skills is also a critical part of writing pedagogy that can provide diagnostic information, it is also important to examine how working memory limitations can result in construct-irrelevant variance in the assessment of young L2 learners.
In the present study, 94 L2 English learners in Hungary, between 11 and 14 years of age, completed the writing section of the computer-administered TOEFL® Junior™ Comprehensive test battery. A key objective of this test is to equip teachers with valid and reliable information about young learners' L2 skills that can inform future instructional design. The writing component of the test comprises four task types: two editing tasks, one email writing task, an opinion essay and a listen-write task in which students have to write a summary of an aural text. In addition, the participants of this study were administered three age-appropriate WM tasks that assessed their phonological short-term memory, the storage, and the processing and central executive (CE) functions of WM. Cumulative Link Mixed Models (CLMMs; Agresti, 2010) were used to examine the relationships between the learners' writing test scores and WM functions and potential changes in this relationship depending on task type and grade level.
In the following, we will first review research on the role of WM functions in L2 writing with a focus on young learners. Next, we will describe the research design of our study and data collection procedures. We then report the results of the CLMMs and discuss them in light of theoretical perspectives on the English-L2 writing performance of young learners. We conclude with practical implications of our findings for the teaching and evaluation of L2 writing for young learners.

Writing and WM models
Writing is a complex, meaning-making cognitive process in which a variety of social and cognitive factors play a role (Byrnes & Manchón, 2014, MacArthur & Graham, 2016. From existing cognitive accounts describing writing processes (e.g., Hayes & Flower, 1980;Bereiter and Scardamalia, 1987; see review by Cumming, 2016), we adopted Kellogg's (1996) writing model in our study since this model has been successfully employed in prior L2 research (e.g., Kormos, 2012;Manchón, Murphy & Roca de Larios, 2009;Révész, Michel, & Lee, 2017) and explicitly links writing processes to WM functions (see Olive, 2012, for a review on L1 writing). Kellogg (1996) distinguishes three highly interactive and recursive sub-processes: formulation, execution and monitoring. During formulation, writers plan the content of their text by retrieving ideas from long-term memory and translate this content into linguistic form by drawing on the processes of lexical retrieval, syntactic encoding and the expression of cohesive relationships. The execution stage involves actual motor movements (holding and moving a pen, typing on a keyboard). Monitoring entails revision and editing behaviour to ensure that the composed text complies with the intentions of the writer. Among others, Bourdin and Fayol (2002) showed that each of these sub-processes draws heavily on attentional resources. WM is responsible for the short-term storage, active processing, and manipulation of information in the cognitive system, and it is therefore a highly relevant construct for writing processes (see reviews by Olive, 2012 andMacArthur &Graham, 2016). In our research, we defined WM according to Baddeley's (2003) highly influential model (adapted from Baddeley & Hitch, 1974). This specifies WM as consisting of a Central Executive and three domain-specific slave-systems for phonological, visual/spatial, and episodic information, respectively. The phonological loop is responsible for the short-term retention and manipulation of verbal information, the visuo-spatial sketchpad stores and processes visual and spatial information, while the episodic buffer merges smaller pieces of information into episodes. The CE controls attentional processes, such as focusing, dividing and switching attention. It is also responsible for the activation and inhibition of processing routines and regulates how information is exchanged between shortterm slave-systems and the long-term memory system. An important assumption of this WM model is that the CE as well as the three slave-systems are limited in capacity. Given that all information will first be processed by WM before it may enter long-term memory, its limited capacity "acts as a bottleneck for learning" (Gathercole & Alloway, 2008, p.12).
Earlier research into the role of WM in L2 processing and learning has demonstrated that individuals with high WM capacity achieve higher language proficiency, have a larger vocabulary size, and are in general more competent users of the four language skills (e.g., meta-analysis: Linck et al., 2014;reviews: Juffs & Harrington, 2011;Kormos, 2012). In addition, individual differences in the storage and processing functions of WM and attention regulation have been shown to play a role in how learners perform on language tests. For example, Mitchell, Jarvis, O'Malley, and Konstantinova (2015) found a strong relationship between advanced learners' TOEFL iBT scores and WM capacity.
language and ideas to be conveyed. In developing writers, such as young learners, the mechanical processes of writing letters -be it by hand or on a key-board -are also likely to need some attention and mental effort (Berninger, 1999;Olive, 2012). In sum, it is expected that limitations in the storage, processing and attention regulation functions of WM might impact all stages of the writing process (Kormos, 2012). Johnson (2017) hypothesized that larger phonological short-term memory (PSTM) storage capacity might allow for the processing of longer and more complex lexical units and syntactic structures. He also assumed that visuo-spatial shortterm memory assists in planning and monitoring processes, since these are related to the processing of visual information (Kellogg, 1996). Individuals with larger WM storage capacity and more efficient CE functioning are also expected to benefit during all stages of writing because the coordination of parallel processing of information and switching between sub-tasks of text composition all draw heavily on the availability of attentional resources (Révész et al., 2017).
While several studies have explored the role of WM functions for L1 writing in adults (e.g., Hoskyn & Swanson, 2003;Olive, Kellogg, & Piolat, 2008), as well as children and adolescents (see review by McCutchen, 2011), to date, only a handful of studies have examined WM effects in L2 writing. Kormos and Sáfár (2008) investigated how Hungarian secondary school learners performed on writing tasks from a general language proficiency exam (Cambridge First Certificate Examination) and on tasks targeting PSTM and the simultaneous storage and processing function of WM. They found that, for those students who were at a pre-intermediate level of English, PSTM scores (i.e., performance on a non-word repetition test) correlated moderately and positively with their writing scores. In contrast, performance on the backward digit span task, which was used to assess the storage and processing function of WM, was not significantly associated with students' writing scores. Adams and Guillot (2008) studied 12 to 15-year old bilingual students' spelling and writing in French and English in relation to their PSTM, verbal and visual memory span. While no significant links between verbal or visual working memory and text composition emerged in either language, a significant relationship was found between PSTM scores and spelling in English.
More recently, Révész et al. (2017) investigated the relationship between adult L2 learners' WM functions and their writing processes and product when completing an opinion-writing task from an international language proficiency exam. Key-stroke logging, eye-tracking and stimulated recalls were used to capture participants' writing processes while composing the essays. This revealed that individuals with better task-switching ability paused for shorter periods between sentences, while those who had better ability to update information paused less frequently between paragraphs. Participants with smaller visual short-term memory capacity gazed at the instructions more often. Regarding text quality, a surprising finding was that those with higher PSTM scores used words from the first 1000 most frequent word-band more often, indicating a potentially negative role of the storage capacity of PSTM in lexical selection.
Zalbidea (2017) explored how WM mediates task complexity effects in oral and written L2 tasks in Spanish. In the written modality, only one significant correlation emerged with WM functioning: the number of errors against gender and plural markings on complex written tasks showed a strong negative correlation with WM scores. Zalbidea argued that more efficient WM functioning allows learners -even during complex tasks -to devote their limited attention to accuracy. Zalbidea's (2017) study highlights that it is important to consider that the effects of WM on performance might depend on the complexity of the task learners perform. Robinson (2001) defines task complexity as "the result of attentional, memory, and other information processing demands imposed by the structure of the task on the language learner" (p. 29). Task complexity has been shown to affect spoken and written L2 performance (see meta-analysis by Jackson & Suethanapornkul, 2013). Robinson (2007) also argues that as task complexity increases, learners with more efficient WM functioning will be more successful in handling the growing cognitive and linguistic task demands. To date, very few studies have investigated interactions between task complexity and working memory in L2 performance. Kim, Payant and Pearson (2015) showed that L2 learners with a high level of WM functioning noticed more recasts during complex than simple oral tasks. Students with less efficient WM functioning were found to notice recasts less frequently regardless of the complexity of tasks. Zalbidea (2017), as mentioned, also revealed that only written performance on complex tasks was associated with WM test scores.

WM and L2 writing of young learners
The above review indicates that the role of WM functions in the L2 writing of older learners is far from conclusive. Even less is known about the impact of individual differences in WM on young L2 learners' writing processes and achievements. First, young learners' writing skills are still developing in their L1 throughout adolescence (Kellogg, 2008) because of the ongoing process of cognitive development. Through schooling, children also gain more expertise in writing as a technical skill (i.e., spelling, hand writing, keyboarding), in text composition and as a reader-oriented cognitive activity (Berninger, 1999;Isbell, 2017;MacArthur & Graham, 2016;McCutchen, 2011;Olive, 2012). Both cognitive development and experience are thought to have an impact on different aspects of text quality, such as syntactic complexity, lexical sophistication and discourse quality (Berninger, 1999;Kellogg, 2008;McCutchen, 2011). In addition, individual variation in WM functioning and differences among children in their cognitive developmental trajectories can put "a fundamental brake on the writing skill of developing writers throughout childhood, adolescence, and young adulthood" (Kellogg, 2008, p. 8; see also Berninger, 1999;Olive, 2012). Therefore, we can hypothesize that these maturational processes also affect L2 writing development. However, one of the few recent studies in this area did not find any significant links between the non-verbal intelligence and writing skills of young learners -either in L2 English or in L1 German in a bilingual schooling context (Steinlen, 2018).
Additionally, empirical work on the complex interactions of WM and task characteristics in the performance of young L2 learners is limited. Recent developments in the context of L2 assessment, however, have led to an increased usage of communicative goaloriented task types such as integrated tasks that require a combination of multiple language skills (Cumming et al., 2005;Cushing-Weigle, 2002;Wolf & Butler, 2017). These tasks require students to reproduce information from written or aural input in writing (Read-Write or Listen-Write tasks) or speaking (Read-Speak or Listen-Speak tasks), and thus might be demanding for individuals with less efficient WM functioning, particularly for young learners. As a result, students with low WM capacity might be disadvantaged in these integrated skills tasks when they are used in assessment -be it in high-stakes tests or classroom-based contexts -and the test might be unfairly biased towards those with better WM abilities. In other words, individual differences in WM functioning might create construct-irrelevant variance in integrated task performances, and the diagnostic and evaluative information gained about writing skills might be inaccurate. As young learners' cognitive abilities are still undergoing development, it is also possible that the role of WM functioning in L2 writing performance changes across grade levels. Therefore, it is imperative to investigate the role of WM in integrated tasks performed by younger learners for whom test results might determine their educational and professional future. Research on the effects of cognitive capacity limitations on various types of writing tasks might also yield useful information for teachers by helping them identify learners who might need additional support with certain types of tasks.

The present study: research questions
To the best of our knowledge, no earlier work has investigated young learners' performance on integrated tasks in relation to individual differences in WM functions. The present study aims to address this gap and uses performance data from several task types of a standardized computer-based test specifically designed for young L2 learners. Our study is also novel because it applies several tools to measure the different functions of WM (storage, processing, executive functioning) in order to provide insights into how cognitive capacity limitations might be related to performance differences. Furthermore, to examine whether writing experience and cognitive maturation might influence the relationship between cognitive functioning and writing test scores, we explored L2 writing in two different grade levels. More specifically, our study addressed the following research question: RQ: What is the role of WM functioning, grade level and task type in the L2 writing performances of young English language learners?
Our hypotheses were that students' performances would differ across grade levels due to cognitive maturity and longer period of instruction. We also expected that task demands would exert a significant effect on students' writing scores and that children with more efficient WM functioning would achieve higher marks than those with less efficient WM functioning. It was further hypothesized that the role of WM functioning would be stronger in the lower grade and in tasks with higher formulation demand and which require the integration of listening and writing skills (cf. Kellogg, 2008).

Research methodology
The research was undertaken with young learners in Grades 6 and 7 (11-14 years old), in Hungary. They performed the Writing subsection of the TOEFL ® Junior™ Comprehensive test-battery, which included four task types (Editing, Email, Opinion, Listen-Write). These tasks were designed by the test developers (Educational Testing Service -ETS) to reflect the most important writing domains that children might encounter in bilingual education contexts and to tap different writing processes (e.g., monitoring and composing). The four writing task types also vary in formulation demands. The Listen-Write task requires the rendering of given information, whereas in the Opinion task students need to express their own views. The Email task contains required elements of information as well as some optionality for individual ideas. The Editing task has no formulation demands as in this task students must detect and correct errors.

Participants
The participants were 94 L2 English learners in two primary schools in Budapest, Hungary. In both schools, which follow a bilingual education programme, children start learning English from Grade 1, have five English language classes per week all through primary school (up to Grade 8) and study a variety of subjects in English. The children receive content-based instruction through the medium of English in arts, music, science and physical education in lower primary school (Grades 2 to 4), and in history and science in upper primary school (Grades 5 to 8) -thus constituting a content-based language instruction (CLIL) context. Both schools are state-owned, and students receive education free of charge. Students learn approximately one third of the subjects (e.g., science, art, music and physical education) in English. The writing instruction the children received in the two schools was similar and very similar amounts of class time were devoted to writing activities.
Fifty-six children attended School A and 38 School B. Forty-five percent were boys and fifty-five percent girls. Their ages ranged between 11 and 14 years (M age = 12.22, SD = .78). Fifty-four percent were enrolled in Grade 6 and 46 percent in Grade 7. All children had been learning English since Grade 1. Based on the students' overall performance on the full TOEFL® Junior™ Comprehensive test-battery, their English-L2 proficiency varied between the A2 to B2 level on the CEFR: 31 percent were at A2, 24 percent at B1, and 45 percent at B2 level. 1

Background questionnaire
To establish a demographic profile of the participant group, we designed a short bio-data questionnaire, asking participants about their gender, age, grade level, language(s) spoken at home, residence abroad, length of learning English and use of English outside the school context. The questionnaire was developed in English, translated into Hungarian and administered using the online survey tool Qualtrics.

Writing tasks
The pupils completed the computer-based TOEFL ® Junior™ Comprehensive test battery, which involves reading, writing, listening, and speaking tasks in English. In this study, we examined performance on the writing section only. For reasons of test security, the exact content of the tasks cannot be revealed, but a general description of the four task types is as follows. The first type is an Editing task which requires test-takers to correct four errors in a paragraph of a non-academic and an academic text, respectively (Ed1/ Ed2). In the second one, an Email task, test-takers write a reply to an email. In the third one, an Opinion task, they compose a paragraph of 100-150 words in which they express their opinion on a topic. In the last one, an integrated Listen-Write task, test-takers first listen to a teacher talking about an academic topic while seeing a picture with animations, which include key information. The teacher's presentation lasts for about 90 s, and learners can take notes while listening. Test-takers are then asked to write a summary paragraph and check their responses for grammar and spelling. While typing the paragraph, the task instructions and illustration on the computer screen remain visible (see Fig. 1). 2

Tasks measuring WM functioning
The participants also performed a series of WM tasks that take into account young learners' cognitive functions still under development (Gathercole et al., 2004). The specific WM tests were chosen based on three criteria: (1) language independence, (2) suitability for young learners, (3) feasibility regarding time restrictions of the research. The tasks aimed to measure the storage, processing and task-switching functions of WM. More specifically, to measure the storage and processing functions of WM, we used visual forward and backward digit span tasks. Digit span tests were originally developed for assessing intellectual abilities (IQ) in young children and are routinely used with children as young as four years old, as well as up to adulthood (see e.g., Gathercole et al., 2004;Jarvis & Gathercole, 2003). In our study, we chose visual over auditory span tasks or those using letters, (non-)words or sentences, because the former are seen to be less dependent on language than other versions. In addition, visual span tasks allowed us 2 The time provided for the writing tasks was the same as the regular standardised limits for the TOEFL ® Junior™ Comprehensive test and also showed to be sufficient for the participating students. The mean number of words written was within the required word-limit (Email task Mean = 78 words; Opinion task Mean = 110 words; Listen-Write task Mean = 111 words). However, it is important to note that text length is not considered in scoring students' performances.
to test students in groups. Furthermore, in their review of WM tests for children and adolescents, Jarvis and Gathercole (2003) report that digit span tasks are suitable for the age group of 11-14-year-olds. We used digit span versions based on Woods et al. (2011;Experiment 1).
As a measure of the task-switching function of WM, we opted for the Symmetry Span task (SymSpan; Kane et al., 2004). The SymSpan task asks participants to remember the location of a sequence of blocks (e.g., in a 4 × 4 grid) while being interrupted by a decision task on the symmetry of a black-and-white block pattern (Conway et al., 2005; see Fig. 2).
The updating function of the WM was not measured in a separate test because recent research has shown large overlap between the updating and the processing functions of WM (Indrarathne & Kormos, 2018). Although we assessed students' inhibitory control with a Stop-Signal task (Logan, 1994), we do not report results relating to this task, since students' performance on this was highly positively skewed and the kurtosis value was also very high, indicating the presence of a large number of outliers. The Inquisit Web tool (www.millisecond.com) was used to set up and administer all WM tests.

Procedures
Ethical approval for the research was granted by the relevant ethics review committee at the researchers' institution, and consent was obtained from parents as well as the children. The WM and CE tasks and the TOEFL® Junior™ Comprehensive test were piloted with 14 students. Only minor rewordings of the Hungarian instructions were needed for the WM tasks. The piloting of the language test showed that the test was suitable for the children in terms of language proficiency level, structure and timing. The participants' perceptions of the test were also found to be positive. Furthermore, the children demonstrated an appropriate level of computer literacy and typing skills, and no technical issues arose during test performance. In both the pilot and the main study, the children's classroom teachers familiarized the learners with the test, using publicly available sample materials and the TOEFL® Junior™ Comprehensive test handbook.
Data collection took place in two consecutive sessions in Spring 2017. Participants first completed the WM tasks and then the TOEFL® Junior™ Comprehensive test. Finally, they completed a short online bio-data questionnaire. All instruments were group administered, with one of the authors and two research assistants overseeing the procedures. The participants spent 25-35 minutes in total on the WM tasks: the visual forward and backward digit span tasks took about 5 min each and the SymSpan and Stop-Signal tasks about 10 min each. All TOEFL® Junior™ Comprehensive tasks were computer-administered. The Editing tasks had a time limit of 2.5 min each. Learners had 7 min for composing in the Email task, 10 min for the opinion task, and 10 min to write a summary in the  Listen-Write task (see Table 1 for an overview). Throughout the experiment, the learners were given several breaks (in line with the regulations of the TOEFL® Junior™ Comprehensive test).

Scoring and analysis
Experienced raters from the TOEFL ® Junior™ Writing rater pool scored the pupils' performances on the writing tasks based on the TOEFL ® Junior™ Comprehensive performance descriptors (https://www.ets.org/s/toefl_junior/pdf/toefl_junior_comprehensive_ writing_scoring_guides.pdf). 3 Accordingly, participants could achieve a maximum score of 4 on each part of the writing section, where the top level indicates that the writer produced an accurate and coherent text by using simple and complex sentences to provide key information. A top score also demonstrates that the test-taker understood and accurately conveyed key ideas and sufficient supporting detail. A score of zero represents either no response or a response that is off-task. The maximum total score participants could achieve was 4 on each task, i.e., 16 in total (scores on the two Editing tasks were averaged).
Following Woods et al. (2011), the Forward and Backward Digit Span scores provide an estimate of the score each participant was expected to get correct 50 percent of the time based on overall performance during all 14 trials. The SymSpan score gives the sum of all accurately recalled items in a correct order (Conway et al., 2005).
To investigate the relationships between the predictor variables and the TOEFL Writing scores, we used CLMMs (Christensen, 2015;Agresti, 2010), also known as multilevel ordinal regression models, to analyse the 470 observations -94 students completing five writing tasks each -using the clmm function in the Ordinal package (Christensen, 2015) in R (R Core Team, 2018). CLMMs were appropriate for two reasons. First, we had a crossed random effect: each student completed five writing tasks. Second, the outcome variable (writing performance) can be considered an ordinal scale whereby there is ordering of the levels (from 0 to 4) and an upper (4) and lower (0) limit for each writing task. CLMMs allow us to account for the potential ceiling and floor effects imposed by these limits. The predictor variables included: Grade (Grade 6 vs Grade 7), Writing Task (Task 1 Edit 1, Task 1 Edit 2, Task 2 Email, Task 3 Opinion, Task 4 Listen-Write), and a composite score of WM (see next section for details).

Descriptive statistics and correlational analyses
In Table 2 the descriptive statistics of the different writing task performances are given. Accordingly, participants performed particularly well on the Email and Listen-Write tasks with mean scores above 3 out of 4. Students also scored high on the Opinion task with a mean score of approximately 3. The two Editing tasks received lower scores. Table 3 provides an overview of the descriptive statistics for the WM data. Despite their young ages, our participants achieved high Forward and Backward Digit Span scores of 6 and 5.5, respectively. In comparison, Jarvis and Gathercole (2003) obtained scores of 5.2 and 4.6 for 14 year olds, while the adolescents in Kormos and Sáfár (2008) achieved 5.3 on the backward digit span. Participants reached a task switching score (SymSpan task) of almost 19. Standard deviations and minimum and maximum scores indicate that the young learners in our study displayed a wide range of cognitive abilities.
Correlational analyses (Table 4) between the three different WM tests showed that the forward and backward versions of the Digit Span test assessed a partly overlapping construct, as indicated by the strong correlation between the two measures. The Symmetry Span test correlated moderately with the two-digit spans.
To minimise the risk of Type I and Type II errors due to potential multicollinearity (Tu et al., 2005), we explored whether it would be appropriate to create a composite score of the three WM test scores. A principal component analysis with the three test scores obtained a Kaiser-Meyer-Olkin measure of sampling adequacy of .64. This lies above the recommended minimum value of .50 (Pett et al., 2003). A Bartlett's Test of Sphericity (Approximate χ 2 (3) = 73.70, p < .001) also indicated that the correlation matrix could be combined into one factor. The total variance table revealed that, together, the span scores had an Eigenvalue of 1.95, which could explain 65.10% of the variance with respective initial Eigenvalues of 65% (ForwardDigitSpan), 22% (BackwardDigitSpan) and 13% (SymSpan). The factor loadings showed that the three components contributed to the score to a similar extent: ForwardDigit-Span=.83; BackwardDigitSpan=.87; SymSpan=.72. Based on these commonalities among the three span tasks, and the results of the factor analysis, we deemed it appropriate to create a composite score using regression factor scores (Tabachnick & Fidell, 2001).

The role of WM in writing scores
To answer our research question, we started by fitting a series of models, beginning with a minimal model containing just the random effects of students on intercepts, and progressively increasing the model complexity by adding fixed effects and interaction terms. The minimal model (Model 1) was compared to a model including terms corresponding to the fixed effects of: Grade Level, Task and WM (Model 2). The Likelihood Ratio Tests (LRT) revealed that the additional complexity of the model was justified. Model 2 provided a better fit for the data than Model 1, χ 2 (6) = 190.32, p < .01. Next, we compared Model 2 to a model with added interactions of: Grade Level by Task; Grade Level by WM; Grade Level by Task by WM (Model 3). We found that the inclusion of interactions further improved the model fit, χ 2 (13) = 22.46, p < .05. Thus, based on both the theoretical interest associated with interactions and improvement in the model fit to the data, we decided to keep the interactions.
In the next step of the analysis we found that the Maximum Likelihood Model was too complex for the underlying data. The random effects structure exceeded the number of observations in the data set; therefore, the model did not converge as it was overparameterised. Following the recommendation of Bates, Kliegl, Vasishth, and Baayen (2015), to keep the model parsimonious, we established the utility of random slopes using the LRT. We found that the addition of random slopes did not improve the model fit. Consequently, our final model contained the random intercept of students only.
A summary of the final model is presented in Table 5 where we supplement the log-odds estimates with Odds Ratio (OR) estimates. It is important to mention that the summary table of the final model should not be interpreted directly, as all the coefficients are estimated against the reference level categories. For example, the significant coefficient for Grade 7 indicates that Grade 7 students were 3.86 times more likely to obtain a higher writing score than Grade 6 students, but only for the reference level of the writing task (Task 4 Listen-Write) and keeping WM at the average. For a more intuitive interpretation of estimates, subsequent multiple comparisons tables should be considered (Tables 6-9), where we adjusted the p-value for multiple comparisons using normal approximation. Expected Score Change estimates in Tables 6-8 demonstrate the impact of the model's estimates on expected writing scores for the subgroup being compared. This allows for a more meaningful interpretation of the estimates and we recommend considering both the OR and Expected Score Change estimates alongside the significance values. For example, in Table 7 we can see that, keeping WM at the average, Grade 7 students were 3.86 times more likely to score higher on Task 4 Listen-Write than Grade 6 students, and although this difference is significant on average it is not expected that Grade 7 students will have a higher score on that task than Grade 6 students. The expected writing scores were calculated by considering both the model estimates for betweenpredictor comparisons and the four intercept parameters which show the log-odds thresholds that specify the expected writing score.
To look into the role of task type and grade level in L2 writing performances, we investigated how tasks varied across grade levels and WM (Table 6). Keeping WM constant at a z-score of zero, both grade levels were on average more likely to have a higher writing score on Task 2 Email versus every other task except Task 4 (Listen-Write). The biggest difference for both Grade levels was between Task 2 Email and Task 1 Ed1 where Grade 7 students were 40.85 times (Grade 6 were 37.71 times) more likely to score higher on Task 2 Email than Task 1 Ed1, and the expected writing score change for Grade 7 was from 2 (Task 1 Ed1) to 4 (Task 2 Email). However, in some cases the significant difference between Task 2 Email and other tasks did not result in any meaningful expected  Note. a Score that a participant is expected to get correct 50 percent of the time based on overall performance during all 14 trials; b Sum of all correctly recalled squares of correctly recalled sets; SE = Standard Error.
M. Michel, et al. Journal of Second Language Writing 45 (2019) 31-45 score change. For example, students in Grade 6 were 3.33 (1/.30) times less likely to score higher in Task 3 Opinion than in Task 2 Email, but for both tasks the average expected writing score is 3. This contradiction occurs because the position of the baseline group (Task 3 Opinion) at the lower end of the threshold 3 intervals is such that the effect of the significant but small OR change means that the resulting score remains in threshold 3 for the comparison group (Task 2 Email). On the other hand, students in Grade 7 were 1.52 times more likely to score higher in Task 2 Email versus Task 4 Listen-Write, and although the difference in log-odds between the two tasks is not significant, it resulted in an expected writing score change from 3 to 4, whereby Grade 7 students were expected to score 4 on Task 2 Email but only 3 on Task 4 Listen-Write. In this case, this happens because the baseline scenario is close to the 3|4 threshold and does not require an increase in OR of significant magnitude to cross that boundary; we describe this as a meaningful effect. With one z-score increase in WM, students in both Grades were more likely to score highest on Task 2 (Email) versus any other task. However, in most cases for Grade 6 students, the differences between Email writing and other tasks resulted in no change in the expected writing score. Task 1 Ed1 was the only task among the Grade 6 students which had an expected writing score of 2. Consequently, those with higher WM in Grade 6 were significantly more likely to score lower on Task 1 Ed1 than on any other task. For example, students in Grade 6 were 49.40 times more likely to score higher on Task 2 Email than Task 1 Ed1. Students with higher WM in Grade 7 also scored significantly lower in Task 1 Ed1 than on any other task, with the most meaningful change being between Task 1 Ed1 (expected writing score 2) versus Task 4 Listen-Write and Task 2 Email (both had expected writing scores of 4). Overall, the number of significant differences between the tasks reduced with a single z-score increase in WM, but this did not reduce the number of meaningful differences between the tasks among Grade 7 students. Table 7 indicates that from Grade 6 to Grade 7 the writing skills have improved in terms of log-odds. However, in most tasks, including Task 4 Listen-Write, on average it is not enough to see a detectable change in the average writing score. Interestingly, Grade 7 students were 3 times more likely to score higher on Task 2 Email than Grade 6 students, and although this difference is not significant, it is meaningful since it results in an expected score change from 3 for Grade 6 to 4 for Grade 7.
Our comparisons also revealed that students with above average WM had higher writing score log-odds on Task 1 Ed2 than those with lower WM (Table 8). Grade 6 students with a WM score of one standard deviation higher than average students were 4.10 more likely to have a higher writing score on Task 1 Ed2 than the students with average WM, whereas their Grade 7 counterparts were 4.06 times more likely to have a higher writing score on this task. Additionally, the influence of WM on Task 1 Ed2 was meaningful since students in both grades with above average WM performance were expected to score 3, whereas those with average WM performance were expected to score 2. Interestingly, WM had a non-significant, but meaningful, effect on Task 4 Listen-Write scores among Grade 7 students, whereby those with higher WM were 1.49 times more likely to have a higher writing score (expected score of 4) than those with lower WM (expected score of 3). Table 9 demonstrates that an increase in WM resulted in a marginally better improvement in log-odds for one Grade level versus the other, but this was not significant. We do not show the expected score change in Table 9 since the ORs here describe changes in strength of WM influence between two participant groups rather than its influence on a specific group as evaluated earlier. Note 1. * = p < .05; ** = p < .01; *** = p < .001. Note 2. Grade 6 is the reference level for Grade; Task 4 Listen-Write is the reference level for Task. Note 3. zWM is centred and standardised WM. Note 4. OR refers to Odds Ratio. Note 1. * = p < .05; ** = p < .01; *** = p < .001. Note 2. † = keeping WM constant at a z-score of 0; ‡ = after an increase in WM by a z-score of 1.

Discussion
This study set out to investigate the role of WM functioning, grade level, and task type in L2 writing performances of young English language learners. We also aimed to establish whether WM functions play a differential role in L2 writing performance depending on grade level and task type.
Our results showed that the participants performed well on the writing tasks of the computer-administered TOEFL ® Junior™ Comprehensive test. After nearly six and seven years of intensive English learning (5 h per week language classes plus approximately 5-7 h of content-based instruction per week) respectively, they reached A2 to B1 level, some even performing at B2 (cf. Tannenbaum & Baron, 2015, p. 16). This shows that the test tasks were within the competence of the targeted sample of young learners in the investigated CLIL context. It was also encouraging to see that participants did particularly well on the Email and Listen-Write tasks where, on average, they scored above 3 out of 4 points. The fact that the integrated task type (Task 4 Listen-Write) elicited similarly high scores as the Email and Opinion tasks, but that the two Editing tasks (Task 1 Ed 1/2) significantly differed from the other task types, gives further support to earlier calls to create and apply instructional and assessment tasks that reflect target language use domains in academic contexts (Cumming et al., 2005;Cushing-Weigle, 2002;So et al., 2015) (see below for more discussion on the nature of the Editing task type). It also provides empirical evidence for the value of recent endeavours in L2 test design to assess testtakers by means of integrated task types that combine multiple language skills (So et al., 2015). This authentic integrated task type seems to present an appropriate-level challenge to L2 learners in a CLIL context and may prove to have high instructional value.
The descriptive statistics for the WM and CE test scores revealed that the young learners in our study, aged 11-14 years, achieved relatively high scores on the forward and backward digit span tasks (around 6 and 5.5, respectively) (see Gathercole, 2003, andSáfár, 2008, for comparisons). The mean task switching score (SymSpan task) in this group was almost 19, which is a lower capacity than that typically found in adult samples (cf. Foster et al.'s, 2015, mean score of 26.6 for healthy adults).
Given the range of scores in the WM and CE tests, it was unexpected to find that WM functioning had a limited effect on the L2 writing scores of the young learners, except for the academic version of the editing task (Task 1 Edit 2) and the integrated Listen-Write task in Grade 7. The lack of association between WM functioning and performance on most of the writing tasks in this study is surprising because WM functions have been assumed to influence the coordination, parallel processing of information and switching between sub-tasks of text composition (Olive, 2012;Révész et al., 2017). Our results, however, are partially in line with earlier research findings with older learners where significant relationships between writing scores and WM emerged only for the storage function of PSTM (Adams & Guillot, 2008;Kormos & Sáfár, 2008), but where limited or no relationships were found for the simultaneous storage and processing function of WM or executive control (Révész et al., 2017;Steinlen, 2018;Zalbidea, 2017). On the one hand, the findings of this study might show that varying levels of WM functions do not seem to cause construct-irrelevant variance in most of the writing tasks of the TOEFL ® Junior™ Comprehensive test. Therefore, these types of tasks may serve as accurate tools for teachers and other stakeholders for gaining information about young learners' L2 writing skills in a CLIL context regardless of differences in WM functioning. On the other hand, the results might also indicate that the instruction the participants received in Note 1. *** = p < .001.

Table 9
Multiple comparisons: how log-odds change in Grade 7 against Grade 6 for a one unit increase in WM per task.  Michel, et al. Journal of Second Language Writing 45 (2019) 31-45 their current CLIL context might have been beneficial in reducing variance in L2 writing skills that can potentially be caused by differential WM functioning among children.
One of the significant relationships between WM and writing performance was found in the Editing task where students had to find and correct errors in an academic text (Task 1 Edit 2) but not in the version that contained a non-academic text. This finding might corroborate the interpretation of Zalbidea (2017), who concluded that more efficient WM functions allowed her participants to devote more attention to accuracy, which was also an important aspect of the editing task in our study. In fact, in several other test systems, editing tasks are classified as language-in-use tasks (see e.g., the Austrian school-leaving exam for foreign languages (Froetscher, 2016). The Editing task, which involves metacognitive processing, assesses a specific aspect of the writing process, which is strongly inter-related with reading, namely the monitoring stage (Kellogg, 1996). The fact that WM effects were found for the academic editing task, but not for the non-academic version of the task, suggests that participants with more efficient WM functioning could have been more successful in co-ordinating their reading and monitoring processes when reading and monitoring accuracy in an academic text. Similar WM effects when reading complex texts have been detected in studies with monolingual children. These studies found that comprehension monitoring ability, which is often assessed with editing tasks and similar to those in our research, is related to WM functioning (e.g., Oakhill, Hart, & Samols, 2005). These results suggest that teachers might need to provide more assistance to young L2 learners with less efficient WM functioning to detect errors in their writing.
Another interesting finding of our study was the non-significant, but meaningful influence of WM functioning on performance in the Listen-Write task in Grade 7. The Listen-Write task can be considered a relatively complex task type, as it requires young learners to recall and summarize the content of aural input with support from visual input. Therefore, it is possible that the ability to efficiently coordinate attentional processes assists young learners in successfully executing these listening and writing processes. What is surprising, however, is that we found that WM functioning only played a meaningful role in Grade 7. We would have expected performance in this integrated task would be more prone to WM effects in the lower Grade 6 (cf. Kellogg, 2008). The reasons for the expected score change can probably be attributed to the fact that the expected score of students with average WM was closer to the highest score for Grade 7 than Grade 6. Thus, Grade 7 students did not need to improve as much as Grade 6 students to cross the threshold into a higher score. Nonetheless, in assessment contexts, especially if they are high-stakes, this score change might have an effect on the final grade, especially if only few tasks are used.
The influence of an increase in WM values on writing score log-odds was found to be similar for both grade levels. It should be noted that there was some variation and overlap in participants' ages within grade levels and the sample size of the grade level groups was relatively small, which might also account for this finding. However, a more likely explanation is that the one grade level difference in CLIL schooling in early adolescent years might not impact on how WM functioning influences L2 writing achievement.
Regarding differences across grade level, the analysis showed that, on average and irrespective of WM functions, participants in Grade 7 consistently outperformed Grade 6 learners, but this difference was only significant in Task 4 (Listen-Write). This difference, however, was not detectable as a change in the expected average writing scores. The fact that young learners with an added year of writing experience were found to perform somewhat better on this task might indicate that they have more-developed skills of coordinating the simultaneously ongoing operations that are necessary to perform successfully on this task type (Kellogg, 2008;Olive, 2012). The results also show a meaningful, albeit not significant difference between Grade 7 and Grade 6 students in the Email task. The one additional year of language instruction and cognitive maturity might have assisted Grade 7 students in this task type which was also relatively demanding in that it required comprehending the writing prompt (email) and responding with appropriate information.
This study also aimed to explore how task type might influence the writing performance of young learners and how WM functions might mediate task type effects. When WM scores were kept constant, students achieved significantly lower scores on the editing task than on the other types of tasks. In Grade 6, significant differences in the academic and non-academic versions of the editing tasks were also detected. When WM values were increased by one z-score, we found that both Grade 6 and Grade 7 learners scored significantly lower on the non-academic versions of the editing task than on other types. In this analysis, which models performance with high WM abilities, the students also received lower scores for editing a non-academic text than an academic one. As the test included only one non-academic and one academic editing task, it is difficult to attribute these findings to the academic nature of the text or to the specific type of the task. However, from the perspective of cognitive validity 4 (O'Sullivan & Weir, 2011), the use of the editing task in the assessment and teaching of young L2 learners, especially if the text contains academic content, needs to be carefully considered.
A further task-related difference was that when WM scores were kept at average, both Grade 6 and Grade 7 participants scored significantly lower on the task where they had to elaborate their opinion than on the email writing task. Nevertheless, the difference was only meaningful in the sense that it resulted in an expected grade change in Grade 7. Although the difference between these two tasks was not statistically significant for participants with above-average WM scores in Grade 7, we detected a meaningful change between the scores. The opinion-writing task differs from the email task in that it requires students to formulate their own ideas, whereas in the email task some of the content is specified in the prompt. This difference in formulation demands and students' potential lack of experience of writing in the argumentative genre in the Hungarian context might explain the lower performance of our participants on the opinion task. In fact, work by Olive (2012) and colleagues suggests that the maturation of cognitive effort in writing might become specifically visible on argumentative tasks. Indeed, the difference in scores seems to widen with age and exposure to L2, which suggests that young L2 writers would benefit from more practice in expressing their own opinion in writing and from more explicit instruction in argumentative genres across the curriculum.

Conclusions and implications
This study investigated the writing performances of young English-L2 learners and their relationship with individual differences in WM functioning. We also examined how L2 writing achievement might relate to task type and grade level and whether the effect of WM functioning on L2 writing performances varies across different types of tasks and years of study. Our research showed that young L2 learners performed particularly well on the Email writing and integrated Listen-Write tasks. When compared to Grade 6 students, Grade 7 participants scored statistically significantly higher on the Listen-Write task and meaningfully higher on the Email task. Both of these tasks are cognitively demanding as the integrated Listen-Write task requires writers to summarize information they had previously heard, and in the Email task students have to understand the original message in the email and respond to it. Differences in WM functioning played a limited role when young learners completed the written component of the TOEFL ® Junior™ Comprehensive test-battery. The only task where statistically significant WM effects were detected was the academic editing task and a meaningful but non-significant WM influence was found for Grade 7 students in the Listen-Write task. The results also revealed that learners with high WM functions showed somewhat more consistent performance across tasks than did learners with low WM functions.
The findings of the present study have valuable implications for the practice of teaching and assessing L2 writing for young learners. First, the fact that individual differences in WM functions showed only limited interactions with scores on the writing section of the computer-administrated TOEFL ® Junior™ Comprehensive test-battery provides support for the cognitive validity of this assessment tool. These findings indicate that, except for the academic editing task and potentially the Listen-Write task, young learners with less efficient WM functioning do not seem to be disadvantaged in the written tasks of the test. This finding is important for assessment and instructional task design, because it shows that students who have below average WM functions -and this group might potentially include learners with specific learning difficulties (cf. Kormos, 2017) -can perform on these tasks to the best of their knowledge, even under standard test administration and teaching conditions. In combination with our earlier findings that the same group of L2 learners also demonstrated positive attitudes and task motivation towards this test (Kormos, Brunfaut, & Michel, submitted), these results suggest that the writing tasks of the TOEFL ® Junior™ Comprehensive test are appropriately tailored to the characteristics of young L2 learners (Bailey, 2017) and can yield useful diagnostic information on the writing development of young L2 learners in CLIL contexts. The results of our study might also contribute to positive washback effects on the teaching of writing for these students (Cheng, Watanabe, & Curtis, 2004). If standardized tests for young learners assess performance on tasks sampling from the target language use domain (e.g., informal and formal school interactions) and task types reflecting that domain (e.g., email writing or integrated tasks such as the Listen-Write task), it is hoped that material designers and teachers will also use these tasks more frequently in coursebooks and in the classroom.
This study is not without its limitations, however. First, it should be noted that we investigated a specific group of young learners that attended bilingual schools in Hungary. Data collection took place in the students' school environment and the assessment scores were only made available to the students and their parents, not to their teachers, and did not count towards their academic grades at school. Therefore, our sample and the conditions in which the test was administered might not be representative of the wide variety of young L2 learners and contexts in which students take the TOEFL ® Junior™ Comprehensive test-battery or similar young learners tests.
A potential further limitation is the fact that we analysed our data using a composite score for the different WM tasks. However, this was essential for modelling, since we wanted to minimise the risks of Type I and Type II errors which are associated with multicollinearity. Although correlational analyses with individual WM tests, which we do not report in this paper, did not reveal a different pattern of relationships, in the future it would be interesting to explore interrelations of young L2 learners' writing scores with the individual components of WM to test the predictions of Kellogg's (1996) model of the writing process. Future work could also follow recent endeavours in researching adult L2 writing from a process-oriented perspective (for example Révész, Michel, and Lee (2019) who used key-stroke logging, eye-tracking, and stimulated recall). Another potential direction for future research could be the exploration of how cognitive maturation and experience (Berninger, 1999;McCutchen, 2011) might change the writing product and process of children and adolescents. Finally, replications of this project could be conducted using similar types of tasks with different content, as in our study -except for the editing task -each task type was only represented by one task.