The Process Corpus of English in Education: Going beyond the written text

The Process Corpus of English in Education (PROCEED) is a learner corpus of English which, in addition to written texts, consists of data that make the writing process visible in the form of keystroke log files and screencast videos. It comes with rich metadata about each learner, among which indices of exposure to the target language and cognitive measures such as working memory or fluid intelligence. It also includes an L1 component which is made up of similar data produced by the learners in their mother tongue. PROCEED opens new perspectives in the study of learner writing, by going beyond the written product. It makes it possible to investigate aspects such as writing fluency, use of online resources, cognitive phenomena like automaticity and avoidance, or theoretical modelling of the writing process. It also has applications for teaching, e.g. by showing students screencast video clips from the corpus illustrating effective writing strategies, as well as for testing, e.g. by establishing a corpus-derived standard of writing fluency for learners at a certain proficiency level.


INTRODUCTION: FROM WRITTEN PRODUCT TO WRITING PROCESS
The first electronic corpus ever, the Brown Corpus, was a corpus of written English.
Since then, many corpora have been collected that represent written language. Among learner corpora, i.e. corpora consisting of language produced by foreign or second language (L2) learners, 64 per cent are made up of written texts only (and 12% of both written texts and spoken transcripts) according to the current version of the Learner Corpora around the World list maintained by the Centre for English Corpus Linguistics (2020). Examples of written learner corpora include the International Corpus of Learner English, the Longman Learners' Corpus, the International Corpus of Crosslinguistic Interlanguage or the Written Corpus of Learner English. These and other written corpora have yielded invaluable insights into writing: its lexico-grammatical features, the way sentences and paragraphs are organised, how genres can be characterised linguistically, what errors writers tend to make, etc.
What these corpora give access to is the written product, that is, the final output of the writing act. Most written texts, however, go through several stages of editing and revision before they reach the final stage, when the text is offered to the reader. These intermediate states of the text are lost in a typical written corpus. The aim of the resource that is introduced in this article, the PROcess Corpus of English in EDucation (PROCEED), 1 is to make the whole writing process visible. To illustrate the difference between written product and writing process, one can consider example (1), a sentence taken from PROCEED and produced by a French-speaking learner of English. This sentence is the result of as many as twenty-eight different stages, as visible in PROCEED and as represented in (2), where strikethrough indicates text that has been deleted and the grey font shows a word in which one or several letters have been inserted.
(1) Our actual society is dominted by technology and science. A lot of experiments concentrate lately on the effects of those new developments on the human being.
(2) a. In b. In c.  (2), but is visible in the PROCEED data, is the fact that the learner has paused on several occasions while typing this sentence. For example, in (2u), there is a long pause of 23 seconds just after that, which may be indicative of the learner's difficulty in finishing the sentence. There is also a seven-second pause before the insertion of the s-letter at the end of development (2ab), which seems to correspond to a reviewing of the whole sentence, resulting in a last correction.
This example is an illustration of Murray's (1980: 3) witty remark that "process can not be inferred from product any more than a pig can be inferred from a sausage." It also points to the importance of considering the writing process next to the written product. Indeed, there have been calls in the literature to pay attention to the writing process. Back in the 1980s, Hairston (1982: 84)  Among the studies that use writing process data, the setting tends to be experimental, with data being collected specifically for this particular study, often among a small group of participants. In Breuer (2019), for example, the keystroke log files produced by 10 German students writing three texts in English and two in German are used to investigate the students' fluency in L1 (mother tongue) and L2 writing, revealing a higher degree of fluency in L1 than in L2 for most students. Sullivan and Lindgren (2002) test the pedagogical use of keystroke log files among four learners of English required to write a narrative text and demonstrate the positive effect of observing one's own composing process. In Elola and Mikulski (2016), a comparison is drawn between the screen activity of six learners of Spanish as a foreign language and 12 learners of Spanish as a heritage language, which brings to light similarities between the two groups (e.g. transfer of writing processes from the L1) as well as differences (e.g. more surface revisions but fewer meaning revisions in Spanish as a foreign language). The term corpus is hardly ever used in such studies, which may suggest that the data are not meant as a durable and reusable resource. A notable exception is Wengelin (2006), who describes her data sets, consisting of keystroke log files for Swedish texts, as corpora. Moreover, she shows how the techniques of corpus linguistics can be applied to the study of pauses in writing by looking for 'microcontexts' made up of a pause preceded and followed by certain elements (e.g. a pause preceded by a typed letter and followed by a deletion). Cislaru and Olive (2018) similarly refer to their process data (different versions of texts in French, together with the keystroke log files) as a corpus. In addition, they explicitly mention corpus linguistics as one of the frameworks they draw inspiration from. Hamel and Séror (2016: 156) also use the term corpus to describe a collection of screencast videos showing the writing process of L2 learners of French and English. They point out that such corpora represent new and exciting forms of empirical data which, once anonymized, could contribute to learner corpus projects that might be shared with others.
The Process Corpus of English in Education (PROCEED), as its name indicates, was designed as a corpus right from the start, meant as a durable and reusable resource bringing together a substantial amount of data supposed to be representative of a larger population. It relies on both keylogging and screencasting. It also comes with rich metadata and comparable data in the learners' L1. The resource is described in more detail in Section 2, while Section 3 provides an overview of some of the research perspectives that the corpus offers. Section 4 concludes the article.

A project in learner corpus research
PROCEED can be described as a new type of learner corpus in the typology of learner corpora (cf. Gilquin 2015), namely a 'process learner corpus', which shows the process through which a text is composed on computer by language learners. It makes the writing process visible through keylogging and screencasting, two complementary methods to record the activity of writing a text on computer. The corpus aims to contribute to learner corpus research by providing a resource that allows for a novel and fine-grained approach to written performance, in the original sense of 'performance', that is, the process of doing something (in this case, writing a text).
The corpus project started in February 2017 with the collection of writing process data among a group of higher intermediate to advanced, mostly French-speaking students majoring in English at the University of Louvain (Belgium). Since then, additional data have been collected at least once a year among a new cohort of students each year. This is seen as the first step towards setting up an international project that seeks to collect similar data in other countries, among learners of English with different mother tongue backgrounds.

The data
Like traditional written learner corpora, PROCEED includes texts written by learners.  Wengelin's (2006) study, mentioned in Section 1). Inputlog has a replay function, which makes it possible to reconstruct the writing process in a video-like manner on the basis of the stored data. However, the function comes with a warning that an error-free replay of the process files cannot be guaranteed and with a recommendation for researchers relying on replay to resort to screencasting. Although PROCEED is first and foremost a learner corpus, consisting of nonnative data produced by language learners, it was deemed relevant to include L1 data representing the learners' writing process in their mother tongue. This is because writing processes are said to display "conspicuous individual differences" (Sasaki 2000: 262), which may partly be the result of idiosyncratic behaviours that are languageindependent, and hence valid regardless of whether the writer is writing in their L1 or in an L2. Comparing writers' behaviours in L1 and L2 is not only intrinsically interesting (cf. Thorson 2000;Stevenson et al. 2006), but it can also help distinguish these language-independent features from those that are due to the non-native nature of the writing process. The L1 data are collected according to the same principles as the L2 data: the learners have about 45 minutes to write a 350-word argumentative text on one of several set topics/quotes, while their screen and keyboard activity is recorded with their permission.

The metadata
As is the case with most learner corpora, PROCEED comes with rich metadata describing learners' profiles and collected via a questionnaire to be filled in by each Because typing speed is essential when considering aspects of the writing process such as fluency, learners are required to carry out a copy task, both in English and in their L1. The copy task was designed by the developers of Inputlog, within which the results of the task can be analysed. It can be done online, with the output file being directly downloadable from the website. 3 It involves several activities: pressing two keys one after the other as quickly as possible, copying a sentence as many times as possible, copying combinations of three words and copying blocks of consonants.
The analysis of writing process data can provide insights into more cognitive aspects of language performance (cf. Section 3.1). For this reason, the PROCEED metadata also include measures of learners' cognitive abilities, which can be related to the writing process data and possibly account for some of the individual variation.

Writing process research
Besides the kind of research that is traditionally possible on the basis of written learner corpora, the PROCEED data have great potential for research into the writing process.
By combining keylogging and screencasting, they present an accurate picture of the way learners of English compose their texts, with unprecedented detail on the actual mechanics of the process. This information can be used for descriptive, explanatory and theoretical purposes.
In terms of description, the keylogging data provide comprehensive statistics about aspects that have to do with writing fluency (number, duration and location of pauses, type and number of revisions, etc.). As against the conventional approaches that measure fluency as the number of words produced overall or the mean number of words produced per minute (cf. Sasaki 2004), the keylogging-based approach considers writing fluency in its multidimensionality (cf. Van Waes and Leijten 2015). This focus on the notion of fluency also opens up new possibilities for comparing learner writing and speech. In addition, keylogging and screencasting data make it possible to examine the use of online resources during the writing process, such as secondary sources (Leijten et al. 2019) or writing tools (Gilquin and Laporte forthcoming, based on the annotation of PROCEED videos with ELAN). The data could also be used to carry out a dynamic discourse analysis, looking at how discourse is created in real time (e.g. paragraph formation, development of rhetorical functions) or what strategies learners adopt to compose a text (e.g. linear composition or outline that is progressively fleshed out).
A further use of PROCEED is for explanatory purposes. The writing process data can help account for the origin of certain features of the finished texts. A lack of tense agreement between main clause and subclause, for example, may turn out to be due to the fact that the tense of the main verb was changed at some stage but the writer failed to adapt the tense of the verb in the subclause (cf. Gilquin 2021). The data can also help uncover more cognitive aspects of writing performance (cf. Spelman Miller et al. 2008).
Revisions may thus point to a lack of automaticity for certain language components (e.g. the subject-verb agreement rule, if the verb form regularly needs to be revised) or to phenomena of avoidance (e.g. avoidance of the passive, if passive structures are systematically aborted), which are typically very difficult to discover on the basis of written texts only. Seeing what words are produced together in one go (the so-called 'bursts', see Chenoweth and Hayes 2001) can also give an indication of the constructions that are stored as wholes in the mind (Gilquin 2020).
From a theoretical perspective, writing process data such as those found in PROCEED can help develop or improve models of writing, as shown in Leijten et al. (2014) with keylogging data. The design of PROCEED, consisting of texts produced by the same writers in their L1 and in L2 English, could lead to the development of bilingual writing models representing native and non-native writing, and showing how L1 and L2 writing abilities interact with each other. The metadata associated with each writer might even make it possible to adapt a general writing model to individual variation, most notably through the empirical measures of working memory, which is part and parcel of most writing models (cf. Kellogg 1996;Hayes 2012).

Teaching and testing applications
Next to its use for research purposes, PROCEED also has potential applications for teaching and testing. The most immediate pedagogical application is to use PROCEED as a local learner corpus, that is, a corpus that is collected by the teacher among -and for the benefit of-his or her own students (Seidlhofer 2002). In other words, the learners are both contributors to and users of PROCEED. After collecting data from a group of learners, they can each be given access to their screencast video and be required to watch (part of) it, so as to become aware of how they actually compose a text. Additionally, clips from some learners' videos can be selected and shown to the members of the group, to illustrate effective strategies that could be useful to them (e.g. highlighting words to be checked later in a dictionary, so that the flow of ideas does not get interrupted). Learners can also be presented with some statistics describing their writing behaviour. On the basis of a keystroke log file, Inputlog can generate a user report that summarises some important facts about the user's writing process, such as the time they have been writing vs. pausing or the number of revisions they have made (Vandermeulen et al. 2020). The report also includes a graph representing the writing process which, with some explanations, could help learners visualise their own writing behaviour, and possibly compare it with the behaviour of other learners in the group or that of native writers (see Gilquin 2019 for a pedagogical intervention based on PROCEED as a local learner corpus). The PROCEED data can also be used as pedagogical materials for learners other than those among whom the data were collected. Video clips illustrating different writing strategies (effective or less effective) could be shown to learners to help them reflect on the act of writing and how best to compose a text. The process graphs generated by Inputlog could also be used as a basis to exemplify various writing behaviours (e.g. revising the text as one goes along or leaving some time at the end to revise the whole of it).
The writing process data from PROCEED can also serve testing purposes. While the testing of writing skills typically only relies on the quality assessment of the finished text, considering the writing process too could result in a more fine-grained evaluation of writing performance (cf. Ranalli et al. 2018). Thus, it would make sense, as is the case for speech, to include a criterion like writing fluency, which would aim to assess how smooth the writing process is. The PROCEED data, and in particular the analysis of the keystroke log files, could provide the necessary statistics to empirically assess the writing fluency of the learners who contributed to the corpus. Their writing fluency in the mother tongue could even be taken into account to provide a tailor-made yardstick for each learner. Another aspect that could be relevant to the evaluation of writing skills is consultation behaviour, that is, the way in which learners resort to online writing tools like dictionaries or thesauri, as using these tools effectively may be seen as an important component of writing performance. Again, this can be examined empirically for the contributors to the corpus, using the screencast videos. The analysis of such aspects of the writing process in PROCEED could also help improve writing assessment on a more general level, for other learners than those who contributed to the corpus. By bringing together data from a large number of participants, PROCEED can be said to be representative of a certain population of learners. It can therefore be exploited to determine the typical writing behaviour of learners at a given proficiency level, for example in terms of pausing time or number of revisions, and to set this as the expected standard. Other learners with a similar profile can then be evaluated against this corpusderived standard.

CONCLUSION
This article has introduced a new resource, PROCEED, which also represents a new type of corpus to investigate learner writing. Its unique combination of written texts, screencast videos, keystroke log files, rich metadata including cognitive measures, and equivalent L1 data offers an unparalleled opportunity to study the process through which learners write texts. It also opens new perspectives in terms of research and applications: study of writing fluency and comparison with spoken fluency; analysis of learners' use of online writing tools; dynamic discourse analysis taking the development of discourse into account; exploration of cognitive aspects of writing performance; theoretical modelling of the bilingual writing process; pedagogical interventions involving learners' examination of their own writing behaviour; addition of a 'process' component to the assessment of writing skills, based on corpus-derived standards; etc.
While collecting and analysing corpus data of the PROCEED type implies different routines than those followed in traditional learner corpus research, this description of the PROCEED project will hopefully have demonstrated the value of what could be referred to as 'process learner corpus research', and the significance of its possible applications. The potential of PROCEED will arguably continue to increase as the corpus keeps growing in size and in diversity of learner profiles.