Corpora for Linguists vs. Corpora for Learners Bridging the Gap in Italian L2 Learning and Teaching

This paper aims to shed light on how research findings stemming from Learner Corpus Research (LCR) can inform the development of Data-driven learning (DDL) pedagogical activities. By doing this, it seeks to show how the gap between corpora built to be used by linguists and those tailored for learners can be filled. It starts by defining what a corpus is and how second language learning studies can benefit from the research findings based on corpora, but also from the direct use of corpora in the classroom. Then, it provides an overview of the available native and learner corpora of Italian, and how corpora in general can be adapted for DDL purposes. Finally, it describes an example of how an LCR finding can be used to develop DDL activities. It concludes with some desiderata for the future.


Introduction
A corpus is widely defined as an authentic, representative and machine-readable collection of linguistic data (McEnery, Xiao, Tono 2006, 5). It is authentic because it includes only instances of language that are used by real speakers, as opposed to a single ideal speaker; it is representative because it aims to include a sample of language data that is appropriate to a given aim of inquiry, thus being able to reflect a certain genre or variety; finally, it is machinereadable in the sense that the data is stored in electronic format and is searchable through a dedicated software interface, so that large quantities of data can be processed at the one time.
L2 learning studies can benefit from corpora in a number of ways. The most notable advantage of using corpora consists in being able to have better descriptions of real language usage by native speakers, which means that teachers and, even more so, syllabus designers can rely on better sources to identify and sequence L2 learning aims. This potential of corpora has been clear since the very early stages of corpus construction, when Randolphe Quirk stated that corpora were a necessity in order to deal with the inadequacy of teaching materials at the time, which kept reflecting a kind of language that learners would not find in real life communicative practices (Quirk 1960).
Another major advantage of using corpora in L2 learning studies is that corpora are able to provide empirical evidence of real language use not only in relation to native speakers, but also in relation to the learners themselves. This is the case of learner corpora, where teachers and researchers are able to detect the most frequently occurring errors, along with the most frequently occurring traits distinguishing native and non-native uses of the language (Granger 1996;2015). Learner corpora that are constructed on the basis of more than one data collection point in time are also able to trace developmental patterns in the learning process through time, which are argued to be typically non-linear (Larsen-Freeman 1997).
In general, two main modalities in using corpora for second language learning have been identified: the indirect use, when the data derived from the corpus is not immediately visible to the learners, and the direct use, when data is immediately visible to the learners. This dichotomy was originally proposed by Leech (1997) in relation to the direct vs. indirect aspect, and later extended by Meunier (2010) in relation to when the data may be made visible to the learners, either immediately or not.
The indirect uses of corpora in second language learning can be seen in approaches such as Contrastive Interlanguage Analysis (Granger 1996;2015), where varieties of native and non-native instances of language use are compared and analysed, but also in language testing, when using errors found in a learner corpus as distractors in the development of multiple-choice items. Furthermore, indirect uses are found in lexicographical practices catering for learners (Granger, Paquot 2015;Paquot 2012;Spina 2010) and also in coursebook design (McCarten 2010).
Direct uses, on the other hand, aim to make corpus data immediately visible to the learners, so that they can explore it and use it within learning activities. This can be done both in a computer-based format, where the learners interrogate a corpus themselves (Mueller, Jacobsen 2016), or in a paper-based format, where concordance lines are previously selected by the teacher and printed on paper (Boulton 2010).
We can see how the indirect uses of corpora in second language learning partially overlap with Learner Corpus Research (LCR), while the direct uses of corpora mostly coincide with Data-driven learning (DDL). Our purpose here is to show how LCR can inform DDL through learner-friendly corpora.

2
Corpora for All?
As we have seen, native and learner corpora can be variously used in relation to second language learning. In the following paragraphs we will first describe the main corpora that are available for Italian, then we will identify the main traits characterising corpora that are specifically built for learners, and finally we will see what has been done so far, from this perspective, in the context of Italian L2 learning and teaching.

An Overview of Italian Native and Learner Corpora
A number of native and learner corpora have been constructed for the Italian language. As for native corpora, we can mention large written corpora like Paisà (Lyding et al. 2014), based on web data, the Repubblica corpus ( Baroni et al. 2004

Towards Learner-Friendly Corpora
In order to be usable and useful for learners of a second language, corpora need to satisfy some basic requirements. First, the texts contained in the corpus need to be suitable for the learner's needs. These needs refer primarily to the difficulty level of the text: reference corpora are perhaps the most widely constructed and used corpora in any language, but because they are constructed with the purpose of describing a language in a representative way, from the perspective of native speakers, they would be usable mostly by advanced learners only, which constitute only a small proportion of the language learner population. On the contrary, the texts contained in a learner-friendly corpus would need to be suitable in terms of proficiency level, reflecting either one broad level or different proficiency or difficulty levels. Ideally, the nature of the texts contained in a learner-friendly corpus should also reflect the learners' interests, so that they are even more motivated to use the tool.
Second, a learner-friendly corpus would need to have a userfriendly interface, a learner friendly output and simple querying systems. Reference corpora built for linguists most typically require a user to formulate a query by using specific kinds of syntax. A learner would need some basic querying forms, that are graphically appealing and able to make the output easy to understand and then easy to be used for whatever learning need he/she may have.
A number of learner-friendly corpora have already been developed. One example is SCoRE (Sentence Corpus of Remedial English; Chujo, Oghigian 2012; Chujo, Oghigian, Akasegawa 2015), a corpus made of sentences that were extracted from a reference corpus and then manually emended by language experts in order to make them suitable for English language learners at beginner level. Another example is SkELL (Sketch Engine for language learning; Baisa, Suchomel 2014), which is based on an algorithm that automatically selects 40 good examples for learners, drawn from a very large reference corpus; these examples are selected on the basis of criteria formulated with the aim to exclude, for instance, long subordinate clauses and/or words that are rare or technical. Finally, the idea of corpora based on graded readers has also been explored, but these corpora are yet to be put into practice on a systematic basis. In this case, texts are written by native experts in order to cater for different proficiency levels (Allan 2009;Gavioli, Aston 2001), though generally keeping in mind issues of authenticity and appropriateness (Hendry, Sheepy 2017).
The examples that were just outlined pertain to English language learning. With regards to Italian, two main kinds of attempts have been made. On the one hand, the texts contained in the corpora that were to be used by learners were chosen according to the specific learning aims of the learners, be it creative writing (Kennedy, Miceli 2001; or Italian for Specific Purposes (Polezzi 1993 Just like the English version, the Italian version of SkELL allows the learners to type in a word or word string into a search box and then explore it in terms of concordances, word sketches showing the words that most frequently co-occur with the word that was searched for, and finally a word cloud showing synonyms and other words that are similar to the searched one. Other ideas as to how learner-friendly corpora can be used with learners are outlined in Naismith's article (2016). The author's primary aim is to show second language teachers how easy-to-use corpora can be integrated in second language lessons. He proposes tools such as Google Books Ngram Viewer, which uses an interface that is very similar to that of Google, which can then be easily integrated in lessons as needed. With Google Books Ngram viewer, the learner can look up the frequency of occurrence of words or word combinations throughout time. Two or more forms can be compared, for instance, in order to see which one is more frequently used in the present. The frequency is shown by means of a line graph. Another tool proposed by Naismith is Justtheword: 2 here, the learner can easily extract quantitative information about the usage of a word or word com-bination; in this case, the information is shown with examples as well as numerically, along with a wordle extension that is incorporated in the tool, allowing to visualise the results in the form of a word cloud.
These are only some of the ways in which corpora have been adapted for the needs of learners, instead of catering solely for the needs of researchers. In the following paragraph we will describe how a specific kind of learner-friendly corpus, namely SkELL, can be used to apply the finding coming from an LCR study.

3
From LCR to DDL This paragraph describes how LCR and DDL can meet through it-SkELL. The empirical evidence emerging from the investigation of a learner corpus provides crucial data for the teacher: it can indicate error frequencies and frequent non-native usage patterns, and also shed light on how they both develop over time. In the following two sections we will describe an LCR study on Italian focused on collocations and based on the LOCCLI corpus.

An LCR Study on Italian
The study (Spina 2019) investigates the developmental patterns of phraseological errors in Chinese beginner and pre-intermediate learners of Italian in the use of noun + adjective (tempo libero 'free time') and adjective + noun (bel tempo 'nice weather') lexical combinations, and it is based on an error annotated sub-sample of the LOC-CLI corpus. The main aim of the study is to verify the hypothesis that time affects errors in the combinations of nouns with adjectives produced by beginner and pre-intermediate Chinese learners of Italian. One of the major findings is that noun + adjective and adjective + noun combinations display opposite behaviours across time with respect to the production of these specific phraseological errors: errors decrease after six months for adjective + noun combinations, while they significantly increase for noun + adjective combinations. An error of the type Ho trovato gli spagnoli ragazzi sono non più belli di italiani ragazzi ('I found that Spanish boys are more good-looking than Italian boys'), where the correct form ragazzi spagnoli ('Spanish boys') is replaced by a form with the adjective wrongly preceding the noun, tends therefore to increase over time.

Building an LCR-Informed DDL Activity
The main finding of the study above consists in the observation that errors tend to increase over six months. This finding may lead the teacher to want to place a particular focus on this kind of error. This does not need to be explicit or out of context, but it can stem from the observation of an error in a learner's writing or speech. The teacher will know that this kind of error does not tend to decrease over time, but, on the contrary tends to increase. And this means that it may deserve some additional attention.
Starting from the example indicated in the previous paragraph, the teacher may point out the error to the student(s), without providing any kind of corrective feedback. Then the student(s) will be asked to open SkELL and search for spagnoli ragazzi. The student(s) will see an empty screen with no results. At this point, the teacher will invite the student(s) to search again by inverting the position of the two words, ragazzi spagnoli. In this case, the student(s) will find a number of examples and this will allow him or her to infer the regularity: in this case, the adjective always follows the noun. This corresponds to the simplest type of DDL activity: it guides the learner to observe whether a certain form that he or she has used in writing or speech is actually used by native speakers or not. Other activity types may involve a number of different steps building up a guided discovery process, like the many examples shown in Sinclair (2003).
The data contained in a simple activity such as the one that was shown can lend itself to be used in subsequent lessons, with the aim of recycling and increasing the frequency of input of the given structure.
For example, the set of concordance lines can be used to construct a multiple-sentence gap-fill exercise, such as the one shown in figure 3. The main advantage of a multiple sentence gap-fill exercise over a single sentence gap-fill is that the learner is able to test an initial hypothesis with multiple examples. In this case, there are 12 sentences containing the same word combination and gapped in relation to the same member of the word combination, namely the noun collocate.
By looking at the first sentence, the learner will be able to notice that the noun can only be masculine and plural, considering that these are the properties characterising the adjective spagnoli. So, ideally with a partner or with a group of peers in the classroom, the learner will start exploring options: studenti or ragazzi, for instance. The context here is that of education, since the sentence seems to evoke an educational program that caters for people aged between 18 and 22, so the missing noun could be one of the two. Then, the learner or group of learners will proceed towards the second example. Here, the context does not seem to be related to education, so of our two initial hypotheses, studenti or ragazzi, we can only retain the second one. The third example has the potential of clearing any doubts: the article that used before the noun is i, which is not used in front of nouns that begin with st. As a result, the only possible noun occurring in all of these examples is ragazzi. The students may not be able to find all the clues provided by the concordance lines immediately. This is why, while fostering group work, the teacher will have provided a guided-discovery procedure that the learners can follow. This procedure most typically comes in the form of guiding questions (as shown in Sinclair 2003), but it can also come in the form of questions with a series of options, especially in cases where the proficiency level of the classroom is low, thus providing additional scaffolding. The teacher may also circulate throughout the classroom while the students engage in explorative activities such as this one, and provide even further scaffolding whenever needed.
An activity such as this one is initiated as something focused on a single word combination, which in this case is ragazzi spagnoli, but as the learners go through it, it can easily extend into different areas: as mentioned previously, the teacher can draw the students' attention to the use of articles depending on the how the words following them begin, but also to the verbs that this noun phrase co-occurs with. What do these ragazzi spagnoli actually do? Not all examples will provide an answer to this, but those that do can lead the students to observe that the ragazzi spagnoli students develop things (line 3), are being invited as guests somewhere (line 5), are accompanied by teachers somewhere (line 7), are part of mobility programs (line 8), are always ready to joke and laugh and talk (line 9), and take part in training programs (line 11). A teacher will easily see great potential in all of these examples, and will be able to further extend the DDL activity towards the avenues that will be deemed closer to the learners' needs.
This would not be possible with a traditional single sentence gapfill activity. What students most typically find themselves doing are exercises with lists of single sentences containing specific learning aims, in terms of grammatical features or lexis. Each sentence presents a cotext for the learning aim, but is ultimately devoid of context. The time spent on each item will be limited and the attention span of the learner will be likely to decrease as he or she proceeds, because each new sentence will evoke a different kind of thematic context.
In the case of a DDL activity stemming from a single word combination, such as the one we presented, the discovery process that is initiated can take a number of different turns, all of which will be logically linked. This is how the student will find him or herself in the position of a detective, a research-scientist or a traveler (Bernardini 2000;Cobb 1999;Johns 1997), while the teacher will become a demonstrator, a collaborator or a guide (Boulton 2011;Charles 2014;Frankenberg-Garcia 2012). And this way, the discovery process put into place by the DDL activity will allow the learner to gain insight into language usage and the form varieties in an autonomous way: as Cobb points out in his contribution on constructivism, "knowledge encoded from data by learners themselves will be more flexi-ble, transferable, and useful than knowledge encoded and transmitted to them by an instructor" (Cobb 1999, 15).
However, what this study attempts to show is that a further step can be taken. If a DDL activity is linked to the empirical evidence deriving from an LCR finding, that activity is likely to be more effective, because its focus will be empirically motivated.
We have stated that bridging the gap between LCR and DDL can be done by using LCR findings to inform DDL practices. But what this means in practice is using LCR findings to decide what to focus on and how to sequence it and modulate it within pedagogical material design. If a teacher is informed about the empirical evidence that is attached to a specific kind of error, he or she will be able to make an informed decision as to how to treat it whenever it will be encountered. If an LCR finding says that a particular error is not very common, the teacher can decide to evaluate it as something temporary; on the contrary, if a finding says that an error is quite common, especially as time goes by, in this case, the teacher can decide to devote some special attention to the error, in order to counteract the learning pattern attached to a given form, unveiled by the LCR finding.

Conclusions
In this article we tried to explore ways in which the uses of corpora built for linguists can be merged with the uses of corpora built for learners. More fruitful exchanges between the two imply a number of desiderata. In terms of learner corpora of Italian, we need more collaborative work that is able to ensure more accurate corpus design criteria. Learner corpora for Italian need to be larger in size, their longitudinal dimension needs to be increased, and the computational tools that are used to process and extract data from them need to be more sophisticated.
On the other hand, the effectiveness of different kinds of learnerfriendly corpora should be further explored, in relation to the specific needs of the different teaching contexts. Tools that are suitable for use by intermediate and lower-intermediate learners of Italian should be explored and new ones should be created, considering that this is generally the largest group of learners learning Italian. Furthermore, in order to integrate LCR findings into DDL practices, and teaching practices in general, teachers need to be aware of research findings, but researchers need to be aware of teachers' needs: progress in language learning and teaching methods is arguably based on good research and good education policies, but also on awareness of learner needs and teacher needs. As a result, the gap that needs to be bridged is not only that between LCR and DDL, but also that between researchers and teachers.