CL-MoNoise: Cross-lingual Lexical Normalization

Social media is notoriously difﬁcult to process for existing natural language processing tools, because of spelling errors, non-standard words, shortenings, non-standard capitalization and punctuation. One method to circumvent these issues is to normalize input data before processing. Most previous work has focused on only one language, which is mostly English. In this paper, we are the ﬁrst to pro-pose a model for cross-lingual normalization, with which we participate in the WNUT 2021 shared task. To this end, we use MoNoise as a starting point, and make a simple adaptation for cross-lingual application. Our proposed model outperforms the leave-as-is base-line provided by the organizers which copies the input. Furthermore, we explore a completely different model which converts the task to a sequence labeling task. Performance of this second system is low, as it does not take capitalization into account in our implementation. 1


Introduction
Lexical normalization is the task of converting non-canonical text to its canonical equivalent on the word level.As common for Natural Language Processing (NLP) tasks, most of the previous work is done on English data (Han and Baldwin, 2011;Baldwin et al., 2015).However, for lexical normalization there have also been many attempts for other language-(pair)s (Plank et al., 2020;Sidarenka et al., 2013;Alegria et al., 2013;Ljubešić et al., 2017a;Barik et al., 2019;van der Goot et al., 2020;Schuur, 2020;Erjavec et al., 2017;Ljubešić et al., 2017b;Çolakoglu et al., 2019;van der Goot and Çetinoglu, 2021), which have been combined into one benchmark for the WNUT 2021 shared task (van der Goot et al., 2021a).Even though data has been available for multiple languages, most work focused on one language, and 1 https://bitbucket.org/robvanderg/cl-monoise/ to the best of our knowledge no one has attempted to solve this task cross-lingually.If successful, a cross-lingual lexical normalization model would open up possibilities for lexical normalization for languages in which no training data is available.In this work, we use the MoNoise model (van der Goot, 2019a) as a starting point, the only normalization model that is open source and has models available for the languages we target.Furthermore, it is heavily dependent on raw data to generate candidates and features, which makes it relatively easy to adapt it for a cross-lingual setup (Section 2).We refer to our new model CL-MoNoise.
In addition to our cross-lingual model, we also evaluate an out-of-the-box sequence labeler, we use the string2string task-type of MaChAmp (van der Goot et al., 2021b), which was originally created for the purpose of lemmatization.

CL-MoNoise
MoNoise is a two-step modular normalization model.It first generates potential normalization candidates, and then ranks these in a second step.For both of these steps a variety of modules are used, and all modules used for generation are also used to generate features for the ranking.The most important candidate generation modules are: the Aspell spell checker2 , closest word in a Twitter word2vec (Mikolov et al., 2013) embedding space, and a lookup list based on the training data.Features from these modules are then complemented by n-gram probabilities based on Wikipedia and Twitter data, Aspell dictionary presence and some language agnostic features, like punctuation detection and length of the candidate (number of characters).For more details on MoNoise, we refer to (van der Goot, 2019a).To retrain MoNoise, new raw data was collected to base its n-gram proba- Test time: Figure 1: A diagram of our proposed setup to use MoNoise in a cross-lingual setup with source language l1 and target language l2.All boxes are parts of the model.It should be noted that "Aspell", "N-grams" and "Embeddings" are swapped during test time with target language versions, and the random forest classifier from train time is used."Misc" here represents all remaining features of MoNoise that can be considered language-agnostic.
bilities and word embeddings on.We downloaded Twitter data of 2012-2020 from archive.org, filtered it with the fasttext language classifier (?), and used the most recent Wikidump for each language. 3This is the exact same data as used by the MoNoise submission provided by the organizers of the shared task (van der Goot et al., 2021a).We train MoNoise models for each of the source languages, and evaluate them on all the training sets of the other languages to pick the optimal source language for each target language.MoNoise is a supervised model, but many of its features are comprised from language-specific unsupervised components: Aspell spell checker, word embeddings, and n-gram probabilities.We adapt to the new languages by replacing these language-specific modules at run-time.In other words, we train a model on language A, with Aspell, word embeddings and n-gram probabilities based on language A, then we employ this trained model on language B, and use Aspell, word embeddings and n-gram probabilities of language B. This proposed models is No adaptations of the code of MoNoise were necessary, as the data directory is simply a parameter.
We are aware that this setup constitutes an unrealistic setting, as the cross-lingual setup is not pure (annotated data is used to decide which source model to pick).Alternatives could include auto-3 Of 01-08-2021.Available: https://robvanderg.github.io/blog/twit_embeds.htm matic selection of models based on the input, or to simply use language distances (for example from lang2vec (Littell et al., 2017)).However, we consider this work to be an exploratory analysis, and attempt to validate whether this (cross-lingual) direction is feasible; we leave other strategies for model selection for future work.
We use the Aspell badspellers option when it performs better in-language (for all datasets, except SR, ID-EN, SL).For the language pairs (ID-EN, TR-DE), we use the code-switched version of MoNoise (van der Goot and Çetinoglu, 2021), more specifically the multi-lingual model, because we do not assume language labels to be available.

MaChAmp
As an alternative model, we evaluate the string2string task type of MaChAmp (van der Goot et al., 2021b).This task type uses the Wagner-Fischer algorithm (Wagner and Fischer, 1974) implementation from UDPipe Future (Straka, 2018).This algorithm finds a character edit operation to transform the original word into its normalized form.The training procedure then becomes a sequence labeling problem, where for each word its correct transformation is being predicted.At runtime, the predicted transformation is applied to the original word to obtain the final normalization.
One main weakness of this approach is that it lowercases all text first, and then tries to predict conversion to capitals where necessary.While this makes sense for lemmatization (its original usecase), this removal of information is probably suboptimal for lexical normalization, as capitals are often kept.

Development Phase
For the CL-MoNoise model, we tune the source language separately for each dataset.Results are shown in Table 1, most best source languages can be explained by language relatedness, but in some cases the best source languages is surprising; Slovenian scores best for Danish, Turkish is best for Indonesian-English, and Turkish is best for Dutch.We inspected the correct replacements, and found to our surprise that they were not words that exists in both languages, nor were they only some very frequent words.Instead, the word embeddings and Aspell features seemed to have generalized well in spite of the language variety.For MaChAmp, we use all default hyperparameters, and compare the difference in performance between MBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020), which have both been pre-trained on all the languages targeted in Mul-tiLexNorm.much higher.This is because capitalization was not taken into account in the conversion algorithm of MaChAmp, probably because it was focused towards lemmatization, and lemmas are commonly lowercased.

Test Phase
Results of our two models and all models provided by the organizers are shown in first sight, this seems also true for CL-MoNoise, but we should take into account that MFR was trained on in-language training data, whereas CL-MoNoise was not.CL-MoNoise is indeed the best performing system not trained on in-language data, as LAI is the only competitor there; also all other participants in the shared task used in-language training data (van der Goot et al., 2021a).The worst scores for both MaChAmp and CL-MoNoise are on the Italian dataset, which is probably because there are quite some language specific constructions that are normalized (van der Goot et al., 2020), and for MaChAmp it matters that capitalization is corrected.Performance on German, Turkish and the code-switch Turkish-German is relatively high, because they all have highly relevant source languages.For Indonesian-English, performance dropped a lot compared to scores on the development set (Table 1), and it might have been safer to use English as a source language instead of Turkish.
Table 2 shows that XLM-R outperforms MBERT for all languages.Furthermore, it becomes clear that capitals are a main weakness of MaChAmp; the uncased scores for DE and NL, which were the only development datasets with capitalization correction in the annotation are

Table 3 :
Results of the models provided by the organizers (grey) and our proposed models.The source language used for CL-MoNoise is shown in the row.