On Language Models for Creoles

Creole languages such as Nigerian Pidgin English and Haitian Creole are under-resourced and largely ignored in the NLP literature. Creoles typically result from the fusion of a foreign language with multiple local languages, and what grammatical and lexical features are transferred to the creole is a complex process. While creoles are generally stable, the prominence of some features may be much stronger with certain demographics or in some linguistic situations. This paper makes several contributions: We collect existing corpora and release models for Haitian Creole, Nigerian Pidgin English, and Singaporean Colloquial English. We evaluate these models on intrinsic and extrinsic tasks. Motivated by the above literature, we compare standard language models with distributionally robust ones and find that, somewhat surprisingly, the standard language models are superior to the distributionally robust ones. We investigate whether this is an effect of over-parameterization or relative distributional stability, and find that the difference persists in the absence of over-parameterization, and that drift is limited, confirming the relative stability of creole languages.


Introduction
A creole language arises if a pidgin, 1 developed by adults for use as a second language, becomes the native and primary language of their children. Although a large portion of creole languages have their roots in Western European colonialism and slavery, creole languages still serve as important lingua franca in multi-ethnic and multilingual communities, and creoles are often an important part of the local identity. Moreover, there are more than a 1 A pidgin is a grammatically simplified language that develops between two or more groups that do not have a language in common. Both pidgins and creoles are sometimes referred to as contact languages. hundred million speakers of creole languages world wide ( Fig. 1), with similar needs for technological assistance, and yet creoles are still largely absent from NLP research (Joshi et al., 2020). Haitian Creole, for example, has 9.6 million speakers as of today; Nigerian Pidgin English has 100 million speakers, and Singaporean Colloquial English (Singlish) has 3.5 million speakers. This paper sets out to collect existing resources for these three languages and provides language models for them. In doing so, we wish to take the nature of creole languages into account, not necessarily assuming that our best approaches to modeling non-creole language are also best for the creole languages.
The nature of creole languages has been a matter of much debate in linguistics during the last decade (Sessarego, 2020): Some see creole languages as natural stages in language change cycles (Aboh, 2015), while others see them as a distinct typological class with unique characteristics, including, for example, a very simple morphology (McWhorter, 1998). Another feature of creoles is that they exhibit significant variation across groups of speakers (Patrick, 1999). Winford (1999) goes as far as to call creoles a continua that cannot be captured under a single grammar.
Consider the following pair of sentences from  Figure 2: Example sentence in Singlish featuring multilingual vocabulary, Chinese-style topic prominence combined with a subordinate clause with English word order, and a final interjection representing a discourse particle; a common feature of Singlish. Example from https://languagelog.ldc.upenn.edu/nll/?p=25758.
Here, according to the authors, both sentences are valid utterances in Singlish, and they both mean John is so lecherous, but the first would more likely come from a speaker of Chinese, and the second from a Malay speaker. From this, 2 we derive the conjecture that creole language models can benefit from learned mixtures of source languages. Training on mixtures of source languages has been applied to language modeling of code-switched language (Pratapa et al., 2018), and it is clear from examples such as the one in Fig. 2 that creole languages, at the sentence level, share commonalities with code-switched language, with vocabularies drawn from multiple source languages. To exploit synergies with learned mixtures of source languages, and to obtain robust performance across related, but unseen distributions, we explore ways of training creole language models with distributionally robust objectives (Oren et al., 2019). Our results below, however, show that, somewhat surprisingly, this conjecture is probably not true, at least not in a straight-forward way.
Contributions We combine existing datasets and present pretrained language models for the following creole languages: Nigerian Pidgin English, Singaporean Colloquial English (Singlish), and Haitian Creole. We perform intrinsic evaluation (word prediction), as well as extrinsic evaluation (part-of-speech tagging and named entity recognition). Comparing language models trained with empirical risk minimization to languages models trained with robust objectives, we observe that training with multiple related languages does not improve creole modeling; and also, somewhat surprisingly, that models with empirical risk minimization are superior to models robust across domains. We hence investigate why this is: in particular, whether it is due to over-parameterization, insufficient regularization , or relative distributional stability (Ben-David et al., 2007). We observe no significant difference for language models with fewer parameters or higher degree of regularization. On the other hand, we find that the underlying reason might be the relative stability of the creoles, which show no significant drift.
2 Related Work NLP research on creoles Despite the unique features of creoles that make them an interesting application for multilingual and cross-lingual NLP, as well as the open-ended debate about the linguistic nature of creoles (Sessarego, 2020), little attention has been devoted to creoles in NLP. (We present the works related to the specific creoles of focus in this paper in §3.) One relevant work by Murawaki (2016) explored the typological status of creoles and also introduced a method for statistical modeling of creole genesis. To start, the authors reported that binary SVM classification of creole and non-creole languages failed to distinguish the two classes, even though their underlying distributions are quite different. After this, they introduce a statistical model of creoles, formulated as a mixture of its influential languages and an inferred "restructurer", which is set of possible linguistic feature distributions that are observed across languages included in their experiments. Overall, this work showcases how statistical modeling methods can be useful for investigating the language evolution of creoles, however there is also no discussion of how their findings could help others extend current NLP methods for creoles.

NLP research on pidgins and code-switching
Creoles are pidgins that have consolidated over time to become a first language for new generations of speakers. The NLP literature on pidgins is even more sparse than the literature on creoles, because many pidgins that did not undergo creolization have gone extinct, such as Maritime Polynesian Pidgin (Kriegel, 2016). Code-switching literature, however, is also relevant, as both pidgins and creoles also draw from other languages. Importantly, pidgins differ from code-switching or mixed language in that code-switching typically only occurs between two bilingual or highly proficient speakers of two languages. Pidgins, on the other hand, are derived from multiple languages, and spoken by those who do not fluently speak every language involved. The NLP literature on code-switching is surprisingly rich, however. We refer readers to Çetinoglu et al. (2016) and Dogruöz et al. (2021) for an overview.

Computational research on language evolution
Research on creoles is more common in the field of language evolution than in NLP. In particular, work on creoles in this field typically focuses on their computational modeling, their emergence (Nakamura et al., 2009), and their evolution (Jansson et al., 2015;Furman and Nitschke, 2020). Other creole modeling efforts in this space may be more tailored towards specific linguistic insights (Parkvall, 2008). While these studies demonstrate that work on creoles is being done in a computational space, it is difficult to apply conclusions from them to NLP, because distinct empirical assumptions are made in these two research areas.
Distributionally robust optimization Effectively learning to model and predict underrepresented subdistributions has always been a challenge in machine learning, e.g., when predicting rare classes, (Scheirer et al., 2013;Fei and Liu, 2016) or classes of examples from rare domains (Zheng et al., 2020) or minority groups (Hashimoto et al., 2018). Often, underrepresented data is ignored or learned poorly by the models (Feldman and Zhang, 2020), compared to their over-represented counterparts. Distributionally Robust Optimization (DRO) (Hashimoto et al., 2018; aims to minimize the loss on all sub-populations, rather than minimizing their average (Ben-Tal et al., 2013). DRO has been particularly useful in the domain of algorithmic fairness (Hashimoto et al., 2018), but has also been found to boost performance on underrepresented domains in language modeling (Oren et al., 2019) and is generally applicable in situations with drift (Koh et al., 2021).

Creoles and Corpora
While creole languages are spoken by hundreds of millions, and are often a lingua franca within a larger community, only a handful of resources exist for creoles presently. Some challenges to collecting data resources for creole languages can be a creole's non-standardized orthography, e.g. Haitian Creole (Hewavitharana et al., 2011), or the specific contexts in which creoles are used -it may not always be used in official capacities for news, education, and official documents, even if the creoles are widely used in most other aspects of life (Shah-Sanghavi, 2017). This of course complicates data collection. In this work, we focus on the following creoles, as they each have diverse linguistic makeup and have some existing datasets: Nigerian Pidgin English West Africa is one of the world's most linguistically diverse places, with Nigeria alone having over 400 languages (Ufomata, 1999). Recent work to advance African NLP has led to the creation of several datasets in Nigerian Pidgin English (Agić and Vulić, 2019; Ogueji and Ahia, 2019;Ndubuisi-Obi et al., 2019;Caron et al., 2019;Oyewusi et al., 2021;Adelani et al., 2021;Oyewusi et al., 2020), which makes it particularly well-resourced in comparison to other creole languages. Nigerian Pidgin English, also referred to as simply Nigerian Pidgin, can further be understood as a member in the larger family of West African Pidgins, as many West African countries have their own unique variation of this creole, but all share influences from many of the same languages, such as Igbo, Hausa, and Yoruba.
The first sizeable Nigerian Pidgin dataset comes from Agić and Vulić (2019), who collected parallel text from several magazines written by a religious society, which have parallel translations in many languages. This dataset has been utilized in the first attempts to develop baselines for machine translation of Nigerian Pidgin English (Ogueji and Ahia, 2019;Ahia and Ogueji, 2020). Furthermore, Ogueji and Ahia (2019) also introduced the first corpus of Nigerian Pidgin English to further facilitate machine translation from Nigerian Pidgin into English. Ndubuisi-Obi et al. (2019) also introduced a code-switching corpus of news articles and on-line comments in both Nigerian Standard English and Nigerian Pidgin. In this work, they discuss some challenges of working with Nigerian Pidgin, such as non-standardized spelling. They also find that different topics prompt code-switching to Nigerian Pidgin over Nigerian Standard English. More task-specific Nigerian Pidgin datasets have been introduced for Universal Dependency Parsing ( Singlish Singaporean Colloquial English, also known as Singlish, has English as a source language, but also draws parts of its grammar and vocabulary from languages such as Mandarin, Cantonese, Hakka, Hokkien, Malay, and Tamil. Presently, few publicly available datasets exist in Singlish, as this creole is primarily utilized for informal conversation between people and not for official purposes. The largest relevant corpus is The National University of Singapore SMS Corpus from Chen and Min-Yen (2015), which consists of over 67,000 text messages written by Singaporeans. Qualitatively, we observed that this dataset is much closer to Standard English, albeit with noise from outdated SMS language, than the example provided in Fig. 2, but, within this data, we still observe many hallmark features of Singlish such as discourse markers and vocabulary from relevant languages. Tan et al. (2020) have also released a webcrawler that collects posts from an popular Singaporean forum about hardware, where discussion is often in Singlish. They use the resulting Singlish corpus as part of their work to investigate the role of inflection for NLP with non-standard forms of English. Beyond plain text corpora, Wang et al. (2017)

Datasets for Creole Language Models
We experiment with training language models for creoles with a mixture of creole data, and additional data from languages influential to each creole.
Data splits We begin with the creole datasets noted in Tab. 1, and combine them with data of other higher-resource languages that have been influential to the creole. We combine a fixed number of these examples into a MIXED-LANGUAGE dataset, as described in Tab Singlish: "treat him makah lah" en: 88.19%, ms: 4.34%, ta: 0.04%, and zh: 0.01% Nigerian Pidgin: "Pikin wey like to play wit wetin no dey common and sabi one particular subject reach ground" en: 87.46%, pt: 0.23%, and yo: 0.03% Haitian Creole: "Infomation sou kestion te tranble a ak lekol" fr: 3.50%, es: 0.08%, and yo: 0.01% While the language identification algorithm is not perfect, the confidence scores for the languages still reflect the high-level trends for the creole examples, namely, that English and Malay ("makan") are indeed present in the Singlish sample, and also that English and Portuguese ("pikin", "sabi") are present in the Nigerian Pidgin example. However, for the Haitian Creole example, we see that none of our chosen languages have very high scores from the language identification algorithm, which begs the question: were there other languages with higher confidence from the language identification algorithm?
To ensure that the languages we chose are wellrepresented in the creole examples, we looked at the distribution of the identified languages across examples in our CREOLE-ONLY datasets in Fig. 3. From this, we observe that choosing to identify languages specifically related to the creole (i.e. the same languages we included in the MIXED-LANGUAGE datasets) is more reliable than trusting the language identifier pick the top five languages with the highest confidence -there appears to be a bias for falsely predicting European languages, even on creole data unrelated to these languages, as well as some strange outliers, such as Tagalog being the third most commonly predicted language for Haitian Creole sentences. Also, we see that Haitian Creole itself was a commonly identified language, which could explain the low confidence scores for French and Spanish in the example above. Finally, among our specifically chosen languages for the creoles, we see that, although the source language (e.g. English or French) is most dominant, the other languages are still well distributed, with the exception of Yoruba. We surmise that the densely distributed, low-confidence scores for Yoruba can probably be attributed to the fact that Yoruba is a lower-resourced language.

Experiments
In this section, we detail our experimental setups. We make our code and models publicly available. 5

Training
Using the datasets described above, we conduct several experiments to assess how different training strategies affect the modeling of creoles. We conduct all the experiments on both English BERT and multilingual mBERT models (Devlin et al., 2019). As our baseline, we consider pretrained BERT Base and mBERT models, and evaluate them on our development splits for the creoles. We then assess the effectiveness of two popular training strategies: Empirical Risk Minimization (ERM) and Distributionally Robust Optimization (DRO). In this case, ERM consists of masked language modeling over all the data points in each dataset, in a similar fashion as done during pretraining.
For DRO, we utilize the WILDS library (Koh et al., 2020), which uses metadata associated with the input data to form the groups for DRO. In our case, we investigate three grouping strategies: grouping with language information as metadata (DRO-Language), as well as with two additional control experiments. In the first control experiment, we assign all training examples to the same group (DRO-One), such that that DRO is optimizing over only one large group. In the second control experiment, we randomly assign examples to one of four groups (DRO-Random). The motivation of for these control experiments is to ensure that improvements for DRO are actually grounded in the language information, and not an artifact of the WILDS grouping algorithm.
In DRO-Language, information about the examples' language makeup is used to determine the groups. In MIXED-LANGUAGE, we rely on our knowledge of where the examples were sampled from, but in CREOLE-ONLY, we subdivide the creole examples depending on their etymology. Specifically, grouping is done as follows in our two data setups outlined in §4: • MIXED-LANGUAGE: Here, grouping is done over the languages in the training data. For example, in the case of Nigerian Pidgin, if a sentence originally comes from the Yoruba corpus, it is assigned to the Yoruba group, and similarly for Nigerian Pidgin and the other languages listed in Tab. 2 for each creole.
• CREOLE-ONLY: Here, as we only have the creole samples, grouping is done over the confidence scores from the collection of the influential languages (see Section §4). An example is assigned to one of 2 N groups, representing the combinations of detected languages in a sentence. N is the number of languages listed in Tab. 2 (Langs) for each creole, and presence of a language is derived from its confidence score by the language identifier: if there is a confidence of 0.1% or higher that the language is represented in the sentence, then it is considered as present.

Evaluation
We perform two types of evaluation: intrinsicbased on the MLM training objective -and extrinsic -on traditional downstream NLP tasks.
Intrinsic evaluation We evaluate our language models intrinsically with the following metrics: • Precision at k (P@k): Precision of the language model in predicting a random masked token per sentence. This allows us to assess the general performance following the training objective. In the following, we report P@1. Results at k = {5, 10} are in the App. • Dictionary-based precision at k (P D @k): Due to their nature, most of the words in a creole sentence are from the corresponding source language (see Fig. 3). Hence, for a more principled measurement of precision, we collect online dictionaries of our creoles. 6 We perform the same MLM task as above, but this time only mask words belonging to the creole dictionaries. By doing so, we can obtain a more accurate measure of what the LMs have learned. We again report results at k = 1 here, and refer the reader to the App. for k = {5, 10}. • Mean pseudo-log-likelihood score (PLL): Following recent studies (Shin et al., 2019;Wang and Cho, 2019;Salazar et al., 2020), we measure the pseudo-log-likelihood scores from MLMs given by summing the conditional log probabilities log P MLM (w t |w \t ) of each token w t in a sentence w = w 1 , . . . , w T . These are obtained in BERT  Table 3: Intrinsic evaluation: Precision@1 (P@1), Precision@1 for words in our creole dictionary (P D @1), and average Pseudo-log-likelihood score (PLL). We report results for MIXED-LANGUAGE (top) and CREOLE-ONLY (bottom). We note that ERM consistently outperforms the language models trained with robust objectives.
by replacing w t with the special [MASK] token. Here, we report the mean score given by: (1) where C denotes the evaluation corpus, and θ denotes a model's parameters.

Extrinsic evaluation
We also perform an extrinsic evaluation of our models on downstream tasks, for the datasets that are available. Specifically, we train and evaluate models for Nigerian Pidgin NER and POS tagging with Universal Dependencies (Nivre et al., 2020, UPOS), as well as Singlish UPOS. We fine-tune our pretrained language models on the training sets of these two tasks and evaluate them on the corresponding test sets.

Framework
We write our code in PyTorch (Paszke et al., 2019). In particular, for language model training, we rely on the HuggingFace Transformers library (Wolf et al., 2019), and the WILDS library (Koh et al., 2020) for DRO. Models are fine-tuned for 100,000 steps with batch size of 16. For downstream tasks, we use MaChAmp (van der Goot et al., 2021) and train our models for 10 epochs. The best checkpoints were selected based on performance on the dev sets. Unless otherwise specified, we use the default hyperparameters. Our experiments are run on one NVIDIA TitanX GPU in a shared cluster.

Results and Analyses
Intrinsic evaluation The main finding of the intrinsic evaluation is that ERM outperforms DRO  for all grouping strategies across all metrics. We also observe that P D @k is a more difficult task than the standard precision at k, with randomly masked tokens (see App. §A for full results with both BERT and mBERT). Moreover we find that the DRO models often have a much higher perplexity than ERM. Finally, the results show that, between the MIXED-LANGUAGE and CREOLE-ONLY experiments, the latter performed better, demonstrating that training on additional data was not useful for learning language models for creoles. While we only report results for BERT here, we observe the same patters with mBERT (see App. §A).
Extrinsic evaluation Here, we observe the same trend as in the intrinsic evaluation: ERM performs better than DRO (see Tab. 4). Although for Nigerian Pidgin DRO-Language performs better than ERM on both NER and UPOS, the gap between the scores is too small to draw concrete conclusioins from.
There are several factors that could have influenced the DRO models to perform worse than ERM. We explore their effects below.  Over-parameterization Over-parameterization is known to be problematic for DRO . In order to investigate the role of over-parameterization in our experiments, we ran additional MIXED-LANGUAGE experiments on Nigerian Pidgin English, with different sized BERT models, namely BERT Tiny , BERT Small (Jiao et al., 2020), and BERT Base . The results in Tab. 5 demonstrate that over-parameterization was not a leading cause for DRO failure, otherwise we would expect for smaller BERT versions to have relative better performance compared to the corresponding ERM runs. Instead, we see that standard BERT works fine for this task, and over-parameterization is not the cause of poor performance of DRO in our experiments.
Regularization  also discuss how lack of regularization lead to problems for DRO, and how increased regularization is necessary for worst-group generalization. To investigate this potential weakness in our experiments, we run additional experiments using BERT Small on MIXED-LANGUAGE data for Nigerian Pidgin English, trying different weight decay values in each Tab. 6. If our DRO models were suffering from insufficient regularization, we would expect that increasing the regularization factor of weight decay would boost performance. However, we find no meaningful effect of this hyperparameter, which leads us to believe that insufficient regularization is not a driving factor in the underperformance of DRO compared to ERM.
Drift and creole stability Creole languages arise from pidgins, which are initially developed for use as second language. Recent years have seen renewed interest in the classic question of the relationship between pidgin and creole formation and second language acquisition (Plag, 2009   investigate the matter of creole stability, we follow (Ben-David et al., 2007) and calculate the proxy A-distance (PAD) between different domains of creole data (see Tab. 7). Specifically, we train an SVM on the BERT encodings. 7 Our A-distance results suggest that creole languages do not exhibit more drift than English when the data are comparable. This potentially explains why distributionally robust language models do not outperform regular language models trained with empirical risk minimization objectives.

Conclusion
In this paper, we bring creole languages to the attention of the NLP community. We collect data and train baseline language models for three creoles, and evaluate these models across the downstream tasks of part-of-speech tagging and named entity recognition. Based on previous work suggesting the instability of creole languages (Winford, 1999;Patrick, 1999), we explore the impact of using more robust learning objectives for masked language modeling of creoles, but our results show that vanilla empirical risk minimiziation is superior. We show that this is not the result of overparameterization or lack of regularization, but in-stead suggest this is a result of the relative stability of creole languages. We note that it still remains possible that significant improvements could be achieved by modeling dynamics specific to creole languages, i.e., the processes that govern their development, including social factors (Holm, 2000) and second language acquisition dynamics (Plag, 2009).

Acknowledgments
We would like to thank the anonymous reviewers and Samson Tan