Towards Word Sense Disambiguation for Latvian

. The goal of this paper is to describe the current situation on word sense disambiguation for Latvian, reviewing the available data and potential problems, and describing the explo-ration of word sense disambiguation methods using BERT contextual embeddings in order to apply them to Latvian language. Training is performed on a recently developed dataset of sense example sentences. The experiments of this paper demonstrate the feasibility of the approach by applying a mixture-of-experts approach of word sense disambiguation to the data, developing the first proof of concept WSD system for Latvian using state of art approaches. An evaluation of the WSD solution was performed on a selection of 18 highly ambiguous words, demonstrating reasonable performance.


Introduction
Word Sense Disambiguation (WSD) is the task of associating words in context with their possible meanings contained in a pre-defined sense inventory. The goal of this project is to develop a word sense disambiguation system for Latvian based on the recently developed Latvian WordNet dataset  that includes a substantial number of sentence examples matched to specific word senses.
In computational linguistics, tasks involving semantic analysis and natural language understanding (NLU) inevitably require treatment of word meaning and its ambiguity -either as a separate component explicitly performing word sense disambiguation (WSD), or by having information of possible word senses and their relations as additional data input to a NLP system that solves a particular task that implicitly involves WSD. Examples of NLP tasks that require WSD include semantic parsing, information extraction, information retrieval, abstractive text summarization and dialogue systems. WSD tools and sense-annotated corpora built with assistance of these tools are also of high value in lexicographic research and digital humanities.
The desire for a WSD system for Latvian has been relevant for a long time, but as of now no WSD solutions were available as only recently an appropriate sense inventory and data for training and evaluation became available. In this paper we describe the application of current state of art research on English WSD to develop such a solution for Latvian.

Related work
For Latvian language, there has been no published research of general purpose automatic word sense disambiguation. The closest relevant work is the analysis on word sense disambiguation linguistic principles used in preparing this dataset (Lokmane et al., 2021) and earlier work on word sense disambiguation in the very restricted domain of controlled natural language using logic reasoning (Bārzdiņš et al., 2007).
On the other hand, for other languages word sense disambiguation has been a widely researched topic. Most of the advanced research has been performed on English, with later application of these approaches to other languages. The primary restriction on applying these methods to Latvian has been the lack of an appropriate word sense inventory and data to develop and evaluate such systems, but such data has been recently made available . However, it is plausible that adapting these methods to Latvian may require research and modifications, as differences in linguistic properties such as morphological variation and less strict word order often require changes in NLP methods used (Bender, 2011).
The most commonly used approach for current state of art WSD systems for English rely on training many lemma specific classifiers ("word experts") for disambiguating senses of that lemma. This approach was successfully used both before widespread application of contextual word embeddings (Iacobacci et al., 2016) and -with significantly improved results -after applying the improved embeddings from BERT (Devlin et al., 2019) and related models (Hadiwinoto et al., 2019;Vial et al., 2019).
An alternative approach applies pretrained language models in a more direct manner, adapting the task as sentence pair classification, which is one of the main tasks for the pretrained BERT models. It can be done for the sentence context paired with each of candidate glosses (Huang et al., 2019;Blevins and Zettlemoyer, 2020) or, interestingly, as concatenating all the glosses for the target word in a single 'sentence' and attempting span extraction to determine most appropriate choice (Barba et al., 2021). For multilingual approaches, zero-shot learning from multilingual embeddings achieves competitive results .
While these approaches are technically different, they achieve similar performance on English datasets. Complex models that integrate many different types of data achieve an accuracy improvement of around 2 percentage points (Song et al., 2021), but in this early stage of research these accuracy differences are less significant than the model implementation aspects. Reviewing research on non-English datasets for languages linguistically similar to Latvian reveals multiple projects with earlier methods, but the advancements of last two years described above do achieve improved results, and published work does yet not evaluate the effectiveness of these state of art approaches for non-English languages, so this needs experimental validation.

Task and evaluation dataset
The system was trained on and evaluated on the set of sentences used as corpus examples in the Latvian WordNet dataset, which have been manually linked to specific word senses and subsenses.
The sense inventory comes from the same dataset. It has a two-level granularity, listing senses which then may be split into subsenses. It's worth noting that the number of senses is substantially different for different words, with many rare words or terms having just one sense and no need for disambiguation, and some words having 10-20 subsenses grouped into five or more conceptually different senses.
For evaluation we selected 18 words covering the main parts of speech (7 verbs, 5 nouns, 3 adjectives and 3 adverbs) chosen out of the most frequent words in the corpus those that had multiple senses, were linguistically interesting, and had sufficient amount of annotated examples. 60% of the available annotated sentences were used to train the "word expert" models and 40% of the annotated examples were used as test data for evaluation.

Model architecture
For initial proof of concept validation and testing of the data suitability a transformerbased deep learning model for word sense disambiguation was developed, very similar to (Hadiwinoto et al., 2019), pretraining on a large relevant corpus and fine-tuning for the classification of specific words.
The pretrained model used was a small (6 layers, 8 attention heads, 256 hidden unit size) version of BERT architecture trained on a combination from the Balanced Corpus of Modern Latvian (Levane-Petrova, 2019), Latvian Wikipedia and a web blog corpus, which is a reasonably diverse selection of approximately 50 million tokens. The small size of the pretrained model facilitates rapid experimentation as the model can be finetuned in a minute without the use of GPU.
A standard sequence classification architecture is used on top of this pretrained model, pooling the output and adding a single linear layer for the actual classification, with each word having a separate classification layer ("word expert") with the number of classes determined by the number of subsenses in the dataset. The pretrained model is not updated during the training. This approach was chosen as the most popular approach seen in literature for integrating example sentence training data, which has shown good results for other languages. The technical implementation was done in PyTorch using the Huggingface transformers libraries. Table 1 shows the results for classifying the annotated sense examples for a selection of 18 words. It's worth noting that these numbers are pessimistic as this selection focuses on highly ambiguous words with a large set of overlapping subsenses, excludes the less frequent "easy" words and the selection of examples overrepresents rare subsenses, so these numbers are not directly comparable to e.g. English WSD datasets. Accuracy is  Table 1. Accuracy of the proposed WSD model on a selection of 18 words measured separately for the fine-grained subsense annotation and for the main senses of the word. The results achieve a significant improvement over a naive baseline of choosing the most popular sense in the training data.

Results and conclusions
The main result of this research is the development of the first proof of concept system of word sense disambiguation for Latvian, applying the latest state of art ap-proaches which are very recent and have not yet been widely applied for languages similar to Latvian.
An immediate observation is the high variability of the accuracy for different words. It is plausible that this may reflect the relative distance between the senses, as for some words the senses may be fundamentally different and involve separate domains, while for others the difference may be a relatively narrow semantic change and because of this hard to disambiguate both for humans and automated systems.
The effect of number of samples and number of subsenses on accuracy is not obvious given the observed data. Each subsense in the dataset was allocated corpus examples for the primary needs of the dictionary, so words with more senses and subsenses also have more training and test examples.
A review of errors seems to indicate that in many cases, especially for the subsense distinction, a human would need a larger context than a single sentence in order to be certain about the proper interpretation. The current system works on a sentence basis, but in principle a longer context could be supplied.
The data seems to indicate that a larger set of senses is harder to disambiguate but it is not conclusive and it is plausible that the most significant factor is how different the specific senses are from each other. This should also align with how difficult it is for human annotators to assign senses, but this would need further research work.

Future work
An obvious future extension is to replace the currently used small transformer model with a model that is larger and has been pretrained on larger corpora. LVBERT (Znotins and Barzdins, 2020) is a possible candidate, but it is trained on a large news corpus which raises concerns about the omission of large classes of word senses such as colloquial language. It may be that a new transformed model would need to be trained on a more diverse corpus such as the recently updated Latvian National Corpora Collection (Saulīte et al., 2022) and experiment with the various updated transformer approaches that improve on the original BERT structure. We also received notice about a combined Lithuanian-Latvian pretrained model (Ulčar et al., 2021) that has the potential for improved results.
Another direction of future work is to prepare a representative evaluation corpus by annotating all word senses for a balanced set of running text. This data is required for proper evaluation to have a realistic frequency distribution, as rare examples are overrepresented in dictionary data.
Literature on English state-of-the-art suggests that substantial improvements can be achieved by integrating knowledge from the WordNet graph (Kumar et al., 2019;Bevilacqua and Navigli, 2020). This relies on a wide coverage sense graph, which is currently not available for Latvian, but there is potential that such a high-coverage graph could be developed soon through automated transfer of semantic links from the Princeton WordNet .
There seems to be potential to improve accuracy by directly applying more lexical data. There is work on using the "supersenses" from the WordNet hypernym ontology (Levine et al., 2019) and integrating synset gloss embeddings (Huang et al., 2019) as an additional data source for classification, so integrating gloss data would be a reasonable next step for extending the model. It's worth noting that these papers fully replace the supervised sentence examples, but combining the approaches could also yield useful results.
One of future applications for the developed system would be automatic disambiguation of Latvian corpora to enable corpus search for specific word senses, not only lemmas.