Multilingual Transformers for Named Entity Recognition

. Different methods for automatic named entity recognition (NER) have been researched for many years. Today, the most common technique for training named entity recognition models is a fine-tuning of large pre-trained language models. In this paper, we investigate the performance of various multilingual NER models in the state-of-the-art natural language processing framework Flair and compare them against the multilingual NER solution of the MAPA anonymization toolkit and BERT multilingual model, fine-tuned for NER. We demonstrate that in multilingual settings the best results could be achieved with fine-tuned XLM-R model, while in the case of Latvian (monolingual settings), the more targeted LitLat BERT model leads to the best results.


Introduction
Automatic methods for named entity recognition (NER) have been researched for many years, mostly in the context of information extraction, data mining, and entity linking.
NER is an important constituent for different natural language understanding tasks, such as semantic annotation, question answering, ontology population, and opinion mining (Marrero et al., 2013). A rather new direction that recently has raised interest in NER, is text de-indentification and anonymization. The need for text data anonymization (or at least de-identification) has become essential with the changes in legislation on the protection of natural persons with regard to the processing of personal data 3 and opening public service data.
The goal of the anonymization task is to anonymize text (see Fig.1), making impossible to identify the natural person mentioned in the text directly or indirectly. Text anonymization is an extremely difficult task, because even seemingly unrelated facts may allow an informed attacker to identify involved persons (Gianola et al., 2020;Lison et al., 2021).
De-identification is easier task, aiming to delete specific identifiers (e.g., person names, place names, etc.) which could allow identification of an individual (Chevrier et al., 2019), or to replace such identifiers with similar ones, thus producing a readable text, which does not allow to find the identity of the original entity. The traditional automatic text de-identification approach uses the NER to identify named entities and other identifiers and then hides identified entities using either data masking or pseudonymization techniques. Usually, the performance of text de-identification system is evaluated by comparing output from de-identification systems to human-annotated gold standard and calculating precision, recall, and F1-measure (Pilán et al., 2022). The state-of-the-art technique used for named entity identification and recognition is fine-tuning a large pre-trained language model for NER task. This approach became popular with the introduction of BERT models (Devlin et al., 2019a) in 2018. Since that time different transformer models have been introduced (e.g., XLM-R (Conneau et al., 2020), ALBERT (Lan et al., 2020), BART (Lewis et al., 2020) and others).
In this work, we explore the use of multilingual Transformer models for the NER task in multilingual settings, using FLAIR (Akbik et al., 2018) and MAPA libraries. The NER model using fine-tuned mBERT pre-trained by Devlin et al. (2019a) is also included as a baseline. As we are primarily interested in Baltic languages, we also include a litlat BERT (Ulčar and Robnik-Šikonja, 2020) in our evaluation. While overall goal is to analyse applicability of different NER solution in de-identification process, in this paper we mainly focus on performance of multilingual Transformer models in NER task, treating all named entity types equally.

Related Work
CLARIN Resource Families 4 of Tools for named entity recognition 5 lists 26 tools that support named entity recognition: 17 tools support a single language (4 Dutch, 2 English, 1 Finnish, 2 German, 1 Icelandic, 1 Greek, 1 Hungarian, 1 Latvian, 3 Polish, 1 Portuguese), while the rest have a very broad multilingual scope, however, none of the tools support all European languages, including Baltic languages.
The use of transfer learning for building multilingual NER has been explored by Rahimi et al. (2019) using multilingual fasttext (Bojanowski et al., 2017) embeddings and Wikiann NER corpus (Pan et al., 2017) for training. As the sequential tagger, a BiLSTM-CRF (Fang and Cohn, 2016) is used. Tedeschi et al. (2021) use Wikipedia as a source for multilingual annotations and proceed to build a multilingual NER model using multilingual BERT embeddings. The dataset produced in this manner contains CoNLL entity types (PER, ORG, LOC, and MISC). As the sequential tagger, an architecture based on BiLSTM with a CRF layer (Lafferty et al., 2001) is used.
Baumann (2019) evaluated multilingual language models used to train NER models in English and German on a CONLL dataset. For English, the NER using the native BERT model outperforms the mBERT model. For German, the NER using mBERT model embeddings outperformed most other NER models except Flair (Akbik et al., 2018) embeddings, showing the importance of the embedding layer.
Besides dedicated NER tools and natural language processing pipelines that support NER, several de-identification toolkits, such as Presidio or MAPA, include solutions for NER task. MAPA toolkit contains a built-in NER component, while Presidio uses either Flair (Akbik et al., 2019a), spaCy (Honnibal and Montani, 2017), or other NER models.
The Multilingual Anonymisation Toolkit for Public Administrations (MAPA) project 6 , aims to leverage natural language processing tools for text de-identification. It relies on named entity recognition and classification techniques that aim to detect relevant entities and decide which ones must be masked. MAPA entity detection models use multilingual BERT to convert text into embeddings which are fed into a two-level Classifier used to detect entities according to MAPA two-level scheme . For comparison purposes with other NER models, in this work, we evaluate only Level 1 annotations.
Flair 7 is a state-of-the-art natural language processing (NLP) framework designed to facilitate training and distribution of sequence labeling, text classification and language models (Akbik et al., 2019a). Flair includes text embedding libraries, NLP models, and helper classes intended for data handling. The text embedding libraries included in the Flair framework allow the use and combination of different word and document embeddings, including ELMo (Peters et al., 2018), Flair (Akbik et al., 2018) and Transformer based BERT (Devlin et al., 2019b) and XLM-R (Conneau et al., 2020) embeddings. The NLP models are exposed to the user using standardized training and hyperparameter selection routines allowing researchers to mix various models and embeddings (Akbik et al., 2019b). Flair framework allows the use of document-level features with transformer-based models achieving state-of-the-art NER results (Schweter and Akbik, 2020).

Datasets, Annotations and Formats
To test the performance of transformers for the NER task two different multilingual datasets (BSNLP 2021 and EUR-LEX) and two monolingual (one Latvian and one English) datasets were used. The size and languages each dataset represents differ, Table1 summarizes the number of entities in each dataset used for analysis. Formats of these datasets include conll-2003, conll-U, WebAnno tsv, json, jsonlines and BSNLP. In order to train baseline BERT and Flair models, all datasets were converted into a conll-like format: one token and its named entity label per line, with empty lines representing sentence and document boundaries. To train MAPA named entity recognizer datasets were converted into the jsonlines format.
The datasets also differ in respect to labeled entity types (see Table 2). Even in cases when the same-named entity types are annotated, domains and annotation guidelines could differ. In the next sub-sections we summarize information about each dataset.

The English Conll 2003 Named Entity Dataset
The English Conll-2003 named entity dataset (Tjong Kim Sang and De Meulder, 2003) consists of training, development, and test files and a file with unannotated data. Annotated data are tokenized, part-of-speech tagged, and chunked. Named entity annotation has been performed manually, mostly following MUC conventions (Chinchor and Robinson, 1998), using 4 named entity categories: location (LOC), person (PER), organization (ORG), and MISC (contains all names not included in other categories). Data is

Multilingual BSNLP 2021 Dataset
BSNLP dataset 8 is annotated corpus in 6 languages: Bulgarian, Czech, Polish, Russian, Slovenian, and Ukrainian, created for the shared task of 8 th BSNLP: Balto-Slavic Natural Language Processing Workshop (Piskorski et al., 2021). The dataset contains a list of five types of named entity mentions: person (PER), organization (ORG), location (LOC), event (EVT), and product (PRO). Each named entity mention has a unique identifier assigned to link the same entity across different languages and documents and a lemma or canonical form of multi-word expression, if given named entity is a multiword entity. Documents in this corpus are represented as a text file with metadata and a corresponding annotation file. Annotation file contains an identifier of the text file and a list of tab-separated fields: mention, its base form, named entity type, and named entity Id. In order to train and test NER systems, we converted the data into conll-2003-like data format, by keeping only word and named entity label in each line.

Multilingual MAPA Dataset
MAPA de-identification package contains 12 annotated documents from Court of Justice of the European Union. The documents are translated into 24 European languages, annotated according to MAPA annotation schema and available ELRC share as 24 separate datasets 9 . MAPA annotation schema has 2 levels of entities. Top-level contains 6 entity classes: Person, Time, Location, Organization, Amount, and Vehicle. The second annotation level consists of attributes or subdivisions of top-level entities, e.g. entity PERSON may have second-level annotations "title", "initial name", "marital status" and so on. The second level contains 57 entity classes, however, most of them are not used in the MAPA EUR-LEX dataset. The data files are provided in WebAnno TSV 3.2 format (Eckart de Castilho et al., 2016). To train MAPA toolkit models, a conversion to JSON format is performed using the conversion script included in the MAPA project. 12 documents for each language were split into 8 for training, 2 for development, and 2 for test documents and kept translated versions of the same document together.

Latvian FullStack Corpus
FullStack-LV corpus 10 (Grūzītis et al., 2019) contains hierarchical named entity annotations in Latvian. In cases when outer layer annotations contain nested named entities, nested entities are annotated as inner layer entities. Fullstack-LV corpus contains 3947 paragraphs of text with 9697 outer entities and 944 inner entities (Gruzitis et al., 2018).
In contrast with the MAPA dataset, where second-level entities are parts of the firstlevel entity, in the FullStack-LV dataset inner entities are independent of outer entities.
Because most of the tools tested are intended to detect single-level named entities, we discard inner entity annotations. Dataset has been labeled using 9 entity types: Geopolitical, Person, Time, Location, Product, Organization, Money, Event, and Entity (for all entities, which do not belong to any other classes). The data is provided in a version of Conll-2003 data format, single document per file. The data was converted to a simpler version of Conll-2003-like format keeping only word and entity labels in each line and joining all files in training, development and test sets into appropriate files.

Experimental Setup
To evaluate different NER models, we train a NER model using given framework, language model, and dataset. All NER models were trained for 20 epochs or until early stopping conditions were met. The training parameters (the number of epochs, batch size, and learning rate) were selected as it was recommended in documentation. The experiments were run 3 times and the results were averaged. For a baseline, we fine-tune the multilingual BERT model in the same way as in original BERT paper (Devlin et al., 2019a) -by adding a single output classification layer and jointly fine-tuning all parameters on the NER task. Fine-tuning is performed on BERT-base Multilingual Cased model 11 using batch size 16, learning rate 2e-5 for 20 epochs.
Within the Flair framework, we explore multiple approaches for NER model training: multilingual BERT (mBERT), multilingual XLM RoBERTa (XLM-R). We also test the Litlat BERT (Ulčar and Robnik-Šikonja, 2020), which has been pre-trained on Latvian, Lithuanian, and English data. The "Flair Litlat", "Flair mBERT" and "Flair XLM-R" models were trained using the Flair Sequence labeling approach (Akbik et al., 2018) and Litlat, multilingual BERT or XLM-R language models. The "Flert mBERT" and "Flert XLM-R" models were trained using Flair's "Flert" (Schweter and Akbik, 2020) approach and using multilingual BERT or XLM-R language models. The Flert approach uses a transformer model fine-tuned using the target sentence and documentlevel features (64 subtokens of left and right context).
MAPA named entity hierarchy is hard-coded to detect 6+59 named entities by default, therefore, MAPA named entity recognizer was adapted to each dataset by editing appropriate labeling hierarchy. Training was performed using default settings specified in MAPA documentation.

Results
The overall results of the evaluation are summarised in Table 3. For comparison we also include the Conll-2003 Named Entity Recognition result (BERT-base) from original BERT paper (Devlin et al., 2019a). For both multilingual datasets, the best results are achieved by Flert XLM-R model. While for the Latvian litlat bert is the best in this evaluation, it still performs worse than monolingual LVBERT (Znotiņš and Barzdins, 2020) and none of the multilingual models exceed the F1 score of English monolingual BERT.

Named Entity Recognition of Baltic Languages
In monolingual settings, when NER is trained only on Latvian language data from Fullstack-LV dataset, the best result (81.97 F1 points) is achieved with LitLat BERT (pre-trained on Latvian, Lithuanian and English data) as a language model. We also compare fine-tuned Latvian (Viksna and Skadina, 2020) BERT model results (81.91 F1 points) to Flair NER using LitLat BERT language model. LitLat Flair NER model achieves an F1 score of 81.97, which is close to the state-of-the-art monolingual LVBERT (Znotiņš and Barzdins, 2020) results (82.6 F1 points) on the Fullstack-LV dataset. However, LitLat BERT performs poorly for other languages on which it has not been pre-trained.
In the case of the multilingual EUR-LEX dataset XLM-R model demonstrates the best result for Latvian, reaching an F-score of 75, while mBERT model's results are the best for Lithuanian, reaching F-score of 80 (see Table4).
The reason why results achieved for Latvian are in many cases lower than for other languages could be explained by the morphological richness of Latvian, its share in large multilingual language models, as well as, the size of training data.

Named Entity Recognition of Slavic Languages
For Slavic languages, we tested the Flert mBERT model, which demonstrated similar results (96.56 F-score) to the Flert XLM-R model. Although Slavic languages are morphologically rich languages and thus complicate to process, all models demonstrate better performance for Slavic languages than for English. This could be explained by the fact that the BSNLP data belongs to specific domain -news about a few selected themes (Brexit, Nord stream, Ryanair, and other). pen when event entity contains entity of another class, e.g. "Meeting Trump-Putin in Helsinki" (here both Person (PER) and Location (LOC) are present), or Event entity is not detected at all, e.g. "vietnam war", "All-Star" or "referendum of 2016". Entities of LOC type are most often misclassified as non-entities, e.g. "punjabi", "statue of Jesus Christ", or entities have wrong boundary as in "airport Charleroi", "airport Brussel"airport is not detected as part of LOC entity. Similarly, most of the errors for Organisation (ORG) class are boundary errors with LOC entities, e.g. "Pakistan government" or "The Group of European Conservatives and Reformists". 47 instances of organizations are misclassified as Products (PRO), e.g. "Facebook", "YouGov", "Funke". Person class is misclassified as a non-entity in cases where it refers to groups of people ("Republican senators") or ethnicity ("pakistanis [wife]"). Product entities are mostly misclassified as non-entities, and also as organisations ("The Economist", "Bloomberg").

Transformer models in Multilingual Settings -Case of EUR-LEX
In case of EUR-LEX the Flert XLM-R model shows the best performance among multilingual models used. The NER model based on trilingual Litlat, trained using EUR-LEX data, is performing unexpectedly well, considering that most of the training data are in languages it has never seen before. Finally, since the MAPA EUR-LEX dataset is small, and the test set consists of around 200 entity mentions per language, thus few misclassified entities lead to large F1 score differences.

Conclusions
In this paper, we examined the NER performance of several toolkits which use multilingual Transformer language models. Our experiments show that multilingual Transformer models achieve NER performance comparable to monolingual Transformer models trained and fine-tuned on target language data. However, if a language model (such as LitLat) is fine-tuned for a language that is not represented in training data, the resulting model demonstrates poor performance. In our experiments, the NER using finetuned XLM-R (the "Flert XLM-R") model shows the best results for multilingual NER task. Such multilingual NER could be used for named entity identification in multilingual de-identification tools, allowing usage of single model, instead of separate models for each language. However, analysis of applicability of particular NER solution in deidentification pipeline, taking into account importance of different types of entities in de-identification process, is the next task of our research.