Multilingual Multiword Expression Identiﬁcation Using Lateral Inhibition and Domain Adaptation

: Correctly identifying multiword expressions (MWEs) is an important task for most natural language processing systems since their misidentiﬁcation can result in ambiguity and misunderstand-ing of the underlying text. In this work, we evaluate the performance of the mBERT model for MWE identiﬁcation in a multilingual context by training it on all 14 languages available in version 1.2 of the PARSEME corpus. We also incorporate lateral inhibition and language adversarial training into our methodology to create language-independent embeddings and improve its capabilities in identifying multiword expressions. The evaluation of our models shows that the approach employed in this work achieves better results compared to the best system of the PARSEME 1.2 competition, MTLB-STRUCT, on 11 out of 14 languages for global MWE identiﬁcation and on 12 out of 14 languages for unseen MWE identiﬁcation. Additionally, averaged across all languages, our best approach outperforms the MTLB-STRUCT system by 1.23% on global MWE identiﬁcation and by 4.73% on unseen global MWE identiﬁcation.


Introduction
Natural language processing (NLP) is a significant domain of artificial intelligence, with applications ranging from language translation to text classification and information retrieval. NLP allows computers to interpret and process human language, enabling them to perform tasks such as understanding and responding to questions, summarizing texts, and detecting sentiments. Some phenomena present in language can preclude its correct understanding by machines (and even humans sometimes). Such a phenomenon is represented by multiword expressions (MWEs), which are groups of words that function as a unit and convey a specific meaning that is not the sum of the meanings of the component words (i.e., the expression lacks compositionality). Examples of MWEs include idioms (e.g., "break a leg" is used to wish someone good luck), collocations (e.g., "take an exam"), or compounds (e.g., "ice cream"), different authors assuming a more comprehensive or a narrower meaning of this term. The number of MWEs in a language is relatively high. The authors of [1] synthesized papers reporting the number or proportion of MWEs in different languages: English-with an almost equal number of MWEs and single words; French-with 3.3 times greater number of MWE adverbs than that of single adverbs and 1.7 times greater number of MWE verbs than that of single verbs; and Japanese-in which 44% of the verbs are MWEs. Despite being so numerous in the dictionary, MWEs' frequency in corpora is low [2].
Identifying and processing MWEs is crucial for various NLP tasks [3]. In machine translation, for instance, the correct translation of an MWE often depends on the specific context in which it appears. Suppose an MWE is translated rather than appropriately localized for the target language. In that case, the resulting translation may be difficult to understand for native speakers or may convey a wrong meaning [4]. In text classification tasks, MWEs are considered essential clues regarding the sentiment or topic of a text [5]. Additionally, to improve the accuracy of search engines in information retrieval, MWEs can help disambiguate the meaning of a query [6].
Developing systems that can handle multiple languages is another important NLP area. In particular, the ability to accurately process and analyze text in various languages is becoming increasingly important as the world becomes more globalized and interconnected. For example, multilingual NLP systems can improve machine translation, allowing computers to translate text from one language to another accurately. This can be particularly useful in situations where there is a need to communicate with speakers of different languages, such as in global business or international relations. In addition to its practical applications, multilingual NLP is an important area of study from a theoretical perspective. Research in this field can help shed light on the underlying principles of language processing and how these principles differ across languages [14,15].
Multilingual Transformer models have become a popular choice for multilingual NLP tasks due to their ability to handle multiple languages and achieve strong performance on a wide range of tasks. Based on the Transformer architecture [16], these models are pretrained on large amounts of multilingual data and can be fine-tuned for specific NLP tasks, such as language translation or text classification. Some models that have become influential in this area include the multilingual bidirectional encoder from transformers (mBERT) [17], cross-lingual language model (XLM) [18], XLM-RoBERTa (XLM-R) [19], and multilingual bidirectional auto-regressive transformers (mBART) [20]. One of the essential benefits of multilingual Transformer models is their ability to transfer knowledge between languages. These models can learn common representations of different languages, allowing them to perform well on tasks in languages that they have yet to be specifically trained on. Thus, multilingual Transformer models are a good choice for NLP tasks that involve multiple languages, such as machine translation or cross-lingual information retrieval [21].
In this work, we leverage the knowledge developed in the two research areas (i.e., MWEs and multilingual NLP) to improve the results obtained at the PARSEME 1.2 shared task [13]. We explore the benefits of combining them in a singular system by jointly finetuning the mBERT model on all languages simultaneously and evaluating it separately. In addition, we try to improve the performance of the overall system by employing two mechanisms, (i) the newly introduced lateral inhibition layer [22] on top of the language model and (ii) adversarial training [23] between languages. For the last mechanism, other researchers have experimented with this algorithm and have shown that it can provide better results with the right setting [24]; however, we are the first to experiment with and show the advantages of lateral inhibition in multilingual adversarial training.
Our results demonstrate that by employing lateral inhibition and multilingual adversarial training, we improve the results obtained by MTLB-STRUCT [25], the best system in edition 1.2 of the PARSEME competition, on 11 out of 14 languages for global MWE identification and 12 out of 14 languages for unseen MWE identification. Furthermore, averaged across all languages, our highest-performing methodology achieves F1-scores of 71.37% and 43.26% for global and unseen MWE identification, respectively. Thus, we obtain an improvement of 1.23% for the former category and a gain of 4.73% for the latter category compared to the MTLB-STRUCT system.
The rest of the paper is structured as follows. Section 2 summarises the contributions of the PARSEME 1.2 competition and the main multilingual Transformer models. The following section, Section 3, outlines the methodology employed in this work, including data representation, lateral inhibition, adversarial training, and how they were employed in our system. Section 4 describes the setup (i.e., dataset and training parameters) used to evaluate our models. Section 5 presents the results, and Section 6 details our interpretation of their significance. Finally, our work is concluded in Section 7 with potential future research directions.

Multilingual Transformers
This subsection will present the most influential three multilingual language models (MLLMs): mBERT, XLM, and XLM-R. The mBERT model, similar to the original BERT model [17], is a Transformer model [16] with 12 hidden layers. However, while BERT was trained solely on monolingual English data with an English-specific vocabulary, mBERT is trained on the Wikipedia pages of 104 languages and uses a shared word-piece vocabulary. mBERT has no explicit markers indicating the input language and no mechanism specifically designed to encourage translation-equivalent pairs to have similar representations within the model. Although simple in its architecture, due to its multilingual representations, mBERT's robustness to generalize across languages is often surprising, despite needing to be explicitly trained for cross-lingual generalization. The central hypothesis is that using word pieces common to all languages, which must be mapped to a shared space, may lead to other co-occurring word pieces being mapped to this shared space [26].
XLM resulted from various investigations made by the authors in cross-lingual pretraining. They introduce the translation language modeling objective (TLM), which extends the masked language modeling (MLM) objective to pairs of parallel sentences. The reason for doing that is sound and straightforward. Suppose the model needs to predict a masked word within a sentence from a given language. In that case, it can consider that sentence and its translation into a different language. Thus, the model is motivated to align the representations of both languages in a shared space. Using this approach, XLM obtained state-of-the-art (SOTA) results on supervised and unsupervised machine translation using the WMT'16 German-English and WMT'16 Romanian-English datasets [27], respectively. In addition, the model also obtained SOTA results on the Cross-lingual Natural Language Inference (XNLI) corpus [28].
In contrast to XLM, XLM-R does not use the TLM objective and instead trains RoBERTa [29] on a large, multilingual dataset extracted from CommonCrawl (http://co mmoncrawl.org/ last accessed on 21 April 2023) datasets. In 100 languages, totaling 2.5 TB of text. It is trained using only the MLM objective, similar to RoBERTa, the main difference between the two being the vocabulary size, with XLM-R using 250,000 tokens compared to RoBERTa's 50,000 tokens. Therefore, XLM-R is significantly larger, with 550 million parameters, compared to RoBERTa's 355 million parameters. The main distinction between XLM and XLM-R is that XLM-R is fully self-supervised, whereas XLM requires parallel examples that may be difficult to obtain in large quantities. In addition, this work demonstrated for the first time that it is possible to develop multilingual models that do not compromise performance in individual languages. XLM-R obtained similar results to monolingual models on the GLUE [30] and XNLI benchmarks.

PARSEME 1.2 Competition
We present the results obtained by the systems participating in edition 1.2 of the PARSEME shared task [13] on discovering VMWEs that were not present (i.e., were not seen) in the training corpus. We will not focus on the previous editions of this shared task for two reasons, (i) the corpora were different, on the one hand, concerning the distribution of seen and unseen VMWEs in the train/dev/test sets, and, on the other hand, smaller for some languages; and (ii) the focus in the last edition, unlike the first two, was on the systems' ability to identify VMWEs unseen in the train and dev corpora, exploring alternative ways of discovering them. Thus, in a supervised machine learning approach, the systems were supposed to learn some characteristics of seen VMWEs and, based on those, find others in the test dataset.
The competing systems used recurrent neural networks [25,[31][32][33], but also exploited the syntactic annotation of the corpus [34,35], or association measures [34,35]. The shared task was organized on two tracks, closed and open. The former allowed only for the use of the train and dev sets provided by the organizers, as well as of the raw corpora provided for each language, with sizes between 12 and 2474 million tokens. The latter track allowed for the use of any existing resource for training the system, and examples of such resources are as follows, VMWEs lexicons in the target language or another language (exploited due to their translation in the target language) or language models (monolingual or multilingual BERT [25,33], XLM-RoBERTa [32]). Only two systems participated in the closed track, while seven participated in the open one.
The best-performing system in the open track is MTLB-STRUCT [25]. It is a neural language model relying on pre-trained multilingual BERT and learning both MWEs and syntactic dependency parsing, using a tree CRF network [36]. The authors explain that the joint training of the tree CRF and a Transformer-based MWE detection system improves the results for many languages.
The second and third place in the same track is occupied by the model called TRAVIS [33] that came in two variants, TRAVISmulti (ranked second), which employs multilingual contextual embeddings, and TRAVISmono (ranked third), which employs monolingual ones. These systems rely solely on embeddings, and no other feature is used. The author claims that the monolingual contextual embeddings are much better at generalizations than the multilingual ones, especially concerning unseen MWEs.

Methodology
In this work, we perform two kinds of experiments, (i) train a model using only the data for a specific language (referred to as monolingual training) and (ii) put multiple corpora from different languages in one place, train the multilingual model on it and then evaluate the trained model on the test set of each language (referred to as multilingual training). For the latter, we also perform additional experiments to improve the results by employing lateral inhibition and adversarial training mechanisms, as depicted in Figure 1.

Data Representation
BERT has significantly impacted the field of NLP and has achieved SOTA performance on various tasks. Its success can be attributed to the training process, which involves learning from large amounts of textual data using a Transformer model and then fine-tuning it on a smaller amount of task-specific data. The masked language modeling objective used during pre-training allows the model to learn effective sentence representations, which can be fine-tuned for improved performance on downstream tasks with minimal taskspecific training data. The success of BERT has led to the creation of language-specific versions of the model for various languages, such as CamemBERT (French) [37], AfriBERT (Afrikaans) [38], FinBERT (Finnish) [39], and RoBERT (Romanian) [40]. We have the mBERT feature extractor F with green, whose role is to generate the token embeddings, the MWE label classifier C with blue, and the language classifier LD with orange, whose gradient is reversed and scaled by λ before it is fed into the feature extractor. Additionally, C has incorporated in its architecture the lateral inhibition mechanism.
The scarceness of data and resources has resulted in recent advances in NLP being limited to English and a few high-resource languages rather than being more widely applicable across languages. To address this issue, MLLMs have been developed and trained using large amounts of unlabeled textual data collected from multiple languages. These models are designed to benefit lower resource languages by leveraging their shared vocabulary, genetic relatedness, or contact relatedness with higher resource languages [41,42]. Many different MLLMs are available, which vary in terms of their architecture, training objective, data used for pre-training, and the number of languages covered. However, in our experiments, we employ only the mBERT model because it allows us to provide a cleaner comparison with the monolingual BERT models and thus emphasizes the strengths of our approach.

Lateral Inhibition
The biological process of lateral inhibition represents the capacity of excited neurons to reduce the activity of their neighbors [43]. In the visual cortex, this process is associated with an increased perception under challenging environments, such as low-lighting conditions. Previously, we proposed implementing the lateral inhibition mechanism in artificial neural networks (ANN) to improve the named entity recognition task [22,44]. The intuition behind introducing this mechanism is that it reduces noise associated with word representations in some instances, such as less frequent words or contexts.
The implementation uses an additional ANN layer that filters the values of a neuron from a previous layer (the word embedding representation) based on values from other adjacent neurons in the previous layer. Equation (1) describes the new layer's forward pass.
Here, X is the layer's input vector (a token embedding representation), Diag is a matrix with the diagonal set to the vector given as a parameter, ZeroDiag produces a matrix with the value zero on the main diagonal, and W and B represent the weights and bias. Θ is the Heaviside function, described in Equation (2). The derivative of the Heaviside function in the backward pass is approximated with the sigmoid function using a scaling parameter k [45] (see Equation (3)), a method known as surrogate gradient learning [46].

Adversarial Training
In recent years, adversarial training of neural networks had a significant influence, particularly in computer vision, where generative unsupervised models have demonstrated the ability to generate new images [47]. A crucial challenge in adversarial training is finding the proper balance between the generator and the adversarial discriminator. As a result, several methods have been proposed in recent times to stabilize the training process [48][49][50]. Therefore, Joty et al. [51] introduced cross-lingual adversarial neural networks designed to learn discriminative yet language-invariant representations. In this work, we use the same methodology to learn task-specific representations in a cross-lingual setting and improve the predictive capabilities of a multilingual BERT model.
Our approach is rooted in the Domain Adversarial Neural Network (DANN) algorithm, initially designed for domain adaptation [52]. DANN consists of a deep feature extractor F, responsible for extracting relevant features f from the input data, and a deep label classifier C, which uses those features to make predictions about the label of the input x. Together, these two components form a standard feed-forward architecture. In order to improve the performance of the model on a target domain where labeled data are scarce, an additional component is added to the architecture, called a domain classifier D, which is responsible for distinguishing between samples from the source and target domains d. This domain classifier is connected to the feature extractor via a gradient reversal layer, which multiplies the gradient by a negative constant during training. The gradient reversal layer helps ensure that the feature distributions over the two domains are as similar as possible, resulting in domain-invariant features that can better generalize to the target domain. The overall training process minimizes the label prediction loss on the source examples and the domain classification loss on all samples. Thus, we have the following equations that are used to update the parameters of each of the three components: where θ C are the parameters of the label classifier, L y is the loss obtained by the label classifier when predicting the class labels y, θ D are the parameters of the domain classifier, L d is the loss obtained by the domain classifier when predicting the domain labels d, θ F are the parameters of the feature extractor, λ is the hyperparameter used to scale the reverse gradients, and α is the learning rate.

Monolingual Training
In the monolingual training experiments, we treat the MWE task as sequence tagging, so we try to predict a label for each input token. To attain that, we employ a feed-forward layer that maps the embeddings produced by a BERT model into the specific MWE class logits and then apply the softmax activation function to obtain the probabilities. This mechanism is succinctly described in the following equation: where p i are the class MWE probabilities for the token i, e i are the embeddings produced by the language model, W T is the transpose of the feed-forward layer, and b is its bias. We use the same BERT models for each language as in [25]).

Multilingual Training
We fine-tune the mBERT model for multilingual training using the same methodology as in the monolingual case. However, we improve the predictions by first employing the lateral inhibition layer on top of the embeddings. The lateral inhibition layer has been shown to improve the performance of language models in named entity recognition tasks [22,44,53], and we believe that it would do the same for MWE identification since the methodology is similar for the two tasks. Therefore, the equation that describes the resulting system becomes: where LI is the lateral inhibition layer and the rest of the terms are the same as in Equation (5).
We also adapt the multilingual training by employing the DANN algorithm with a language discriminator instead of the domain discriminator. Thus, we create languageindependent features out of the mBERT model by reversing the gradient that comes out of the language discriminator when backpropagating through the language model. The gradient reversal mechanism in our system is described using the following equations where θ C are the parameters of the MWE classifier, L y is the loss obtained by the MWE classifier when predicting the MWE labels y, θ LD are the parameters of the language discriminator, L ld is the loss obtained by the language discriminator when predicting the language labels ld, θ F are the parameters of the mBERT model (i.e., the feature extractor in DANN), λ is the hyperparameter used to scale the reversed gradients, and α is the learning rate. Finally, we employ the lateral inhibition layer and the DANN methodology with a language discriminator on the mBERT model for multilingual training. The forward procedure of this approach, which is used to compute the loss between the predicted MWE probabilities for a given text and the corresponding ground truths, and the loss between the predicted language probabilities and the corresponding ground truths of the given text, is described in Algorithm 1 as follows: • Tokenize the text using the mBERT tokenizer, obtaining the tokens tok i (Line 1). • Generate the multilingual embeddings emb i for each of the above tokens tok i using the mBERT model (Line 2). • Apply the lateral inhibition layer on each of the embeddings emb i (Line 3). • Use the MWE classifier composed of lateral inhibition layer output to produce the probabilitiesŷ i of a token to belong to a certain MWE class (Line 4). • Use the language discriminator on the embedding emb [CLS] corresponding to the token [CLS] to produce the probabilitiesl d i of the text to belong to a certain language (Line 5). • Compute the loss L y between the predicted MWE probabilities and the ground truth MWE labels (Line 6) and the loss L ld between the predicted language probabilities and the ground truth language labels (Line 7).
In Algorithm 2, we outline the backward procedure used to update the parameters of our models as follows: • Compute the gradients ∇ C for the MWE classifier using the MWE loss L y (Line 1). • Compute the gradients ∇ LD for the language discriminator using the language discriminator loss L ld (Line 2). • Compute the gradients ∇ F of the mBERT model using ∇ C and −∇ LD multiplied by λ (Line 3). • Update the model parameters (i.e., θ C , θ LD , and θ F ) using the gradient descent algorithm (Lines 4-6).
Algorithm 1: Algorithm describing the forward pass of the multilingual training with lateral inhibition and language adversarial training. Input: text, ground truth MWE labels y i , and ground truth language labels ld i Output: MWE identification loss L y and language discrimination loss L ld tok i ← tokenize(text) Algorithm 2: Algorithm describing the backward pass of the multilingual training with lateral inhibition and language adversarial training. Input: MWE identification loss L y , language discrimination loss L ld , and reversed gradient scaling factor λ Output: Parameters θ C , θ LD , and θ F ∇ C ← compute_gradients(L y )

Dataset
The corpus used to evaluate our models is the PARSEME dataset version 1.2. The corpus was manually annotated with VMWEs of several types. Some are universal because they exist and were annotated in all languages in the project. These universal types are verbal idioms (e.g., the Romanian "a face din t , ânt , ar armȃsar"-eng. "to make a mountain out of a molehill") and light verb constructions (e.g., the Romanian "a face o vizitȃ"-eng. "to pay a visit") in which their verb is light in the sense that its semantic contribution to the meaning of the whole expression is almost null, its role being rather only that of carrying the verb specific morphological information, such as tense, number, or person. There are also light verb constructions in which the verb carries a causative meaning (e.g., the Romanian "a da bȃtȃi de cap"-eng. "to give a hard time"), and they are also annotated in all languages. The types of VMWEs that apply only to some of the languages in the project are called quasi-universal: inherently reflexive verbs (e.g., the Romanian "a-s , i imagina"-eng. "to imagine (oneself)"), verb-particle constructions (e.g., "to give up"), multi-verb constructions (e.g., "make do"), and inherently adpositional verbs (e.g., "to rely on"). For Italian, a language-specific type was defined, namely inherently clitic verbs (e.g., "prendersela"-eng. "to be angry").
The dataset used in the PARSEME shared task edition 1. to 1015 k tokens (RO), while the number of annotated VMWEs ranges from 662 (GA) to 9164 (ZH). The dataset split was made to ensure a higher number of unseen VMWEs in the dev (100 unseen VMWEs with respect to the train set) and test (300 unseen VMWEs with respect to the train + dev files) sets. More statistics regarding the PARSEME 1.2 dataset are depicted in Table 1.
In addition to the annotation with VMWEs, the multilingual PARSEME corpus is also tokenized, morphologically, and syntactically annotated, mostly with UDPipe [54]. Thus, the syntactic analysis follows the principles of Universal Dependencies (https://universa ldependencies.org/ last accessed on 21 April 2023) [55].

Fine-Tuning
We followed the fine-tuning methodology employed by MTLB-STRUCT (the corresponding configuration files for each language are available at https://github.com/shivaat /MTLB-STRUCT/tree/master/code/configs last accessed on 21 April 2023) with the tree conditional random fields [56] disabled. Thus, we trained our models for 10 epochs using a batch size of 32 and the Adam optimizer [57] with a learning rate of 3 × 10 −5 . We set the maximum input sequence length to 150, the scaling parameter k, used in the gradient approximation of the lateral inhibition Heaviside function, to 10, which was empirically shown to create a good enough surrogate gradient [22], and the hyperparameter λ to 0.01 in the DANN algorithm for scaling the reversed gradient. We did not employ k-fold cross-validation in our experiments, and we measured the model performance in terms of precision, recall, and F1-score at the token level using the following equations: where TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives. As suggested by the PARSEME 1.2 competition evaluation methodology (https://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evalu ation/ last accessed on 21 April 2023), we compute the strict variant of the F1-score. Thus, we consider the predicted label of a group of tokens as true positive only if it perfectly matches the ground truth [58].

Results
The results of our evaluation for both monolingual and multilingual training, with and without lateral inhibition and adversarial training, for all the 14 languages, are displayed in Table 2. We improved the performance of MTLB-STRUCT, the best overall system according to the competition benchmark (https://multiword.sourceforge.net/PHITE.php?sitesig=CO NF&page=CONF_02_MWE-LEX_2020___lb__COLING__rb__&subpage=CONF_40_Shared _Task last accessed on 21 April 2023), on 11 out of 14 languages for global MWE prediction (the three remaining languages are German, Italian, and Romanian) and on 12 out of 14 languages for unseen MWE prediction (the two remaining languages are German and Greek). Out of all the cases where our methods underperformed, the only high difference was obtained in the German language, our best system being behind the MTLB-STRUCT system by approximately 3.43% on global MWE prediction and approximately 6.57% on unseen MWE prediction. We believe that this is due to the employment of the German BERT (https://huggingface.co/bert-base-german-cased last accessed on 21 April 2023) by the MTLB-STRUCT team, while we still used the mBERT model for this language.
For the global MWE prediction, we managed to improve the performance in 11 languages, the highest F1-score was obtained by the monolingual training once (i.e., Chinese), by the simple multilingual training three times (i.e., Greek, Irish, and Turkish), by the multilingual training with lateral inhibition three times (i.e., French, Hebrew, and Polish), by the multilingual adversarial training once (i.e., Basque), and by the multilingual adversarial training with the lateral inhibition three times (i.e., Hindi, Portuguese, and Swedish). On the other hand, for the unseen MWE prediction, we managed to achieve better results in 12 languages. The simple multilingual training obtained the highest F1-score only once (i.e., Swedish), the multilingual training with the lateral inhibition three times (i.e., French, Turkish, and Chinese), the multilingual adversarial training five times (i.e., Irish, Hebrew, Hindi, Polish, and Romanian), and the multilingual adversarial training with lateral inhibition three times (i.e., Basque, Italian, and Portuguese). Table 2. The results obtained by the monolingual and multilingual training, together with the results obtained by the best system of the PARSEME 1.2 competition, MTLB-STRUCT. LI is the lateral inhibition component, while Adv is the domain adaptation technique for cross-lingual MWE identification. We measure the precision (P), recall (R), and F1-score (F1) for each global and unseen MWE identification experiment. The best results in each language are highlighted in bold. Also, the monolingual training has not achieved the highest F1-score for unseen MWE prediction for any language. These findings are summarized in Table 3). Total (ours) 11 12 We further compared the average scores across all languages obtained by our systems. In Table 4, we compared our results with the ones obtained by each system at the latest edition of the PARSEME competition (https://multiword.sourceforge.net/PHITE.php?sitesig= CONF&page=CONF_02_MWE-LEX_2020___lb__COLING__rb__&subpage=CONF_50_Shared _task_results last accessed on 21 April 2023): MTLB-STRUCT [25], Travis-multi/mono [33], Seen2Unseen [34], FipsCo [10], HMSid [35], and MultiVitamin [32]. For the global MWE identification, we outperformed the MTLB-STRUCT results with all the multilingual training experiments, the highest average F1-score being obtained by the simple multilingual training without lateral inhibition or adversarial training. It achieved an average F1-score of 71.37%, an improvement of 1.23% compared to the MTLB-STRUCT F1-score (i.e., 70.14%).

Language
For unseen MWE identification, we improved the average results obtained by MTLB-STRUCT using all the methodologies employed in this work. The highest average F1-score was obtained by the multilingual adversarial training with 43.26%, outperforming the MTLB-STRUCT system by 4.73%. Table 4. The average precision (AP), recall (AR), and F1-scores (AF1) over all languages obtained by our systems are compared with the results obtained by each system at the PARSEME 1.2 competition on global and unseen MWE identification. We also depict the number of languages used to train each system (#Lang). The best results are highlighted in bold.

Discussion
According to our experiments, the average MWE identification performance can be improved by approaching this problem using a multilingual NLP system, as described in this work. An interesting perspective of our results on this task is how much improvement we brought compared to the PARSEME 1.2 competition's best system. These results are shown at the top of Figure 2 for global MWE prediction and at its bottom for unseen MWE prediction. In general, the most significant relative improvements were achieved in the Irish language by employing multilingual training that, combined with adversarial training, boosted the performance by 45.32% for the global MWE prediction and by 90.78% for the unseen MWE prediction. On the other hand, for the same language, by using the monolingual training, we decrease the system's performance on global MWE prediction by 8.71% and slightly increase it by 2.86% on unseen MWE prediction. We believe that these improvements in Irish were due to the benefits brought by the multilingual training since this language contained the least amount of training sentences (i.e., 257 sentences), and it has been shown in previous research that superior results are obtained when such fine-tuning mechanisms are employed [59]. However, the Hindi language also contains a small number of training samples (i.e., 282 sentences), but our multilingual training results are worse when compared to Irish. We assume that this is the outcome of the language inequalities that appeared in the mBERT pre-training data [60] and the linguistic isolation of Hindi since there are no other related languages in the fine-tuning data [61].
The second highest improvements for global MWE prediction were achieved in the Swedish language with 2.45% for the monolingual training, 4.26% for the multilingual training, 4.17% for the multilingual training with the lateral inhibition, 4.65% for the multilingual adversarial training, and 5.92% for the multilingual adversarial training with lateral inhibition. We observe a relatively high difference between the first and the second place, but we believe again that this is due to the small number of sentences for Irish compared to Swedish. On the other hand, the results for unseen MWE prediction outline that the second highest improvements were attained in Romanian with 43.62% for the monolingual training, 44.00% for the multilingual training, 32.56% for the multilingual training with lateral inhibition, 49.47% for the multilingual adversarial training, and 40.32% for the multilingual adversarial training with lateral inhibition. In addition, the improvements are more uniform on the unseen MWE prediction than the global one. Figure 2. Improvements brought by our methodologies (i.e., Monolingual, Multilingual, Multilin-gual+LI, Multilingual+Adv, and Multilingual+LI+Adv) on global (top) and unseen (bottom) MWE prediction compared to the results of MTLB-STRUCT, the best system in the PARSEME shared task edition 1.2.

Conclusions and Future Work
Failure to identify MWEs can lead to misinterpretation of text and errors in NLP tasks, making this an important area of research. In this paper, we analyzed the performance of MWE identification in a multilingual setting, training the mBERT model on the combined PARSEME 1.2 corpus using all the 14 languages found in its composition. In addition, to boost the performance of our system, we employed lateral inhibition and language adversarial training in our methodology, intending to create embeddings that are as language-independent as possible. Our evaluation results highlighted that through this approach, we managed to improve the results obtained by MTLB-STRUCT, the best system of the PARSEME 1.2 competition, on 11 out of 14 languages for global MWE identification and 12 out of 14 for unseen MWE identification. Thus, with the highest average F1-scores of 71.37% for global MWE identification and 43.26% for unseen MWE identification, we class ourselves over MTLB-STRUCT by 1.23% for the former task and by 4.73% for the latter.
Possible future work directions involve analyzing how the language-independent features produced by mBERT are when lateral inhibition and adversarial training are involved, together with an analysis of more models that produce multilingual embeddings, such as XLM or XLM-R. In addition, we intend to analyze these two methodologies, with possible extensions, for multilingual training beyond MWE identification, targeting tasks, such as language generation or named entity recognition. Finally, since the languages in the PARSEME 1.2 dataset may share similar linguistic properties, we would like to explore how language groups improve each other's performance in the multilingual scenario.

Conflicts of Interest:
The authors declare no conflict of interest.