Exploring zero-shot and joint training cross-lingual strategies for aspect-based sentiment analysis based on contextualized multilingual language models

ABSTRACT Aspect-based sentiment analysis (ABSA) has attracted many researchers' attention in recent years. However, the lack of benchmark datasets for specific languages is a common challenge because of the prohibitive cost of manual annotation. The zero-shot cross-lingual strategy can be applied to solve this gap in research. Moreover, previous works mainly focus on improving the performance of supervised ABSA with pre-trained languages. Therefore, there are few to no systematic comparisons of the benefits of multilingual models in zero-shot and joint training cross-lingual for the ABSA task. In this paper, we focus on the zero-shot and joint training cross-lingual transfer task for the ABSA. We fine-tune the latest pre-trained multilingual language models on the source language, and then it is directly predicted in the target language. For the joint learning scenario, the models are trained on the combination of multiple source languages. Our experimental results show that (1) fine-tuning multilingual models achieve promising performances in the zero-shot cross-lingual scenario; (2) fine-tuning models on the combination training data of multiple source languages outperforms monolingual data in the joint training scenario. Furthermore, the experimental results indicated that choosing other languages instead of English as the source language can give promising results in the low-resource languages scenario.


Introduction
Nowadays, the development of social networks and e-commerce helps users share and quickly consult feedback about products and services of business organizations. Customers tend to refer to comments before making decisions. In addition, users' comments are also valuable resource for business organizations to analyse, develop, and improve their products and services to provide the best customer experience. Unfortunately, manual processing by human annotation is not possible for massive comments. Hence, the Opinion Mining task has attracted much attention from researchers worldwide and business organizations in the field of Natural Language Processing (NLP). Most of the research has recently focused on solving this task at the aspect level, called Aspectbased Sentiment Analysis. Therefore, more insight information can be extracted from the tremendous comments automatically. For example, given a review for the restaurant domain as 'This place is great, but the food is not delicious'. There are two aspect categories (Restaurant#General and Food#Quality) in this sentence; but the sentiment polarity of categories is contradictory (positive for Restaurant#General, negative for Food#Quality). It is obvious that the polarity for two aspect categories is different. Analysing the comments on positive or negative categories has been able to improve the service and attract new customers.
There are currently more than 7000 languages worldwide Joshi et al. (2020a); however, most recent research focuses only on resource-rich languages such as English, Chinese, Arabic, etc. These languages have a lot of abundant and diverse annotated resources for various NLP tasks and tools. Moreover, building a new dataset requires more resources and costs to manually annotate for a specific language. Therefore, the lack of annotated datasets for low-resource languages has become a challenge for researchers in the NLP field. Recently, the great work of Hedderich et al. (2021) presented an overview of different approaches to improving the performance of low-resource languages, including data enhancement, multilingual language models, etc. Following the success of multilingual BERT Devlin et al. (2019), there are many pre-trained transformer multilingual models such as XLM-R (Conneau et al., 2020), InfoXLM (Chi et al., 2021a), XLM-Align (Chi et al., 2021b), which are beneficial for low-resource languages, with the task-specific annotated data being scarce. There are several recent studies that take advantage of exist pre-trained multilingual language model to improve the performance of system on different tasks such as sentiment analysis (Kumar & Albuquerque, 2021;Pei et al., 2022;Sarkar et al., 2019;Sultan et al., 2020), Named Entity Recognition (Arkhipov et al., 2019;Pires et al., 2019;Sharma et al., 2022b), Hate Speech Detection (Sharma et al., 2022a), and other tasks (Bhatnagar et al., 2022;Sun et al., 2022). In addition, pre-trained multilingual language models have been applied in zero-shot (Artetxe & Schwenk, 2019;Keung et al., 2020;Kim et al., 2021;Lauscher et al., 2020a;Nooralahzadeh et al., 2020;Pamungkas et al., 2021;Phan et al., 2021) cross-lingual for various NLP tasks. However, most of these studies only compared the performance of mBERT to XLM-R models as well as used English as a source language. There is no fair comparison when the amount of training data is different between languages in the zero-shot learning scenario.
There is no study on exploring the state-of-the-art (SOTA) pretrained multilingual language models to ABSA in both zero-shot and joint training cross-lingual scenarios. Therefore, the major objective of this study is to investigate the effectiveness of finetuning multilingual contextualized language models in both zero-shot and joint training cross-lingual scenarios on two main tasks of ABSA: Aspect Category Detection (ACD) and Category-Sentiment Classification (CSC). In this work, we make important contributions to the ABSA task, including: . Firstly, we explore the power of zero-shot transfer learning for five languages in the context of lacking labelled training data in the target resource-poor language.
. Secondly, we conduct experiments to answer a research question: 'Why do not we use another language as the source language instead of using English for ABSA tasks in the zero-shot scenario?'. Because many previous studies have trained models on English data and tested them on non-English languages (Keung et al., 2020;Lin et al., 2019) in the zero-shot setting. . Thirdly, we evaluate the system's performance in a joint training strategy by mixing training data of the source language and target language and combining multiple source languages. . Finally, many of the latest multilingual language models as InfoXLM, XLM-Align have not been explored before on zero-shot and joint learning cross-lingual task, especially resource-low languages. In this paper, we also investigate several latest pre-trained multilingual transformer models for two tasks in the ABSA problem.
The remainder of this paper is structured as follows: Section 2 presents a survey of previous studies on ABSA task, zero-shot and joint learning research. Section 3 describes the methodology using different pre-trained language models in zero-shot and joint learning scenarios. Section 4 presents the experimental results. Our conclusions are found in the final section.

Related work
This section consists of three sub-sections covering the associated studies for the ABSA problem, zero-shot and joint training cross-lingual in the NLP field. The purpose of this paper is to explore the performance of zero-shot and joint training cross-lingual for ABSA tasks. Therefore, we survey the most recent work concerning the ABSA in Section 2.1, zero-shot learning in Section 2.2 and joint learning in Section 2.3.

Aspect-based sentiment analysis
In recent years, the power of contextual language models has increased the performance in system to the field of the ABSA. First, there are many datasets published for the research community at shared-task SemEval 2014 (Pontiki et al., 2014), SemEval 2015 (Pontiki et al., 2015), SemEval 2016 (Pontiki et al., 2016). The shared-task SemEval provided several datasets from various domains in languages such as English, Chinese, Dutch, etc. These datasets are very popular and are benchmark datasets for many ABSA tasks. Recently, the pretrained BERT (Devlin et al., 2019) language models have shown their effectiveness in various tasks in ABSA problems. Sun et al. (2019) presented four new methods based on fine-tuning BERT with an auxiliary sentence for T(ABSA) problem. They transformed this problem into a sentence-pair classification task and fine-tuned the pre-trained BERT model. Their experimental results demonstrated the advantages of sentence pair classification based on the BERT model for the ABSA task, however, their models take a lot of computation resources and time for training. Hoang et al. (2019) presented an overview of fine-tuning the pretrained BERT model to address the out-of-domain ABSA problem at both levels of datasets by using the sentence pair classification approach. However, the authors just conducted the experiments on the English language instead of other languages in the SemEval datasets.  presented an end-to-end neural-based on BERT architecture for the aspect term with corresponding sentiment polarity. They formulated two tasks as a sequence labelling problem and used the pre-trained BERT embedding as the embedding layer combined with several different layers (linear layer, recurrent network, self-attention, and conditional random fields layer) on top of BERT. Their experimental results showed that the BERT-based model is a powerful architecture to improve the performance of aspect-based sentiment analysis problems.
On the other hand, Xu et al. (2019) proposed a BERT-based post-training model for the OTE task and Review reading comprehension (RRC) to enhance the domain-awareness. Due to the difference between the training corpus of BERT and the review corpus, this novel post-training to adapt BERT using two unsupervised objectives on the taskspecific corpus to learn the domain-awareness contextualized representations. Rietzler et al. (2020) analysed the behaviour of domain-specific and cross-domain post-training techniques based on BERT language modelling for the Aspect-Target Sentiment Classification task. The experimental results indicated that domain-specific language model finetuning produce the state-of-the-art performance. Unfortunately, we have to provide enough a domain-specific corpora and resources to train the BERT model for a specific domain. This might may not be feasible for low-resource language and computationally-insufficient studies. Subsequently, Karimi et al. (2021) showed that using adversarial training with domain-specific post-trained BERT could further improve ABSA performance. In addition, they also investigated the number of training epochs and dropout values that can significantly affect on model's performance. Song et al. (2020) investigated the potential of BERT intermediate layers to improve the performance of BERT fine-tuning using the LSTM pooling or the attention mechanism. The experimental results demonstrated the effectiveness of the proposed approach to the ABSA problem. Wan et al. (2020) proposed a novel architecture that relied on the BERT language model to address the limitation of implicit terms in the review. Their model is able to capture the dependence of sentiments on both term expressions and aspect categories in the sequence by jointly learning. We can see that most of the above research focus on high-resource language such as English by leverage the available monolingual language models to improve the performance on supervised learning in ABSA problem.

Zero-shot cross-lingual
The development of deep learning architectures has achieved significant success in many areas; however, it requires a sufficient amount of labelled training data . Furthermore, labelling processing is time-consuming and expensive for several tasks; therefore, zero-shot learning methods (Larochelle et al., 2008) have been studied on various topics in the field of NLP, especially for languages without resources to train supervised models. Jebbara and Cimiano (2019) addressed the lack of available annotated data for specific languages by applying a zero-shot cross-lingual approach for the opinion target expressions task. To do that, the author used the alignment of embeddings to calculate the cross-lingual representation of two languages based on the FastText embedding (Bojanowski et al., 2017). Jebbara and Cimiano (2019) presented the experiments using the Convolutional Neural Network as a baseline model and demonstrated the effectiveness of zero-shot learning. However, in this work, the authors ignore the influence of the data size of the target language. Lauscher et al. (2020b) presented extensive experiments for zero-shot cross-lingual transfer using multilingual pre-trained language models (mBERT, XLM-R) on different NLP tasks. The authors analysed the conditions and factors that affect the performance of cross-lingual transfer, such as the linguistic similarity and size of pre-training data. van der Heijden et al. (2020) presented a comprehensive comparison of a multilingual word and sentence representation for Named Entity Recognition and Part-of-Speech task in zero-shot learning settings. The results showed that pre-trained multilingual BERT outperformed other supervised models. However, the authors compared the mBERT model to the XLM transformer model (Conneau & Lample, 2019) instead of XLM-R.
Recently, Pamungkas et al. (2021) investigated the zero-shot learning approach for hate speech detection based on knowledge from resource-rich language. However, the author only transferred knowledge from English; therefore, the effectiveness of using knowledge of other languages has not been studied in the article. Kim et al. (2021) presented a parallel-labelled cross-lingual named entity recognition in English and Korean to develop a zero-shot learning model. They fine-tuned the mBERT in English and transferred the trained model to Korean, and compared it to the embedding and annotation projection approach. The experimental results showed that the order of words in the target language is important in cross-lingual learning. Kumar and Albuquerque (2021) applied the power of the XLM-R model to transfer knowledge from the English to Hindi dataset for the sentiment analysis dataset. Unfortunately, the author just compared the performance of the XLM-R large model to deep learning approaches; it is difficult to conclude that the XLM-R model is suitable for zero-shot scenario. Phan et al. (2021) presented a study on zero-shot cross-lingual learning on pre-trained multilingual models (mBERT and XLM-R) for two sub-tasks in ABSA problem. Unfortunately, the authors did not pay attention to the number of training samples among languages. This leads to unfair comparisons between models and languages.

Joint training
The joint training scenario is an idea of training one model on multiple languages because many languages share common features such as morphological, phonological, and syntactic phenomena (Ammar et al., 2016;Bender, 2011;Mulcaire et al., 2018). As a result, training in multiple languages can improve the performance of models in related languages. Ammar et al. (2016) found that the training model on multilingual treebanks of multiple languages outperformed the monolingual training data for parsing tasks. However, the authors employed the traditional deep learning model (LSTM) combined with static multilingual word embedding instead of contextual word representation. Mulcaire et al. (2018) also applied this idea by combining training data across languages for semantic role-labelling tasks. The experimental results showed that joint learning could achieve better performance than monolingual data. Aharoni et al. (2019) presented extensive experiments in multilingual neural machine translation by training multilingual languages in a single model. This demonstrated that multilingual joint learning has been shown to be beneficial in various NLP tasks. The authors employed the XLM-R language model as the baselines. Recently, the development of multilingual pre-trained language models brings a lot of benefits to low-resource languages. With plenty of pre-trained language models and languages, how to choose them to improve the performance of a specific language is an interesting problem.

Cross-lingual framework
Figures 1 and 2 provide a comprehensive view of the zero-shot and joint learning cross-lingual learning on our experiments. We conduct two strategies on two main ABSA tasks, including Aspect Category Detection and Category-Sentiment Classification tasks. First, we present the problem formulation of two experimental tasks. Second, we summarize the methodology we used to build based models corresponding to two tasks. Third, we present the detail of zero-shot cross-lingual transfer learning approach based on the based models of five languages. Finally, we explain the cross-lingual joint training approach in this paper. As shown in Figure 1, the model is trained with training data of one source language (e.g. Spanish) based on a multilingual pretrained language model. Then the trained model is tested on target languages (e.g. English, French) in a zero-shot manner. While in the joint learning cross-lingual, the model is trained on the combined data of two source languages (e.g. Spanish and French) as in Figure 2.

Problem formulation
. Aspect Category Detection: The purpose of this task is to identify the pre-defined list of the entity E and attribute A pairs towards which is mentioned in a given sentence. A review of length N can be represented as X r = {w 1 , w 2 , . . . , w N } where w i denotes i th word. To tackle this task, we consider it as a multi-label classification problem where the output can be defined as a binary vector Y = {y 1 , y 2 , . . . , y C } where y c denotes c th aspect category, C is the number of aspect category of the specific domain. . Category-Sentiment Classification: Given a review sentence of length N can be represented as X r = {w 1 , w 2 , . . . , w N } where w i denotes i th word. The main task of CSC task is to detect the aspect categories and identify the associated sentiment polarities in the sentence. Formally, let the output Y be the set of one-hot vectors where y a i represents the i-th category in the set of C aspect categories, y p i represents the sentiment corresponding to the i-th aspect category in the set of positive, neutral, negative sentiment labels.

Methodology
In recent years, deep contextual language models have been introduced and the SOTA results have been achieved in various downstream NLP tasks. These models are already trained on a large unlabelled corpus and then is fine-tuned to downstream tasks. The aim of this study is to perform the zero-shot and joint training cross-lingual for ABSA tasks in five languages using transfer learning techniques based on pre-trained language models. However, we need a large amount of data and computational resources to train these models, which might not be possible for low-resource languages. Therefore, pretrained multilingual language models are released to tackle this gap in research. From the work of Kalyan et al. (2021), there are many available multilingual language models. We employ the models mBERT and XLM-R because they support the languages in our experimental datasets. Moreover, we employ the latest SOTA multilingual models, such as InfoXLM and XLM-Align to conduct our experiments based on their ability in cross-lingual NLP tasks. Late in this section, we summarize the pre-trained multilingual transformer models used in this paper: . mBERT: This is the BERT architecture Devlin et al. (2019) trained on a multilingual Wikipedia of 104 highest-resource languages on the two tasks: Masked language modelling (MLM) và Next sentence prediction (NSP). . XLM-R: An optimized version of BERT which is trained based on the MLM task on 2.5T of data across 100 languages filtered from Common Crawl text (Conneau et al., 2020). This model outperforms the mBERT model in a variety of cross-lingual NLP tasks. . InfoXLM: Chi et al. (2021a) presented a new cross-lingual pre-trained language model, named InfoXLM. This model is trained on monolingual and parallel data based on jointly training cross-lingual contrast with multilingual masked language modelling and translation language modelling. The pre-training data is similar to XLM-R model   Conneau et al., 2020). The experimental results demonstrated that InfoXLM achieved better performance in cross-lingual transferability. . XLM-Align: This is a pre-trained cross-lingual language model by applying the denoising word alignment task (Chi et al., 2021b). The model's training process consists of two steps: (1) self-labelling word alignments for translation pair; (2) random mask tokens in the bitext sentence. The extensive experiments on cross-lingual tasks showed that this model is effective for various datasets such as Sentence Classification (XNLI, PAWS-X), Question Answering (XQuAD,MLQA, and TyDiQA), etc.
To explore the performance of zero-shot and joint training cross-lingual approaches in different languages, we fine-tuned the above models based on the recommendation of the previous work (Devlin et al., 2019) and use different additional linear layers for each task. The detail of the two based models is described in the following section.

Aspect category detection
Aspect Category Detection is a multi-label classification where zero or more aspect categories can be detected from the sentence. Figure 3 illustrates the architecture for this task. Let the input sentence consists of a sequence of words: X = {w 1 , w 2 , . . . , w N } where w i denotes ith word. After pre-processing text, two special tokens noted [CLS] Figure 3. The overall architecture is based on the pre-trained transformer language models for the Aspect Category Detection task. and [SEP] are added to the beginning and ending of the sequence. Because the experimental data is the sentence-level review, we use the padding operation to pad sentences in a uniform length. The max length value is the length of the longest sentence in the data set and is ensured to be shorter than the input of the transformer models. Then, this sequence is fed directly to pre-trained multilingual language models to obtain the representations for tokens. The output is a sequence of hidden states (H L X ) represented as follows: where H L X [ R N×dim h , dim h is the dimension of the representation vector, h L i is the hidden state of i th input token in L transformer layers. The final hidden state h L CLS of the [CLS] token in the last layer is used as the representation of input review. Finally, a fully connected layer with a sigmoid activation is added to the top model for task-specific. The model's output is a probability vector for the length corresponding to the size of the number of pre-defined categories. The sigmoid function will generate the corresponding probability of whole aspect categories. The aspect category is assigned to the review if the probability is greater than a threshold. The threshold is optimized on the validation set by using grid search. We employ the binary cross entropy as the loss function to calculate the predicted probability with the true label:

Category-Sentiment classification
The output of this task is the pairs of {Aspect category, Sentiment} mentioned in a given review. It means that this compound task detects the aspect categories and their corresponding sentiment polarities simultaneously in this compound task. To deal with this problem, a multi-task approach based on the BERT models, inspired by the previous works (Dai et al., 2019;Schmitt et al., 2018;Van Thin et al., 2022), is employed to predict the output of each aspect category with its corresponding sentiment as a onehot vector with four elements. The first element indicates whether the aspect category is mentioned in the review, while the three other elements represent three levels of sentiment polarity of each category; for example, the pair of 'Quality, positive' is encoded as [0 1 0 0]. As shown in Figure 4, we have the C softmax output layers corresponding to C aspect categories. We can train a model for an aspect category independently; however, this does not help the model explore correlated information between categories. Therefore, we build a multi-task architecture to utilize the correlation and influence between multi-aspect categories in the review. As similar to the ACD architecture, we use the last hidden state of the CLS token H L cls as the representation of input review and feed it into the C fully connected layers with softmax activation.
where a is the aspect category a th in the total C aspect categories, weight W and bias b are the parameters during training. Our model is optimized by minimizing the sum of categorical cross-entropy loss in each category as follows: where C is the number of the aspect category, y a i is the true one-hot vector for the a category andŷ a i is the probability vector of prediction for the a category.

Zero-shot cross-lingual transfer learning
In most studies in the NLP field, termed zero-shot cross-lingual transfer learning means that the transfer model which is trained on the source language can be used to predict the target language without training data (Karthikeyan et al., 2019;Keung et al., 2019Keung et al., , 2020Lauscher et al., 2020a;Nooralahzadeh et al., 2020). Based on this scenario, we experiment with two steps as follows: (1) Fine-tuning pre-trained multilingual language models on the training data of source language; (2) transferring the knowledge weight to evaluate the test data of the target language. Figure 5 shows this strategy for the zero-shot cross-lingual evaluation between two languages in our experiments. Unlike previous works, in this paper, we train and transfer the model to different source languages instead of using only English. Furthermore, we evaluated the performance when using the combination of multiple source languages as training data.

Cross-lingual joint training
In zero-shot learning, the model is trained and tested on two different languages, while in multilingual joint learning, the model is trained on the combination of source and target language data. For example, we have labelled training data in English and French language as L en and L fr . The task is to use L en and L fr to train a model and classify the review texts in the target language L fr . This approach has been shown to be beneficial in various cross-lingual tasks (Aharoni et al., 2019;Johnson et al., 2017;Mulcaire et al., 2018;van der Heijden et al., 2020;Zhou et al., 2016). In order to evaluate the benefit of joint training based on multilingual language model, we conduct two experiments for the ABSA task as follows: (1) Combining the full training data of source and target language pair as the new training set and then evaluate on the test set of the target language, (2) Combining the entire training data of all languages to train a model and then evaluating it on the test sets of each language.

Datasets
We use the SemEval 2016 dataset (Pontiki et al., 2016) for the restaurant domain to conduct whole experiments, including languages such as English (en), French (fr), Spanish (es), Dutch (nl), and Russian (ru). These datasets have different sizes and this difference greatly affects the experimental results from the zero-shot and joint learning scenarios, especially in languages with lots of training data. To address this challenge, we use iterative stratification (Sechidis et al., 2011) to recreate the datasets for our experiments. These datasets are split into sub-datasets, including about 1200 training samples and 400 testing samples for five languages. The statistics of datasets are shown in Table 1.

Experimental settings
In this study, we utilize the pre-trained multilingual language models, which are available on the Hugging Face library (Wolf et al., 2020), including mBERT 1 , XLM-R 2 , XLM-Align 3 , and InfoXLM 4 for the base version, XLM-R 5 and InfoXLMInfoXLM 6 for the large version. For the zero-shot and joint learning cross-lingual evaluation, we use the best individual-based model due to the limitation of computational resources. For the hyper-parameters, this study applies the cross-validation technique on the training set to choose the optimized parameters for each model and language. These parameters are learning rate, number of epochs, etc., mentioned in Table 2. For both tasks, we implemented an AdamW optimizer for different learning rates for each language. Batch sizes are selected with 32, 64, and 16. The experiments have shown that our model requires more epochs to prevent the underfitting problem because the two tasks are the type of the multi-label classification task. Therefore, we set the number of epochs at 30 and 40 for the ACD and CSC tasks, respectively, with an early stopping strategy.

Results
This section presents the experimental results of three scenarios: (1) performance of different pre-trained language models; (2) evaluation of zero-shot learning; (3) evaluation of joint learning scenario.

Performance of multilingual models
We first compare the effectiveness of multilingual pre-trained language models on two ABSA tasks. Tables 3 and 4 show the F1 − score Micro per model per language for the tasks of ACD and CSC, with the highest scores per language shown in bold and underlined. Note that the results represent the performance of models trained and tested on the specific language data. It can be observed that the large models improve the results over base models further. InfoXLM large model achieved the highest scores in most languages except Dutch for the ACD task. While InfoXLM large model shows the effectiveness in all languages for the CSC task. Our experimental results confirmed that the large models perform better than base pre-trained models. Among based language models, the XLM-Align model achieved high scores for languages except Dutch and Russian for the ACD task. One of the reasons for the worse performance of the XLM-Align model in comparison with the XLM-R model in Dutch and Rusian is the size of pre-training data in the pre-trained model as a previous study (Lauscher et al., 2020b). Specifically, the size of pre-training data of the XLM-R model is larger than XLM-Align While XLM-R still shows effectiveness in the Dutch and Russian with 3.33% and 0.2% improvements than XLM-Align. Comparing the results of InfoXLM against XLM-Align shows that the performance of the two models is quite competitive, which depends on the language and task, but the difference is not significant. These experimental results are similar to the CSC task. These results show that the performance of pretrained multilingual language models is different based on the language. Therefore, we recommend that future studies compare the performance of these models in a specific language, particularly low-resource languages, to select the best model.

Zero-shot cross-lingual
In this section, we present the results in zero-shot on two following scenarios. First, we investigate the performance of each model, which is trained on the source language and evaluated on the target language. Secondly, we examine the effectiveness of the model which are trained on multiple source languages without the target language. We choose the XLM-Align as the primary model for this setting. Table 5 presents the zero-shot cross-lingual results on source-target language pairs of the ACD and CSC tasks. The row and column represent the source and target language, respectively. The best performance of the cross-lingual language pair is bold for each column. As shown in Table 5, it can be seen that training model on the English language achieves the best scores on the French language for two tasks. Moreover, we observe that the performances of the XLM-Align model are different on two tasks in some language pairs in this setting. For example, using French as the source language gives the best score on Dutch for the ACD task; however, the best source language for the CSC task is English. It can be seen that the CSC task is a compound task that aim to assign the set of aspect categories and corresponding sentimentwhich may yield different results than ACD. In addition, the ratio of training data toward the sentiment polarity class between languages is also different. This leads to inconsistent experimental results between the two tasks. Moreover, our experimental results indicate the effect of similar languages on zero-shot transfer learning. For example, the model is trained on English (a Germanic language) data to produce the higher scores in typologically or etymologically languages such as French and Spanish (Romance languages) than Russian (a Slavic language). The reason is that most of modern English vocabulary is borrowed from the Romance languages (Şenel et al., 2017). This demonstrates that the selection of source language to transfer knowledge in the zero-shot setting also influences performance in the target language. Figure 6 shows the best scores of the monolingual model compared with the zero-shot cross-lingual setting. In general, it is obvious that the performance of zero-shot learning is lower than the monolingual model. This result is acceptable in the context of no training data in the target language. Specifically, when comparing the highest zero-shot cross-lingual results with the monolingual results, we can see that the difference ranges from 3.89% to 9.81% and 4.96% to 7.12% for the ACD and CSC tasks, respectively.
Moreover, we conduct an experiment to find out the effectiveness of the training model in a combination of multiple languages. Therefore, in this experiment, we combine the training data of source languages without the target language. Table 6 presents the scores of this experiment for two tasks. We can also observe that the training model on combination data gains better scores in all of the languages in both tasks. In particular, there is a remarkable increase for the ACD task in Dutch, Spanish, and English with +7.33%, +5.25%, and +4.41%, respectively. For the CSC task, the model also improves F1 − score Micro in the range from 3.35% to 7.67% in all languages. Figure 6. The graph compares the best results in monolingual models, which are trained both trained and tested on target language data (blue bar), and zero-shot cross-lingual setting (red bar).  Table 7 shows the results of joint training of language pairs for two tasks. The main diagonal represents the best scores (Tables 3 and 4) where the model is trained and tested on the data of the target language. In general, the joint training approach can improve the performance of the model; however, it is obvious that the improvements depend on the language pair. We found that training the pairs in the same language group related to linguistic relations 7 such as English with Dutch, French, and Spanish brings the benefit over other pairs. In addition, instead of joint training for each language pair, we combine the training data for all languages, including the target language, as the final training set. The results of this experiment are summarized in Table 8.

Discussion
First, we discuss the role of source languages in zero-shot cross-lingual transfer learning based on our experimental results. We surveyed that most previous studies (Larochelle et al., 2008) choose English as the source language to transfer knowledge in zero-shot learning because of the following reasons: (1) English is one of the rich-resource languages in research community (Joshi et al., 2020b); (2) English is a top language that have a large size in the pre-training data in most current pre-trained multilingual language models (Chi et al., 2021a(Chi et al., , 2021bConneau et al., 2020;Devlin et al., 2019). However, our experimental results, which are shown in Table 5 indicated that other  languages could be the source language instead of English. For example, transferring knowledge from French produced the highest scores in Spanish for the ACD task, while Spanish gave the best score for the Russian language for both tasks. In addition, it is observed that the languages which belong to the same language group with closer linguistic relations, such as English and Dutch, Spanish and French, or with close lexical distance (Şenel et al., 2017), such as English and French, produced the highest scores for the target language in the zero-shot setting. Therefore, the linguistic relationship and lexical distance between the source language and the target language play an important role in zero-shot cross-lingual learning. Second, Tables 9 and 10 show the performances of three approaches for the ACD and CSC tasks, respectively. Generally, we observe that the results of zero-shot learningwhich is trained on the source language and tested on the target language, are always lower than the results of monolingual learningwhich is trained and tested on the target language. Another interesting point is that training model on multiple source languages improves the performance in terms of F1-score than a source language for all languages in both tasks. Especially, we notice that the performance of zero-shot learning in multiple source settings gives approximately the best results than the monolingual setting for the CSC task in all languages except Russian. The reason might relate to the distance between languages and the distribution of aspect categories with corresponding sentiment polarity. Because the CSC task considers the category and polarity as ground truth and has a large imbalance between the classes, therefore, combing multiple source languages can increase and diversify the number of training samples. When comparing the joint training results with the remaining, it can be seen that joint training improves the performance of the model in all languages, particularly close-relation languages. Furthermore, by combining multiple training data for all languages, including the target language, the model performs better than using only one source language for all languages; the difference is significant for the CSC task.
In order to explore the effectiveness of the joint learning approach, we conducted experiments with different sizes of the training set in the source and target language. We consider two scenarios: (1) combining the full training set of the source language with part of the training set of the target language; (2) In contrast, we combine a part of the training set of the source language with the full training set of the target language. The source language for a specific language is selected based on joint learning results (see in Table 7). For example, the combination of English and Dutch languages produces the best score for the Dutch language. As shown in Figures 7 and 8, we can see that the model's performance increases with the size of the training samples for the target language. These results proved that combining more annotated data for the target language increases the performance of multilingual models, which are only trained on the source language data.

Conclusion
In this paper, we studied the ability of different contextualized multilingual language models in the zero-shot and joint training cross-lingual settings. We conducted experiments on two sub-tasks in the ABSA problem for five languages. For the zero-shot cross-lingual setting, we explore two scenarios relying on two strategies: (1) training on a source language; (2) training on the multiple source languages to fine-tuning the models. The results showed that it is beneficial to take advantage of multiple source languages in a zero-shot cross-lingual setting. Moreover, the experimental results indicate that the selection of source languages also plays an important role in achieve good results for the target language. Although our results indicated that the performance of zero-shot learning is not as good as in a monolingual setting, the results are pretty impressive in case there is no training data for the target language.
For the joint training cross-lingual study, two experiments were conducted to demonstrate the effectiveness of the multilingual language models on the mixture dataset. We found that a joint training model on the group languages with linguistic relations can perform better than monolingual data. Through extensive experiments in several languages, we demonstrated the efficacy of cross-lingual joint training. Furthermore, we explored the combination of the source language with amount of training samples from the target language. The results indicated that the performance of the model increased proportionally to the number of data samples in the target language.
Finally, from the performance of fine-tuning various pre-trained multilingual language models in five languages, we recommend that future studies should compare the performance of different models in the specific language in order to select the best model, especially for low-resource languages. In future work, we plan to examine a broader set of languages (Asian, Africa, etc.) and tasks to obtain more comprehensive evaluations.

Notes Funding
This research is funded by University of Information Technology-Vietnam National University HoChi-Minh City [grant number D1-2023-01].

Notes on contributors
Dang Van Thin is currently PhD student at University of Information Technology -VNUHCM. He graduated with the Bachelor and Master degree in computer science at the University of Information Technology -Vietnam National University Ho Chi Minh city, Vietnam in 2017 and 2020, respectively. He also a member of Multimedia Communications Laboratory (MMLab) and his research interests are about natural language processing, machine learning, deep learning and applications.
Hung Quoc Ngo was awarded PhD degree in Computer Science from University College Dublin, Ireland. He received a Master degree in Computer Science from University of Science-VNUHCM, Vietnam. He is currently a lecturer with the School of Business Technology, Retail, and Supply Chain, Technological University Dublin, Ireland. He has involved in the BioCaster project by building geographical ontology, integrating the geo-ontology into the Global Health Monitor system, and building the webpage for publishing project results. Recently, he has built knowledge graphs for digital agriculture in CONSUS project at University College Dublin. His research interests are natural language processing, knowledge management, and data analytics.
Dr Duong Ngoc Hao is a lecturer of the department of Department of Maths and Physics at University of Information Technology -VNUHCM. He received the B.S. degree in Math -Informatics from HCMC University of Education and the M.S. degree in University of Science, Vietnam National University -Ho Chi Minh City, Vietnam. He got Ph.D. degree from Institute of Mechanics, Vietnam. His interests is math for Computer Science, Machine Learning Algorithm, and Natural Language Processing.
Ngan Luu-Thuy Nguyen is a scientist at the University of Information Technology, Vietnam National University, Ho Chi Minh City, Vietnam. She received her PhD degree in information science and technology from the University of Tokyo, Japan. She was a postdoctoral researcher at the National Institute of Informatics, Japan from 2012 to 2013. Her research interests include natural language processing and data analysis.