Arabic stance detection of COVID-19 vaccination using transformer-based approaches: a comparison study

Reema Khaled AlRowais (Department of Computer Sciences, King Saud University, Riyadh, Saudi Arabia)

Duaa Alsaeed (King Saud University, Riyadh, Saudi Arabia)

Arab Gulf Journal of Scientific Research

ISSN: 1985-9899

Article publication date: 23 November 2023

Downloads

243

pdf (2 MB)

Abstract

Purpose

Automatically extracting stance information from natural language texts is a significant research problem with various applications, particularly after the recent explosion of data on the internet via platforms like social media sites. Stance detection system helps determine whether the author agree, against or has a neutral opinion with the given target. Most of the research in stance detection focuses on the English language, while few research was conducted on the Arabic language.

Design/methodology/approach

This paper aimed to address stance detection on Arabic tweets by building and comparing different stance detection models using four transformers, namely: Araelectra, MARBERT, AraBERT and Qarib. Using different weights for these transformers, the authors performed extensive experiments fine-tuning the task of stance detection Arabic tweets with the four different transformers.

Findings

The results showed that the AraBERT model learned better than the other three models with a 70% F1 score followed by the Qarib model with a 68% F1 score.

Research limitations/implications

A limitation of this study is the imbalanced dataset and the limited availability of annotated datasets of SD in Arabic.

Originality/value

Provide comprehensive overview of the current resources for stance detection in the literature, including datasets and machine learning methods used. Therefore, the authors examined the models to analyze and comprehend the obtained findings in order to make recommendations for the best performance models for the stance detection task.

Keywords

Citation

AlRowais, R.K. and Alsaeed, D. (2023), "Arabic stance detection of COVID-19 vaccination using transformer-based approaches: a comparison study", Arab Gulf Journal of Scientific Research, Vol. ahead-of-print No. ahead-of-print. https://doi.org/10.1108/AGJSR-01-2023-0001

Publisher

:

Emerald Publishing Limited

License

Published in Arab Gulf Journal of Scientific Research. Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode

1. Introduction

Nowadays, social networking sites on the Internet have become known as the new social media, witnessing a dynamic movement of development, and spread. People can communicate their feelings and stances on social media in various ways. Thus, this prompted researchers to analyze posts on social media to gain insights. Twitter is one of the most popular social media platforms, allowing users to publish whatever they want. Since people can easily share their locations, opinions and feelings within a small text, Twitter-based research has become more realistic and available.

Given the significance that social media plays in modern culture and its power over various sectors of society, the field of stance detection (SD) is unquestionably essential. Its purpose is to collect valuable data for decision making by individuals, businesses, or even governments. Given the volume of data uploaded on social networking platforms, mining information from them is crucial. Thus, we can use this feature to know people's opinions or stances on specific issues (Al-Ghadir, Azmi, & Hussain, 2021).

Natural language processing (NLP) is the ability of machines to understand human language. A subfield of NLP, SD, is one of the most important research topics in NLP for automatic analysis of content (Al-Ghadir et al., 2021), [3] and is a classification of the author’s position for a specific target into one of three categories including in favor, against or neutral [3]. In addition, SD is a recent NLP topic and a significant amount of research is in the English language. In contrast, in Arabic NLP, it is still unexplored and in its early stages. Since Arabic SD is in its infancy, it presents significant research problems with several practical implications. Thus, this provides a motivation for conducting Arabic SD research.

The aim of this paper is to investigate and compare the use of different transformers in SD to analyze the stances of Arabic society toward COVID-19 vaccines through the social networking site Twitter. People’s positions can help the government visualize how people cope with the situation.

In the current research trend, transformer models are being used for different NLP tasks, including tackling SDs. However, these research methodologies are readily available for languages such as English and Chinese. One can see that there is a minimal amount of research on Arabic SD and a general lack of Arabic resources. Therefore, this project will fill this research gap by tackling the SD problem in Arabic by leveraging various transformer models. To achieve this purpose, we have chosen two different variants of the transformer models, namely Araelectra (Antoun, Baly, & Hajj, 2021), MARBERT (Abdul-Mageed, Elmadany, & Nagoudi, 2021), AraBERT (Antoun, Baly, & Hajj, 2021) and Qarib (Chowdhury et al., 2020). All these models are pretrained on a huge text corpus. This project will retrain those transformers on the specific problem of SD using the dataset previously introduced in a related work (Mubarak, Hassan, Chowdhury, & Alam, 2022), where it was used with two transformers, namely AraBERT and Qarib. However, our objective extended beyond simple comparison. We aimed to leverage the latest transformers, such as Araelectra and MARBERT, to train and develop new stance detection models.

In addition to exploring newer transformers, we retrained and developed new models using the transformers (AraBERT and Qarib) as in the related work (Mubarak et al., 2022). Our comprehensive experiments involved fine-tuning the task of stance detection on Arabic tweets using these four different transformers. The primary goal was to assess their performance and compare our results to the models developed in the related work.

The remaining sections of this paper are organized as follows: section 2 gives a background in the subject matter including SD, transformer-based models. Section 3 presents related literature in the field. Proposed method is presented in section 4, while section 5 discusses experimental results. Finally, section 6 concludes this paper with main findings and future work.

2. Background

This section of the paper provides some background information needed to understand the underlying concepts of NLP and SD. First, it provides the theoretical background on SD, followed by an overview of transformer-based models.

2.1 Stance detection (SD)

The position or standing of a person, object, concept or opinion is referred to as stance (Jannati, Mahendra, Wardhana, & Adriani, 2018). Positions can be favor, against or neutral, where “favor” means directly or indirectly supporting someone or something, disagreeing or criticizing something or someone against the target; “against” means directly or indirectly rejecting or criticizing someone or something by supporting something or someone opposed to the target; and “neutral” means being in a neutral posture or failing to effectively communicate one’s perspective within the paragraph (Jannati et al., 2018).

The SD in NLP is to determine whether the author favors, is against, or has no opinion on a specific event or topic. It is commonly thought of as a subproblem of sentiment analysis (SA), and it seeks to determine the author's stance toward a target. The computational treatment of sentiments and opinions in texts is commonly referred to as SA. This problem is commonly equated to detecting a text producer's sentiment polarity, and a classification result, such as positive, negative or neutral, is expected from the SA technique.

The primary difference between SA and SD problems are that the former is concerned with sentiment without a specific aim, whereas the latter is concerned with a specific target (Küçük & Can, 2020). Also, within the same text, the sentiment and stance for the target may not be matched at all; that is, the text's polarity may be positive while the stance is against, and vice versa, such as “we live in a sad world when wanting equality makes you a troll” (Küçük & Can, 2020).

The SD is widely recognized to have a various practical applications, such as opinion detection, emotion recognition, sarcasm detection, fake news detection and claim validation (Padnekar, Kumar, & Deepak, 2020). It has many uses in the fields of public opinion, politics and marketing. SD is particularly interesting in the field of social media analytics, as it can aid in determining the positions of many users, possibly millions, on various problems (Darwish, Stefanov, Aupetit, & Nakov, 2020). It has also has been used in a variety of studies as a way to connect language forms and social identities to better understand the backgrounds of people who have a polarized stance (ALDayel & Magdy, 2021). However, humans are perfectly capable of determining the correct stance, while ML models typically fall short (Schiller, Daxenberger, & Gurevych, 2021) because of the various dataset sizes and ML models that have only been trained on a single dataset tend to underperform when applied to new domains.

2.2 Transformer-based models

Artificial intelligence (AI), the computational theory of learning, introduced ML, which investigates the analysis and development of algorithms that learn from raw data, train the system and generate predictions based on this train data (Chauhan & Singh, 2018). In 1959, ML was defined by Arthur Samuel, a pioneer in the field of ML, as a “field of study that gives computers the ability to learn without being explicitly programmed” (Chauhan & Singh, 2018). It focuses on how data and algorithms can be used to imitate humans in learning, analysis, decision making and improving accuracy (Chauhan & Singh, 2018).

Conventional ML algorithms and techniques aim to train a system using a training set to produce a trained model (Chauhan & Singh, 2018). However, depending on the data used with the algorithms, the learning process for these algorithms can be supervised, unsupervised or semi-supervised.

Despite being widely used, conventional ML techniques are limited in their ability to perform analysis in data in its natural form. These methods need a high level of understanding and experience; for example, feature selection demand careful engineering (Chauhan & Singh, 2018).

Deep learning (DL), on the other hand, is an advanced ML approach for teaching computers to automatically extract, analyze and understand relevant information from raw data. The outcomes of DL are far superior to those of the conventional ML (Chauhan & Singh, 2018).

DL has been hailed for not requiring as much manual feature engineering as traditional techniques; thus, they not only outperform traditional ML, but they also require less human work, making their adoption easier (Magnini, Lavelli, & Magnolini, 2020). On the other hand, DL algorithms require a massive amount of data to train a network, unlike conventional ML (Chauhan & Singh, 2018).

DL algorithms use multiple processing layers to learn hierarchical data representations and have shown excellent results in numerous fields. In the past few years, NLP has been increasingly focusing on the use of new DL approaches, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) (Young, Hazarika, Poria, & Cambria, 2018). Although CNNs and RNNs are widely used, the primary disadvantage of these DL models is that they require a large number of labeled instances for training, which are expensive to build and maintain (Kalyan, Rajasekharan, & Sangeetha, 2021). Fortunately, one solution that addresses this very problem is transfer learning. This is an ML method in which we reuse a previously trained model as the foundation for a new model on a new task. Simply put, a model trained on one task is repurposed on a second related task as an optimization that leads to faster progress when modeling the second task (Alyafeai, AlShaibani, & Ahmad, 2020). Figure 1 illustrates the basic transfer learning pipeline. As shown, the data needed for building the source model are very large; however, when the knowledge is transferred and the pretrained source model is used to develop a new task-specific model, the data required are far smaller.

Transformers are a type of transfer learning, and transformer-based models are pretrained models, which means a large dataset of labeled instances for training is not needed and the trained model will be fine-tuned to the new data of a specific domain (Daniel Jurafsky, n.d.; Alyafeai et al., 2020).

Transformers are a new model of DL that uses complicated neural networks. It primarily makes use of the self-attention process to extract intrinsic characteristics and has a big future for AI applications (Vaswani et al., 2017; Han et al., 2021).

Antoun, Baly et al. (2021) proposed AraBERT, the first transformer-based language model for Arabic. AraBERT was evaluated for different tasks in Arabic NLP and achieved state-of-the-art performance on SA, question answering and named-entity recognition. ARBERT and MARBERT, provided by Abdul-Abdul-Mageed et al. (2021), assist transfer learning on modern standard Arabic (MSA) and Arabic dialects, and it has been pretrained on massive datasets; ARERT was trained on 61 GB of MSA, and MARBERT was trained on 28 GB of dialectal Arabic using social media. QARiB provided by Chowdhury et al. (2020) was evaluated for Arabic text categorization and trained on the Arabic GigaWord corpus, Abulkhair Arabic Corpus, OpenSubtitles and 50 million tweets. The topologies, sizes and types of training data used by these models differ (Abu Farha & Magdy, 2021).

3. Related work

This section gives a review of recent works related to SD in Twitter data. From our review, we found that most of the work in SD can be categorized according to the approach into three approaches: rule-based ML (RBML), DL algorithms and transformer-based methods.

Several studies have been conducted using RBML for SD. Tun and Hninn Myint (2019) proposed a method for detecting stance in Twitter, as a two-phase approach for stance classification. In the first phase, the Naive Bayes classifier was used to classify the tweet to determine whether the stance was neutral or non-neutral. In the second phase, the decision tree (DT) classifier was used to classify whether the non-neutral tweets were favored or against. Moreover, the method was tested on the dataset “SemEval 2016 Task A, Task B” and 4,870 tweets. In addition, the majority class and SVM-ngrams-comb were employed to compare with the proposed approach, which shows the proposed approach a good performance with appropriate F1 score 68.6 in task A and F1 score 59.2 in task B of measure in compared to baseline approaches, which was higher F1 score achieved is 68.98 in task A and 29.72 in task B (Tun & Hninn Myint, 2019).

To detect the stance of people’s tweets, Aldayel and Magdy (2019) employed additional features about the users’ utilized features, such as their topic postings, the networks they followed, the websites they frequently visited and the stuff they enjoyed. To classify the stance of tweets, the researchers used an SVM model with a linear kernel. However, a macro-average of the F1 scores for the “against” and “favor” classes was calculated, with the F1 scores for the “none” class omitted. On the SemEval-2016 Task 6 dataset, the method showed promising results with an F1 score of 72.49% compared to the baseline, which achieved an F1 score of 68.98%. Another study (Al-Ghadir et al., 2021) was conducted by Santosh et al. using the same dataset. The experiment examined different classifiers, which are SVM, k nearest neighbor (K-NN), weighted K-NN (WKNN) and class-based K-NN (CKNN) to evaluate the SD. WKNN was the most potent classifier with an F1 score of 76.45%.

Darwish et al. (2020) developed an unsupervised framework for determining Twitter users’ opinions on sensitive topics. Their method employed dimensionality reduction by using the Uniform Manifold Approximation and Projection (UMAP) algorithm to project people into a low-dimensional space, followed by mean shift for clustering to core users who represent the various stances. They compared their framework to previous methods, which were based on semi-supervised or supervised categorization. The results showed that using UMAP with more than 98% accuracy, these setups were able to identify groupings of users according to the main stance on difficult topics. They found that their framework offers some key advantages: first, the method creates clusters without requiring users to label; second, they do not require domain- or topic-level knowledge to conduct the labeling.

Kovacs, Cotfas, Delcea, and Florescu (2023) analyze the tweets about COVID-19 vaccination using the SVM model to identify the gender of the author, the result showed 85% classification accuracy. Also, they detected the stance and showed that RoBERTa was the most effective classifier accuracy 93.64%.

On the other hand, new approaches relying on DL were applied in the SD field. The fake news challenge (FNC) benchmark dataset has been used in several studies for SD. Padnekar et al. (2020) presented a stance prediction architecture based on bidirectional long short-term memory (Bi-LSTM) and autoencoder. The system was tested using the FNC-1 corpus, which has a training set of 50,000 data sets and a test set of 25,000 with a stance label. As a result, their proposed method showed 94% classification accuracy. Another study (Santosh, Bansal, & Saha, 2019) used the same dataset provided by the FNC and aimed to demonstrate the effectiveness of the Siamese adaptation of LSTM networks for SD. The result showed a high FNC score and accuracy with an FNC score of 0.85 compared with baseline, which was the highest achieved FNC score of 0.82. Also, Mohtarami et al. (2018) used the same dataset with different classifiers, namely SVM, (K-NN), WKNN and class-based K-NN (CKNN) to assess position detection. WKNN was the most effective classifier F1 score of 76.45%.

Sobhani, Inkpen, and Zhu (2017) presented a multitarget stance dataset for each instance. They proposed a bidirectional RNN-based attentive encoder decoder to capture the interdependence between stance labels for numerous targets. For example, a model built using this approach should be able to classify a tweet in terms of Clinton and Trump at the same time. While the framework allows for more than two targets, it is still limited to a limited number of targets.

Liviu-Adrian et al. (2021) and Ahmed et al. (2023) investigated the opinion dynamics surrounding the COVID-19 vaccine by performing sentiment analysis on various vaccine-related tweets.

Recently, most of the research in the field of NLP in general and in SD in particular focuses on a transformer approach. Schiller et al. (2021) presented a feature vector for multidataset learning (MDL) based on the BERT architecture. Multitask and MDL can improve the accuracy and resilience of SD, according to research. The findings indicate that transfer learning and multi-task learning can also help enhance performance. Another (Lin, Wu, Chou, Lin, & Kao, 2020) was conducted by Mohtarami et al. using BERT to detect the stance in FNC-1 and gave an accuracy of 88.7%. Ghosh, Singhania, Singh, Rudra, and Ghosh (2019) examined SD approaches on two datasets (SemEval 2016, and multiperspective consumer health query (MPCHI)) and discovered that the BERT pretrained model outperforms existing approaches for SD with a performance F1- score of 0.751 on the SemEval dataset and an F1-score of 0.756 on MPCHI. Müller, Salathé, and Kummervold (2020) proposed a model COVID-Twitter-BERT (CT-BERT) that was pretrained on a large corpus of COVID-19-related Twitter messages. Therefore, one of the datasets referred to the stance of maternal vaccines. The results show that CT-BERT has a higher performance, with an average F1 score of 0.833, compared to BERT-LARGE (Devlin, Chang, Lee, & Toutanova, 2019) with an average F1 score of 0.802.

From reviewed studies, one can say that most of the work in SD is on the English language with minimal work on other natural languages. One of the studies on SD in other languages is a study for the Italian language conducted by Kayalvizhi, Thenmozhi, and Chandrabose (2021). In this study, BERT was applied to transformer to detect authors’ stance in their tweets. The evaluation of their model gave an average F1 score of 47.07, and the BERT model outperformed the encoder–decoder model, which gave an average F1 score of 0.4473.

Alhindi, Alabdulkarim, Alshehri, Abdul-Mageed, and Nakov (2021) presented a new Arabic SD dataset (AraStance), which contained 4,063 claim-articles from various domains and Arab countries. They investigated a variety of BERT-based models that were pretrained on Arabic or multilingual data, then fine-tuned and applied it to their dataset. The best model had a macro F1 score of 78% and an accuracy of 85%.

Table 1 summarizes the above reviewed studies on SD. To conclude, the majority of research on SD is in English language, and there is a gap in in the Arabic language. A review of the literature has also shown that there are many developments in the field of NLP and on the use of transfer learning. This encouraged us to investigate the use of transformers for SD in Arabic.

4. Proposed method

This section discusses the proposed method and shows the workflow of experiments. First, we will perform some preprocessing operations on the dataset. Second, we chose some pretrained transformer-based models and retrain them on SD problem using an Arabic dataset for SD. Since the style data used to train these pretrained models is from a different application and for a different task, it is necessary to fine-tune these models for our own domain and target task using domain related data, and in our case, we need to fine-tune the model for SD using COVID-19 vaccination Twitter data. Third, we will use a set of evaluation metrics to evaluate the performance of all retrained transformer-based models. Finally, we will analyze and compare the models' performance. The framework adopted in this study is illustrated in Figure 2. Next, we will discuss in brief the different parts of this framework.

4.1 Dataset

In this study, the models will be trained on the manually annotated Arabic tweet dataset for the COVID-19 vaccination “ArCovidVac” (Mubarak et al., 2022). The dataset size is 10k tweets, which covers many Arab regions. The stance was annotated using the following labels: pro-vaccination (positive), neutral and anti-vaccination (negative). To collect the tweets, they used the twarc [1] search API and the following keywords: between January 5 and February 3, 2021. They used the Appen [2] crowdsourcing platform for manual annotation and used the standard evaluation (Alshaabi et al., 2021) to improve the quality of the annotation, such as to participate in the annotation activity, each annotator required to complete at least 70% of the tweets. Moreover, they calculated the annotation agreement using Cohen's kappa coefficient and discovered a score of 0.82, indicating high annotation quality (Mubarak et al., 2022). Figure 3 shows distribution of the dataset.

In the experiment setting, the regular training used the 80:20 splitting rule, and a 5-fold validation with an ensemble approach is used for more reliable results.

4.2 Data preprocessing

The collected texts from the web are unstructured. The data included irrelevant data such as English letters, special characters, punctuation and typographical errors format such as Hamza-Alif (أ) and bare Alif (ا). They must be converted to a machine-readable format using data cleaning and preprocessing techniques.

The following are the data cleaning and preprocessing steps that are usually applied to Arabic language corpus:

Remove numbers, all English letters and newline.
Removes all words containing underscore, hashtag sign and @ sign from the corpus.
Remove special characters including symbols and emojis.
Remove repeating character such as
Arabic normalization, which is the unification of some characters that have many forms, demonstrated in Figure 4.

4.3 Fine-tuning transformer models

As mentioned earlier, this project aims to explore the use of transformers in Arabic SD. Thus, we will perform extensive experiments on retraining several pre-trained transformer models for SD problems using COVID-19 vaccination Twitter data. These transformers are as follows:

Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA) transformer was selected as it proved its effectiveness in Arabic SA and named-entity recognition. The training dataset used with it was 77 GB and was mostly made up of news articles.
The second type of transformer to be used is MARBERT, which has been examined for a variety of NLP applications such as SA, classification and dialect identification. MARBERT was trained on the Arabic Twitter dataset of size 1B, which uses both MSA and dialectal Arabic.
AraBERT is an Arabic-specific BERT provided by Antoun Baly et al. (2021). However, AraBERT used about 24 GB of data and 64k vocab size. It has been used in various Arabic NLP tasks, such as SA, named-entity recognition and question answering.
Qarib was provided by Chowdhury et al. (2020). This model was trained using a variety of data sources, including news articles and tweets.

4.4 Fine-tuning process

Since the pretrained models were trained on data from a different domain and for a different NLP task, one needs to retrain the model for the specific task using problem domain-related data. This retraining involves fine-tuning the model. Fine-tuning is a supervised learning method in which the weights from pretrained models are used as the first weights for a new model being trained on a task (Imran & Amin, 2021). This technique not only expedites training but also produces a state-of-the-art model for a number of different of NLP tasks (Imran & Amin, 2021). It is an extremely effective training strategy that uses a pretrained model and trains it on a dataset relevant to your task. Fine-tuning is a supervised learning procedure in which the weights of a previously trained model are utilized as the beginning weights for a newly learned model on a similar task. This technique not only expedites training but also produces a state-of-the-art model for a wide range of NLP tasks (Rifat & Imran, 2021).

In this project, to retrain a pretrained model in a new task—as in the case with transformers—we will need to fine-tune the training parameters to get the best performance for each model.

Training parameters: When training a model, hyperparameter tuning uses the processing infrastructure to evaluate multiple hyperparameter configurations. It can provide us with optimal hyperparameter values, thereby increasing the prediction accuracy of our model. Hyperparameters are critical for improving model performance in NLP. The main parameters are as follows:

Batch size: The number of instances (up to 128 token sequences) in each mini-batch. We try batch sizes of 128 since our hardware has enough memory (Izsak, Berchansky, & Levy, 2021).
Learning rate: Starting with 2e–5, the learning rate will gradually increase until it heats up to the peak learning rate before it begins to decline. We tested a range of learning rates between 2e–5 and 10e–5.
Epoch: This is the number of times the entire dataset must be passed. Since our disk has limited memory, we tested 2–4 epochs.
Sequence length: Another important parameter is the input sequence length must also be determined. Figure 5 presents the frequency distribution of the text sequence length. The average length of a sequence is 20, and the highest length is 120. Therefore, in order to preserve the data as much as possible, we selected the largest sequence length available in the data (120).

To determine the optimal number of epochs, we manually tuned the hyperparameters on the suggested values and set a fixed number of four epochs while observing the validation loss.

In our fine-tuning process, we followed the fine-tuning strategy recommended by Clark, Luong, Le, and Manning (2020) and Mubarak et al. (2022) and we used the same training parameters for each of the models. Table 2 shows a list of the different values for each parameter.

4.5 Evaluation

Based on our literature review, the commonly used measures for evaluating the performance of the AI model in NLP in general and for SD models in particular, are accuracy, precision, recall and F1 score. Thus, we will use these metrics to evaluate and compare the performance of different models. Accordingly, we selected the best-performing model to use in our dashboard. These evaluation metrics depend on counting the number of the following factors.

True positive (TP): Correctly classify an observation as positive
False positive (FP): Wrongly classify a negative observation as positive
True negative (TN): Correctly classify an observation as negative
False negative (FN): Wrongly classify a positive observation as negative

Table 3 shows the description and equation of each metric (Daniel Jurafsky, n.d.).

5. Experiments and results

In this section, we describe the experimental setup in detail, utilizing four different models. Thus, we investigated at a large variety of hyperparameters and analyze the effect of each hyperparameter on model performance by synchronizing the learning rate schedule with each epoch for each model.

As planned, we trained the four transformers (Araelectra, Marabert, AraBERT and Qarib). For fine-tuning the training parameters, we tested all different settings for each of the four transformers using different values for learning rate and number of epochs, as shown in Table 4, and the setting of batch size was 120. To determine the optimal number of epochs, we manually tuned the hyperparameters on the suggested values and set a fixed number of four epochs while observing the validation loss.

In the following subsections, we discuss the performance of the developed models using each of the four transformers.

5.1 Results of Araelectra

The stance-detection model developed with Araelectra was tested with different parameter settings; the results are shown in Table 4. The performance of the model was enhanced after 4 epochs compared to 2 and 3 epochs. Within the four epochs, the peak performance was attained when the learning rate was (9e–5). Therefore, with these parameters, the model attained an accuracy of (83%) and an F1 score of (66%). However, the worst performance with Araelectra was in two epochs and the learning rate (2e–5) in which the accuracy was 80% and F1 score was 33%.

5.1.1 Results of Marabert

Table 5 presents the performance of the Marabert model with different parameters. As with Araelectra, the performance improved after 4 epochs. The best performance was attained at the learning rate of (5e–5); the accuracy achieved was 84% and the F1 score was 67%. Araelectra’s worst performance was at a learning rate of (2e–5) and two epochs.

5.1.2 Results of Qarib

Qarib is another transformer that was used in Mubarak et al. (2022) to develop an SD model. The authors in Mubarak et al. (2022) reported the best result as accuracy of 81% and an F1 score of 63% with three epochs and a learning rate of 8e–5. As with AraBERT, we tested Qarib with several parameter settings, and the results are shown in Table 6. A similar effect to that seen with AraBERT was observed in Qarib. The performance was enhanced with the increase in number of epochs and change in learning rate. The best results achieved by Qarib were an accuracy of 84% and an F1 score of 68%) with four epochs and a learning rate of (8e–5).

5.2 Results of Arabert

As discussed earlier in the literature review chapter, the AraBERT transformer has been used to develop a SD model (Mubarak et al., 2022). In their experiments, they achieved an accuracy of 82.2% and an F1 score of 62.2% at a learning rate of 8e–5 and 3 epochs. To compare the performance of different transformers with AraBERT and to investigate the effect of changing learning parameters, we decided to test AraBERT again on the same data set but with different parameters, such as different values of learning rate and epochs. In addition, a sequence length of 120 was used, which was not reported in their work. The different performance results achieved in different settings are reported in Table 7. As shown, the increase in the number of epochs improved the performance of our model compared to what was achieved in Mubarak et al. (2022) with two epochs. The best performance achieved was an accuracy of 85% and F1 score of 70% at a learning rate of 8e–5 and 4 epochs.

5.3 Comparison between different transformers

To compare the performance of all transformers tested in this work, we compared the best achieved performance for each transformer, Table 8 and Figure 6 show those results. The performance of all transformers based on accuracy ranged between 83% and 85%, showing small differences However, the F1 measure shows a wider range of differences, as it ranged from 66% to 70%. Taking all evaluation metrics into consideration: accuracy, F1, precision and recall, we found that AraBERT outperformed other transformers in all measures with a setting of 8e–5 learning rate and four epochs. Conversely, Araelectra had the worst performance in all metrics of all models.

As discussed earlier, the performance of different transformers was affected by the change in the number of epochs. Figure 7 shows the different F1 scores achieved with different epochs. As clearly seen, all models (Araelectra, Marabert, AraBERT and Qarib) reached their peak at four epochs.

We also compared the performance of Qarib and AraBERT in our work with the results reported in Mubarak et al. (2022). As seen in Table 8, both AraBERT and Qarib have enhanced results compared to what was achieved in (Mubarak et al., 2022) in terms of accuracy, precision, recall and F1.

The change in training parameters was proven to enhance performance. Figure 8 shows a comparison between the results in both transformers achieved in our work and those reported in Mubarak et al. (2022). In terms of accuracy, AraBERT improved from 82% to 85% and increased in F1 from 62% to 70%. A similar observation was made on Qarib; the accuracy increased to 81% from 84% in Mubarak et al. (2022) and F1 to 62% from 68%.

6. Limitations

This study has limitations posed by the imbalanced dataset and its potential impact on model performance. As part of our future work, we plan to address this limitation by undertaking a new study to construct our own dataset for Arabic stance detection.

7. Conclusion and future work

In this study, we aimed at investigating and comparing the performance of transformer-based models in Arabic stance detection. We developed transformer-based models for Arabic SD, which was applied as a case study on Covid-19 vaccination using Arabic Twitter data, and the ArCovidVac AraBERT (Mubarak et al., 2022) dataset.

In our extensive experiment, we tested four Arabic transformers (Araelectra, Marabert, AraBERT and Qarib). As for stance classification performance, our AraBERT-based model outperformed other tested transformers and represents a high-performance classifier with an F1 score of 70%. It also outperformed the model developed by Mubarak et al. (2022), in which the F1 score was 82%. The worst-performing model was based on Araelectra, with an F1 score of 66%.

Research in Arabic SD research faces a significant challenge due to the lack of Arabic stance detection datasets, which limited options for alternative sources of data. And in our research here, we encountered this challenge and to overcome this challenge, we turned to a dataset previously introduced in Mubarak et al. (2022) as a starting point for our research. And this was also a good option as our aim was to compare the use of transformers in Arabic SD with results found in study by Mubarak et al. (2022) that introduced this dataset.

While the dataset helped us establish a foundation for our work, we recognized the need for improvements in the methodology to address the issues related to dataset imbalance.

As part of our future work, we acknowledge the importance of addressing the problem of imbalanced data directly. We intend to explore techniques specifically designed to tackle the class imbalance issue in Arabic SD datasets and compare the results to our results found in this study.

In addition, another challenge is the small dataset size is small, and these issues may affect the re-training and learning process. Therefore, a future extension of this work would be to experiment with datasets and explore the effect of dataset size on the performance of different transformers. To accomplish this, we need to build a new larger corpus for SD from Twitter Arabic data. And we have already started by collecting a new dataset of tweets to construct a corpus with balanced distribution across the three stance labels, unlike the case with ArCovidVac dataset AraBERT (Mubarak et al., 2022). Part of the process of creating a corpus is data annotation. For this purpose, three Arabic native speakers were recruited to annotate the collected tweets. Annotation guidelines were given to the annotators to explain the labels and assure accurate labeling. The adopted guidelines are similar to those proposed by Alhindi et al. (2021).

Another research direction to expand and enhance this work is testing the models on other case studies not related to COVID-19 using Twitter data on general or other trending topics in the social media.

Figures

Figure 1

Transfer learning outline

Figure 2

Framework of stance detection

Figure 3

Distribution of the dataset

Figure 4

Arabic text normalization

Figure 5

Sequence length distribution

Figure 6

The performance for different model

Figure 7

F1 over the number of epochs

Figure 8

Comparison between our results and study