Predicting the popularity of tweets by analyzing public opinion and emotions in different stages of Covid-19 pandemic

In this study, public opinion and emotions regarding different stages of the Covid-19 pandemic from the outbreak of the disease to the distribution of vaccines were analyzed to predict the popularity of tweets. More than 1.25 million English tweets were collected, posted from January 20, 2020, to May 29, 2021. Five sets of content features, including topic analysis, topics plus TF-IDF vectorizer, bag of words (BOW) by TF-IDF vectorizer, document embedding, and document embedding plus TF-IDF vectorizer, were extracted and applied to supervised machine learning algorithms to generate a predictive model for the retweetability of posted tweets. The analysis showed that tweets with higher emotional intensity are more popular than tweets containing information on Covid-19 pandemic. This study can help to detect the public emotions during the pandemic and after vaccination and predict the retweetability of posted tweets in different stages of Covid-19 pandemic.


Introduction
The coronavirus pandemic, also known as Covid-19, began in December 2019 when several patients from Wuhan Hubei province in China reported severe health symptoms. Since then, Covid-19 has spread across the globe. According to the World Health Organization (WHO) report on July 14th, 2021, there have been 187,519,798 cases of including 4049,372 deaths. 1 In the very early stages of the pandemic, the WHO advocated for isolation and self-quarantine of affected individuals to reduce the number of cases and mortality rates, leading to the largest lockdown in history. Spending time at home and searching for Covid-19-related news became a common preoccupation, and many turned to social media platforms such as Twitter, which became one of the most important means of sharing information and expressing feelings regarding Covid-19 ( Mohammed & Ferraris, 2021 ;Su, Venkat, Yadav, Puglisi, & Fodeh, 2021 ;Younis et al., 2020 ).
Twitter users can "retweet " or forward a posted tweet to their network, which speeds up the information sharing process. Thus, retweets can represent Twitter users' interests on a large scale. The popularity of tweets is measured by their content and the volume of retweets. Shahi, Dirkson, & Majchrzak, 2021 conducted an exploratory study to examine the sources, spread, and content of misinformation in tweets related to the Covid-19 pandemic. Yousefinaghani, Dara, Mubareka, Pa-E-mail address: mahdikhanim@cofc.edu 1 WHO Coronavirus Disease Dashboard, www.who.int padopoulos, and Sharif (2021) examined the content of four million tweets to learn about public opinion regarding the Covid-19 vaccine. Using Twitter data from several mega-cities worldwide, Yao, Yang, Liu, Keith, & Guan, 2021 employed machine learning techniques to analyze the public's response to the Covid-19 pandemic. To the best of our knowledge, none of the previous studies have investigated the patterns in public responses to the pandemic from its onset to vaccine distribution by analyzing the content of tweets and predicting the popularity of tweets.
This study, address this gap by collecting tweets generated from January 2020 to May 2021 and by analyzing the public opinions and emotions by applying advanced machine learning technique, including the latent Dirichlet allocation (LDA) topic ( Blei, Ng, & Jordan, 2003 ) and CrystalFeel algorithm ( Gupta & Yang, 2018 ). More importantly, the extraction of different categories of content features and the building of a predictive model that assesses the popularity of tweets by using the number of retweets (based on the content of posted tweets) is another gap in the literature that we addressed in this study. The research objectives for this study are as follows: (i) Detecting public emotions in different stages of Covid-19 pandemic using Twitter data. (ii) Exploring the dominant English topics related to Covid-19 on Twitter, and the sentiment associated with them. And (iii) Building a predictive model for retweetability of the posted tweets based on their content. the contribution of this study to the literature can be summarized as follows: (i) Analyzing 1251,216 randomly selected tweets from January 20, 2020 to May 29, 2021, which includes tweets from the early stages of the pandemic to tweets related to the distribution of vaccines, can help to understand the public opinions and emotions regarding Covid-19 pandemic at the ongoing pandemic. (ii) This study applied the latent Dirichlet allocation (LDA) topic and CrystalFeel algorithm to detect four basic emotions, fear, anger, joy, and sadness, at different stages of the Covid-19 pandemic. (iii) The proposed approach extracts five different sets of content features from the posted tweets to applies them to three base supervised machine learning algorithms, and an ensemble voting classifier to predict the retweetabilty of the posted tweets. (iv) The experimental results are then compared by four metrics including, accuracy, F1-score, recall, and precision to choose a model with the highest performance. The study further compared the execution time for running each model to choose the most efficient model. This study is organized by reviewing the literature in section 2, and specifically reviewing the background of the impact of social media and Twitter during the pandemic. The research methodology is introduced in Section 3. Experimental design and analysis along with the models' results are discussed in section 4. The discussion and the implication of the research are presented in section 5. The conclusions and limitations of our work are discussed in Section 6.

Literature review
During the Covid-19 pandemic, social media platforms such as Facebook, Instagram, TikTok, and Twitter became even more important as a means to interact and connect with others. Visits to the Twitter increased by 36 percent in 2020 compared with those of the previous year, and users in the United States spent an average of 32.7 min on the platform per day 2 . Access to large datasets on various platforms offer opportunities for scholars to use advanced computational science to gain insights ( Kar & Dwivedi, 2020 ). For instance, Mishra, Urolagin, and Jothi (2019) applied term frequency-inverse document frequency (TF-IDF) and Cosine Similarity on hotels reviews to generate a recommendation system for suggesting proper hotels to the customers. Chintalapudi, Battineni, Canio, Sagaro, & Amenta, 2021 analyzed medical records from digital health systems from 2018 to 2020 by implementing text mining approach to gain insights on improving healthcare quality and assessing patient feedback. Rajendran & Sundarraj, 2021 conducted experiments in two domains including movies and restaurants to gather users browsing history, generate topics by using Latent Dirichlet Allocation (LDA) models, and extract user preferences by enhancing recommendation algorithm. Mishra, Urolagin, and Jothi (2020) also used the reviews data to apply sentiment intensity analyzer and generate a recommendation system for tourist point of interest. This research contributes to two research streams, including the impact of media, and particularly Twitter during pandemics, and retweeting behavior based on the content of tweets.

Media's impact and particularly Twitter during pandemics
Regarding the first research stream, Odlum and Yoon. (2015) studied the use of Twitter during the Ebola outbreak to monitor information sharing among users and examine the users' behavior and their knowledge of the disease during the pandemic. The result of this study revealed the pattern in the spread of information among the public and highlighted the value of Twitter as a tool for spreading public awareness. Lazard, Scheinfeld, Bernhardt, Wilcox, and Suran (2015) used textual analysis to examine public concerns about the Ebola virus and interest in safety information. The study highlighted the efficiency of using Twitter in public health communication. Jain and Kumar. (2015) examined the use of Twitter in the 2015 H1N1 pandemic (also known as Swine flu) to create an inspection system by analyzing information relevant to Influenza (H1N1) and enhancing public awareness in India. They classified tweets as either relevant or irrelevant to studying public opinion regarding H1N1. Their results highlighted the importance of social media for tracking a disease. Szomszor, Kostkova, and Louis (2011) ) analyzed tweets and online media related to the Swine flu pandemic of 2009 to identify the popularity of true information. They found that poorly represented scientific information can still be shared in public and cause harm. Furthermore, several studies have examined Twitter content to analyze how the public expresses their feelings at the onset of pandemics ( Baboukardos, Gaia, & She, 2021 ;Garcia & Berton, 2021 ;S. Kaur, Kaul, & Zadeh, 2020 ;Ridhwan & Hargreaves, 2021). By following a quasiinductive approach, Mittal, Ahmed, Mittal, & Aggarwal, 2021 found that the majority of Twitter users tend to share positive content regarding the lockdown but their opinions could swing over the course of pandemic based on recent developments. Some studies analyzed tweets with a focus on the public's emotions during the Covid-19 pandemic ( Gupta et al., 2021 ;Kabir & Madria, 2021 ), while others focused on public opinions following the rollout of Covid-19 vaccines ( Sv, Tandon, Vikas, & Hinduja, 2021 ;Yousefinaghani et al., 2021 ). Kabir and Madria (2021) developed a neural network model to automatically detect a variety of emotions in tweets on Covid-19. They randomly selected ten thousand tweets in English from the United States for their analysis, and their results showed that negative emotions increased during the pandemic. Kaur, Mittal, Khosla, & Mittal, 2021 discussed the use of advanced machine learning tools to predict and analyze the impact of quarantine during Covid-19 pandemic. Rustam et al. (2021) identified sentiments regarding Covid-19 from tweets using a supervised machine learning approach to understand how people made informed decisions on how to handle their circumstances during the pandemic. Mishra, Urolagin, Jothi, Neogi, & Nawaz, 2021 used LDA model on almost 20,000 tweets for tourism sector, sub-domains hospitality and healthcare during Covid-19 pandemic to identify frequent terms and applied state-of-the-art deep learning algorithm to generate a robust sentiment prediction model. This study contributes to this research stream by analyzing 1251,216 Covid-19-related tweets from January 20, 2020, to May 29, 2021 to investigate Twitter users' opinion and feeling about the Covid-19 pandemic during different phases of the pandemic, including the early stage of the disease, during the lockdown, and after the distribution of vaccines.

Retweeting behavior
Several studies have contributed to this field by proposing methods for predicting the results of important events, such as games, and political elections, using data on the volume of retweet ( Abdullah, Nishioka, Tanaka, & Murayama, 2015 ;Liang et al., 2016 ). Some studies explored the reasons why users retweet certain information without applying machine learning techniques for prediction. Boyd, Golder, and Lotan (2010) empirically examined several case studies on Twitter to understand and analyze the motivations behind retweeting behavior. Their study highlighted that bias in interpreting tweets caused the spread of false information on Twitter.
Kwak, Lee, Park, and Moon (2010) studied the impact of retweeting on information sharing. To evaluate the popularity of tweets, they ranked users based on their number of followers and followings compared to the volume of retweets. The results of this study showed the volume of retweets based on the tweet's content has a stronger impact than the number of people who follow the Twitter account's user. Naveed, Gottron, Kunegis, and Alhadi (2011) examined the impact of a tweet's content on its retweet volume. They analyzed two different levels of content-based features in tweets and predicted the retweetability of a given tweet. Guidry, Waters, & Saxton, 2014 analyzed the content of 3415 Twitter updates for 50 nonprofit organizations to examine which type of content is likely to be retweeted and to learn how to engage audi-ences and facilitate discussions. Marino & lo Presti, 2018 examined the content of tweets of European Commissioners and proposed a retweetability rate to measure citizen engagement based on the content on social media in response to certain events. Chung, Woo, & Lee, 2020 collected the tweets from Women Who Code (WWC) over a one-year period to examine whether certain content and features such as hashtags and photos resulted in differences in retweet volume. Rao, Vemprala, Akello, & Valecha, 2020 studied the alarming vs. reassuring retweet distribution pattern related to Covid-19. To the best of our knowledge, none of the Covid-19-related used an advanced machine learning predictive model to examine the retweetability of tweets based on content. Neogi, Garg, Mishra, & Dwivedi, 2021 generated models to categorize and analyze sentiments based on a collection of tweets pertaining to protests of Indian farmers. We contribute to this research stream by examining the content-based features for predicting the popularity of tweets based on the volume of retweets during the Covid-19 pandemic.

Topic modeling on tweets related to Covid-19
Recently, several studies adopted topic modeling analysis on tweets to identify public concerns about Covid-19. Abd-Alrazaq, Alhuwail, Househ, Hamdi, & Shah (2020) examined the tweets posted in English related to Covid-19 from February 2020 to March 2020 by adopting Latent Dirichlet Allocation (LDA) for topic analysis. Mackey et al. (2020) explored tweets related to Covid-19 symptoms from March 2020 and applied bi-terms topic model (BTM) to examine the content related to symptoms, testing, and recovery of individuals who had been infected with Covid-19. Stokes, Andy, Guntuku, Ungar, & Merchant (2020) analyzed the public response to Covid-19 based on real-time analysis of 94,467 comments from March 2020 about the pandemic and Covid-19 made in a public forum. They adopted the LDA technique by defining 50 topics and reviewing the top ten words associated with each topic. Lwin et al. (2020) examined worldwide trends of four basic emotions (i.e., fear, anger, sadness, and joy) during the pandemic by analyzing more than 20 million tweets from January 28 to April 9, 2020. They adopted a lexical approach by using the algorithm CrystalFeel and used "wuhan ", "corona ", "nCov ", and "Covid " as search keywords to generate word clouds related to emotions. Cinelli et al. (2020) collected data related to Covid-19 on Twitter, Instagram, YouTube, Reddit, and Gab to examine public engagement on the topic of Covid-19. They extracted all the topics related to Covid-19 by generating word embedding for the text corpus, and then analyzed the topics. This study contributes to the literature by employing the LDA algorithm to identify the most popular topic related to Covid-19 for content feature purposes and applying them into the CrystalFeel algorithm to examine the public's basic emotions about the Covid-19 pandemic.

Research method
In this study, the primary objective was to identify public concerns and basic emotions related to the Covid-19 pandemic at the early stages, during the pandemic and in the post-pandemic phases. Five sets of content features, including topic modeling, topics plus the TF-IDF vectorizer, BOW by the TF-IDF vectorizer, document embedding, and document embedding plus the TF-IDF vectorizer, are then selected. The five sets of features are applied as inputs for the selected classifiers to compare the accuracy of the prediction performance of tweet popularity based on the volume of retweets. Fig. 1 illustrates the system architecture of the research study.

Tweet collection and preprocessing
To implement this study, a subset of a dataset of tweets related to Covid-19 were examined which were collected by Chen, Lerman, and Ferrara (2020) from January 20, 2020, to May 29, 2021. In this study, the English tweets for each month are randomly chosen, and narrowed down the dataset to 1251,216 tweet IDs. The tweet IDs further were retrieved to tweets' complete information by using Hydrator software.
A laptop with Quad-Core i7-8750 H processors running at 16X PCI-e lanes was used for analyzing the data.
The following table shows the relevant information about the dataset and an example of one unique record. The data were imported into the Python console by using numpy, nltk, and pandas packages. In Table 1 , the user ID represents a unique identifier for the tweet, and EN in our dataset refers to English. Furthermore, the number of tweets that are issued by user ID is shown as the user status count, which describes the user's activity on Twitter. The number of times that the tweet is shared with the user ID's network is described as the retweet count.
The raw texts were further cleaned by removing punctuation, usernames, URL links, numbers, pictures, and emojis, and converted text to the lowercase. Furthermore, the stop words such as "the ", "the ", "of ", "in ", "at " were removed. Cleaned tweets were tokenized to be processed from sentences to words for future analysis.

Retweetability measure
To measure the popularity of tweets based on the volume of the retweets, we considered tweets that had at least one retweet during the period from January 20, 2020, to May 29, 2021. The purpose of this categorization is to describe the process of the binary response variable for future analysis.

Features extraction
Five different categories of features were chosen for this study: (i) topic modeling, (ii) topic modeling plus TF-IDF vectorizer, (iii) BOW by TF-IDF vectorizer, (iv) document embedding, and (v) document embedding plus TF-IDF vectorizer. The following subsections will cover each set of content features, particularly topic modeling and how basic emotions related to Covid-19 were detected using the CrystalFeel algorithm.

Topics analysis using LDA model
Due to the large volume of tweets and retweets, topic modeling was used to classify text data pertaining to Covid-19 based on the frequency of words in each document. The latent Dirichlet allocation (LDA) model ( Blei et al., 2003 ) was applied to identify the most popular topics in tweets related to Covid-19. LDA model is an unsupervised machine learning algorithm that detects a certain number of topics within documents with a certain probability. Note that each topic is also represented as a probabilistic distribution over words. LDA models a corpus including documents, and each document has words to the following generative process ( Blei et al., 2003 )  probability of observed data is computed as follows: In the above equation, and are word-level variables and variables are document-level variables. This research aimed to find the optimal number of topics within the documents by calculating the coherence score which is referred as score ( Röder, Both, & Hinneburg, 2015 ) and measures the coherence of the topics by the normalized mutual information (NPMI) metric. NPMI is defined as follows: Where the topic coherence is automatically computed by point wise mutual information (PMI) metric as follows: And ( ) and ( ) are probabilities for word and word within the document and ( , ) is a joint probability of word and word . Given the size of the dataset in this study, applying the LDA model was one of the most effective methodology to extract the features. In this study, Python Scikit-learn's LatentDirichelAllocation function is used with the learning decay of 0.85. Learning decay is a parameter for controlling the learning rate, and its value must be set between 0.5 to 1 to guarantee asymptotic convergence. Fig. 2 . shows the optimal number of topics along with the coherence score for the whole dataset. A higher value for the coherence score indicates an optimal number of topics within the documents. The highest coherence value is 0.6088, indicating 38 topics for the whole dataset. Fig. 3 . shows the wordcloud of the most frequent words for all the 38 topics. The LDA algorithm was further applied for tweets related to each phase of the Covid-19 pandemic to identify the most frequent topics and use the CrystalFeel algorithm to detect the four emotions including fear, anger, sadness, and joy.

CrystalFeel algorithm.
Previous studies analyzed the four emotions in different periods of the pandemic using the CrystalFeel algorithm ( Garcia & Berton, 2021 ;Lwin et al., 2020 ;Shah, Yan, Qayyum, Naqvi, & Shah, 2021 ), which has been proven in recent works to be accurate. In this study, the emotional strength scores of the CrystalFeel algorithm (R. K. Gupta & Yang, 2018 ) were used to label the dominant emotions of fear, anger, sadness, and joy at different phases of the pandemic according to the timeline of WHO tweets and U.S news during the ongoing Covid-19 pandemic. In the CrystalFeel algorithm, topics are labeled based on emotion score (i.e., emotional valence refers to feelings' polarity) in three different categories including: (i) No-specific emotion, (ii) If valence-score is higher than 0.520, then the emotion category is "joy "; (iii) If valence-score is lower than 0.480, then the emotion category is: (1) "anger " if and only if the anger intensity-score is higher than both the fear and sadness intensity-scores, (2) "fear " if and only if the fear intensity-score is higher than both the and sadness intensity-scores, and (3) "sadness " if and only if sadness intensity-score is higher than both the anger and fear intensity-scores ( Garcia & Berton, 2021 ). Fig. 4 . illustrates the algorithm 1.  Algorithm 1: Emotion score for Label in CrystalFeel 1: Output: Labeled topics based on emotion score 2: emotion-category = " no specific emotion "; 3: if valence-score > 0.520 then 4: Print emotion-category = "joy "; 5: else 6: if valence-score < 0.480 then 7: Print emotion-category = "anger "; 8: if (fear-score > anger-score) and (fear-score > sadness-score) then 9: Print emotion-category = "fear "; 10: else 11: if (sadness-score > anger-score) and (sadness-score > fear-score) then 12: Print emotion-category = "sadness "; 13: end 14: end 15: end 16: end The results of CrystalFeel analysis are shown in Table 2 from January 2020 to May 2021. For each month, the LDA algorithm was applied on the randomly selected tweets, and then the top ten words for each topic were extracted and used as inputs for the CrystalFeel algorithm.   to-human transmission of Covid-19 occurring outside of China, the public response to the Covid-19-ralted news was characterized by anger. By the onset of the pandemic, public response turned to fear. However, there was a sudden increase in joy in April 2020, which marked the beginning of the "lockdown " in the U.S., after the announcement that the unemployment benefits from the U.S. Department of Labor would amount to $600 per week. Countries that experienced the longest stayat-home orders, saw this joy turned to fear, anger, and sadness in the following months. The announcement that the former president and first lady tested positive for Covid-19 marked when public emotion turned to fear in October 2020. By the end of 2020, public emotions were characterized by joy following the WHO announcement that the Pfizer and Moderna Covid-19 vaccines were effective. However, the beginning of 2021 started with fear which was related to the U.S. 2020 presidential election. The remaining months in 2021 were characterized by joy with the distribution of vaccines in the United States and throughout the world and reopening plans for restaurants and indoor spaces. Fig. 6 . shows the line graph of all the four basic emotion from January 2020 to May 202, along with three examples of important events that occurred during the time period.

Bag of word by using term frequency-inverse document frequency (TF-IDF) vectorizer
N-gram analysis for extracting features is one of the most reliable, efficient, and fastest techniques for text classification. The process starts by preprocessing language documents by removing unnecessary information, e.g., punctuations, numbers, tags, while keeping necessary terms. N-grams are sequence of words from the documents, and " " corresponds to the window size of the words in text analysis. In this study, the window size of sequence words for n-gram analysis is one for bag of words, which generates the vocabulary list for all the unique words and their frequencies in the documents. To enhance the performance of classification models, the TF-IDF vectorizer was used to weight the ngram profiles ( Hassan, Gomaa, Khoriba, & Haggag, 2020 ;Nasser, Karim, el Ouadrhiri, Ali, & Khan, 2021 ). The highest weight of TF-IDF occurs when a word has high term frequency (TF) in any tweet, and low document frequency (DF) of the word in the entire dataset. In this study, the TF-IDF vectorizer method introduced by Salton and Buckley (1988) was applied to the documents and it is an older method compared to other aforementioned features. The TF-IDF method assumes that the important words in a given document frequently appears in that document  but rarely appears in other documents which aids in recognizing meaningless terms. Therefore, the frequency of word within document is denoted as the parameter while the frequency of documents with word is denoted as the parameter . The importance of word for document is measured by having a large and a small and is calculated as follows: Where " " is the total number of documents, and ( +1 ) represent the inverse document frequency for word . Table 3 provides examples of tweets along with their top words, TF-IDF values, and detected emotion.

Document embedding
Doc2vector or document embedding is the extension of word embedding for text analysis. Word2Vec can convert tokenize words into a vector that represents the vocabulary of texts within documents. Word2Vec enables exploration of the correlation among the words and their contextual information and constructs the network of words. Doc2vec builds a numerical representation of a document where there is a group of words as a unique document to achieve sentence embedding. Thus, when training Word2Vec ( Mikolov, Yih, & Zweig, 2013 ), Doc2vec is also trained. One of the main learning algorithms for Doc2vec that is implemented in this research is distributed bag of word version of paragraph vector (PV-DBOW), which is based on skip-gram. In PV-DBOW, each text is associated with a specific paragraph vector, and each word is associated with a specific word vector in a whole dataset. The genism package was further imported to Python and created the document-to-vector model to learn the network of documents and to detect similar tweets based on the vector distance.

Supervised machine learning algorithms
Scikit-learn package in Python 3.8 was used to implement three base and effective supervised machine learning algorithms: (i) random forest (RF) ( Breiman, 2001 ) classifier, (ii) stochastic gradient descent (SGD) ( Zhang, 2004 ) classifier, and (iii) logistic regression (LR) ( Hosmer, Lemeshow, & Sturdivant, 2013 ) classifier, and an ensemble voting classifier of the three machine learning algorithms (i.e., RF, SGD, and LR) to enhance accuracy and reduce error rates of classifiers. Each classifier and ensemble approach are explained in detail in the following subsections. Note that, in this study, ensemble voting classifier is referred as EVC for ensemble approach.

Random forest classifier
The random forest classifier is a supervised machine learning algorithm. It consists of tree classifiers where each tree is grown with a random vector that is distributed independently and identically, and each tree casts a vote for the most popular class of input vectors ( Breiman, 2001 ). After creation, RF classifier can split into two stages: random forest creation and prediction from the created RF classifier ( Biau & Scornet, 2016 ). The algorithm has the following steps ( Neogi, Garg, Mishra, & Dwivedi, 2021 ) .
Step 1 RF randomly selects " " features from a total of " " features where ≪ .
Step 2 RF calculates the node " " among the " " features using the best split point.
Step 3 RF uses the optimal split by breaking the node into child nodes.
Step 4 RF repeats 1 to 3 steps iteratively until the number of nodes reaches the maximum allocated value.
Step 5 RF builds a forest by repeating step 1 to 4 for " " number time to create " " number of trees. In this study, RF classifier accuracy was compared with the accuracy of stochastic gradient descent (SGD) and logistic regression (LR), and ensemble voting classifier (EVC).

Stochastic gradient descent classifier
The stochastic gradient descent (SGD) classifier is a supervised machine learning algorithm and is a very powerful classifier for building a predictive model ( Zhang, 2004 ). The algorithm has the following steps.
Step 1 SGD computes the gradient of the loss function with respect to each feature.
Step 2 SGD selects a random initial value for the parameters.
Step 3 SGD updates the gradient function by allocating the parameter values.
Step 4 SGD calculates the step sizes for each feature with respect to learning rate of algorithm.
Step 5 SGD calculates the new parameters.
Step 6 SGD repeats step 3 to 5 until the gradient reaches to zero. In SGD classifier, learning rate value has a significant impact on the behavior of gradient descent. Thus, the learning-rate in Python codes is set to "optimal " and the loss function is set to "log " which gives logistic regression, a probabilistic classifier. The log loss function gives the probability of false classifications ( Rustam et al., 2021 ), and can be defines as: Where is the number of instances, is the outcome of the _ ℎ instance, and ( ) is the probability of the _ ℎ instance for the value .

Logistic regression classifier
The logistic regression (LR) classifier is a supervised machine learning algorithm that is used to model the probability of a binary classification problem ( Hosmer, Lemeshow, & Sturdivant, 2013 ). The LR algorithm has the following steps.
Step 2 the LR classifier determines the cost function.
Step 3 the LR classifier Where is the learning rate, , ∈ { 1 , 2 , … , } is input and is the target variable.
Step 4 the LR classifier calculates the output with the highest probability.
Step 5 the LR classifier repeats steps 1 to 4 and updates the model for each training instance in the dataset. In this classifier, scikit-learn's LogisticRegression uses liblinear for the solver parameter as a loss function which is the different-different algorithmic style to optimize the loss function, and it supports both L1 and L2 regularization for penalizing the model complexity. Note that, liblinear applies a Newton method for the LR classifier (; Lin, Weng, & Keerthi, 2007 ).

Ensemble voting classifier
An ensemble approach is a combination of classifiers that improves the performance of a classification system ( Li, Zong, & Wang, 2007 ). Classic machine learning methods are trained by using one classification method on the dataset, while ensemble approach is trained by using multiple classifiers. The error rate for ensemble approach is lower than individual classifier' error rate. To combine the decision of RF, SGD, and LR, this study used soft voting in ensemble approach. The convex combination of the predicted class probabilities was applied for individual classifier. The summation of weights for classifiers was one, and the weighting was chosen based on performance of classifier due to its simplicity and accurate results ( Pierola, Epifanio, & Alemany, 2016 ). In soft voting approach, the predict-proba attribute is used to give the probability of each variable, and shuffles training set, and data points for RF, SGD, and LR classifier. Each classifier computes its prediction with soft voting technique, the majority voting is calculated for the final prediction ( Kumari, Kumar, & Mittal, 2021 ). Fig. 7 . illustrates algorithm 2 for the soft voting technique. Training_data, Testing_data = split (Tweets_attributes, label) 3: Return Training_data, Testing_data 4: Voting = "soft " 5: RF = Random_Forest (training_data,Training_label, Testing_data) 6: SGD = SGD (training_data,Training_label, Testing_data) 7: LR = Logistic_Regression (training_data,Training_label, Testing_data) 8: Procedure Ensemble_Model (training_data,Training_label, Testing_data) 9: Soft_voting_classifier = concatenate (RF, SGD, LR); 10: Soft_voting_classifier.fit (Training_data.Training_label) 11: predictions = soft_voting_classifier.predict (Testing_data) 12: end Procedure

Experimental design and analysis
As mentioned in the research method section, tweets related to the Covid-19 pandemic were collected using Twitter APIs ( Chen, Lerman, & Ferrara, 2020 ), and keywords such as: Covid, corona, pandemic, and similar keywords. The study randomly chose 1251,216 tweets written in English that were posted between January 20, 2020 and May 29, 2021. The tweets were labeled as popular and non-popular tweets based on the number of retweets. Each of the classification models used a grid search to find optimal hyper-parameters. The grid search utilized the GridSearchCV object of scikit-learn in Python for all classification models. The results of the models were obtained using five-fold crossvalidation with a split ratio of 0.75 to train the classifiers. The optimal hyper-parameters for all the proposed classifiers are summarized in Table 4 . Furthermore, to overcome the imbalance data problem, the class weight for each classifier is modified such that higher weight is given to smaller classes to produce optimal results.

Model
The binary response variable in this study was popular versus nonpopular tweets based on the volume of retweets, where the tweets with at least one retweet were labeled as popular and tweets with no retweets Predicts the class label based on argmax of the sums of the predicted probabilities 'soft'

Flatten_transform
Affects shape of transforms output with the matrix of (n_samples, n_classifiers * n_classes) TRUE weights weights class probabilities before averaging [45,35,20] were labeled as non-popular. Since there were 435,900 non-popular tweets and 815,316 popular tweets, this was an imbalanced dataset.
To avoid misleading results due to an imbalanced dataset, an oversampling technique in which the minority class is duplicated was adopted to keep all the relevant information in the training set. Furthermore, three main sets of content features and their combinations were utilized as inputs for three robust and effective machine learning classifiers and an ensemble voting classifier for imbalanced datasets and used to predict the retweetability. To enhance the performance of the classifiers, the feature-extraction function was used from the scikit-learn package in Python 3.8 to extract the lexical features and weight them using a TF-IDF vectorizer. The gensim package was then applied for Doc2vec and LDA, and LatentDirichletAllocation function from Scikit-learn package was used for topics analysis. The parameters for classifiers were also adjusted to prevent poor results. All the classifiers were modified by adding class weights as "balanced " to their cost function where the penalty to the minority class is higher. The scikit-learn Python package provides the class weights for the classifiers. Furthermore, an ensemble voting classifier was applied to enhance the accuracy of prediction and reduce bias, and error rate. This study utilized an ensemble of random forest, stochastic gradient decent, and logistic regression by applying soft voting technique. Furthermore, this research addressed two main components of generating a prediction model. First, tuning the hyperparameters of each base model, and second, weighting the base models by adopting a soft voting technique to create the prediction model which are explained in the following sections.

Training time and system configuration
The classification models were trained for 250 epochs on a system with a RAM of 32 GB. The GPU had a RAM of 8 GB. The unsupervised machine learning algorithms took more than 30 h to train. The supervised machine learning algorithms were efficient and took less time to run and provide outcomes. However, creating an ensemble voting classifier for each set of features took more time for both training and executing the models. By optimizing the hyper-parameters GridSearchCV, and classification models, the performance improved and runtime was more efficient.

Evaluation metrics
To evaluate the performance of the selected classifiers, four metrics were chosen: (i) accuracy, (ii) precision, (iii) recall, and (iv) F1-score.
The accuracy score is the ratio of correct predictions to total predictions, and range is between zero to 1. The equation for accuracy is as follows: Where TP denotes true positive, FP denotes false positive, TN denotes true negative, and FN denotes false negative. Precision score indicates the proportion of true positive predictions in the list of all the positive predictions. The precision value lies between zero and one, and its equation is as follows: Recall score represents the completeness of a classifier where the number of true positives divided by the total number of true positives and false negatives. The recall value lies between zero to one, and its equation is as follows: F1-score is a harmonic mean of the precision score and recall score, and its value lies between zero to one. The equation for F1-score is as follows: The execution time for running the classifiers was utilized to compare and evaluate which classifier consumes a shorter time with more accurate results. Table 5 summarizes the results of the three supervised machine learning classifiers, and ensemble voting classifier with five sets of features on the Covid-19 tweets. As shown in Table 5 , topic modeling has the lowest accuracy of all the classifiers compared to other sets of features. In the first category of features, using topic modeling, EVC has an accuracy of 0.6861, RF has an accuracy of 0.6239, SGD has an accuracy of 0.555, and LR has an accuracy of 0.5506. Adding TF-IDF weighting in topics modeling enhanced the accuracy and F1-score of all the classifiers, particularly the accuracy of the EVC, which raised by 26.43% to the level of 0.9504, and its F1-score increased by 32% to a value of 0.95. Furthermore, with topics plus TF-IDF vectorizer, the RF classifier has an accuracy of 0.9381, and an F1-score of 0.94, the SGD classifier has an accuracy of 0.9304, and an F1-score of 0.93, while the LR classifier has an accuracy of 0.9293, and an F1-score of 0.93. Moreover, BOW by TF-IDF vectorizer for the EVC has a close accuracy to the topics plus TF-IDF vectorizer, with an accuracy of 0.9437 and an F1-score of 0.94, but it also has a longer time runtime of 14,769.07 s. For the RF classifier with BOW by TF-IDF vectorizer, the accuracy is 0.9368 which is slightly lower than the accuracy value for topics plus TF-IDF vectorizer, and F1-score is 0.94 which is as equal as F1-score value for topics plus TF-IDF vectorizer. However, for the SGD classifier with BOW by TF-IDF vectorizer, the accuracy at a value of 0.9339 and F1-score at a value of 0.94 are slightly higher than the accuracy and F1-score value for topics plus TF-IDF vectorizer. For the LR classifier with BOW by TF-IDF vectorizer, the accuracy is 0.9308 which is lower than the accuracy for the SGD classifier. Although, for the fourth set of features, Doc2vector vectorizer, the accuracies for all the classifiers are higher than the accuracies for the classifiers using the first set of features, topic modeling, the performance of classifier is low compared to that of other sets of features. Adding TF-IDF weighting to the Doc2vectore model improved the accuracy of all three classifiers when compared with only applying the Doc2vector feature with the following increases in percentage points: for EVC by 1.93%, for the RF by 3.78%, for the SGD by 12.5%, and for the LR by 12.4%. In sum, the EVC achieved the highest accuracy compared with the RF classifier, the SGD classifier and the LR classifier for all five sets of features, particularly when using topics plus TF-IDF vectorizer feature with a runtime of 12,420.34 s. Table 5 also shows that although the RF, SGD, and LR classifiers had the shortest runtime of all the models compared with ensemble approach, the accuracy of their models was not as high as the ensemble approach. Fig. 8 . shows the F1-score for the four classifiers and all five sets of features. However, with applying ensemble approach and soft voting technique the runtime increased for all five sets of features. The runtime of each model depends on the complexity of the base learners and the size of the dataset. Fig. 9 . shows the comparison between the runtime of models by using ensemble approach, and the accuracy of the models. Among all the sets of features, topics plus TF-IDF vectorizer has the highest accuracy, and the runtime is relatively short compared to BOW by TF-IDF vectorizer.

Discussion
Inaccurate information related to the ongoing COVID-19 pandemic and the safety of vaccines and their side effects spread quickly through social media, especially via retweets on Twitter. Therefore, it has become more important to address misinformation ( Budhwani & Sun, 2020 ;Forati & Ghose, 2021 ;Singh et al., 2020 ). Prior research has explored the essential characteristics of retweet prediction, including retweeting behaviors, emoji and playfulness engagement, and number of followers. However, there is less progress in exploring the content of tweets and in predicting the retweetability over the phases of the pandemic from the initial spread of the virus to the distribution of vaccines. In this study, the content and popularity of tweets and public opinion and emotions were analyzed according to the number of retweets oc-  curring during different phases of the Covid-19 pandemic. Five different sets of content features (i.e., topic modeling, BOW by TF-IDF vectorizer, topics plus TF-IDF vectorizer, Doc2vec, and Doc2vect plus TF-IDF vectorizer) were selected, compared, and then used for three effective and robust classifiers, random forest, stochastic gradient descent, and logistic regression, and an ensemble voting classifier which is a meta classifier to evaluate and compare the outcomes. The results highlighted a strong support for the study's contributions by introducing a novel ap-proach to extract the features from tweets and to predict their retweetability using supervised machine learning algorithms.
The results of this study showed that topics plus TF-IDF vectorizers outperformed other sets of features for all the base classifiers and the ensemble voting classifier. The result of BOW by TF-IDF vectorizers as a content feature set was very close to topics plus TF-IDF vectorizers. One possible explanation is that all tweets pertained to Covid-19, so the performance of the basic text representation was close to that of topic modeling. Moreover, the results of all the experiments in this study confirmed that the EVC has the highest accuracy compared with the state-of-art methods.

Implications of this study
The results of this study have several theoretical and practical implications. To the best of our knowledge, this is the first study that used the most updated dataset that covers tweets from the onset of the pandemic to the distribution of vaccines. As such, this is the first study that utilized unsupervised machine learning algorithms such as LDA, and document embedding to extract the features and apply them to the supervised machine learning algorithms such as random forest, stochastic gradient descent, and logistic regression, and an optimal ensemble voting model of the selected classifiers to build a predictive model for their retweetability. Furthermore, by applying the LDA algorithm, the most popular topics for each month were identified. The CrystalFeel algorithm was employed to label the public emotions in response to the Covid-19 pandemic, to analyze the patterns in public opinion and emotions, and to extract the most effective features for the predictive model.
In terms of practical implications, the results of this research can be adopted to create a recommendation system for tweets that are relevant to certain events, or as a means of obtaining a higher number of retweets. Identifying patterns in public emotions during the ongoing pandemic can help public health authorities make strategic decisions regarding communication during critical events such as a pandemic. The findings of this study show that although negative emotions, such as anger, fear and sadness were dominant in the early stages of the Covid-19 pandemic, the vaccine rollout and published results on vaccine effectiveness has a positive influence on public emotions. Furthermore, the finding of this study can help to detect and minimize the misleading information related to Covid-19 on Twitter.

Conclusion
In this study, the popularity of tweets (based on the number of retweets) was predicted by extracting content features from tweets written in English on the Twitter platform from January 20, 2020, to May 29, 2021. This study shows that the popularity of tweets based on the number of retweets can be drawn from the content of tweets and certain repeated terms during important events such as the Covid-19 pandemic. This section discusses the findings of the study, and its limitations. The results of this study revealed how public opinion changed throughout the stages of the Covid-19 pandemic. The study aimed to select the effective features from the content of the posted tweets by applying unsupervised machine learning algorithms and then to use them as inputs to feed the selected supervised machine learning algorithms for predicting retweetability. Identifying negative and misleading sentiments on popular social media platforms such as Twitter can help to prevent the spread of misinformation. Promoting accurate information and positive sentiments can enhance public awareness regarding certain events such as pandemics. In the proposed approach, the most popular topics at different stages of the pandemic were first identified by using the LDA, and the emotional intensity were detected by employing the CrystalFeel algorithm ( Gupta & Yang, 2018 ) for four emotions: fear, anger, joy and sadness. Second, they were used as one category of content features along with other sets of features to apply them to the selected classifiers. The results showed that topics plus TF-IDF vectorizers feature set had the highest accuracy compared with other sets of content features, and the ensemble voting classifier by ensemble of three machine learning algorithms such as random forest, stochastic gradient decent, and logistic regression had the highest performance when compared with the state-of-art classifiers.
The analysis in this study was limited to tweets written in English and related to Covid-19. Future studies can expand the analysis into different languages. Furthermore, the findings of this study are limited to only users on Twitter platform; future research can explore text content from other social platform to compare the results.