Exploring the Evolution of Sentiment in Spanish Pandemic Tweets: A Data Analysis Based on a Fine-Tuned BERT Architecture

: The COVID-19 pandemic has had a signiﬁcant impact on various aspects of society, including economic, health, political, and work-related domains. The pandemic has also caused an emotional effect on individuals, reﬂected in their opinions and comments on social media platforms, such as Twitter. This study explores the evolution of sentiment in Spanish pandemic tweets through a data analysis based on a ﬁne-tuned BERT architecture. A total of six million tweets were collected using web scraping techniques, and pre-processing was applied to ﬁlter and clean the data. The ﬁne-tuned BERT architecture was utilized to perform sentiment analysis, which allowed for a deep-learning approach to sentiment classiﬁcation. The analysis results were graphically represented based on search criteria, such as “COVID-19” and “coronavirus”. This study reveals sentiment trends, signiﬁcant concerns, relationship with announced news, public reactions, and information dissemination, among other aspects. These ﬁndings provide insight into the emotional impact of the COVID-19 pandemic on individuals and the corresponding impact on social media platforms.


Introduction
During the beginning of the COVID-19 pandemic, humanity was isolated from workplaces, schools, recreational centers, parks, and sports venues, resulting in consequences for people's mental health [1]. Problems, such as stress, anxiety, fear, bad mood, irritability, frustration, and boredom, among others, have arisen with the emergence of the pandemic [2]. At the start of the pandemic, there was very little information, leading many researchers to access data and information from various sources during the COVID-19 outbreak. All of this data reflected the point of view of individuals, organizations, and government agencies on the coronavirus [3].
In parallel, there was a progressive increase worldwide in the use of the internet and its various tools for sharing information in recent years [4]. Common people have shifted from consumers to producers of data through different sources, such as blogs, websites, social networks, and apps, among others [5]. Nowadays, social media platforms are the most widely-used tools for spreading data, for example, Twitter, a social media platform with over 500 million users worldwide, is where opinions and comments about any topic, ranging from criticisms of or congratulations for products and services to sports and political issues, are most widely spread [6].
During the pandemic, social media platforms began to play an important role in times of confinement, as they were tools that kept people connected with the world without leaving home [7]. In the case of Twitter, it was used to share opinions about Covid, show points of view, and in some cases, write comments based on the sentiments people were feeling at that moment [3]. For this reason, sentiment analysis techniques (SA) began to be applied to decipher and discover the emotions of different Twitter users [8] during the The most-used technique to address SA is machine learning (ML), which depends on the existence of pre-labeled training documents, i.e., those that have already been assigned a polarity. Within the literature related to ML, there are works such as the one by De Freitas et al. [22], which performed an ontology-supported analysis in the domain of movies and hotels in Portuguese. Other examples include the study by Steinberger et al. [20], which presented a supervised approach to restaurant reviews in Czech, and the study by Manek et al. [23], which proposed a system in the English language based on the GINI index of movies.
In recent years, more advanced and precise approaches have used deep learning [24][25][26]. Deep learning consists of algorithms based on neural networks that represent a highlevel of generalization regarding data processing; this is achieved via the multiple layers accumulated between themselves by alternating linear and non-linear transformations [27].
Various social networks have been studied. The work by Bonifazi et al. [28], using information based on Reddit data, the concept of scope in social network analysis was addressed, proposing a multidimensional view that included temporal scope. It introduced the concept of sentiment scope and developed a general framework for analyzing it on any social network platform. Bonifazi et al. [29], presented three novel approaches to analyzing COVID-19 Reddit posts: dynamic classification, virtual subreddit creation, and user community identification, addressing gaps in the previous research.
Several works are found in the literature related to sentiment analysis on Twitter that revolves around COVID-19. There are works on English tweets [30], where 500,000 tweets were analyzed in April 2020, finding that 36% of people had optimistic opinions, while only around 14% were negative. Likewise, an analysis was carried out in March 2020 on 50,000 tweets from several countries, resulting in most people having a positive and hopeful approach [31]. However, there were cases of fear, sadness, and disgust worldwide. Another contribution is found in the study by Boon-Itt et al. [32], where over 100,000 tweets were analyzed between December 2019 and March 2020, resulting in people having a negative perspective towards COVID-19, and the findings of predominant themes, such as pandemic emergency and learning how to control the disease.
In other languages, there are contributions, where more than 6 million comments in English and Portuguese were analyzed, resulting in finding sentiment trends and relationships with announced news, as well as a comparison of human behavior in two different geographical locations affected by the pandemic [33]. In another study [3], data were processed from users who shared their location as being "Nepal" between 21 May 2020 and 31 May 2020. The results of the study concluded that while most people in Nepal adopted a positive and hopeful approach, there were cases of fear, sadness, and disgust. Another contribution is found in a study [34] where Twitter messages collected during the first months of the COVID-19 pandemic in Europe were analyzed, finding that lockdown announcements correlated with a deterioration of mood in almost all surveyed countries, recovering in a short time.
Other contributions have focused on a Twitter analysis considering several aspects, such as the impact of vaccines on the pandemic [35,36], the impact of COVID on the financial market [37], the use or non-use of masks [38], the effects of confinement [39], and the relationship of COVID with announcements and news [34], among others. In the case of Colombia, some contributions can be found, such as a study [40] where 38,000 Twitter posts on COVID-19 vaccination were analyzed, highlighting opposition to the government with feelings of anger. In addition, a study analyzed the sentiments of 72,000 Twitter comments related to isolation, identifying the most frequent themes and words, resulting in fear as the predominant sentiment during the entire confinement period [41].
In another study [42], the researcher proposed a new approach to sentiment analysis of COVID-19 news headlines using deep neural networks, with high accuracy and an analysis of over 73,000 pandemic-related tweets from six global channels. In study by Kumari et al. [43], the researcher proposed a deep learning mechanism for the sentiment analysis of COVID-19-related Twitter data, utilizing an intelligent lead-based BiLSTM and intelligent lead optimization to eliminate loss and improve accuracy. The proposed mechanism outperformed the baseline KNN technique, as assessed by metrics such as accuracy, sensitivity, and specificity.
Using pre-trained models for Spanish text has seen significant advancements in recent years. A BERT-based pre-trained language model (BETO) was introduced, which was trained exclusively on Spanish data from various sources, including Wikipedia [44]. This model has been utilized in different applications for the sentiment analysis of Spanish tweets [45], achieving accuracy levels of approximately 65%. Furthermore, another researcher presented the CaTrBETO model, which combines a caption transformer (CaTr) and BETO to understand the sentiments in tweets by jointly analyzing images and text [46]. The impact of COVID-19 on mental and physical health was discussed in [25], with social media found to be inducing fear and anxiety. The study analyzed Twitter data using a Hybrid Heterogeneous Support Vector Machine (H-SVM) for sentiment classification, outperforming RNN and SVM. In another paper [47], the sentiment analysis of Twitter posts related to COVID-19 from March to mid-April 2020 was addressed, using natural language processing techniques and seven different LSTM-based deep learning models for analysis. The models were trained to classify tweets into three classes (negative, neutral, and positive), which is advanced compared with traditional machine learning classifiers. Similar works proposed different deep learning models. Jojoa et al. [48], proposed a dis-tilBERT transformer model for detecting positive and negative sentiments in open-text survey responses during the COVID-19 pandemic. The study aimed to contribute to understanding people's sentiments during the pandemic and addressing future challenges related to lockdown. In another study [49], a new approach to the sentiment analysis of Moroccan tweets related to COVID-19 was proposed. The model achieved high accuracy (86%) and outperformed well-known machine learning algorithms. The study showed that users' sentiments change over time and are affected by the evolution of the epidemiological situation in Morocco.
In general, efforts have been made to develop systems that automatically analyze large amounts of data through natural language processing, machine learning, data mining, and cloud computing. The majority of the approaches focus on detecting polarity, while the more advanced techniques involve deep learning. There have been several contributions in various languages, including English, Spanish, French, and Chinese. Sentiment analysis has been applied to analyze tweets about COVID-19 in different languages, finding various sentiments such as fear, optimism, and disgust, among others.

Materials and Methods
Sentiment analysis has become an essential tool for understanding human behavior and its relation to relevant social events or phenomena, such as the COVID-19 pandemic. In this sense, the methodology described in this section is a significant contribution to the analysis of sentiments expressed in Spanish tweets collected on Twitter during the period of 2020-2021.
The methodology consisted of several phases (see Figure 1), starting with data extraction through web-scraping techniques on the Twitter platform. Next, the pre-processing stage took place, which involved cleaning and filtering the collected data, and eliminating irrelevant or duplicated words or terms. Fine-tuning of the BERT architecture was then applied, a deep learning technique that allows for training a language model to identify and classify sentiments expressed in Spanish tweets. Once sentiment extraction had been carried out, the data analysis phase proceeded, which involved interpreting and visualizing the obtained results. The methodology described allowed for the identification of trends and patterns in the evolution of sentiments over time. visualizing the obtained results. The methodology described allowed for the identification of trends and patterns in the evolution of sentiments over time.

Data Extraction
The data extraction process was carried out using data scraping on Twitter. Scraping is a technique for data gathering that transforms the unstructured data found into structured data, which is then stored locally for further analysis. We used the Python language. The general steps for data extraction are shown in Figure 2. Algorithm 1 describes the general procedure applied in this phase.

Data Extraction
The data extraction process was carried out using data scraping on Twitter. Scraping is a technique for data gathering that transforms the unstructured data found into structured data, which is then stored locally for further analysis. We used the Python language. The general steps for data extraction are shown in Figure 2. Algorithm 1 describes the general procedure applied in this phase.

Data Pre-Procesing
In this phase, a series of regular expressions were used to pre-process the text, removing irrelevant elements, such as mentions, links, Emojis, and punctuation marks. In

Data Pre-Procesing
In this phase, a series of regular expressions were used to pre-process the text, removing irrelevant elements, such as mentions, links, Emojis, and punctuation marks. In addition, the hashtags were expanded and replaced by their corresponding text, removing the "#" symbol and separating the words that compose them. The hashtag expansion was carried out using a dictionary of 637,000 Spanish words. This pre-processing process permitted the analysis of sentiments in social networks since it allowed for reducing the complexity of the data by eliminating irrelevant elements and standardizing the information. We considered that expanding the hashtags and replacing them with their corresponding expanded text would permit us to hold key elements for sentiment identification. In this way, the application of sentiment analysis techniques, such as machine learning, natural language processing, and data mining, was facilitated.

Fine-Tuning of BERT Architecture
BERT has achieved state-of-the-art performance in various NLP tasks by employing a transformer-based architecture [50] (see Figure 3). The model is pre-trained on a large amount of text data and fine-tuned on a smaller labeled dataset for specific tasks [14]. versatile tool for natural language processing applications. The BERT architecture comprises three primary components, including an embedding layer, several transformer layers [51], and a task-specific output layer. During the embedding layer phase, an input word token would be transformed into a continuous vector space through an embedding matrix. In the subsequent Transformer layer, each word token would interact with all of the other word tokens via an attention mechanism, allowing for the exchange of information between them [52,53]. The main weakness of BERT is its inability to handle very long text sequences, as it considers only up to 512 tokens. This means that long sequences must be split into multiple short sequences of 512 tokens, which can limit the technique's effectiveness for certain applications [50]. However, in the case of Twitter analysis, the length of the sequence to be analyzed is not a significant limitation due to the inherent characteristics of tweets.
Fine-tuning BERT with tweets in Spanish was necessary to improve its performance in analyzing Spanish text, which has its own unique linguistic characteristics, such as differences in syntax, morphology, and vocabulary, compared to other languages. By finetuning BERT on Spanish tweets, the model could better capture the nuances of the sentiments in Spanish text, and therefore produce more accurate results.
The general methodology for carrying out the fine-tuning process of the BERT-based architecture followed the classical guidelines of this type of process in the area of machine learning. It began by selecting a pre-trained architecture. Next, the dataset that was to be used in the model adjustment phase was defined in order to train and validate the model. The dataset was selected based on criteria such as language, origin (Twitter), and the amount of available data.

Sentiment Extraction
Each tweet was fed into a classification model based on the BERT architecture for sentiment analysis. The input data was preprocessed using techniques such as tokenization, attention masks, and padding to create suitable input features for the model. To perform this task efficiently on large volumes of data, the processing was carried out using the CuDF library, which leverages the power of GPU acceleration. The resulting sentiment One of the most important aspects of the BERT architecture is that it allows researchers to capture a vast amount of contextual information from a large corpus of text, including bidirectional language models and masked language models. This permitted us to understand the context and relationships between the words and phrases better than what would be possible via other architectures. Unlike context-free models, such as Word2Vec and GloVe, BERT offers contextual embeddings for every word within a text, which is a crucial aspect of its architecture. Additionally, BERT's ability to handle long-term dependencies and its bidirectional nature allows it to perform well on tasks that require more complex language understanding, such as sentiment analysis. Finally, BERT's architecture is highly modular and can be fine-tuned for specific tasks with relative ease, making it a versatile tool for natural language processing applications. The BERT architecture comprises three primary components, including an embedding layer, several transformer layers [51], and a task-specific output layer. During the embedding layer phase, an input word token would be transformed into a continuous vector space through an embedding matrix. In the subsequent Transformer layer, each word token would interact with all of the other word tokens via an attention mechanism, allowing for the exchange of information between them [52,53].
The main weakness of BERT is its inability to handle very long text sequences, as it considers only up to 512 tokens. This means that long sequences must be split into multiple short sequences of 512 tokens, which can limit the technique's effectiveness for certain applications [50]. However, in the case of Twitter analysis, the length of the sequence to be analyzed is not a significant limitation due to the inherent characteristics of tweets.
Fine-tuning BERT with tweets in Spanish was necessary to improve its performance in analyzing Spanish text, which has its own unique linguistic characteristics, such as differences in syntax, morphology, and vocabulary, compared to other languages. By fine-tuning BERT on Spanish tweets, the model could better capture the nuances of the sentiments in Spanish text, and therefore produce more accurate results.
The general methodology for carrying out the fine-tuning process of the BERT-based architecture followed the classical guidelines of this type of process in the area of machine learning. It began by selecting a pre-trained architecture. Next, the dataset that was to be used in the model adjustment phase was defined in order to train and validate the model. The dataset was selected based on criteria such as language, origin (Twitter), and the amount of available data.

Sentiment Extraction
Each tweet was fed into a classification model based on the BERT architecture for sentiment analysis. The input data was preprocessed using techniques such as tokenization, attention masks, and padding to create suitable input features for the model. To perform this task efficiently on large volumes of data, the processing was carried out using the CuDF library, which leverages the power of GPU acceleration. The resulting sentiment scores were then used to generate insights into the attitudes and opinions expressed in the tweets, which could be used for a variety of applications in fields such as social media analytics, market research, and public opinion monitoring.

Data Analysis
In the data analysis phase, we used the CuDF library to preprocess and manipulate the data. In this phase, sentiments were analyzed over time, discriminating by sentiment type. We performed an analysis of relevant events during the pandemic and searched for patterns to determine if they influenced the generation of sentiments analyzed in the collected data. To show and analyze the evolution of the sentiments, we created visualizations, such as time-series graphs and heatmaps, highlighting changes and trends in the sentiments over time. We also performed statistical analysis and hypothesis testing to identify significant differences in the sentiments between time periods, sentiment types, and other relevant variables.

Results
The experiments were conducted on a computer equipped with an Intel Xeon Gold 6230R 2.10G 2933 MHz 26-core processor, 256 GB DDR4 2933 memory, and 2 NVIDIA RTX A6000 48 GB DH.

Data Extraction
This phase involved the collection of 6,306,621 tweets written in the Spanish language, with an approximate frequency of 10,000 tweets per day, during the analysis period. The tweet collection process was consistent throughout most of the analysis days, with a few exceptions where the API returned errors, leading to some missing data. The total number of tweets per day obtained is presented in Figure 4. The storage of this data was conducted using The Hierarchical Data Format version 5 (HDF5), an open-source file format that enables efficient storage of large and complex data, making it an ideal choice for the handling of the extensive dataset in this study.
Different search terms, such as COVID, COVID19, COVID-19, Coronavirus, and SARS-CoV-2, were used. The data was collected day by day, from the date 1 April 2020 to the date 30 December 2021, until 10,000 comments written in the Spanish language were obtained. The used parameters and the extracted fields are described in Table 1.

Data Pre-Procesing
The Algorithm 2 procedure was applied to each of the collected tweets, using the CuDF library for GPU acceleration. In this work, the replacement of emojis with words associated with the sentiment they reflect was not addressed. Table 2 shows an example of the result of applying this procedure. Some hashtags were not properly, expanded even though a dictionary of 637,000 words in Spanish was used. This study did not consider the analysis of possible errors in the expansion of hashtags and what impact this could have on the classification of the sentiments they represent.

Data Pre-Procesing
The Algorithm 2 procedure was applied to each of the collected tweets, using the CuDF library for GPU acceleration. In this work, the replacement of emojis with words associated with the sentiment they reflect was not addressed. Table 2 shows an example of the result of applying this procedure. Some hashtags were not properly, expanded even though a dictionary of 637,000 words in Spanish was used. This study did not consider the analysis of possible errors in the expansion of hashtags and what impact this could have on the classification of the sentiments they represent.

Training Data-Set Selection
In selecting a training dataset for fine-tuning the sentiment analysis model to apply to Spanish tweets, we considered data that had a range of sentiment labels to ensure that the model could learn to distinguish between positive, negative, and neutral sentiments. The Sentiment140 dataset was translated into Spanish, with the dataset containing a total of over 1.6 million tweets with positive, negative, and neutral labels. This dataset had the advantage of being large and diverse, covering a wide range of topics and sentiments. Another option was the Dataset de Twitter en Español-SemEval 2017 Task 4A, which consisted of tweets in Spanish with positive, negative, and neutral labels. This dataset was created for a sentiment analysis task and was specifically designed for training and evaluating sentiment analysis models. Finally, the Emotex dataset, which contained tweets in Spanish annotated with emotional categories, may have been useful for sentiment analysis in the context of emotional recognition. This dataset had the advantage of providing more detailed information about the emotions expressed in the tweets. Table 3 summarizes the key characteristics of these datasets. Various pre-trained models of Bidirectional Encoder Representations from Transformers (BERT) were available for processing text with tweet characteristics, each with their advantages and limitations. BERT-base, one of the most popular models, comprised 110 million parameters and could handle a wide range of natural language processing (NLP) tasks. BERT-large, on the other hand, was capable of processing large volumes of data due to its 340 million parameters. DistilBERT, a lighter version of BERT with approximately half the parameters of BERT-base, was more efficient in terms of memory and processing. RoBERTa, a BERT-based model, utilized a more advanced pre-training algorithm and had slightly better accuracy than BERT in various NLP tasks. Another architecture to consider was ALBERT, which used a factorized embedding parameterization to reduce the number of parameters in the model while maintaining its performance. AL-BERT was especially suitable for resource-limited devices due to its reduced memory and computational requirements. Table 4 shows a description of the analyzed models, including their most relevant features.

Training Data-Set Selection
In selecting a training dataset for fine-tuning the sentiment analysis model to apply to Spanish tweets, we considered data that had a range of sentiment labels to ensure that the model could learn to distinguish between positive, negative, and neutral sentiments. The Sentiment140 dataset was translated into Spanish, with the dataset containing a total of over 1.6 million tweets with positive, negative, and neutral labels. This dataset had the advantage of being large and diverse, covering a wide range of topics and sentiments. Another option was the Dataset de Twitter en Español-SemEval 2017 Task 4A, which consisted of tweets in Spanish with positive, negative, and neutral labels. This dataset was created for a sentiment analysis task and was specifically designed for training and evaluating sentiment analysis models. Finally, the Emotex dataset, which contained tweets in Spanish annotated with emotional categories, may have been useful for sentiment analysis in the context of emotional recognition. This dataset had the advantage of providing more detailed information about the emotions expressed in the tweets. Table 3 summarizes the key characteristics of these datasets.

. Pre-Trained BERT Model Selection
Various pre-trained models of Bidirectional Encoder Representations from Transformers (BERT) were available for processing text with tweet characteristics, each with their advantages and limitations. BERT-base, one of the most popular models, comprised 110 million parameters and could handle a wide range of natural language processing (NLP) tasks. BERT-large, on the other hand, was capable of processing large volumes of data due to its 340 million parameters. DistilBERT, a lighter version of BERT with approximately half the parameters of BERT-base, was more efficient in terms of memory and processing. RoBERTa, a BERT-based model, utilized a more advanced pre-training algorithm and had slightly better accuracy than BERT in various NLP tasks. Another architecture to consider was ALBERT, which used a factorized embedding parameterization to reduce the number of parameters in the model while maintaining its performance. AL-BERT was especially suitable for resource-limited devices due to its reduced memory and computational requirements. Table 4 shows a description of the analyzed models, including their most relevant features.

Training Data-Set Selection
In selecting a training dataset for fine-tuning the sentiment analysis model to apply to Spanish tweets, we considered data that had a range of sentiment labels to ensure that the model could learn to distinguish between positive, negative, and neutral sentiments. The Sentiment140 dataset was translated into Spanish, with the dataset containing a total of over 1.6 million tweets with positive, negative, and neutral labels. This dataset had the advantage of being large and diverse, covering a wide range of topics and sentiments. Another option was the Dataset de Twitter en Español-SemEval 2017 Task 4A, which consisted of tweets in Spanish with positive, negative, and neutral labels. This dataset was created for a sentiment analysis task and was specifically designed for training and evaluating sentiment analysis models. Finally, the Emotex dataset, which contained tweets in Spanish annotated with emotional categories, may have been useful for sentiment analysis in the context of emotional recognition. This dataset had the advantage of providing more detailed information about the emotions expressed in the tweets. Table 3 summarizes the key characteristics of these datasets.

. Pre-Trained BERT Model Selection
Various pre-trained models of Bidirectional Encoder Representations from Transformers (BERT) were available for processing text with tweet characteristics, each with their advantages and limitations. BERT-base, one of the most popular models, comprised 110 million parameters and could handle a wide range of natural language processing (NLP) tasks. BERT-large, on the other hand, was capable of processing large volumes of data due to its 340 million parameters. DistilBERT, a lighter version of BERT with approximately half the parameters of BERT-base, was more efficient in terms of memory and processing. RoBERTa, a BERT-based model, utilized a more advanced pre-training algorithm and had slightly better accuracy than BERT in various NLP tasks. Another architecture to consider was ALBERT, which used a factorized embedding parameterization to reduce the number of parameters in the model while maintaining its performance. AL-BERT was especially suitable for resource-limited devices due to its reduced memory and computational requirements. Table 4 shows a description of the analyzed models, including their most relevant features.

Training Data-Set Selection
In selecting a training dataset for fine-tuning the sentiment analysis model to apply to Spanish tweets, we considered data that had a range of sentiment labels to ensure that the model could learn to distinguish between positive, negative, and neutral sentiments. The Sentiment140 dataset was translated into Spanish, with the dataset containing a total of over 1.6 million tweets with positive, negative, and neutral labels. This dataset had the advantage of being large and diverse, covering a wide range of topics and sentiments. Another option was the Dataset de Twitter en Español-SemEval 2017 Task 4A, which consisted of tweets in Spanish with positive, negative, and neutral labels. This dataset was created for a sentiment analysis task and was specifically designed for training and evaluating sentiment analysis models. Finally, the Emotex dataset, which contained tweets in Spanish annotated with emotional categories, may have been useful for sentiment analysis in the context of emotional recognition. This dataset had the advantage of providing more detailed information about the emotions expressed in the tweets. Table 3 summarizes the key characteristics of these datasets. Various pre-trained models of Bidirectional Encoder Representations from Transformers (BERT) were available for processing text with tweet characteristics, each with their advantages and limitations. BERT-base, one of the most popular models, comprised 110 million parameters and could handle a wide range of natural language processing (NLP) tasks. BERT-large, on the other hand, was capable of processing large volumes of data due to its 340 million parameters. DistilBERT, a lighter version of BERT with approximately half the parameters of BERT-base, was more efficient in terms of memory and processing. RoBERTa, a BERT-based model, utilized a more advanced pre-training algorithm and had slightly better accuracy than BERT in various NLP tasks. Another architecture to consider was ALBERT, which used a factorized embedding parameterization to reduce the number of parameters in the model while maintaining its performance. AL-BERT was especially suitable for resource-limited devices due to its reduced memory and computational requirements. Table 4 shows a description of the analyzed models, including their most relevant features. In selecting a training dataset for fine-tuning the sentiment analysis model to apply to Spanish tweets, we considered data that had a range of sentiment labels to ensure that the model could learn to distinguish between positive, negative, and neutral sentiments. The Sentiment140 dataset was translated into Spanish, with the dataset containing a total of over 1.6 million tweets with positive, negative, and neutral labels. This dataset had the advantage of being large and diverse, covering a wide range of topics and sentiments. Another option was the Dataset de Twitter en Español-SemEval 2017 Task 4A, which consisted of tweets in Spanish with positive, negative, and neutral labels. This dataset was created for a sentiment analysis task and was specifically designed for training and evaluating sentiment analysis models. Finally, the Emotex dataset, which contained tweets in Spanish annotated with emotional categories, may have been useful for sentiment analysis in the context of emotional recognition. This dataset had the advantage of providing more detailed information about the emotions expressed in the tweets. Table 3 summarizes the key characteristics of these datasets.

Pre-Trained BERT Model Selection
Various pre-trained models of Bidirectional Encoder Representations from Transformers (BERT) were available for processing text with tweet characteristics, each with their advantages and limitations. BERT-base, one of the most popular models, comprised 110 million parameters and could handle a wide range of natural language processing (NLP) tasks. BERT-large, on the other hand, was capable of processing large volumes of data due to its 340 million parameters. DistilBERT, a lighter version of BERT with approximately half the parameters of BERT-base, was more efficient in terms of memory and processing. RoBERTa, a BERT-based model, utilized a more advanced pre-training algorithm and had slightly better accuracy than BERT in various NLP tasks. Another architecture to consider was ALBERT, which used a factorized embedding parameterization to reduce the number of parameters in the model while maintaining its performance. ALBERT was especially suitable for resource-limited devices due to its reduced memory and computational requirements. Table 4 shows a description of the analyzed models, including their most relevant features.
For this work, we considered using BERT-base as it was a widely used and tested model that could handle a variety of NLP tasks. Additionally, the parameter count of this model was sufficient for effectively processing tweet data. The pre-trained model was initialized, with the parameters released in one study [54], and made available in another [55].

Selecting the Optimal Fine-Tuning Parameters
To select the optimal fine-tuning parameters for a BERT-base architecture, we considered the trade-off between model complexity and generalization. We used configurations based on the original recommendations from the model release. The approach was to use a small learning rate during the initial stages of fine-tuning, followed by a gradual increase in the learning rate as training progresses. Another important consideration was the choice of batch size, which should be large enough to ensure the efficient use of computational resources, but small enough to avoid overfitting. Additionally, the number of epochs used was carefully chosen to prevent overfitting and ensure optimal model performance. Finally, during the fine-tuning process, techniques such as early stopping and dropout were used to prevent overfitting and improve the generalization performance of the model. We performed a grid search on batch sizes of {8, 10, 16, 32} and learning rates of {5 × 10 −6 , 1 × 10 −5 , 31 × 10 −5 , 51 × 10 −5 } to determine the optimal values for our specific task.
The optimal values are shown in Table 5. We formed a balanced dataset of 45,000 tweets and utilized a strategy to divide the dataset into 60% for training, 20% for validation, and 20% for the test set. The dataset was carefully designed to ensure an even distribution of the target classes. This partitioning technique was employed to evaluate the performance of the fine-tuned model. The goal was to maintain the generalization of the models by using a subset of the data for validation and testing. The results for the test set, consisting of 9000 tweets, are shown in Table 6.

Sentiment Extraction and Data Analysis
The sentiment analysis phase consisted of evaluating the sentiments using the trained model. Three classes were used regarding the types of sentiments: POSITIVE, NEGATIVE, and NEUTRAL. The results show that the number of positive tweets was significantly lower than in the other categories. The category with the highest number of tweets was the neutral category, as shown in Figure 5. Some examples are shown in Table 7. Additionally, an emotion was classified in regards to each of the tweets, and the results are shown in Figure 6. It shows that the predominant emotion was anger, followed by the emotion of joy. However, sadness and fear also appeared as relevant sentiments.   Figure 7 shows the number of tweets in each sentiment class. Positive tweets were few compared to the other classes, while neutral and negative ones were significantly correlated. It is probable that this category is where the classifier presented the most uncertainty. In order to ascertain if any specific users had a significant impact on the daily trend of the data collected, the number of unique users in the group of tweets was also evaluated. Figure 8 shows the daily average of the tweets in blue together with the daily average of unique users in green.
The fact that there is a difference of approximately 2000 between these lines suggests that some tweets originated from users who were similar. Yet, there were around 8000 unique users per day, suggesting heterogeneity in the data gathered.

Sentiment Extraction and Data Analysis
The sentiment analysis phase consisted of evaluating the sentim model. Three classes were used regarding the types of sentiments: P and NEUTRAL. The results show that the number of positive tw lower than in the other categories. The category with the highest the neutral category, as shown in Figure 5. Some examples are sh tionally, an emotion was classified in regards to each of the twee shown in Figure 6. It shows that the predominant emotion was a emotion of joy. However, sadness and fear also appeared as releva Table 7. Results examples of sentiment classification on Spanish tweets.

Sentiment
Example (Raw Input)

Sentiment Extraction and Data Analysis
The sentiment analysis phase consisted of evaluating the sentiments using the tr model. Three classes were used regarding the types of sentiments: POSITIVE, NEGA and NEUTRAL. The results show that the number of positive tweets was signific lower than in the other categories. The category with the highest number of tweet the neutral category, as shown in Figure 5. Some examples are shown in Table 7. tionally, an emotion was classified in regards to each of the tweets, and the resul shown in Figure 6. It shows that the predominant emotion was anger, followed b emotion of joy. However, sadness and fear also appeared as relevant sentiments. Table 7. Results examples of sentiment classification on Spanish tweets.

Sentiment
Example (Raw Input)

Sentiment Extraction and Data Analysis
The sentiment analysis phase consisted of evaluating the senti model. Three classes were used regarding the types of sentiments: and NEUTRAL. The results show that the number of positive t lower than in the other categories. The category with the highest the neutral category, as shown in Figure 5. Some examples are s tionally, an emotion was classified in regards to each of the twe shown in Figure 6. It shows that the predominant emotion was emotion of joy. However, sadness and fear also appeared as relev Table 7. Results examples of sentiment classification on Spanish tweets.

Sentiment
Example (Raw Input)
The fact that there is a difference of approximately 2000 between these lines suggests that some tweets originated from users who were similar. Yet, there were around 8000 unique users per day, suggesting heterogeneity in the data gathered.  Figure 7 shows the number of tweets in each sentiment class. Positive tweets were few compared to the other classes, while neutral and negative ones were significantly correlated. It is probable that this category is where the classifier presented the most uncertainty. In order to ascertain if any specific users had a significant impact on the daily trend of the data collected, the number of unique users in the group of tweets was also evaluated. Figure 8 shows the daily average of the tweets in blue together with the daily average of unique users in green.

Using Emojis and Pre-Trained Language Model in Spanish
A pre-trained model called roBerTuito [56] was used, having been trained on a specific Spanish corpus for Twitter and incorporating the interpretation of emojis. Therefore, in this section, we will present tweets without removing the emojis and analyze the results.
The roBerTuito model decreased the number of tweets classified as neutral and increased the number of negative and positive tweets (see Figure 9).

Using Emojis and Pre-Trained Language Model in Spanish
A pre-trained model called roBerTuito [56] was used, having been trained on a specific Spanish corpus for Twitter and incorporating the interpretation of emojis. Therefore, in this section, we will present tweets without removing the emojis and analyze the results.
The roBerTuito model decreased the number of tweets classified as neutral and increased the number of negative and positive tweets (see Figure 9). The fact that there is a difference of approximately 2000 between these lines suggests that some tweets originated from users who were similar. Yet, there were around 8000 unique users per day, suggesting heterogeneity in the data gathered.

Using Emojis and Pre-Trained Language Model in Spanish
A pre-trained model called roBerTuito [56] was used, having been trained on a specific Spanish corpus for Twitter and incorporating the interpretation of emojis. Therefore, in this section, we will present tweets without removing the emojis and analyze the results.
The roBerTuito model decreased the number of tweets classified as neutral and increased the number of negative and positive tweets (see Figure 9). In Table 8, examples are shown of tweets that were classified as neutral by the Bertbase model and whose sentiment classification varied with the RoBerTuito model. Some examples were accurately classified; however, in regards to others, the sentiment assignment was not as clear. Emojis can impose a burden on the classification that is not easy to quantify due to the high informality and freedom of their use.  The use of pre-trained models in sentiment analyses poses challenges when attempting to measure their precision, due to the absence of a manually labeled reference dataset. Our collected dataset for sentiment analysis was not manually labeled. Hence, it was difficult to evaluate the precision of the pre-trained models used.
In the fine-tuning process of the models, a specific set of tweets was used for training. According to one researcher [56], the datasets used to train the RoBERTuito model were the same datasets used to fine-tune the BERT-base model, thereby invalidating the statistical comparisons of both models. Due to the impossibility of establishing a comparative statistical framework, qualitative comparisons will be made based on the results obtained in this study (see Table 7). This will allow for a descriptive and subjective evaluation of the performance of the pre-trained models in the task of sentiment analysis.
In Table 8, examples are shown of tweets that were classified as neutral by the Bertbase model and whose sentiment classification varied with the RoBerTuito model. Some examples were accurately classified; however, in regards to others, the sentiment assignment was not as clear. Emojis can impose a burden on the classification that is not easy to quantify due to the high informality and freedom of their use.  In Table 8, examples are shown of tweets that were classified as neutral by the Bertbase model and whose sentiment classification varied with the RoBerTuito model. Some examples were accurately classified; however, in regards to others, the sentiment assignment was not as clear. Emojis can impose a burden on the classification that is not easy to quantify due to the high informality and freedom of their use.  In Table 8, examples are shown of tweets that were classified as neutral by base model and whose sentiment classification varied with the RoBerTuito mo examples were accurately classified; however, in regards to others, the sentime ment was not as clear. Emojis can impose a burden on the classification that is n quantify due to the high informality and freedom of their use.     In Table 8, examples are shown of tweets that were classified as neutral by the Bertbase model and whose sentiment classification varied with the RoBerTuito model. Some examples were accurately classified; however, in regards to others, the sentiment assignment was not as clear. Emojis can impose a burden on the classification that is not easy to quantify due to the high informality and freedom of their use. La señora de la alita ya reabrió su negocio, solo delivery, obvio tienta pero es , encima la casa de al lado y 2 casas más abajo de la suya tienen personas con covid.
NEU NEG "Cerrado por COVID-19": cercaron una cuadra de un barrio por caso positivo In Table 8, examples are shown of tweets that were classified as neutral by the B base model and whose sentiment classification varied with the RoBerTuito model. S examples were accurately classified; however, in regards to others, the sentiment ass ment was not as clear. Emojis can impose a burden on the classification that is not eas quantify due to the high informality and freedom of their use. (a) (b) Figure 9. Number of tweets, (a) total number of tweets classified by sentiment during the defined time period, (b) number of tweets discriminate by year using RoBerTuito.
In Table 8, examples are shown of tweets that were classified as neutral by the Bertbase model and whose sentiment classification varied with the RoBerTuito model. Some examples were accurately classified; however, in regards to others, the sentiment assignment was not as clear. Emojis can impose a burden on the classification that is not easy to quantify due to the high informality and freedom of their use. La señora de la alita ya reabrió su negocio, solo delivery, obvio tienta pero es , encima la casa de al lado y 2 casas más abajo de la suya tienen personas con covid.
NEU NEG https://t.co/HFtsSTG10W NEU NEG Table 8. Cont. Table 8. A comparison of tweet classifications between the two models.

Bias and Representativeness of the Data on Twitter
The utilization of only Twitter as a data source in this research may incur a bias due to the platform's characteristics and its usage patterns among users. It is noteworthy that Twitter does not encompass the entire Spanish-speaking population. The profiling of users based on variables such as age, gender, socioeconomic status, and education, among other factors, would be crucial in assessing the data's representativeness on Twitter. Furthermore, certain users may exhibit a more dominant presence on the platform than others, thus impeding the generalizability of the findings to the broader population. Notably, the analysis of Twitter disregards individuals who do not utilize this platform, which may impart a bias on the results and generate incomplete or inaccurate conclusions. Consequently, the outcomes of this study may not be generalizable to the wider Spanish population and may not be a true reflection of reality.
Another interesting aspect regarding the data is the discrimination by country. However, Twitter's own API reports that there is no guarantee that the reported location is accurate. In some cases, if the user has not granted location permissions, this value is often estimated and has low levels of precision.
Unfortunately, such discrimination against social groups accessing the platform is not possible through the API. In cases where it is possible, it may even have significant levels of imprecision, in the sense that some characteristics of users are reported but are also not validated by the social network itself.
Alternative social media and digital platforms may also be indicative of the general population, such as Facebook, Instagram, TikTok, and others. Each one has distinct attributes and attracts unique user groups, thereby resulting in potential variability in the biases based on the platforms employed.

Considering Emojis in Sentiment Analysis
We consider the fact that emojis play a significant role in modern communication, as they provide additional contextual information and can enrich the interpretations of text. However, we also recognize that the appropriate use of emojis in sentiment analysis is a challenging aspect and does not necessarily guarantee a significant improvement in the Medidas de apoyo al sector cultural y deportivo ante el COVID-19. !!Objetivo: proteger a un sector fundamental en nuestro país para que nadie se quede atrás ante la emergencia sanitaria y social. #ProtegemosLaCultura #ProtegemosElDeporte #EsteVirusLoParamosUnidos https://t.co/EadvNWph NEU POS * Some mentions and web addresses were modified to anonymize the text.

Bias and Representativeness of the Data on Twitter
The utilization of only Twitter as a data source in this research may incur a bias due to the platform's characteristics and its usage patterns among users. It is noteworthy that Twitter does not encompass the entire Spanish-speaking population. The profiling of users based on variables such as age, gender, socioeconomic status, and education, among other factors, would be crucial in assessing the data's representativeness on Twitter. Furthermore, certain users may exhibit a more dominant presence on the platform than others, thus impeding the generalizability of the findings to the broader population. Notably, the analysis of Twitter disregards individuals who do not utilize this platform, which may impart a bias on the results and generate incomplete or inaccurate conclusions. Consequently, the outcomes of this study may not be generalizable to the wider Spanish population and may not be a true reflection of reality.
Another interesting aspect regarding the data is the discrimination by country. However, Twitter's own API reports that there is no guarantee that the reported location is accurate. In some cases, if the user has not granted location permissions, this value is often estimated and has low levels of precision.
Unfortunately, such discrimination against social groups accessing the platform is not possible through the API. In cases where it is possible, it may even have significant levels of imprecision, in the sense that some characteristics of users are reported but are also not validated by the social network itself.
Alternative social media and digital platforms may also be indicative of the general population, such as Facebook, Instagram, TikTok, and others. Each one has distinct attributes and attracts unique user groups, thereby resulting in potential variability in the biases based on the platforms employed.

Considering Emojis in Sentiment Analysis
We consider the fact that emojis play a significant role in modern communication, as they provide additional contextual information and can enrich the interpretations of text. However, we also recognize that the appropriate use of emojis in sentiment analysis is a challenging aspect and does not necessarily guarantee a significant improvement in the accuracy of classification models.
In our research, we have observed that emojis can be useful in certain contexts, as demonstrated by some researchers [57,58] who found that incorporating emojis in NLP models improved sentiment analysis performance. However, other studies have reported less conclusive results. For instance, in [59], the use of emojis and hashtags was found not to significantly improve performance.
One of the most relevant aspects when including emojis is addressing the challenges that arise in sentiment analysis. These challenges include differences in emoji representations across different platforms, variations in emoji usage across cultures and social contexts, and how emojis can affect the polarity of a message. The sentiment of a tweet and the sentiment conveyed by an emoji should be analyzed independently; we should also consider how their combination affects the overall sentiment of the message [60].
In another study [61], the researchers established that emojis, while ubiquitous in modern digital communication, presented differences in their usage across languages and countries. The study analyzed the relationship between emoji usage and linguistic and geographic factors, using millions of tweets to do so. The results showed that the diversity in emoji usage varied across different languages and countries. Another researcher [62] highlighted variations in the interpretations of emojis across different cultures and communities, which could impact the effectiveness of sentiment analysis when emojis are included. In a study carried out in two Spanish cities [63], it was concluded that the overall semantics of the subset of emojis studied were preserved in both of these cities, but some of them were interpreted differently. In another study [64], researchers discussed the ambiguity in the meaning of emojis and the need for a sense network for emojis, indicating that there may be difficulties in interpreting the meaning and sentiment of emojis in sentiment analysis.
In our study, we examined over six million Spanish-language tweets from diverse sources, ranging from Mexico to Argentina, and even including sources in Europe. Given the diversity of cultures represented, we believe that the appropriate use of semantic meaning in terms of emojis is an important topic. However, due to the scope of this study, we made the principal decision to exclude emojis to prevent the analysis from becoming overly complex. Nevertheless, we reported the results of processing the tweets with a model that includes emoji interpretation. The findings on the benefits are inconclusive, and a much deeper analysis is required in this regard. Further research into cultural differences in emoji use and interpretations could provide valuable insights into the effectiveness of incorporating emojis in sentiment analyses for various populations.

Conclusions
This study collected a total of 6,306,621 tweets in the Spanish language, using various search terms related to COVID-19. The data was stored using The Hierarchical Data Format version 5 (HDF5), an open-source file format suitable for handling large and complex data.
The collected tweets were pre-processed using the CuDF library for GPU acceleration. The sentiment analysis was performed using a BERT-base architecture trained on a dataset consisting of over 1.6 million tweets with positive, negative, and neutral labels. A balanced dataset of 45,000 tweets was used for fine-tuning the model, which was then evaluated on a test set of 9000 tweets.
The results of this study revealed that the BERT-based model achieved an accuracy of 89%, with the neutral category having the highest number of tweets, followed by the negative category, while positive tweets were relatively few. This study also found that the predominant emotion expressed in the tweets was anger, followed by joy, while sadness and fear were also significant. The analysis of unique users revealed a high level of heterogeneity in the data collected.
Regarding future work, it is proposed that researchers should replace Emojis with words that correspond to the sentiments that they represent. Replacing Emojis with these words would prevent the loss of representative sentiment information. In the same sense, a sarcasm detector could decrease the false positive rate, considering that this type of text is frequently employed on a social network, such as Twitter.