Credibility Analysis on Twitter Considering Topic Detection

: Twitter is one of the most popular sources of information available on the internet. Thus, many studies have proposed tools and models to analyze the credibility of the information shared. The credibility analysis on Twitter is generally supported by measures that consider the text, the user, and the social impact of text and user. More recently, identifying the topic of tweets is becoming an interesting aspect for many applications that analyze Twitter as a source of information, for example, to detect trends, to ﬁlter or classify tweets, to identify fake news, or even to measure a tweet’s credibility. In most of these cases, the hashtags represent important elements to consider to identify the topics. In a previous work, we presented a credibility model based on text, user, and social credibility measures, and a framework called T-CREo, implemented as an extension of Google Chrome. In this paper, we propose an extension of our previous credibility model by integrating the detection of the topic in the tweet and calculating the topic credibility measure by considering hashtags. To do so, we evaluate and compare different topic detection algorithms, to ﬁnally integrate in our framework T-CREo, the one with better results. To evaluate the performance improvement of our extended credibility model and show the impact of hashtags, we performed experiments in the context of fake news detection using the PHEME dataset. Results demonstrate an improvement in our extended credibility model with respect to the original one, with up to 3.04% F1 score when applying our approach to the whole PHEME dataset and up to 9.60% F1 score when only considering tweets that contain hashtags from PHEME dataset, demonstrating the impact of hashtags in the topic detection process.


Introduction
Social networks have become tools in people's daily life to share, for example, their opinions, feelings, and stories [1], as well as to support their professional life to communicate news, disasters, accidents, etc. [2]. Thus, social media contributes significantly to a variety of situations, such as awareness [3], disaster notifications [4], entertainment, communication, news and social interaction, information sharing, information seeking, self-documentation, and self-expression [5].
Among the current social media platforms, Twitter is one of the more widely used throughout the world [6], having 650 million registered users [7]. It is the largest social network used to write and read people's short text (called tweets) about anything in life, with a maximum of 280 characters, mixed with contextual clues, such as URLs, tags, T-CREo, we performed experiments in the context of fake news detection using the PHEME dataset. Results show an improvement in the extended credibility model with respect to the original, i.e., up to 3.04% F1 score when applying our approach to the whole PHEME dataset and up to 9.60% F1 score when only considering tweets that contain hashtags.
In summary, the main contributions of this work are as follows: (i) The integration, into a previous credibility model [22], of the topic detection algorithm that evaluates hashtags; hence, obtaining an extended model of credibility that considers four dimensionstext, user, social, and topic credibility measures. (ii) A comparative evaluation-in terms of precision, recall, and F1 metrics-of several topic detection algorithms, based on sequential k-means, latent semantic indexing (LSI), non-negative matrix factorization (NMF), and latent Dirichlet allocation (LDA). (iii) The implementation of the extended model within the T-CREo framework [41], with the topic credibility measure based on NMF, derived from the best results of the comparative study, which allows a quantitative and qualitative evaluation of our extended credibility model in the context of detection of fake news.
This paper is structured as follows: Section 2 presents relevant works related to topic analysis and credibility. Section 3 describes the methods used for the topic analysis including topic detection algorithms, metrics for model evaluation, and distance measures. Section 4 presents the comparative evaluation of the topic detection algorithms considered. Section 5 shows the extended credibility model. Section 6 presents the qualitative and quantitative evaluation of our approach. Finally, Section 7 presents the conclusions and future work.

Related Work
The need for a topic detection system, particularly associated with Twitter, is motivated by the amount of information in microblogs and its use to spread information and express opinions. With this massive amount of Twitter data, users can get saturated and miss important topics [33]. Then, topic detection, as a technique for discovering the main topics automatically, can help in many applications that analyze Twitter. Therefore, in the literature, there exists a huge number of scientific papers focused on topic detection on Twitter. For example, by a simple search of the Scopus database with the keywords topic + detection + Twitter between 2009 to 2022, 1692 related articles are obtained. In this section, we describe only a few of the most relevant studies related to our work.
There is a large variety of techniques used for topic analysis on Twitter. Works that use machine learning techniques are basically based on supervised learning [27,29,34,42,43] or consider a hybrid approach, which includes latent Dirichlet allocation (LDA) [44] or graphs [28]. Sentiment analysis and topic detection on Twitter have been combined with the Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture (a technique that simplifies the inference process assuming that each document is the result of a single topic) to analyze the content related to COVID-19 from Brazil and the USA [30]. Pattern mining (a frequent pattern mining technique), which takes frequency and utility into account at the same time, has also been used for topic detection in Twitter streams [45]. Lee et al. [43] classified Twitter Trending Topics into 18 general categories, such as sports, politics, technology, by using a text-based classification with a Bag-of-Words approach and a network-based classification. Mottaghinia et al. [34] explored different approaches to detect topics of tweets and classified these approaches into four classes of categories: with word embedding or without word embedding, specified or unspecified, offline or online, and supervised or unsupervised. The authors summarized their advantages and disadvantages and concluded that depending on the application, one of the categories may be more suitable than the other. These works are examples of the different approaches that can be used to detect topics on Twitter; however, they do not propose credibility models based on the detected topics, as we do in our proposal.
Many works focused on Twitter credibility analysis have considered topic detection as an import aspect to improve the calculation of the credibility level of tweets. In the context of the Great Eastern Japan Earthquake, Namihira et al. [27] proposed a method based on the topic and opinion classification for automatically assessing the credibility of information. For this, the authors calculated the ratio of the positive opinions to all opinions about a topic. To identify the topic of a tweet and generate a topic model, the LDA algorithm was used, and to identify if an opinion of the tweet was positive or negative, a sentiment analysis algorithm was applied. Hamdi et al. [28] proposed an approach to evaluate information sources of fake news (SOFN) in term of credibility on Twitter, based on user features (e.g., created at, name, default profile, default profile image, favorites count, statuses count, description), social graph of users (followers/following graph), and topic annotations. Binary Machine Learning classifier models are fed with these features to predict SOFN. A web interface framework, implemented as a web plug-in system, was proposed by Tan S. [29] to compare tweets to relevant news headlines. Considering that the news headlines are true, tweets were classified as entailment, neutral, or contradiction with respect to a specific topic. To do so, the author used four classification models: logistic regression model based on count vectorizer, support vector machine based on text content features, a feedforward network using GloVe word embeddings, and an RNN-based LSTM sentence encoder with a multilayer perceptron classifier. Yang et al. [3] designed a crowdsourcing-based credibility framework for Twitter in the context of disasterawareness situations. This framework is able to calculate in real-time the topic-level credibility (i.e., emergency situations), by analyzing the text, linked URLs, number of retweets, and geographic information extracted from both a tweet's text and external URLs, which are kept in a database. Thus, the credibility of a detected event is increased when multiple sources (the three factors) refer to the same event: tweets, the linked URLs, and retweets referring to the same event. Thus, the tweet credibility score is calculated based on the information contained in its text and URL, and the accumulated credibility score for each event is calculated based on the number of tweets and retweets associated with the same event. Similar to these works, many other studies propose credibility models related to a specific topic. The topic does not influence the credibility level; however, it is used to classify or filter the tweets. In our work, we identify the topic of the tweet, both to filter the tweet and to impact the level of credibility. Moreover, we also analyze other aspects of the tweet to determine the level of credibility.
Some of these works have used URLs present in the tweet to support the topic analysis. Similarly, other studies consider other aspects in the tweet. Alrubaian et al. [46] proposed a hybrid approach to credibility analysis to identify implausible content on Twitter and prevent the proliferation of fake or malicious information. For this, they designed an automated classification system with four components: a reputation-based component, a credibility classifier engine, a user experience component, and a feature rank algorithm. The classifier engine component distinguishes between credible and noncredible content from a user tagged dataset considering extracted features at the tweet-, user-, and hybridlevel. These features include structural aspects of the tweet, such as length, number of tags, mentions, positive and negative words, and URLs and hashtags. The classifier used for this component is the naïve Bayes classifier with a feature rank process. Shao at al. [18] introduced Hoaxy, a platform for the collection, detection, and analysis of online misinformation and its related fact checking efforts. The platform collects and tracks misinformation and fact checking. The components of this platform consist of a monitor, a database, and different data sources (social networks and news sites). The monitor has a URL tracker for the Twitter API and a set of crawlers for both fake news and fact checking websites. The extraction of social networks is performed via a stream API and, for the news sites, use an RSS (Rich Site Summary) Parser and Scrapy Spider technologies. Then, the collected data are stored in a database for future analysis. The aim of this work was to characterize the relation between the overall social sharing activity of misinformation and fact checking.
Similar to these works, we consider hashtags as an extra factor in the topic analysis. In contrast with all the works described in this section, we propose the use of Hellinger distance for comparing the semantic proximity between the topic associated with a tweet and the topic associated with its hashtags. The topics are obtained using the NMF model, which produced better results than other topic detection algorithms such as K-means, latent semantic indexing (LSI), and latent Dirichlet allocation (LDA).

Topic Detection Methods
Ibrahim et al. [33] presented a survey about tools and approaches for topic detection from Twitter streams. They categorized the topic detection techniques into five categoriesclustering, frequent pattern mining, exemplar-based, matrix factorization, and probabilistic-and evaluated their performance using three Twitter datasets. In terms of precision, the best results were obtained with Soft Frequent Pattern Mining and Bngram, a cluster technique; while considering recall, the best results were obtained with Column Subset Selection, a matrix factorization technique. A good balance between recall and precision was obtained with an exemplar-based topic detection model. Considering this survey, we implement and test several of the categorized algorithms to select one to be integrated into the credibility model.
We have considered three categories of techniques, which include clustering, matrix factorization, and probabilistic techniques. The algorithms were selected considering the available libraries to implement them, basically, scikit-learn (https://scikit-learn.org/ stable/, accessed on 1 June 2022) and their results obtained in similar contexts. The algorithms used are sequential k-means for clustering techniques, non-negative matrix factorization (NMF) and latent semantic indexing (LSI) for matrix factorization techniques, and LDA for probabilistic techniques.

Clustering Models
Sequential k-means [47] is one of the most popular clustering techniques. This algorithm groups observations into k groups based on their characteristics. So, we can partition n observations into k clusters, S = S 1 , S 2 , . . . , S k , such that the cluster distance (WC) is minimized [33].
Equation (1) is an iterative process that we can explain as follows: • Initialization: Select k random points as representative centroids. • Repeat until convergence: -Assign each data point to the cluster of the nearest centroid.

-
Recompute each cluster centroid as the average of the assigned points.
The sequential k-means is able to update the existing clusters by applying only new data-points [48].

Matrix Factorization Model
From these techniques, we are interested in the latent semantic indexing (LSI) and non-negative matrix factorization (NMF) algorithms.

Latent Semantic Indexing
LSI [49] is a popular text analysis technique. To extract the conceptual content of a document, it is necessary to establish associations between those terms that occur in similar contexts [50]. Thereby, the main idea is to match topics by concepts instead of by terms [51].
Given a data matrix X nxd (n documents and d terms), LSI factorizes it to the multiplication of three matrices UDV T , which is known as singular value decomposition (SVD) [52], as shown in Equation (2).
This can be interpreted as projecting the data matrix X into a lower-dimensional space whose bases are latent topics.

Non-Negative Matrix Factorization (NMF)
LSI has two disadvantages: (i) the factorized matrices may have negative values that do not have intuitive interpretation; (ii) the bases are latent and cannot be easily interpreted [33]. In contrast, NMF [53] is another class of techniques that guarantees that the factorized matrices contain non-negative values. Figure 1 shows a graphical representation of this process, also represented in Equation (3) [54]. Matrix X is projected into a lower-dimensional space spanned by a set of latent topics, where the coefficients of each document with respect to these bases are contained in the rows of the matrix W and each base is represented by one row in the matrix H [33].

Probabilistic Model
We have considered LDA among the probabilistic models. LDA is a probabilistic topic modeling approach, where the document is considered as a combination of several topics and the characteristics of every topic are determined by word distribution [55]. Figure 2 shows an illustration of the definition of LDA [56], which consists of select M documents and each document contains a vector θ of topic proportions. Each word w is generated by first choosing a topic z from a multinomial parameterized by θ and then choosing a word from a multinomial conditioned on the selected topic. In our case, we treat each tweet as if it were one of the M documents. LDA has poor performance in the case of short texts [21] because the topics learned from this algorithm are formally a multinomial distribution over words and only the top words are used to identify the subject area or give an interpretation of a topic.

Comparative Evaluation of Topic Detection Algorithms
The primary goal of having the new measure associated with topic analysis is to improve the credibility model calculation using topic modeling algorithms. Our topic analysis is based on the best topic detection model considering different metrics for comparison. For this, our methodology involves preprocessing, processing, and tuning steps for each algorithm and an evaluation process to compare them. Figure 3 shows a summary of the steps of the process followed in this evaluation phase. Although we have evaluated only four topic detection algorithms, this methodology can be followed to evaluate other approaches, which we intend to conduct as future work.

The Dataset Description
The dataset used for this analysis corresponds to data collected by Quezada et al. [57], a CC BY 4.0 licensed public access database (https://figshare.com/articles/dataset/tweets_ csv_gz/3465974, accessed on 1 June 2022). The data contain tweets gathered from news headlines from a manually curated list of well-known news media accounts (e.g., @CNN, @BreakingNews, @BBCNews) on Twitter. The dataset is composed of a total of 43,256,261 tweets distributed across 5234 different events.
According to the privacy and tweet availability terms of Twitter (https://developer. twitter.com/en/developer-terms/agreement-and-policy), accessed on 1 June 2022, most available datasets only provide the id of tweets, as it is necessary to use the Twitter API to extract their actual texts. This API has a limit (https://developer.twitter.com/en/ docs/twitter-api/rate-limits, accessed on 1 June 2022) of 900 tweet requests every 15 min; therefore, we only collected 2,000,000 tweets to create our dataset-this took us almost 1 month, managing to obtain only 86,400 tweets per day, which compose our dataset.
The obtained dataset resulted as imbalanced: in some cases, an event has less than 1000 tweets and in other cases an event has more than 100,000 tweets. In order to obtain a balanced dataset, we selected 250 events (which will be used later as topics) with 8000 tweets each. This balanced dataset is a subset from the original dataset and contains 2,000,000 tweets for training and testing purposes. For the event (topic) selection, we have taken into account that each topic to be included in our balanced dataset has to satisfy that at least 8000 tweets belong to the topic and these 8000 tweets also contain 400 tweets with at least 1 hashtag. This dataset was structured by a tweet id and a topic id. The tweet id corresponds to the internal id provided by Twitter and the topic id is the original identifier provided by the dataset between 1 and 5234 associated to an event (or topic for us). Note that some topic identifiers are not considered in our subset since only 250 topics were selected. Table 1 shows an example of a list of the six first tweet ids in our dataset.

Preprocessing
The tweets have very unstructured, short texts with misspelled words, irrelevant characters, emojis, unconventional syntax, hashtags, among other elements, as well as stop words, prepositions, punctuation symbols, etc. that make more difficult the task of topic detection algorithms. For that reason, it is necessary to clean the data as part of the topic detection process. Therefore, the first step in the process is to clean the text from irrelevant words, such as usernames, URLs, emojis, and invalid characters. After that, the cleaned tweets have to be converted to a format suitable as input for the algorithms. The format that we used is the TF-IDF (Term Frequency-Inverse Document Frequency). In this preprocessing step, the following tasks are executed: • Tokenization: the text is split at each blank character to create a list of single tokens (stand-alone words, numbers, signs, or a concatenated string such as a URL). • Remove mentions or usernames from tweets that begin with '@' symbol and are followed by text (e.g., @jimcramer, @apple). • Removing special characters: the characters, such as %, *, !, [, ), are removed to preserve the focus on words in every tweet. • Removing Web URLs: URLs are not considered in our topic modeling approach because they contain unspecific and hardly interpretable information. • Removing numbers: numbers are not considered because they generally do not contain semantically viable information for our purposes. • Removing hashtags (e.g., #AAPL, #AppleSnob), emojis, symbols, and emoticons. • Removing frequent words and stopwords that would not provide specific semantics. These are commonly words that do not carry distinct semantic meaning, e.g., the, an, and, what. Table 2 shows the result of applying our preprocessing step to five random tweets. After the cleaning task, we have split our dataset into training and testing sets. The proportion was 95% for training, i.e., 7600 (tweets) × 250 (topics)= 1,900,000 tweets and 5% for tests, i.e., 400 × 250 = 100,000 tweets.

Processing
For the processing step, we train the algorithms for topic detection explained in Section 3: sequential k-means (KMEANS), latent semantic indexing (LSI), non-negative matrix factorization (NMF), and latent Dirichlet allocation (LDA). As the dataset was previously categorized by topic, we have proceeded with a supervised learning approach. The algorithms were executed iterating several times over different configurable variables to obtain the best results considering the evaluation metrics. The hyperparameters used are summarized in Table 3. The rest of the parameters corresponds to default values for each algorithm (https://scikit-learn.org/stable/modules/generated/sklearn.decomposition/, https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans/, accessed on 1 June 2022). In the case of LDA, if the data size is large, the online update parameter will be much faster than the batch update parameter. The experiments were performed using Python 3.8.10 and libraries sklearn 1.1.1, on a computer with 16 GB memory, 8 AMD vCPUs, an 80 GB disk, and SFO3-Ubuntu 20.04 (LTS) ×64.

Evaluation Metrics for the Models
We have used classical metrics to measure the models' performance, which are as follows [58]: The F1 score can be interpreted as a weighted harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. F1 score is defined as 2×precision×recall precision+recall . Table 4 summarizes the results of the considered algorithms, i.e., K-means, LSI, NMF, and LDA with their metrics (precision, recall, and F1 score). The evaluations were executed considering our testing set composed by 400 × 250 = 100,000 tweets. Figure 4 shows a graphical comparison between the metrics of each model generated. In terms of precision, recall, and F1, the NMF shows the best results with 0.74, 0.76, and 0.75, respectively. The second best algorithm is LDA with 0.71, followed by K-means and LSI. According to these results, the NMF algorithm is the most suitable for topic analysis in this scenario.  Note that for the comparative evaluation phase, the topic detection models were trained without hashtags; thus, the prediction of the topic is only based on the cleaned text. In the proposed topic detection measure, the topic of hashtags are also identified and compared with the one predicted from the plain text. The following section describes the original credibility model proposed in [22] and the extension proposed in this work by considering the topic credibility measure based on the analysis of hashtags.

An Extended Credibility Model Proposal: Adding Topic Measure
The credibility model proposed in [22], takes into account three measures: (i) Text Credibility; (ii) User Credibility; and (iii) Social Credibility. Each part has a value of 33%, by default, to compute 100% of the tweet credibility level. In the following, we first describe the original credibility model [22]; afterward, we present the extension of this model by considering topic detection. Figure 5 shows a general view of the original credibility model. Text Credibility is entirely related to the post's text, while User Credibility and Social Credibility are calculated using users' attributes. Each credibility measure is based on several components that we call filters. Hence, the model becomes easy to implement, flexible, and extensible. It does not need advanced data manipulation, which makes it ideal to use on real-time applications.

Text Credibility
Text credibility analyzes syntactically the content of the post (without checking the author attributes), through SPAM, bad words, and misspelling filters, as shown in Definition 1. w SPAM , w BadWords , and w MisspelledWords represent user-defined parameters to indicate the weights that the user gives to each filter, such that w SPAM + w BadWords + w MisspelledWords = 1.

User Credibility
User credibility analyzes only the user as a unit of the platform, without being influenced by other users, as it is described in Definition 2.

Definition 2. User Credibility (UserCred).
Given a set of metadata of a user who published a post, p.user, User Credibility is a function, denoted as UserCred(p.user), that returns a measure ∈ [0, 100], defined as UserCred(p.user) = Veri f _Weight(p.user)+ Creation_Weight(p.user) where

Social Credibility
Social credibility is focused on the relations between a user account and the other accounts on the social media platform. It considers the number of followers and following (see Definition 3). The MAX_FOLLOWERS constant is supplied by the user, for example, in [22] it is considered as 2 million. FFproportion is self-explanatory-a simple proportion that increases the credibility if the user has more followers than followings. The purpose of this function is to discredit bots, which tend to have more followings than followers.

Credibility Level
The credibility of a post is a weighted sum of the three credibility measures described previously. Definition 4 shows how it is calculated. According to the social network, the respective features for User Credibility and Social Credibility have to be identified and obtained.

Definition 4. Credibility Level (Cred).
Given a post, p, the Credibility Level is a function, denoted as Cred(p), that returns a measure ∈ [0, 100] of its level of credibility, defined as Cred(p) = weight text × TextCred(p.text) + weight user × UserCred(p.user)+ weight social × SocialCred(p.user) where • weight text , weigh user , and weight social are user-defined parameters to indicate the weights that the user gives to Text Credibility, User Credibility, and Social Credibility, respectively, such that weight text + weight user + weight social = 1; by default, they are around 33%; • TextCred(p.text), UserCred(p.user), and SocialCred(p.user) represent the credibility measure related to the text, the user, and the social impact of p, respectively.

Extended Credibility Model with Topic Credibility
Once we have obtained the best trained model for topic detection, as we explain in Section 4 and Figure 3, we propose the analysis of hashtags of tweets to support topic detection. Figure 6 shows the process followed. The tweet is preprocessed and split into words to identify hashtags from the text. Then, with the topic detection model selected in the previous comparative evaluation phase, the topic of the text is determined and the topics of the hashtags are identified by applying the same process as the text, i.e., with each hashtag as an input of the NMF algorithm. Afterward, a similarity measure is calculated between the topics of the text and the topics of hashtags. We can compare if the hashtags used in a tweet are coherent with the topic treated on it. To define how coherent a tweet is with respect to its hashtags, we use the Hellinger distance. The Hellinger distance is a metric to measure the difference between two probability distributions. It is the probabilistic analog of Euclidean distance [59]. The Hellinger distance forms a bounded metric on the space of probability distributions over a given probability space. When comparing a pair of discrete probability distributions, the Hellinger distance is preferred because P and Q are unit length vectors as per the Hellinger scale [60]. This metric distance has been applied to other similar problems, e.g., to calculate similarity between topics [61], to find the distance between two documents [62], or to compare the distance between Tweet Corpora [59]. Due to the fact that the output of our model is a unidimensional vector with a probability distribution of topics associated with a tweet or hashtag, the Hellinger is appropriate for our problem. This distance allows calculating the semantic proximity between the topic associated to a tweet and the topic associated to its hashtags. Then, when the Hellinger distance score (HDS) approaches 1, the topics diverge, and therefore become vaguely related; when the score approaches 0, the topics become closely related.
Let us consider f (x) and g(x) as absolutely continuous functions. The square of the Hellinger distance is defined as shown in Equation (4) [63], where f and g are constrained to be probability density functions that integrate to 1 by definition.
Using these functions, it is possible to expand the square in the integral and obtain an alternative form for two probability distributions, P and Q, as shown in Equation (5).
where • P = probability distribution for the cleaned text; • Q = probability distribution for the hashtag.
To interpret the results of the Hellinger distance, we used the dissimilarity score (Figure 7) [64]. This means if the result is closer to 1, the dissimilarity is high; otherwise, if the result is closer to 0, the dissimilarity is low. If a tweet has two or more hashtags, the model is used to obtain the topic associated to the tweet, as well as the topic associated to the individual hashtags. Afterwards, HDS will be calculated for each tweet-hashtag pair and these results will be averaged to obtain a single HDS. Definition 5 formally describes the topic credibility measure.

Definition 5. Topic Credibility (TopicCred).
Given the text of a post, p.text, Topic Credibility is a function, denoted as TopicCred(p.text), that returns a measure ∈ [0, 100], defined as NMF is the topic detection algorithm; • HDS is the Hellinger distance between the topics of the tweet (p.text) and the topics of the p.text.hashtag i ; • n is the number of hashtags.
For example, in a tweet with two hashtags, #1 and #2, the model finds the topic probability distribution of the plain text of the tweet (without the hashtags), Ttext, and the topic probability distribution of each hashtag, T#1 and T#2, in order to compare how far hashtag #1 and hashtag #2 are from the plain text. HDS is calculated for each hashtag, (HDS(Ttext, T#1))) and (HDS(Ttext, T#2))), and these HDS values are averaged in order to calculate the Topic Credibility measure.
Our new credibility model is composed of the Text Credibility, User Credibility, Social Credibility, and Topic Credibility, as shown in Figure 8, where the Hashtag Filter represents the process of topic detection, described in Figure 6. However, there are scenarios where the trained model is incapable of assigning a topic to a given tweet or hashtag. If the percentage of association of a certain term with each of the topics allocated on the model is at or below a certain threshold (in this case, set to 0.05) the term cannot be assigned to a topic; therefore, HDS cannot be calculated. Given this scenario, the topic detection parameter should not be considered in the credibility model. By following the previous scenario, the new credibility measure is formally defined in Definition 6.
weight2 social × SocialCred(p.user) where • weight1 text , weigh1 user , weight1 social , and weight1 topic are user-defined parameters to indicate the weights that the user gives to Text Credibility, User Credibility, Social Credibility, and Topic Credibility, respectively, such that weight1 text + weight1 user + weight1 social + weight1 topic = 1; • weight2 text , weigh2 user , and weight2 social are user-defined parameters to indicate the weights that the user gives to Text Credibility, User Credibility, and Social Credibility, respectively, such that weight2 text + weight2 user + weight2 social = 1.
By a default configuration and under the first scenario where the topic analysis is possible, all weights-i.e., weight1 text , weigh1 user , weight1 social , and weight1 topic -are set to 25%, while for the second scenario, weight2 text , weight2 user , and weight2 social are set to 33.33%. Table 5 shows several examples of tweets that contain hashtags, the value of HDS (which shows the relation between the text and its hashtags), and the measures of credibility obtained with the original model and with the extended model. These results demonstrate that the HDS measure directly affects the credibility models if a tweet has at least one hashtag. Most of the results show that if there is a close relationship between the text and the hashtag, the credibility increases. On the other hand, if there is a far relationship between the text and the hashtag, the credibility decreases. Note that tweet #3 has an HDS value very close to 0, since Gurlitt is a composer who owns several art works. Therefore, as the text is related to the hashtag, the distance is very close (0.04) and credibility with the extended model increases (up to 71.08%), with respect to the original model (63.05%). This is unlike tweet #4, where the algorithm fails to associate the hashtag #BREAKING with the information in the text that speaks of a tragedy that occurred in Paris; therefore, credibility decreases with the extended model (70.86%) with respect to original model (74.45%). For all true tweets (#1 to #3), the extended model reports better credibility compared with the original model. For the fake tweets, the extended model decreases the global credibility in two of the three. The topic measure was implemented in T-CREo framework to calculate the global credibility of the tweet, which is the final step in the whole process of the topic detection, described in Figure 6. Figure 9 shows the T-CREo front-end as a Google Chrome Extension, while Figure 10 shows the credibility values under an account's timeline.  The following section evaluates our model with respect to the original one.

Qualitative and Quantitative Evaluation
In order to evaluate our proposal, we perform a battery of experiments considering people's opinion through a survey to measure human perception and a dataset in the domain of fake news.

Qualitative Analysis
To evaluate human perception, we used the survey proposed in [65], where ten tweets were randomly selected from Twitter. This survey (the form is available at https://forms. gle/2uZNYze2YJSmCT1v7, accessed on 1 July 2022) contains opinions of 40 participants that have undergraduate and postgraduate degrees in different areas of study, using the following question Q: How credible the following tweet is? Then, it ranks them in a scale 1 to 10, where 1 means not believable at all and 10 means totally believable. Further, the tweets are evaluated using the original credibility model, as well as the extended version. Table 6 shows the results obtained in this test. The results show similar human perception values with respect to the original and extended credibility models (an average of 10.11% of difference), which validate the models. Since most of the tweets do not have hashtags, our extended credibility analysis remained almost the same as the original one, with the exception of tweet ID:xxxxxx4739103236099, which has a hashtag and where our model obtained 72.33%, while for the original model, it obtained 69.78%. To improve this evaluation, a new survey on tweets that contain hashtags is planned for future work.

Quantitative Analysis
In the domain of fake news, most of the studies apply machine learning techniques for a binary classification [66][67][68] (whether it is fake or not). Its evaluations are made by the use of benchmarks that consist of labeled tweets from different topics. One well-known and available dataset is PHEME, proposed by Zubia [69], which contains 6424 labeled tweets, grouped by 9 events.
Since credibility is a percentage between 0 and 100, we establish a threshold ([0, 100]). When the credibility value is less than the threshold, the tweet is considered as fake. To evaluate our proposal, we calculated the F1 score, based on the precision and recall, defined in Section 4.4. A variation of step = 5 for the threshold is used in order to evaluate several cases.
In the following sections, the results are described by event. The original model and extended model are renamed as OM and EM, respectively.

Event "Putin Missing"
This event has 238 tweets divided into rumors and nonrumors, of which 107 have at least one hashtag. We can see the chart of the results obtained in Figure 11. The biggest F1 score difference between the OM and the EM was obtained for the threshold of 55%, where our EM had 25.00% and the OM had 16.99%. In general, our EM had an increase of 0.17% in the F1 score. In all cases, the F1 score of the EM was equal or greater than the one of OM. Figure 11. Results of the event "Putin missing".

Event "Charlie Hebdo"
This event has 2079 tweets divided into rumors and nonrumors, of which 254 have at least one hashtag. We can see the chart of the results obtained in Figure 12. For thresholds less than 40%, our EM as well as the OM resulted 0% F1 scores. The biggest difference in F1 score was for the threshold of 60%, where the OM obtained 7.45% and our proposal obtained 9.14%. In general, our model had an increase of 0.56%.

Event "Prince Toronto"
This event has 233 tweets divided into rumors and nonrumors, of which 83 have at least one hashtag. We can see the chart of the results obtained in Figure 13. For thresholds of 100%, 95%, 90%, and 85%, our extended model obtained the same F1 score as the original model (100%). For the threshold of 60%, we obtained the best difference (25.75% for the EM and 16.73% for the OM). In general, our EM had an increase of 1.95%.

Event "Ottawa Shooting"
This event has 890 tweets divided into rumors and nonrumors, of which 221 have at least one hashtag. We can see the chart of the results obtained in Figure 14. For thresholds less than 40%, our EM and the OM obtained 0% F1 scores. The best difference was obtained for the threshold of 65%, where our EM obtained 27.63% F1 score, while the OM obtained 25.40%. In general, our model had a decrease of −0.38%. Figure 14. Results of the event "Ottawa shooting".

Event "Gurlitt"
This event has 138 tweets divided into rumors and nonrumors, of which 12 have at least one hashtag. We can see the chart of the results obtained in Figure 15. For thresholds greater than 65%, our EM has a similar F1 score to the OM. The best difference was obtained for the threshold of 50%, where our EM had an F1 score of 13.69%, while the OM had 3.12%. In general, our model had an increase of 4.80%. Figure 15. Results of the event "Gurlitt".

Event "Ebola"
This event has 14 tweets divided into rumors and nonrumors and there is no tweet that has a hashtag. Therefore, the results for both models are the same. We can see the chart of the results obtained in Figure 16.

Event "Germanwings"
This event has 469 tweets divided into rumors and nonrumors, of which 30 have at least one hashtag. We can see the chart of the results obtained in Figure 17. For thresholds greater than 70%, our EM has a similar F1 score to the OM. For the threshold of 65%, our EM had the biggest difference with respect to the OM (41.92% and 40.11%, respectively). In general, our model had an increase of 0.16%.

Event "Ferguson"
This event has 1143 tweets divided into rumors and nonrumors, of which 1143 have at least one hashtag. We can see the chart of the results obtained in Figure 18. For thresholds less than 45%, both our EM and the OM received 0% F1 scores. The biggest difference was obtained for the threshold of 55%, where our EM had an F1 score of 11.64%, while the OM had 1.98%. In general, our model had an increase of 1.71%.  This event has 1221 tweets divided into rumors and nonrumors, of which 137 have at least one hashtag. We can see the chart of the results obtained in Figure 19. For thresholds bigger than 80%, our EM has similar precision to the OM. The biggest difference was obtained for a threshold of 75%, where our EM had 44.44% F1 score, while the OM had 43.15%. In general, our model had an increase of 0.02%. Figure 19. Results of the event "Sydney Siege". 6.2.10. All Tweets from PHEME Dataset Figure 20 shows the precision, recall, and F1 score by thresholds of all tweets from the PHEME dataset. We can observe that our EM has better results than the OM for thresholds 45% until 75% (up to 3.04% difference for the threshold of 60%). For other thresholds, the F1 score values are similar for both models. The best F1 score was obtained with the threshold set at 95% (47.43%).

Figure 20.
All Tweets from PHEME dataset by threshold.

All Tweets from PHEME Dataset with Hashtags
In Figure 21, we show the precision, recall, and F1 score by thresholds of all tweets that have at least one hashtag. We can observe that our EM has better results for thresholds from 40% until 75% (up to 9.60% of difference for threshold 60%). For other thresholds, the F1 score values are similar for both models. The best F1 score was obtained with the threshold set at 95% (56.41%). Note that for the quantitative experiment, the initial NMF model trained by 250 topics was used, i.e., the model was not trained with the PHEME dataset; thus, better precision values can be obtained by training the model with the PHEME dataset. The idea of training the initial model with 250 topics has allowed having a generalized model that works with topics that are not included but are related to the original dataset.

Conclusions and Future Work
In this work, we extended a credibility model by adding topic analysis for tweets that have hashtags, since currently on Twitter it is very common to use hashtags to somehow label the tweet with words that are trending or relevant. To do so, we first compared different topic detection algorithms; we evaluated them using precision, recall, and F1 score, and we stayed with the one that gave the best results, which was NMF.
We can notice that from the 6424 tweets of PHEME dataset that only 1987 have hashtags. That means the other remaining tweets will keep their credibility probability percentage the same as the original model. It can be seen in the 'Ebola' event that since it did not contain any hashtags, the extended model obtained the same results as the original one. Moreover, the dataset with the greatest difference between the models is "Gurlitt", due to this event being about stolen works from the Gurlitt Collection (arts collected by Cornelius Gurlitt); therefore, most of the tweets talk about art and museums, and the hashtags used for this event are also related to these words; for example: "#Entertainment", "#museum", and "#art". The extended model increases, in this event, up to 4.80% the average of F1 score values. In general, the improvement of the extended model in the PHEME dataset can be shown. Although it is not much (up to 3.04% F1 score for the 60% threshold), it is because of the fact that only 30% of the tweets had hashtags; for the case where all tweets have hashtags, the improvement is more significant (up to an F1 score of 9.60%).
Even though the NMF was not trained for PHEME dataset, we obtained high F1 score values for events such as Prince Toronto (97.80% F1 score for 80% threshold), Ebola (100% F1 score for thresholds greater than 80%), Germanwings (64.96% F1 score for 90% threshold), Ferguson (approx 39% F1 score for thresholds greater than 75%), and Sydney Siege (51.90% F1 score for 90% threshold). This effect was obtained thanks to the huge number of topics with which the model was trained, returning topics that are related.
We are currently working on extending this model by considering retweets, likes, and other attributes to measure the social impact of a tweet, which in turn could improve the measure of credibility, and applying it to other languages, such as Spanish and French. Moreover, we are planning to use a larger dataset to have a greater variety of topics and keywords that can be used for analysis, as well as to evaluate other topic detection algorithms such as neural language models and community detection.
Finally, the present study shows the feasibility of integrating a topic analysis to our credibility framework and of considering certain associated semantics. However, this concern is still a challenge. Other semantic aspects can be also incorporated such as the following: How do hashtags impact the credibility human perception of tweets? Is there coherence between the tweet image and its topic? Is the URL associated with the tweet concordant with its topic? Part of these questions will be considered as further works, with additional experiments for qualitative and quantitative assessment.