Study of the Yahoo-Yahoo Hash-Tag Tweets Using Sentiment Analysis and Opinion Mining Algorithms

: Mining opinion on social media microblogs presents opportunities to extract meaningful insight from the public from trending issues like the “yahoo-yahoo” which in Nigeria, is synonymous to cybercrime. In this study, content analysis of selected historical tweets from “yahoo-yahoo” hash-tag was conducted for sentiment and topic modelling. A corpus of 5500 tweets was obtained and pre-processed using a pre-trained tweet tokenizer while Valence Aware Dictionary for Sentiment Reasoning (VADER), Liu Hu method, Latent Dirichlet Allocation (LDA), Latent Semantic Indexing (LSI) and Multidimensional Scaling (MDS) graphs were used for sentiment analysis, topic modelling and topic visualization. Results showed the corpus had 173 unique tweet clusters, 5327 duplicates tweets and a frequency of 9555 for “yahoo”. Further validation using the mean sentiment scores of ten volunteers returned R and R2 of 0.8038 and 0.6402; 0.5994 and 0.3463; 0.5999 and 0.3586 for Human and VADER; Human and Liu Hu; Liu Hu and VADER sentiment scores, respectively. While VADER outperforms Liu Hu in sentiment analysis, LDA and LSI returned similar results in the topic modelling. The study conﬁrms VADER’s performance on unstructured social media data containing non-English slangs, conjunctions, emoticons, etc. and proved that emojis are more representative of sentiments in tweets than the texts.


Introduction
The continuous rise in Internet technology and various social media (SM) platforms has made it possible for effective communication and interaction among various people from diverse social and cultural backgrounds [1,2].This growth in technology advancement has also introduced some downsides such as proliferation of cyber-crime [3].SM has become a significant aspect of online activity [4] and plays a crucial part in cybercrime such as in [5] and cyber terrorism-related operations [6].Cybercrime, which is one of the popular forms of deviance among youth in Nigeria, is still a serious problem affecting the country's image [7,8].The perpetrators are supported by some people and social institutions when they make illegitimate money; hence, the increasing justification of illegality [9,10].The phrase Yahoo-Yahoo originated from the use of Yahoo emails and Yahoo instant messenger as a dominant medium of communication between perpetrators and victims [11].This popular term refers to the activities that entail using computers, phones, and the Internet to defraud unsuspecting victims, especially those outside the country.The likelihood of fraudulent users integrating new approaches without necessarily applying extensive technical knowledge on the Internet could result in fraudulent activity [12].
The rising popularity of cyber-crime in Nigeria [13] can be connected to the current state of economic instability, the high unemployment rate among able-bodied youths, erosion of traditional values of integrity quick-money syndrome, etc.To curb these illegal activities, institutions such as the Economic and Financial Crimes Commission (EFCC) were established in Nigeria and have recorded several arrests and prosecution of cyber-crime suspects [14].However, it is expected that with the apprehensions and prosecutions, more understanding of the "modus operandi" of culprits will emerge.However, crime may not be static as suspects could adopt new methods when the old ones are known to the public and law enforcement agencies.Cybercrime has gone beyond the notorious 419 email and SMS scams [15] to apply more sophisticated methods making SM users even more vulnerable [16].Recently, SM platforms such as Facebook, Instagram, Twitter, Google+, and Pinterest are becoming popular for crucial data sources in research studies relating to sentiment analysis [17,18].SM can accommodate information on different subjects, thus increasing and improving communication between them, and participants can form groups with a common interest and express themselves freely [19].
The importance of SM opinion cannot be over-emphasized as this medium serves as the most accessible way to get large, valuable, and rich details of information (especially on the subject matter) within a short period.The Twitter platform is a social microblog site and has generated about 330 million tweets every month across different countries [20].Twitter has recently been used to mine opinions and trending topics to understand users' behaviors and attitudes through predefined information such as user description, location, status, and other attributes.Also, Twitter allows the exchange of media such as text, images, videos, etc. and the potential to facilitate research over social phenomena based on sentiment analysis, using Natural Language Processing and Machine Learning techniques to interpret sentimental tendencies related to users' opinions and make predictions about real-world events [21].
Analyzing different trending topics on Twitter may create insight into polarized opinions in various issues such as politics, celebrities, national disasters, corporations, etc., for real-world event prediction.Previous studies by researchers have shown that this practice falls within the socioeconomic cyber-crime [22], and its continued popularity can be attributed to the influence of friends [7,8,23].The relationships between factors influencing these activities and the learning process are depicted in Figure 1 as adapted from [24].
Information 2022, 13, x FOR PEER REVIEW 2 of 21 The rising popularity of cyber-crime in Nigeria [13] can be connected to the current state of economic instability, the high unemployment rate among able-bodied youths, erosion of traditional values of integrity quick-money syndrome, etc.To curb these illegal activities, institutions such as the Economic and Financial Crimes Commission (EFCC) were established in Nigeria and have recorded several arrests and prosecution of cybercrime suspects [14].However, it is expected that with the apprehensions and prosecutions, more understanding of the "modus operandi" of culprits will emerge.However, crime may not be static as suspects could adopt new methods when the old ones are known to the public and law enforcement agencies.Cybercrime has gone beyond the notorious 419 email and SMS scams [15] to apply more sophisticated methods making SM users even more vulnerable [16].Recently, SM platforms such as Facebook, Instagram, Twitter, Google+, and Pinterest are becoming popular for crucial data sources in research studies relating to sentiment analysis [17,18].SM can accommodate information on different subjects, thus increasing and improving communication between them, and participants can form groups with a common interest and express themselves freely [19].
The importance of SM opinion cannot be over-emphasized as this medium serves as the most accessible way to get large, valuable, and rich details of information (especially on the subject matter) within a short period.The Twitter platform is a social microblog site and has generated about 330 million tweets every month across different countries [20].Twitter has recently been used to mine opinions and trending topics to understand users' behaviors and attitudes through predefined information such as user description, location, status, and other attributes.Also, Twitter allows the exchange of media such as text, images, videos, etc. and the potential to facilitate research over social phenomena based on sentiment analysis, using Natural Language Processing and Machine Learning techniques to interpret sentimental tendencies related to users' opinions and make predictions about real-world events [21].
Analyzing different trending topics on Twitter may create insight into polarized opinions in various issues such as politics, celebrities, national disasters, corporations, etc., for real-world event prediction.Previous studies by researchers have shown that this practice falls within the socioeconomic cyber-crime [22], and its continued popularity can be attributed to the influence of friends [7,8,23].The relationships between factors influencing these activities and the learning process are depicted in Figure 1 as adapted from [24].However, this study is motivated by increasing information on SM, majorly Twitter, considering the great benefit to the Government and all related stakeholders.We have considered the effect on a developing country-Nigeria (as a case study), a fast-growing economy, the largest populated country in Africa.
This paper aims to analyze the Yahoo-Yahoo hashtag tweets on SM using sentiment analysis and opinion mining algorithms with the following specific objectives: 1. Collect tweets based on the Yahoo-yahoo hashtags using the Orange Twitter API. 2. Pre-process and tokenize the tweets using a pre-trained tweet tokenizer.However, this study is motivated by increasing information on SM, majorly Twitter, considering the great benefit to the Government and all related stakeholders.We have considered the effect on a developing country-Nigeria (as a case study), a fast-growing economy, the largest populated country in Africa.
This paper aims to analyze the Yahoo-Yahoo hashtag tweets on SM using sentiment analysis and opinion mining algorithms with the following specific objectives:

1.
Collect tweets based on the Yahoo-yahoo hashtags using the Orange Twitter API.

2.
Pre-process and tokenize the tweets using a pre-trained tweet tokenizer.

3.
Conduct unsupervised lexicon-based sentiment analysis on the tweet corpus using the Liu Hu and VADER techniques, respectively.

4.
Carry out Topic modeling to detect abstract topics on corpus using Latent Dirichlet Allocation (LDA) and Latent Semantic Indexing (LSI) algorithms, respectively.5.
Validate the topic modeling using Multidimensional Scaling (MDS) graph and Marginal Topic Probability (MTP).
The rest of the paper is organized as follows: Section 2 discusses the related work; Section 3 provides a detailed description of the research methodology; Section 4 discussed in detail the implementation and results obtained, while the paper concludes with future directions in Section 5.

Literature Review
This section discusses previous research endeavors on sentiment analysis and opinion mining using Twitter data in detail.The specific focus is on cyber-crime and Twitter data, and state-of-the-art methods proposed in the literature will be carefully studied for contributions and future recommendations.
In [25], the Authors proposed a temporal topic detection model to infer predictive topics over time.Authors developed a dynamic vocabulary to detect topic trends rather than word dictionaries using Twitter data to predict the Chicago crime trend.The study concluded that the use of content-based features improves overall prediction performance.In [21], the Authors presented a statistical analysis based on 1 regularization regression algorithm for detecting cyber-attacks using social sentiment sensors on Twitter.Kounadi et al. in [26] examined Twitter messages for detecting homicide crime in London based on spatial and temporal analysis.The Authors adopted two pre-processing methods from link correspondence and the home estimation model.Hariani and Riadi in [3] analyzed Twitter data for cyberbullying using a naïve Bayes classifier and TF-IDF weighting.Authors claimed their classification results were able to detect cyberbullying on SM, and the effect of this bullying is more psychological, with a prediction of 77.73%.Sharma et al. in [27] proposed a sentiment analysis of Twitter data using the Valence Aware Dictionary for Sentiment Reasoning (VADER) method for detecting cybersecurity and cyber-crime.The Author concluded that Asian nations are majorly affected by cybersecurity challenges when compared to other European Union countries.Al-Garadi et al. in [28] proposed a supervised machine learning approach using four classifiers, namely support vector machine (SVM), Naïve Bayes (NB), K-Nearest Neighbor (KNN), and the Random Forest (RF) classifier for detecting cyber-crime on the Twitter network.The methods show that integrating Synthetic Minority Over-Sampling Technique (SMOTE) with RF gave the best performance of 94.3% compared with the other machine learning classifier.
The application of Deep Learning (DL) methods has also been proposed in previous studies.Al-Smadi et al. [29] carried out a sentiment analysis of Arabic hotel reviews using Deep Recurrent Neutral Network (DRNN) and SVM.The DRNN had a faster execution time during the training and testing of the models.Founta et al. in [30] presented an architecture based on DL for detecting online multiple abusive behaviors among Twitter users.The proposed approach gave a significant performance in detection rate and increasing AUC from 92-98%.Like the previous study, [31] also applied a DL method based on a Convolutional Neural Network (CNN) to detect cyber-bullying using Instagram images and text data.The detected bullying words are further analyzed using the N.B. classifier to detect potential cyberbullying threats effectively.
After a review of previous studies, to the best of our knowledge, this study is the first to analyze the Twitter dataset for understanding and identifying sentiments towards yahoo-yahoo cyber-crime.The summary of the related work on cybercrime analysis using Twitter data is shown in Table 1.
Table 1.Summary of related work on cybercrime analysis using Twitter Data.

Contributions
Research Domain [26] Machine learning based on logistic regression.
Result shows the proposed method could be effective and reliable for investigating the crime.
Proposed methods were useful to predict possible cyber-attacks.Cyber-attack detection.
The proposed method gave a reliable capacity to predict relevancy with an improvement in accuracy of more than 6%.
Improved prediction accuracy for the detection of social tension topics in Russia.Social tension detection. [35] CyberEM model based on pattern clustering and an NMF-based (non-negative matrix factorization) event aggregation algorithm.
The proposed model was able to discover cybersecurity events and update event aggregation online.
Event detection.
Develop a cost-sensitive model.Cyberbullying detection.
[37] K-means clustering algorithm and Random Forest algorithm.
The proposed methods were able to show significant prediction power in detecting cyberbullying.
Classification results showed very high levels of performance at reducing false positives and produced promising results with respect to false negatives.
Cyber Hate Speech

Research Method
The research methodology employed in this study is presented in this section.The data analyzed is based on the content of the tweets and other metadata.Duplicated tweets were detected and filtered to analyze unique tweets from the dataset.This study adopted Liu Hu [39] and VADER [40] methods for sentiment analysis.The research approach was divided into three modules, as shown in Figure 2, including Data Collection, Pre-processing, and Data Analysis.

Data Collection
Twitter data was chosen based on its popularity with microblog services for sentiment and opinion analysis in detecting cyber-bullying, cyber-terrorism, etc. [17].Twitter API was employed in streaming live tweets for the past 14 days on the Orange Data mining toolbox [41].To use the Twitter API, it is required to obtain the Twitter API credentials, which contain the key and secret passwords.With the API, query parameters relating to specific keywords such as 'wordlist query', 'search by', 'language', 'allow retweets', etc., can be set, and the data obtained can be saved as Comma-Separated Value (CSV) format.For this study, the search query's keyword was "yahoo-yahoo", and a maximum tweet of 5500 tweets was returned.The complete dataset is available in the Zenodo repository at [42]. Figure 3 shows the Twitter dataset obtained using the #yahooyahoo on a data table containing the tweets content as well as 17 other metadata, which includes 'Author ID', 'Date', 'Language', 'Location', 'Number of Likes', 'Number of Retweets', 'In Reply To', 'Author Name', 'Author Description', 'Longitude and Latitude', etc.

Research Method
The research methodology employed in this study is presented data analyzed is based on the content of the tweets and other metadata were detected and filtered to analyze unique tweets from the dataset.Liu Hu [39] and VADER [40] methods for sentiment analysis.The res divided into three modules, as shown in Figure 2, including Data cessing, and Data Analysis.Information 2022, 13, x FOR PEER REVIEW 5 of 21

Data Collection
Twitter data was chosen based on its popularity with microblog services for sentiment and opinion analysis in detecting cyber-bullying, cyber-terrorism, etc. [17].Twitter API was employed in streaming live tweets for the past 14 days on the Orange Data mining toolbox [41].To use the Twitter API, it is required to obtain the Twitter API credentials, which contain the key and secret passwords.With the API, query parameters relating to specific keywords such as 'wordlist query', 'search by', 'language', 'allow retweets', etc., can be set, and the data obtained can be saved as Comma-Separated Value (CSV) format.For this study, the search query's keyword was "yahoo-yahoo," and a maximum tweet of 5500 tweets was returned.The complete dataset is available in the Zenodo repository at [42]. Figure 3 shows the Twitter dataset obtained using the #yahooyahoo on a data table containing the tweets content as well as 17 other metadata, which includes 'Author ID', 'Date', 'Language', 'Location', 'Number of Likes', 'Number of Retweets', 'In Reply To', 'Author Name', 'Author Description', 'Longitude and Latitude', etc.

Data Pre-Processing
The Twitter dataset was pre-processed by breaking the tweets into smaller pieces like words, phrases, or bi-grams called tokens.Normalization was done on the tweets to generate n-grams, tags with spoken tags, and partial language markings.Other pre-processing tasks carried out on the tweets include: 1. Converting all characters in the corpus to lowercase; 2. Remove all HTML tags from a string; 3. Removing all text-based diacritics and accents; 4. Removing URLs, articles, and punctuations;

Data Pre-Processing
The Twitter dataset was pre-processed by breaking the tweets into smaller pieces like words, phrases, or bi-grams called tokens.Normalization was done on the tweets to generate n-grams, tags with spoken tags, and partial language markings.Other preprocessing tasks carried out on the tweets include: 1.
Converting all characters in the corpus to lowercase; 2.
Remove all HTML tags from a string; 3.
Removing all text-based diacritics and accents; 4.
Filtering stop words, lexicon, Regular expressions.

Sentiment Analysis
Sentiment analysis aims to extract users' emotions from texts at sentence, document, or aspect/feature level.It determines the feeling being projected from each tweet as either positive, negative, or neutral.The NLTK emotion modules in Orange are based on sentiment lexicons approaches and contain Liu Hu [39] and VADER [40] techniques.The lexicon-based process is an unsupervised machine learning method that employs a dictionary or lexicon list.Each lexicon is associated with a sentiment strength which represents a positive or negative orientation [43].
The Liu Hu method [39] involves an examination of the lexicon.It classifies the tweets into negative, positive, and neutral sentiment while the VADER examines the lexicon and uses the thumb rule.It simply sums up the sentiment scores of all sentiment words in a tweet or sentence segment.
VADER was proposed in [40].Unlike Liu Hu in [39], VADER [40] has its sentiment orientation divided into four categories which are: positive, negative, neutral, and final compound scores for analyzing sentiment.The compound score is calculated in Equation ( 1) by finding the sum of each word's valence scores in the lexicon, which are adjusted according to the rules. x where x = sum of valence scores of constituent words, and ∝ = Normalization constant (default value is 15).The costs are then normalized in a continuous polarity annotation between −1 and +1, representing the most extreme negative and most extreme positive sentiments.The VADER compound score is a single unidimensional measure of a tweet's sentiment.
The tweet corpus was collected, duplicate tweets were removed using Manhattan distance for distance measures between tweets with a single linkage and distance threshold of 0.5.The VADER and Liu Hu sentiment analysis methods were applied, and the results were obtained using the heat maps and sentiment scores.Algorithm 1 represents the duplicate detection and sentiment analysis workflow.

Ground Truth Generation for Sentiment Analysis
To evaluate and validate VADER and Liu Hu models for sentiment analysis, we took a cue from [40] by employing a human-centered approach using ten human experts to individually review all 173 unique tweets and give their sentiment score on a scale of −3 to +3 representing extreme Negative to extreme Positive sentiments for each of the tweets, while a score of zero to a tweet was considered a neutral sentiment.The mean opinion score (MOS) of the human subjective scores is thus obtained from the averaged result of a set of individual human subjective sentiment scores.The Mean opinion score (MOS) is defined as: where i tweet sentiment score, p(i) = tweet sentiment score probability, and S = number of independent observers.Ten observers were assigned to grade the 173 tweets, thus S = 10 and i = 173.The grading scales were maintained as −3, 2, 1, 0, 1, 2, and 3 representing −3 to +3 representing extreme negative sentiments to extreme positive sentiments.
Apply Sentiment Analysis Method (Liu Hu; VADER) Generate output end

Topic Modelling
Topic modeling is used to detect abstract topics in the corpus or data table based on word clusters and their respective frequency in each document or tweet, as in this case study.It has been applied in Natural Language Processing (NLP) to discover topics and extract semantic meaning from unordered documents, especially in applications such as SM, text mining, and information retrieval.This study aims to use the topic to facilitate understanding of the emotion and conversations between the respondent in the corpus under study.
LDA is a three-level hierarchical Bayesian model in which each item of a collection is modeled as a finite mixture over an underlying set of topics.It is interpreted easily but slower than LSI.
LSI model returns topics with negative and positive keywords that have negative and positive weights on the topic.The positive weights are words that are highly representative of the topic and contribute to its occurrence.For negative weights, the topic is more likely to occur if they appear less in it.The modeled topics were visualized using Multidimensional Scaling (MDS) graphs, which is a low-dimensional projection of the topics as points.MDS attempt to fit distances between the points as much as possible.Algorithm 2 represents the workflow for topic modeling as conducted in this study.

Results and Discussion
As outlined in section three, the result obtained from implementing the research methodology is presented and discussed in this session.The Twitter data mining API, widgets, and workflow engine for text mining in the Orange Data mining toolbox [41] were used primarily for the data collection and implementation of the study.

Results of Pre-Processing and Tokenization
A pre-trained tweet tokenizer was used for pre-processing of the corpus texts.By setting the document frequency range, tokens outside the range will be removed.75,280 tokens of 3968 types were generated using a document frequency of 0.00-1.00,while 16,620 tokens of 5 types were returned for document frequency in the range 0.10-0.90. Figure 4 shows the tokens' visualization and their frequency in the tweet dataset through a word cloud.The word cloud showed that the larger the word in the cloud, the higher its frequency.Only a record of tokens with frequencies higher than 100 was stored while 3960 tokens appeared more than a hundred times.Table 2 shows the 12 most frequent tokens with "yahoo" on top of the chart, having 9555 frequencies.3960 tokens appeared more than a hundred times.Table 2 shows the 12 most frequent tokens with "yahoo" on top of the chart, having 9555 frequencies.

Results of Geolocation
From Figure 5, the color bar differentiates the number of tweets originating from each country in the range 0 to 20 on a scale of 0-4, 5-9, 10-14, and 15-20, respectively.The white locations had no tweets from the #yahooyahoo, while the colored ones had tweets of varying amounts.

Results of Geolocation
From Figure 5, the color bar differentiates the number of tweets originating from each country in the range 0 to 20 on a scale of 0-4, 5-9, 10-14, and 15-20, respectively.The white locations had no tweets from the #yahooyahoo, while the colored ones had tweets of varying amounts.From the geolocation analysis in Figure 5, Some of the countries in the range 0-4 (color code 1) on the world map are Ghana, French Guyana, South Africa, Tanzania, Uganda, Kenya, France, and Iran; Canada and the United States of America are in the range 5-9 (color code 2) while Northern Ireland, United Kingdom, and Nigeria are in the range 10-14 (color code 3) with 9, 10 and 11, respectively.Finally, Spain fell into the last category with the color 16.From the dataset and contrary to expectation, Spain has the From the geolocation analysis in Figure 5, Some of the countries in the range 0-4 (color code 1) on the world map are Ghana, French Guyana, South Africa, Tanzania, Uganda, Kenya, France, and Iran; Canada and the United States of America are in the range 5-9 (color code 2) while Northern Ireland, United Kingdom, and Nigeria are in the range 10-14 (color code 3) with 9, 10 and 11, respectively.Finally, Spain fell into the last category with the color 16.From the dataset and contrary to expectation, Spain has the highest number of tweets with the #yahooyahoo on Twitter, followed by Nigeria.However, our tweet dataset also confirmed that most Twitter users prefer to set their location as anonymous.

Results of Duplicate Detection
The dataset was filtered for unique tweets using duplicate detection to remove duplicate tweets from the 5500 tweets.With the linkage set to single and distance threshold = 0.5, the duplicate detection workflow returned 173 unique clusters and their sizes.Where a cluster represents a unique tweet, and the size is the frequency of retweets or duplication in the dataset.Thus, 173 unique tweets and 5327 duplicates were returned.Tweet C91 is the largest cluster with 484 retweets.Table 3 shows the top twenty tweets, cluster, number of retweets, and content.The 173 unique tweet clusters were adopted for further analysis in the study.To maintain the privacy and the 'right to be forgotten' of the tweet's Authors, their Twitter handles were anonymized using serial uTweet1, . . ., uTweet 173 for Unique tweet 1 to 173, respectively.

Result of Sentiment Analysis
From the result of the sentiment analysis, the Liu Hu method returned a single sentiment score 173 tweets in the range +12.5 and −12.5 classified as 43 (24.86%),86 (49.71%) and 44 (25.43%) for Positive, Neutral and Negative tweets, respectively.Figure 6 presents the visualization of Liu Hu sentiment classification using the heat map.
The VADER method classified the 173 unique tweets by returning the negative, neutral, positive, and compound scores as shown in Figure 7 using the heat map.The right-hand side (RHS) consist of 142 tweets.73 (42.20%) tweets were negative as depicted by the blue color on the heat map; their compound sentiment scores ranged between −0.0516 to −0.9393 with very low positive sentiment scores, close to zero (0).For the tweets classified as neutral, the positive, negative, and compound sentiment orientation scores, all returned zero (0), while the neutral scores were all one (1).This returned 43 (24.86%)neutral tweets.The third class in the heat map's RHS are tweets whose compound sentiment scores are above zero (0) but below 0.5.They do not represent negative or neutral sentiments and are also below the 0.5 thresholds to be considered as positive sentiment tweets.These classes of tweets are referred to as no-zone sentiment tweets with 26 (15.03%) tweets.Their compound sentiment scores ranged between 0.0202 and 0.4767, and their color on the heat map is not distinct.The left-hand side (LHS) of the heat map in Figure 7

Result of Sentiment Analysis
From the result of the sentiment analysis, the Liu Hu method returned a single sentiment score 173 tweets in the range +12.5 and −12.5 classified as 43 (24.86%),86 (49.71%) and 44 (25.43%) for Positive, Neutral and Negative tweets, respectively.Figure 6 presents the visualization of Liu Hu sentiment classification using the heat map.The VADER method classified the 173 unique tweets by returning the negative, neutral, positive, and compound scores as shown in Figure 7 using the heat map.The righthand side (RHS) consist of 142 tweets.73 (42.20%) tweets were negative as depicted by the blue color on the heat map; their compound sentiment scores ranged between −0.0516 to −0.9393 with very low positive sentiment scores, close to zero (0).For the tweets classified as neutral, the positive, negative, and compound sentiment orientation scores, all returned zero (0), while the neutral scores were all one (1).This returned 43 (24.86%)neutral tweets.The third class in the heat map's RHS are tweets whose compound sentiment scores are above zero (0) but below 0.5.They do not represent negative or neutral sentiments and are also below the 0.5 thresholds to be considered as positive sentiment tweets.These classes of tweets are referred to as no-zone sentiment tweets with 26 (15.03%) tweets.Their compound sentiment scores ranged between 0.0202 and 0.4767, and their color on the heat map is not distinct.The left-hand side (LHS) of the heat map in Figure 7 has 31 instances.These are classified by VADER as positive tweets, their compound scores range between 0.5106 to 0.9508.They are represented with the visible yellow color on the heat map.The result showed that a chunk of the tweets classified as Neutral by Liu Hu are in the RHS of the VADER heat map.

Results of Ground Truth Generation for Sentiment Analysis
Figure 8 shows the summary of the ground truth generation and validation of the sentiment analysis using subjective human scores along the VADER and Liu Hu methods.The mean opinion Score (MOS) of 10 volunteers on the 173 unique tweets was obtained.Figure 8a,c,e

Results of Ground Truth Generation for Sentiment Analysis
Figure 8 shows the summary of the ground truth generation and validation of the sentiment analysis using subjective human scores along the VADER and Liu Hu methods.The mean opinion Score (MOS) of 10 volunteers on the 173 unique tweets was obtained.Figure 8a,c,e

Results of Topic Modelling
The LDA and LSI models were applied for topic modelling.Using a document frequency of range 0.10 and 0.90, only 16617 tokens were returned with five (5) types.
In this case, the LDA and LSI models returned only one topic with the same keywords: yahoo, rt, pastor, forget, and adeboye.However, using the document frequency range between 0.00-1.00,75280 tokens of 3968 types were returned.We set out for six topics using LDA and LSI; the topics and their keywords are shown in Table 4. Figure 9 shows

Results of Topic Modelling
The LDA and LSI models were applied for topic modelling.Using a document frequency of range 0.10 and 0.90, only 16617 tokens were returned with five (5) types.
In this case, the LDA and LSI models returned only one topic with the same keywords: yahoo, rt, pastor, forget, and adeboye.However, using the document frequency range between 0.00-1.00,75280 tokens of 3968 types were returned.We set out for six topics using LDA and LSI; the topics and their keywords are shown in Table 4. Figure 9 shows the cloud of words that constitutes LDA and LSI generated topics 1 to 6, while the topics selected top ten key words and weights are presented in Tables 5 and 6.The topics represent the point in the Multidimensional Scaling (MDS) graph, where the size of the point is a function of the Marginal Topic Probability (MTP) for each topic extracted from the tweet corpus.The bigger the size of the point, the stronger the topic is represented by the words in the corpus.Only the LDA topics are visualized using MDS because LDA is easier to interpret than LSI even though it is more computationally intensive.The visualization of the LDA topics with MDS shows that topic 6 has the highest marginal probability of 0.244889, followed by topic 3 and topic 4. Figure 10 shows the LDA topics using MTP with multidimensional scaling points.The topics represent the point in the Multidimensional Scaling (MDS) graph, where the size of the point is a function of the Marginal Topic Probability (MTP) for each topic extracted from the tweet corpus.The bigger the size of the point, the stronger the topic is represented by the words in the corpus.Only the LDA topics are visualized using MDS because LDA is easier to interpret than LSI even though it is more computationally intensive.The visualization of the LDA topics with MDS shows that topic 6 has the highest marginal probability of 0.244889, followed by topic 3 and topic 4. Figure 10 shows the LDA topics using MTP with multidimensional scaling points.We further used the box plots to visualize the words that are most representative of each topic.The box plot sorts the variables (words) by separating the selected subgroup values.The subgroup 'yes' represents the weights of the most representative words of the topic selected on the MDS graph.Table 7 shows the top ten most representative words for LDA topics 1-6 selected on the MDS graph and visualized on the box plot.The words are sorted by their order of relevance to the topics.We further used the box plots to visualize the words that are most representative of each topic.The box plot sorts the variables (words) by separating the selected subgroup values.The subgroup 'yes' represents the weights of the most representative words of the topic selected on the MDS graph.Table 7 shows the top ten most representative words for LDA topics 1-6 selected on the MDS graph and visualized on the box plot.The words are sorted by their order of relevance to the topics.
In Figures 11 and 12, we present subplots of selected LDA topics from the MDS graph as displayed on the box plot.The box plot changes by closing the separation between the yes and no subgroups.In Figures 11 and 12, we present subplots of selected LDA topics from the MDS gr as displayed on the box plot.The box plot changes by closing the separation between yes and no subgroups.

Discussion
From the Pre-processing and tokenization, one would be wondering what a toke word like a 'pastor' has to do with cyber-crime.This does not mean pastors are invol but, in this context, based on the tweets mined, some Authors classified the way of li some Nigeria-based pastors as a form of "yahoo", which means cyber-crime in Nig The word 'arrested' has to do with the Landlord who harboured fraudsters in Ibada city in Southwest Nigeria; a man found with PS4 was also 'arrested' at Ojuelegba in La whom the policemen assumed to be a "yahoo-boy".'Pastor Adeboye' appeared am others due to a statement he made which was, "one of my sons once told me that he was al excited to resume in the office every Monday because he would get to see his secretary again.I him to fire (sack) her immediately.Nothing and no one is worth your marriage".With this, pa Adeboye was hash tagged with "yahooyahoo".uTweet 90: "My friend just updated o status that policemen arrested him at Ojuelegba for having a ps4 in his bag, claiming that he a yahoo boy", generated many retweets that were responsible for tokens such bag, arrested, and rt.Another user, uTweet 84 was quite sentimental with his opinion as: not in support of Yahoo yahoo, it's really bad but let's face the fact that it's yahoo yahoo that's lowering poverty" while uTweet164 supported that "it has saved people and promoted business." The sentiments analysis results confirm the advantage of VADER which asid simplicity and computationally efficiency is its accuracy across domains especially on cial media text.The result showed good performance of the sentiment analysis on 'yahoo-yahoo' text which contains several popular slangs among Nigerian youths, ca letters, conjunctions, emoticons, punctuations, etc.The result also confirmed the stre

Discussion
From the Pre-processing and tokenization, one would be wondering what a token or word like a 'pastor' has to do with cyber-crime.This does not mean pastors are involved, but, in this context, based on the tweets mined, some Authors classified the way of life of some Nigeria-based pastors as a form of "yahoo", which means cyber-crime in Nigeria.The word 'arrested' has to do with the Landlord who harboured fraudsters in Ibadan, a city in Southwest Nigeria; a man found with PS4 was also 'arrested' at Ojuelegba in Lagos, whom the policemen assumed to be a "yahoo-boy".'Pastor Adeboye' appeared among others due to a statement he made which was, "one of my sons once told me that he was always excited to resume in the office every Monday because he would get to see his secretary again.I told him to fire (sack) her immediately.Nothing and no one is worth your marriage".With this, pastor Adeboye was hash tagged with "yahooyahoo".uTweet 90: "My friend just updated on his status that policemen arrested him at Ojuelegba for having a ps4 in his bag, claiming that he was a yahoo boy", generated many retweets that were responsible for tokens such bag, ps4, arrested, and rt.Another user, uTweet 84 was quite sentimental with his opinion as: "I'm not in support of Yahoo yahoo, it's really bad but let's face the fact that it's yahoo yahoo that's still lowering poverty" while uTweet164 supported that "it has saved people and promoted more business." The sentiments analysis results confirm the advantage of VADER which aside its simplicity and computationally efficiency is its accuracy across domains especially on social media text.The result showed good performance of the sentiment analysis on the 'yahoo-yahoo' text which contains several popular slangs among Nigerian youths, capital letters, conjunctions, emoticons, punctuations, etc.The result also confirmed the strength of VADER on English and non-English text, as our corpus contains a lot of broken or informal English grammar.This was a clear challenge for Liu Hu method which classified 86 (49.71%) tweets as Neutral, many of which were misclassified as also noted in [40].
The topics obtained from LDA, and LSI contain words that are consistent with those that appear more than 100 times.There are strong similarities between the keywords in the topics obtained from the LSI and LDA models.The word 'yahoo' is highly positively representative of topics 1, 3, 4 and 6 in the LDA with weights 0.170, 0.101, 0.092, 0.234, respectively, and topic 1 in LSI with a weight of 0.087.'Yahoo' is also representative of LSI topic 2 with a negative weight of −0.232 in Table 5.It was observed from Figure 9 (subplot 5) of LDA-generated topics and Table 4 that LDA topic five is made up of words with very low weights when compared to other LDA topics.It was also observed from Table 5, that LDA topics 1, 4, and 6 have top keywords that formed the tweets below when combined.It may not be unconnected with the fact that the tweets created by these top words of topics 1, 4, and 6 had retweet of 72, 1418, and 501, respectively, in the corpus.
Topic 1: uTweet82, Someone said yahoo yahoo is youth empowerment, i weep for my country.Topic 4: uTweet90, My friend just updated on his status that policemen arrested him at Ojuelegba for having a ps4 in his bag, claiming that.Topic 6: uTweet84, Iâm not in support of yahoo yahoo, it is really bad but let us face the fact that its yahoo yahoo thats still lowering poverty.
It was observed from the topic modelling that the words racism, judge, homosexuality, terrifying, etc., with red font color in the LSI topics have very strong negative weights.Topic 1 contain positive words only, topics 2 to 5 had a mix of negative and positive words, while topic 6 has only negatively representative words.Emoji have strong weights in both LDA and LSI-generated topics with positive contributions towards their respective topics, as shown in Figure 9, Tables 5 and 6.This observation is consistent with [45] that emoticons and their associated sentiment usually dominates the sentiment conveyed by textual data analysis whenever they are used.They express emotions and sentiments such as 'laughing and rolling on the floor', 'face with tears of joy', 'sweat droplets', 'loudly crying face', 'grinning face with sweat', etc.Hence, the sentiment expressed by the emoticons usually convey the central sentiment conveyed by the tweet's textual content.
Also observed from the box plots in Figures 11 and 12, is the notable separation between yes and no subgroups for topics with high MTP.The 'yes' subgroup represents the words for the selected topic, while 'no' subgroup are the other words in the corpus.Subplot 6 in Figure 12 shows the good separation between the word frequency for LDA topic six and all others.The MTP of each topic is shown by the 'yes' subgroup of each topic as 2.41322 × 10 −5 , 2.1151 × 10 −5 , 0.226381, 0.195948, 0.112567 and 0.244889 for topic 1, 2, 3, 4, 5 and 6, respectively.The gap between the subgroups is consistent with the sizes of the points and MTP values of the topics, as shown in Figure 10.

Conclusions
In this study, a content analysis of Twitter data using 5500 tweets from the #yahooyahoo was conducted to assess SM user's opinions on the cyber-crime popularly called "yahoo-yahoo".A convenient sample of opinions (tweets) is used for the study collected from the Twitter application.A semi-structured Twitter data was collected from various verified and unverified Authors.The result gives a detailed analysis of the sentimental view of people towards yahoo yahoo.Although the geolocation showed more users tweeted on the topic from Spain, a closer look into the corpus shows otherwise because of privacy concerns, because many users don't declare their location on Twitter.It can also be concluded that LDA and LSI modelled topics showed a more representative reflection of the tweet corpus.Although LSI is said to be more computationally demanding in literature and is often less preferable to LDA, we observed that the insight it provided by identifying negative representative words along with the positive representative ones is very significant to topic modelling and gaining insights from tweets.Emojis have strong weights in determining sentiments and contribution to topics modelling and more attention should be paid to sentiment analysis using emojis.The discussion towards yahoo-yahoo as a cybercrime was largely seen as negative to the society from the sentiment distributed using VADER as 42.20%, 24.86%, 15.03%, and 17.92% for negative, neutral, no-zone, and positive sentiment tweets, respectively.
For future research, the Authors plan to create long short-term memory (LSTM) deep learning models for sequence-to-label classification problems in tweets; test the model with new narratives to evaluate their performance; conduct internal and external validity of the study to ascertain that the result obtained are meaningful and trustworthy; we also plan to collect very large historical tweet datasets on trending national issues such as #COVID-19Nigeria, #EndSARS, #LekkiMassacre, etc. for evaluating the proposed future research directions.

Figure 1 .
Figure 1.Relationships between predisposing factors and cyber-crime in Nigeria.

Figure 1 .
Figure 1.Relationships between predisposing factors and cyber-crime in Nigeria.

Figure 3 .
Figure 3. Screenshot showing the Data Table with the Tweet contents and other metadata.

Figure 3 .
Figure 3. Screenshot showing the Data Table with the Tweet contents and other metadata.

Figure 4 .
Figure 4. Word Cloud showing the tokens and their frequency/weight from the pre-processed tweet dataset.

Figure 4 .
Figure 4. Word Cloud showing the tokens and their frequency/weight from the pre-processed tweet dataset.

Information 2022 , 21 Figure 5 .
Figure 5. Map Showing Location of Author representing the frequency with color weight.

Figure 5 .
Figure 5. Map Showing Location of Author representing the frequency with color weight.
has 31 instances.These are classified by VADER as positive tweets, their compound scores range between 0.5106 to 0.9508.They are represented with the visible yellow color on the heat map.The result showed that a chunk of the tweets classified as Neutral by Liu Hu are in the RHS of the VADER heat map.19 C109 81 Between 2017 and January 2020, F.G. has repatriated $1.89Billion of Abacha Loot.20 C96 79 To SARS you are doing yahoo yahoo o. they should just arrest themselves.21 C71 79 Someone said Yahoo Yahoo is now a course in his University.

Figure 6 .
Figure 6.Heat map showing sentiments classification using Liu Hu method.Figure 6.Heat map showing sentiments classification using Liu Hu method.

Figure 6 .
Figure 6.Heat map showing sentiments classification using Liu Hu method.Figure 6.Heat map showing sentiments classification using Liu Hu method.

Figure 7 .
Figure 7. Heat map showing sentiments classification using VADER method.
Figure8shows the summary of the ground truth generation and validation of the sentiment analysis using subjective human scores along the VADER and Liu Hu methods.The mean opinion Score (MOS) of 10 volunteers on the 173 unique tweets was obtained.Figure8a,c,eshows the line plots of VADER compound, Liu Hu, and Human sentiment scores, respectively.The Figures show the trends between the sentiment scores returned by each approach and ultimately how the classification for each tweets sentiments differs.The Figure 8b,d,f are scatter plots between Human and VADER, Human and Liu Hu, and Liu Hu and VADER sentiment scores, with 0.8038 and 0.6402; 0.5994 and 0.3463; 0.5999 and 0.3586 as correlation score (R) and Co-efficient of Determination (R 2 ), respectively.The blue line across the plots shows the intercept.The relationship between Human and VADER sentiment scores was more significant as in Figure 8b.We can also roughly consider each of the subplots as having four quadrants, which are: true negatives (lower left), true positives (upper right), false negatives (upper left), and false positives (lower right) representing the accuracy of the tweet's sentiment classification.The more the data point in the false negatives and false positives quadrants implies decrease in the sentiment classification accuracy.

Figure 7 .
Figure 7. Heat map showing sentiments classification using VADER method.
Figure8shows the summary of the ground truth generation and validation of the sentiment analysis using subjective human scores along the VADER and Liu Hu methods.The mean opinion Score (MOS) of 10 volunteers on the 173 unique tweets was obtained.Figure8a,c,eshows the line plots of VADER compound, Liu Hu, and Human sentiment scores, respectively.The Figures show the trends between the sentiment scores returned by each approach and ultimately how the classification for each tweets sentiments differs.The Figure 8b,d,f are scatter plots between Human and VADER, Human and Liu Hu, and Liu Hu and VADER sentiment scores, with 0.8038 and 0.6402; 0.5994 and 0.3463; 0.5999 and 0.3586 as correlation score (R) and Co-efficient of Determination (R 2 ), respectively.The blue line across the plots shows the intercept.The relationship between Human and VADER sentiment scores was more significant as in Figure 8b.We can also roughly consider each of the subplots as having four quadrants, which are: true negatives (lower left), true positives (upper right), false negatives (upper left), and false positives (lower right) representing the accuracy of the tweet's sentiment classification.The more the data point in the false negatives and false positives quadrants implies decrease in the sentiment classification accuracy.

Figure 8 .
Figure 8. Subplots showing validation results of the tweet sentiment classification.Subplot: (a) Line plot of VADER sentiment scores; (b) Scatter plot of Human and VADER sentiment scores; (c) Line plot of Liu Hu sentiment scores; (d) Scatter plot of Human and Liu Hu sentiment scores; (e) Line plot of Human sentiment scores; (f) Scatter plot of Liu Hu and VADER sentiment scores.

Figure 8 .
Figure 8. Subplots showing validation results of the tweet sentiment classification.Subplot: (a) Line plot of VADER sentiment scores; (b) Scatter plot of Human and VADER sentiment scores; (c) Line plot of Liu Hu sentiment scores; (d) Scatter plot of Human and Liu Hu sentiment scores; (e) Line plot of Human sentiment scores; (f) Scatter plot of Liu Hu and VADER sentiment scores.

Table 4 .
Showing LDA and LSI generated topics with keywords.

Figure 10 .
Figure 10.Visualizing LDA topics using MTP with multidimensional scaling points.

Figure 10 .
Figure 10.Visualizing LDA topics using MTP with multidimensional scaling points.

Figure 12 .
Figure 12.Box plots showing Marginal Topic Probability of the topics 4, 5 and 6 using Latent Dirichlet Allocation (LDA).

Algorithm 1 :
Duplicate Detection and Sentiment Analysis Workflow.

Table 2 .
The topmost frequent tokens with their frequency/weight.
Information 2022, 13, x FOR PEER REVIEW 9 of 21

Table 2 .
The topmost frequent tokens with their frequency/weight.

Table 3 .
Top twenty tweets showing the cluster, number of retweets, and content.

Table 5 .
LDA selected topics with top 10 words and weights.

Table 6 .
LSI selected topics with top 10 words and weights.

Table 7 .
Top ten words by order of relevance to the topic in the corpus extracted from the box plot.

Table 7 .
Top ten words by order of relevance to the topic in the corpus extracted from the box plot.