Fighting the COVID-19 Infodemic with NeoNet: A Text-based Supervised Machine Learning Algorithm

The spread of the Coronavirus pandemic has been accompanied by an infodemic. The false information that is embedded in the infodemic affects people’s ability to have access to safety and follow proper procedures to mitigate the risks. Here, we present a novel supervised machine learning text mining algorithm that analyzes the content of a given news article and assign a label to it. The NeoNet algorithm is trained by noun-phrases features which contributes a network model. The algorithm was tested on a real-world dataset and predicted the label of never-seem articles and flags ones that are suspicious or disputed. In five different fold comparisons, NeoNet surpassed prominent contemporary algorithm such as Neural Networks, SVM, and Random Forests. The analysis shows that the NeoNet algorithm predicts a label of an article with a 100% precision using a non-pruned model. This highlights the promise of detecting disputed online contents that may contribute negatively to the COVID-19 pandemic. Indeed, using machine learning combined with powerful text mining and network science provide the necessary tools to counter the spread of misinformation, disinformation, fake news, rumors, and conspiracy theories that is associated with the COVID19 Infodemic.


Introduction
Without doubt, the Coronavirus pandemic has affected the world around us in unprecedented way. Particularly, an emerging infodemic of news articles, social media posts, and publications has accompanied the global pandemic and circulated a vast volume of information, some of which is misleading [1][2][3][4][5][6][7] . According to the World Health Organization, an infodemic is "an overabundance of information -some accurate and some not." 8 . This means that our digital world is riddled with an enormous amount of misinformation and disinformation resulting from fake news articles, careless social media posts, or publications that has not gone through rigorous peer-review process 9 . As a result, rumors, conspiracy theories, and stigma are linked to the ongoing COVID-19 pandemic and circulated on social media platforms and news networks. The impact of the infodemic on the general-public is unquestionable as it makes it hard for people to identify reliable guidelines from trustworthy sources 10 . Clearly, the spread of misinformation and disinformation has existed long before the pandemic. It has also been considered as a social-determinant of health due to its impact 11 The coronavirus infodemic aspects are many: (1) The spread of rumors across the world has led to inappropriate behavior and have caused adverse effect on people's physical and the mental health 2,12 (2) conspiracy theories have widespread during the pandemic in attempt to explain the unusual circumstances 13 . In fact, similar theories have emerged during the SARS outbreak in China and Ebola outbreak the Congo 14 . (3) misinformation and public health damage related to tweeting bad advice from people of authority. For instance, Orso et al., stated that in a tweet, the French minister of health warned the citizen of his country not to use certain drugs (e.g., cortisone); an advice has gone viral during the pandemic. Later on, clinical trials proved that cortisone is beneficial. Clearly, such events have the effect of dispensing significant treatment, and in this case, any reference to cortisone was eliminated 15 . (4) stigma which overwhelmed social media in the form of hashtags contributed to a backlash against countries and people (e.g., stigma against China and Chinese people) 14 , (5) disinformation, which is an intentional act to deliver false information to mislead the general public. A significant instance that took place during the pandemic was the promotion of vitamin D by an Indonesian author. The article and its recommendations turned out to be from a suspicions source, as the authors' names was never linked to the listed affiliation. Such an article was downloaded 17000 times and mentioned 8000 times on social media platforms. The matter made worse when the article was also broadcasted by DailyMail, a major news network, in an article entitled as: "Terrifying chart shows how Covid-19 patients who end up in hospital may be almost certain to die if they have a vitamin D deficiency" 16 . Indeed, it is terrifying to witness major news organizations making make life and death assertions based on suspicious sources such as this suspicious kind of publication.
Presenting the above evidence begged the question of "Who do you trust? how to better mitigate?". Several efforts have also emerged to address the flood of information and provided guidelines and recommendations on how to answer such questions. In fact, this exact question is answered an article titled: "Who do you trust? The digital destruction of shared situational awareness and the COVID-19 infodemic" 17 . The authors of the referenced article referred to how the digital disruption of social media and search engines are responsible for the digital destruction by the act of propagating misinformation. The article urged for the development of new methods and approaches to establish and build trust among the users and their platforms. Another prominent reference titled: "How to fight an Infodemic: The Four Pillars of Infodemic Management" 18 . The pillars included (1), monitoring of information, (2) knowledge refinement and quality improvement processes (e.g., fact-checking), (3) the presentation of timely and accurate knowledge that minimizes or eliminates the influence of commercial and political influence, (4) advocating for facts and science, which often have been overtaken by social media advertisements presenting "inappropriate content". Another effort that addresses the trustworthiness of online knowledge and information sources introduced a COVID-19 infodemic crowdsourcing framework 8 . The effort resulted in recommendations that also overlapped with the four pillars presented in 18 (e.g., knowledge refinement for fact-checking). The recommendations stated the importance of using computational methods such as artificial intelligence (AI) and machine learning (ML) to produce insights that enable decision making to manage the infodemic.
From the wide spectrum of issues associated with the COVID-19 infodemic and assessments 19 , and recommendations made by the scholars in the field, we believe that computational scientists have a significant role to play in the fight against this infodemic. Particularly, the use of artificial intelligence, natural language processing, and machine learning must demonstrate its full potential in this fight. As demonstrated by the DailyMail news article, disinformation thrives in major news networks. The impact of such articles is clearly magnified when it is also socialized on social media platforms such as (Twitter and Facebook). It is imperative to address misinformation and disinformation at the source (i.e., the news article) before it is socialized on social media and become viral. An essential step that is dire at this point is the development of a mechanism that analyzes the content of an article to assess how viable the content of a news article is from a linguistic point of view. We argue that each news article must pass a step of label-prediction, otherwise, it must be flagged as potentially untrustworthy. This has to be accomplished by measuring the quality of the linguistic aspects of the article. Noun-phrases, for example, are essential for making up the main facts of each article. Therefore, any computational mechanism must utilize the noun-phrases to decide if an article should pass the label-prediction process. The final outcome generates a COVID-19 SAFE/DISPUTED label accordingly. In the past few months, Twitter started flagging socialized contents of political dispute "this claim is disputed". Twitter has also taken more advanced measures and applied filters to remove vaccine misinformation from the platform 20 . In the deal scenario, we envision that our mechanism is theoretically adopted by all major social media platforms and flag socialized news articles as SAFE or DISPUTED COVID-19 articles.

The Role of Machine Learning and Text Mining in Misinformation and Fake News Classification
From the motivation presented in the Introduction section above, it has become clear that computational science in general; and machine learning and computational linguistics, in particular, must be at the forefront of the fight against the infodemic. Prior the COVID-19 infodemic, machine learning, and natural language processing have played an essential role in fighting misinformation and fake news 21 . We believe that innovating new solutions that leverage the power of both fields is the right step to take in this fight.
The literature is rich of valuable methods and algorithms that demonstrate both the machine learning algorithms and natural language processing approaches, individually or as hybrid. Here we share the background and approaches that represent the backbone of the methods of this paper. In the early 2000's, Soon et al., claimed that training a machine learning algorithms with specific linguistic features holds a promise in classifying text in general. The authors claimed that their algorithm is the first-learning-based system trained by bigram features to achieve comparable results to non-learning methods 22 .
Mackey et al., in their efforts of identifying suspected fake contents on social media, they combined natural language processing and machine learning. The approach identified keywords associated with the pandemic and suspected marketing 23 . By analyzing millions of social media posts, the authors adopted a deep learning algorithm that detected high volumes of suspicious and untrustworthy products.
Liu et al., presented a "survey like" paper to demonstrate the various applications of combining both natural language processing and machine learning. Specifically, the method of training algorithms using word features (bigrams) 24 . Bigrams are a sequence of two words that appear in the text (e.g., global pandemic) 25 . They provide valuable and richer textual features than mere single high-frequency words counterparts. Aphiwongsophon et al., demonstrated how famous ML algorithms (e.g., NaiveBayes 26,27 , and Support Vector Machines 28-30 ) can be used to detect fake news. Their results shown promise with accuracy of 96% or better 31 .
Following a similar path, H. Ahmed et al., also used classical machine learning algorithms, (i.e., a variation of support vector machine), but rather trained them using ngram features 32 . The accuracy of their algorithm was lower than the previous methods (92%). The authors, however, argued that training the algorithm with the n-gram is better in terms of feature quality than features of high frequency that do not contribute to the context of the dataset.
Another interesting approach but Conroy et al., who also used machine learning to detect deception in identifying fake news. The approach combined machine learning, linguistic features (e.g., n-grams), and network analysis for networks of linked data. The authors claim that both linguistic and network analysis methods have shown high accuracy in classification tasks of detecting fake news. The authors conclude their research by making the following recommendations: (1) achieving maximum performance require deeper linguistic analysis of, (2) the utilization of linked data and corresponding format will assist in achieving up-to-date fact checking 33 .

Materials and Methods
With the previous introduction and the recommendations made to fight the COVID-19 infodemic, we present a novel supervised machine learning algorithm, which we call NeoNet. The algorithm is specifically designed for COVID-19 news classification. The overall approach of the NeoNet algorithm is centered around a bigrams network. We applied the TFIDF algorithm 34 to extract bigrams (a pair of words) which is the bridge to identifying discriminant features. The bigram features naturally present themselves as a network which we use as a training model. Hence, the role of feature selection using TFIDF to identify bigrams is significant. TFIDF features have two folds: (1) provide discriminant features that contribute significantly to the training phase of the algorithm, (2) provide linked features that take the mere article contents to a connectivity level. The result is an interesting network model that offers a platform for testing whether a new article is relevant from both content and connectivity. In this section, we discuss how the algorithm is designed, implemented, and tested. Particularly, we present the cornerstone steps that lead to determining the class of news articles: (1) TFIDF feature selection, which is used to extract bigram features from news articles 23,35 (2) TFIDF bigram-based network model, which we use for training the algorithm before it is able to predict the label of new articles, (3) a supervised machine learning algorithm, which predicts the final label for each news article as a SAFE/DISPUTED COVID-19 news article.
The dataset used in this work is identified as (COVID-19 News Articles Open Research Dataset) which is available at Kaggle 36 . The data exists as a Comma Separated Value (CSV) file that is comprised of seven columns. The ones of interest to this here are the (article title), (article description), and (full article) and contains 6782 articles. The articles collected from the website of the Canadian CBC News network. The preprocessing of the text is done using the Pandas 37 framework and the linguistic analytic is done using TextBlob 38 . We split the dataset 10-folds where each partition contained 500 articles. We used one for training the algorithm and another five to test it. For each test-fold, we set the minimum support parameter to a certain threshold and compared the performance.

TFIDF Feature Selection and Model Construction
Every training model starts with a good representation of data items. For text classification in particular, feature selection is the prerequisite step necessary for such a task. Various approaches are designed around the idea of selecting a set of words that best represents the document (or a set of documents). The most common text feature selection known is based on the idea of term frequency. Specifically, the Term Frequency and Inverse Document Frequency (TFIDF) method 32,34 has been most dominant. In this section, we discuss how we extracted the bigram features needed for training the NeoNet classifier. For this task, we used a COVID-19 news dataset that is trustworthy, and publicly available (published on Kaggle 36 ). Due to the fact that raw text presents users with inherent issues (e.g., format, encoding, and punctuation), we performed a preprocessing step to address such issues.
We split the list of articles into 10-folds of 500 articles. We used one-fold to be analyzed for feature selection using the TFIDF algorithm. Given that a TFIDF feature can be a word or more, we calibrated the algorithm to capture features that are of exactly 2 words (i.e., bigram). The TFIDF scores each feature and rank them accordingly. When the TFIDF was run against the training articles it produced 193914 bigrams. The TFIDF measure produces features of a certain confidence. In the training fold (500 articles) of the dataset that we have used produced 193914 bigrams. Clearly, this causes the model to be noise and also could lead to an overfitting problem. Therefore, we only selected the top 500 ranked and ignored the rest. Table 1 shows a sample of top-ranked features selected from the training dataset before the noise removal.  Table 1 shows a sample of top features extracted using the TDIDF algorithm. The first column shows its order in the data, the second column shows the actual feature being selected, and the last column displays the rank of the feature in the dataset.
Bigrams, as network construction means, have been widely used in various computational problems 39 . We present an incremental network construction approach that is well-known in the literature in prominent algorithms (e.g., Prim's algorithm 40,41 ) which starts with an empty set of nodes and incrementally adds new nodes, one node at a time. In a similar fashion, we follow the same method of construction. Our goal is to add all the bigrams that also meet a certain criterion. The bigram extraction step, which discussed above, produced the set of length-two features. The length-two not only captures the core necessary features for classification but also offers a network model that can be used for training a classification algorithm. They offer a source-target mechanism where the source and target are nodes in the network and connected with an edge. The continuation of adding new bigrams forges an incremental linkage. The final outcome of such a process results in a graph where its structure and characteristics are dependent upon the dataset being analyzed (i.e. healthcare, politics, business, etc.). For the COVID-19 domain, following the incremental process ensures an upfront production of high-quality features. The network ensures that classified bigrams are related to the content and not a result of verbatim exact match.
The following example demonstrates how a TFIDF feature of length two can provide the foundations for constructing the needed network. A sentence such as (Top U.S. health official Dr. Anthony Fauci said it has a "clear cut, significant, positive effect in diminishing the time to recovery" 42 after favorable results of a clinical trial.) When performing the TFIDF feature extraction step, it produces the following (health official), (clinical trials) bigrams. These two bigrams contribute four different nodes (unigrams), namely (health, official, clinical, trials). It will also contribute two edges: an edge from health and official) and another from clinical and trials. As we analyze more sentences, we encounter the mention of (strict public health measures). In turn, this contributes another bigram (health measure). Putting these bigrams together and connecting them based on the bigram relationship ought to form a graph where the node health is connected to (official and measure). The continuous addition of bigrams extracted from the training dataset will result in a much larger network. Figure 1 displays a wordcloud from the top-100 features of the training set. The figure shows proof of how relevant the features selected using the TFIDF algorithm. Figure 1: shows a wordcloud representation of the top-100 features selected using TFIDF. Here we how observe relevant the features selected such as coronavirus, health, outbreak, covid, pandemic, social distancing, novel, spread, etc.
Upon constructing the network training model, it ended up with 471 nodes only. This is explained by the fact that some bigrams might share a common word among them. Example: ('health issues') and ('health problems') have the node ('health') in common. While such bigrams features will be used as they are in other machine learning algorithms, a network model such as ours naturally prune repeated features that can lead to overfitting which is a problem that other algorithms suffer from. Table 1 summarizes the training model which was constructed from the top-500 TFIDF features selected, and Figure 2 shows the training model after construction.  Table 2 describes the structural properties of the network training model: number of nodes, number of edges, and average node degree Figure 1 shows a simplified version of the network training model that is constructed from 100 features, which were originally extracted from the news articles. Here we see terms such as coronavirus outbreak, Covid 19, novel coronavirus, pandemic, positive cases

The Design of NeoNet Classifier
The previous step explained how a network-based training model is derived from a given set of news articles. Here we present the algorithmic steps that leads to labeling a new article that is yet to be seen by our algorithm. The algorithm is controlled by a configuration parameter which we called: minimum support which is inspired by the Apriori algorithm [43][44][45] . The minimum support guarantees a certain number of bigrams to be present in each article, otherwise, it will be labeled as suspicious. It ensures that the article contains sufficient contents that contributes to the training model. If the article does not meet this condition it will not be classified as SAFE. Clearly, an article that does not have a minimum number of TDIDF features also communicate significant facts worthy of reading 46 . As for the percentage generated by the minimum confidence, it guides the setting of the minimum support and helps to set it to a sufficient level. This becomes significant in long vs short article. In long articles, it is expected to have a higher number of TFIDF features than shorter articles. If the minimum support parameter is set too low, this percentage helps correct this issue and ensure that news articles are not classified as "SAFE" if they should be classified as "SUSPICIOUS". The NeoNet algorithmic steps are described below and also is expressed in pseudocode in Algorithm 1. Algorithm 1 shows the steps of the NeoNet algorithm in pseudocode staring from when the bigram features are extracted until a classification label is generated.

Experiments and Results
Using the training model resulting from the bigram feature selection step, we conducted a series of classification (testing) experiments. Using five different folds of the dataset, and different configurations of the minimum support parameter, we measured the precision of the NeoNet algorithm. The rationale behind this is to come up with a threshold that produces the best outcome. The minimum support parameter is based on the number of bigrams produced by each article. The higher the number of the bigrams matching, the higher the chances of an article being classified as a positive COVID-19 class. However, the experimentation guides the algorithm to identify a reasonable threshold. A very high number of bigrams would lead to classifying articles that are extremely similar in content. As a result, the algorithm would miss articles that belong to the COVID-19 class, but less similar to the training model. On the other hand, a very low threshold would lead to classify any article with a slight overlap as positive and would make the algorithm not precise. We used the following min-sup configurations [5,10,15,20,25] bigrams. Figure 3 shows the five different test sets and how they were classified according to NeoNet with various minimum support levels. The analysis shows that mandating that at least 5 bigrams are matched against the training set and produced a precision value between (99.6%-100%). If it is set to a more demanding parameter (minisup = 25) the classification results fluctuate between (91.18%-95.59%). The experimentations showed that somewhere in the middle is reasonable (min-sup = 15), and it produced classification precision that fluctuates between (97.99%-99.20%). Each fold is plotted using five different curves, one for each minimum support value. Figure 2 shows the performance of NeoNet with different configurations of the min-sup parameter. The figure shows 5 different configurations of the min-sup [5,10,15,20,25]. The plots are displayed from left to right respective to the values of the configurations We have showed above how NeoNet can be controlled using the min support to make flexible to use in various case scenarios. However, the algorithm also performs exceptionally well without such configurations when needed. In this section we show its performance analysis compared with the most prominent algorithms (e.g., SVM, neural nets, random forests). Given the fact that such algorithms don't necessarily utilize a similar notion of the min Precision of TFIDF Feature using Minimum Support min-sup = [5,10,15,20,25] Fold-one Fold-two Fold-three Fold-four Fold-five support/confidence, we set both parameters to (min-sup=15). The algorithm is trained using a 500 articles fold. The rest of the dataset was split into 5-folds each of 500 articles.
As expected, each fold of the dataset was tested against NeoNet and compared with a counterpart algorithm. The algorithm was tested on each of the five folds and compared against all other algorithms to measure the precision achieved for all the 5-folds. Figure 4   In this article, we discussed the how the COVID-19 pandemic has also been accompanied but an infodemic. Particularly, we discussed the various aspects of the infodemic and how it presents a serious health threat to the general public due to the misinformation/disinformation that may exist in the source (e.g., scientific publications, fake news, and social media posts). For instance, we presented an evidence of a disinformation that existed in a publication, which eventually presumed to be from a suspicious source. The article reported health issues associated with vitamin D. As the article was published, it was also highlighted by a reputable news organization (i.e., DailyMail). The matter was made worse when also DailyMail news article 47 was also socialized on Facebook and Twitter. Clearly, such misinformation or (disinformation in this case) threaten the world's public health.
This paper also highlighted the various efforts have been taken by the scientific community in the fight against the infodemic and made recommendations. One specific reference, Eysenbach, addressed the infodemic and introduced four pillars that must be observed in order to win this fight. The recommendations included information monitoring, encouraging knowledge refinement and quality improvement processes. Our research here has taken such recommendations into serious level and implemented them accordingly. Specifically, we presented an information monitoring and a knowledge refinement solution that addressed the problem from the source. The research also performed a diligent literature review on what specific tools and research methods should be used. The technical recommendations were influenced by advances of machine learning, computational linguistic, and network science.
Indeed, this paper have presented a novel machine learning algorithm that utilized knowledge refinement produced by natural language processing to produced training features. We then empowered the algorithm by a network model. Such a model offered both the structural components (i.e., nodes and edges) and the node degree centrality to perform the knowledge refinement when constructing the training model. This led to the generation of highly representative features and eliminated the noise by using the degree centrality as a heuristic.
As for the actual step of training and testing the algorithm, we selected a trustworthy set of news taken the text classification from mere content matching level to a connectivity level expressed by the underlying relationships that make up the training model. The future direction of this work will consider developing an adaptive approach to set such a configuration automatically.
We will also consider promoting the algorithm to be multi-lingual and test it against various datasets from various news organizations.
It is worthwhile mentioning that the algorithm will perfectly function on all other text sources, not only the news. Specifically, we expect the algorithm to function the exact same way, and without any modifications, to classify medical publications. The setting of the minimum support will be calibrated by experimentation. In this case, the algorithm may be trained on PubMed abstracts and doctor's notes to provide COVID-19 specific features that may only be clinical. Our reason to believe that it will be successful is that the algorithm is trained using bigram features. Such features are highly significant in the context of doctor's notes and publications since they may reference entities such as organs, disease, gene, protein, indication, symptom, etc. Eventually, the training model will be rich, and suspected sources will fail to classify positively against the model. To conclude, we produced an algorithmic approach that shows promise to fight the COVID-19 infodemic because of its powerful demonstration and flexibility in classifying news articles as "SAFE" or "DISPUTED". We also believe that training the algorithm using scientific literature and doctor's notes may eliminate suspicious publications such as the one that promoted vitamin D. This is yet to be explored in future publications.
Funding: This research is partly funded by IU of Madinah, Tamayoz initiative project 23/40 and National Security Agency #22341 Cyber Institute Institutional Review Board Statement: Not applicable.

Informed Consent Statement: "Not applicable
Data Availability Statement: In this section, please provide details regarding where data supporting reported results can be found, including links to publicly archived datasets analyzed or generated during the study. Please refer to suggested Data Availability Statements in section "MDPI Research Data Policies" at https://www.mdpi.com/ethics. You might choose to exclude this statement if the study did not report any data.