Knowledge Graph informed Fake News Classification via Heterogeneous Representation Ensembles

Increasing amounts of freely available data both in textual and relational form offers exploration of richer document representations, potentially improving the model performance and robustness. An emerging problem in the modern era is fake news detection -- many easily available pieces of information are not necessarily factually correct, and can lead to wrong conclusions or are used for manipulation. In this work we explore how different document representations, ranging from simple symbolic bag-of-words, to contextual, neural language model-based ones can be used for efficient fake news identification. One of the key contributions is a set of novel document representation learning methods based solely on knowledge graphs, i.e. extensive collections of (grounded) subject-predicate-object triplets. We demonstrate that knowledge graph-based representations already achieve competitive performance to conventionally accepted representation learners. Furthermore, when combined with existing, contextual representations, knowledge graph-based document representations can achieve state-of-the-art performance. To our knowledge this is the first larger-scale evaluation of how knowledge graph-based representations can be systematically incorporated into the process of fake news classification.


Introduction
Identifying fake news is a crucial task in the modern era. Fake news can have devastating implications on society; the uncontrolled spread of fake news can for example impact the idea of democracy, with the ability to alter the course of elections by targeted information spreading [1]. In the times of a global pandemic they can endanger the global health, for example by reporting that using bleach can stop the spread of Coronavirus [2,3], or that vaccines are problematic for human health. With the upbringings of the development of the information society, the increasing capability to create and spread news in various formats makes the detection of problematic news even harder.
For media companies' reputation it is crucial to avoid distributing unreliable information. With the ever-increasing number of users and potential fake news spreaders, relying only on manual analysis is becoming unmanageable given the number of posts a single person can curate on a daily basis. Therefore, the need for automated detection of fake news is more important than ever, making it also a very relevant and attractive research task.
By being able to process large collections of labeled and unlabeled textual inputs, contemporary machine learning approaches are becoming a viable solution to automatic e.g., credibility detection [4]. One of the key problems, however, concerns the representation of such data in a form, suitable for learning. Substantial advancements were made in this direction in the last years, ranging from large-scale curated knowledge graphs that are freely accessible to contextual language models capable of differentiating between subtle differences between a multitude of texts [5]. This work explores how such technologies can be used to aid and prevent spreading of problematic content, at scale.
With the advancements in the field of machine learning and natural language processing, various different computer-understandable representations of texts have been proposed. While the recent work has shown that leveraging background knowledge can improve document classification [6], this path has not yet been sufficiently explored for fake news identification. The main contributions of this work, which significantly extend our conference paper [7] are: The remaining work is structured as follows. In Section 2, we present the relevant related work, followed by the text and graph representations used in our study in Section 3, we present the proposed method, followed by the evaluation in Section 4. We discuss the obtained results in Sections 5 and 6 and finish with the concluding remarks in Sections 7 and 8.

Related Work
We next discuss the considered classification task and the existing body of literature related to identification/detection of fake news. The fake news text classification task is defined as follows: given a text and a set of possible classes (e.g., fake and real) to which a text can belong, an algorithm is tasked with predicting the correct class label assigned to the text. Most frequently, fake news text classification refers to classification of data based on social media.
The early proposed solutions to this problem used hand-crafted features of the authors (instances) such as word and character frequencies [8]. Other fake news related tasks include the identification of a potential author as a spreader of fake news and the verification of facts. Many of the contemporary machine learning approaches are based on deep neural-network models [9].
Despite the fact that the neural network based approaches outperform other approaches on many tasks, they are not directly interpretable. On the other side, more traditional machine learning methods such as symbolic and linear models are easier to interpret and reason with, despite being outperformed by contemporary deep-learning methods. To incorporate both viewpoints, a significant amount of research has been devoted to the field of neuro-symbolic computing, which aims to bring the robustness of neural networks and the interpretability of symbolic approaches together. For example, a recent approach explored document representation enrichment with symbolic knowledge (Wang et. al [10]). In their approach, the authors tried enriching a two-part model: a text-based model consisting of statistical information about text and a knowledge model based on entities appearing in both the KG and the text.
Further, Ostendorff et al. [6] explored a similar idea considering learning separate embeddings of knowledge graphs and texts, and later fusing them together into a single representation. An extension to the work of Ostendorff et al. was preformed by Koloski et al. [11], where a promising improvement of the joint representations has been observed. This approach showed potentially useful results, improving the performance over solely text-based models.
Versatile approaches achieve state of the art results when considering various tasks related to fake news detection; Currently, the transformer architecture [12] is commonly adopted for various down-stream learning tasks. The winning solution to the COVID-19 Fake News Detection task [13] utilized fine-tuned BERT model that considered Twitter data scraped from the COVID-19 period -January 12 to April 16, 2020 [14,9]. Other solutions exploited the recent advancements in the field of Graph Neural Networks and their applications in these classification tasks [15]. However, for some tasks best preforming models are SVM-based models that consider more traditional n-gram-based representations [16]. Interestingly, the stylometry based approaches were shown [17] to be a potential threat for the automatic detection of fake news. The reason for this is that machines are able to generate consistent writings regardless of the topic, while humans tend to be biased and make some inconsistent errors while writing different topics. Additionally researchers explored how the traditional machine learning algorithms perform on such tasks given a single representation [18]. The popularity of deep learning and the successes of Convolutional and Recurrent Neural Networks motivated development of models following these architectures for the tasks of headline and text matching of an article [19]. Lu and Li [20] proposed a solution to a more realistic scenario for detecting fake news on social media platforms which incorporated the use of graph co-attention networks on the information about the news, but also about the authors and spread of the news. However, individual representations of documents suitable for solving a given problem are mostly problem-dependent, motivating us to explore representation ensembles, which potentially entail different aspects of the represented text, and thus generalize better.

Proposed methodology
In this section we explain the proposed knowledge-based representation enrichment method. First we define the relevant document representations, followed by concept extraction and knowledge graph (KG) embedding. Finally, we present the proposed combination of the constructed feature spaces. Schematic overview of the proposed methodology is shown in Figure 1. We begin by describing the bottom part of the scheme (yellow and red boxes), followed by the discussion of KG-based representations (green box). Finally, we discuss how the representations are combined ("Joint representation") and learned from (final step of the scheme).

Existing document representations considered
Various document representations capture different patterns across the documents. For the text-based representations we focused on exploring and ex-ploiting the methods we already developed in our submission to the COVID-19 fake news detection task [7]. We next discuss the document representations considered in this work.
Hand crafted features. We use stylometric features inspired by early work in authorship attribution [8]. We focused on word-level and character-level statistical features. Character based features The character based features consisted of the counts of digits, letters, spaces, punctuation, hashtags and each vowel, respectively. Hence, the final statistical representation has 10 features.
Latent Semantic Analysis. Similarly to Koloski et al. [21] solution to the PAN 2020 shared task on Profiling Fake News Spreaders on Twitter [22] we applied the low dimensional space estimation technique. First, we preprocessed the data by lower-casing the document content and removing the hashtags, punctuation and stop words. From the cleaned text, we generated the POS-tags using the NLTK library [23]. Next, we used the prepared data for feature construction. For the feature construction we used the technique used by Martinc et al. [24] which iteratively weights and chooses the best n-grams. We used two types of n-grams: Word based: n-grams of size 1 and 2 and Character based: n-grams of sizes 1, 2 and 3. We generated word and character n-grams and used TF-IDF for their weighting. We performed SVD [25] of the TF-IDF matrix, where we only selected the m most-frequent n-grams from word and character n-grams. With the last step we obtained the LSA representation of the documents. For each of our tasks, our final representation consists of 2,500 word and 2,500 character features (i.e. 5,000 features in total) reduced to 512 dimensions with the SVD.
Contextual features. For capturing contextual features we utilize embedding methods that rely on the transformer architecture [12], including: First, we applied the same preprocessing as described in subsection 3.1.
After we obtained the preprocessed texts we embedded every text with a given transformer model and obtained the contextual vector representation. As the transformer models work with a limited number of tokens, the obtained representations were 512-dimensional, as this was the property of the used pre-trained models. This did not represent a drawback since most of the data available was shorter than this maximum length. The contextual representations were obtained via pooling-based aggregation of intermediary layers [29].

Knowledge graph-based document representations
We continue the discussion by presenting the key novelty of this work: document representations based solely on the existing background knowledge.
To be easily accessible, human knowledge can be stored as a collection of facts in knowledge bases (KB). The most common way of representing human knowledge is by connecting two entities with a given relationship that relates them. Formally, a knowledge graph can be understood as a directed multigraph, where both nodes and links (relations) are typed. A concept can be an abstract idea such as a thought, a real-world entity such as a person e.g., Donald Trump, or an object -a vaccine, and so on. An example fact is the fol- In order to learn and extract patterns from facts the computers need to represent them in useful manner. To obtain the representations we use six knowledge graph embedding techniques: TransE [30], RotatE [31], QuatE [32], ComplEx [33], DistMult [34] and SimplE [35]. The goal of a knowledge graph embedding method is to obtain numerical representation of the KG, or in the case of this work, its entities. The considered KG embedding methods also aim to preserve relationships between entities. The aforementioned methods and the corresponding relationships they preserve are listed in Table 1. It can be observed that RotatE is the only method capable of modeling all five relations.  The GraphVite library [36] incorporates approaches that map aliases of concepts and entities into their corresponding embeddings. To extract the concepts from the documents we first preprocess the documents with the following pipeline: punctuation removal; stopword removal for words appearing in the NLTK's english stopword list; lemmatization via the NLTK's WordNetLemma-tizer tool.
In the obtained texts, we search for concepts (token sets) consisting of unigrams, bi-grams and tri-grams, appearing in the knowledge graph. The concepts are identified via exact string alignment. With this step we obtained a collection of candidate concepts C d for each document d.
From the obtained candidate concepts that map to each document, we developed three different strategies for constructing the final representation. Let e i represent the i-th dimension of the embedding of a given concept. Let represent the element wise summation (i-th dimensions are summed). We consider the following aggregation. We considered using all the concepts with equal weights and obtained final concept as the average of the concept embeddings: The considered aggregation scheme, albeit being one of the simpler ones, already offered document representations competitive to many existing mainstream approaches. The key parameter for such representations was embedding dimension, which was in this work set to 512.

Construction of the final representation
Having presented how document representations can be obtained from knowledge graphs, we next present an overview of the considered document representations used for subsequent learning, followed by the considered representation combinations. The overview is given in Table 2 For exploiting the potential of the multi-modal representations we consider Merged -we concatenate the obtained language-model and knowledge graph representations. As previously mentioned we encounter two different scenarios for KG: • LM+KG -we combine the induced KG representations with the methods explained in Subsection 3.2.
• LM+KG+KG-ENTITY -we combine the document representations, Having discussed how the constructed document representation can be combined systematically, we next present the final part needed for classification -the representation ensemble model construction.

Classification models considered
We next present the different neural and non-neural learners, which consider the constructed representations discussed in the previous section.
Representation stacking with linear models. The first approach to utilize the obtained representations was via linear models that took the stacked representations and learned a classifier on them. We considered using a Logis-ticRegression learner and a StochasticGradientDescent based learner that were optimized via either a log or hinge loss function. We applied the learners on the three different representations scenarios.
Representation stacking with neural networks. Since we have various representations both for the textual patterns and for the embeddings of the concepts appearing in the data we propose an intermediate joint representation to be learnt with a neural network. For this purpose, we propose stacking the inputs in a heterogeneous representation and learning intermediate repre- sentations from them with a neural network architecture. The schema of our proposed neural network approach is represented in Figure 3. We tested three different neural networks for learning this task.  [SNN] Shallow neural network. In this neural network we use a single hidden layer to learn the joint representation.
[5Net] Five hidden layer neural network. The original approach that we proposed to solve the COVID-19 Fake News Detection problem featured a five layer neural network to learn the intermediate representation [7]. We alter the original network with the KG representations for the input layer.
[LNN] Log(2) scaled neural network. Deeper neural networks in some cases appear to be more suitable for some representation learning tasks.
To exploit this hypothesis we propose a deeper neural network -with a domino based decay. For n intermediate layers we propose the first intermediate layer to consist of 2 n neurons, the second to be with 2 n−1 ... and the n 0 -th to be activation layer with the number of unique outputs.

Empirical evaluation
In this section, we first describe four data sets which we use for benchmarking of our method. Next we discuss the empirical evaluation of the proposed method, focusing on the problem of fake news detection.

Data sets
In order to evaluate our method we use four different fake news problems.
We consider a fake news spreaders identification problem, two binary fake news detection problems and a multilabel fake news detection problem. We next discuss the data sets related to each problem considered in this work.
COVID-19 Fake News detection data set [13,38]  Profiling fake news Spreaders is an author profiling task that was organized under the PAN2020 workshop [22]. In author profiling tasks, the goal is to decide if an author is a spreader of fake news or not, based on a collection of posts the author published. The problem is proposed in two languages English and Spanish. For each author 100 tweets are given, which we concatenate as a single document representing that author.
FNID: FakeNewsNet [40] is a data set containing news from the PolitiFact website. The task is binary classification with two different labels -real and fake. For each news article -fulltext, speaker and the controversial statement are given.
The data splits are summarised in Table 3

Document to knowledge graph mapping
For each article we extract the uni-grams, bi-grams and tri-grams that also appear in the Wikidata5M KG. Additionally, for the Liar and the FakeNewsNet data sets we provided KG embedding based on the aggregated concept embedding from their metadata. In the case of the Liar data set we use if present the speaker, the party he represents, the country the speech is related with and the topic of their claim.In all evaluation experiments we use the AGG-AVERAGE aggregation of concepts.

Classification setting
We use the train splits of each data set to learn the models, and use the validation data splits to select the best-performing model to be used for final test set evaluation. For both the linear stacking and the neural stacking we define custom grids for hyperparameter optimization, explained in the following subsections. • 5Net fixed sizes as in [7].
We considered batches of size 32, and trained the model for a maximum of 1,000 epochs with an early stopping criterion -if the result did not improve for 10 successive epochs we stopped the optimization.

Baselines
The proposed representation-learner combinations were trained and validated by using the same split structure as provided in a given shared task, hence we compared our approach to the state-of-the-art for each data set separately.
As the performance metrics differ from data set to data set, we compare our approach with the state-of-the-art with regard to the metric that was selected by the shared task organizers.

Quantitative results
In this section, we evaluate and compare the quality of the representations obtained for each problem described in Section 4. For each task we report four metrics: accuracy, F1-score, precision and recall.  Table 4. improvement of F1-score and 26.70% gain in recall score. The proposed methodology improves the score over the current best performing model by a margin of 3.22%. The evaluation of the data is task with respect to the models is shown in Table 5.  is task with respect to the models is shown in Table 6.

Task 4: COVID-19
The text based representation of the model outperformed the derived KG representation in terms of all of the metrics. However, the combined representation of the text and knowledge present, significantly improved the score, with the biggest gain from the joint-intermediate representations. The best-performing representation for this task was the one that was learned on the concatenated representation via SNN with 1024 nodes. This data set did not contain metadata information, so we ommited the KG-ENITTY evaluation. The evaluation of the data is task with respect to the models is shown in Table 7. The proposed method of stacking ensembles of representations outscored all other representations for all of the problems. The gain in recall and precision is evident for every problem, since the introduction of conceptual knowledge informs the textual representations about the concepts and the context. The best-performing models were the ones that utilized the textual representations and the factual knowledge of concepts appearing in the data.

Qualitative results
In the following section we further explore the constructed multi-representation space. In Subsection 6.1, we are interested in whether it is possible to pinpoint which parts of the space were the most relevant for a given problem. In Subsection 6.3, we analyze whether predictions can be explained with the state-ofthe-art explanation methods.

Relevant feature subspaces
We next present a procedure and the results for identifying the key feature subspaces, relevant for a given classification task. We extract such features via the use of supervised feature ranking, i.e. the process of prioritizing individual features with respect to a given target space. In this work we considered mutual information-based ranking [44], as the considered spaces were very high dimen-   that for data sets like AAAI-COVID19, e.g., mostly LSA and statistical features are sufficient.

Exploratory data analysis study on the knowledge graph features from documents
In this section we analyze how representative the concept matching is. As described in Subsection 3.2 for each document we first generate the n-grams and extract those present in the KG. For each data set we present the top 10 most frequent concepts that were extracted. First we analyze the induced concepts for all four data sets, followed by the concepts derived from the document metadata for the LIAR and FakeNewsNet dataset. The retrieved concepts are shown in Figure 6. We finally discuss the different concepts that were identified as the most present across the data sets. Even though in data sets like FakeNewsNet and LIAR-PANTS, the most common concepts include well-defined entities such as e.g., 'job', the PAN2020 mapping indicates that this is not necessarily always the case. Given that only for this data set most frequent concepts also include e.g., numbers, we can link this observation to the type of the data -noisy, short tweets. Having observed no significant performance decreases in this case, we conducted no additional denoising ablations, even though such endeavor could be favourable in general.
Next we analyze how much coverage of concepts per data set has the method acquired. We present the distribution of induced knowledge graph concepts per document for every data set in the Appendix in Figure B.9. The number of found concepts is comparable across data sets.
The chosen data sets have more than 98% of their instances covered by additional information, from one or more concepts. For the LIAR data set we fail to retrieve concepts only for 1.45% of the instances, for COVID-19 only for 0.03% instances. In the case of PAN2020 and LIAR data sets we succeed to provide one or more concepts for all examples. Additional distribution details are given in Appendix B.

Evaluation of word features in the data
To better understand data sets and obtained models, we inspected words in the COVID-19 Fake News detection set as features of the prediction model.
We were interested in words that appeared in examples with different contexts which belonged to the same class. To find such words, we evaluated them with the TF-IDF measure, calculated the variance of these features separately for each class and extracted those with the highest variance in their class.
We mapped the extracted words to WordNet [45] and generalized them using Reasoning with Explanations (ReEx) [46] to discover their hypernyms, which can serve as human understandable explanations. Figure 7 shows words with the highest variance in their respective class, while Figure 8 shows found hypernyms of words with the highest variance for each of the classes.  If examined separately, most words found based on variance offer very little as explanations. A couple of words stand out, however; since this is a COVID news data set it is not surprising that words such as "new", "covid19", "death" and "case" are present across different news examples in both classes. Because COVID-19 related news and tweets from different people often contain contradictory information and statements, there must be fake news about vaccines and some substances among them, which could explain their inclusion among words appearing in examples belonging to the "fake" class. Words found in examples belonging to the "real" class seem to be more scientific and concerning measurements, for example, "ampere", "number", "milliliter".  After generalizing words found with variance we can examine what those words have in common. "Causal agent" is a result of the generalization of words in both fake and real classes, which implies that news of both classes try to connect causes to certain events. These explanations also reveal that different measures, attributes and reports can be found in examples belonging to the "real" class.

Discussion
The fake news problem space captured in the aforementioned data sets

Conclusions
We compared different representations methods for text, graphs and concepts, and proposed a novel method for merging them into a more efficient representation for detection of fake news. We analysed statistical features, matrix factorization embedding LSA, and neural sentence representations sentencebert, XLM, dBERT, and RoBERTa. We proposed a concept enrichment method for document representations based on data from the WikiData5m knowledge graph. The proposed representations significantly improve the model expressiveness and improve classification performance in all tackled tasks.
The drawbacks of the proposed method include the memory consumption and the growth of the computational complexity with the introduction of high dimensional spaces. In order to cope with this scalability we propose exploring some dimensionality-reduction approaches such as UMAP [47] that map the original space to a low-dimensional manifold. Another problem of the method is choosing the right approach for concept extraction from a given text. Furthermore, a potential drawback of the proposed method is relatively restrictive entity-to-document mapping. By adopting some form of fuzzy matching, we believe we could as further work further improve the mapping quality and with it the resulting representations.
For further work we propose exploring attention based mechanisms to derive explanations for the feature significance of a classification of an instance.
Additionally we would like to explore how the other aggregation methods such as the AGG-TF and the AGG-TF-IDF perform on the given problems. The intensive amount of research focused on the Graph Neural Networks represents another potential field for exploring our method. • report,"In an Aaj Tak news report the Chinese prime minister said ""Reading Quran and offering namaz is the only cure for COVID-19.""" • chinese,"In an Aaj Tak news report the Chinese prime minister said ""Reading Quran and offering namaz is the only cure for COVID-19."""

Appendix B. Distribution of concepts
In this subsection we showcase the distribution of concepts per each data set, shown in Figure B.9.  Table C Table C    Knowledge graph only based representation yielded too general spaces, making for the lowest-performing spaces for the COVID-19 task. Notable improvement for the dataset was achieved by the addition of language models to the knowledge graph representations. The worst-performing combinations are listed in Table C.14, while the best-performing combinations are listed in Table C

Appendix C.2. Conclusion
In this section we discuss the main highlights of the extensive ablation studies targeting the performance of different feature space combinations. The main conclusions are as follows.
In the evaluation of spaces study, we analyzed how combining various spaces before learning common joint spaces impacts performance. We can take two different outputs from the study: 1. knowledge graph-based representations on their own are too general for tasks where the main type of input are short texts. However, including additional statistical and contextual information about such texts has shown to improve the performance. The representations that are capable of capturing different types of relation properties (e.g., symmetry, asymmetry, inversion etc.) in general perform better than the others.