Unsupervised Heterogeneous Graph Neural Networks for One-Class Tasks: Exploring Early Fusion Operators

Heterogeneous graphs are an essential structure that models real-world data through different types of nodes and relationships between them, including multimodality, which comprises different types of data such as text, image, and audio. Graph Neural Networks (GNNs) are a prominent graph representation learning method that takes advantage of the graph structure and its attributes that, when applied to the multimodal heterogeneous graph, learn a unique semantic space for the different modalities. Consequently, it allows multimodal fusion through simple operators such as sum, average, or multiplication, generating unified representations considering the supplementary and complementarity relationships between the modalities. In multimodal heterogeneous graphs, the labeling process tends to be even more costly due to the multiple modalities analyzed, in addition to the imbalance of classes inherent to some applications. In order to overcome these problems in applications that comprise a class of interest, One-Class Learning (OCL) is used. Given the lack of studies on multimodal early fusion in heterogeneous graphs for OCL tasks, we proposed a method based on unsupervised GNN for heterogeneous graphs and evaluated different early fusion operators. In this paper, we extend another work by evaluating the behavior of the main GNN convo-lutions in the method. We highlight that using operators such as average, addition, and subtraction were the best early fusion operators. In addition, GNN layers that do not use an attention mechanism performed better. In this way, we argue for heterogeneous graph neural networks in multimodal using early fusion simple operators instead of well-often-used concatenation and less complex convolutions.


Introduction
Early fusion data generates new multimodal, robust, and unified representations considering supplementary and complementary modalities of different data types, such as audio, image, and text (Baltrušaitis et al., 2018).These heterogeneous and multimodal data can be modeled using heterogeneous graphs.Graphs offer a powerful structure for modeling real-world problems by explicitly capturing the relations between entities since graphs better deal with abstract concepts such as relations and interactions (Rahman, 2017).Heterogeneous graphs model real-world problems with a natural structure that enriches the task resolution that solves these problems (Zhou et al., 2020; Xia et al., 2021).In addition, heterogeneous graphs allow the modeling of different relations between different graph nodes, which enriches the representation through graphs, modeling more information and modalities (Wang et al., 2022).
In the scenario of heterogeneous graphs, each type of node can be considered a modality.In this way, we can exploit the modalities fusion to generate better representations for the problem.Early fusion studies for multimodal data explore mainly concatenation (Beserra, 2022) of the modalities.Concatenation increases the dimensionality of vectors (doubling in the case of two modalities, tripling in the case of three, and so on).On the other hand, few studies explore different early fusion strategies, such as vector operators between feature vectors in the latent spaces of each modality (Beserra et al., 2020; Beserra andGoularte, 2023).For instance, addition, average, subtraction, and minimum, among others (Beserra, 2022).
Existing studies do not investigate multimodal fusion operators for heterogeneous graphs.This research gap is particularly promising as many multimodal applications are now being modeled using heterogeneous graphs (Guo et al., 2019).For instance, multimodal document classification (which may include text, images, and audio) (Liu et al., 2019), recommendation systems, in which multimodal fusion in heterogeneous graphs can enhance accuracy by considering different types of information (e.g., browsing history and personal preferences) (Guo et al., 2020), event detection in multimedia-based social networks (Schinas et al., 2015), and content retrieval based on its multimodal content (Kumar et al., 2013).
In different scenarios of data modeled through heterogeneous graphs, there is an interest class, such as the detection of hit songs (da Silva et al., 2022), detection of fake news (de Souza et al., 2022), recommendation (Gôlo et al., 2022), and detection of interest events (Nguyen and Grishman, 2018).In these scenarios, the studies model the data through heterogeneous graphs and explore One-Class Learning (OCL) (Emmert-Streib andDehmer, 2022; Tax, 2001).Studies use the OCL because they can learn to classify the interest class only with labels of this class available.Therefore, OCL reduces the user's labeling effort, is more appropriate for imbalanced classification scenarios, and does not need to cover the scope of the non-interest class (or classes) (Khan andMadden, 2014; Alam et al., 2020).
In the one-class heterogeneous graph learning literature, studies learn representations for nodes through Graph Neural Networks (GNNs), methods considered state-of-the-art for learning representation in graphs (Wu et al., 2020).GNNs are neural networks applied to data modeled through graphs capable of learning new, more robust representations that capture structural features given the graph node relations and node features given the initial representations of the nodes.At the end of learning, the GNNs learn representations for the different types of nodes at the same semantic level, which makes it possible to use simple operators, such as addition, average, or multiplication, to combine the representations of the different types of nodes considering the early fusion and improve the task solved by one-class learning (Atrey et al., 2010; Jakob et al., 2021; Gôlo et al., 2021).On the other hand, studies in the one-class heterogeneous graph learning literature concatenate the learned representations (Huang et al., 2022; Zhou and Mao, 2022; Gôlo et al., 2022) or only use one type of node (da Silva et al., 2022; Ganz et al., 2023) in the one-class learning step.
This article is as extended version of Gôlo et al. (2023a) that proposes a graph neural network (GNN) method for heterogeneous graphs that explores different types of early fusion operators to deal with multiple modalities.We perform an extensive empirical evaluation of fusion operators for the representations of different types of nodes learned by GNNs and by graph regularization for one-class tasks.We propose a generic pipeline with the learning of representation in heterogeneous graphs through a Graph Autoencoder (GAE) (Kipf and Welling, 2016), considering the classification through the algorithms One-Class Support Vector Machines (OCSVM) (Schölkopf et al., 2001).In this extended version we explore three GNN layers in our results: Graph Convolutional Network (GCN) (Kipf and Welling, 2017), Graph SAmpling and aggreGatE (GraphSAGE) (Hamilton et al., 2017), and Graph Attention Network (GAT) (Velickovic et al., 2018).We add more analysis and disccusions.Our pipeline represents any type of heterogeneous graph and solves any one-class problem.Based on the experiments conducted, we answered the following research questions: 1. Does the early fusion of the regularized representations or learned by GNN improve the performance of oneclass tasks? 2. Which early fusion operator generates better representations for one-class problems?3. What is the best representation obtained through graphs to apply early fusion in one-class tasks? 4. Which Graph Neural Network layer obtains better representations for one-class tasks considering unsupervised representation learning?
We performed an extensive experimental evaluation to an-swer these research questions.We consider four one-class problems using four different datasets: hit song detection, movie recommendation, events of interest detection, and fake news detection.We represent the nodes of all heterogeneous graphs naturally modeled through an unsupervised GAE considering the GCN, GraphSAGE, and GAT layers and classify the representations of interest using OCSVM.
We use seven fusion operators to generate the representations of interest: addition, subtraction, average, multiplication, concatenation, minimum, and maximum.We consider the k-fold cross-validation to run our experiments, the t-Distributed Stochastic Neighbor Embedding algorithm to reduce the dimension of the representations to visualize the embeddings, and the f 1 -macro to evaluate the fusion operators.In summary, our contributions are: • We present a model that incorporates additional graph modalities to the target nodes of classification, facilitating the exploration of graph heterogeneity through their combination; • While most of the existing methods explore concatenation operator for modality fusion, we investigate and evaluate the impact of alternative types of early fusion operators (addition, subtraction, multiplication, maximum, minimum, and average) to advance studies on heterogeneous scenarios modeled through graphs in oneclass tasks; • We introduce the application of different GNN layers for unsupervised learning in one-class tasks, enabling progress in selecting appropriate layers for various oneclass problems.
The early fusion of the representations improved the performance in one-class learning in all datasets considering representations generated by the graph regularization and the GAE.The GAE representations performed better than the regularized representations in most evaluation scenarios.We highlight the Average, Addition, and Subtraction operators as the fusion operators that obtained the best results for one-class tasks considering data modeled through heterogeneous graphs.Two-dimensional projections showed the effectiveness of the fusion operators.We highlight the twodimensional projection of the representations of the Addition, Average, and Subtraction operators.
We divide the remainder of the article: Section 2 presents the background for the paper.Section 3 presents early fusion in heterogeneous graphs on one-class problems related work.Section 4 presents our proposal for learning representation in heterogeneous graphs, early fusion, and one-class learning.Section 5 presents the experimental evaluation with information about the datasets, experimental setup, results, and discussion.Finally, Section 6 presents the study's conclusions and future work.

Background
In graphs without initial representation for all nodes, we need to obtain initial representations for all nodes.Thus, we use a regularization framework to obtain a vector of attributes for each graph node, enabling Graph Autoencoders to learn representation in the regularized graph.We present the regularization in Section 2.1.After, we can apply a graph neural network to learn new, more robust representations for the nodes from the representations obtained.We present the graph autoencoders in Section 2.2 as an unsupervised graph neural network.

Regularization
We denote G = (V, E) as a graph in which V is the set of vertices, and E is the set of edges.In addition, we associate G with an array of attribute vectors of f -dimensional nodes F .However, there are scenarios where only a subset of nodes of the V F graph has an associated attribute vector, which makes the use of graph neural networks impracticable.Thus, a solution is using a regularization framework for learning graph representations (do Carmo and Marcacini, 2021).Equation 1defines the objective function to be minimized by the process: The first term determines that attribute vectors of neighboring nodes u and v are similar.At the same time, the second term, weighted by a factor µ ∈ R, indicates how much the initial attribute vector we want to preserve during the procedure.The described problem is an optimization problem that can be solved using an iterative label propagation method (Zhou and Schölkopf, 2004).At the end of the process, we have an array of node attributes X ∈ R |V |×f , in which all graph nodes have a vector with features.After obtaining a feature for each node, we can obtain robust and tuned representation for the graph nodes through the heterogeneous graph autoencoders that we present in the next section.

Graph Autoencoders
Graph neural networks are a learning method of graph representation that generalize convolutions to graphs.It consists of iterative updates of the node representation through neighborhood aggregation (Wu et al., 2020).After k iterations, the GNN aggregates structural information of the k-hop neighborhood of nodes (Xu et al., 2019).Formally defined as Z = GNN(A, X), generating a matrix of latent representations Z ∈ R |V |×d , in which A ∈ R |V |×|V | is the adjacency matrix.Among the GNN convolutions, the Graph Convolution Networks (GCN) (Kipf and Welling, 2017) is a spectral convolution method based on the Laplacian of a graph.The l-th layer of GCN is defined in Equation 2, in which h (l) i and W (l) are, respectively, node i representation and parameters of the l-th layer,  (Hamilton et al., 2017) is a non-spectral convolution that generalizes GCN to use trainable aggregation functions.The l-th layer of SAGE is formalized in Equation 3, in which concatenates the representation of the current node to the aggregated representation of the neighborhood of the node h (l) Ni .In the original paper, they evaluated SAGE with non-trainable aggregation functions such as the simple element-wise mean of the neighborhood representations defined in Equation 4, and aggregation functions with trainable parameters such as max-pooling described in Equation 5, agg are the learnable weights.Using learnable aggregation functions, the Graph Attention Network (GAT) (Velickovic et al., 2018) is a convolution incorporating the attention mechanism in aggregation.The mechanism assigns different weights (or importance levels) to each node in the node's neighborhood.GAT uses multiple independent heads, where each head pays attention to different particularities of the neighborhood.Equation 6 defines the l-th GAT layer, denotes the h-th head of l-th layer attention coefficient (or importance) of node j to i and concantenating the representations of the H heads.The dynamic attention (Brody et al., 2022) is normalized through the neighborhood N i is computed with trainable weights W (l,h) and a (l,h) as shown in Equation 7, In this paper, we propose to use Graph Autoencoder (GAE) to obtain representations in an unsupervised way (Kipf and Welling, 2016).The GAE is an unsupervised training framework for GNNs, whose objective is to compress the structural information of the graph into a lower dimensionality space through reconstructing the adjacency matrix, i.e., we use the encoder with the GNN layers.Which can be predicted through an inner product of the representations obtained, described as Â = σ(ZZ T ), i.e., we use the decoder with the inner product.However, in practice, as the adjacency matrix is sparse, we use the negative sampling of edges that do not exist in the original matrix.After obtaining representations through the GAE, we can apply different operators in the representations because the nodes' representations are at the same semantic level.

Related Work
This section presents related work to multimodal fusion on data modeled through heterogeneous graphs in the resolution of problems solved by one-class learning.Huang et al. (Huang et al., 2022) proposed to detect intrusions in systems through one-class learning.The study model the graph considering process and file nodes.The authors propose a directed heterogeneous graph neural network to learn the representation of the process and file nodes.In graph modeling, the authors use the process fork process and process access files as the edges.In heterogeneous GNN, the authors proposed a new type of aggregation that considers the directionality of the graph since directionality influences the intrusion detection task in a graph of processes and files.The authors concatenate the node representations to detect anomalies.The authors use the Deep Support Vector Data Description (DeepSVDD) algorithm (Ruff et al., 2018) to detect the anomalies.The authors used the real host data from the enterprise dataset and obtained better results than baselines, such as DeppSVDD, and other methods based on GNNs.Gôlo et al. (2022) recommended movies through one-class learning.The authors proposed a framework that combined enriched modeling of a graph for the recommendation, representation learning through an unsupervised GNN, and a oneclass learning algorithm.Naturally, movies are connected with users (user rating for movies).The authors added nodes for keywords, genres, and movie reviews for enriched modeling.The study used a link prediction strategy, i.e., prediction if an edge between the types of nodes existed or not, to learn the representations for nodes through the unsupervised GNN.After learning the representations, the authors concatenated the representations of rating 5 to train the one-class classifier (recommendation).The authors used One-Class Support Vector Machines (OCSVM) to classify the recommendations.The work used a movie recommendation dataset and enriched it with IMDB data.The proposal performed better than baselines such as BERT and GNNs end-to-end.
da Silva et al. (2022) proposed to detect hit songs through one-class learning.The authors modeled the songs through graphs and used a GNN to learn a robust node representation.In the modeling, da Silva et al. (2022) connect the songs with their respective artists and enrich the modeling with relations between artists.The authors use an unsupervised heterogeneous GNN that learns representations with a loss function that keeps nodes connected with similar representations and unconnected nodes with less similar representations.The study uses the OCSVM considering only music-type nodes to classify hit songs.The authors used a Spotify dataset and obtained better results than the baselines such as the BERT, and the concatenation of BERT and artist representation.
Ganz  Wang et al. (2021).The loss function penalizes nodes of interest outside the hypersphere and unsupervised nodes inside the hypersphere.In the data modeling, Zhou and Mao (2022) modeled the events through the texts and generated graphs with two types of nodes: sentences and entities.The work uses a Graph Attention Network to learn representations at the same semantic level and later concatenates the learned representations to generate a new one.The authors use a dataset and four variations, each with an argument of interest.The proposed method performed better than other state-of-the-art.
Studies with modeling through heterogeneous graphs for one-class tasks perform the concatenation of the learned representations at the same semantic level or use only one node type, even having more representations from other node types available.Therefore, the studies do not progress in relation to other heterogeneous representations or other fusion operators for representations on the same semantic level that can obtain better representation and consequently improve the results on one-class tasks.In this sense, the next section presents a method based on GNNs that considers different fusion operators on data modeled through heterogeneous graphs to solve one-class problems.

Early Fusion On Heterogeneous Graph Neural Networks For One-Class Learning
The first step of our pipeline was the regularization (Section 2.1), and the second was the representation learning through a Graph Autoencoder (Section 2.2).Later, the fusion operators aggregate information from the different node types in the graph into a single fused representation.Section 4.1 presents the early fusion process.We will submit these new representations to a one-class learning algorithm and then classify the instances as belonging to the interest class.Fi-nally, Section 4.2 presents the one-class learning algorithm.Figure 1 summarizes the proposed method.

Heterogeneous Early Fusion
After obtaining the representations, we divide the nodes by node type.For instance, consider a graph with three node types, {V F , V a , V b } ∈ V , in which a and b are the node types of the heterogeneous graph whose representations were obtained through regularization.V F is the set of main nodes that have the initial representation.Considering our scenarios, we have sets of V F : the news, events, music, and items.We employ early fusion operators by combining the nodes' representations Z F with Z a , Z b , in which Z a , Z b are the representations generated by the GNN for the node type a and b.Given an v i ∈ V F , the neighboring nodes of v i are first grouped to a single representation for each node type through an average.We define this first step in Equation 9, in which d av i is the representation generated for the nodes of type a generated by the representation average of all neighboring nodes of type a considering v i (Z av i ).We apply this process to all neighboring node types of v i , {d av i , Finally, we define an op operator, i.e., the early fusion operator.We use the operators: addition, subtraction, multiplication, minimum, maximum, average, and concatenation.It is worth mentioning that the concatenation will increase the new representation dimensionality, i.e. doubling in the case of two modalities, tripling in the case of three, and so on, while the other operators will maintain the modalities dimension.For a main node v i , the fusion process consists of applying the op operator to all generated node type representations D vi .We define the process of combining the representations in Equation 10, in which λ vi is the fused representation generated to the v i node, and z vi is the representation generated through GAE to the node v i .With the fused and new representations, we can apply one-class learning (OCL).We present OCL in the next section.

One-Class Learning
After obtaining a fused representation, we can apply oneclass learning algorithms to classify interest instances.We use the One-Class Support Vector Machine (OCSVM) that is based on the Support Vector Machine (Schölkopf et al., 2001).The Binary SVM aims to generate a hyperplane of maximum separation margin between the two classes.In the OCSVM, the algorithm generates fictitious instances close to the origin corresponding to the interest class's counterexamples to apply a maximum separation hyperplane (Schölkopf et al., 2001).Formally, OCSVM uses Equation 11 to create the maximum separation hyperplane between the interest class and the origin instances, subject to: in which Λ int is a set of fused representations of interest, c are the coefficients of the separation hyperplane, ν ∈ [0, 1) is an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors, ε λ i is the distance from a node λ i to the separation hyperplane, ρ is the classification error threshold, and φ(λ i ) is a kernel function to map the node into a linearly separable space.After creating the hyperplane, the function f (λ i ) indicates if node z i belongs to the interest class, returning +1 (the interest side of the hyperplane) or −1 (the origin side of the hyperplane).The function f (λ i ) is given by Equation 13, in which sgn() is a signal function that returns −1 when c • φ(λ i ) − ρ is negative and returns +1 when greater than or equal to 0.

Experimental Evaluation
In the experimental evaluation, we propose to compare seven early fusion operators and the non-use of operators.We used the OCSVM algorithm to compare the operators.Our goal is to demonstrate that the fusion of node representations through other operators outperforms concatenation and the non-use of operators, which are commonly used in the literature for one-class learning on heterogeneous graphs.The next sections present the datasets used in the experimental evaluation, experimental settings, results, and discussion.

Datasets
We use four datasets to evaluate our proposal.The first is a fake news dataset commonly used in one-class studies (Gôlo et al., 2023b).The dataset name is Fact Checked News (FCN)2 .Interest instances are fake news, and outliers are real news.The second dataset is a recommender system dataset for movies used and enriched by (Gôlo et al., 2022).In this dataset, interest instances are relations between users and items with a rating of five, and outliers are relations with a rating one3 .The third is a hit song prediction dataset collected by (da Silva et al., 2022) 4 , in which interest nodes are hit songs and outliers are other songs.Finally, the fourth dataset is an event dataset used in the article (Mattos and Marcacini, 2021) 5 .The dataset name is GoldStd, a 5W1H event dataset generated from news text.The dataset has 13 classes related We model the fake news dataset with a bipartite graph, in which the nodes are documents and terms (words).Edges are relations between documents and terms, i.e., if the document has the term, we add an edge.We model the recommender systems datasets with a heterogeneous graph with five nodes (users, items, keywords, genre, and review) and four edges (user-item, keyword-item, genre-item, and review-item) (Gôlo et al., 2022).For the hit song dataset, we have artists and song nodes.The songs and artists are directly related, while pre-annotated data give the relations between artists (da Silva et al., 2022).In this sense, we have edges between artists and artists and songs.Finally, we model the event dataset through a heterogeneous graph with seven nodes (event, what, who, when, where, why, how, IPTC code, and cluster code), and each event has edges with the nodes what, who, when, where, why, and how.Cluster nodes are to keep the graph connected, considering event nodes (Mattos and Marcacini, 2021).IPTC codes are nodes related to the event topic considering the media topics extracted from the International Press Telecommunications Council (IPTC).Thus, we also have edges between event and cluster nodes and event and IPTC nodes.Table 1 shows the datasets synthesis.

Experimental Settings
It is important to emphasize that we decided to use a new experimental setting to standardize the experimental evaluation in these four datasets.Therefore, even using the same datasets from (da Silva et al., 2022) and (Gôlo et al., 2022), we decided not to use the same experimental configurations of the one-class graph studies.We represent one node type in each dataset to create the V F set and perform the regularization.After regularization, all nodes have a feature vector.
The main nodes for all our datasets have textual content (text news, events description, music lyrics, and items overview (movies)).
To represent the textual contents, we use variations of the pre-trained model Bidirectional Encoder From Transformers (BERT) (Devlin et al., 2019) since this model obtained stateof-the-art results for textual data (Otter et al., 2020).BERT is a pre-trained neural network based on transformer architecture.This architecture has attention mechanics that focus on the main words in the sentence (Vaswani et al., 2017).(Devlin et al., 2019) trains the BERT model in a large textual corpus that represents sentences based on their context and outperforms other natural language pre-processing models in different tasks and languages (Otter et al., 2020).BERT can extract semantic and syntactic characteristics from the text generating dynamic embeddings (Otter et al., 2020).
We chose BERT variations according to each dataset.We used the multilingual BERT model for fake news because the news are in Portuguese6 .We also use this model to represent events.For the songs, we used pre-trained BERT on lyrics7 .Finally, we represent the overviews and reviews of the movies from the recommendation dataset with the all-MiniLM-L6-v2 model that obtained the highest number of downloads considering the sentence similarity task.
For the GAE, Early Fusion, and One-Class Support Vector Machines, we have the following parameters: • Heterogeneous Early Fusion: operators = {addition, subtraction, average, minimum, maximum, multiplication, and concatenation}.
in which n is the dimension of the input data and o is the variance of the representations.
We use the procedure 5-Fold Cross-Validation in the oneclass learning stage, i.e., when applying the OCSVM in the classification step of the pipeline.In this procedure, we apply a 5-Fold Cross-Validation considering only the interest class for each dataset.The procedure consists of dividing the interest class into folds and using 4 folds to train and the remaining fold to test iteratively.We also added the not-interest set to the test set.Finally, we use the macro f 1 -score as the evaluation measure.f 1 -macro is the arithmetic average between the classes.We present f 1 -score in Equation 14, in which T P (True Positives) is the number of positive instances that the algorithm has correctly classified; T N (True Negatives) is the number of negative instances that the algorithm has correctly classified; F P (False Positives) is the number of negative instances that have been classified as pos-itive; and F N (False Negatives) is the number of positive instances classified as negative.

Results and Discussion
Tables 2, 3, 4, and 5 present the experimental evaluation results considering the regularized representations of the heterogeneous graph (REG) and the representations learned by the Graph Autoencoder (GAE) considering the GCN, GAT and SAGE layers for each of the four datasets.We present the values of f 1 -macro and the respective standard deviation.Each column presents the result referring to an early fusion operator or the non-use of the operator (Without).Each line represents the results for the regularized representations and the representations learned by GCN, GAT, and SAGE layers.We bold the best results.In cases of ties, we highlight better results considering the smallest standard deviation.

Does the early fusion of the regularized or GNN representations improve the performance of one-class tasks?
The subtraction operator obtained the highest values of f 1macro in the music and fakenews datasets.The minimum and multiply operators obtained the highest values of f 1macro in the event dataset.On the other hand, the non-use of operator obtained the highest values of f 1 -macro in the recommender systems dataset.Thus, in general, the use of early fusion operators improve the performance of one-class tasks.Even though not using fusion operators in the recommendation dataset generates better results in most representations, the concatenation operator obtain the highest f 1-macro in this dataset.
Concatenation is the most used early fusion operator in the literature.Few studies (Beserra et al., 2020; Beserra, 2022; Beserra and Goularte, 2023) use operators such as the ones presented in this research, and even fewer use operators to fuse features at a medium semantic level.However, the concatenation, in addition to increasing the dimensionality of the generated vector (doubling in the case of two modalities, tripling in the case of three, and so on.),and increasing the cost of the algorithm, did not obtain the best results, with the exception of one scenario.On the other hand, the other operators, in addition to obtaining the best results, are also space efficient since they generate a new representation with the same modalities' dimensions.
We can observe that early fusion does not benefit the classification performance in the recommender systems dataset.We believe that two factors together benefited the non-use of early fusion operators.The first factor is the number of modalities.The recommendation systems dataset has five modalities (number of nodes) in total (users, items, keywords, genre, and review), and the greater the number of modalities, the more challenging it is to combine these modalities.This was not the only factor since the event dataset has many modalities, and early fusion benefits the classification performance.This guides us to the second factor, which is a characteristic that differentiates these two datasets: how the classification is carried out.In the event dataset, we classify nodes of interest.On the other hand, in the recommendation systems dataset, we classified interactions between users and items, which did not benefit from using early fusion operators.It is worth mentioning that we carried out the early fusion of without weights in each modality, which may have resulted in the non-benefit of the use of early fusion operators.

Which early fusion operator generates better representations for one-class problems?
For the music dataset considering the regularized representations, the multiplication operator performed better than the others.For GCN and GAT layers on GAE, the subtraction operator outperformed the other operators.Finally, in the SAGE layer, the multiplication operator obtain the highest f 1-macro.The minimum, without, without, and minimum obtained the worst results considering REG, GCN, GAT and SAGE representations, respectively.For the fakenews dataset considering the regularized representations and GCN layer, the subtraction operator performed better than the others.For GAT layer, the maximum operator outperformed the other operators.Finally, in the SAGE layer, the multiplication operator obtain the highest f 1-macro.The maximum, without, minimum, and minimum obtained the worst results considering REG, GCN, GAT and SAGE representations, respectively.
Generally, the subtraction operator generates better representations for one-class problems.On the other hand, we highlight some interesting particularities in the best results when comparing operators.In the fake news dataset, subtracting the terms representation from the document representation differentiated the document representations to improve the classification performance.When investigating the collection procedure of this dataset, we observe that the authors collect the dataset on the fake and real news of the same topic (politics).Therefore, when modeling the heterogeneous graph as a bipartite graph of terms and documents, documents of different classes will share words and subtract these shared representations, removing redundant information between documents of different classes, which improves the classification.In addition, the event dataset has few main nodes (see Table 1) in relation to other node types, and therefore some operators obtain the same results.This fact also influenced the result of f 1 macro 1 for all folds in the minimum and multuply operator.
The recommendation dataset was the only one with the best result considering the non-use of operators.We obtain these results in the regularized representation, GAT and SAGE layers.However, we the have the best result is this dataset with the concatenation operator in the learned GCN representations.This dataset also was the only one that obtained the best result with the concatenation operator.Thus, it is interesting to highlight that depending on the representation you explore, it may not be worth using an operator or using a more costly one such as concatenation.
In addition to the analysis of which fusion operators obtain higher f1-scores for each dataset, we carried out a general analysis of all operators in all scenarios.We provide a statistical significance analysis for the results.We show the critical difference diagram proposed by Demšar (2006).First, the Friedman test is performed to reject the null hypothesis, and then we proceed with a posthoc analysis based on the Wilcoxon-Holm method to generate the average ranking and the critical difference.Figure 2 presents the result of the  et al. (2019).The diagram presents the methods' average rankings.Methods connected by a line do not present statistically significant differences between them.Average obtains the best average ranking, followed by Sum, Concatenation, Without, and Subtraction operators.Min, max, and Multiplication obtain the worst average rankings.Furthermore, the average operator obtained a statistically significant difference from the minimum operator.

What is the best representation obtained through graphs to apply early fusion in one-class tasks?
Representation learning through GAE generated better results in all datasets.Furthermore, most of the time, regardless of the chosen fusion operator, GAE representations generate better f 1 -macro.We note exceptions in the fake news dataset, in which the GCN and GAT representation generated f 1 -macro smaller than the regularized representation in some operators.An indication of these results is the type of dataset used.This dataset has fake and real news that is well-behaved in context, type, topic, and veracity.Therefore, good initial representations of fake news (BERT) already solve the separation of classes very well, as shown in the results of other studies of fake news detection through oneclass learning that uses this dataset (de Souza et al., 2021; Gôlo et al., 2021; de Souza et al., 2022; Gôlo et al., 2023b).Thus, with a very robust initial representation, the regularization already adds enough information to generate good representations for the terms of the bipartite graph and does not need the graph autoencoder.
In addition to comparing the operators' performances, we performed another experiment to analyze the representations generated by the fused representation.Figures 3, 4, and 5 present two-dimensional projections of the fused representation considering each operator in the fake news dataset with the graph autoencoder representations.We choose this dataset because, in this scenario, the operators have the highest f 1-macro and |V f |.We generated the representations using the t-Distributed Stochastic Neighbor Embedding (t-SNE) for the analysis (Van der Maaten and Hinton, 2008).
In the TSNE results for the GCN layer, the non-use of fusion and minimum operators obtain the worst visual results.On the other hand, the other operators performed the separation of classes satisfactorily, showing good visual results.In operators with good results, we noticed that there are few real news closer to fake news and far from the real news region.In addition, we observe a few real news grouped in the fake news right-bottom region that could be the differential between operators for better or worse results, i.e., how op-erators represent this news group directly impacts their performance.We can observe this fact in the two-dimensional projection of the concatenation, addition, and average operators that project this real news in a smaller region than the subtraction operator, and the subtraction operator obtained the best f 1 -macro result.
In the TSNE results for the GAT and SAGE layers, we note a different behavior.All the operators performed the separation of classes satisfactorily, showing good visual results.We highlight that these layers outperform the GCN layer and do not show a few real news grouped in the fake news rightbottom such as the GCN layer.This difference may be what made the one-class learning model obtain higher values of f 1 .Therefore, we emphasize that adding attention to the edges and sampling the neighboring nodes that will be aggregated, improved the representation learning through graph neural networks to detect fake news through one-class learning.
We also analyzed GCN and OCSVM best parameters for the early fusion scenario for data modeled through heterogeneous graphs to solve one-class tasks.We present the best parameters to indicate that better parameters should be used in future studies for one-class learning and heterogeneous graphs.We highlight the GAE architecture with one layer containing 32 neurons, the learning rate 1 −4 , and the patience 100.For the OCSVM, the polynomial and sigmoid kernels obtained the best results.Regarding ν, smaller values between 0.05 and 0.15 were better for the sigmoid kernel, while values between 0.40 and 0.65 were better for the polynomial kernel.Finally, γ = 1 n•o gave the best results in most scenarios.Notably, the projected representations have curved separation, which indicates the advantage of the polynomial kernel over the others.

Which Graph Neural Network layer obtains better representations for one-class tasks considering unsupervised representation learning?
Figure 6 present the GCN, GAT and SAGE layers best results for each dataset independent of the operator, i.e., the best result of each layer without considering the same operator in the comparison.In general, GCN obtain the best values of f 1 and GAT obtain the worst values.In the music and recommender systems datasets GCN outperformed the other layers.SAGE outperformed the other layers to detect fake news.On the other hand, we have a tie for GCN and SAGE layers in the event dataset.
Once again, the particularity of the the fake news dataset graph modeling influences the best result.The SAGE layer performs sampling when aggregating information, i.e., considering the bipartite modeling of documents and terms, in the sampled aggregation, only a portion of the terms will be selected, which improved representation learning and consequently the one-class learning to detect fake news.For the other one-class tasks with others graph modeling, we indicate the simple and tradicional GCN layer.

Conclusions and Future Work
In this article, we aim to answer some research questions and significantly contribute to data modeled through heterogeneous graphs in one-class tasks.To answer the research questions "Does pre-merging the representations regularized and learned by GNN improve the performance of one-class tasks?", "Which pre-merging operator generates better representations to solve one-class problems?",and "What is the best representation obtained through graphs to apply premerging in one-class tasks?", we propose the use of a graph neural networks method considering different early fusion operators in four different one-class tasks.Our objective was to compare the performance of seven fusion operators, evaluate the impact of using the operators, and evaluate the impact of using graph neural networks in these scenarios.
The results presented by the study showed that the early fusion of the regularized and learned representations by GCN, GAT, and SAGE improved the performance of one-class learning in the four datasets modeled through heterogeneous graphs.The representations learned through the GCN and SAGE obtained better results.However, GCN representations obtained better results in most datasets.In twelve of sixteen scenarios, fusion operators had a positive impact, improving the classification performance in the four datasets used.We highlight the average, addition, and subtraction operators as the best early fusion operators for one-class tasks in which data is modeled using heterogeneous graphs.On the other hand, we highlight the non-use of operators to recommend interest movies.
In future work, we intend to propose a GNN that learns the representations for the different types of nodes while learning to combine the modalities biased by some task.In this sense, we intend to explore the one-class graph neural networks Wang et al. (2021), considering a heterogeneous version of the data in which the method will learn how to combine the heterogeneous data into a single fused representation through neurons.We intend to explore this pipeline in oneclass edge classification in the homogeneous and heterogeneous scenarios.Furthermore, we intend to explore weights in the early fusion.

Figure 1 .
Figure 1.Proposed pipeline with five steps for early fusion for one-class learning on heterogeneous graph neural networks.

Figure 2 .
Figure 2. Friedman test with posthoc analysis based on the Wilcoxon-Holm through the critical difference diagram for all operators in all scenarios.Friedman test with posthoc analysis based on the Wilcoxon-Holm through the critical difference diagram Ismail Fawaz et al.(2019).The diagram presents the methods' average rankings.Methods connected by a line do not present statistically significant differences between them.Average obtains the best average ranking, followed by Sum, Concatenation, Without, and Subtraction operators.Min, max, and Multiplication obtain the worst average rankings.Furthermore, the average operator obtained a statistically significant difference from the minimum operator.

Figure 3 .Figure 4 .Figure 5 .Figure 6 .
Figure 3. Two-dimensional projections (t-SNE) of each fused representation considering each operator and the non-use of an operator in the fake news dataset for the GCN layer.The colors indicate class real news (orange) and fake news (blue).Operators that show less overlap between classes are more promising for one-class learning.
is node degree with self-loop added, and σ a non-linear activation function like ReLU.It is worth noting that h et al. (2023) detects backdoor software through oneclass learning.The authors model code activities through collaborative graphs with commit nodes, branches, files, developers, and methods (functions).Ganz et al. (2023) represents the graph's nodes through an unsupervised heterogeneous GNN, specifically, a Variational graph autoencoder.After learning the representations for the different types of heterogeneous graph nodes, the authors use only the commit node to detect software backdoors through Deep SVDD.The authors used a dataset extracted from GitHub repositories.Ganz et al. (2023) generated anomalies synthetically for training.The study uses state-of-the-art from the literature and the OCSVM, Deep-SVDD, Local Outlier Factor, Elliptic Envelop, and Isolation Forest as baselines.The study has competitive results with the advantages of detecting anomalies in different nodes and proposing an interpretable model.Zhou and Mao (2022) perform the extraction of arguments in events by classifying arguments of interest and noninterest.The authors proposed a new loss function based on hyperspheres.This function can be adapted for one-class learning and has been proposed as an adaptation of loss from

Table 1 .
Number of nodes, edges, and nodes with initial features for all datasets.

Table 3 .
Results for the seven operators considering f1-macro in the Fake News dataset.We show results for regularized representations and graph autoencoder representations with GCN, GAT, and SAGE layers.Bold values indicate the best results.

Table 4 .
Results for the seven operators considering f1-macro in the Event dataset.We show results for regularized representations and graph autoencoder representations with GCN, GAT, and SAGE layers.Bold values indicate the best results.

Table 5 .
Results for the seven operators considering f1-macro in the Rec.Sys.dataset.We show results for regularized representations and graph autoencoder representations with GCN, GAT, and SAGE layers.Bold values indicate the best results.