Graph Neural Networks for Text Classification: A Survey

Text Classification is the most essential and fundamental problem in Natural Language Processing. While numerous recent text classification models applied the sequential deep learning technique, graph neural network-based models can directly deal with complex structured text data and exploit global information. Many real text classification applications can be naturally cast into a graph, which captures words, documents, and corpus global features. In this survey, we bring the coverage of methods up to 2023, including corpus-level and document-level graph neural networks. We discuss each of these methods in detail, dealing with the graph construction mechanisms and the graph-based learning process. As well as the technological survey, we look at issues behind and future directions addressed in text classification using graph neural networks. We also cover datasets, evaluation metrics, and experiment design and present a summary of published performance on the publicly available benchmarks. Note that we present a comprehensive comparison between different techniques and identify the pros and cons of various evaluation metrics in this survey.


INTRODUCTION
Text classification aims to classify a given document into certain pre-defined classes, and is considered as a fundamental task in Natural Language Processing (NLP). It includes a large number of downstream tasks, such as topic classification [141], and sentiment analysis [108]. Traditional text classification methods build representation on the text using N-gram [15] or Term Frequency-Inverse Document Frequency (TF-IDF) [33] and apply traditional machine learning models, such as SVM [41], to classify the documents. With the development of neural networks, more deep learning models have been applied to text classification, including convolutional neural networks (CNN) [47], recurrent neural networks (RNN) [109] and attention-based [114] models and large language models [23].
However, these methods are either unable to handle the complex relationships between words and documents [133], and can not efficiently explore the contextual-aware word relations [143]. To resolve such obstacles, graph neural networks (GNN) are introduced. GNN is used with graph-structure datasets so a graph needs to be built for text classification. There are two main approaches to constructing graphs, corpus-level graphs and document-level graphs.
The datasets are either built into single or multiple corpus-level graphs representing the whole corpus or numerous

Text Classification Tasks
Text classification involves assigning a pre-defined label to a given text sequence. The process typically involves encoding pre-processed raw text into numerical representations and using classifiers to predict the corresponding categories.
Typical sub-tasks include sentiment analysis, topic labelling, news categorization, and hate speech detection. Certain frameworks can be extended to advanced applications such as information retrieval, summarising, question answering, and natural language inference. This paper focuses specifically on GNN-based models used for typical text classification.
• Sentiment Analysis is a task that aims to identify the emotional states and subjective opinions expressed in the input text, such as reviews, micro-blogs, etc. This can be achieved through binary or multi-class classification.
Effective sentiment analysis can aid in making informed business decisions based on user feedback.
• Topic Classification is a supervised deep learning task to automatically understand the text content and classified into multiple domain-specific categories, typically more than two. The data sources may gather from different domains, including Wikipedia pages, newspapers, scientific papers, etc.
• Junk Information Detection involves detecting inappropriate social media content. Social media providers commonly use approaches like hate speech, abusive language, advertising or spam detection to remove such content efficiently.

Text Classification Development
Many traditional machine learning methods and deep learning models are selected as the baselines for comparing with the GNN-based text classifiers. We mainly summarized those baselines into three types: Traditional Machine Learning: In earlier years, traditional methods such as Support Vector Machines (SVM) [140] and Logistic Regression [29] utilized sparse representations like Bag of Words (BoW) and TF-IDF. However, recent advancements [62,99,135] have focused on dense representations, such as Word2vec, GloVe, and Fasttext, to mitigate the limitations of sparse representations. These dense representations are also used as inputs for sophisticated methods, such as Deep Averaging Networks (DAN) [38] and Paragraph Vector (Doc2Vec) [51], to achieve new state-of-the-art results.
Sequential Models: RNNs and CNNs have been utilized to capture local-level semantic and syntactic information of consecutive words from input text bodies. The upgraded models, such as LSTM [31] and GRU [17], have been proposed to address the vanishing or exploding gradient problems caused by vanilla RNN. CNN-based structures have been applied to capture N-gram features by using one or more convolution and pooling layers, such as Dynamic CNN [43] and TextCNN [47]. However, these models can only capture local dependencies of consecutive words. To capture longer-term or non-Euclidean relations, improved RNN structures, such as Tree-LSTM [108] and MT-LSTM [66], and global semantic information, like TopicRNN [24], have been proposed. Additionally, graph [92] and tree structure [84] enhanced CNNs have been proposed to learn more about global and long-term dependencies.
A entions and Transformers: attention mechanisms [6] have been widely adopted to capture long-range dependencies, such as hierarchical attention networks [1] and attention-based hybrid models [132]. Self-attention-based transformer models have achieved state-of-the-art performance on many text classification benchmarks via pre-training on some tasks to generate strong contextual word representations. However, these models only focus on learning the Manuscript submitted to ACM relation between input text bodies and ignore the global and corpus level information. Researchers have proposed combining the benefits of attention mechanisms and Graph Neural Networks (GNNs) to learn both the relation between input text bodies and the global and corpus level information, such as VGCN-BERT [73] and BERTGCN [63].

Outline
The outline of this survey is as follows: • Section 1 presents the research questions and provides an overview of applying Graph Neural Networks to text classification tasks, along with the scope and organization of this survey.
• Section 2 provides background information on text classification and graph neural networks and introduces the key concepts of applying GNNs to text classification from a designer's perspective.
• Section 3 and Section 4 discuss previous work on Corpus-level Graph Neural Networks and Document-level Graph Neural Networks, respectively, and provide a comparative analysis of the strengths and weaknesses of these two approaches.
• Section 5 introduces the commonly used datasets and evaluation metrics in GNN for text classification.
• Section 6 reports the performance of various GNN models on a range of benchmark datasets for text classification and discusses the key findings.
• The challenges for the existing methods and some potential future works are discussed in Section 7.
• In Section 8, we present the conclusions of our survey on GNN for text classification and discuss potential directions for future work.

Definition of Graph
A graph in this paper is represented as = ( , ), where and represent a set of nodes (vertices) and edges of , respectively. A single node in the node set is represented ∈ , as well as = ( , ) ∈ donates an edge between node and . The adjacent matrix of graph is represented as , where ∈ R × and is the number of nodes in graph . If ∈ , = 1, otherwise = 0. In addition, we use X and E to represent the nodes and edges representations in graph , where X ∈ R × and E ∈ R × . x ∈ R represents the -dimensional vector of node and e ∈ R represents the -dimensional vector of edge (most of the recent studies set = 1 to represent a weighting scalar). A donates the edge feature weighted adjacent matrix.
Lastly, the traditional graph based methods are comparative time inefficient like the Graph Edit Distance-based graph matching methods have exponential time complexity [104].

Foundations of GNN
To tackle the limitation of traditional graph-based algorithms and better represent non-Euclidean relations in practical applications, Graph Neural Networks are proposed by [100]. GNNs have a unified graph-based framework and simultaneously model the graph structure, node, and edge representations. This section will provide the general mathematical definitions of Graph Neural Networks. The general forward process of GNN can be summarised as follows: where A ∈ R × represents the weighted adjacent matrix and H ( ) ∈ R × is the updated node representations at the -th GNN layers by feeding − 1-th layer node features H ( −1) ∈ R × ( is the dimensions of previous layers node representations ) into pre-defined graph filters F .
The most commonly used graph filtering method is defined as follows: is the normalized symmetric adjacency matrix. A ∈ R × is the adjacent matrix of graph and D is the degree matrix of A, where = Σ . W ∈ R × is the weight matrix and is the activation function. If we design a two layers of GNNs based on the above filter could get a vanilla Graph Convolutional Network (GCN) [122] framework for text classification: where W 0 and W 1 represent different weight matrix for different GCN layers and H is the input node features.
function is used for non-linearization and is used to generated predicted categories Y .

GNN for Text Classification
In this paper, we mainly discuss how GNNs are applied in Text Classification tasks. Before we present the specific applications in this area, we first introduce the key concepts of applying GNNs to text classification from a designer's view. We suppose for addressing a text classification task need to design a graph = ( , ). The general procedures include Graph Construction, Initial Node Representation, Edge Representations, Training Setup.

Graph Construction.
Some applications have explicit graph structures including constituency or dependency graphs [110], knowledge graphs [77,87], social networks [20] without constructing graph structure and defining corresponding nodes and edges. However, for text classification, the most common graph structures are implicit, which means we need to define a new graph structure for a specific task such as designing a word-word or word-document co-occurrence graph. In addition, for text classification tasks, the graph structure can be generally classified into two types: • Corpus-level/Document-level. Corpus-level graphs intend to construct the graph to represent the whole corpus such as [63,68,123,133], while the document-level graphs focus on representing the non-Euclidean relations existing in a single text body like [16,86,143]. Supposing a specific corpus C contains a set of documents (text After designing the graph-scale for the specific tasks, specifying the graph types is also important to determine the nodes and their relations. For text classification tasks, the commonly used graph construction ways can be summarized into: • Homogeneous/Heterogeneous Graphs: homogeneous graphs have the same node and edge type while heterogeneous graphs have various node and edge types. For a graph = ( , ), we use N and N to represent the number of types of and . If N = N = 1, is a homogeneous graph. If N > 1 or N > 1, is a heterogeous graph.
• Static/Dynamic Graphs: Static graphs aim to use the constructed graph structure by various external or internal information to leverage to enhance the initial node representation such as dependency or constituency graph Manuscript submitted to ACM [110], co-occurrence between word nodes [143], TF-IDF between word and document nodes [53,123,133] and so on. However, compared with the static graph, the dynamic graph initial representations or graph topology are changing during training without certain domain knowledge and human efforts. The feature representations or graph structure can jointly learn with downstream tasks to be optimised together. For example, [120] proposed a novel topic-awared GNN text classification model with dynamically updated edges between topic nodes with others (e.g. document, word). Piao et al. [95] also designed a dynamic edge based graph to update the contextual dependencies between nodes. Additionally, [16] propose a dynamic GNN model to jointly update the edge and node representation simultaneously. We provide more details about above mentioned models in Section 3 and Section 4.
Another widely used pair of graph categories are directed or undirected graphs based on whether the directions of edges are bi-directional or not. For text classification, most of the GNN designs are following the unidirectional way.
In addition, those graph type pairs are not parallel which means those types can be combined.

Initial Node Representation.
Based on the pre-defined graph structure and specified graph type, selecting the appropriate initial node representations is the key procedure to ensure the proposed graph structure can effectively learn node. According to the node entity type, the existing node representation approaches for text classification can be generally summarised into: • Word-level Representation: non-context word embedding methods such as GloVe [93], Word2vec [81], FastText [13] are widely adopted by many GNN-based text classification framework to numerically represent the node features. However, those embedding methods are restricted to capturing only syntactic similarity and fail to represent the complex semantic relationships between words, as well as, they cannot capture the meaning of out-of-vocabulary (OOV) words, and their representations are fixed. Therefore, there are some recent studies selecting ELMo [94], BERT [23], GPT [97] to get contextual word-level node representation. Notably, even if one-hot encoding is the simplest word representation method, there are many GNN-based text classifiers using one-hot encoding and achieving state-of-the-art performance. Few frameworks use randomly initialised vectors to represent the word-level node features.
• Document-level Representation: similar to other NLP applications, document level representations are normally acquired by aggregating the word level representation via some deep learning frameworks. For example, some researchers select by extracting the last-hidden state of LSTM or using the [CLS] token from BERT to numerically represent the input text body. Furthermore, it is also a commonly used document-level node representation way to use TF-IDF based document vectors.
Most GNN based text classification frameworks will compare the performance between different node representation methods to conduct quantitative analysis, as well as provide reasonable justifications for demonstrating the effectiveness of the selected initial node representation based on defined graph structure.

Edge Features.
Well-defined edge features can effectively improve the graph representation learning efficiency and performance to exploit more explicit and implicit relations between nodes. Based on the predefined graph types, the edge feature types can be divided into structural features and non-structural features. The structural edge features are acquired from explicit relations between nodes such as dependency or constituency relation between words, word-word adjacency Manuscript submitted to ACM relations, etc. Those relations between nodes are explicitly defined and are also widely employed in other NLP applications. However, more commonly used edge features are non-structural features which implicitly existed between the nodes and specifically applied to specific graph-based frameworks. The typically non-structural edge features are firstly defined by [47] for GNNs-based text classification tasks including: • PMI measures the co-occurrence between two words in a sliding window and is calculated as: where # is the number of windows in total, and # ( ), # ( , ) shows the number of windows containing word and both word and respectively.
• TF-IDF is the broadly used weight of the edges between document-level nodes and word-level nodes.
Except for those two widely used implicit edge features, some specific edge weighting methods are proposed to meet the demands of particular graph structures for exploiting more information of input text bodies.

Training Setup.
After specifying the graph structure and types, the graph representation learning tasks and training settings also need to be determined to decide how to optimise the designed GNNs. Generally, the graph representation learning tasks can be categorised into three levels including Node-level, Graph-level and Edge-level. Node-level and Graph-level tasks involve node or graph classification, clustering, regression, etc, while Edge-level tasks include link prediction or edge classification for predicting the relation existence between two nodes or the corresponding edge categories.
Similar to other deep learning model training settings, GNNs also can be divided into supervised, semi-supervised and unsupervised training se ings. Supervised training provides labelled training data, while unsupervised training utilises unlabeled data to train the GNNs. However, compared with supervised or unsupervised learning, semisupervised learning methods are broadly used by GNNs designed for text classification applications which could be classified into two types: • Inductive Learning adjusts the weights of proposed GNNs based on a labelled training set for learning the overall statistics to induce the general trained model for following processing. The unlabeled set can be fed into the trained GNNs to compute the expected outputs.
• Transductive Learning intends to exploit labelled and unlabeled sets simultaneously for leveraging the relations between different samples to improve the overall performance.

CORPUS-LEVEL GNN FOR TEXT CLASSIFICATION
We define a corpus-level Graph Neural Network as "constructing a graph to represent the whole corpus", thus, only one or several graphs will be built for the given corpus. We categorize Corpus-level GNN into four subcategories based on the types of nodes shown in the graph.

Notations Descriptions
A graph. The set of nodes in a graph. The set of edges in a graph. An edge between node and node . The neighbors of a node . The graph adjacency matrix. The normalized matrix . , ∈ The ℎ power of˜ .
The concatenation of and .
The degree matrix of .
( ) The weight matrix of layer .

∈ ×
The feature matrix of a graph.
The feature matrix of a graph at layer . ∈ The feature vector of the node ( ) ∈ The feature vector of the node at layer .

∈ ×
The output feature matrix of a graph. ∈ The output feature vector of the node

Document and Word Nodes as a Graph
Most corpus-level graphs include word nodes and document nodes and there are word-document edges and word-word edges. By applying (normally =2 or 3) layer GNN, word nodes will serve as a bridge to propagate the information from one document node to another.

PMI and TF-IDF as graph edges: TextGCN, SGC, S 2 GC, NMGC, TG-Transformer, BertGCN.
TextGCN [133] Yao et al. [133] builds a corpus-level graph with training document nodes, test document nodes and word nodes. Before constructing the graph, a common preprocessing method [47] has been applied and words shown fewer than 5 times or in NLTK [11] stopwords list have been removed. The edge value between the document node and the word node is TF-IDF and that between the word nodes is PMI. The adjacency matrix of this graph shows as follows.
A two-layer GCN is applied to the graph, and the dimension of the second layer output equals to the number of classes in the dataset. Formally, the forward propagation of TextGCN shows as: where˜ is the normalized adjacency of and is one-hot embedding. 0 and 1 are learnable parameters of the model.
The representation on training documents is used to calculate the loss and that on test documents is for prediction.

Manuscript submitted to ACM
TextGCN is the first work that treats a text classification task as a node classification problem by constructing a corpuslevel graph and has inspired many following works.
Based on TextGCN, several works follow the same graph construction method and node initialization but apply different graph propagation models.
SGC [123] To make GCN efficient, SGC (Simple Graph Convolution) removes the nonlinear activation function in GCN layers, therefore, the K-layer propagation of SGC shows as: which can be reparameterized into and is 2 when applied to text classification tasks. With a smaller number of parameters and only one feedforward layer, SGC saves computation time and resources while improving performance.
To solve the oversmoothing issues in GCN, Zhu and Koniusz [148] propose Simple Spectral Graph Convolution(S 2 GC) which includes self-loops using Markov Diffusion Kernel. The output of S 2 GC is calculated as: and can be generalized into: Similarly, = 2 on text classification tasks and denotes the trade-off between self-information of the node and consecutive neighbourhood information. S 2 GC can also be viewed as introducing skip connections into GCN.
NMGC [53] Other than using the sum of each GCN layer in S 2 GC, NMGC applies min pooling using the Multi-hop neighbour Information Fusion (MIF) operator to address oversmoothing problems. A MIF function is defined as: NMGC-K firstly applies a MIF( ) layer then a GCN layer and K is 2 or 3. For example, when = 3, the output is: NMGC can also be treated as a skip-connection in Graph Neural Networks which makes the shallow layer of GNN contribute to the final representation directly.
TG-Transformer [137] TextGCN treats the document nodes and word nodes as the same type of nodes during propagation, and to introduce heterogeneity into the TextGCN graph, TG-Transformer (Text Graph Transformer) adopts two sets of weights for document nodes and word nodes respectively. To cope with a large corpus graph, subgraphs are sampled from the TextGCN graph using PageRank algorithm [88]. The input embedding of is the sum of three types of embedding: pretrained GloVe embedding, node type embedding, and Weisfeiler-Lehman structural encoding [85].
BertGCN [63] To combine BERT [45] and TextGCN, BertGCN enhances TextGCN by replacing the document node initialization with the BERT [CLS] output of each epoch and replacing the word input vector with zeros. BertGCN trains BERT and TextGCN jointly by interpolating the output of TextGCN and BERT: where is the trade-off factor. To optimize the memory during training, a memory bank is used to track the document input and a smaller learning rate is set to BERT module to remain the consistency of the memory bank. BertGCN shows that with the help of TextGCN, BERT can achieve better performance.

Multi-Graphs/Multi-Dimensional Edges: TensorGCN, ME-GCN.
TensorGCN [68] Instead of constructing a single corpus-level graph, TensorGCN builds three independent graphs: Semantic-based graph, Syntactic-based graph, and Sequential-based graph to incorporate semantic, syntactic and sequential information respectively and combines them into a tensor graph.
Three graphs share the same set of TF-IDF values for the word-document edge but different values for word-word edges. Semantic-based graph extracts the semantic features from a trained Long short-term memory(LSTM) [36] model and connects the words sharing high similarity. Syntactic-based graph uses Stanford CoreNLP parser [76] and constructs edges between words when they have a larger probability of having dependency relation. For Sequential-based graph, PMI value is applied as TextGCN does.
The propagation includes intra-graph propagation and inter-graph propagation. The model first applies the GCN layer on three graphs separately as intra-graph propagation. Then the same nodes on three graphs are treated as a virtual graph and another GCN layer is applied as inter-graph propagation.

ME-GCN[118]
To fully utilize the corpus information and analyze rich relational information of the graph, ME-GCN (Multi-dimensional Edge-Embedded GCN) builds a graph with multi-dimensional word-word, word-document and document-document edges. Word2vec and Doc2vec embedding is firstly trained on the given corpus and the similarity of each dimension of trained embedding is used to construct the multi-dimensional edges. The trained embedding also serves as the input embedding of the graph nodes. During propagation, GCN is firstly applied on each dimension and representations on different dimensions are either concatenated or fed into a pooling method to get the final representations of each layer.

Making TextGCN Inductive: HeteGCN, InducT-GCN, T-VGAE.
HeteGCN [98] HeteGCN (Heterogeneous GCN) optimizes the TextGCN by decomposing the TextGCN undirected graph into several directed subgraphs. Several subgraphs from TextGCN graph are combined sequentially as different layers: feature graph (word-word graph), feature-document graph (word-document graph), and document-feature graph (document-word graph). Different combinations were tested and the best model is shown as: where − and − show the adjacency matrix for the word-word subgraph and word-document subgraph. Since the input of HeteGCN is the word node embeddings without using document nodes, it can also work in an inductive way while the previous corpus-level graph text classification models are all transductive models.

InducT-GCN[119]
InducT-GCN (InducTive Text GCN) aims to extend the transductive TextGCN into an inductive model. Instead of using the whole corpus for building the graph, InducT-GCN builds a training corpus graph and makes the input embedding of the document as the TF-IDF vectors, which aligns with the one-hot word embeddings.
The weights are learned following TextGCN but InducT-GCN builds virtual subgraphs for prediction on new test documents.
T-VGAE [127] T-VGAE (Topic Variational Graph Auto-Encoder) applies Variational Graph Auto-Encoder on the latent topic of each document to make the model inductive. A vocabulary graph which connects the words using PMI values is constructed while each document is represented using the TF-IDF vector. All the document vectors are stacked into a matrix which can also be treated as a bipartite graph . Two graph auto-encoder models are applied on and respectively. The overall workflow shows as: = Encoder ( , ) where is an Identity Matrix. The Encoder and the decoders are applied following [48] while Encoder is an unidirectional message passing variant of Encoder . The training objective is minimising the reconstruction error and is used for the classification task.

Document Nodes as a Graph
To show the global structure of the corpus directly, some models only adopt document nodes in the non-heterogeneous graph.
knn-GCN [9] knn-GCN constructs a k-nearest-neighbours graph by connecting the documents with their nearest neighbours using Euclidean distances of the embedding of each document. The embedding is generated in an unsupervised way: either using the mean of pretrained GloVe word vectors or applying LDA [12]. Both GCN and Attention-based GNN [111] are used as the graph model. is mixed as the output of this layer and input for the next layer for all three graphs: where (0) is the TF-IDF vector of the documents. Data augmentation with super nodes is also applied in TextGTL to strengthen the information in graph models.

Word Nodes as a Graph
By neglecting the document nodes in the graph, a graph with only word nodes shows good performance in deriving the graph-based embedding and is used for downstream tasks. Since no document nodes are included, this method can be easily adapted as an inductive learning model.
where BERT embedding is used as the input. The graph word embeddings are concatenated with BERT embedding and fed into the BERT as extra information.

Extra Topic Nodes in the Graph
Topic information of each document can also provide extra information in corpus-level graph neural networks. Several models also include topic nodes in the graph.

Single Layer Topic nodes: HGAT, STGCN.
HGAT [64] HGAT (Heterogeneous GAT) applies LDA [12] to extract topic information for each document, top topics with the largest probabilities are selected as connected with the document. Instead of using the words directly, to utilize the external knowledge HGAT applies the entity linking tool TAGME 1 to identify the entities in the document and connects them. The semantic similarity between entities using pretrained Word2vec with threshold is used to define the connectedness between entity nodes. Since the graph is a heterogeneous graph, a HIN (heterogeneous information network) model is implemented which propagates solely on each sub-graphs depending on the type of node. An HGAT model is applied by considering type-level attention and node-level attention. For a given node, the type-level attention learns the weights of different types of neighbouring nodes while node-level attention captures the importance of different neighbouring nodes when ignoring the type. By using the dual attention mechanism, HGAT can capture the information of type and node at the same time.

STGCN[130]
In terms of short text classification, STGCN (Short-Text GCN) applies BTM to get topic information to avoid the data sparsity problem from LDA. The graph is constructed following TextGCN while extra topic nodes are included. The edge values of word-topic and document-topic are from BTM and a classical two-layer GCN is applied.
The word embeddings learned from STGCN are concatenated with BERT embeddings and a bi-LSTM model is applied for final prediction.

Multi-layer Topic Nodes: DHTG.
DHTG [120] To capture different levels of information, DHTG (Dynamic Hierarchical Topic Graph) introduces hierarchical topic-level nodes in the graph from fine-grain to coarse. Poisson gamma belief network (PGBN) [145] is used as a probabilistic deep topic model. The first-layer topics are from the combination of words, while deeper layers are generated by previous layers' topics with the weights of PGBN, and the weights serve as the edge values of each layer of topics. For the topics on the same layer, the cosine similarity is chosen as the edge value. A two-layer GCN is applied and the model is learned jointly with PGBN, which makes the edge of the topics dynamic.

Critical Analysis
Compared with sequential models like CNN and LSTM, corpus-level GNN is able to capture the global corpus structure information with word nodes as bridges between document nodes and shows great performance without using external resources like pretrained embedding or pretrained model. However, the improvement in performance is marginal when pretrained embedding is included. Another issue is that most corpus-level GNN is transductive learning which is not 1 https://sobigdata.d4science.org/group/tagme/ applicable in the real world. Meanwhile, constructing the whole corpus into a graph requires large memory space especially when the dataset is large.
A detailed comparison of corpus-level GNN is displayed in Table 2.

DOCUMENT-LEVEL GNN FOR TEXT CLASSIFICATION
By constructing the graph based on each document, a graph classification model can be used as a text classification model. Since each document is represented by one graph and new graphs can be built for test documents, the model can easily work in an inductive way.

Local Word Consecutive Graph
The simplest way to convert a document into a graph with words as nodes is by connecting the consecutive words within a sliding window.

Simple consecutive graph models: Text-Level-GNN, MPAD, TextING.
Text-Level-GNN [37] Text-Level-GNN applies a small sliding window and constructs the graph with a small number of nodes and edges in each graph, which saves memory and computation time. The edge value is trainable and shared across the graphs when connecting the same two words, which also brings global information.
Unlike corpus-level graph models, Text-Level-GNN applies a message passing mechanism (MPM) [30] instead of GCN for graph learning. For each node, the neighbour information is aggregated using max-pooling with trainable edge values as the AGGREGATE function and then the weighted sum is used as the COMBINE function. To get the representation of each graph, sum-pooling and an MLP classifier are applied as the READOUT function. The propagation shows as: where ( ) is th word node presentation of layer , is edge weight from node to node . A two-layer MPM is applied, and the input of each graph is pretrained GloVe vectors.
MPAD [86] MPAD (Message Passing Attention Networks) connects words within a sliding window of size 2 but also includes an additional master node connecting all nodes in the graph. The edge only shows the connectedness of each pair of word nodes and is fixed. A variant of Gated Graph Neural Networks is applied where the AGGREGATE function is the weighted sum and the COMBINE function is GRU [18]. Self-attention is applied in the READOUT function.
To learn the high-level information, the master node is directly concatenated with the READOUT output, working as a skip connection mechanism. To get the final representation, each layer's READOUT results are concatenated to capture multi-granularity information. Pretrained Word2vec is used as the initialization of word nodes input.
TextING [143] To simplify MPAD, TextING ignores the master node in the document-level graphs, which makes the graph sparser. Compared with Text-Level-GNN, TextING remains fixed edges. A similar AGGREGATE and COMBINE function are applied under the concept of e Gated Graph Neural Networks(GGNN) [58] with the weighted sum and GRU.
However, for the READOUT function, soft attention is used and both max-pooling and mean-pooling are applied to make sure that "every word plays a role in the text and the keywords should contribute more explicitly".

Advanced graph models: MLGNN, TextSSL, DADGNN.
MLGNN [61] MLGNN (Multi-level GNN) builds the same graph as TextING but introduces three levels of MPM: bottom-level, middle-level and top-level. In the bottom-level MPM, the same method with Text-Level-GNN is applied with pretrained Word2vec as input embedding but the edge is non-trainable. In the middle level, a larger window size is adopted and Graph Attention Networks(GAT) [115] is applied to learn long distant word nodes information. In the top-level MPM, all word nodes are connected and multi-head self-attention [114] is applied. By applying three different levels of MPM, MLGNN learns multi-granularity information well. Diffusion GNN) constructs the same graph as TextING but uses attention diffusion to overcome the oversmoothing issue. Pretrained word embedding is used as the input of each node and an MLP layer is applied. Then, the graph attention matrix is calculated based on the attention to the hidden states of each node. The diffusion matrix is calculated as

DADGNN[70] DADGNN (Deep Attention
where is the graph attention matrix and is the learnable coefficients.
plays a role of connecting -hop neighbours and Liu et al. [70] uses ∈ [4,7] in practice. A multi-head diffusion matrix is applied for layer propagation.
TextSSL [95] To solve the word ambiguity problem and show the word synonymity and dynamic contextual dependency, TextSSL (Sparse Structure Learning) learns the graph using intra-sentence neighbours and inter-sentence neighbours simultaneously. The local syntactic neighbour is defined as the consecutive words and trainable edges across graphs are also included by using Gumbel-softmax . By applying sparse structure learning, TextSSL manages to select edges with dynamic contextual dependencies.

Global Word Co-occurrence Graph
Similar to the TextGCN graph, document-level graphs can also use PMI as the word-word edge values.

Only global word co-occurrence: DAGNN.
DAGNN [125] To address the long-distance dependency, hierarchical information and cross-domain learning challenges in domain-adversarial text classification tasks, Wu et al. [125] propose DAGNN (Domain-Adversarial Graph Neural Network). Each document is represented by a graph with content words as nodes and PMI values as edge values, which can capture long-distance dependency information. Pretrained FastText is chosen as the input word embeddings to handle the out-of-vocabulary issue and a GCN model with skip connection is used to address the oversmoothing problem. The propagation is formulated as: To learn the hierarchical information of documents, DiffPool [136] is applied to assign each document into a set of clusters. Finally, adversarial training is used to minimize the loss on source tasks and maximize the differentiation between source and target tasks.

Combine with Extra Edges: ReGNN, GFN.
ReGNN [56] To capture both global and local information, ReGNN (Recursive Graphical Neural Network) uses PMI together with consecutive words as the word edges. And graph propagation function is the same as GGNN while additive attention [7] is applied in aggregation. Pretrained GloVe is the input embedding of each word node.
GFN [20] GFN (Graph Fusion Network) builds four types of graphs using the word co-occurrence statistics, PMI, the similarity of pretrained embedding and Euclidean distance of pretrained embedding. Although four corpus-level graphs are built, the graph learning actually happens on subgraphs of each document, making the method a documentlevel GNN. For each subgraph, each type of graph is learned separately using the graph convolutional method and then a fusion method of concatenation is used. After an MLP layer, average pooling is applied to get the document representation.

Other word graphs
Some other ways of connecting words in a document have been explored.
HyperGAT [25] Ding et al. [25] proposes HyperGAT (Hypergraph Attention Networks) which builds hypergraphs for each document to capture high-level interaction between words. Two types of hyperedges are included: sequential hyperedges connecting all words in a sentence and semantic hyperedges connecting top-K words after getting the topic of each word using LDA. Like traditional hypergraph propagations, HyperGAT follows the same two steps of updating but with an attention mechanism to highlight the key information: Node-level attention is applied to learn hyperedges representations and edge-level attention is used for updating node representations.
IGCN [110] Contextual dependency helps in understanding a document and the graph neural network is no exception. IGCN constructs the graph with the dependency graph to show the connectedness of each pair of words in a document. Then, the word representation learned from Bi-LSTM using POS embedding and word embedding is used to calculate the similarity between each pair of nodes. Attention is used for the output to find the important relevant semantic features.
GTNT [80] Words with higher TF-IDF values should connect to more word nodes, with this in mind, GTNT(Graph Transformer Networks based Text representation) uses sorted TF-IDF value to determine the degree of each node and applies the Havel-Hakimi algorithm [32] to determine the edges between word nodes. A variant of GAT is applied during model learning. Despite the fact that GAT's attention score is mutual for two nodes, GTNT uses relevant importance to adjust the attention score from one node to another. Pretrained Word2vec is applied as the input of each node.

Critical Analysis
Most document-level GNNs connect consecutive words as edges in the graph and apply a graph neural network model, which makes them similar to CNN where the receptive field enlarges when graph models go deeper. Also, the major differences among document-level GNNs are the details of graph models, e.g. different pooling methods, and different attention calculations, which diminishes the impact of the contribution of these works. Compared with corpus-level GNN, document-level GNN adopts more complex graph models and also suffers from the out-of-memory issue when the number of words in a document is large.
A detailed comparison of document-level GNN is displayed in Table 2.

Datasets
There are many popular text classification benchmark datasets, while this paper mainly focuses on the datasets used by GNN-based text classification applications. Based on the purpose of applications, we divided the commonly adopted datasets into three types including Topic Classification, Sentiment Analysis and Other. Most of these text classification datasets contain a single target label of each text body. The key information of each dataset is listed in Table 3.
Manuscript submitted to ACM Table 2. Models Detailed Comparison on whether using external resources, how to construct the edge and node input, and whether transductive learning or inductive learning. GloVe and Word2vec are pretrained if not specified. "emb sim" is short for "embedding similarity", "dep graph" is short "dependency graph".

Graph
Model External Resource Edge Construction Node Initialization Learning

Topic Classification.
Topic classification models aim to classify input text bodies from diverse sources into predefined categories. News categorization is a typical topic classification task to obtain key information from news and classify them into corresponding topics. The input text bodies normally are paragraphs or whole documents especially for news categorization, while there are still some short text classification datasets from certain domains such as micro-blogs, bibliography, etc.
Some typical datasets are listed: • Ohsumed [40] is acquired from the MEDLINE database and further processed by [133] via selecting certain documents (abstracts) and filtering out the documents belonging to multiple categories. Those documents are classified into 23 cardiovascular diseases. The statistics of [133] processed Ohsumed dataset is represented in Table 3, which is directly employed by other related works.
• R8 / R52 are two subsets of the Reuters 21587 dataset 2 which contain 8 and 52 news topics from Reuters financial news services, respectively.
• 20NG is another widely used news categorization dataset that contains 20 newsgroups. It was originally collected by [50], but the procedures are not explicitly described. • Database systems and Logic Programming (DBLP) is a topic classification dataset to classify the computer science paper titles into six various topics [80]. Different from paragraph or document based topic classification dataset, DBLP aims to categorise scientific paper titles into corresponding categories, the average input sentence length is much lower than others.
• Dbpedia [52] is a large-scale multilingual knowledge base that contains 14 non-overlapping categories. Each category contains 40000 samples for training and 5000 samples for testing.
• WebKB [19] is a long corpus web page topic classification dataset.
• TREC [57] is a question topic classification dataset to categorise one question sentence into 6 question categories.

Sentiment Analysis.
The purpose of sentiment analysis is to analyse and mine the opinion of the textual content which could be treated as a binary or multi-class classification problem. The sources of existing sentiment analysis tasks come from movie reviews, product reviews or user comments, social media posts, etc. Most sentiment analysis datasets target to predict the people's opinions of one or two input sentences of which the average length of each input text body is around 25 tokens.
• Movie Review (MR) [89] is a binary sentiment classification dataset for movie review which contains positive and negative data equally distributed. Each review only contains one sentence.
• Stanford Sentiment Treebank (SST) [106] is an upgraded version of MR which contains two subsets SST-1 and SST-2. SST-1 provides five fine-grained labels while SST-2 is a binary sentiment classification dataset.
• Internet Movie DataBase (IMDB) [74] is also an equally distributed binary classification dataset for sentiment analysis. Different from other short text classification dataset, the average number of words of each review is around 221.
• Yelp 2014 [109] is a large scale binary category based sentiment analysis dataset for longer user reviews collected from Yelp.com.
Certain binary sentiment classification benchmark datasets are also used by GNN-based text classifiers. Most of them are gathered from shorter user reviews or comments (normally one or two sentences) from different websites including Amazon Alexa Reviews (AAR), Twitter US Airline (TUA), Youtube comments (SenTube-A and SenTube-T ) [113].

Other Datasets.
There are some datasets targeting other tasks including hate detection, grammaticality checking, etc. For example, ArangoHate [4] is a hate detection dataset, a sub-task of intend detection, which contains 2920 hateful documents and 4086 normal documents by resampling the merged datasets from [21] and [121]. In addition, [26] proposes another large scale hate language detection dataset, namely FountaHate to classify the tweets into four categories including 53,851,14,030,27,150, and 4,965 samples of normal, spam, hateful and abusive, respectively. Since there is no officially provided training and testing splitting radio for above datasets, the numbers represented in Table 3 are following the ratios (train/development/test is 85:5:10) defined by [73].

Dataset Summary.
Since an obvious limitation of corpus-level GNN models has high memory consumption limitation [25,37,137], the datasets with a smaller number of documents and vocabulary sizes such as Ohsumed, R8/R52, 20NG or MR are widely used to ensure feasibly build and evaluate corpus-level graphs. For the document-level GNN based models, some larger size datasets like AG-News can be adopted without considering the memory consumption problem. From Table 3, we could find most of the related works mainly focus on the GNN applied in topic classification and sentiment analysis which means the role of GNNs in other text classification tasks such as spam detection, intent detection, abstractive question answering need to be further exploited. Another observed trend is short text classification are gained less attention compared with long document classification tasks. In this case, GNN in short text classification may be an .

Performance Metrics.
In terms of evaluating and comparing the performance of proposed models with other baselines, accuracy and F1 are most commonly used metrics to conduct overall performance analysis, ablation studies and breakdown analysis. We use , , and to represent the number of true positive, false positive, true negative and false negative samples.
is the total number of samples.
• Accuracy and Error Rate: are basic evaluation metrics adopted by many GNN-based text classifiers such as [54,67,120,133,137]. Most of the related papers run all baselines and their models 10 times or 5 times to show the ± of accuracy for reporting more convincing results. It can be defined as: • Precision, Recall and F1: are metrics for measuring the performance especially for imbalanced datasets. Precision is used to measure the results relevancy, while recall is utilised to measure how many truly relevant results acquired. Through calculating the harmonic average of Precision and Recall could get F1. Those three measurements can be defined as: Few papers only utilise recall or precision to evaluate the performance [80]. However, precision and recall are more commonly used together with F1 or Accuracy to evaluate and analyse the performance from different perspectives e.g. [56,64,73,127]. In addition, based on different application scenarios, different F1 averaging methods are adopted by those papers to measure overall F1 score of multi-class (Number of Classes is ) classification tasks including: • Macro-F1 applies the same weights to all categories to get overall 1 by taking the arithmetic mean.
• Micro-F1 is calculated by considering the overall and . It can be defined as: where: • Weighted-F1 is the weighted mean of F1 of each category where the weight is related to the number of occurrences of the corresponding th class, which can be defined as:

Other Evaluation Aspects.
Since two limitations of GNN-based models are time and memory consumption, therefore, except the commonly used qualitative performance comparison, representing and comparing the GPU or CPU memory consumption and the training time efficiency of proposed models are also adopted by many related studies to demonstrate the practicality in real-world applications. In addition, based on the novelties of various models, specific evaluation methods are conducted to demonstrate the proposed contributions.
• Memory Consumption: [25,37,70] list the memory consumption of different models for comprehensively evaluating the proposed models in computational efficiency aspect. • Parameter Sensitivity is commonly conducted by GNNs studies to investigate the effect of different hyperparameters e.g. varying sliding window sizes, embedding dimensions of proposed models to represent the model sensitivity via line chart such as [25,64,70].
• Number of Labelled Documents is a widely adopted evaluation method by GNN-based text classification models [25,54,64,80,98,120,133] which mainly analyse the performance trend by using different proportions of training data to test whether the proposed model can work well under the limited labelled training data.
• Vocabulary Size is similar to the number of labelled documents but it investigates the effects of using different sizes of vocabulary during the GNN training stage adopted by [120].

Metrics Summary.
For general text classification tasks, Accuracy, Precision, Recall and varying F1 are commonly used evaluation metrics for comparing with other baselines. However, for GNN based models, only representing the model performance cannot effectively represent the multi-aspects of proposed models. In this case, there are many papers conducting external processes to evaluate and analyse the GNN based classifier from multiple views including time and memory consumption, model sensitivity and dataset quantity.

PERFORMANCE
While different GNN text classification models may be evaluated on different datasets, there are some datasets that are commonly used across many of these models, including 20NG, R8, R52, Ohsumed and MR. The accuracy of various models assessed on these five datasets is presented in Table 4. Some of the results are reported with ten times average accuracy and standard derivation while some only report the average accuracy. Several conclusions can be drawn: • Models that use external resources usually achieve better performance than those that do not, especially models with BERT and RoBERTa [63,134].
• Under the same setting, such as using GloVe as the external resource, Corpus-level GNN models (e.g. TG- Transformer [137], TensorGCN [68]) typically outperform Document-level GNN models (e.g. TextING [143], TextSSL [95]). This is because Corpus-level GNN models can work in a transductive way and make use of the test input, whereas Document-level GNN models can only use the training data.
• The advantage of Corpus-level GNN models over Document-level GNN models only applies to topic classification datasets and not to sentiment analysis datasets such as MR. This is because sentiment analysis involves analyzing the order of words in a text, which is something that most Corpus-level GNN models cannot do.

Model Performance
With the development of pre-trained models [45,71], and prompt learning methods [28,69] achieve great performance on text classification. Applying GNNs in text classification without this pre-training style will not be able to achieve such good performance. For both corpus-level and document-level GNN text classification models, researching how to combine GNN models with these pretrained models to improve the pretrained model performance can be the future work. Meanwhile, more advanced graph models can be explored, e.g. more heterogeneous graph models on word and document graphs to improve the model performance.

Graph Construction
Most GNN text classification methods use a single, static-value edge to construct graphs based on document statistics.
This approach applies to both corpus-level GNN and document-level GNN. However, to better explore the complex relationship between words and documents, more dynamic hyperedges can be utilized. Dynamic edges in GNNs can be learned from various sources, such as the graph structure, document semantic information, or other models. And hyperedges can be built for a more expressive representation of the complex relationships between nodes in the graph.

Application
While corpus-level GNN text classification models have demonstrated good performance without using external resources, these models are mostly transductive. To apply them in real-world settings, an inductive learning approach should be explored. Although some inductive corpus-level GNNs have been introduced, the large amount of space required to construct the graph and the inconvenience of incremental training still present barriers to deployment.
Improving the scalability of online training and testing for inductive corpus-level GNNs represents a promising area for future work.

CONCLUSION
This survey article introduces how Graph Neural Networks have been applied to text classification in two different ways: corpus-level GNN and document-level GNN, with a detailed structural figure. Details of these models have been introduced and discussed, along with the datasets commonly used by these methods. Compared with traditional machine learning and sequential deep learning models, graph neural networks can explore the relationship between words and documents in the global structure (corpus-level GNN) or the local document (document-level GNN), giving a good performance. A detailed performance comparison is applied to investigate the influence of external resources, model learning methods, and types of different datasets. Furthermore, we propose the challenges for GNN text classification models and potential future work.