A Comprehensive Evaluation of Graph Kernels for Unattributed Graphs

Graph kernels are of vital importance in the field of graph comparison and classification. However, how to compare and evaluate graph kernels and how to choose an optimal kernel for a practical classification problem remain open problems. In this paper, a comprehensive evaluation framework of graph kernels is proposed for unattributed graph classification. According to the kernel design methods, the whole graph kernel family can be categorized in five different dimensions, and then several representative graph kernels are chosen from these categories to perform the evaluation. With plenty of real-world and synthetic datasets, kernels are compared by many criteria such as classification accuracy, F1 score, runtime cost, scalability and applicability. Finally, quantitative conclusions are discussed based on the analyses of the extensive experimental results. The main contribution of this paper is that a comprehensive evaluation framework of graph kernels is proposed, which is significant for graph-classification applications and the future kernel research.


Introduction
Graphs are important structures for information representation, in which nodes and edges respectively represent the entities and the relationships in the real world. Graph processing has been widely used in many scientific fields such as image processing [1], biochemical research [2], social network [3] and natural language processing [4]. Meanwhile, graph comparison plays a core role in data mining and target recognition in these fields. For instance, two molecules with the same chemical properties usually have similar topological structures [5]. Thus people can successfully perform a prediction for an unknown molecule via topology comparison with known ones.
It has been reported that exact graph comparison is equivalent to sub-graph isomorphism detection, which is a well-known NP-hard problem [6]. Inexact substitutions have to be explored, such as approximate graph edit distance [7], topological descriptors [8] and graph kernels.
The construction of graph kernels has been extensively investigated in the past decades. Through a well-defined kernel function, two samples in the graph space could be mapped into a real number, which represents the quantitative similarity between two graphs. Moreover, extracting the graph features explicitly is unnecessary in this procedure.
However, as stated in [9], an exact graph kernel can locate the dissimilarity between two arbitrary and non-isomorphic graphs. Thus it has the same time complexity with graph isomorphism detection, which is NP-hard but has been neither proven as NP-complete nor found to be solved by any polynomial-time algorithm. By now, all the existing graph kernels are actually inexact, paying the distinct tradeoffs among classification accuracy, applicability and runtime performance.
where x, x ∈ S and ·, · H is the inner product operation in RKHS. An illustration of kernel method is shown in Figure 1. The most obvious advantage of graph kernel is that a problem which cannot be linearly separated is changed into a linear separable problem through a kernel mapping. Specifically, we call the method as graph kernel if the input space is the graph space. From Equation (1), it is clear that via a feature extraction method φ, graphs can be mapped into vectors in a RKHS, and the inner product of two vectors can represent the graph similarity.
Usually, the graph similarity can be computed directly, without the explicitly definition of the feature extraction method φ. In the real world, it is quite difficult to compute an explicit embedding representation of structured data. Therefore, compared to topology descriptors and other graph comparison methods, graph kernel could achieve a smart and accurate similarity measurement because of the implicit feature extraction. In the field of machine learning, it has been witnessed that graph kernels can bridge the gap between graph-based data and a large group of kernel-based machine learning algorithms including support vector machines (SVM), kernel regression, and kernel principle component analysis (kPCA) [9].
Specifically, we call the method as graph kernel if the input space is the graph space. From Equation (1), it is clear that via a feature extraction method φ , graphs can be mapped into vectors in a RKHS, and the inner product of two vectors can represent the graph similarity.
Usually, the graph similarity can be computed directly, without the explicitly definition of the feature extraction method φ . In the real world, it is quite difficult to compute an explicit embedding representation of structured data. Therefore, compared to topology descriptors and other graph comparison methods, graph kernel could achieve a smart and accurate similarity measurement because of the implicit feature extraction. In the field of machine learning, it has been witnessed that graph kernels can bridge the gap between graph-based data and a large group of kernel-based machine learning algorithms including support vector machines (SVM), kernel regression, and kernel principle component analysis (kPCA) [9].
( ) According to the Mercer's theorem [16], a valid graph kernel must be symmetric and positive semi-definite (p.s.d.): Symmetric. Obviously, for two graphs A G and B G , ( ) ( ) , , p.s.d. For a dataset with n graphs, any finite sequences of graphs 1 2 , , n g g g  and any choices of arbitrary real numbers 1 2 , , n c c c  , we have ( ) Note that although the learning ability of valid graph kernels has been proven, there is no theoretical research which demonstrates that invalid graph kernel could not support the learning algorithms. Actually, there exist discussions of how similarities (i.e., non-p.s.d. kernels) can support learning [17,18]. Therefore, some graph kernels are not proved to be p.s.d. in the literatures.

Kernel Groups
In the literature, the existing graph kernels are usually designed based on the distinct topological analysis of graph structure. According to five different dimensions of kernel design details, the whole family of graph kernels can be categorized as follows: 1. Framework: R-convolution kernels vs. Information theoretic kernels In 1999, Haussler proposed R-convolution kernels [19], which discover the relationship between graph similarity and the appearances of same or similar substructures of two graphs. As the formal definition in [20], for two graphs 1 G and 2 G , { }  According to the Mercer's theorem [16], a valid graph kernel must be symmetric and positive semi-definite (p.s.d.):

1.
Symmetric. Obviously, for two graphs p.s.d. For a dataset with n graphs, any finite sequences of graphs g 1 , g 2 , · · · g n and any choices of arbitrary real numbers c 1 , c 2 , · · · c n , we have n ∑ i=1 n ∑ j=1 k g i , g j c i c j ≥ 0, i.e., all the eigenvalues of the kernel matrix K n×n = k g i , g j for the dataset are nonnegative.
Note that although the learning ability of valid graph kernels has been proven, there is no theoretical research which demonstrates that invalid graph kernel could not support the learning algorithms. Actually, there exist discussions of how similarities (i.e., non-p.s.d. kernels) can support learning [17,18]. Therefore, some graph kernels are not proved to be p.s.d. in the literatures.

Kernel Groups
In the literature, the existing graph kernels are usually designed based on the distinct topological analysis of graph structure. According to five different dimensions of kernel design details, the whole family of graph kernels can be categorized as follows: 1. Framework: R-convolution kernels vs. Information theoretic kernels In 1999, Haussler proposed R-convolution kernels [19], which discover the relationship between graph similarity and the appearances of same or similar substructures of two graphs. As the formal definition in [20], for two graphs G 1 and G 2 , S 1;1 , · · · , S 1;n 1 , · · · , S 1;N 1 and S 2;1 , · · · , S 2;n 2 , · · · , S 2;N 2 are sub-graph sets of G 1 and G 2 respectively. A standard R-convolution kernel K for these two graphs is defined as: δ S 1;n 1 , S 2;n 2 (2) where δ denotes a Dirac kernel shown as: where S 1;n 1 ∼ = S 2;n 2 indicates the substructure S 1;n 1 is isomorphic (or approximately isomorphic) to S 2;n 2 . So far, most existing kernels [9,10,[21][22][23] belong to this group. Intuitively, two similar graphs should have many common substructures. However, the main drawback of the R-convolution kernels is that they neglect the relative locations of substructures. This is because R-convolution kernels cannot establish reliable structural correspondences among substructures [21]. Meanwhile, the teeter-totter problem and the complexity issue of the graph decomposition challenge the development of R-convolution kernels [6].
Recently, Bai et al. utilized information theory methods to compute the probability distribution diffusion of two graphs as the similarity measurement [20,[24][25][26][27][28]. By mapping all the data points of the input space into a fitted distribution in a parametric family S, a kernel for the distributions can be defined. This group of kernels for the data points in terms of distributions can be automatically induced in the original input space. Therefore, this framework provides us an alternative way of defining kernels that maps graphs into a statistical manifold. In real-world applications, these kernels outperform the linear kernels using SVM classifiers. Some of them create a bridge between kernel methods and information theory, and thus have an information theoretic interpretation. These methods are called information theoretic kernels. However, its high computational complexity of the information entropy is the bottleneck of this group.

Graph Pre-processing: Aligned Kernels vs. Unaligned Kernels
Graph alignment can locate the mapping relationship between the node sequences of two graphs. It is a pre-processing procedure for the original graphs. Common substructures will be pinned in the same position in these two graphs after the alignment. Through assigning parts of one object to parts of the other, the most similar parts of the two objects can be found out. Finding such a bijection is known as the assignment problem and well-studied in combinatorial optimization. This approach has been successfully applied to graph comparison, e.g., in general graph matching as well as kernel-based classification. In contrast to convolution kernels, assignments establish structural correspondences, thereby alleviating the problem of diagonal dominance at the same time. And then it can achieve an accurate similarity computation without false positive. The research on optimal assignment kernels was reported in [29], in which each pair of structures is aligned before comparison. However, the similarities derived in this way are not necessarily positive semi-definite (p.s.d.) and thus do not give rise to valid kernels, severely limiting their use in kernel methods [30]. Kriege et al. discussed the condition for guaranteeing an optimal assignment kernel to be positive semi-definite [31].
The performance of the aligned kernel depends on the alignment accuracy. During graph alignment, every node/subgraph in a graph could only map to one node/subgraph in another graph, which will lead to information loss. Therefore, unaligned kernels are more common in the literature.

Matching Pattern: Global kernels vs. Local kernels
A novel choice of matching pattern is usually the core of designing a new graph kernel. The similarity computation of a kernel mainly depends on the exploration of the matching patterns.
Most existing graph kernels belong to the group of local kernels, which focus on local patterns of data structure, such as walks [22], paths [23], sub-trees [21] and limit-sized sub-graphs [10]. For example, the random walk kernel will embed graph into a feature space, in which the feature vector consists of walk numbers with different lengths.
A few studies changed to explore the global characteristics of graphs such as the Lovász number [32] and the whole probability distribution [11]. The kernels based on these methods are called global kernels. In this kernel group, all the members directly obtain the similarity among graphs based on the whole structure information. Therefore, the graph decomposition is not necessary. In the most recent research, some hybrid patterns were utilized to design a kernel [33]. Because graph matching is a special case of sub-graph matching, the kernels based on graph matching are allocated in the group of local pattern in this paper.

Computing Models: Quantum kernels vs. Classical kernels
Quantum computation differs from its classical counterparts. It can store, process and transport information using the peculiar properties of quantum mechanics such as the superposition and the entangled state. These properties result in an exponential increase of the dimensionality of the state-space, which is the basis of the quantum speedup.
As the quantum counterpart of random walk, quantum walk [34] becomes a novel computing model to analyze graph structures. Because of the amazing properties of quantum parallel and quantum interference, the quantum amplitude of quantum walks on a graph could represent more complex structural information. For kernels based on the discrete-time edge-based quantum walk (DTQW) [12,35] or the concrete-time node-based quantum walk (CTQW) [11,21], all of them are called quantum kernels, and the others are called classical kernels. In fact, quantum walk is the only method used to design the quantum kernels in literature.
Here we use DTQW as an instance. The discrete-time quantum walk is the quantum counterpart of the classical random walk. We denote the state space of DTQW as E, which is the edge set of a graph. And a general state of the quantum walk is: where the quantum amplitudes α uv are complex. Using the Grover diffusion matrix, the entries of the transition matrix U is shown as follows: where U (u,v),(w,x) shows the quantum amplitude for the transition |uv → |wx , d x means the vertex degree and δ ux is the Kronecker delta, i.e., δ ux = 1 if u = x, otherwise δ ux = 0. Different from random walk where the probability propagates, what propagates during quantum walk is the quantum amplitude. Therefore, the quantum interference will happen between two crossing walks. In this paper, we only consider quantum computing as a novel computation model and evaluate the quantum kernels by running on a classical computer to simulate quantum walk.

Methodologies: Topological kernels vs. Entropy kernels
Graph and sub-graph matching is the key procedure for all the kernels. For most of the existing kernels, match mapping is computed via the topological analysis. In this kernel group, graph isomorphism or sub-graph isomorphism is the main method. However, the pairwise matching will cost a lot of time for the kernel computation. Therefore, adding some toy pattern constraints (e.g., edge, fixed-length walk, triangles) or constructing the product graph are the common methods to locate the matching. The product graph is usually an auxiliary structure to locate common sub-graphs, which is constructed by two graphs [22]. However, the product graph will be large if the two graphs are big, which may still lead to unacceptable complexity.
In information theory, mutual information represents the diffusion of two probability distributions. Utilizing the mutual information entropy of the probability distributions of substructures is a novel trend to search the similar substructures. Therefore these methods are called entropy kernels in this paper.
Here 15 popular graph kernels are chosen and shown in Table 1. Every kernel can be grouped according to the above five dimensions. From the groups of these kernels, we find that:

•
Most of these kernels are R-convolution, unaligned and local-pattern kernels. • All the entropy kernels here utilize quantum walk model to compute the probability distribution. • All of the information theory kernels here belong to the group of entropy kernels. Meanwhile, some R-convolution kernels which detect the similar substructure via the computation of information entropy also belong to this group.

Complexity Analysis
The computation of graph kernel is erogic and complex. As the increase of graph size and dataset size, the computational cost will increase. In this section, we will make a qualitative comparison on the runtime cost of the 15 chosen graph kernels through the analysis of time complexity. Table 2 shows the time complexities of all the 15 mentioned kernels. Suppose that the base dataset has K graphs and each graph has N nodes (unattributed nodes) and E edges (undirected and unweight edges). All the complexities are denoted by the three parameters.

Quantitative Evaluation
In this section, plenty of graph datasets are utilized to perform a quantitative evaluation on graph kernels in terms of many criteria such as classification accuracy, runtime cost, scalability and applicability. From the aforementioned kernel methods, several graph kernels are evaluated using the following tests.

Datasets
Both real-world datasets and synthetic datasets are used in this paper to evaluate these graph kernels: The real-world datasets. According to the main scope of graph classification, 31 graph datasets from the real world are chosen, including 20 chemical datasets, five image datasets, two social network datasets, three hand-writing datasets and one fingerprint dataset. Among the 31 datasets, some of them are multi-class, such as COIL-DEL and so on. Others are binary class datasets (given in Table A1 in Appendix A). For each object in these datasets, its topological structure is extracted as an unattributed graph, and we try to find the relationship between the natural characteristic and the graph structure. All the datasets can be downloaded from the benchmark website [39].
The synthetic datasets. In order to further evaluate the scalability and the applicability of these kernels, some synthetic datasets are chosen or generated, including random graphs, cospectral graphs, regular graphs and strong regular graphs. Table 3 lists the random graph datasets used for scalability evaluation. According to different node and edge levels, 100 sample graphs are randomly generated with the same size for each class in these datasets. To test the set-based scalability, a dataset is generated which consists of different numbers of random graphs with the same graph size.  Table 4 collects three kinds of similar graphs. CosGraph includes 5048 pairs of 10-node graphs. Each pair of graphs has the same graph spectrum, and is called a cospectral graph pair. RegGraph and SRGraph consist of 31 classes of regular graphs and 11 classes of strong regular graphs respectively. Within one class, each graph is regular or strong regular but not isomorphic with others. All the synthetic datasets can be downloaded from our github website [40].

Evaluation Criteria
In this paper, our evaluation of graph kernels will mainly focus on several criteria as follows:

1.
Accuracy. Accuracy is the most important criterion for classification to compare the graph kernels. In this paper, C-SVM [41] is utilized to do the 10-fold cross validation test. In particular, for all the kernels, 10-fold division is the same for every single comparison, and 100 random tests are repeated. Here we use the average probability of the correct-labelled test samples as the accuracy result. Meanwhile, F1 score (macro F1) is used to compute the classification ability for the multi-class problems.

2.
Runtime. This criterion mainly focuses on the computational cost of a graph kernel for a graph dataset. Because the training procedure belongs to the post-process, we only consider the runtime cost of the computation of the kernel matrix.

3.
Scalability. To evaluate the runtime cost clearly, scalability is further used to unveil the behavior of the computational time with the increasing number of the graph sizes or graphs in the dataset.

4.
Applicability. Theoretically, a complete graph kernel is fit for the general graph family, i.e., if graph G A is not isomorphic to graph G B , then k( · , G B ) = k( · , G A ). However, inexact graph kernels may fail to distinguish some similar graphs" especially the cases of cospectral graphs and regular graphs. We utilize the failure rate as the applicability measurement for the graph kernels.
All the experiments were tested in Matlab R2010b on an Intel Xeon Core E5-1620 CPU with 8 GB memory. All the runtime consumption tests are executed with a single thread.

Results
All the experimental results are shown and analyzed in this subsection.

Accuracy Results
For all the tests on real-world datasets, a single label is given for the graph under test to predict which class the graph belongs to. It is considered to be correctly classified when the label for the graph under test equals to its true label. The average classification accuracy results are shown in Figure 2, where the real-world datasets used in the tests are given in Appendix A and the accuracy data is given in Table A2 in Appendix B.  Table A2.
Considering the multi-class cases, macro F1 is used as another criterion to evaluate the accuracy performance, which is the harmonic average result of the average precision and the average recall of all the classes in the datasets. Figure 3 illustrates the macro F1 results. The datasets are the multi-class cases in Appendix A, and the macro F1 results (in percentage) are given in Table A3 in Appendix B. The detailed information about this test is shown in Table A2.
Considering the multi-class cases, macro F1 is used as another criterion to evaluate the accuracy performance, which is the harmonic average result of the average precision and the average recall of all the classes in the datasets. Figure 3 illustrates the macro F1 results. The datasets are the multi-class cases in Appendix A, and the macro F1 results (in percentage) are given in Table A3 in Appendix B. detailed information about this test is shown in Table A2.
Considering the multi-class cases, macro F1 is used as another criterion to evaluate the accuracy performance, which is the harmonic average result of the average precision and the average recall of all the classes in the datasets. Figure 3 illustrates the macro F1 results. The datasets are the multi-class cases in Appendix A, and the macro F1 results (in percentage) are given in Table A3 in Appendix B. The results show that most of the kernels only achieve over-50% accuracy and over-0.2 F1 score. The reasons are threefold: (1) there are many multi-classes datasets such as COIL (100 classes) and MSRC (20 classes), which are very difficult to correctly recognize; (2) unattributed graphs contain limited information about real-world object, because they neglect the attributes of nodes and edges; (3) compared to the novel multi-kernel method [42], it is more difficult for the single kernel method to capture plenty of complex features of objects.
In addition, we apply statistical tests on the accuracy results. Since some of the accuracy results cannot satisfy the normality assumption, a non-parametric test, the Kruskal-Wallis test, is used. After the statistical test, the significance is 0.631 (the confidence level is 95%), which means that there is no significant difference between the accuracy results of using different graph kernels. This result is obvious since we admit that there is no best kernel for all the datasets, and precisely for this reason, researchers have to design specific graph kernels for a specific given problem. Therefore, a comprehensive kernel evaluation framework is quite useful to kernel comparison.
Although not significant, there is still a slight advantage of the kernels QJSU, WLK and GHK (which may be due to chance). It can be inferred that some distinguishing ability could be achieved through quantum interference or sub-tree matching. The advantages may be more obvious if some The results show that most of the kernels only achieve over-50% accuracy and over-0.2 F1 score. The reasons are threefold: (1) there are many multi-classes datasets such as COIL (100 classes) and MSRC (20 classes), which are very difficult to correctly recognize; (2) unattributed graphs contain limited information about real-world object, because they neglect the attributes of nodes and edges; (3) compared to the novel multi-kernel method [42], it is more difficult for the single kernel method to capture plenty of complex features of objects.
In addition, we apply statistical tests on the accuracy results. Since some of the accuracy results cannot satisfy the normality assumption, a non-parametric test, the Kruskal-Wallis test, is used. After the statistical test, the significance is 0.631 (the confidence level is 95%), which means that there is no significant difference between the accuracy results of using different graph kernels. This result is obvious since we admit that there is no best kernel for all the datasets, and precisely for this reason, researchers have to design specific graph kernels for a specific given problem. Therefore, a comprehensive kernel evaluation framework is quite useful to kernel comparison.
Although not significant, there is still a slight advantage of the kernels QJSU, WLK and GHK (which may be due to chance). It can be inferred that some distinguishing ability could be achieved through quantum interference or sub-tree matching. The advantages may be more obvious if some particular classification cases are considered. Therefore, for specific cases, the evaluation process should be reproduced and it is suggested that statistical tests be performed to fully analyze the performance of the kernels. Table 5 shows the runtime cost of the 10 chosen kernels for real-world datasets. For each kernel, the runtime of all the 31 real-world datasets are computed. The maximal runtime, minimal runtime and average runtime are listed in lines. Compared with the analyses in Table 2, the runtime evaluation is approximately consistent with the theoretical time complexity comparison.

Runtime Results
Generally, SPK is the fastest method because of the fast computation of the shortest path. Meanwhile, WLK and AGK show outstanding runtime performance as well. However, the quantum kernels consume more cost than the classical ones because the simulation of quantum walk costs much runtime on a classical computer. Especially for the case of using DTQW, the runtime cost is nearly the square of that of using RWK, which may be owing to the computation of the evolution of the quantum state.
The Kruskal-Wallis test is applied to the runtime results. When taking 95% degree of confidence, the significance is 0, which means that there are significant differences between the runtime results when using different graph kernels.

Scalability Results
To further explore the scalability, we generate some random graph sets to test the runtime trend with the increasing of the number of node, edge and set size.

•
Node-based Scalability The dataset 200-Edge is designed to test the node-based scalability of graph kernels. All the graphs in the dataset have 200 edges. Therefore, the graph density descends as the graph size increases. Figure 4 shows the runtime comparison of kernels for node-based scalability test. Note that the size range of most of the concerned graphs is not large. The x-axis is not changed by the orders of magnitude. In Figure 4a, QJSK, DQMK and AGK have the best scalabilities because the runtime costs of these kernels nearly maintain the same when the graph size increases. These kernels mainly focus on the local patterns which are relative to graph edges. QJSK utilizes DTQW which is quantum walk among edges. Therefore, when the edge number maintains, these kernels are nearly unaffected. •

Edge-based Scalability
The dataset 50-Node is generated to test the edge-based scalability of graph kernels. All the graphs have 50 nodes. Unlike the dataset 200-Edge, the graph density increases as the edge number of graphs increases. Figure 5 shows the runtime comparison of kernels for edge-based scalability test. Most kernels show good scalabilities when graph density increases (see Figure 5a), except AGK, QJSK and DQMK in Figure 5b. RWK locates the walk pattern of the graphs therefore it is sensible to the graph density. QJSK and DQMK utilize the edge-based discrete-time quantum walk to compute the similarity, which is a high complexity operation when the graph is dense. Therefore, the runtime costs of these 3 kernels increase significantly when the edges increase (see Figure 5b). It is found that this observation is nearly opposite with the result of the above Node-based Scalability experiment. However, other kernels in Figure 4b show bad scalabilities, especially the QJSU which has the sharpest ascending trend as the graph size increases. The main reason is that CTQW is a costly procedure. •

Edge-based Scalability
The dataset 50-Node is generated to test the edge-based scalability of graph kernels. All the graphs have 50 nodes. Unlike the dataset 200-Edge, the graph density increases as the edge number of graphs increases. Figure 5 shows the runtime comparison of kernels for edge-based scalability test. Most kernels show good scalabilities when graph density increases (see Figure 5a), except AGK, QJSK and DQMK in Figure 5b. RWK locates the walk pattern of the graphs therefore it is sensible to the graph density. QJSK and DQMK utilize the edge-based discrete-time quantum walk to compute the similarity, which is a high complexity operation when the graph is dense. Therefore, the runtime costs of these 3 kernels increase significantly when the edges increase (see Figure 5b). It is found that this observation is nearly opposite with the result of the above Node-based Scalability experiment.
other kernels.

•
Edge-based Scalability The dataset 50-Node is generated to test the edge-based scalability of graph kernels. All the graphs have 50 nodes. Unlike the dataset 200-Edge, the graph density increases as the edge number of graphs increases. Figure 5 shows the runtime comparison of kernels for edge-based scalability test. Most kernels show good scalabilities when graph density increases (see Figure 5a), except AGK, QJSK and DQMK in Figure 5b. RWK locates the walk pattern of the graphs therefore it is sensible to the graph density. QJSK and DQMK utilize the edge-based discrete-time quantum walk to compute the similarity, which is a high complexity operation when the graph is dense. Therefore, the runtime costs of these 3 kernels increase significantly when the edges increase (see Figure 5b). It is found that this observation is nearly opposite with the result of the above Node-based Scalability experiment. •

Set-based Scalability
The dataset 50-Node&150-Edge is designed to test the set-based scalability of graph kernels. All the graph samples have the same amount of nodes and edges. Figure 6 shows the runtime comparison of kernels for set-based scalability test. Based on the formal definition of the kernel, the kernel matrix used in the kernel-based classification is pairwise and the matrix size relates to the graph number of the dataset. Therefore, all the kernels will cost more runtime when the graph set increases. Compared with other methods, the increasing trends of the runtime costs of QJSK, DQMK •

Set-based Scalability
The dataset 50-Node&150-Edge is designed to test the set-based scalability of graph kernels. All the graph samples have the same amount of nodes and edges. Figure 6 shows the runtime comparison of kernels for set-based scalability test. Based on the formal definition of the kernel, the kernel matrix used in the kernel-based classification is pairwise and the matrix size relates to the graph number of the dataset. Therefore, all the kernels will cost more runtime when the graph set increases. Compared with other methods, the increasing trends of the runtime costs of QJSK, DQMK and RWK are more significant. For a large dataset, these three kernels cannot work well within an acceptable time. and RWK are more significant. For a large dataset, these three kernels cannot work well within an acceptable time. •

Normalized Evaluation
For every kernel, the runtime cost is related with many factors as listed in Table 2. The factor set is assumed to be { , , } x y z  . Take factor x as an example. In order to make a normalized standard evaluation on the x -based scalability of a kernel method, we fix all the other factors in the dataset and use the following function to compute the normalized scalability: • Normalized Evaluation For every kernel, the runtime cost is related with many factors as listed in Table 2. The factor set is assumed to be {x, y, . . . z}. Take factor x as an example. In order to make a normalized standard evaluation on the x-based scalability of a kernel method, we fix all the other factors in the dataset and use the following function to compute the normalized scalability: where T x,y...z denotes the runtime cost for a dataset with the relative factors {x, y, . . . z}. Actually, the x-based scalability should be evaluated by the derivatives of the sub-function ∂T x ∂x . The higher the value is, the x-based scalability is worse. However, for an arbitrary kernel, the sub-cost T x is difficult to test. Approximately, we could assume that every factor is independent with each other and the x-based asymptotic complexity is about O x k . It can be easily proved that Equation (6) can be used to compute ∂T x ∂x approximately. Note that even if the scalability equals to 1, it does not mean that the runtime cost changes linearly with the factor. This function should be considered as an approximate method to measure the scalability quantitatively.
The node-based, edge-based and set-based normalized scalabilities of the 10 graph kernels are given in Table 6. Table 6. The three kinds of normalized scalabilities of the 10 mentioned kernels. The results in red italic font denote bad scalabilities we observe in Figures 4-6, while the results in blue bold font are outstanding scalabilities.

Kernel Name
Node-Based Scalability

Applicability Results
Some similar and non-isomorphic graphs are usually difficult to distinguish via inexact graph comparison methods. Therefore, a graph kernel cannot be applied to some kinds of graphs. In this subsection, the distinguishing ability for similar graphs is used to compare the applicability of these graph kernels. Here similar graphs are the graph pairs or graph groups with similar structure. Table 7 shows the failure rates of these graph kernels for distinguishing the similar graph pairs collected in Table 4, including the cospectral graphs, regular graphs and strong regular graphs.
The Kruskal-Wallis test under the 95% degree of confidence is conducted. The significance is 0.014. Therefore, the graph kernels have significant influence on the applicability. RWK is the worst kernel, which cannot be used to distinguish these similar graphs. WLK could only locate the difference of the cospectral graphs, but fails for regular graphs. On the contrary, DQMK, LTK, QJSU and AGK achieve the best distinguishing abilities, even for the strong regular graphs.
Generally, the quantum kernels show better applicability. Because the slight topological difference will be amplified by quantum interference, and thus better distinguishing ability is achieved. Therefore, when the sample graphs are similar and difficult to be classified, quantum kernels will be better choices.

Discussion
According to the evaluations in Section 4, seven criteria are considered for each kernel including classification accuracy, F1Score, runtime cost, node-based scalability, edge-based scalability, set-based scalability and distinguishing ability. The normalized ability value (using the ability X of kernel K as an example) is defined as follows: where X best and X worst are the ability value of the optimal kernel and the worst kernel in all the 10 mentioned kernels respectively. According to Section 2.3, all the graph kernels can be grouped according to five different dimensions. Here we focus on the comparison of the graph kernel groups to explore the advantages and disadvantages of all the kernel groups. For each kernel group, we compare the average abilities of all the kernels in this group. Figure 7 shows the comparison results of the five graph kernel groups. In the radar figures in Figure 7, the bigger the ability scope is, the better the graph kernel group is. And for each criteria of the comparison, the bigger the value is, the stronger the ability is.
Through the statistical analyses, we find out that: • R-convolution kernels perform better on scalabilities and runtime cost, while the information theory kernels show better abilities on accuracy and applicability. The information theory kernels utilize the global probability distribution diffusion of two graphs to measure the graph similarity. Therefore, compared with local pattern matching of R-convolution kernels, the information theory kernels result in the better accuracy and applicability. • Aligned kernels have stronger applicability and node-based scalability but are weaker than unaligned kernels on the other criteria. Through graph alignment, the vertex mapping characteristic is found out before kernel computation. Meanwhile, the slight difference of the similar graph pairs can also be located in the alignment procedure. Therefore, after graph alignment, the kernel methods can utilize the vertex mapping directly, which leads to a well node-based scalability. • Global kernels, quantum kernels and entropy kernels are worse than their counterpart kernel group in all the other criteria except the distinguishing ability (applicability). It unveils that if good applicability is needed, more complex computations are needed in the kernel method, such as the above kinds of graph kernels.
Above all, R-convolution, unaligned, local-pattern, classical and topological kernels have better ability scope and show more advantages. However, these kinds of kernels lack powerful applicability. The reasons are twofold. Firstly, a complete graph kernel can distinguish all the non-isomorphic graph pairs, which means it has the best applicability. However, it is NP-hard. Therefore, to achieve a powerful applicability, the computation cost will be great. Secondly, distinguishing slight differences will lead to bad generalization performance, and thus result in low accuracy.  Figure 7 shows the comparison results of the five graph kernel groups. In the radar figures in Figure 7, the bigger the ability scope is, the better the graph kernel group is. And for each criteria of the comparison, the bigger the value is, the stronger the ability is. Through the statistical analyses, we find out that: • R-convolution kernels perform better on scalabilities and runtime cost, while the information theory kernels show better abilities on accuracy and applicability. The information theory kernels utilize the global probability distribution diffusion of two graphs to measure the graph similarity. Therefore, compared with local pattern matching of R-convolution kernels, the information theory kernels result in the better accuracy and applicability.

•
Aligned kernels have stronger applicability and node-based scalability but are weaker than unaligned kernels on the other criteria. Through graph alignment, the vertex mapping characteristic is found out before kernel computation. Meanwhile, the slight difference of the similar graph pairs can also be located in the alignment procedure. Therefore, after graph alignment, the kernel methods can utilize the vertex mapping directly, which leads to a well node-based scalability.

•
Global kernels, quantum kernels and entropy kernels are worse than their counterpart kernel group in all the other criteria except the distinguishing ability (applicability). It unveils that if

Conclusions
In this paper, a comprehensive evaluation of graph kernels for unattributed graphs is introduced. According to five different dimensions of the design details, all the existing graph kernels can be catalogued and 10 representative graph kernels are chosen to be completed compared using plenty of real-world and synthetic datasets. For each kernel, we focus on seven criteria to evaluate the performance of the kernel, namely, the classification accuracy, runtime cost, node-based scalability, edge-based scalability, set-based scalability and applicability. Through the kernel group comparison, it is found that the R-convolution, unaligned, local-pattern, classical and topological kernels achieve better performance in all the criteria except for the applicability.
Ten chosen graph kernels may not be enough to represent all the existing graph kernel methods. Therefore, some conclusions in this paper should be seen as guidelines which are useful for choosing an optimal kernel for graph classification or designing a novel kernel. As to the future work, more kernels will be included and graph kernels for attributed graphs will be considered. Acknowledgments: The authors appreciate the kind comments and professional criticisms of the anonymous reviewers. They have greatly enhanced the overall quality of the manuscript and opened numerous perspectives geared toward improving the work.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A The Real-World Datasets
All the real-world datasets are listed in Table A1.