1 Introduction

Fig. 1
figure 1

Overview of the operations in our approach. The Kandinsky Pattern (KP) original data set can be used individually to train a GNN. The same GNN (no retraining, freezed weights) is used to make predictions for the KP concept validation benchmark. Several state-of-the-art explainers (analyzed in Sect. 3.2) can be used to interpret the decision-making process of the GNN and compute relevance values for the important components of the input graphs (comprising nodes, edges, walks and sub-graphs). The graphs are ranked by their total relevance, divided into two classes according to a threshold, and comprise the input to ILP. ILP then produces a set of logic rules that serve as comprehensible explanations. Independently thereof, human-defined relevance that represents domain knowledge expectations, passes through a similar set of operations, producing another set of rules. The comparison of the two rule sets concerning generalization capabilities is the ultimate goal of this research work, and it can be performed by either a human or a computer agent

Machine learning models like neural networks became popular, e.g., for the classification of large amounts of data. However, maintaining transparency for human decision-makers remains an open issue due to the black-box character of such models [1]. Especially in the area of graph classification, where data has a complex structure, it is of great importance to gain insights into the reasons behind a classification outcome. Graphs consist of nodes and edges to represent, for example, manifold connections and relationships between entities like people, molecules, or locations [2]. The goal of graph classification is to find structures in a graph that are representative of a class (e.g., the molecular composition of a substance to distinguish carcinogenic and harmless chemicals [3]). This can be achieved, for instance, by generalizing on the similarity of attribute values, sub-graphs, on the size and the density of a graph and by considering distances between nodes in a network [4, 5].

Graph classification research has already introduced many suitable mathematical frameworks to find similarity and dissimilarity in graphs for distinguishing different classes as efficiently and reliably as possible, e.g., kernel-based and graph embedding methods [6,7,8], as well as logic-based approaches [8]. The high performance of Graph Neural Networks (GNNs), for example, in applications of biology and medicine makes them a promising approach to automated classification and knowledge discovery in large and complex data sets [3, 6, 9,10,11]. However, although using GNNs mostly eliminates the need for feature selection and the pre-definition of discriminating mathematical functions to fit different data distributions, it is still difficult or even impossible for a human to understand why a GNN has reached a classification decision and whether it used features for its generalization process that are relevant from an application domain and user’s perspective.

Approaches are needed that can be used to validate whether a GNN has learned concepts that can be described by humans. This is crucial for gaining an understanding of whether a GNN fulfills its purpose.

Methods for explaining GNNs are currently mainly based on identifying the most relevant input features w.r.t. the predictive outcome [12,13,14], e.g., by using generative techniques [15], decomposition by means of backpropagation [16], perturbations of input [17], approximation of decision boundaries with surrogate models [18, 19] or gradient-based methods [20]. Current xAI methods for GNNs are capable of extracting and highlighting the relevance for nodes, edges, features, paths, or even sub-graphs, depending on the particular approach. This is an important property because, as mentioned earlier, graphs are usually composed of complex, highly relational structures. The existing GNN explainers can thus help in detecting bias and overfitting in models [21] and their application can lead to actionable insights [22].

However, although the existing methods for explaining GNNs are based on techniques that have been found useful in the past, in part, in experiments and studies with humans [23, 24], there is still a lack of frameworks for evaluating the produced explanations for GNNs and relational data. Specifically, there is a need for methods to evaluate the output of a GNN w.r.t its conceptual validity. A concept can be represented by a human-understandable expression in contrast to raw features [25]. It can describe some property or relation, e.g., spatial relationships between cells extracted from medical images. A concept may further be characterized by sub-concepts. In analogy, a graph may be decomposed into sub-graphs, where some sub-graphs may represent relevant sub-concepts. This means that the human-understandable expression can be encoded as a sub-graph with its nodes’ and edges’ attributes.

There already exists work that evaluates GNNs w.r.t. learning concepts [26,27,28,29,30,31]. However, these works do not evaluate existing relevance-based GNN explainers regarding the fit between the relevance they compute and the features that represent important sub-concepts from an application domain’s and user’s perspective. Particularly, we identified three properties that limit the explanatory capabilities of current methods and that should be addressed by new evaluation frameworks.

First, the explanatory power of relevance is limited if it remains unknown whether a GNN has found sub-concepts to be relevant that are important from the perspective of the application domain or the user [32]. Therefore, a human should be involved in the evaluation of explanatory power. Second, the meaning of individual relevance values can be multifaceted and may depend on the context or classification problem [33]. Therefore, conceptual knowledge can help to assign explicit meaning to relevance. Last but not least, current explanatory approaches do not produce natural language statements or language-like outputs [34]. Visualizations may not be expressive enough, especially for complex, relational data [35,36,37]. Language accordingly facilitates conceptual validation complementary to visualizations and should therefore be incorporated into GNN evaluation frameworks.

In the last years, symbolic approaches, such as Inductive Logic Programming (ILP) regained their popularity due to their relational concept learning capabilities and their expressiveness as well as comprehensibility of nearly natural language expressions [38,39,40,41]. Further, symbolic AI promises to open a path to understandable descriptions for explaining, validating and improving the decisions of neural networks [35, 37, 42, 43]. Especially ILP supports learning from a small amount of data, and it shows promising results in generalizing from the output of neural network explainers, e.g., for explaining and validating the classification of relational concepts in images [35, 37]. Furthermore, ILP is suitable for putting the human in the loop to enhance robustness and trustworthiness [24, 34, 35, 44, 45].

Further related work has introduced, for example, neuro-symbolic approaches that generate textual output in the form of domain-specific languages [46, 47] or logic programs [48, 49], where the description of simple concepts detected in the input examples is the foremost goal [50,51,52]. Lifted Neural Networks [53] encode deep relational learning to provide insights into complex relations that go beyond the current GNN capabilities. Furthermore, there exist approaches to incorporate logic-based symbolic domain knowledge into GNNs, with the goal to improve their performance [42].

In contrast to existing work, as a first attempt, we provide a framework that applies ILP to GNN explainer output and human-expected explanations in order to validate whether the respective GNN assigned relevance to important relational sub-concepts. By applying ILP to the explainer output, we step toward providing verbal explanations for relevant graph sub-structures in the form of logic rules, thereby helping humans, such as developers and domain experts, validate the GNNs decisions against human knowledge. In particular, we want to transparently and comprehensibly evaluate the GNN’s ability to use sub-concepts that humans also find relevant. In the first step, we test to what extent ILP can learn models that faithfully find sub-concepts according to the distribution of relevance in the human-expected output of GNN explainers.

With the help of a simulated benchmark data set, we represent a human domain expert’s expectations of a GNN explainer output and provide a pipeline that transforms this output into symbolic Prolog syntax (an introduction to the syntax can be found in [8]). We then apply ILP as an interpretable machine learning approach to the symbolic input. In ILP systems, learning is usually performed by contrasting examples that belong to a target class with examples that do not belong to the target class (concept learning). In our approach, the task of the ILP component is to find commonalities in more relevant sub-graphs of a graph, while discriminating them from less relevant sub-graphs. The goal is to derive comprehensible rules that characterize the sub-concepts that may be present in an explainer’s output, that is, a set of reasons for a GNN’s decision (see Fig. 1).

Our approach is based on the assumption that GNN explainers should assign high accumulated relevance scores to sub-concepts that are known to be important for the classification task. Based on this assumption, we hypothesize that important sub-concepts will be included in ILP-generated explanations for a well-performing GNN, given that the relevance of sub-concepts learned by a GNN is taken into account in the ILP learning process. In our proposed method, the examples are graphs for which we simulate the relevance computation of a GNN explainer and which are translated into symbolic Prolog representation. The examples are divided into two sets (more and less relevant sub-graphs) based on an adaptable relevance threshold. Then, ILP learns a set of rules optimized to cover the more relevant sub-graphs and none of the less relevant sub-graphs. The rules may then serve as language-like explanations for a GNNs decision. This way, we combine relevance scores from explainers with symbolic representations in order to facilitate conceptual validation of GNNs.

The paper is organized as follows. First, we introduce our concept learning data set in Sect. 2, in particular the use case of symmetry in Kandinsky Patterns [54] used to benchmark our ILP-based proof-of-concept implementation. We state which outcomes the benchmark should produce accordingly. Second, we introduce the foundations of GNNs and state-of-the-art explainers for GNNs in Sect. 3; some of them inspired the design of our benchmark data set. In Sect. 4, we describe our proposed ILP-based approach that learns symbolic representations from the sub-concepts present in the most and least relevant human-expected explainer outputs. Our method is introduced as a first attempt to give insight into a GNN’s decision w.r.t. learned sub-concepts. Then, we present the experimental setting of our evaluation and the results accordingly (see Sect. 5). Finally, we conclude our work and point out future prospects in Sect. 6.

Fig. 2
figure 2

Three representative images from the vertical symmetry Kandinsky Pattern. The first one, on the left, shows an image obeying the vertical symmetry concept. The second one in the middle shows the “counterfactual” case, where the elements are slightly perturbed from vertical symmetry. The third one on the right contains a representative image from the “false” class, where the elements are non-symmetrical and are positioned far away from the vertical symmetry positions

2 A Kandinsky Pattern Benchmark for Conceptual Validation

Kandinsky Pattern data sets are artificially created to test the reasoning and generalization capabilities of state-of-the-art neural networks [54]. We chose to use a Kandinsky Pattern data set as a benchmark, since the structure of some patterns is inherently relational; therefore they were first introduced in the medical domain, specifically in the classification and understanding of histopathological images [55]. There are several patterns that explore positional, arithmetic, and grouping concepts that are of inherent relational nature, mostly when the complexity increases when combined. We consider it therefore suitable for developing an evaluation framework for GNNs that may be put into practice for recognizing the characteristics in relational input, e.g., in complex medical data.

Kandinsky Patterns consist of a very simplified mathematical abstraction of histopathological images and the organizational rules that underlie them. A Kandinsky Figure (KF) is a square image consisting of a gray background and \(n \in \mathbb {N}\) geometric objects (see Fig. 2). Each object is defined by its shape, color, size, and position within the Figure. The objects follow certain constraints; specifically, they do not overlap, are not cropped at the border of the image, and are easily recognizable and clearly distinguishable by a human observer. Given a domain of object shapes, colors, sizes, and positions, a Kandinsky Pattern (KP) is then the subset of all possible KFs in this domain that adhere to a certain ground truth [54]. An example of such a ground truth could be a natural language statement, such as “The figure is vertically symmetrical”. If a KF belongs to a KP, it is labeled “true”, otherwise “false”, or “counterfactual” (a slight variation of the concept that was used in the class “true”) as illustrated in Fig. 2. For our experiments, the set of contrastive figures consists of “false” cases.

Some of the more complex KPs contain areas, where elements of the same color are concentrated together. Others take into account relationships that involve objects that might be very far from each other and their spatial relations, such as “left of” and “above”.

KPs can be analyzed as images, e.g., with the help of object segmentation applied to individual KFs. Alternatively, graph extraction methods can be applied, for example, to histopathological images [56,57,58]; KFs can be represented as graphs, where objects represent nodes and edges express the topology of the graph accordingly.

Subsequently, such graphs can be automatically classified, e.g., by training and applying a GNN model. Then, the model’s decisions to assign a certain KP to a KF example can be explained with the help of methods that reveal which structures in a graph input have been relevant to the decision.

One KP that is considered both simple and important is symmetry. In this paper, we focus on this particular KP for several reasons. First, its underlying concepts can be easily defined by humans in terms of spatial, geometrical properties (e.g., “similar objects have equal coordinates along an axis that is ordinal to a mirror axis”) and can thus be easily validated. Moreover, symmetry exemplifies that relevance may have different meanings. Consider two figures, each containing similar objects: one symmetrical and one non-symmetrical constellation. In the symmetrical case, two objects would be connected by a horizontal line, whereas the line between the same objects in the non-symmetrical case would be diagonal to the mirror axis. Still, in both cases, the lines connecting the two objects may get a considerably high relevance in the classification task. The reason may be that the connecting line holds important information in both cases: horizontal or diagonal alignment. Nevertheless, in the symmetrical case, the relevance assigned to the horizontal line may be higher compared to the relevance of the line in the non-symmetrical case. There exist constellations in the KP data set, where objects in non-symmetrical figures may be connected by a horizontal line as well. It is the GNN’s task to differentiate these cases, which should be reflected in our concept-based explanations.

Note that we considered symmetry as well for the reason that it is one of the most important criteria for the diagnosis of non-cancerous vs. cancerous structures and that it is a pattern that allows for expressing gradual deformation from symmetric to non-symmetric constellations, which corresponds to the structural process a disease is causing. The detection and use of domain-specific symmetries in data and models is also a recent line of research, where GNNs play a decisive role [59].

For our benchmark, we take symmetrical and non-symmetrical KFs similar to the ones presented in Fig. 2 and transform them into a graph, where objects are represented by nodes, which in turn are connected by edges. We simulate the relevance output of a GNN explainer and assign the respective scores to the nodes’ attributes and to edges.

We thereby assume the following relationships between relevance scores and the input for well-performing GNNs. (1.) In the case of the classification decision being symmetric for a truly symmetrical figure, horizontal edges or vertical edges, and equal coordinate values of any two nodes on the symmetry axis show a higher relevance than in non-symmetrical figures. (2.) In the case of the classification decision being non-symmetric for a truly non-symmetrical figure, diagonal edges and unequal coordinates for any two nodes on an axis show a higher relevance compared to symmetrical figures. (3.) Some features of the input may get a higher relevance for both classification outcomes if they are decisive (e.g., edges connecting pairs of very similar objects compared to edges that connect very different objects). (4.) Some features, like color or shape, should be irrelevant unless the input data set itself is already not representative and a GNN would thus include irrelevant features in its decision making process.

Since explanations need to be expressive enough to induce justified trust in humans, producing verbal or language-like representations of the relations between the relevant components of a KF is desirable. Language-like expressions can be generated with the help of ILP, in the form of logic rules. Our benchmark dataset is designed in such a way that ILP should generate rules that adhere to our assumptions. To test this, we generate all possible sub-graphs for an input graph of a KF and assign relevance to the nodes and edges according to our assumptions. In this way, we realize the mapping of relevance to sub-concepts (e.g., “a horizontal connection between two objects” gets a higher relevance in symmetrical figures compared to non-symmetrical ones). The assignment of relevance was inspired by existing GNN explainers. To provide the reader with a better background, we introduce GNNs as well as GNN explainers in the next section.

3 Graph Neural Networks

We shortly introduce GNNs and existing GNN explainers. Further, we describe how we intend to bridge the gap between relevance-based explainers and conceptual validation of GNNs using ILP.

3.1 GNN Architectures

GNNs were invented to make predictions and accomplish tasks for graph-structured data. In contrast to CNNs that process grid-structured inputs in a sequential manner, GNNs operate on graph structures, not necessarily by a pre-specified order. The tasks that they can effectively solve are node classification, link prediction, and graph classification.

One of the first architectures, the Graph Convolutional Network (GCN) [60], took inspiration from CNNs to accomplish tasks, meaning that information about each element is computed by aggregating and combining information from neighboring elements. By the same means as the grid-structured neighborhood of a pixel is used to compute the filter values of a CNN, the k-hop neighborhood of a node is utilized and updated during the training process.

Equation 1 describes the main operations performed in a GNN that update the values of the initial features h on nodes and edges in a manner that is guided by training. k is the number of aggregation iterations and corresponds to the number of hops in the neighborhood \(\mathcal {N}\) that will be considered. The aggregation operation (Eq. 1) uses all features of the neighboring nodes (denoted with u); those can be of any form and can numerically encode several characteristics like size, color, shape and many more.

$$\begin{aligned} \begin{aligned} a_v^{(k)}= & {} \text {AGGREGATE}^{(k)} \Big ( \Big \{ h_u^{(k-1)}: u \in \mathcal {N}(v) \Big \} \Big ) \\ h_v^{(k)}= & {} \text {COMBINE}^{(k)} \Big ( h_u^{(k-1)}, a_v^{(k)} \Big ) \end{aligned} \end{aligned}$$
(1)

After the aggregation operation, the combine operation provides the new values for the features of node v (equation 1), which are typically called embeddings.

Those two operations are used by GNNs to solve all possible tasks; depending on the task, researchers have found that GNNs can be extended by functionality, having different purposes and serving different goals [60]. Particularly the Graph Isomorphism Network (GIN) architecture learns a function with the use of Multi-layer Perceptrons (MLP) [3]. This provides the necessary flexibility for injectiveness, for the maximum possible discrimination ability, as well as the property of mapping similar graph structures to nearby embeddings. Graph classification with the use of this architecture could succeed in separating different KP classes. xAI methods uncovering the GNN’s decision-making process could help researchers see if those decisions are based on elements that make humans understand the causal factors of a decision [61].

3.2 Explainability of Graph Neural Networks

Several methods have been proposed to uncover the underlying decisions made by a GNN classifier by computing relevance scores for nodes, edges, features thereof, walks and sub-graphs. They provide the means toward transparency and, ideally, could be used in combination with symbolic approaches to generate verbal or language-like explanations for increased comprehensibility and better control of the overall system to satisfy the validation and improvement requirements of human-centric AI [10, 35, 45].

However, currently, none of the existing GNN explainers generates verbal or language-like explanations and the vast majority of them search for the substructure that has the highest relevance according to specially designed losses, metrics, and assumptions. In the following paragraphs, we give a short overview.

GNNExplainer: For node as well as graph classification tasks, the GNNExplainer is used to identify important sub-graphs in the input graphs that are going to be explained [17]. Graph and node features are masked; the search for an adequate mask is performed by an optimization algorithm. If the substructure has a high mutual information value [62] w.r.t. the predicted label distribution, then it is a good enough explanation. A variant of this method, the PGExplainer [63] uses a deep neural network to parameterize the generation process of the masks and the explanations in general, thereby explaining multiple instances of sub-graphs at the same time.

GNN-LRP [16]: is a method that applies the Layer-wise Relevance Propagation (LRP) principle [64] to GNN architectures that fulfil the required constraints. The main strategy of LRP, namely the backpropagation of relevance from the output to the input, layer-by-layer, reflecting the intensity of the learned weights in proportion to each other, is transferred to the GNN message-passing scheme (see section 3.1). By applying Taylor decomposition w.r.t. to the graph’s Laplacian matrix, the relevance of each possible walk of a particular size in the graph is computed. This method, although time and memory-consuming, has several benefits that go beyond the fact that it computes both positive and negative relevance. By returning a relevance for each walk, the “flow” of the relative importance of a node is captured in the relevance of its neighboring nodes.

MotifExplainer: [65] is trying to find motifs (sub-graphs that may be characteristic, e.g., circles) in the computation graph of each node that is classified. The domain knowledge is incorporated to reduce the search space effectively by selecting motifs that are known to play a role in the prediction; this procudes explanations that have both high fidelity and sparsity. Those motifs are fed into the model, their embeddings, as well as the performance of the model, is registered and an additional attention model provides the weights by which sub-graphs are selected. The motifs that have a similar prediction performance and have a high attention weight are the ones that are characterized as relevant [14].

GraphLIME: [18], in the same means as LIME [66], tries to find a local surrogate model to explain each node. The N-hop neighborhood of a node is sampled to compute the most representative features, which then serve as the explanations of the prediction.

PGMExplainer: The PGM-Explainer is one of the first counterfactual xAI methods for GNNs [19]; the main goal is to answer the question “what if” the input was different. Counterfactuals are usually created by adding or removing features from the input and this creates a broad spectrum of possibilities on graph data sets. Perturbing the topology of a graph in a way guided by the change in performance leads to the computation of a Probabilistic Graphical Model (PGM) [67, 68] with the help of structure and parameter learning. The resulting PGM’s random variables (also expressed through their conditional probability tables) determine the distribution of values that node features can take.

XGNN: This method uses Reinforcement Learning to guide the search toward the substructure in the graph that was relevant to the decision [69]. It takes into account prior knowledge by starting with an initial state - a small sub-graph - and trying to find the substructure that will keep the prediction performance intact by exploration and exploitation, but also selects graphs that are in accordance with domain knowledge.

3.3 Toward Verbal and Trustworthy Explanations

While metrics exist to measure the efficiency of an explanation in an algorithmic way, such as fidelity, we believe that for humans, verbal explanations may provide insight into a GNNs decision-making in the most natural way. Furthermore, verbal explanations could be utilized to validate the graph explainer itself. The graphs extracted from symmetrical and non-symmetrical KFs provide an ideal benchmark for such experiments, as they can be categorized into sub-concepts (the sub-graphs with attributes) and verbally described by a human.

In our benchmark, we put a special focus on simulating the expectations of a human user w.r.t. explanations for GNNs. For this particular data set, the input graph is perturbed in all possible ways and while assuming a correct GNN prediction, the human prospects for the relevance values for nodes, edges, and sub-graphs are computed. We chose this approach strategically to enable human-in-the-loop valuable input in a post-hoc procedure, rather than requiring the human to mask the extracted input graph according to domain-specific concepts. Allowing the explainer to only perturb features with or without a concept would not be feasible since it would require modifying each explainer at its very core. This would not be an adequate step toward a general tool for the validation of any explanation inferred by any relevance-based explainer.

In this work, we explore and investigate a new strategy to map human-masked concepts to a learnable model parameter (here: symbolic predicates) inferred by an ILP system. This is done by decomposing the input graph into sub-graphs that contain or do not contain the sub-concepts of interest. After each decomposition, relevance scores are re-calculated for our simulated explainer. We argue that the distribution of relevance scores should remain robust to these concept-wise perturbations and thus will converge to a model parameter.

As a first investigation, we use a synthetic data set with hypothetical relevance scores to validate the general feasibility of our approach. Once the general applicability of the presented approach is fully validated, we will explore relevance scores that are produced by the explainers described in Sect. 3.2.

4 Combining Relevance Scores with Symbolic Predicates

Fig. 3
figure 3

The goal of our approach: Learning symbolic representations of disjoint and common features within sub-concepts of symmetrical and non-symmetrical KFs that help to find relevant explanations for how a GNN discriminates between symmetry and non-symmetry

As motivated in the introduction of this paper, ILP is an interpretable machine learning approach suitable for generating language-like, expressive explanations. It has been applied to explainer output in previous works to introduce relational learning for explaining predictions of CNNs [35, 37]. In this work, we integrate ILP to create a comprehensible validation framework for graph classification with GNNs.

In this section, we first give an overview of our approach. We then explain how our benchmark data set is created and which assumptions underlie its design. Subsequently, we introduce ILP as a concept learning approach and how it can be applied to derive comprehensible, language-like rules composed of symbolic predicates that may characterize the sub-concepts that are present in an explainer’s output, that is, a set of reasons for a GNN’s decision. Finally, we state our hypotheses w.r.t. results we expect from applying our approach.

4.1 From Relevance to Sub-Concepts

The usual GNN-based graph classification pipeline, including the explanation of predictive outcomes, would basically consist of three steps. First, the GNN is trained and tested on pre-labeled graph-structured input examples to achieve the highest possible classification performance. Second, the learned GNN model is ready to be applied to yet unseen graph-structured examples from the same domain in order to derive class labels. In the third step, developers and decision-makers can ask for explanations for individual classified examples. These explanations are computed by GNN explainers and their extensions (e.g., ILP as an extension to produce comprehensible output for GNN explainers). Recall Fig. 1 for an overview.

Our work has two branches as far as relevance distributions are concerned. First, we use human domain knowledge for modeling the expected relevance values, and second, we’ve already started analyzing GNN explainers, as presented in Sect. 3.2. The relevance values comprise the basis for the symbolic representation and input to ILP after transformations that will be clarified in Sect. 5. As presented in Fig. 1, our ultimate goal is to compare the explanations expected by humans and the ones created by explainers and thereby achieve concept validation.

Figure 3 illustrates our benchmark, that is based on KFs of vertical symmetry. KFs are depicted on the left side, and the respective explainer output for their graph representation is on the right side of part (a). Each object (yellow triangle, blue circle, etc.) is represented as a node, which is connected through edges to all the other nodes. The red coloring of nodes and edges illustrates the value of relevance assigned by a GNN explainer or as given by humans. The more intense the red color is, the higher is the relevance of the component. Attributes of objects, like shape, size, color and the coordinates of the center-point of an object, are assigned as well with relevance values. In order to derive the relevance for sub-concepts and not for the whole graph of a KF, we decompose each example. Part (b) illustrates the human-understandable sub-concepts (e.g., “a single red triangle”, “two yellow circles side by side”, etc.). At this point, we want to point out that ILP is capable of generalizing from concrete attribute values by introducing variables, making it possible to express more abstract sub-concepts like “some objects side by side”. Part (c) depicts the sub-graphs derived from decomposed KFs and assigned with relevance based upon our assumptions as described in Sect. 2. Since our assumption is that important sub-concepts aggregate more relevance, we summarize the relevance r for each sub-graph G to derive its total relevance R. We then transform the sub-graphs with relevance R into a symbolic representation in Prolog syntax (see part (d)). An example can be found in Fig. 5.

Note that our goal is not to learn an interpretable surrogate model for a GNN with ILP, as presented in [37], but to identify the sub-concepts that are present in the more relevant sub-graphs in contrast to the less relevant sub-graphs. This is a different type of learning task that is not based on splitting symmetrical and non-symmetrical examples. It takes into account that structures that are present in the most relevant sub-graphs from symmetrical and non-symmetrical KFs do not have to be disjoint. This applies, for example, to the edge between two horizontally located nodes that have the same y-axis position in case of symmetry or not in case of non-symmetry. In both cases, the edge holds important information about the class: it is either a horizontal or a diagonal line.

As outlined on the right side of Fig. 3, ILP learns rules from the set of all sub-graphs (represented in Prolog) that apply only to the most relevant sub-graphs. Thus, each rule may contain a symbolic representation of sub-concepts in the form of so-called predicates to learn symmetry and non-symmetry.

To generate the input set for ILP, we rank sub-graphs according to their relevance sum. We then divide the sub-graphs w.r.t. their rank at a threshold. A justified criticism of this approach is that small sub-graphs may have a lower total relevance than bigger sub-graphs. We address this in our benchmark by keeping the total relevance more or less constant among important sub-graphs and distributing it over the structures that represent important sub-concepts. Note that for real explainer output, normalizing the relevance in advance to ranking would be necessary. Sub-graphs that have a relevance sum above a configurable threshold t are considered members of the class of “more relevant” examples; the rest is considered “less relevant”. By learning from such examples with ILP, we generalize over symbolized sub-graphs of symmetrical and non-symmetrical KFs. This way, we aim to find the commonalities in the most relevant sub-concepts while discriminating them from the least relevant sub-concepts, both for symmetrical and non-symmetrical examples.

4.2 A Generated Benchmark for Vertical Symmetry

Fig. 4
figure 4

The fully connected graph for four objects with the pre-specified relevance per node and edge in a non-symmetric case. The relevance of the whole graph is the sum of the relevance of all its components. The relevance spectrum has values from 0 to 1 (from blue to red)

For the validation of GNNs and GNN explainers with the help of ILP, we provide a benchmark data set based on the KP for vertical symmetry. Every image in the data set contains either a vertically symmetric constellation of objects or not. Currently, our benchmark includes graphs with one to four objects which have a predefined size, color, and shape, e.g., one triangle and one circle left of the mirror axis and similarly on the right. We created both symmetric and slightly perturbed cases as far as the positions of the objects are concerned. In particular, to create non-symmetrical constellations, each of the coordinates of the four nodes is a noisy version of the corresponding coordinates in the symmetric case. The added noise is sampled eight times in total (twice for each node, one for each coordinate) from a normal distribution \(\mathcal {N}(\mu , \sigma ^2)\) with \(\mu =0\) and \(\sigma = 0.1\). The benchmark generator makes sure that the produced objects do not overlap and that they have distinct relations to each other, as presented in Fig. 2. We consider all possible sub-graphs that can be generated from input graphs with one, two, three, and four objects. For each set of objects, we compute every possible graph with every possible combination of edges. We further include all possible sets of nodes that are not connected by edges.

The total relevance R of each sub-graph G can range from 0 to 1 and is the sum of all relevance values of nodes and edges \(\in \) G. The relevance distribution inside G is defined with the use of domain knowledge about sub-concepts that may be important for differentiating between symmetry and non-symmetry from a human’s perspective. The following principles were applied: First, the R is non-uniformly distributed within a sub-graph to account for the fact that not all components may be equally important, especially in larger sub-graphs. This is preserved even for sub-graphs, where important sub-concepts occur multiple times, under the assumption that GNNs weigh features efficiently and not for the sake of completeness. Second, in accordance to our assumptions presented in Sect. 2, horizontal edges or vertical edges, as well as equal coordinate values of any two nodes on the symmetry axis get a higher R in symmetrical figures compared to non-symmetrical ones. Diagonal edges and unequal coordinates for any two nodes get a higher R in non-symmetrical figures compared to symmetrical ones. Further, some edges get a higher R in both cases if the connection they express is independently important, e.g., a line between a pair of objects (being horizontal in symmetry and diagonal in non-symmetry). Sub-graphs that contain no edges and where the GNN would rather consider the size, shape, or color of an object, are assigned with a lower R in order to account for the considerable irrelevance of such features w.r.t. symmetry. Of course, symmetry holds only if the objects on the symmetry axis are equal in appearance, but in general, concrete size, shape or color values should not play a role. The distribution of relevance in a sub-graph is exemplified in Fig. 4. It shows a non-symmetrical example in which edges that are nearly horizontal get a higher relevance than others.

We further create sub-graphs on four levels of expressiveness named single objects, x-axis, y-axis, and alignment relations. In the case of single objects, all object constellations are created without any edge connecting them - practically all graphs with no edges. In the x-axis case, all possible graphs are created. However, the connection between two nodes is only kept if it expresses a relation w.r.t. the x-axis (“left-of” and correspondingly “right-of”). Similarly, for the y-axis case, only the “down-of” and “up-of” relations are kept. For the alignment case, the eight spatial relationships are: “top-left”, “top-middle”, “top-right”, “middle-left”, “middle-right”, “bottom-left”, “bottom-middle”, and “bottom-right”. To compute them, each object in an image takes an imaginary center position, and the object connected to it, is either inside (contained in) one of the eight sectors or it will be found on one of the eight lines that separate the sectors. It is obvious that if a particular relation is found between two objects, then the corresponding reverse relation must be detected for the same two objects. The code to generate and re-use our benchmark can be found in the corresponding GitHub repository.Footnote 1

4.3 Learning Symbolic Predicates with ILP

ILP can be applied to symbolic input to induce a set of logic rules that separates a target class from contrastive classes. Most ILP frameworks perform concept learning, meaning induction based on a set of so-called positive examples, belonging to the target class, and a set of so-called negative examples, not belonging to the target class. The input data, the algorithm as well as the learned set of rules is represented as logic programs that are interpretable for humans [35]. To be more specific, ILP is based on first-order logic, and program code is represented in Prolog, which is a declarative programming language. We shortly summarize the key terminology of ILP and Prolog based on the introduction given in [44].

Prolog programs consist of a finite set of Horn clauses. A Horn clause is a set of disjunctive literals with at most one positive literal and can be represented as an implication, written in the form

$$\begin{aligned} A_0 \leftarrow A_{1} \wedge \ldots \wedge A_{n},\ \text {where}\ n \ge 0. \end{aligned}$$

Each literal \(A_i\) has the form \(p(t_1,\ldots ,t_m)\), consisting of a predicate p and terms \(t_i\). Terms are either a constant symbol, a variable or a composite expression. The literal \(A_0\) is called the head and the conjunction of elements \(\bigwedge _{i=1}^n A_i\) is called the body of a clause. In Prolog, implication \(\leftarrow \) is denoted by \(:-\) and the colon represents conjunction \(\wedge \). Clauses end in a full stop.

Basic Prolog concepts include facts, rules and queries. Horn clauses with an empty body are denoted as facts and express unconditional statements, otherwise they are called rules and express conditional statements. Semantically, we can read a clause as “the conclusion (or head) \(A_0\) is true if every \(A_i\) in the body is true”. Facts are literals with constant terms. Rules express a logical implication to describe that a condition holds if a combination of literals holds, supported by given facts. Queries can be used to retrieve information from a Prolog program. Queries can be either facts, to check for their truth or falsity, or can be composite expressions to retrieve terms that make those expressions true. Since we do not apply queries directly in our approach, we focus on facts and rules in the following paragraphs.

To learn symbolic rules that contain relevant sub-concepts to explain the classification of KFs, we applied the ILP framework Aleph [70]. Aleph uses inverse entailment to induce a theory, consisting of Prolog rules, from a set of positive and negative examples and background knowledge that describes these examples. The default algorithm performs four steps. First, it selects an example to be generalized. Second, it constructs a most specific clause based on a pre-defined syntactical restriction. Then, a clause that is more general than the most specific clause and which entails the selected example is found, and finally, all the examples that are covered by the more general clause (rule) are removed from the set of candidate examples to be generalized.

Figure 5 serves as an illustrative example for our KP benchmark. At the top, it contains an excerpt from the so-called background knowledge, describing one sub-graph of a KF represented by symbolic facts. The facts express which objects are contained in a graph (both denoted by their respective identifiers) and which spatial relationships hold between the contained objects (e.g., “left_of”) as well as the attributes in terms of coordinates (denoted by “has_coordinates”), shape (d. b. “is_a”) and color (d. b. “has_color”) of the objects. We omit the relevance values for all predicates, except for the relevance of the whole graph (d. b. “has_relevance”). These relevance values are excluded from the ILP learning process since the relevance is solely used to split the examples into positive and negative ones. At the bottom, Fig. 5 shows a rule that may be learned by Aleph to discriminate between more relevant and less relevant sub-graphs. It is a rule that covers only vertically symmetrical examples. It represents the sub-concept of equal y-position of contained objects. It may also contain rather irrelevant sub-concepts like the color of an object.

Fig. 5
figure 5

Prolog representation of an exemplary KF sub-graph (top) and of a Prolog rule that applies to it (bottom)

4.4 Expected Explanations for Symmetric Kandinsky Pattern Classification

In accordance with our assumptions presented in Sect. 2 and the design of our benchmark that is in line with these presuppositions (see Sect. 4), we expect that ILP learns a set of rules, some of which may cover only symmetrical examples or non-symmetrical examples and where some may cover both. For vertical symmetry, we expect ILP to consider equal y coordinates in objects as well as horizontal and vertical connections between such objects, while for the non-symmetrical examples we expect ILP to consider diagonals between objects with different y coordinates. We further expect that there might be some sub-concepts, like the shape and color of objects that might be included for symmetry if they are equal for different objects and which might be included for non-symmetry if they are not equal.

5 Evaluation

In this Section we first describe the experimental setting we used and the obtained results.

5.1 Experimental Setting

For our experiments, we distinguish four levels of expressiveness, as presented in Sect. 4. We want to test different levels of detail in the background knowledge for representing sub-graphs of the original graph and, thus, sub-concepts.

Fig. 6
figure 6

Plot of the sorted human-expected relevance, as well as some characteristic sub-graphs. The red vertical line represents the threshold t that divides the graph data set into more and less relevant sub-graphs (here, at t = 0.5)

Determined by the number of possible combinations, the set of input examples is 20 for single objects and 216 for the x-axis relational case, the y-axis relational case and the more advanced spatial alignments. We set the relevance threshold to a particular value and split the input examples in order to assign them either to the target class or the contrastive case: the ones with less relevance than the threshold value (negative examples) and the ones with equal or more relevance (the positive examples) than the threshold. Figure 6 shows an exemplary split at a threshold of 0.5. We performed experiments for different relevance thresholds - ranging from a minimum threshold of 0.05 (meaning 5 % of a maximum relevance of 1) to the maximum threshold of 0.95 - to test Aleph’s robustness against imbalanced input data and its generalization capabilities. Different thresholds lead to a different number of positive and negative examples and, thus, to imbalanced input. Figure 7 shows the distribution of examples according to the threshold for single objects, whereas Fig. 8 depicts the distribution for spatial relations (along the x-axis, y-axis, and as alignments). As expected, the threshold regulates the balance or imbalance of the input data set for Aleph. This is in line with the fact that relevance values computed by GNN explainers (see 3.2) are generally not equally distributed over sub-graphs. This valuable property is used to explore the capabilities of Aleph to learn from small and potentially imbalanced data sets.

Fig. 7
figure 7

The number of positive and negative examples per threshold for single objects

Fig. 8
figure 8

The number of negative and positive examples per threshold for spatial relations

The unequal distribution of relevance values puts the generality of explanations produced by ILP under test. Therefore, we examine the robustness of explanations dependent on the distribution of the relevance across sub-graphs and the chosen threshold. We observe that the best balance for single objects is at t = 0.25 (see Fig. 8), for relations, it is 0.55 (depicted in Fig. 8). In the case of single objects, a split of examples at a threshold of 0.05 yields no negative examples, whereas a threshold of 0.35 yields no positive examples. Similarly, for relations, the corresponding thresholds are 0.05 and 0.85. This is in accordance with the fact that graphs do not accumulate enough relevance at higher thresholds.

5.2 Results

Fig. 9
figure 9

Exemplary rules learned for each use case (Single Objects, x-Axis, y-Axis and Alignments)

This section presents the results for each of our four levels of expressiveness. In particular, we list some rules that had the highest number of covered positive examples while covering as few negative examples as possible and some interesting cases that demonstrate the suitability of learned symbolic predicates for validation.

Figure 9 lists nine exemplary rules that stood out from our results. The first rule was learned for spatial alignment at the threshold of 0.25. Although there was just a small number of contrasting examples (see Fig. 8), the rule is quite representative for symmetrical examples (covering 98 of them) by containing a predicate that imposes the condition of the same y-coordinate value on two related objects, C and B. The rule further states that the graph must contain an object D that is yellow. Something similar applies to the second rule that is learned for the same use case for sub-graphs from non-symmetrical examples. There is a yellow object E; in contrast to the first rule, the predicate expressing the spatial relationship describes a diagonal (B is middle-right top-right of C). Another interesting rule is the third one. Here, ILP found a set of predicates that covers symmetrical as well as non-symmetrical examples, where the sub-graphs had a total relevance higher than the threshold of 0.55. The rule contains several diagonals, e.g., the diagonals top-middle-top-left and bottom-middle-bottom-left. Comparing rules 4 and 5 that were learned for x-Axis relations, one for a threshold of 0.45 and the other for a threshold of 0.55, we can see that rule 4 which covers only symmetrical examples, contains a predicate for the same y-coordinate again. Rule 5, on the other hand, covers only non-symmetrical examples and contains no predicates that are representative of symmetry. Instead, the colors of objects seem to provide a more general pattern here by being different (yellow, blue). Rules 6 and 7 were both learned for y-Axis relations. Their comparison shows how important it is to balance positive and negative examples, meaning examples with higher relevance and those with lower relevance, in order to avoid noise. For rule 6, which was learned at a threshold of 0.25, a total of 98 symmetrical examples could be covered. However, the rule contains a rather irrelevant restriction to the color yellow for an object. In contrast to this, in rule 7, which was learned at a threshold of 0.55 (which is the most balanced case for relations; see Fig. 8), no statements about color are included. Instead, only predicates of containment and spatial relations were induced. For the single objects, we selected two interesting rules. While rule 8 was learned for 12 symmetrical as well as 12 non-symmetrical examples, rule 9 covered only four symmetrical examples. The first one includes information, such as the kind of shape (here: circle), and the second one contains a statement about the same y-coordinate again.

Our results show that aligning the relevance threshold with the distribution of relevance in the sub-graphs produces rules that contain symbolic predicates that match with human-expected sub-concepts.

6 Discussion and Conclusion

The results in the previous section show that symbolic predicates can be learned that describe the sub-concepts that are representative of for a set of sub-graphs, either for symmetrical or non-symmetrical examples or both. We have learned that symmetrical and non-symmetrical examples are often covered by separate rules, even though they share the positive label, which is a precondition to find sub-concepts that distinguish both groups. We could observe that the investigation of positive and negative labels provides information about the relevance distribution of the explanations. The rules, on the other hand, may indicate whether there is noise due to over-represented, rather irrelevant sub-concepts. At the same time, symbolic rules that cover a high number of examples in a rather balanced setting of positive and negative examples may be representative of the concepts that are underlying the decision of a GNN, given that the explanations are gathered from a GNN explanation method. Currently, we have applied our approach solely to human-expected relevance scores. In the future, we will take into account the model accuracy of a GNN (where low accuracy may be a driver for noise in the sub-concepts) as well as the output of real explainers (which may cause imbalanced input data). We’ve already started experiments using a Graph Convolutional Network (GCN) [60] and currently, investigating the output of different state-of-the-art explainers. In order to mitigate noise and to improve a GNN model by a human-in-the-loop, a corrective feedback technique could be integrated similar to previous work [35].

Our benchmark data set could be utilized in additional experiments to compare the correctness, similarities, and differences in explanations by humans and AI systems, as well as for an investigation on how a human could learn from comprehensible AI. Our presented approach is a first attempt toward generating comprehensible output for relevance-based methods (as presented in Sect. 3.2) to explain a GNN’s classification. For this, a rigorous comparison of different explanation methods as well as data sets is necessary. While we applied our approach to the Kandinsky Pattern of vertical symmetry, future work could include further relational data sets and examples. A comparison of GNNs with ILP models by means of classification performance and interpretability could be another interesting investigation. We further note that our approach is a combination of model-specific and model-independent methods w.r.t. GNN architectures. While relevance-based explanations are computed based on the inner workings of a specific GNN architecture, ILP allows for conceptual abstraction. That is, our approach can be applied to any explainer that computes a linear or ordinal score by means of relevance for features that can be easily modified, such as nodes, edges, and attributes. Our implementation is the very first step toward comparing different GNN explanations and, consequently, their symbolic representation. In the future, research will also include user studies in order to verify our approach. Ultimately, we intend to further extend our proposed technique toward actionable, interactive systems by combining GNN, xAI, and a logical approach that is understandable to humans.