A Technique for Determining Relevance Scores of Process Activities using Graph-based Neural Networks

Process models generated through process mining depict the as-is state of a process. Through annotations with metrics such as the frequency or duration of activities, these models provide generic information to the process analyst. To improve business processes with respect to performance measures, process analysts require further guidance from the process model. In this study, we design Graph Relevance Miner (GRM), a technique based on graph neural networks, to determine the relevance scores for process activities with respect to performance measures. Annotating process models with such relevance scores facilitates a problem-focused analysis of the business process, placing these problems at the centre of the analysis. We quantitatively evaluate the predictive quality of our technique using four datasets from different domains, to demonstrate the faithfulness of the relevance scores. Furthermore, we present the results of a case study, which highlight the utility of the technique for organisations. Our work has important implications both for research and business applications, because process model-based analyses feature shortcomings that need to be urgently addressed to realise successful process mining at an enterprise level.


Introduction
The purpose of business process management (BPM) is to improve business processes [1]. A central role in process improvement is played by the process analyst [2], who is responsible for 'monitoring, measuring, and providing feedback on the performance of a business process' [3, p.45]. The ongoing implementation of information systems in organisations, along with the subsequently enhanced availability of event log data, have enabled process analysts to discover as-is models of processes with process mining with relative ease [4]. However, the crucial challenge lies in identifying potential areas for process improvements (i.e., process analysis) with respect to a strategic goal [5]; this requires analytical capabilities such as Pareto or root cause analysis [2].
A business process can be defined as a 'completely closed, timely, and logical sequence of activities' [6, p.3] that realises an outcome valuable to a customer [7]. The effectiveness (i.e., customer value) and efficiency (e.g., timely, logical sequence, resource utilisation) of a business process are monitored using key performance indicators (KPIs) as aggregated measures of process outcomes; in the context of BPM, these are often referred to as process performance indicators (PPIs) [8]. Thus, to improve a business process, it is essential for a process analyst to understand the relevance of individual process activities in terms of their impact on the dimensions expressed by these performance measures.
For example, we consider a travel reimbursement process at an university; it aims for a high degree of compliance with travel policies. Observing that the KPI ratio of budget violations increases, the process analyst must understand which activities in the process should be redesigned to improve the process; hence, they must evaluate the KPI. In Figure 1, two discovered process models are presented.
On the left, the most frequent path is shown, annotated with the number of occurrences for each activity. The process analyst can deduce information about the reimbursement processâĂŹ execution from a generic perspective but not with respect to the budget violations. The right-hand process model indicates the most relevant path in terms of budget violations, and each process activity is annotated with a relevance score expressing its importance thereto. The process analyst can directly identify which activities should be considered for redesign and can also suggest which of these activities should be considered first (e.g., Process model-based analysis-that is, process analysis based on the discovered process model-is able to make users aware of the business processes behind the data and can subsequently guide process analysts as they improve these processes [9]. To facilitate analysis beyond the simple discovery of a process, the process model must provide information suitable for the improvement initiative. Therefore, we design a technique to determine the relevance scores of process activities with respect to a performance measure extracted from the event log data.
Determining relevance scores for process activities is an interesting challenge, owing to the plurality of relationships between activities. For instance, an activity may or may not occur; if it does, then it may occur towards the start or end of a process, once or multiple times, and before, after, or between other activities, etc. Understanding these complex relationships and their influences on process performance is a difficult task.
One paradigm to address this challenge is machine learning (ML). ML tech-niques can automatically learn models from the data that map relationships and effects. Evermann et al. [10] showed that with deep learning (DL), predictive models can be learned from event log data more accurately than with traditional ML techniques. Deep neural networks (DNNs) were shown to be able to learn the intricate structures of a business process using multi-representation learning [11]. However, DNNs often struggle to intuitively represent the learned structures; this is commonly referred to as the black-box problem [12].
Graph-based neural networks (GNNs) are a relatively new group of DNNs; they have proved useful in domains where the input data have a graph structure, such as in chemistry and molecular biology [13]. Compared to traditional DNNs such as multi-layer perceptrons, GNNs can compute graph data directly [14].
In particular, the structure of the input graph can be matched directly to the topology of the GNN; this allows for direct inferences to be made between the relevance of network nodes and graph nodes. Gated graph neural networks (GGNNs) are a variant of GNNs; they were designed to tackle temporal dependencies in the data [15]; such dependencies are a significant aspect of event log data. Therefore, the main idea of this paper is to design a GGNN-based techniquereferred to as Graph Relevance Miner (GRM)-to determine the relevance scores (with respect to a given performance measure) of process activities from event log data. First, we transform process instances using a prediction label (i.e., the performance measure), converting them into instance graphs (IGs). Second, we input these graphs into the GGNN model for training and testing. Finally, we input individual or multiple instances into the GGNN model, to determine the relevance scores.
The remainder of this paper is organised as follows. In Section 2, we introduce the preliminary information regarding event logs and GNNs. Next, we present the design of our technique in Section 3. In Section 4, we present the results from our evaluation of the technique, obtained for four different reallife datasets; then, we describe our case study. In Section 5, we summarise our contributions, review the related works, and consider the limitations of our study. Lastly, in Section 6, we conclude the paper with a brief summary of the techniqueâĂŹs potential impacts on research and business applications, and we highlight possible future research directions.

Event Logs
Process mining is a technology that facilitates the discovery, analysis, and enhancement of process models, using the data extracted from event logs [4].
An event log is structured into traces, which are in turn structured into events.
Thus, based on Polato et al. [16], we define the terms event universe, event, trace, and event log as follows: Definition 1 (Event universe). E = A×C ×T is the event universe in which A is the set of process activities, C the set of process instances (cases), C the set of case IDs with the bijective projection id : C → C, and T the set of timestamps.
To consider time, a process instance c ∈ C contains all past and future events, whereas events in the trace σ c of c contain all events up to the current time instant.
Definition 2 (Event). An event e ∈ E is a tuple e = (a, c, t), where a ∈ A is the process activity, c ∈ C is the case ID, and t ∈ T is its starting timestamp.
Given an event e, we define the projection functions F p = {f a , f c , f t }: f a : e → a, f c : e → c, and f t : e → t.

Definition 3 (Trace). A trace is a non-empty sequence
trace can also be considered as a sequence of vectors, in which a vector contains all or part of the information relating to an event (e.g., an event's activity).
, where x (i) ∈ R n×1 is a vector, and the superscript denotes the time-ordering of the events.
Definition 4 (Event log). An event log L τ for time instant τ is the set of traces such that ∀σ c ∈ L τ , ∃c ∈ C with ∀e ∈ σ c . id(f c (e)) = c ∧ ∀e ∈ σ c . f t (e) ≤ τ (i.e., all events of the observed cases that have already happened).
Finally, our technique assumes a labelled event log for training. Thus, we define the term label.
Definition 5 (Label). Given a trace σ = e 1 , . . . , e k , . . . , e |σ| , we can define its label as f l (σ) = l. In this paper, a label represents a certain outcome of a process; for example,'loan is accepted' or 'loan is not accepted' in the case of a loan application process.

Graph Neural Networks
GNNs [13] are a type of neural network in which the network architecture is defined according to a graph structure. Because graphs constitute an integral part of these neural networks, we define the term graph first.
Definition 6 (Graph). A tuple G = (V, E) is a graph, where V is a set of nodes and E a set of edges. A node v ∈ V has a unique value assigned to it, whilst an edge is a paire = (v, v ) ∈ V ×V . The node vector (node representation or node embedding) for node v is denoted by h v ∈ R D . D denotes the vector dimensionality of node v. Graphs can also contain node labels l v ∈ l 1 , . . . , l |V | for each node v, as well as edge labels (edge types) l e ∈ l 1 , . . . , l |E| for each edge.
Furthermore, we define four functions to help us manage these graphs.
The recurrent function f lt is shared among all nodes. Its input parameters are as follows: (states of the neighbouring nodes; i.e., nodes that are directly connected) and l f nbr(v) (features of the neighbouring nodes). For this, Scarselli et al. [13] have suggested decomposing f lt (·) into a sum of terms describing ingoing and outgoing edges: where f lt is either a feed-forward neural network or a linear function of h v .
The terms' parameters differ according to the label configuration (i.e., l (v ,v) or l (v,v ) , where each vector represent edge type and direction). For example, in the linear case, f lt can be defined as follows: where A (lv,l (v ,v), ,l v ) is the sparsity matrix (or adjacency matrix) containing the weight of the edge running from node v to v, and b (lv,l (v ,v), ,l v ) is the bias of this edge. Both the weight and bias are learnable parameters.
After computing node representations using the propagation model, the output model maps these representations and their corresponding labels to an output. Depending on the problem to be addressed, the output can be graph-based, node-based, or edge-based. In this work, we focus on graph-based outputs, because the outcome of a process is not determined by a single node or edge. The graph-based outputô is calculated by a local output function f lo : Similar to the function f lt , f lo is either a feed-forward neural network or a linear function of h v . To summarise, the computations described in f lt and f lo can be interpreted as feed-forward neural networks or linear functions, and their (internal) parameters are updated through a gradient-descent strategy.
Lastly, we adopt a framework for standardising GNNs-referred to as mes- where h w are the node representations of nodes v and w, respectively; e (v,w) represents the features of the edge running from node v to node w. Then, a node update function U t calculates node v's new node representation h (t+1) v , as formalised in Eq. (6): Second, the readout phase uses function R to omit the input parameter h 0 of the GNN output function f lo , as formalised in Eq. (7):

Gated Recurrent Units of the Gated Graph Neural Network
In this paper, we adopt the GGNN architecture described in Li et al. [15].
This architecture extends the 'vanilla' GNN of Scarselli et al. [13], using gated recurrent units (GRUs). A GRU [18] can be considered as a logical unit employing two gates to control the information flow over time. These two gates are referred to as reset and forget gates. The reset gate determines the quantity of past information (from previous time steps) to be forgotten; conversely, the update gate determines the quantity of past information to be propagated to the future. Given a sequence of inputs, a GRU computes the sequence of activations via the following recurrent equations: where sig denotes the sigmoid activation function, r is the reset gate vector, z is the update gate vector, indicates a point-wise multiplication, h is a hidden state vector, b is a bias vector, and W and U are weight matrices. To summarise, the set θ = {W, U, b} includes the GRU's learnable parameters (i.e., its weights and biases). Finally, we define the projection function f GRU : , which applies Eqs. (8) to (11).

GRM -Determining Relevance Scores of Process Activities with
GGNNs GRM determines the relevance scores for activities using the event log data.  In the following, we refer to the representative event log L ex τ -as depicted in Table 1-to describe our technique's steps. L ex τ comprises the trace σ 1 , which represents Case 1 of a reimbursement process for business travel 1 . Along with the three control-flow attributes Case, Activity, and Timestamp, the event log includes the data attribute Travel expense overspent. The data attribute takes either the value 'true' or 'false'; this indicates whether or not the travel expense was excessive. We consider this data attribute as the target attribute for learning, and we use it to apply our technique's GGNN model M. To begin, GRM loads an event log L τ . This event log L τ is transformed into a data set D, where each instance represents a sequence of activities. As 1 Note: This case originates from the bpi2020pl event log, which is introduced in Section 4.3.  (12).

Event Log Transformation
Second, GRM transforms the dataset D into a set of IGs I. To this end, van Dongen and van der Aalst [19] and Diamantini et al. [20] have proposed methods to map sequences of events (i.e., traces) onto directed graphs of events, to enhance the transparency of the event log's traces in an isolated or aggregated manner. In these methods, the node of a graph instance represents an event.
However, we are here interested in the relevance scores of activities (i.e., event types) on the prediction outcome. Thus, we introduce a definition of the IG, in which a node denotes an activity.
Definition 8 (Process instance graph). Given the trace σ c representing a sequence of activities for dataset D, an IG is a tuple of two elements Ψ σc = where V Ψ σc denotes the set of nodes extracted from the trace σ c , and E Ψ σc denotes the set of edges extracted from the trace σ c . For an activity a ∈ σ c , we define the projection function f v : a → v; ∀a ∈ σ c , we apply f v (.) to obtain V Ψ σc . Hence, each activity a ∈ σ c is mapped to a node v ∈ V Ψ σc . Furthermore, we add the 'pseudo'-activity "Start/End" in form of a node to the set of nodes σc . An edgee connecting two nodes is represented by a tuple of two temporally ordered nodes We introduce the activity 'Start/End' as a node in V Ψ σc of Ψ σc , to indicate the start and end of the original trace σ c . The GGNN model M expects the IGs for such a 'Start/End' activity to preserve the correct ordering of the instances' activities in the model-learning and prediction phases. For example, the trace σ 1 from (12) is transformed into the IG Ψ σ1 , as depicted in Figure 3.
Furthermore, the GGNN model M requires IGs, where each input edge has a discrete edge type assigned to it [15]. Thus, The edge type is also stored in the respective edges of E Ψ σc . For example, following the insertion of the edge type 'Start' between the source and target node, the edge can be represented as Figure 4 depicts the IG of our running example Ψ σ1 , including its edge types. In the last step of event log transformation, we numerically encode the categorical node label (i.e., activity) and edge label values of the IGs. The GGNN used in this paper requires a numerical encoding of the input data for calculating forward-and backward-propagation [15]. To ensure this, we one-hot encode the categorical label values of the IGs' nodes and edges (i.e., source node, edge type, and target node).

GGNN Model Creation and Training
GRM uses a GGNN to create and train the model M for process outcome prediction, using the set of IGs I. From the created model, activity relevance scores for individual IGs are determined during prediction. We select a GGNN model because it can directly manage the graph-oriented structure of process data; expressed otherwise, it can explicitly map process activities of IGs as nodes and even the relationships between these process activities as edge types and directions. Typically, other ML or DL algorithms are incapable of understanding the semantics of a process to the same extent, because they do not encode node and edge information separately from each other, and some neglect edge information entirely. Therefore, GGNN models are a promising candidate for capturing process semantics.
For our GGNN architecture, we used an adapted version of the architecture proposed in Li et al. [15]. Their GGNN extends the 'vanilla' GNN of Scarselli et al. [13] through using GRUs [18] and backpropagation through time (BPTT) [21] for parameter learning. GRUs resolve the problem of gradient vanishing [22], which occurs when performing backpropagation in GNNs for longer sequences [15]. Generally, an event log includes several sequences exceeding 100 steps [23]. On the other hand, the BPTT gradient-based technique enables us to learn the internal parameters of a GGNN more efficiently [14]. Such efficient computation is necessary because event logs can contain several million events.
According to the MPNN framework of Gilmer et al. [17], the architecture of the GGNN model can be described in terms of message-passing and readout phases. The message-passing phase receives as its input IGs of I and returns abstract node representations. In our case, an IGâĂŹs nodes represent process activities. In the message-passing phase, these node representations h (t+1) are calculated via two steps. First, for a node v, it calculates messages m (t+1) v by applying the message function M t , as formalised in Eq. (14).
Messages express the interactions between nodes; here, these are the interaction between process activities of IGs. Given these messages m of node v at time (t + 1) can be calculated by using the node update function f GRU , as shown in Eq. (15). and maps these to a predicted process outcomeô via the readout function R, as formalised in Eq. (16): The predicted process outcomeô is a real value lying within the range (16)) can be further described as follows [15]: where the term sig i(h (T ) v , h 0 v ) calculates the node relevance r v for node v, and the term tanh j(h (T ) v , h 0 v ) returns the node representation of node v. i and j are neural networks. Both neural networks take as their inputs the concatenation of the final node representation h (T ) v and the initial node representation h 0 v . A graph-based representation vector h G is calculated by point-wise multiplying the output of both terms for each node, constructing the sum over all nodes, and inputting this through a tanh activation function. Then, a sigmoid function (sig) is applied to the vector h G , to obtain a process outcome predictionô.

More specifically, the term sig i(h (T )
v , h 0 v ) operates as a soft-attention mechanism; it determines which activities of the IG Ψ σc are of greater and lesser relevance to the current graph-based process outcome. The term returns for each node v a real-valued relevance score r v . To capture the relevance scores for all activities of the IG Ψ σc , we store these in a relevance score vector r Ψσ c ∈ R |V |×1 .
Then, we min-max normalise the activities' relevance scores of r Ψσ c . For this, the minimum is set to 0, and the maximum is set to 1. Note: The relevance scores of the activities excluded from the IG Ψ σc are set to zero.

Procedure
The goal of the evaluation is to assess (1) the efficacy of our technique and (2) its effectiveness [24].We consider GRM to be efficacious if it delivers relevance scores for process activities with a high faithfulness [25]. Therefore, we evaluate the predictive quality (which determines the quality of the relevance scores) of the model and compare it against those of other state-of-the-art techniques. Furthermore, we verify the relevance scores by repeating the experiments after removing each instanceâĂŹs most/least relevant activity from one of the datasets, to observe changes in predictive quality. Thus, to evaluate the efficacy of GRM , we test the following hypotheses: (1) GRM exhibits a similar or superior predictive quality to other state-of-the-art algorithms for outcome prediction and (2) removing activities identified by GRM as being most relevant has a stronger negative impact on predictive quality than removing activities that it identifies as least relevant.
We consider GRM effective if it fulfils the stated goal of supporting process analysts in improving business processes. More specifically, we aim to close the gap between process discovery and process analysis, using GRM. We conduct a case study to evaluate the usefulness of the relevance scores determined through GRM for process analysts, in terms of identifying the root causes of process performance issues in the process flow.

Setup
To improve model generalisability, we randomly shuffle the process instances of each event log. For this, we perform a process-instance-based sampling to consider the process-instance-affiliations of event log entries. For each event log, we perform ten-fold cross-validation. Thus, in every iteration, the event log's process instances are split into a 90%-training and 10%-testing set. Additionally, we use 10% of the training set for validation; this prevents overfitting by implementing early stopping after ten epochs (i.e., learning iterations).
We measure predictive quality (i.e., efficacy) using the following metrics: Area under the receiver operating characteristic (ROC) curve (AU C ROC ), Specificity, and Sensitivity [28]. AU C ROC measures a classifier's ability to avoid false classifications [28]. A major advantage of the AU C ROC over other popular measures-such as the Accuracy (overall correctness of a classifier) or F1-score (harmonic mean of Precision and Recall )-is that it remains unbiased for a highly imbalanced class label distribution [29]. In outcome prediction scenarios, the distribution of class labels is typically imbalanced [23]. Additionally, we use Specificity (true negative rate (TNR) = 1 -false positive rate (FPR)) and Sensitivity (true positive rate (TPR)) to measure the classifiers' predictive quality.
The FPR and TPR are mapped onto the ROC curve's horizontal and vertical axes, respectively. Therefore, the Specificity and Sensitivity allow us to better interpret the AU C ROC . For significance testing, we perform a Friedman test followed by a Nemenyi test (post hoc) as suggested by Demšar [30] for each data set and metric. Finally, to intuit the classifier predictions' robustness, we evaluate the standard deviation over the ten folds for each measurement.
For the second part of the evaluation (i.e., evaluating the effectiveness of GRM ), we use the best models (in terms of the AU C ROC ) from the first part of the evaluation, to determine relevance scores for single instances. We use the directly-follows graph (DFG) miner in pm4py 2 to identify the process from the event log, and we colour the activities according to their relevance. DFGs are a user-friendly visualisation that 'shows which activities can follow another directly' [31]. To visualise multiple instances (i.e., the event log), we split them by outcome label (because the relevance scores are only useful for the predicted outcome of the instance) into two datasets and aggregate the relevance scores for each by finding the mean value.

Data
We evaluate GRM using four real-life event logs, whose characteristics are summarised in Table 2. Three of them originate from the Business Process Intelligence challenges; the other was provided by a mid-sized German home appliances vendor.
bpi2017w [32] contains event data describing the loan application process of a Dutch financial institute. We only consider workflow events, which are executed by humans. For the outcome prediction target, we select the attribute accepted. Therefore, GRM determines the relevancy of process activities with respect to the acceptance or rejection of a loan.
bpi2018al [33] describes the European UnionâĂŹs application process for German farmers (Application log). For the outcome prediction target, we select the attribute rejected, which is highly imbalanced. Therefore, GRM determines the relevancy of process activities with respect to the rejection or acceptance of direct payment applications.
bpi2020pl [34] describes the reimbursement process at the Eindhoven University of Technology (Permit log). For the outcome prediction target, we select the attribute travel expense overspent. Therefore, GRM determines the relevancy of process activities with respect to the adherence or non-adherence to travel budgets. sp2020 [35] represents a customer service process for faulty home appliance devices in need of repair. We collected the dataset and published it along with a documentation as part of this research [35]. The process begins with the creation of the repair order; then, it proceeds through the reception and analysis of the device, extending up to the actual repair and the final return of the device to the customer. We choose the attribute customer repair on time as the outcome prediction target. Therefore, GRM determines the relevancy of process activities with respect to the meet-ing or falling short of service agreements (in terms of repair time) with customers. To run the experiments, we implemented GRM using Python. For reproducibility, the source code, event logs, and results can be found on GitHub 3 . Table 3 presents the results (averaged over ten folds) for GRM and the baseline techniques. In terms of AU C ROC , GRM outperforms all three baseline techniques for each dataset (signficantly for bpi2018al). Taking a closer look (by considering Specificity and Sensitivity), it is seen that GRM is consistently significantly superior for the less frequent class (in the bpi2017w event log, the negative class is underrepresented, whereas in the other three logs it is overrepresented). We can see that the more distorted the class of interest is, the better GRM 's results are for the weaker class compared to the baseline techniques. This observation accords well with the research in Kratsch et al. [29], which found that DL techniques outperformed traditional ML techniques for imbalanced target variables in process outcome prediction. However, our results show that of the DL architectures, GGNNs clearly outperform LSTMs.

Results for Predictive Quality
Meanwhile, GRM performs significantly worse on three of four datasets for the more frequent class. While GRM still performs reasonably well for some  of the datasets (e.g., bpi2018al and sp2020), the results also suggest that GRM performs poorly for a more frequent class (e.g., bpi2020pl).
This part of the evaluation did not aim to prove that GRM is superior to other state-of-the-art techniques but rather to assure a reasonable predictive quality relative to the baselines. The AU C ROC values for GRM are better than those of the baseline techniques for all datasets; thus, we are confident that GRM can compete against state-of-the-art predictive business process monitoring (PBPM) techniques. However, when using GRM to determine the relevance scores for the more frequent class, the predictive quality (sensitivity or specificity)-operating as a proxy for the faithfulness of the relevance scores-must first be assured by the process analyst.
To further substantiate the validity of the relevance scores, we created two new datasets from the sp2020 dataset, by removing the least and most frequent activity from each instance, respectively. The AU C ROC results in Table   4 confirm the hypothesis that removing an activity results in a lower predictive quality (i.e., less information for the model). More importantly, it confirms our hypothesis that the effect of removing the most relevant activity is significant (for AU C ROC and Specificity). Removing the least relevant activity from an instance has little impact on the AU C ROC , and the Sensitivity even improves slightly. There is a noticeable difference for Specificity; however, this did not prove to be significant.

Case Study
To evaluate the utility of GRM , we conducted a case study. For this, we sought an organisation that was actively engaging in process improvement and had event log data available for the respective processes. The company that provided us with the sp2020 event log is a premium supplier for home appliances, who strives for service excellence. Their portfolio comprises roughly 25 products (not considering remakes of device types). Their sales are exclusively performed by retail partners (i.e., no direct sales); however, customer service is primarily delivered by the company itself, giving it strategic value. One of their important target measures is the percentage of service orders fulfilled within five business days.
Several workshops were held, in which we learned about the company, its products, and the customer service process; these workshops included a visit of the repair shop. In return, we introduced them to process mining and began a data-driven analysis of their customer service process. In a joint effort between the head of customer service (as process owner), process analysts, and customer service agents (as process participants), we implemented GRM to identify and analyse the process in terms of delayed repairs.
From the class distribution in Table 2, it is evident that only 32.6% of all repairs could be completed within the desired time-frame of five business days.
Hence, the company was eager to improve their repair time and subsequent customer satisfaction. However, their process analysts struggled to identify the root causes for delays within the process execution. Figure 5 illustrates a DFG mined from the sp2020 event log. The frequency was represented by the activities' colours and edge thicknesses (darker blue/thicker = more frequent).
This process visualisation represents the current process-discovery capabilities of process mining software, as we identified from a recent market study 4 .
While it offered the process analysts some insights pertaining to the repair time (e.g., a high degree of variation was found for non-timely service orders), the analysts struggled to identify root causes for the delays from the process flow. Log filtering was applied as a possible method for isolating the issues, although this was predominantly a tedious procedure of trial and error.
In contrast, Figure 6 shows a DFG mined from the sp2020 event log, augmented with the relevance scores determined through GRM . Each process activity is coloured according to its relevance score, which was determined by averaging the scores of all instances with the same outcome prediction (i.e., either positive or negative) contained in the log (darker colours correspond to higher relevance).  Presented with Figure 6 in a workshop, the process analysts were immediately drawn towards the process activity Approved. According to the process stakeholders, the activity indicates that the customers were required to provide approval for costs that were incurred for the repair but not covered by the warranty. Further analysis showed that the process was indeed delayed when the activity Approved occurred, not only whilst waiting for the approval but also because it occasionally took several days to even request approval from the customer. The company implemented an immediate redesign of the process, by starting low-cost repairs without approval; this was because the risk of losing an unsatisfied customer through long repair times exceeded the risk of bearing the costs. Looking at the next most relevant process activities, StatusRequest simply indicated that the customer became impatient with the long waiting times, whilst StockEntry suggested that missing spare parts delayed the process. Here, an immediate action was to increase the stock for all service points.
To summarise, GRM facilitated a problem-focused analysis of the business process, as opposed to the 'traditional', process mining-based process discovery, which is driven by the frequency of process activities and connections. Both figures provided insights regarding the delays in repair time. However, the process analysts found it easier to analyse the process model that was augmented with relevance scores based on the business goal. The activities marked as more relevant drew their attention and triggered immediate discussions, resulting in process redesign ideas.

Discussion
Process analysis-in particular, root cause analysis-is a challenging task and a significant endeavour for organisations, owing to the continuous need to improve business processes for lasting competitiveness [5]. Manual analysis of a process can be costly and time-consuming. Process mining has emerged as a data-driven technology to support process analysts. By definition, process discovery in BPM facilitates the identification of the as-is process (model) of an organisation [7, p.155], which is therefore the objective of discovery techniques in process mining. The discovered process model encourages data analysis from a process perspective, rather than-for example-tables or column charts, which omit the process dimension behind the data. As such, process models are an excellent starting point for process analysis. However, decision support systems in BPM must guide process analysts even further in their search for performance issues such as bottlenecks or rework.
Existing studies on process model-based analysis have tried to incorporate this aspect. Seeliger et al. [36] presented ProcessExplorer to suggest similar subsets of the process to the analyst. However, whilst recommendations were shown next to the process model, the process model itself was not enriched. Mannhardt [37] presented a multi-perspective process explorer allowing for the projection of performance statistics onto the process model. However, the performance statistics solely relied upon frequency and were not learned in a similar way as our GGNN-based technique. An example of process model-based analysis was presented by van Eck et al. [9], who designed an extension of their composite state machine miner, a process discovery technique. They coloured the process nodes according to their degree of artefact interaction; that is, the ('correlations between sets of artefact states or transitions' [9]). However, their technique did not directly permit root cause analysis with respect to performance indicators.
Furthermore, the authors stated a limitation of their study: they did not evaluate their work with domain experts. We provide this evaluation via our case study.
To bridge the gap between process discovery and process analysis, we developed GRM as a process model-based analysis technique, to identify the relevance of process activities with respect to a business goal (i.e., a process outcome). The quantitative evaluation of GRM ensures trust in the validity of the relevance scores. GRM provides a reasonable predictive quality, because it can compete against state-of-the-art techniques for process outcome prediction. We were able to demonstrate the impact of process activities that were identified as more relevant on the predictive quality of the model. Furthermore, we evaluated GRM via a case study; the results suggest that the method can help process analysts identify root causes in the process flow and address performance issues.
Besides these contributions, this work also features several limitations. First, we evaluated the utility of GRM using only a single case study. While the results of this case study were promising, further qualitative evaluations should be conducted. Second, we argued that business process outcomes are imbalanced and that the problematic outcome is typically less frequent. While the case study showed that a violation of this assumption does not necessarily impact the utility of the relevance scores, it remains an aspect that should be carefully evaluated when applying GRM . Third, we did not further evaluate the impact of incorrect predictions on the relevance scores. Currently, we consider the relevance scores of an instance in the context of the predicted label, and this label may be incorrect. In future work, we plan to conduct a more detailed analysis of the faithfulness of the relevance scores on an instance level. Finally, GGNNs are computationally expensive to train. We did not accurately evaluate efficiency; however, from the run-times it was evident that the experiments for the baseline approaches (especially RF) ran significantly faster than those conducted with GRM. Whilst this does not impair the theoretical contributions of our work, it may hinder its adoption in practice.

Conclusion
In this paper, we presented GRM , a GNN-based technique for determining activity relevance scores. The technique was split into three steps: transforming the input data into IGs, creating and training the GGNN model, and making predictions while determining the relevance scores. We validated GRM quantitatively, using four different datasets. We demonstrated the utility of relevance scores in closing the gap between process discovery and process analysis, by augmenting the discovered process models with outcome-oriented measures. A case study was conducted; the results suggest that GRM can guide process analysts in their search for process performance issues within the process flow.
Our work has important implications for both research and business applications. In term of the academic community, our work is an example of applying ML to process mining, to produce results on a process-model level rather than on an instance one [38]. While the latter approach might be suitable for supporting operations, process analysts and managers require actionable insights to achieve long-term improvements for business processes [38].
We present a novel method with a problem-focused approach. Our technique requires a business goal to be specified in the form of a performance measure.
The model learns towards this goal rather than using heuristics (e.g., frequency, similarity, or distance measures); such techniques can provide more guidance for a process analyst considering a business problem [9].
We envisage major implications for business practice. Process mining is rapidly gaining momentum in practice. Davenport recently suggested that it might even trigger 'a new era of process management' [39]. Most commercial process mining vendors seek to bridge the gap between process discovery and analysis, by adding business intelligence capabilities to their solutions 5 . Several solutions also offer root cause analysis, combined with the deviations identified through conformance checking. Our work highlights the potential of process model-based analysis. Placing the discovered process model at the centre of the analysis facilitates a process-aware analysis of the data. To be useful for analysis beyond discovery, the identified process models must offer additional information besides frequency and throughput times.
By proposing GRM , our work gives practitioners a method of enriching their process mining analyses with relevance scores oriented towards a business goal. The implementation of process mining often requires large investments from organisations in infrastructure and software licensing, as well as a skilled workforce. Subsequently, they are pressured to realise a fast return on this investment. Techniques and methods facilitating problem-focused analysis are thus urgently required by the practice.
We see several directions for future research. We presented GRM as a technique to close the gap between process discovery and process analysis. However, we believe that GNNs can be of use in other phases of the BPM life-cycle. In another work, we have shown how GNNs can provide explainability for predictions in the (predictive) monitoring phase [40]. Furthermore, the relevance scores could be used as inputs for some of the redesign heuristics proposed by Reijers and Mansar [41]. For example, the heuristic task elimination suggests unnecessary tasks that can be removed from a business process. GRM could indicate irrelevant process activities with respect to a defined process goal.
We proposed GRM as a technique capable of capturing the semantics of a business process. However, we think that this capability of GNN could be exploited even further. Future work should consider using a formal process model such as a Petri net (instead of a graph) as input. This would allow analysts to determine not only relevance scores for activities but also-for exampletransitions that represent decision points in the process. Following this line of thought, GNN could also be used in combination with decision models [42].
Finally, we plan to extend GRM by considering the relevance of contextual attributes of events and process instances alongside the process flow. Contextual information can have a valuable contribution to predictive models [43]. Potential challenges include incorporating context into the GGNN architecture and visualising the contextual attributes in the process model.