Supervised learning of process discovery techniques using graph neural networks

Automatically discovering a process model from an event log is the prime problem in process mining. This task is so far approached as an unsupervised learning problem through graph synthesis algorithms. Algorithmic design decisions and heuristics allow for efficiently finding models in a reduced search space. However, design decisions and heuristics are derived from assumptions about how a given behavioral description—an event log—translates into a process model and were not learned from actual models which introduce biases in the solutions. In this paper, we explore the problem of supervised learning of a process discovery technique. We introduce a technique for training an ML-based model using graph convolutional neural networks, which translates a given input event log into a sound Petri net. We show that training this model on synthetically generated pairs of input logs and output models allows it to translate previously unseen synthetic and several real-life event logs into sound, arbitrarily structured models of comparable accuracy and simplicity as existing state of the art techniques in imperative mining. We analyze the limitations of the proposed technique and outline alleys for future work. © 2023TheAuthor(s).PublishedbyElsevierLtd.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).


Introduction
Automated process discovery (APD) is the problem of discovering a process model M from an event log L [1].State-of-the-art techniques approach APD as an unsupervised learning problem trying to achieve pareto-optimality of M regarding fitness and precision wrt.L, generalization wrt.future traces not seen in L yet, and structural simplicity of M [2], and further to ensure soundness of M [3].APD of flow-based models, such as Petri nets, is primarily approached algorithmically by synthesizing a graph from behavioral abstractions of L [4][5][6], as optimization problem over linear [7] or logical constraints [8], or genetic algorithms searching for optima in the space of models [2].
Reviews and benchmarks observe that, despite impressive progress, no unsupervised APD technique consistently returns fitting, precise, simple, and sound models on all problem instances in feasible time [9,10].Specifically, each technique is based on different algorithmic design decisions and uses different heuristics for efficiently finding models in the available search space, resulting in an inherent bias favoring some quality criteria over others that cannot be overcome [10].These design decisions and heuristics are derived from assumptions about how a given behavioral description -an event log -translates into a process model in an ad-hoc manner and not derived systematically leading to low generalization [11] as we discuss in Section 2.
In contrast, human modelers train their modeling skills in a supervised fashion by learning which model structures are adequate solutions for which behavior, and then apply these skills to build a solution piece-wise along the information provided [12][13][14].In this paper, we study whether it is possible to design a process discovery technique that more directly emulates a human modeler, i.e. (1) learn how to construct models from examples of event logs and corresponding models, and then (2) transfer this learned knowledge to construct process models to event logs not seen previously.
We formulate the supervised process discovery problem.We want to develop a technique t that can train for given pairs ⟨L i , M i ⟩ k i=1 of (synthetically generated) event logs and corresponding process models a discovery function d(.) = t(⟨L i , M i ⟩ k i=1 ) so that d(.) can translate an (unseen, real-life) event log L into a sound model M = d(L) that has high accuracy wrt.L and is structurally simple.Where supervised learning with artificial negative events [15] learns a process model, the supervised process discovery problem is to learn the discovery technique itself.
In the following, we limit ourselves to imperative models and propose a first solution to this problem for Petri nets as target modeling language where d = (f , N 1 , . . ., N k ) is an algorithm f constructing and updating a graph G using graph convolutional neural networks N 1 , . . ., N k .f first encodes the general translation task from given log L into the space of possible models as https://doi.org/10.1016/j.is.2023.1022090306-4379/© 2023 The Author(s).Published by Elsevier Ltd.This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).a graph G. G encodes the input log L with edges to a template Petri net M having one transition t a per activity label a in L and candidate places and arcs between these transitions.f then uses N 1 , . . ., N k to select which candidate places shall remain in the model based on the information in L. The N i update state vectors at event and transition nodes and at edges between nodes of G to propagate behavioral information from the event log to the transitions.The technique t trains N 1 , . . ., N k on ⟨L i , M i ⟩ k i=1 by following an iterative approach based on the process of process modeling observed in human modelers [13] to learn which places of target model M i shall remain given the structure of the input event log L i .Through training, the N 1 , . . ., N k learn how to update the weights in G to select places from the available candidates; f uses beam search [16] and S-coverability checks [3] to prune the search space.
As supervised process discovery has not been attempted before in this general form, we conducted experiments to test whether it is feasible to learn from synthetic data a process discovery technique with performance comparable to state-ofthe-art techniques on real-life problem instances.With the aim of enabling follow-up research, all experiments were conducted on a regular PC without relying on GPU support.We implemented the above technique using DGL https://docs.dgl.ai/ with PyTorch and trained d(.) on a synthetically generated training set of 2000 block-structured process models and corresponding event logs of varying size and complexity.We evaluated the models returned by d(.) wrt.accuracy and simplicity on (1) previously unseen synthetic event logs of generated block-structured models with the same representational bias as the training data, and on (2) unseen real-life event logs outside of the representational bias of the training data.
Our results show that d(.) trained with our technique is able to (1) discover process models with high accuracy wrt.unseen synthetic logs and even wrt.the structure of the data generating process; accuracy and simplicity are comparable to state-of-theart APD techniques.Further, (2) also applying d(.) on unseen real-life event logs results in process models with high accuracy and high simplicity comparable to state-of-the-art APD techniques.Specifically, unlike other APD techniques, d(.) is capable of returning sound, simple, non-block-structured process modelsdespite being trained on block-structured models, only.However, for some problem instances the produced models may only be easy sound (contain dead parts but no deadlocks).
We replicated these findings in an additional sensitivity analysis where we varied the complexity and representational bias of the training data and obtained similar performance of d(.), specifically wrt.real-life event logs.These findings not only demonstrate the feasibility of solving the supervised process discovery problem.They show that our technique can effectively and robustly generalize beyond the representational block-structured bias of synthetic training data to solve unseen real-life problem instances outside the representational bias.
This paper is an extension of work originally presented in ICPM [17].We more explicitly discuss the requirements, design decisions, and parameters of the defined learning task.We generalize the proposed approach by clearly stating all parameters contained, rather than proposing a fixed set and explaining the concepts in more detail.Further, we differentiate between the fundamental approach and various heuristics to obtain feasible performance on a standard PC; a consequence is a correctness proof for returning sound models.For the evaluation, the design process and parameter selection is explained in detail.Both the quantitative and qualitative evaluation are extended with more stable precision and recall measures for the former and going into more detail on the produced process models for the latter.To enable further research in this novel problem, we systematically explore the reasons for the technique's current shortcomings, various assumptions in our problem formulation and technique, and outline several open learning problems and research questions as alleys for future work on supervised process discovery.For one of these new research questions, we conducted an additional structured experiment for sensitivity to the training data.

Related work
We discuss literature on quality measures, biases and design decisions in APD, modeling as a human task, and graph neural networks.
The quality of a process model M to an event log L is assessed along 5 criteria.Fitness is the share of process executions (traces) of L that is described or accepted by M; precision is the share of process executions described by M that is not in L. Alignmentbased fitness and estimating precision via escaping edges (only the first steps of M not in L) [18] are most widely used [10].The monotone precision/recall measures [19] rank models more consistently but are slower to compute.Generalization is the likelihood that M accepts another unseen trace from the process that generated L and can only be estimated [18].Simplicity states how clearly M describes the logic or cause-effect relations of the process that generated L and is estimated through graph size and complexity [10].M is sound if every partial execution from the initial state can be extended to a terminating execution in a designated final state [3].A sound workflow net N can be decomposed into so-called S-components, conversely if N cannot be decomposed into S-components, i.e., is not S-coverable it is not sound.We exploit this property to exclude non-sound solutions.
The fundamental challenge in APD is that the target search space of models is too large to exhaust [2,7].Genetic algorithms [2] effectively explore the search space but take too long to find a satisfactory model for practical applications [10].Timeefficient algorithmic solutions to APD synthesize a graph from behavioral abstractions of the log [4][5][6].Thereby design decisions and heuristics bias the algorithm regarding fitness, precision, simplicity, and soundness.Enforcing a specific representational bias [20] in the problem formulation, e.g., limiting the search space to block-structured models [2,4], ensures simplicity and soundness, although sound models with high fitness and precision lie outside the chosen representational bias [10].Algorithmic design decisions can favor or even guarantee solutions of a specific quality criterion at the cost of loss in another criterion, e.g., ensuring fitness [4,7,8] lowers precision [8,10].Heuristicsbased filtering and pattern detection on behavioral abstractions of event data results in models of high fitness and precision [5,6] which in turn may be unsound and have high complexity [10]; the heuristics may not generalize to new data [21] or larger samples [11].Techniques relying on behavioral abstractions of event logs [4][5][6]22] fail when the log contains behaviors not preserved by the abstraction [22,23].Techniques avoiding behavioral abstractions solve an optimization problem over linear [7,24] or logical constraints [8] over the event log that ensures fitness at the cost of precision and soundness [8,10] or prohibitively high running times [24].
The cognitive process of humans creating models has been studied empirically.Humans create models by iterating three phases: comprehend (a chunk of the information about the process), model (by adding or removing formal model structures related to the comprehended chunk), and reconcile (by reorganizing model layout to better comprehend the created model structures) [13].Empirical evaluations have shown that breadthfirst modeling strategies along the structure of the target model lead to highest precision and recall [12].
Neural networks have not been exploited for APD as of yet.However, they have been developed for various graph problems [25].Such networks learn representations of graphs, node and edges based on an information propagation process to perform tasks like node classification or link prediction.Generative graph neural networks are developed that learn and represent conditional structures on graphs [26].The attention mechanism have proven useful in generative tasks that can redistribute the weights of the different inputs [27] and has been exploited in graph neural networks as well [28].For recurrent generative tasks, heuristic search algorithms like beam search with and without length normalization have been proposed to traverse the search space efficiently and effectively [29].
Within process mining research, graph neural networks have been used to determine relevance of process activities for performance in a prediction setting [30].Similar to our approach, the authors represent the input log as a graph and learn the relation between activities within the graph and performance of the process.Peeperkorn et al. [31] have shown that RNN and LSTM networks do not easily learn a representation of the structure of a particular process to then correctly reproduce traces of the input log, especially in the presence of parallelism and loops.Neither of these works trains a neural network to generate the process model itself, which is the focus of this research.

Defining the learning task
The fundamental task of process discovery is to translate a behavioral specification, in this case an event log L, in this case a Petri net N. Human modelers solve the same task, albeit manually and typically from a natural language specification [12].They do so through a semi-structured process called the process of process modeling [13,14] that was studied empirically.In the following, we recall in Section 3.1 the process of process modeling and define in Section 3.2 three formal learning problems for emulating the process of process modeling by an automated process discovery technique.We then encode all three learning problems in a graph (Section 3.3) allowing us to define a supervised learning task for process discovery over this graph (Section 3.4).
We chose to focus on Petri nets, an imperative modeling language [32], in this research for the following reason.The process of process modeling has been primarily studied for imperative modeling languages; the learning problems we identify subsequently may not generalize to the creation of declarative modeling languages [33].Further, Petri nets employ only three concepts: two types of nodes (places and transitions easily distinguished through a bipartite structure) and arcs.This simplifies the formalization of the learning tasks as no types or modalities of nodes or edges have to be encoded in the problem formulation.

Process of process modeling
The process of process modeling has been observed empirically [13,14].Human modelers solve the task of creating a process model from a behavioral specification by iterating over three cognitive phases as illustrated in Fig. 1.We explain these phases for the behavioral specification of an event log L = [⟨a, b, c, d⟩, ⟨a, c, b, d⟩, ⟨a, b, c, e⟩, ⟨a, c, b, e⟩], i.e., a is followed by b and c in parallel followed by a choice between d and e.
During the comprehension phase, the modeler relates the specification to the partial model created so far, and identifies how to further encode behavioral information in the specification using constructs of the modeling language [34].For example, Fig. 1 1.Where ''best'' to continue modeling: to further model the behavioral information after a, i.e., ⟨a, b, c, . ..⟩, ⟨a, c, b, . ..⟩, or to model the behavior after b, i.e., ⟨. . ., b, d⟩, ⟨. . ., b , e⟩? 2. How to encode the behavioral information between, e.g., a, b, and c: as a sequence using a new place between b and c or as parallelism using a new place between a and c?
The first choice determines the order in which the modeling task is solved, whereas the second choice determines whether the resulting model is correct for the given specification.Empirical observations suggest that a breadth-first order traversal along the graph that is being constructed results in higher precision and recall of human-created models [12].In Fig. 1, the modeler should pick ''a new place between a and c'' as only this choice will satisfy all traces.
During the modeling phase, the modeler changes the model on the modeling canvas by creating new modeling constructs or removing existing modeling constructs.In our example, the modeler adds place p 2 and arcs (a, p 2 ) and (p 2 , c), see Fig. 1(top right).Skilled modelers can solve modeling tasks by only adding modeling constructs (and not removing any modeling constructs) through careful choices during longer comprehension phases [13].
During the reconciliation phase, the modeler improves the alignment between the created model and the behavioral information in the specification.This can be observed in modelers adjusting the layout to better represent the behavioral information from the specification, e.g., by moving place p 1 so that p 1 and p 2 visually convey a parallel split [35], see Fig. 1(bottom).For most modelers, organizing the available information through reconciliation prior to comprehending new information improves model quality and reduces modeling errors [14] as it reduces mental effort in human modelers [34].
Human modelers perform these three phases iteratively multiple times until they obtained a model they consider as adequately describing the behavioral specification.Though, individual phases are skipped depending on modeler preferences or the specific situation on the modeling canvas [14].When training human modelers in their abilities to create models, modelers are given example specifications and feedback about whether their created models are indeed adequate.The human modeler then adapts their modeling process based on this feedback.

Formalizing the modeling process for automated process discovery
In the following, we investigate how to encode principles of the process of process modeling for automating the task of creating a Petri net N that describes a given event log L. At a minimum, we need to formalize the following: R1 Relating behavioral information from log L to a (partial) model N as observed comprehension and reconciliation.R2 Encoding which behavioral information is described by a particular model construct in the (partial) model.R3 Determining where and by which formal model constructs the partial model should be extended as observed in the comprehension phase.R4 Extending a partial model with a new model construct (to subsequently relate the extended model to L) as observed in the modeling phase.R5 Deciding when the partial model is considered complete and stop extending the model.
We can omit deleting model constructs as we start from an empty canvas and modelers also succeed by only adding model constructs.We disregard model layout and focus on creating model structure only as a model layout can be computed from model structure.Thus, we omit the activities a human modeler undergoes during reconciliation.While a human modeler has an unbounded ''solution space'' for identifying new model constructs, we have to bound this solution space to make the task computationally feasible.We do so by encoding the discovery problem in a graph G which contains both the event log L and a candidate Petri net N with a fixed set of transitions T (one per activity in L) and candidate places P. Links in graph G from L to N then formalize R1.We then can define two primary learning tasks that correspond to human decision making in the process of process modeling: L1 To iteratively select the most suitable candidate place p i from P and to extend the partial model with chosen candidate places p i , formalizing R3 and R4.L2 To detect when the model is complete and to return the sub-graph of N with the selected candidate places as the resulting model, formalizing R5.
Note that while a human modeler often works in chunks of multiple model elements [13], we here formulated L1 to extend the partial model with one place at a time to simplify the learning task.To be able to learn making these decisions, we also have to learn: L3 To represent and iteratively propagate behavioral information from L to the partial model N (comprehension) and between the nodes of the partial model N (reconciliation), formalizing R2.
Fig. 2 summarizes this approach on a high level.Section 3.3 details how to encode the discovery problem in a graph G. Section 3.4 defines learning tasks L1 and L2.In Section 4, we explain how to train a function d to learn how to solve L1 and L2 in an iterative manner by also learning L3.

Encoding the discovery problem in a graph
We encode the problem of translating a given input event log L into a Petri net N as a graph G of 3 parts as illustrated in Fig. 3.
(1) We encode L as a trace graph.(2) We encode the solution space of all possible models for N as a candidate Petri net (overapproximating the required places and arcs).(3) Links from event nodes in the trace graph to transitions in the candidate Petri net encode which transition shall describe which event.
After recalling some notation, we first define the basic solution space in Section 3.3.2and then discuss heuristics to reduce the solution space for more efficient search in Section 3.3.3.

Preliminaries
We write A for the (finite) set of activity names extended with > ∈ A and | ∈ A for artificial start and end.A log L (over A) is a finite multiset of traces where a trace is a finite sequence σ = ⟨>, a 1 , . . ., a n , |⟩ ∈ A * .Each occurrence of an activity a i ∈ σ is called an event.
The α-relations over A [22] serve as basis for many behavioral abstractions of an event log L over A. Let a, b ∈ A. Directlyfollows: a > L b ⇐⇒ ⟨. . ., a, b, . ..⟩ ∈ L; k-eventually follows: We recall notations for Petri nets with τ -transitions and refer to [3] for definitions.A Petri net N = (P, T A ∪ T τ , F ) has places P, transitions T = T A ∪ T τ where T A and T τ are visible and invisible transitions (i.e., the T τ do not occur in firing sequences of N), respectively, and arcs F .In a workflow net N, every node of N is on a path (along F ) from the unique source place i ∈ P (no incoming arcs) to the unique sink place o (no outgoing arcs).

Encoding the solution space as a graph
The graph G consists of a trace graph, a candidate Petri net graph, and links.
We encode the log L as the trace graph (V event , E event ) containing one 1.one start and one end event node v > , v | ∈ V event , and 2. for each trace σ = ⟨>, a 1 , . . ., a n , |⟩ ∈ L the chain of unique event nodes v σ ,i The candidate Petri net graph (P, T , F ) defines a superset of the candidate places P and arcs F needed to describe the behavior in L; within this superset we search for the target model N. T contains one transition t a ∈ T for each activity a ∈ A in L (>, | ∈ A).Similar to the α-algorithm [22], we fully characterize each candidate place p and its incoming and outgoing arcs by a pair (X , Y ), X , Y ⊆ T of input and output transitions.The pair (X , Y ) defines place p X ,Y with arcs (t, p X ,Y ), t ∈ X , (p X ,Y , t), t ∈ Y .Within the set of all possible candidate places (and arcs) P = {(X, Y ) | X , Y ⊆ T } we search the Petri net (P ′ , T , F ′ ), with places P ′ ⊆ P and corresponding arcs F ′ ⊆ F that best describe L.
Pre-processing can reduce this search space, see Section 3.3.3.Links are edges from event to transition nodes.For each event node v σ ,i a i we add a directed edge (v σ ,i a i , t a i ) to t a i .Thus, the graph structure encodes which transition nodes shall model which events.

Reducing the solution space
We can limit the set P of candidate places to contain only those places which are compatible with the behavioral information in log L. For this, we use the α-relations which overapproximate the behavior in L. Behavior not described by the α-relations is not in L, and N does not have to describe such behavior.Thus, we can limit the candidate places P to those places justified by the α-relations, as follows.
For every two activities a and b where the eventually follows relations a > k L b holds for a specified k, the one-to-one place ({a}, {b}) is added.Note that all such places have a single incoming and outgoing transition, i.e., we define the set P (1−1) for a given maximum K with (1) A one-to-many place ({a}, T out ) with one incoming transition {a} and many outgoing transitions Many-to-one places (T in , {b}) ∈ P (n−1) are defined correspondingly.
Lastly, the many-to-many places P (n−n) are constructed by combining many-to-one and one-to-many places in a similar fashion: We can limit G to contain only the candidate places P (1−1) Places not contained in this set are inconsistent with the behavioral information in L and not needed to describe L (up to distance k).

Defining the learning task
Graph G essentially defines the learning task of identifying only the subset P ′ ⊆ P that restrict the transitions T to the behavioral patterns present in L, e.g., choices, parallelism.
A process discovery algorithm, such as the α-algorithm, constructs the subset of P from L, e.g., from the α-relations alone.
In contrast, a human modeler learns which candidate places to select from pairs ⟨L i , N i ⟩ of example specifications L i and models N i , as explained in Section 3.1.In Section 3.2, we identified two primary decision problems L1 and L2.
To solve L1 (to select the most likely place to add to the model), we propose to train a function d on examples ⟨L i , N i ⟩ to (learn to) estimate the likelihood that a specific place v i ∈ P is best suited to be included in N given that places v 1 , . . ., v i−1 were chosen before, i.e., p(v i |v 1 , . . ., v i−1 ) should be maximal.
This corresponds to choosing the next ''best'' modeling construct on a given partial model during the comprehension phase as illustrated in Fig. 1 where the modeler estimates p(p 3 |p 1 , p 2 ) to be highest.However, we have to make sure that the entire set P ′ of currently selected places describes the most likely solution for a specific log L, i.e., the joint probability p with ordering π has to be maximal.In the situation in Fig. 1, p Note that the actual ordering of P ′ is irrelevant in a graph, as the order in which nodes were added to the graph does not influence what the graph currently represents, i.e., in Fig. 1(top left), the modeler does not have to know whether first p 2 was added and then p 1 , or vice versa.Therefore, ideally p(P ′ , π|L) is independent of π.The marginal joint probability models this probability over all permutations P(P ′ ) of P ′ : p(P ′ |L) = ∑ π ∈P(P ′ ) p(P ′ , π|L).In the situation in Fig. 1, Thus, p(P ′ |L) can be used to find arg max P ′ p(P ′ |L) for event log L, i.e., to find among the candidates the place p i+1 ∈ P \ P ′ so that the probability p(P ′ ∪ {p i+1 }|L) is maximal for the given log L.
The second decision problem L2 is to estimate the likelihood that the set of chosen candidate places P ′ is incomplete for L. As for L1, we use the marginal joint probability p add (P ′ |L) to model the probability that either another candidate place should be added or the places P ′ sufficiently describe L.
In Section 4, we present an approach to learn estimators for the parameters of p(P ′ |L) and p add (P ′ |L) by training on pairs ⟨L i , N i ⟩ of example logs L i and models In line with the sequential process of process modeling explained in Section 3.1, we make both estimators part of a sequential procedure d that selects candidate places one-by-one.

Approach
We detail our approach for learning from training examples ⟨L i , N i ⟩ a function d(L) = N that can construct a Petri net N from an event log L. We first encode the discovery problem from L in the graph G (see Section 3.3).We define d itself as a sequential procedure that calls upon a set of (graph) neural networks to emulates specific steps of the process of process modeling as visualized in Fig. 4.
Each (graph) neural network solves one of the learning problems identified in Section 3.

Sequential candidate selection
Function d uses two decision making neural networks (NNs), ''Select candidate'' NN (for L1) and ''Stop'' NN (for L2).However as is stated in above, these NNs have to make their decisions based on behavioral information from L encoded in the trace graph in G.
Graph convolutional neural networks (GCNs) can propagate and process information through G to encode this behavioral information in a node's feature vector h i (also called node embedding).We therefore introduce a feature vector h i for each node v i ∈ V of G; h i allows to encode how v i is related to each activity a ∈ A recorded in L. Initially, each node v i is given a one-hot encoded feature vector h (0) i of length |A| denoting its activity label concatenated with its frequency in the event log, if available.h (0) i for candidate places are zero vectors.
GCNs allow to update a node's feature vector h i based on the feature vectors of its (indirect) neighbors, taking the surrounding graph structure into account.This allows us to train a GCN ''propagation network 1'' (PN1) to process and propagate the behavioral information from the event log nodes in G to the place nodes in G (for L3) such that the NNs can make their decision for selection, further detailed in Section 4.1.1.
d then solves the task of sequentially identifying the candidate places P ′ in two steps: deciding which candidate place to select next (Section 4.1.2),and deciding when to stop selecting more places (Section 4.1.3).For both, a regular ''decision making'' NN is sufficient to model the probability distribution on nodes for selection; see Fig. 4.
As the method is sequential, more propagation is required by a second GCN PN2 through G to aggregate the NN's past decisions as future decisions are conditionally dependent on them (Section 4.1.4).This corresponds to learning task L3 for repeated reconciliation phases in the process of process modeling, see Section 3.2.These four NNs now each have their own smaller learning task which align with the steps in the process of process modeling.Fig. 4 shows how they are connected and modify G, which is similar to the graph generation process as proposed in [26].The following sections go into detail of each NN's learning task.

Propagation network 1
For initial propagation of behavioral information from the trace graph to all candidate places P in G, we use a GCN with K -headed attention mechanism as described in [28].The number of layers l in a GCN controls how many propagation steps are performed.Each place p in G has at least input transition t a and one output transition t b ; to at least propagate behavioral information from the predecessor and successor events of both a and b, PN1 should contain at least three layers (as illustrated in Fig. 5).The attention mechanism can be used to give unequal weights to nodes and is necessary because of the different node classes in G which are not equally important for this initial step.
The result of this step is the same graph with updated node embeddings as computed by the internal weights of the GCNs as follows: For the first (l−1) layers of the network the nodes' embeddings are updated by: where h l i denotes the embedding of node v i at layer l, N (i) denotes the outgoing neighbors of node v i and W (l) k is the weight matrix of the GCN at layer l and attention head k.To also aggregate information from incoming neighbors, all arcs in G, except for the links, are made bi-directional, where the vector d ij encodes the direction of the arc: [1, 0] if from v i to v j , and [0, 1] if from v j to v i .Furthermore, self-loops are added to retain a node's own The update function of the last layer, having a single attention head, is: with the ReLU activation function.The nodes' embeddings h i are used by the GCN to encode the behavioral information and are only interpretable by the ''Select candidate'' and ''Stop'' NNs. to be later used for classification of the places.Such information is similar to how a human modeler would aggregate information from the event log and the place's structural properties to decide whether it fits the process.
Think of e.g. the frequencies of preceding and succeeding activities that determine choices or parallelisms.

''Select candidate'' network
We now determine for each candidate place v i ∈ P \ P ′ that was not selected yet, the probability p i that v i should be added next, and pick arg max v i ∈P\P ′ p i .v i is marked as selected by adding a feature value to the embedding.To compute p i , we define CN as a regular NN with input vector h i (as returned by the preceding step) and output a single value p i .CN learns the weight matrix W to map h i to p i as a single fully-connected layer.
By normalizing all the candidates' scores using the Softmax function we get a probability p v for each node: with exp(x) = e x .

''Stop'' network
We define the ''Stop'' network SN for deciding when to stop adding candidate places.SN takes as input all Petri net node embeddings h i , v i ∈ V \ V event and has as output a probability p add that another place should be added; we then make the binary decision to stop adding by sampling from a Bernoulli distribution with parameter p add .SN has two layers.Layer 1 aggregates h i , v i ∈ V \ V event into a graph embedding h G by the following equation with learnable weight matrices W a and W g .Sigmoid(h v W a ) serves as a ''gating function'', how much each node should contribute to h G by the weights in W a [26], where SIGMOID(x) = 1 1+e −x is used to map a value from R to a value between 0 and 1.
Layer 2 computes the probability p add to add more nodes from h G by: with a learnable weight matrix W d .(h G W d ) is a score which is converted to a probability using the logistic sigmoid function.
If the network's decision is to stop, we are done.Otherwise, we continue at Section 4.1.4.

Propagation network 2
This step in the process is similar as the one in Section 4.1.1where information is propagated through the graph by a GCN with multi-headed attention mechanism and is used to process the previously made decision.It is necessary for the GCN to have at least two layers to ensure that information about the previous decision reaches other candidates.Recall that the decisions are encoded as features in the embeddings.By propagating this information two steps in the graph, d can encode information whether other candidates fit the already selected candidates.The nodes' embeddings are updated in a similar fashion as is described in Section 4.1.1,after which we loop back to Section 4.1.2to make the next decision.

Training and inference
During training on instances ⟨L j , N j ⟩ with known solutions N j = (P j , T j , F j ), the weight matrices in the NNs are optimized such that a specified loss function is minimized.The loss function to be minimized during training is as follows: with known places P j = P ′ ⊆ P to be selected and p(v i |v 1 , . . ., v i−1 ) the probability of choosing candidate place v i after having selected candidate places v 1 to v i−1 .
It is intractable to learn the marginal joint probability because many permutations may exist, so a consistent ordering should be selected.With π as a random ordering, problems of reproducibility arise, but a canonical ordering can give a lower bound.For graphs, completely canonical orderings do not exist, but a breadth-or depth-first search come closest.For learning, we choose π based on the breadth-first search on N j .The choice of BFS over DFS stems from the nature of human process modeling preferring to build the Petri net in this manner, as elaborated on in Section 3.1.How a different choice for π changes the method's ability to approximate the marginal joint probability is left to be investigated in future work.
With π and Eq. ( 10), the model learns the joint distribution q = p(P ′ , π|L) of the data by maximizing the expected joint log-likelihood: With a dataset that is sufficiently large and varying E p data (P ′ ,π |L) [log q] approaches E p(P ′ |L) which is the true probability distribution of the correct selection of candidates P ′ from P given an event log L, solving the learning task.
Note that with this loss function l, the learning task is essentially to rediscover the process model N from which log L was generated -by re-identifying the places of the original model given during training as candidate places P ′ .
Since d is an autoregressive model, computing the loss during training over the generated complete sequence of candidate places that possibly contains mistakes is slow and redundant.An early mistake results in more mistakes and the loss does not directly steer in the right direction to correct for the first mistake causing slow and unstable convergence.Teacher forcing [36] addresses the slow convergence and instability by enforcing the breadth-first ordering π during the generation process and correcting the prediction after every candidate choice.
During inference -the counterpart of training where d is used on unseen data -teacher forcing is inaccessible and therefore the sampled candidates are selected by the NNs without correction.Since we assume no conditional independence between the place selection, a suboptimal choice in the beginning of the generation process can lead to a low joint probability at the end, which is exactly what we want to maximize.To which extent teacher forcing is used during training determines the mismatch between training and inference.With a 100% teacher forcing, the model does not know how to deal with mistakes, and could have a hard time correcting for it.
To mitigate this, beam search is used; a heuristic search algorithm to find the highest joint probability, by helping to ''correct'' mistakes by going back to a previous state where the model was more confident.It ''knows'' it made a mistake when the joint probability gets very low.Using beam search with a beam width of b, d selects the b candidates with highest conditional probability at each step from b unfinished runs, resulting in a set of b 2 (un)finished runs of which the b runs with highest joint probabilities are taken for the next step.Afterwards, the b Petri nets with highest joint probabilities are returned.

Extensions
This is the basic idea which suffices to train a model on a training dataset and perform process discovery on new unseen data.However, some limitations arise: there is no mechanism ensuring the produced process models are sound or even connected.Furthermore, the set of candidates only contains places, meaning that the resulting process model only contains places and transitions from the set of distinct activities while models of real-life processes often require invisible transitions to describe skips or iterations.Lastly, maximizing the joint probability gives rise to a problem that is caused by the fact that probabilities are multiplied causing the joint probability to always decrease when new choices are added, hence shorter solutions are generally preferred over longer ones, although this is not always the best solution.Here, we provide some extensions that can be included in d to tackle said limitations and are illustrated in Fig. 4.
Ensuring soundness.Firstly, we do not add a candidate place v i chosen by SCN if the net N with v i is not S-coverable (a polynomial check on the structure of N) as then the resulting net would not be sound [3]; in this case, we exclude v i and let SCN return the next most probable candidate (see Fig. 3).To check for S-coverability, the unfinished net is completed by creating a unique sink place by synchronizing all current sink places by a final transition, following the algorithm from Fig. 5 in [37].Note that with a finite beam width (see Section 4.2), this could result in traversing the search space without finding a sound model.A simple fallback method, besides increasing the beam width, is to decrease the number of trace variants to take into account when constructing the graph G which reduces P and therefore the search space.To minimize decrease in fitness, the most frequent trace variants should be encoded in G. S-coverability can only ensure reachability of the final marking but cannot avoid dead transitions; in this case the net is called easy sound.For ensuring proper soundness, a global check for a workflow net structure [3] could be introduced, but can only be done on the finished net.This would require unrestricted backtracking, which we want to avoid to obtain feasible performance.Ensuring connectedness.Secondly, SN (Section 4.1.3)may decide to stop while some transitions do not have an incoming or outgoing place (not a workflow net); we check for non-connectedness of N and override the stop decision in this case.Note that overriding the decisions reduces the joint probability since d is forced to select probabilities that are not the highest.However, beam search mitigates this similarly as we have shown before by enabling d to search in another direction where its decisions are not overruled.
Adding invisible transitions.Invisible τ -transitions are needed to model silent skips in real-life event logs while they are not recorded in L. We add τ -transitions T τ to the candidate nodes (previously P) for the SCN to select.In principle, the set of all possible candidate τ -transitions is T τ = {(X, Y )|X , Y ⊆ P}, however, for feasibility, we only add candidate τ -transitions between two already selected candidate places: T τ = {(x, y)|x, y ∈ P ′ }.Candidate τ -transitions are added to G after selecting a new place and its feature vector is defined as the sum of the feature vectors of its neighboring nodes.Avoiding bias towards short solutions.Length normalization is proposed in [29] to account for the fact that solutions of different lengths are compared, which takes the average of the log probabilities of the choices rather than the product.Heuristically, an α parameter is added as an exponent to the solution length.The score function s of a sequence of choices P ′ is then: More extensions could be introduced, e.g.adding transitions with duplicate labels or adding silent transitions between sets of already selected candidate places:

Correctness of the approach
In Section 4.2, we included beam search for inference in our approach as a generic heuristics to prune search space for candidate places in order to achieve feasible performance.The beam search heuristic is ''unaware'' of the subsequent S-coverability check to ensure easy soundness (see Section 4.3).As a result, the heuristic may only explore candidate places leading to non-S-coverable models, and hence not find an easy sound solution.
By the following Theorem 1, we show that this limitation is only due to the beam search heuristic and that with these extensions, our approach (without heuristics) can always find an (at least) easy sound Petri net.
Theorem 1 (With Infinite Beam Width, the Produced Petri Net is Always at Least Easy Sound).Let L be an event log and let d be the ML model from Sections 4.1 and 4.2 with extensions from Section 4.3.

Assuming the beam width is infinite, N = d(L) is always at least easy sound.
Proof.To show that N = d(L) is always at least easy sound, we prove that independent of the selected silent transitions T ′ τ and the binary decision of the ''Stop'' network, there exists a solution in the search space that is connected and easy sound.
Let, c(P, T , F ) and s(P, T , F ) be functions denoting connectedness and S-coverability of a Petri net (P, T , F ). Then we have to show We prove the statement in Eq. ( 13) by construction of P s .
We have to find a subset of candidate places P x ⊆ P, s.t. the net (P x , T , F ) is connected and easy sound.With P = {(X, Y ) | X , Y ⊆ T } the complete solution space, it is trivial that such subset exists, namely a state machine respecting the directly follows graph.With a reduced solution space as described in Section 3.3.3,the set of candidate places, assuming a minimal K = 1, equals the set Cnd(L) from Def. 19 in [38].With the subset of maximal places Sel(L) ⊆ P, the (Sel(L), T , F ) provides a solution that is a sound Petri net as shown in [38].
Either the algorithm selects an S-coverable solution different from P x .Or, since the beam width is infinite, the complete search space {P ′ | P ′ ⊆ P} can be exhausted and eventually an arbitrary permutation P ′ s of P x is selected.Furthermore, let T ′ τ = ∅ be the set of selected silent transitions.After selecting P ′ s and T ′ τ , the set of valid candidate places and silent transitions is In the selection loop, either P ′ s is extended by a place p ∈ P (v)   that further ensures S-coverability or T ′ τ is extended by a silent transition t τ ∈ T (v)   τ .By definition of connectedness and P v , the resulting Petri net (P still connected and S-coverable; i.e., the solution remains correct.Assuming, in the worst case, that the binary decision of the ''Stop'' network is to always continue, there is a point when P (v) is empty.When P (v)   = ∅, selecting silent transitions cannot cause P (v) ̸ = ∅, hence there is also a point when is the produced Petri net which is at least easy sound, since N is connected and S-coverable.□ While an infinite beam width leads to practically infeasible running times, the result suggests that easy sound models can be ensured by sufficiently large beam width or new heuristics that retain S-coverable solutions.Note that with the fallback method described in Section 4.3 of reducing the number of trace variants for constructing the graph, the number of candidates reduces and therefore the also the search space and required beam width for ensuring an at least easy sound process model.

Evaluation
We assess the feasibility of our approach for solving the supervised process discovery problem.We specifically aim to answer the following question.Does training d to rediscover N i on synthetic problem instances ⟨L i , N i ⟩ suffice to generalize d, i.e., that applying d on unseen (1) synthetic and (2) real-life data yields sound, accurate, and simple models?Thus, our objective is not to identify a set of parameters for d that outperforms all other methods, but to assess whether there are parameters by which our approach is capable of generalizing from synthetic training data ⟨L i , N i ⟩ to unseen problem instances.
While parameter tuning is not the objective of this experiment, we summarize considerations and observations from our research that we used for designing the experimental setup and for choosing parameters for training d in Section 5.1.Section 5.2 describes the exact experimental setup.We report our findings regarding models in Section 5.3 and regarding running times in Section 5.4.Later, in Section 7, the sensitivity the method's performance with regard to the training data is analyzed.To encourage future research we restrict ourselves to the use of a regular PC without the need of GPU support for training or inference.

Evaluation design process
On the one hand, the process discovery problem is to obtain a function that is able to generate fitting and precise process models from event logs from a variety of real-life processes that are not available when that function is defined or learned, i.e., generalization.On the other hand, our method d is designed to learn how to rediscover Petri nets from a known training dataset.Since a large quantity of ⟨L, N⟩ pairs for real-life processes does not exist, a synthetic dataset is required.
This raises the following three questions to answer in our experiment.(Q1) Can the NN actually learn to solve the rediscovery task by minimizing the loss function?This can be evaluated using a test dataset with the same general characteristics, i.e., representational bias, as the training data.As this is a hard problem on its own, we need to evaluate what happens when exact rediscovery fails: (Q2) Does the NN return a model that comes close to the original Petri net only in terms of model structure or also in terms of its observed behavior, i.e., the log?The behavioral similarity to a model can be assessed using conformance metrics; in this regard, we effectively evaluate the algorithm's performance in terms of discovery rather than rediscovery.
Finally, (Q3) Does the trained NN generalize to real-life data being possibly outside of the representational of the training data?Since the ground truth process models are missing for reallife processes, we also base the evaluation on the conformance metrics, i.e., process discovery.Evaluating process discovery quality in (Q2) and (Q3) by comparison of a model its input log lacks a ground truth.We therefore as baseline for comparison a selection of state-of-the-art algorithmic process discovery methods.
The set of parameters to choose for training and evaluating d ranges from the generated training dataset, to the neural networks' inner settings and the framework's components: P1 The hyperparameters of the neural networks; P2 The set of parameters to generate the synthetic training dataset; P3 The choice of k in the k-eventually follows relations constructing the candidate places; P4 The choice of T τ constructing the candidate silent transitions; P5 The addition of frequency information in the nodes' initial feature vectors; P6 The use of teacher forcing during training; P7 The validity checks on the selected candidates, e.g.Scoverability and connectedness; P8 The use of beam search and length normalization.
Since we are aiming for evaluating the feasibility rather than the optimal parameters, we select a single set of hyperparameters for the NNs (P1) based on the experiments of the similar graph generation process proposed in [26].
For (P2), we base the generation of the synthetic training dataset on a number of aspects from domain knowledge.For simple and quick generation, process trees are chosen that can be converted to Petri nets.The structural characteristics are manually adjusted such that the process models appear, visually, like handcrafted process models for real-life process that are intermediate in complexity, with the hypothesis that these are close to the representational bias of real-life processes.The one-hot encoder in our approach limits the number of distinct activities to take into account, which we determine similarly.To determine the size of the training data, we observed in initial experiments overfitting with a small dataset, after which we manually increased the training data incrementally until we noticed convergence without clear signs of overfitting, while having a feasible limit for reproducibility and encouragement of future research.For the other parameters (P3-P8), we aim to keep them as simple as possible while utilizing the techniques for stability and feasibility in terms of training and inference.

Experimental setup
We can now formulate the evaluation wrt. 3 objectives based respectively on the questions above: (O1) Is the loss function of Eq. ( 10) able to optimize the NNs in d to achieve high precision/recall in selecting the ground truth places on synthetic test data?(O2) Does the loss function also optimize fitness and precision of the discovered model wrt. the input model on synthetic test data compared to state-of-the-art methods?(O3) How does the quality of models compare to the other methods on real-life data, i.e., does the ML program generalize beyond the task it was trained on?
For training and testing, we used the PTAndLogGenerator plugin in ProM [39] to generate ⟨L i , N i ⟩ 2663 i=1 problem instances: we generated 2663 random process trees, converted to blockstructured Petri net N i , and generated corresponding event log L i by simulation.The parameters for generating this dataset (P2) are listed in Appendix A.1.For feasibility as described in Section 4.3, we append T τ only with silent transitions between two already selected places (P4), therefore, silent transitions with multiple input or output place are labeled distinctly.
Appendix A.2 lists the exact hyperparameters for the neural networks (P1), which we based on the experiments in [26].As mentioned in Section 5.1, the other parameters are chosen to be as simple as possible, with k set to 1 (P3), frequency information included (P5), 100% teacher forcing during training (P6) and validity checks for both S-coverability and connectedness (P7) (c.f Appendices A.3 and A.4).With the objective of demonstrating feasibility only, we neither optimized the parameters in any way nor claim they are optimal.We do, however, analyze the sensitivity of the representational bias in the training data in Section 7.
For (O1), d is trained and tested on the dataset for 100 epochs, with a 75/25 (2000/663) train/test split, all while keeping track of the loss and percentages of true/false positive places to assess the ability to learn the rediscovery task (c.f Q1).
For (O2), inference is performed using the trained d on the test dataset from (O1) with a limited beam width of 10 for feasibility without length normalization (P8), which is decreased by 1 after each selected place.For the produced process models, we measure both entropy-1 and alignment-based fitness and precision [18,40], used to obtain F-scores, and simplicity scores based on the inverse arc degree [41] to assess the ability of process discovery (c.f.Q2).Selected from the benchmark presented in [10], we compared to Inductive Miner (IM) [4], Split Miner (SM) [6], and Heuristics Miner (HM) [42].The more recent miner PIM [43] was not included as no public implementation 1 We used partial trace matching, since exact trace matching resulted in invalid results and/or timeouts (of 300 s).Furthermore, we capped the results to 1 when this was exceeded (presumably due to rounding errors).was available.We do not consider declarative miners [33,44] in this evaluation as entropy-based fitness and precision does not extend to declarative models, and as alignment-based precision is unreliable [45], a fair comparison is impossible.Furthermore, no quantitative measures have been proposed yet for syntactically comparing declarative and imperative models regarding simplicity, also given the fact that either paradigm facilitates different forms of understanding [32].
For (O3), inference is performed on eleven real-life datasets that are selected from the BPI challenge.Because of the input size of 20 and the artificial start and end transition, only the 18 most frequent occurring activities are taken from each dataset.Starting with a beam width of 50, we again decremented it after each selected place.Furthermore, we only used a sample of size between 8 and 75 (depending on the size of G) from each log to keep the number of candidate places tractable.Note that conformance checking is always done on the complete event log.We refer to Appendices A.5 and A.6 for more details on the evaluation data and chosen parameters for inference.
All experiments are performed on an Intel i7-8705G CPU (3.10 GHz) with 16 GB RAM an no GPU support; the source code and manual is available at gitlab.com/dominiquesommers/apdml.

(O1) Effectiveness of the loss function wrt. model structure (rediscovery)
We discuss our findings regarding the ability of the technique to exactly rediscover the Petri net from which the input log was generated, based on synthetic data.
Fig. 6 shows that the loss converges towards zero after the 100 epochs, but does not reach zero.However, we argue that this is expected since the ordering of candidates is not completely canonical (c.f Section 4.2).Inference during training shows a similar trend for selecting true positive candidate places, with convergence at ∼87% of places for the unseen test data.Whether this accuracy for rediscovery is sufficient to achieve generalization is to be concluded from the evaluation on conformance metrics, assessing whether the produced process models come close behaviorally when the Petri net is not exactly rediscovered.Additional experiments in further research are necessary should this accuracy need to be improved.The number of false positives keeps decreasing after the loss is converged, due to the use of teacher forcing causing a mismatch in training and inference.Note that the loss is directly related to the number of true positives, but not to the false positives as teacher forcing corrects d after every selection during training.
We inspected the classified candidate places in more detail from the test dataset by computing for each place the α-relations associated with it, e.g., does a place model that a follows b or that b and c are in conflict?Comparing the associated αrelations of place (both in the original and in the discovered model) with the α-relations compute from the log, we see some interesting differences, as shown in Fig. 7.A large portion of the incorrectly classified places are under-and/or overspecified, meaning that the α-relations in the event log provide respectively less or more directly-follows relations than those found in the incoming and outgoing transitions of the place.Mistakes caused by overspecifiedness could mean that the model has difficulty in knowing what information is relevant for making the decision.Furthermore, it is interesting to see that a large portion of false positives is neither under-nor overspecified, indicating that the α-algorithm [22] would have selected this place too.However, it is known that α-relations do not encapsulate all behavioral information and more information, encoded in the event log (e.g., eventually follows relations), is used for making decisions.In earlier experiments from [46], we analyzed these places in more detail, concluding that often an incorrectly classified place (A, B) is either a subset of one of the correct places P s or vice versa,

(O2) Effectiveness of the loss function wrt. behavioral properties (process discovery)
We discuss our findings regarding the ability of the technique to discover process models that are accurate wrt. the input event log and simple, based on synthetic data.
With the selected beam width for inference, our approach produced an easy sound process model for a subset of 65% of the test dataset.The limited beam width causes the search space to be insufficiently traversed to always find a valid Petri net.Recall from Theorem 1, this can be prevented by widening the beam width and/or decreasing the number of trace variants.We observed (see discussion below) that for the failing cases, the processes are more complex in structure.As this likely negatively impacts the performance of the other methods, we chose to only discuss measures for the subset where our method returned a model, to avoid negative bias towards other methods.The scatter plot in Fig. 8 shows the entropy-and alignmentbased fitness and precision scores as well as the F-scores and simplicities of the process models produced by each method in the 65% subset.The mean and medians are summarized in Table 1.
Looking at the median centroids of all evaluation measures, IM clearly comes closest to the ground truth (of block-structured models).In terms of fitness and precision, our approach outperforms both the SM and the HM, while lacking behind the IM.Our approach and the SM discover simpler models with a small trade-off in F-score.HM lacks in F-score while achieving high simplicity.The large gap between the median and the mean for our approach shows that for a lot of samples, the conformance is very high and that for few samples the conformance is very low.Looking further into these, most of the lower scoring samples are not sound (only easy sound) and the processes are complex in structure: parallelism in the upper part of the process tree with multiple parallelism and loops in the lower parts, (1) causing a large number of candidate places, and (2) suggesting that the generated event logs may have been too small to provide sufficient behavioral information such as directly-follows completeness [23].
To evaluate relative performance per log, we compute the entropy-based F-score ratios of our approach over each other approach: a ratio >1means our approach performed better on a log.The histogram in Fig. 9 reiterates the better performance of IM over our approach, but more interestingly, it shows that our  approach outperforms SM on many logs while it struggles more significantly on a smaller subset of logs.These specific results, however, are limited by our method only returning a model on 65% of the logs, with the parameters chosen for this experiment.
While the experiment does show feasibility for process discovery, more research into parameters of the method and the size of the training data is required to increase robustness to return accurate and simple results for all logs as we further discuss in Section 6.

(O3) Generalization towards unseen real-life data (process discovery)
We finally discuss our findings regarding the ability of the technique to generalize beyond the characteristics of training data by analyzing the models discovered from real-life event logs.
Fig. 10(a) and Table 2 show the entropy-and alignment-based fitness and precision scores for each method and real-life dataset separately.The incompleteness is caused by either memory or unknown errors in the evaluation software or the process model not being valid, i.e. at least easy sound.The IM is known to have low precision, which is where our approach outperforms the other methods on average.In terms of fitness is where the other methods perform similarly, all slightly better than our approach.
Similarly, Fig. 10(b) and Table 2 show F-scores and simplicity scores.The low precision for the IM causes low F-scores.The SM generally achieves similar F-scores as the HM, but has higher simplicity.Our approach competes with the best scoring methods on all datasets in terms of F-scores and has the highest simplicity (except for BPIC '12); the average score plot in Fig. 10(b) reinforces this observation.
Fig. 11(a) shows the model discovered by our approach in the Road Traffic Fine dataset.The Petri net is not block structured but sound proving generalization beyond the block-structured training data.Furthermore, the model captures a complex synchronization of 2 parallel branches (repeated payments, sending fine) by an optional appeals procedure (with penalty in parallel) in an unstructured loop: payments can resume after conclusion of appeals.For comparison, Figs.11(b) to 11(d) show the discovered models by the other methods.The SM seems to have captured the process the best, with a very simple process of labeled transitions with many τ -transitions.For the IM, it is clear that the produced model has low precision, having a τ -transition to skip almost every activity.The HM produced a very complex spaghetti-like model, not exposing the process structure at all.Additionally, Fig. 12 shows the model discovered by our approach in the BPI 2017 _O prefixed dataset.Again, the Petri net τ -transition is also used differently than a simple skip or cycle step.

Running time
Regarding complexity and running times, creating graph G from L is linear in the number of unique trace variants and polynomial in the number of activities by pruning candidate places (we observed 1-10 s).Information propagation and the computation of the embeddings is independent of the size of G. Place selection is linear in the number of selected candidates (on avg. 1 s per place) multiplied by the beam width.When d has a hard time selecting places retaining the S-coverability property, substantially more candidates have to be considered.We observed running times of several minutes while the algorithmic methods complete in the matter of seconds [10].

Discussion and new research problems
The results of Section 5 demonstrate that machine learning is capable of learning a function to map from event logs to process models, and that this function learned on synthetic training data generalizes to unseen real-life data.Consequently, our experiments show that it is feasible to automate the task of ''learning how to create a process model from a behavioral specification'' and to reach performance similar to state-of-the-art algorithms.Yet, although our approach is based on how human modelers (learn to) model, a skilled human modeler would still outperform our, and other, automated methods -when given enough time.
Our results are subject to numerous design decisions and parameters in our solution which consists of the specific problem formulation in Section 3 and the approach in Section 4. In the following, we revisit these design decisions and parameters of each and identify new learning problems (LP) for the general problem, and new technical research questions (RQ) for the approach.

Generalizing the learning problem
The specific learning problems and approach were formulated for a very limited setting of creating imperative models in the form of Petri nets.Human modelers learn modeling in other modeling languages and modeling paradigms as well.This gives rise to the following two learning problems.LP1 How to adapt the learning problem to other imperative target languages such as BPMN?LP2 How to adapt the learning problem to other declarative target languages such as DCR graphs [33] or Declare [44]?
In general, our method is not limited to Petri nets, and could be adapted to other target languages.The main challenge lies in adapting the selection of candidate places to a selection of model constructs (nodes, edges) in the respective target languages.However, the parameter space to learn in these languages is potentially larger.For declarative languages, such as DCR graphs, the model not only has to decide the inclusion of a constraint edge but also the constraint type on the edge.Richer languages such as BPMN not only have far more modeling constructs but also allow nested graph structures through scopes and sub-processes.Reliably discovering such structures requires, next to the ordering of activities, also additional event attributes to be present in the input [47], which is currently not included in our approach.
We not only constrained the solution space to Petri nets, but also to only select candidates for a model with uniquely labeled transitions.
LP3 How to adapt the learning problem to allow unbounded construction of the target model (instead of selection from possible options)?
Learning how and when to freely introduce additional nodes, especially multiple transition nodes with the same activity name, would be closer to emulating how a human modeler creates models.This would require to learn methods that can dynamically grow the graph G that defines the learning problem in Section 3.3.Dynamically growing G most likely requires to also consider the following: LP4 How to learn to identify larger ''chunks'' or patterns of modeling constructs to extend a given model, as human modelers typically do? [13] LP5 How to learn to modify previously created model structures to reconcile with model structures created later?
While LP5 is ideally avoided in model creation, human modelers approach particularly large and complex modeling problems by first finding local solutions to smaller problems that are then adapted to be correctly combined into a larger solution, for example to ensure soundness of the model [48,49].Solving LP5 requires to introduce more global reasoning over the constructed model than our approach does, which may also ameliorate the costly search for candidates that ensure S-coverability.Notably, solving LP5 could enable adapting or repairing a given process model rather than creating an entirely new model [50].Further, the learning problem in this paper focused on learning model structure only.However, model layout (and other forms of secondary notation) are central to model understanding [51,52].

LP6 How to learn to generate adequate model layouts during model creation?
Solving this learning problem also has to consider the task [53] and user background [54].

Further research questions on the approach
Orthogonal to the above variations of the learning problem, our approach itself is characterized by a number of choices and design decisions that invite further research.
Our approach trained method d on a fixed synthetically generated dataset of block-structured process models and corresponding event logs of limited size.
RQ1 What is the impact of the size of the training data on the ability of d to generalize to other kinds of data?How large does the training data have to be to achieve generalization?RQ2 What is the impact of the representational bias in the training data on the ability of d to generalize to other kinds of data?Could synthetic data generated from more complex workflow patterns [35] improve quality?RQ3 Can the quality of d be improved by training on data sampled from human modelers instead of synthetically generated data?
During evaluation we observed that d failed to find a solution for some real-life data sets as the search space was too large, but succeeded to find a good solution for the full data when only given a sample of the real-life data.
RQ4 What is the relation between the behavioral information contained in the sample given as input to the quality of the returned model?What is the smallest sample from which an adequate model can still be generated.
RQ1-RQ4 give rise to a number of structured experiments.Answers to these questions would also be important for answering the more generalized learning problems.In Section 7, we take a first step in exploring the more generalized learning problem by a structured experiment on the sensitivity of the method to the representational bias in the training data (RQ2).Further, method d was trained with a fixed set of parameters such as the graph G, neural networks' hyperparameters, the use and balance of teacher forcing for convergence stability, the use of beam search and length normalization as a search heuristic for finding the highest joint probability, and the breadth-first set ordering π for selecting the place candidate.While we chose the parameters for our experiment based on available literature and initial experimentation (c.f.Section 5.2), their general influence on the method is not known yet.
RQ5 What is the impact of each parameter on the quality of models discovered by d on synthetic data and on unseen real-life data?RQ6 Are there other designs for d (graph G, neural network hyperparameters, etc.) than explored in this research that improve the quality of models discovered by d? RQ7 How to efficiently search for and determine the right parameter combination to train d with desired properties?
While RQ5 gives rise to a number of structured experiments on the implemented approach, RQ6 and RQ7 require further research drawing on developments in graph neural networks [55] and also meta learning [56].A particularly interesting direction is to train d not only from a finite set of examples but to allow self-correction through generative-adversarial networks [57] or reinforcement learning [58], where conformance checking could provide feedback on the adequacy of the created model.These problems and research questions described above outline new lines of research on APD with the aim to obtain an automated method that performs comparably to a human modeler in learning how to model.

Sensitivity Analysis
The discovery technique we described in Section 4 is trained on pairs ⟨L i , N i ⟩ of event logs and process models.We demonstrated in Section 5 that synthetically generated training data ⟨L i , N i ⟩ is sufficient to achieve generalization to unseen real-life event logs.However, we do not know whether and how much properties of the input data influence this ability to generalize.Subsequently, we study the sensitivity of the performance of our approach to properties of the training data.
This analysis follows from the further research questions proposed in Section 6.2, where each type of parameter in the framework is listed for further evaluation.RQ1 and RQ5-RQ7 focus on analyzing parameter impact that are important in any neural network approach and RQ2-RQ4 are specific to the automated process discovery problem.RQ2 is especially interesting.In Section 5 we chose the parameters for generating the set of synthetically generated process models (see Appendix A.1) based on whether the resulting models are visually similar to real-life processes in their complexity.Consequently, our results rely on the assumption that, by visual similarity, the complexity of the training data sufficiently captures various properties of process models that also occur in real-life data.
This raises the question whether using training data with different characteristics (or representational bias) would affect the performance of the ML model on (1) test data with similar bias as well as on (2) real-life data with bias independent of the training data?For a first analysis on sensitivity, we vary the complexity of the models in the training data.We compare the influence of training data with structurally simpler models (complexity = low) as well as structurally more complex models (complexity = high) compared to the training data of Section 5 (complexity = medium) on the performance of the ML model.To assess this influence, we formulate three hypotheses which are tested in this evaluation.(H2) The ML model trained on (complexity = low) will achieve lower accuracy than the reference ML model (complexity = medium) on the real-life data as the representational bias is too far off from the real-life data and it fails to capture the more complex structures of real-life data.The ML model trained on (complexity = high), when applying to the real-life data has learned sufficiently to capture the comparatively less complex structure of the real-life data, and will score similar to the reference ML model trained on (complexity = medium).
(H3) Alternatively to (H2), the approach is not sensitive to the training data and both an ML model trained on simple as well as complex data can learn the process structures sufficiently, as the building blocks are present in both cases, and they both generalize to real-life processes, comparing to the reference ML model.

Experimental setup
To assess the impact of the model characteristics (or representational bias) in the training data, we vary the parameters to generate the synthetic process models as follows.
As the synthetic data used for the training of the reference ML model is deemed intermediate in complexity, we explore the two extremes with data sets containing process models that are either structurally simple or complex.For learning a representation of process behavior, a process is considered complex when it contains many parallelism and loop constructs as opposed to sequences and choices, as representations for parallelism and loops are difficult to learn in neural networks [31] and, depending on the representation, both constructs may not be distinguishable from each other [23].We therefore chose to vary complexity in the training data by varying the relative share of parallel and loop operators in the synthesized process trees.The parameters of generating the synthetic data sets are summarized in Table 3.
Thus, the process models with complexity = low primarily consist of sequences and choices while in process models with complexity = high 50% of the constructs are complex (parallel or loop).All other parameters of the experiment remain untouched, ensuring that the remainder of the experiment is the same as in Section 5, for a fair comparison of the different ML models.Note that finding the optimal parameters is not the objective of this analysis and therefore an extensive hyper-parameter search is omitted.
For each representational bias, we trained and evaluated the ML model as described in Section 5. To test (H1)-(H3), we then measure for each representational bias (O1) how does the performance compare to the reference ML model from Section 5 on unseen synthetic data, and (O2) on unseen real-life data?

(O1) Performance on unseen synthetic data
Similar to the experiment in Section 5.3.2, the limited beam width causes d to produce at least easy sound Petri nets for only a subset of the test data.The ML model trained on (complexity = low) returned an easy sound model for 75% of the test logs.
The scatter plot in Fig. 13 shows the entropy-and alignmentbased fitness and precision scores as well as the F-scores and simplicities of all produced models by each method of this subset of the low complexity test dataset, with means and medians summarized in Table 4.
For the ML model trained on (complexity = high), easy sound process models were discovery only on 40% of the logs in the test dataset, of which the same conformance metrics are shown in Fig. 14 and summarized in Table 5 for each method.
In terms of the conformance metrics where the ML models succeed, the performances are similar to the intermediate ML model relative to the other methods.Aligning with (H1), the models trained with our method performs better and worse for the simple and complex training and test data, respectively.The success rates for the two data sets confirm our observation from Section 5.3.2 that a larger beam width is required the more complex the data is, which is to be expected since the search space is generally larger for more complex processes.

(O2) Generalization towards unseen real-life data
Fig. 15 and Table 6 show the entropy-and alignment-based fitness and precision scores as well as the F-scores and simplicities of the produced models for each real-life data set by the ML model trained on data with low and high complexity compared to the reference model of Section 5 (complexity = medium).
For some of the real-life event logs, the ML trained on complex and/or simple data did not produce an at least easy sound Petri net, which is again due to the search space that is traversed.Since the exact same parameters are used for inference as in the experiment in Section 5, this shows that the parameters which reduce the search space is dependent not only on the input event log but also on the ML model itself and should be adjusted accordingly for optimal results.For the other datasets, the conformance results are very similar across the different ML models, without a clear overall winner.
For the hypothesis, the results lean towards (H3) rather than (H2) as both ML models are able to generalize for data outside the representational bias similarly to the intermediate ML model.This shows that the impact of training data on model performance on real-life data is minimal, though the heuristics for inference appear to be sensitive to the training data in some cases.

Conclusion and future work
We reconsidered the problem of automatically constructing process models from event data at a fundamental level.The classical process for engineering a new process discovery algorithm encodes the relation between event logs and model constructs in fixed rules.These fixed rules are based on knowledge of semantics of modeling constructs (which constructs formally describe the behavior) and heuristics for how to choose among multiple possible modeling constructs in case of insufficient information (e.g., infrequent or deviating behaviors).Experiments repeatedly show that all known algorithms for discovering imperative models suffer from the fact that the chosen heuristics do not generalize well to new real-life data sets, i.e., whose characteristics were either not known during algorithm design or could not be encoded in the algorithm's rules.
In this paper, we explored whether it is possible at all to automatically learn a process discovery algorithm for imperative models -through supervised learning from training data -so that this algorithm can successfully generate sound, accurate, and simple models from input event logs not seen before, i.e., is able to generalize to unseen data.This is a new formulation of the process discovery problem, which we call supervised process discovery.
Our study and experiments show that this problem can indeed be solved for discovering Petri nets.Our approach is based on the process of how human modelers create models.It encodes the relation between input event log and the solution space of possible models in a graph, and learns how to infer a valid solution by information propagation using graph neural networks.Our experiments demonstrate feasibility of solving the supervised process discovery problem.The trained algorithm performed well on unseen synthetic problem instances similar to the training data (block-structured process models) reaching performance comparable to state-of-the-art process discovery techniques.More importantly, for real-life problem instances with characteristics different from the training data, the trained algorithm returned sound models with accuracy comparable to better simplicity than start-of-the-art techniques.The results are robust under changing the bias and complexity of the training data.This suggests that the described technique is effectively able to learn how to make choices among multiple modeling constructs in generalizable manner, i.e., to unseen data.
We argue that this study lays down the foundation of introducing machine learning to automated process discovery and provides a first framework to do so in a systematic manner.In Section 6 we identified various novel learning tasks and research questions to improve and generalize the various components and building blocks of the problem formulation itself, and of our proposed approach.The fact that our results were obtained on standard PC hardware without GPU support lowers the barrier for subsequent research and adoption.This specifically includes the open problem of whether a process discovery algorithm can also be learned for declarative process modeling languages [33].
Besides these general alleys of future work, the specific approach and findings are subject to several limitations which we discuss next.
A limiting factor for the proposed method to be scalable to large processes is the choice of the initial features, being onehot encodings, allowing only a predefined maximum number of distinct activities.Future research is needed on a different label encoder that is scalable to the cardinality to counter this limitation.
Robustness has been a focus in automated process discovery since a wide variety of processes exist in terms of structure and algorithmic approaches often tackle only a specific kind of process.This has been a problem for our method as well where no sound model was discovered at all in some cases.In Section 4.4 we have proven that this is due to the heuristics of the beam search limiting the traversed search space.The sensitivity analysis wrt.complexity of the training data confirmed that the beam search heuristics limits inference in finding sound solutions in large search spaces (of more complex models).In our experiments, we reduced the number of traces to reduce the size of G until a sound model was discovered.Although this is justified by the fact that human modelers also only look at the top most frequent/important traces and discover a model based on that, it is not ideal since possibly valuable information is lost.Future research has to investigate how to tune beam search (or other heuristics) to not exclude sound solutions from the search space and develop proper fallback methods to ensure robustness.
Finally, many algorithmic process discovery methods have adjustable parameters for balancing between fitness and precision of the model to be produced.In our method, this balance is encoded implicitly in the training data, so consequently it learns the same balance.Having various balances encoded explicitly in the training data, a conditional version of our method using additional input features could steer this balance during inference.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.• SCN: 1 layers with 17 neurons, output size 1; • SN: 1 layers with 16 neurons, output size 1.

A.3. Graph construction and feature encoding
The search space of candidate places is reduced using the k-eventually follows relation with k = 1 as described in Section 3.3.3.We did also include event frequency in the nodes.The set of candidate silent transitions T τ is limited to adding candidate silent transitions only between two already selected candidate places: T τ = {(x, y) | x, y ∈ P ′ } where P ′ ⊆ P is the set of selected candidate places.

A.4. Framework components
For training d, 100% teacher forcing is used.S-coverability and connectedness checks are included for the candidate selection.

A.5. Evaluation data
The number of trace variants as the input is decreased to 30 for the training data to limit the graph size.For the real-life datasets, the number of trace variants is selected such that they cover at least 80% of the most frequent traces, with a minimum of 30 and a maximum of 75.Due to d not finding sound process models for a couple of datasets, we manually decreased the input size, resulting in the lower number of included trace variants as listed in Table 8.For each dataset, the second row shows the included number of most frequent trace variants, the third shows the total number of trace variants and the last row shows what percentage this is in terms of most frequent traces.

A.6. Inference
The beam width b for beam search was set to 10 for the synthetic data and 50 for real-life data, decrementing b after each choice.Important to note is that the conformance checking during evaluation was done on the complete event log.We did not use length normalization during inference.
(top left) the modeler created a partial model with transitions a, . . ., e, initial place p 1 and place p 2 between a and b.The modeler now has to identify how to encode the behavioral information after b, i.e., ⟨a, b, c, . ..⟩, ⟨. . ., b, d⟩, ⟨. . ., b, e⟩.This requires two choices:

Fig. 2 .
Fig. 2. High level overview of the approach.
Section 4.3 introduces extensions to f to ensure S-coverability and the possibility of incorporating invisible τ -transitions.

Fig. 6 .
Fig. 6.Training and testing curves of loss, % true positives and % of false positives (red: training data, blue: test data).(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 7 .
Fig. 7. Statistics of the classified places and their relative specifiedness in terms of the α-relations.

Fig. 8 .
Fig. 8. Conformance results on 65% of the synthetic test dataset with mean and median centroids.

Fig. 9 .
Fig.9.Histogram of (entropy-based) F-score ratios comparing our approach to the ground truth and other methods.

Fig. 11 .
Fig. 11.Discovered Petri nets for the Road Traffic Fine dataset.

(
H1) We hypothesize that an ML model trained on simple synthetic data (complexity = low), converges quicker than the reference ML model (complexity = medium) of Section 5 and easily learns the training data while generalizing to synthetic test data with the same characteristics (complexity = low).For the ML model trained on complex data (complexity = high), the training will take longer and achieves lower accuracy on corresponding test data (complexity = high), since it has a harder instance of the learning problem.

Fig. 13 .
Fig. 13.Conformance on the test data with (complexity = low) for the logs where the ML returned a model (75%) with mean and median centroids.

Fig. 14 .
Fig. 14.Conformance results on test data with (complexity = for the logs where the ML returned a model (40%) with mean and median centroids.

Fig. 15 .
Fig. 15.Conformance results on real-life datasets of the three trained NNs of our approach.
2. The ''Select candidate'' network solves L1, i.e., estimating the parameters of the joint probability p(P ′ |L) of selecting only the subset P ′ ⊆ P with arg max P ′ p(P ′ |L) needed to describe the behavioral relations between transition nodes available from an event log L. The ''Stop'' network solves L2, i.e., estimating the parameters of the joint probability p add (P ′ |L) to stop adding places.However, we do not directly translate from L to p(P ′ |L) and p add (P ′ |L) but introduce a latent parameter space of node embeddings to encode and propagate behavioral relations from L to the nodes in N using two propagation i , N i ⟩.

Table 1
Results on artificial data showed as mean (median) with 65% success.

Table 2
Conformance results on real-life datasets.

Table 3
[39]meters for generating synthetic data for sensitivity analysis using the PTAndLogGenerator[39].

Table 4
Results on (complexity = low) synthetic data showed as mean (median) with 75% success.

Table 5
Results on (complexity = high) synthetic data showed as mean (median) with 40% success.

Table 7
[39]meters for generating synthetic data for the evaluation using the PTAndLogGenerator[39].