Discovery of process variants based on trace context tree

Process variants usually exhibit a high degree of internal heterogeneity, in the sense that the executions of the process differ widely from each other due to contextual factors, human factors, or deliberate business decisions. Understanding differences among process variants helps analysts and managers to make informed decisions as to how to standardise or otherwise improve a business process. Existing process variant mining approaches typically fall short in full supporting semantic process variability mining, especially rarely taking activity behaviour relationships and trace context semantic into consideration. Here, we propose a semantic process variant discovery method, aimed at solving the difficulty of distinguishing similar-but-different behaviours directly from event logs. More specifically, we adapt concepts of benchmark logs and trace context tree to formalise context semantic of event log, to classify benchmark logs into several parts, thereby the clustered trace cohorts are mapped to discover the configurable process variants. In the experimental part, some performance metrics of the proposed method are evaluated and calculated by real-world event logs, supporting the usefulness of the proposed method. The experimental results show that the proposed method is able to distinguish similar-but-different behaviours and is superior to the characteristic trace clustering method using conventional neural networks.


Introduction
Owing to inevitable software maintenance and adaptability of process models, many process models are derived from the same base model in practical applications, in order to match the increasing individualisation of customer demands.These kind of configurable process models is gaining importance, as an example it can offer various benefits like reusability and flexibility compared to traditional predefined business process models.Configurable process models derived from the same base model usually have a high level of similarity, and cannot be differentiated from each other using state-of-the-art similarity measurements.This is especially true in the scenario of process variant mining directly from event logs, without relying on any a priori reference model.
CONTACT Huan Fang fanghuan0307@163.comProcess variant analysis is a set of techniques to analyse event logs produced during the execution of a process, in order to identify and explain the differences between two or more process models (van der Aalst, 2022).The goal of process variant analysis is to help business analysts or stakeholders to understand why and how multiple variants of a process differ (Lopez-Martinez-Carrasco et al., 2021).
In this setting, a process variant is a subset of executions of a business process that can be distinguished from others based on some characteristics (Taymouri et al., 2021).In this work, we call a set of execution logs as similar-but-different process variants, which usually have high degrees of similarities.For example, an organisation may have different process orchestration for some given specific business process, such as multiple products sales processes in different countries(say C1,C2,C3,C4), or multiple accounting processes in different branches(say C1,C2,C3,C4).So, the actual executions of the same process may vary with time and geography, we can obtain 4 similar-but-different process variants: one for each of these countries or branches.In these variants, some relevant event data such as location, different business modules, products, and customer types could change, but the main process models are similar, and can be divided into differentiated clusters.The sub-models of clusters are functionally homogeneous, but can be differentiated from each other by some number of partial variations, and these similar models can be formalised, understood, and expressed as process variants.
The process variants have proven to be a mainstream development technology for flexible business systems adapting to different markets, and a wide range of methods for process variant analysis have been proposed in the past decade, such as configurable BPMN(Business Process Modelling Notation), configurable Petri nets, etc (Van Den Ingh et al., 2021).Latest process variants mining or discovery techniques can be divided into three categories: (1) Process variability modelling methods, which mainly deduce process variants through configurable operations to base model; (2) Configurable process mining methods, which discover process variants through semantic trace fragmentation or slicing operations; (3) Trace clustering methods based on machine learning, which utilise characteristic data clustering methods to extract process variants.
Due to the interdisciplinary nature of this field, the existing methods and the types of differences they can identify vary widely.The challenges encountered while managing process variants discovery are related to the models creation and the configuration.Recently, process mining offers some advanced techniques to discover, check conformance of models, and enhance configurable process models using a collection of event logs, that captures traces during the execution of process variants (Bettina et al., 2022).However, existing works in configurable process mining lack the incorporation of semantics in the resulting model.Historically, semantic process mining has been applied to event logs to improve process discovery with respect to semantic (De Leoni et al., 2016;Khannat et al., 2021).
This paper integrates the advantages of configurable process mining and trace clustering methods, and presents a log-based process variants discovery method.The main contributions of this paper are as the following: (1) The formalisation of the behaviour semantic of event log, enriching the collection of event logs with configurable benchmark log concepts that capture variability of elements present in the logs.This is an important step towards discovering semantically enriched process variants.First, concepts of benchmark log and trace context tree are formalised to describe the behaviour semantics of event log, where kth-strict order relationship between activities are highlighted.Then, a kind of weighted frequency cosine similarity measurement is presented, in order to select the representative activity nodes and their neighbourhood length in benchmark log.Finally, the context tree of event log is constructed in the form of frequent pattern tree(abbreviated as FP tree).( 2) The construction of a configurable process variants discovery method based on trace context tree, named semantic α splitting method.Semantic α splitting method can discover trace clusters directly from event log, and it combines behaviour profiles and trace clustering techniques together.In the experimental part, results show that semantic α splitting method can identify process variants that cannot be distinguished by the existing methods, as well as it has higher fitness and higher precision than characteristic tracing clustering methods using conventional neural network (abbreviated as CNN), especially in the scenarios that different variants have different probability distribution.
The trace semantic α splitting method starts from the point of configurable process mining, and is also an effective method, which can improve characteristic trace clustering method in fitness and precision.The first advantage is to construct an approach of configurable process mining framework incorporating behaviour profiles, so it enriches process variant mining techniques in configurable process mining; the second advantage is to build a kind of trace similarity measurement incorporating various behaviour relationships of activities in event log, so it extends the classical similarity measurement in machine learning; the third advantage lies in that it can simplify the traces of event log to the greatest extent, in the meanwhile preserving the behavioural relationships of activities.
In addition to these three advantages, the biggest innovation of the proposed method is highlighted as the short and long dependencies among activities of trace semantic extraction, which is expressed as a trace context tree in neighbourhood length of k(k ≥ 1).
The remainder of this paper is structured as follows.Section 2 reviews some basic concepts and notations, and Section 3 introduces the related work.Section 4 presents an illustrative motivation example.Section 5 introduces the proposed method of this work, named semantic α splitting method based on the activity context of event log.Section 6 conducts experiments and analyses the experimental results.Finally, Section 7 concludes this paper.

Preliminaries
In this section, we briefly review a couple of terminologies such as events, traces, event log, and log behaviour profiles based on previous work (Agostinelli et al., 2023;Lu et al., 2022;Wang et al., 2022), in order to ease the readability of this paper.
A business process is a set of activities executed in a given setting to achieve predefined business object.An activity is an expression of the form A(a 1 , a 2 , . . ., a n A ), where A is the activity name and each a i is an attribute name.We call n A the arity of A. The attribute names of an activity are all distinct, but different activities may contain attributes with matching names (Agostinelli et al., 2023).We assume a finite set Act of activities, all with distinct names; thus, activities can be identified by their name, instead of by the whole tuple.Every attribute a i of an activity A is associated with a type D A (a i ), i.e. the set of values that can be assigned to a i when activity is executed (Agostinelli et al., 2023).
An event is the execution of an activity and is formally captured by an expression of the form e = A(v 1 , v 2 , . . ., v n A ), where A ∈ Act is an activity name with v i ∈ D A (a i ) (Agostinelli et al., 2023).The set of events is denoted as Event.
A trace is formally defined as finite sequences of events σ = e 1 , e 2 , . . ., e n with e i = A i (v 1 , v 2 , . . ., v n A i ).Traces model process executions, i.e. the sequences of activities performed by a process instance CID.A finite collection of executions into a set L of traces is called an event log (Agostinelli et al., 2023).
In the behaviour profile B L of L, the exclusiveness relations are not appear, for the reason that if a pair of events are exclusive each other, then they definitely not occur in the same trace concurrently.However, the opposite is not necessarily always true.So, we can not deduce exclusiveness relationship based on event log alone.(1) Tr is a root node of the tree; (2) P is the item prefix subtree, its node item is denoted as tr = (tname, count, nodelink), where tname represents the identifier of the node item, count represents the number of subpaths from root to it, and nodelink represents the next node with the same identifier tname in the prefix subtree; (3) H = (iname, hnodelink) is a frequent item header table, where iname stands for frequent item identification domain, and hnodelink is a pointer to the first frequent item node with the same item identifier in the prefix subtree.

Related work
Process variant analysis is a rather broad topic, and the main research question (RQ for short) of this field is can be summarised as "given a set of two or more process variants executions, how to identify and explain the differences among variants (Taymouri et al., 2021)?"The RQ can be tackled from three different perspectives: process variability modelling, configurable process mining and trace clustering.

Process variability modelling methods
From the perspective of process variability modelling, the process variants are in terms of a cluster of process models, and different operational managements to the base model could generate process cohorts (Döhring et al., 2014;Li et al., 2011;Rosa et al., 2017;Taymouri et al., 2021;Van Den Ingh et al., 2021).In this scenario, the up mentioned RQ can be simplified into a fine-grained problem named RQ1.

RQ1:
Given a set of two or more process variants models, how to identify and explain the difference among variants?
Suppose that the base model of a specific business process is given as known, some process variants clusters or families can be obtained via configuring personalised operations to base model.These configurable operations are performed by stakeholder, process manager, or end-user, and can be expressed in the forms of Not-Functional-Requirement (Taymouri et al., 2021), reasonable process fragments (Schunselaar et al., 2012), declarative variability rules (van Beest et al., 2019), and etc.
Although, the digital and the physical worlds are closely aligned, and it is possible to track operational management processes in detail to some extent, however, there exist challenges for identify these variants (Pourbafrani et al., 2020).While employing model-based comparison to process variants, the key problem is related to the fact that the variants are compared in terms of their model structure whereas we aim to compare the behaviour.So, a kind of low-level behavioural representation is preferred, i.e. transition systems, instead of high-level process modelling languages, such as BPMN or Petri nets.However, low-level modelling methods are fall short in state-explosion.Moreover, existing process variability modelling methods are mostly from the perspective of control flow, another drawback of model-based approaches is that they are unable to detect differences in terms of frequency or other perspectives.Therefore, some additional comprehensive techniques could be take into consideration, supporting advanced process variability modelling.

Configurable process mining methods
Process mining (van der Aalst, 2022) is a body of methods and tools to analyse business execution logs (named event logs), and many organisations have adopted process mining tools that use these event data for the discovery and analysis of the actual execution of their business process.In this context, an event log describes all that occurred during the execution of the relevant system by the end-users, such as events, activities, time stamps, case instance, etc.
From the perspective of process mining, the process variants are finally formalised as a cluster of process models, so the RQ defined upcoming can be further refined a fine-grained problem, named RQ2.RQ2: Given a set of event logs of two or more process variants executions, how to identify and explain the differences among process variants?
This kind of process variants mining methods starts from event logs directly, and does not depend on any priori knowledge about the business process model, and the first step is mostly splitting the event logs into cohorts using some trace merging or splitting operations.Chan et al. used process mining technology to mine configurable process models from event log collections, and proposed a frequency-based method to guide the configuration process for discovering process variants (Chan et al., 2014); Folino et al. proposed an automatically discovering method from the perspective of the control flow of event logs (Folino et al., 2015).This method generated workflow patterns collections of process models, each workflow pattern describes a trace cluster, which can further used to discover process variants; Bolt et al. integrated the control flow perspective and performance perspective of the event log together, so as to detect related process variants in an interactive manner (Bolt et al., 2017).Addressing to the problem that existing works in configurable process discovery lack the incorporation of the semantics in the resulting model, Khannat et al. proposed a novel method to enrich the collection of event logs with configurable process ontology concepts by introducing semantic annotations, which can capture the variability of elements in the event log (Khannat et al., 2021).
Besides the studies mentioned above, there are some other related process fragmentation or slicing studies about process variants.Owing to the fact that process variants are composed of several process fragments with commonalities and differences, so they are usually with high similarity.These similarities can be used to merge a cluster of variants together.Hasankiyadeh et al. used a process slicing algorithm to identify process fragments in the event log (Hasankiyadeh et al., 2014).Later, Pourmasoumi et al. proposed an algorithm to extract morphological fragments from the event log (Pourmasoumi et al., 2017).Reusing the extracted morphological fragments are predictably to reduce the cost of designing a new process model, and speed up the design progress.Hmami et al. introduced configurable process models and the variability concept in change mining approaches, and propose an approach of merging and filtering a collection of event logs from the same family with respect to variability (Hmami et al., 2021).This method is aim to enhance change mining from a collection of event logs and detect changes in variable fragments of the obtained event log.
Historically, semantic process mining has been applied to event logs to improve process discovery with respect to semantic.So, at present, in the study of configurable process mining, the greatest challenge lies in that how to introduce semantic in the mining procedures, enhance distinguishing similar-but-different variants effectively.This motivates why, in this paper, we have opted context tree combining frequency and behaviour relationships together as the semantic expression of event log.

trace clustering methods
From the perspective of machine learning, the process variants are finally formalised as a cluster of process traces, so RQ defined upcoming is equivalent to RQ2.
Machine learning is the systematic design, analysis and study of algorithms and systems that learn from past experience.Machine learning is inherently a multidisciplinary field.As for RQ2, trace clustering is a suitable technique to divide event log into several clusters or cohorts.
Most of the state-of-the-art process model discovery methods focus on how to find well-structured and understandable process models (De Leoni et al., 2016;Zandkarimi et al., 2020).However, due to the highly flexible and complex nature of processes, it may be particularly difficult to find actual process models being executed in some reallife environments, such as health-care, product development, customer support and other processes.When processing such unstructured processes, process mining algorithms can generate an incomprehensible "spaghetti like" process model (De Koninck et al., 2021).One of the main reasons is the diversity of event logs, that is, there are local, non-significant differences between several process execution instances.There are many implicit process variants in these process execution instances, and the model of each variant is more suitable to describe some personalised logs than the complete process model.
Therefore, in order to solve the problems caused by the local diversity of event logs, researchers have proposed many different techniques.In addition to event log filtering, event log conversion, event log sampling and specially developed process mining algorithm (De Leoni et al., 2016), another method to overcome this limitation is trace clustering (De Koninck et al., 2021;Delias et al., 2023;Luengo & Sepúlveda, 2011;Tariq et al., 2021;Tavares et al., 2022;Vertuam Neto et al., 2021;Xu & Liu, 2019;Zandkarimi et al., 2020).
Trace clustering techniques divide the log into more homogeneous subsets by reducing the number of process log instances that participate in the analysis at one time, and combining the similarity metrics to measure the similarity between process instances.The obtained research conclusions show that trace clustering techniques would definitely enhance process mining, for the reason that all traces in each cluster can be analysed independently in flexible environment.In the subsequent research, many researchers tried to improve the trace clustering algorithms from the perspectives of feature extraction, trace coding and distance measurement, in order to enable the process mining algorithm to generate more accurate process models.

Summary of related work
As outlined in the previous subsections, a wide range of methods have been proposed to tackle the problem of process variant mining.However, because of the heterogeneous nature of the underlying algorithms, there exist some deficiencies and challenges shown in Table 2 that need to further study and improve.The State-of-the-art log-based process variant mining methods, either from configurable process mining, or trace clustering, are designed to distinguish similar-but-different behaviours of process variants, and only few researches take the relationships of activities or event semantics into consideration.
Behaviour profile has been proved to be an effective technique to evaluate the relationships between activities in event logs (Tang et al., 2022).Therefore, based on our previous work (Fang et al., 2020), this paper takes the relationships between activities named activity context semantics into consideration, and proposes a new log-based approach to discover process variants, incorporating the concept of Frequent Pattern tree in pattern mining field (Borah & Nath, 2018), in order to distinguish similar-but-different behaviours among process variant clusters.

Motivation
In order to provide a better understanding of process variants mining method, we first introduce an example to illustrate the motivation of this paper.
Assuming that there are two sets of event logs 1(a) are mined from L 1 using inductive mining algorithm on pm4py platform . 1 pm4py is the leading open source process mining platform written in Python.Model M 2 shown in Figure 1(b) is obtained by applying four personalised operations to model M 1 , named Move(M 1 , E, start, I),Move(M 1 , J, B, end), Insert(M 1 , X, E, B), Move(M 1 , I, D, H).Here, Insert(M 1 , X 1 , X 2 , X 3 ) means insert activity X 1 into the position of after  X 2 and before X 3 in model M 1 , Move(M 1 , X 1 , X 2 , X 3 ) means move activity X 1 to the position of after X 2 and before X 3 in model M 1 .
Suppose that L 1 and L 2 are known, and the priori models M 1 and M 2 are keep unknown.Let L = L 1 L 2 , that is mixing the two groups of event logs together.We use inductive mining on pm4py for log L, and the mined model from L is obtained as shown in Figure 2. The fitness of the model for log L is 1, however the precision is only 0.41.
Trace clustering is a common method that can enhance the discovery of process models, and has been widely used in the field of process mining.Here, we use trace clustering method with the event frequency coding technique, two model clusters are resulted as shown in Figure 3.In order to further evaluate the effectiveness of different methods in identifying similar-but-different process variants, we conduct a series of trace clustering and process mining experiments on log L, the related experimental results are listed in Table 3.
It is obviously that trace clustering methods indeed enhance process mining approaches (as shown in Table 3), as the performance indicators of fitness and precision of each cluster are higher than those using process mining method alone.It is noteworthy that these 4  As shown in this motivation example, this paper proposes a novel approach to discover process variants, named trace semantic α splitting method.It is assumed that none of reference models is given as known, and the process variants are mined directly from event log.The intuitive idea of our approach is to extract the benchmark log from the event log by trace compression.In trace compression, some properties such as activities behaviour profile, trace frequencies, and etc. are preserved, but the obtained benchmark log can simplify the initial log to the greatest extent, and inherently reduce the complexity of corresponding log processing algorithm.

Benchmark log extraction
In order to discover process variants from event log, this section formalises the concept of context log, and uses it as the basis of benchmark log extraction.
Let L = {σ i : i ≥ 1} be the available event log set, which is a simplified formalisation of L = {CID, σ , Lattr {1,2,...m} }, A = {A i : i ≥ 1} be the activities set.For a given activity a ∈ A, a sub-log related to the activity a is denoted as and π X (σ i ) represents the projection of event sequence σ i on set X. Definition 5.1 (kth-strict order relationship): In the event log L = {σ i : i ≥ 1}, two activities x and y are in kth-strict order relationship, denoted as x −→ k y, if and only if As the relationship of x → y means that there exists flow relationship between activity x and activity y, and the kth-strict order relationship has more relax preconditions than those in strict order relationship, so we can choose reasonable value of k in x → k y relationship to limit the neighbourhood length of the activity x. (1) A k a = {a} ∪ {a i : (2) Pr(L, A) is a mapping function extended from π A (σ i ), which represents the projection of event log L on activity set A, Pr(L, The event log in Table 4 is used to illustrate the concept of context log.Activity z is selected for an example.According to the kth-strict order relationship, activities b, c, i, d, l and activity z are in 1st-strict order relationship, and activities b, c, e, h, m, i and activity z are in 2 nd -strict order relationship.Therefore, according to Definition 5.2, b → z and Z → 2 m can be obtained.If the length is limited to 2, the 2th-context alphabet corresponding to activity z is Since there may be more than one context log for a given activity, so the concept of weighted frequency cosine similarity is proposed on the basis of cosine similarity, which helps to select the most suitable log from the context log as the benchmark log.

Definition 5.3 (Benchmark log):
Let L = {σ i : i ≥ 1} be the available event log set, A = {A i : i ≥ 1} be the activities set, A k a be the kth-context alphabet corresponding to the activity a, and L k a be the context log of the activity a, Definition 5.4 (Weighted frequency cosine similarity): ),Y(y → j , y j , y ← −1 j ) be two three-dimensional vectors, x α j (α ∈ (→, , ← −1 )) be the frequency of each trace in the context log with α relationship, y α j (α ∈ (→, , ← −1 )) be the frequency of each trace in the original event log with α relationship, ω α j (α ∈ (→, , ← −1 )) be the weight distribution of α relationship, p i is the percentage of each trace frequency in the event log L, then the weighted frequency cosine similarity between X and Y are denoted as COS(X, Y): If COS(X, Y) = 1, then the X and Y match exactly; on the contrary, if COS(X, Y) = 0, then X and Y don't match at all.The closer the weighted frequency cosine similarity is to 1, the higher the matching degree.In this paper, we use COS(X, Y) as a metric to select the benchmark log.
Therefore, the value of COS(X, Y) can be used to select the benchmark log.

Context tree construction based on FP tree
Generally, there exist a set of common activities among different process variants, process variants are usually realised through individual orchestration and configuration to these common activities.Therefore, this subsection starts with the common activities of trace, and gives definitions of trace context and context tree of event log, to illustrate the semantic contexts of activities and traces.

Definition 5.5 (Trace context):
Let L = {σ i : i ≥ 1} be the available event log set, σ i is a trace, LCP be the longest common prefix of traces in log L, SP is called as the context of the trace σ i , if and only if SP = {d ∈ 2 σ i |σ i = LCP|d}, where the symbol "|" represents the concatenation operator.
As the activity common prefix can be represented by a prefix tree, in order to effectively identify the semantic context, a novel prefix tree structure named context tree is introduced here on the basis of the frequent pattern tree (Definition 2.6).

Definition 5.6 (Context Tree):
A triple CT = (Tr , P , H ) that fulfill the following conditions is called a context tree, where: (1) Tr is a root node of context tree; (2) P is the context prefix subtree, the node t = (tname, count, nodelink) in the context prefix subtree.Among them, tname represents the activity name of the node, count represents the number of subpaths from root to it, and nodelink represents the next node with the same identifier tname in the prefix subtree (if none, then the next node is recorded as null); (3) H = (iname, hnodelink) is a context header table, where iname represents activity name, hnodelink is a pointer to the first node with the activity name in the prefix subtree.
The context tree corresponding to the event log in Table 4 is shown in Figure 4.It is can be concluded that each trace in the event log is substituted as a branch of the context tree (as shown in Figure 4).The context tree has a top-down layout, and traces with the same prefix share a branch block of the root node.At the same time, the context header table can help us to retrieve the structure faster during the dynamic construction and query of the tree.

Selection of activity node and neighbourhood length
Due to the different choices of activity nodes and the length of the neighbourhood, the context logs extracted from the same event log are likely different, which obviously resulting in different benchmark logs.If calculating A k a for each activity node in activity set, then the number of context logs generated afterwards is definitely huge, and thus bring a highly complicated calculation complexity to the selection of benchmark log.
In order to simplifies the calculation difficulty, it is suggested that the parameters of activity node and its neighbourhood length should be selected and determined carefully.
Firstly, it is reasonable to narrow the nodes in the activity set to a controllable range.The selection of activity nodes is mainly determined according to the number of occurrence times (i.e.frequency) that they appear in the event log, and the activity nodes are resorted by frequency in descending order.At the same time, the average frequency of activity nodes is calculated, and the activity nodes that occur less frequently are excluded, which reduces the selection range of active nodes; the activity nodes selected in the narrowed range can be further divided into different clusters based on frequency, and a representative activity node can be further selected in each node cluster for next operation.
Secondly, it is also crucial to choose the length of the neighbourhood in order to select the appropriate context log.According to the selected activity node a, the corresponding kth-context alphabet A k a is determined, that is, the activity alphabet displays the context activities of the activity node.
Summarily, for the selected activity node a, an important step is determining the value of k in A k a , i.e. the neighbourhood length.

Algorithms and complexity analysis
Here, we propose semantic α trace splitting method to discover process variants directly from event log.Three algorithms (Algorithms 1− 3) are formalised to illustrate the in-depth procedures of this method.2.

Algorithms
Algorithm 1 Benchmark log extraction algorithm.
Input: Event logs L, threshold φ, and weighted values for ω α j ((α ∈ (→, , ← −1 ))).Output: Benchmark log L k B .1: Extracting all activities in L to form activity set T; 2: Count the frequency of each activity in T, and form a activity frequency vector F; 3: Sort F in descend order; 4: Compute average frequency thr of T, delete all activities with frequency lower than thr, and update T and F; 5: According to the frequency in F, divide T into 2 subgroups:subT 1 , subT 2 ; 6: for i =1 to 2 do 7: Random select a representative node N i in subT i ; 8: Set k=1; 9: if the number of activities in the Determine neighbourhood length is equal to k; Count the frequency of context log L k a , and calculate x α j (α ∈ (→, , ← −1 )) be the frequency of each trace in the context log with α relationship; 20: end for 21: According to Definition 5.4, calculate the weighted frequency cosine similarity value of context log, and select the context log that has the maximum value as benchmark log Algorithm 1 extracts benchmark log from initial event log, and there are 4 parameters should given as constants in advance, which are threshold φ, weighted values for ω α j ((α ∈ (→, , ← −1 ))).It is suggested that φ is a value between 0.6 and 0.8, and this value is user defined.Algorithm 2 aims to constructing the context tree on the basis of benchmark logs, and Algorithm 3 uses benchmark logs and context trees as input to discover process variants.It is noteworthy that the final process variants we mined are depicted in the form of Petri nets.

Complexity analysis
Given an event log L, suppose the number of activities contained in the log be n, the number of traces be m, now L is used as input to analyse the complexity of each algorithm.The core of Algorithm 1 is to extract the benchmark log: calculating average frequency avg of all activities in event log, deleting those activity nodes with frequencies lower than avg in the alphabet, and classifying the remained activities to form a classification.After a series of operations, p × k benchmark logs are obtained.Then the time complexity of extracting the benchmark log is O(pk); the core of Algorithm 2 is to construct the context tree, assuming that the number of activities in the longest common prefix is z, and the number of activities of the remaining sequence of activities is q, the corresponding time complexity is O(m(z + q)); the core of Algorithm 3 is mining process variants.Suppose that there are x clusters, and the traces in the benchmark log are added to the clusters using context tree, so the time complexity is O(mx).Additionally, the complexity of mapping clusters of benchmark log to their counterparts in the original event log is O(x).Therefore, the total time complexity of mining process variants from the event log is O(pk + m(z + q) + x), which has equal complexity with O(n 2 ).

Experiment and evaluation
In this section we apply our proposed method on three kinds of real complex industrial logs, showing that, although designed for distinguish some similar-but-different behaviours, such as in banks credit authorisation system (BPIC 2012) (Bautista et al., 2012), the proposed method can provide insights and unveil some deficiencies of existing methods.An important fact is that our proposed process variants mining method can improve the deficiency Calculate context(bt i ) = π A (bt i ); 5: end for 6: Compute the distance of any pair of traces bt Merge two trace bt i , bt j into the same cluster, and mark the leaves of traces bt i , bt j in CT as the label of cluster bt i ; 9: until the number of all clusters is less than or equal to CluNum 10: for each trace t i in L do 11: Assign each trace t i with the same cluster identifier as its projection trace in benchmark log; 12: end for 13: Form trace clusters Clusters of event log L; 14: Transform Clusters into the form of process variant model PV in Petri net.15: return PV; of trace clustering in variant mining, especially in the scenario that different variants with imbalanced distributions.The first part of this section describes variant mining procedures in the credit authorisation system (BPIC 2012), the second part provides a comparative analysis between this work and the existing methods, and the third part provides an in-depth discussion about the findings and potential limitations.

Case study
In practice, disposal of banks credit authorisation may have different policies due to different types of lenders.Based on these policies, event logs are generated during the execution of the credit business process from BPIC 2012, which is taken a case study to validate the effectiveness of the proposed method in this paper.By using the credit model alignment technology (Borah & Nath, 2018) for change operations, the operational logs were extracted from CPN tools platform, and the extracted event logs are shown in Table 5.There are 23 activities in the event log, and the activity label table is N = {a, b, c, d, e, f , h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w}.The meaning of each alphabet in Table 5 is described as the following.a (accepting credit applications), b (starting to process credit applications), c (registering customer information), d (checking customer credit), e (contacting the bank),f (checking company funds), m (approving Loan type), n (approval request), o (check income),p (archive request),q (contact customer), r (check documents), g (check interest),h (stop inspection),i (verify residential area type),j (check property information ), k (request the mortgage insurer), l (verify the information is qualified), s (verify the loan amount), t (verify the system funds), u(end the verification), v (end the inspection phase), w (the loan is successful).
In Algorithm 1, rows 1-2 are executed to obtain the frequency of each activity in the activity label table N, which are listed in Table 6 according to frequency in descending order.The average frequency avg of these activities is calculated as 166.87, and it is used as a threshold value.
Every activity node whose frequency below avg is filtered out according to Algorithm 1, in order to reduce the number of activity nodes in the activity alphabet N, resulting a filtering activity alphabet is obtained as N = {a, b, w, c, d, e, h, n, o, p}.Line 5 of Algorithm 1 is executed to classify the nodes in the activity alphabet N into two categories according to their frequencies, i.e. 350 and 200, respectively, and line 7 is executed to randomly select a representative node from each cluster, namely b and h, which is used to perform the further operations.
After that, lines 10-11 in Algorithm 1 are executed, and the context activities corresponding to the kth-strict order relationship of representative nodes b and h in length k are calculated, the activity nodes A k i are listed in Table 7.In order to avoid underfitting due to a small number of activity nodes, we control the number of activity nodes are controlled within a certain range, by using the judging condition shown in line 12 of Algorithm 1. Lines 13-14 in the Algorithm 1 are designed to control l, m, n, w, p, q a , o, p, q k = 7 n, w p , q, w the number of activities contained in the activity relationship table within a certain threshold range, and the threshold range is a user defined parameter(in this paper, the threshold range is set to 60%-80%).Therefore, any length l below or above k is not taken into consideration, ensuring that the contextual alphabets are controlled in a reasonable complexity.
There are a total of 23 activities in the activity alphabets, then the number of active nodes based on the length k ranges from 13.8 to 18.4, which is used to obtain A k i , where i = b or h.From Tables 8 and 9, it can be induced that the neighbourhood length selected for node b is k = 5, and the lengths selected for node h can be k = 5, k = 6 or k = 7 according to the range of 13.8-18.4.However, since the activity alphabets corresponding to lengths k = 6 and k = 7 of node h are the same, so k is set to 6 for node h.
The context logs of nodes b and h are extracted from the event log by mapping L to A k i in line 17.The details are shown in Table 10, where each row in the table represents a trace in the context log.
Hereafter, these context logs in Table 10 are used as the basis for calculating Equation (1) (lines 18-20).The weights of three behavioural relations are set as the following.The weight of the strict order relation is set to 45%, the weight of the interleaving order relation is set to 30%, and the weight of the strict inverse order relation is set to 25%.The calculation results is listed in Table 11.r, c, d, s, f, e, t, u, g, h 11 k = 4 a, r, c, d, s, f, e, t, u, g, h, i, v 13 k = 5 a, r, c, d, s, f, e, t, u, g, h, i, v, j, k, o 16 k = 6 a, r, c, d, s, f, e, t, u, g, h, i, v, j, k, o, l, m, n, w, p, q 22 k = 7 a, r, c, d, s, f, e, t, u, g, h, i, v, j, k, o, l, m, n, w, p, q 22 Table 9.Activity alphabet with node h length k. i, n, w, d, c, j, k 8 k = 3 e, i, n, w, d, c, j, k, f, m, l 11 k = 4 e, i, n, w, d, c, j, k, f, m, l, a, n 13 k = 5 e, i, n, w, d, c, j, k, f, m, l, a, b, o 14 k = 6 e, i, n, w, d, c, j, k, f, m, l, a, b, o, p, q 16 k = 7 e, i, n, w, d, c, j, k, f, m, l, a, b, o, p, q 16 Table As indicated as in Table 11, L 6 h is selected as the benchmark log, and based on it, line 3 of Algorithm 2 is executed to split each trace in the benchmark log L 6 h into two parts: the longest common prefix of the trace and the remaining active sequence.Taking the first trace in the benchmark log L 6 h as an illustration, ab is the longest common prefix of the trace and the remaining active sequence is d 1 =< cdehijlkmnopw >.After that, Algorithm 2 are executed to update the context tree with new activities iteratively, and the final context tree is constructed as shown in Figure 5.
In Algorithm 3, initially each trace in benchmark log forms a cluster, and then a trace distance measurement distance(bt i , bt j ) = 1 − |π A (bt i ) ∩ π A (bt j )|/|A| is utilised to merge the nearest two traces into the same cluster iteratively, until the cluster number is less than or equals to the value we set in priori.After completing the traces clustering in benchmark log, the counterpart of each trace in benchmark is identified in the original event log.Finally, the mined process variants models are depicted in the form of Petri net.So, based on the context tree in Figure 5, the benchmark log is clustered into 4 clusters, i.e.L 6 h 1 = { abcdehijlkmnopw 47 , abdcehijklmnoqw 28 , abcdehijlkmnop 25 }, L 6 h 2 = { abhw 50 },L 6 h 3 = { abcdfhnw 48 , abdcfhnw 52 }, L 6 h 4 = { abhoqw 48 , abhopw 52 }.The last 3 lines of the Algorithm 3 are executed to map the benchmark log to the original event log, and the final process variant model are mined in the form of Petri nets, as shown in Figure 6.
As each activity in Figure 6 has a specific meaning, so the credit processes can be specifically interpreted as housing loan processes, student loan processes, commercial loan processes, and microloan processes.It is obviously that these 4 process variants behaves many commonalities and a certain degree of differences.Distinguishing the behaviours of these similar-but-different process variants can definitely enhance process mining, and surely bring convenience to the subsequent management by organisational managers.

Comparative analysis
In order to validate the feasibility of our approach, we have evaluated two kinds of reallife event cases, and compared this work with the activity recommendation approach (Chan et al., 2014), and tracing clustering method (Xu & Liu, 2019) using CNN coding (Pan et al., 2020).
The research work of the activity recommendation approach (Chan et al., 2014) develops process variants by using event logs to recommend activities in the process model, provided that the process model is known.Obviously, the main deficiencies of this method are listed as follows: (1) it requires a priori known process model; ( 2) it has a large computational complexity.However, comparatively, the proposed trace semantic α splitting method in this paper overcomes the upcoming two problems.comparison between these two methods, where n A represents the number of activities, n P represents the number of business process variants, n represents the maximum number of public activities located on a layer, and k represents the number of layers considered.To validate the event complexity calculated by the method in this paper, the dataset used for the experiments in this section consists of two parts: the event logs of the bank credit process and the library book checkout and return process, and the data can be achieved from the website . 2 The main indices of the event log are shown in Table 13.
In order to effective comparison, we calculate: (1) the average number of configuration steps required to derive the process variants; (2) the proportion of traces in the log that the process variants can be completely replayed; (3) the accuracy of the model.The results are shown in Table 14.Furthermore, we conduct another comparison experiment with tracing clustering using CNN method, based on the same two datasets.The experimental results are listed in Table 15, and graphical comparisons are shown in Figure 7.
From Tables 14 and 15, some conclusions can be made as the following.
(1) Compared with the work of Chan et al. (2014), the average configuration steps required to derive the process variants are similar for both methods, but our method is slightly better than the activity recommendation method; furthermore, the differences between our work and (Chan et al., 2014) is statistically significant in fitness and accuracy performance, as the P-values by Wilcoxon Test are equal to 0.0017 in both of fitness and accuracy indicators.So, it is obviously that our proposed method is superior to the activity recommendation method.(2) Compared with the work of Xu and Liu (2019), the differences in fitness and accuracy performance are not statistically significant, as the P-value by Wilcoxon Test in fitness is 0.27, while in accuracy it is 0.74, both of them are greater than significance level(0.05).However, the mean fitness and the mean accuracy are both higher than those in Xu and Liu (2019).
In order to further discuss the differences between our work and the trace clustering method in Xu and Liu (2019), next subsection develops another two experiments on different event logs with different distribution of the variants.relationships of the log, so the proposed method can effectively capture the behavioural differences of different variants.Admittedly, as the proposed method requires additional procedures for activity context calculation within k neighbourhood length, so the running time will be slightly longer than the characteristic clustering methods.Taking BPIC2015 event log as an example, the characteristic trace clustering method using CNN takes 3.97 seconds, while our method takes 16.87 seconds.Based on the mentioned three kinds of medium-scale datasets, a detail execution time comparison is depicted in Figure 9.

Conclusions and future work
Process variants are a set of models or execution logs, which have high degrees of similarity, however, the behaviour of each process variant is differentiated from the others.In the scenario of only event logs are given as known, how to realise process variants mining is an open difficult problem.At present, latest trace clustering methods can significantly improve process mining in fitness and precision.However, when we encounter the problem of process variants mining, state-of-the-art trace clustering methods cannot work effectively because of the inherently high similarity of variants.To the best of our knowledge, there is only few amount of researches aiming at discovering process variants directly from the perspective of configurable process mining, and do not rely on any priori process model.
In this work, we propose a semantic α splitting method based on activity context of event log, to effectively discovering process variants directly from event logs, and obtain the process variant clusters.Unlike the previous work based on configuration operations, the proposed method combines the advantages of configurable process mining and trace clustering methods, it presents a trace similarity measurement incorporating behaviour profiles of event log.The biggest innovation of the proposed method lies in that it extracts the trace semantic in the form of trace context tree, where the short and long dependencies of activities are expressed as kth-strict order relationship.The kth-strict order relationship is a kind of behaviour profiles, and it helps to simplifies the event logs to the greatest extent, and hence reduce the calculation complexity as possible.The paper achieves two main objectives: (1) A framework of semantic process variants mining is constructed, which depicts the semantic of event log as context tree, and simplifies the event log to the greatest extent by using the activities alphabet within k neighbourhood length.This approach highlights calculating traces directly without converting them into any other forms, and shortens the length of traces to be tackled in the event log.
(2) An approach of process variants discovery method is present, which can effectively discover the process cohorts.The mined process variants are hard to identified for their inherently high similarity.We conduct a series of experiments based on real life datasets, and compare the proposed work with those in configurable process mining, or trace clustering.Through experiments and discussions, it is demonstrated that the proposed method works effectively with high fitness and precision, foremost, it can discover variants that cannot be mined through characteristic tracing clustering method.
It is undeniable that, while the proposed method in this paper has the up mentioned advantages, it also has some shortcomings.For example, it has longer execution time than characteristic trace clustering methods, although this execution time is also in an acceptable range.As future work, we will focus on the performance improvement of configurable process mining algorithms, and build a development environment by plug-in component in PROM(Process Mining Framework).Also, exploring comprehensive configurable process mining method involved multi-perspective event attributes, such as resources, organisations, and etc., would be a key research direction of future work.

Figure 1 .
Figure 1.A reference model and its process variant.(a) The mined reference model M 1 for log L 1 using inductive mining method (b) A process variant model M 2 from M 1 by personalised operations.

Figure 2 .
Figure 2. The mined model of log L through inductive mining.

Figure 3 .
Figure 3. Two clusters deduced from log L through tracing clustering method (Xu & Liu, 2019).(a) The first cluster model through tracing clustering method (b) The second cluster model through tracing clustering method.

Definition 5. 2 (
Context Log):Let A k a be the kth-context alphabet corresponding to the activity a, and L k a be the context log of the activity a, L k a = {t i | t i ∈ Pr(L, A k a )}, where:

Figure 4 .
Figure 4.The context tree corresponding to the event logs in Table2.

Algorithm 3 B 2 :
Process variants discovering algorithm.Input: Benchmark log L k B , initial event log L, Context tree CT, clusters number CluNum.Output: trace clusters Clusters, Process variants PV. 1: Convert the traces multi-set of L k B to traces set setL k Assign a cluster label to each leaf of CT; 3: for each trace bt i in L k B do 4: len

Figure 6 .
Figure 6.Process variants (a), (b), (c), (d) found in the event log.(a) Housing loan process (b) Student loan process (c) Commercial loan process (d) Small loan process.

Figure 8 .
Figure 8. Fitness and precision comparisons for event logs with different variant distributions.(a) fitness of L (b) precision of L (c) fitness of BPIC2015 (d) precision of BPIC2015.

Figure 9 .
Figure 9. Execution time of different methods.

Table 1 .
An example of event logs.

Table 2 .
The comparisons of different process variants discovery methods.

Table 3 .
Experimental results comparison among different methods for logL = L 1 L 2 .characteristicclusteringmethodshave a common bottleneck, more specifically, the fitness and precision are all equal to 1 in one cluster, however, in the other cluster, precision is relatively low.So, it is indicated that characteristic trace clustering methods can not distinguish upcoming mentioned similar-but-different behaviours of process variants.The experimental results in Table3also verify that characteristic trace clustering and process variant mining methods are different from each other, especially in the scenario of distinguishing similar-but-different behaviours.

Table 4 .
An example of event log L to illustrate context log.

Algorithm 2
Context tree construction algorithm.Context tree CT. 1: Set i = 1, num is the number of traces contained in L k B ; 2: Initialise context Tree CT = ∅; 3: Calculate the longest common prefix LCP of traces in L k B and its frequency cnt; 4: Add the sequential activity node in LCP in the tree CT, T = Addnode(CT, LCP, cnt), activity node set N = π A (LCP); 5: for i = 1 to num do N = N ∪ {π A (d i )}; 18: Update context tree CT with new activities set N; 19: return CT;

Table 5 .
Event logs of credit disposal processes.

Table 6 .
Frequency of activities.

Table 7 .
kth-strict order relationships of nodes b and h.

Table 8 .
Activity alphabet with node b of length k.

Table 11 .
10. Context logs for nodes b and h.Weighted frequency cosine similarity calculation results.
Table 12 gives an illustrative

Table 12 .
Comparison of process variants discovery methods.

Table 13 .
Holistic information of datasets.

Table 17 .
Comparisons among different methods for BPIC 2015 log.