Automated Discovery of Declarative Process Models with Correlated Data Conditions

Automated process discovery techniques enable users to generate business process models from event logs extracted from enterprise information systems. Traditional techniques in this ﬁeld generate procedural process models (e.g., in the BPMN notation). When dealing with highly variable processes, the resulting procedural models are often too complex to be practically usable. An alternative approach is to discover declarative process models, which represent the behavior of the process as a set of constraints. Declarative process discovery techniques have been shown to produce simpler models than procedural ones, particularly for processes with high variability. However, the bulk of approaches for automated discovery of declarative process models focus on the control-ﬂow perspective, ignoring the data perspective. This paper addresses the problem of discovering declarative process models with data conditions. Speciﬁcally, the paper tackles the problem of discovering constraints that involve two activities of the process such that each of these two activities is associated with a condition that must hold when the activity occurs. The paper presents and compares two approaches to the problem of discovering such conditions. The ﬁrst approach uses clustering techniques in conjunction with a rule mining technique, while the second approach relies on redescription mining techniques. The two approaches (and their variants) are empirically compared using a combination of synthetic and real-life event logs. The experimental results show that the former approach outperforms the latter when it comes to re-discovering constraints artiﬁcially injected in a log. Also, the former approach is in most of the cases more computationally eﬃcient. On the other hand, redescription mining discovers rules with higher conﬁdence (and lower support) suggesting that it may be used to discover constraints that hold for smaller subsets of cases of a process.


Introduction
Automated process discovery techniques take as input an event log recording the execution of instances of a business process over a period of time, and produce as output a process model that captures the behavior observed in the log. The process models produced by automated process discovery techniques mainly fall into two categories: procedural and declarative. The dichotomy procedural versus declarative when choosing the most suitable language to represent the output of a process discovery technique has been widely studied [1,2]: procedural languages are suitable for processes with low variability (i.e., relatively low number of variants), whereas declarative languages are suitable for processes with high variability [3,4,5,6].
This paper focuses on the problem of discovering declarative process models from event logs. The paper tackles a still open challenge in this field, namely that of discovering multi-perspective declarative process models, i.e., models that take into account both the control-flow perspective (which events occur and in which order) and the data perspective, specifically the (data) conditions that hold when a given event occurs. For example, in the context of a loan application process, we aim at discovering rules like: "when an applicant having a salary lower than 24 000 euros per year submits a loan application, eventually an assessment of the application will be carried out, and the type of the assessment is complex." In this example, we have that a response Declare constraint (the submission of an application is eventually followed by an assessment) is satisfied only when both a condition on the payload of the activation (i.e., the amount associated to the submission of the application) and a condition on the target (on the type of assessment) is satisfied. The former condition is called an activation condition, the latter is called a target condition, and when both conditions are present in a Declare constraint, we say that the constraint contains two correlated data conditions. The paper proposes two approaches for automated discovery of declarative process models with correlated data conditions. Both techniques start by discovering a set of frequent constraints from an event log. A frequent constraint is a constraint having a high number of constraint instances, i.e., pairs of events (one activation and one target) satisfying it. In the first approach, we cluster the target payloads to find groups of targets with similar payloads. These groups are used as labels for rule mining. In particular, the labels together with the features extracted from the activation payloads are used as input of a classification problem to discover correlations between the activation payloads and the target payloads. In the second approach, we apply redescription mining to sets of activation and target payloads to find the rules that correlate these two sets.
This article is an extended and revised version of a conference paper [7]. The conference version focused on the first type of approach (clustering followed by rule discovery). In this article, we present an alternative approach based on redescription mining and we empirically evaluate the tradeoffs between these two approaches and their variants. This article also includes a more extensive validation that considers both the accuracy and scalability of the proposed approaches.
The paper is structured as follows. Section 2 provides the necessary background to understand the rest of the paper, and presents related work. Section 3 introduces a running example used to illustrate the proposed techniques. Section 4 presents the proposed techniques, while Section 5 discusses their evaluation on synthetic and real-life logs. Finally, Section 6 concludes the paper and spells out directions for future work.

Background and Related Work
This section introduces the notion of event log (Section 2.1). It then defines the declarative process modeling notations used in the rest of the paper (Sections 2.2 and Section 2.3) and provides an overview of related work (Section 2.4).

Event Log
The starting point for process mining is an event log. Event logs record the execution of businesses processes. Each event in a log refers to an activity (i.e., a well-defined step in a business process) and is related to a particular case (i.e., a process instance). Events that belong to a case are ordered and constitute a single "run" of the process (often referred to as a trace of events). Event logs may store additional information about events such as resources (i.e., people and/or devices) executing or initiating the activities, timestamps indicating when the events occur, and data elements associated with the events. Data elements stored in the log can be either event attributes, i.e., data produced by the activities of a business process, or case attributes, namely data that are associated to the whole process instance. In this paper, we assume that all attributes are globally visible and can be accessed/manipulated by all activity instances executed inside the case.
Let Σ be the set of activities. Then, t ∈ Σ * is a trace over Σ, i.e., a sequence of activities that encodes a case. An event log E is a multi-set over Σ * .

Declare
Declare is a declarative process modeling language originally introduced by Pesic and van der Aalst in [3]. Instead of explicitly specifying the flow of the interactions among process activities, Declare describes a set of constraints that must be satisfied throughout the process execution. The possible orderings of activities are implicitly specified by constraints and anything that does not violate them is possible during execution. In comparison with procedural approaches that produce "closed" models, i.e., all that is not explicitly specified is forbidden, Declare models are "open" and tend to offer more possibilities for the execution. In this way, Declare enjoys flexibility and is very suitable for highly dynamic processes characterized by high complexity and variability due to the changeability of their execution environments.
A Declare model consists of a set of constraints over activities. Constraints, in turn, are based on templates. Templates are patterns that define parameterized classes of properties, and constraints are their concrete instantiations (we indicate template parameters with capital letters and concrete activities in their instantiations with lower case letters). Templates have graphical representations and their semantics has been captured using various formalisms [4], making them verifiable and executable. Each constraint inherits the graphical representation and semantics from its template. Table 1 summarizes some Declare templates (the reader can refer to [8] for a full description of the language). Here, the F, X, G, and U LTL (future) operators have the following intuitive meaning: formula Fφ 1 means that φ 1 holds sometime in the future, Xφ 1 means that φ 1 holds in the next position, Gφ 1 says that φ 1 holds forever in the future, and, lastly, φ 1 Uφ 2 means that sometime in the future φ 2 will hold and until that moment φ 1 holds (with φ 1 and φ 2 LTL formulas). The O, and Y LTL (past) operators have the following meaning: Oφ 1 means that φ 1 holds sometime in the past, and Yφ 1 means that φ 1 holds in the previous position.
The major benefit of using templates is that analysts do not have to be aware of the underlying logic-based formalization to understand the models. They work with the graphical representation of templates, while the underlying formulas remain hidden.
Consider, for example, the response constraint G(a → Fb). This constraint indicates that if a occurs, b must eventually follow. Therefore, this constraint is satisfied for traces such as t 1 = a, a, b, c , t 2 = b, b, c, d , and t 3 = a, b, c, b , but not for t 4 = a, b, a, c because, in this case, the second instance of a is not followed by a b. Note that, in t 2 , the considered response constraint is satisfied in a trivial way because a never occurs. In this case, we say that the constraint is vacuously satisfied [9]. In [10], the authors introduce the notion of behavioral vacuity detection according to which a constraint is non-vacuously satisfied in a trace when it is activated in that trace. Intuitively, an activation of a constraint in a trace is an event whose occurrence imposes, because of that constraint, ϕc(x, y)) ∨ F I (B ∧ ∃y.ϕc(x, y))))) response G(∀x.((A ∧ ϕa(x)) → F I (B ∧ ∃y.ϕc(x, y)))) alternate response G(∀x.((A ∧ ϕa(x)) → X(¬(A ∧ ϕa(x))U I (B ∧ ∃y.ϕc(x, y))))) chain response ϕc(x, y)) ∨ F I (B ∧ ∃y.ϕc(x, y))))) not response G(∀x.((A ∧ ϕa(x)) → ¬F I (B ∧ ∃y.ϕc(x, y)))) not precedence G(∀x.((B ∧ ϕa(x)) → ¬O I (A ∧ ∃y.ϕc(x, y))) not chain response G(∀x.((A ∧ ϕa(x)) → ¬X I (B ∧ ∃y.ϕc(x, y))) not chain precedence G(∀x.((B ∧ ϕa(x)) → ¬Y I (A ∧ ∃y.ϕc(x, y))) some obligations on other events (targets) in the same trace. For example, a is the activation for the response constraint G(a → Fb) and b is the target, because the execution of a forces b to be executed, eventually. In Table 1, for each template, the corresponding activation is specified. An activation of a constraint can be a fulfillment or a violation for that constraint. When a trace is perfectly compliant with respect to a constraint, every activation of the constraint in the trace leads to a fulfillment. Consider, again, the response constraint G(a → Fb). In trace t 1 , the constraint is activated and fulfilled twice, whereas, in trace t 3 , the same constraint is activated and fulfilled only once. On the other hand, when a trace is not compliant with respect to a constraint, an activation of the constraint in the trace can lead to a fulfillment but also to a violation (at least one activation leads to a violation). In trace t 4 , for example, the response constraint G(a → Fb) is activated twice, but the first activation leads to a fulfillment (eventually b occurs) and the second activation leads to a violation (b does not occur subsequently). An algorithm to discriminate between fulfillments and violations for a constraint in a trace is presented in [10].
Tools implementing process mining approaches based on Declare are presented in [11]. The tools are implemented as plug-ins of the process mining framework ProM.

Multi-Perspective Declare
In this section, we illustrate a multi-perspective version of Declare (MP-Declare) introduced in [12]. This semantics is expressed in Metric First-Order Linear Temporal Logic (MFOTL) and is shown in Table 2.
We describe here the semantics informally and we refer the interested reader to [12] for more details. To explain the semantics, we introduce some preliminary notions.
The first concept we use is the one of payload of an event. Consider, for example, that the execution of an activity Submit Loan Application (S) is recorded in an event log and, after the execution of S at timestamp τ S , the attributes Salary and Amount have values 12 500 and 55 000. In this case, we say that, when S occurs, two special relations are valid event(S) and p S (12 500, 55 000). In the following, we identify event(S) with the event S itself and we call (12 500, 55 000), the payload of S.
Note that all the templates in MP-Declare in Table 2 have two parameters, an activation and a target (see also Table 1). The standard semantics of Declare is extended by requiring two additional conditions on data, i.e., the activation condition ϕ a and the correlation condition ϕ c . As an example, we consider the response constraint "activity Submit Loan Application is always eventually followed by activity Assess Application" having Submit Loan Application as activation and Assess Application as target. The activation condition is a relation (over the variables corresponding to the global attributes in the event log) that must be valid when the activation occurs. If the activation condition does not hold the constraint is not activated. The activation condition has the form p A (x) ∧ r a (x), meaning that when A occurs with payload x, the relation r a over x must hold. For example, we can say that whenever Submit Loan Application occurs, and the amount of the loan is higher than 50 000 euros and the applicant has a salary lower than 24 000 euros per year, eventually an assessment of the application must follow. In case Submit Loan Application occurs but the amount is lower than 50 000 euros or the applicant has a salary higher than 24 000 euros per year, the constraint is not activated.
The correlation condition is a relation that must be valid when the target occurs. It has the form p B (y) ∧ r c (x, y), where r c is a relation involving, again, variables corresponding to the (global) attributes in the event log but, in this case, relating the payload of A and the payload of B. A special type of correlation condition has the form p B (y) ∧ r c (y), which we call target condition, since it does not involve attributes of the activation.
In this paper, we aim at discovering constraints that correlate an activation and a target condition. For example, we can find that whenever Submit Loan Application occurs, and the amount of the loan is higher than 50 000 euros and the applicant has a salary lower than 24 000 euros per year, then eventually Assess Application must follow, and the assessment type will be Complex and the cost of the assessment higher than 100 euros.
Finally, in MP-Declare, also a time condition can be specified through an interval (I = [τ 0 , τ 1 )) indicating the minimum and the maximum temporal distance allowed between the occurrence of the activation and the occurrence of the corresponding target.

Discovery of Data-Aware Declarative Process Models
In [13], the authors propose a data-aware technique for the discovery of declarative models. The technique uses a data-aware extension of the Declare language defined in terms of LTL-FO (First Order Linear Temporal Logic). Given a data constraint, the approach can be used to discover data conditions that discriminate cases in which the constraint is satisfied and cases in which it is violated.
In [14], the authors use correlations to prune a discovered declarative model and to disambiguate event associations. As a result, the discovered process models only show the more meaningful constraints and become more readable.
The work in [15] presents another approach for the multi-perspective discovery of declarative process models. This work is based on RelationalXES, a relational database architecture for storing event log data. Once stored, relational event data can be queried with conventional SQL. Queries capture the semantics of MP-Declare and can be customized. However, the queries have to be manually specified.
One existing technique for the discovery of imperative process models can identify conditions in the decision rules (a.k.a. branching points), refer to [16]. The technique combines existing methods for discovering process models (e.g., Petri nets) and decision trees. The identified conditions compare variables with some constant values. This technique cannot be used to discover a condition over more than one variable. The technique presented in [17] overcomes this limitation by combining standard methods for decision tree learning with a technique for discovering the (likely) invariants of execution logs [18]. Figure 1 shows a fictive MP-Declare model that we will use as a running example throughout this paper. This example models a process for loan applications in a bank. When an applicant submits a loan application with an amount higher than 50 000 euros and she has a salary lower than 24 000 euros per year, eventually an assessment of the application will be carried out. The assessment will be complex and the cost of the assessment higher than 100 euros. This behavior is described by response constraint C 1 in Figure 1. When an applicant submits a loan application with an amount higher than 100 000 euros, eventually a complex assessment with cost higher than 100 euros is performed (C 2 ). In other cases, the simple assessment will be carried out with the cost of assessment not exceeding 100 euros (constraints C 3 and C 4 ). The outcome is always positive when a salary of an applicant is greater than 70 000. This is reflected in response constraint C 5 . Outside the application assessment there are 2 additional checks that can be performed before or after the assessment: the career check and the medical history check. A career check with a coverage lower than 15 years is required if the application assessment is simple (responded existence constraint C 6 ). The career of the applicant should be checked with a coverage higher than 15 years if the application assessment is complex (responded existence constraint C 7 ). If the career check covers less than 5 years, a medical history check should be performed immediately after and its cost is lower than 100 euros (chain response constraint C 8 ). If the career check covers more than 5 years, the medical history check is more complex and more expensive (its cost is higher than 100 euros). This behavior is described by chain response constraint C 9 in Figure 1. If the outcome of an application assessment is notified and the result of the outcome is accepted, then this event is always preceded by an application submission whose applicant has a salary higher than 12 000 euros per year (precedence constraint C 10 ).

Enhancing Declare Rules with Data-Aware Conditions
We propose two alternative approaches to discover data-aware rules from event logs. The outline of these two approaches is shown in Figure 2. Both approaches start from the extraction of instances of Declare constraints discovered from the log. The discovery of Declare constraints is car-ried out using the approach presented in [19]. In this paper, we assume that the discovered Declare constraints are already given and we enhance them with data-aware conditions. A constraint instance is a pair of an activation and the corresponding target activity of a Declare constraint. Both activation and target are also equipped with the corresponding payloads. We extract activation and target payloads of each constraint instance to form a set of (unlabeled) fulfillment feature vectors. Activations that cannot be associated to any target (representing a violation of the constraint) are put into violation feature vectors and labeled as violated.
In the first approach, the target payloads of previously obtained fulfillment feature vectors are clustered to find groups of targets with similar payloads. For each cluster its description is discovered. These descriptions are used as labels in combination with the activation payloads to generate a set of labeled fulfillment feature vectors. The labeled violation and fulfillment feature vectors are then used as input for rule mining. This procedure allows for finding correlations between the activation payloads and the target payloads.
In the second approach, we apply redescription mining algorithms using both the features of activation and target payloads. There are two major types of redescription mining algorithms. The first family of algorithms proposes to grow two decision trees in an alternating way and then join them in the leaves. By traversing the obtained trees, we can get the correlations between activation and target payloads. An alternative technique is to extend the correlations greedily starting from the simplest ones (containing one literal from each side). In this paper, we consider both approaches. Note that the core parts of our approaches (highlighted with blue rectangles in Figure 2) are independent of the procedure used to extract constraint instances. Thus they can be used in combination with any approach for constraint instances extraction.

Constraint Instances Extraction
The first step of the algorithm is to extract the instances of a given Declare constraint. This is done by creating two vectors idx 1 and idx 2 that represent activation and target occurrences. Then, the algorithm processes each trace in the input log to find the events corresponding either to the activation or to the target of the constraint, and their indexes are collected in the corresponding vector. For example, for trace t = SSSASASSA and Response(S,A), we have idx 1 = (1; 2; 3; 5; 7; 8) and idx 2 = (4; 6; 9). Then, based on the template, the number of constraint instances is computed as follows: • (Not) Response. For each element idx1i, from the activation vector idx1, we take the first element idx2j from the target vector idx2 that is greater than idx1i.
• (Not) Chain Response. Here, we check the existence of pairs (i,j) from idx1 and idx2 where j − i = 1.
• Alternate Response. In this case, for each element idx1i from idx1, we take the first element idx2j from idx2 that is greater than idx1i. However, we identify a constraint instance only if there are no elements from idx1 that lie between idx1i and idx2j.
• Precedence. For precedence, chain precedence and alternate precedence, the logic is almost the same as for their response counterparts. However, for precedence rules, the idx1 is considered as target vector and idx2 as activation vector.
In addition, idx1 has to be reversed.
• (Not) Responded Existence. We associate each element idx1i from the activation vector idx1, with the first element from the target vector idx2.
Note that, here, to extract constraint instances, we consider the closest pairs of activation and target occurrences. However, the approach is easily adaptable to use other strategies for constraint instances extraction. If we enumerate the occurrences of S and A in t, we have t = S 1 S 2 S 3 A 1 S 4 A 2 S 5 S 6 A 3 . The constraint instance that consists of activation and target is called constraint fulfillment. The constraint fulfillments of the standard Declare templates instantiated with activities (A, S) and (S, A) are listed in Table 3 and in Table 4.
The procedures used to identify constraint fulfillments are also used to identify constraint violations (i.e., activations that cannot be associated to any target). The constraint fulfillments are used to create fulfillment feature vectors, while constraint violations are used to generate violation feature vectors. We stress again that these procedures only provide an example of how to identify temporal patterns in a log. Any semantics (also beyond standard Declare) can be used to identify frequent constraints.

Features Encoding
Fulfillment and violation constraints are used to create a set of fulfillment and violation feature vectors, respectively. Fulfillment feature vectors consist of the payloads of activations and targets, while violation feature vectors consist of the payloads of activations only. Violation feature vectors do not have a corresponding target and are labeled as "violated". Assume to have a constraint instance where the activation Submit Loan Application has a payload (12 500, 55 000) (see section 2.3). If this activation cannot be associated to any target, we generate the violation feature vector: If the same activation is part of a constraint instance of a frequent constraint with target Assess Application and payload (Complex, 140), we generate the (unlabeled) fulfillment feature vector: Violation and fulfillment feature vectors are then used as the main input data for the core parts of our approaches.

Approach 1: Clustering + Rule Mining
Starting from the fulfillment feature vectors, 1 we use clustering to find groups of payloads that are similar. In particular, a modification of the K-Medoids clustering algorithm is used. With this modification, we can handle categorical attributes as well as numerical ones. In order to compute the distance between two feature vectors, we use the Gower distance where n denotes the number of features, while d (f ) i,j is the distance between feature vectors i and j, when considering only feature f. d (f ) i,j is a normalized distance. For nominal attributes, we calculate the distance as follows: For interval scaled attribute values, we use the distance: where max and min are the maximum and minimum observed values of attribute f.
At each iteration of the K-Medoids algorithm, we compute the number of centroids, equal to the number of clusters, in a way that: 1) for categorical attributes, the most frequent value is taken; 2) for numerical attributes, the average value is computed. For each computed centroid, the closest real feature vector is assigned as being a medoid of the current iteration. After obtaining the medoids, the feature vectors are relabeled correspondingly and the next iteration starts. The clustering stops when medoids are converged to some feature vectors or after N iterations, where N is given as input parameter.
Once the clusters have been constructed, we apply a direct rule-based classification algorithm called RIPPER (Repeated Incremental Pruning to Produce Error Reduction) to search for their distinct features that could be used to describe them. In particular, we build the classifier by using as feature vectors the projections of the fulfillment feature vectors on the target payloads and the clusters ids as labels. In this way, we can describe each cluster in terms of characteristics of target payloads in that cluster. The algorithm builds the rules greedily by adding a new condition using the conjunction operator as long as information gain improves. The initially obtained rule set is then pruned and simplified. The output of RIPPER is a decision table. For a 2-class problem, RIPPER selects one class as positive and the other as negative, and then learns rules for the positive class. The negative class is described by the default rule (none of the above rules are satisfied). For multi-class problems, it picks the class with the smallest prevalence (fraction of instances that belong to a class) and considers it as the positive class, while all the other classes are considered to be negative. In such a way, it transforms a multi-class problem into a 2-class problem. When the rules for the positive class are discovered, this class is not considered anymore and the algorithm picks the next smallest class as positive class, while correspondingly treating the other classes as negative. The procedure repeats until 2 classes are left. Then it solves the 2-class problem as it was described before and the class with the largest representation becomes the default class. Figure 3 shows two clusters (Cluster1 and Cluster2) associated with the response constraint with target Assess Application (colored in green and red respectively). We use a bidimensional representation here to show that the clusters can be characterized using the Assessment Cost attribute (Assessment Cost ≤ 100 and Assessment Cost > 100, respectively) and the Assessment Type attribute (Assessment T ype = Simple and Assessment T ype = Complex, respectively). In particular, the conjunction Assessment Cost ≤ 100 ∧ Assessment T ype = Simple characterizes Cluster1, whereas the conjunction Assessment Cost > 100 ∧ Assessment T ype = Complex characterizes Cluster2. These clusters/conditions are used as labels to build labeled fulfillment feature vectors. The features of the labeled fulfillment feature vectors come from the projections of the fulfillment feature vectors on the activation payloads.
Assume again to have a constraint instance where the activation Submit Loan Application has a payload (12 500, 55 000). If this activation is part of a constraint instance with target Assess Application and payload Labeled fulfillment feature vectors and violation feature vectors are used as input data to RIPPER again. This time, the output of RIPPER provides a set of rules that correlates payloads of activations and targets of a given Declare constraint.

Approach 2: Redescription Mining
Redescription mining is a family of unsupervised descriptive knowledge discovery approaches that aim at finding correlations between subsets of elements in a dataset by using two or more disjoint sets of descriptive attributes. In particular, the input of a redescription mining algorithm is a tuple (E, V L , V R , A L , A R ), where E denotes a set of entities that are characterized by two different views V L and V R , respectively. These views are described by two sets of attributes A L and A R . In our case, the entities are the fulfillment feature vectors, the views V L and V R represent activations and targets of these vectors, and A L and A R are the attributes of their corresponding payloads.
The output of the algorithm is a set of redescriptions R that describe relations between the two different views. In particular, a redescription r ∈ R is a logical formula that consists of two parts, r L and r R , where r L contains literals from A L , and r R consists of literals from A R , respectively. For example, for a redescription Amount > 100 000 => Assessment T ype = Simple, Amount > 100 000 represents r L , and Assessment T ype = Simple is r R .
There are two types of redescription mining approaches. The first approach is based on classification and regression trees. The main idea is to iteratively   Amount ≤ 100 000 & Salary ≤ 24 000 & Amount ≤ 50 000 => Assessment T ype = Simple grow two decision trees (one for each view) that will be joined in their leaves. The trees grow in an alternating way, meaning that the prediction vector derived for one tree in a certain step is used to grow the other tree in the next step. Figure 4 illustrates the application of this approach for the response constraint with activation Submit Loan Application (left hand side) and target Assess Application (right hand side). As can be seen, there are two classes of behaviors (colored in green and red), i.e., when Amount > 100 000, or Amount > 50 000 and Salary ≤ 24 000, then Assessment T ype = Simple, in all the other cases Assessment T ype = Simple. The redescriptions are obtained by traversing the trees from the root node of the first tree to the root node of the second one. Table 5 shows the redescriptions derived from the trees in Figure 4.
An alternative approach is to grow the redescriptions greedily, starting from a pair of singleton queries (i.e., one variable on each side) and extending them by appending a literal on either side using conjunctions or disjunctions. This procedure can be stopped when the maximum length of a query is reached or when the addition of a new condition does not improve the accuracy of the redescription. In this paper, we will consider a tree-based redescription approach called SplitT [20] and a greedy algorithm called ReReMi [21].

Evaluation
We implemented the Clustering + Rule Mining approach as an open-source prototype tool, 2 while, for the Redescription Mining approach, we configured an existing redescription mining tool, i.e., Siren. 3 We used these two tools to conduct a series of experiments aimed at understanding the relative strengths of the two discovery approaches. In particular, we wanted to examine their capability to rediscover the behavior injected into an artificial log, and assess their scalability (using both artificial and real-life logs) and their applicability to real-life event logs. Accordingly, we investigated the following three research questions: • RQ1. Which of the approaches better rediscovers constraints artificially injected into an event log?
• RQ2. How do the approaches perform relative to each other in terms of execution time?
• RQ3. What are the characteristics of the constraint sets generated by the proposed approaches when applied to real-life logs?
RQ1 focuses on the evaluation of the rediscovery accuracy of the proposed approaches. RQ2 investigates the scalability of the approaches. RQ3 deals with the validation of the discovery approaches in real scenarios. The characteristics evaluated with RQ3 are the number of data conditions discovered for a single original control-flow constraint, their complexity, and also support and confidence of the enhanced constraints. In this evaluation, we call support of a constraint the percentage of feature vectors of that constraint where both activation and target conditions hold (over the total number of feature vectors for that constraint), and confidence the percentage of feature vectors where both activation and target conditions hold over the number of feature vectors where the activation condition holds.

Datasets
The artificial logs used in this evaluation were generated using the MP-Declare Log Generator tool [22]. To answer RQ1, we generated a log by simulating the model in Figure 1. This log contains 5000 cases with average length of 5 events, leading to a total of 25 000 events. In order to have a sufficient number of feature vectors available, the activities that are involved in the constraints of the model (Submit Loan Application, Assess Application, Notify Outcome, Check Career, Check Medical History) appear in every case, leading to over 5000 feature vectors for each constraint.
There are two factors that affect the execution time of the algorithms (investigated with RQ2): the number of feature vectors and the payload size. In order to assess how the former factor affects the execution time, we created a set of artificial logs with growing number of feature vectors for one specific MP-Declare constraint (1000, 5000, 10 000, 20 000, 30 000, 50 000 and 100 000 feature vectors), with a default payload size of 6 attributes. To assess the impact of the latter factor, we generated a set of artificial logs with increasing event payload size (5,10,15,20,25 and 30 attributes), with a default number of feature vectors equal to 1000. In addition to these artificial logs, we used seven real-life logs for answering RQ2 and RQ3. We considered all real-life event logs from the 4TU Data Center collection 4 having at least three event payload attributes. These are BPIC 2011, 5 all the logs of BPIC 2013, 6,7,8 BPIC 2017, 9 Sepsis, 10 and the Road Traffic Fine Management log. 11 These logs cover various domains such as healthcare, IT support management, banking and public administration. The logs were preprocessed by deleting those cases that contain missing values in the event payloads and by removing redundant attributes (i.e., case attributes that always have the same value in a case or attributes that do not store any valuable information that can be used to discriminate the different clusters of behaviors, e.g., the "Variant" attribute in the Sepsis log or the "EventID" attribute in BPIC 2017). For the BPIC 2017 log, we also removed duplicated events. An overview of the logs' characteristics is provided in Table 6. All the logs used in evaluation as well as instructions on how to reproduce experiments are publicly available. 12

RQ1: Rediscovery Accuracy
The objective of this first experiment was to test whether the proposed approaches are able to rediscover the original constraints that were injected into a synthetic log. The constraints in question are those shown in Figure 1. We evaluated two different variants of the K-Medoids + RIPPER approach: one (described in Sect. 4.3), where the fulfillment feature vectors to be clustered contain both the activation and target payloads, and a variant of it, where the fulfillment feature vectors only contain target payloads. In this way, we tested whether the information about activation payloads in the clustering phase is relevant to improve the accuracy of the results. For the Redescription Mining approach, we used the ReReMi algorithm with default settings and the SpliT algorithm with maximum tree depth of 100 and minimum size of a node equal to 5% of the number of feature vectors.
To measure the rediscovery accuracy we used recall, precision, and F-score. In this setting, recall is the percentage of injected constraints that were correctly rediscovered, while precision represents the fraction of discovered constraints that match the injected behavior. F-score is the harmonic mean of precision and recall.
The constraints discovered by the algorithms are shown in Tables 7-10. The constraints that were correctly re-discovered are marked in bold and the id identifying the constraint in the original model in Figure 1 is specified between brackets. Sometimes a constraint was not correctly re-discovered, but instead a semantically similar version of it was discovered. In this case, the id identifying the constraint in the original model is specified between brackets, but the row is not marked in bold.
As we can see from the tables, sometimes the rediscovered constraints were not the same as the injected ones, but they were semantically identical (e.g., "Simple Assessment" always goes in pair with "Assessment Cost" lower than or equal to 100, therefore these two conditions are interchangeable). The variant of K-Medoids + RIPPER that clusters only target payloads discovered a less accurate set of constraints when compared to its alternative (clustering of both activation and target payloads). In particular, often, in the former, the wrong separating point between clusters (e.g., Cost < 150 and Cost ≥ 150) was  The SplitT algorithm did not discover any constraints involving numerical attributes only (e.g., C 8 , C 9 ). On the other hand, ReReMi discovered several redundant constraints describing the same behavior from different angles (see e.g., the first two constraints in Table 9). Although SplitT discovered the smallest number of constraints, the complexity of their data conditions is the highest. In contrast, ReReMi discovered the largest number of constraints with the smallest average length of the data conditions (both activation and target  conditions mostly consist of one atomic condition only). Table 11 reports the rediscovery accuracy in terms of number of constraints discovered, recall, precision, and F-score for all four approaches. The results show that K-Medoids + RIPPER with clustering of activation and target payloads was able to rediscover almost all the original constraints (with a recall of 0.9). Meanwhile, SplitT achieved the highest rediscovery precision (0.75). Overall, K-Medoids + RIPPER with clustering of activation and target payloads has the highest F-score. Given that clustering of both activation and target payloads clearly leads to higher rediscovery precision and recall than clustering of only the target payloads, in the next experiments, we only consider the former variant.
We also checked how rediscovery precision and recall are affected by the number of clusters for the K-Medoids + RIPPER approach (with clustering of both activation and target payloads). The results are shown in Figure 5. We observe that the number of clusters significantly affects the rediscovery accuracy. As the number of clusters increases starting from the original one (in this particular case, 2), precision and recall decrease to the point that it is no longer possible to rediscover any originally-injected behavior.

RQ2: Scalability
We tested the scalability of the three approaches (K-Medoids + RIPPER clustering of activation and target payloads, ReReMi and SplitT) on both artificial and real-life logs. Specifically, we measured how the execution time varies based on the size of the log in terms of number of feature vectors and event payload size. The experiments have been run on a Windows 10 x64 machine, equipped with Intel Core i5-52000U CPU 2.20 GHz, using 16GB of memory.  The results are shown in Figure 6. We can see that the execution time of K-Medoids + RIPPER has a stronger dependency on the total number of feature vectors than the two Redescription Mining approaches. In particular, the execution time for K-Medoids + RIPPER rapidly rises with the increase of the total number of feature vectors. On the other hand, the execution time for K-Medoids + RIPPER is less sensitive to the payload size. Note that, in Figure 6b, we do not report the execution time of ReReMi since it is exponentially growing from around 300 seconds for a payload of size 5 to more than 1 hour for a payload with 15 attributes.
For evaluating the scalability of the approaches on the real-life logs, we used K-Medoids + RIPPER with three different settings (k equal to 2, 5 and 10), ReReMi with default settings and SpliT with maximum tree depth of 100 and minimum size of a node equal to 5% of the number of feature vectors. The results are presented in Table 12.
In most cases, K-Medoids + RIPPER outperforms the other approaches in terms of time performance. This can be observed for all logs except for the Road Traffic Fines Management and Sepsis logs. This can be related to the fact that these logs contain a weaker signal (smaller event payloads for Sepsis, shorter traces for Road Traffic Fines Management), which also leads Redescription Mining approaches to discover less data conditions.

RQ3: Characteristics of the Discovered Data Conditions using Real-Life
Logs To answer RQ3, we used the real-life logs to discover a set of control-flow Declare constraints with the Declare Miner [19]. Then, for each discovered control-flow constraint, we ran all the proposed approaches, recording the number of discovered data conditions, their average length, and the average support and confidence of the MP-Declare constraints derived from the original Declare constraint. Finally, we computed the average of each of the recorded metrics over all the discovered Declare constraints. The setup of each approach is the same as the one used to answer RQ2 (see Section 5.3). The results are presented in Table 13. From Table 13, we can see that K-Medoids + RIPPER with k equal to 2 discovers the smallest number of data conditions, while keeping the constraint confidence high (always greater than 0.8). Overall, K-Medoids + RIPPER produces less conditions than the two Redescription Mining approaches. This is due to the fact that RIPPER minimizes the number of conditions discovered, and these conditions do not overlap. On the other hand, ReReMi and SpliT try to discover all possible conditions that satisfy the filter criteria. Often, this leads to discovering redundant rules. On the other hand, the Redescription Mining approaches tend to discover rules with very high confidence, while we can see that, for K-Medoids + RIPPER, the confidence of the discovered constraints decreases as k increases.

Threats to validity
A possible threat to internal validity is posed by the use of a single log to address RQ1 (rediscovery accuracy). The log is generated using a limited set of constraints and these constraints are relatively simple, not involving a high number of attributes at the same time. To mitigate this threat, we decided to address RQ1 at a qualitative level instead of approaching this question quantitatively. In particular, we tried to rediscover constraints of different types, and with data conditions involving both categorical and numerical attributes.
A further threat to internal validity is the limited use of parameter values to configure the approaches at hand. Specifically, in the case of ReReMi and SpliT, we used default parameters and increased the maximal depth of the trees only, while in the case of K-Medoids + RIPPER, we used k equal to 2, 5 and 10.
A potential threat to external validity is given by the use of a limited number of real-life logs (seven). However, these logs originate from different domains and exhibit different characteristics, providing a good representative set of reallife event logs. To ensure the full reproducibility of the results, we have released all the preprocessed real-life logs as well as the ones artificially generated used in our experiments.

Conclusion
We presented two approaches to enhance Declare constraints with data conditions that relate the occurrence of pairs of events in a case of an event log (correlated data conditions). The first approach combines clustering and rule mining techniques, while the second approach relies on Redescription Mining.
Overall, the experimental results show that the clustering-based approach outperforms Redescription Mining in terms of its ability to rediscover constraints artificially injected in a log, in terms of number of conditions discovered (lower number of conditions), and in terms of computational efficiency, when the number of feature vectors is not significantly high. However, the experiments showed that the accuracy of the clustering-based technique is highly dependent on the number of clusters given in input. Hence, this technique requires careful parameter tuning.
The experiments also showed that the Redescription Mining approaches discover constraints with higher confidence (and lower support). This latter observation suggests that these techniques may be used to effectively discover outlier behaviors, i.e., constraints that are less frequently activated, but, when activated, are in most of the cases satisfied.
While we have shown that the proposed techniques address the problem of discovering Declare constraints with correlated data conditions and that they scale up to handle real-life event logs, the usefulness of the discovered rules in practical settings still needs to be studied. In this respect, a possible avenue for future work is to conduct case studies to determine if the sets of constraints produced by the proposed techniques can provide insights to business analysts, beyond what can be achieved by simply discovering plain (non-data-enhanced) Declare constraints or, alternatively, by discovering procedural process models enhanced with data conditions.