Measuring the interestingness of temporal logic behavioral specifications in process mining

The assessment of behavioral rules with respect to a given dataset is key in several research areas, including declarative process mining, association rule mining, and specification mining. An assessment is required to check how well a set of discovered rules describes the input data, and to determine to what extent data complies with predefined rules. Particularly in declarative process mining, Support and Confidence are used most often, yet they are reportedly unable to provide a sufficiently rich feedback to users and cause rules representing coincidental behavior to be deemed as representative for the event logs. In addition, these measures are designed to work on a predefined set of rules, thus lacking generality and extensibility. In this paper, we address this research gap by developing a measurement framework for temporal rules based on (LTLp f ). The framework is suitable for any temporal rules expressed in a reactive form and for custom measures based on the probabilistic interpretation of such rules. We show that our framework can seamlessly adapt well-known measures of the association rule mining field to declarative process mining. Also, we test our software prototype implementing the framework on synthetic and real-world data, and investigate the properties characterizing those measures in the context of process analysis. ©


Introduction
Measuring the degree to which process traces comply with behavioral rules is key in process analysis branches such as conformance checking [1], compliance assessment [2], and discovery of process constraints [3]. To date, several measures have been defined to this end, yet there are two major problems with their application.
First, measures adopted for process mining are defined inconsistently for specific applications. For example, among the most frequently used measures there are Support and Confidence. However, their definition has been customized to the specification languages in use and even for the specific mining algorithms under analysis. For instance, there is a significant difference in the definition of Support used in [3] (percentage of traces fully compliant to a rule) and [4] (percentage of activations that lead to a fulfillment), in a way that the Support of rule ''If a is executed, then b will be executed later'' on a set of traces like {⟨a, b, c, d⟩, ⟨a, b, c, a⟩, ⟨a, c⟩} is equal to 0.33 for [3] and 0.5 according to [4]. Furthermore, the definition of those measures are defined ad hoc for specific sets of rules, like Declare [5] templates. Such issues hinder the fair comparison and eventually the advancement of rule-based process mining.
Second, the opportunity to adopt available measures from association rule mining has been largely missed so far. A plethora of measures that are reportedly superior in comparison to Support and Confidence [6] have been proposed in this field. Support measures only the satisfaction frequency of a rule and Confidence its validity. Although those are crucial aspects in the assessment of a rule, they do not suffice to avoid spurious result [7]. Markedly, various directions have been explored in prior research, among others by revising Support and Confidence [8][9][10], and by defining complementary measures [11][12][13]. However, all such measures do not account for the temporal perspective, which is a first-class citizen dimension in process mining.
In this paper, we address the research challenge of defining a general and comprehensive measurement system. More specifically, we propose a framework based on formal semantics grounded in Linear-time Temporal Logic with Past on Finite Traces (LTLp f ) to express Reactive Constraints (RCons) [14] in a way that abstracts from specific rule-specification languages. Such constraints are rules in the form of ''if A then B'', thus binding the satisfaction of an antecedent A to the occurrence of a consequent B, wherein both A and B are temporal formulas. We show that a probabilistic interpretation of the fine-grained temporal logic evaluation of any such formulas allows us to employ all available association rule mining measures as-is for temporal rules. Markedly, the framework has linear time and space complexity with respect to the input size.
Our contribution extends concepts from association rule mining to temporal logic specifications. In this way, we define a foundation upon which the fitness between measures and data analysis scenarios can be discussed in future research. We conduct an extensive set of simulation experiments, the results of which demonstrate that, driven by known properties, the measures respond differently to changes in the behavior evidenced by event logs. This is an important finding that highlights the need to select measures according to the application context, confirming previous findings for association rules [15] in the realm of temporal logic specifications.
This paper is an extension of our previous conference paper [16] presented at the 2nd International Conference on Process Mining (ICPM 2020). We extend the contribution in the following aspects: • We extend the framework to provide measures at the level of the event log, and not only descriptive statistics at the trace level (Section 4); • We revise the experiments based on the new theoretical extension. In particular, we score the proposed measures and rank them in order to identify the best candidates to be used for rule discovery (Section 6); • We analyze the memory consumption of the framework along with the time performance (Section 5); • We extend the discussion about the interestingness measures used and exploit their known properties for a better understanding of log behavior (Sections 3 and 6).
Additionally, we prove the linear-time performance of the RCons verification in Appendix.
The remainder of this paper is structured as follows. Section 2 discusses prior research on measures for declarative process mining and specification mining. Section 3 defines preliminaries upon which we define our framework. Section 4 defines the measurement framework. Section 5 presents a computational study of the framework and Section 6 shows the results of an array of simulation experiments and discusses them. Finally, Section 7 summarizes the contribution of the paper and points to opportunities for future research.

Related work
Behavioral rules have been widely used to support application scenarios such as association rule mining in machine learning, process discovery and conformance checking in process mining, and specification mining in software engineering. The assessment of rules with respect to the available data is a key component of all these techniques.
In association rule mining, interestingness measures are used to discriminate candidate pairs of relevant co-occurring events. A common technique is to discover frequent rules above a certain Support threshold (frequency) and to prune the results below a certain Confidence threshold (validity). For example, [17] discovers associations between items through an Apriori algorithm based on the downward-closure property of the Support measure. Nevertheless, the use of Support and Confidence alone is reportedly not sufficient to avoid a large number of spurious results [7], i.e., the discovery of rules which are frequently satisfied by the data although merely by chance, thus threatening their statistical validity. A plethora of new measures have been proposed in the literature to overcome the limits of using only Support and Confidence [11], yet the employment of Support and Confidence as the main interestingness measures remains widespread. The main goal driving the development of better measures is indeed the exclusion of spurious rules, so as to let the more crucial ones stand out. Several measures are directly improving on or refining the results of Support and Confidence (e.g., Lift [8] scales Confidence with the Support of the consequent of a rule), others combine different measures (e.g., Added Value [12] subtracts the Prevalence to the Confidence of a rule), and further ones show complementary information (e.g., Specificity [11] measures to what extent the absence of the consequent is related to the absence of the antecedent of a rule).
In declarative process discovery, interestingness measures are used to prune candidate rules based on user-defined thresholds. This pruning approach is used for Declare discovery in [3,4] and for DCR graphs discovery in [18]. These techniques are mainly based on Support and Confidence, which lead to the aforementioned limits [7]. In addition, the definitions of these metrics also differ depending on the techniques that use them. For instance, the Support measure presented in [3] does not correspond to the Support measure of [4], although both are expressly defined for the sole Declare constraints.
In the area of conformance checking, interestingness measures are used to check the degree of conformance of a rule with respect to an execution trace. In [19], Linear Temporal Logic (LTL) rules are checked against each trace in a given event log. This is highly generic as it supports any custom LTL formula, but it reports only binary results, i.e., whether a rule holds in a trace or not. In [20], Burattin et al. use measures like fulfillment ratio and violation ratio, based on the evaluation of the number of activations of a rule (intuitively, the occurrences of its antecedent) that lead to a fulfillment and the number of activations that lead to a violation in an event log. However, these metrics are specifically bound to the set of Declare rules, thus not providing a general measurement framework that can be applied to general type of rule.
In specification mining, interestingness measures are also used to prune candidate temporal specifications based on user-defined thresholds. Interestingly, specification mining and declarative process discovery are two largely overlapping concepts from distinct fields. Yang et al. [21] discover 2-value temporal patterns using a trace measure that quantifies partial satisfactions of a rule. Yet, the technique lacks generality as it is limited only to alternation patterns (similar to AlternateResponse and AlternatePrecedence in Table 1) and the adopted computation heuristics are tailored to the software domain. Le et al. [6] emphasize the limits of using only Support and Confidence measures and investigate properties of other measures reviewed in [11]. Their results demonstrate that there are several measures outperforming Support and Confidence, and that the combination of different measures yields better results. However, they limit their study to 2-value temporal patterns (specifically, Response and Precedence in Table 1). Furthermore, their computation of the probability for a temporal specification is based on a sliding window technique [22]: traces are read in chunks of the size of a given window, then the probability of a rule is the percentage of windows in which it is satisfied. They test the effect of different window sizes, showing that their results depend not only on the input rules and the data, but also on this parameter selection.
Lemieux et al. [23] extend specification mining to arbitrary LTL specifications (implicitly on finite traces) beyond 2-value templates. However, they resort to the sole Support and Confidence measures to prune uninteresting results, thus incurring in the already mentioned statistical limits [7].
The aforementioned shortcomings of quality measures are also discussed in the field of sequence mining [24] when dealing with discovering patterns (specifically subsequences) to classify sequential data. Egho et al. [25] highlight how measures like Confidence and Lift alone lead to unstable classification results of subsequences, proposing a probabilistic Bayesian-based measure to overcome such an instability and avoiding the requirement of setting thresholds for measures. It falls under the family of techniques based on the minimum description length principle, like [26], where an encoding scheme is used to discover a minimal set of subsequences which can reproduce the original data. Notably, subsequence interleaving patterns are only a subset of patterns expressible with LTL formulae. Works adopting behavioral rules for classification like [27], on the other hand, fall back to the sole employment of Support.
In summary, despite the discussion in different fields on measures and the problem of spurious relations, there is no technique that supports at the same time a comprehensive and extensible multi-measurements assessment of rules and its applicability on general temporal logic specifications.

Preliminaries
To develop our framework, we build on the sound foundations of LTLp f . In this section, we introduce the fundamentals of LTLp f formulae (Section 3.1) and interestingness measures for association rules (Section 3.2).

Linear-time Temporal Logic with Past on Finite Traces (LTLp f )
As the formal foundations of our framework, we consider the rules specified in Linear Temporal Logic on Finite Traces (LTL f ) [28] as used in Declare [5,29]. LTL f has the same syntax as LTL [30]. Its semantics is interpreted on finite traces (here abstracted as finite sequences of symbols), and thus take into account that business processes are assumed to eventually terminate [31]. Declare focuses on a set of specific LTL f formulas. Table 1 illustrates some of the most important rules for business process specifications in Declare.
LTLp f is an extension of LTL f supporting the expression of properties of the past (hence the ''p'' suffix) [14]. Well-formed Linear-time Temporal Logic with Past on Finite Traces (LTLp f ) formulae are built from an alphabet Σ ⊇ {a} of propositional symbols and are closed under the boolean connectives, the unary temporal operators (next) and ⊖ (previous), and the binary temporal operators U (Until) and S (Since): From these basic operators, the following can be derived: Classical boolean abbreviations True, False, ∨, →; Constant t End ≡ ¬ True, denoting the last instant of a trace; Constant t Start ≡ ¬ ⊖ True, denoting the first instant of a trace; ϕ ≡ True U ϕ indicating that ϕ holds true eventually before t End ; ϕ 1 W ϕ 2 ≡ (ϕ 1 U ϕ 2 ) ∨ ϕ 1 , which relaxes U as ϕ 2 may never hold true; ϕ ≡ True S ϕ indicating that ϕ holds true eventually in the past, after t Start ; ϕ ≡ ¬ ¬ϕ indicating that ϕ holds true from the current instant till t End ; ⊟ϕ ≡ ¬ ¬ϕ indicating that ϕ holds true from t Start to the current instant.
Given a finite trace t of length n ∈ N, an LTLp f formula ϕ is satisfied in a given instant i (1 ≤ i ≤ n) by induction of the following: A formula ϕ is satisfied by a trace t, written t | ϕ iff t, 1 | ϕ. One of the central properties of LTLp f and LTL f is that a deterministic finite state automaton (DFS) A ϕ can be computed such that for every trace t we have t | ϕ iff t is in the language recognized by A ϕ , as illustrated in [14,32,33].
Without loss of generality, in this paper we abstract traces as finite strings of symbols representing events. We assume that every event reports on the execution of exactly one task and LTLp f formulae use those tasks as their propositional symbolsthe so-called Declare assumption [32]. A trace extracted from the real-world Sepsis event log [34] is, e.g., t Sepsis = ⟨ER Registration, ER Triage, ER Sepsis Triage, CRP, Lactic Acid, IV Liquid, IV Antibiotics⟩. Notice that this trace complies with the constraints that Mannhardt et al. identified as normative for the Sepsis treatment process [35], including the following ones: (i) Init(ER Registration), i.e., every trace begins with the registration at the emergency department,; (ii) AtMostOne(ER Triage), i.e., the triage in the emergency room occurs at most once in a process run; (iii) Response(ER Triage, ER Sepsis Triage), i.e., the ER Triage procedure should be eventually followed by the Sepsis-specific triage, (iv) Precedence(ER Sepsis Triage, IV Antibiotics), i.e., the intravenous injection of antibiotics must be preceded by the ER Sepsis Triage procedure. Table 17 contains an extended set of Declare constraints that a correct execution of the Sepsis treatment process should fulfill. For the sake of succinctness, we may use single-letter identifiers in place of full-length task names whenever suitable in the following examples.
An event log is a multi-set of traces, i.e., traces can recur multiple times in an event log. The cardinality of the event log is the sum of the multiplicities of its traces. Considering the Sepsis event log, the multiplicity of t Sepsis is 13 (i.e., t Sepsis occurs 13 times in the event log). Other exemplary traces of that log are t ′ Sepsis = ⟨ER Registration, ER Triage, ER Sepsis Triage⟩ (occurring 35 times) and t ′′ Sepsis = ⟨ER Registration, ER Triage, ER Sepsis Triage, Leucocytes, CRP⟩ (with a multiplicity of 24). The cardinality of the event sub-log consisting of the traces above is thus 72. The cardinality of the whole Sepsis event log is 1050.

Interestingness measures for association rules
In this section, we revisit key findings of research on quality measures in association rule mining. More specifically, we build on prior research that surveys measures in the area of software engineering and association rule mining, namely [11,15,36,37]. These works are specifically suited as a foundation due to the wide coverage of measures and their comparative study of both formal and user-perceived measure properties. The rules under study are in the form ''if A then B'', where A is called the antecedent of the rule, and B its consequent. We refer to A and B as variables of the rule.
Specifically, we consider only probability-based objective measures, i.e., measures depending only on the data as opposed to those requiring user-provided parameters. Objective measures are based on the probabilities derived from the contingency table of the occurrences of the variables, as depicted in Table 2. Table 3 presents the list of measures covered in this study. In the following, we provide a brief description of each measure.   [17] measures the frequency of the co-occurrence of the antecedent and the consequent. The Support of the sole antecedent of the rule is also called Coverage, while the Support of the consequent is called Prevalence.
Confidence [17] measures the co-occurrences of the antecedent and the consequent in the fraction of data containing the antecedent.
Recall [11] measures the co-occurrences of the antecedent and the consequent in the fraction of data containing the consequent.
Specificity [11] measures the co-absences of the antecedent and the consequent in the fraction of data not containing the antecedent.
Accuracy [11] measures the fraction of the data either containing both the consequent and the antecedent or neither of the two.
Lift [8] scales the Confidence by the probability of the consequent, to check if the co-occurrence of the antecedent and consequent is more likely than their independence.
Leverage [9] measures the difference between the Confidence of the rule and the independent occurrence of its variables.
Added Value [12] measures the difference between the Confidence of the rule and the probability of the consequent alone, to check if the conditioned occurrence of the consequent differ from its unconditioned occurrence.
Relative Risk [38] measures the ratio of the conditional probability of the consequent given the antecedent to the conditional probability of the consequent given the negation of the antecedent.
Jaccard's Coefficient [39] measures the similarity between the variables using the ratio of their co-occurrence to the union of all their independent occurrences.
Certainty Factor [10] measures the ratio of the Added Value of the rule to the Added Value of the consequent alone, in order to see the variation of probability in the data containing the antecedent.
Odds Ratio [40] measures the ratio of the probability of having the consequent when the antecedent is present to the probability of having the consequent when the antecedent is not present.
Odds Multiplier [40] measures the ratio of the probability of having the antecedent when the consequent is present to the probability of having the antecedent when the consequent is not present. [41,42] are normalization of the Odds Ratio to have it centered around 0 and ranging between −1 and 1.

Yules's Q and Yules's Y
Klosgen's Measure [43] weights the Support of the rule using its Added Value.
Conviction [13] measures the occurrences of the antecedent without the consequent in comparison to their independence.
Interestingness Weighting Dependency [44] combines Support and Lift of a rule and explicitly gives weights to each of them to let the user decide their relative importance.
Collective Strength [12] measures the ratio of the agreement ratio (number of non-violations per expected number of non-violations) to the violation ratio (number of violations per expected number of violations).  Laplace Correction [45] is a variation of Confidence to take into account small data.
Gini index [46] measures if the entropy introduced by a rule brings a marked difference. [47] is an entropy based measure for the information content of a rule.

J-measure
One-way Support and Two-way Support [48] combine respectively Confidence and Support of a rule with the degree of independence between the variables.

P(AB) − P(A)P(B) √ P(A)P(B)P(¬A)P(¬B)
Two-way Support Variation [48] measures the change in the Two-way-Support.

Linear Correlation Coefficient [49] measures the Pearson's correlation between the variables.
Piatetsky-Shapiro [9] measures the difference between the cooccurrences of antecedent and consequent and their independent frequency.
Cosine [37] measures the geometric mean between Lift and Support of a rule.
Information Gain [51] is the logarithm of the Lift. [52] measures the proportion of positive and negative occurrences of the antecedent.

Sebag-Schoenauer
Least Contradiction [53] measures the difference between positive and negative occurrences of the antecedent weighted by its frequency. [11] measures the proportion of the antecedent occurrences with and without consequent.

Example and Counterexample Rate
Zhang [54] measures the positive or negative association between the antecedent and the consequent.
Different studies have been dedicated to the analysis of general properties for measures [9,11,15,37]. Properties show the response of measures under certain conditions. Therefore, properties can be used to group similar measures and decide the proper ones to be employed depending on the context. For example, we will analyze the sensitivity of measures to the increase of noise in the data as an important selection criterion for rule monitoring or discovery. We will delve deeper into this aspect in Section 6. In this paper, we focus specifically on a subset of the properties proposed in [37] and [15], as their meaning and effects are reportedly recognizable in a clear manner by the final user. The selected properties are explained below and associated to each measure M in Table 3. [37]. The measure is unaffected by traces not containing neither A or B. Therefore, it assesses whether the traces not related to the rule affect the measurement or not. To satisfy this property, the measure should not vary when |¬A¬B| increases in the contingency table, while the other values remain fixed. In Table 3, we use the '✓' or '-' symbols to indicate whether the property holds or not, respectively. For example, for Confidence and Recall this property holds, whereas for Support, Leverage and Collective Strength it does not. P2. Asymmetric processing of variables [15]. The measure is asymmetric under variable swap, i.e., the measure of if A then B differs from that of if B then A. The measure enjoys this property if it does not vary upon the swapping of the values of |¬AB| and |A¬B| in the contingency table.

P1. Null invariance
In Table 3, every measure is marked with ''asym'' or ''sym'' to indicate whether the property is enjoyed or not, respectively. For instance, Support and Leverage are symmetric under variable swap, whereas Confidence, Recall and Collective Strength provide an asymmetric processing of the variables. P3. Variation with occurrences of B in the absence of A [15].
The value of the measure varies when the occurrences of B in the absence of A increase. In other words, this property focuses on whether the independent occurrence of B influences the measure. Given an if A then B rule, if B is very likely to occur regardless of A, the influence of A on B may be questioned. To verify this, the value of the measure varies when |¬AB| increases in the contingency values (and the other values remain fixed). In Table 3, measures are marked with '↘' if the variation is a decrease (as in the case of Support), '↗' if it is an increase (e.g., Leverage), '?' if the variation can be either a decrease or an increase depending on the values of B (e.g., Collective Strength), '→' if the value does not vary at all (e.g., Confidence). P4. Reference situations: independence [15]. If the variables are independent, then the measure exhibits a known value. The variables are considered independent when their joint probability is equal to the product of their respective probabilities, i.e., P(AB) = P(A)P(B). The measure should have a constant and known value in that case. In Table 3, measures are labeled as ''const'' (e.g., Lift) if this property holds, and ''var'' otherwise (e.g., Support). P5. Reference situations: logical rule [15]. If the rule is always satisfied, then the measure exhibits a known value. An if A then B rule is always satisfied if P(A¬B) = 0. In other words, if there are no counterexamples in the data, the value of the measure that enjoys this property is a known constant (let it be a number or tendency to infinite). In Table 3, measures are labeled as ''const'' if this property holds (as in the case of Confidence), and ''var'' otherwise (see, e.g., Lift). P6. Trend with P(A¬B) [15]. If the number of counterexamples to the rule rises, the value of the measure reacts exhibiting a decreasing trend that denotes a higher or lower sensitivity. For an if A then B rule, a higher number of counterexamples translates into an increase of P(A¬B). Against that increase, the measure may show a fast (convex), linear, or slow (concave) decrease. Measures in Table 3 are labeled either as ' ' (convex, e.g., Conviction), '↘' (linear, e.g., Support), or ' ' (concave, e.g., Recall) accordingly.
We will resort to these properties to examine the quality measures in the context of process mining.
In the following section, therefore, we extend the aforementioned measures to temporal process rules.

Temporal-extended measurement framework
Our framework addresses the limits of Support and Confidence measurements by building on LTLp f formal semantics and the spectrum of measures defined in different areas of computer science. Furthermore, it is generic as it allows for the usage of any probabilistic measure (including those of Table 3) on any temporal-logic-based rules specification. To this end, Section 4.1 formalizes the reactive temporal specification of rules, Section 4.2 discusses their probabilistic interpretation, and Section 4.4 defines the overall framework.

Reactive temporal specification
Our first building block is the concept of Reactive Constraint (RCon), originally introduced in [14], the paper which we extend here. A rule typically expresses that the occurrence of given preconditions (activator) implies certain consequences (target). The reactive nature of this kind of rule lies in the fact that the condition on the target is exerted only if the activator is verified. We codify this intuition in RCons, whose semantics is based on LTLp f .

Definition 4.1 (Reactive Constraint (RCon)). Given an alphabet of
An RCon is interpreted as follows: Each time the activator is true, the target should be true at that point of the trace. For example, a c is an RCon stating that every time a (the activator, ϕ α ) is True, then also c (the target, ϕ τ ) must evaluate to True. That RCon corresponds to Response(a, c) in Declare as it requires that if a occurs in a trace, it must be eventually followed by c. c d corresponds to Precedence(d, c) in Declare because it requires that every time c (the activator) occurs in a trace, then it has to be preceded by d (the target). Table 1 provides a list of standard Declare constraints expressed in the form of RCons. Table 4 Evaluation (0 is False and 1 is True) and probabilistic interpretation of RCon a c.
Trace Table 5 Evaluation (0 is False and 1 is True) and probabilistic interpretation of RCon. c d.
Trace Table 6 Evaluation (0 is False and 1 is True) and probabilistic interpretation of RCon An RCon that goes beyond the standard repertoire of Declare is satisfied between the occurrence of b and the occurrence of e in a trace; its target is the formula ϕ τ = ¬c ∨ f, which evaluates to True if either c is False, or c occurs and is eventually followed by f. Because at every event of the trace (i.e., any point in time) both the activator and target can be either True or False, the possible evaluation of an RCon can result in either of the following four combinations.

Definition 4.2 (RCon Evaluation).
Given an RCon Ψ ≜ ϕ α ϕ τ and a trace t of length n ∈ N, let i denote the ith event in the trace (1 ⩽ i ⩽ n). For each t i ∈ t the possible evaluations of Ψ are: For example, the second and third rows of Tables 4-6 show the evaluation of RCons a c (i.e., Response(a, c) in Table 4), Table 5) and ( b∧ e) (¬c∨ f) ( Table 6) on trace ⟨a, b, c, d, f, c, e, c, h⟩. Notice that ϕ α and ϕ τ are evaluated separately at every event of a trace. The RCon evaluation can be performed efficiently based on the automaton-based techniques defined in [14], adapting it for offline verification. The full discussion on this aspect can be found in Appendix, but we briefly outline the rationale here. Intuitively, we resort to [14,Theorem 4]: An RCon can be separated in pure-past, pure-present and pure-future components. The respective sub-formulae contain only past temporal operators, none, Table 7 Contingency table of the probabilities of an RCon ϕ α ϕ τ in a trace.
or only future ones, respectively. As they are LTLp f formulae, all components correspond to finite state automata (FSAs). The key point is that, by mirroring pure-past formulae and reversing their automata, a single replay of the sub-trace from the beginning to the activator event keeps track of the truth value of the purepast formula till that point. As we have knowledge of the whole trace, and thus of the suffix too a fortiori, we can apply the same principle to pure-future formulae too: A single replay from the end of the trace to the activator event keeps track of the truth value of the pure-future formula from that point onwards.
From this optimization, it follows that any LTLp f formula can be evaluated at each event reading the trace only twice (as in [4,55] though for any RCon and not just Declare constraints): Once from t Start to t End (past components) and once from t End to t Start (future components). This result implies that the computational cost depends linearly on the number of events in the event log and in the number of rules to verify. Specifically, given an event log L of cardinality |L|, assuming that (i) every trace t ∈ L has a length of up to n, and (ii) |R| rules are under analysis, the cost to evaluate all rules on L is: O(|L| × n × |R|).

Probabilistic interpretation on a trace
The evaluation of RCons indicates whether a rule holds true or false within a trace. In real life, traces often contain noise or partially deviate from desired process specifications. In those occasions wherein the trace may contain also events that do not satisfy the rule, we are interested in understanding to what degree a rule is satisfied. As we have previously defined the notion of satisfaction for ϕ α and ϕ τ on single events (Definition 4.2), we can devise a probabilistic interpretation for RCons over traces.

Definition 4.3 (Probability of an LTLp f Formula in a Trace).
Given an LTLp f formula ϕ and a trace t of length |t| = n, we define the probability of ϕ in t 2 as the proportion of the events in t that satisfy ϕ:

Definition 4.4 (Joint Probability of LTLp f Formulae in a Trace).
Given two LTLp f formulae ϕ 1 and ϕ 2 and a trace t of length n, we define the probability of the intersection of ϕ 1 and ϕ 2 in t (joint probability) as the proportion of the events in t that satisfy both ϕ 1 and ϕ 2 : The probabilities of the evaluations of activator and target of an RCon follow from the above definitions (Table 7 shows the  resulting contingency table): 2 Notice that we use the comma in P(ϕ, t) and similar following expressions to separate the parameters, namely the formula to be evaluated (here, ϕ) and the structure on which the formula is analyzed (here, t).  Table 6) Table 9 Trace measures computation and event log statistics of a sample of measures The statistics are computed skipping divisions of zero by zero (marked with ''NaN''), whenever they occur.

Event log Support Confidence Specificity Lift
For example, Tables 4-6  (¬c ∨ f), respectively, on trace ⟨a, b, c, d, f, c, e, c, h⟩. Table 8 summarizes the results of Table 6 in a contingency table. In association rule mining, rules are in the form ''if A then B'', given an antecedent A and a consequent B. Probabilities defined as above allow for the application of measures defined for association rule mining [11] to the context of temporal logic specifications over finite traces. To that extent, it suffices to map ϕ α to A and ϕ τ to B, thus having P(A) as P(ϕ α , t), P(B) as P(ϕ τ , t), and P(AB) as P(ϕ α ∩ ϕ τ , t). For example, Table 9 shows some measures computed from the probabilities associated to the activator and the target of RCon ( b ∧ e) (¬c ∨ f).
These probabilities pertain to the events in the traces, intuitively answering the question: ''How likely is it that an event satisfies the constraint?''. It follows that also the measures based on them pertain to events with respect to traces, and that their statistics over the entire event logs will preserve the focus on the singles events.

Probabilistic interpretation on an event log
Following the probability definition for LTLp f formulae over traces, it is of interest to define similar probabilities over event logs. Intuitively, if the trace probabilities assess the likelihood of the rule correctness in events within a trace, event log probabilities should question the likelihood of the rule correctness in the traces of an event log. As previously mentioned, the descriptive statistics of trace measures across an event log are suitable for this purpose because they preserve the focus on the events. In order to achieve this goal, we have to first derive the conditional probability of the target given the activator in a trace, i.e., P(ϕ τ |ϕ α , t). Intuitively, this is the probability for the target to hold true when the activator holds true. Notice that this viewpoint is conceptually closer to the notion of Reactive Constraint than the joint probability of activator and target. Furthermore, the conditional interpretation of rules is also more in line with their human interpretation [56]. This makes the conditional probability a suitable means for the probabilistic analysis of a constraint in a trace as a whole.

Definition 4.5 (Conditional Probability of LTLp f Formulae in a
Trace). Given two LTLp f formulae ϕ 1 and ϕ 2 and a trace t of length n, we define the conditional probability of ϕ 2 given ϕ 1 over t as the proportion of events satisfying ϕ 2 among those that satisfy ϕ 1 : From the above definition, it follows that: To devise the probability of an RCon in an event log L (henceforth, event log probability), we have to detect the portion of the event log satisfying an LTLp f formula. To this end, we split the event log into a sub-log that has only the traces in which the activator occurs at least once (i.e., every t ∈ L such that P(ϕ α , t) > 0), and the complementary sub-log consisting of the traces in which the activator does not occur (i.e., every t ∈ L such that P(ϕ α , t) = 0). Given the above considerations and the definition of conditional probability for RCons in single traces (Definition 4.5), we devise a probabilistic interpretation for RCons over event logs as follows.

Definition 4.6 (Conditional Probability of LTLp f Formulae in an
Event Log). Let ϕ 1 and ϕ 2 be two LTLp f formulae and L an event log of cardinality |L|. We say that ϕ 1 is non-null in a trace t ∈ L if and only if P(ϕ 1 , t) > 0. If P(ϕ 1 , t) = 0, we say that ϕ 1 is null in t. The conditional probability of ϕ 2 given ϕ 1 in L is the portion of the event log that consists of traces for which ϕ 1 is non-null and satisfies ϕ 2 , given the satisfaction of ϕ 1 : The conditional probability of ϕ 2 given ¬ϕ 1 in L is the portion of the event log that consists of traces for which ϕ 1 is null and satisfies ϕ 2 , given the satisfaction of ¬ϕ 1 : . Table 10 shows the resulting contingency table. In the following, we provide the proof of the correctness of our approach.
Proof. In light of the fact that there cannot be a trace where P(ϕ 1 ) is both 0 and not 0 at the same time, the proof of Theorem 4.1 proceeds as follows.
∑ t∈L:P(ϕ 1 )>0 ∑ t∈L:P(ϕ 1 )>0 1 + ∑ t∈L:P(ϕ 1 )=0 1 = |L| (8) |L| P(ϕ 1 )>0 + |L| P(ϕ 1 )=0 = |L| (9) |L| = |L| ■ Probabilities defined as above permit the application of the association rule mining measures presented in Section 3.2 over an entire event log. In the light of the contingency table in Table 10, it suffices to map We remark that the non-trivial mapping from P(AB) to P(ϕ τ |ϕ α , L) is intuitively rooted into the inherent nature of ''if A then B'' rules such as the RCons, as evidenced in [56], and its soundness is evidenced by Theorem 4.1. For example, Table 11 shows a few measures computed from the probabilities of the RCon (¬c ∨ f) over a log composed of five traces. We remark that this result is distinct from the mere aggregation of trace measures. For example, comparing the average of the Support values in Table 9 (0.44) and the Support value presented in Table 11 (0.67), we observe that the former is the average

Table 11
Event log probabilities and measures of a sample of measures for the RCon ( b ∧ e) (¬c ∨ f).

Measurement system
Given an event log L, a set of RCons R, and a set of probabilistic measures M as input, our framework returns the measurement of every measure in M for each constraint in R both over every single trace t ∈ L and over the entire event log L. More specifically, the output can be reported at three different levels of detail: For example, Table 9 shows some trace level measures together with their descriptive statistics and Table 11 shows the corresponding event log level measures for the RCon ( b∧ e) (¬c∨ f). Since being able to perceive the overall status of a constraint is as important as the possibility to analyze its details in single traces, we report the entire statistical distribution of a measure across the event log to provide a complete information spectrum. Fig. 1 depicts the pipeline of the framework from the input to the output. In the first stage, an RCon is evaluated on each trace of the event log. Then, the evaluation result is used to compute the probabilities of the rule. On top of them, the measures of the rule in each trace and in the entire event log are computed. Also, descriptive statistics over the event log are reported for each trace measure.
We remark that the design of the RCons is crucial for the evaluation and the computation of the measures especially in terms of definition of their activator. Let us take as an example the RespondedExistence(a, b) constraint from the repertoire of Declare (see Table 1). The classical LTL f formula underlying RespondedExistence(a, b) for whole-trace evaluations is ¬ a ∨ b [28]. However, the formulation of the rule as an RCon can lead to different interpretations: All the formulations above are legitimate as they entail that the occurrence of a in the trace demands the occurrence of b. However, the difference in the way the activator is represented turns out to be crucial. The activator, indeed, encodes when the rule is of interest. For example: Are we interested in each occurrence of task a or only in its eventual occurrence in the trace? Do we want the rule to be satisfied in every point of the trace or just at the beginning of the trace? These choices have a clear impact on the measures. Table 12 presents the evaluation of a trace with the different formulations seen above and their trace measurements for Confidence and Support. Confidence is equal to 1 as each time the activator holds true, also the target holds true. Support (i.e., the frequency of ϕ α ∩ ϕ τ ) varies considerably instead. Notice that this phenomenon comes with neither a good nor with a bad connotation, but stresses the idea that a full control over the formula implies a mindful decision about its design and subsequently on picking the right measures for it. In Table 1, we devised the RCon formulae based on the activators of Declare templates described in [3,20] -hence, e.g., the choice for RespondedExistence of the first option presented in Table 12.
While Declare templates are reasonably simple and well-known standard cases, encoding the right activator is crucial for the design of custom RCons and their measures. Lastly, we would like to remark that all the variants in Table 12 take exactly the same amount of computational time to be checked as any formula requires a trace to be read only twice, as described in Section 4.1.
In summary, we have described in this section a novel measurement framework for reactive temporal specifications based on LTLp f , supporting probabilistic interestingness measures at

Table 12
Measurements of a constraint expressed with different formulations on trace ⟨d, a, b, c, a⟩.

RCon formulation Evaluation Support
both trace and event log levels. The framework is designed to be suitable for any custom formula in the form of a Reactive Constraint, and any measure that is based on the probability of the activator and target of the constraints. Therefore, it supports template sets like Declare and all the interestingness measures from association rule mining seen in Table 3, though not being limited to them. Next, we evaluate our approach through tests conducted with our implemented prototype.

Implementation and performance analysis
We have implemented our measurement framework as a proof-of-concept software prototype built upon the existing declarative process specification processor tool Janus [14,16]. The Java source-code can be found at github.com/Oneiroe/Janus. The core component of the software is the RCons verification engine, upon which are build independently a declarative process discovery module and the present declarative rules measurement module. All the process specifications used in the following experiments are discovered with this discovery module implementing the technique presented in [14]. In the remainder of this section, we first report on the results of a time and space analysis with simulated data. Then, we investigate the computational performance on real-world event log datasets. The results demonstrate the practical feasibility and applicability of our approach.

Time analysis
To assess the efficiency of our implemented technique, we measure its time performance against an increase in the data size (i.e., the cardinality of the event log and the length of its traces) and the model size (i.e., the number of rules) with synthetic event logs. We repeated every experiment 10 times to smooth random Table 13 The set of Declare rules used in the experiments.

Init(a)
Response(e, f) AlternateSuccession(y, z) ChainSuccession(j, k) NotCoExistence(0, 1) factors. The reported results average over the ones of the single repetitions. The machine used for the experiments was equipped with an Intel Core i5-7300U CPU at 2.60 GHz, quad-core, 16 Gb of RAM and an Ubuntu 18.04 LTS operating system.
To test the response of our implemented framework against the input data size, we set up a controlled experiment in which we first generate logs of varying sizes that are compliant with a fixed set of rules, resorting to the simulation engine of MINERful [57]. Thereupon, we compute the measures listed in Table 3 against all the rules of a larger test specification (not fully compliant with the event log). For every run, we recorded the wall-clock time of our prototype.
The starting set of rules stems from the Declare repertoire of templates [5] and is provided in Table 13. Notice that the set contains all the rule templates seen in Table 1 and is designed in a way that every constraint insists on different tasks.
The test model consists of 649 constraints extracted by the discovery algorithm of Janus (setting the Support and Confidence threshold parameters to 0.05 and 0.8) from a synthetic event log of 834 963 events, 500 traces and tasks in {a, . . . , z, 0, 1} that is compliant with the initial model. 3 Given the test specification obtained as described above, we performed two tests of 65 iterations (1) increasing the length of the traces (with a step of 100 events per iteration, keeping the number of traces per event log equal to 500), and (2) increasing the number of traces in the event log (with a step of 50 new traces per iteration, keeping the trace lengths between 900 and 1000 events). Fig. 2 illustrates the results of both experiments. We observe that the factor actually influencing the wall-clock time is the total amount of events rather than the trace length: indeed, Fig. 2 shows that the recorded timings of both experiments tend to lie on the same line. This experimental result confirms the linear relation between the total number of events in the log and the computational performance illustrated in Section 4.1.
Next, we investigate the response of the framework to an increase in the model size. To do so, we first generate an event log containing 1000 traces with a trace length between 100 and 500  events from the simulation of the rules in Table 13. Thereupon, we use the discovery algorithm of Janus to automatically retrieve different test models with varying levels of compliance. To that extent, we make the Confidence threshold range from 1.0 (full model compliance), down to 0.0 with a step of 0.05. The rationale is, the lower the Confidence threshold, the higher the number of constraints in the test model. Then, we calculate all the measures in Table 3 for every constraint of each test model. The time taken for the measurements are shown in Fig. 3. Notice that the computation time is linearly dependent on the number of rules to check, thus in line with the theoretical computational cost exposed in Section 4.1.

Space analysis
The space consumption of our technique depends on the data structures required to store the multi-level results. More specifically, four multidimensional matrices are used, containing respectively (1) the evaluation at the event level, (2) the measures at the trace level, (3) their descriptive statistics over the log, and finally (4) the measures at the event log level.
Considering |L| the number of traces in the log, |E| the total number of events in a log, |R| the number of constraints, |M| the number of measures, the sizes of the matrices are respectively the following: (1) Events evaluation: |E| × |R|, wherein each cell contains two boolean values (i.e., the evaluation of the activator and target of the constraint); (2) Trace measures: |L| × |R| × |M|, having a real number in every cell; (3) Trace measures statistics: |R| × |M|, containing seven real numbers per cell (for the mean, geometric mean, variance, population variance, standard deviation, maximum value, and minimum value, respectively); (4) Event log measures: |R| × |M|, with a real number each.
The events matrix is optimized as a bit matrix, where two bits are sufficient to store the boolean results of the evaluation of both the activator and target for one event. We implemented our framework in Java, so we employ 1-byte Byte objects and 4-byte Float numbers (6 decimal digits are sufficiently accurate for our purpose). Taking these indicators into account, we can estimate the space consumption. For example, assuming that |L| = 1000, the maximum number of events in a trace n is 50, |R| = 100, |M| = 30, the expectation for the space demands are distributed as follows: Therefore, the most memory-demanding data structures are those that pertain to the events evaluation and the trace measurements matrices. The former is bigger than the latter only if the average number of events per trace is greater than 4 times the number of measures used, i.e., |E| |L| > 4 × |M|. In our experiments, even using all the 37 measures of Table 3, this has not occurred.
As with the experiments for the evaluation of time, we analyze empirically the space consumption through simulations, controlling the number of events n per trace, the number of traces |L|, and the number of constraints under analysis |M|. To measure the memory consumed by the data structures, we perform a Bit serialization of the matrix objects listed above. This allows us to have a precise measure of the space consumed by every object, though it unavoidably requires that the available memory is twice as much as the strictly necessary amount. Fig. 4 illustrates the results of our experiments. As it can be seen, the resulting linear trends are in line with the expectations, modulo the constant factors introduced by the Java Virtual Machine objects. It can be noticed that the number of constraints to check, being a common factor among all the objects, increases the overall required memory quicker than the other parameters. We remark that depending on the desired outcome, not all the measures nor all the matrices are necessary. For example, if the log measures are desired, the trace measures and their statistics can be ignored and vice-versa. The events evaluation is the only mandatory object upon which all other computations are based.
At present, our implementation works in-memory, thus it is assumed that all the objects fit in main memory (which proved to be sufficient for all the real life log under analysis). However, we would remark that every measure calculation (both at log and trace levels) is independent from the other one, thus it is possible to either (i) compute one measure or constraint at the time in pipeline, in order to reduce the memory load, or (ii) to distribute the workload, by making each system compute independently one measure per constraint. Both are interesting directions for the future upgrades of our implementation.

Application on real-world event logs
To test the performance also in real settings, we compute the rule measures on 13 event logs, whose characteristics are exposed in Table 14. Twelve of those event logs are openly available 4 and belong to the Business Process Intelligence Challenge (BPIC) collection, a Road-Traffic Fines Management Process (RTFMP) and the aforementioned Sepsis event log. In addition, we analyze the performance of our prototype on an event log stemming from a partner of a smart-city project in which the authors are involved (labeled as ''Smart city'' in Table 14). We included the Smart city event log due to its considerable size: as it can be noticed from the table, it is the one bearing the largest amount of events in this experiment.
For each log, we ran the discovery algorithm of Janus [14] in order to extract a test model to check the event log against. We tuned the parameters of the discovery algorithm to obtain a set of rules to which the event log complies for the most part (Confidence threshold of 0.8), even though the constrained tasks are possibly infrequently co-occurring (Support threshold of 0.05). Table 14 illustrates the results. For each event log we report, along with the number of traces, the occurring tasks, the events, the number of constraints in the test model, the total time from the launch to the termination of the software (''Time''), the time to evaluate the rules on events (''Checks''), the time to compute the measures both at the trace and log levels (''Measures''), and the total space consumed by the data structures of our tool (''Space'') against their expected value (''Expectation''). We remark that the space consumption is consistent with our theoretical expectation and that the wall-clock time remains within acceptable ranges as the slowest run takes around 4.5 min to check about 600 constraints in a considerably big event log such as BPIC17 [58] handling around 2.5 Gb of data.

Analysis of custom rules
In order to demonstrate that our framework can handle any Reactive Constraints, beyond the standard Declare repertoire, we applied our approach to compute the discussed measures of a custom rule on the Sepsis event log. We name the custom rule BidirectionalTimeConsequent(a, b, c) as its RCon formulation is a b ∧ c. It states that if a occurs, it is expected that either c occurred before it or b will occurs afterwards. Table 15 reports the measures at log and trace level for BidirectionalTimeConsequent(Admission NC, CRP, IV Liquid) calculated on the Sepsis real-life log [34]. As it can be noticed, Confidence and Recall are relatively high (0.82 and 0.79, respectively) 4 https://data.4tu.nl/. and the values of Coverage and Prevalence (0.76 and 0.79, respectively) suggest a frequent occurrence of activator and target. The value of Lift is greater than 1, which denotes dependency between activator and target (especially at a trace level). The detailed results of the evaluation on each trace can be found at oneiroe.github.io/DeclarativeMeasurements-static.
The capability of our framework to handle non-standard rules opens up new possibilities for the claimed extendibility of Declare as a declarative specification language, claimed from its very inception to be open to customization through the definition of new rules according to the process analyst needs [65].

Evaluation
In this section, we report on experiments that show interesting implications of having a vast availability of measures with customization options. Specifically, Section 6.1 investigates over which measures can be of interest in the scope of declarative specification discovery, and Section 6.2 shows how the properties of measures can be exploited to characterize the alterations of constraints when the underlying process changes.
All the experimental data (code, input data, results) can be found at https://oneiroe.github.io/DeclarativeMeasurements-stat ic. In the following experiments, we resort to the following tools: (i) The Janus discovery algorithm [14] for the discovery of declarative models from events logs; (ii) The simulation engine of MIN-ERful [57] for the generation of event logs complying with given declarative specifications; (iii) The error injection engine of MIN-ERful [66] for the controlled insertion of noise into event logs; (iv) The declarative model simplification technique of MINERful [33] for the removal of redundancies from declarative specifications.

Ranking experiment
The objectives of this experiment are the following: (1) Empirically showing that relying on more measures than the sole Support and Confidence measures is effective to characterize process rules in an event log; (2) Highlighting insightful measures in a declarative process mining context.
To achieve both objectives we rank the measures according to how many correct and interesting rules they are able to recognize. We take inspiration from the correct rules at N experiment introduced by the seminal work of Le and Lo [6]. Given an event log L, a ground truth set of rules R G satisfied in L, a set of rules R D ⊃ R G containing also loosely satisfied rules in L, the set of interestingness measures M of Table 3, and a predefined threshold N, we compute the value of each measure m ∈ M for all the rules r ∈ R D on L. Then, for every measure m ∈ M we create sets of rules that are associated to a common value and

Table 15
Measures resulting from the evaluation of constraint BidirectionalTimeConsequent(Admission NC, CRP, IV Liquid) on the Sepsis event log [34]. sort those sets accordingly. This leads to a separate sorting of the rules for every measure. If, for instance, rules r 1 and r 2 have a Confidence of 1.0, rule r 3 has a Confidence of 0.9, and rules r 4 , r 5 and r 6 have a Confidence of 0.8, the top-N sets are: {r 1 , r 2 } in the top-1, {r 1 , r 2 , r 3 } in the top-2, and r 1 , r 2 , r 3 , r 4 , r 5 , r 6 in the top-3 for Confidence. Intuitively, a good measure should assign high scores to correct rules. Therefore, we finally count how many of the rules in R G are within the top-N sets. We repeated the experiment 10 times and considered the average of the results to avoid fluctuations caused by the random factors of simulation. We performed the experiment with N set to 1, 5, 10, 25, 50, 100, 200, 500, 1000, and 1500, i.e., ranging from considering only the best-scoring rules to considering all the rules in R D . The final ranking of a measure is computed as the average of its ranking for each N. Table 16(a) shows the final rankings for this experiment. Together with the ranking, for each measure we report also the number of correct rules (''Correct'' column) and the average ratio of correct rules over the total number (''Ratio'' column) in the top-N sets, averaged over the 10 repetitions of the experiment. We run our experiments with three different event logs: (i) a simulation of a synthetic process specification (Section 6.1.1); (ii) a simulation of a synthetic process specification with random changes in the event log so as to mimic partial non-compliance (Section 6.1.2); (iii) a real-life event log (Section 6.1.3). At the end of this section, we draw some conclusions from the obtained results.

Process simulation
More specifically, we simulate the specification described in Table 13. Notice that the rules are designed to not interfere with one another and each of them constrains different tasks. The simulation produces an event log that is fully compliant with the rules in Table 13. From the simulated event log, we discover a new process specification with loose bounds (Support and Confidence thresholds set to 0.05 and 0.5, respectively) in order to discover also infrequent and seldom violated rules. We simplify the resulting set of rules by removing those that do not match yet strictly subsume or are entailed by the groundtruth rules, in order to avoid misleading results -for example, if ChainResponse(o, p) and Response(o, p) both belong to the returned set of rules, the former is retained and the latter is removed because ChainResponse(o, p) is part of the ground-truth specification (see Table 13) and is subsumed by Response(o, p). The full detail of the technique that deals with the removal of redundant rules is out of scope for this paper. The interested reader can find a detailed description of the problem and the approach in [33]. The simplified discovered model consists of 1310 rules on average. Thereupon, we apply our measurement framework to compute the measures in Table 3 at the event log level for all the discovered rules. Finally, we sort the rules according to each measure, and rank the measures according to how many of the original rules are among in top N sets.
Notice that Support ranks only sixteenth. Confidence, by contrast, is at the top of the ranking. It should be observed, however, that the experiment considers by design never violated rules (i.e., those with maximal Confidence), hence the top position of this measure. Nevertheless, there are two measures that match Confidence, namely (i) Example and Counterexample Rate and (ii) the Sebag-Schoenauer measure. This is in accordance with P5, as the absence of counterexamples for the rules makes measures with known maximal values highly rank the original, correct rules. The ''Ratio'' column reported in Table 16(a) helps to distinguish the accuracy of the results. For example, Odd Multiplier and Odd Ratio have the same rank (i.e., the same amount of correct rules identified), although the Odd Ratio returns 3 times more rules than the Odd Multiplier within the top-N sets.

Process simulation with noise
The previous experiment tests a perfectly compliant setup where the reference rules are never violated. For this reason, we conduct a modified version of the experiment, this time injecting noise in the event log in order to check if the ranking is preserved Table 16 Ranking of measures according to the simulation experiment with a fully compliant simulated event log (left), with an altered event log (center), and with a real-life event log (right).

Rank
Measure Correct Ratio in non-optimal situations. Specifically, we randomly delete or duplicate 5% of the occurrences of every task. The results can be found in Table 16(b). It can be seen that also in the presence of partial non-compliance, the measures of Confidence and Example and Counterexample Rate continue to be on top of the ranking, while Sebag-Schoenauer measure drops in the second half of the list, together with Ylue Q, Ylue Y and Odds Ratio. This sudden change due to noise is motivated by the fact that these measures are convex, as we discussed in regard with P6, so they rapidly decrease in presence of counterexamples. Support persists in the middle of the ranking, with 17 measures scoring better than it. Lift and Information Gain perform by far better in presence of noise. Understanding how measures react to changes in the data (noise in this case), is key for process drift analysis [67], where the evolution of the process is the main objective. We study more in details the influence of noise on the measurements in Section 6.2.

Real-life event log
We replicate our ranking experiment based on a real-life event log, specifically the Sepsis data-set [34]. In [35], Mannhardt and Blinde illustrate a procedural model discovered with the help of domain experts representing the sepsis treatment process at a hospital. We manually translated the model into Declare rules that we use as a ground-truth specification to replicate the ranking experiment. The model consists of 100 rules, listed in Table 17. Because it is a real life event log, we do not inject any noise. Table 16(c) illustrates the results. It can be seen that the measures of Confidence and Example and Counterexample Rate remain at the top of the ranking. Also, while Recall conquers the first position for the amount of correct rules reported, it also returns a high number of other incorrect rules as signaled by its low average ratio score. In this case, Support drops at the bottom of the ranking.
These experiments suggest possible candidate measures to be used for declarative process discovery. We tested different scenarios, thereby showing how the measures' scores vary in each of them and how the evaluation of rules can benefit from the perspectives of multiple measures that were not previously available for temporal specifications. However, we remark that some measures should be handled in a specific manner. For example, Coverage is a descriptive measure reporting on the portion of the event log in which the rule activator occurs: this characteristic determines its low ranking in all the previous experiments. A low score in these experiments does not imply that a measure is of scarce use in general, though. Indeed, we can only highlight the average top ranking measures of these experiments as excellent options, as we do in the next section while depicting the current results. The informed proposal of best practices for the selection of measures and combinations thereof depending on the analysis purposes (e.g., discovery) constitutes a highly interesting outlook for research.

Sensitivity and resilience to noise
In this section, we study in details the effect of noise injection in the event log on the constraints log measurements. We experimentally observed in Section 6.1.2 that measures exhibit different changes upon the presence of alterations from the expected behavior in an event log. A measure may ''sense'' the alteration in the data with respect to a rule or remain unaffected. Also, if the alteration is perceived, the magnitude of the measures reaction may be different, in light of the different properties that the measures enjoy, as discussed in Section 3.2. Markedly, we will empirically demonstrate that the properties originally for association rules hold also in the context of temporal rules. The possibility to characterize a constraint evolution is crucial in continuous measurement settings such as streaming analysis [68] or drift analysis [67]. Informed decisions on the measures to monitor based on the characteristic they have and the properties they enjoy is, therefore, key. Providing a set of guidelines to support such decisions is out of scope for this paper and paves the path for future research avenues. Nevertheless, with this experiment, we hope to provide useful preliminary indications in that sense. We also remark that while we call uncommon or unexpected events as ''noise'' or ''error'' (reflecting our experimental setting), the same measures evolution would occur in the case of process improvements or changes in the normative aspects the process is subject to.
We conducted the experiment as follows. We took as a reference model the set of rules in Table 13 and we simulated it to generate a clean event log that is compliant wit it. We remark again how the rules are designed to not interfere with one another. In this way, it is possible to observe the response of measures at varying noise levels targeting one constraint at a time, thus limiting the effect of cross-interference.
Thereupon, we injected noise in the event log and calculated all the measures in Table 3 for the reference model. In particular, we made use of the following types of noise [66]: Events insertion: Spurious events are included in the traces (mimicking, e.g., double records, alien events, etc.); Events deletion: Events are expunged from the event log (mimicking, e.g., missing records, uncommitted transactions, etc.); White noise: Events are randomly inserted and deleted.
Addressing one rule at a time, we studied (1) the direct effect of noise on that constraint by altering the occurrences of its activator and target via insertions and deletions, and (2) the indirect effect, by altering the occurrences of the other tasks in the event log with white noise. We made the noise spread all over the event log according to a controlled probability variable. For instance, setting the noise injection as the deletion of occurrences of task a with a probability of 20 % results in the removal of 20 % of the occurrences of task a from the event log, picked at random.
More specifically, for every rule in the set of Table 13, we ran a separate experiment for (i) event insertion noise affecting the activator or (ii) the target, (iii) event deletion noise affecting the activator or (iv)the target, (v) white noise affecting neither the activator nor the target. For each of the combinations above, we let the error-injection probability range from 0 to 100 % with a step of 10 %. Because of the random factor, we repeated each experiment 10 times and recorded the average results. Fig. 5 shows the results of our experiments on constraint Response(e, f). The Response constraint imposes that the target occurs eventually after every occurrence of the activator. Plotting all the measures together in a static image within the boundaries of a page results in a complex intertwining of lines, hampering the readability of results. Thus we report here only the top 10 measures among the averaged results of the ranking experiment presented in Section 6.1, namely: Confidence, Example and Counterexample Rate, Laplace Correction, Least Contradiction, Accuracy, Cosine, Jaccard, Sebag-Schoenauer, Conviction, and Odd Multiplier plus Support as a baseline reference. Furthermore, as the measures have different ranges, we normalize them between 0 and 1 in order to compare the different trends. The normalization formula used is where v m is the value of measure, |v m | its absolute value, max m and min m the maximum and minimum values of the range of a measure, respectively (in case of infinite ranges, the maximum value supported by the software is considered). The full set of plots with all the measures and constraints can be interactively explored at https://github.com/Oneiroe/DeclarativeMeasurement s-static.
For the Response constraint, the selected measures are overall particularly sensitive to the deletions of the target and influenced by both the deletions and insertions of the activator, while mostly insensitive to spurious insertions of the target and white noise. More specifically, we can derive the following observations.
• The deletion of events that satisfy the target leads to more violations of the rule, i.e., lower P(ϕ τ |ϕ α , L) and higher P(¬ϕ τ |ϕ α , L). The negative effect is thus reflected in the constant decrease of all measures. The rapidity at which this phenomenon occurs is particularly interesting. In accordance with property P6, the decrease of concave measures (e.g., Example and Counterexample Rate) is slower than that of linear ones (e.g., Confidence), and convex measures decrease faster than all the others (e.g., Sebag-Schoenauer).
• The deletion of events that satisfy the activator, instead, does not bring further violations, but a higher P(¬ϕ α , L).
As a consequence, the measures are mostly stable until the error rate is close to 100 %. As per P5, measures with a decreasing trend (e.g., Laplace Correction or Accuracy) are susceptible to the frequency of the rule activators in the event logs, while measures depending only on rule satisfaction (e.g., Conviction and Odd Multiplier) are constant at all the error rates except 100 % (in which case, they are no longer defined due to a zero-by-zero division in their formula).
• The insertion of more events that satisfy the target lead to a higher P(ϕ τ , L) but does not influence the constraint satisfactions, as P(ϕ τ |ϕ α , L) remains constant. We remark that the target ϕ τ of Response(e, f) is f, thus injecting more occurrences of f does imply an increase of P(ϕ τ , L) proportional to the error rate. Following from this consideration and in light of P3, all the selected measures remain constant.
• The insertion of more events that satisfy the activator is less definite, as the new tasks may bring either to more rule satisfactions, i.e., higher P(ϕ τ |ϕ α , L), or violations, i.e., higher P(¬ϕ τ |ϕ α , L). As Response(e, f) requires the occurrence of f eventually after e at any distance in the trace, in this experiment the injections mostly leads to satisfactions. That is why most of the measures show only a slightly decreasing trend. Notably, Conviction and Odd Multiplier measures sense this alteration in a marked way with respect to all the other rules.
• Lastly, the insertion of events not related to both the target and the activation mostly do not alter the measures. The satisfactions and violations remain constant, whilst the only increase is in P(¬ϕ α , L). Nevertheless, considering P1 and P5, we can distinguish the fluctuating measures due to the frequencies change (e.g., Cosine and Jaccard) from the unaffected ones (e.g., Confidence and Conviction).
Because of space limits, we cannot illustrate the outcome for the other rules of Table 13. The interested reader can find the full set of experimental data and results at https://oneiroe.github. io/DeclarativeMeasurements-static in a digital interactive format, which is more suitable for data exploration and browsing.

Discussion
The applications of our framework presented in Sections 6.1 and 6.2 allows for some reflections on the employment and availability of the measures. First, the measures evolution reflects whether the alteration in the data affects either the target or the activator of the rule (or both). This gives more insights into the analyzed phenomena than knowing whether a rule is satisfied or not at a given moment.
Nevertheless, given a constraint, not all the data alterations are perceived by all measures. This implies that depending on the requirements of the analysis the choice of the measure is crucial. We reported, for example, an experiment to identify measures suitable for discovery. Furthermore, given a certain alteration, it is possible to identify groups of measures with similar trends that focus on the same aspects of a rule, as it can be easily noticed in Fig. 5. Within such groups, it is then possible to select one representative measure.
To this extent, the properties discussed in Section 3.2 are clearly a guiding criterion for the selection of measures, markedly reflected in our experiments in which measures with similar trends could be distinctly identified by the properties they enjoy. Ultimately, this choice is strictly related to the requirements and goals of the event log analysis.
For instance, P6 turned out to be particularly relevant. While measures may have similar trends, the magnitude of such trends, reflected in the steepness of the curves with which the measure evolves (i.e., concave, convex, or linear), indicates how quickly the measure react to changes in the data. Furthermore, it shows the range of tolerance before the alteration in the data becomes too large to recognize the specific constraint behavior. A resilient measure with a slower decrease is desirable to sense if the fundamental characteristics of a log are still visible despite the deviations (e.g., to implement discovery algorithms that are robust to noise). On the other hand, a sensitive measure with a faster decrease is desirable when exceptions to the rules are only accepted in a very limited manner (e.g., to monitor normative processes).
An approach enabling such a vast number of measures for temporal specifications is presented in the seminal work of Le and Lo [6]. We extend their investigation along two main lines: first, our approach can handle arbitrary temporal formulae as activator and target, as opposed to single-task variables only (which also restrict the analysis to the sole Response and Precedence patterns). Second, our computation of the rule probabilities processes the whole trace at once, while the calculation scheme in [6] relies on a sliding window, whose size has to be manually set up thereby influencing the results.

Conclusion
In this paper, we presented a comprehensive measurement framework for declarative specifications modeled as Reactive Constraints. Given an event log and a set of custom probabilistic measures, the framework accepts in input any RCon and returns as output the evaluation of the rule for each event of the log, the computed measures for all the traces and their statistics over the entire event log, and the computed measures over the entire log. The framework goes beyond the current state of the art as it is not limited to a specific set of measures or rules. The experiments conducted reveal the possibility to characterize the behavior of a given constraint through the combination of different measures, which sense differently the behavior recorded in the log. Also, while the choice of the measures to employ is highly contextdependent, we showed how the measures properties can be used to guide the selection, as their effects are clearly visible in the results.
Future work Different possibilities are now open upon the foundations of this measurement framework. It is possible to exploit the possibility to characterize a phenomenon by studying the evolution of different measures for, e.g., the dynamic recognition of exceptions in process monitoring [69], the identification of process drifts [67] or the analysis of process variants [70]. Also, it can be employed as a post-processing tool for multi-measure filtering of the results of declarative process discovery techniques like [3,4,14].
As measures react differently to diverse stimuli for distinct types of rules, a method to find the best combination of measures depending on the analysis context turns out to be key. To this extent, future research could resort to existing techniques like [71] or develop novel multi-measure heuristics. Also, the measures can be integrated for the assessment of multi-constraint specifications as a whole as in [1,20].
While the analysis of multiple measures at once may be overwhelming for a human, machine learning techniques could benefit from the availability of the great amount of information returned by the proposed framework, as they are designed to deal with large sets of multidimensional data. Therefore, it seems to be also promisingly exploitable for feature selection tasks in sequence classification [24].
Finally, we observe that the implementation of this framework can largely benefit from run-time optimization for the verification of the rules' automata, particularly for as far as the recognition of permanent violations and satisfactions is concerned [72]. The design and integration of such dedicated techniques serves as an impulse for future research.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.