Outcome-Oriented Prescriptive Process Monitoring Based on Temporal Logic Patterns

Prescriptive Process Monitoring systems recommend, during the execution of a business process, interventions that, if followed, prevent a negative outcome of the process. Such interventions have to be reliable, that is, they have to guarantee the achievement of the desired outcome or performance, and they have to be flexible, that is, they have to avoid overturning the normal process execution or forcing the execution of a given activity. Most of the existing Prescriptive Process Monitoring solutions, however, while performing well in terms of recommendation reliability, provide the users with very specific (sequences of) activities that have to be executed without caring about the feasibility of these recommendations. In order to face this issue, we propose a new Outcome-Oriented Prescriptive Process Monitoring system recommending temporal relations between activities that have to be guaranteed during the process execution in order to achieve a desired outcome. This softens the mandatory execution of an activity at a given point in time, thus leaving more freedom to the user in deciding the interventions to put in place. Our approach defines these temporal relations with Linear Temporal Logic over finite traces patterns that are used as features to describe the historical process data recorded in an event log by the information systems supporting the execution of the process. Such encoded log is used to train a Machine Learning classifier to learn a mapping between the temporal patterns and the outcome of a process execution. The classifier is then queried at runtime to return as recommendations the most salient temporal patterns to be satisfied to maximize the likelihood of a certain outcome for an input ongoing process execution. The proposed system is assessed using a pool of 22 real-life event logs that have already been used as a benchmark in the Process Mining community.


Introduction
Prescriptive Process Monitoring [1,2] is a branch of Process Mining [3] that, leveraging historical process data recorded in an event log, aims at providing users with recommendations that, when followed during the execution of a business process, improve the probability of avoiding negative outcomes, or optimizing performance indicators. For example, a Prescriptive Process Monitoring system might recommend the interventions to carry on, or the activities to execute in order to minimize the likelihood of a patient going to intensive care, or the time required for dismissing a patient from a hospital.
The recommended interventions have to be reliable, that is, they have to guarantee that the desired outcome or a good process performance is achieved, but, at the same time, they have to be flexible enough to avoid recommending interventions that cannot be realized, for instance because a certain activity cannot be executed at a certain point in time during the process execution, or because it has a cost that cannot be afforded. Most of the existing state-of-the-art approaches in the Prescriptive Process Monitoring field, however, mainly focus on returning reliable predictions, while neglecting the flexibility aspect. Most of them, indeed, do not take into account whether the recommended intervention is feasible or affordable in terms of costs.
In this paper, we propose a new Outcome-Oriented Prescriptive Process Monitoring system that aims at ensuring not only the reliability of the provided recommendations, but also their flexibility. In particular, the proposed system returns different recommendations expressed in terms of temporal relations between activities to be preserved in order to maximize the likelihood of achieving a desired outcome (e.g., avoiding a negative outcome), or to optimize a performance indicator of interest. The returned recommendations are prioritized based on their predicted impact on the process outcome. Each recommendation is composed of a temporal relation and a corresponding advice, i.e., "it cannot be violated" or "it has to be satisfied". Recommendations based on temporal relations between activities, differently from recommendations based on activities (or sequences of activities) to be executed, provide the user with more flexibility in choosing the interventions that best fit the current circumstances of an ongoing process execution. In a healthcare scenario, for example, the system could recommend executing an activity no more than once in order to minimize the likelihood of a patient going to intensive care. Furthermore, additional flexibility is provided by the system since users can choose among different prioritized recommendations.
The approach proposed in this paper consists of two steps. In a first step, an encoding based on temporal relations between activities is used to encode the historical process data recorded in an event log. The encoded log is then used to train a Machine Learning (ML) classifier. In a second step, given an ongoing process case σ k , the classifier is inspected in order to extract temporal relations between activities (classification rules) characterizing cases similar to the ongoing execution and leading to the desired outcome. Our assumption is indeed that temporal relations characterizing executions similar to the ongoing case σ k and leading to the desired outcome can convey effective recommendations towards that outcome. The proposed solution has been evaluated on a pool of 22 real-life event logs that have already been used as a benchmark in the Process Mining community [4].
The paper is structured as follows. Section 2 introduces the main concepts useful for understanding the paper. Section 3 and Section 4 introduce the proposed approach and its application in a concrete use case. In Section 5, a wide experimentation is presented, while Section 7 concludes the paper and spells out directions for future works.

Background
In this section, we introduce the main concepts needed for understanding the remainder of the paper.

Events, Traces and Logs
The main basic concept in Process Mining [3] is the event record (or simply event) that represents the occurrence of an activity in a business process. An event is associated with three mandatory attributes: the event class (or activity name) that states the name of the activity the event refers to, the timestamp that specifies when the event occurred and the case id, which is an identifier of the case of the business process in which the event occurred. For example, a hospital might carry out procedures for the treatment of Sepsis, whose executions are logged in the hospital information system. Each treated Sepsis case is labeled with a case id and every event during this treatment (for example, the triage in the emergency room or the administering of particular antibiotics) is associated only with this case. In general, each event represents an activity occurred at a certain point in time in a given case. In addition, events can have other attributes related to the data payload: the so-called event-specific attributes. In our Sepsis case example, an event attribute for the activity related to the administration of particular antibiotics is the dosage attribute. Finally, case attributes refer to the whole case and are shared by all the events in the same case. In our Sepsis case example, case attributes are the age and the sex of a patient affected by Sepsis and the corresponding values will be the same for each event in the case. The value of a case attribute does not change during the case execution, i.e., it is static. The event-specific attributes, instead, are dynamic, as they change their value based on the event.
We now provide some formal definitions. We denote with Σ the set of all the activity names and with E the universe of all events. A case is the sequence of events generated by a given process execution.
Definition 2 (Case) A case is a non-empty sequence σ = e 1 , . . . , e n of events such that ∀i ∈ {1, . . . , n}, e i ∈ E and ∀i, j ∈ {1, . . . , n} e i .c = e j .c, that is, all events in the sequence refer to the same case.
Consistently with the literature on Process Mining, many business process tasks focus only on the activity names of a case. Therefore, it is customary to perform the projection of the activity names from a case to a trace.
We denote with S the universe of all possible traces and we use the symbol σ for indicating both cases and traces when there is no risk of ambiguity. An event log L is a set of complete cases (i.e., the cases recording the execution of complete process execution). For instance, in the Sepsis example, we can consider an event log containing two cases σ 1 and σ 2 (see Figure 1). The activity name of the first event in case σ 1 is ER registration; this event occurred at 11:15 AM and it refers to case A. The first two attributes are static and related to the age (27) and the sex of the patient (male), respectively. These have the same values for all the events in the case. The other attributes are event-specific and show that amountPaid is 10 for the first event and 15 for the last one. Note that not all events carry every possible event attribute. For example, the first event of case σ 2 does not have the attribute amountPaid .
Given a case σ = e 1 , . . . , e n and a positive integer k < n, σ k = e 1 , . . . , e k is the prefix of σ of length k. Furthermore, we define the prefix log as the log composed of all possible case prefixes, which is typically used in Predictive and Prescriptive Process Monitoring settings [4]. Definition 4 (Prefix Log) Given a log L, the prefix log L * of L is the event log that contains all prefixes of L, i.e., L * = {σ k : σ ∈ L, 1 ≤ k < |σ|}.

declare
As stated above, the recommendations of the proposed Prescriptive Process Monitoring system are given in the form of temporal relations between activities to be performed during the execution of a process. Such recommendations have to be expressed in a clear semantics for users. To this aim, as a formal basis for specifying such temporal relations/patterns, we adopt the customary choice of Linear Temporal Logic over finite traces (LTL f ) [5]. This logic is at the basis of the well-known declare [6] constraint-based process modeling language. LTL f has exactly the same syntax as standard LTL, but, differently from LTL, it interprets formulae over an unbounded, yet finite linear sequence of states. Given an alphabet Σ of atomic propositions (in our setting, it represents the activity names of events), an LTL f formula ϕ is built by extending propositional logic with temporal operators: The semantics of LTL f is given in terms of finite traces denoting finite, possibly empty sequences σ = a 0 , . . . , a n of elements of 2 Σ , containing all possible propositional interpretations of the propositional symbols in Σ. In this paper, consistently with the literature on Process Mining, we make the simplifying assumption that in each point of the sequence, one and only one element from Σ holds. Under this assumption, σ becomes a total sequence of activity name occurrences from Σ, matching the standard notion of trace. Table 1 shows the semantics of the LTL f operators.
Given a trace σ, the evaluation of a formula ϕ is done in a given position of the trace, and the notation σ, i |= ϕ is used to express that ϕ holds in position i of σ. The notation σ |= ϕ is used as a shortcut for σ, 0 |= ϕ, that is, to indicate that ϕ holds over the entire trace σ starting from the very beginning. A formula ϕ is satisfiable if it admits at least one trace σ such that σ |= ϕ. A set of formulae M = {ϕ 0 , . . . , ϕ n } is a model for a log L, denoted with L |= M, if σ |= ϕ 0 ∧ . . . ∧ ϕ n for each σ ∈ L. ϕ has to hold in the neXt position of a sequence.
Gϕ ϕ has to hold always (Globally) in the subsequent positions of a sequence.
Fϕ ϕ has to hold eventually (in the Future) in the subsequent position of a sequence.
ϕUψ ϕ has to hold in a sequence at least Until ψ holds. ψ must hold in the current or in a future position.
declare [6] is a declarative process modeling language based on LTL f . More specifically, a declare model fixes a set of activities, and a set of constraints over such activities, formalized using LTL f formulae. The overall model is then formalized as the conjunction of the LTL f formulae expressing its constraints. Among all possible LTL f formulae, declare selects some predefined patterns. Each pattern is represented as a declare template, i.e., a formula with placeholders to be replaced by concrete activities to obtain a constraint. We denote placeholders in declare templates with capital letters and concrete activities in declare constraints with lower case letters. Table 2 reports the main declare templates together with their LTL f semantics and a textual description.
For binary constraints (i.e., constraints involving two activities), one of the two activities, i.e., the activity triggering the constraint, is called activation, and the other one, i.e., the one that satisfies the constraint, is called target. For example, for constraint response (a, b), a is an activation, since the execution of a forces b to be executed eventually. Event b is, instead, the target, since it guarantees the constraint satisfaction. An activation of a constraint can be a fulfillment (if there is a target that satisfies the activation), or a violation for such a constraint. When a trace satisfies a constraint, every activation of the constraint in the trace leads to a fulfillment. For example, constraint response (a, b) is activated and fulfilled twice in trace a, a, b, c , whereas, in trace a, b, c, b , the same constraint is activated and fulfilled only once. When a trace does not satisfy a constraint, an activation of the constraint in the trace can lead to a fulfillment, but also to a violation (at least one activation leads to a violation). In a, b, a, c , for example, constraint response (a, b) is activated twice and the first activation leads to a fulfillment (b occurs eventually), but the second activation leads to a violation (b does not occur after the second activation). A pending activation is an activation that is not fulfilled in a prefix of a trace. For example, given the prefix a, a, b, a, c of a certain trace, constraint response (a, b) has one pending activation in the last occurrence of a since it is not currently followed by any occurrences of b, but can be satisfied in the future considering that the prefix is (by definition) not complete. We denote by |activations|, |f ulf illments|, |violations| and |pendings| the number of activations, fulfillments, violations and pending activations in a trace, respectively.
When testing a trace for satisfaction over one of the declare constraints, the presence of an activation in the trace triggers the clause verification, requiring the (non-)execution of an event containing the target in the same trace. The notion of activation is related to the notion of vacuity detection in model checking [7,8]. For example, in constraint response (a, b), if a never occurs in a trace, then the constraint is "vacuously" satisfied, that is, satisfied without showing any form of interaction with the trace.
declare templates can be gathered into four main groups according to their semantics [9] (see the declare template groups in Table 2): Existence E: the templates in this group have only one parameter and check either the number of its occurrences in a trace or its position in the trace.
Choice C: the templates in this group have two parameters and check if (at least) one of them occurs in a trace.
Positive Relations PR: the templates in this group have two parameters and check the relative position between the two corresponding activities.
Negative Relations N R: the templates in this group have two parameters and check that the two corresponding activities do not occur together or do not occur in a certain order.
Hereafter, we denote with A the set of declare templates, i.e., A = E ∪ C ∪ PR ∪ N R. Furthermore, given a set of activities Σ, we denote with A Σ the set of declare templates instantiated over activities in Σ.
A has to occur at least n times. E: absence (n + 1, A) ¬existence(n + 1, A) A has to occur at most n times. E: exactly (n, A) existence(n, A) ∧ absence(n + 1, A) A has to occur exactly n times.

E: init (A)
A Each case has to start with A. When a process execution is ongoing, the satisfaction of the corresponding trace prefix against a declare constraint is not boolean. In particular, the Runtime Verification (RV) satisfaction value of a declare constraint ϕ in a trace prefix σ i (indicated as [σ i |= ϕ] RV ) is defined according to the four-valued semantics introduced in [10]. In particular, a constraint in an ongoing process execution can be: Possibly Satisfied: the constraint is satisfied in the current position of the trace, but might be violated in the future.
Possibly Violated: the constraint is violated in the current position of the trace, but might be satisfied in the future.
Satisfied: the constraint is permanently satisfied and can no longer become violated in the future positions of the trace.
Violated: the constraint is permanently violated and can no longer become satisfied in the future positions of the trace.
The RV satisfaction value of a constraint in a trace depends on the type of constraint. Table 3 shows the criteria to determine the RV satisfaction value of a constraint in a trace for each declare template. Table 3: Criteria for identifying the RV satisfaction values of a constraint in a trace, where a = |activations|, f = |f ulf illments|, v = |violations|, p = |pendings|, and done is a boolean value specifying whether the trace is complete or not.

Template
Poss.viol Poss.sat Viol Sat not response not chain response precedence not precedence absence (n + 1) chain precedence not chain precedence alternate precedence

Method
Our Outcome-Oriented Prescriptive Process Monitoring system focuses on prescribing interventions on the process control flow (i.e., on the activities to be executed) in order to maximize the likelihood of achieving a certain outcome. Specifically, our system prescribes to users temporal relations between activities that have to be preserved or violated in order to achieve a desired outcome. For example, to minimize the likelihood of a patient going to intensive care in a Sepsis case, the prescription (at a certain point in time of the case) could be that activity Antibiotics treatment should be immediately followed by Leucocytes test.
These prescriptions need to be: D1: reliable, to ensure the achievement of the desired outcome; D2: flexible, so as to provide users with enough freedom in the application of the suggested recommendations.
To meet these desiderata, we propose a Prescriptive Process Monitoring system that: 1. encodes the traces of a historical event log with temporal relations among activities (D2); 2. learns with an ML classifier correlations between these temporal relations and case outcomes (D1); 3. generates a prioritized list of prescriptions/recommentations R (i.e., an ordered list of temporal relations to satisfy or violate) for an ongoing case (D2).
Given A Σ , the set of declare constraints instantiated over the set Σ of the activities in a log, we define a prescription/recommendation r ∈ R as a pair ϕ, c , where ϕ ∈ A Σ is a temporal constraint and c is a condition specifying whether the constraint ϕ has to be satisfied or violated when the next activities of the process case are performed. These recommendations guide the user to achieve the desired outcome. More details on these conditions are provided in Section 3.3. For instance, in the example above, R = { ϕ, c } contains a unique recommendation ϕ, c , where ϕ is constraint chain response (Antibiotics treatment, Leucocytes test) and c is the condition "It should not be violated". Figure 2 shows an overview of our proposal. Given a labeled training , in which each trace σ i is associated to a label y(σ i ) (specifying whether a given desired outcome has been achieved or not in that trace), each trace σ i in L train is encoded, using a declare encoder e, in a feature vector x i . The feature vector is composed of p features each representing the grounding on log activities of a declare template, i.e., a declare constraint ϕ ∈ A Σ (Section 3.1). Each trace is encoded based on whether the trace satisfies or not the declare constraint associated to each feature. The encoded traces { x i , y(σ i ) } n i=1 are then fed into an ML classifier to learn a classification task according to the given labeling (Section 3.2). The learned classifier f θ is then queried by a generator of recommendations using the encoded prefixes of a prefix log L * test . The aim of the query is to extract from f θ a set of recommendations R that maximize the likelihood of a positive outcome for a prefix σ k ∈ L * test (Section 3.3).

Encoding Traces Using LTL f Temporal Patterns
The proposed approach encodes temporal relations between log activities by using a sequence encoder. Each sequence σ i in L train or L * test is transformed into a vector e(σ i ).
Definition 5 (Sequence/trace encoder) A sequence (or trace) encoder e : S → X 1 × · · · × X p is a function that takes a (partial) trace σ i and transforms it into a feature vector x i = e(σ i ) in the p-dimensional vector space X 1 × · · · × X p , with X j ⊆ R, 1 ≤ j ≤ p being the domain of the j-th feature.
Specifically, we adopt an encoding based on the declare semantics reported in Table 3 so as to obtain, for each (prefix) trace σ i , a feature vector x i . The features used to build the feature vector are obtained by instantiating all the declare templates in Table 2 with all the combinations 1 of activities available in L train ∪ L * test (representing the alphabet Σ). Each element of the feature vector is then a value representing whether each of those declare constraints ϕ j ∈ A Σ is (possibly) satisfied or (possibly) violated in σ i .
The possible feature values for the j-th feature are: After the encoding phase, the event log (L train or L * test ) is transformed into a matrix of numerical values, where each row corresponds to a sequence and each column corresponds to a declare constraint. Each entry is the RV satisfaction value of the constraint in the sequence (expressed using an integer value as explained above).
We present now an example of trace encoding, using trace σ i = a, b, c, a, b, c, c, a, b , alphabet Σ = {a, b, c}, and response as the only template used in the encoding, so that we have The feature values are determined using the semantics reported in Table 3. For example: • if σ i is a complete trace, constraint response (a, c) is violated since the third activation (the third occurrence of a) leads to a violation (it is not eventually followed by c). It is hence encoded with 0; • if σ i is a complete trace, constraint response (a, b) is satisfied. It is hence encoded with 1; • if σ i is a prefix, constraint response (a, b) is possibly satisfied since there are no pending activations for this constraint (but the constraint can still be violated in the future). The constraint is then encoded with 3; • if σ i is a prefix, constraint response (b, c) is possibly violated since the last occurrence of b is a pending activation for this constraint. The constraint is hence encoded with 2.
Assuming that the features used by the encoder are in the order response This operation if performed by the declare encoder e in Figure 2, leveraging the semantics reported in Table 3, where the parameter done is T rue when the training log L train is provided as input, F alse when the test prefix log L * test is provided as input.
Differently from the standard methods for trace encoding [4], this kind of encoding allows for the generation of easily understandable recommendations R based on the simple and intuitive declare temporal patterns. These recommendations will be extracted from an ML model f θ , trained with the event log L train encoded with the declare encoding just introduced, by using a rule extraction technique. However, this kind of encoding has the drawback of creating very long feature vectors x i . Indeed, for a set of activities Σ and a given declare constraint involving an activation and a target activity, the number of generated features is O(|Σ| 2 ). For instance, in a very simple domain where |Σ| = 10, the number of features generated using the templates shown in Table 2 is 4 * 10 + 14 * 10 2 = 1440, where 4 is the number of unary constraints and 14 the number of binary constraints. These large feature vectors have the disadvantage of including irrelevant or redundant features, which are time demanding for the ML algorithm used to train the classifier, and make the classification problem harder. We addressed this problem by adopting a feature selection strategy. In particular, using the Apriori algorithm described in [11], we select the most frequent (pairs of) activities according to a user-defined threshold and we instantiate the declare templates only by using those activities. The obtained features are then ranked according to their mutual information score [12] with the class label. In our experiments, we set the Apriori algorithm threshold to 5% in order to select a sufficiently high number of features to be ranked based on the mutual information score. The number of the top most informative features, instead, was selected through a grid search, as explained in Section 5.3.

Training a Classifier for Reliable Recommendations
Desideratum D1 requires reliable recommendations, that is, recommendations that, if followed, help achieving the desired process outcome. We therefore need an effective mapping function f θ between the recommendations (expressed in terms of satisfaction or violation of certain declare constraints) and the outcome of a sequence σ i . The following definition formalizes the outcome of a complete trace with a known class label given the set S of all possible sequences.
Definition 6 (Labeling function) A labeling function y : S → Y maps a trace σ i to its class label y(σ i ) ∈ Y with Y being the domain of the class labels.
For classification tasks, Y is a finite set of categorical outcomes. In this paper, we only consider binary outcomes, i.e., Y = {0, 1}. For instance, in the Sepsis case example, a case σ i can be labeled as positive (y(σ i ) = 1) if the patient does not need to go to intensive care, or as negative (y(σ i ) = 0) in the opposite case. For building the mapping function f θ between the feature vectors x i = e(σ i ) and their labels y(σ i ), we train an ML classifier.
Definition 7 (Classifier) A classifier f θ : X 1 × · · · × X p → Y is a function that takes an encoded vector x ∈ X 1 × · · · × X p and estimates its class label.
The variable θ is characteristic of the classifier and indicates a set of parameters to be learned to have a reliable estimation of the class label. This set of parameters is learned by training the classifier through a learning algorithm whose input is the training log L train .

Generating Recommendations as LTL f Temporal Patterns
The trained ML classifier f θ is successively used to extract a prioritized list of recommendations R for a given prefix σ k ∈ L * test . As ML classifiers, we use Decision Trees (DTs) as (i) they have shown good performance in Predictive Process Monitoring [4] (D1); (ii) the most important features are explicitly available in the model and recommendations can be sorted based on their discriminativeness (D2). In particular, in DTs, the features closest to the root are the most discriminative ones. This provides a natural prioritization on the effectiveness of the recommendations.
A path from the root to a leaf of a DT simply consists of a set of features along with their corresponding values learned during the training process. Each trace is mapped to a path in the DT that can be used to classify it. The paths in the DT are identified using decision points expressed as conditions on the feature values. Therefore, the path itself can be seen as a classification rule for an input trace. In our case, a classification rule for a (complete) trace σ i has the following form: where ϕ j ∈ A Σ , and, since the classifier is trained over complete traces, val j can be satisfied if σ i |= ϕ j , or violated otherwise. A good DT contains a set of paths that are able to discriminate, in an effective way, between L + train = {σ i ∈ L train : y(σ) = 1} (the subset of traces of L train with a positive label) and L − train = {σ i ∈ L train : y(σ) = 0} (the subset of traces of L train with a negative label). Note that L + train and L − train represent a partition of L train , i.e., L train = L + train ∪ L − train and L + train ∩ L − train = ∅. Before explaining the details of our method to generate recommendations, we introduce some preliminary notions. Given a DT, let P = {p 0 , p 1 , . . .} be the set of its paths from the root to the leaves. A single path p from the root to a leaf is defined as: where (ϕ 0 , val 0 ), (ϕ 1 , val 1 ), . . . are the feature-value pairs belonging to the path, polarity and impurity are the majority class and the impurity value (computed by using either the Gini index or the entropy) of the leaf node of the path, and #PosSamples and #NegSample are the number of positive and negative training samples matching the path.
Given a trace prefix σ k , our proposal is to derive a set of recommendations R from P by finding a positive path p ∈ P + (that is, a path with a positive polarity and hence likely to lead to a positive outcome) with featurevalue pairs matching as much as possible the ones appearing in the encoding of σ k . Our assumption is that a positive path very similar (according to a similarity score) to σ k can convey effective recommendations for achieving a positive process outcome. However, the similarity between a path and a prefix could be not sufficient to find a unique path providing good recommendations. Indeed, for short prefixes, many paths in P could have the same similarity score as σ k , due to the small number of activities in σ k . To better discriminate among the different paths with the same similarity score, we therefore select the path with the lowest impurity and the highest probability. We formalize these ideas with the notion of recommendation score ρ(σ k , p) between a prefix σ k and a path p, defined as: where F is a fitness function measuring the similarity between σ k and path p, the term weighted by λ 2 refers to the purity of the leaf node of p (i.e., the complement of its impurity) and the term weighted by λ 3 is the probability of path p classifying correctly a positive sample (P + is the set of paths leading to a positive outcome). All the weighted terms of Eq. (1) are numbers between 0 and 1, and weights λ 1 , λ 2 and λ 3 are hyperparameters of the generation algorithm such that λ 1 + λ 2 + λ 3 = 1. The fitness function F is computed as the average compliance of the learned satisfaction values of the declare constraints in path p and the RV satisfaction values of these constraints in prefix σ k . Given a path p, let rule(p) = (ϕ 0 , val 0 ), (ϕ 1 , val 1 ), . . . be the sequence of pairs of constraints and their satisfaction values in path p. The fitness function F is then defined as: where the compliance function C returns higher values if the learned satisfaction value for ϕ is similar to the RV satisfaction value of ϕ in σ k , [σ k |= ϕ] RV . Specifically: (3) Given a prefix σ k and a DT with P + the set of paths leading to a positive outcome in DT, we define the path p * that conveys the best recommendations as the positive path that maximizes the recommendation score ρ with σ k : The extraction of the recommendations R from p * is straightforward. Let rule(p * ) be the set of constraints and their satisfaction values encoded in p * . The recommendation is generated for each pair (ϕ, val) in rule(p * ) by comparing again val with [σ k |= ϕ] RV .  Table 4 shows the rules for generating recommendations from val and [σ k |= ϕ] RV . The idea is to provide a recommendation so that the ongoing trace σ k becomes compliant with the classification rule. A full compliance between val and [σ k |= ϕ] RV results in a case completely in line with the positive classification rule of the DT. In this case, no prescription is needed (see rows 2 and 6). On the other hand, if a contradiction occurs between val and [σ k |= ϕ] RV , the case cannot be fixed anymore (see rows 1 and 5). In other situations, a recommendation ϕ, c is provided (see rows 3-4 and 7-8). The recommendation suggests the action to take and is composed of a declare constraint ϕ and a condition c expressing whether the constraint should be satisfied or not.
The resulting list of recommendations R is returned already ordered by importance, considering that the recommendations extracted from the feature-value pairs that, in the DT, are the closest to the root best discriminate between a positive and a negative outcome for σ k . Therefore, the corresponding recommendations can be presented to the users with a higher priority to allow them to choose the most important recommendations to adopt in case it is not possible to follow all of them. tion (sepsis case 2) and the existence family of declare templates E. We choose this family as it allows us to provide a simple but concrete description of our Prescriptive Process Monitoring system. As stated in Section 5.1, the sepsis case 2 dataset contains 782 cases. Among them, the cases with a positive label are the ones related to patients that do not need to go to intensive care. Therefore, our Prescriptive Process Monitoring system will provide recommendations for avoiding the admission of a patient to the intensive care.

Preprocessing and encoding
The dataset contains 24 activity names representing standard activities being performed during a Sepsis case. During a preprocessing phase, the system discards the activity names that are infrequent in the dataset, based on a user-defined threshold. In our case, we select the activity names that appear in at least 5% of the cases in the dataset. After this step, 12 activity names remain: ER Registration, ER Triage, ER Sepsis Triage, CRP, LacticAcid, Leucocytes, IV Liquid, IV Antibiotics, Admission NC, Release A, Return ER, Release B. These activity names are combined with the existence templates in E (Table 2), thus obtaining 48 features for the trace encoding.
Assuming that our feature selection algorithm considers the top h features, with h corresponding to half of the number of the original features, after the feature selection, the resulting features are reduced to 24. The obtained features are in our case: existence (ER Registration), existence (ER Triage), existence (ER Sepsis Triage), init (CRP), exactly (CRP), absence (LacticAcid), absence (Leucocytes), exactly (Leucocytes), exactly (IV Liquid), existence (IV Antibiotics), existence (Admission NC), absence (Admission NC), exactly (Admission NC), existence (Release A), absence (Release A), init (Release A), exactly (Release A), existence (Return ER), absence (Return ER), init (Return ER), exactly (Return ER), existence (Release B), absence (Release B), exactly (Release B). The dataset has 782 traces that are divided into a training set L train (625 traces, 80%) and a test set (157 traces, 20%) from which the prefix log L * test is extracted. The traces in L train and L * test are all encoded using the above features.

The Machine Learning Classifier
The trained DT is shown in Figure 3. The DT is relatively small with only 5 paths and a depth of 5. In spite of this, the DT does not underfit, but discriminates well between positive and negative samples as shown in Table 7 of Section 5. By looking at the DT, there are 3 paths leading to a positive outcome.  The most likely one node #0 → node #8 has maximal purity (1) and 85.2% of the positive samples follow this path (460 out of 540). Also path node #0 → node #1 → node #3 → node #7 has maximal purity, although, in this case, only 6.9% of the positive samples follow this path (37 out of 540). Finally, only 6.7% of the positive samples follow the least likely path node #0 → node #1 → node #3 → node #4 → node #5 , and this path has also a high entropy (0.874).

Recommendation Generation
The recommendations are generated using the recommendation score defined in Eq. (1). The hyperparameters λ 1 , λ 2 , λ 3 that weight fitness, purity and positive sample probability in the definition of the recommendation score are found through grid search and their optimized values are, in our example, 0.4, 0.4 and 0.2. This means that fitness and purity have higher importance with respect to the positive sample probability.
We now show some examples of recommendations generated starting from a given prefix and the above DT. For prefix σ 15 = ER Sepsis Triage, ER Registration, ER Triage, CRP, LacticAcid, Leucocytes, IV Antibiotics, IV Liquid, Admission NC, CRP, Leucocytes, Admission NC, CRP, Leucocytes, Release B , the positive path in the DT matching the prefix with the highest recommendation score is node #0 → node #1 → node #3 → node #7 . This path has a higher recommendation score with respect to the other positive paths as it has fitness and purity values equal to 1. The high value of the fitness function is due to the full compliance of the feature values in the encoding of σ 15 and the ones in the path. In particular, constraint existence The path having the highest positive sample probability (node #0 → node #8 ) has a lower fitness value (0.5) that leads to a lower recommendation score. The third path has a lower fitness value (0.875), lower purity and lower positive sample probability. Since constraint existence (Admission NC) is already satisfied in prefix σ 15 , the generated recommendations are { existence (Release A), It should not be SATISFIED , exactly (Release B), It should not be VIOLATED }. The first recommendation has a higher priority with respect to the second one. This recommendation suggests that in order to avoid the intensive care, in case of hospitalization, activity Release A should not be performed. The second recommendation states, instead, that activity Release B has to be performed exactly once and, since this activity has already been performed in σ 15 , it should not be performed again.
As a second example, we consider prefix σ 5 = IV Liquid, ER Registration, ER Triage, ER Sepsis Triage, IV Antibiotics . In this case, the path with the highest recommendation score in the DT is node #0 → node #8 . This path has high purity and positive sample probability (1 and 85.2%, respectively) that counterbalance a modest fitness value (0.5). This fitness value is due to the low similarity of prefix σ 5 with the path (constraint existence (Release A) is satisfied in the path but possibly violated in σ 5 ). The other positive paths have, instead, a higher fitness (0.67 for node #0 → node #1 → node #3 → node #7 , and 0.875 for node #0 → node #1 → node #3 → node #4 → node #5 ) that, however, is not sufficient to counterbalance the low positive sample probability in the recommendation score. Therefore, in this case, only one recommendation is provided ({ existence (Release A), It should be SATISFIED }), indicating that for a Sepsis case with a clinical history similar to σ 5 , in order to avoid intensive care, activity Release A should occur at least once.

Evaluation
To assess the validity of our proposal, we need to answer the following research questions: RQ1. Are the recommendations extracted from a classifier trained on declare constraints effective for achieving a desired outcome in a business process execution?
RQ2. Are there statistical differences in using different families of declare constraints for extracting effective recommendations for a business process?
To answer these research questions, on the one hand, we check that the adoption of the recommendation set R brings to a positive outcome for a given trace of a business process. On the other hand, we also show that if the recommendations are not followed, the process is going to achieve a negative outcome.
In particular, we developed the following experiment protocol on a pool of datasets to test the proposed Outcome-Oriented Prescriptive Process Monitoring system. For each event log L in the pool of datasets:

Datasets
As a pool of datasets, we adopt the one used in [4] used as a benchmark for Outcome-oriented Predictive Process Monitoring. Such well-known and standard datasets allow us to a have a robust and significant evaluation of our Prescriptive Process Monitoring system. Following [4], we used eight real-life event logs publicly available in the 4TU Centre for Research Data 2 and discarded the private Insurance dataset (since it is not publicly available). In most of the datasets, several labeling functions y have been applied, i.e., different desired outcomes in each dataset are specified. These labelings on the eight initial event logs lead to 22 different prescriptive tasks and datasets. We now provide more details on the original logs, the used labeling functions and the resulting Prescriptive Process Monitoring tasks. BPIC 2011. This event log has been originally published in relation to the Business Process Intelligence Challenge (BPIC) that took place in 2011. This event log refers to cases from the Gynaecology department of a Dutch Academic Hospital. Each case records procedures and treatments (stored as activities) applied to a given patient. There are four different labeling functions based on four LTL formulas [13], that is, the class label for a case σ is defined according to the satisfaction of the LTL formula ϕ in each trace σ: The four LTL rules used are the following: • bpic2011 1 : ϕ = F(tumor marker CA-19.9) ∨ F(ca-125 using meia); • bpic2011 2 : ϕ = G(CEA-tumor marker using meia → F(squamous cell carcinoma using meia)); • bpic2011 3 : ϕ = (¬histological examination-biopsies nno) U(squamous cell carcinoma using meia); • bpic2011 4 : ϕ = F(histological examination-big resectiep).
For example, the labeling for bpic2011 1 expresses the fact that at least one of the activities tumor marker CA-19.9 or ca-125 using meia must happen eventually during a case. It is trivial to see that when one of these events occur, the class label y(σ) becomes known. Therefore, the evaluation step will be biased due to this phenomenon. To solve this issue, all the cases have been cut exactly before the occurrence of one of these events. The same cut is performed exactly before the occurrence of histological examination-biopsies nno in bpic2011 3 and before histological examination-big resectiep in bpic2011 4. Regarding bpic2011 2, no cut is necessary as it is never possible to infer the class label before the end of the case. Indeed, the class label is true if and only if every occurrence of CEA-tumor marker using meia is eventually followed by squamous cell carcinoma using meia and this constraint is never permanently satisfied or violated before the end of the case.
BPIC 2012. This event log refers to the execution history of a loan application process in a Dutch Financial Institution. Each case stores the events related to a particular loan application. The available labelings are based on the final outcome of a loan application, i.e., on whether the application is accepted, rejected, or canceled. This is a multi-class classification problem, but, as in [4], the labelings are considered as three separate binary classification tasks. In the experiments, these tasks are referred to as bpic2012 accepted, bpic2012 cancelled, and bpic2012 refused.
BPIC 2015. This event log refers to the application process of building permits of 5 Dutch Municipalities. Each log comes from a single Municipality and is taken as a single dataset with its own labeling function. This is defined similarly to BPIC 2011, that is, according to the satisfaction/violation of an LTL formula ϕ. Each dataset is denoted as bpic2015 i, where i = 1 . . . 5 indicates the number of the Municipality. The adopted labeling function is: • bpic2015 i : ϕ = G(send confirmation receipt → F(retrieve missing data)).
Similarly to bpic2011 2, no trace cutting has been performed as the satisfaction/violation of ϕ can be evaluated only at the completion of the case.
BPIC 2017. This event log originates from the same Financial Institution as bpic2012, but with an improvement of the data collection process, resulting in a richer and cleaner dataset. As for bpic2012, the event cases record execution traces of a loan application process and three separate labelings based on the outcome of the application are applied, i.e., bpic2017 accepted, bpic2017 cancelled, and bpic2017 refused.
Hospital billing. This dataset contains cases regarding a billing procedure for medical services. The cases come from an ERP system of a Hospital and the labelings for this log are: • hospital 1 : the billing procedure is not eventually closed; • hospital 2 : the billing procedure is reopened.
Production. This event log contains cases of a manufacturing process. Each case stores information about the activities, workers and/or machines involved in the production process of an item. The labeling is based on whether, in a case, there are rejected work orders, or not.
Sepsis cases. This dataset records hospitalizations of patients with symptoms of the life-threatening Sepsis condition in a Dutch Hospital. Each case stores events from the patient's registration in the Emergency Room (ER registration) to the discharge from the Hospital. Laboratory tests together with their results are also recorded as events. The reasons of the discharge are available in an anonymized format. Three different labelings for this log are available: • sepsis 1 : the patient returns to the Emergency Room within 28 days from the discharge; • sepsis 2 : the patient is (eventually) admitted to intensive care; • sepsis 3 : the patient is discharged from the Hospital on the basis of a reason different from Release A (i.e., the most common release type).
Traffic fines. This event log comes from the ERP of an Italian local Police Force. The events in the log refer to the notifications sent about a fine and the (partial) repayments. Additional case/event attributes include, for instance, the reason, the total amount, and the amount of repayments for each fine. The available labeling is based on whether the fine is repaid in full, or is sent for credit collection. The adopted 22 datasets exhibit different characteristics shown in Table 5. The production log is the smallest one with 220 cases, while the traffic log is the largest one with 129 615 cases. The datasets with the highest case lengths are the bpic2011 datasets where the longest case has 1814 events. On the other hand, the traffic log contains the shortest cases (their length varies from 2 to 20 events). The class labels are the most imbalanced in the hospital billing 2 dataset, where only 5% of cases are labeled as positive (class label = 1). Conversely, in the bpic2012 accepted, bpic2017 cancelled and traffic datasets, the classes are balanced. Concerning the event classes, traffic fines 1 has the lowest number of distinct activity names (10). On the other hand, the logs with the highest number of event classes are the bpic2015 logs containing a maximum of 396 event classes.

Offline Evaluation of a Prescriptive Process Monitoring System
One of the main challenges when evaluating Prescriptive Process Monitoring systems is dealing with the lack of adoption of those systems by real users [14]. One of the possibilities when testing these systems is, therefore, to resort to an offline evaluation based on the "what-if" simulation [14], in order to evaluate the effectiveness of the set of recommendations R for a prefix σ k . The idea is evaluating the consequences of (not) following the recommendations R at step k on the whole trace σ. We hence evaluate the effectiveness of R for a prefix σ k by checking whether the recommendations in R have been followed in σ, and by comparing the outcome of σ and its actual label y(σ). We expect that if the recommendations are followed, the outcome will be positive. If they are not followed, the outcome will be negative. Let p * be the path of the DT from which the set R has been computed. A high similarity between σ and p * means that the recommendations have been followed by the execution σ and hence that we expect a positive outcome. The prediction related to trace σ will hence be classified as a true positive (TP) if y(σ) = 1 or as a false positive (FP) if y(σ) = 0. Symmetrically, if there is no similarity between σ and p * , this means that the recommendations have not been followed by σ. We hence expect a negative outcome. The prediction related to trace σ will hence be classified as a true negative (TN) if y(σ) = 0 and as a false negative (FN) if y(σ) = 1.
The similarity between σ and p * is computed by leveraging F(σ, p * ) (Eq. (2)). Differently from the general formula, however, in this case the compliance function C is applied to the whole trace and, therefore, does not need to take into account temporary violations/satisfactions of declare constraints in p * . Specifically, a fitness threshold th f it is used to evaluate the similarity between the whole trace σ and the path p * , that is, if F(σ, p * ) is higher than or equal to th f it , this means that the recommendations in R are followed. A similarity lower than th f it means that the trace did not follow the recommendations. Adopting a fitness threshold is necessary as a similarity of exactly 1 between σ and p * could be too restrictive and lead to a high number of false negatives. This is totally in line with a realistic situation in which some of the recommendations are not followed by a process manager as they are not strictly necessary for the positive outcome of the process. In our experiments, the optimal fitness thresholds have been selected via grid search. Table 6 summarizes the confusion matrix entries.
To assess the accuracy of our approach, we compute precision, recall and F-score as follows: prec = T P T P + F P ; rec = T P T P + F N ; F −score = 2 * prec * rec prec + rec .
We use F-score rather than accuracy as many of the datasets used in the evaluation are imbalanced towards the negative class and the accuracy could be biased by the true negatives leading to non-reliable results.

Experimental Setup
In this section, we provide some details about the experimental setup. All the experiments were carried out using Python 3.6, the Declare4Py library [15] (for the declare encoding of the traces) and the scikit-learn library 0.24 [16] (for building and querying the classifiers).
Preprocessing Mimicking real-life situations in which the prediction model is trained on historical data and the recommendation is carried out on ongoing cases, the event logs have been first chronologically ordered and then split in training and test set. Specifically, the cases in the event logs have been ordered according to the start time and the first 80% -i.e., all cases that started before a given date -has been used for the construction of the training and the validation log, while the remaining 20% has been used to create the test event log L * test . Since the last cases of the training and validation log could still not be completed when the test period starts, we removed from these cases in the training and validation log the events overlapping with the test period, as in [4]. The training and the validation event logs are instead split so that the first 70% of the whole event log (L train ) is used for training the prediction model, while about 10% of the event log (L val ) is used for the optimization of the hyperparameters.

ML Classifier Training
The training of the DT has been performed with a grid search to tune the hyperparameters with 5-fold cross-validation on L train . The range of values used for the hyperparameters are: i) the Gini index or the entropy criterion for the computation of the impurity; ii) [4, 6, 8, 10, ∞] for the maximum depth of the DT; iii) the use of class weights or not during the training to avoid poor performance due to the imbalance of the datasets (see, for example, hospital billing 2 in Table 5); iv) [0.1, 0.2, 0. 3,2] for the minimum number of samples required to split an internal node (float values indicate a percentage of the training data); v) [1,10,16] for the minimum number of samples required to consider a node a leaf node; vi) the number of the most informative features to use in the feature selection phase, i.e., 50%, 30% and the square root of the total number of initial features (after ranking them by using the mutual information score).
Recommendation Generation Using the trained DT, the λ parameters in Eq. (1) have been optimized through grid search on the prefix log L * val extracted from the validation log L val . The set P + in the same equation has been filtered to contain only paths with at least 3 training samples.
Evaluation The values k for the prefix lengths range from 1 to a maximum that changes according to the dataset. We adopted the same criteria used in [4] for the maximum value: 9 for the traf f ic f ines dataset, the minimum between 20 and the 90 th percentile of the case lengths for the bpic2017 datasets, the minimum between 40 and the 90 th percentile of the case lengths for the other datasets. This choice is due to the low number of long cases (after the 90 th percentile) in the prefix test logs that could produce results with no statistical significance for high values of k. The optimal fitness threshold th f it has been found by applying grid search on L * val and using values 0.55, 0.65, 0.75, 0.85. These values have been chosen considering that values outside this range could bias the system towards a low precision or a low recall.

Results
Since the proposed approach leverages a DT trained on L train to provide recommendations, we first inspect the performance of the DT in the classification of the outcome of (complete) traces in L val and L train as positive or negative. Table 7 shows the average F-score of the DT on L val and L train . We notice that the majority of the classifiers have good performance on both the validation and train folds with an absence (or a low degree) of overfitting. However, the hospital billing 2 and the sepsis cases 1 datasets present low performance on both L val and L train . This underfitting is due to the insufficient information carried by the adopted encoding. We checked this aspect by inspecting the traces of these datasets with the Disco 3 tool. We noticed that both positive and negative labeled traces present a very similar control flow. Therefore, any ML classifier taking as input traces encoded with our declare-based encoding does not have sufficient information for discriminating between positive and negative samples. As future work, we aim at enriching our encoding with information regarding the data payloads attached to events and their execution times to overcome this issue. In general, no significant difference between families of declare constraints used for trace encoding is found in these results. Only the sepsis cases 2 dataset benefits of a higher number of features provided in the A family. We now discuss the results related to the returned recommendations. Figures 4 and 5 show the trend of the F-score for different prefix lengths k computed on each prefix σ k in L * test . Both figures show the cumulative results, that is, the T P , T N , F P and F N at a given prefix length k are summed to the corresponding ones at prefix j ≤ k. This is done to avoid that the results are influenced by the small number of traces that usually characterize long traces [4]. The average over all the prefixes is reported in Table 8  eral, that is, the proposed Prescriptive Process Monitoring system returns   recommendations that guarantee a positive outcome in a trace. When the recommendations are not followed, instead, the corresponding traces have, in most of the cases, a negative outcome (RQ1). For some datasets, our system reaches an F-score of 100% along all prefixes. This is due to the encoding with declare patterns that creates a semantically rich feature space that allows a crisp discrimination between regions containing only positive and regions containing only negative samples. In addition, the resulting DTs have a depth of maximum 4 with a consequent lower number of temporal relations to satisfy. Therefore, the fitness and the overall performance of the system increase.
Although, in most of the cases, the high discriminating power of the temporal constraints on the control flow obtained with the declare encoding guarantees accurate results, for some logs, the declare encoding does not achieve such a good performance. The production dataset, for instance, has a DT with depth 8 and contains several paths. In this situation, find- ing the best path is harder, the fitness score that can be achieved is lower and, as a consequence, the overall performance of the Prescriptive Process Monitoring system decreases. Moreover, our system poorly performs on the sepsis cases 1 dataset due to the poor performance obtained by the DT (see Table 7).
We stress the fact that other encodings, like the ones used in [4], are based on a fine-grained vectorization of the log traces that minimizes the loss of information in each trace. For example, the well-known index encoding [17] assigns at position i of the feature vector the activity name of the event occurring at position i in the trace. Therefore, the inference of high-level relations in the control flow of a trace, such as response (A, B) or existence3 (A), is left to the ML system. In this case, the semantics inference is limited by the expressive capabilities of the ML system being used. The declare encoding, on the other hand, is less fine grained as it abstracts the temporal order of the events with the constraint families. Since these relations are explicitly defined, the ML system only needs to infer the correlation between such the temporal patterns and the trace labeling. For this reason, using the declare encoding, also simple ML models (like DTs) can easily capture those correlations thus improving their performance.
Since the use of declare patterns as features in Predictive and Prescriptive Process Monitoring is at its early adoption [18], we are interested in studying the effect of the different families of constraints on the quality of the recommendations. Figure 6 is derived from Table 8 and shows the critical difference diagram of the declare families by using the Nemenyi test with a significance level of 0.05 (as proposed in [19]). The diagram reports the average ranking of each family according to the F-score results in Table 8. Groups of families that are not significantly different (with p < 0.05) are connected. We can observe that the PR and A families obtained the best results for the majority of the datasets. The results obtained with these two families are not significantly different. The C and E families have also similar results, while both of them perform worse than PR (and this difference is statistically significant).
The lower performance of these families are due to the limited expressivity of their constraints. The performance of the N R family is close to the performance of C and E even though N R contains relation constraints. This is due to the lower discriminating power of the constraints in N R that negate the occurrence of a target activity (when the activation occurs) rather than explicitly constraining the occurrence of a specific target activity (when the activation occurs) as for PR constraints. Positive relations (and existential constraints) seem hence to contribute most to the good performance of our Prescriptive Process Monitoring system (RQ2).

Limitations
The main limitation of our work relates to the fact that our Prescriptive Process Monitoring system has been evaluated in an offline scenario. In particular, we evaluated our system by performing a "what-if" analysis. We tried to mitigate this limitation by running our experiments using different real-life datasets. However, for a final deployment in a real organization, our system would require further evaluations with real users employing the system in their worklife. This requires a user-friendly Graphical User Interface (GUI) that allows users to interact with the system. To this aim, in the future, we plan to embed our recommendation system in the Nirdizati tool [20], which is an open-source web-based Predictive Process Monitoring engine.
Moreover, in real scenarios, it could happen that users are not familiar with declare constraints. Therefore, a human-understandable rendering of the prescriptions could be more effective. This could be achieved with the use of Natural Language Generation techniques for persuasive messages [21], where recommendations are passed as input. The resulting persuasive natural language sentences contain an effective description of the constraints to satisfy, their importance for the achievement of a positive outcome of the process and some explanations to motivate them.

Related Work
Prescriptive Process Monitoring methods can be categorized according to whether the recommended interventions focus on the control flow, on the resources, or on other perspectives [22].
The interventions involving the control flow usually prescribe a set of activities to perform next [23,24,25,26,27]. The next best activity can be prescribed in different domains and to different users, e.g., to employment companies to help customers in finding the most suitable job [25], to business analysts to improve the execution time, the customer satisfaction or the service quality of a process [23], or to doctors in order to identify the most appropriate treatment based on the conditions of a patient [27].
Another group of prescriptions focuses on the resource perspective [28,29], e.g., which resource should perform the next activity. Also in this case, prescriptions can be applied to different domains. For example, in [29], prescriptions are related to which police officer is best suited for the next task based on their predicted performance in a driving license application process. In [28], recommendations on the repairs to carry out are provided to mechanics to guarantee that they complete their work within a predefined time.
Few works prescribe interventions regarding both control flow and resources [30,31,32]. For instance, in [30], an intervention to make an offer to a client together with the specific clerk that is the most suitable one to carry out the task are prescribed. In [31], the next activity and the specialist that should perform it are recommended to resolve open tickets in an IT service management process.
Finally, a last group of works focus on other types of interventions. For instance, in [1,2], the authors propose a method in which a cost-model is used to control the creation of alerts in order to reduce the projected cost for a particular event log. In [33], the trade-off between the earliness and accuracy of the predictions for proactive process adaptation is discussed. The approach presented in [34] uses online reinforcement learning to learn when to initiate proactive process adjustments based on forecasts and their run-time dependability. In [30], the authors tackle the problem of recommending interventions for avoiding an undesired outcome when a limited amount of resources is available.
The approach proposed in our work focuses on control flow but, differently from existing works, it does not merely prescribe a sequence of activities to perform next but, rather, a set of temporal constraints that have to be satisfied. Temporal relations between activities provide more sophisticated and flexible recommendations (since they do not require the mandatory execution of a certain activity at a given point in time).

Conclusion
The proposed Outcome-Oriented Prescriptive Process Monitoring approach aims at providing recommendations to maximize the likelihood a positive process outcome. Differently from state-of-the-art works, the proposed approach does not recommend specific activities to be executed, but temporal properties between activities that need to be preserved or violated. This type of recommendations avoid forcing the execution of specific activities during the process execution thus providing more flexibility in the ways the process should be executed. The approach has been evaluated on a pool of 22 real-life event logs already used as a benchmark in Predictive Process Monitoring in [4]. For most of the datasets, we achieved a good accuracy.
One of the limitations of the current Prescriptive Process Monitoring system is that the encoding we use is based on declare patterns, i.e., on pure control flow features. These features do not take into account the data payload, as, for instance the used resources. The data payload can be injected in the encoding in a principled way as the declare language has been already extended to include data conditions [35]. This would improve the system performance in those cases in which the only temporal constraints between activities are not sufficient for an effective outcome-based discrimination of positive and negative cases, such as the hospital billing 2 and the sepsis case 1 datasets (see Section 5.4). Future work will focus on this research direction. Another avenue for future work is to extend our current work to other type of prediction tasks like, for example, the regressionbased ones. Finally, in the future, we plan to extend our current evaluation, which is mainly based on a "what-if" analysis, by deploying the proposed approach in an organization environment and by performing experiments with real users.