Group-Based Privacy Preservation Techniques for Process Mining

Process mining techniques help to improve processes using event data. Such data are widely available in information systems. However, they often contain highly sensitive information. For example, healthcare information systems record event data that can be utilized by process mining techniques to improve the treatment process, reduce patient's waiting times, improve resource productivity, etc. However, the recorded event data include highly sensitive information related to treatment activities. Responsible process mining should provide insights about the underlying processes, yet, at the same time, it should not reveal sensitive information. In this paper, we discuss the challenges regarding directly applying existing well-known group-based privacy preservation techniques, e.g., k-anonymity, l-diversity, etc, to event data. We provide formal definitions of attack models and introduce an effective group-based privacy preservation technique for process mining. Our technique covers the main perspectives of process mining including control-flow, time, case, and organizational perspectives. The proposed technique provides interpretable and adjustable parameters to handle different privacy aspects. We employ real-life event data and evaluate both data utility and result utility to show the effectiveness of the privacy preservation technique. We also compare this approach with other group-based approaches for privacy-preserving event data publishing.


Introduction
Process mining employs event data to discover, analyze, and improve the real processes [1]. Indeed, it provides fact-based insights into the actual processes using event logs. There are many algorithms and techniques in the field of process mining. However, the three basic types of process mining are (1) process discovery, where the goal is to learn real process models from event logs, (2) conformance checking, where the aim is to find commonalities and discordances between a process model and an event log, and (3) process re-engineering (enhancement), where the aim is to extend or improve a process model using different aspects of the available data.

arXiv:2105.11983v1 [cs.DB] 25 May 2021
An event log is a collection of events where each event is described by its attributes [1]. The typical attributes required for the main process mining algorithms are case identifier, activity, timestamp, and resource. The case identifier refers to the entity that the event belongs to, the activity refers to the activity associated with the event, the timestamp is the time that the event occurred, and the resource is the activity performer. In the human-centered processes, case identifiers refer to persons. For example, in a patient treatment process, the case identifiers refer to the patients whose data are recorded. Moreover, the resource attribute often refers to the persons performing activities, e.g., in the healthcare context, the resources refer to the doctors or nurses performing activities for the patients. The event attributes are not limited to the above-mentioned ones, and an event may also carry other case-related attributes, so-called case attributes, e.g., age, salary, disease, etc, which could be considered as sensitive person-specific information. Table 1 shows a sample event log.
Orthogonal to the three mentioned types of process mining, different perspectives are also defined including control-flow, organizational, case, and time perspective [1]. The control-flow perspective focuses on activities and their order, which are often utilized by process discovery and conformance checking techniques. The organizational perspective focuses on resources and their relations, which are exploited by social network discovery techniques. The case perspective is focused on case-related attributes, and the time perspective is concerned with the time-related information, which can be used for performance and bottleneck analyses.
With respect to the main attributes of events, two different perspectives for privacy in process mining can be considered in the human-centered processes; resource perspective and case perspective. The resource perspective focuses on the privacy rights of the individuals performing activities, and the case perspective concerns the privacy rights of the individuals whose data are recorded and analyzed. Depending on the context, the relative importance of these perspectives may differ. However, often the case perspective is more critical for privacy than the resource perspective. For example, in the healthcare context, activity performers could be publicly available. However, what happens for a specific patient and her/his personal information should be kept private. In this paper, we are focused on the case perspective. In principle, when event logs explicitly or implicitly include personal data, privacy concerns appear which should be taken into account according to regulations such as the European General Data Protection Regulation (GDPR) [40].
In this paper, we describe disclosure risks and linkage attacks against event logs. The attack models are formally defined based on the available event attributes. We discuss the challenges regarding directly applying group-based privacy preservation techniques, e.g., k-anonymity [35], l-diversity [21], etc, to event logs. We extend the work described in [31], where the TLKC -privacy is introduced as an effective group-based privacy preservation technique for process mining. The TLKC -privacy exploits some restrictions regarding the availability of background knowledge in the real world to deal with process mining-specific challenges. This technique is focused on control-flow, time, and case perspectives. TLKC -privacy generalizes several traditional privacy preservation techniques, such as k-anonymity, confidence bounding [41], (α,k)-anonymity [42], and l-diversity.
The extended privacy preservation technique covers all the main perspectives of process mining including control-flow, time, case, and organizational perspectives. It empowers the adjustability of the proposed technique by adding new parameters to adjust privacy guarantees and the loss of accuracy. Moreover, a new utility measure is defined to tackle the drawbacks of the current approach. To evaluate the extended technique, we employ real-life event logs and evaluate both data utility and result utility. We also compare the extended TLKC -privacy with the main algorithm and other group-based approaches for privacy-preserving event data publishing. Our experiments show that the proposed approach maintains high data and result utility, assuming realistic types of background knowledge. Figure 1 shows a general overview of privacy-related activities in process mining which are discussed in this paper.
The rest of the paper is organized as follows. In Section 2, we explain the motivation and challenges. Section 3 provides preliminaries on event logs and different types of background knowledge. In Section 4, we provide formal models of the attacks. Privacy preservation techniques are discussed in Section 5. In Section 6, the experiments are presented. Section 7 outlines related work, and Section 8 concludes the paper.

Motivation and Challenges
To motivate the necessity to deal with privacy issues in process mining, we describe the disclosure risks using an example in the health-care context. Consider Table 1 as part of an event log recorded by an information system in a hospital. Note that each case has a sequence of events that are ordered based on the timestamps. This sequence of events is called a trace which is a mandatory attribute for a case [1]. For example, case 1, which could be interpreted as patient 1, is first registered by employee 4, then visited by doctor 3, and at the end released from the hospital by employee 6.
Suppose that an adversary knows that a victim patient's data are in the event log (as a case), with little information about some event attributes that belongs to the patient, the adversary is able to connect the patient to the corresponding case id, so-called case disclosure [29]. Consequently, two types of sensitive personspecific information are revealed: (1) the complete sequence of events belonging to the case, and (2) sensitive case attributes. (1) and (2) are generally called attribute disclosure. (1) is also called trace disclosure that is a specific type of attribute disclosure [29]. For example, if the adversary knows that two blood tests were performed for the victim patient, the only matching case is the case with id 2. This attack is called case linkage attack. After the case re-identification, the sensitive case attributes are disclosed, e.g., the disease of patient 2 is infection. This is called attribute linkage attack. Moreover, the complete sequence of events performed for patient 2 is disclosed which contains private information, e.g., the complete sequence of activities performed for the case, the resources who performed the activities for the case, or the exact timestamp of doing a specific activity for the case. We call this attack trace linkage which is a specific type of attribute linkage attack.
Note that the attribute linkage attack does not necessarily need to be launched after the case linkage, i.e., if more than one case corresponds to the adversaries knowledge while all the matching cases have the same value for the sensitive case attribute(s) or the same sequence of event attributes (e.g., the same sequence of activities), the attribute linkage/trace linkage could happen without a successful case linkage attack. For example, if the adversary knows that the activity visit has been performed by the resource doctor 3 for a victim patient, case 1 and case 6 match this background knowledge. However, they both have the same sequence of activities and resources ( (RE, E4), (V I, D3), (RL, E6) ). Consequently, the adversary realizes the complete sequence of activities and the resources who performed the activities. Several group-based privacy preservation techniques, such as k-anonymity [35], l-diversity [21], and t-closeness [19], have been introduced to deal with similar attacks in the context of relational databases. In such techniques, the data attributes are classified into four main categories including; explicit identifiers, quasi-identifiers, sensitive attributes, and non-sensitive attributes. The explicit identifiers are the attributes that can be used to uniquely identify the data owner, e.g., national id. The quasi-identifiers are a set of attributes that could be exploited to uniquely identify the data owner, e.g., {age, gender, zipcode}. The sensitive attributes consist of sensitive person-specific information, e.g., disease or salary, and the non-sensitive attributes contain all the attributes that do not fall into the previous three categories [7]. Assuming that explicit identifiers suppressed or replaced with dummy identifiers, the group-based privacy preservation techniques aim to perturb potential linkages by generalizing the records into equivalence classes, i.e., groups of records, having the same values on the quasi-identifier. These techniques are effective for anonymizing relational data. However, they are not easily applicable to event data due to some specific properties of event data.
In process mining, the explicit identifiers (i.e., actual case identifiers) do not need to be stored and processed, and case identifiers are often dummy identifiers, e.g., incremental IDs. As described in the above-mentioned examples, a trace can be considered as a quasi-identifier and, at the same time, as a sensitive attribute. In other words, a complete sequence of events belonging to a case, is sensitive person-specific information, at the same time, part of a trace, i.e., only some of the event attributes, can be exploited as a quasi-identifier to launch case linkage and/or attribute linkage attacks.
The quasi-identifier role of traces in process mining causes significant challenges for group-based privacy preservation techniques because of two specific properties of event data: the high variability of traces and the typical Pareto distribution of traces. Considering only activity as the main event attribute in a trace, the variability of traces in an event log is high because of the following reasons: (1) there could be tens of different activities which could happen in any order, (2) one activity or a bunch of activities could happen repetitively, and (3) traces could contain any non-zero number of activities, i.e., various lengths. Note that this variability becomes even higher when events contain more attributes, e.g., resources. In an event log, trace variants are often distributed similarly to the Pareto distribution, i.e., few trace variants are frequent and many trace variants are unique. Enforcing group-based privacy-preserving approaches on little-overlapping and high-dimensional space is a significant challenge, and often valuable data needs to be suppressed in order to achieve desired privacy requirements [6].

Preliminaries
In this section, we provide formal definitions for event logs and background knowledge. These formal models will be used in the remainder for describing the attack scenarios and the approach.

Event Log
We first introduce some basic notations. For a given set A, A * is the set of all finite sequences over A, and B(A) is the set of all multisets over the set A. For A 1 , A 2 ∈ B(A), A 1 ⊆ A 2 if for all a ∈ A, A 1 (a) ≤ A 2 (a). A finite sequence over A of length n is a mapping σ ∈ {1, ..., n} → A, represented as σ = a 1 , a 2 , ..., a n where a i = σ(i) for any 1 ≤ i ≤ n. |σ| denotes the length of the sequence. For σ 1 , σ 2 ∈ A * , σ 1 σ 2 if σ 1 is a subsequence of σ 2 , e.g., a, b, c, x z, x, a, b, b, c, a, b, c, x . For σ ∈ A * , {a ∈ σ} is the set of elements in σ, and [a ∈ σ] is the multiset of elements in σ, e.g., [a ∈ x, y, z, x, y ] = [x 2 , y 2 , z]. For x = (a 1 , a 2 , ..., a n ) ∈ A 1 × A 2 × ... × A n , π Ai (x) = a i is the projection of the tuple x on the element from the domain A i , 1 ≤ i ≤ n.
Definition 1 (Process Instance, Trace). We define P = C × E * × S as the universe of all process instances. C is the universe of case identifiers. E = A × R × T is the universe of main event attributes for process mining where A is the universe of activities, R is the universe of resources, and T is the universe of timestamps. S ⊆ D 1 ∪ ... ∪ D m is the universe of sensitive case attributes where D 1 ,...,D m are the universes of different case attributes, e.g., disease, salary, age, etc. Given a process instance p = (c, σ, s) ∈ P, σ ∈ E * is called the trace attribute of the case c.
Definition 3 (Perspective, Projection). Let P = C × E * × S be the universe of process instances. ps ∈ {A, R, A × R, A × T , R × T , A × R × T } is a perspective which can be used to project traces of an event log EL ⊆ P. For σ = (a 1 , r 1 , t 1 ), ..., (a n , r n , t n ) ∈ E * , such that there exists (c, σ, s) ∈ EL, π ps (σ) is the projection of the trace on the given perspective, e.g., for ps = A × R, π ps (σ) = (a 1 , r 1 ), ..., (a n , r n ) is the projection of the trace on the activities and resources. We denote PS = {A, R, A × R, A × T , R × T , A × R × T } as the universe of perspectives.
Definition 4 (Set of Activities/Resources in an Event Log). Let P = C × E * × S be the universe of process instances, and EL ⊆ P be an event log. A EL = {a ∈ A | ∃ (c,σ,s)∈EL a ∈ π A (σ)} is the set of activities in the event log, and R EL = {r ∈ R | ∃ (c,σ,s)∈EL a ∈ π R (σ)} is the set of resources in the event log.
Definition 5 (Set of Traces/Variants in an Event Log). Let P = C×E * ×S be the universe of process instances, EL ⊆ P be an event log, and ps ∈ PS be a perspective. EL ps = [π ps (σ) | (c, σ, s) ∈ EL] is the multiset of traces in the event log w.r.t. the given perspective. EL ps = {π ps (σ) | (c, σ, s) ∈ EL} is the set of variants, i.e., unique traces, w.r.t. the given perspective, e.g., EL A is the set of unique traces w.r.t. the activities.
Definition 6 (Directly Follows Relations). Let EL⊆P be an event log, ps∈{R, A} be a perspective, EL ps be the set of variants and EL ps be the multiset of traces in the event log EL w.r.t. the given perspective ps. DF EL ps ={(x, y) ∈ ps×ps | x > EL ps y} is the set of directly follows relations w.r.t. the given perspective. x > EL ps y iff there exists a trace σ∈ EL ps and 1≤i<|σ|, s.t., σ(i) = x and σ(i+1) = y. |x > EL ps y|= σ∈ ELps EL ps (σ)×|{1≤i<|σ|| σ(i)=x ∧ σ(i+1)=y}| is the number of times x is followed by y in EL.

Definition 7 (Variant Frequency).
Let P = C × E * × S be the universe of process instances, and EL ⊆ P be an event log. Given a perspective ps ∈ PS, f req EL ps : EL ps → [0, 1] is a function that retrieves the relative frequency of the variants in the event log w.r.t. the given perspective. f req EL ps (σ) = ELps(σ) /|ELps| and σ∈ ELps f req EL ps (σ) = 1. Table 2 shows the process instance representation of the event log shown in Table 1, where timestamps are represented as "day-hour:minute". In this event log, disease is the attribute which is considered as the sensitive one.

Background Knowledge
Regarding the quasi-identifier role of traces, we consider four main types of background knowledge including set, multiset (mult), sequence (seq), and relative time difference (rel ). Using set as the type of background knowledge, we assume that an adversary knows a subset of some event attributes contained in the trace attribute of a victim case. In the multiset type of background knowledge, the assumption is that an adversary knows a subset of some event attributes included in the trace attribute of a victim case as well as the frequency of the elements.
In the sequence type of background knowledge, we suppose that an adversary knows a subsequence of some event attributes included in the trace attribute of a victim case. The exact timestamps of events in an event log impose a high risk regarding the linkage attacks such that little time-related knowledge may easily single out specific events, and consequently the case re-identification. For performance analysis in process mining, we need to have the time-related information. However, the timestamps do not necessarily need to be the actual ones. Therefore, we make all the timestamps relative as defined in Definition 8.
Using relative timestamps does not eliminate time-based attacks, since the time differences are real and can be exploited by an adversary. Relative time difference type of background knowledge is an extension for the sequence type, where the assumption is that an adversary knows a subsequence of some event attributes as well as the relative time differences between the elements. Figure 2 shows the classification of background knowledge based on the types and event attributes. In the following, we provide formal definitions for different categories of background knowledge based on the main event attributes, i.e., activity, resource, and timestamp. Moreover, one can see that there is a relation between type, attribute, and perspective, i.e., a combination of type and attribute can be mapped to a perspective. For example, if type = rel and att = ar, the corresponding perspective is ps = A × R × T , or if type ∈ {set, mult, seq} and att = re, the corresponding perspective is ps = R.
Definition 9 (Background Knowledge Based on Activities). Let EL be an event log, and A EL be the set of activities in the event log. bk set,ac (EL) = 2 A EL , bk mult,ac (EL) = B(A EL ), and bk seq,ac (EL) = A * EL are the sets of candidates of background knowledge based on the activity attribute of the events for the set,multiset, and sequence types of background knowledge. For example, {a, b, c} ∈ bk set,ac (EL), [a 2 , b] ∈ bk mult,ac (EL), and a, b, c ∈ bk seq,ac (EL).
Definition 10 (Background Knowledge Based on Resources). Let EL be an event log, and R EL be the set of activities in the event log. bk set,re (EL) = 2 R EL , bk mult,re (EL) = B(R EL ), and bk seq,re (EL) = R * EL are the sets of candidates of background knowledge based on the resource attribute of the events for the different types of background knowledge.

Definition 11 (Background Knowledge Based on Activities&Resources).
Let EL be an event log, A EL be the set of activities in the event log, and R EL be the set of resources in the event log. bk set,ar (EL) = 2 A EL ×R EL , bk mult,ar (EL) = B(A EL × R EL ), and bk seq,ar (EL) = (A EL × R EL ) * are the sets of candidates of background knowledge based on the activity and resource attribute of the events for the various types of background knowledge.
Definition 12 (Background Knowledge Based on Time Differences Between Relative Timestamps). Let EL be an event log, A EL be the set of activities in the event log, R EL be the set of resources in the event log, and T be the Note that in Definition 12, other attributes are also present. However, our focus is on time differences between relative timestamps. Therefore, we refer to this category of background knowledge as time-based. Figure 3 shows our simple scenario of data collection and data publishing. With respect to the types of data holder's models, introduced in [15], we consider a trusted model. In the trusted data holder models, the data holder is trustworthy, and on the data holder's side, only simple anonymization techniques need to be applied, e.g., suppressing real identifiers. However, the data recipient, i.e., a process miner, is not trustworthy and may attempt to identify sensitive information about record owners, i.e., cases. Given a process instance p = (c, σ, s) ∈ P, both σ and s are considered as sensitive person-specific information, and part of the trace σ can be exploited as the quasi-identifier to re-identify the owner of the process instance, i.e., c, and/or to learn the sensitive information which belongs to the data owner, i.e., σ and/or s.

Attack Models
In the following, we provide formal definitions and examples for the attack scenarios based on the main event attributes, i.e., activity, resource, and timestamp. Note that the examples are based on the event log shown in Table 2.

Activity-based Attacks
In the activity-based scenarios, we assume that the adversary's knowledge is about the activities performed for a victim case. In the following, we provide formal models based on the introduced types of background knowledge.
-Based on a set of activities (A1): In this scenario, we assume that the adversary knows a subset of activities performed for a case, and this information can lead to the case linkage and/or attribute linkage attacks. Given EL as an event log, we formalize this scenario by a function match EL set,ac : For example, if the adversary knows that {V I, IN } is a subset of activities performed for a case, the only matching case is case 4. Therefore, both the sequence of events and the sensitive attribute are disclosed.
-Based on a multiset of activities (A2): In this scenario, we assume that the adversary knows a sub-multiset of activities performed for a case, and this information can result in the linkage attacks. Given EL as an event log, we formalize this scenario as follows. match EL mult,ac : For example, if the adversary knows that [HO 1 , BT 2 ] is a multiset of activities performed for a case, the only matching case is case 2. Consequently, the complete sequence of events and the disease are disclosed.
-Based on a sequence of activities (A3): In this scenario, we assume that the adversary knows a subsequence of activities performed for a case, and this information can lead to the linkage attacks. Given EL as an event log, we formalize this scenario by a function match EL seq,ac : For example, if the adversary knows that RE, V I, HO is a subsequence of activities performed for a case, case 5 is the only matching case.

Resource-based Attacks
In the resource-based scenarios, we assume that the adversary's knowledge is about the resources who perform activities for a victim case. In the following, we provide formal models based on the main types of background knowledge.
-Based on a set of resources (R1): In this scenario, we assume that the adversary knows a subset of resources involved in performing activities for a victim case, and this information can lead to the case linkage and/or attribute linkage attacks. Given EL as an event log, we formalize this scenario as follows. match EL set,re : 2 R EL → 2 EL . For R ∈ bk set,re (EL), match EL set,re (R) = {(c, σ, s) ∈ EL | R ⊆ {r ∈ π R (σ)}}. For example, if the adversary knows that {E1, D2} is a subset of resources involved in handling a victim case, case 5 is the only matching case. Therefore, both the sequence of events and the sensitive attribute are disclosed.
-Based on a multiset of resources (R2): In this scenario, we assume that the adversary knows a sub-multiset of resources involved in performing activities for a victim case, and this information can lead to the linkage attacks. Given EL as an event log, we formalize this scenario as follows.
For example, if the adversary knows that [N 1 2 , E3] is a multiset of resources performed activities for a victim case, the only matching case is case 2.
-Based on a sequence of resources (R3): In this scenario, we assume that the adversary knows a subsequence of resources who performed activities for a victim case, and this information can result in the linkage attacks. Given EL as an event log, we formalize this scenario by a function match EL seq,re : For example, if the adversary knows that E4, D2 is a subsequence of resources who performed activities for a victim case, the only matching case is case 4.

Activity&Resource-based Attacks
In the activity&resource-based scenarios, we assume that the adversary's knowledge is about activities and the corresponding resources who perform activities for a victim case. In the following, we provide formal models based on the main types of background knowledge.
-Based on a set of (activity,resource) pairs (AR1): In this scenario, we assume that the adversary knows a subset of (activity,resource) pairs included in the trace attribute of a victim case, and this information can result in the case linkage and/or attribute linkage attacks. Given EL as an event log, we formalize this scenario as follows. match EL set,ar : For example, if the adversary knows that {(HO, E6)} is a subset of (activity,resource) pairs contained in the trace attribute of a victim case, case 5 is the only matching case, which result is the whole sequence and sensitive attribute disclosure.
-Based on a multiset of (activity,resource) pairs (AR2): In this scenario, we assume that the adversary knows a sub-multiset of (activity,resource) pairs included in the trace attribute of a victim case. Given EL as an event log, the scenario can be formalized as follows. match EL mult,ar : For example, if the adversary knows that [(BT, N 1) 2 ] is a multiset of (activity,resource) pairs included in the trace attribute of a victim case, the only matching case is case 2.
-Based on a sequence of (activity,resource) pairs (AR3): In this scenario, we assume that the adversary knows a subsequence of (activity,resource) pairs included in the trace attribute of a victim case, and this information can lead to the linkage attacks. Given EL as an event log, we formalize this scenario by a function match EL seq,ar : For example, if the adversary knows that (RE, E4), (V I, D2) is a (activity,resource) pairs included in the trace attribute of a victim case, case 4 is the only matching case.

Time-based Attacks
As we discussed in subsection 3.2, after making the timestamps relative, the time differences are still real and can be exploited by an adversary. In the following, we extend the attacks of the type sequence, i.e., A3, R3, AR3, with the time-related information.
-Based on relative time differences between activities (AT): In this scenario, we assume that the adversary knows a subsequence of activities and also the time difference between the activities. Given EL as an event log, the scenario is formalized as follows. match EL rel,ac : For example, if an adversary's knowledge is HO, V I , both case 2 and case 3 get matched. However, if the adversary further knows that for a victim case, visit performed in the morning of the next day, the only matching case is case 2.
-Based on relative time differences between resources who performed activities (RT): According to this scenario, the adversary knows a subsequence of resources and the time difference between the resources involved in handling a case. Given EL as an event log, we formalize this scenario by a function match EL rel,re : For example, if an adversary's knowledge is E1, E3 , both case 2 and case 3 get matched. However, if the adversary further knows that for the victim case, employee 3 performed hospitalization more than one hour after registration, case 3 is the only matching case.
-Based on relative time differences between (activity,resource) pairs (ART): In this scenario, the assumption is that the adversary knows a subsequence of (activity,resource) pairs and the time difference between these pairs. Given EL as an event log, we formalize this scenario as follows. match EL rel,ar : (A EL × R EL ) * → 2 EL . For σ ∈ bk rel,ar (EL), match EL rel,ar (σ) = {(c, σ , s) ∈ EL | σ relative(σ )}. For example, case 1 and case 6 have the same sequence of (activity,resource) pairs. However, if the adversary knows that for a victim case, it took almost four hours to get released by employee 6 after visiting by a doctor, the corresponding possible cases narrow down to only one case, which is case 6.

Privacy Preservation Techniques
Traditional k-anonymity and its extended privacy preservation techniques assume that an adversary could use all of the quasi-identifier attributes as background knowledge to launch linkage attacks. According to the types of background knowledge introduced in Section 3, this assumption means that the background knowledge of an adversary is bk rel,ar which covers all the information contained in a trace. In the following, we show the results of applying two baseline methods with respect to the aforementioned assumption.

Baseline Methods
In this subsection, we introduce two baseline methods to apply k-anonymity on event logs: Baseline-1 and Baseline-2. Baseline-1 is a naïve k-anonymity approach where we remove all the trace variants occurring less than k times. Baseline-2 maps each violating trace variant, i.e., the variant that does not fulfill the desired k-anonymity requirement, to the most similar non-violating subtrace by removing events. In Baseline-2, if there exists no non-violating subtrace, the whole trace variant is removed. Suppose that Table 3 is part of an event log recorded by an information system in a hospital that needs to be published after applying k-anonymity. Note that for the sake of simplicity, the time differences between relative timestamps are represented by integers. Since all the traces in this event log are unique if we apply k-anonymity with any value greater than 1, using Baseline-1, all the traces are removed. If we apply Baseline-2 where k = 2 then the result is the event log shown in Table 4. One can see that for such a weak privacy requirement 12 events are removed. Now, if we use k = 4, Table 5 is the result where 18 events are removed which is more than half of the events.
In [13], the P RET SA method is introduced as a group-based privacy preservation technique for process mining where the authors apply k-anonymity and tcloseness on event data for privacy-aware process discovery. However, P RET SA focuses on the resource perspective of privacy while we focus on the case perspective. The P RET SA method assumes a prefix of activity sequences as the background knowledge, and each violating trace is mapped to the most similar non-violating trace. In [31], P RET SA case is introduced as a variant of P RET SA method where only the k-anonymity part is considered, and the focus is on the  Table 3 using Baseline-2.   Table 3 using Baseline-2. privacy of cases rather than resources. Therefore, P RET SA case is a specific type of Baseline-2 where the background knowledge is a specific type of bk seq,ac , i.e., a prefix of activity sequences rather than any subsequence.

TLKC -Privacy (Extended)
As discussed in [31], it is almost impossible for an adversary to acquire all the information of a target victim, and it requires non-trivial effort to gather each piece of background knowledge. The TLKC -privacy exploits this limitation and assumes that the adversary's background knowledge is bounded by at most L values of the quasi-identifier, i.e., the size or power of background knowledge. Based on the types of background knowledge illustrated in Figure 2, the TLKCprivacy considers all the types, i.e., set, multiset, sequence, and relative. However, it focuses on the activity attribute (ac) and timestamps which are included in the relative type. In this paper, the technique is extended with the resource attribute, i.e., merely resource (re) and activity along with resource (ar) are also considered. In the following, we bound the power of the different types of background knowledge (Definition 9-12) with L as the maximal size of candidates.
Definition 13 (Bounded Background Knowledge). Let EL be an event log, type ∈ {set, mult, seq, rel} be the type of background knowledge, att ∈ {ac, re, ar} be the event attribute of background knowledge, and L be the size of background knowledge. bk L type,att (EL) = {cand ∈ bk type,att (EL) | |cand|≤ L} are the candidates of the background knowledge whose sizes are bounded by L.
In the TLKC -privacy, T ∈ {seconds, minutes, hours, days} refers to the accuracy of timestamps, e.g., T = minutes shows that the accuracy of timestamps is limited at minutes level, L refers to the power of background knowledge, K refers to the k in the k-anonymity definition, and C refers to the bound of confidence regarding the sensitive attribute values in a matching set. We denote EL(T ) as the event log with the accuracy of timestamps at the level T . The general idea of T LKC-privacy is to ensure that the background knowledge of size L in EL(T ) is shared by at least K cases, and the confidence of inferring the sensitive value in S is not greater than C.
Definition 14 (T LKC-Privacy). Let EL ⊆ P be an event log, L be the maximal size of background knowledge, T ∈ {seconds, minutes, hours, days} be the accuracy of timestamps, type ∈ {set, mult, seq, rel}, and att ∈ {ac, re, ar}. EL(T ) satisfies T LKC-privacy if and only if for any cand ∈ bk L type,att (EL(T )) such that match type,att (cand)| ≤ C for any s ∈ S, where 0 < C ≤ 1 is a real number as the confidence threshold, and π S (p) is the projection of the process instance on the sensitive attribute value.
The TLKC -privacy provides a major relaxation from traditional k-anonymity based on a reasonable assumption that the adversary has restricted knowledge. It generalizes several privacy preservation techniques including k-anonymity, confidence bounding, (α, k)-anonymity, and l-diversity. It also provides interpretable parameters. Note that the type and attribute of background knowledge implicitly show the perspective (Figure 2).

Privacy Measure
In the subsection, we define (minimal) violating traces w.r.t. the privacy requirements of the TLKC -privacy.
Definition 15 (Violating Trace). Let EL ⊆ P be an event log, L be the maximal size of background knowledge, T ∈ {seconds, minutes, hours, days} be the accuracy of timestamps, type ∈ {set, mult, seq, rel}, att ∈ {ac, re, ar}, ps ∈ PS be the corresponding perspective w.r.t. the given type and att, and σ π ps (σ ) such that (c, σ , s) ∈ EL(T ). σ is a violating (sub)trace with respect to the T LKC-privacy requirements if there exists a cand ∈ bk L type,att (EL(T )): type,att (cand)|< K or P r(s|cand) > C for some s ∈ S.
An event log satisfies TLKC -privacy, if all violating traces w.r.t. the given privacy requirement are removed. A naïve approach is to determine all violating traces and remove them. However, this approach is inefficient due to the numerous number of violating traces, even for a weak privacy requirement. Moreover, as demonstrated in [31], TLKC -privacy is not monotonic w.r.t. L. In fact, the anonymity threshold K is monotonic w.r.t. L, i.e., if L ≤ L and C = 100%, an event log EL which satisfies T LKC-privacy must satisfy T L KC-privacy. However, confidence threshold C is not monotonic w.r.t. L, i.e., if σ is non-violating trace, its subtrace may or may not be non-violating. Therefore, we have to make sure that the conditions should hold for any L ≤ L. To this end, in the following, we define the extended version of minimal violating traces w.r.t. the different perspectives.
Definition 16 (Minimal Violating Trace). Let EL ⊆ P be an event log, L be the maximal size of background knowledge, T ∈ {seconds, minutes, hours, days} be the accuracy of timestamps, type ∈ {set, mult, seq, rel}, att ∈ {ac, re, ar}, ps ∈ PS be the corresponding perspective w.r.t. the given type and att, and σ π ps (σ ) such that (c, σ , s) ∈ EL(T ). σ is a minimal violating trace if σ is a violating trace (Def inition 15) in the EL, and every proper subtrace of σ is not violating. We denote M V T EL ps as the set of minimal violating traces in the event log EL w.r.t. the perspective ps.
Every violating trace in an event log is either a minimal violating trace or it contains a minimal violating trace. Therefore, if an event log contains no minimal violating trace, then it contains no violating trace. Note that the set of minimal violating traces in an event log is much smaller than the set of violating traces in the event log which results in better efficiency for removing violating traces.

Utility Measure
In the TLKC -privacy, the maximal frequent traces are defined as a measure for considering data utility, where traces contain activity and timestamp attributes. Since we extend the TLKC -privacy preservation technique to cover all the main perspectives of process mining, the utility measure also needs to be extended. In the following, we provide an extended version of the utility measure considering the perspectives.
Definition 17 (Maximal Frequent Trace). Let EL be an event log, and ps ∈ PS be a perspective. For a given minimum support threshold Θ, a nonempty trace σ π ps (σ ) such that (c, σ , s) ∈ EL is maximal frequent in the EL if σ is frequent, i.e., the frequency of σ is greater than or equal to Θ, and no supertrace of σ is frequent in the EL. We denote M F T EL ps as the set of maximal frequent traces in the event log EL w.r.t. the perspective ps.
The goal of data utility is to preserve as many MFT as possible w.r.t. the given perspective. For example, in the control-flow perspective, i.e., ps = A, the goal in to preserve the maximal frequent traces w.r.t. the activities. Note that in an event log, the set of maximal frequent traces is much smaller than the set of frequent traces. Moreover, any subtrace of a maximal frequent trace is also a frequent trace, and once all the MFTs are discovered, the support counts of any frequent subtrace can be computed by scanning the data once.

Balancing Privacy and Utility
As discussed in the privacy measure section, to provide the desired privacy requirements, all the minimal violating traces need to be removed. However, this should be done w.r.t. the utility measure. According to Definition 16, every proper subtrace of a minimal violating trace is not violating. Therefore, a minimal violating trace can be removed after removing one event of the trace. This event needs to be chosen w.r.t. both utility and privacy measures. To this end, a greedy function is defined to choose an event to remove from the minimal violating traces such that it maximizes the number of removed minimal violating traces, i.e., privacy gain, yet, at the same time, minimizes the number of removed maximal frequent traces, i.e., utility loss.
Definition 18 (Score, Privacy Gain, Utility Loss). Let EL be an event log, ps ∈ PS be a perspective, and events ps (EL) = {e ∈ π ps (σ) | (c, σ, s) ∈ EL} be the set of events in the event log w.r.t. the given perspective. score EL ps : E R >0 is a function which retrieves the score of the events in the event log w.r.t. the perspective. For e ∈ events ps (EL), score EL ps (e) = P G EL ps (e) /UL EL ps (e)+1.
P G EL ps (e) is the number of MVTs containing the event e, i.e., P G EL ps (e) = |{x ∈ M V T EL ps | e ∈ x}| and U L EL ps (e) is the number of MFTs containing the event e, i.e., U L EL ps (e) = |{x ∈ M F T EL ps | e ∈ x}|. Note that in the score (Definition 18), 1 is added to the denominator to avoid diving by zero (when e does not belong to any MFT). The event e with the highest score is called the winner event, denoted by e w . Algorithm 1 summarizes all the steps of T LKC-privacy. In the following, we show how the algorithm works on the event log Table 3. Suppose that Table 3 shows a simple event log EL where timestamps are represented by integer values as hours. The first line in Algorithm 1 generates the set of maximal frequent traces (M F T EL ps ) and the set of minimal violating traces (M V T EL ps ) from the event log EL with T = hours, L = 2, K = 2, C = 50%, Θ = 25%, Disease as the sensitive attribute S, and bk EL rel,ar as the background knowledge, i.e., ps = A × R × T . Figure 4 shows the M F T tree ps and M V T tree ps generated by line 2 in Algorithm 1, where each root-to-leaf path represents one trace, and each node represents an event in a trace with the frequency of occurrence. Table 6 shows the initial score of every event (node) in the M V T tree ps (score EL ps (e)). Line 4 determines the winner event e w which is (V I, D1, 5). Line 5 deletes all the MVTs and MFTs containing the winner event e w , i.e., subtree 2 and the path (RE, E4, 1), (V I, D1, 5) of subtree 1 in the M V T tree ps , and the path (HO, E3, 4), (V I, D1, 5), (BT, N 1, 7) of subtree 4 in the M F T tree ps are removed and frequencies get updated. Line 6 updates the scores based on the new frequencies of events. Table 7 shows the remaining events in M V T tree ps with the updated scores. Line 7 adds the winner event to a suppression set Sup EL . Lines 4-7 is repeated until there is no node in M V T tree ps . According to Table 7 Table 8.
Compared to the Table 4 and Table 5 which are the results of applying traditional k-anonymity using Baseline-2, Table 8 shows that TLKC -privacy removes less events (only 6), for the stronger privacy requirements.  Table 3 with T = hours, L = 2, K = 2, C = 50%, Θ = 25%, S = Disease, and bk EL rel,ar .

New Utility Measure and New Score
In this subsection, we first describe the shortcomings of the utility measure and the score introduced in [31] (extended in Definition 17 and Definition 18), then we introduce a new utility measure and a new score to overcome the drawbacks. According to Definition 18, the score is calculated based on the existence of events in the set of minimal violating traces and the set of maximal frequent traces. However, the sizes of these sets, and consequently the included events, highly depends on the corresponding parameters. The set of MVTs is obtained based on T , L, K, C, and bk type,att , while the set of MFTs is discovered based on Θ and the given perspective. Therefore, some of the events included in the set of minimal violating traces may not be included in the set of maximal frequent traces. Consequently, the score of the corresponding events is merely calculated based on the effect on the privacy gain. When two or more events have the same score based on the privacy gain, the algorithm assumes an equal effect for the data utility aspect and randomly choose one of the events, which is not a valid assumption.   Table 3 with T = hours, L = 2, K = 2, C = 50%, Θ = 25%, S = Disease, and bk EL rel,ar . Another problem with the current score is that even when there are maximal frequent traces where the event is included, the score does not differentiate the corresponding MFTs based on their frequencies in the event log. For example, suppose that for two events e 1 and e 2 in the minimal violating traces there are two maximal frequent traces M F T 1 and M F T 2 such that e 1 is only included in M F T 1 , i.e., U L(e 1 ) = 1, and e 2 is only included in M F T 2 , i.e., U L(e 2 ) = 1. Hence, both events get the same score for the utility aspect. However, the corresponding MFTs may have completely different frequencies in the event log which leads to a different impact on the utility. Particularly, this issue is highlighted when the frequency threshold (Θ) is rather low. For example, if Θ = 50%, then frequency of M F T 1 and M F T 2 in the event log could differ up to 50%. Furthermore, the current score is not normalized, and it is not possible for the user to adjust the effect of each aspect on the score. For example, one may want to consider more effect for the data utility aspect compared to the privacy gain aspect.
To overcome the above-mentioned shortcomings, we define a new utility measure that is able to show the impact of every single event on the data utility. We also define a new score based on the new utility measure which provides normalized scores, and the effect of each aspect is adjustable for users. In the new utility measure (Definition 19), we consider the relative frequency of the variants, where the given perspective of the event is included, as the basis of the utility.
Definition 20 (New Score). Let EL be an event log, ps ∈ PS be a perspective, events ps (EL) = {e ∈ π ps (σ) | (c, σ, s) ∈ EL} be the set of events in the event log w.r.t. the given perspective, α be the coefficient of privacy gain (0 ≤ α ≤ 1), β be the coefficient of utility loss (0 ≤ β ≤ 1), and α + β = 1. n-score EL ps : E R >0 is a function which retrieves the score of the events in the event log w.r.t. the perspective. For e ∈ events ps (EL), n-score EL ps (e) = α·rP G EL ps (e)+β ·nU L EL ps (e), where rP G EL ps (e) is the relative value of the privacy gain, i.e., rP G EL ps (e) = |{x∈M V T EL ps |e∈x}| /|MV T EL ps |.
Algorithm 2 shows the new algorithm based on the new utility measure and new score, where maximal frequent traces are not used anymore, and the score of events included in the minimal violating traces is calculated based on the new score. Note that in both Algorithm 1 and Algorithm 2 the perspective is derived from the background knowledge type and attribute ( Figure 2).

Experiments
In this section, we evaluate the extended TLKC -privacy by applying it to real-life event logs. We explore the effect of applying the technique on both data utility and result utility. The results are also compared with the baseline methods. The result utility analysis evaluates the similarity of the specific results obtained from the privacy-aware event log with the same type of results obtained from the original event log, while the data utility analysis compares the privacy-aware event log with the original event log. As discussed in [29], the result utility analysis is highly dependent on the underlying algorithm generating specific results, and the data utility analysis provides a more general evaluation. We perform the evaluation for the three main perspectives including control-flow, organizational, and time perspectives. For the result utility analysis, in each perspective, we focus on a specific type of results. For the control-flow perspective, we focus on process discovery, for the organizational perspective, we perform social network discovery, and for the time perspective, we perform bottleneck analysis.
The implementation as a Python program is available on GitHub. 1

Experimental Setup
For the experiments, we employ two human-centered event logs, where the case identifiers refer to individuals: Sepsis-Cases, BPIC-2012-APP, and BPIC-2017-APP. Sepsis-Cases [22] is a real-life event log containing events of sepsis cases from a hospital. BPIC-2012-APP [37] is also a real-life event log about a loan application process taken from a Dutch financial institute. BPIC-2017-APP also pertains to a loan application process of a Dutch financial institute. Table 9 shows the general statistics of these event logs. The Sepsis-Cases event log was included in the experiments because it has some challenging features for privacy preservation techniques, namely, 80% of traces are unique based on the activity perspective which imposes significant challenges for privacy-preserving process discovery algorithms [31,13,23]. BPIC-2017-APP has similar properties w.r.t. the resource perspective, i.e., 76% of traces are unique w.r.t. the resource perspective. Note that Sepsis-Cases does not contain resource information and cannot be used for the organizational perspective analysis. We employ BPIC-2012-APP and BPIC-2017-APP for the organizational perspective. Table 10 shows some statistics about the variants with respect to different perspectives. For example, as mentioned, in Sepsis-Cases, 80% of traces are unique from the activity perspective, or in BPIC-2017-APP, 76% of traces are unique from the resource perspective.
Overall, we performed more than 1000 experiments for the four different types of background knowledge and different perspectives. 200 different settings are used based on the following values for the main parameters: L ∈ {2, 3, 4, 5, 6}, K ∈ {20, 30, 40, 50, 60}, C ∈ {0.2, 0.3, 0.4, 0.5}, and T ∈ {hours, minutes}. We consider equal weights for the privacy gain and utility loss aspects of the score,   i.e., α = 0.5 and β = 0.5. In Sepsis-Cases, "diagnose" and "age" are considered as the sensitive case attribute. The numerical attributes are converted to categorical attributes using boxplots such that all the values greater than the upper quartile are categorized as high, the values less than the lower quartile are categorized as low, and the values in between are categorized as middle. Note that the confidence value C should not be greater than 0.5, i.e., there are at least two different sensitive values for a victim case. To show and interpret the results of experiments, we focus on specific strong and weak settings. We use T = minutes, L = 2, K = 20, and C = 0.5 as the weak setting, and T = minutes, L = 6, K = 60, and C = 0.2 as the strong setting. Note that in the experiments, TLKC refers to the algorithm presented in [31] which has been extended here w.r.t. the different perspectives, i.e., Algorithm 1, and TLKC -EXT refers to Algorithm 2.

Control-flow Perspective
In this subsection, we evaluate the effect of applying the extended TLKC -privacy on the result utility and data utility with respect to the control-flow perspective. We perform the control-flow perspective analysis for both event logs.

Result Utility
As mentioned, for the result utility analysis of the control-flow perspective, we focus on process discovery. The main goal is to find out how accurately the discovered process model from a privacy-aware event log capture the behavior of the original event log. To this end, we first discover a process model M from the privacy-aware event log EL . Then, for M , we calculate fitness, precision, and f1-score, as some model quality measures, w.r.t. the original event log EL.
Fitness quantifies the extent to which the discovered model can reproduce the traces recorded in the event log [4]. Precision quantifies the fraction of the traces allowed by the model which is not seen in the event log [5], and f1-score combines the fitness and precision f 1-score = 2×precision×f itness /precision+fitness. For process discovery, we use the inductive miner infrequent algorithm [16] with the default parameters (noise threshold 0.2). Figure 5 shows the results of experiments for the quality measures. We consider four variants of our privacy preservation technique based on the introduced types of background knowledge where the attribute is activity, i.e., bk set,at , bk mult,at , bk seq,at , and bk rel,at . Note that applying privacy preservation techniques may improve some quality measures. However, the aim is to provide as similar results as possible to the original ones and not to improve the quality of discovered models. Therefore, we include the results from the original event log to compare the proximity of the values. Figure 5a and 5b show how the mentioned quality measures are affected by applying our method with the weak and strong settings (for TLKC , we set Θ = 0.5). We compare the measures with the results from the original process model and the introduced baseline methods. If we only consider the quality measures, Baseline-2 should be marked as the best one, since it results in better f1-score values. However, the baseline methods remove more events from the original event log. Consequently, the corresponding privacy-aware event logs contain significantly less behavior compared to the original event log, and the resulting models have high precision and f1-score. The result utility analyses show that the extended version of the TLKC -privacy leads to the more similar results to the original ones, specifically for the set and multiset types of background knowledge. However, the results obtained based on the relative type of background knowledge have a worse fitness value which is not surprising regarding the assumed background knowledge which is considerably strong, at the same time, difficult to achieve in reality. Note that the baseline methods do not protect event data against the attribute linkage attack and provide weaker privacy guarantees.

Data Utility
For the data utility analysis, we utilize the earth mover's distance, as proposed in [29]. The earth mover's distance describes the distance between two distributions [34]. In an analogy, given two piles of earth, it describes the effort required to transform one pile into the other. Assuming EL as the original event log, EL as a privacy-aware event log, and ps ∈ PS as the perspective of analysis. The data utility is calculated as follows: du(EL, EL ) = 1−min r∈RA ul(r, EL ps , EL ps ) where ul(r, EL ps , EL ps ) is the distance between the traces of two event logs projected on the given perspective. Note that r ∈ RA is used as a reallocation function, and normalized edit distance (Levenshtein) [18] is used to calculate the distance between variants. It should also be noted that for the control-flow ps = A. Figure 6 shows the results of data utility analysis where we compare TLKC and TLKC -EXT which provide the same privacy guarantees. As can be seen, for the weak privacy setting, the data utility results are similar, and TLKC -EXT performs slightly better for the stronger types of background knowledge. For the strong privacy setting, TLKC -EXT performs considerably better for the multiset and sequence types of background knowledge. Comparing the data utility analysis with the result utility analysis shows that the model quality measures alone cannot precisely evaluate the effectiveness of the privacy preservation techniques. For example, in the result utility analysis, for both weak and strong setting, TLKC -EXT results in an acceptable f1-score value. However, the data utility analysis shows that the utility loss is indeed high for this type of background knowledge.
As already mentioned, the Sepsis-Cases event log is a significantly challenging dataset for the privacy preservation techniques due to the high uniqueness of variants. To show the effectiveness of our privacy preservation technique on other event logs, we perform the same type of analyses for BPIC-2012-APP considering only the strong setting. Figure 7 shows that both data and result utility are high even for the strong types of background knowledge.

Organizational Perspective
In this subsection, we evaluate the effect of applying the extended TLKC -privacy on the result and data utility of the organizational perspective. The experiments of this perspective are done on BPIC-2012-APP which includes resource information.

Result Utility
For the result utility analysis of organizational perspective, we focus on the social network discovery techniques. There are different methods for discovering social networks from event logs such as causality-based, joint activities, joint cases, etc [3]. Here, we focus on the handover technique which is causality-based. This technique monitors for individual cases how work moves from resource to resource, i.e., there is a handover relation from individual r 1 to individual r 2 , if there are two subsequent activities where the first is performed by r 1 and the second is performed by r 2 . Figure 8 shows the handover networks discovered from the original event log and a privacy-aware event log when the relation threshold is 0, i.e., all the handovers. The privacy-aware event log was obtained using the TLKC -EXT privacy preservation technique with the strong setting and set as the type of background knowledge. As expected, the density of the network discovered from the privacy-aware event log is less than the original handover network. However, by focusing on some specific nodes, one can see that basic concepts are preserved. For example, node 11339 in the original handover network has the following set of input links {11302, 11003, 11300, 11121, 11122, 11180, 10932, 10861} and no output link (excluding the self-loop), and in the network discovered from the privacy-aware event log, only the input link from node 11121 is removed.
To quantify the similarity of social networks resulting from an original and a privacy-aware event log, we use a set of measures similar to the quality measure of process models, i.e., fitness, precision, and f1-score. Consider SN = (R EL , DF EL R ) and SN = (R EL , DF EL R ) as the handover social networks obtained from an original event log and its corresponding privacy-aware event log, respectively. Since both TLKC and TLKC -EXT provide privacy guarantees by removing events, the vertices of SN is a subset of vertices in SN , i.e., R EL ⊆ R EL . However, the set of edges in SN is not necessarily a subset of edges in SN , i.e., SN is not necessarily a subgraph of SN . The following equations are used to compute the fitness (F sn ) and the precision (P sn ) for handover networks. The f1-score for handover networks (F 1 sn ) is the harmonic mean of F sn and P sn .
The original handover network.
(b) The handover network discovered from the privacy-aware event log.
(c) The relations of the resource 11339 in the original handover network. The node has 8 input links and no output link, except the self-loop.
(d) The relations of the resource 11339 in the handover network resulting from the privacy-aware event log. The node has 7 input links and no output link, except the self-loop.   Figure 9 shows the similarity of handover social networks after applying the TLKC -EXT privacy model with the strong setting to BPIC-2012-APP and BPIC-2017-APP. The precision is high for all the types of background knowledge, i.e., the handover social networks obtained from the privacy-aware event logs often do not contain edges that do not exist in the original network. The fitness decreases when the background knowledge becomes stronger, i.e., the SN s obtained based on stronger assumptions for the background knowledge have fewer edges in common with the SN .

Data Utility
For the data utility analysis of the organizational perspective, we utilize the earth mover's distance, similar to the data utility analysis of the control-flow perspective. Here, the perspective is resource, i.e., ps = R. Figure 10 shows the results for the data utility analysis for BPIC-2012-APP considering different types of background knowledge using TLKC -EXT as the privacy preservation technique. As can be seen, the data utility reservation is above 0.5 even for the strong types of background knowledge.

Time Perspective
We evaluate the effect on performance analyses by analyzing the bottlenecks w.r.t. the mean duration of cases between activities. Since the privacy preservation techniques may remove some activities, we cannot compare the bottlenecks in the original process model with the bottlenecks in a process model discovered from a privacy-aware event log. Therefore, we first project the original event log on the activities existing in the privacy-aware event log. Then, we discover a performance-annotated directly follows graph DF G from the projected event log and compare it with the performance-annotated directly follows graph DF G from the privacy-aware event log. A DFG is a graph where the nodes represent activities and the arcs represent causalities. Two activities a 1 and a 2 are connected by an arrow when a 1 is frequently followed by a 2 [17]. Figure 11 (set and multiset as the types of background knowledge) and Figure 12 (sequence and relative as the types of background knowledge) show the results for Sepsis-Cases using T LKC-EXT with the strong setting. 2 As can be seen, the bottlenecks in DF G and DF G are the same for all the variants, except for DFGs discovered using bk rel,ac , where the assumed background knowledge is relative which is significantly strong and our data utility analysis in Section 6.2 demonstrated a low data utility preservation for Sepsis-Cases. Note that the mean duration of the cases are different in DF G and DF G due to the relative timestamps in the privacy-aware event logs.
We also evaluate the similarity of the Directly Follows Graphs (DFGs) resulting from an original event log and its corresponding privacy-aware event log. Let DF G=(A EL , DF EL A ) and DF G =(A EL , DF EL A ) be the directly follows graphs obtained from an original and its corresponding privacy-aware event logs, respectively. To compare these graphs, we follow the same approach taken for quantifying the similarity of social networks. The fitness (F df g ) and precision (P df g ) for DFGs are calculated as follows: The f1-score for DFGs (F 1 df g ) is the harmonic mean of F df g and P df g . Figure 13 shows the similarity of DFGs after applying the TLKC -EXT privacy model with the strong setting for Sepsis-Cases, BPIC-2012-APP, and BPIC-    (c) The DFG comparison for the graphs obtained from the BPIC-2017-APP event log. Fig. 13: The DFG comparison based on fitness (F df g ), precision (P df g ), and f1-score (F 1 df g ). The privacy preservation technique is TLKC -EXT with the strong setting.
2017-APP. The precision is always high, i.e., the DFGs obtained from the privacyaware event logs often do not contain directly follows relations that do not exist in the original DFG. For the Sepsis-Cases event log, the fitness decreases when the background knowledge becomes stronger, i.e., the DF G s obtained based on stronger assumptions for the background knowledge preserve fewer directly follows relations of the original DFG. The fitness for the BPIC event logs only drops for the relative type of background knowledge which is considerably strong.

Related Work
In process mining, the research field of confidentiality and privacy is recently receiving more attention. In this section, we list the work that has been done in this research field which is rapidly growing.
In [2], Responsible Process Mining (RPM) is introduced as the sub-discipline which focuses on possible negative side-effects of applying process mining where tion techniques including k-anonymity, confidence bounding, (α,k)-anonymity, and l-diversity. It also provides interpretable and tunable parameters.
Similar to the main approach, we implemented four variants of the extended version with respect to the four different types of background knowledge and considering all the main perspectives. The effectiveness of different variants in different perspectives was evaluated based on real-life event logs. Both data and result utility were analyzed to evaluate the effectiveness. Overall more than 1000 experiments were performed for different types of background knowledge considering different perspectives, and the results were given for a weak and a strong setting. Our experiments showed that the extended TLKC -privacy performs better than the previous version considering the data utility preservation aspect. However, in the event logs with the high ratio of unique traces, when the assumed type of background knowledge is very specific, e.g., relative, the group-based privacy preservation techniques may not be able to preserve the general data utility, and this negative effect cannot be observed by only result utility analyses.