An Improved Sanitization Algorithm in Privacy-Preserving Utility Mining

High-utility patternmining is an effective technique that extracts significant information from varied types of databases. However, the analysis of data with sensitive private information may cause privacy concerns. To achieve better trade-off between utility maximizing and privacy preserving, privacy-preserving utility mining (PPUM) has become an important research topic in recent years. +eMSICF algorithm is a sanitization algorithm for PPUM. It selects the item based on the conflict count and identifies the victim transaction based on the concept of utility. AlthoughMSICF is effective, the heuristic selection strategy can be improved to obtain a lower ratio of side effects. In our paper, we propose an improved sanitization approach named the Improved Maximum Sensitive Itemsets Conflict First Algorithm (IMSICF) to address this issue. It dynamically calculates conflict counts of sensitive items in the sanitization process. In addition, IMSICF chooses the transaction with the minimum number of nonsensitive itemsets and the maximum utility in a sensitive itemset for modification. Extensive experiments have been conducted on various datasets to evaluate the effectiveness of our proposed algorithm. +e results show that IMSICF outperforms other state-of-the-art algorithms in terms of minimizing side effects on nonsensitive information. Moreover, the influence of correlation among itemsets on various sanitization algorithms’ performance is observed.


Introduction
Data mining is used to discover the decision-making knowledge and information from massive data [1][2][3][4]. In a cooperative project, data are shared among different companies for mutual benefits. However, this brings the risk of disclosing sensitive knowledge contained in a database [5]. Sensitive knowledge can be represented as a set of frequent patterns and high-utility patterns with security implication [6,7]. us, data owner wants to hide sensitive information before a database is released. To solve the problem, privacy-preserving data mining (PPDM) has been proposed and become an important research direction [8]. PPDM methods have been applied in various fields, such as cloud computing, e-health, wireless sensor networks, and location-based services [9].
One way to conceal sensitive knowledge is to sanitize a database by modifying some items in it. Atallah et al. [10] first proved that the optimal sensitive-knowledge-hiding problem is NP-hard and proposed a sanitization algorithm based on heuristic strategy. After that, a lot of works have been completed. However, the existing hiding approaches for protecting high-utility itemsets sanitize a database on the basis of the concept of utility. Meanwhile, the side effects on nonsensitive information are not taken into account. us, the damage to nonsensitive knowledge is serious, and database quality is low when a database is modified. To address this problem, we propose an improved approach called the Improved Minimum Sensitive Itemsets Conflict First Algorithm (IMSICF) for hiding sensitive high-utility itemsets.
is algorithm is based on the MSICF algorithm and makes the following improvements: (1) For the victim item selection, the conflict count of each sensitive item is dynamically calculated in the sanitization process, which ensures that the item with the maximum conflict count is chosen to be sanitized. (2) For the victim transaction selection, the transaction supporting the least number of nonsensitive itemsets and having the maximal utility of a sensitive itemset is chosen to be modified, which effectively reduces undesired side effects produced by the sanitization process. (3) e conflict degree is defined to reflect the correlation among sensitive itemsets. Moreover, the influence of conflict degree on sanitization performance is observed. e rest of the paper is organized as follows. Section 2 reviews related works. In Section 3, preliminary knowledge of high-utility itemsets mining is introduced. Section 4 describes the hidden strategy of the proposed sanitization approach. Section 5 gives experimental results and analysis. Finally, conclusions are made in Section 6.

Related Works
In this section, related works on privacy-preserving data mining are reviewed.
Most of previous studies focused on hiding sensitive itemsets from frequent itemset mining approaches. Verykios et al. [11,12] proposed five approaches for achieving PPDM.
e first three algorithms are used to protect sensitive association rules. A sensitive rule is hidden by reducing its support or confidence. e last two algorithms are used to conceal sensitive itemsets. However, all of the five algorithms only hide rules that are supported by disjoint frequent itemsets. Oliveira and Zaïane [13] presented a one-scan algorithm that only needs to scan a database once. e disclosure threshold is introduced to balance privacy protection and knowledge disclosure. Amiri [14] developed three algorithms, namely, aggregate, disaggregate, and hybrid, for hiding sensitive frequent itemsets. Aggregate conceals sensitive itemsets by deleting transactions. Disaggregate sanitizes a database by removing some items. e hybrid algorithm is the combination of the previous two algorithms. In terms of execution time and side effects, the hybrid algorithm is recommended for database sanitization.
Gkoulalas-Divanis and Verykios [15] introduced a new approach for hiding sensitive itemsets by inserting some synthetic transactions into an original database. e hybrid database is generated based on the constraints on the border itemsets. Wu and Huang [16] described two greedy approaches, namely, greedy approximation algorithm and exhausted algorithm, for concealing sensitive association rules. Both algorithms include the sanitization procedure and exposed procedure. e greedy approximation algorithm always outperforms the other one because the cost is recalculated when items are modified. However, the greedy algorithm takes a lot of time to expose missing rules.
Hong et al. [17] used the technique of term frequencyinverse document frequency (TF-IDF) to sanitize a database. e transaction with more sensitive items and less influence on other transactions is selected for modification. However, the scalability of this approach is poor.
Le et al. [18,19] applied the lattice theory to hide sensitive association rules, and two approaches called HCSRIL and AARHIL were proposed based on the intersection lattice of frequent itemsets. Both algorithms select the victim item with the least impacts on the generating set for sanitization.
e AARHIL algorithm has better performance than HCSRIL in terms of missing cost since the victim transaction selection is improved. Shah et al. [20] adopted the genetic algorithm to hide sensitive association rules. e fitness function is used to evaluate whether to modify a transaction or not. e transaction with lower fitness will be modified with higher probability because it contains more sensitive items and minimal number of data items.
Cheng et al. [21][22][23][24][25] applied a multiobjective optimization algorithm into PPDM, such as NSGA-II and Hype. e algorithms in [21][22][23][24] conceal sensitive association rules by modifying some items. e sanitization approach in [25] hides sensitive itemsets by removing some items. e key issue in optimization algorithm is to design the objective functions, which are based on the side effects on a database. Besides, Cheng et al. [26] also proposed a greedy algorithm for hiding sensitive rules. e information on nonsensitive itemsets is considered in the selection of the victim transaction. us, the side effects are effectively reduced. Lin et al. [27] presented a multiobjective algorithm (NSGA2DT) for hiding sensitive itemsets with transaction deletion. A Fast SoRting strategy and the prelarge concept are utilized to accelerate the iterative process. e above methods focus on protecting sensitive knowledge in frequent itemset mining. However, it is not suitable to modify the quantities of items in a transactional database. Recently, various methods are developed for protecting high-utility itemsets. Moreover, PPUM has become an important research issue. Yeh et al. [28,29] presented the HHUIF and MSICF algorithms to conceal sensitive high-utility itemsets. Both algorithms sanitize the original database based on the concept of utility. However, the MSICF algorithm takes the conflict count of sensitive items into account. Rajalaxmi and Natarajan [30] identify the victim transaction with the maximum number of sensitive items. en, the item with the maximal utility is selected for modification. However, the impact on nonsensitive itemsets is discarded. Lin et al. [31] proposed a GA-based algorithm for PPUM by inserting some appropriate transactions into an original database. A function with three factors is used to determine the transactions for insertion. However, this algorithm produces some spurious itemsets after the sanitization process.
Yun and Kim [32] presented an algorithm called FPUTT to improve the efficiency of the HHUIF algorithm. e tree structure is utilized to accelerate the sanitization process. However, the results of FPUTT in terms of side effects on nonsensitive knowledge are the same as those of HHUIF. Lin et al. [33,34] then developed two approaches MSU_MAU and MSU_MIU for PPUM. For each sensitive itemset SH i , the transaction with the maximum utility of SH i is selected for modification. en, the victim item is chosen based on the maximum utility or minimum utility. Lin et al. [35] designed a genetic algorithm to hide sensitive HUIs by transaction deletion. e prelarge concept is adopted to accelerate the evolution process. However, some spurious itemsets are produced by the sanitization process. Li et al. [36] formulate the hiding process as a constraint satisfaction problem. Integer linear programming is adopted in the designed algorithm to obtain a lower ratio of side effects produced in the hiding process.
Rajalaxmi and Natarajan [37] proposed two approaches named MSMU and MCRSU to hide the sensitive frequent and utility itemsets. Both algorithms conceal the itemsets until their support and utility fall below the given thresholds, respectively. Liu et al. [38] presented a novel sanitization algorithm called HUFI to conceal sensitive frequent and utility itemsets. e concept of maximum boundary value is introduced to determine the hidden strategy.
us, the approach outperforms the other algorithms in minimizing the side effects. Besides the above works, Le et al. [39] proposed an efficient algorithm for hiding high-utility sequential patterns, which relies on a novel structure to enhance the sanitization process.

Preliminaries
Some preliminary definitions of high-utility itemsets mining are introduced in this section [40,41]. In addition, the sanitization problem is described [42,43].

Definitions.
Let I � I 1 , I 2 , . . . , I m be a set of distinct items. Let D � T 1 , T 2 , . . . , T n be a transaction database, where each transaction T i has a unique identifier TID and T i ⊆ I. Each item has an external utility value, which reflects the importance of an item. An itemset is a subset of I, and it is called a k-itemset if it contains k items. Definition 1. Each item i t is assigned an external utility value, which is denoted as eu(i t ). For example, in Table 1, eu(b) � 3.

Definition 2.
Each item i t in a transaction T is assigned an internal utility value, which is denoted as iu(i t , T). For example, in Table 1, iu(b, T 1 ) � 1.

Definition 3.
e utility of item i t in a transaction T is denoted as u(i t , T) and defined as iu(i t , T) × eu(i t ). For example, in Table 1, u(d, T 2 ) � 1 × 6 � 6.

Definition 4.
e utility of itemset SH i in a transaction T is denoted as u(SH i , T) and defined as i t ∈SH i u(i t , T). For example, in Table 1 Definition 5. e utility of itemset SH i is denoted as u(SH i ) and defined as SH i ∈T u(SH i , T). For example, in Table 1, Definition 6. e utility of a transaction T is denoted as tu(T) and defined as i t ∈T u(i t , T). For example, in Table 1,

Definition 7.
e user-specified minimum utility threshold is denoted as minutil. A pattern X is a high-utility itemset if u(X) ≥ minutil. Otherwise, it is a low-utility itemset. High-utility itemset mining is to discover the itemsets whose utility values are beyond minutil.

Sanitization Problem Description.
Given a transaction database D, the minimum utility threshold minutil, and the high-utility itemsets mined from D based on minutil. SH � SH 1 , SH 2 , . . . , SH t is a set of sensitive high-utility itemsets, where SH i is the itemset that needs to be hidden. A sensitive transaction refers to the transaction supporting at least one sensitive itemset. e sanitization problem is to transform an original database D to a sanitized database D ′ so that all sensitive itemsets are hidden, while at the same time, the side effects on the database and nonsensitive knowledge are minimized.
One way to sanitize a database is to modify some items in it.
e modified item contained in D is the victim item, which is denoted as I vic . e transaction supporting I vic is the victim transaction, which is denoted as T vic .

The Hiding Approach
In this section, a sanitization approach named the Improved Maximum Sensitive Itemsets Conflict First Algorithm (IMSICF) is presented in detail. e victim item is selected based on the conflict count, which is calculated dynamically. Moreover, the victim transaction is selected based on the side effects on nonsensitive knowledge, which effectively reduces the missing costs. To better illustrate how the IMSICF algorithm works, an example is given.

e Sanitization Process of Hiding Sensitive Itemsets
Definition 9. Let SH � SH 1 , SH 2 , . . . , SH t be a set of sensitive high-utility itemsets. e conflict count of a sensitive item i t in SH is denoted as Icount(i t ) and defined as where n i refers to the number of distinct items contained in SH. e conflict degree reflects the correlation among sensitive itemsets. e higher conflict degree indicates that sensitive itemsets have more common items.
For example, given a set of sensitive itemsets SH � b, e { }, e, f , the conflict degree of SH is 4/3 because Definition 10. Let SH i be a sensitive itemset and minutil be the minimum utility threshold. To hide SH i , the minimum utility to be reduced is defined as u(SH i ) − minutil + 1 and denoted as diffu, where u(SH i ) is the utility of SH i . Let D be a database and SH i a sensitive itemset. To hide SH i , the utility of SH i should be reduced until it falls below the minimum utility threshold; namely, diffu of SH i should be lower than or equal to zero. e strategy of hiding SH i is to sanitize some items in the selected transactions in D. e sanitization process of hiding SH i is shown in Figure 1. First, the victim item is identified. en, a sensitive itemset supporting the victim item is determined. Next, the victim transaction is selected to be modified. After the victim item and transaction are determined, an original database is sanitized. In the following, the sanitization process is described in detail.

e Victim Item Selection.
Let D be a database, SH � SH 1 , SH 2 , . . . , SH t is a set of sensitive itemsets. In the MSICF algorithm, the items in SH are sorted according to the descending order of conflict counts. en, the victim item is selected based on the sorted results. Because the order of sorted items is fixed, the victim item selection cannot be changed if an original database is modified. us, we improve the selection of the victim item. In our approach, the item with the maximum conflict count is selected to be sanitized, that is, I vic � argmax i t ∈SH Icount(i t ). Once a sensitive itemset is hidden, the conflict count of each sensitive item is recalculated to prevent other sensitive itemsets from being hidden in the sanitization process. In this way, we can make sure that a victim item has the maximum conflict count in the sanitization process.

e Victim Transaction Selection.
Let D be a database and SH i a sensitive itemset. For the MSICF algorithm, the transaction with the maximum utility of a victim item is selected to be a victim transaction. However, the damage to nonsensitive knowledge is not taken into account. To reduce the side effects on nonsensitive information, the transaction that causes the minimum impact should be modified with priority. us, we assign a transaction weight to each sensitive transaction, which is used to determine the victim transaction. e transaction weight is computed as where u(SH i , T) is the utility of SH i in transaction T and NSHC(T) is the number of nonsensitive itemsets supported by T. e transaction with the maximum utility of SH i indicates that the deletion of a victim item will decrease more utility. Moreover, the transaction supporting the minimum number of nonsensitive itemsets indicates that the modification of it will generate less side effects. us, the transaction having the maximum transaction weight would be sanitized first. Because NSHC(T) is zero when a transaction does not support any nonsensitive itemset, we set the denominator of Formula (2) to NSHC(T) + 1.

4.1.3.
e Original Database Sanitization. Let SH i be a sensitive itemset, I vic a victim item, and T vic a victim transaction. If diffu of SH i is not greater than u(I vic , T vic ) − eu(I vic ), it indicates that the victim item is not removed from a victim transaction. en, iu(I vic ) is reduced to iu(I vic ) − dif fu/eu(I vic ) , where eu(I vic ) is the external

e Sketch of the IMSICF Algorithm.
e pseudocode of the IMSICF algorithm is shown in Algorithm 1. Initially, the conflict count of each sensitive item in SH is calculated (Line 2). en, the item with the maximum conflict count is selected to be victim item I vic (Line 3). After the sanitized item is identified, the sensitive itemsets containing I vic are hidden one by one. For a sensitive itemset SH i , the minimum utility to be reduced is computed (Line 5). e utility of SH i is reduced until diffu is less than or equal to zero. To hide SH i , the sensitive transactions of SH i are identified. en, the transaction weight of each transaction is calculated according to Formula (2). e transaction with the maximum weight is selected to be victim transaction T vic . is is reasonable since the minimum side effects on nonsensitive information are generated by modifying the selected T vic (Line 7-8). Next, the victim item in T vic is modified, and the database and itemsets are updated, respectively (Line 9-16). If diffu of SH i is not greater than zero, the sensitive itemset is removed from SH (Line 18). e algorithm is terminated until all sensitive itemsets are hidden. Table 1, the high-utility itemsets derived at minutil � 60 are listed in Table 2. e user-specified sensitive itemsets are {b, e} and{e, f}, which are identified in boldface in Table 2.

An Illustrative Example. For a given transaction database in
e proposed algorithm (IMSICF) is applied to hide sensitive itemsets.
To hide the sensitive itemsets SH � b, e { }, e, f , the conflict count of each item contained in SH is calculated. e results are Icount(b) � 1, Icount(e) � 2, and Icount(f) � 1. Item e is selected to be the victim item because it has the maximum conflict count. en, the sensitive itemset for hiding is randomly selected from among the ones containing the victim item. Let us assume that the selected itemset is {b, e}.
e minimum utility to be reduced is dif fu � 72 − 60 + 1 � 13, and the sensitive transactions are ST � T 1 , T 3 , T 4 . e transaction weight of each transaction is assigned according to Formula (2). Because tw(T 1 ) � 21/4, tw(T 3 ) � 9/2, and tw(T 4 ) � 42/6, the transaction T 4 is chosen for modification. After the victim item and transaction are identified, the item is sanitized according to the original database sanitization method described in Section 3.

Experimental Analysis
To evaluate the performance of the IMSICF algorithm, a series of experiments have been conducted on various real and synthetic datasets, in which IMSICF is compared with the state-of-the-art algorithms. Besides, the experimental results are discussed in this section.

Experimental Data.
e experiments were conducted on a 2.8 GHz Intel Xeon E5-2360 processor with 8 GB RAM. To evaluate the performance of the proposed algorithm, four state-of-the-art sanitization algorithms, namely, HHUIF, MSICF, MSU-MIU, and MSU-MAU, were used for comparison. Because the PPUMGAT algorithm hides sensitive itemsets by transaction deletion, we do not compare the proposed algorithm with PPUMGAT. All of the algorithms were implemented in Java language. Four datasets [33] were used to run the programs. e characteristics of these datasets are displayed in Table 3. e density is measured as the average transaction length divided by the number of items. For each dataset, the external utility values were generated with the Gaussian normal distribution, and the internal utility values are the random numbers ranging from 1 to 10. e EFIM algorithm [44] was used to mine highutility itemsets, and the minimum utility thresholds for mushroom, Foodmart, T25I10D10K, and T20I6D100K were set at 8.66%, 0.045%, 0.24%, and 0.17%, respectively. e sensitive itemsets were randomly selected from the mined itemsets. e proposed algorithm identifies the victim item based on the conflict count of each sensitive item. us, the conflict degree of the sensitive itemsets is presented to observe how the correlation among the itemsets influences the performance of the sanitization algorithms. Besides, the sensitive  Input: a database D, a set of sensitive itemsets SH � SH 1 , SH 2 , . . . , SH t , the given utility threshold minutil. Output: the sanitized database D′ Find the sensitive transactions ST, end if (16) Update the database and the itemsets (17) end while (18) SH � SH − SH i (19) end for (20) end while ALGORITHM 1: e IMSICF algorithm. percentage is used to evaluate the scalability of the sanitization approaches. Sensitive percentage is measured as the number of sensitive itemsets divided by the number of highutility itemsets. is value of parameter ranges from 0.1% to 0.5%. Moreover, two-way ANOVA is used to evaluate the differences between the compared approaches. Two-way ANOVA is a comparison of means between groups that have been split on two independent variables (called factors). e P value is important because it indicates whether the difference between the sanitization algorithms is significant. If P value is below 0.05, it means that there is a significant difference between the compared approaches. Otherwise, it indicates that there is no significant difference between the sanitization approaches.

Performance Measurement.
To evaluate the efficiency, the execution time of the sanitization process is measured and the data processing stages are discarded. On the other hand, five performance measures are used to evaluate the effectiveness and summarized as follows.
(1) Hiding failure (HF): it is the proportion of the sensitive itemsets that fail to be hidden, which is calculated as where SH(D ′ ) and SH(D) are the sensitive highutility itemsets mined from a sanitized database D ′ and an original database D, respectively. (2) Missing cost (MC): it is the proportion of the missing nonsensitive itemsets that are hidden by accident after sanitization, which is computed as where NSH(D ′ ) and NSH(D) are the nonsensitive itemsets discovered from the databases D ′ and D, respectively.
(3) Artificial cost (AC): it refers to the proportion of the artificial itemsets, which is computed as where H(D ′ ) and H(D) are the high-utility itemsets mined from the databases D ′ and D, respectively.
(4) Itemset utility similarity (IUS): it reveals the utility loss for the discovered itemsets by the sanitization process, which is calculated as where X∈HUIs D ′ u(X)and X∈HUIs D u(X) denote the utility of the high-utility itemsets discovered from the databases D ′ and D, respectively.
(5) Database utility similarity (DUS): it reveals the utility loss for an original database by the sanitization process, which is calculated as where T i ∈D′ tu(T i ) and T i ∈D tu(T i ) denote the utility of the databases D ′ and D, respectively.

Execution Time.
e results of the execution times under various sensitive percentages are plotted in Figure 2. It is clear to see that the runtime is increased with the growth of the sensitive percentage.
is is reasonable because the increasing number of sensitive itemsets requires more transactions for modification. From Figure 2, we can also observe that the proposed algorithm takes more time than the other algorithms. e reason is that HHUIF, MSICF, MSU_MIU, and MSU_MAU sanitize the database based on the concept of utility. However, IMSICF needs to calculate the number of nonsensitive itemsets supported by each sensitive transaction, which costs a lot of time. Besides, note that the runtime in mushroom is more than that in the other datasets. e reason is that mushroom is a much denser dataset.
e results of the execution times under various conflict degrees for different databases are plotted in Figure 3. We can see that the runtime is decreased with the growth of the conflict degree in most cases. is is reasonable because the higher the conflict degree is, the more sensitive the itemsets will be concealed at the same time. From Figure 3, we also find that the proposed algorithm IMSICF costs more time than the other algorithms in most cases.
is is because IMSICF takes a lot of time to calculate the number of nonsensitive itemsets supported by each sensitive    Mathematical Problems in Engineering 7 transaction. However, IMSICF performs the best in Figure 3 because Foodmart is a very sparse dataset compared to the other datasets. Moreover, the sparse dataset indicates that the number of nonsensitive itemsets supported by each transaction is less. us, the execution time in Foodmart is less than that in other datasets. Based on two-way ANOVA, there is a significant difference between the execution times of the sanitization algorithms for T25I10D10K and mushroom datasets (P � 0.0037 in Figure 2(a) and P � 0.002 in Figure 2(d)). For T20I6D100K and Foodmart datasets, there is no significant difference between the sanitization algorithms (P � 0.13 > 0.05 in Figure 2(b) and P � 0.44 in Figure 2(c)).

Missing Costs.
e results of the missing costs under various sensitive percentages are shown in Figure 4. It can be observed that the missing costs are increased with the rise of the sensitive percentage due to the increasing number of modified transactions. From the results of Figure 4, it is also clear to see that the proposed algorithm prevents more nonsensitive itemsets from being overhidden. An important reason is that the transaction with the minimum number of nonsensitive itemsets and the maximum utility of sensitive itemset is chosen for modification. e second reason is that the conflict count of each item contained in sensitive itemsets is recalculated once a sensitive itemset is hidden. However, the algorithms MSU_MAU, MSU_MIU, HHUIF, and MSICF select the victim item based on the value of utility. us, the side effects on nonsensitive information are discarded in the sanitization process. Moreover, based on two-way ANOVA, there is a significant difference between the missing costs of the sanitization algorithms for various datasets (P � 1.46 * 10 − 8 in Figure 2(a), P � 9.43 * 10 − 6 in Figure 2(b), P � 4.24 * 10 − 7 in Figure 2(c), and P � 0.0006 in Figure 2(d)).
e results of the missing costs under various conflict degrees are shown in Figure 5. From the results of Figure 5, it can be observed that the missing costs decrease as the conflict degree is increased. e reason is that the higher the conflict degree is, the more sensitive the itemsets will be hidden by modifying a victim item. Correspondingly, the number of nonsensitive itemsets concealed by mistake is decreased. From Figure 5, we can also find that the proposed algorithm outperforms other algorithms. is is caused by the sanitization strategy of IMSICF. Besides, it is noted that the number of missing nonsensitive itemsets in mushroom is more than that in other datasets. e reason is that mushroom is much denser compared to the other datasets, which indicates that the modification of the dataset will cause more side effects on nonsensitive itemsets. Moreover, based on two-way ANOVA, there is a significant difference between the missing costs of the sanitization algorithms for various datasets (P � 0.004 in Figure 2(a), P � 0.021 in Figure 2(b), P � 2.04 * 10 − 7 in Figure 2(c), and P � 0.02 in Figure 2(d)).

Itemset Utility Similarity.
e results of the itemset utility similarity under various sensitive percentages for different datasets are plotted in Figure 6. We can find that the IUS values are decreased with the growth of the sensitive percentage. is is reasonable because the increase on the number of sensitive itemsets will cause more transactions to be sanitized. us, the damage to the nonsensitive itemsets is correspondingly increased. From the results of Figure 6, we also observe that the proposed algorithm outperforms the other four algorithms in most cases. e reason is that IMSICF takes the side effects on nonsensitive knowledge into account when identifying the victim transaction. However, the other algorithms select the victim transaction based on the value of utility, without considering the damage to nonsensitive information by the sanitization process. Besides, it is interesting to see that the IUS values of mushroom are lower than those of the other datasets since mushroom is a very dense dataset.
e results of the itemset utility similarity under various conflict degrees for different datasets are plotted in Figure 7. It can be observed that the conflict degree has a great impact on the itemset utility similarity. With the growth of the correlation among the sensitive itemsets, more sensitive itemsets are hidden when a sensitive itemset is sanitized.
us, less nonsensitive itemsets are concealed in error after the database sanitization, and the IUS values are correspondingly increased. From Figure 7, it is also clear to see that the proposed algorithm outperforms the other algorithms under various conflict degrees. e reason is that the impact on nonsensitive information is considered in the IMSICF algorithm. Moreover, the conflict count of each sensitive item is dynamically calculated. However, the other algorithms only take the concept of utility into account in the sanitization process.
Based on two-way ANOVA, there is a significant difference between the IUS of the sanitization algorithms for various datasets (P � 0.0046 in Figure 2(a), P � 0.025 in Figure 2(b), P � 1.25 * 10 − 7 in Figure 2(c), and P � 0.019 in Figure 2(d)).

Database Utility Similarity.
e results of the database utility similarity under various sensitive percentages are shown in Figure 8. It can be seen that the DUS values are decreased with the growth of the sensitive percentage because more items are modified for hiding more sensitive itemsets. As shown in Figure 8, we also observe that the IMSICF algorithm outperforms the other approaches except the MSU_MIU algorithm. e reason is that MSU_MIU identifies the victim transaction T vic with the maximum utility of a sensitive itemset X, and the item contained in X with the minimum utility is selected to be a victim item I vic . In addition, the utility of X is reduced by u(X, T vic ) when I vic is removed from T vic . us, the database utility similarity of MSU_MIU is higher than that of the other algorithms. However, the proposed algorithm IMSICF selects the     sanitized item based on the side effects on nonsensitive information. us, IMSICF performs worse than MSU_MIU in terms of DUS. Moreover, HHUIF, MSICF, and MSU_-MAU algorithms select the victim item with the maximum utility. Hence, these algorithms perform worse than the previous two algorithms. Based on two-way ANOVA, there is a significant difference between the DUS of the sanitization algorithms for various datasets (P � 3.27 * 10 − 8 in Figure 2(a), P � 2.55 * 10 − 7 in Figure 2(b), P � 2.43 * 10 − 7 in Figure 2(c), and P � 1.11 * 10 − 8 in Figure 2(d)).
e results of the database utility similarity under various conflict degrees for different datasets are shown in Figure 9. e DUS values are increased as the conflict degree increases. is is reasonable because few items are sanitized when the sensitive itemsets have more common items. In Figure 9, we also find that the MSU_MIU algorithm performs the best in terms of DUS under various conflict degrees. e reason is that the item with the minimal utility is chosen for modification. e proposed algorithm IMSICF has better performance than MSU_MAU, HHUIF, and MSICF because these algorithms identify the victim item with the maximum utility. Besides, note that DUS of mushroom is lower than that of other datasets. is is because the number of items supported by a transaction in mushroom is much higher compared to the other datasets. Moreover, based on two-way ANOVA, there is a significant difference between the DUS of the sanitization algorithms for various datasets (P � 0.0016 in Figure 2(a), P � 0.039 in Figure 2(b), P � 1.13 * 10 − 6 in Figure 2(c), and P � 0.0015 in Figure 2(d)).
e above experimental results demonstrate that the proposed algorithm IMSICF outperforms the other state-ofthe-art algorithms in terms of MC and IUS. e reason is that IMSICF selects the victim transaction based on the side effects on nonsensitive itemsets. Besides, we can find that the MSU_MIU algorithm performs better than other algorithms in DUS. is is reasonable because the victim item with the minimum utility is chosen for modification, and the utility of a sensitive itemset is reduced by the utility of the itemset in the identified transaction when a victim item is removed. In addition, it can be observed that the density of a dataset affects the performance of the itemset hiding.

Conclusions
In this paper, an improved sanitization algorithm called IMSICF is proposed for privacy-preserving utility mining.
is algorithm identifies the victim item with the maximum conflict count, which is dynamically computed in the sanitization process. en, the sensitive itemset containing the victim item is selected to be hidden. e transaction with the maximum utility of the currently hidden sensitive itemset and the minimum count of nonsensitive itemsets is chosen for modification. Hence, the side effects on nonsensitive knowledge are effectively reduced. In our experiments, real and synthetic datasets are used to evaluate the performance of the proposed algorithm. e experimental results show that IMSICF outperforms the state-of-the-art algorithms in missing cost and itemset utility similarity, at the expense of a degradation on efficiency. Besides, it is observed that the conflict degree of sensitive itemsets has a great impact on the performance of the sanitization algorithms. For future work, we will focus on preserving other forms of sensitive knowledge, such as frequent and utility itemset.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.