Integrated Association Rules Complete Hiding Algorithms

This paper presents database security approach for complete hiding of sensitive association rules by using six novel algorithms. These algorithms utilize three new weights to reduce the needed database modifications and support complete hiding, as well as they reduce the knowledge distortion and the data distortions. Complete weighted hiding algorithms enhance the hiding failure by 100 %; these algorithms have the advantage of performing only a single scan for the database to gather the required information to form the hiding process. These proposed algorithms are built within the database structure which enables the sanitized database to be generated on run time as needed.


Introduction
Hiding the sensitive rules, not the sensitive data, is the main objective of association rule hiding [1], which is done by sanitizing the data so that the association rule mining algorithms can extract all the non-sensitive rules and un-extract the sensitive rules.Sanitization is done by making some changes in the original data set.
Complete hiding means the capability to hide all the sensitive association rules (zero hiding failure).This paper is organized as follows; association rule hiding process and related work are discussed in Sec. 2. and Sec. 3. , respectively.The proposed solution and experiments results are explained in Sec. 4. and Sec. 5. , respectively.Finally, conclusions explanation is included in Sec. 6.

2.
Association Rule Hiding Process

Problem Description
The general definition of the problem is that we have a transnational dataset (database) D that contains sensitive information which needs to be protected from inference.Applying association rule mining algorithm to this dataset generates a set of association rules R with algorithm parameters Minimum Confidence Threshold (MCT) and Minimum Support Threshold (MST).
R is divided into two subsets: a set of the sensitive rules R sen that needs to be protected, and a set of the non-sensitive rules R non−sen .The problem solution is to generate the sanitized database D , which when encountered to rule mining techniques generates a new set of association rules R .This new set is divided into a set of non-sensitive association rules R non−sen and a set of sensitive rules that could not be hidden R non−Hide , and a set of lost non-sensitive rules that were not meant to hide Fig. 1 demonstrates the association rule hiding rule sets [5].

Problem Formulation
The following notations are used to clarify the problem formulation as follows: • I = i 1 , i 2 , . . ., m: a set of finite m literals.Each member of I is called an item, • X is the item set, where X ⊆ I, • t: transaction is the set of items, where t = {i k | i k ∈ I, k ≤ m}, • The relation between database and transactions is given by D = {t 1 , t 2 , . . ., t n | n ∈ N }, • The item set is supported by a transaction if X ⊂ I and t ∈ D if X ⊆ t.
sup(X): support of X, which is the frequency of an itemset X in the database, and it is defined as: where X(t) = {t ∈ D | t contains X}.If sup(X) ≥ M ST then the itemset X is described as a frequent itemset.An association rule is represented as I L → I R , where I L ∩ I R = Φ and I L , I R ⊂ I, where I R is the RHS (Right Hand Side) itemset and I L is the LHS (Left Hand Side) itemset.The support of a rule I L → I R is the support of itemset I L ∪ I R , as the Eq. ( 2) [13], The rule I L → I R confidence is defined as [13].
The association rule [13]: If I L , I R ⊆ I, and I L ⊆ I R , then sup(I L ) ≥ sup(I R ).This means that if an itemset I L is frequent, then all itemsets that are subsets of I L are frequent.The main hiding approaches shown in the next Fig. 2 can be based on the following description: • Decreasing the confidence as the ISL (Increase Support of LHS) and DSR (Decrease Support of RHS).
• Decreasing the support as DSL (Decrease Support of LHS).X and Y Fig. 2: The main hiding approaches.

Association Rule Hiding Measures
Association rule hiding algorithm performance is measured by commonly used methods in order to evaluate the proposed weighted algorithms.
• Hiding Failure (HF) HF measures the sensitive rules that are not hidden and can be mined from sanitized dataset.The hiding failure measurement is defined as the percentage of the sensitive data that remains discoverable in the sanitized dataset to the total number of sensitive rules to be hidden in the original dataset as shown in next equation [5]: where D is the original data set, D is the sanitized data set, and S R is the number of sensitive association rules.
• Misses Cost (MC) MC measures the amount of non-sensitive association rules (lost rules) that are hidden by accident after sanitization, It is calculated by counting the non-sensitive data hidden after the sanitization process (S R (D) − S R D ) and dividing it by the all non-sensitive rules in the original dataset D(S R (D)), using the following formula [5]: • Artificial Patterns (AF) AF measures the artificial association rules (ghost rules) that cannot be extracted from the original dataset but it can be extracted from sanitized dataset [11], which is created during the sanitization process due to the addition of noise in the data, and is calculated by: where R is the set of discovered association rules in the original database D, R is the set of association rules in the sanitized database D , and |X| denotes the cardinality of X. |X| is described as the cardinality of X.
• Knowledge Distortion (KD) KD is the total knowledge distortion.It is calculated as the cumulative sum of the amount of missing nonsensitive rules (Misses Cost MC) and the amount of ghost rules (Artificial Patterns AF) [11] as show in Eq. (7).
• Data Distortion (DD) DD measures the difference between sanitized database and original database.
where |T vi | is the number of victim transactions that are modified in dataset D i n order to hide the sensitive rules.T N is the total number of dataset transactions [11].

Related Work
In 2005, S. Wang et al. [2], proposed the DSR (Decrease Support of RHS) algorithm and the ISL (Increase Support of LHS) algorithm.DSR decreases the sensitive rule support and confidence below MST, MCT respectively to hide it.ISL works by rising support of sensitive rule LHS to hide it; confidence will be reduced under the MCT.DSR result shows no hiding failure; while ISL may fail when there are no appropriate transactions to add.In 2010, Modi et al. [3], created a new algorithm DSRRC (Decrease Support of RHS Items of Rule Cluster) to reduce hiding side effects by grouping the sensitive rules by similarity of RHS before the start the hiding process.This algorithm has two side effects: • it increases the execution time due to the needed ordering of the database after each changes, • it does not maintain data quality.
In 2011, Jain et al. [4], proposed a new algorithm that hides the rule by reducing and increasing the support of the RHS and LHS item of the rule at the same time.The advantage of this algorithm is its utilization of lower processing power than the previous work as a result of minimization of the data updates needed to hide a set of rules.In 2012, Shah et al. [6], proposed RRLR (Remove and Reinsert LHS of Rule) and AD-SRRC (Advanced Decrease Support of RHS items of Rule Cluster) to enhance the performance of DSRRC.ADSRRC and DSSRC group sensitive rules by using the same RHS.ADSRRC is faster than DSSRC because it started with sorting the transactions according to the sensitivity in descending order.RRLR can hide association rules with multiple RHS.In 2012, Jain et al. [7] introduced a new method called Representative Rule (RR), where sensitive rules can be hidden without major changes in database.It is based on altering the position of items so the frequent itemsets support is still the same.The side effect of RR is the confidence computation for the non-strong rules that has confidence lower than MCT.In 2013, Domadiya et al. [8], proposed Modification Decrease Support of RHS items of Rule Clusters (MDSRRC).It can hide rules with multiple items in LHS and RHS.It begins with deleting items with highest values of sensitive rule items based on RHS.This decreases the database modification.MDSRRC has more benefits than DSRRC, including less side effects and improved data quality.In 2013, Dhutraj et al. [9], introduced a new algorithm using both DSR and ISL methods.This algorithm has two disadvantages: • it cannot hide association rule with multiple items in RHS and LHS, • high memory usage.
In 2014, Cheng et al. [10] proposed a hybrid algorithm that uses data distortion algorithm with a genetic algorithm named Evolutionary Multiobjective Optimization (EMO).The selection of deleting items needs more effort.It can effectively hide sensitive rules while generating fewer side effects.But it suffers from high count lost rules.In 2015, Cheng et al. [11] proposed improved hybrid algorithm to EMO by changing the hiding method from deleting items to Adding Items.It is called EMO-AddItem algorithm (HypE).It can do the hiding task with less knowledge distortions for most test cases.

Proposed Solution
In this work, a program is implemented for Apriori algorithm which is the most popular algorithm to find all the frequent sets and learn association rules.This program is used to generate the association rules from the dataset and verify the results by using Waikato Environment for Knowledge Analysis (WEKA) Apriori associations tool [14].In addition, six weighted hiding algorithms were designed as follows: • W_ISL: Weighted Increase Support of LHS, • W_DSL: Weighted Decrease Support of LHS, • W_DSR: Weighted Decrease Support of RHS, • W_C_DSL_DSR_C: Weighted Complete Hiding by integrating W_DSL and W_DSR using minimum changes method, • W_C_ISL_DSR_C: Weighted Complete Hiding by integrating W_ISL and W_DSR using minimum changes method, • W_C_ISL_DSR_S: Weighted Complete Hiding by integrating W_ISL and W_DSR using minimum Sensitive Rule Weight SRW method.
Algorithms W_ISL, W_DSL and W_DSR support the complete hiding with certain conditions depending on the database, the sensitive rules and the algorithm itself.These algorithms, W_C_DSL_DSR_C, W_C_ISL_DSR_S, and W_C_ISL_DSR_C, respectively, support the complete hiding for all sensitive rules.

Results Validations
Validation performed by comparing results of the proposed algorithms and the Algo1.a(Based on Increasing the support of the left hand side) [15], WSDA (Weight based Sorting Distortion Algorithm) [16], SIF-IDF (sensitive items frequency-inverse database frequency) [17], and the EMO-AddItem (Multi-Objective Optimization (EMO) based on many objective optimizations that using Adding Items) algorithm proposed on the work of Cheng et al. 2015 [11].All results are calculated for the same MST = 5 % and MCT = 50 %, by using the same dataset, same sensitive rules, and same Apriori settings applied in Cheng et al. 2015.

Utilizing Victim Transaction
Weights in the Invented Six Weighted Algorithms: • Transaction Frequent Rule Weight TFRW: Each victim transaction V i is assigned a transaction weight TFRW(V i) calculated as the count of the frequent and non-sensitive rules that is fully supported by transaction V i.
• Non-sensitive Rules Weight NSRW: Each victim transaction V i is assigned a non-sensitive rules weight NSRW(V i) which is calculated as the count of the frequent and non-sensitive rules supported by transaction V i and can be hidden while applying sensitive rule hiding changes.
• Sensitive Rules Weight SRW: Each victim transaction V i is assigned to a sensitive rules weight SRW(V i) which is calculated as the count of sensitive rules that can be hidden by using transaction V i with a hiding method.

Reuse victim transactions RVT:
RVT is a new method that collects all possible victim transactions for all sensitive rules, and then allows the hiding algorithm to select the victim transactions from this collection.Then it applies the selection method to support reusing of the victim transaction to hide more than one sensitive rule.It is used with SRW(V i) weight to reduce the transactions data distortion and total database modifications.

Basic Notation and Definitions
Let sup(R) be the initial support of the rule R. Let conf (R) be the initial confidence of the rule R. Let T L be the transactions that support I L .Let |T L | be the count of the transactions that support I L .Let SSM be the Support safety Margin threshold and it is used with DSL method.Let CSM be the Confidence Safety Margin threshold and it is used with DSR (CSM_DSR) and ISL (CSM_ISL) methods.Let Csup(R) be the minimum changes needed to hide a rule by changing the rule support and represents the minimum transactions count needed to decrease the sup(R) in order to hide rule R. Let Cconf (R) to be the minimum changes needed to hide a rule by changing the rule confidence and I represents the minimum transactions count needed to decrease the conf (R) in order to hide rule R. Transactions which are updated to hide the rule R are called rule victims victim(R).

1)
For W_DSR method: This W_DSR confidence hiding minimum changes is applied on transactions that fully support I L and I R |T LR |.The condition for W_DSR complete hiding is defined as |T LR | ≥ Cconf (R i ), where |T LR | is the count of transactions that fully support I L and I R .

2)
For W_ISL method: This W_ISL confidence hiding minimum changes is applied on the transactions that partially support where |T LpRn | is the count of transactions that partially support I L but do not support I R .

3)
For W_DSL method: This W_DSL confidence hiding minimum changes is applied on transactions that fully support I L and I R (T LR ).The condition for W_DSL complete hiding is defined as where |T LR | is the count of transactions that fully support I L and I R .

4)
For W_C_DSR method: The complete hiding condition is |T LR | < Cconf (R i ) so we need to increase the transactions that fully support I L and I R T LR by making (Cconf (Ri) − |T LR |) changes in transaction not supporting I L and supporting I R .

5)
For W_C_ISL method: The complete hiding condition is |T LpRn | < Cconf (R i ) so we need to increase transactions that partially support I L but do not support I R (T LpRn ) by making (Cconf (R i ) − |T LpRn |) changes in transaction not supporting I L and supporting I R .

6)
For W_C_DSL method: The complete hiding condition is |T LR | < Cconf (R i )so we need to increase the transactions that fully support I L and I R (T LR ) by making (Cconf (R i ) − |T LR |) changes in transaction supporting I L and not supporting I R .

Victim Transaction Weights:
1) Transaction Frequent Rule Weight TFRW "The lower the best" Each victim transaction V i assigns a transaction weight W (V i) which is calculated as the count of frequent and non-sensitive rules that are fully supported by transaction V i, where |RF (V i)| is the count of frequent and non-sensitive rules that are fully supported by victim transaction V i.
2) Non-sensitive Rules Weight NSRW "The lower the best"Each victim transaction V i assigns a non-sensitive rules weight NSRW (V i) and it is calculated as the count of frequent and non-sensitive rules supported by transaction (V i) and can be hidden while applying sensitive rule hiding changes.
For W_DSR,the algorithm selects all frequent and non-sensitive rules that are fully supported by transaction (V i) and RHS of the non-sensitive rule same as I R for sensitive rule R i .
For W_DSL algorithm selects all the frequent and non-sensitive rules that are fully supported by transaction V i and LHS of the non-sensitive rule same as I L for sensitive rule R i .For W_ISL algorithm select all the frequent and non-sensitive rules that are fully supported by transaction V i and LHS of the non-sensitive rule same as I L new item values for sensitive rule R i .

3)
Sensitive Rules Weight SRW "The higher the best" Each victim transaction (V i)is assigned sensitive rules weight SRW(V i) which is calculated as the count of sensitive rules that can be hidden by using transaction V i with a hiding method.
where Sens_Rules function is calculated with respect to given hiding method.

Proposed Weighted Hiding Association Rule Algorithms
In this section, all proposed algorithms are explained.The inputs and outputs of algorithms are summarized as follow: Algorithm Inputs are defined as: • a finite transaction database D, • MST = 5 % and MCT = 50 %.SSM for DSL method = 0.0001, CSM for ISL method = 0.0008 and CSM for DSR method = 0.0009, • the set R sen of sensitive rules (10 Rules).Algorithm Output: A sanitized database D .

1) W_ISL: Weighted Increase Support of LHS
This method is based on increasing the support of sensitive rule LHS by updating the selected transactions that partially support rule LHS and do not support rule RHS.The complete hiding is achieved if the number of available transactions is higher than or equal to the hiding minimum changesCconf (R i ).

2) W_DSL: Weighted Decrease Support of LHS
This method is based on decreasing the support of sensitive rule LHS by updating the selected transactions that fully support rule LHS and RHS.The complete hiding is achieved if the number of available transactions is higher or equal to the hiding min.changesCsup(R i ).

3) W_DSR: Weighted Decrease Support of RHS
This method is based on decreasing the support of sensitive rule RHS by updating selected transactions that fully support rule LHS and RHS.The complete hiding is achieved if the number of available transactions is higher than or equal to the hiding min.changesCconf (R i ).
Main Steps of Weighted hiding algorithm: Main steps and difference when applied the RVT method are shown in Fig. 2 and Fig. 3 are explained as follows: • calculate the hiding minimum changes required for hiding by this method, • calculate the TFRW(V i), SRW(V i), and NSRW(V i) for all victims transactions of this method,  • for each sensitive rule R i in R sen selects the number of transactions equal to the hiding minimum changes(Cconf (Ri) for W_ISL, W_DSR or Csup(Ri) for W_DSL) order by NSRW(V i) ascending, SRW(V i) descending, and TFRW(V i) ascending, • get the final transaction changes set by group common transactions for all sensitive rules where common transaction means transaction that used to hide more than one sensitive rule, • update the database by the final transaction changes set to get the sanitized DB D .

Notes:
• if the available transaction in step 3 is less than the hiding changes minimum, then hiding failure for this sensitive rule occurs, • we test different order by methods for step 3 like TFRW ascending, SRW(V i) descending or SRW(V i) descending and TFRW ascending, • SRW(V i) and grouping transactions in Step 4 are used to reduce the transactions data distortion and total database modifications, • NSRW(V i) or TFRW(V i) or both are used to reduce the knowledge distortion.

4) W_C_DSL_DSR_C
By using this method, we are able to calculate the hiding minimum changes for W_DSL and W_DSR, and check the availability of enough transaction for complete hiding, then select hiding method for each sensitive rule such that it supports complete hiding and achieves the minimum changes when comparing the two methods.
Steps of the algorithm: • selects hiding method for each sensitive rule R i based on support of complete hiding and minimum changes, • calculates the TFRW (V i), SRW(V i) and NSRW(V i) for all victims transactions of R i rules hiding by W_DSL method, • calculate the TFRW (V i), SRW(V i), and NSRW(V i) for all victims transactions of R i rules hiding by W_DSR method, • for each sensitive rule R i in R sen selects the number of transactions equal to the hiding minimum changes required for its hiding method.This selection is done based on ordering all suitable transactions by NSRW(V i) ascending, SRW(V i) descending and TFRW ascending, • get the final transaction change set by group common transactions for all sensitive sets.
Update the database by the final transaction change set to get the sanitized database D .

5) W_C_ISL_DSR_C
By using this method, we are able to calculate the hiding minimum changes for W_ISL and for W_DSR, and check the availability of enough transaction for complete hiding, then select hiding method for each sensitive rule such that it supports complete hiding and achieves the minimum changes when comparing the two methods.
Steps of algorithm: • calculate Cconf (Ri) for both methods W_ISL and for W_DSR, • select hiding method for each sensitive rule Ri based on support of complete hiding and minimum changes, • calculate the TFRW (V i), SRW(V i), and NSRW(V i) for all victims transactions of R i rules hiding by W_DSR method, • for each sensitive rule R i in R sen select the number of transactions equal to the hiding minimum changes required for its hiding method.This selection is done based on ordering all suitable transactions by NSRW(V i) ascending, SRW(V i) descending and TFRW ascending, • get the final transaction change set by group common transactions for all sensitive sets, • update the database by the final transaction change set to get the sanitized database D .

6) DB_C_ISL_DSR_S
By using this method, we are able to calculate the hiding minimum changes for W_ISL for W_DSR and check the availability of enough transaction for complete hiding.The selection of the hiding method for each sensitive rule is based on its support of the complete hiding and have the minimum SRW from the two methods.
Steps of algorithm: • calculate Cconf (R i ) for both methods W_ISL and for W_DSR, • select hiding method for each sensitive rule Ri based on support of complete hiding and minimum SRW(R i ), • calculate the TFRW (V i), SRW(V i), and NSRW(V i) for all victims transactions of R i rules hiding by W_ISL method, • calculate the TFRW (V i), SRW(V i), and NSRW(V i) for all victims transactions of R i rules hiding by W_DSR method, • for each sensitive rule R i in R sen select the number of transactions equal to the hiding minimum changes required for its hiding method.This selection is done based on ordering all suitable transactions by NSRW(V i) ascending, SRW(V i) descending and TFRW ascending, • get the final transaction change set by group common transactions for all sensitive sets, • update the database by the final transaction change set to get the sanitized database D .

Experimental Setup
The proposed approaches use the oracle database 11g, Procedural Language/Structure Query Language PL/SQL 11.0.2.0, and run on an Intel i5 CPU 660 with four processors with 3.33 GHz speed and main memory with 4 GB.We did extensive experiments on real dataset.the experimental results are based on the following measures: • hiding failure: The amount of sensitive rules that fail to be hidden (The lower the better), • knowledge distortion: it is the sum of the two measures of lost non-sensitive rules and ghost rules (The lower the better), • data distortion (Data loss): it is the amount of transactions changes needed to obtain the sanitized database (The lower the better).

Used Dataset
We examined the proposed algorithms prepared by Roberto Bayardo using the Mushroom dataset which was to represent the real database.The Characteristics of Mushroom dataset and parameter settings [11] are shown in Tab. 1.

Conclusions
The novel improved algorithms show better results compared to EMO-AddItem, Algo1.a,WSDA, and SIF-IDF.The results of weighted algorithms W_ISL, W_DSL, and W_DSR show that only W_ISL does not support the complete hiding because the algorithm hiding results.
depend on the hiding methods and its implementations in the hiding algorithm, sensitive rules, and dataset.So we proposed the integrated algorithms to achieve the complete hiding.The integrated algorithms W_C_ISL_DSR_C, and W_C_ISL_DSR_S achieve the complete hiding for the W_ISL algorithm and enhance both KD and DD measures.The W_C_DSL_DSR_C algorithm enhanced the KD and DD measures for W_DSL.
The use of grouping of common victim transactions and SRW, TFRW selection method achieved lower data distortion with W_ISL, W_DSR algorithms.The change of the selection methods enhances the knowledge distortion KD or the data distortion DD measures and can be chosen according to the problem requirements.Those algorithms only need a single scan of the database to hide the sensitive rules so the victim transaction weights applied on all victim transactions which help to be more effective.Integration of this algorithm to the database structure adds new capability to generate the sanitized database in the run time when required.

Mohamed
Refaat ABDELLAH received a Bachelor degree in engineering from the Military Technical College (MTC), Cairo, Egypt, in 1994 and got his Master's degree in engineering fromComputer Engineering department, Cairo University, Egypt, in 2011.He is currently a Ph.D. student in Computer Engineering Department in MTC.His research interests are in data mining, database security and Data Hiding Techniques.Hesham Aboelsoud MOHAMED receiveda Bachelor degree and Masters in computer engineering from the MTC, Cairo, Egypt, in 1993 and 2000, respectively.He also received a Ph.D. degree in Systems and Biomedical Engineering, from Military Technical Collage, in 2006.He is currently a faculty member in the Department of Computer Engineering, MTC.His research interests are in Digital Image Processing, Computer System Security, Database Security and Data Hiding Techniques.Khaled Shafee BADRAN received a Bachelor degree incomputer engineering and Mastersdegree from the MTC, Cairo, Egypt, in 1995 and 2000, respectively.He also received the Ph.D. degree in Electrical and Computer engineering from Sheffield University, UK, in 2009.He is currently a faculty member with the Department of Computer Engineering, MTC.His research interests are in data mining, semantic web and database security.Mohamed Badr SENOUSY received the Bachelor degree in engineering from the MTC, Cairo, Egypt, in 1973.He also received the Masters and Ph.D. degrees from George Washington University, Washington DC, USA in 1982 and 1985, respectively.He is currently a faculty member in the Department of Computer and Information System, Sadat Academy Cairo, Egypt.His research interests are operating systems, software engineering and computer security.
Tab. 1: Characteristics of Mushroom dataset and parameter setting.