Frequent Itemset Mining of User’s Multi-attribute Under Local Differential Privacy

Frequent itemset mining is an essential problem in data mining and plays a key role in many data mining applications. However, users’ personal privacy will be leaked in the mining process. In recent years, application of local differential privacy protection models to mine frequent itemsets is a relatively reliable and secure protection method. Local differential privacy means that users first perturb the original data and then send these data to the aggregator, preventing the aggregator from revealing the user’s private information. We propose a novel framework that implements frequent itemset mining under local differential privacy and is applicable to user’s multi-attribute. The main technique has bitmap encoding for converting the user’s original data into a binary string. It also includes how to choose the best perturbation algorithm for varying user attributes, and uses the frequent pattern tree (FP-tree) algorithm to mine frequent itemsets. Finally, we incorporate the threshold random response (TRR) algorithm in the framework and compare it with the existing algorithms, and demonstrate that the TRR algorithm has higher accuracy for mining frequent itemsets.


Introduction
Mining frequent itemsets is very useful in many cases, such as in the record of someone purchasing cold medicine; whether they would buy antipyretics or cough medicines. By collecting data from multiple users, the degree of association among these drugs can be obtained, so a doctor can select the appropriate drug purchase ratio based on these records. However, users do not want to disclose sensitive personal information during the purchase process. In this study, we aim to mine frequent itemset information while ensuring that the user's personal information is not revealed. There are many ways to protect privacy, such as data scrambling, data encryption, data anonymity, and other privacy protection technologies. Existing anonymity-based privacy protection models require some background knowledge and special attack assumptions. Moreover, they cannot quantify the intensity of privacy protection, so significant limitations exist in the actual. To address the shortcomings of the anonymous privacy protection model, Dwork [Dwork (2006)] proposed a differential privacy model that randomly perturbed the published data. In a statistical sense, regardless of background knowledge that the attacker possesses, identifying whether a record exists in the original data table is impossible. The user sends the original data to a trusted third-party aggregator; then the aggregator perturbs the user data and publishes it. However, when the aggregator is untrustworthy or malicious third party attacks them, the user's personal privacy information can be leaked. To solve the shortcomings of this mechanism, Kasiviswanathan et al. [Kasiviswanathan, Lee and Nissim (2011)] proposed local differential privacy, which advances the steps of the disturbance. Each user first perturbs the data on the user side and then sends it to the third-party aggregator. This process aggregator can be untrusted because it only analyzes the perturbed data and publishes it. Randomized aggregatable privacy-preserving ordinal response (RAPPOR) is the most convincing application of the local differential privacy model, and Erlingsson et al. [Erlingsson, Pihur and Korolova (2014)] has deployed RAPPOR to Google Chrome. In addition to Google, Thakurta et al. [Thakurta, Vyrros and Vaishampayan (2017)] applied for a patent to protect users' privacy and deploy to the ISO in 2017. Other electronic information companies such as Samsung have also proposed a similar privacy protection system. Companies are beginning to value the importance of protecting user privacy data. Currently, most papers focus on privacy protection for a single frequent item, and each user has only one attribute. However, this situation does not apply in practical applications. For example, the user's personal information may not only include the name or age, but also the education and income, and these attributes may be related to each other. For such complex user data, we propose a novel framework to implement frequent itemset mining under local differential privacy. Our contribution: 1) For the multi-attribute of users, we propose a complete framework to mine frequent itemsets under local differential privacy. 2) We also consider that the different attributes have different possible value ranges, such as gender and income. For this case, we chose different perturbation algorithms to achieve the best data availability. 3) Finally, the aggregator uses the FP-tree algorithm to mine frequent itemsets and compares them with the existing methods. The rest of our paper is organized as follows. Section 2 introduces related work. Section 3 describes the problem and defines local differential privacy. Section 4 introduces two mechanisms to satisfy the local differential privacy. Section 5 proposes the TRR algorithm to satisfy the user's multi-attribute. Section 6 introduces the aggregator using the FP-tree algorithm to mine frequent itemsets and compares the TRR algorithm with the existing methods. Section 7 concludes the paper.

Related work
In recent years, with the explosive growth of data and the rapid development of information technology, various industries have accumulated large amounts of data through various channels. However, various industries are beginning to value the importance of protecting user privacy data. Aiming at the security and privacy issues in cloud computing, for instance, Min et al. [Min, Yang, Wang et al. (2019)] proposed a homomorphic encryption algorithm, which utilized the characteristics of multi-nodes and matrix multiplication for parallel encryption. As for the problem of privacy leakage in the smart grid, He et al. [He, Zeng, Xie et al. (2017)] proposed a random linear network coding scheme to protect user privacy and effectively organize traffic analysis. Furthermore, for the leakage of location privacy information, Gu et al. [Gu, Yang and Yin (2018)] proposed a multi-level query tree structure to publish location data on database, and added an exponential mechanism to the query results. In addition, for the privacy leakage of personalized recommendation service system, Yin et al. [Yin, Shi, Sun et al. (2019)] proposed an efficient privacy protection collaborative filtering algorithm based on differential privacy protection and time factor. To ensure privacy of data mining, Wong et al. [Wong, Li and Fu (2006)] proposed a traditional method based on k-anonymity and Li [Li (2007)] proposed its extended models. These methods require certain assumptions and it is difficult to protect privacy when the assumptions are violated. The insufficiency of k-anonymity and its extended models is that there is no strict definition of the attack model, and that the knowledge of the attacker cannot be quantitatively defined. To pursue strict privacy analysis, Dwork [Dwork (2006)] proposed a strong privacy protection model called central differential privacy. It features independence of the background knowledge of the attacker and proves to be very useful. For centralized datasets, Wong et al. [Wong, Cheung and Hung (2007)] proposed using a 1-to-n encryption method to change original itemsets to protect data privacy when outsourcing frequent itemset mining. Qiu et al. [Qiu, Li and Wu (2006)] proposed an algorithm that transforms business information into a very long binary vectors and a series of random mapping functions based on bloom filters. Tai et al. [Tai, Yu and Chen (2010)] proposed a k-support anonymity-based frequent itemset mining algorithm. All these methods sacrifice the precision of the mining result. Because traditional approaches are based on heuristics, a solid privacy guarantee is missing. Therefore, researchers began to investigate frequent itemset mining with differential privacy. Bhaskar et al. [Bhaskar, Laxman and Smith (2010)] presented two mining algorithms, which are representatives of frequent itemset mining with differential privacy. Cheng et al. [Cheng, Su and Xu (2015)] implemented differential privacy protection to the apriori mining algorithm. Xiong et al. [Xiong, Chen and Huang (2018)] implemented differential privacy protection to the FP-tree mining algorithm, which can reduce the number of database traversals. These are frequent itemset mining protection methods based on central differential privacy. Many previous studies used local differential privacy to solve heavy hitter problem, rather than frequent itemset. Heavy hitters are simply the frequency of occurrence and do not consider the relationship between frequent items. Examples include papers [Bassily and Smith (2015); Bassily, Nissim and Stemmer (2017); Bun, Nelson and Stemmer (2018)], which solve heavy hitter problem under local differential privacy and only have a single attribute for the user. Wang et al. [Wang and Li (2018)] considered that there may be differences in the thresholds of attributes, but they still only considered a single attribute and heavy hitter. Herein, we study local differential privacy protection in frequent itemset mining and consider user multi-attribute values and associations. We use two important local differential privacy mechanisms, the RAPPOR [Erlingsson, Pihur and Korolova (2014)] mechanism and randomized response (RR) [Kairouz, Oh and Viswanath (2014)] mechanism. Zhang et al. [Zhang, Huang and Fang (2017)] proposed that the Multiple Randomized Response (MRR) algorithm applies personalized differential privacy to mine frequent itemsets. The RR algorithm is selected when the attribute is protected to a low degree, and the RAPPOR algorithm is selected when the attribute is protected to a high degree. Although the MRR algorithm has multiple attribute values, it does not consider that the thresholds of different attributes may have large differences. They have some understanding flaws in the RR and RAPPOR algorithms because their theoretical basis is derived from data distribution estimates. Choosing the RR and RAPPOR algorithms in frequent itemset mining is closely related to the attribute threshold. Our propose TRR algorithm solves the shortcomings of the above problems. Finally, to reduce the number of database traversals, the aggregator uses an FP-tree to mine frequent itemsets.

Problem definition
Let = { 1 , 2 , … , } be the set of all items in the transaction database, transaction be a set of some items ( ⊆ ), and a database = { 1 , 2 , … } be the set of transactions. Each where ⊆ is called an "itemset" and P is also called a -itemset, where | | = . Transaction T contains an itemset if and only if ⊆ ; the support of , which is denoted as support (P), is defined as the percentage of transactions in containing . Let min − support be the user-defined minimum support threshold. There is a frequent itemset if and only if − ≤ ( ). Given database D and the minsupport threshold, the frequent itemset mining task is defined as "discovering all frequent itemsets with their supports."

Local differential privacy
Based on the untrustworthy third-party aggregator, we used local differential privacy to perturb the data on the user side, and the user can also choose different perturbation methods according to the protection degree of the original data. The aggregator is only responsible for collecting and analyzing data and publishing the overall model. Definition 1 (Local Differential Privacy [Kasiviswanathan, Lee and Nissim (2011)]). For any inputs 1 and 2 , the output is obtained by an algorithm . If the following inequalities are satisfied, we say that algorithm that satisfies -local differential privacy. [ (1) Here, denotes the output probability by the algorithm and represents the privacy budget, which is inversely proportional to the privacy protection degree. Specifically, a smaller means higher privacy protection of the user data, and a larger ε means lower privacy protection.

Sequence combination
Local differential privacy inherits the sequence combination of central differential privacy. When a single user satisfies local differential privacy and different users may select different perturbation algorithms, multiple users can still satisfy local differential privacy, as follows: Given a dataset and n privacy algorithms { 1 , 2 … } and (1 ≤ ≤ ) that satisfy -local differential privacy, the sequence combination of ( ) satisfies (∑ =0 )-differential privacy.

Random disturbance mechanism 4.1 k-RR random perturbation algorithm
The previously proposed the -RR perturbation algorithm, which is mainly applied for the two values. For example, we want to count the number of people with AIDS, and users respond to whether they have AIDS. If a user has AIDS, the probability of answering the suffering is ,and the not probability is . If a user does not have AIDS, the probability of the unaffected person answering is , and the probability of suffering is . However, the -RR algorithm can only be applied to a relatively simple user attribute. If the user attribute is more complex than a binary attribute, the -RR would not be satisfied. The -RR perturbation algorithm is proposed to improve the -RR algorithm. When = 2, -RR is only a special case of -RR. We use to represent the random perturbation algorithm. The specific perturbation probability of -RR is as follows: In general, we need to map the user's data to a binary string and perturb with probability or . In the above formula, and are the input and output, respectively; represents the privacy budget; and k is the possible value of the user data. Theorem 1: The algorithm -RR satisfies ε-local differential privacy. Proof: We use − to denote the -RR perturbation algorithm. To achieve data availability, we usually set that no disturbance probability is greater than the perturbation probability. The probability of should be greater than and satisfy the following inequalities.

k-RAPPOR random perturbation algorithm
RAPPOR was proposed by Erlingsson et al. [Erlingsson, Pihur and Korolova (2014)] and has been successfully applied to Google Chrome. It is one of the few perturbation algorithms deployed in a real environment. RAPPOR is roughly divided into three stages. The most complete RAPPOR must first hash each value with a bloom filter and then perform two random perturbations, namely, permanent and temporary perturbations. The k-RAPPOR we introduced is only a relatively simple case, called a one-time RAPPOR, used only for permanent perturbations. We use to represent the perturbation algorithm. The probability of not perturbing a bit of a binary string is , and the probability of perturbation is . The specific probability formula is as follows: Here, v and z are the user's input and output, respectively, and is the disturbance to theth bit. Theorem 2: The algorithm -RAPPOR satisfies ε-local differential privacy. Proof: We use 1 and 2 to represent the input, and the output is z as follows:

TRR disturbance algorithm
The TRR algorithm selects the k-RR when the threshold is small, and selects the k-RAPPOR algorithm when the threshold is large. Next, we use some theory to explain its correctness.

Pure local differential privacy protocol
The biggest difference between local differential privacy is that the input will obtain a certain output value, and Wang et al. [Wang and Li (2018)] proposed the pure differential privacy protocol that defines this output value as a set. Suppose there are two different probabilities, * and * . The non-disturbing probability * should be greater than the perturbation probability * and satisfy the following inequality: In the above inequality, is the perturbation algorithm, v 1 and v 2 are the user's input, and y is the output. Add is a function that maps each possible output value y to an input value set. If y j represents the perturbation data sent to the aggregator by user j, the aggregator needs to estimate the number of per bit and then perform statistics: In the above formula, � is used to indicate the number of times each bit is used. It should be noted that there is a flag function, and the specific calculation formula is as follows: From (6), is the user's binary string, and is one of the strings. It can be obtained after encoding that the user has a certain attribute coded to 1 or coded to 0. Theorem 3: In order to verify the correctness of our estimate for each bit, we need to calculate its variance: Proof. [ However, most values have fewer occurrences and are determined by frequent values in most applications. Avoiding a large number of false positives allows one to obtain lower estimated variances between infrequent values. In the above formula, when the frequency is small, the variance is mainly determined by the first term, and then an approximate variance * is obtained:

k-RR and k-RAPPOR variance
The additional function of the -RR perturbation algorithm is − ( ) = { }, which satisfies the pure local differential privacy protocol. Furthermore, * = and * = can be obtained, and then the -RR approximate variance is obtained by combining Eqs. (7) and (11) We can see that as the threshold increases, the -RR variance also increases. It can be said that the accuracy of the -RR result decreases as the threshold increases. We, therefore, use the -RR algorithm when the threshold is small. Similarly, the additional function of the -RAPPOR is − ( ) = { | [ ] = 1}, which satisfies the pure local differential privacy protocol. We can obtain * = and * = , and then combine the above Eqs. (7) and (11) to obtain the approximate variance: We can see that the approximate variance of the -RAPPOR algorithm is irrelevant to the threshold . We use the -RAPPOR algorithm better when the threshold is larger.

Encoding
The situation we have to consider is more complicated. Each user may have multiple attribute values, and the possible values of each attribute are different. How should we choose the encoding method and the perturbation algorithm for this complicated situation?
We use bitmap encoding, first considering a universal set U with cardinalityn. We can represent each subset of by a bitmap of size n. Each element of is assigned to one of the bits in the bitmap. If an element is a member of a subset ( ⊆ ), then its corresponding bit is 1; otherwise it is 0. Consider the following example: let there be a universal set = { 3 , 2 , 1 , 0 }, and subsets = { 3 , 2 } and = { 3 , 0 }. With two bitmaps of size four, in which each a i (0 ≤ i ≤ 3) is assigned to their ith bit, these subsets are represented as = 1100 and = 1001 . With this representation of sets, some common set operators can be implemented faster using bitwise operators. For example, to calculate the intersection (union) of two given sets, we can use the bitwise operator AND (OR) on their corresponding bitmaps. Bitwise operators are implemented efficiently in CPUs and performed in one CPU cycle. Specific to the user's multi-attribute, assuming that each user has ℎ attributes, and the value of each attribute is represented by . Then, 1 is the possible value of the first attribute, and ℎ is the possible value of the ℎth attribute. We mapped each attribute to a binary string. For example, the first attribute may have 1 values that are mapped to 1 bits. If the user has a value of this attribute mapped to 1, otherwise is mapped to 0, so we can get a binary string of length = 1 + 2 … + ℎ .

TRR algorithm
We select -RR when the attribute threshold is small, and select -RAPPOR when the attribute threshold is large in the TRR algorithm. When < 3 + 2 is a smaller threshold, and > 3 + 2 is a larger threshold [Wang and Li (2018)]. The TRR algorithm is suitable because the user data are more complicated. In particular, the user has multiple attributes, and the thresholds between the attributes are significantly different. The TRR algorithm is more practical and has a wider range of applications. The specific process of the algorithm is as follows:

Decoding
Next, we introduce some unbiased estimation content, mainly considering that the original data will have some impact after the perturbation algorithm. If unbiased estimation is not performed, the -th bit of the binary string of all users is directly added to obtain: This method will cause some errors, so we will estimate it unbiasedly: Theorem 4: By calculating the number �( ) of a certain bit, the expectation can [ �( )] = be proved to be unbiased. Proof.

FP-tree frequent itemset mining
After the server receives the user's binary string, it constructs an FP-tree for frequent itemset mining. First, building an FP-tree requires creating an item header table, so we scan the database for the first time and obtain all 1-itemset �( ). Then, we delete the items whose support is lower than min -support, obtain frequent 1-itemsets into the header table, and sort them in descending order of support. Next, the second scan database deletes the infrequent 1itemsets of the data and sorts them in descending order of support. After acquiring the item header table and sorted datasets, we can start building the FP-tree. The FP-tree has no data at the beginning. When building an FP-tree, we need to insert the sorted datasets one by one. The node that ranks first is the ancestor node, and the next one is the descendant node. If they have a shared ancestor, the ancestor node count is incremented by one. After the insertion, if a new node appears, the node corresponding to the item header table is linked to the new node through the node list. The creation of the FP-tree is complete when all the data has been inserted. Next, we mine frequent itemsets from the item at the bottom of the item header table. For each item in the item header table that corresponds to the FP-tree, we need to find its conditional pattern base. The conditional pattern base is the FP-subtree corresponding to the leaf node that we want to mine. We set the count of each node in the FP subtree to the count of the leaf nodes and delete the nodes whose count is lower than the support. From this conditional model base, we can recursively mine frequent itemsets. In the experiment, we set , -= 0.5 and obtain frequent 2-itemsets and frequent 3-itemsets. These settings are not fixed.

Algorithm experiment comparison
Our experimental metric is F-score, which is a commonly used measure in data mining, and it is the harmonic mean of the correct rate and recall rate.
Here, precision denotes the accuracy rate and recall is the recall rate. The higher F-score, the more effective the experimental method.
Additionally, to measure the error with the actual support of itemsets in the dataset, we calculated the relative error (RE) of the support of released itemsets.
Here, is the set of all frequent itemsets generated by a private algorithm, ( ) is the actual support of itemset , and ′ ( ) is its noisy support. It should be noted that the smaller the RE is, the smaller the error; it also indicates that the utility of the algorithm is higher. Our experimental environment was implemented in Python 3.7. The experimental dataset generates a normal distribution, and an exponential distribution by its definition, and then verifies the TRR algorithm. If the dataset obeys a uniform distribution, our algorithm is not applicable. It does not make much sense to mine frequent itemsets. We conducted two sets of experiments under the normal distribution to mine frequent 2itemsets and frequent 3-itemsets. Fig. 1 shows the impact of four different local differential privacy algorithms on frequent mining results under different privacy budgets. The abscissa indicates that the privacy budget has a value from 1.0 to 6.0, with an increase of 0.5 each time. As shown in Fig. 1, as the privacy budget increase, the overall F-scores of the four algorithms show an upward trend. Further, the F-score of the TRR algorithm is larger than that of the other three algorithms, which indicates that the TRR algorithm has a higher accuracy for mining frequent 2-itemsets. As seen in Fig. 2, similar to Fig. 1, the F-score of the TRR algorithm in mining frequent 3-itemsets is larger than that of the other three algorithms. This also shows that the TRR algorithm has a better effect than the RR, RAPPOR and MRR algorithms in mining frequent itemsets. show the MRR, TRR, RR, and RAPPOR algorithms mining frequent 2itemsets and frequent 3-itemsets in an exponential distribution. As shown in Fig. 3, as the privacy budget increases, the F-score of the four algorithms gradually increase. As shown in Fig. 3, the F-score change curves of the MRR and RR algorithms almost coincide, mainly because the MRR algorithm selects the RR algorithm when the privacy budget is low. Moreover, as the privacy budget gradually increases from 4.0 to 6.0, the F-score of the RAPPOR and TRR algorithms do not change significantly. This is because as the privacy budget increases, the degree of perturbation of the original data approaches 0, and the dataset produced by the perturbation closer to the real dataset. Therefore, the TRR algorithm is superior to the RR, RAPPOR and MRR algorithms because the TRR algorithm has the lowest probability of disturbance under the same privacy budget, resulting in a higher F-score. Similarly, Fig. 4 shows that TRR and the other three algorithm mine frequent 3-itemsets. The F-score obtained by the TRR algorithm is larger than that of the other three algorithms, indicating that the TRR algorithm is also suitable for an exponential distribution.  The F-score experimental standard only measures the accuracy and recall of the mining results, but it does not evaluate the experimental error. To analyze the error of the experimental results, we used the relative error measure. For each element in the frequent itemset, we find its initial support for the original data and then find the perturbation support for the perturbed data. Subsequently, we calculate these two support values and then find the median of this set to determine the relative error.  Fig. 5 shows that in the process of mining frequent 2-itemsets, as the privacy budget becomes larger, the relative errors of the four algorithms generally show a downward trend. Further, the TRR algorithm has higher accuracy in data mining than the MRR, RR and RAPPOR algorithms. The relative error of the MRR algorithm is the largest, and the relative error varies significantly and is unstable when the privacy budget is 7.0-7.5. This is mainly because the disturbance probability is more sensitive to these privacy budgets and the uncertainty of random disturbances. Similarly, Fig. 6 similar characteristics to Fig. 5. With the increase in the privacy budget, the relative error of the TRR algorithm is smaller than the relative errors the other three algorithms, which also shows that the TRR algorithm has better utility than the MRR, RR and RAPPOR algorithms in mining frequent itemsets.   Fig. 7 shows that the relative error of TRR algorithm is smaller than that of the MRR, RR and RAPPOR algorithms, and the relative error of the MRR algorithm is high. The relative error of the TRR algorithm is less because the TRR disturbance probability is lower than that of the other three algorithms under the same privacy budget. Fig. 8 shows frequent 3-itemset ming, demonstrating that the relative error of the TRR algorithm is still less than that of the RR and RAPPOR algorithms. In summary, F-score of the TRR algorithm is larger, and the relative error is the smallest. When the user has multiple attributes, the RR and RAPPOR algorithms are determined by the attribute threshold. The disadvantage of the MRR algorithm is that the data distribution estimation is not suitable for frequent itemset mining. In short, if the user has multiple attributes, choosing the TRR algorithm will be a better choice.

Conclusions
We propose a complete implementation framework to mine frequent itemsets under local differential privacy protection. The general process is that the original data is first bitmap encoded, and the encoded data implement local differential privacy perturbation; the aggregator performs unbiased estimation after perturbing data, and then uses FP-tree to mine frequent itemsets. We propose a new TRR algorithm, mainly to satisfy the complex data types under multiple attributes of users, and the thresholds of multiple attributes can vary greatly. The TRR algorithm satisfies the requirements of complex data and ensures that the user's personal information dose not leak. Finally, the FP-tree algorithm is used to mine frequent itemsets and the TRR, RR, RAPPOR and MRR algorithms are compared experimentally. We found that the TRR algorithm is better than the existing algorithms.