Local differential privacy for unbalanced multivariate nominal attributes

Data with unbalanced multivariate nominal attributes collected from a large number of users provide a wealth of knowledge for our society. However, it also poses an unprecedented privacy threat to participants. Local differential privacy, a variant of differential privacy, is proposed to eliminate the privacy concern by aggregating only randomized values from each user, with the provision of plausible deniability. However, traditional local differential privacy algorithms usually assign the same privacy budget to attributes with different dimensions, leading to large data utility loss and high communication costs. To obtain highly accurate results while satisfying local differential privacy, the aggregator needs a reasonable privacy budget allocation scheme. In this paper, the Lagrange multiplier (LM) algorithm was used to transform the privacy budget allocation problem into a problem of calculating the minimum value from unconditionally constrained convex functions. The solution to the nonlinear equation obtained by the Cardano formula (CF) and Newton-Raphson (NS) methods was used as the optimal privacy budget allocation scheme. Then, we improved two popular local differential privacy mechanisms by taking advantage of the proposed privacy budget allocation techniques. Extension simulations on two different data sets with multivariate nominal attributes demonstrated that the scheme proposed in this paper can significantly reduce the estimation error under the premise of satisfying local differential privacy.

and genetic information [2] can be followed closely to better diagnose and monitor patient health status.
However, in practical crowdsourcing systems, high-dimensional heterogeneous data cannot be utilized effectively. There are two main reasons for this situation. (1) Non-local privacy guarantee. Differential privacy [3,4], as one of the currently effective privacy protection mechanisms, randomizes the query output by adding noise to sensitive data to achieve the purpose of privacy protection. Many existing works [5][6][7][8] focus on centralized data sets under the assumption of trusted third-party data collectors. These concentrate raw data into a data center and then publish relevant statistical information that satisfies differential privacy. However, even if third-party data collectors claim that they will not steal and disclose confidential user information, the privacy of users is still not guaranteed. It is difficult to find a truly trusted third-party data collection platform in practical applications, which significantly limits the use of centralized differential privacy technologies. As users, they prefer to ensure data security on the user side, enabling themselves to process and protect their confidential information separately (i.e., local differential privacy [9,10]). (2) Highdimensional disaster. In crowdsourcing systems, high-dimensional heterogeneous data are ubiquitous. With the increases in data dimensions and the dimensional difference between different attributes, many existing local differential privacy mechanisms such as RAPPOR [11] and [12,13], if straightforwardly applied to multiple attributes with unbalanced dimensions, will become extremely unavailable. Their fatal drawbacks are their non-optimized privacy budget allocation schemes and high computational complexities, which lead to large data utility loss and high latency. Different dimensions of attributes need to allocate different privacy budgets. How to find the best allocation scheme is the key to improving data utility.
In addition to privacy vulnerability and data utility, collecting a large amount of data from distributed user groups means that the efficiency of data processing is low, especially in the application of the Internet of Things (IoT). Thus, it is important to provide an efficient privacy-preserving method with high-dimensional heterogeneous data. Furthermore, considering that the privacy concern level required by users for different data is inconsistent, it is also important to find the optimal privacy mechanism under high and low privacy regimes.
In addressing the above issues, many existing methods have proved their effectiveness from different perspectives. One is to ensure that user privacy is not leaked when users are provided a local privacy guarantee, such as [11][12][13]. However, these methods become extremely complicated in communication, and the data availability drops sharply when processing high-dimensional heterogeneous data. The other is to privately release high-dimensional data [14][15][16]. These methods mainly use specific methods to reduce the dimensionality of the data and then release it privately. These methods not only have high computational complexity but also have low data utility due to their unreasonable privacy budget allocation schemes.
In this paper, we aim at designing an efficient and effective privacy budget allocation scheme for high-dimensional heterogeneous data under the local privacy guarantee. Our main contributions are as follows: • We propose an optimal privacy budget allocation scheme with high-dimensional heterogeneous data. In this scheme, we use the Lagrange multiplier (LM) algorithm to transform the privacy budget allocation problem into a problem of calculating the minimum value from unconditionally constrained convex functions. Then, the Cardano formula (CF) and Newton-Raphson (NS) methods are employed to iteratively calculate the optimal solution. • To meet the local privacy guarantee and the different needs of different data for the privacy concern levels, we use the optimal privacy budget allocation scheme obtained by the above procedure to improve the BRR and MRR and call it the OBRR and OMRR, respectively, which are optimal in the high and low privacy regimes with high-dimensional heterogeneous data, respectively. • Finally, we conduct simulation experiments to show that the two improved algorithms, OBRR and OMRR, can significantly reduce the estimation error under the premise of satisfying local differential privacy, with lower time and communication complexities.

Related work
This paper focuses on the frequency statistics problem of high-dimensional heterogeneous data with local differential privacy, which refers to the situation where each user sends multiple variable values voted from candidate attributes. The candidate attributes always have different dimensions. Without loss of generality, we assume that the candidate attributes A = {a 1 , a 2 , . . . , a l } , where each attribute a i has a specific dimension k i , and we assume d = k 1 + k 2 + · · · + k l . Each user needs to translate a fixed value of l variables. Unlike single-valued frequency statistics, in multivariate scenarios, we need to consider not only the locality of users' privacy but also the segmentation of privacy budgets. An unreasonable privacy budget allocation scheme will result in a sharp drop in sanitized data utility.

Local privacy guarantee
Despite the privacy protection reaction against difference and inference attacks from aggregate queries, individuals' data may also suffer from privacy leakage before aggregation. Given the privacy flaws of differential privacy, the notion of local privacy has been proposed to provide the local privacy guarantee to distributed users [9,10]. Recently, local privacy has aimed to learn particular aggregation features from distributed users with some public knowledge. Groat et al. [12] proposed the technique of negative surveys, which is based on randomized response techniques, to identify the true distributions from noisy participant data. Similarly, Bassily et al. [17] proposed the S-Hist algorithm.
To reduce the transmission cost, they use random response technology to perturb the original data and then randomly select one of the bits and send it to the data collector. However, when the dimension is high, the sparsity of the data will lead to much utility loss. At the same time, their high computational complexities will lead to high latency. Many single-valued frequency statistical mechanisms that satisfy local differential privacy have been proposed. Erlingsson et al. [11] proposed RAPPOR to estimate the frequencies of different strings in a candidate set. Their subsequent research RAPPOR-unknown [18] proposed learning the correlations between dimensions via an EM-based learning algorithm.
Intuitively, single-valued frequency statistics can be used repeatedly on each variable in highdimensional cases. However, when the dimension is high, the data utility decreases dramatically, and the computational complexities increase exponentially. For the RAPPOR method, the length of the Bloom filters over the multi-attributes domain becomes: . Moreover, the EM algorithm has an exponentially higher complexity. Therefore, if the single-valued frequency publishing method is used as the frequency publishing method in the high-dimensional case, the data utility and communication cost cannot be optimized. In addition, there are many improved local differential privacy algorithms suitable for single-valued frequency statistics, such as O-RAPPOR [19], PCE [20], k-RR [21], and k-Subset [22]. When addressing high-dimensional frequency statistics, they all have irreparable deficiencies in terms of data utility, communication costs or computational complexity.

High dimension
Currently, for the issue of high-dimensional data publishing, there are many methods that have had their effectiveness proved from different perspectives. For example, Cai et al. [23] studied the trade-off between statistical accuracy and privacy in average estimation and linear regression with high-dimensional data, mainly by improving the setting strategies of parameters such as the minimum-maximum lower bound and iterative threshold to ensure the statistical accuracy under the premise of satisfying differential privacy. However, this approach does not satisfy the locality of users' privacy, and they did not discuss how to allocate the privacy budget effectively. Li et al. [24] put forward the dichotomy of the privacy budget by using the method of publishing differential privacy histograms in groups. When it comes to high-dimensional heterogeneous data, there is no theoretical basis for their division. Since the dimensions of attributes are different, allocating the same privacy budget inevitably leads to a decline in data utility. Similarity, the method in [25] improves the accuracy of published data by adding additional processing to the output to restore the consistency of the count specified in the structure. However, this method cannot solve the problem of data utility decline caused by the sparsity of high-dimensional data before aggregation. There are also some methods such as [26,27] that use the matrix mechanism to publish the database to minimize the query noise. However, the optimization cost of this method is very high, and the assumption that the query distribution is known in advance is not reasonable.
Another solution to mitigate the high dimension issue is to group the correlated records into clusters and then allocate the privacy budget to each low-dimensional cluster. However, in the existing schemes [14,28,29], the original data set is explicitly accessed twice to understand the correlation between properties and to generate the distribution of the cluster. The biggest problem with these methods is that the two accesses are computed separately and that there is no consistent privacy guarantee. That is, two different privacy budgets are allocated separately, but it is not clear how to allocate the privacy budget to achieve a sufficient privacy guarantee and utility maximization.
Moreover, although the unbalanced data with the multivariate nominal attribute can be reduced into several low-dimensional clusters, the sparsity caused by the combinations in each cluster still exists and may result in lower utility. In contrast to the totally centralized setting in [14], Su et al. [30] proposed a distributed multiparty setting to publish a new data set from multiple data curators. However, their multiparty computation can protect only the privacy between data servers. Instead, there is no guarantee of local personal privacy in a data server. In addition, Zhang et al. [31] proposed a self-adaptive regression-based multivariate data compression scheme. They used a correlation matrix to compress different data streams from the same node to reduce communication costs. However, this method does not solve how to effectively compress a high-dimensional data stream when there is only one.
To solve the shortcomings of the above methods, which cannot meet the privacy locality nor handle high-dimensional data, some effective methods have been proposed. For example, Ren et al. proposed LoPub [15,16], which combines the RAPPOR and probability graph model. They first transform each attribute value into a random bit string using a Bloom filter [32] and then send it to the central server. Subsequently, similar to the high-dimensional data publishing method based on centralized differential privacy in [14], the data collector determines the frequency statistics of the collected data and then constructs a Markov network. The joint probability distribution of attributes is expressed as a maximal clique to reduce the dimensions of the data. Finally, a data set is resynthesized by a joint probability distribution for data release. However, the biggest disadvantage of this method is that it does not consider the allocation of the privacy budget before the high-dimensional heterogeneous data are aggregated to the server. Moreover, if each attribute is mutually independent, they propose using the EM to estimate the multivariate distribution, which will increase the computational complexities exponentially.
To overcome the shortcomings of low data utility, nonlocal privacy and high computational complexities within those schemes, we propose a novel privacy budget allocation scheme to publish unbalanced multivariate nominal attribute data while guaranteeing local privacy. At present, many similar optimization theories and methods have been proposed [33][34][35], but different objective functions lead to different solutions. In this paper, we turn the privacy budget allocation problem into a problem of solving the univariate cubic equation. The experimental results show that our method can greatly improve the low query accuracy caused by the defect of privacy budget allocation.

System model
The demonstrative aggregation model is depicted in Fig. 1, where several users and a central aggregator are interconnected, constituting a crowdsourcing system. At first, the aggregator releases or publishes unbalanced multivariate aggregation query A = {a 1 , a 2 , . . . , a l } to each participant, along with the global parameters, including the optimal privacy budget allocation scheme ǫ = {ǫ 1 , . . . , ǫ l } for each attribute a i , and other specific mechanism parameters, such as the sign of the high or low privacy regime. The different regimes require different privacy mechanisms i.e., OBRR or OMRR). In our mechanism, both secret data v and sanitized data v ′ are expressed as bit maps; specifically, if a participant's secret value equals the j-th element V j in data domain V, then the secret data v i ∈ {0, 1} |V | is a bit map of length |V|, with the j-th bit set to 1 and other bits set to 0. After receiving the sanitized data list {v ′ 1 , v ′ 2 , . . . , v ′ n } , the aggregator attempts to decode an estimation over the domain V . According to the estimated results from the sanitized data set, the aggregator tries to provide users with better network services. In the process of data releasing with local differential privacy, no one knows the secret information they release except for the participants themselves.

Problem statement
Given a collection of data records with l attributes from different users, the dimensions of different attributes are different. Our goal is to help the aggregator design a reasonable privacy budget allocation scheme to improve the utility of releasing data under different privacy regimes. Formally, the unbalanced multivariate nominal attributes A = {a 1 , a 2 , . . . , a l } , and each attribute a i has a specific number of categories a i = {a i1 , a i2 , . . . , a ik i } , where k i is the number of categories for the i-th attribute, that is, |a i | = k i , i = 1, 2, . . . , l . We assume that if i = j , we have k i = k j . Specially, each user Let n be the total number of users and d = k 1 + k 2 + · · · + k l be the length of the bit maps. v i is first translated into bit maps: i is sent to the aggregator. The frequency of the true histogram is denoted as H = {h 1 , . . . , h n } . The estimated frequency of the sanitized histogram can be expressed as H ′′ = {h ′′ 1 , . . . , h ′′ n }. With the above notations, our problem can be formulated as follows: Given the fixed privacy budget (located in the high or low regime), our goal is to find the optimal privacy budget allocation scheme {ǫ 1 , . . . , ǫ l } to minimize the error of estimated histogram H ′′ of the true histogram H . This can be expressed as follows: Moreover, different privacy regimes require different privacy mechanisms, which require a flexible privacy budget allocation scheme that can be easily applied to different local privacy mechanisms. Some notations employed in this paper are listed in Table 1.

Local differential privacy
The protection model under local differential privacy (LDP) fully considers the possibility of data collectors stealing or revealing user privacy during data collection. In this model, each user first randomizes the data and then sends the sanitized data to data collectors; data collectors collect statistics on the collected data to obtain valid analysis results. Local differential privacy [5] is a rigorous privacy notion in the local setting, which provides a stronger privacy guarantee than does centralized differential privacy. The formal definition of local differential privacy is as follows: Definition 1 Given n users, where each user corresponds to a record, a randomized algorithm F satisfies ǫ-local differential privacy if for any two records t and t ′ ∈ D and for all M ⊆ Range(F ): where ǫ denotes the privacy budget and D represents the domain of the privacy data.
For local differential privacy technology, each user can independently randomize individual data, that is, the privacy process is transferred from the data collector to a single client so that no trusted third-party intervention is required. This also eliminates privacy attacks that may be caused by untrusted third-party data collectors.

Binary randomized response
The binary randomized response (BRR) [11] is a technique that requires each user to send a sanitized bit to the aggregator, where the perturbation is based on a randomized response (RR). Each participant is asked to flip a biased coin with probability p in secret and tell the truth if it comes up heads but tell a lie otherwise (if the coin comes up tails).
To solve the perturbation problem of multiple unbalanced categorical data, the binary random response first initializes a length-d Yet, how to determine the value of p to make the sanitized data released by each user satisfy the need for differential privacy is the key problem. To do so, we analyze the sensitivity of releasing a length-d bit vector to each user. Since each user possesses exactly l items, there are l ones in H . Therefore, two such bit vectors can differ by at most 2l bits, meaning that the sensitivity is 2l. To meet the requirements of differential privacy, the probability p follows the method applied by RAPPOR [11]: The BRR allocates the same privacy budget ǫ 2l for each attribute, regardless of whether the number of categories of attributes is the same. If the number of categories between attributes varies widely, for example, the user's browsing site and the user's gender, the same privacy budget will likely bring a large estimated deviation.

Multivariate randomized response
The multivariate randomized response (MRR) mechanism [36] is a locally differentiable private mechanism whose noisy output alphabet Y is the original input domain X . Specially, each user possesses a set v i = {v i1 , . . . , v il } of an item; after being perturbed by the MRR, the sanitized output turns into . . , l . Then, the user u i publishes the sanitized set v ′ i of items to the aggregator. The conditional probabilities are given by: To satisfy the requirements of the differential privacy, we analyze the sensitivity of releasing a length-l vector to each user in a manner similar to that mentioned above. Two such vectors can differ by at most l positions, meaning that the sensitivity is l. Thus, when ǫ m satisfies ǫ m = ǫ l , the MRR mechanism satisfies the differential privacy requirements. The MRR allocates the same privacy budget ǫ l for each attribute. The same unreasonable budget allocation problem will also appear in the MRR mechanism.
The BRR mechanism incurs O(d) communication costs for each user, and the MRR incurs O(l) communication costs. The number of attributes l is usually far smaller than the total number of items d, that is, l ≪ d . As far as the communication cost is concerned, the MRR is superior to the BRR. In the work proposed by Kairouz et al. [36], the BRR and MRR are called staircase mechanisms. The BRR has been proved to be optimal in the high privacy regime, and the MRR has been proved to be optimal in the low privacy regime [19]. However, their unreasonable privacy budget allocation schemes are fatal problems. In the next section, we present evidence showing how to obtain the optimal allocation schemes over multiple unbalanced categorical data. Then, we apply the optimal budget allocation scheme to the BRR and MRR, resulting in the optimal mechanisms in the high and low regimes, respectively.

Optimal budget allocation for the BRR
The main goal of the aggregator is to estimate the frequency of the items without disclosing the privacy of the users. Therefore, we adopt the square error (SE) as the metric to evaluate the estimation. Without loss of generality, we assume there are l attributes a 1 , a 2 , . . . , a l and that the number of items for each attribute a i is k i , that is, |a i | = k i . The total number of items d = k 1 + k 2 + · · · + k l . We allocate budgets {ǫ 1 , ǫ 2 , . . . , ǫ l } to the set of attributes {a 1 , a 2 , . . . , a l } , respectively, and ǫ 1 + ǫ 2 + · · · + ǫ l = ǫ 2 . Each user u i publishes a length-d bit vector The SE is given as follows: where H ′ ij is the Bernoulli probability distribution, with the variance of H ′ ij being equal to np i (1 − p i ) . Our goal is given as: To solve the optimization problem under restricted conditions, we employ the LM method to translate the conditional restrictions into unconditional constraints: where ǫ i ≥ 0, i = 1, . . . , l . The task now is to obtain the minimum value of L(ǫ, ) . Since the second-order partial derivative ∂ 2 L(ǫ, ) We let p = c a − b 2 3a 2 and q = d a + 2b 3 27a 3 − bc 3a 2 ; thus, Eq. (11) can be expressed as: By using the CF method, we can obtain the root of Eq. (12) as follows: where ω = −1+ √ 3i 2 . Thus, the roots of Eq. (10) are obtained by solving We take only x i1 as our final real root. The finally obtained l solutions x 1 , x 2 , . . . , x l are applied to equation We can thus obtain a higher-order equation for . We employ the existed Newton-Raphson (NS) method to solve the problem of high degree with one unknown. The NS method first chooses an initial approximate value 0 . At each iteration, let k be the initial value of the next iteration, which is given as: The NS method will produce an infinite sequence { 1 , 2 , . . .} , which will converge to the true root of the function f ( ) . After obtaining the asymptotic answer * , we can obtain the value of {x 1 , x 2 , . . . , x l } . The privacy budget ǫ i can be obtained by ǫ i = log x i , i = 1, . . . , l for each attribute. To analyze the optimal answer {ǫ 1 , ǫ 2 , . . . , ǫ l } , we can draw the following conclusions: Theorem 1 For multiple unbalanced categorical data, the optimal privacy budget value ǫ i of the BRR is positively correlated with the number of items k i . Specially, if k 1 = k 2 = · · · = k l , the allocation scheme ǫ 1 = ǫ 2 = · · · = ǫ l = ǫ 2l is optimal.
Theorem 2 For any given number of items {k 1 , k 2 , . . . , k l }, there exists only one optimal budget allocation scheme ǫ * = {ǫ 1 , ǫ 2 , . . . , ǫ l } , s.t. ǫ 1 + ǫ 2 + · · · + ǫ l = ǫ 2 , and its upper bound is When we apply the optimal privacy budget allocation scheme to the BRR, we obtain the OBRR mechanism in a high privacy regime, which greatly improves the original mechanism. The encoder algorithm of the OBRR is shown in Algorithm 1.

Optimal budget allocation for the MRR
In this section, we also employ the SE as a metric to evaluate the estimation. We assume that the parameters used in this section are the same as in the previous definition.
The SE is given as follows: where H ij represents the j-th item of the i-th attribute. One can use prior knowledge on H as a substitution; here, we assume only that it is a uniform histogram such that H ij = n k i . Thus, our goal is to minimize the following equation: We also employ the LM to translate the conditional restrictions into unconditional constraints and let x i = exp(ǫ i ) ; thus, we obtain: where x i > 1, i = 1, 2, . . . , l . Since the second-order partial derivative ∂ 2 L(x, ) = 1, 2, . . . , l and L(x, ) is strictly a convex function for the variable x i , there must exist a minimum solution for L(x, ) . Its optimal solution is obtained by solving the following equations: where i = 1, 2, . . . , l . We use the same CF and NS methods introduced in the last section to solve the roots of the above equation. Finally, we obtain the optimal allocation scheme {ǫ 1 , ǫ 2 , . . . , ǫ l } , and ǫ 1 + ǫ 2 + · · · + ǫ l = ǫ . To further analyze the properties of the optimal budget, we can draw the following conclusion: Theorem 3 For multiple unbalanced categorical data, the optimal privacy budget value ǫ i of the MRR is positively correlated with the number of items k i . Specifically, if k 1 = k 2 = · · · = k l , the allocation scheme ǫ 1 = ǫ 2 = · · · = ǫ l = ǫ l is optimal.
Theorem 4 For any given number of items {k 1 , k 2 , . . . , k l }, there exists only one optimal budget allocation scheme ǫ * = {ǫ 1 , ǫ 2 , . . . , ǫ l } s.t. ǫ 1 + ǫ 2 + · · · + ǫ l = ǫ, and its upper bound is When we apply the optimal privacy budget allocation scheme to the MRR, we can obtain the OMRR mechanism in a high privacy regime, which improves the original mechanism significantly. The encoder algorithm of the OMRR is shown in Algorithm 2.

Convergence
When using the NS method to calculate the roots of the equation f ( ) = x 1 x 2 . . . x l − exp( ǫ 2 ) = 0 , the biggest problem lies in the selection of the initial iteration values. If the initial value is far from the true solution, it is difficult for the NS method to converge. To improve the shortcomings of the over-reliance of the NS on the initial value, we add the selection of the best initial value to the iteration process. The iteration is divided into two processes. We first calculate whether |f ( k ) − f ( )| falls within a reasonable interval [a, b] on the basis of the given initial value 0 . If it does not match, then we add a fixed step size k+1 = k + δ and recalculate until a suitable initial value ′ 0 is found. Based on the best initial value ′ 0 , the NS method is used to improve the iteration accuracy. The global threshold is set to ξ = 0.01 . When the iteration error f ( * ) − f ( ) ≤ ξ , the iteration is terminated. To show the relationship between the overall number of iterations and the number of iteration errors, we perform experiments on two data sets. The data sets are detailed in "Simulation" section. The comparison results are shown in Fig. 2. To facilitate the comparison, the error is normalized to [0, 1].
It can be seen from the experiment that the number of iterations has a great relationship with the selection of the initial value 0 and with the selection of the step size δ . After the improvement, all optimization equations can stably converge to the real root . Based on the obtained approximate solution, the optimal privacy allocation schemes can be obtained, which are illustrated in Table 2.

Analysis
To prove the effectiveness of the optimal methods, we carry out an experimental analysis of the theoretical error. Without loss of generality, we assume that there are 5 attributes and that each attribute has different categories; specifically, we let {k 1 , k 2 , . . . , k 5 } = {5, 6, 150, 200, 250} . These numbers are randomly chosen, but it does exist in reality, for example, the number of sexes is 2, while the number of websites visited by users may be in the thousands. We assume that there are 1000 participants. We conduct experiments on the BRR, MRR, OBRR, and OMRR. The experimental results are shown in Table 3. Here, we use log 10 (NSE) as the reference point. In this experiment, the OBRR has the best performance, and the MRR has the worst performance. The reason for this result is the excessive number of items. The conditional probability of the BRR satisfies , but the probability of the MRR meets p m = exp(ǫ m ) exp(ǫ m )+k−1 ; thus, we find that if k is large, the probability p m becomes small, which will incur a bad performance. Compared to the BRR and MRR, the OBRR method can reduce the estimated square error by 40% , and the OMRR method can reduce the estimated square error by approximately 73%.

Error bounds and computational complexities
The OBRR is optimal in the high privacy regime when addressing multivariate unbalanced nominal attributes. The OMRR is optimal in the low privacy regime when addressing multivariate unbalanced nominal attributes. Thus, the estimated histograms in these mechanisms are no less favorable than the histogram estimated by the BRR [11,37] and MRR [38]. Thus, we have: For each participant, both the OBRR mechanism proposed in Algorithm 1 and the OMRR proposed in Algorithm 2 have a computational complexity of O(d), where d is the length of the bit maps. For the aggregator, finding the optimal budget allocation scheme {ǫ 1 , ǫ 2 , . . . , ǫ l } requires approximately O(log(l)F (l)) computational complexity, where F(l) is the cost of calculating f (x) f ′ (x) with l-digit precision, and estimating the histogram from the observed sanitized data requires O(nd + n) time, where n is the number of participants. The OBRR and OMRR mechanisms have only linear time complexities concerning d or n, except when optimizing the budget allocation scheme. The optimal privacy allocation scheme can be calculated offline, that is, it can be calculated in advance before aggregating users' sanitized information. In short, the OBRR and OMRR mechanisms have only linear complexities with respect to the domain size |D| or number of participants n for both participants and the aggregator. Hence, the OBRR and Fig. 2 The relationship between the number of iterations and convergence OMRR mechanisms are highly efficient for multiple unbalanced categorical data aggregation.

Optimal binary randomized response mechanism
In this section, we conduct an experiment to compare the performances of the BRR and OBRR mechanisms. We assume that each participant's secret data value is drawn from histogram H, which is uniformly and randomly generated during each aggregation. The  Table 3 The relationship between log 10 (NSE) and the privacy budget ǫ dimension of the data set is [n, d]. The selection of the data set guarantees the following criteria: each participant can vote for only l tickets, that is, the sum of each row of the data set matrix is l. The total number of tickets for all participants is l * n . The data set generation algorithm is given in Algorithm 3. All of the experiments mentioned in this paper are run on a notebook with Windows 8.1, i7 − 4710MQ, a 2.50 GHz CPU and 8.0 GB of RAM. The coding platform is MATLAB R2015b. Without loss of generality, we assume that there are 5 attributes, and each attribute has a different number of categories. We selected two data sets in total. The number of attribute categories is randomly selected to demonstrate the optimal effect of budget allocation for unbalanced data. Without loss of generality, we let {k 11  The black lines denote the log 10 (NSE) of the BRR mechanism. The BRR ignores the number of categories of attributes and treats all attributes as equal, encoding each attribute with the same privacy budget ǫ 2l . Our OBRR method takes into account the nature of all attributes and finds a more reasonable privacy budget allocation scheme, that is, it allocates more budget to attributes with more items, and then encodes each attribute using the method proposed in Algorithm 1. When (k 1 , k 2 , . . . , k l ) = (5, 6, 150, 200, 250) , Fig. 3a, b represent the estimated errors for 1000 and 10,000 users, respectively. When (k 1 , k 2 , . . . , k l ) = (2, 4, 6, 7, 100) , Fig. 3c, d represent the estimated errors for 1000 and 10000 users, respectively. Due to the randomness of perturbation, we perform three experiments for each case and take the average of the tests for the mapping. The error bars in the figures are calculated using the standard deviation.
As can be seen from the figure, the optimal privacy budget allocation scheme proposed by us plays an important role. According to Fig. 3a, b, the OBRR mechanism can reduce the estimated square error by 41.6% and 40.2% compared with the BRR, respectively. According to Fig. 3c, d, the OBRR mechanism can reduce the estimated square error by 33.2% and 36.4% compared with the BRR, respectively. It can be concluded from the experimental results that the magnitude of the error reduction is independent of the number of participants n but is related to the number of values of the attributes (k 1 , k 2 , . . . , k l ) . In fact, the larger the dimensional difference between attributes is, the better the privacy budget allocation scheme proposed in this paper will be.

Optimal multivariate randomized response mechanism
We first introduce the implementation principle of the MRR, which was introduced in "Preliminaries" section. The MRR treats all the attributes as equal and allocates the same privacy budget ǫ l to each attribute. If the dimensions between attributes differ greatly, assignment of the same privacy budget is bound to result in inaccurate estimates of the results. Taking into account the drawbacks of the MRR, we allocate the privacy budget more reasonably, assigning more budget to attributes with more items.
To demonstrate the effectiveness of the OMRR, we experiment on the same data set created above. The compared results are presented in Fig. 4. The black lines denote the MRR, and the red lines indicate the OMRR. The number of participants assumed in Fig. 4a, c is 1000, while in Fig. 4b, d, it is 10,000. When the budget increases, the estimated error gradually declines. In Fig. 4a, b , the OMRR reduces the estimated square error by 72.8% and 72.0% compared with the MRR, respectively. In Fig. 4c, d, the OMRR a b c d Fig. 3 The relationship between the estimated histogram error measured by log 10 (NSE) and the privacy budget ǫ reduces the estimated square error by 73.0% and 73.7% compared with the MRR, respectively. In fact, there is currently no research on privacy budget allocation schemes for unbalanced multivariate nominal attributes. The purpose of our comparison with the BRR and MRR in "Simulation" section is to prove the effectiveness of our method. Our approach is highly scalable. In the process of local differential privacy processing, as long as it involves the allocation of privacy budgets for categorically unbalanced data, it can be solved by our privacy budget allocation scheme.

Conclusion
Traditional local differential privacy techniques typically assign the same privacy budget to unbalanced multivariate nominal attributes, leading to large data utility loss and high communication costs. To solve this problem, we propose an optimal privacy budget allocation scheme with high-dimensional heterogeneous data based on the Lagrange multiplier algorithm, Cardano formula and Newton-Raphson methods. In addition, to meet the local privacy guarantee and the different needs of different data for the privacy concern levels, we use the proposed optimal privacy budget allocation scheme to improve the BRR and MRR and call it the OBRR and OMRR, respectively. The OBRR and OMRR a b c d Fig. 4 The relationship between the estimated histogram error measured by log 10 (NSE) and the privacy budget ǫ are optimal in the high and low privacy regimes with high-dimensional heterogeneous data, respectively. To prove the effectiveness of our improved local differential privacy mechanisms, we carry out simulation experiments on two different data sets with unbalanced multivariate nominal attributes. The simulation results demonstrate that the proposed mechanism can achieve a considerable improvement by reducing the estimated square error by 53.2% compared to the BRR and MRR on average.