Improving the utility of locally differentially private protocols for longitudinal and multidimensional frequency estimates

This paper investigates the problem of collecting multidimensional data throughout time (i.e., longitudinal studies) for the fundamental task of frequency estimation under Local Differential Privacy (LDP) guarantees. Contrary to frequency estimation of a single attribute, the multidimensional aspect demands particular attention to the privacy budget. Besides, when collecting user statistics longitudinally, privacy progressively degrades. Indeed, the"multiple"settings in combination (i.e., many attributes and several collections throughout time) impose several challenges, for which this paper proposes the first solution for frequency estimates under LDP. To tackle these issues, we extend the analysis of three state-of-the-art LDP protocols (Generalized Randomized Response -- GRR, Optimized Unary Encoding -- OUE, and Symmetric Unary Encoding -- SUE) for both longitudinal and multidimensional data collections. While the known literature uses OUE and SUE for two rounds of sanitization (a.k.a. memoization), i.e., L-OUE and L-SUE, respectively, we analytically and experimentally show that starting with OUE and then with SUE provides higher data utility (i.e., L-OSUE). Also, for attributes with small domain sizes, we propose Longitudinal GRR (L-GRR), which provides higher utility than the other protocols based on unary encoding. Last, we also propose a new solution named Adaptive LDP for LOngitudinal and Multidimensional FREquency Estimates (ALLOMFREE), which randomly samples a single attribute to be sent with the whole privacy budget and adaptively selects the optimal protocol, i.e., either L-GRR or L-OSUE. As shown in the results, ALLOMFREE consistently and considerably outperforms the state-of-the-art L-SUE and L-OUE protocols in the quality of the frequency estimates.


Background
In recent years, Differential Privacy (DP) [1,2] has been increasingly accepted as the current standard for data privacy [3,4,5,6]. In the centralized model of DP, a trusted curator has access to the entire raw data of users (e.g., the Census Bureau [7,8]). By "trusted", we mean that curators do not misuse or leak private information of individuals. However, this assumption does not always hold in real life, e.g., data breaches are all too common [9]. $ Final version accepted in the journal Digital Communications and Networks (soon to be updated with DOI). * Corresponding author (email: heber.hwang-arcolezi@inria.fr).
To preserve privacy at the user-side, an alternative approach, namely, Local Differential Privacy (LDP), was initially formalized in [10]. With LDP, rather than trust a data curator to have the raw data and sanitize it to output queries, each user applies a DP mechanism to their data before transmitting it to the data collector server. The local DP model allows collecting data in unprecedented ways and, therefore, it has been widely adopted by industry (e.g., Google Chrome browser [11], Microsoft windows 10 operation system [12], Apple iOS and macOS [13]).

Motivation and problem statement
When collecting data in practice, one is often interested in multiple attributes of a population, i.e., multidimensional data. For instance, in crowd-sourcing applications, the server may collect both demographic information (e.g., gender, nationality) and user habits in order to develop personalized solutions for specific groups. In addition, one generally aims to collect data from the same users throughout time (i.e., longitudinal studies), which is essential in many situations [11,12]. For example, the fact that two medical acts identified at a different time have been performed on the same patient, or two different patients mean treatment in the first case or two isolated acts in the second.
So, in this paper, we focus on the problem of private frequency (or histogram) estimation of multiple attributes throughout time with LDP. Frequency estimation is a primary objective of LDP, in which the data collector (a.k.a. the aggregator) decodes all the privatized data of the users and then estimates the number of users for each possible value. More formally, we assume there are d attributes A = {A 1 , A 2 , ..., A d }, where each attribute A j with a discrete domain has a specific number of value k j = |A j |. Each user u i for i ∈ {1, 2, ..., n} has a tuple v (i) = (v (i) 1 , v (i) 2 , ..., v (i) d ), where v (i) j represents the value of attribute A j in record v (i) . Thus, for each attribute A j at time t ∈ [1, τ], the aggregator's goal is to estimate a k j -bins histogram, including the frequency of all values in A j .

Summary of contributions
In this paper, we extend the analysis of three stateof-the-art LDP protocols, namely, Generalized Randomized Response (GRR) [18], Optimized Unary Encoding (OUE) [14], and Symmetric Unary Encoding (SUE) [11] for both longitudinal and multidimensional frequency estimates. On the one hand, for all three protocols, we theoretically prove that randomly sampling a single attribute per user improves data utility, which is an extension of common results in the LDP literature [36,24,37,29,38].
On the other hand, in the literature, both SUE and OUE protocols have been extended (and also applied [39,40]) to longitudinal studies based on the concept of memoization [11,12], i.e., L-SUE and L-OUE, respectively. However, we numerically and experimentally show that combining both protocols provides higher data utility, i.e., starting with OUE and then with SUE (L-OSUE) optimizes data utility better than using SUE or OUE twice. In addition, we also extend GRR for longitudinal studies (i.e., L-GRR), which provides higher data utility than the other protocols based on unary encoding for attributes with a small domain size.
Lastly, in a multidimensional setting having different domain sizes for each attribute, a dynamic selection of longitudinal LDP protocols is preferred. Therefore, we propose a new solution named Adaptive LDP for LOngitudinal and Multidimensional FREquency Estimates (ALLOMFREE), which combines all the aforementioned results. More specifically, ALLOM-FREE randomly samples a single attribute to be sent with the whole privacy budget and adaptively selects the optimal protocol, i.e., either L-GRR or L-OSUE. To validate our proposal, we conduct a comprehensive and extensive set of experiments on four real-world open datasets. Under the same privacy guarantee, results show that ALLOMFREE consistently and considerably outperforms the state-of-the-art L-SUE and L-OUE protocols in the quality of the frequency estimates.
The remainder of this paper is organized as follows. In Section 2, we review the privacy notion in consideration, i.e., LDP and the protocols. In Section 3, we extend the analysis of GRR, OUE, and SUE to multidimensional data collections. In Section 4 we present the memoization-based framework for longitudinal data collections, the extension and analysis of longitudinal GRR and the longitudinal UE-based protocols and the numerical evaluation of their performance, and we present our ALLOMFREE solution. In Section 5, we present experimental results and discuss our results. In Section 6 we review the related work. Lastly, in Section 7, we present the concluding remarks and future directions.

Theoretical background
In this section, we briefly present the concept of privacy considered in this work, that is, LDP, and the LDP protocols we will apply in this paper.

LDP
Local differential privacy, initially formalized in [10], protects an individual's privacy during the data collection process. A formal definition of LDP is given as follows: Definition 1 ( -Local Differential Privacy). A randomized algorithm A satisfies -LDP if, for any pair of input values v 1 , v 2 ∈ Domain(A) and any possible output y of A: Similar to the centralized model of DP, LDP also enjoys several important properties, e.g., immunity to post-processing (F(A) is -LDP for any function F) and composability [3]. That is, combining the results from d locally differentially private protocols also satisfy LDP. If these protocols are applied separately in disjointed subsets of the dataset, = max( 1 -, . . . , d )-LDP (parallel composition). On the other hand, if these protocols are sequentially applied to the same dataset, = d i=1 i -LDP (sequential composition).

LDP protocols
Randomized Response (RR), a surveying technique proposed by Warner [41], has been the building block for many LDP protocols. Let A j = {v 1 , v 2 , ..., v k j } be a set of k j = |A j | values of a given attribute and let be the privacy budget, we review three state-of-theart LDP mechanisms for single-frequency estimation (a.k.a. frequency oracles) that will be used in this paper.

GRR
The k-Ary RR [18] mechanism extends RR to the case of k j ≥ 2 and is also referred to as direct encoding [14] or Generalized RR (GRR) [42,43,29]. Throughout this paper, we use the term GRR for this LDP protocol. Given a value v ∈ A j , GRR(v) outputs the true value with probability p, and any other value v ∈ A j such that v = v with probability 1 − p. More formally, the perturbation function is defined as: This satisfies -LDP since p q = e . On expectation, the number of times that a value v i is reported, N i , for i ∈ [1, k j ], is given by: in which N i is the number of times the value v i has been reported, f (v i ) is the real frequency of value v i , and n is the total number of users. This immediately provides the normalized estimationf (v i ) that each value v i occurs as [18,14,11]: In [14], the authors prove thatf (v i ) in Eq. (1) is an unbiased estimation of the true frequency f (v i ), and the variance of this estimation is n(p−q) . In the case of small f (v i ) ∼ 0, this variance is dominated by the first term, which gives the approximate variance as [14]: Since the estimation in Eq. (1) is unbiased, its variance Var[f (v i )] is equal to the Mean Squared Error (MSE), which is commonly used as an accuracy metric (e.g., cf. [43,35]) and also adopted in this paper.
Replacing p = e e +k j −1 and q = 1 e +k j −1 into Eq. (2), the GRR variance is calculated as:

Unary encoding-based
Protocols based on Unary Encoding (UE) consist of transforming a value v into a binary representation of it. So, first, for a given value v, B = UE(v), where B = [0, 0, ..., 1, 0, ...0], a k j -bit array where only the vth position is set to one. Next, the bits i, for i ∈ [1, k j ], from B are flipped, depending on parameters p and q, to generate a sanitized vector B , in which: The proof that the UE-based protocols satisfy -LDP for is known in the literature and can be found in [11,14].
In [14] the authors presented two ways for selecting probabilities p and q, which determines the protocol variance. One well-known UE-based protocol is the basic one-time RAPPOR [11], referred to as Symmetric UE (SUE), which selects p = e /2 e /2 +1 and q = 1 e /2 +1 , where p+q = 1 (symmetric). The estimated frequencŷ f (v i ) that a value v i occurs for i ∈ [1, k j ] is also calculated using Eq. (1). Replacing p = e /2 e /2 +1 and q = 1 into Eq. (2), the SUE variance is calculated as [11]: Moreover, rather than select p and q to be symmetric, Wang et al. [14] proposed Optimized UE (OUE), which selects parameters p = 1 2 and q = 1 e +1 that minimize the variance of UE-based protocols while still satisfying -LDP. Similarly, the estimation method used in Eq. (1) equally applies to OUE. Replacing p = 1 2 and q = 1 e +1 into Eq. (2), the OUE variance is calculated as [14]:

Multidimensional frequency estimates with LDP
In the literature, few work for collecting multidimensional data with LDP is based on random sampling (i.e., dividing users in groups) [32,33,34,35,14,38]. This technique reduces both dimensionality and communication costs, which will also be the focus of this paper. Let d ≥ 2 be the total number of attributes, k = [k 1 , k 2 , ..., k d ] be the domain size of each attribute, n be the number of users, and be the privacy budget. An intuitive solution (Spl) is to split the privacy budget, i.e., assigning /d for each attribute. The other solution (Smp) is based on uniformly sampling (without replacement) only r attribute(s) out of d possible ones, i.e., assigning /r per attribute. Notice that both solutions satisfy -LDP according to the sequential composition theorem [3].
For the first case, Spl, the variances (σ 2 1 ) of GRR, SUE, and OUE are respectively: For the second case, Smp, the number of users per attribute is reduced to nr/d. Thus, the variances (σ 2 2 ) of GRR, SUE, and OUE are, respectively: Notice that if r = d in Eq. (8), one achieves Eq. (7). Practically, the objective is reduced to finding r, which minimizes σ 2 2 for each protocol. In this way, to find the optimal r for each protocol, we first multiply each σ 2 2 in Eq. (8) by . Without losing generality, minimizing σ 2 2,GRR , σ 2 2,S UE , and σ 2 2,OUE is equivalent to minimizing e /r r(e /r −1) 2 , e /2r r(e /2r −1) 2 , and e /r r(e /r −1) 2 , respectively. Hence, let x = r/ be the independent variable, σ 2 2,GRR and σ 2 2,OUE can be rewritten as y 1 = 1 x · e 1/x (e 1/x −1) 2 , and σ 2 2,S UE can be rewritten as y 2 = 1 x · e 1/2x (e 1/2x −1) 2 as functions over x. It is not hard to prove that both y 1 and y 2 are increasing functions w.r.t. x. Therefore, the minimum and optimal number of attributes per user is r = 1 for all three protocols. We highlight that this is a common result in the LDP literature obtained for different protocols and contexts [32,33,35,14,24,37,36,44].
Therefore, in this paper, we adopt the multidimensional setting Smp with r = 1. In this setting, users tell the data collector whose attribute is sampled, and its perturbed value ensures -LDP by applying either GRR or UE-based protocols; the data analyst server would not receive any information about the remaining d − 1 attributes.

Longitudinal frequency estimates with LDP
In this section, we first present the memoizationbased framework for longitudinal data collections.
Next, we present the analysis of longitudinal GRR and longitudinal UE-based protocols. Lastly, we numerically evaluate the extended longitudinal protocols and propose our ALLOMFREE solution.

Memoization-based data collection with LDP
In the literature, many studies focus on how to collect and analyze categorical data longitudinally based on memoization [11,12,36]. The key idea behind memoization is using two sanitization processes. The first round (RR 1 ) replaces the real value B with a sanitized one B with a higher epsilon ( ∞ ). Whenever one intends to report B, B shall be reused to produce other sanitized versions B with lower epsilon values. Notice that the second sanitization (RR 2 ) is a must to avoid "averaging attacks", in which adversaries can reconstruct the true value from multiple sanitized versions of it. This technique allows achieving privacy over time with an upper bound value of ∞ -LDP. Let values of a given attribute and let be the privacy budget. In this paper, for both RR 1 and RR 2 steps, we will apply either GRR, SUE, or OUE. The unbiased estimator in Eq. (1) for in which N i is the number of times the value v i has been reported, n is the total number of users, p 1 and q 1 are the parameters used by an LDP protocol for RR 1 , and p 2 and q 2 are the parameters used by an LDP protocol for RR 2 . Eq. (9) is the result of using the unbiased estimator of Eq. (1) with two rounds of sanitization. Proof.
Let us focus on Thus, The variance of the estimation in Eq. (9) is: Proof. Thanks to Eq. (9), we have . Since all the users are independent, We thus have Var[X] = γ − γ 2 = γ(1 − γ) and, finally, In this work, we will use the approximate variance, in which f (v i ) = 0 in Eq. (10), which gives:

Longitudinal GRR (L-GRR): definition and -LDP study
.., v k j } be a set of k j values of a given attribute and let v i be the real value. We now describe an extension of GRR for longitudinal studies; we refer to this protocol as L-GRR for the rest of this paper. First, Encode(v i ) = v i (direct encoding). Next, there are two rounds of sanitization, RR 1 and RR 2 applying GRR, as described in the following equations. in which p 1 and q 1 control the level of longitudinal ∞ -LDP. The value B shall be reused as the basis for all future reports on the real value v i .
in which B is the report to be sent to the server.
Visually, Fig. 1 illustrates the probability tree of the L-GRR protocol. In the first round of sanitization, RR 1 , our proposed L-GRR applies GRR with Fig. 1), where k j = |A j |. As discussed in subsection 2.2.1, this permanent memoization satisfies ∞ -LDP since p 1 q 1 = e ∞ , which is the upper bound. On the other hand, with a single collection of data, the attacker's knowledge of v i comes only from B , which is generated using two randomization steps with GRR. This provides a higher level of privacy protection [11]. From Fig. 1, we can obtain the following conditional probabilities: Fig. 1), with the second round of sanitization, RR 2 [GRR], our proposed L-GRR protocol satisfies 1 -LDP since p s q s = e 1 . Notice that 1 corresponds to a single report (lower bound) and its extension to infinity reports is limited by ∞ (upper bound) since RR 2 [GRR] uses as input the output of RR 1 [GRR]. More specifically, the calculus of 1 = ln p 1 p 2 + q 1 q 2 p 1 q 2 + q 1 p 2 (12) in which p 1 = e ∞ e ∞ +k j −1 , q 1 = 1−p 1 k j −1 , and both p 2 and q 2 are selectable according to ∞ , 1 , and k j , calculated as: The estimated frequencyf L (v i ) that a value v i occurs for i ∈ [1, k j ] is calculated using Eq. (9). Lastly, one can calculate the L-GRR approximate variance by replacing the resulting p 1 , q 1 , p 2 , q 2 parameters into Eq. (11).

Longitudinal UE (L-UE): definition and -LDP
study We now describe the UE-based protocol for longitudinal studies. We refer to this protocol as L-UE for the rest of this paper. Let V = {v 1 , v 2 , ..., v k j } be a set of k j values of a given attribute and let v i be the real value. First, Encode(v i ) = B (unary encoding), where B = [0, 0, ..., 1, 0, ...0], a k j -bit array where only the vth position is set to one. Next, there are two rounds of sanitization, RR 1 and RR 2 , which apply the UE-based protocols, described as follows.
in which p 1 and q 1 control the level of longitudinal ∞ -LDP. The value B shall be reused as the basis for all future reports on the real value v i .

RR 2 [UE]: For each bit
in which B is the report to be sent to the server.
Visually, Fig. 2 illustrates the probability tree of the L-UE protocol. One natural question emerges: how to select the parameters {p 1 , q 1 , p 2 , q 2 } in order to optimize the utility of this L-UE protocol?
One can see RR 1 [UE] as a permanent sanitization and RR 2 [UE] as a 'small' perturbation to avoid averaging attacks and keep privacy over time.
Based on SUE and OUE, we are then left with four options: two popular solutions that strictly use only OUE or SUE parameters in both sanitization steps and in which L-SUE is the well-known Basic-RAPPOR protocol [11], L-OUE is the state-of-the-art OUE protocol [14] with memoization, and both L-OSUE and L-SOUE are proposed in this paper.
As presented in [14], the OUE variance in Eq. (6) is smaller than the SUE variance in Eq. (5) and, therefore, the former can provide higher utility than the latter for RR 1 . On the other hand, we argue that OUE might be too strict for RR 2 since the parameter p 2 = 1/2 is constant. Thus, we hypothesize that option III (i.e., L-OSUE) is the most suitable one. Without losing generality, the following analyses are done only for L-OSUE, which can be easily extended to any of the other combinations.
In the first round of sanitization, RR 1 , our solution L-OSUE applies OUE with e ∞ +1 (underlined in the middle of Fig. 2). As discussed in Section 2.2.2, this permanent memoization satisfies ∞ -LDP since (1−p 1 )q 1 = e ∞ , which is the upper bound.
Following the same development as for L-GRR, on the other hand, with a single collection of data, the attacker's knowledge of B = UE(v) comes only from B , which is generated using two randomization steps with OUE and SUE, respectively. This provides a higher level of privacy protection [11]. From Fig. 2, we can obtain the following conditional probabilities according to each bit i ∈ [1, k j ]: Fig. 2), with the second round of sanitization, RR 2 [S UE], our proposed L-OSUE protocol satisfies 1 -LDP since (1−p s )q s = e 1 . Notice that 1 corresponds to a single report (lower bound) and its extension to infinity reports is limited by ∞ (upper bound) since RR 2 [S UE] uses as input the output of RR 1 [OUE]. More specifically, the calculus of 1 for L-OSUE (or L-UE protocols in general) is: in which, for L-OSUE, we have p 1 = 1 2 , q 1 = 1 e ∞ +1 , and both p 2 and q 2 are symmetric (p 2 + q 2 = 1) and selectable according to ∞ and 1 , calculated as: Similarly, the estimated frequencyf L (v i ) that a value v i occurs for i ∈ [1, k j ] is calculated using Eq. (9). Lastly, one can calculate the L-OSUE (or L-UE protocols in general) approximate variance by replacing the resulting p 1 , q 1 , p 2 , q 2 parameters into Eq. (11).

Numerical evaluation of L-GRR and L-UE protocols
In this subsection, we evaluate numerically the approximate variance of all developed longitudinal protocols, namely, L-GRR, and the four UE-based options, namely, L-OUE, L-SUE, L-OSUE, and L-SOUE, respectively. As aforementioned, once both ∞ and 1 privacy guarantees are defined, one can obtain parameters p 1 and q 1 depending on ∞ , and parameters p 2 and q 2 depending on both ∞ and 1 (and the domain size k j for L-GRR), as given in Eq. (13) for L-GRR and in Eq. (15) for L-OSUE.
Next, once the parameters {p 1 , q 1 , p 2 , q 2 } are computed, one can calculate the approximate variance with Eq. (11) for each protocol. In other words, following our proposal, one has to set both the upper ( ∞ ) and lower ( 1 ) bounds of the privacy guarantees. For example, let ∞ = 2, one might want the first 1 -LDP report to have high privacy such as 1 = 0.1, i.e.,  [14]), and For values of 1 higher than 0.6 ∞ , neither L-OUE nor L-SOUE could satisfy some values of 1 because of the constant p 2 = 1/2 in RR 2 . However, it is not desirable to have higher values of 1 and, thus, we do not consider values above 0.6 ∞ in our analysis. Besides, Table 2 exhibits the numerical values for the non-longitudinal GRR, OUE, and SUE protocols, which allow evaluating how utility degrades with a second step of sanitization.
From Table 1, one can notice that L-GRR presents the smallest variance values for binary attributes (i.e., when k j = 2). On the other hand, L-GRR is also most sensitive to changes in privacy parameters ∞ and 1 when k j is large, which shows a much higher variance than when using a nonlongitudinal GRR, as shown in Table 2. Similar to the non-longitudinal GRR, this increase in the variance is due to the number of values k j , which decreases the probability p of reporting the true value. With two rounds of sanitization, it further deteriorates the accuracy of the L-GRR protocol that gets extremely high values, e.g., see L-GRR(k j = 2 10 ). Interestingly, when k j = 2 in Table 1, the variance of L-GRR with 1 = 0.5 ∞ is a lagged version of the variance values given by the non-longitudinal GRR in Table 2. This effect is also observed for both the L-SUE (cf. SUE in Table 2) and L-OSUE (cf. OUE in Table 2) protocols, which use symmetric probabilities on RR 2 (i.e., p 2 + q 2 = 1). We highlight these values in bold font. However, for L-GRR, this is not true for other values of k j , the further analysis of which is beyond the scope of this paper.
On the other hand, the L-UE protocols avoid having a variance that depends on k j by encoding the value into the unary representation, which results in a constant variance regardless of the size of the attribute. To complement the results of Table 1, Fig. 3 illustrates the numerical values of the approximate variance for the L-UE protocols with 1 = {0.3 ∞ , 0.6 ∞ }. With the four options I-IV analyzed, on the high privacy regimes, L-OSUE and L-SUE have similar performance while always favoring the proposed L-OSUE. On lower privacy regimes, our proposed protocols L-SOUE and L-OSUE have similar performance, which outperform both the L-OUE and L-SUE protocols. As shown in our experiments, the L-OUE protocol has the worst performance among the four options analyzed, with the exception of high values for ∞ (see the plot on the bottom of Fig. 3), when it has performance superior or similar to that of L-SUE. Indeed, for L-OUE, selecting p 2 = 1/2 for the second sanitization step is too strict, which results in higher variance values. Therefore, by comparing the approximate variances, Privacy Guarantees L-GRR L-UE k j = 2 k j = 32 k j = 2 10 L-OSUE L-SUE L-SOUE L-OUE    the best option for L-UE protocols, in terms of utility, is to start with OUE and then with SUE as we propose in this paper, i.e., L-OSUE.

The ALLOMFREE algorithm
Let A = {A 1 , A 2 , ..., A d } be a set of d attributes with the domain size k = [k 1 , k 2 , ..., k d ], A = {L-GRR, L-OSUE} be a set of optimal longitudinal LDP protocols, and ∞ and 1 be the longitudinal and single-report privacy guarantees, respectively. Each i.e., a private value per attribute. From now on, we will simply omit the index notation v (i) and use v in the analysis as we focus on one arbitrary user u i here. For each attribute j ∈ [1, d] (we slightly abuse the notation and use j for A j ) at time t ∈ [1, τ], the aggregator aims to estimate the frequencies of each value v ∈ A j . Client-Side. In a multidimensional setting with different domain sizes for each attribute, a dynamic selection of longitudinal LDP protocols is preferred. As mentioned in Section 3, we propose that each user randomly sample r = Uni f orm (1, 2, ..., d) to select a single attribute A r . Given k r (the domain size), . Therefore, the first round of sanitization ensures a permanent memoization B that is always used for the second round of sanitization to generate B each time t ∈ [1, τ] the user will report the real value B. We call our solution Adaptive LDP for LOngitudinal and Multidimensional FREquency Estimates (ALLOMFREE), which is summarized in Algorithm 1 as a pseudocode.
The intuition of ALLOMFREE is as follows. By requiring each user to submit only 1 attribute with the whole privacy budget, it reduces both the variance incurred as well as the communication cost. Also, since we develop the calculus of the approximate variance in Eq. (11) for the proposed longitudinal protocols (L-GRR and L-OSUE), ALLOMFREE can adaptively select the protocol with a smaller variance value to optimize the data utility. Therefore, ALLOMFREE utilizes optimal solutions for both multidimensional and longitudinal data collection settings developed in Sections 3 and 4 of this paper, respectively. Server-Side. On the server-side, for each attribute j ∈ [1, d] at time t ∈ [1, τ], the estimated frequencyf L (v i ) that a value v i occurs for i ∈ [1, k j ] is calculated using Eq. (9). Privacy analysis. On the one hand, according to the analysis in subsections 4.2 and 4.3, Alg. 1 satisfies -LDP with upper ∞ (infinity reports) and lower 1 (a single report) bounds as it uses either L-GRR or L-OSUE to sanitize a single attribute per user. Notice that, to ensure the users' privacy over time and to avoid the sequential composition theorem [3], each user must always report the same unique attribute A r . In addition, the privacy of a user decreases gracefully according to the number of LDP reports t ≤ τ that an adversary has gained access to, which is calculated as [45,36]: Limitations. Similar to other sampling-based methods for collecting multidimensional data under LDP [34,32,33,35], our ALLOMFREE algorithm also entails a sampling error, which is due to observing a sample instead of the entire population. In addition, concerning the privacy guarantees, the memoization step of ALLOMFREE is certainly effective for longitudinal privacy in the cases where the true client's data does not vary (static) or vary very slowly or in an uncorrelated manner [11]. In many application scenarios, gender, age range, nationality, and other demographic data are generally static or hardly ever vary. On the other hand, for dynamic attributes such as the location or the time spent in the application, this is not the case. Therefore, for each different value, a new memoized value would be generated, thus accumulat-ing the privacy budget ∞ by the sequential composition theorem [3].

Experimental results
In this section, we present the setup of our experiments and the results with real-world data.

Setup of experiments
The main goal of our experiments is to evaluate the proposed longitudinal LDP protocols on multidimensional frequency estimates a single time, i.e., satisfying 1 -LDP (as in [11,40,39], for example). Environment. All algorithms are implemented in Python 3.8.8 with NumPy 1.19.5 and Numba 0.53.1 libraries. The codes we develop and use for all experiments are available in a Github repository 1 . In all experiments, we report average results over 100 runs as LDP algorithms are randomized. Methods evaluated. We consider for evaluation the following solutions and protocols: • Solution Smp (cf. Section 3), which randomly samples a single attribute to be sent with the whole privacy budget. We will experiment with the state-of-the-art protocols, namely, L-SUE and L-OUE, and with our extended protocols L-OSUE and L-SOUE; • Our ALLOMFREE solution (cf. Alg. 1), which also randomly samples a single attribute to be sent with the whole privacy budget but adaptively select the optimal protocol, i.e., either L-GRR or L-OSUE.
Experimental evaluation and metrics. We vary the longitudinal privacy parameter in the range ∞ = [0.5, 1, ..., 3.5, 4] with 1 = [0.3 ∞ , 0.6 ∞ ] to compare our experimental results with numerical ones from subsection 4.4. Notice that this range of privacy guarantees is commonly used in the literature for multidimensional data (e.g., in [33] the range is = [0.5, ..., 4] and in [35] the range is = [0.1, ..., 10]). To evaluate our results, we use the MSE metric averaged per the number of attributes d in a single data collection τ = 1, i.e., with 1 -LDP. Thus, for each attribute j, we compute for each value v i ∈ A j the estimated frequencyf (v i ) and the real one f (v i ) and calculate their differences. More precisely, Datasets. For the ease of reproducibility, we conduct our experiments on four multidimensional open datasets.
Algorithm 1 User-side algorithm of ALLOMFREE.

Results
Our experiments were conducted on four real-world datasets with varied parameters for n, d, and k, which allowed evaluating our solutions more practically. As one can notice in the results, for all datasets, ALLOMFREE consistently and considerably outperforms the state-of-the-art protocols, namely, L-SUE (a.k.a. Basic-RAPPOR) [11] and L-OUE (that uses OUE [14] twice). Indeed, the difference between the performances of ALLOMFREE and the other longitudinal LDP protocols increases proportionally according to the privacy guarantees, i.e., for high ∞ and 1 values, the gap is bigger. This is first because in all datasets there are attribute(s) with a small domain size (e.g., k j = 2 or k j = 3), in which L-GRR can provide smaller variance values than the L-UE protocols (cf. subsection 4.4). Secondly, by adequately selecting the probabilities p 1 , q 1 , p 2 , q 2 for the L-UE protocol (i.e., L-OSUE) also optimizes data utility. Thus, since there is a way to measure the approximate variance of the extended protocols (i.e., Eq. (11)), given the sampled attribute, ALLOMFREE adaptively selects one of the optimized protocol (i.e., L-GRR or L-OSUE) whose smaller variance improves the data utility.
In addition, among the L-UE protocols applied individually, the experimental results with multidimensional data approximate the numerical results with a single attribute from subsection 4.4. For instance, the proposed L-OSUE provides similar or better performance than L-SUE while always outperforming L-OUE. Besides, L-SOUE always outperforms L-OUE too, achieving performance similar to those of L-OSUE and L-SUE in low privacy regimes (i.e., high values). As we have already shown in subsection 4.4, even though OUE has better utility than SUE for onetime collection [14], applying OUE twice does not provide higher utility.
To complement the results of Figs. 4 -7, Table 3 ( 1 = 0.3 ∞ ) and Table 4 ( 1 = 0.6 ∞ ) exhibit all datasets and ∞ guarantees the following utility metrics: in which U L-SUE and U L-OUE represent the accuracy gain of ALLOMFREE over the state-of-the-art L-SUE and L-OUE protocols, respectively. From Tables 3 and 4, one can notice that ALLOM-FREE considerably improves the quality of the fre-   Table 3: Accuracy gain of ALLOMFREE over the state-of-the-art L-SUE and L-OUE protocols for all datasets with 1 = 0.3 ∞ , measured with the U L-SUE and U L-OUE metrics expressed in %.

Nursery
Adult  Table 4: Accuracy gain of ALLOMFREE over the state-of-the-art L-SUE and L-OUE protocols for all datasets with 1 = 0.6 ∞ , measured with the U L-SUE and U L-OUE metrics expressed in %. quency estimates in comparison with the state-of-theart L-SUE and L-OUE protocols. On average, AL-LOMFREE improves the results of L-SUE at least 10% with the MS-FIMU dataset in Table 3 and at most 30.38% with the Nursery dataset in Table 4 for the privacy guarantees ∞ and 1 analyzed. Similarly, on average, ALLOMFREE improves the results of L-OUE at least 19.32% with the MS-FIMU dataset in Table 3 and at most 54.96% with the Nursery dataset in Table 4. The highest gain of accuracy was about ∼ 71%, achieved with the Nursery dataset when ∞ = 4 in Table 4 in comparison with the L-OUE protocol. Finally, as one can note, with higher values of 1 , ALLOM-FREE will provide much higher utility than the other protocols.
However, most studies for collecting multidimensional data with LDP mainly focused on numerical data [49] (e.g., [32,33,34,35]) or other complex tasks with categorical data (e.g., marginal estimation [27,28,29,30,31], analytical/range queries [24,23,25,26]). Our ALLOMFREE solution is based on the multidimensional Smp solution, which randomly samples a single attribute per user only, minimizing the variance of the estimation and the communication cost. A recent study [50] proposes the Random Sampling plus Fake Data (RS+FD) solution for multidimensional data, in which the user samples a single attribute, but also generates fake data for all nonsampled attributes. The RS+FD solution creates uncertainty in the view of the aggregator while achieving similar data utility as the Smp solution. An interesting direction would be to extend ALLOMFREE to add fake data for non-sampled attributes too.
Besides, most academic literature on frequency estimation focuses on single data collection. To address longitudinal data collections, in [11,12], the authors proposed LDP protocols based on two rounds of sanitization, i.e., memoization, which was also adopted in this paper. In the literature, some studies [39,40] applied L-SUE (a.k.a. Basic-RAPPOR [11]) and L-OUE (i.e., OUE [14] with memoization) for longitudinal frequency estimates. However, rather than strictly using only SUE or OUE, we prove that the optimal combination is to start with OUE and then with SUE (i.e., L-OSUE). The privacy guarantees of chaining two LDP protocols has been further studied in [45,36], which results in Eq. (16). Indeed, combining "multiple" settings (i.e., many attributes and several collections throughout time) imposes several challenges, for which this paper proposes the first solution named ALLOMFREE under LDP.

Conclusion
This paper investigates the problem of collecting multidimensional data throughout time for the fundamental task of frequency estimation under LDP guarantees. We extend and analyze three state-of-the-art LDP protocols, namely, GRR [18], OUE [14], and SUE [11], and propose an optimized solution, namely, ALLOMFREE, which randomly samples one attribute per user and adaptively selects a protocol with a lower variance (i.e., L-GRR or L-OSUE) in order to improve data utility. Through experimental validations, we demonstrate the advantages of ALLOMFREE over the state-of-the-art protocols L-SUE [11] and L-OUE [14] by using four real-world datasets, with the gain of accuracy on average ranging from 10% up to 55% for the analyzed range of ∞ and 1 privacy guarantees. For future work, we suggest and intend to improve the frequency estimates through post-processing tech-niques [56,43] and to design LDP protocols for longitudinal and multidimensional studies considering both numerical and categorical data.