research-article

Public Access

HTF: Homogeneous Tree Framework for Differentially Private Release of Large Geospatial Datasets with Self-tuning Structure Height

Authors:
Sina Shaham

University of Southern California, USA

University of Southern California, USA

0000-0002-8346-8105
View Profile

,
Gabriel Ghinita

College of Science and Engineering, Hamad Bin Khalifa University, Qatar Foundation, Qatar

College of Science and Engineering, Hamad Bin Khalifa University, Qatar Foundation, Qatar

0000-0002-8372-3006
View Profile

,
Ritesh Ahuja

University of Southern California, USA

University of Southern California, USA

0000-0003-0810-2784
View Profile

,
John Krumm

Microsoft Research, USA

Microsoft Research, USA

0000-0003-4394-6704
View Profile

,
Cyrus Shahabi

University of Southern California, USA

University of Southern California, USA

0000-0001-9118-0681
View Profile

ACM Transactions on Spatial Algorithms and Systems Volume 9 Issue 4Article No.: 25pp 1–30https://doi.org/10.1145/3569087

Published:20 November 2023Publication History

ACM Transactions on Spatial Algorithms and Systems

Abstract

Mobile apps that use location data are pervasive, spanning domains such as transportation, urban planning, and healthcare. Important use cases for location data rely on statistical queries, e.g., identifying hotspots where users work and travel. Such queries can be answered efficiently by building histograms. However, precise histograms can expose sensitive details about individual users. Differential privacy (DP) is a mature and widely adopted protection model, but most approaches for DP-compliant histograms work in a data-independent fashion, leading to poor accuracy. The few proposed data-dependent techniques attempt to adjust histogram partitions based on dataset characteristics, but they do not perform well due to the addition of noise required to achieve DP. In addition, they use ad hoc criteria to decide the depth of the partitioning. We identify density homogeneity as a main factor driving the accuracy of DP-compliant histograms, and we build a data structure that splits the space such that data density is homogeneous within each resulting partition. We propose a self-tuning approach to decide the depth of the partitioning structure that optimizes the use of privacy budget. Furthermore, we provide an optimization that scales the proposed split approach to large datasets while maintaining accuracy. We show through extensive experiments on large-scale real-world data that the proposed approach achieves superior accuracy compared to existing approaches.

1 INTRODUCTION

Statistical analysis of location data, typically collected by mobile apps, helps researchers and practitioners understand patterns within the data, which in turn can be used in various domains such as transportation, urban planning, and public health. At the same time, significant privacy concerns arise when locations are directly accessed. Sensitive details about individuals, such as political or religious affiliations, alternative lifestyle habits, and so on, can be derived from users’ whereabouts. Therefore, it is essential to account for user privacy and protect location data.

Differential privacy (DP) [9] is a powerful protection model that allows answering aggregate queries (e.g., count, sum) while hiding the presence of any specific individual within the data. In other words, the query results do not permit an adversary to infer with significant probability whether a certain individual’s record is present in the dataset or not. DP achieves protection by injecting random noise in the query results according to well-established rules. It is a powerful semantic model adopted by both government entities (e.g., Census Bureau) as well as major industry players. In contrast with syntactic approaches, such as k-anonymity or \(\ell\)-diversity [20], DP provides formal security guarantees that achieve protection even against adversaries with access to background knowledge.

In the location domain, existing DP-based approaches build a spatial index structure, and perturb index node counts using random noise. Subsequently, queries are answered based on the noisy node counts. Building DP-compliant index structures has several benefits: first, querying indexes is a natural approach for most existing spatial processing techniques; second, using an index helps quantify and limit the amount of disclosure, which becomes infeasible if one allows arbitrary queries on top of the exact data; third, query efficiency is improved. Due to large amounts of background knowledge data available to adversaries (e.g., public maps, satellite imagery), information leakage may occur both from query answers, as well as from the properties of the indexing structure itself. To deal with the structure leakage, initial approaches used data-independent index structures, such as quad-trees, binary space partitioning trees, or uniform grids (UG). No structural leakage occurred, and the protection techniques focused on improving the signal-to-noise ratio in query answers. However, such techniques perform poorly when the data distribution is skewed.

Recently, data-dependent approaches emerged, such as adaptive grids (AG), or kd-tree-based approaches [16, 23]. AG overcomes the rigidity of UG by providing a two-level grid, where the first level has fixed granularity, and the second uses a granularity derived from the coarse results obtained from the first level. While it achieves improvement, it is still a rather blunt tool to account for high variability in dataset density, which is quite typical of real-life datasets. More sophisticated approaches tried to build kd-trees or R-trees in a DP-compliant way, but to do so they used DP mechanisms such as the exponential mechanism (EM) (discussed in Section 2), which are difficult to tune and may introduce significant errors. In fact, the work in Reference [23] shows that data-dependent structures based on EM fail to outperform AG.

Our proposed Homogeneous Tree Framework(HTF) is addressing the problem of DP-compliant location protection using a data-dependent approach that focuses on building index structures with homogeneous intra-node density. Our key observation is that density homogeneity is the main factor influencing the signal-to-noise ratio for DP-compliant spatial queries (we discuss this aspect in detail in Section 3). Rather than using complex mechanisms like EM, which have high sensitivity, we derive theoretical results that can directly link index structure construction with intra-node data density based on the lower-sensitivity Laplace mechanism (introduced in Section 2). This novel approach allows us to build effective index structures capable of delivering low query error without excessive consumption of privacy budget. HTF is custom-tailored for capturing areas of homogeneous density in the dataset, which leads to significant accuracy gains. Our specific contributions are¹:

We identify data homogeneity as the main factor influencing query accuracy in DP-compliant spatial data structures;
We propose a custom technique for homogeneity-driven DP-compliant space partitioning based on the Laplace mechanism, and we perform an in-depth analysis of its sensitivity;
We derive effective DP budget allocation strategies to balance the noise added during the building of the structure with that used for releasing query answers;
We propose a set of heuristics to automatically tune data structure parameters based on data properties, with the objective of minimizing overall error in query answering;
We devise a stop condition strategy that allows self-tuning of the partitioning structure height to optimize the utilization of the privacy budget;
We provide an extension of HTF designed to preserve accuracy of queries in the case of very large input datasets;
We perform an extensive empirical evaluation showing that HTF outperforms existing state-of-the-art on real and synthetic datasets under a broad range of privacy requirements.

The rest of the article is organized as follows: Section 2 presents background information and introduces the problem definition. Section 3 provides the overview of the proposed framework, followed by technical details in Section 4. We evaluate our approach empirically against the state-of-the-art in Section 5. We survey related work in Section 6 and conclude in Section 7.

2 BACKGROUND AND DEFINITIONS

Private publication of location histograms follows the two-party system model shown in Figure 1. The data owner/curator first builds an exact histogram with the distribution of locations on the map. Non-trusted users/analysts are interested in learning the population distribution over different areas, and perform statistical queries. The goal of the curator is to publish the location histogram without the privacy of any individual being compromised. To this end, the exact histogram undergoes a sanitizing process according to DP to generate a private histogram. In our proposed method, a tree-based algorithm is applied for protection, and the tree’s nodes, representing a private histogram of the map, are released to the public. Analysts/researchers ask unlimited count queries that are answered from the private histogram. Furthermore, they may download the whole private histogram, and the protection method remains strong enough to protect the identity of individuals in the database. Table 1 summarizes our notations.

Fig. 1. System model for private location histograms.

2.1 Differential Privacy

Consider two databases \(\mathcal {D}\) and \(\mathcal {D}^{\prime }\) that differ in a single record t, i.e., \(\mathcal {D}^{\prime }=\mathcal {D}\bigcup \lbrace t\rbrace\) or \(\mathcal {D}^{\prime }=\mathcal {D}\backslash \lbrace t\rbrace\). D and \(D^{\prime }\) are commonly referred to as neighboring or sibling.

Definition 1

(\(\epsilon\)-Differential Privacy [8])

A randomized mechanism \(\mathcal {A}\) provides \(\epsilon\)-DP if for any pair of neighbor datasets D and \(D^{\prime }\), and any \(S\in Range(\mathcal {A})\), (1) \(\begin{equation} \dfrac{Pr(\mathcal {A}(\mathcal {D})=S)}{Pr(\mathcal {A}(\mathcal {D}^{\prime })=S)} \le e^\epsilon . \end{equation}\)

Parameter \(\epsilon\) is referred to as privacy budget. \(\epsilon\)-DP requires that the output S obtained by executing mechanism \(\mathcal {A}\) does not significantly change by adding or removing one record in the database. Thus, an adversary is not able to infer with significant probability whether an individual’s record was included or not in the database.

There are two common methods to achieve differential privacy: the Laplace mechanism (LM) and the EM. Both approaches are closely related to the concept of sensitivity, which captures the maximal difference achieved in the output by adding or removing a single record from the database.

Definition 2

(\(L_1\)-Sensitivity [9])

Given sibling datasets \(\mathcal {D}\), \(\mathcal {D}^{\prime }\) the \(L_1\)-sensitivity of a set \(f = \lbrace f_1, \ldots , f_m\rbrace\) of real-valued functions is: \(\begin{equation*} \Delta f\texttt {=}\underset{\forall \mathcal {D},\mathcal {D}^{\prime }}{max}\sum _{i\texttt {=}1}^m |f_i(\mathcal {D})- f_i(\mathcal {D}^{\prime })| \end{equation*}\)

2.1.1 Laplace Mechanism (LM).

LM adds to the output of a query function f noise drawn from Laplace distribution Lap\((b)\) with scale b, where b depends on two factors: sensitivity and privacy budget, (2) \(\begin{equation} \text{Lap}(x|b) = \dfrac{1}{2b}e^{-|x|/b},\; \text{where} \; b=\dfrac{\Delta {f}}{\epsilon }. \end{equation}\)

To simplify notation, we denote Laplace noise by \(\text{Lap}(\dfrac{\Delta {f}}{\epsilon })\), as it only depends on the sensitivity and budget.

2.1.2 Exponential Mechanism (EM).

EM achieves \(\epsilon\)-DP when the output of a computation is not numerical. With EM, the output is drawn from a probability distribution chosen based on a utility function. Consider the generalized problem where for an input d, output s is chosen from the space denoted by S, i.e., \(s\in S\). The utility function u takes as input two parameters d and s, and returns a real value r measuring the quality of s as a solution for the input x. EM aims to determine in a differentially private way \(max_{s\in S}\lbrace u(x,s) \rbrace\). In general, a single record may have a significant impact on the utility, hence the required noise may grow large, leading to poor accuracy.

An important property of DP is composability [10], which helps quantify the amount of privacy attained when multiple functions are evaluated on the data. Sequential composition specifies that running in succession multiple mechanisms that satisfy DP with privacy budgets \(\epsilon _1, \epsilon _2, \ldots ,\epsilon _n\), results in \(\epsilon\)-differential privacy where \(\epsilon = \sum _{i=1}^n \epsilon _i\). Conversely, when mechanisms are applied on disjoint data partitions, parallel composition states that the budget consumption is \(\max _{i=1}^n {\varepsilon _i}\).

2.2 Problem Formulation

Consider a two-dimensional location dataset D discretized to an arbitrarily fine \(N\times M\) grid. Each point is represented by its corresponding rectangular cell in the grid. We study the problem of releasing DP-compliant histograms to answer count queries as accurately as possible. Cell counts are modeled via an \(N\times M\) frequency matrix, in which the entry in the \(i{\text{th}}\) row and \(j{\text{th}}\) column represents the number of records located inside cell \((i,j)\) of the grid.

A DP histogram is generated based on a non-overlapping partitioning of the frequency matrix by applying methods to preserve \(\epsilon\)-DP. The DP histogram consists of the boundary of partitions and their noisy counts, where each partition consists of a group of cells.

Let us denote the total count of a partition with q cells by c and its noisy count by \(\overline{c}\). There are two sources of error in answering a query. The first is referred to as noise error, which is due to Laplace noise added to the partition count. The second source of noise is referred to as uniformity error and arises when a query has partial overlap with a partition. An assumption of uniformity is made within the partition, and the answer per cell is calculated as \(\overline{c}/q\).

For example, consider the \(3\times 3\) grid shown in Figure 2(a), where each count represents the number of data points in the corresponding cell. The cells are grouped in four partitions \(C_1\), \(C_2\), \(C_3\), and \(C_4\), each entailing \(0,\, 12,\, 4,\) and 2 data points, respectively. Independent noise with the same magnitude is added to each partition’s count denoted by \(n_1\), \(n_2\), \(n_3\), and \(n_4\), and released to the public as a DP histogram. The result of the query shown by the dashed rectangle can be calculated as \((12+n_2)/4 + (2+n_4)/2\).

Fig. 2. Example of HTF partitioning. The dashed rectangles show the query.

Problem 1.

Generate a DP histogram of dataset D, such that the expected value of mean relative error (MRE) is minimized, where for a query q with the true count c and noisy count \(\overline{c}\) RE is calculated as (3) \(\begin{equation} MRE(q) = \dfrac{|c- \overline{c}|}{c}\times 100. \end{equation}\)

In the past, several approaches have been developed for Problem 1. Still, current solutions have poor accuracy, which limits their practicality. Some methods tend to perform better when applied to specific datasets (e.g., uniform) and quite poorly when applied to others. Limitations of existing work have been thoroughly evaluated in Reference [14], and we review them in Section 8.

3 HOMOGENEOUS-TREE FRAMEWORK

Our proposed approach relies on two key observations to reduce the noise error and uniformity error. To address noise error, one needs to carefully calibrate the sensitivity of each operation performed, to reduce the magnitude of required noise. We achieve this objective by carefully controlling the depth of the indexing structure. To control the impact of uniformity error, we guide our structure-construction algorithm such that each resulting partition (i.e., internal node or leaf node) has a homogeneous data distribution within its boundaries.

Homogeneity ensures that uniformity error is minimized, since a query that does not perfectly align with the boundaries of an internal/leaf node is answered by scaling the count within that node in proportion with the overlap between the query and the node. None of the existing works on DP-compliant data structures has directly addressed homogeneity. Furthermore, conventional spatial indexing structures (designed for non-private data access) are typically designed to optimize criteria other than homogeneity (e.g., reduce node area or perimeter, control the data count balance across nodes). As a result, existing approaches that use such structures underperform when DP-compliant noise is present.

We propose a HTF, which builds a customized spatial index structure specifically designed for DP-compliant releases. We address directly aspects such as selection of structure height, a homogeneity-driven node split strategy, and careful tuning of privacy budget for each structure level. Our proposed data structure shares similarities with kd-trees, due to the specific requirements of DP, namely, (1) nodes should not overlap, since that would result in increased sensitivity, and (2) the leaf set should cover the entire data domain, such that an adversary cannot exclude specific areas by inspecting node boundaries. However, as shown in previous work, using kd-trees directly for DP releases leads to poor accuracy [7, 14].

Similar to kd-trees, HTF divides a node into two sub-regions across a split dimension, which is alternated through successive levels of the tree. The root node covers the whole dataspace. Figure 2(b) provides an example of a non-private simplified version of the proposed HTF construction applied on a \(3\times 3\) grid (frequency matrix). HTF consists of three steps:

(A) Space partitioning aims to find an enhanced partitioning of the map such that the accuracy of the private histogram is maximized. HTF performs heuristic partitioning based on a homogeneity metric we define. At every split, we choose the coordinate that results in the highest value of the homogeneity metric. For example, in the running example (Figure 2(b)) node \(B_1\) is split into \(C_1\) and \(C_2\), which are homogeneous partitions. However, the metric evaluation is not straightforward in the private case, as metric computation for each candidate split point consumes privacy budget. We use the Laplace mechanism to determine an advantageous split point without consuming large amounts of budget. As part of HTF, a search mechanism is used to select plausible candidates for evaluation and find a near-optimal split position. The total privacy budget allocated for the private partitioning is denoted by \(\epsilon _{\text{prt}}\).

(B) Data sanitization starts by traversing the tree generated in the partitioning step. At each node, a certain amount of budget is used to perturb the node count using the Laplace mechanism. Based on the sanitized count, HTF evaluates the stop condition (i.e., whether to follow the downstream path from that node or release it as is), which is an important aspect in building private data structures. The private evaluation of stop conditions enables HTF to avoid over- or under-partitioning of the space, and preserve good accuracy. Revisiting the example in Figure 2(b), suppose that we do not want to further partition the space when the number of data points in a node is less than 7. Once HTF reaches node \(B_2\), the actual node count (6) is noise-perturbed. The value of the sanitized count may be less than 7 after sanitization, leading to pruning at \(B_2\) and stopping further partitioning. Finally, the tree’s leaf set (i.e., sanitized count of each leaf node) is released to the public. The total budget used for data sanitization is denoted by \(\epsilon _{\text{data}}\).

(C) Height estimation is another important HTF step. Tree height is an important factor in improving accuracy, as it influences the budget allocated at each index level. HTF dedicates a relatively small amount of budget (\(\epsilon _{\text{height}}\)) to determine an appropriate height.

The total budget consumption of HTF (\(\epsilon _{\text{tot}}\)) is the sum of budgets used in each of the three steps: (4) \(\begin{equation} \epsilon _{\text{tot}} = \epsilon _{\text{prt}} + \epsilon _{\text{data}}+ \epsilon _{\text{height}}. \end{equation}\)

The DP composition rules in the case of HTF apply as follows:

Sequential decomposition: The sum of budgets used for node splits along every tree path adds up to the total budget available for partitioning.
Parallel decomposition: The budget allocated for partitioning nodes in the same level is independent, since the nodes at the same level have non-overlapping extents.

4 TECHNICAL APPROACH

Section 4.1 introduces the split objective function used in HTF and provides its sensitivity analysis. Section 4.2 focuses on HTF index structure construction. Section 4.3 presents the data perturbation algorithm used to protect leaf node counts.

4.1 Homogeneity-based Partitioning

Previous approaches that used kd-tree variations for DP-compliant indexes preserved the original split heuristics of the kd-tree, namely, node splits were performed on either median or average values of the enclosed data points. To preserve DP, the split positions were computed using the exponential mechanism (Section 2), which computes a merit function for each candidate split. However, such an approach results in poor query accuracy [14].

We propose homogeneity as the key factor for guiding splits in the HTF index structure. This decision is based on the observation that if all data points are uniformly distributed within a node, then the uniformity error that results when intersecting that node with the query range is minimized. At each index node split, we aim to obtain two new nodes with a high degree of intra-node density homogeneity. Of course, since the decision is data-dependent, the split point must be computed in a DP-compliant fashion.

For a given node of the tree, suppose that the corresponding partition covers \(U\times V\) cells of the \(N\times M\) grid (i.e., frequency matrix), in which the count of data points located in its \(i{\text{th}}\) row and \(j{\text{th}}\) column is denoted by \(c_{ij}\). Without loss of generality, we discuss the partitioning method w.r.t. the horizontal axis (i.e., rows). The aim is to find an index k that groups rows 1 to k into one node and rows \(k+1\) to U into another, such that homogeneity is maximized within each of the resulting nodes (we also refer to resulting nodes as clusters). We emphasize that the input grid abstraction is used to obtain a finite set of candidate split points. This is different than alternate approaches that use grids to obtain DP-compliant releases. Furthermore, the frequency matrix can be arbitrarily fine-grained, so discretization does not impose a significant constraint.

The proposed split objective function is formally defined as (5) \(\begin{align} o_{k} = \sum _{i\texttt {=}1}^k \sum _{j\texttt {=}1}^V |c_{ij} - \mu _1| + \sum _{i\texttt {=}k+1}^U \sum _{j\texttt {=}1}^V |c_{ij} - \mu _2|, \end{align}\) where (6) \(\begin{equation} \mu _1 = \dfrac{\sum _{i\texttt {=}1}^k \sum _{j\texttt {=}1}^V c_{ij} }{k\times V},\;\;\;\;\;\; \mu _2 = \dfrac{\sum _{i\texttt {=}k+1}^U \sum _{j\texttt {=}1}^V c_{ij}}{(U-k)\times V}. \end{equation}\)

The optimal index \(k^*\) minimizes the objective function, (7) \(\begin{equation} k^* = \arg \min _{k}\;\; o_{k}. \;\;\; \end{equation}\)

Consider the example in Figure 2(b) and the partitioning conducted for node \(B_1\). There exist three possible ways to split rows of the frequency matrix: (i) separate the top row of cells resulting in clusters {[0,0]} and {[3,3],[3,3]} yielding the objective value of zero in Equation (5); (ii) separate the bottom row of cells resulting in two clusters {[0,0],[3,3]} and {[3,3]} yielding the objective value of 6, or (iii) no division is performed, yielding the objective value of 8. Therefore, the proposed algorithm will select the first option (\(k^*\,\texttt {=}\,2\)), generating two nodes \(C_1\) and \(C_2\).

Note that the value of \(k^*\) is not private, since individual location data were used in the process of calculating the optimal index. Hence, a DP mechanism is required to preserve privacy. Thus, we need to assess the sensitivity of \(k^*\), which represents the maximum change in the split coordinate that can occur when adding or removing a single data point. The sensitivity calculation is not trivial, since a single data point can cause the optimal split to shift to a new position far from the non-private value. Another challenge is that the exponential mechanism, commonly used in literature to select candidates from a set based on a cost function, tends to have high sensitivity, resulting in low accuracy.

4.1.1 Baseline Split Point Selection.

We propose a DP-compliant homogeneity-driven split point selection technique based on the Laplace mechanism. As before, consider \(U\times V\) frequency matrix of a given node and a horizontal dimension split. Denote by \(o_k\) the objective function for split coordinate k among the U candidates. There are U possible outputs \(\mathcal {O} = (o_{1},o_{2}, \ldots ,o_{U})\), one for each split candidate. In a non-private setting, the index corresponding to the minimum \(o_{i}\) value would be chosen as the optimal division. To preserve DP, given that the partitioning budget per computation is \(\epsilon _{\text{prt}}^{\prime \prime }\), we add independent Laplace noise to each \(o_{i}\), and then select the optimal value among all noisy outputs, (8) \(\begin{equation} \overline{\mathcal {O}} = (\overline{o}_{1},\overline{o}_{2}, \ldots ,\overline{o}_{U}) = \mathcal {O} + {\bf Lap}(2/\epsilon _{\text{prt}}^{\prime \prime }), \end{equation}\) where \({\bf Lap}(2/\epsilon _{\text{prt}}^{\prime \prime })\) denotes a tuple of U independent samples of Laplace noise. Note that since the grid is fixed, enumerating split candidates as cell coordinates is data-independent, hence does not incur disclosure risk. The Laplace noise added to each \(o_i\) is calibrated according to a sensitivity of 2, as proved in Theorem 1:

Theorem 1 (Sensitivity of Partitioning).

The sensitivity of cost function \(o_k\) for any given horizontal or vertical index k is 2.

Proof.

In the calculation of objective function o for a given index k, adding or removing an individual data point affects only one cell and the corresponding cluster. The objective function for split point k can be written as (9) \(\begin{align} o_k = \sum _{i\texttt {=}1}^k \sum _{j\texttt {=}1}^V |c_{ij} - \mu _1| + \sum _{i\texttt {=}k+1}^U \sum _{j\texttt {=}1}^V |c_{ij} - \mu _2|, \end{align}\) The modified objective function value following addition of a single record to an arbitrary cell \(c_{xy}\) can be represented as (10) \(\begin{align} {o_k^\prime } = \sum _{i\texttt {=}1}^k \sum _{j\texttt {=}1}^V |c_{ij}^{\prime } - \mu _1^{\prime }| + \sum _{i\texttt {=}k+1}^U \sum _{j\texttt {=}1}^V |c_{ij}^{\prime } - \mu _2^{\prime }|. \end{align}\)

Without loss of generality, assume that the additional record is located in the first cluster, which results in \(\mu _1^{\prime } = \mu _1 + 1/kV\), \(\mu _2^{\prime } = \mu _2\), and \(c_{ij}^{\prime }\) being equal to \(c_{ij}\) for all possible i and j except for \(c_{xy}\), where we have \(c_{xy}^{\prime } = c_{xy} + 1\). Therefore, the sensitivity of the objective function has value 2 as follows: (11) \(\begin{align} \Delta {o_k} = o_k - {o_k^\prime } \le \dfrac{2(kV-1)}{kV} \le 2. \end{align}\) Equation (11) is derived using the reverse triangle inequality: (12) \(\begin{equation} \begin{split} &\Big {|} |c_{ij} - \mu _1 - \dfrac{1}{kV}|- |c_{ij} - \mu _1| \Big {|} \le \dfrac{1}{kV}\\ &\qquad \qquad \forall \lbrace ij | i\in \lbrace 1, \ldots ,k\rbrace \wedge j\in \lbrace 1, \ldots ,V\rbrace , ij\ne xy \rbrace \end{split} \end{equation}\) and (13) \(\begin{equation} \Big {|} | c_{xy} +1- \mu _1 - \dfrac{1}{kV}|- |c_{xy} - \mu _1| \Big {|}\le \dfrac{kV-1}{kV}. \qquad \quad \end{equation}\)

Similarly, the sensitivity upper bound corresponding to an individual record’s removal can be shown to be 2.□

We refer to the above approach as the baseline approach. One challenge with the baseline is that the calculation of noise is performed separately for each candidate split point, and since the computation depends on all data points within the parent node, the budget consumption adds up according to sequential composition. This means that the calculation of each individual split candidate in \(\overline{o_i}\) may receive only \(1/U\) of the budget available for that level.

For large values of U, the privacy budget per computation becomes too small, decreasing accuracy. This leads to an interesting trade-off between the number of split point candidates evaluated and the accuracy of the entire release. On the one hand, increasing the number of candidates leads to a higher likelihood of including the optimal split coordinate in the set \(\overline{\mathcal {O}}\); on the other hand, there will be more noise added to each candidate’s objective function output, leading to the selection of a sub-optimal candidate. Next, we propose an optimization that finds a good compromise between number of candidates and privacy budget per candidate.

4.1.2 Optimized Split Point Selection.

We propose an optimization that aims to minimize the number of split point candidate evaluations required, and searches for a local minimum rather than the global one. Algorithm 1 outlines the approach for a single split step along the y-axis (i.e., row split). Inputs to Algorithm 1 include (i) the frequency matrix \(F_{U\times V}\) of the parent node, (ii) the total budget allocated for the partitioning per level of the tree \(\epsilon _{\text{prt}}^{\prime }\), and (iii) variable T, which bounds the maximum number of objective function computations—a key factor indicating the extent of search, and thus of the budget per operation. The proposed approach is essentially a search tree, determining the candidate split to minimize the objective function’s output. The search starts from a wide range of candidates and narrows down within each interval until reaching a local minimum, similar to a binary search.

Let \(\lbrace l,\ldots ,r\rbrace\) represent the index range where the search is conducted, initially set to the first and last possible index of the input frequency matrix. At every iteration of the main “for” loop, the search interval is divided into four equal length sub-intervals, including three inner points and two boundary point. The inner points are referred to as split indices. The objective function is calculated for each of these candidates, and perturbed using Laplace noise to satisfy DP. The split corresponding to the minimum value is chosen as the center of the next search interval, and its immediate “before” and “after” split positions are assigned as the updated search boundaries l and r). Hence, in every iteration, two new computations of the objective function are performed, except the first run, which has a single computation. Therefore, the total number of private evaluations sums to \((2T+1)\) each perturbed with the privacy budget of \(\epsilon _{\text{prt}}^{\prime \prime } = \epsilon _{\text{prt}}^{\prime }/(2T+1)\).

4.2 HTF Index Structure Construction

Our proposed HTF index structure is built in accordance to the split point selection algorithm introduced in Section 4.1. The HTF construction pseudocode is presented in Algorithm 2. Each node stores the rectangular spatial extent of the node (node.region), its children (node.left and node.right), real data count (node.count), noisy count (node.ncount), and the node’s height in the tree.

The root of the tree represents the entire data domain (\(N\times M\) frequency matrix) and its height is denoted by h. Deciding the height of the tree is a challenging task: a large height will result in a smaller amount of privacy budget per level, whereas a small one does not provide sufficient granularity at the leaf level, decreasing query precision. We estimate an advantageous height value using a small amount of budget (\(\epsilon _{\text{height}}\)) to perturb the total number of data records based on the Laplace mechanism, (14) \(\begin{equation} \overline{|D|} = |D| + Lap(1/\epsilon _{\text{height}}). \end{equation}\) Next, we set the height to (15) \(\begin{equation} h = \log _2{\left(\dfrac{\overline{|D|} \epsilon _{\text{tot}}}{10}\right).} \end{equation}\)

The formula is motivated by the work in Reference [23]. The authors show that when data are uniformly distributed in space, using a grid with a lower granularity of \(\textstyle \sqrt {\tfrac{\overline{|D|} \epsilon _{\text{tot}}}{c_0}}\times \sqrt {\tfrac{\overline{|D|} \epsilon _{\text{tot}}}{c_0}}\) improves the mean relative error, where the value of constant \(c_0\) is set to 10 experimentally. We emphasize that the approach does not indicate that the number of leaves on the tree is \(\textstyle \dfrac{\overline{|D|} \epsilon _{\text{tot}}}{c_0}\), but the information contained in this number is merely used as an estimator of the tree’s height. This estimation is formally characterized in Reference [14] and referred to as scale-epsilon exchangeability property. The intuition is that the error due to decreasing the amount of budget used for the estimation is offset by having a larger number of data points in the entire dataset.

The last input to the algorithm is the budget allocated per level of the partitioning tree. We use uniform budget allocation to allocate the budget between levels denoted as \(\epsilon _{\text{prt}}^{\prime }= \epsilon _{\text{prt}}/h\).

Starting from the root node, the proposed algorithm recursively creates two child nodes and decreases height by one. This is done by splitting the underlying area of the node into two hyperplanes based on Algorithm 1. The division is done on the y dimension if the current height is an even number and in the x dimension otherwise. The algorithm continues until reaching the minimum height of zero, or to a point where no further splitting is possible.

4.3 Leaf Node Count Perturbation

Once the HTF structure is completed, the final step of our algorithm is to release DP-compliant counts for index nodes, so that answers to queries can be reconstructed from the noisy counts. The total partitioning budget adds up to \(\epsilon _{\text{height}}+ \epsilon _{\text{prt}}\), where \(\epsilon _{\text{height}}\) was used to estimate the tree height and \(\epsilon _{\text{prt}}\) budget to generate the private partitioning tree. The data perturbation step uses the remaining \(\epsilon _{\text{data}}\) amount of budget and releases node counts according to the Laplace mechanism.

One can choose various strategies to release index node counts. At one extreme, one can simply release a noisy count for each index node; in this case, the budget must be shared across nodes on the same path (sequential composition), and can be re-used across different paths (parallel composition). This approach has the advantage of simplicity, and may do well when queries exhibit large variance in size—it is well-understood that when perturbing large counts, the relative error is much lower, since the Laplace noise magnitude only depends on sensitivity, and not the actual count.

However, in practice, queries tend to be more localized, and one may want to allocate more budget to the lower levels of the structure, where the actual counts are smaller, thus decreasing relative error. In fact, as another extreme, one can concentrate the entire \(\epsilon _{\text{data}}\) on the leaf level. However, doing so can also decrease accuracy, since some leaf nodes have very small real counts.

Our approach takes a middle ground, where the available \(\epsilon _{\text{data}}\) is spent to (i) determine which nodes to publish and (ii) ensure sufficient budget remains for the noisy counts. Specifically, we publish only leaf nodes, but these are not the same leaves returned by the structure construction algorithm. Instead, we perform an additional pruning step, which uses the noisy counts of internal nodes to determine a stop condition, i.e., the level at which a node count is likely to be small enough that a further recursion along that path is not helpful to obtain good accuracy. Effectively, we perform pruning of the tree using a small fraction of the data budget, and then split the remaining budget among the non-pruned nodes along a path. This helps decrease the effective height of the tree across each path, and hence the resulting budget per level increases.

Next, we present in detail our approach that contributes two main ideas: (i) how to determine smart stop (or pruning) conditions based on noisy internal node counts, and (ii) how to allocate perturbation budget across shortened paths.

The proposed technique is summarized in Algorithm 3: it takes as inputs the root node of the tree generated in the data partitioning step; the remaining budget allocated for the perturbation of data (\(\epsilon _{\text{data}}\)); a tracker of accumulated budget (\(\epsilon _{\text{accu}}\)); a stop condition predicate denoted by cond; and the nominal tree height h as computed in Section 4.2. Similar to prior work [7], we use a geometric progression budget allocation strategy, but we enhance it to avoid wasting budget on unnecessarily long paths. The intuition behind this strategy is to assign more budget to the nodes located in the lower levels of the tree, since their actual counts are lower, and hence larger added noise impacts the relative error disproportionately high. Conversely, at the higher levels of the tree, where actual counts are much higher, the effect of the noise is negligible.

Equation (16) formulates this goal as a convex optimization problem, (16) \(\begin{align} \underset{\epsilon _0...\epsilon _h}{min}\;\;\;& \sum _{i\texttt {=}0}^h 2^{h-i}/\epsilon _i^2, \end{align}\) (17) \(\begin{align} \text{where} \end{align}\) (18) \(\begin{align} & \sum _{i\texttt {=}0}^h \epsilon _i=\epsilon , \;\; \epsilon _i\gt 0 \; \;\forall i=0...h. \end{align}\) Writing Karush-Kuhn-Tucker (KKT) [4] conditions, the optimal allocation of budget can be calculated as (19) \(\begin{align} L(\epsilon _1, \ldots ,\epsilon _h,\lambda) &= \sum _{i\texttt {=}0}^h 2^{h-i}/\epsilon _i^2 + \lambda \left(\sum _{i\texttt {=}0}^h \epsilon _i- \epsilon \right) \end{align}\) (20) \(\begin{align} &\Rightarrow \dfrac{\partial L}{\partial \epsilon _i} = - \dfrac{2^{h-i+1}}{\epsilon _i^3}+ \lambda =0 \end{align}\) (21) \(\begin{align} &\Rightarrow \epsilon _i = \dfrac{2^{h-i+1}}{\lambda ^{1/3}}, \end{align}\) and substituting \(\epsilon _i\)’s in the constraint of problem the optimal budget in the ith level is derived as (22) \(\begin{equation} \epsilon _i = \dfrac{2^{(h-i)/3}\,\epsilon \, (2^{1/3}-1)}{(2^{(h+1)/3}-1)}. \end{equation}\)

The algorithm starts the traversal from the partitioning tree’s root and recursively visits the descendent nodes. Once a new node is visited, the first step is to use the node’s height to determine the allocated budget (\(\epsilon _{\text{data}}^{\prime }\)) based on geometric progression. Recall that the nodes on the same level follow parallel decomposition of the budget as their underlying areas in space do not overlap. Additionally, the algorithm keeps track of the amount of budget used so far on the tree, optimizing the budget in later stages. Next, the computed value of \(\epsilon _{\text{data}}^{\prime }\) is utilized to perturb the node.count by adding Laplace noise, resulting in noisy count node.ncount.

The stop condition we use takes into account the noisy count in the current internal node (i.e., count threshold); and the spatial extent of the internal node threshold (i.e., extent threshold). If none of the thresholds is met for the current node, then the algorithm recursively visits the node’s children; otherwise, the algorithm prunes the tree considering that the current node should be a leaf node. In the latter case, the algorithm subtracts the accumulated budget used so far on that path from the root, and uses the entire remaining budget available to perturb the count. This significantly improves the utility, as geometric allocation tends to save most of the budget for the lower levels of the tree. Revisiting the example in Figure 2(b), suppose that the stop condition is to prune when the underlying area consists of less than four cells. During the data perturbation process, the node \(B_2\) is turned into a leaf node due to its low number of cells. At this point, the node’s children are removed, and its noisy count is determined based on the remaining budget available on the lower levels of the tree.

The computational complexity of the HTF algorithm is \(O((2T+1)\times 2^h)\). Each internal node of the tree leads to a split that requires \(O(2T+1)\) computations to search for the split index, and there are \(\sum _{i\texttt {=}1}^h 2^{h-i} = 2^h-1\) internal nodes.

5 OPTIMAL ADAPTIVE STOP CONDITIONS

One essential component that determines the accuracy of DP-compliant releases is the strategy that decides when to stop partitioning, i.e., the stop condition. Most existing studies take an ad hoc approach to stop conditions, which often leads to poor accuracy. In most cases, the decision is as simple as comparing the noisy count of a candidate leaf node with a fixed threshold. This section takes a principled approach to determine when to stop partitioning. Our strategy is based on a formal analysis on how pruning can enhance the accuracy. Differentially private stop conditions have several important roles, such as: (i) avoid over-partitioning in low-density areas, (ii) prevent under-partitioning in areas with a high concentration of population, and (iii) significantly lower the existence of negative noisy counts in the published histogram. Findings from the 2020 U.S. census show that post-processing has more adverse impacts on data utility than the additive Laplace noise used for sanitization [13]. Post-processing deals with aspects such as the removal of negative population counts or unreasonably large counts for a household. Such a problem can be fixed by using appropriate stop conditions. We focus on this aspect and on the utility of the Laplace mechanism to formulate optimization problems and derive insights on pruning thresholds.

Suppose that a given privacy-preserving partitioning algorithm (e.g., HTF) has divided the space such that the leaf set consists of M non-overlapping partitions with the population counts denoted by \(p_1,\, p_2, \ldots ,\,p_M\) (note that, the leaf nodes may reside at different levels of the tree; nevertheless, they are disjoint, and they cover the entire data domain). Moreover, let \(Z_1,\, Z_2, \ldots ,Z_M\) be independent and identically distributed Laplace random variables used for the sanitization of the leaf counts (i.e., random variables representing the noise drawn to protect each individual count according to DP). We use Lemma 1 to characterize the utility of the Laplace mechanism.

Lemma 1.

The likelihood of utility loss for a random variable \(Z\sim Lap(b)\) for every \(t\gt 0\) is given by (23) \(\begin{equation} Pr(|Z|\gt tb) = e^{-t}. \end{equation}\)

Proof.

Based on Laplace distribution and assuming that the mean random variable is zero, (24) \(\begin{align} \text{Lap}(Z=z|b) = \dfrac{1}{2b}e^{-|z|/b}. \end{align}\) Looking at the cumulative distribution function, (25) \(\begin{align} Pr(Z\le z) & = \int _{-z}^{z} \dfrac{1}{2b} e^{\dfrac{-|z|}{b}} \,dz \end{align}\) (26) \(\begin{align} & =\dfrac{1}{2b}\left(\int _{0}^{z} e^{\dfrac{-z}{b}} \,dz + \int _{-z}^{0} e^{\dfrac{z}{b}} \,dz\right) \end{align}\) (27) \(\begin{align} & = \dfrac{1}{2b} (-2b e^{-z/b} +2b) = 1- e^{-z/b}, \end{align}\) which results in (28) \(\begin{equation} Pr(|Z|\gt tb) = 1- Pr(Z\le tb) = e^{-t}. \end{equation}\) Equation (23) is derived by substituting \(z=tb\).□

Intuitively, the lemma above provides a means to measure the likelihood of noise to remain below a certain threshold. Armed with this knowledge, the likelihood of noise not exceeding the partition’s population count for the \(j{\text{th}}\) partition can be determined by setting the value of \(t=p_j/b\) and substituting in Equation (23), as follows: (29) \(\begin{equation} Pr(|Z_j|\le p_j) = 1- e^{p_j/b}. \end{equation}\) We formulate the following optimization problem to maximize utility: \(\begin{equation*} \begin{aligned}& \text{maximize} & & \sum _{j=1}^{M}\, Pr(|Z_j|\lt p_j) = M - \sum _{j=1}^{M} e^{-p_j/b}, \\ & \text{subject to} & & \sum _{j=1}^{M}\, p_j = N,\, p_j\ge 0.\\ \end{aligned} \end{equation*}\)

The above problem can be formulated in two instances. In its direct formulation, it provides the optimal values for pruning thresholds given the privacy budgets in the various tree levels where the leaf nodes reside. In its dual formulation, it allows one to derive the optimal privacy budget values for each leaf level, given a set of count thresholds. HTF, which takes a top-down approach, benefits from the direct problem, as the algorithm starts by estimating the tree’s height and, consequently, deriving the optimal privacy budget for each level of the tree. Thus, prior knowledge of the privacy budget can be used to set stop count thresholds on nodes at different heights of the HTF partitioning structure. The dual formulation is not directly applicable to our setting, since it is difficult in practice to know the leaf counts without first building the structure. We do include it, however, since it is of theoretical interest, especially for future work that may consider bottom-up index structure construction.

Consider the estimated number of partitions to be M, and let \(Z_1,\, Z_2, \ldots ,Z_M\) represent additive Laplace noise random variables such that \(Z_j \sim Lap(1/\epsilon _j)\) for all \(j=1...M\). The optimization problem can be modified to the following equation to derive optimal thresholds \(p_1,\, p_2, \ldots ,\, p_M\): \(\begin{equation*} \begin{aligned}& \underset{p_1, \ldots ,p_M}{\text{maximize}} & & M - \sum _{j=1}^{M} e^{-p_j\epsilon _j }, \\ & \text{subject to} & & \sum _{j=1}^{M}\, p_j = N,\, p_j\ge 0.\\ \end{aligned} \end{equation*}\)

The convex problem can be solved by writing the equation in the standard form and formulating the Lagrangian as (30) \(\begin{equation} L =\sum _{j=1}^{M} e^{-p_j\epsilon _j } - M + \lambda \left(\sum _{j=1}^{M} p_j - N\right) +\sum _{j=1}^{M} \lambda _j p_j. \end{equation}\) Taking derivatives and using KKT conditions, the optimal solution can be derived as (31) \(\begin{equation} p_j = \dfrac{-\ln \epsilon _j }{\epsilon _j}+ \dfrac{ N+ \sum _{j=1}^{M} \ln \epsilon _j/\epsilon _j }{\epsilon _j(\sum _{j=1}^{M} 1/\epsilon _j)}. \end{equation}\) Algorithm 4 includes in line 6 the optimal stop condition in the equation above for each level of the HTF tree.

For the dual problem, the constraints to derive the optimal values of budgets given the population thresholds are \(\begin{equation*} \begin{aligned}& \underset{\epsilon _1, \ldots ,\epsilon _M}{\text{maximize}} & & M - \sum _{j=1}^{M} e^{-p_j\epsilon _j }, \\ & \text{subject to} & & \sum _{j=1}^{M}\, \epsilon _j = \epsilon ,\, \epsilon _j \gt 0.\\ \end{aligned} \end{equation*}\) The solution to this optimization problem can be derived as (32) \(\begin{equation} \epsilon _j = \dfrac{-\ln p_j }{p_j}+ \dfrac{ \epsilon + \sum _{j=1}^{M} \ln p_j/p_j }{p_j(\sum _{j=1}^{M} 1/p_j)}. \end{equation}\)

6 SUPPORTING LARGE-SCALE SPATIAL DATASETS

Existing methods for DP-compliant spatial dataset partitioning, HTF included, are significantly influenced in terms of both accuracy and execution time by the cardinality of the input datasets. With larger input data, the decisions that are made at each structure level, e.g., how to choose the split position, what fan-out to use, and so on, become more complex and more costly in terms of privacy budget as the data scale increases. Most existing algorithms are evaluated on datasets such as Geolife [35] and TDrive [33], with the population in the order of a million data points. When using significantly larger input datasets, performance deteriorates. In this section, we propose an approach for the publication of DP-compliant histograms for large-scale spatial datasets, such as entire countries, that can entail in the order of a hundred million user locations. As a running example, we focus on the publication of a private histogram for the United States.

An overview of the proposed architecture is shown in Figure 3. Our proposed design is a hierarchical tree-based approach in which the entire USA map and the total count of users are represented by the root of the tree. By moving to lower levels, the spatial areas become more compact. The second level of the tree is dedicated to states, currently including 50 regions according to USA census data and statistics. Note that, the fanout is determined by the nature of the administrative partitioning of the territory. For example, by moving from level 1 to level 2, the spatial area is divided into 50 non-overlapping partitions.

Fig. 3. Hierarchical architecture for the publication of a DP-compliant histogram of the USA.

The states’ descendants are located in the third level and represent counties associated with each state. As an example, the state of California entails 58 counties modeled as its child nodes. It is essential to consider that counties of a state do not always thoroughly cover the state. Hence, it is recommended to have an additional node referred to as unincorporated that indicates the unassigned locations. Such areas are usually either unpopulated or have a low population. Similarly, the fourth level represents cities located in each county and can accompany an “unincorporated” node modeling locations excluded from the city boundaries.

We set the threshold upon which the traditional algorithms can be applied to 1 million people. The nodes in levels 1 to 3 are expected to have populations greater than the threshold as they represent large spatial areas. We refer to these nodes as static nodes, and the nodes in level 4, which are anticipated to have a population less than the threshold as dynamic nodes. Dynamic nodes are the ones that HTF and other traditional algorithms are applicable to, whereas static nodes are usually too populated for this purpose. It is important to note that a static node may turn into a dynamic node if its population is detected to be less than the threshold in the sanitization process, as explained next. For example, suppose a county’s population is below the stop count threshold. In that case, the algorithm directly applies the HTF algorithm on that node instead of further partitioning to cities.

Budget Allocation. The budget allocation procedure differs for static and dynamic nodes. Once a dynamic node is reached, given that its allocated budget is known, the HTF algorithm can be executed to partition and sanitize the data. A naive approach for budget allocation is to follow approaches such as QuadTree or kd-tree to reach the desired threshold of the population. There are two critical problems with such an approach: (i) it requires a large number of spatial splits before reaching the threshold, increasing the tree’s height, and (ii) analysts are usually interested in the population of predefined geographic regions such as counties and cities, and not arbitrarily generated geographic regions. Both these problems are addressed in the proposed architecture. A critical challenge with budget allocation for predefined spatial regions is that different levels have varying fanouts on the tree. To this end, we generalize the optimization formulation and solve it by applying KKT conditions.

Denote the height of the architecture by \(h_{\text{arch}}\). For the hierarchical architecture proposed in Figure 3, this number would be set to 3. Moreover, denote the number of nodes within the ith level of the tree by \(n_i\) for \(i=0...h_{\text{arch}}\). For the map of the USA, the numbers are set to \(n_0 = 1,\, n_1 = 50,\, n_2\) = 3,006, \(n_3\) = 19,495, representing the count of nodes for the country, states, counties, and cities. The optimization problem can then be formulated as (33) \(\begin{align} \underset{\epsilon _0...\epsilon _{h_{\text{arch}}}}{min}\;\;\;& \sum _{i\texttt {=}0}^{h_{\text{arch}}} n_i/\epsilon _i^2, \end{align}\) (34) \(\begin{align} & \sum _{i\texttt {=}0}^{h_{\text{arch}}} \epsilon _i=\epsilon , \;\; \epsilon _i\gt 0 \; \;\forall i=0...h_{\text{arch}}. \end{align}\) The Lagrangian of the optimization problem is derived as (35) \(\begin{align} L(\epsilon _1, \ldots ,\epsilon _{h_{\text{arch}}},\lambda) &= \sum _{i\texttt {=}0}^{h_{\text{arch}}} n_i/\epsilon _i^2+ \lambda \left(\sum _{i\texttt {=}0}^{h_{\text{arch}}} \epsilon _i- \epsilon \right) \end{align}\) (36) \(\begin{align} &\Rightarrow \dfrac{\partial L}{\partial \epsilon _i} = - \dfrac{-2n_i}{\epsilon _i^3}+ \lambda =0 \end{align}\) (37) \(\begin{align} &\Rightarrow \epsilon _i = \dfrac{(2n_i)^{1/3}}{\lambda ^{1/3}}. \end{align}\) By writing KKT conditions and substituting into the objective function, the optimal solution can be derived as (38) \(\begin{equation} \epsilon _i = \dfrac{\epsilon _{\text{data}}\times n_i^{1/3}}{\sum _{i\texttt {=}0}^{h_{\text{arch}}} n_i^{1/3}}. \end{equation}\) In the above equation, \(\epsilon _i\) represents the data perturbation budget derived to be used in the ith level.

Structure Traversal. Once the tree height and the budget allocated to each level are determined, the tree’s traversal process starts from the root node and recursively visits the descendants. For static nodes, the geographic area is fixed, and no budget is utilized for partitioning. Upon arrival to a new node, the node’s count is sanitized by the addition of Laplace noise with the sensitivity of one and proportional to the allocated budget. Once a node’s count is sanitized, the threshold is evaluated on the sanitized count, as was the case for private stop conditions in HTF. If the count is less than the threshold, then the static node turns into a dynamic one, and the HTF algorithm is applied to the node; otherwise, the algorithm continues to visit the node’s children. Once reaching a dynamic node, the allocated privacy budget is used as input, and the HTF algorithm is applied for sanitization, resulting in a private histogram of the whole map.

We evaluate empirically the performance impact of the proposed scaling approach in Section 7.6.

7 EXPERIMENTAL EVALUATION

7.1 Experimental Setup

We evaluate HTF on both real and synthetic datasets:

Real-world Datasets. We use the location measurements of cell phones provided by Veraset [29]² throughout the USA for performance evaluation. Primarily, we focus on collected data in Los Angeles covering a \(70 \times 70\) km\(^2\) area centered at latitude 34.05223 and longitude \(-118.24368\). The selected data generates a frequency matrix of 3.5 million data points during the time period of January 1–7, 2020. Additionally, we use the data collected in six cities—New York, San Francisco, Seatle, Denver, Detroit, and Phoenix—with the detailed statistics provided in Table 2.

Table 1.

Symbol	Description
\(\epsilon _{\text{tot}}\)	Total privacy budget
\(\epsilon _{\text{height}}, \epsilon _{\text{data}}\)	Height estimation, data perturbation budget
\(\epsilon _{\text{prt}},\epsilon _{\text{prt}}^{\prime },\epsilon _{\text{prt}}^{\prime \prime }\)	Partitioning budget: total, per level, per round
\(o_k\)	Objective function output for index k
\(c_{ij}\)	Number of data points in row i and column j
h	Tree height

View Table

Table 1. Summary of Notations

Table 2.

Dataset	Lat	Lon	Variance	Entropy
NewYork	40.7306	\(-\)73.9352	9.99	17.04
San Francisco	37.7739	\(-\)122.4312	36.66	15.43
Seatle	47.6080	\(-\)122.3351	16.65	16.41
Denver	39.7420	\(-\)104.9915	19.80	16.14
Detroit	42.3314	\(-\)83.0457	13.08	16.68
Phoenix	33.4483	\(-\)112.0740	14.41	16.51

View Table

Table 2. Statistics of Datasets Used in Our Experiments

Synthetic Datasets. We generate locations according to a Gaussian distribution as follows: a cluster center, denoted by \((x_c,y_c)\), is selected uniformly at random. Next, coordinates for each data point x and y are drawn from a Gaussian distribution with the mean of \(x_c\) and \(y_c\), respectively. We model three sparsity levels by using three standard deviation (\(\sigma\)) settings for Gaussian variables: low (\(\sigma = 20\)), medium (\(\sigma = 50\)), and high (\(\sigma = 100\)) sparsity.

We discretize the space to a 1,024 \(\times\) 1,024 frequency matrix. We use as performance metric the MRE for range queries. Similar to prior work [14, 23, 30, 34], we consider a smoothing factor of 20 for the relative error, to deal with cases when the true count for a query is zero (i.e., relative error is not defined). Each experimental run consists of 2,000 random rectangular queries with center selected uniformly at random. We vary the size of queries to a region covering \(\lbrace 2\%, 6\%, 10\%\rbrace\) of the dataspace.

7.2 HTF Versus Data Dependent Algorithms

Data-dependent algorithms aim to exploit users’ distribution to provide an enhanced partitioning of the map. The state-of-the-art data-dependent approach for DP-compliant location publication is the kd-tree technique from Reference [7]. The kd-tree algorithm generates the partitioning tree by splitting on median values, which are determined using the exponential mechanism. We have also included the smoothing post-processing step from References [15, 23], which resolves inconsistencies within the structure (e.g., using the fact that counts in a hierarchy should sum to the count of an ancestor).

Figure 4 presents the comparison of the HTF algorithm with kd-tree approaches, namely: (i) geometric budget allocation in addition to smoothing and post-processing labelled as KdTree (geo); (ii) uniform budget allocation including smoothing and post-processing labelled as KdTree (uniform); (iii) HTF algorithm with the partitioning budget per level set to \(\epsilon _{\text{prt}}^{\prime } = 5E-4\); and (iv) HTF algorithm with \(\epsilon _{\text{prt}}^{\prime } = 1E-3\). Recall that \(\epsilon _{\text{prt}}^{\prime }\) denotes the budget per level of partitioning, and therefore, given the tree’s height as h, the remaining budget for perturbation is derived as \(\epsilon _{\text{data}} = \epsilon _{\text{tot}} - \epsilon _{\text{prt}}^{\prime }\times h - \epsilon _{\text{height}}\). The value of \(\epsilon _{\text{height}}\) in the experiments is set to \(1E-4\), the HTF’s T value is set to 3, and stop condition thresholds are set to no less than 5 cells or 100 data points. Moreover, for the kd-tree algorithm, \(15\%\) of the total budget is allocated to implement the partitioning.

Fig. 4. Comparison with data-dependent algorithms, Los Angeles dataset.

In Figures 4(a), 4(b), and 4(c), the MRE performance is compared for different values of \(\epsilon _{\text{tot}}\) over a workload of uniformly located queries with random shape and size. HTF clearly outperforms kd-tree for all height settings. Looking at the MRE performance, the kd-tree algorithm follows a parabolic shape commonly occurring in tree-based algorithms, meaning that the MRE performance reaches its best values at a particular height, and further partitioning of the space increases the error. This is caused by excessive partitioning in low-density areas. HTF, on the other hand, is applying stop conditions to avoid the adverse effects of over-partitioning, and is able to estimate the optimal height beforehand. Figures 4(d), 4(e), and 4(f), show the results when varying query size (for square shape queries). HTF outperforms the kd-tree algorithm significantly, with an average improvement of \(89\%\) for \(\epsilon _{\text{prt}}^{\prime } = 1E-3\).

7.3 HTF Versus Grid-based Algorithms

Grid-based approaches are mostly data-independent, and they partition the space using regular grids. The UG approach uses a single-layer fixed size grid, whereas its successor AG method considers two layers of regular grids: the first layer is similar to UG, whereas the second uses a small amount of data-dependent information (i.e., noisy query results on the first layer) to tune the second layer grid granularity.

Figure 5 presents the comparison of HTF with AG and UG. For HTF, we consider several stop condition thresholds. HTF consistently outperforms grid-based approaches, especially when the total privacy budget is lower (i.e., more stringent privacy requirements). For example, the percentage of improvement for the HTF algorithm with \(\epsilon _{\text{prt}}^{\prime } = 1E-3\) compared to AG is \(28\%\), \(70\%\), and \(63\%\) for total privacy budgets of \(0.1,\,0.3,\) and 0.5, respectively. The impact of the stop count condition on HTF depends on the underlying distribution of data points. For the Los Angeles dataset, MRE is relatively larger for small values of \(\epsilon _{\text{tot}}\). The performance improves and reaches its near-optimal values around stop count 50, and ultimately worsens when the stop count becomes larger. This matches our expectation as, on one hand, small values of stop count result in over-partitioning, and on the other hand, when the stop count is too large, the partitioning tree cannot reach the ideal heights, resulting in high MRE values.

Fig. 5. Comparison to grid-based algorithms, Los Angeles dataset.

Figures 5(d), 5(e), and 5(f), show the obtained results for varying query sizes (2, 6, and \(10\%\) of the data domain). HTF outperforms both AG and UG. Note that all three algorithms are adaptive and change their partitioning according to the number of data points. Therefore, in low privacy regimes, the structure of algorithms may cause fluctuations in the accuracy. However, as the privacy budget grows, the algorithms reach their maximum partitioning limit, and increasing the budget always results in lower MRE.

7.4 HTF Versus Data Independent Algorithms

The most prominent DP-compliant data-independent technique [7] uses QuadTrees. The technique recursively partitions the space into four equal size quadrants. Two commonly used budget allocation techniques used in Reference [7] are geometric budget allocation (geo) and uniform budget allocation. For a fair comparison, we have also included the smoothing post-processing step from References [23, 31].

Figure 6 presents the comparison results. Figures 6(a), 6(b), and 6(c) are generated using random shape and size queries, and several different height settings. Note that the fanout of QuadTrees is double the fanout of HTF, so the height represented by 2k in the figures corresponds to the implementation of QuadTree with the height of k. The error of the QuadTree approach is large for small heights, then improves to its optimal value, and rises again significantly as height further increases, due to over-partitioning. Similar to kd-trees, no systematic method has been developed for QuadTree to determine optimal height, whereas the HTF height selection heuristic yields levels 15, 16, and 17 for the allocated privacy budget of 0.1, 0.3, and 0.5, respectively. HTF outperforms QuadTree for all settings of \(\epsilon _{\text{tot}}\). As an example, the percentage of improvement in Figure 6(a) when the tree’s height is 16 reaches \(35\%\). Figures 6(d), 6(e), and 6(f) show the accuracy of HTF and QuadTree for square randomly placed queries of varying size. HTF outperforms in all cases.

Fig. 6. Comparison to data-independent algorithms, Los Angeles dataset.

7.5 Additional Benchmarks

To further validate HTF performance, we run experiments on the Los Angeles dataset as well as six synthetic datasets generated using Gaussian distribution. Figure 7 presents the comparison of all algorithms with randomly generated query workload and a privacy budget of \(\epsilon _{\text{tot}}=0.1\). Three additional algorithms are used as comparison benchmarks, (i) Singular algorithm– preserving differential privacy by adding independent Laplace noise to each entry of the frequency matrix, (ii) Uniform algorithm in which Laplace noise is added to the total count of the grid with the assumption that data are uniformly distributed within the grid, and (iii) the Privlet [30] algorithm based on wavelet transformations.

Fig. 7. Mixed workloads, \(\epsilon _{\text{tot}}=0.1\) , all datasets.

Figure 7 shows that HTF consistently outperforms existing approaches. For the denser datasets (\(sigma=20\)) the gain compared to approaches designed for uniform data (e.g., UG, AG, Quadtrees) is lower. As data sparsity grows, the difference in accuracy between HTF and the benchmarks increases. HTF performs best on a relative basis for lower \(\epsilon _{\text{prt}}^{\prime }\), i.e., more stringent privacy requirements.

Finally, in Figure 8, we present the execution time of all considered algorithms when applied to the LA dataset. The execution time also includes response time to 200 queries with random shape and size. As it can be seen in the figure, not only does HTF have a higher utility for the private dataset, but it is also the fastest. Such an advantage significantly helps the application of HTF to large-scale datasets.

Fig. 8. Computational time overhead, \(\epsilon _{\text{tot}}=0.1\) .

7.6 Performance Evaluation of Scaling Approach

We evaluate the approach introduced in Section 6 for scaling DP-compliant hierarchical structure construction to large datasets. The impact of the proposed scaling approach in conjunction with HTF is compared with two popular benchmarks on six datasets corresponding to cities throughout the USA (Table 2):

Benchmark 1: Geometric budget allocation with fanout two [32]. In this case, the map is divided using a binary tree until reaching the 1M population threshold. Given the population of USA 328M, and the threshold of 1M, there are \(\lceil \log _2 (\dfrac{328\times 10^9}{1\times 10^9 })\rceil = 9\) levels before reaching the threshold of dynamic nodes. Therefore, \(h_{\text{arch}} = 9\), and the geometric progression is applied with the basis of 2, which is a special case of the problem formulated in Equation (33), where \(n_i = 2^i\) for all \(i=0...h_{\text{arch}}\).
Benchmark 2: Geometric budget allocation with fanout of four, previously used in QuadTrees [7]. The total number of levels is \(\textstyle \lceil \log _4 (\dfrac{328\times 10^9}{1\times 10^9 })\rceil = 5\), leading to \(h_{\text{arch}} = 5\). This corresponds to solving the problem in Equation (33) when the division basis is set to four, resulting in \(n_i = 4^i\) for all \(i=0...h_{\text{arch}}\).

Dataset details are provided in Table 2. For each city, 1M data points are sampled from the GPS logs of the Veraset data. Central coordinates of the cities are given in the table, and the area range is selected as \(\pm 0.6\) of the center to generate a 1,024 \(\times\) 1,024 frequency matrix. The parameters selected for HTF follow Section 7.2 with \(\epsilon _{\text{prt}}^{\prime } = 1E-3\).

The results of the experiments are presented in Figure 9. The proposed architecture is labeled as Hierarchical in the figure, and the comparisons are provided for the total allocated budget of 0.5 and 1 to sanitize the whole map of the USA. For each city, the amount of budget is calculated based on the proposed hierarchical scheme and the two benchmarks. The improvement achieved in the utility of the published data is consistent among all six cities. Two key trends are observed: (1) by increasing the amount of budget, the MRE is reduced, and (2) increasing the fanout tends to preserve more budget for lower levels of the tree and therefore, results in better accuracy.

Fig. 9. Evaluation of proposed scaling approach compared with benchmarks, random shape, and size queries.

7.7 Parameter Analysis

The performance of HTF depends on several internal and external parameters that can impact the utility of the published private histogram. In this subsection, we vary parameters one at a time to evaluate their influence on the algorithm’s performance. The default parameters include the Veraset LA dataset with 3.5M samples, total privacy budget \(\epsilon _{\text{tot}}=0.1\), height estimation budget \(\epsilon _{\text{height}}= 0.001\), per level partitioning budget \(\epsilon _{\text{prt}}^{\prime } = 1E-3\), stop count threshold 10, and \(T=3\).

Privacy Budget. Recall that the total privacy budget is the summation of sanitization budget (\(\epsilon _{\text{data}}\)), partitioning budget (\(\epsilon _{\text{prt}}\)) and the height estimation budget (\(\epsilon _{\text{height}}\)). The height estimation budget is negligible compared to sanitization and partitioning budgets, and we have already shown in the experiments that by increasing the total privacy budget, there is a clear trend of increase in utility demonstrating the privacy-utility trade-off. In Figure 10(a), we focus on understanding how the privacy budget should be distributed between sanitization and partitioning. The x-axis of the figure is per level of the tree, which by multiplying to the height h results in the total utilized budget for partitioning. Utility follows a parabolic trend relative to the budget. If the selected privacy budget is too low, then the algorithm fails to capture the impact of homogeneity, adversely affecting the utility. On the other hand, when the privacy budget increases to a higher level where the budget is almost equally distributed between sanitization and partitioning, the low sanitization budget plays a bigger role, significantly deteriorating utility.

Per Level Computation T. The variable T denotes the number of times the homogeneity objective function is evaluated for each node split on the HTF tree. A higher value of T results in an increased number of split point candidates and finding the one that leads to more homogeneity. This comes with the drawback that less per computation budget remains, leading to a less accurate evaluation of the objective function (\(\epsilon _{\text{prt}}^{\prime \prime } = \epsilon _{\text{prt}}^{\prime }/(2T+1)\)). Figure 10(b) looks into the impact of variable T for low, medium, and high privacy regimes. For high privacy regimes, the impact of variation in T is less severe as the sanitization budget dominates partitioning. The role of the parameter becomes more significant when the privacy budget is limited. In such a scenario, the parabolic role impact on utility can be clearly seen in the graph. For low values of T, the number of computations is not enough to reach a good extent of homogeneity identification. Similarly, when the value grows too large, privacy deficiency disables the accurate identification of the objective function lowering the utility of the published private histogram.

Height. In Figure 10(c), we remove the height estimation mechanism, which is implemented before generating the HTF tree and run the algorithm for fixed given heights. Despite fluctuations due to the Laplace mechanism, the clear trend can be seen that significant utility loss is imposed on the histogram when the tree’s height is short. The performance improves as the height grows larger until it reaches the height 15 from which the performance enters the stabilization phase due to stop conditions. The optimal value can be seen to approximately coincide with the height estimation proposed for HTF (\(|\overline{D}|\approx |D|\)), (39) \(\begin{equation} h = \log _2{\left(\dfrac{|D| \epsilon _{\text{tot}}}{10}\right)} = \log _2{\left(\dfrac{3.5\times 10^6 \times 0.1}{10}\right)} \approx 15. \end{equation}\)

Stop Conditions. Despite its simplicity, the use of private stop conditions can critically refine or reduce utility. The discussion around how to handle postprocessing in the U.S. 2020 Census magnifies the importance of dealing with negative numbers. Our proposed method to apply private stop conditions is conducted in a private way as opposed to the approach taken by the bureau: suppressing all negative numbers in addition to allowing for more flexible constraints such as density and homogeneity. In Figure 10(d), we look at utility for a given stop condition applied on tree nodes. The experiments suggest that large values for stop conditions prevent HTF from efficiently partitioning the space by early pruning the tree. Based on our observations, the selection of stop conditions below total noise variance can efficiently improve the performance as a rule of thumb. In this case, given that the total privacy budget \(\epsilon _{\text{tot}}\), the variance of noise calculated as \(2/\epsilon _{\text{tot}}^2 = 200\). Note that, the empirical values obtained for the errors are confirming the validity of our theoretical analysis for leaf node thresholds in Section 5. The experiments show that the minimum error obtained for the stop count of 10 is very close to the one obtained for the theoretically indicated value of stop count 62.

Scale. Figure 10(e) evaluates the scale-epsilon exchangeability property for HTF. This principle dictates that increasing the scale (number of samples) has a similar effect as increasing the privacy budget. Instead of using all 3.5 million datapoints in the Los Angeles dataset, we sample data uniformly at random, generating 6 datasets with the population count of 0.5, 1, 1.5, 2, 2.5, 3 million samples. The clear trend of reduction in utility by increasing scale confirms that the algorithm is following scale-epsilon exchangeability as the same behavior is expected by increasing the privacy budget.

8 RELATED WORK

Among different approaches and metrics to achieve privacy such as cryptography [26, 27], k-anonymity [28] and l-diversity [21], DP [10] has proven itself to be a significantly viable option for statistical databases including population histograms. The comprehensive benchmark analysis in Reference [14] showed that the dimensionality and scale of the data directly determine the accuracy of an algorithm. For one-dimensional data, both data-dependent and data-independent methods perform well. The hierarchical method in Reference [15] uses a strategy consisting of hierarchically structured range queries arranged as a tree. Similar methods (e.g., Reference [34]) differ in their approach to determining the tree’s branching factor and allocating appropriate budget to each of its level. Data-dependent techniques on the other hand exploit correlation in real-world datasets to boost the accuracy of histograms. They first compress the data without loss: for example, EFPA [1] applies the Discrete Fourier Transform, whereas DAWA [17] uses dynamic programming to compute the least cost partitioning. The compressed data is then sanitized, for example, directly with LPM [1] or with a greedy algorithm that tunes the privacy budget to a sample query set given in advance [17]. Privlet [30] compresses data through a wavelet transformation such that the the noise incurred by a range query scales proportionately to the logarithm of its length.

In the 2D scenario, the main focus is on spatial datasets that exhibit sparse and skewed data distributions, where only data-dependent approaches tend to be competitive. General-purpose mechanisms such as the matrix mechanism of Li and Miklau [18, 19] and its workload-aware counterpart DAWA [17] operate over a discrete 1D domain, and may be extended to spatial data by applying a Hilbert transform to the 2D data representation [14]. However, approaches specialized for answering spatial range queries, such as UG [23], AG [23], QuadTree [7], and kd-tree [32] outperform general-purpose mechanisms [14]. Xiao et al. [32] present the earliest attempt at a DP-compliant spatial decomposition algorithm based on the kd tree. It first imposes a uniform grid over the data domain, and then constructs a private kd tree over the cells in the grid. While the simplicity of the approach is appealing, the split criterion is solely based on the median, leading to high-sensitivity and split partitions with low intra-node homogeneity.

Recent work focuses on high-dimensional data, where the key idea is to reduce the impact of the higher dimensionality. The most accurate algorithm in this class is High-Dimensional Matrix Mechanism (HDMM) [22], which represents queries and data as vectors and uses sophisticated optimization and inference techniques to answer them. DPCube [31] searches for dense sub-cubes to release privately. Some of the privacy budget is used to obtain noisy counts over a regular partitioning, which is then refined to a standard kd-tree. Fresh noisy counts for the partitions are obtained with the remaining budget, and a final inference step resolves inconsistencies between the two sets of counts.

In contrast to DP-compliant aggregate statistics published by a data curator, significant work has also been devoted to preventing the data curators themselves (such as a location-based service provider) from inferring a mobile user’s location in the online setting [2, 11, 24]. Spatial k-anonymity (SKA) [11, 12] generalizes the specific position of the querying user to a region that encloses at least k users. Geo-indistinguishability [2] extends the DP definition to the Euclidean space and obfuscates user check-ins to protect the exact location coordinates. Synthesizing privacy-preserving location trajectories is explored in Reference [3]. Finally, reporting high-order statistics of mobility data in the context of DP has been pursued in References [5, 6, 34], where the focus is on releasing trajectories using noisy counts of prefixes or n-grams in a trajectory.

9 CONCLUSIONS AND FUTURE WORK

We proposed a novel approach to privacy-preserving release of location histograms with differential privacy guarantees. Our technique capitalizes on the key observation that density homogeneity within partitions of the constructed index structure reduces the error introduced by DP noise. We devised low-sensitivity strategies for finding split coordinates in a DP-compliant way, and implemented effective privacy budget allocation strategies across different stages of the data sanitization process. In future work, we plan to extend our approach to trajectory data, which are more challenging due to their high-dimensionality. We also plan to combine our data sanitization techniques with machine learning approaches, to further boost the accuracy of DP-compliant location queries.

Footnotes

¹ This submission is an extended version of the work in Reference [25]; additional contributions include a technique for automatic tuning of structure height to optimize privacy budget utilization (Section 5); and a method that enhances query accuracy for large datasets (Section 6). We also provide additional experiments in Sections 7.6 and 7.7.
Footnote
² Veraset is a data-as-a-service company that provides anonymized population movement data collected through location measurement signals of cell phones across USA.
Footnote

REFERENCES

[1] Acs Gergely, Castelluccia Claude, and Chen Rui. 2012. Differentially private histogram publishing through lossy compression. In Proceedings of the IEEE 12th International Conference on Data Mining. IEEE, 1–10.Google ScholarDigital Library
Reference 1Reference 2
[2] Andrés Miguel E., Bordenabe Nicolás E., Chatzikokolakis Konstantinos, and Palamidessi Catuscia. 2013. Geo-indistinguishability: Differential privacy for location-based systems. In Proceedings of the ACM Conference on Computer and Communications Security (CCS’13).Google ScholarDigital Library
Reference 1Reference 2
[3] Bindschaedler Vincent and Shokri Reza. 2016. Synthesizing plausible privacy-preserving location traces. In Proceedings of the IEEE Symposium on Security and Privacy (SP’16). IEEE, 546–563.Google ScholarCross Ref
Reference
[4] Boyd Stephen, Boyd Stephen P., and Vandenberghe Lieven. 2004. Convex Optimization. Cambridge University Press, 243–244.Google ScholarCross Ref
Reference
[5] Chen Rui, Acs Gergely, and Castelluccia Claude. 2012. Differentially private sequential data publication via variable-length n-grams. In Proceedings of the ACM Conference on Computer and Communications Security (CCS’12). 638–649.Google ScholarDigital Library
Reference
[6] Chen Rui, Fung Benjamin C. M., Desai Bipin C., and Sossou Nériah M.. 2012. Differentially private transit data publication: A case study on the montreal transportation system. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 213–221.Google ScholarDigital Library
Reference
[7] Cormode Graham, Procopiuc Cecilia, Srivastava Divesh, Shen Entong, and Yu Ting. 2012. Differentially private spatial decompositions. In Proceedings of the IEEE 28th International Conference on Data Engineering. IEEE, 20–31.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
[8] Dwork Cynthia. 2008. Differential privacy: A survey of results. In Proceedings of the International Conference on Theory and Applications of Models of Computation. Springer, 1–19.Google ScholarDigital Library
Reference
[9] Dwork Cynthia, McSherry Frank, Nissim Kobbi, and Smith Adam. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Theory of Cryptography Conference. Springer, 265–284.Google ScholarDigital Library
Reference 1Reference 2
[10] Dwork Cynthia, Roth Aaron, et al. 2014. The algorithmic foundations of differential privacy. Found. Trends Theoret. Comput. Sci. 9, 3-4 (2014), 211–407.Google ScholarDigital Library
Reference 1Reference 2
[11] Ghinita Gabriel, Zhao Keliang, Papadias Dimitris, and Kalnis Panos. 2010. A reciprocal framework for spatial k-anonymity. Info. Syst. 35, 3 (2010), 299–314.Google ScholarDigital Library
Reference 1Reference 2
[12] Gruteser Marco and Grunwald Dirk. 2003. Anonymous usage of location-based services through spatial and temporal cloaking. In Proceedings of the 1st International Conference on Mobile Systems, Applications and Services. ACM, 31–42.Google ScholarDigital Library
Reference
[13] Hawes Michael B.. 2020. Implementing differential privacy: Seven lessons from the 2020 United States Census. DOI: DOI: https://hdsr.mitpress.mit.edu/pub/dgg03vo6/release/4.Google Scholar
Reference
[14] Hay Michael, Machanavajjhala Ashwin, Miklau Gerome, Chen Yan, and Zhang Dan. 2016. Principled evaluation of differentially private algorithms using dpbench. In Proceedings of the International Conference on Management of Data. 139–154.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
[15] Hay Michael, Rastogi Vibhor, Miklau Gerome, and Suciu Dan. 2010. Boosting the accuracy of differentially private histograms through consistency. Proc. VLDB Endow. 3, 1–2 (Sept.2010), 1021–1032.Google ScholarDigital Library
Reference 1Reference 2
[16] Inan Ali, Kantarcioglu Murat, Ghinita Gabriel, and Bertino Elisa. 2010. Private record matching using differential privacy. In Proceedings of the 13th International Conference on Extending Database Technology. 123–134.Google ScholarDigital Library
Reference
[17] Li Chao, Hay Michael, Miklau Gerome, and Wang Yue. 2014. A data-and workload-aware algorithm for range queries under differential privacy. Proc. VLDB Endow. 7, 5 (2014), 341–352.Google ScholarDigital Library
Reference 1Reference 2Reference 3
[18] Li Chao and Miklau Gerome. 2012. An adaptive mechanism for accurate query answering under differential privacy. Proc. VLDB Endow. 5, 6 (2012), 514–525.Google ScholarDigital Library
Reference
[19] Li Chao and Miklau Gerome. 2013. Optimal error of query sets under the differentially private matrix mechanism. In Proceedings of the 16th International Conference on Database Theory. 272–283.Google ScholarDigital Library
Reference
[20] Machanavajjhala A., Gehrke J., Kifer D., and Venkitasubramaniam M.. 2006. L-diversity: Privacy beyond k-anonymity. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). DOI:Google ScholarDigital Library
Reference
[21] Machanavajjhala Ashwin, Kifer Daniel, Gehrke Johannes, and Venkitasubramaniam Muthuramakrishnan. 2007. l-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1, 1 (2007), 3–es.Google ScholarDigital Library
Reference
[22] McKenna Ryan, Miklau Gerome, Hay Michael, and Machanavajjhala Ashwin. 2018. Optimizing error of high-dimensional statistical queries under differential privacy. Proc. VLDB Endow. 11, 10 (2018), 1206–1219.Google ScholarDigital Library
Reference
[23] Qardaji Wahbeh, Yang Weining, and Li Ninghui. 2013. Differentially private grids for geospatial data. In Proceedings of the IEEE 29th International Conference on Data Engineering (ICDE’13). IEEE, 757–768.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
[24] Quercia Daniele, Leontiadis Ilias, McNamara Liam, Mascolo Cecilia, and Crowcroft Jon. 2011. Spotme if you can: Randomized responses for location obfuscation on mobile phones. In Proceedings of the 31st International Conference on Distributed Computing Systems. IEEE, 363–372.Google ScholarDigital Library
Reference
[25] Shaham Sina, Ghinita Gabriel, Ahuja Ritesh, Krumm John, and Shahabi Cyrus. 2021. HTF: Homogeneous tree framework for differentially private release of location data. In Proceedings of the 29th International Conference on Advances in Geographic Information Systems (SIGSPATIAL’21). Association for Computing Machinery, New York, NY, 184–194. DOI:Google ScholarDigital Library
[26] Shaham Sina, Ghinita Gabriel, and Shahabi Cyrus. 2020. Enhancing the performance of spatial queries on encrypted data through graph embedding. In Proceedings of the IFIP Annual Conference on Data and Applications Security and Privacy. Springer, 289–309.Google ScholarCross Ref
Reference
[27] Shaham Sina, Ghinita Gabriel, and Shahabi Cyrus. 2021. An efficient and secure location-based alert protocol using searchable encryption and Huffman codes. Retrieved from https://arXiv:2105.00618.Google Scholar
Reference
[28] Sweeney Latanya. 2002. k-anonymity: A model for protecting privacy. Int. J. Uncert., Fuzz. Knowl.-based Syst. 10, 5 (2002), 557–570.Google ScholarDigital Library
Reference
[29] Veraset. 2021. Veraset Movement Data for the USA, The Largest, Deepest and Broadest Available Movement Dataset (Anonymized GPS Signals). Retrieved from https://datarade.ai/data-products/veraset-movement-data-for-the-usa-the-largest-deepest-and-broadest-available-movement-dataset-veraset.Google Scholar
Reference
[30] Xiao Xiaokui, Wang Guozhang, and Gehrke Johannes. 2010. Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 23, 8 (2010), 1200–1214.Google ScholarDigital Library
Reference 1Reference 2Reference 3
[31] Xiao Yonghui, Xiong Li, Fan Liyue, Goryczka Slawomir, and Li Haoran. 2014. DPCube: Differentially private histogram release through multidimensional partitioning. Transactions on Data Privacy 7, 3 (2014), 195–222.Google ScholarDigital Library
Reference 1Reference 2
[32] Xiao Yonghui, Xiong Li, and Yuan Chun. 2010. Differentially private data release through multidimensional partitioning. In Proceedings of the Workshop on Secure Data Management. Springer, 150–168.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[33] Yuan Jing, Zheng Yu, Zhang Chengyang, Xie Wenlei, Xie Xing, Sun Guangzhong, and Huang Yan. 2010. T-drive: Driving directions based on taxi trajectories. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems. 99–108.Google ScholarDigital Library
Reference
[34] Zhang Jun, Xiao Xiaokui, and Xie Xing. 2016. Privtree: A differentially private algorithm for hierarchical decompositions. In Proceedings of the International Conference on Management of Data. 155–170.Google ScholarDigital Library
Reference 1Reference 2Reference 3
[35] Zheng Yu, Zhang Lizhu, Xie Xing, and Ma Wei-Ying. 2009. Mining interesting locations and travel sequences from GPS trajectories. In Proceedings of the 18th International Conference on World Wide Web. ACM, 791–800.Google ScholarDigital Library
Reference

Index Terms

HTF: Homogeneous Tree Framework for Differentially Private Release of Large Geospatial Datasets with Self-tuning Structure Height
1. Security and privacy
  1. Human and societal aspects of security and privacy
    1. Privacy protections

Recommendations

HTF: Homogeneous Tree Framework for Differentially-Private Release of Location Data
SIGSPATIAL '21: Proceedings of the 29th International Conference on Advances in Geographic Information Systems

Mobile apps that use location data are pervasive, spanning domains such as transportation, urban planning and healthcare. Important use cases for location data rely on statistical queries, e.g., identifying hotspots where users work and travel. Such ...
Read More
A differentially private algorithm for location data release

The rise of mobile technologies in recent years has led to large volumes of location information, which are valuable resources for knowledge discovery such as travel patterns mining and traffic analysis. However, location dataset has been confronted ...
Read More
On the complexity of differentially private data release: efficient algorithms and hardness results
STOC '09: Proceedings of the forty-first annual ACM symposium on Theory of computing

We consider private data analysis in the setting in which a trusted and trustworthy curator, having obtained a large data set containing private information, releases to the public a "sanitization" of the data set that simultaneously protects the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Spatial Algorithms and Systems Volume 9, Issue 4
December 2023
218 pages
ISSN:2374-0353
EISSN:2374-0361
DOI:10.1145/3633511
Editor:
Walid G. Aref
Purdue University, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 November 2023
- Online AM: 25 October 2022
- Accepted: 19 October 2022
- Revised: 14 October 2022
- Received: 28 February 2022
Published in tsas Volume 9, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Location protection
differential privacy
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 440
  Total Downloads
- Downloads (Last 12 months)348
- Downloads (Last 6 weeks)78
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTF: Homogeneous Tree Framework for Differentially Private Release of Large Geospatial Datasets with Self-tuning Structure Height

ACM Transactions on Spatial Algorithms and Systems

Abstract

1 INTRODUCTION

2 BACKGROUND AND DEFINITIONS

2.1 Differential Privacy

(\(\epsilon\)-Differential Privacy [8])

(\(L_1\)-Sensitivity [9])

2.1.1 Laplace Mechanism (LM).

2.1.2 Exponential Mechanism (EM).

2.2 Problem Formulation

3 HOMOGENEOUS-TREE FRAMEWORK

4 TECHNICAL APPROACH

4.1 Homogeneity-based Partitioning

4.1.1 Baseline Split Point Selection.

4.1.2 Optimized Split Point Selection.

4.2 HTF Index Structure Construction

4.3 Leaf Node Count Perturbation

5 OPTIMAL ADAPTIVE STOP CONDITIONS

6 SUPPORTING LARGE-SCALE SPATIAL DATASETS

7 EXPERIMENTAL EVALUATION

7.1 Experimental Setup

7.2 HTF Versus Data Dependent Algorithms

7.3 HTF Versus Grid-based Algorithms

7.4 HTF Versus Data Independent Algorithms

7.5 Additional Benchmarks

7.6 Performance Evaluation of Scaling Approach

7.7 Parameter Analysis

8 RELATED WORK

9 CONCLUSIONS AND FUTURE WORK

Footnotes

REFERENCES

Cited By

Index Terms

Recommendations

HTF: Homogeneous Tree Framework for Differentially-Private Release of Location Data

A differentially private algorithm for location data release

On the complexity of differentially private data release: efficient algorithms and hardness results

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media