Evaluating the Risk of Disclosure and Utility in a Synthetic Dataset

: The advancement of information technology has improved the delivery of financial services by the introduction of Financial Technology (FinTech). To enhance their customer satisfaction, Fintech companies leverage artificial intelligence (AI) to collect fine-grained data about individuals, which enables them to provide more intelligent and customized services. However, although visions thereof promise to make customers’ lives easier, they also raise major security and privacy concerns for their users. Differential privacy (DP) is a common privacy-preserving data publishing technique that is proved to ensure a high level of privacy preservation. However, an important concern arises from the trade-off between the data utility the risk of data disclosure (RoD), which has not been well investigated. In this paper, to address this challenge, we propose data-dependent approaches for evaluating whether the sufficient privacy is guaranteed in differentially private data release. At the same time, by taking into account the utility of the differentially private synthetic dataset, we present a data-dependent algorithm that, through a curve fitting technique, measures the error of the statistical result imposed to the original dataset due to the injection of random noise. Moreover, we also propose a method that ensures a proper privacy budget, i.e., (cid:15) will be chosen so as to maintain the trade-off between the privacy and utility. Our comprehensive experimental analysis proves both the efficiency and estimation accuracy of the proposed algorithms.


Introduction
Financial Technology (FinTech) concept has evolved as a result of integrating innovative technologies into nancial services, e.g., AI and big data, Blockchain and mobile payment technologies, to provide better nancial services [1]. Investments in FinTech industry is trending upward, such that by September 2020 the global investment in Fintech was $25.6 Billion, reported by KPMG [2]. However, security and privacy of the users' data is among the main concerns Recently, the United States and Europe launched new privacy regulations such as the California Consumer Privacy Act (CCPA) and General Data Protection Regulation (GDPR) to strictly control the manner in which personal data are used, stored, exchanged, and even deleted by data collectors (e.g., corporations). Attempts to assist law enforcement have given rise to a strong demand for the development of privacy-preserving data release (PPDR) algorithms, together with the quantitative assessment of privacy risk. Given an original dataset (the dataset to be released), PPDR aims to convert the original dataset into a sanitized dataset (or a private dataset) such that privacy leakage by using the sanitized dataset is controllable and then publish the sanitized dataset. In the past, the former demands could be satis ed by conventional approaches such as k-anonymity, l-diversity, and t-closeness. However, these approaches have shortcomings in terms of syntactic privacy de nition and the dif culty in distinguishing between quasi-identi ers and sensitive attributes (the so-called QI fallacy), and therefore are no longer candidates for PPDR. In contrast, differential privacy (DP) [10] can be viewed as a de-facto privacy notion, and many differentially private data release (DPDR) algorithms [11] have been proposed and even used in practice. Note that DPDR can be considered as a special type of PPDR with DP as a necessary privatization technique.
Although it promises to maintain a balance between privacy and data utility, the privacy guarantee of DPDR is, in fact, only slightly more explainable. Therefore, in the case of DPDR, it is dif cult to choose an appropriate con guration for the inherent privacy parameter, privacy budget . More speci cally, DP uses an independent parameter that controls the magnitude of the injected noise, yet the selection of such that the data utility is maximized remains problematic. On the other hand, although the value of affects the magnitude of noise, it has no direct relevance to the risks of data disclosure, such as the probability of re-identifying a particular individual. In other words, the choice of such that the privacy is meaningfully protected still needs to be investigated. In practice, there is no clear recommendation by the regulatory institutions in Fintech industry on the preferred anonymization technique which could address the challenge of preserving privacy while providing openness. This might be due to the unclarity of privacy guarantee versus utility in the existing DPDR algorithms. Thus, a strong demand exists to develop novel measures for the risk of disclosure (RoD) and utility for DPDR.

Related Work
In this section we present a brief review of studies on the differentially private data release and the risk of disclosure.

Differentially Private Data Release (DPDR)
Privacy-preserving data release (PPDR) methods have been introduced to address the challenge of the trade-off between privacy and utility of a released dataset. Several anonymization and privacy-enhancing techniques have been proposed in this regard. The most popular technique is k-anonymity [12], which uses generalization and suppression to obfuscate each record between at least k −1 other similar records. Due to vulnerability of the k-anonymity against sensitive attribute disclosure where attackers have background knowledge, l-diversity [13] and t-closeness [14] are proposed to further diversify the record values. However, all of these techniques are proven to be theoretically and empirically insuf cient for privacy protection.
Differential privacy (DP) [10] is another mainstream privacy preserving technique which aims to generate an obfuscated dataset where addition or removal of a single record does not affect the result of the performed analysis on that dataset. Since the introduction of DP, several DPDR algorithms have been proposed. Here, we place particular emphasis on the synthetic dataset approach in DPDR. Namely, the data owner generates and publishes a synthetic dataset that is statistically similar to the original dataset (i.e., the dataset to be released). It should be noted that, since 2020, the U.S. Census Bureau has started to release census data by using the synthetic dataset approach [15]. DPDR can be categorized into two types: Parametric and non-parametric. The former relies on the hypothesis that each record in the original dataset is sampled from a hidden data distribution. In this sense, DPDR identi es the data distribution, injects noise into the data distribution, and repeatedly samples the noisy data distribution. The dataset constructed in this manner is, in fact, not relevant to the original dataset, even though they share a similar data distribution, and can protect individual privacy. The latter converts the original dataset into a contingency table, where i.i.d. noise is added to each cell. The noisy contingency table is then converted to the corresponding dataset representation. This dataset can be released without privacy concerns because each record can claim plausible deniability.
Two examples in the category of parametric DPDR are PrivBayes [16] and JTree [17]. In particular, PrivBayes creates a differentially private but high-dimensional synthetic dataset D by generateing a low-dimensional Bayesian network N. PrivBayes is composed of three steps: 1) Network learning, where a k-degree Bayesian network N is constructed over the attributes in the original high-dimensional dataset O using an ( /2)-DP algorithm; here k refers to a small value dependent on the affordable memory size and computational load. 2) Distribution learning: an ( /2)-DP algorithm is used to generate a set of conditional distributions, such that each attribute-parent (AP) in N has a noisy conditional distribution. 3) Data synthesis: N and d noisy conditional distributions are used to obtain an approximation of the input distribution, and then from the approximate distribution, we sample tuples to generate a synthetic dataset D.
JTree [17] proposes to use Markov random eld, together with a sampling technique, to model the joint distribution of the input data. Similar to PrivBayes, JTree consists of four steps: 1) Generate the dependency graph: The goal of this step is to calculate the pairwise correlation between attributes through the sparse vector technique (SVT), leading to a dependency graph. 2) Generate attribute clusters: Once the pairwise correlation between attributes is computed, we use junction tree algorithm to generate a set of cliques from the above dependency graph. These attribute cliques will be used to derive noisy marginals with the minimum error. 3) Generate noisy marginals: We generate a differentially private marginal table for each attribute cluster. After that, we also apply consistency constraints to each differentially private marginal table, as a postprocessing, to enhance the data utility. 4) Produce a synthetic dataset: From these differentially private marginal tables, we can ef ciently generate a synthetic dataset while satisfying differential privacy. Other methods in the category of parametric DPDR include DP-GAN [18][19][20], GANObfuscator [21], and PATE-GAN [22].
On the other hand, Priview [23] and DPSynthesizer [24] are two representative examples in the category of non-parametric DPDR. Priview and DPSynthesizer are similar in that they rst generate different marginal contingency tables. The main difference between parametric and nonparametric DPDR lies in the fact that the former assumes a hidden data distribution, whereas the latter processes the corresponding contingency table directly. Speci cally, noise is applied to each cell of the contingency tables to derive the noisy table. Noisy marginal contingency tables are combined to reconstruct the potentially high-dimensional dataset, followed by a sophisticated design of the post-processing step for further utility enhancement. Other methods in the category of non-parametric DPDR include DPCube [25], DPCopula [26], and DPWavelet [27].

Risk of Disclosure (RoD) Metrics
Not much attention has been paid to develop RoD, although DP has its own theoretical foundation for privacy. The research gap arises because the privacy of DP has been hidden in the corresponding de nition, according to which the query results only differ negligibly from those of neighboring datasets. However, in the real world setting, the user prefers to know whether (s)he will be re-identi ed. Moreover, the user wants to know what kind of information is protected, what the corresponding privacy level is, and the potentially negative impact of the perturbation for the statistical analysis tasks. On the other hand, although many DPDR algorithms have emerged, because of the lack of a clear and understandable de nition of RoD, we know that the choice of is critical but it hinders the practical deployment of DPDR systems. Thus, we eager to develop an RoD to quantitatively answer questions such as what kind of information is protected, what the corresponding privacy level is, and the potentially negative impact of the perturbation for the statistical analysis tasks properly.
To make the privacy notion easy to understand by layperson, Lee et al. [28] made the rst step to have a friendly de nition of the RoD. They adopt the cryptographic game approach de ne the RoD. Speci cally, given a dataset O with m records, the trusted server randomly determines, by tossing a coin, whether a record r ∈ O will be deleted. Let D be the resulting dataset after the deletion of the chosen record. The attacker's objective is to determine whether the record r exists. Here, the attacker is assumed to be equipped with an arbitrary knowledge of the datasets O and D. In this sense, Lee and Clifton formulated the attacker as a Bayesian attacker, which means that the attacker is aimed to maximize the probability of guessing correctly by using both the prior and posterior knowledge of O, D, and r.
Hsu et al. [29], on the other hand, propose to choose the parameter based on an economic perspective. The idea behind their work is based on the observation that a normal user has a nancial incentive to contribute sensitive information (if the third party or even attacker provide the nancial compensation). Their economics-inspired solution [29] can calculate an proper by striking a balance between the accuracy of the released data and . In addition, Naldi et al. [30] calculated the parameter according to estimation theory. More speci cally, Naldi et al. put their emphasis only on the counting query (as the counting query leads to the minimal sensitivity) and de ne their RoD. Their solution is similar to the solution in this paper. Nonetheless, the restriction on the counting query implies the limited practicality. Tsou et al. [31] also presented an RoD and their RoD is de ned by restricting the con dence level of the magnitude of Laplace noise. With an observation that the data owner who wants to evaluate the RoD is only in possession of an original dataset O and a candidate differentially private synthetic dataset D. With another observation that the magnitude of Laplace noise can be bounded with high probability, the value range of the Laplace noise can then be estimated with a certain con dence level. As a result, the estimated value range of the Laplace noise can be used to determine the level of distortion for the record values, and this also implies the privacy level.

Problem Statement
The assessment of the RoD and data utility of the DP synthetic dataset presents the following two dif culties.
• RoD requires good explainability for layman users in terms of a privacy guarantee, while simultaneously allowing quantitative interpretation to enable the privacy effect of different DPDR algorithms to be compared. In particular, although the privacy budget in DP has a mathematical explanation, it is dif cult for layman users to comprehend the meaning behind the de nition. Moreover, the privacy budget is inconsistent with the requirements in the current privacy regulations such as GDPR and CCPA, because the relation between and legal terms such as "single-out" remains to be investigated.
• Usually, it is necessary to generate a synthetic dataset and then estimate the corresponding privacy level by calculating the RoD. Nonetheless, this process requires an uncertain number of iterative steps until the pre-de ned privacy level is reached, leading to inef ciency in synthetic dataset generation. Thus, the aim is to develop a solution that can ef ciently estimate the privacy level of the DP synthetic dataset.
Though the methods in Section 1.1 can be used to determine (or to quantify the privacy level), all of them have inherent limitations as mentioned previously. In particular, two of these studies [28,29] can apply only in the case of interactive DP, where the data analyst keeps interacting with the server and receives query results from the server. Nevertheless, as interactive DP suffers from the privacy budget completion problem, the DPDR in this paper only considers non-interactive DP (see Section 2.1), which results in the publication of the differentially privacy dataset allowing an arbitrary number of queries. Thus, the studies [28,29] are not applicable to the assessment of RoD. Moreover, though the method in [30] is somewhat related to non-interactive DP, its application is limited to the counting queries. Sarathy et al. [32] also have a similar limitation since their work only applies to numerical data. Lastly, Tsou et al. [31] method works only when the synthetic dataset is synthesized with the injection of Laplacian noise. Unfortunately, this is not always the case, because Laplacian noise leads to synthetic dataset with awful utility and therefore the data owner might choose alternatives. The design of PriBayes, JTree, Priview, and DPSynthesizer (in Section 1.1.1) similarly indicate that a more sophisticated design is used for DPDR, further limiting the applicability of this work [31].

Contribution
Our work makes the following two contributions: • We de ne a notion for evaluating the risk of disclosure (RoD) particularly for the DP synthetic dataset. The state-of-the-art DPDR algorithms decouple the original dataset from the synthetic dataset which makes it dif cult to evaluate the RoD. However, we strive to quantify the RoD without making unrealistic assumptions.
• We propose a framework for ef ciently determining the privacy budget in DP, using the curve tting approach, taking into consideration the desired trade-off between the data utility and privacy.

Differential Privacy (DP)
The scenario in our consideration is a trusted data owner with a statistical database. The database stores a sensitive dataset. The database constructs and publishes a differentially private synthetic dataset for the public. In this sense, DP has been a de facto standard for protecting not only the privacy in the interactive (statistical database) query framework but also the (noninteractive) data release framework (see below).
There are two different kinds of DP scenarios, interactive and non-interactive. In the former, a dedicated and trusted server is located between the data analyst, who issues queries to the server (the data owner), and server (data owner). The server is responsible for answering the queries issued by data analyst. However, to avoid information leakage from query results, before forwarding the query result, the server will perturb it. Obviously, the interactive setting is cumbersome because in reality the data owner needs to setup a dedicated server. On the contrary, in the latter, the server (data owner) simply releases a privatized dataset to the public after the sanitization of dataset. During the whole process, no further interaction with anyone is needed. The synthetic dataset approach is a representative for non-interactive DP. Throughout the paper, we consider the non-interactive setting of DP (i.e., DPDR) unless stated otherwise.
Let and M be a positive real number and a randomized algorithm with the dataset as the input, respectively. We claim that M is -DP if, for all neighboring datasets D 1 and D 2 that differ at most one single record (e.g., the data of one person), and all subsets S of the image of M, where the parameter can be adjusted according to the tradeoff between utility and privacy; a higher implies lower privacy. Therefore, is also called the privacy budget because, the number of query responses is positively proportional the privacy loss. From the above de nition, we can also know that DP provides a cryptographic privacy guarantee (from indistinguishability point of view) that the presence or absence of a speci c record will not affect the algorithm signi cantly. From attacker's point of view, (s)he cannot tell whether a speci c record exists given the access of the algorithm output. DP can be achieved by injecting a zero-mean Laplace noise [33]. Speci cally, the noise sampled from a zero-mean Laplace distribution is added to perturb the query answer. Then, the data analyst only receives the noisy query answer. With two parameters on the mean and variance, Laplace distribution is determined jointly by and global sensitivity, of the query function q; that is, for any query q and mechanism M, is -DP, where Lap q is a random variable that follows a zero-mean Laplace distribution.
Apparently, as stated above, determines the tradeoff between privacy and data utility. DP features the sequential composition theorem, which states that by querying the dataset k times, if each noisy response satis es -DP, then all the k queries achieve k -DP together. In addition, DP involves post-processing, which states that any kind of data-independent processing of a noisy answer ( -DP) does not compromise its privacy guarantee.

Voronoi Diagram
In our proposed algorithm in Section 3.2.3, we take advantage of the Voronoi Diagram, a mathematical concept which refers to partitioning a plane into adjacent regions called Voronoi cells to cover a nite set of points [34]. The de nition of a Voronoi diagram is as follows [34,35]: Let X be a set of n points (called sites or generators) in the plane. For two distinct points x, y ∈ X the Voronoi region/cell associated to x is the set of all points in the plane that are closer to x than to any other point in the plane (i.e. the nearest neighbor to the point). In other words, the region associated to x is all the points in the plane lying in all of the dominances of y, i.e., region ( where l 2 is the Euclidean distance. Due to speci c geometrical structure of Voronoi diagrams and their simplicity in visual perception, they have been used by several research studies, such as le searching, scheduling and clustering [34]. Recently, the Voronoi diagram has also been used for preserving location privacy in various research studies [36][37][38]. In [36] the authors propose a privacy preserving model for mobile crowd computing to hide users in a cloaked area based on the Voronoi diagram. This paper takes advantage of the Voronoi diagram to provide k-anonymity for users in each Voronoi cell. In another study, Bi et al [38] combine local differential privacy and Voronoi diagram in order to preserve privacy in edge computing.
Compared to the state-of-the-art, in this paper we adopt Voronoi diagram in a completely different manner for evaluating the RoD in a differentially private synthetic dataset.

Proposed Approach for Evaluating Risk of Disclosure
In the following, we consider the setting of an original dataset O and the corresponding differentially private synthetic dataset D, both sized m × n and with numeric attributes A 1 , A 2 , . . . , A n . Each record in O represents personal information; a concrete example of O is a medical dataset, where each record corresponds to the diagnosis of a patient. We do not assume a particular DPDR algorithm unless stated otherwise. The DPDR algorithms in Section 1.1.1 are all available for consideration. Our goal is to develop friendly notions of both RoD and utility, ensuring that these notions can easily quantify the RoD and utility of D, given the access to O and the satisfaction of -DP. First, we discuss a simple metric for RoD in Section 3.1. In Section 3.2, we present four privacy notions. After that, we claim that the combined notion would be the best by justifying its self-explainability. Thereafter, we present our solution of how to quickly evaluate the utility of the synthetic dataset, given a speci c privacy level, in Section 3.3. At last, we present in Section 3.4 a uni ed framework of calculating the maximal privacy budget by jointly considering utility and RoD.

Straightforward Metric for RoD
Irrespective of the approach that was followed to generate the synthetic dataset, each record in D could be "fake"; i.e., the existence or non-existence of a record in the synthetic dataset does not indicate the existence status of this record in the original dataset. In theory, even an attacker with abundant background knowledge cannot infer the individual information in O. Speci cally, the privacy foundation lies in the fact that D is no longer a transformed version of O, and one cannot link O and D. Nonetheless, in reality, layman users are still concerned about the probability of the attacker re-identifying an individual from the synthetic dataset. We term this phenomenon the scapegoat effect. In particular, the scapegoat effect states that despite the fact that the information about an individual x in O will almost certainly not appear in D, because a recordx in D could be suf ciently similar to x and the attacker only has partial knowledge of x, the attacker will (falsely) identifyx as x. We claim the importance of the scapegoat effect because this is similar to the skewness attack in l-diversity [14]. In other words, an innocent individual might be accused of a crime if they are misclassi ed as either being or not being in the dataset.
Considering that most people may be concerned about the probability of an individual being re-identi ed by an attacker, given a synthetic dataset, a straightforward method for assessing the privacy level of the synthetic dataset would be to calculate the hitting rate, which is de ned as the ratio of the number of overlapping records to the total number m of records in both datasets. Despite its conceptual simplicity, the use of the hitting rate incurs two problems.
• First, because of the curse of dimensionality, the density of the data points in a highdimensional space is usually low. This could easily result in a very low hitting rate and an overestimated level of privacy guarantee.
• Second, such an assessment metric leads to a trivial algorithm for a zero hitting rate. For example, a synthetic dataset could be constructed by applying a tiny (and non-DP) amount of noise to each record in the original dataset. Owing to the noise, the records in the original and synthetic datasets do not overlap, leading to a zero hitting rate and an overestimated level of privacy.

Our Proposed Methods for RoD
In this section, we develop friendly privacy notions (or say friendly RoD estimation) from a distance point of view. More speci cally, as we know from the literature, in the DPDR scenario, the original dataset O is already decoupled from the synthetic dataset D. A consequence is that there is no sense to connect between O and D. This also implies the dif culty in de ning the appropriate RoD. However, the previous work [28][29][30][31] sought different ways to create the linkage between O and D, as the linkage between them is the most straightforward way for human understanding. Unfortunately, the privacy notion based on the linkage inevitably incurs security aw, especially in the sense that such a linkage does not exist.
In the following, taking the scapegoat effect into consideration, a distance-based framework, (y, p)-coverage, is rst proposed in Section 3.2.1 to minimize the scapegoat effect and to reconcile the privacy level and decoupling of O and D. The idea behind (y, p)-coverage is inspired by the observation that, even without the knowledge of the linkage between O and D, the only strategy left for the attacker is still to seek the closest record in O as the candidate original record in D. However, (y, p)-coverage is not suitable for measuring RoD because it has two parameters and does not have total order (described in more detail at the end of Section 3.2.1). Subsequently, we propose RoD metrics to measure the risk of re-identi cation.

(y, p)-Coverage
Here, we propose the notion of (y, p)-coverage as a framework for evaluating RoD. The idea behind (y, p)-coverage is that, the attacker would exhaustively search for a candidate matched record in O, given access to D. In particular, given D i as the ith record of D (ith row of D), due to the lack of pre-knowledge, the attacker's research range would be in the neighborhood of a speci c record D i . In this sense, the corresponding privacy metric can be formulated as a minimum weight bipartite matching problem in graph theory. Also from graph theory, we know that one can use the Hungarian Algorithm to handle minimum weight bipartite matching problem and so one can conduct the same algorithm to evaluate the RoD with the observation that O and D would be of the same size. In particular, Algorithm 1 is the pseudocode of the above idea, whose aim is to test if the distance-based (y, p)-coverage is ful lled. Given O and D, we construct a complete weighted bipartite graph In the graph G, each edge has an edge weight that indicates the dissimilarity between the respective records; hence, the edge weights are de ned as The graph G has 2m vertices, each of which corresponds to a record from O. Thus, each vertex can also be seen as an n-dimensional row vector. Under such a construction, G is completely bipartite as all the vertices from O are connected to all those from D. No direct edge for any pair of vertices both of which are for O (and D) exists. We also note that, the notation · denotes the norm. Here, while many norms can be used, an implicit assumption in this paper is that we always choose to use the Euclidean distance for O i − D j = (χ 1 , . . . , χ n ) for certain χ 1 , . . . , χ n . However, the other norm can also be used as an alternative.
Given a matching M, let its incidence vector be x, where x ij = 1 if (i, j) ∈ M and 0 otherwise, and the perfect matching of the minimum weight is a subset of edge weights such that where c ij = w ij . Once the Hungarian algorithm is performed, we can derive the perfect matching and the corresponding edge weight ω, where ω is an m-dimensional vector and the ith entry of ω denotes an edge weight of the minimum weighted bipartite matching for D i . We then calculate the number of weights less than or equal to y as a count ζ , where I ω i ≤y denotes an indicator function with I ω i ≤y = 1 in the case of ω i ≤ y, and I ω i ≤y = 0 otherwise, given a user-de ned weight y. Subsequently, we calculate p = ζ /m. With the probability p from the user, D is supposed to ful ll (y, p)-coverage if p ≤ p. Despite its conceptual simplicity and theoretical foundation, (y, p)-coverage cannot be used for assessing RoD because (y, p)-coverage has two parameters y and p and does not have total order. Note that the purpose of developing the RoD metric is to enable layman users to conveniently choose the privacy budget in DP. However, when the notion of (y, p)-coverage is used, it may be dif cult to determine which of, for example, (3, 0.5)-coverage and (4, 0.4)-coverage, improves the privacy. Hence, the following sections de ne additional privacy notions with total order to enable an RoD comparison.

y-Privacy
An underlying assumption for the (y, p)-coverage that the attacker conducts a search to look for a bijective matching between O and D by using the Hungarian algorithm. However, this assumption may fail in reality. To t the reality setting, one can relax such an assumption, ensuring that the attacker instead performs an exhaustive search to nd a matching between O and D. In this process, two records in D may happen to be matched against the same record in O. In (y, p)-coverage, we only keep one matching and get rid of another matching, which does not make sense. Therefore, in this section, we propose to use y-privacy to overcome the above limitation. While (y, p)-coverage can be seen as an average-case analysis, y-privacy is featured by its focuses on the worst-case analysis.
Algorithm 2 is proposed to achieve y-privacy; more speci cally, it is used for verifying whether a given dataset satis es y-privacy. In what follows, we consider the case of n = 1 with integral values in O to ease the representation. We will relax this assumption later. However, our implementation is a bit different from the above description. Instead, we in Algorithm 2 turn our focus to nding the mapping, instead of the matching, between O and D with the minimum incurred noise. First, we nd the minimum value y i for each record D i in D such that [D i ± y i ] contains one original record in O. This ensures that an original record is within the attacker's search range. Then, for each y i in y , we calculate The above equation indicates that, because y can be seen as all of the possibilities, when the attacker sees a record, it needs to be veri ed whether this was a brute-force guess. Consequently, a lower y implies a downgrade privacy. Thus, Eq. (8) can also be seen as the probability that the attacker successfully makes a correct guess on an original record in O within the range [D i − y, D i + y], given that a record D i ∈ D is always at most 1/(2y + 1). One can also choose y in such a way that the median of y is selected as y to strike a balance between the privacy and utility. In comparison, the choice of y goes back to the average-case analysis, and choosing the minimum y as y has a similar avor of (y, p)-coverage. Under the framework of (y, p)-coverage, y-privacy considers a general attack strategy and provides a worst-case guarantee. Compared to (y, p)-coverage, y-privacy has total order, and it is easy to see that y-privacy is better than y -privacy when y ≥ y in terms of the privacy guarantee. Intuitively, the above argument that y-privacy is better than y -privacy when y ≥ y also indicates that each record in the synthetic dataset satisfying y-privacy will be at least y-far away from the closest record in the original dataset, in contrast to the synthetic dataset, which satis es y -privacy. However, y-privacy is still not practical when a dense dataset (consisting of a huge number of records) is considered. Speci cally, the common weakness of y-privacy and (y, p)-coverage is that when the records in a dataset are seriously dense, the parameter y in y-privacy and (y, p)-coverage should be very small. This can be understood by the fact that the records are close to each other both before and after the noise injection. It turns out that the parameter y becomes less meaningful as a privacy level.

Voronoi Privacy
The notion of y-privacy can also be generalized to consider its extreme condition. In other words, because y-privacy considers a y-radius ball centered at each data point and considers the number of data points in O covered by this y-radius ball, we can follow this perspective and consider the y-radius balls centered at all data points in O. The rationale behind the above consideration is to determine the arrangement of the dataset with the optimal y-privacy. Expanding the radius of all y-radius balls ultimately leads to a Voronoi diagram [39]. As explained in Section 2.2, this diagram is a partition of a multi-dimensional space into regions close to each of a given set of data points. Note that a Voronoi diagram can be ef ciently constructed for two-dimensional data points, but for high-dimensional data points this would only be possible by using approximation algorithms [40,41]. The Voronoi diagram is characterized by the fact that, for each data point, a corresponding region consisting of all points of the multi-dimensional space exists closer to that data point than to any other. In other words, all positions within a Voronoi cell are more inclined to be classi ed as a particular data point.
From the perspective of RoD, we then have an interpretation that, in terms of y-privacy, each record in D cannot be located at these positions within the Voronoi cell; otherwise, an attacker who nds such a record in D is more inclined to link to a particular record in O. The above argument lies in the theoretical foundation of Voronoi privacy. Algorithm 3 shows the calculation of Voronoi privacy, given access to O and D. In particular, the rationale behind Voronoi privacy is to derive the optimal privatized datasetD (in terms of privacy) rst, and then calculate the distance between D andD as an RoD metric. A larger distance implies a higher level of dissimilarity between D andD and therefore a lower risk of data closure. In this sense, rst, Algorithm 3 constructs an empty datasetD. The subsequent procedures gradually add records toD, making it optimal in terms of privacy. Then, Algorithm 3 constructs a Voronoi diagram from O because the data owner would like to know the corresponding optimal privatized dataset. As mentioned previously, approximation algorithms [40,41] might be candidates for this task. Once the data owner has a Voronoi diagram from O, the collection of data points on the Voronoi diagram constitutes the optimal privatized dataset. Thus, an in nite number of optimal privatized datasets are available. Here, we aim to nd the optimal privatized dataset with the optimal data utility. Considering that more perturbation on the record in O implies lower data utility, for each data point in O, the closest point on Voronoi edges would have been identi ed. The data owner collects all these data points asD. Thereafter, the data owner calculates the distance between D andD as an RoD metric. We particularly mention that different choices of the Distance function are possible in the implementation, depending on the domain of the dataset. In general, the l 2 distance (Euclidean distance) can be used, whereas the earth mover distance (EMD) could also be used if the data owner was interested in quantitatively measuring the data utility in terms of machine-learning applications.

p-Privacy
Based on the observation that dependency among attributes of the dataset can be a characteristic of the privacy, Algorithm 4 de nes a novel privacy metric, called p-privacy. This is due to the fact that, in reality, the attacker will not perform a pure random guess; instead, the attacker would make educated guesses according to the distribution of O. The dif culty for the attacker is that (s)he does possess O. However, because D and O have the similar distribution according to the DPDR, the attacker can still make educated guesses by considering only the distribution of D. Inspired by this observation, we know that further reducing the futile combinations in the general case of n ≥ 2 would be necessary. Thus, by computing the correlation among attributes (similar to JTree), our rst step is to construct the dependency graph G. This step would be different from the exhaustive search in y-privacy and (y, p)-coverage. After deriving a dependency graph, we consider each linked part as a clique and obtain a clique set C. We calculate D C i , where D C i are records with values only for the attributes in C i for each clique C i in C. Let U be the set of D C i . Then, we produce a candidate table F with |U| i=1 D C i combinations by merging each D C i in U to. The candidate table F can be seen as the records that more likely to be the records in O. Subsequently, after a comparison between F and O, one can nd a count ξ , where the records of F belong to O, and then obtain the attack probability p, In the table below, we assume an exemplar original dataset O and synthetic dataset D. Based on this assumption, we show how p-privacy works to serve as a friendly privacy notion. A1 A2 A3 A4 A5 A1 A2 A3 A4 A5   3  5  1  2  3  4  5  1  2  3  8  1  2  3  4  8  7  2  3  4  2  2  3  4  5  3 3  1  1  2  3  3  1  2  3  4  3  5  1  2  3  3  5  2  3  4  3  7  1  2  3  3  7  2  3  4  4  1  1  2  3  4  1  2  3  4  4  5  1  2  3  4  5  2  3  4  4  7  1  2  3  4  7  2  3  4  8  1  1  2  3  8  1  2  3  4  8  5  1  2  3  8  5  2  3  4  8  7  1  2  3  8  7  2  3  4 F In Table F, both (3, 5

Data-Driven Approach for Determining
Although how to determine a proper privacy level (notion) is presented in Section 3.2.1∼3.2.4, we still eager to develop a method for choosing an proper in DP for a given privacy level. In other words, in the previous sections, we only have a friendly explanation on the privacy but still need a concrete method to determine . In the following, based on the curve tting technique, we propose algorithms to obtain satisfactory values for .
Baseline Approach for Determining . Inspired by JTree [42], Algorithm 5 adopts a similar strategy from JTree to determine the that satis es the user's risk and utility requirements. Apparently, if the O is uniformly distributed, only a small amount of noise will be needed to reach the desirable privacy level, because each record has the low sensitivity. However, if the data distribution of the original dataset is not uniform, additional noise is needed to protect the sensitive records, as stated in Section 2.1.
More speci cally, the data owner can determine the sensitive records and the variance of Laplacian noise, given the marginal tables in JTree. Thus, we construct the corresponding dependency graph and calculate the marginal tables. Then, once we have a user-de ned probability p that represents the preferable utility and privacy levels, the value of can be derived from the equation The above procedures are similar to those in JTree, except that some operations are ignored. Moreover, as count queries are the fundamental operation that can have minimal impact on the function output, the global sensitivity (f ) is 1. So, the 95% con dence interval (µ + 2σ ) of Lap( 1 ) is used in our consideration as the maximum value that represents p. This choice enables us to determine a satisfactory via Algorithm 5. Unfortunately, Algorithm 5 poses certain problems, such as the possibility that, because different kinds of DP noise injection mechanisms could be used, one cannot expect that the retrieved from Algorithm 5 can be suitable for all of DP noise injection mechanisms. On the other hand, as the utility is closely related to , the choice of is also critical in improving the data utility. This makes it necessary to develop a more accurate method for estimating .
Data-Driven Approach for Determining . The simplest idea to determine is an iterative algorithm; i.e., we generate a synthetic dataset with a chosen and see whether the utility goes to be what we want. This is a theoretically feasible solution but is very inef cient, especially in the case of an uncertain number of iterations. Therefore, curve tting, a data-driven approach, is proposed to learn the correlation between privacy level and . The curve bridges between privacy and utility. So, once we derive the curve, can be calculated instantly, given the desired level of utility. The remaining question is how we can derive the curve. The corresponding privacy levels can indeed be obtained after generating a large number of differentially private synthetic datasets with different values. Thereafter, the curve can be learned on the basis of the learned privacy levels. However, when learning the curve, although this process enables the best tted coef cients, we still need to determine the type of curve. Initially, we choose exponential and polynomial curves. After that, we also choose reciprocal curves as an alternative.
One can see from Fig. 1 that the reciprocal curve of degree 2 results in the best t. The predictions in Tab. 1 are quite close to the real risk distances.

Baseline Method for Evaluating Utility
As mentioned in the previous sections, even though the synthetic dataset has already achieved the required privacy level, usually the data utility will be sacri ced. So, the objective of data owner is to maximize the data utility subject to the required privacy. Deriving an explicit formulation for privacy and utility is complex; instead, we resort to data-driven approach. A simple idea for deriving the utility is to iteratively generate different synthetic datasets and then determine the corresponding utility. This operation is very time-consuming. In the worst case, one needs an in nite amount of time to try an in nite number of combinations of parameters. As a result, an algorithm capable of ef ciently learning the utility of a given synthetic dataset is desired.  The statistics such as the mean, standard deviation, and variance can be seen as the most popular statistics used by the data analysts and so the metrics for evaluating data utility. In what follows, for fair comparison, after the use of synthetic dataset D, we also used these metrics to estimate the error of the result. The error of variance introduced by the synthetic dataset D can be formulated as var As a whole, to evaluate the variance error of the entire dataset, what we can do is to sum up the errors for each record, Obviously, when the synthetic dataset has smaller estimation error, it also leads to better data utility. The analysis of other statistical measures is also consistent to the ones derived from the above formulas. As a result, we used these statistical measures because of the following two reasons. First, it is because of their popularity and simplicity. Second, we can also learn an approximate distribution from these measurements. Moreover, when there are huge errors for these simple statistics, it would de nitely lead to catastrophic utility loss for the other complex statistical measures.

Data-Driven Method for Evaluating Utility
Usually, calculating the estimation error from the synthetic dataset is to calculate Eqs. (11) and (12) over the differentially private synthetic datasets with = {0.01, 0.1, 1, 10} is the most straightforward method that we start to try. For example, Fig. 2 where the input dataset is a 5 × 1e6 health dataset with ve attributes HEIGHT, WEIGHT, BMI, DBP, SBP 1 shows the errors incurred by different settings of .
Iterating the above process of choosing a , generating a synthetic dataset, and then calculating the utility would be highly inef cient. This is due to the fact that the data owner might want to further improve the utility of the current version of the synthetic dataset. Thus, the data owner will iterate the above process again and again until a satisfactory version appears. As a whole, we decided to generate synthetic datasets for pre-determined , and then estimate their errors only. After that, our plan is to use these information to t a curve bridging the privacy and utility. In particular, we propose using a curve tting, a data-driven approach, as a surrogate method to learn the correlation between and utility measures such as the error of the mean, standard deviation, and variance. Once we have such a curve from curve tting, we can indeed calculate very quickly, given the desired level of utility or vice versa.
Speci cally, in the case of var , var can be obtained after generating a large number of synthetic datasets with different values of . Thereafter, the curve could be learned using the obtained values of var .
However, when performing curve tting, although this could be used to learn the coef cients that best t the chosen curve, one factor that we can have freedom to choose is the type of curve. Initially, the two intuitive choices are exponential and polynomial. A more counterintuitive one would be reciprocal curves. We, however, found that the reciprocal curve with the following form: whereˆ var denotes the estimator of var , leads to the best t in almost all cases. Here, for completeness, we also present exponential and polynomial curves that correspond to the error of other statistical measures in Fig. 3, in addition to the reciprocal curves. The reciprocal curve ts almost perfectly, as shown in the gure. In our experiments, after averaging all the coef cients from the formulas, we conclude the estimated of var will bê var = 1 5 .
In fact, Eq. (14) has room to be improved so as to offer better prediction results. In our consideration, we aim to calculate the errors in the cases of = {0.01, 0.05, 0.1, 0.5, 1, 5, 10}. Nevertheless, only three errors are calculated for the cases of = {0.01, 0.5, 10}. Afterwards, we learn the curve based on these three errors. These results are shown in Fig. 4, where the real statistics and predicted statistics in Tab. 2 are matched almost perfectly. Despite the seemingly satisfactory results in Fig. 4, once we scrutinize Tab. 2, we will nd that there are negative predicted values of = {5, 10}, and this result is not pleasing. Once xing the shape of the curve tting and reciprocal curve, we found that the main reason for predictability degradation of the tted curve can be attributed to the insuf cient degree of the reciprocal (or the other used) curve. Consequently, when we slightly increase the degree of the reciprocal curve, we obtain var = a 2 + b + c, a, b, c ∈ R. (15) Here, after the comparison among the results in both Tab. 3 and Fig. 5, one can know immediately that the predicted errors are matched against the real error values, with a curve newly learned from the data.   Figure 5: Difference between tted reciprocal curves with the degree 1 and 2

Jointly Evaluating RoD and Utility
The data utility results when varying d in the Voronoi privacy, y in the y-privacy, and p in the p-privacy, are provided in Figs. 6-8, respectively. Obviously, as the RoD increases, the data utility is not maintained. This is because additional perturbation is added to the original dataset and therefore the synthetic dataset is generated from a data distribution with more noise. One can know that the extension from the aforementioned data-driven approaches for evaluating both the utility and RoD to a data-driven approach for determining the privacy budget with a joint consideration of the utility and RoD can be easily achieved. In essence, from Sections 3.2 and 3.3 we will learn different (reciprocal) curves; one for the privacy level and another for utility. While the curves for the privacy and utility are unrelated at the rst glance, if we consider the DP de nition and the way of how to achieve DP, they, in fact, are correlated to each other. In this sense, multidimensional curve tting will be an alternative for a more complex setting; i.e., it would be a candidate to be used over the curves learned from Sections 3.2 and 3.3 so as to learn a higher dimensional curve for the privacy level and utility. Since the resulting higher dimensional curve will have a parameter , after a simple calculation when the other parameters are xed, the privacy budget can be determined with a joint consideration of the utility and RoD.

Conclusion
In this paper, we proposed a number of friendly privacy notions to measure the RoD. We also developed curve tting-based approach to determine the privacy budget in a data-driven manner with the joint consideration of the RoD and utility. This approach enables novice users to grab the idea behind the level of privacy protection and the data utility. As a result, these users would be able to determine an appropriate privacy budget for DPDR, depending on the amount of privacy risk they would be prepared to tolerate and the desired utility.