Information Extraction Under Privacy Constraints

A privacy-constrained information extraction problem is considered where for a pair of correlated discrete random variables $(X,Y)$ governed by a given joint distribution, an agent observes $Y$ and wants to convey to a potentially public user as much information about $Y$ as possible without compromising the amount of information revealed about $X$. To this end, the so-called {\em rate-privacy function} is introduced to quantify the maximal amount of information (measured in terms of mutual information) that can be extracted from $Y$ under a privacy constraint between $X$ and the extracted information, where privacy is measured using either mutual information or maximal correlation. Properties of the rate-privacy function are analyzed and information-theoretic and estimation-theoretic interpretations of it are presented for both the mutual information and maximal correlation privacy measures. It is also shown that the rate-privacy function admits a closed-form expression for a large family of joint distributions of $(X,Y)$. Finally, the rate-privacy function under the mutual information privacy measure is considered for the case where $(X,Y)$ has a joint probability density function by studying the problem where the extracted information is a uniform quantization of $Y$ corrupted by additive Gaussian noise. The asymptotic behavior of the rate-privacy function is studied as the quantization resolution grows without bound and it is observed that not all of the properties of the rate-privacy function carry over from the discrete to the continuous case.


I. INTRODUCTION
With the emergence of user-customized services, there is an increasing desire to balance between the need to share data and the need to protect sensitive and private information. For example, individuals who Parts of the results in this paper were presented at the 52nd Allerton Conference on Communications, Control and Computing [5] and the 14th Canadian Workshop on Information Theory [7]. join a social network are asked to provide information about themselves which might compromise their privacy. However, they agree to do so, to some extent, in order to benefit from the customized services such as recommendations and personalized searches. As another example, a participatory technology for estimating road traffic requires each individual to provide her start and destination points as well as the travel time. However, most participating individuals prefer to provide somewhat distorted or false information to protect their privacy. Furthermore, suppose a software company wants to gather statistical information on how people use its software. Since many users might have used the software to handle some personal or sensitive information -for example, a browser for anonymous web surfing or a financial management software-they may not want to share their data with the company. On the other hand, the company cannot legally collect the raw data either, so it needs to entice its users. In all these situations, a tradeoff in a conflict between utility advantage and privacy breach is required and the question is how to achieve this tradeoff. For example, how can a company collect high-quality aggregate information about users while strongly guaranteeing to its users that it is not storing user-specific information?
To deal with such privacy considerations, Warner [49] proposed the randomized response model in which each individual user randomizes her own data using a local randomizer (i.e., a noisy channel) before sharing the data to an untrusted data collector to be aggregated. As opposed to conditional security, see e.g. [9], [18], [42], the randomized response model assumes that the adversary can have unlimited computational power and thus it provides unconditional privacy. This model, in which the control of private data remains in the users' hands, has been extensively studied since Warner. As a special case of the randomized response model, Duchi et al. [19], inspired by the well-known privacy guarantee called differential privacy introduced by Dwork et al. [20]- [22], introduced locally differential privacy (LDP). Given a random variable X ∈ X, another random variable Z ∈ Z is said to be the ε-LDP version of X if there exists a channel Q : X → Z such that Q(B|x) Q(B|x ) ≤ exp(ε) for all measurable B ⊂ Z and all x, x ∈ X. The channel Q is then called as the ε-LDP mechanism. Using Jensen's inequality, it is straightforward to see that any ε-LDP mechanism leaks at most ε bits of private information, i.e., the mutual information between X and Z satisfies I(X, Z) ≤ ε.
There have been numerous studies on the tradeoff between privacy and utility for different examples of randomized response models with different choices of utility and privacy measures. For instance, Duchi et al. [19] studied the optimal ε-LDP mechanism M : X → Z which minimizes the risk of estimation of a parameter θ related to P X . Kairouz et al. [27] studied an optimal ε-LDP mechanism in the sense of mutual information, where an individual would like to release an ε-LDP version Z of X that preserves as much information about X as possible. Calmon et al. [12] proposed a novel privacy measure (which includes maximal correlation and chi-square correlation) between X and Z and studied the optimal privacy mechanism (according to their privacy measure) which minimizes the error probability Pr(X(Z) = X) for any estimatorX : Z → X.
In all above examples of randomized response models, given a private source, denoted by X, the mechanism generates Z which can be publicly displayed without breaching the desired privacy level.
However, in a more realistic model of privacy, we can assume that for any given private data X, nature generates Y , via a fixed channel P Y |X . Now we aim to release a public display Z of Y such that the amount of information in Y is preserved as much as possible while Z satisfies a privacy constraint with respect to X. Consider two communicating agents Alice and Bob. Alice collects all her measurements from an observation into a random variable Y and ultimately wants to reveal this information to Bob in order to receive a payoff. However, she is worried about her private data, represented by X, which is correlated with Y . For instance, X might represent her precise location and Y represents measurement of traffic load of a route she has taken. She wants to reveal these measurements to an online road monitoring system to received some utility. However, she does not want to reveal too much information about her exact location. In such situations, the utility is measured with respect to Y and privacy is measured with respect to X. The question raised in this situation then concerns the maximum payoff Alice can get from Bob (by revealing Z to him) without compromising her privacy. Hence, it is of interest to characterize such competing objectives in the form of a quantitative tradeoff. Such a characterization provides a controllable balance between utility and privacy.
This model of privacy first appears in Yamamoto's work [51] in which the rate-distortion-equivocation function is defined as the tradeoff between a distortion-based utility and privacy. Recently, Sankar et al. [44], using the quantize-and-bin scheme [47], generalized Yamamoto's model to study privacy in databases from an information-theoretic point of view. Calmon and Fawaz [10] and Monedero et al. [38] also independently used distortion and mutual information for utility and privacy, respectively, to define a privacy-distortion function which resembles the classical rate-distortion function. More recently, Makhdoumi et al. [34] proposed to use mutual information for both utility and privacy measures and defined the privacy funnel as the corresponding privacy-utility tradeoff, given by where X − − Y − − Z denotes that X, Y and Z form a Markov chain in this order. Leveraging well-known algorithms for the information bottleneck problem [48], they provided a locally optimal greedy algorithm to evaluate t R (X; Y ). Asoodeh et al. [5], independently, defined the rate-privacy function, g ε (X; Y ), as the maximum achievable I(Y ; Z) such that Z satisfies I(X; Z) ≤ ε, which is a dual representation of the privacy funnel (1), and showed that for discrete X and Y , g 0 (X; Y ) > 0 if and only if X is weakly independent of Y (cf, Definition 2). Recently, Calmon et al. [11] proved an equivalent result for t R (X; Y ) using a different approach. They also obtained lower and upper bounds for t R (X; Y ) which can be easily translated to bounds for g ε (X; Y ) (cf. Lemma1). In this paper, we develop further properties of g ε (X; Y ) and also determine necessary and sufficient conditions on P XY , satisfying some symmetry conditions, for g ε (X; Y ) to achieve its upper and lower bounds.
The problem treated in this paper can also be contrasted with the better-studied concept of secrecy following the pioneering work of Wyner [50]. While in secrecy problems the aim is to keep information secret only from wiretappers, in privacy problems the aim is to keep the private information (not necessarily all the information) secret from everyone including the intended receiver.

A. Our Model and Main Contributions
Using mutual information as measure of both utility and privacy, we formulate the corresponding privacy-utility tradeoff for discrete random variables X and Y via the rate-privacy function, g ε (X; Y ), in which the mutual information between Y and displayed data (i.e., the mechanism's output), Z, is maximized over all channels P Z|Y such that the mutual information between Z and X is no larger than a given ε. We also formulate a similar rate-privacy functionĝ ε (X; Y ) where the privacy is measured in terms of the squared maximal correlation, ρ 2 m , between, X and Z. In studying g ε (X; Y ) andĝ ε (X; Y ), any channel Q : Y → Z that satisfies I(X; Z) ≤ ε and ρ 2 m (X; Z) ≤ ε, preserves the desired level of privacy and is hence called a privacy filter. Interpreting I(Y ; Z) as the number of bits that a privacy filter can reveal about Y without compromising privacy, we present the rate-privacy function as a formulation of the problem of maximal privacy-constrained information extraction from Y .
We remark that using maximal correlation as a privacy measure is by no means new as it appears in other works, see e.g., [33], [30] and [12] for different utility functions. We do not put any likelihood constraints on the privacy filters as opposed to the definition of LDP. In fact, the optimal privacy filters that we obtain in this work induce channels P Z|X that do not satisfy the LDP property.
The quantity g ε (X; Y ) is related to a notion of the reverse strong data processing inequality as follows. Given a joint distribution P XY , the strong data processing coefficient was introduced in [1] and [4], as the smallest s(X; Y ) ≤ 1 such that I(X; Z) ≤ s(X; Y )I(Y ; Z) for all P Z|Y satisfying the Markov condition X − − Y − − Z. In the rate-privacy function, we instead seek an upper bound on the maximum achievable rate at which Y can display information, I(Y ; Z), while meeting the privacy constraint I(X; Z) ≤ ε. The connection between the rate-privacy function and the strong data processing inequality is further studied in [11] to mirror all the results of [4] in the context of privacy.
The contributions of this work are as follows: • We study lower and upper bounds of g ε (X; Y ). The lower bound, in particular, establishes a multiplicative bound on I(Y ; Z) for any optimal privacy filter. Specifically, we show that for a given (X, Y ) and ε > 0 there exists a channel Q : Y → Z such that I(X; Z) ≤ ε and where λ(X; Y ) ≥ 1 is a constant depending on the joint distribution P XY . We then give conditions on P XY such that the upper and lower bounds are tight. For example, we show that the lower bound is achieved when Y is binary and the channel from Y to X is symmetric. We show that this corresponds to the fact that both Y = 0 and Y = 1 induce distributions P X|Y (·|0) and P X|Y (·|1) which are equidistant from P X in the sense of Kullback-Leibler divergence. We then show that the upper bound is achieved when Y is an erased version of X, or equivalently, P Y |X is an erasure channel.
• We propose an information-theoretic setting in which g ε (X; Y ) appears as a natural upper-bound for the achievable rate in the so-called "dependence dilution" coding problem. Specifically, we examine the joint-encoder version of an amplification-masking tradeoff, a setting recently introduced by Courtade [14] and we show that the dual of g ε (X; Y ) upper bounds the masking rate. We also present an estimation-theoretic motivation for the privacy measure ρ 2 m (X; Z) ≤ ε. In fact, by imposing ρ 2 m (X; Y ) ≤ ε, we require that an adversary who observes Z cannot efficiently estimate f (X), for any function f . This is reminiscent of semantic security [25] in the cryptography community. An encryption mechanism is said to be semantically secure if the adversary's advantage for correctly guessing any function of the privata data given an observation of the mechanism's output (i.e., the ciphertext) is required to be negligible. This, in fact, justifies the use of maximal correlation as a measure of privacy. The use of mutual information as privacy measure can also be justified using Fano's inequality. Note that I(X; Z) ≤ ε can be shown to imply that Pr(X(Z) = and hence the probability of adversary correctly guessing X is lower-bounded.
• We also study the rate of increase g 0 (X; Y ) of g ε (X; Y ) at ε = 0 and show that this rate can characterize the behavior of g ε (X; Y ) for any ε ≥ 0 provided that g 0 (X; Y ) = 0. This again has connections with the results of [4]. Letting , and hence the rate of increase of Γ(R) at R = 0 characterizes the strong data processing coefficient. Note that here we have Γ(0) = 0.
• Finally, we generalize the rate-privacy function to the continuous case where X and Y are both continuous and show that some of the properties of g ε (X; Y ) in the discrete case do not carry over to the continuous case. In particular, we assume that the privacy filter belongs to a family of additive noise channels followed by an M -level uniform scalar quantizer and give asymptotic bounds as M → ∞ for the rate-privacy function.

B. Organization
The rest of the paper is organized as follows. In Section 2, we define and study the rate-privacy function for discrete random variables for two different privacy measures, which, respectively, lead to the information-theoretic and estimation-theoretic interpretations of the rate-privacy function. In Section 3, we provide such interpretations for the rate-privacy function in terms of quantities from information and estimation theory. Having obtained lower and upper bounds of the rate-privacy function, in Section 4 we determine the conditions on P XY such that these bounds are tight. The rate-privacy function is then generalized and studied in Section 5 for continuous random variables.
II. UTILITY-PRIVACY MEASURES: DEFINITIONS AND PROPERTIES Consider two random variables X and Y , defined over finite alphabets X and Y, respectively, with a fixed joint distribution P XY . Let X represent the private data and let Y be the observable data, correlated with X and generated by the channel P Y |X predefined by nature, which we call the observation channel.
Suppose there exists a channel P Z|Y such that Z, the displayed data made available to public users, has limited dependence with X. Such a channel is called the privacy filter. This setup is shown in Fig. 1.
The objective is then to find a privacy filter which gives rise to the highest dependence between Y and Z. To make this goal precise, one needs to specify a measure for both utility (dependence between Y and Z) and also privacy (dependence between X and Z).

X Y Z
Fixed channel (observation channel) Privacy filter

A. Mutual Information as Privacy Measure
Adopting mutual information as a measure of both privacy and utility, we are interested in characterizing the following quantity, which we call the rate-privacy function 1 , where (X, Y ) has fixed distribution P XY = P and (here X − − Y − − Z means that X, Y, and Z form a Markov chain in this order). Equivalently, we call g ε (X; Y ) the privacy-constrained information extraction function, as Z can be thought of as the extracted information from Y under privacy constraint I(X; Z) ≤ ε.
Note that since I(Y ; Z) is a convex function of P Z|Y and furthermore the constraint set D ε (P ) is convex, [41,Theorem 32.2] implies that we can restrict D ε (P ) in (3) to {P Z|Y : X − − Y − − Z, I(X; Z) = ε} whenever ε ≤ I(X; Y ) . Note also that since for finite X and Y, P Z|Y → I(Y ; Z) is a continuous map, therefore D ε (P ) is compact and the supremum in (3) is indeed a maximum. In this case, using the Support Lemma [17], one can readily show that it suffices that the random variable Z is supported on an alphabet Z with cardinality |Z| ≤ |Y| + 1. Note further that by the Markov condition X − − Y − − Z, we can always restrict ε ≥ 0 to only 0 ≤ ε < I(X; Y ), because I(X; Z) ≤ I(X; Y ) and hence for ε ≥ I(X; Y ) the privacy constraint is removed and thus by setting Z = Y , we obtain As mentioned earlier, a dual representation of g ε (X; Y ), the so called privacy funnel, is introduced in [34] and [11], defined in (1), as the least information leakage about X such that the communication rate is greater than a positive constant; Given ε 1 < ε 2 and a joint distribution P = P X × P Y |X , we have D ε 1 (P ) ⊂ D ε 2 (P ) and hence . Using a similar technique as in [45, Lemma 1], Calmon et al. [11] showed that the mapping This, in fact, implies that ε → gε(X;Y ) ε is non-increasing for ε > 0. This observation leads to a lower bound for the rate privacy function g ε (X; Y ) as described in the following lemma.

Lemma 1 ( [11]
). For a given joint distribution P defined over X × Y, the mapping ε → gε(X;Y ) ε is non-increasing on ε ∈ (0, ∞) and g ε (X; Y ) lies between two straight lines as follows: for ε ∈ (0, I(X; Y )). Using a simple calculation, the lower bound in (4) can be shown to be achieved by the privacy filter depicted in Fig. 2 with the erasure probability In light of Lemma 1, the possible range of the map ε → g ε (X; Y ) is as depicted in Fig. 3. We next show that ε → g ε (X; Y ) is concave and continuous.
Lemma 2. For any given pair of random variables (X, Y ) over X × Y, the mapping ε → g ε (X; Y ) is concave for ε ≥ 0.
Proof. It suffices to show that for any 0 ≤ ε 1 < ε 2 < ε 3 ≤ I(X; Y ), we have which, in turn, is equivalent to Let P Z 1 |Y : Y → Z 1 and P Z 3 |Y : Y → Z 3 be two optimal privacy filters in D ε 1 (P ) and D ε 3 (P ) with disjoint output alphabets Z 1 and Z 3 , respectively.
We introduce an auxiliary binary random variable U ∼ Bernoulli(λ), independent of (X, Y ), where λ := ε 2 −ε 1 ε 3 −ε 1 and define the following random privacy filter P Z λ |Y : We pick P Z 3 |Y if U = 1 and P Z 1 |Y if U = 0, and let Z λ be the output of this random channel which takes values in Z 1 ∪ Z 3 . Note that which implies that P Z λ |Y ∈ D ε 2 (P ). On the other hand, we have which, according to (7), completes the proof.

Corollary 1.
For any given pair of random variables (X, Y ) over X × Y, the mapping ε → g ε (X; Y ) is continuous for ε ≥ 0.
Proof. Concavity directly implies that the mapping ε → g ε (X; Y ) is continuous on (0, ∞) (see for example [43,Theorem 3.2]). Continuity at zero follows from the continuity of mutual information.
Remark 2. Using the concavity of the map ε → g ε (X; Y ), we can provide an alternative proof for the lower bound in (4). Note that point (I(X; Y ), H(Y )) is always on the curve g ε (X; Y ), and hence by concavity, the straight line ε → ε H(Y ) I(X;Y ) is always below the lower convex envelop of g ε (X; Y ), i.e., the chord connecting (0, g 0 (X; Y )) to (I(X; Y ), H(Y )), and hence g ε (X; Y ) ≥ ε H(Y ) I(X;Y ) . In fact, this chord yields a better lower bound for g ε (X; Y ) on ε ∈ [0, I(X; Y ] as which reduces to the lower bound in (4) only if g 0 (X; Y ) = 0.

B. Maximal Correlation as Privacy Measure
By adopting the mutual information as the privacy measure between the private and the displayed data, we make sure that only limited bits of private information is revealed during the process of transferring Y . In order to have an estimation theoretic guarantee of privacy, we propose alternatively to measure privacy using a measure of correlation, the so-called maximal correlation. , [26], [39], the information measure [32], mutual information and f -divergence [16].

Definition 1 ( [39]
). Given random variables X and Y , the maximal correlation 2 ρ m (X; Y ) is defined as follows: where S is the collection of pairs of real-valued random variables f (X) and g(Y ) such that Ef (X) = Eg(Y ) = 0 and Ef 2 (X) = Eg 2 (Y ) = 1. If S is empty (which happens precisely when at least one of X and Y is constant almost surely) then one defines ρ m (X; Y ) to be 0. Rényi [39] derived an equivalent characterization of maximal correlation as follows: Measuring privacy in terms of maximal correlation, we proposê as the corresponding rate-privacy tradeoff, wherê Again, we equivalently callĝ ε (X; Y ) as the privacy-constrained information extraction function, where here the privacy is guaranteed by ρ 2 m (X; Z) ≤ ε. Setting ε = 0 corresponds to the case where X and Z are required to be statistically independent, i.e., absolutely no information leakage about the private source X is allowed. This is called perfect privacy. Since the independence of X and Z is equivalent to I(X; Z) = ρ m (X; Z) = 0, we havê where ε := log(kε + 1) and k := |X| − 1.
The following lemma is a counterpart of Lemma 1 forĝ ε (X; Y ).

Lemma 3. For a given joint distribution
Proof. Like Lemma 1, the proof is similar to the proof of [45, Lemma 1]. We, however, give a brief proof for the sake of completeness.
For a given channel P Z|Y ∈D ε (P ) and δ ≥ 0, we can define a new channel with an additional symbol e as follows It is easy to check that Therefore, for ε ≤ ε we have Similar to the lower bound for g ε (X; Y ) obtained from Lemma 1, we can obtain a lower bound for g ε (X; Y ) using Lemma 3. Before we get to the lower bound, we need a data processing lemma for maximal correlation. The following lemma proves a version of strong data processing inequality for maximal correlation from which the typical data processing inequality follows, namely, ρ m (X; Lemma 4. For random variables X and Y with a joint distribution P XY , we have Proof. For arbitrary zero-mean and unit variance measurable functions f ∈ L 2 (X) and g ∈ L 2 (Z) where the inequality follows from the Cauchy-Schwartz inequality and (9). Thus we obtain ρ m (X; This bound is tight for the special case of X → Y → X , where P X |Y is the backward channel associated with P Y |X . In the following, we shall show that To this end, first note that the above implies that ρ m (X; Y )ρ m (Y ; X ) ≥ ρ m (X; X ). Since P XY = P X Y , it follows that ρ m (X; Y ) = ρ m (Y ; X ) and hence the above implies that ρ 2 m (X; Y ) ≥ ρ m (X; X ). One the other hand, we have which together with (9) implies that Thus, ρ 2 m (X; Y ) = ρ m (X; X ) which completes the proof. Now a lower bound ofĝ ε (X; Y ) can be readily obtained.
Corollary 2. For a given joint distribution P XY defined over X × Y, we have for any ε > 0 Proof. By Lemma 4, we know that for any Markov chain , from which the result follows.
A loose upper bound ofĝ ε (X; Y ) can be obtained using an argument similar to the one used for where k := |X| − 1 and (a) comes from [30, Proposition 1]. We can, therefore, conclude from (11) and Corollary 2 that Similar to Lemma 2, the following lemma shows that theĝ ε (X; Y ) is a concave function of ε.
Lemma 5. For any given pair of random variables (X, Y ) with distribution P over X ×Y, the mapping Proof. The proof is similar to that of Lemma 2 except that here for two optimal filters P Z 1 |Y : Y → Z 1 and P Z 3 |Y : Y → Z 3 inD ε 1 (P ) andD ε 3 (P ), respectively, and the random channel P Z λ |Y : Y → Z with output alphabet Z 1 ∪ Z 3 constructed using a coin flip with probability γ, we need to show that and E[f 2 (X)] = 1 and let U be a binary random variable as in the proof of Lemma 2.
We then have where (a) comes from the fact that U is independent of X. We can then conclude from (13) and the alternative characterization of maximal correlation (9) that from which we can conclude that P Z λ |Y ∈D ε 2 (P ).

C. Non-Trivial Filters For Perfect Privacy
As it becomes clear later, requiring that g 0 (X; Y ) = 0 is a useful assumption for the analysis of g ε (X; Y ). Thus, it is interesting to find a necessary and sufficient condition on the joint distribution . The random variable X is said to be weakly independent of Y if the rows of the transition matrix P X|Y , i.e., the set of vectors {P X|Y (·|y), y ∈ Y}, are linearly dependent.
The following lemma provides a necessary and sufficient condition for g 0 (X; Y ) > 0.
Lemma 6. For a given (X, Y ) with a given joint distribution P XY = P Y × P X|Y , g 0 (X; Y ) > 0 (and Proof. ⇒ direction: Assuming that g 0 (X; Y ) > 0 implies that there exists a random variable Z over an alphabet Z such that the Markov condition X − − Y − − Z is satisfied and Z⊥ ⊥X while I(Y ; Z) > 0. Hence, for any and hence Since Y is not independent of Z, there exist z 1 and z 2 such that P Y |Z (y|z 1 ) = P Y |Z (y|z 2 ) and hence the above shows that the set of vectors P X|Y (·|y), y ∈ Y is linearly dependent.
⇐ direction: Berger and Yeung [8,Appendix II], in a completely different context, showed that if X being weakly independent of Y , one can always construct a binary random variable Z correlated with Y which satisfies X − − Y − − Z and X⊥ ⊥Z, and hence g 0 (X; Y ) > 0.
Remark 3. Lemma 6 first appeared in [5]. However, Calmon et al. [11] studied (1), the dual version of g ε (X; Y ), and showed an equivalent result for t R (X; Y ). In fact, they showed that for a given  Hence, Lemma 6 implies that g 0 (X; Y ) > 0 if Y has strictly larger alphabet than X.
In light of the above remark, in the most common case of |Y| = |X|, one might have g 0 (X; Y ) = 0, which corresponds to the most conservative scenario as no privacy leakage implies no broadcasting of observable data. In such cases, the rate of increase of g ε (X; Y ) at ε = 0, that is g 0 (X; Y ) := d dε g ε (X; Y )| ε=0 , which corresponds to the initial efficiency of privacy-constrained information extraction, proves to be very important in characterizing the behavior of g ε (X; Y ) for all ε ≥ 0. This is because, for example, by concavity of ε → g ε (X; Y ), the slope of g ε (X; Y ) is maximized at ε = 0 and so In the sequel, we always assume that X is not weakly independent of Y , or equivalently g 0 (X; Y ) = 0. For example, in light of Lemma 6 and Remark 4, we can assume that It is easy to show that, X is weakly independent of binary Y if and only if X and Y are independent (see e.g., [8,Remark 2]). The following corollary, therefore, immediately follows from Lemma 6.
Corollary 3. Let Y be a non-degenerate binary random variable correlated with X. Then g 0 (X; Y ) = 0.

III. OPERATIONAL INTERPRETATIONS OF THE RATE-PRIVACY FUNCTION
In this section, we provide a scenario in which g ε (X; Y ) appears as a boundary point of an achievable rate region and thus giving an information-theoretic operational interpretation for g ε (X; Y ). We then proceed to present an estimation-theoretic motivation forĝ ε (X; Y ).

A. Dependence Dilution
Inspired by the problems of information amplification [29] and state masking [35], Courtade [14] proposed the information-masking tradeoff problem as follows. The tuple said to be achievable if for two given separated sources U ∈ U and V ∈ V and any ε > 0 there exist there exist indices K and J of rates R u and R v given U n and V n , respectively, such that the receiver in possession of (K, J) can recover at most n∆ M bits about U n and at least n∆ A about V n . The [14]. Here, we look at a similar problem but for a joint encoder. In fact, we want to examine the achievable rate of an encoder observing both X n and Y n which masks X n and amplifies Y n at the same time, by rates ∆ M and ∆ A , respectively.
We define a (2 nR , n) dependence dilution code by an encoder and a list decoder + is said to be achievable if, for any δ > 0, there exists a (2 nR , n) dependence dilution code that for sufficiently large n satisfies the utility constraint: having a fixed list size where J := f n (X n , Y n ) is the encoder's output, and satisfies the privacy constraint: Intuitively speaking, upon receiving J, the decoder is required to construct list g n (J) ⊂ Y n of fixed size which contains likely candidates of the actual sequence Y n . Without any observation, the decoder can only construct a list of size 2 nH(Y ) which contains Y n with probability close to one. However, after J is observed and the list g n (J) is formed, the decoder's list size can be reduced to 2 n(H(Y )−∆ A ) and thus reducing the uncertainty about Y n by 0 ≤ n∆ A ≤ nH(Y ). This observation led Kim et al. [29] to show that the utility constraint (14) is equivalent to the amplification requirement which lower bounds the amount of information J carries about Y n . The following lemma gives an outer bound for the achievable dependence dilution region.
for some auxiliary random variable U ∈ U with a finite alphabet and jointly distributed with X and Y .
Before we prove this theorem, we need two preliminary lemmas. The first lemma is an extension of Fano's inequality for list decoders and the second one makes use of a single-letterization technique to express I(X n ; J) − I(Y n ; J) in a single-letter form in the sense of Csiszár and Körner [17]. [29]). Given a pair of random variables (U, V ) defined over U × V for finite V and arbitrary U, any list decoder g : This lemma, applied to J and Y n in place of U and V , respectively, implies that for any list decoder with the property (14), we have where ε n := 1 n + (log |Y| − 1 n log |g n (J)|)p e and hence ε n → 0 as n → ∞.
copies of a pair of random variables (X, Y ). Then for a random variable J jointly distributed with (X n , Y n ), we have Proof. Using the chain rule for the mutual information, we can express I(X n ; J) as follows Similarly, we can expand I(Y n ; J) as Subtracting (20) from (19), we get where (a) follows from the Csiszár sum identity [28].
Proof of Theorem 1. The rate R can be bounded as where (a) follows from Fano's inequality (18) with ε n → 0 as n → ∞ and (b) is due to (15). We can also upper bound ∆ A as where (a) follows from (15), (b) follows from (18), and in the last equality the auxiliary random variable We shall now lower bound I(X n ; J): where (a) follows from Lemma 8 and (b) is due to Fano's inequality and (15) (or equivalently from (17)).
Combining (22), (23) and (24), we can write where ε n := ε n + δ and Q is a random variable distributed uniformly over {1, 2, . . . , n} which is independent of (X, Y ) and hence I( The results follow by denoting U := (U Q , Q) and noting that Y Q and X Q have the same distributions as Y and X, respectively.
If the encoder does not have direct access to the private source X n , then we can define the encoder mapping as f n : Y n → {1, 2, . . . , s nR }. The following corollary is an immediate consequence of Theorem 1.
for some joint distribution P XY U = P XY P U |Y where the auxiliary random variable U ∈ U satisfies |U| ≤ |Y| + 1.
Remark 5. If source Y is required to be amplified (according to (17)) at maximum rate, that is, ∆ A = I(Y ; U ) for an auxiliary random variable U which satisfies X − − Y − − U , then by Corollary 4, the best privacy performance one can expect from the dependence dilution setting is which is equal to the dual of g ε (X; Y ) evaluated at ∆ A , t ∆ A (X; Y ), as defined in (1).
The dependence dilution problem is closely related to the discriminatory lossy source coding problem studied in [47]. In this problem, an encoder f observes (X n , Y n ) and wants to describe this source to a decoder, g, such that g recovers Y n within distortion level D and I(f (X n , Y n ); X n ) ≤ n∆ M . If the distortion level is Hamming measure, then the distortion constraint and the amplification constraint are closely related via Fano's inequality. Moreover, dependence dilution problem reduces to a secure lossless (list decoder of fixed size 1) source coding problem by setting ∆ A = H(H), which is recently studied in [6].

B. MMSE Estimation of Functions of Private Information
In this section, we provide a justification for the privacy guarantee ρ 2 m (X; Z) ≤ ε. To this end, we recall the definition of the minimum mean squared error estimation.
Definition 3. Given random variables U and V , mmse(U |V ) is defined as the minimum error of an estimate, g(V ), of U based on V , measured in the mean-square sense, that is where var(U |V ) denotes the conditional variance of U given V . Given a non-degenerate measurable function f : X → R, consider the following constraint on This guarantees that no adversary knowing Z can efficiently estimate f (X). First consider the case where f is an identity function, i.e., f (x) = x. In this case, a direct calculation shows that where (a) follows from (26) and (b) is due to the definition of maximal correlation. Having imposed ρ 2 m (X; Z) ≤ ε, we, can therefore conclude that the MMSE of estimating X given Z satisfies which shows that ρ 2 m (X; Z) ≤ ε implies (27) for f (x) = x. However, in the following we show that the constraint ρ 2 m (X; Z) ≤ ε is, indeed, equivalent to (27) for any non-degenerate measurable f : X → R.
and the Poincaré constant for P U V is defined as The privacy constraint (27) can then be viewed as Theorem 2 ( [37]). For any joint distribution P U V , we have In light of Theorem 2 and (29), the privacy constraint (27) is equivalent to ρ 2 m (X; Z) ≤ ε, that is, for any non-degenerate measurable functions f : X → R.
Hence,ĝ ε (X; Y ) characterizes the maximum information extraction from Y such that no (non-trivial) function of X can be efficiently estimated, in terms of MMSE (27), given the extracted information.
IV. OBSERVATION CHANNELS FOR MINIMAL AND MAXIMAL g ε (X; Y) In this section, we characterize the observation channels which achieve the lower or upper bounds on the rate-privacy function in (4). We first derive general conditions for achieving the lower bound and then present a large family of observation channels P Y |X which achieve the lower bound. We also give a family of P Y |X which attain the upper bound on g ε (X; Y ).
A. Conditions for Minimal g ε (X; Y ) Assuming that g 0 (X; Y ) = 0, we seek a set of conditions on P XY such that g ε (X; Y ) is linear in ε, or equivalently, g ε (X; Y ) = ε H(Y ) I(X;Y ) . In order to do this, we shall examine the slope of g ε (X; Y ) at zero. Recall that by concavity of g ε (X; Y ), it is clear that g 0 (X; Y ) ≥ H(Y ) I(X;Y ) . We strengthen this bound in the following lemmas. For this, we need to recall the notion of Kullback-Leibler divergence. Given two probability distribution P and Q supported over a finite alphabet U, Lemma 9. For a given joint distribution P XY = P Y × P X|Y , if g 0 (X; Y ) = 0, then for any ε ≥ 0 .
Proof. The proof is given in Appendix A.
Remark 6. Note that if for a given joint distribution P XY , there exists y 0 ∈ Y such that D(P X|Y (·|y 0 )||P X (·)) = 0, it implies that P X|Y (·|y 0 ) = P X (x). Consider the binary random variable Z ∈ {1, e} constructed according to the distribution P Z|Y (1|y 0 ) = 1 and P Z|Y (e|y) = 1 for y ∈ Y\{y 0 }. We can now claim that Z is independent of X, because P X|Z (x|1) = P X|Y (x|y 0 ) = P X (x) and Clearly, Z and Y are not independent, and hence g 0 (X; Y ) > 0. This implies that the right-hand side of inequality in Lemma 9 can not be infinity.
In order to prove the main result, we need the following simple lemma. for all y ∈ Y.
Proof. It is clear that where the inequality follows from the fact that for any three sequences of positive numbers Now we are ready to state the main result of this subsection.
Theorem 3. For a given (X, Y ) with joint distribution P XY = P Y × P X|Y , if g 0 (X; Y ) = 0 and .
Proof. Note that the fact that g 0 (X; Y ) = 0 and g ε (X; Y ) is linear in ε is equivalent to g ε (X; Y ) = ε H(Y ) I(X;Y ) . It is, therefore, immediate from Lemmas 9 and 10 that we have where (a) follows from the fact that g ε (X; Y ) = ε H(Y ) I(X;Y ) and (b) and (c) are due to Lemmas 10 and 9, respectively. The inequality in (31) shows that According to Lemma 10,(32) implies that the ratio of − log P Y (y) D(P X|Y (·|y)||P X (x)) does not depend on y ∈ Y and hence the result follows.
This theorem implies that if there exists y = y 1 and y = y 2 such that log P Y (y) D(P X|Y (·|y)||P X (x)) results in two different values, then ε → g ε (X, Y ) cannot achieve the lower bound in (4), or equivalently This, therefore, gives a necessary condition for the lower bound to be achievable. The following corollary 26 simplifies this necessary condition.
Corollary 5. For a given joint distribution P XY = P Y × P X|Y , if g 0 (X; Y ) = 0 and ε → g ε (X; Y ) is linear, then the following are equivalent: Proof. (i) ⇒ (ii): From Theorem 3, we have for all y ∈ Y Letting D := D P X|Y (·|y)||P X (·) for any y ∈ Y, we have y P Y (y)D = I(X; Y ) and hence D = I(X; Y ), which together with (33) implies that H(Y ) = − log(P Y (y)) for all y ∈ Y and hence Y is uniformly distributed.

(ii) ⇒ (i):
When Y is uniformly distributed, we have from (33) that I(X; Y ) = D P X|Y (·|y)||P X (·) which implies that D P X|Y (·|y)||P X (·) is constant for all y ∈ Y. Example 1. Suppose P Y |X is a binary symmetric channel (BSC) with crossover probability 0 < α < 1 and P X = Bernoulli(0.5). In this case, P X|Y is also a BSC with input distribution P Y = Bernoulli(0.5).
Example 2. Now suppose P X|Y is a binary asymmetric channel such that P X|Y (·|0) = Bernoulli(α 0 ), P X|Y (·|1) = Bernoulli(α 1 ) for some 0 < α 0 , α 1 < 1 and input distribution P Y = Bernoulli(p), 0 < p ≤ 0.5. It is easy to see that if α 0 + α 1 = 1 then D(P X|Y (·|y)||P X (·)) does not depend on y and hence we can conclude from Corollary 5 (noticing that g 0 (X; Y ) = 0) that in this case for any p < 0.5, g ε (X; Y ) is not linear and hence for 0 < ε < I(X; Y ) In Theorem 3, we showed that when g ε (X; Y ) achieves its lower bound, illustrated in (4), the slope of the mapping ε → g ε (X; Y ) at zero is equal to − log P Y (y) D(P X|Y (·|y)||P X (·)) for any y ∈ Y. We will show in the next section that the reverse direction is also true at least for a large family of binary-input symmetric output channels, for instance when P Y |X is a BSC, and thus showing that in this case,

B. Special Observation Channels
In this section, we apply the results of last section to different joint distributions P XY . In the first family of channels from X to Y , we look at the case where Y is binary and the reverse channel P X|Y has symmetry in a particular sense, which will be specified later. One particular case of this family of channels is when P X|Y is a BSC. As a family of observation channels which achieves the upper bound of g ε (X; Y ), stated in (4), we look at the class of erasure channels from X → Y , i.e., Y is an erasure version of X.

1) Observation Channels With Symmetric Reverse:
The first example of P XY that we consider for binary Y is the so-called Binary Input Symmetric Output (BISO) P X|Y , see for example [24], [46].
Suppose Y = {0, 1} and X = {0, ±1, ±2, . . . , ±k}, and for any x ∈ X we have P X|Y (x|1) = P X|Y (−x|0). This clearly implies that p 0 := P X|Y (0|0) = P X|Y (0|1). We notice that with this definition of symmetry, we can always assume that the output alphabet X = {±1, ±2, . . . , ±k} has even number of elements because we can split X = 0 into two outputs, X = 0 + and X = 0 − , with P X|Y (0 − |0) = P X|Y (0 + |0) = p 0 2 and P X|Y (0 − |1) = P X|Y (0 + |1) = p 0 2 . The new channel is clearly essentially equivalent to the original one, see [46] for more details. This family of channels can also be characterized using the definition of quasi-symmetric channels [3,Definition 4.17]. A channel W is BISO if (after making |X| even) the transition matrix P X|Y can be partitioned along its columns into binary-input binary-output sub-arrays in which rows are permutations of each other and the column sums are equal. It is clear that binary symmetric channels and binary erasure channels are both BISO. The following lemma gives an upper bound for g ε (X, Y ) when P X|Y belongs to such a family of channels.
Lemma 11. If the channel P X|Y is BISO, then for ε ∈ [0, I(X; Y )], where C(P X|Y ) denotes the capacity of P X|Y .
Proof. The lower bound has already appeared in (4). To prove the upper bound note that by Markovity X − − Y − − Z, we have for any x ∈ X and z ∈ Z Now suppose Z 0 := {z : P Y |Z (0|z) ≤ P Y |Z (1|z)} and similarly Z 1 : Then (34) allows us to write for z ∈ Z 0 is the inverse of binary entropy function, and for z ∈ Z 1 , ) denote the right-hand sides of (35) and (36), respectively, we can, hence, write where H(X unif ) denotes the entropy of X when Y is uniformly distributed. Here, (a) is due to (35) and

Hence, we obtain
where the equality follows from the fact that for BISO channel (and in general for any quasi-symmetric channel) the uniform input distribution is the capacity-achieving distribution [3,Lemma 4.18]. Since g ε (X; Y ) is attained when I(X; Z) = ε, the conclusion immediately follows.
This lemma then shows that the larger the gap between I(X; Y ) and I(X; Y ) is for Y ∼ Bernoulli(0.5), the more g ε (X; Y ) deviates from its lower bound. When Y ∼ Bernoulli(0.5), then C(P Y |X ) = I(X; Y ) and H(Y ) = 1 and hence Lemma 11 implies that and hence we have proved the following corollary.
Corollary 6. If the channel P X|Y is BISO and Y ∼ Bernoulli(0.5), then for any ε ≥ 0 This corollary now enables us to prove the reverse direction of Theorem 3 for the family of BISO channels.
Theorem 4. If P X|Y is a BISO channel, then the following statements are equivalent: The initial efficiency of privacy-constrained information extraction is , ∀y ∈ Y.

(ii)⇒ (i).
Let Y ∼ Bernoulli(p) for 0 < p < 1, and, as before, X = {±1, ±2, . . . , ±k}, so that P X|Y is determined by a 2 × (2k) matrix. We then have and − log P Y (1) D(P X|Y (·|1)||P X (·)) = log(p) . (38) The hypothesis implies that (37) is equal to (38), that is, It is shown in Appendix B that (39) holds if and only if p = 0.5. Now we can invoke Corollary 6 to This theorem shows that for any BISO P X|Y channel with uniform input, the optimal privacy filter is an erasure channel depicted in Fig. 2. Note that if P X|Y is a BSC with uniform input P Y = Bernoulli(0.5), then P Y |X is also a BSC with uniform input P X = Bernoulli(0.5). The following corollary specializes Corollary 6 for this case.
Corollary 7. For the joint distribution P X P Y |X = Bernoulli(0.5) × BSC(α), the binary erasure channel with erasure probability (shown in Fig. 4) for 0 ≤ ε ≤ I(X; Y ), is the optimal privacy filter in (3). In other words, for ε ≥ 0 Moreover, for a given 0 < α < 1 2 , P X = Bernoulli(0.5) is the only distribution for which ε → g ε (X; Y ) is linear. That is, for P X P Y |X = Bernoulli(p) × BSC(α), 0 < p < 0.5, we have Proof. As mentioned earlier, since P X = Bernoulli(0.5) and P Y |X is BSC(α), it follows that P X|Y is also a BSC with uniform input and hence from Corollary 6, we have g ε (X; Y ) = ε I(X;Y ) . As in this case g ε (X; Y ) achieves the lower bound given in Lemma 1, we conclude from Fig. 2 that BEC(δ(ε, α)), where δ(ε, α) = 1 − ε I(X;Y ) , is an optimal privacy filter. The fact that P X = Bernoulli(0.5) is the only input distribution for which ε → g ε (X; Y ) is linear follows from the proof of Theorem 4. In particular, we saw that a necessary and sufficient condition for g ε (X; Y ) being linear is that the ratio − log P Y (y) D(P X|Y (·|y)||P X (·)) is constant for all y ∈ Y. As shown before, this is equivalent to Y ∼ Bernoulli(0.5). For the binary symmetric channel, this is equivalent to X ∼ Bernoulli(0.5). The optimal privacy filter for BSC(α) and uniform X is shown in Figure 4. In fact, this corollary immediately implies that the general lower-bound given in (4) is tight for the binary symmetric channel with uniform X.
2) Erasure Observation Channel: Combining (8) and Lemma 1, we have for ε ≤ I(X; Y ) In the following we show that the above upper and lower bound coincide when P Y |X is an erasure channel, i.e., P Y |X (x|x) = 1 − δ and P Y |X (e|x) = δ for all x ∈ X and 0 ≤ δ ≤ 1. For any x ∈ X, we have which implies Z⊥ ⊥X and thus I(X; Z) = 0. On the other hand, P Z (z) = 1−δ m 1 {z =e} + δ1 {z=e} , and therefore we have It then follows from Lemma 1 that g 0 (X; Y ) = H(Y |X), which completes the proof.
Example 3. In light of this lemma, we can conclude that if P Y |X = BEC(δ), then the optimal privacy filter is a combination of an identity channel and a BSC(α(ε, δ)), as shown in Fig. 5, where 0 ≤ α(ε, δ) ≤ 1 2 is the unique solution of where a). Note that it is easy to check that . Therefore, in order for this channel to be a valid privacy filter, the crossover probability, α(ε, δ), must be chosen such that I(X; Z) = ε. We note that for fixed 0 < δ < 1 of the BEC and BSC channels, which is similar to other existing extremal properties of the BEC and the BSC, see e.g., [46] and [24]. For X ∼ Bernoulli(0.5), we have for any channel P Y |X , where g ε (BSC(α)) is the rate-privacy function corresponding to P XY = Bernoulli(0.5) × BSC(α) and (H(X|Y )). Similarly, if X ∼ Bernoulli(p), we have for any channel P Y |X with H(Y |X) ≤ 1, where g ε (BEC(δ)) is the rate-privacy function corresponding to P XY = Bernoulli(p) × BEC(δ) and

V. RATE-PRIVACY FUNCTION FOR CONTINUOUS RANDOM VARIABLES
In this section we extend the rate-privacy function g ε (X; Y ) to the continuous case. Specifically, we assume that the private and observable data are continuous random variables and that the filter is composed of two stages: first Gaussian noise is added and then the resulting random variable is quantized using an M -bit accuracy uniform scalar quantizer (for some positive integer M ∈ N). These filters are of practical interest as they can be easily implemented. This section is divided in two subsections, in the first we discuss general properties of the rate-privacy function and in the second we study the Gaussian case in more detail. Some observations onĝ ε (X; Y ) for continuous X and Y are also given.

A. General properties of the rate-privacy function
Throughout this section we assume that the random vector (X, Y ) is absolutely continuous with respect to the Lebesgue measure on R 2 . Additionally, we assume that its joint density f X,Y satisfies the following.
(a) There exist constants C 1 > 0, p > 1 and bounded function C 2 : R → R such that and also for (c) the differential entropy of (X, Y ) satisfies h(X, Y ) > −∞, where a denotes the largest integer such that ≤ a.
Note that assumptions (b) and (c) together imply that h(X, Y ), h(X) and h(Y ) are finite, i.e., the maps We also assume that X and Y are not independent, since otherwise the problem to characterize g ε (X; Y ) becomes trivial by assuming that the displayed data Z can equal the observable data Y .
We are interested in filters of the form Q M (Y + γN ) where γ ≥ 0, N ∼ N (0, 1) is a standard normal random variable which is independent of X and Y , and for any positive integer M , Q M denotes the M -bit accuracy uniform scalar quantizer, i.e., for all x ∈ R γN ). We define, for any M ∈ N, and similarly The next theorem shows that the previous definitions are closely related. Proof. See Appendix C.
In the limit of large M , g ε (X; Y ) approximates g ε,M (X; Y ). This becomes relevant when g ε (X; Y ) is easier to compute than g ε,M (X; Y ), as demonstrated in the following subsection. The following theorem summarizes some general properties of g ε (X; Y ).
As opposed to the discrete case, in the continuous case g ε (X; Y ) is no longer bounded. In the following section we show that ε → g ε (X; Y ) can be convex, in contrast to the discrete case where it is always concave.
It is clear to see from Theorem 6 thatĝ 0 (X; Y ) = g 0 (X; Y ) = 0 andĝ ρ 2 (X;Y ) (X; Y ) = ∞. However, although we showed that g ε (X; Y ) is indeed the asymptotic approximation of g ε,M (X; Y ) for M large enough, it is not clear that the same statement holds forĝ ε (X; Y ) andĝ ε,M (X; Y ).

B. Gaussian Information
The rate-privacy function for Gaussian Y has an interesting interpretation from an estimation theoretic point of view. Given the private and observable data (X, Y ), suppose an agent is required to estimate Y based on the output of the privacy filter. We wish to know the effect of imposing a privacy constraint on the estimation performance.
The following lemma shows that g ε (X; Y ) bounds the best performance of the predictability of Y given the output of the privacy filter. The proof provided for this lemma does not use the Gaussianity of the noise process, so it holds for any noise process.
Lemma 13. For any given private data X and Gaussian observable data Y , we have for any ε ≥ 0 inf γ≥0, Proof. It is a well-known fact from rate-distortion theory that for a Gaussian Y and its reconstruction , and hence by settingŶ = E[Y |Z γ ], where Z γ is an output of a privacy filter, and noting that I(Y ;Ŷ ) ≤ from which the result follows immediately.
According to Lemma 13, the quantity λ ε (X) := 2 −2gε(X;Y ) is a parameter that bounds the difficulty of estimating Gaussian Y when observing an additive perturbation Z with privacy constraint I(X; Z) ≤ ε.
Note that 0 < λ ε (X) ≤ 1, and therefore, provided that the privacy threshold is not trivial (i.e, ε < I(X; Y )), the mean squared error of estimating Y given the privacy filter output is bounded away from zero, however the bound decays exponentially at rate of g ε (X; Y ).
To finish this section, assume that X and Y are jointly Gaussian with correlation coefficient ρ. The value of g ε (X; Y ) can be easily obtained in closed form as demonstrated in the following theorem.
Theorem 7. Let (X, Y ) be jointly Gaussian random variables with correlation coefficient ρ. For any ε ∈ [0, I(X; Y )) we have Proof. One can always write Y = aX + N 1 where a 2 = ρ 2 var(Y ) var(X) and N 1 is a Gaussian random variable with mean 0 and variance σ 2 = (1 − ρ 2 )var(Y ) which is independent of (X, Y ). On the other hand, we have Z γ = Y + γN where N is the standard Gaussian random variable independent of (X, Y ) and hence Z γ = aX + N 1 + γN . In order for this additive channel to be a privacy filter, it must satisfy and hence Since γ → I(Y ; Z γ ) is strictly decreasing (cf., Appendix C), we obtain According to (46), we conclude that the optimal privacy filter for jointly Gaussian (X, Y ) is an additive Gaussian channel with signal to noise ratio , which shows that if perfect privacy is required, then the displayed data is independent of the observable data Y , i.e., g 0 (X; Y ) = 0.
Remark 7. We could assume that the privacy filter adds non-Gaussian noise to the observable data and define the rate-privacy function accordingly. To this end, we define g f ε (X; Y ) := sup γ≥0, where Z f γ = Y + γM f and M f is a noise process that has stable distribution with density f and is independent of (X, Y ). In this case, we can use a technique similar to Oohama [36] to lower bound g f ε (X; Y ) for jointly Gaussian (X, Y ). Since X and Y are jointly Gaussian, we can write X = aY + bN where a 2 = ρ 2 var(X) var(Y ) , b = (1 − ρ 2 )varX, and N is a standard Gaussian random variable that is independent of Y . We can apply the conditional entropy power inequality (cf., [28,Page 22]) for a random variable Z that is independent of N , to obtain and hence Assuming Z = Z f γ and taking infimum from both sides of above inequality over γ such that I(X; Z f γ ) ≤ ε, we obtain which shows that for Gaussian (X, Y ), Gaussian noise is the worst stable additive noise in the sense of privacy-constrained information extraction.
Theorem 8. Let (X, Y ) be jointly Gaussian random variables with correlation coefficient ρ. For any Proof. Since for the correlation coefficient between Y and Z γ we have for any γ ≥ 0, we can conclude that Since ρ 2 m (X; Z) = ρ 2 (X; Z) (see e.g., [39]), the privacy constraint ρ 2 m (X; Z) ≤ ε implies that and hence By monotonicity of the map γ → I(Y ; Z γ ), we havê Theorems 7 and 8 show that unlike to the discrete case (cf. Lemmas 2 and 5), ε → g ε (X; Y ) and ε →ĝ ε (X; Y ) are convex.

VI. CONCLUSIONS
In this paper, we studied the problem of determining the maximal amount of information that one can extract by observing a random variable Y , which is correlated with another random variable X that represents sensitive or private data, while ensuring that the extracted data Z meets a privacy constraint with respect to X. Specifically, given two correlated discrete random variables X and Y , we introduced the rate-privacy function as the maximization of I(Y ; Z) over all stochastic "privacy filters" P Z|Y such that pm(X; Z) ≤ , where pm(·; ·) is a privacy measure and ≥ 0 is a given privacy threshold. We considered two possible privacy measure functions, pm(X; Z) = I(X; Z) and pm(X; Z) = ρ 2 m (X; Z) where ρ m denotes maximal correlation, resulting in the rate-privacy functions g (X; Y ) andĝ (X; Y ), respectively. We analyzed these two functions, noting that each function lies between easily evaluated upper and lower bounds, and derived their monotonicity and concavity properties. We next provided an information-theoretic interpretation for g (X; Y ) and an estimation-theoretic characterization for g (X; Y ). In particular, we demonstrated that the dual function of g (X; Y ) is a corner point of an outer bound on the achievable region of the dependence dilution coding problem. We also showed thatĝ (X; Y ) constitutes the largest amount of information that can be extracted from Y such that no meaningful MMSE estimation of any function of X can be realized by just observing the extracted information Z. We then examined conditions on P XY under which the lower bound on g (X; Y ) is tight, hence determining the exact value of g (X; Y ). We also showed that for any given Y , if the observation channel P Y |X is an erasure channel, then g (X; Y ) attains its upper bound. Finally, we We clearly have P Z (k) = δP Y (k) and P Z (e) = 1 − δP Y (k), and hence and also, It, therefore, follows that for k ∈ {1, 2, . . . , n} and We then write and hence, Using the first-order approximation of mutual information for δ = 0, we can write Similarly, we can write where Ψ(x) := x log x which yields From the above, we obtain Clearly from (49), in order for the filter P Z|Y specified in (47) and (48) to belong to D ε (P XY ), we must and hence from (50), we have This immediately implies that where we have used the assumption g 0 (X, Y ) = 0 in the first equality.

APPENDIX B COMPLETION OF PROOF OF THEOREM 4
To prove that the equality (39) has only one solution p = 1 2 , we first show the following lemma.
APPENDIX C

PROOF OF THEOREMS 5 AND 6
The proof of Theorem 6 does not depend on the proof of Theorem 5, so, there is no harm in proving the former theorem first. The following version of the data-processing inequality will be required.
Lemma 15. Let X and Y be absolutely continuous random variables such that X, Y and (X, Y ) have finite differential entropies. If V is an absolutely continuous random variable independent of X and Y , then with equality if and only if X and Y are independent.
, the data processing inequality implies that I(X; Y + V ) ≤ I(X; Y ). It therefore suffices to show that this inequality is tight if and only X and Y are independent.
It is known that data processing inequality is tight if and only if X − − (Y + V ) − − Y . This is equivalent to saying that for any measurable set A ⊂ R and for P Y +V almost all z, Pr(X ∈ A|Y + V = z, Y = y) = Pr(X ∈ A|Y + V = z). On the other hand, due to the independence of V and (X, Y ), we have Pr(X ∈ A|Y + V = z, Y = y) = Pr(X ∈ A|Y = z − v). Hence, the equality holds if and only if which implies that X and Y must be independent.
Lemma 16. In the notation of Section V-A, the function γ → I(Y ; Z γ ) is strictly-decreasing and continuous. Additionally, it satisfies with equality if and only if Y is Gaussian. In particular, I(Y ; Z γ ) → 0 as γ → ∞.
Proof. Recall that, by assumption b), var(Y ) is finite. The finiteness of the entropy of Y follows from assumption, the corresponding statement for Y + γN follows from a routine application of the entropy power inequality [15,Theorem 17.7.3] and the fact that var(Y + γN ) = var(Y ) + γ 2 < ∞, and for (Y, Y + γN ) the same conclusion follows by the chain rule for differential entropy. The data processing inequality, as stated in Lemma 15, implies I(Y ; Z γ+δ ) ≤ I(Y ; Y + γN ) = I(Y ; Z γ ).
Clearly Y and Y + γN are not independent, therefore the inequality is strict and thus γ → I(Y, Z γ ) is strictly-decreasing.
Continuity will be studied for γ = 0 and γ > 0 separately. Recall that h(γN ) = 1 2 log(2πeγ 2 ). In particular, lim Since the channel from Y to Z γ is an additive Gaussian noise channel, we have I(Y ; Z γ ) ≤ 1 2 log 1 + var(Y ) γ 2 with equality if and only if Y is Gaussian. The claimed limit as γ → 0 is clear. Proof. The proof of the strictly-decreasing behavior of γ → I(X; Z γ ) is proved as in the previous lemma.
To prove continuity, let γ ≥ 0 be fixed. Let (γ n ) n≥1 be any sequence of positive numbers converging to γ. First suppose that γ > 0. Observe that for all n ≥ 1. As shown in Lemma 16, h(Y + γ n N ) → h(Y + γN ) as n → ∞. Therefore, it is enough to show that h(Y + γ n N |X) → h(Y + γN |X) as n → ∞. Note that by de Bruijn's identity, we have as n → ∞.
To prove the continuity at γ = 0, we first note that Linder and Zamir [31, Page 2028] showed that h(Y + γ n N |X = x) → h(Y |X = x) as n → ∞, then, as before, by dominated convergence theorem we can show that h(Y + γ n N |X) → h(Y |X). Similarly [31] implies that h(Y + γ n N ) → h(Y ). This concludes the proof of the continuity of γ → I(X; Z γ ).
Furthermore, by the data processing inequality and previous lemma, and hence we conclude that lim γ→∞ I(X; Z γ ) = 0.
Proof of Theorem 6. The nonnegativity of g ε (X; Y ) follows directly from definition.
In order to prove Theorem 5, we first recall the following theorem by Rényi [40]. provided that the integral on the right hand side exists.
We will need the following consequence of the previous theorem. Since the function x → −x log(x) is increasing in [0, 1/2], there exists K > 0 such that for |k| > K −p k,γn log(p k,γn ) ≤ A k p log(A −1 k p ).
Since |k|>K A k p log(A −1 k p ) < ∞, for any > 0 there exists K such that |k|>K A k p log(A −1 k p ) < .
To prove continuity at γ 0 = 0, observe that equation (57) holds in this case as well. The rest is analogous to the case γ 0 > 0. where γ is as defined in the proof of Theorem 6. This implies then that where inequality follows from Markovity and γ M ,min := inf Γ M γ. By equation (58), γ ∈ Γ M +1 ⊂ Γ M and in particular γ M ,min ≤ γ M +1 ,min ≤ γ . Thus, {γ M ε,min } is an increasing sequence in M and bounded from above and, hence, has a limit. Let γ ,min = lim M →∞ γ M ,min . Clearly γ ,min ≤ γ .
By the previous lemma we know that I(X; Z M γ ) is continuous, so Γ M is closed for all M ∈ N. Thus, we have that γ M ,min = min Γ M γ and in particular γ M ,min ∈ Γ M . By the inclusion Γ M +1 ⊂ Γ M , we