Perturbation of convex risk minimization and its application in differential private learning algorithms

Convex risk minimization is a commonly used setting in learning theory. In this paper, we firstly give a perturbation analysis for such algorithms, and then we apply this result to differential private learning algorithms. Our analysis needs the objective functions to be strongly convex. This leads to an extension of our previous analysis to the non-differentiable loss functions, when constructing differential private algorithms. Finally, an error analysis is then provided to show the selection for the parameters.


Introduction
In learning theory, convex optimization is one of the powerful tools in analysis and algorithm designs, which is especially used for empirical risk minimization (ERM) (Vapnik  []). When running on a sensitive data set, algorithms may leak private information. This has motivated the notion of differential privacy (Dwork et  For the sample space Z, denote the Hamming distance between two sample sets {z 1 , z 2 } ∈ Z m as d(z 1 , z 2 ) = #{i = , . . . , m : z ,i = z ,i }, i.e., there is only one element that is different. Then -differential privacy is defined as follows.
Definition  A random algorithm A : Z m → H is -differential private if for every two data sets z 1 , z 2 satisfying d(z 1 , z 2 ) = , and every set O ∈ Range(A(z 1 )) ∩ Range(A(z 2 )), we have Throughout the paper, we assume <  for meaningful privacy guaranties. The relaxation ( , δ)-differential privacy is also interesting and has been studied in some recent literature. However, it is out of our scope and we will just focus on the -differential privacy throughout the paper. Extension of our results to ( , δ)-differential privacy or concentrated differential privacy [] may be studied in future work.
A mechanism obtains differential privacy usually by adding a perturbation term to an original definite output (Dwork et al.  []), i.e., the so-called Laplacian mechanism. McSherry and Talwar  [] proposed the exponential mechanism, which chooses an output based on its utility function. Indeed, the two mechanisms are related, and both of them are dependent with some kinds of sensitivity of the original definite output. We refer to Dwork  On the other hand, sensitivity in a differential private algorithm, which can be considered as the perturbation for the ERM algorithms, or the stability, has been studied in Bousquet and Elisseeff  [] and Shalev-Shwartz et al.  [] in the classical learning theory setting. More recently, the relationship between the stability and differential privacy has been revealed in Wang et al.  [].
The main contribution of this paper is to present a different perturbation analysis for the ERM algorithms, in which the condition is just in having convex loss functions and strongly convex regularization terms. Thus the output perturbation mechanisms can still be valid directly in SVM or other non-differentiable loss cases. Besides, an error analysis is conducted, from which we find a choice for the parameter to balance the privacy and generalization ability.

Perturbation analysis for ERM algorithms
In this section we consider the general regularized ERM algorithms. Let X be a compact metric space, and output Y ⊂ R, where |y| ≤ M for some M > . ( which is drawn according to a distribution function ρ on the sample space Z := X × Y . Furthermore, we assume there is a marginal distribution ρ X on X and a conditional distribution ρ(y|x) on Y given some x.
Firstly we introduce our notations which will be used in the following statements and analysis. Let the loss function L(f (x), y) be positive and convex for the first variable. Denote Without loss of generality, we setz = {z  , z  , . . . , z m- ,z m }, which replaces the last element of z, and z-= {z  , z  , . . . , z m- } as a sample set deleting the last element of z. Then similar notations can be given: Denote (H K , · K ) as the reproducing kernel Hilbert space (RKHS) on X, i.e., H K := span{K(x, ·), x ∈ X}, where K : X × X → R is a Mercer kernel. Let K x (y) = K(x, y) for any x, y ∈ X, and κ = sup x,y∈X K(x, y). Then the reproducing property tells us that f (x) = f , K x K . Now a typical regularized ERM algorithm can be stated as Here λ >  is the regularization parameter and (f ) is a γ -strongly (γ > ) convex function with respect to the K norm, i.e., for any f  , f  ∈ H K and t ∈ [, ], This definition of being strongly convex is taken from Sridharan  [], where the authors derived some kind of uniform convergence under the strongly convex assumption. It has been widely used in the subsequent literature such as [, , , ], etc. By denoting we have the following result.
Theorem  Let f z and fz be defined as above. is γ -strongly convex and L is convex w.
Proof We will prove the result in three steps.
() For any S ∈ Z m and f S from (), It is obvious from the definition above that () The minimization of the two objective functions are close, i.e., From the notations above, we have A similar analysis for fz can be given as follows: , and the two lower bounds above is the same, we have We can deduce that () Now we can prove our main result. Since is γ -strongly convex, and L(f (x), y) is convex w.r.t. the first argument, which leads to the convexity of E z (f ), for any  < t < , it follows that Therefore, which proves our result. Now let us make a brief remark about this result. In our theorem, only convexity for the loss function and γ -strongly convexity for are assumed. The assumption λ (f S ) ≤ B is trivial for algorithms such as general SVM or coefficient regularization [], since E S (f S ) + λ (f S ) is the minimum value. The advantage of this result is that most of our learning algorithms satisfy this condition, especially including hinge loss for SVM and pinball loss for quantile regression. Perturbation, or stability analysis has already been performed in [, ]. There the authors proposed quite a few stability definitions, which is mainly used for classical generalization analysis. References [, ] also studied the differential private learning algorithms with different kernels and Lipschitz losses, with a regularization term of square norm. A similar result to theirs with our notations is as follows.
Theorem  Let f z , fz, f z-be defined as above. Assume |L(t  , y) -L(t  , y)| ≤ C L |t t  | for any t  , t  , y and some C L > , then we have Proof From the convexity of the loss function and regularization term, we have, for any f ∈ H K and  < t < , This leads to i.e., Let t tend to , we have Similarly, we also have By adding the two equations we have From the fact that and the theorem is proved.
Though the condition for the latter result is stronger than the first one, we will still apply this to the analysis below, as the bound is sharper and most of the loss functions satisfy the Lipschitz condition above.

Differential private learning algorithms
In this section, we will describe the general differential private learning algorithms based on an output perturbation method. Perturbation ERM algorithms give a random output by adding a random perturbation term on the above deterministic output. That is, where f z is derived from (). To determine the distribution of b, we firstly recall the sensitivity, introduced in Dwork  [], in our settings.
Definition  We denote f as the maximum infinite norm of difference between the outputs when changing one sample point in z. Let z andz be defined as in the previous section, and f z and fz be derived from () accordingly, we can see that Then a similar result to [] is the following.
Lemma  Assume f is bounded by B > , and b has a density function proportional to exp{-|b| B }, then algorithm () provides -differential privacy.
Proof For all possible output function r, and z,z differ in last element, So by the triangle inequality, Then the lemma is proved by a union bound.
Combining this with the result in the previous section, we can choose the noise term b as follows.

Proposition  Assume the conditions in Theorem  hold, and b takes value in (-∞, +∞), we choose the density of b to be
, where α = κ  C L λγ m , then the algorithm () provides -differential privacy.
Proof Since from the previous section we have for any z andz differing in the last sample point. Then from the reproducing property, The proposition is proved by substitute B = κ  C L λγ m in the last lemma.

Error analysis
In this section, we conduct the error analysis for the general differential private ERM algorithm (). We denote as our goal function. In the following in this section, we always assume the Lipshitz continuous condition for the loss function, i.e. |L(t  , y) -L(t  , y)| ≤ C L |t t  | for any t  , t  , y and some C L > . Now let us introduce our error decomposition, where f λ is a function in H K to be determined and Here R  and R  involve the function f z,A from random algorithm () so we call them random errors. S and D(λ) are similar to the classical ones in the literature in learning theory and are called sample error and approximation error. In the following we will study these errors, respectively.

Concentration inequality and error bounds for random errors
To bound the first random error, we need a concentration inequality. Dwork  Theorem  If an algorithm A provides -differential privacy, and outputs a positive function g z,A : where the expectation is taken over the sample and the output of the random algorithm. Then Proof Denote the sample sets w j = {z  , z  , . . . , z j- , z j , z j+ , . . . , z m } for j ∈ {, , . . . , m}. We observe that Then On the other hand, This leads to These verify our results. Since y is bounded by M >  throughout our paper, it is reasonable to assume that E z () =  m m i= L(, y i ) ≤ B  for some B  >  depending just on M. Then we apply this concentration inequality to the random error R  .

Remark 
Proposition  Let f z,A be obtained from algorithm (). Assume E z () ≤ B  for some constant B  > . We have Proof Let g z,A (z) = L(f z,A (x), y), which is always positive. Note that and we have By applying the concentration inequality for the given g z,A we can prove the result with constantB = (B  + λ ()).
For the random error R  , we have the following estimation.
Proposition  For the function f z,A obtained from algorithm (), we have Proof Note that Therefore, This verifies our bound.

Error estimate for the other error terms
For the sample error and approximation error, we choose f λ to be some function in H K close to f ρ , which satisfies |L(f λ (x), y)| ≤ B ρ for some B ρ > . Explicit expressions of f λ and B ρ will be presented in the next section, with respect to different algorithms. To bound the sample error, we should recall the Hoeffding inequality [].
Lemma  Let ξ be a random variable on a probability space Z satisfying |ξ (z) -Eξ | ≤ for some >  for almost all z ∈ Z. Denote σ  = σ  (ξ ), then, for any t > , Now we have the following proposition.
Let us turn to the approximation error D(λ). It is difficult to give the upper bound for the abstract approximation error. So we use the natural assumption on D(λ), which is for some  < β <  and c β > . This assumption is trivial in concrete algorithms; see [-], etc.

Total error bound
Now we can deduce our total error by combining all the error bounds above.
Here we present a general convergence result for the general differential private ERM learning algorithms. In this theorem, we provide a choice for the parameters and λ, under some conditions above, which leads to a learning rate m -β/(β+) with fixed B and γ . However, in an explicit algorithm B and γ may depend on λ and the learning rate will vary accordingly. We cannot go further without a specific description of the algorithms, which will be studied in the next section.

Applications
In this section, we will apply our results to several frequently used learning algorithms. First of all, let us take a look at the assumptions as regards f ρ . Denote the integral operator L K as L K f (t) = X f (x)K(x, t) dρ X (x). It is well known that [] L K ≤ κ  . Then f ρ ∈ L r K (L  ρ X ) for some r >  is often used in learning theory literature. When r = /, it is the same as f ρ ∈ H K []. It is natural if we consider L(π(f (x)), y) ≤ L(f (x), y) for any function f and (x, y) ∈ Z, which means π(f (x)) is more close than f (x) to y in some sense, as |y| ≤ M. Here Then Z (π(f ρ (x)), y) dρ ≤ Z (f ρ (x), y) dρ, i.e., |f ρ (x)| ≤ M always holds. So without loss of generality, we also assume f ρ ∞ ≤ M.