Differentially Private Stochastic Gradient Descent with Low-Noise

Modern machine learning algorithms aim to extract fine-grained information from data to provide accurate predictions, which often conflicts with the goal of privacy protection. This paper addresses the practical and theoretical importance of developing privacy-preserving machine learning algorithms that ensure good performance while preserving privacy. In this paper, we focus on the privacy and utility (measured by excess risk bounds) performances of differentially private stochastic gradient descent (SGD) algorithms in the setting of stochastic convex optimization. Specifically, we examine the pointwise problem in the low-noise setting for which we derive sharper excess risk bounds for the differentially private SGD algorithm. In the pairwise learning setting, we propose a simple differentially private SGD algorithm based on gradient perturbation. Furthermore, we develop novel utility bounds for the proposed algorithm, proving that it achieves optimal excess risk rates even for non-smooth losses. Notably, we establish fast learning rates for privacy-preserving pairwise learning under the low-noise condition, which is the first of its kind.


Introduction
Stochastic gradient descent (SGD) iteratively updates model parameters using the gradient information over a small batch of random examples, which reduces the computation cost and makes it amenable to solving large-scale problems. Due to its low computational overhead and easy implementation, it has become the workhorse algorithm for training many machine learning models [9,12,19,26,31,32,34,37,38,47,48,51].
On the other important front, we have witnessed a significant risk of privacy leakage by sharing gradient information of machine learning models because the gradient often embeds knowledge about the training data. For instance, [53] provides paradigms for breaching privacy and reconstructing training examples from publicly shared gradients and [39] shows that the membership of a data record can be inferred from a binary classifier trained on gradients. As SGD is widely deployed in machine learning models, it is of pivotal importance to develop privacy-preserving SGD algorithms to mitigate the risk of privacy leakage from gradients.
In this paper, we are concerned with differentially private SGD (DP-SGD) algorithms in a setting of stochastic convex optimization (SCO) for both pointwise and pairwise learning problems. Differential privacy (DP) [13] is a de facto concept for designing private algorithms, which defines a rigorous attack model independent of background knowledge and gives a quantitative representation of the degree of privacy leakage. There is a considerable amount of work [2,3,5,4,15,41,43,44,46,48] on analyzing the utility guarantee (i.e., statistical generalization performance) of DP-SGD algorithms. In particular, [2,5,15,43,48] have shown that private SGD algorithms can achieve the optimal excess population risk rate O 1 √ n + 1 nǫ d log(1/δ) for solving convex problems in different settings, where n is the sample size, d is the dimensionality, and (ǫ, δ) are privacy parameters. One nature question then arises: can DP-SGD algorithms achieve faster utility rates beyond O 1 √ n + 1 nǫ d log(1/δ) ?
We provide an affirmative answer to the above question under a low-noise condition. In particular, we conduct a comprehensive study of DP-SGD for both pointwise and pairwise learning as well as both smooth and non-smooth losses, which is able to provide faster utility bounds in terms of the excess population risk. Our main contributions are listed as follows: • For pointwise learning problems, we first show that DP-SGD with gradient perturbation algorithm can achieve the rate O 1 √ n + 1 nǫ d log(1/δ) for both strongly smooth and α-Hölder smooth losses, which match the results in the recently work [43]. Under a low-noise condition, we remove the term O 1 • We propose a simple differentially private SGD algorithm for pairwise learning with utility guarantees. Specifically, for strongly smooth losses, our algorithm only requires gradient complexity O(n) to achieve the excess risk rate O 1 √ n + 1 nǫ d log(1/δ) , while [46] and [48] require O(n 3 log(1/δ)) and O(n log(1/δ)), respectively. We also show that this rate can be achieved even if the loss is non-smooth. Further, for both strongly smooth and non-smooth pairwise losses we establish faster excess risk bounds under a low-noise condition. To the best of our knowledge, this is the first utility analysis which provides the excess risk bounds better than O 1 √ n + 1 nǫ d log(1/δ) for privacy-preserving pairwise learning.
Tables 1 and 2 summarize the excess risk bounds, assumptions on losses and gradient complexity of our methods in comparison to other related work.  Table 1: Comparison of different (ǫ, δ)-DP algorithms for pointwise learning. We report the assumptions on losses, gradient complexity and utility bound for DP-SGD algorithms. Here, α-Hölder denotes α-Hölder smooth losses.
Organization of the Paper. The rest of the paper is organized as follows. In Section 2, we present the formulations of pointwise and pairwise learning together with basic concepts of differential privacy. In Sections 3, we introduce the DP-SGD algorithms in the settings of pointwise learning and pairwise learning and present the privacy and utility guarantees for them. The main proofs are given in Section 4. We conclude the paper in Section 5.

Work
Method Smooth Low-noise Gradient complexity Utility Table 2: Comparison of different (ǫ, δ)-DP algorithms for pairwise learning. We report the results for three types of methods, i.e., Gradient descent with output perturbation (Output GD), Localized Gradient Descent (Localized GD) and SGD with gradient perturbation (Gradient SGD). All methods need to assume the loss is Lipschitz continuous.

Learning Setting and Preliminaries
Let ρ be a probability measure defined on Z = X × Y, where X ⊂ R d is an input space and Y ⊂ R is an output space. In the standard framework of statistical learning theory [7,42], one considers the problem of learning from a training dataset S = {z i } n i=1 , where z i is independently drawn from ρ. In the subsequent subsections, we describe the settings of pointwise and pairwise learning, the definition of differential privacy, and illustrate the goal of utility analysis.

Pointwise and Pairwise Learning
In the task of pointwise learning such as classification and regression, we aim to learn a model w ∈ W ⊂ R d from training data S and measure the quality of w using a pointwise loss function f (w; z) on a single datum z = (x, y). The expected population risk for pointwise learning is given by F (w) = E z∼ρ [f (w; z)]. The corresponding empirical risk minimization (ERM) problem based on training dataset S is defined by In contrast to pointwise learning, the performance of a model w for pairwise learning is measured on a pair of examples (z, z ′ ) by a loss function f (w; z, z ′ ) [45,48,27,28]. Many machine learning problems can be formulated as learning with pairwise loss functions including AUC maximization [11,16,35,49,52], metric learning [6,8,23], a minimum error entropy principle [21] and ranking [1,10]. we useF (w) to denote the population risk, i.e., . Let w * = arg min w∈WF (w) be the best model, and let [n] := {1, . . . , n}. The ERM problem on training data S is given by (2)

Definition and Property of Differential Privacy
As a privacy-preserving technology with a rigorous mathematical guarantee, DP has been widely used in several areas [17,18,30,48]. Its definition is stated formally as follows.
Definition 1 (Differential Privacy (DP) [13]). We say a randomized algorithm A satisfies (ǫ, δ)-DP if, for any two neighboring datasets S and S ′ differing at one data point and any event E in the output space of A, there holds In particular, we call it satisfies ǫ-DP if δ = 0.
To show a randomized algorithm satisfies DP, we need the following concept called ℓ 2 -sensitivity. Let · 2 denote the Euclidean norm.
Definition 2. The ℓ 2 -sensitivity of a function (mechanism) M : Z n → W is defined as ∆ = sup S,S ′ M(S) − M(S ′ ) 2 , where S and S ′ are neighboring datasets differing at one data point.
A basic mechanism to achieve (ǫ, δ)-DP is called Gaussian mechanism, which is shown as follows.

Lemma 1 ([14]
). Given a function M : Z n → W with the ℓ 2 -sensitivity ∆ and a dataset S ⊂ Z n , and assume The following Gaussian mechanism yields (ǫ, δ)-DP: where I d is the identity matrix in R d×d .
We are interested in DP-SGD with strongly smooth and α-Hölder smooth losses, respectively.
where ∂f (·) denotes a (sub)gradient of f . We say a function w → f (w) is α-Hölder smooth with α ∈ [0, 1) and parameter L if for any w, w ′ ∈ W, there holds The smoothness parameter α ∈ [0, 1) characterizes the smoothness of the function f . Specifically, if α = 0, then f is Lipschitz continuous as considered in Definition 4 below. This definition instantiates many non-smooth loss functions including the hinge loss max 0, 1 − yw ⊤ x q for q-norm soft margin SVM and the q-norm loss |y − w ⊤ x| q in regression with q ∈ [1, 2].

Target of Utility Analysis
We move on to describing the target of utility analysis of a randomized algorithm A to solve the ERM problems (1) or (2). For simplicity, we elaborate this by taking pointwise learning as example and the same procedure can apply to the case of pairwise learning.
To this end, let A(S) denote the output of A based on the training dataset S for pointwise learning. The utility of the output of a randomized algorithm is measured by the excess population risk F (A(S)) − F (w * ), Algorithm 1 DP-SGD for pointwise learning 1: Inputs: Data S = {z i ∈ Z : i = 1, . . . , n}, loss function f (w; z) with Lipschitz parameter G, the convex set W ⊆ R d , step size {η t }, privacy parameters ǫ, δ, and constant β. 2: Set: w 1 = 0 3: for t = 1 to T do 4: where w * = arg min w∈W F (w) is the one with the best prediction performance over W. To examine the excess population risk, we use the following error decomposition: where E S,A [·] denotes the expectation w.r.t. both the randomness of S and the internal randomness of A. The first term E S,A [F (A(S)) − F S (A(S))] is called the generalization error, which measures the discrepancy between the expected risk and the empirical one. It can be handled by the stability analysis [2,7,20,25,29]. The second term is called the optimization error. We will use tools in optimization theory to control this term.
Throughout the paper, we assume the loss function f is convex and Lipschitz continuous with respect to (w.r.t.) the first argument.

Main Results
We present our main results in this section. First, we propose the differentially private SGD algorithm for pointwise learning, and systematically study the privacy and utility guarantees of the proposed algorithm. Then, we turn to pairwise learning problems. We present a simple differentially private SGD algorithm for pairwise learning and provide its privacy and utility guarantees.

DP-SGD for Pointwise Learning
In this subsection, we are interested in differentially private SGD for pointwise learning. To achieve (ǫ, δ)differential privacy, we resort to the gradient perturbation mechanism, i.e., adding Gaussian noise to the stochastic gradient. The detailed algorithm is described in Algorithm 1. In particular, in each iteration t, the algorithm randomly selects a sample z it according to the uniformly distribution over [n], and then updates the model parameter w t+1 based on the noising gradient ∂f (w t ; z it ) + b t with b t ∼ N (0, σ 2 I d ). After T iterations, Algorithm 1 outputs the private average model w priv = 1 T T t=1 w t , whose privacy guarantee is established in the following algorithm. Theorem 2 (Privacy guarantee). Suppose that the loss function f is convex and G-Lipshitz. Then Algorithm 1 with some β ∈ (0, 1) satisfies Remark 1. In Algorithm 1, the variance σ 2 of the Gaussian noise b t depends on a constant β ∈ (0, 1), which should satisfy the conditions σ 2 ≥ 2.68G 2 and λ − 1 ≤ σ 2 6G 2 log n λ 1+ σ 2 4G 2 . [43] studied DP-SGD with gradient perturbation for α-Hölder smooth losses and gave a sufficient condition for the existence of β under a specific parameter setting. Specifically, they proved that if n > 18, T = n and δ = 1/n 2 , then there exists at least one β ∈ (0, 1) such that DP-SGD satisfies (ǫ, δ)-DP when ǫ ≥ 7(n 1 3 − 1) + 4 log(n)n + 7/(2n(n 1 3 − 1)). Indeed, our algorithm can be seen as a special case of their algorithm with α = 0. Hence, we can also show the existence of β under the same setting.
The following theorem provides the utility guarantee for strongly smooth losses.
Theorem 3 (Utility guarantee for smooth losses). Suppose f is nonnegative, convex, G-Lipschitz and L-smooth. Let w priv be the output by Algorithm 1 with T iterations. Then the following statements hold true.
for some constant c > 0 and T ≍ n, then

Remark 2.
[43] established the optimal rate for DP-SGD algorithm and improved the gradient complexity to O(n) when the loss is strongly smooth and the parameter space is bounded. Our bound (part (a) in Theorem 3) can achieve the optimal rate with gradient complexity O(n) when the loss is strongly smooth and Lipschitz continuous. Compared with [43], we need a further Lipschitz continuous assumption. However, this assumption can be removed when we assume the parameter domain is bounded in our setting. Indeed, the smoothness of f implies that the upper bound of the gradient can be controlled by the diameter of parameter domain R, Hence, our result can achieve the optimal rate under the same assumptions as [43]. In the optimistic case with Now, we turn to the more general case, i.e., the loss function is α-Hölder smooth with α ∈ [0, 1). The following theorem presents the excess population risk bound for α-Hölder smooth losses.
Theorem 4 (Utility guarantee for non-smooth losses). Suppose f is nonnegative, convex, G-Lipschitz and α-Hölder smooth with parameter L and α ∈ [0, 1). Let w priv be the output of Algorithm 1 with T iterations. Then the following statements hold true.

DP-SGD for Pairwsie Learning
In this subsection, we first present the differentially private SGD algorithm for pairswise learning, and then establish its privacy and utility guarantees. The proposed algorithm is described in Algorithm 2. In particular, in iteration t, the algorithm draws a pair {(i t , j t )} from the uniform distribution over all pairs The following theorem establishes the privacy guarantee for Algorithm 2.
Theorem 5 (Privacy guarantee). Suppose that the loss function f is convex and G-Lipschitz. Then Algorithm 2 with some β ∈ (0, 1) satisfies By combining the stability results and the optimization error bounds (Lemmas 19 and 20 below) together, we establish the following utility guarantees for Algorithm 2 for strongly smooth and non-smooth losses, respectively.
Theorem 6 (Utility guarantee for smooth losses). Suppose f is nonnegative, convex, G-Lipschitz and L-smooth. Let {w t } be produced by Algorithm 2 with T iterations. Then the following statements hold true.
Remark 4. We now compare our results with the related work for pairwise learning. Under the strongly smooth and Lipschitz continuous assumptions, [22] proposed the gradient descent with output perturbation algorithm to achieve DP and provided the excess population risk bound in the order of O  The following theorem establishes the utility bounds for Algorithm 2 when the loss is non-smooth.
Theorem 7 (Utility guarantee for non-smooth losses). Suppose f is nonnegative, convex, G-Lipschitz and α-Hölder smooth with parameter L and α ∈ [0, 1). Let {w t } be produced by Algorithm 2 with T iterations. Then the following statements hold true.
Remark 5. Part (a) in the above theorem shows that the optimal rate O 1 √ n + 1 nǫ d log(1/δ) can be achieved with the same gradient complexity T ≍ n if α ≥ 1/2. For the case α < 1/2, the same rate can be also achieved with a larger gradient complexity O n 2−α 1+α . For non-smooth losses (i.e., α = 0), [48] established the optimal excess population risk rate for localized DP-SGD algorithm with gradient complexity O n 2 log(1/δ) for Lipschitz continuity losses. Under the same assumptions, Part (a) with α = 0 implies that the optimal rate can be achieved with gradient complexity O(n 2 ). Our result reduces the computational cost by a factor of O log(1/δ) in this case. Part (b) establishes the first excess population risk bounds better than O 1 √ n + 1 nǫ d log(1/δ) in the case with low-noise for privacy-preserving pairwise learning.

Proofs of Main Results
Before presenting the detailed proof, we first introduce some definitions and useful lemmas. To establish tighter privacy analysis of DP-SGD, we introduce the definition of Rényi differential privacy (RDP) which provides tighter composition and amplification results for iterative algorithms.
The following lemma shows the privacy amplification of RDP by uniform subsampling, which is fundamental to establish privacy guarantees of noisy SGD algorithms.

Lemma 8 ([33]
). Consider a function M : Z n → W with the ℓ 2 -sensitivity ∆, and a dataset S ⊂ Z n . The Gaussian mechanism G(S, , applied to a subset of samples that are drawn uniformly without replacement with subsampling rate p satisfies (λ, 3. We say a sequence of mechanisms (A 1 , . . . , A k ) are chosen adaptively if A i can be chosen based on the outputs of the previous mechanisms A 1 (S), . . . , A i−1 (S) for any i ∈ [k].
Lemma 9 (Adaptive Composition of RDP [36]). If a mechanism A consists of a sequence of adaptive mechanisms The relationship between RDP and (ǫ, δ)-DP is given as follows.
A fundamental property of DP called post-processing property is introduced as follows. It implies that a differentially private output can be arbitrarily transformed by using some data-independent functions.
Our analysis requires to use a self-bounding property [40,50] for strongly smooth and α-Hölder smooth losses, which means that gradients can be controlled by function values.
We will use the following concept of on-average argument stability to study the generalization error.

Proofs for Pointwise Learning
We first give the proof of the privacy guarantee for Algorithm 1. Specifically, according to the Lipschitz continuity of f , we can show that the ℓ 2 -sensitivity of M t = ∂f (w t ; z it ) is 2G. Then by Lemma 1 and the post-processing property, we know that w t+1 is log(1/δ) (1−β)ǫ + 1, βǫ T -RDP for any t = 1, . . . , T . Further, we use the adaptive composition theorem (Lemma 9) and the connection between RDP and DP (Lemma 10) to show that w priv satisfies (ǫ, δ)-DP. The detailed proof is shown as follows.
Proof of Theorem 2. For each iteration t, let A t = M t + b t , where M t = ∂f (w t ; z it ). For any w t ∈ W and any z it , z ′ it ∈ Z, the Lipschitz continuity of f implies From the definition of sensitivity (see Definition 2), we know the ℓ 2 -sensitivity of M t is bounded by 2G. Note that According to Lemma 8 with p = 1/n, we know A t is λ, -RDP as long as σ 2 ≥ 2.68G 2 and λ − 1 ≤ T -RDP for any t = 1, . . . , T . According to the adaptive composition theorem of RDP (see Lemma 9), we know Algorithm 1 is log(1/δ) (1−β)ǫ + 1, βǫ -RDP. Finally, the relationship between RDP and DP (Lemma 10) implies that Algorithm 1 is hold. The proof is completed.
To study the utility guarantee of Algorithm 1, we need to estimate the generalization error E S,A [F (w priv ) − F S (w priv )] and the optimization error E S,A [F S (w priv ) − F (w * )], respectively. We will use on-average argument stability to study the generalization error, which measures the sensitivity of the output model of an algorithm. The relationship between generalization error and on-average argument stability is established in the following lemma [29].
(a) If f is nonnegative and L-smooth, then (b) If f is nonnegative, convex and α-Hölder smooth with parameter L and α ∈ [0, 1), then Since the noise added to the gradient in each iteration is the same for the neighboring datasets, then the noise addition does not impact the stability analysis. Therefore, the on-average argument stability of non-private SGD equals that of private SGD. We can use the following lemma directly to give the stability bounds of Algorithm 1 for both strongly smooth and non-smooth losses [29].
(b) If f is α-Hölder smooth with parameter L and α ∈ [0, 1), then The following theorem presents generalization bounds of DP-SGD for both smooth and non-smooth losses, which directly follows from Lemma 13 and Lemma 14.
Theorem 15 (Generalization bounds). Suppose f is nonnegative and convex. Let W = R d and let A be Algorithm 1 with T iterations. Let γ > 0.
(a) If f is L-smooth and η t ≤ 2/L for all t ∈ [T ], then (b) If f is α-Hölder smooth with parameter L and α ∈ [0, 1), then In the following theorem, we use techniques in optimization theory to control the optimization error in expectation.
Recall w * = arg min w∈W F (w). Let Theorem 16 (Optimization error). Suppose f is nonnegative and convex. Let {w t } be produced by Algorithm 1. Assume the step size η t is nonincreasing.
(b) If f is α-Hölder smooth with parameter L and α ∈ [0, 1), Proof. Note the projection operator Proj is non-expansive. Then for any α ∈ [0, 1], we have where in the second inequality we used (a + b) 2 ≤ (1 + p)a 2 + (1 + 1/p)b 2 with p = 1/2, and the last inequality is due to the self-bounding property (Lemma 12) and the convexity of f .
Rearranging the above inequality, we get Taking a summation over j and noting w 1 = 0, we know Note that w j is independent of i j , we can take an expectation w.r.t. A and get t j=1 where we used Gaussian vector with mean 0 and variance σ 2 , and w * − w j is independent of b j .
To control the right hand side of (7), we have to estimate . By Young's inequality ab ≤ p −1 |a| p + q −1 |b| q with a, b ∈ R and p −1 + q −1 = 1, for any t ∈ [T ] we have Putting the above inequality back into (6) yields Rearranging the above inequality and multiplying both sides by η t , we get where we assume η t ≥ η t+1 for all t ∈ [T − 1].
Taking a summation over j and noting w 1 = 0, we know 1+α is concave. Then Jensen's inequality implies Plugging the above inequality back into (7), we have Note that w j is independent of b j and i j . Eq.(8) implies Combining the above two inequalities together, we get Multiplying both sides by η t+1 followed with a summation gives Part (a) in Theorem 16 implies Plugging the above inequality back into (10) and Let η t = η ≤ min{2/L, 1} and assume T ≥ n. Note w priv = 1 T T t=1 w t . Then according to Jensen's inequality, there holds Recaling that σ 2 d = 14G 2 T d (a) If we set T ≍ n and γ = √ n, then Eq.(11) implies Further let η t = c/ max √ n, √ d log(1/δ) ǫ ≤ min{2/L, 1} for some constant c > 0, then there holds where we assume d log(1/δ) = O(nǫ) (otherwise the bound will not converge).
(b) Consider the low noise case, i.e, F (w * ) = 0. Let γ = 1 and T ≍ n, then The proof of the theorem is completed.
Finally, we provide the proof of utility guarantee for Algorithm 1 when the loss is non-smooth.
Proof of Theorem 4. Note E S [F S (w * )] = F (w * ) and w priv = 1 T T t=1 w t . By Jensen's inequality we know We first estimate the term 1+α .
Combining the above two inequalities together yields 1+α .
Solving the above inequality of δ t+1 we get .
Assuming T ≥ n, from the definition of δ t+1 we have .
If we set η t = η, then there holds Dividing both sides by η, we get . Now, plugging the above two inequalities back into (13), we have Part (b) in Theorem 16 with η t = η implies Plugging (14) and (15) back into (12) yields . Note we assume ηT ≥ 1. Then Combining the above equation with Eq.(16), we get If we further choose T ≍ n, then for any α ∈ [1/2, 1) there holds For the case α ∈ [0, 1/2), let γ = √ n and η = c min n Similar to the discussion of Part (a), this choice of η implies Further setting T ≍ n 2−α 1+α , then combining the above equation with Eq. (16) implies where the last equality used α < 1/2. The proof of part (a) is completed.

Proofs for Pairwise Learning
We now turn to the analysis of DP-SGD for pairwise learning algorithm (i.e. Algorithm 2) and provide the proofs for Theorems 6 and 7.
We start with the proof of Theorem 5. Specifically, we first prove that each iteration t of the algorithm satisfies RDP by applying Lemma 8 with sampling rate 2/p. Then according to Lemma 9 and Lemma 10, we can show that the proposed algorithm satisfies (ǫ, δ)-DP. The detailed proof is shown as follows.
Proof of Theorem 5. For each t ∈ [T ], we consider the mechanism A t = M t + b t , where M t = ∂f (w t ; z it , z jt ). Similar to before, we can show that the ℓ 2 -sensitivity of M t is 2G by using Lipschitz continuity of f . Notice that Note that z it and z jt are drawn uniformly without replacement from the training set S. Then according to Lemma 8 with p = 2/n, we know A t satisfies λ, -RDP as long as σ 2 ≥ 2.68G 2 and λ−1 hold. Now, let λ = log(1/δ) (1−β)ǫ + 1. Then we get A t satisfies log(1/δ) (1−β)ǫ + 1, βǫ T -RDP. According to Lemma 11 and Lemma 9, we can show that Algorithm 2 is log(1/δ) (1−β)ǫ + 1, βǫ -RDP. Finally, Lemma 10 implies Algorithm 2 is hold. The proof is completed.
To establish the generalization analysis of Algorithm 2, we first introduce the connection between stability and generalization error in the following lemma.
By the Schwartz's inequality and self-bounding property (Lemma 12) we know Plugging the above inequality back into Eq. (17) we get where the last equality is due to 1+α is concave and z i , z j are independent of A(S (i,j) ), we know Combining the above two inequalities together implies The proof of Part (b) is completed.
Our stability analysis for α-Hölder smooth losses requires the following lemma, which shows the approximately non-expansive behavior of the gradient mapping w → w − η∂f (w; z, z ′ ).

Lemma 18 ([29]
). Assume for all z, z ′ ∈ Z, the map w → f (w; z, z ′ ) is convex, and w → ∂f (w; z, z ′ ) is α-Hölder smooth with parameter L and α ∈ [0, 1). Then for all w, w ′ and η > 0 we have As discussed in Section 4.1, adding noise to gradient will not impact stability results. Hence, we only need to address the on-average stability bounds of non-private SGD for pairwise learning. (a) If f is L-smooth and η t ≤ 2/L for all t ∈ [T ], then Further, taking an expectation over both sides yields Due to the symmetry between z i and z ′ i we know It then follows that where in the last equality we used n Further, according to Jensen's inequality and w 1 = w ′ 1 , we know Now, we can apply the above inequality recursively and get Finally, we can set p = n 2t and use (1 + 1/t) t ≤ e to get which completes the proof.
To prove Theorem 6, we introduce the following lemma on optimization error. As discussed in [28], the optimization error analysis of DP-SGD (Algorithm 2) for pairwise learning is the same as that for pointwise learning (Algorithm 1). Here, α = 1 corresponds to the strongly smooth case due to the definition of α-Hölder smoothness. Now, we are ready to prove the utility guarantees of Algorithm 2 for strongly smooth and non-smooth cases. We first present the proof for strongly smooth case (i.e., Theorem 6).
The rest of the proof is similar to Theorem 4. We omit it for simplicity.

Conclusion
In this paper, we are concerned with differentially private SGD algorithms in a setting of stochastic convex optimization under a low-noise condition. We systematically studied DP-SGD with gradient perturbation for both pointwise and pairwise learning problems and established their privacy as well as utility guarantees. In particular, for pointwise learning, we provided sharper excess population risk bounds in the order of O 1 nǫ d log(1/δ) and O n − 1+α 2 + 1 nǫ d log(1/δ) for strongly smooth and α-Hölder smooth losses, respectively. For pairwise learning, we proposed a simple DP-SGD algorithm with utility guarantees. Specifically, we proved that our algorithm can achieve the optimal excess risk rate O 1 √ n + 1 nǫ d log(1/δ) even if the loss is non-smooth. We further established faster excess risk bounds for both strongly smooth and α-Hölder smooth losses under a low-noise condition, which is the first utility analysis for privacy-preserving pairwise learning that provides the excess risk rates tighter than O 1 √ n + 1 nǫ d log(1/δ) . Whether one can derive privacy and utility guarantees for the private SGD with Markov sampling algorithm still remains a challenging open question.