Distributed Empirical Risk Minimization With Differential Privacy

This work studies the distributed empirical risk minimization (ERM) problem under differential privacy (DP) constraint. Standard distributed algorithms achieve DP typically by perturbing all local subgradients with noise, leading to significantly degenerated utility. To tackle this issue, we develop a class of private distributed dual averaging (DDA) algorithms, which activates a fraction of nodes to perform optimization. Such subsampling procedure provably amplifies the DP guarantee, thereby achieving an equivalent level of DP with reduced noise. We prove that the proposed algorithms have utility loss comparable to centralized private algorithms for both general and strongly convex problems. When removing the noise, our algorithm attains the optimal O(1/t) convergence for non-smooth stochastic optimization. Finally, experimental results on two benchmark datasets are given to verify the effectiveness of the proposed algorithms.


Introduction
Consider a group of n nodes, where each node i has a local dataset D i = {ξ (1) i , . . ., ξ (q) i } that contains a finite number q of data samples.The nodes are connected via a communication network.They aim to collaboratively solve the empirical risk minimization (ERM) problem, where the machine learning models are trained by minimizing the average of empirical prediction loss over known data samples.Formally, the optimization problem is given by This work was supported by the Swedish Research Council, the Knut and Alice Wallenberg Foundation, and the Swedish Foundation for Strategic Research.A preliminary version of this work has been reported in the 9th IFAC Workshop on Networked Systems [19].
where f i (x) = 1 q q j=1 l i (x, ξ (j) i ) represents the empirical risk on node i, l i (x, ξ) is the loss of the model x over the data instance ξ, and h(x) is the regularization term shared across the nodes.This setup has been commonly considered in machine learning [18], where h(x) is used to promote sparsity or model the constraints.
As the loss and its gradient in ERM are characterized by data samples, potential privacy issues arise when the datasets are sensitive [2].In particular, when l i is the hinge loss, the solution to Problem (1), i.e., support vector machine (SVM), in its dual form typically discloses data points [2].Advanced attacks such as input data reconstruction [43] and attribute inference [22] can extract private information from the gradients.To defend privacy attacks, differential privacy (DP) has become prevalent in cryptography and machine learning [1,8], due to its precise notion and computational simplicity.Informally, DP requires the outcome of an algorithm to remain stable under any possible changes to an individual in the database, and therefore protects individuals from attacks that try to steal the information particular to them.The DP constraint induces a tradeoff between privacy and utility in learning algorithms [2,3,16,29].
In this work, we are interested in solving Problem (1) while providing rigorous DP guarantee for each data sample in D := ∪ n i=1 D i .

Related work
For problems without regularization, i.e., h ≡ 0, the authors in [14] developed a differentially private distributed gradient descent (DGD) algorithm by perturbing the local output with Laplace noise.Notably, the learning rate is designed to be linearly decaying such that the sensitivity of the algorithm also decreases linearly.
Then, one can decompose the prescribed DP parameter ε into a sequence {ε t } t≥1 , such that ∞ τ =1 ε τ = ε and the operation at each time instant t can be made ε t -DP.However, such choice of learning rate slows down the convergence dramatically and results in a utility loss in the order of O(m/ε 2 ), where m denotes the dimension of the decision variable.Under the more reasonable learning rate Θ(1/ √ t), the utility loss can be improved to O 4 mn 2 /ε [12], where n denotes the number of nodes.Along this line of research, the authors in [11,34,42] extended the algorithm to time-varying objective functions, and the authors in [6] advanced the convergence rate to linear based on an additional gradienttracking scheme.The authors in [30] developed a distributed algorithm with DP for stochastic aggregative games.The differentially private distributed optimization problem with coupled equality constraints has been studied in [4].In these works, however, ε-DP is proved only for each iteration, leading to a cumulative privacy loss of tε after t iterations.To attenuate the noise effect while ensuring DP, the authors in [28] constructed topology-aware noise, with which each node perturbs the messages to its neighbors (including itself) with different perturbations whose weighted sum is 0.
For federated learning (FL) with heterogeneous data, the authors in [13] developed a personalized linear model training algorithm with DP.In [24], general models were considered.In particular, the subsampling of users and local data has been explicitly considered to amplify the DP guarantee and improve the utility.
To tackle regularized learning problems, the alternating direction method of multipliers (ADMM) has been used to design distributed algorithms with DP [37,40,41].However, an explicit tradeoff analysis between privacy and utility was missing.The authors in [32] investigated the privacy guarantee produced not only by random noise injection but also by mixup [38], i.e., a random convex combination of inputs.Approximate DP and advanced composition [15] were used to keep track of the cumulative privacy loss.The privacy-utility tradeoff in linearized ADMM and DGD were captured by the bound O (m/( √ nε)).
To summarize, existing private distributed optimization algorithms applied to Problem (1) typically require each node to make a gradient query to the local dataset at each time instant.Since the sizes of local datasets are considerably smaller than that of the original dataset, local gradient queries have larger sensitivity parameters than that in centralized settings.Therefore, private distributed optimization paradigms in the literature typically employed a larger magnitude of noise to secure the same level of DP, and suffered from relatively low utility.
Recently, an asynchronous DGD method with DP was developed in [35], which achieved a lower utility loss.The algorithm assumed that each local mini-batch is a subset of data instances uniformly sampled from the overall dataset without replacement, which appears to be restrictive in distributed settings.

Contribution
We develop a class of differentially private distributed dual averaging (DDA) algorithms for solving Problem (1).At each iteration, a fraction of nodes is activated uniformly at random to perform local stochastic subgradient query and local update with perturbed subgradient.Such subsampling procedure provably amplifies the DP guarantee and therefore helps achieve the same level of DP with weaker noise.To ensure a user-defined level of DP, we provide sufficient conditions on the noise variance in Theorem 1, which admits a smaller bound of variance than existing results.
The properties of the proposed algorithms in terms of convergence and the privacy-utility tradeoff are analyzed.First, a non-asymptotic convergence analysis is conducted for dual averaging with inexact oracles under general choices of hyperparameters, and the results are summarized in Theorem 2. This piece of result illustrates how the lack of global information and the DP noise in private DDA quantitively affect the convergence, which lays the foundation for subsequent analysis.The privacy-utility tradeoff of the proposed algorithm is examined in Corollaries 3 and 4. In particular, when the objective function is non-smooth and strongly convex, the utility loss is characterized by O(mι 2 /(q 2 ε 2 )), where m, ι, q, ε denote the variable dimension, node sampling ratio, number of samples per node, and DP parameter, respectively.For comparison, we present in Table 1 a comparison of some of the most relevant works.

Yes
Finally, we verify the effectiveness of the proposed algorithms via distributed SVM on two open-source datasets.Several comparison results are also presented to support our theoretical findings.

Outline
The rest of the paper is organized as follows.Section 2 introduces some preliminaries.We present our algorithms and their theoretical properties in Section 3, whose proofs are postponed to Section 4. Some experimental results are given in Section 5. Section 6 concludes the paper.

Basic setup
We consider the distributed ERM in (1), in which h is a closed convex function with non-empty domain dom(h).Examples of h(x) include l 1 -regularization, i.e., h(x) = λ x 1 , λ > 0, and the indicator function of a closed convex set.The regularization term h and the loss functions l i for all i = 1, . . ., n satisfy the following assumptions.
ii) each l i (•, ξ i ) is convex on dom(h).
When q = 1, Problem (1) reduces to a deterministic distributed optimization problem.In Problem (1), the information exchange only occurs between connected nodes.Similar to existing research [7,23], we use a doubly ratio ι ∝ 1/n, and ε for utility loss.The work in [35] considered nonconvex problems, and the results are adapted to convex problems for comparison in Table 1.
stochastic matrix W ∈ [0, 1] n×n to encode the network topology and the weights of connected links at time t.In particular, its (i, j)-th entry, w ij , denotes the weight used by i when counting the message from j.When w ij = 0, nodes i and j are disconnected.

Conventional DDA
The DDA algorithm originally proposed by [7] can be applied to solve Problem (1).In particular, let d(•) ≥ 0 be a strongly convex function with modulus 1 on dom(h).Each node, starting with z (1) i = 0, iteratively generates {z and where {γ t } t≥1 is a non-decreasing sequence of parameters, w ij is the (i, j)-th entry of matrix W , ĝ(t) j ) denotes the stochastic subgradient of local loss over x (t) j with ξ (t) j uniformly sampled from D j , and ∂l j (x (t) j , ξ (t) j ) represents the corresponding subdifferential.Throughout the process, each node only passes z i to its immediate neighbors and updates x i according to (2).Existing DDA algorithms, when applied to solve Problem (1), converge as O(1/ √ t) [5,7].

Threat model and DP
In a distributed optimization algorithm, messages bearing information about the local training data are exchanged among the nodes, which leads to privacy risk.
In this work, we consider the following two types of attackers.
• Honest-but-curious nodes are assumed to follow the algorithm to perform communication and computation.
However, they may record the intermediate results to infer the sensitive information about the other nodes.• External eavesdroppers stealthily listen to the private communications between the nodes.
By collecting the confidential messages, the attackers are abele to infer private information about the users [43].To defend them, we employ tools from DP. Indeed, DP has been recognized as the gold standard in quantifying individual privacy preservation for randomized algorithms.It refers to the property of a randomized algorithm that the presence or absence of an individual in a dataset cannot be distinguished based on the output of the algorithm.Formally, we introduce the following definition of DP for distributed optimization algorithms [41].

Differentially private DDA algorithm
In this section, we develop the differentially private DDA algorithm, followed by its privacy-preserving and convergence properties.

Node subsampling in distributed optimization
As explained in Section 1, parallelized local gradient queries in distributed optimization necessitate stronger noise to achieve DP and therefore deteriorate utility.To circumvent this problem, we only activate a random fraction of the nodes at each time instant to perform averaging and local optimization.This allows us to amplify the privacy of the algorithm, and thereby achieving the same level of DP with noise weaker than in existing works.
Definition 2 For every t ≥ 1, an integer number of nι nodes are sampled uniformly at random with some ι ∈ (0, 1]. The sampling procedure gives rise to a time-varying stochastic communication network.Slightly adjusted to the notation in Section 2.1, we let W (t) ∈ [0, 1] n×n be a random gossip matrix at time t, where the (i, j)-th entry, w ij , denotes the weight of the link (i, j) at time t.Denote by N (t) and N (t) i := {j|j = i, w (t) ij > 0} the set of activated nodes and the set of i's neighbors at time t, respectively.It is worthwhile to point out that W (t) and ι are dependent.That is, we have w i , and w (t) ij = 0 otherwise.
For the gossip matrix W (t) , we assume the following standard condition [20].
Assumption 2 For every t ≥ 1, i) W (t) is doubly stochastic2 ; ii) W (t) is independent of the random events that occur up to time t − 1; and iii) there exists a constant β ∈ (0, 1) such that where ρ(•) denotes the spectral radius and the expectation E[•] is taken with respect to the distribution of W (t) at time t.

Private DDA with stochastic subgradient perturbation
Next, we introduce a differentially private DDA algorithm presented as Algorithm 1.
The update for the local dual variable z where 1, if i active 0, otherwise and {a t > 0} t≥1 is a sequence of non-decreasing parameters.The non-decreasing property of {a t } t≥1 is motivated by that, when the objective exhibits some desirable properties, e.g., strong convexity, assigning heavier weights to fresher subgradients can speed up convergence [21,27].In the special case where a t = 1, η (t) i = 1 and σ = 0, (5) reduces to the conventional update in (3).
For general regularization h(x), the update in (6) requires the knowledge of ι.This requirement is necessary due to technical reasons.More precisely, due to node sampling, the term z (t+1) i , x in (6) serves as a linear approximation of ιf i (x)/n rather than f i (x)/n in standard DDA [7].Thus, one scales up h(x) also with ι in (6) in order to solve the original problem in (1).In the special case where h(x) is the indicator function of a convex set, the knowledge of ι is not needed since ιh(x) ≡ h(x).
The overall procedure is summarized in Algorithm 1.Each node i = 1, . . ., n initializes z Remark 1 There are two common approaches to achieve DP for optimization methods.The first type disturbs the output of a non-private algorithm [39], and the second type perturbs the subgradient [2,29].The former involves recursively estimating the (time-varying) sensitivity of updates [31].This makes the propagation of DP noise and its effect on convergence difficult to quantify [31].In this work, we adopt the latter approach in Algorithm 1, where we introduce Gaussian noise to perturb the stochastic subgradient ĝi .By leveraging the time-invariant sensitivity of the gradient query, we can effectively conduct both privacy and utility analyses in the presence of non-smooth regularization.It is worth noting that, in this scenario, the step-size scheduling rule allows for control over utility.

Privacy analysis
To establish the privacy-preserving property of Algorithm 1, we make the following assumption.
By Assumption 3, we readily have that each f i (•) is L-Lipschitz.Next, we state the privacy guarantee for Algorithm 1.The proof can be found in Section 4.1.
Theorem 1 Suppose Assumption 3 is satisfied, and a random fraction of the nodes with ratio ι ∈ (0, 1] is active at each time instant.Given parameters q, ε ∈ (0, 1], and Remark 2 A few remarks on the results in Theorem 1 are in order: i) It can be verified from the proof of Theorem 1 that Algorithm 1 is (ε, δ)-DP after t ≤ T iterations with ii) Theorem 1 emphasizes that, to achieve a prescribed privacy budget during T iterations, the noise variance σ 2 is related to the DP parameters (ε, δ), the Lipschitz constant L of the loss, the number of samples per local dataset, and the iteration number T .Notably, the lower bound for variance is weighted by ι 2 ≤ 1, meaning that the same level of DP can be achieved with reduced noise.

Privacy-utility tradeoff
Next, we perform a non-asymptotic analysis of Algorithm 1, followed by an explicit privacy-utility tradeoff.
Motivated by [7], we define an auxiliary sequence of variables: where i and {z (t) i : i = 1, . . ., n} t≥1 are generated by Algorithm 1.The convergence property of y (t) is summarized in Theorem 2, whose proof is provided in Section 4.2.
Theorem 2 Suppose Assumptions 1, 2, and 3 are satisfied.For all t ≥ 1, we have where σ is defined in Theorem 1, , and the expectation is over the randomness of the algorithm.
From the error bound in (9), we observe that the last two terms are contributed by the noise.How the noise affects the error bound is determined in part by the hyperparameters of the algorithm.Next, we first investigate the choices of a t that lead to optimal convergence rates for Algorithm 1 with σ = 0; the results for strongly convex and general convex functions are presented in Corollaries 1 and 2, whose proofs are given in Appendices C and D, respectively.
Remark 3 Corollary 1 indicates that the non-private version of Algorithm 1, i.e., σ = 0, attains the optimal convergence rate O(1/t) when Problem (1) is strongly convex.Compared to the algorithm in [36], where the authors focused on constrained problems, the proposed algorithm handles general non-smooth regularizers.Furthermore, the results can be extended to the case where each f i (x) but not h(x) is strongly convex by following a similar idea in [20].
Corollary 2 Suppose Assumptions 2 and 3 are satisfied.In addition, Assumption 1 holds with µ = 0, i.e., h(x) is general convex.If σ = 0, then for all t ≥ 1, and i = 1, . . ., n, we have where ỹ(t) = t −1 t τ =1 y (τ ) and M is a positive constant given in Theorem 2. In addition, for all t ≥ 1, and i = 1, . . ., n, we have where Under the same hyperparameters, we study the privacyutlity tradeoff of Algorithm 1 with σ = 0 for strongly convex and general convex functions in Corollaries 3 and 4, whose proofs are presented in Appendices E and F, respectively.
Corollary 3 Suppose Assumptions 2 and 3 are satisfied.
Corollary 4 Suppose Assumptions 2 and 3 are satisfied.In addition, Assumption 1 holds with µ = 0, i.e., h(x) is general convex.If and ι ≤ 1 − β, then the following holds for T = O and i , and δ 0 is defined in Theorem 1.
Corollaries 3 and 4 highlight that the sampling procedure lowers down the utility loss for both strongly convex and general convex problems.In particular, the utility loss in the strongly convex case becomes ι 2 ≈ 1/n 2 times smaller than that without sampling.For general convex problems, the utility loss is √ ι times smaller.They also suggest that the number of iterations increases in order to achieve a lower utility loss.

Proofs of main results
This section presents the proof of Theorems 1 and 2.

Proof of Theorem 1
We start by introducing some useful properties of DP [8,9,15].Recall from Algorithm 1 that, at each iteration, nι nodes are sampled from n nodes at random, and each activated node randomly selects a data sample from q instances to compute stochastic gradients.Although such a subsampling is not uniform, i.e., the subsets of nι data samples are not necessarily chosen with equal probability, it still helps amplify the privacy [9, Lemma 10].
for any δ ∈ (0, 1]. Lemma 4 (Post-Processing) Given a randomized algorithm A that is (ε, δ)-DP.For arbitrary mapping p from the set of possible outputs of A to an arbitrary set, We are now in a position to prove Theorem 1.

DP at each time t:
We begin by noting that the subgradient perturbation procedure at time t, denoted by M t , is a Gaussian mechanism whose sensitivity, by Assumption 3, is ∆ ≤ 2L.Based on Lemma 1, M t is (ε t , δ 0 )-DP with . Due to the conditions on σ and T , we obtain Denote by A t the composition of M t and the subsampling procedure.Upon using Lemma 2 and ( 16), we obtain that A t is (ε t , ιδ 0 /q)-DP with In addition, because of ( 16), we get ε t = 2ιε t /q ≤ 2ε t ≤ 0.9 and DP after T iterations: Consider the composition of A 1 , . . ., A τ , . . ., A T , denoted by A. Based on the advanced composition rule for DP in Lemma 3, we obtain where we fix δ = T t=1 ε t 2 ≤ 1 and use (17) to get the second inequality.By setting δ = δ, we have that A is (ε, δ)-DP.
DP after postprocessing: The intermediate results {z (τ ) } T τ =1 are computed based on the output of A, i.e., perturbed subgradients.By the post-processing property of DP in Lemma 4, Algorithm 1 also satisfies (ε, δ)-DP specified in Definition 1.

Proof of Theorem 2
Before proving Theorem 2, we present two useful lemmas whose proofs are given in Appendices A and B.
Lemma 5 For the sequence {x (t) i : i = 1, . . ., n} t≥1 generated by Algorithm 1 and the auxiliary sequence {y (t) } t≥1 defined in (8), one has that for all t ≥ 1 and i = 1, . . ., n, and Lemma 6 For all t ≥ 1, we have Now we are ready to prove Theorem 2. Upon using

and L-
Lipschitz continuity of each f j , we have where ) .Further using convexity of f j , j = 1, . . ., n, we have where i .Therefore, where we use and F = f +h in the first inequality, and use ( 21), (22) in the last inequality.Due to uniform node sampling with probability ι, we have where E τ denotes expectation conditioned on {x (τ ) i , i = 1, . . ., n}.Therefore, by putting the conditioned expectation on (23) and using the law of total expectation, we obtain j are independent of y (τ ) and E[ĝ Therefore, we obtain from Lemma 6 that Furthermore, we have we remove the conditioning based on the law of total expectation to obtain where we use the fact that ν (τ ) i is independent of ĝ(τ) i .By plugging the above into (25) and using Lemma 5, we arrive at (9) as desired.

Experiments
In this section, we present experimental results of the proposed algorithms.

Setup
We use the benchmark datasets epsilon [26] and rcv1 [17] in the experiments.Some information about the datasets is given in Table 2.We randomly assign the data samples evenly among the n = 20 working nodes.The working nodes aim to solve the following regularized SVM problem: where i } q j=1 := D i are data samples private to node i.In the experiment, we consider two choices of the regularizer, i.e., h(x) = φ x 1 and h(x) = µ x 2  2 where φ > 0 and µ > 0 will be specified later.
Throughout the experiments, we consider a complete graph with n = 20 nodes as the supergraph.Based on it, we consider two edge sampling strategies, that is, 1 or 2 edges are sampled uniformly at random from the set of all edges at each time instant.The corresponding gossip matrices are created with Metropolis weights [33].Some common parameters used in the two sets of experiments are introduced in the following.For the parameters of DP, we consider ε ∈ {0.2, 0.4, 0.6, 0.8, 1} and δ 0 = 0.01.The random noises in these two cases are generated accordingly based on Theorem 1.The convergence performance of the algorithm is captured by suboptimality, i.e., F (n −1 n i=1 x(t) i )−F (x * ), versus the number of iterations, where the ground truth is obtained by the optimizer SGDClassifier from scikit-learn [25].
We set ε = 0.8 and compare the convergence performance between [19] and Algorithm 1 under different choices of ι ∈ {0.1, 0.2}.Fig. 1 shows that Algorithm 1 with both choices of ι outperform [19] in terms of convergence speed and model accuracy.Furthermore, the use of larger ι in Algorithm 1 leads to higher utility loss, which verifies Corollary 3. We observe that selecting a higher number of sampled nodes at each step leads to improved network connectivity as well as increased noise.The findings from Fig. 1 indicate that, in this specific example, the impact of increased noise on convergence performance may outweigh the benefits of enhanced connectivity.
Next, we examine the performance of [19] and Algorithm 1 under a set of DP parameters.The result in Fig. 2 illustrates that increasing the value of ε-indicating a less stringent privacy requirement-results in decreased utility loss across all the methods.This can be attributed to the fact that a smaller value of ε corresponds to a more stringent differential privacy (DP) constraint, necessitating a stronger noise to perturb the subgradient.In addition, the performance gap between Algorithm 1 and [19] is more significant for the case with smaller ε, i.e., a tighter DP requirement.

Results for l 1 -regularized SVM
Set h(x) = φ x 1 with φ = 0.0005.In this case, the problem in ( 26) is convex with a non-smooth regularization term.According to Corollary 2, we set γ t = 0.01 √ t and a t = 1 in the experiment.
First, we set ε = 0.4 and compare Algorithm 1 under different subsampling ratios.The findings depicted in Fig. 3 illustrate a similar trend: As the subsampling ratio ι decreases, the utility loss diminishes correspondingly.Additionally, we present the results for Algorithm 1 with various DP parameters in Fig. 4. Notably, for both selected subsampling ratios, we observe a degradation in utility as the DP parameter ε decreases.In summary, the experimental results reveal the effectiveness of the proposed algorithms and validate our theoretical findings.

Conclusion
In this work, we presented a class of differentially private DDA algorithms for solving ERM over networks.The proposed algorithms achieve DP by i ) randomly activating a fraction of nodes at each time instant and ii ) perturbing the stochastic subgradients over individual data samples within activated nodes.We proved that our algorithms substantially improve over existing ones in terms of utility loss.
There are numerous promising directions for future endeavors.Firstly, an intriguing avenue to explore is the  heterogeneous case, where nodes exhibit substantial variations in dataset size and/or Lipschitz constants.Secondly, it is worthwhile to investigate the high probability convergence of the proposed algorithms.

A Proof of Lemma 5
Notation: To facilitate the presentation, we introduce the following notation.Define W (t) = W (t) ⊗ I. Given a real-valued random vector x, we let Accordingly, for a square random matrix W , we denote and PROOF.This proof consists of three parts.First, we prove where z(t) = n −1 n i=1 z i (t).Second, we prove Finally, we conclude the proof using these two inequalities.
Part ii) When t = 1, since z (1) i = 0 for all i, we have z (1) = 0 and therefore (A.4) satisfied.Next, we consider the case with t ≥ 1.
This together with the bound between l 1 and l 2 -norms, i.e., 1 This completes the proof.
Note that y (1) = ∇Ψ * 1 (0), A 0 = 0 and γ 0 = 0 by definition, implying that Ψ * 0 (0) = 0. Further considering Step 1.At each time instant t, only active nodes i ∈ N (t) update z (t+1) i and x (t+1) i by following Steps 3-7.In particular, each active node computes and then perturbs the local stochastic subgradient in Step 3 and 4, respectively, followed by the information exchange with neighboring nodes in Step 5.Then, z (t+1) i and x (t+1) i are updated in Steps 6 and 7.For inactive nodes at each time instant t, they simply set z
Then, we investigate the convergence rate of the non-private (noiseless) version of DDA for both strongly convex and general convex objective functions under two sets of hyperparameters in Corollaries 1 and 2, respectively.We remark that Corollary 1 advances the best known convergence rate of DDA for nonsmooth stochastic optimization, i.e., O(1/ √ t), to O(1/t).