Online confidence interval estimation for federated heterogeneous optimization

Graphical abstract An online confidence interval estimation method called separated plug-in via rescaled federated averaging.


Introduction
Federated learning is a privacy-preserving machine learning framework that allows multiple clients to collaboratively train a global model without transferring local data.For estimation and prediction problems, statistical inference is an indispensable method for measuring uncertainties [1][2][3] .However, computation constraints, memory restrictions, and communication budgets make traditional statistical estimation and inference methods incompetent under federated settings [4] .In addition, variations in the local datasets and heterogeneity in the number of local iterations make it difficult to perform confidence interval estimation under federated settings.
Suppose that there are N clients and a central server.Each client labels with an unique number in .The feature space is a finite-dimensional Euclidean space for some positive integer .In regression, the response space is .For the classification problem, the label space is for some positive integer .The -th client has a local dataset consisted of identically and independently distributed (IID) samples from some unknown data distribution over where , and .The federated learning system intends to optimize a sum of risk functions, with only access to local stochastic gradient updates.
The federated optimization problem is to minimize the global risk function where is the local risk function at the -th client.Here, is a client-specified loss function, where represents an independent realization from local data distribution .The weight of the -th client is , which satisfies and .In this research, is the model parameter and we assume that the parameter space is .The minimizer of is denoted by and the minimizer of is denoted by .In the non-IID problem, the data distributions at each client may vary.Therefore, typically does not coincide with for .A typical method to solve Eq. ( 1) is federated averaging (FedAvg) [5] .To reduce communication budges, only a subset of clients are selected in each communication round in Fe-dAvg.The selected clients perform multiple local updates before these local models are aggregated to collaboratively train a global model.It is widely applied in many federated learning applications [6][7][8][9][10][11] .Federated learning has been widely studied on non-IID data.Zhao et al. [12] has shown that the accuracy of federated learning decreases significantly, by up to 55% for neural networks trained for highly skewed non-IID data, where each client trains only on a single class of data.Several approaches, such as CSFedAvg [13] and FL+HC [14] , can promote the accuracy of FedAvg on non-IID data.Some works are interested in the convergence of FedAvg on non-IID data.For instance, Li et al. [15] has developed the convergence guarantees for FedAvg estimates on non-IID data under certain regularity conditions.
A batch of recent works [16][17][18][19] analyzing the convergence of federated optimization algorithms assume that (number of local SGD iterations at client ) is identical, i.e., for any .However, this assumption is unrealistic for real word datasets.The clients are usually very different.The size of datasets and the computation speeds of these clients typically vary.To make full use of the the local data, we perform SGD at each client with data arriving in a given time interval.Due to the streaming data, the number of SGD iterations is proportional to the number of arrived data.Since the size of their local datasets as well as their computation speeds is different, the number of local iterations is typically different.When the batch size is the same across clients, is certainly proportional to .In the seminal paper [5] , McMahan et al. proposed federated averaging in which each client performs epochs of local updates.Then, the number of local iterations is at client , where is the mini-batch size, and is the number of samples at the -th client.In this case, the number of local iterations can vary widely across clients.In short, clients update local parameters in a given time interval using datasets arrived in this period of time which makes proportional to .Wang et al. [20] first analyzed FedAvg on non-IID data when the number of local SGD iterations is non-identical across clients.They have shown that the heterogeneity in the number of local iterations will result in objective inconsistency.That is, the FedAvg estimate does not converge to .Some methods that are proposed to solve non-IID data, such as FedProx [21] , SCAFFOLD [22] and VRLSGD [23] , can be used to reduce the inconsistency to some extent.However, these methods require additional memory and slow down the convergence.They proposed the federated normalized averaging (FedNova) algorithm which eliminates the inconsistency and preserves fast convergence.
In the big data era, many classical optimization methods for statistical problems such as gradient descent need great memory storage and computation power.Hence, online optimization methods, such as stochastic gradient descent (SGD), for statistical problems are of interest.From a statistical viewpoint, it is essential to perform confidence interval estimation in federated learning and online learning.However, only a few papers have considered the confidence interval estimation problems in online learning, and statistical estimation problems are even rarely studied in federated learning.In the online fashion, Ruppert [24] and Polyak and Juditsky [25] have proven that the averaged SGD path is asymptotically normal with unknown asymptotic covariance.To conduct online confidence interval estimation, there are many studies trying to estimate the unknown asymptotic covariance.Zhu et al. [26] introduced a fully online overlapping estimate of the asymptotic covariance using only the iterates from SGD and its non-overlapping variant.Fang et al. [27] introduced an online bootstrap procedure for estimating confidence intervals.In federated learning, Li et al. [28] showed how to perform N θ * θ * statistical inference via Local SGD [16] .They proposed two online confidence interval estimation methods under federated settings.However, they assume that the number of local iterations is identical across clients.The identical number of local iterations violates the imbalanced nature of federated learning.More importantly, local SGD requires full client participation.The server needs to wait for the slow clients to upload the parameters if all clients participate in aggregation, which is time expensive.These slow clients are regarded as stragglers.Li et al. [15] explained that full client participation requirement in local SGD suffers from a serious "straggler effect".Meanwhile, Wang et al. [20] points out that the local SGD or FedAvg estimate does not converge to if the number of local iterations is non-identical.Hence, it is inappropriate to estimate the confidence interval of via Local SGD or Fe-dAvg when the number of local iterations is non-identical on non-IID data in federated learning.
Our research aims to give a confidence interval of in an online fashion when the number of local iterations is nonidentical.We denote the global parameter at the th round by and the averaged path by , where is the max number of rounds.Since it is impossible to use FedAvg, we perform confidence interval estimation via rescaled FedAvg which is a special case of FedNova.Wang et al. [20] gave the convergence guarantee for FedNova with a constant learning rate.However, they did not analyze the statistical properties.In our research, we giva a non-asymptotic convergence rate of the rescaled FedAvg estimate and prove that converges in .Moreover, we give the asymptotic distribution of the averaged estimate .Furthermore, we propose the rescaled plug-in method to estimate the confidence interval of in an online fashion.In summary, this work makes the following contributions: • First, under certain regularity conditions, we prove that the rescaled FedAvg estimate is a consistent estimate of , and give a non-asymptotic convergence rate for the estimate.
• Second, we prove that is asymptotically normal under some regularity conditions.Our research shows that the asymptotic covariance of is inversely proportional to the client participation rate .θ * • Third, we propose the separated plug-in method to construct a confidence interval of on non-IID data when the number of local iterations is non-identical.Additionally, we experimentally prove the effectiveness of the method.
The remainder of this paper is organized as follows.Section 2 begins with some general definitions and notations used throughout the paper, and introduces the rescaled Fe-dAvg algorithm.In Section 3, we start by stating some assumptions essential to our theoretical proofs.Then, we analyze the statistical properties of the rescaled FedAvg estimator and propose an online confidence interval estimation method.In Section 4, we will investigate the empirical performance of the proposed method in Section 3 by numerical simulation.Section 5 gives the conclusions of this research.

Problem formulation
Throughout this paper, we use the following notations.For a to denote almost sure convergence.We consider the problem of performing confidence interval estimation for the federated heterogeneous optimization problem.The "heterogeneous" here means the number of local iterations is different across clients.The federated heterogeneous optimization aims to solve the following problem: on non-IID data when the number of local iterations is nonidentical.In this research, we suppose that the samples arrive one by one in an online fashion at each client, which is the same as Refs.[27, 29].
In the th communication round, suppose that there are samples arriving sequentially at client within a given wallclock time interval.At the th client, these samples are denoted by . The sample is an input/output pair .As the samples arrive sequentially, the -th client updates as the following formula: where denotes the local model parameter of client after the -th local update in the -th round.By convention, .The learning rate in the -th round is .FedAvg simply aggregates local parameter updates by averaging at the end of each round.The global parameter is updated as .Here, is the local parameter update of client in the -th round.As mentioned before, only a subset of clients update their local models in a round.The set of these selected clients in the -th round is denoted by .The constant is the fraction of updated clients.Therefore, the number of selected clients is .These clients are selected without replacement from with probability .For instance, the global risk function in the linear model is where is the input/output pair.Suppose that the true local parameter of client is , which means where is random noise with zero mean.
Consequently, the minimizer of is .Wang et al. [20] has shown that the global estimate converges to .However, if and only if . Hence, there is inconsistency in estimating using FedAvg.It is not reasonable to construct a confidence interval using this estimate because it is not unbiased or consistent.
It is not effective to perform confidence interval estimation via a federated averaging algorithm due to the inconsistency caused by non-IID data and heterogeneity in the number of local iterations.To overcome this ineffectiveness, we have to perform confidence interval estimation via new federated algorithms.
To eliminate the inconsistency, Algorithm 1 rescales the local parameter updates and tries updating the global model parameter by averaging the rescaled local parameter updates where is the average of .We call this algorithm rescaled federated averaging (rescaled FedAvg).Actually, it is a special case of FedNova [20] which updates local parameters by stochastic gradient descent.Although the convergence of Fed-Nova has been guaranteed, the statistical properties are still remain unexplored.Li et al. [28] allows E i 's to grow with the communication rounds.From their work, growing E i 's converge faster than fixed E i 's in terms of communication round.If the number of samples is the same, increasing E i 's will reduce the communication round.In addition, it will enlarge the heterogeneity in the number of local iterations and slow down convergence.Thus, we set the number of local steps E i 's to be identical across different rounds.In the next section, we give a non-asymptotic convergence rate of the Resclaed FedAvg algorithm.In addition, we prove the asymptotic normality of the averaged rescaled FedAvg estimate .
Sample subset of clients from ; After the sample arriving, update local parameters: Communicate to the central server; end On the central parameter server, update global parameters: and averaged rescaled FedAvg estimator Therefore, we construct confidence intervals online based on Rescaled FedAvg in this study.

Theoretical results
In this section, we first introduce some assumptions that are essential to our theoretical proofs.Second, we will prove that the rescaled FedAvg estimate is consistent.Then, we propose an online confidence interval estimation method called separated plug-in.This method leverages the asymptotic normality of the averaged rescaled FedAvg estimate. i For each client , assume that the loss function is continuously differentiable.At client , assume that , where and .Assumption 1 is also assumed in Li et al [15] .Assumption 2 is standard, and is widely assumed in many papers [30,31] .
Assume that the risk functions are -strongly convex.The loss function is assumed to beaverage smooth, i.e., for any vectors , where is an independent realization from .
By Jensen's inequality, are all -smooth.The functions are -strongly convex and -smooth by Assumption 2. Global risk is also -smooth andstrongly convex because it is a linear combination of .
The next assumption considers the Lipschitz continuity of in the neighborhood of .
Assumption 3. Assume that the Hessian matrix of exists and that there exist some and such that for all Define to be the gradient noise at the -th client, where is an independent realization from .Note that for all .The covariance of the gradient noise at is .We denote it by .Denote the Hessian matrix by .
Next, we assume that the difference between the covariance of and is bounded by the quadratic polynomial of .Thus, it ensures the continuity of the covariance of at .The term controls the growth speed of the covariance.The boundedness of the moment of gradient noise is first assumed by Li et al [28] .
Assumption 4. Assume that there exists some constant such that for every , Moreover, suppose that there exists a constant such that is finite.

C i ∈ [N]
Assumption 5.There exists some constant such that for all , where is an independent realization from .

Statistical properties
When the learning rates satisfy some conditions, Li et al. [28] established a non-asymptotic convergence rate for Local SGD, where is a constant.Our results give a convergence rate for rescaled FedAvg when the learning rates are , where and .Furthermore, our results show how the difference in the number of local iterations influences the convergence.
In the SGD and local SGD algorithms and their variants, decreasing learning rates are critical.Li et al. [15] has shown that FedAvg with a fixed learning rate does not converge to .To reach the minimum, the decay of learning rates is essential in FedAvg.The convergence to the optimal of Local SGD is guaranteed with a fixed learning when for all [32] .However, the FedAvg estimate does not converge to if .Hence, the learning rates are decreasing and satisfy some regularized conditions in our analysis.We give the following theorems with decreasing learning rates.
Let the learning rates be with , where .Under Assumptions 1-4, there exists a constant such that whenever where , , , , and is a constant.
Theorem 1 in Wang et al. [20] has shown that converges to with fixed learning rate determined by .In addition, it has proven that the convergence rate is .Nevertheless, this cannot ensure the consistency of .Compared with that, Theorem 1 showed that converges to with decaying learning rates.The convergence rate is .Moreover, our theorem is under the assumption of convexity while their work can be applied to nonconvex cases.
Theorem 1 implies that the convergence of the estimate is related to and .When is large, the convergence is fast.Intuitively, a large indicates that more clients paticipate in aggregation.Consequently, the estimate converges faster with more information.From Theorem 1, is a good measure to quantify the heterogeneity in the number of local iterations when (the average of ) is fixed.With smaller , the estimate converges faster.When there is no heterogeneity in the number of local iterations, converges the fastest.
θT In a recent work, Toulis and Airoldi [33] proposed the implicit SGD procedures, and analyzed the asymptotic distribution of averaged implicit SGD iterates.Similarly, Li et al. [28] anathe asymptotic distribution of the averaged local SGD iterates on non-IID data, and performed confidence interval estimation using the asymptotic distribution.We formulate the asymptotic normality of the averaged rescaled FedAvg estimate in the following theorem.θT Theorem 2. If the learning rates are the same as those in Theorem 1 and Assumption 1-4 hold, the averaged rescaled FedAvg estimate is asymptotically normal, ) , is the client participation rate. θT Theorem 2 reveals that is asymptotically normal with an asymptotic covariance depending on , local gradient noise variance and Hessian matrix .As is shown in Li et al. [28] , the effect of data heterogeneity does not appear in the asymptotic distribution.However, this theorem shows that the heterogeneity in the number of local iterations appears in the asymptotic distribution.
In the federated learning system, the central server only has access to the global iterates .To construct a valid confidence interval, we may leverage the asymptotic distribution of averaged estimator under the federated learning constraints.

Separated plug-in method
To perform confidence interval estimation, a good and explicit estimate of the asymptotic covariance is necessary.There are several methods [26,27,34,35] to estimate the asymptotic covariance in the SGD statistical inference problem.However, these methods cannot be directly applied in federated settings.
From Theorem 2, the asymptotic covariance is determined by the second order derivative and the covariance of stochastic gradient error .Note that is different from the covariance of .Hence, we separately estimate and to construct an estimate of the asymptotic covariance instead of estimating the asymptotic covariance directly.More precisely, we can separately estimate each by some estimate and use as an estimate of .
We assume that are IID from the local data distribution .In the -th round, the input/output pairs arrive at client sequentially, which simulates real-world data streams such as mobile phone data and IoT device data.In addition, we assume that from distribution is independent with from distribution if .
θ T θ * H Since converges to in probability, an intuitive way to estimate is to use the sample estimate where is the set consisting of clients participating in theth aggregation step and is the fraction of participating clients.Due to partial client participation, the information about the second-order derivatives of all clients is not always accessible.For this reason, we only have access to the derivatives of clients that are selected in the current round.In fact, the probability of each client being selected in a round is .The expectation of the number that a client has been chosen in the training process is .
The estimation of is similar to the estimation of but there are some additional problems in estimating .First, we rewrite as Due to the features of federated learning, we cannot directly estimate .An estimate of can be obtained by com- bining estimates of .As is known before, partial participation makes direct estimation of difficult.More importantly, it is infeasible to calculate the expectation of local loss functions.We then estimate by where , and are independent realizations from .An advantage of the above estimate (4) is that it can be updated online, which is the main purpose of rewriting .
Since is a linear combination of , it is natural to estimate by

S T,i
where is estimated in Equation ( 4).
From the above discussion, and are estimates of and , respectively.Furthermore, and can be updated recursively in the spirit of SGD.Next, we prove the consistency of the two estimates under some additional assumptions in addition to Assumption 1-4.Assumption 5 assumes that the second-order derivatives of local risk functions in the neighborhood of are Lipschitz continuous.This assumption is critical in the following theorem.
Theorem 3.Under Assumptions 1-5, and for all .Hence, converges to in probability and converges to in probability.
Although we cannot give an exact confidence interval of , we can form an asymptotic confidence interval by Theorem 3. Denote and the -th coordinate of .From Theorem 3, we can directly derive Corollary 1.Based on Corollary 1, we proposed a new online confidence interval estimation method.Since we separately estimate , we call this method separated plug-in.Details about the algoroithm are in Algorithm 2.
Corollary 1.Under the same assumptions as Theorem 3, where is the quantile of standard normal distribution, and is the -th coordinate of .
When for all and , there is no need to estimate separately.In this case, the covariance becomes , where is the covariance of gradient noise.This is the same as that in Li et al [28] .

Other methods
In addition to the separated plug-in, we extend the random scaling method which is proposed by Lee et al. [36] and extended to the federated setting by Li et al [28] .We theoretically prove that the random scaling method is also effective in our setting.In the previous subsection, we propose the separated plug-in method and give the statistical properties of the proposed estimate and .In fact, the asymptotic normality of can be extended to a more general form.The following theorem is a general case of Theorem 2.

T → ∞
Theorem 4.Under the same assumptions as Theorem 1, the following random function weakly converges to a scaled Brownian motion as , i.e., where denotes -dimensional standard Brownian motion, is the same as that in Theorem 2, and is the client participation rate.

2
Based on Theorem 4, we can then construct a confidence interval using an online procedure the same as Algorithm in Li et al. [28] .

Numerical studies
The numerical studies are divided into three parts.In the first part, we will research on how the heterogeneity in the number of local iterations influences the convergence on simulated data.In the second part, we compare the separated plugin method with the random scaling method to show the effectiveness of the proposed method on simulated data.In the last part, we show how to use the proposed method on two real datasets.

Effect of the heterogeneity in the number of local iterations
The first simulation experiment shows how the heterogeneity in the number of local iterations influences the convergence in linear regression.The feature space here is , and the label space is , where .For simplicity, we do not consider intercepts.Hence, the parameter space is .
We set the learning rates as in this experiment. i At client , the true model parameter is generated from .The input/output pair is a realization from local data distribution .Here is generated from multivariate normal distribution and the response is generated according to the linear model where .The average number of local iterations is fixed at .Instead, we set different degrees of heterogeneity in the number of local iterations.We set three different degrees of heterogeneity ("Balance", "Small", "Large").Note that we have a measure to quantify the heterogeneity in the number of local iterations.In the "Balance" case, the number of local steps is for all .The measure is the "Balance" case.In the "Small" case, the number of local iterations is IID from a discrete uniform distribution on . The expected measure is in this case.In the "Large" case, the number of local steps is IID from a discrete uniform distribution on , and the measure is in this case.The results are in Fig. 1.
θT Eτ From Fig. 1, converges the fastest in the "Balance" case.In the "Large" case, is the largest in the three cases and the estimate converges the most slowly.This indicates that the heterogeneity in the number of local iterations slows down convergence.The empirical results coincide with the theoretical results in Theorem 1.

Separated plug-in and random scaling
In the second simulation experiment, we show the effectiveness of the proposed separated plug-in method.In the experiment, the nominal coverage probability of the confidence intervals is 95% in both linear regression and logistic regression models.The learning rates are set to in linear regression and in logistic regression, which is fine tuned in advance.Here, is an input/output pair.The predictors is generated from and the response is generated according to the local model, Send parameter update to server; end Server update global parameter by ; Send , , and to the server.Compute and following Eqs.( 2) and ( 5); , and ;  ."Balance" means identical number of local iterations, for all and ."Small" means small degree of heteroheneity in number of local iterations, where is IID from discrete uniform distribution on , and ."Large" represents large degree of heterogeneity in the number of local iterations, where is IID from discrete uniform distribution on , and .p = 5 where .Assume that samples from different clients are independent and samples from the same client are IID.Details about the data generating and parameters generating approach are as follows: • In the linear regression, the response at each client is generated according to , where is IID from .The true local parameters and are both generated from .In this case, the minimizer is the average of .
• In the logistic regression, the response is generated to be with probability and with probability .The true local model parameter is also generated from . The first clients have parameter and the rest have the same parameter .To calculate the empirical coverage rate, we have to precisely compute the minimizer .We use the stochastic gradient descent method to iteratively estimate in a centralized setting on a dataset that is a mixture of data from client and client .Half of the dataset is from client , and the rest is from client .
In both linear and logistic regression, we generate from a discrete uniform distribution on .Naturally, the parameter space is since we do not have intercepts in linear regression and omit the constant term in logistic regression here.

1000
In both cases, the coverage rate and the average radius of the confidence intervals are two main aspects used to evaluate the effectiveness of the methods.The coverage rate and the average radius is computed by averaging based on replications.
T Linear regression: Tables 1 and 2 show the empirical performance of the two methods with different participation rates and different maximum numbers of rounds .Both random scaling and separated plug-in have good performance in the linear regression model with different participation rates.
From the two tables, the average radius of the separated plug-in method is smaller than that of random scaling.Furthermore, the standard deviation of the plug-in confidence interval radius is much smaller than that of random scaling.This is because the plug-in method takes advantage of the first and the second derivative information, while random scaling only uses first-order derivatives. T From Table 1, the average radius decreases as increases.Similarly, the average radius is smaller with more clients participating.For instance, the average length of is approximately times the average radius of .This is consistent with the theoretical results.The two methods have similar average coverage rates under linear regression.

T ν
Logistic regression: In Tables 3 and 4, the plug-in method also has a smaller average radius and standard deviation than that of random scaling.The random scaling and separated plug-in methods also have similar coverage rates.The coverage rates of both separated plug-in and random scaling are close to the nominal coverage rate of 95% under the logistic regression model.In the logistic regression model, the average radius of the confidence interval also decreases as and increase.
The above experiments showed that the two methods are efficient and applicable in this problem.The separated plug-in method constructs a smaller confidence interval and the standard deviation of the radius is smaller than that of random scaling.In addition, the two methods have approximate empirical coverage rates.Hence, the separated plug-in is better than random scaling considering the empirical coverage rate and averaged radius.

Real data applications
In this section, we apply our proposed methods to conduct confidence interval estimation in linear regression for the power consumption of the Tetuan city dataset ① .In logistic regression, we conduct confidence interval estimation for the skin segmentation dataset ② .

52416
The power consumption of the Tetuan city dataset [37] is related to the power consumption of three different distribution networks of Tetouan city, which is located in northern Morocco.This dataset consists of samples.We fit a federated linear model to investigate how the variables "temperature", "humidity", "wind speed", "general diffuse flows" and "diffuse flows" influence the response variable "Zone 1 Power Consumption".To simulate the non-IIDness and heterogeneity in the number of local iterations, we allocate the dataset into 10 clients according to the variable "DateTime"

B G R
20000 30000 10000 40000 4000 20000 30000 10000 20000 45057 The Skin Segmentation dataset is constructed over B, G, R color space.Skin and Nonskin dataset is generated using skin textures from face images of diversity of age, gender, and race people.This dataset has 245057 samples and each sample labels with "skin" or "non-skin", out of which 50859 is the "skin" samples and 194198 is "non-skin" samples.We fit a logistic model to observe the relationship between the indicator of skin and the three predictors, , and .The dataset is also partitioned into 10 parts, each part is related to a client.The datasize at each client is , , , Corrosonding to the related hours, the number of local iterations is , , , , , , , , , , and in linear regression.So the total number of rounds is .The learning rates are still .For logistic regression, we follow the same setting in the previous simulated data experiment.The learning rates are set to be .The total number of rounds is in this case, and the number of local iterations is , , , , , , , , , and , in accordance.The participation rate is for both cases to get better performance according to the previous simulated data experiment.The results of our real data analysis are in Tables 5 and 6.

1
From Table 5, we see that the power consumption in Zone greatly influenced by the "temperature".From Table 6, we conclude that the variable B is positively related to the response and the other two variables G and R are negatively related to the response.

Conclusions
This study shows how to perform online confidence interval θ * estimation for federated heterogeneous optimization problems.We first proposed the rescaled FedAvg to estimate .The research gives a nonasymptotic convergence rate of the estimate.This result also revealed that the heterogeneity in the number of local iterations slows down the convergence.Furthermore, we proved that the averaged rescaled FedAvg estimate is asymptotically normal with unknown covariance.
Based on its normality, we proposed the separated plug-in method to estimate the asymptotic covariance.The separated plug-in method estimates the covariances of local gradients separately, and constructs an estimate of the asymptotic covariance matrix by these estimates.Additionally, we have proven a functional CLT and applied it to extend the random scaling method to the federated heterogeneous setting.Finally, the simulation showed that the heterogeneity in the number of local iterations slows down the convergence, and investigated the empirical performance of the two methods via the Monte-Carlo experiment.The simulation results have shown that the plug-in interval has a smaller radius than the random scaling interval, and is more stable.From the experiment, the average length of the confidence interval decreases when performing more aggregations.In addition, the average length and its variance will decay if there are more clients participating in the aggregation step each round.

Fig. 1 .
Fig. 1.Impacts of based on replications.The -axis and -axis are number of rounds and ."Balance" means identical number of local iterations, for all and ."Small" means small degree of heteroheneity in number of local iterations, where is IID from discrete uniform distribution on , and ."Large" represents large degree of heterogeneity in the number of local iterations, where is IID from discrete uniform distribution on , and .
Participation rate , number of clients , learning rates .

Table 1 .
Separated plug-in method in linear regression based on 1000 replications, in brackets are the standard deviations.

Table 2 .
Random scaling method in linear regression based on 1000 replications, in brackets are the standard deviations.

Table 3 .
Plug-in method in logistic regression based on 1000 replications, in brackets are the standard deviations.

Table 4 .
Random scaling method in logistic regression based on 1000 replications, in brackets are the standard deviations.