FedBoosting: Federated Learning with Gradient Protected Boosting for Text Recognition

Typical machine learning approaches require centralized data for model training, which may not be possible where restrictions on data sharing are in place due to, for instance, privacy and gradient protection. The recently proposed Federated Learning (FL) framework allows learning a shared model collaboratively without data being centralized or shared among data owners. However, we show in this paper that the generalization ability of the joint model is poor on Non-Independent and Non-Identically Distributed (Non-IID) data, particularly when the Federated Averaging (FedAvg) strategy is used due to the weight divergence phenomenon. Hence, we propose a novel boosting algorithm for FL to address both the generalization and gradient leakage issues, as well as achieve faster convergence in gradient-based optimization. In addition, a secure gradient sharing protocol using Homomorphic Encryption (HE) and Differential Privacy (DP) is introduced to defend against gradient leakage attack and avoid pairwise encryption that is not scalable. We demonstrate the proposed Federated Boosting (FedBoosting) method achieves noticeable improvements in both prediction accuracy and run-time efficiency in a visual text recognition task on public benchmark.


INTRODUCTION
P ERSONAL data protection and privacy-preserved issues have particularly attracted researchers' attention [1], [2], [3], [4], [5], [6], [7]. Typical machine learning approaches that require centralized data for model training may not be possible as restrictions on data sharing are in place. Therefore, decentralized data-training approaches are more attractive since they offer desired benefits in privacy preserving and data security protection. Federated Learning (FL) [8], [9] was proposed to address such concerns that allows individual data providers to collaboratively train a shared global model without aggregating the data centrally.
McMahan et al. [9] presented a practical decentralized training method for deep networks based on averaging aggregation. Experimental studies were carried out on various datasets and architectures, which demonstrated the robustness of FL on unbalanced and Independent and Identically Distributed (IID) data. Frequent updating approach can generally lead to higher prediction performance whereas the communication cost increases sharply, especially for the large datasets [9], [10], [11], [12], [13]. Konečnỳ et al. [11] focused on addressing the efficiency issue and proposed two weight updating methods, namely structured updates and sketched updates approaches based on Federated Averaging (FedAvg) to reduce the up-link communication costs of transmitting the gradients from the local machine to the centralized server.
• H. Ren Prediction performance and data privacy are two major challenges in FL research. On one hand, the accuracy of FL reduces significantly on Non-Independent and Non-Identically Distributed (Non-IID) data [14]. Zhao et al. [14] showed the weight divergence can be measured quantitatively using Earth Mover's Distance (EMD) between the distributions over classes on each local machine and the global population distribution. Hence, they proposed to share a small subset of data among all the edge devices to improve model generalization on Non-IID data. However, such a strategy is infeasible when restrictions on data sharing are in place which usually leads to privacy breaching. Li et al. [15] studied the convergence properties of FedAvg and concluded a trade-off between its communication efficiency and convergence rate is existed. They argued that the model converges slowly on heterogeneous datasets. Based on our empirical study in this paper, we confirm that given Non-IID datasets, the training needs far more iterations to reach an optimal solution and often fails to converge, especially when the local models are trained on large-scale datasets with a small number of batch size or the global model are aggregated after a large number of epochs. On the other hand, model gradient is generally considered to be safe to share in the FL system for model aggregation. However, some studies have shown that it is feasible to recover training data information from model gradients. For example, Fredrikson et al. [16] and Melis et al. [17] reported two methods that can identify a sample with certain properties is in the training batch. Hitaj et al. [18] proposed a Generative Adversarial Networks (GANs) model as an adversarial client to estimate the distribution of the data from the output of other clients without knowing their training data. Zhu et al. [19] and Zhao et al. [20] demonstrated data recovery can be formulated as a gradient regression problem assuming the gradient from a targeted client is available, which is a largely valid assumption in most FL systems. Furthermore, Generative Regression Neural Network (GRNN) proposed by Ren et al. [21] consists of two branches of generative models, one is based on GAN for generating fake training data and the other one is based on fully-connected layer for generating corresponding labels. The training data is revealed by regressing the true gradient and the fake gradient generated by the fake data and relevant label.
In this paper, we propose Federated Boosting (FedBoosting) method to address the weight divergence and gradient leakage issues in general FL framework. Instead of treating individual local models equally when the global model is aggregated, we consider the data diversity of local clients in terms of the status of convergence and the ability of generalization. To address the potential risk of data leakage via shared gradients, a Differential Privacy (DP) based linear aggregation method is proposed using Homomorphic Encryption (HE) [22] to encrypt the gradients which provides two layers of protection. The proposed encryption scheme only leads to a negligible increase in computational cost.
The proposed method is evaluated using a text recognition task on public benchmarks, as well as a binary classification task on two datasets, which demonstrates its superiority in terms of convergence speed, prediction accuracy and security. The performance reduction due to encryption is also evaluated. Our contributions are four-fold: • We propose a novel aggregation strategy namely FedBoosting for FL to address the weight divergence and gradient leakage issues. We empirically demonstrate that FedBoosting converges significantly faster than FedAvg while the communication cost is identical to traditional approaches. Especially when the local models are trained with small batch size and the global model are aggregated after a large number of epochs, our approach can still converge to a reasonable optimum whereas FedAvg often fails in such case. • We introduce a dual layer protection scheme using HE and DP to encrypt gradients flowing between server and clients, which protect the data privacy from gradient leakage attack. • We show the feasibility of our method on two datasets by evaluating the decision boundaries visually. Furthermore, we also demonstrate its superior performance in a visual text recognition task on multiple large-scale Non-IID datasets compared to centralized approach and FedAvg. The experimental results confirm that our approach outperforms FedAvg in terms of convergence speed and prediction accuracy. It suggests FedBoosting strategy can be integrated with other Deep Learning (DL) models in the privacy-preserving scenarios.
• Our implementation of proposed FedBoosting is publicly available to ensure reproducibility. It can also be run in a distributed multiple Graphics Processing Units (GPUs) setup. 1 The rest of the paper is organized as follows: In Section 2 the related work on encryption method, collaborative 1. https://github.com/Rand2AI/FedBoosting learning and gradient leakage are presented. The proposed method FedBoosting and relevant encryption method are described in Section 3 and evaluated on a text recognition task and a binary classification task. The details of experiments and discussions on the results, as well as a performance comparison, are provided in Section 4, followed by the conclusions in Section 5.

RELATED WORK
FL for privacy-preserving machine learning is proposed for training a model across multiple decentralized edge devices or clients holding local data samples [8], [9], [11], [23], [24]. More specifically, FL framework keeps the raw data to the owners and trains the model locally at client nodes individually, whereas gradients of those models are exchanged and aggregated instead of data. Compared to Secure Multi-Party Computation (MPC) [25], [26] which ensures a high level of security at the price of expensive cryptography operations, FL has loosened security requirements that enables more efficient implementation and lower running cost. Since there is no explicit data exchange, FL does not require adding noises to the data as DP [27], [28], [29], [30], [31], nor encrypting data into homomorphic phase to fit a homomorphic operation as HE [1], [32], [33], [34]. Gradient aggregation from local models is one of the core research problems in FL. Mcmahan et al. [9] introduced the FedAvg method for training deep neural networks over multiple parties, where the global model takes the average of gradients from local models, i.e. ω = N i 1 N ω i , where ω and ω i are the gradients of global model and the i th local model, and N is the total number of clients. The method is evaluated on MNIST benchmark and demonstrates its feasibility on the classic image classification task using Convolutional Neural Networks (CNNs) as the learning model. Although the experimental results show that FedAvg is suitable for both IID and Non-IID data, it is still a statistical challenge of FL when a local model is trained on large-scale Non-IID data. In this paper, our experimental results also support such an argument, where the prediction accuracy and convergence rate drop significantly with large-scale Non-IID data on FedAvg.
FL is designed for privacy protected training as the data is kept and processed locally. However, it has been highlighted in multiple studies, e.g. [18], [19], [21], FL suffers from the so-called gradient leakage problem that is the private training data can be recovered from the publicly shared gradients with significantly high success rate. Hitaj et al. [18] proposed a training data recovery approach from FL system using GAN. It aims to generate similar training samples given one specific class rather than recover the original training data directly. First, the global FL model is trained as usual for several iterations to achieve a relatively high accuracy. They assume the the malicious participant can obtain one of client model and used as discriminator. Then an image generator is updated based on the output of the discriminator given a targeted image class. Finally, the well-trained generator can produce image samples that are similar to the training data given the specified image class. Zhu et al. [19] formulated the data recovery task as a gradient regression problem, where the pixel values of input image are treated as random variables that are optimized using back-propagation while the shared model parameters are fixed. The object function measures the Euclidean distance between the shared gradient in FL and the gradient given by the random image input, which is minimized during the training phase. They hypothesized that the optimized input when the model converges is as similar as the original training image that is stored on local client alone. The experimental results on public benchmark datasets proves the hypothesis is valid, which in turn indicates that gradient sharing could lead to privacy data leakage. Our previous work GRNN [21] further improves the success rate of leakage attack using generative model for data recovery, particularly a large batch size is used in training.

FedBoosting Framework
FedAvg [9] produces a new model by averaging the gradients from local clients. However, on the Non-IID data, the weights of local models may converge to different directions due to inconsistency of data distribution. Therefore, simple averaging scheme performs poorly, especially when strong bias and extreme outlier exist [14], [15], [35]. Therefore, we propose using boosting scheme, namely FedBoosting, to adaptively merge local models according to the generalization performance on different local validation datasets. Meanwhile, in order to preserve data privacy, information exchanges among decentralized clients and server are prohibited. Hence, instead of exchanging data between clients, encrypted local models are exchanged via the centralized server and validated on each client independently. More details are shown in Figure 1.
In contrast to FedAvg, the proposed FedBoosting takes the fitness and generalization performance of each client model into account and adaptively merges the global model using different weights on all client models. To achieve this, three different information are generated from each client, local gradients G i r , training loss T i r and validation loss V i,i r , where G i r and T i r are local gradients and training loss from the i-th local model in training round r and V i,i r refers to validation loss from the i-th local model on the i-th local validation dataset in training round r. The local gradients G i r are then distributed to all the other clients via a centralized server. The cross-validated loss V i,j r , where i = j, can be gained on each client. raining and validation losses are two measurements used to evaluate the predictive performance of the models. It is a valid assumption that a model with relatively large training loss indicates poor convergence and poor generalization ability. However, it also suggests the model gradient contains sufficient information for training. Similarly, a model with low training loss does not guarantee the model having good generalization ability (over-fitting) with less training information. Hence, we take validation loss into consideration. Those two losses jointly determine the aggregation weight of a local model contributing to the global model as shown in Eqn. (2). Then on the server, all the validation results of the i-th model are added together denoted as V i r representing the i-th model's generalization ability. Considering about the convergence, a softmax layer is deployed, whose input is T r . The outputs together with V i r are used for calculating the aggregation weight p i r . In the current round of aggregation, the new global gradients G r can be computed by merging all the local gradients G i r with respect to its weight p i r as such: where T i r and p i r are training loss and mixture coefficient for the i-th local model.
In addition, the proposed FedBoosting scheme is resilient to some malicious attacks, such as data poisoning. For instance, when a malicious client injects poisoned data into the training set and the contaminated local model is aggregated with the same weight as other clean models, our method can mitigate as the validation scores of the toxic model on other clients will be significantly lower, which in turn leads to a significantly lower aggregation weight.

HE Aggregation with Quantized Gradient
HE ensures that the computation can be carried out on the encrypted data as Enc(A) • Enc(B) = Enc(A * B), where "•" stands for operation on encrypted data and " * " is operation on plain data. Since the FedBoosting involves computing the global model based on local gradient, the HE is used in our method to ensure the aggregation and gradient exchange among clients and server are secure. In FedBoosting, local models are trained on each client and then local gradients are transmitted to the server, where all Algorithm 1 FedBoosting with HE and DP: Server 1: build model and initialize weights ω0; 2: for each round r = 1, 2, ..., R do 3: for each client i ∈ CN do 4: if r == 1 then 5: end for 10: for each client i ∈ CN do 11: generateĜ * i r via Eqn. (5); 12: 13: end for 14: generate p i r via Eqn. (2&3); 15: generate G * r via Eqn. (4); 16: if r == R then 17: generate key pair and sent to other clients 3: else 4: wait for key pair from C1 5: end if 6: if r == 1 then 7: load T rnD i , V alD i 8: else 9: decrypt G * r−1 to Gr−1 by secret keys 10: 12: for each epoch e = 1, 2, ..., E do 13: for each batch b = 1, 2, ..., B do 14: ωr ← ωr−1 − ηl(T rnD i,b , ωr−1)

15:
end for 16: end for 17: G i r = ωr − ωr−1 18: g i r = (G i r * 1e32)/P and generate g * i r by public keys 19: T i r ← f (T rnD i |ωr) 20: return g * i r , T i r to server b. Evaluate(j,Ĝ * i r ): 1: decryptĜ * i r toĜ i r by secret keys to server c. Decrypt(G * r ): 1: decrypt G * r to Gr by secret keys 2: ωr = ωr−1 + Gr 3: return ωr to server local gradients are integrated to build global gradients in every round of aggregation. To preserve gradient information, FedBoosting utilizes the HE method, Paillier [22]. Once the training starts, a pair of HE keys are shared among clients. The public key is used to encrypt gradients and the secret key is for decryption. After a round of local training, local gradients can be calculated by G i r = ω i r − ω r−1 , where ω r−1 is the global weight from last round and ω i r is the weight after training at current round.
It is infeasible to encrypt G i r and transmit it to the server directly, as Paillier can only process integer value. To address this issue, we propose to convert G i r into scaled integer form, denoted as G i r by multiplying with 1e 32 . As the weighting scheme at the server side will break the integeronly constrain for homomorphic computation, to ensure the correctness of aggregation, we divide G i r into P pieces and then round to an integer according to g i r = G i r /P . There is a negligible precision loss as only the last few bits are dropped. For example, in the case of P = 10, the value loss is only at the 32-th bit, similarly, for P = 100, the loss is at the 31-th and 32-th bits. Finally, g i r is encrypted using Paillier and the encrypted g * i r is transmitted to the server. On the contrary, the client weight p r is converted to an integer by multiplying P followed by a rounding operation. In FedBoosting, the aggregation weight is computed with respect to Eqn. (2 & 3). The final encrypted global gradients G * r can be computed by merging all gradients from clients (see Eqn. (4)). The final encrypted global gradients G * r is then transmitted back to each client to be decrypted and generate global weights by ω r = ω r−1 + G r , where G r is decrypted from G * r . The proposed secure aggregation approach using HE with quantized gradient is generalizable that can also be used for FedAvg.

DP Fusion for Local Model Protection
Local gradient between client and server is protected by HE as aforementioned. While in our proposed FedBoosting mechanism, local models are shared with all the other clients for cross-validation, and all clients have the same key pair. FedAvg shows that the uniformly combined global model is capable of performing similarly as any local model. Therefore, to protect gradient privacy among clients, inspired by DP, we propose to perturb individual local models using a linearly combination of with HE-encrypted local models, where the target model takes dominant proposition giving highest weight. Only the perturbed local models are shared among clients for cross-validation. Empirically, the reconstructed model performs similarly to the local model. Once the server receives all the encrypted local gradients piece g * i r , ∀i = 1, 2, ..., N , the server randomly generates N sets of private fusing weights within which the corresponding local model always takes the largest proposition. Then, the server computes N reconstructed local model using the HE according to the N sets of weights (see Eqn. (5)).
whereĜ * i r is the i-th dual-encrypted whole gradient in round r. Finally, the server distributes the reconstructed local models to all clients for cross-validation. As HE is used on the server for linear combination, the model is restrictively protected during the exchanging process among server and clients. The accuracy might drop due to precision loss using quantized HE and reconstructed local model for cross-validation. However, in Figure 5, our experimental results show that there is no significant (within 0.5%) loss in testing accuracy.

Decision Boundary Comparison using Synthetic Dataset
We firstly conducted the evaluations using two datasets to compare the decision boundary between FedAvg and FedBoosting. The task is a binary classification problem with 2D feature in order to provide a visible visualization for the decision boundary. We assume the data is subjected to a 2D Gaussian distribution and two datasets was randomly sampled with different mean centers and stand deviations in order to simulate the Non-IID scenario. Individual dataset was used for training on a client and the global model were aggregated using FedAvg and proposed FedBoosting, where each of them contains 40000 samples and was split into a training set and a testing set by a proportion of 9:1. Figure 2 (d), (e), and (f) show those two training datasets and the combined testing dataset respectively. A simple neural network was adopted that contains 2 fully-connected layers following by a Sigmoid activation layer and a Softmax layer respectively. The first fully-connected layer has 8 hidden nodes and the second one has 2 hidden nodes. The optimizer is Adam whose learning rate is 0.003. All the models trained by FedBoosting outperform those trained by FedAvg with batch size of 8 and epoch of 1. Figure 2 (a) and (b) present the visualizations of decision boundary of global models trained using FedAvg, FedBoosting and a centralized training scheme respectively. It can be concluded that the proposed FedBoosting can form a significantly smoother decision boundary compared to FedAvg approach. In addition, the decision boundary of our method is much closer to the model that was trained using centralized scheme, where both two datasets are used together. This study shows that our method is more generalizable in principle.

Evaluation on Text Recognition Task
We adopt Convolutional Recurrent Neural Network (CRNN) [36] as the local neural network model for text recognition mission. CRNN uses VGGNet [37] as the backbone network for feature extraction, where the fullyconnected layers are removed and the feature maps are unrolled along the horizontal axis. To model the sequential representation, a multi-layer Bidirectional Long-Short Term Memory (BiLSTM) network [38] is placed on the top of the convolutional layers that take the unrolled visual features as input and models the long-term dependencies within the sequence in both directions. The outputs of BiLSTM are fed into a Softmax layer, and each element unrolled sequence is projected to the probability distribution of possible characters. The character with the highest Softmax score is treated as an intermediate prediction. Connectionist Temporal Classification (CTC) [39] decoder is utilized to merge intermediate prediction to produce the final output text. For more details of CRNN model, the reader can refer to its original publication [36].

Experimental Setting
The proposed model is trained on two large-scale synthetic datasets, Synthetic Text 90k and Synthetic Text, without finetuning on other datasets. Models are tested on other four standard datasets to evaluate their general recognition performances. For all experiments, the prediction performance is measured using word-level accuracy.
Synthetic Text 90k [40] (Synth90K) is one of two training datasets for all the experiments in this paper. The dataset contains about 7.2 million images and their corresponding ground truth words. We split the images into two sets for FedBoosting, the first one containing 6.5 million images is for training, the rest is for validation which contains 0.7 million images.
Synthetic Text [40] (SynthText) is the second training dataset we used. The dataset has about 85K natural images containing many synthetic texts. We cropped all the texts via labeled text bounding boxes to build a new dataset of 5.3 million text images. We split it into training dataset of 4.8 million images and validation dataset of 0.5 million images for FedBoosting.
IIIT 5K-Words [41] (IIIT5K) is collected from the internet containing 3000 cropped word images in its testing set. Each image contains a ground truth word.
Street View Text [42] (SVT) is collected from the Google Street View, consists of 647 word images in its testing set. Many images are severely corrupted by noise and blur, or have very low resolutions.
SCUT-FORU [43] (SCUT) consists of 813 training images and 349 testing images. The background and illumination vary in large scales in the dataset.
ICDAR 2015 [44] (IC15) contains 2077 cropped images with relatively low resolutions and multi-oriented texts. For fair comparison, we discard the images that contain nonalphanumeric characters, which results in 1811 images. Figure 3 shows some visual examples from Synth90K and SynthText, where large variation of backgrounds and texts can be observed in the images between two datasets. Therefore, we can conclude that two datasets are of Non-IID. All the training and validation images are scaled to the size of 100*32 in order to fit the model using mini-batch and accelerate the training process. Testing images in Table 1 and 3 are scaled proportionally to match with the height of 32 pixels. While in Figure 4, 5 and 6, testing images are processed in the same way with what training and validation images do aforementioned. The testing images whose label length are less than 3 or more than 25 characters are dropped out due to the limitation of CTC. We deploy Synth90K and SynthText datasets on two separate clients. On the local training nodes, AdaDelta is used for backpropagation optimization and the initial learning rate is set to 0.05. Regarding to HE we set 128 as key size in bits. The whole gradient are split into 100 pieces andp = 0.9. Our method is implemented using Keras and Tensorflow, and the source code is publicly available to ensure reproducibility, which can be run in a distributed multiple GPUs setup. Table 1 shows the comparison results on testing datasets with different training hyper-parameters including batch size, number of epoch. The results of the first row (CRNN*) and the second row (CRNN) are produced by the original CRNN model without using FL framework, where CRNN* corresponds to accuracies reported by its authors in [36] and    Table 2.

Result and Discussion
It can be observed that the FedAvg models with bigger batch size and smaller epoch have better performance. In other words, the models perform better when model integration occurs more frequently, which however will increase the communication cost. In Table 1, the model with 256 batch size and 1 epoch even produces no result due to model divergence after a few rounds of integration. The potential reason could be the extreme differences in parameters that are learned on each local machine. Figure 4 Figure 4 second row), do not have such issue. Therefore, we can con-  1 Recognition accuracies (%) on four testing datasets. "90K" and "ST" stand for Synth90K and SynthText datasets respectively. The results of the first row (CRNN*) and the second row (CRNN) are produced by the normal CRNN model without using FL framework, where CRNN* corresponds to accuracies reported in [36] and CRNN corresponds to the results reproduced using our implementation. clude that the boosting strategy we propose can overcome the model collapse issue of FedAvg to a great extent. Table 3 provides the comparison results on three testing datasets (IIIT5K, SVT and IC15) with different FL gradients merging methods and encryption modes under the hyperparameters of 800 batch size and 1 epoch. The results of FedAvg illustrate that by using HE, although it has slight precision loss in processing of dividing the whole gradient into many pieces, accuracy has nearly no reduction on testing datasets. Even it has accuracy raising on IC15 dataset from 72.37% to 73.00%. On the other two testing datasets, the losses of accuracy are 0.27% and 1.6% separately. Same for FedBoosting models, testing results show a slight accuracy reduction which can be tolerant when only use HE. While adding DP into FedBoosting with HE, it has nearly no affection on accuracy. As DP encryption is only used to encrypt local gradients between clients for evaluation and get the results on all clients' validation datasets, so DP has little impact on global gradients generating. However, testing results have a fall down between common FedBoosting and encrypted FedBoosting models, e.g. accuracy reduced from 87.62% to about 85% on IIIT5K and also have an approximately 3% on IC15. We think that is normal fluctuate for training DL models. Although all three testing accuracies have different degrees of reduction, from the curve lines in Figure 5, accuracy climbing trend presents the differential under different encryption modes. It can be observed that differentials on most testing datasets are rather small. Please note that all samples in Figure 5's testing parts are resized to 100*32, which is different with the processing in Table 3 where the samples are scaled proportionally to match with the height of 32 pixels.

Performance Comparison
We further consider that the reason for the divergence issue is the quality of datasets, see Figure 3. That is to say, each local model trained on different private datasets has surely different generalization abilities. In our experiments, aggregating the global model rudely by averaging all the weights of local models may cause the decreasing of generalization ability, especially when the local updating iteration number is large (i.e. small batch size or large epoch number). So the proposed FedBoosting prefers to give a more fair weight instead of a mean value by trading off the training and validation performance of a local model. Following this thought FedBoosting first considers each model's validation results on every client's validation dataset, then collaborates with training results to compute weights of local models. The reason we think over training results is that usually a local model trained on high quality dataset has a nice fitness, while which may perform badly on a poor quality validation datasets. It is unfair to say this model has a poor generalization ability only considering its validation performance on different quality datasets. Inversely, a model that is trained on a poor quality dataset may perform very well on a high quality validation dataset as well, but we do not want this kind of local model to occupy too much among the global model. To this end, to leverage the performance of a local model, we first sum the validation results as a reference representing the local model's generalization ability. Furthermore, training results are taken into account to rectify the reference to obtain the final weights for each local model. It is observed in our experiments that the weights are about 55% for the local model trained on Synth90K dataset and 45% on SynthText dataset, which is reasonable cause we can see the accuracy results in Table 1 that CRNN models trained on Synth90K dataset always obtain better performance comparing with whose trained on SynthText dataset. While if we get rid of training results, the weight for the local model trained on Synth90K dataset would be smaller than which on SynthText dataset.
To prove the above idea in FedBoosting, a performance comparison is given here. It is commonly accepted that  generalization ability is a good metrics of judging a model's performance, whereas only considering generalization ability is not feasible for our proposed method FedBoosting. Otherwise, it is impossible as well to deploy our method only considering training results and get rid of validation results, which may lead to an extremely unfair situation that local model trained on Synth90K may take a weight up to about 80% for the global model. So the following content is mainly talking about how training results work in FedBoosting. We trained a global model with 256 batch size and 1 epoch under the strategy of FedBoosting without considering training results. As described above, the reason of thinking over training results is to rectify the weight for local model. From Figure 6 (a), we can see that the global model without taking training results gains a delay convergence at round 24. While in other experiments, models all converge quickly and properly under the supervision of training results. In the meantime, the performance of global model with training results is always better than that without training results. As a supplement, we visualize the global testing accuracy of two models with 800 batch size and 1 epoch in Figure 6 (b), one uses training results and the other one does not. Two models converge normally in this case, but the model performance of using training results outperforms all the time. From all above, we consider that using training results to supervise the local weight is essential in our scenario. To clarify, all testing images during training are resized to 100*32, which is different to individual testing experiments where testing images are resized to W*32, where W is the proportionally scaled with heights, but at least 100 pixels. That is why accuracies in Figure 6 are lower than those in Table 1. Please refer to our codes for more details.

CONCLUSION
In this paper, we proposed FedBoosting a boosting scheme for FL framework to overcome the limitation of FedAvg on Non-IID dataset. To protect gradient leakage attack, a gradient sharing protocol was introduced using HE and DP. A comprehensive comparison study has been carried out using synthetic dataset and public text recognition benchmark, which shows superior performance over traditional FedAvg scheme on Non-IID dataset. Our implementation is publicly available to ensure reproducibility, and it can be run in a distributed multiple GPUs setup. Theoretical study on model convergence from multi-parties computing, privacy leakage from gradients, and more efficient quantization method for gradients are three potential directions worthy of further investigation.