LoAdaBoost: Loss-based AdaBoost federated machine learning with reduced computational complexity on IID and non-IID intensive care data

Intensive care data are valuable for improvement of health care, policy making and many other purposes. Vast amount of such data are stored in different locations, on many different devices and in different data silos. Sharing data among different sources is a big challenge due to regulatory, operational and security reasons. One potential solution is federated machine learning, which is a method that sends machine learning algorithms simultaneously to all data sources, trains models in each source and aggregates the learned models. This strategy allows utilization of valuable data without moving them. One challenge in applying federated machine learning is the possibly different distributions of data from diverse sources. To tackle this problem, we proposed an adaptive boosting method named LoAdaBoost that increases the efficiency of federated machine learning. Using intensive care unit data from hospitals, we investigated the performance of learning in IID and non-IID data distribution scenarios, and showed that the proposed LoAdaBoost method achieved higher predictive accuracy with lower computational complexity than the baseline method.


Introduction
Health data from intensive care units can be used by medical practitioners to provide health care and by researchers to build machine learning models to improve clinical services and make health predictions. But such data is mostly stored distributively on mobile devices or in different hospitals because of its large volume and high privacy, implying that traditional learning approaches on centralized data may not be viable. Therefore, federated learning that avoids data collection and central storage becomes necessary and up to now significant progress has been made. In 2005, Rehak et al. [1] established CORDRA, a framework that provided standards for an interoperable repository infrastructure where data repositories were clustered into community federations and their data were retrieved by a global federation using the metadata of each community federation. In 2011, Barcelos et al. [2] created an agent-based federated catalog of learning objects (AgCAT system) to facilitate assess of distributed educational resources. Although little machine learning was involved in these two models, their practice of distributed data management and retrieval served as a reference for the development of federated learning algorithms.
In 2012, Balcan et al. [3] implemented probably approximately correct (PAC) learning in a federated manner and reported the upper and lower bounds on the amount of communication required to obtain desirable learning outcomes. In 2013, Richtárik et al. [4] proposed a distributed coordinate descent method named HYbriD for solving loss minimization problems with big data. Their work provided the bounds of communication rounds needed for convergence and presented experimental results with the LASSO algorithm on 3TB data. In 2014, Fercoq et al. [5] designed an efficient distributed randomized coordinate descent method for minimizing regularized non-strongly convex loss functions and demonstrated that their method was extendable to a LASSO optimization problem with 50 billion variables. In 2015, Konecny et al. [6] introduced a federated optimization algorithm suitable for training massively distributed, non-identically independently distributed (non-IID) and unbalanced datasets.
In 2016, McMahan et al. [7] developed the FederatedAveraging (FedAvg) algorithm that fitted a global model with the training data left locally on distributed devices (known as clients). The method started by initializing the weight of neural network model at a central server, then distributed the weight to clients for training local models, and stopped after a certain number of iterations (also known as global rounds). At one global round, data held on each client would be split into several batches according to the predefined batch size; each batch was passed as a whole to train the local model; and an epoch would be completed once every batch was used for learning. Typically, a client was trained for multiple epochs and sent the weight after local training to the sever, which would compute the average of weights from all clients and distribute it back to them. Experimental results showed that FedAvg performed satisfactorily on both IID and non-IID data and was robust to various datasets.
More recently, Konevcny et al. [8] modified the global model update of FedAvg in two ways, namely structured updates and sketched updates. The former meant that each client would send its weight in a pre-specified form of a low rank or sparse matrix, whereas the latter meant that the weight would be approximated or encoded in a compressed form before sending to the server. Either way aimed at reducing the uplink communication costs, and experiments indicated that the reduction can be two orders of magnitude. In addition, Bonawitz et al. [9] designed the Secure Aggregation protocol to protect the privacy of each client's model gradient in federated learning, without sacrificing the communication efficiency. Later, Smith et al. [10] devised a systems-aware optimization method named MOCHA that considered simultaneously the issues of high communication cost, stragglers, and fault tolerance in multitask learning. Zhao et al. [11] addressed the non-IID data challenges in federated learning and presented an improved version of FedAvg with a data-sharing strategy whereby the test accuracy could be enhanced significantly with only a small portion of globally shared data among clients. The strategy required the server to prepare a small holdout dataset G (sampled from IID distribution) and globally share a random portion α of G with all clients. The size of G was defined as b ¼ number of examples in G total number of examples in all clients � 100%. There existed two trade-offs: first, test accuracy and α; and second, test accuracy and β. A rule of thumb was that the larger α or β was, the higher test accuracy would be achieved. It is worth mentioning that since G was a separate dataset from the clients' data, sharing it would not be a privacy breach. Since no specific name was given to this method in Zhao et al.'s literature [11], we referred to it as "FedAvg with datasharing" in our study. Bagdasaryan et al. [12] designed a novel model-poisoning technique that used model replacement to backdoor federated learning. Liu et al. used a federated transfer learning strategy to balance global and local learning [13][14][15][16].
Most of the previously published federated learning methods focused on optimization of a single issue such as test accuracy, privacy, security or communication efficiency; yet none of them considered the computation load on the clients. This study took into account three issues in federation learning, namely, the local client-side computation complexity, the communication cost, and the test accuracy. We developed an algorithm named Loss-based Adaptive Boosting FederatedAveraging (LoAdaBoost FedAvg), where the local models with a high crossentropy loss were further optimized before model averaging on the server. To evaluate the predictive performance of our method, we extracted the data of critical care patients' drug usage and mortality from the Medical Information Mart for Intensive Care (MIMIC-III) database [17] and the eICU Collaborative Research Database [18]. The data were partitioned into IID and non-IID distributions. In the IID scenario LoAdaBoost FedAvg was compared with FedAvg by McMahan et al. [7], while in the non-IID scenario our method was complemented by the data-sharing concept before being compared with FedAvg with data-sharing by Zhao et al. [11]. Our primary contributions include the application of federated learning to health data and the development of the straightforward LoAdaBoost FedAvg algorithm that had better performance than the state-of-the-art FedAvg approach.

FedAvg: The baseline in IID scenario
Developed by McMahan et al. [7], the FedAvg algorithm trained neural network models via local stochastic gradient descent (SGD) on each client and then averaged the weight of each client model on a server to produce a global model. This local-training-and-global-average process was carried out iteratively as follows. At the t th iteration, a random C fraction of the clients were selected for computation: the server first sent the average weights at the previous iteration (denoted w tÀ 1 average ) to the selected clients (except for the 1 st iteration where the clients started its model from the same random weight initialization); each client independently learnt a neural network model initialized with w tÀ 1 average on its local data divided into B minibatches for E epochs, and then reported the learned weights (denoted w t k where k was the client index) to the server for averaging (see Fig 1). The global model was updated by the average weights of each iteration. FedAvg was utlized as the baseline method in IID scenario where both the training and test data were identically independently distributed.

FedAvg with data-sharing: The baseline in non-IID scenario
As demonstrated in the literature [7], FedAvg exhibited satisfactory performance with IID data, but its accuracy could drop substantially when trained on non-IID data. This was because, with non-IID sampling, stochastic gradient could no longer be regarded as an unbiased estimate of the full gradient according to Zhao et al. [11]. To address the challenge, they proposed an improved version of FedAvg: a data-sharing strategy complemented FedAvg via globally sharing a small subset of training data between all the clients (see Fig 2). Stored on the server, the shared data was a dataset distinct from the clients' data and assigned to clients when FedAvg was initialized. Thus, this strategy improved FedAvg with no harm to privacy and little addition to the communication cost. The strategy had two parameters that were α, the random fraction of the globally-shared data distributed to each client, and β, the ratio of the globally-shared data size to the total client data size. Raising these two parameters could lead to a better predictive accuracy but meanwhile make federated learning less decentralized, reflecting a trade-off between non-IID accuracy and centralization. In addition, it is worth mentioning that Zhao et al. also introduced an alternative initialization for their data-sharing strategy: the server could train a warm-up model on the globally shared data and then  distribute the model's weights to the clients, rather than assigning them with the same random initial weights. In this work, we kept the original initialization method to leave all computation on the clients. FedAvg with data-sharing was used as the baseline method in non-IID scenario where both the training and test data came from non-identically independently distributions.

LoAdaBoost FedAvg
We devised a variant of FedAvg named LoAdaBoost FedAvg that was based on cross-entropy loss to adaptively boost the training process on those clients appearing to be weak learners. Since in our study the data labels were either 0 (survival) or 1 (expired), binary cross-entropy loss was adopted as the error measure of model-fitting and calculated as where N was the total number of examples, x i was the input drug feature vector, y was the binary mortality label, and f was the federated learning model. The objective function of each client model under FedAvg and LoAdaBoost learning was to minimize Eq 1, which measured goodness-of-fit: the lower the loss was, the better a model was fitted. Our method utilized the median cross-entropy loss L tÀ 1 median of clients that participated in the previous global round t − 1 as a criterion for boosting Client k. Retraining for more epochs would be incurred if, after training for E/2 epochs at the current global round t, Client k's cross-entropy loss L t;0 k was above L tÀ 1 median . The reason for using the median loss rather than average lied in that the latter was less robust to outliers that were significantly underfitted or overfitted client models. Communication between clients and the server under LoAdaBoost is demonstrated in Fig 3. Not only the model weights but also the cross-entropy losses were communicated between the clients and the server. At the t th iteration, the server delivered the average weights w tÀ 1 average and the median loss L tÀ 1 median obtained at the t − 1 th iteration to each client; then, each client learnt a neural network model in a loss-based adaptive boosting manner, and reported the learnt weights w t;r k and the cross-entropy loss L t;r k to the server. The global model was parametrized by the average of w t;r k . Algorithm 1 shows how LoAdaBoost worked in detail. The server started a neural network model by randomly initializing the weight w 0 , which was then distributed to each client. The initial value of median training loss (L 0 median ) of client models was set to 1.0, and the number of clients participating in federated learning (m) was determined by the product of the client percentage C and the total client count K. At least one client model would be trained in each global round. At the tth round, Client k was initialized with the average weight from the t − 1th round w tÀ 1 average , and trained on the local data for E/2 epochs to obtain weight w t;0 k and loss L t;0 k before retraining. For odd E, E/2 would be rounded up to the nearest integer. If L t;0 k was not greater than the median loss from the previous round L tÀ 1 median , computation on Client k would be finished, with w t;0 k and L t;0 k sent to the server. Otherwise, the client would be retrained for another E/2 epochs. Now, the new loss was denoted L t;1 k where the superscript 1 indicated the first retraining round. If L t;1 k was still above L tÀ 1 median , Client k would be retrained for E/2 − 1 more epochs. This process was repeated for retraining round r = 1,2,3, . . ., each round for max (E/2 − r + 1, 1) epochs, and stopped until the retrained loss L t;r k dropped below L tÀ 1 median or the total number of epochs (including initial training and retraining) reached 3E/2. Lastly, L t;0 k and the final w t;r k were sent to the server. Algorithm 1 LoAdaBoost FedAvg. The K clients are indexed by k, C is the fraction of clients that perform computation at each global round, and E is the number of local epochs for each retrain round r = 1, 2, . . . do 16: train f k for max( E 2 À r þ 1; 1) epochs to obtain w t;r k and L t;r k 17: if L t;r k > L tÀ 1 median or total training epochs > 3E 2 then 18: return w t;r k Depending on its cross-entropy loss, each client would be trained for at least E/2 epochs and at most 3E/2 epochs. We set the maximum training epochs to 3E/2 to control computational complexity of LoAdaBoost, aiming to prevent it from running more average epochs than FedAvg. The median cross-entropy loss of clients from the t − 1th global round L tÀ 1 median was used as the criterion for retraining clients at the tth round. In the worst-case scenario, no improvement of training loss was made on each client after the initial E/2 epochs, and about half of the clients were retrained for the full E additional epochs. Thus, the expected number of epochs per client per global round would be at most E.
LoAdaBoost was adaptive in the sense that the performance of a poorly-fitted client model after the first E/2 epochs was boosted via continuous retraining for a decaying number of epochs. The quality of training was determined by comparing the model's loss L t;r k with the median loss L tÀ 1 median . In this way, our method was able to ensure that the losses of most (if not all) client models would be lower than the median loss at the prior iteration, thereby making the learning process more effective. In addition, because at one iteration only a few of the client models were expected to be trained for the full 3E/2 epochs, the average number of epochs run on each client would be less than E, meaning a smaller local computational load under our method than that of FedAvg. Furthermore, since both L tÀ 1 median and L t;r k were a single value transferred at the same time with w t;r k between the server and Client k, little additional communication cost would be incurred by our method.
Similar to other stochastic optimization-based machine learning methods [11,[19][20][21], an important assumption for our approach to work satisfactorily was that the stochastic gradient on the clients' local data was an unbiased estimate of the full gradient on the population data. This held true for IID data but broke for non-IID. In the latter case, an optimized client model with low losses did not necessarily generalize well to the population, implying that reducing the losses through adding more epochs to the clients was less likely to enhance the global model's performance. This non-IID problem could be alleviated by combining LoAdaBoost FedAvg with the data-sharing strategy, because the local data became less non-IID when integrated with even a small portion of IID data.

The MIMIC-III database
The performance evaluation concerned with the MIMIC-III database [17], which contains health information for critical care patients at a large tertiary care hospital in the US. Included in MIMIC-III are 26 tables of data ranging from patients' admissions, to laboratory measurements, diagnostic codes, imaging reports, hospital length of stay and more. We processed three of these tables, namely ADMISSIONS, PATIENTS and PRESCRIPTIONS, to obtain two new tables as follows: • ADMISSIONS and PATIENTS were inner-joined on SUBJECT_ID to form the PERSONA-L_INFORMATION table which recorded AGE_GROUP, GENDER and the survival status (MORTALITY) of all patients.
• Each patient's usage of DRUGS during the first 48 hours of stay (that is, STARTDATE − ENDDATE = two days) at the hospital was extracted from PRESCRIPTIONS to give the SUBJECT_DRUG_TABLE table.
Further joining these two tables on SUBJECT_ID gave a dataset of 30,760 examples, from which we randomly selected 30,000 examples to form the evaluation dataset where DRUGS were the predictors and MORTALITY was the response variable. The summary of this dataset was provided in Table 1.
The drug feature contained 2814 different drugs prescribed to the patients. Table 2 shows the first six drugs D5W (that is, 5% dextrose in water), Heparin Sodium, Nitro-glycerine, Docusate Sodium, Insulin and Atropine Sulphate. If a drug was prescribed to a patient (identified by SUBJECT_ID), the corresponding cell in the table would be marked 1, and 0 otherwise. For instance, Patient 9 was given D5W and Insulin while none of the first six drugs were offered to Patient 10.
The evaluation dataset was shuffled and split into a training set of 27,000 examples and a holdout set of 3,000 examples for implementing data-sharing strategy. As with the literature [7], the training set was partitioned over 90 clients in two ways: IID in which the data was randomly divided into 90 clients, each consisting of 300 examples; and non-IID in which the data was firstly sorted according to AGE_GROUP and GENDER, and then split into equalsized 90 clients. Using the skewed non-IID data, we would be able to assess the robustness of our model to scenarios when IID data assumption cannot be made, which is more realistic in the healthcare industry.

Parameter sets
The neural network trained on each client consisted of three hidden layers with 20, 10 and 5 units, respectively, using the rectified linear unit (ReLu) activation functions. There were 56, 571 parameters in total. The stochastic optimizer chosen in this study was Adaptive Moment Estimation (Adam), which requires less memory and is more computationally efficient according to empirical results [22]. We used the default parameter set for Adam in the Keras framework: the learning rate η = 0.001 and the exponential decay rates for the moment estimates β 1 = 0.9 and β 2 = 0.999. In addition, while setting the minibatch size B to 30, we experimented with the number of epochs E = 5,10 and 15 and the fraction of clients C = 10%, 20%, 50% and 100% (same as in the work of McMahan et al. [7]). As for parameters of the data-sharing strategy, we experimented with various combinations of αs (10%, 20% and 30%) and βs (1%, 2% and 3%). For instance, α = 10% and β = 1% meant only 0.1% (that is, 270 examples) of the total non-IID data were shared across the clients, each receiving 27 random examples. Small α and β were chosen to implement the data-sharing strategy because we only sought to demonstrate that data-sharing could narrow the performance gap between learning on IID and non-IID data. Large values were unnecessary for this purpose, though both α and β could be increased to further enhance the performance, at the expense of decentralization [11].

Evaluation metrics
Evaluation metrics were twofold. First, the area under the ROC curve (AUC) was used to assess the predictive performance of a federated learning model. Here, ROC stands for the receiver operating characteristic curve, a plot of the true positive rate (TPR) against the false positive rate (FPR) at various thresholds. For a given threshold, TPR was the ratio of the number of mortalities predicted by the global model to the total number of mortalities in the test dataset, while FPR was calculated as 1 − specificity where specificity was the ratio of the number of predicted survivals to the total number of survivals. In our study, 10-fold cross validation was performed to reduce the level of randomness. In IID evaluation, we partitioned the MIMIC III data of 27,000 examples into 90 clients (each holding 300 examples) and further randomly split the clients into 10 folds (each containing 9 clients). In non-IID evaluation, the data was sorted by patients' age and gender before partitioning. Then, each fold was regarded as the test data in turn and the remaining nine folds were used to train FedAvg and

PLOS ONE
AdaBoost federated machine learning on medical data LoAdaboost. Predictions for every fold were recorded and compared against the true labels, and AUC ROC at convergence was calculated. This process was repeated for five times, resulting in a set of five cross-validation AUC values. FedAvg and LoAdaboost were compared in terms of average and standard deviation of these values. Second, we defined average epochs of clients as the expected number of epochs to run on a single client in a complete federated learning process and used the metric to measure the computational complexity of federated learning algorithms.
where T was the total number of global rounds taken by an algorithm to converge and m was the number of clients participating in computation at each global round. Under FedAvg, average epochs would be a constant value of E times the number of global rounds, while under our adaptive method it would be varying because each client expectedly ran for a different number of epochs. In the experiments, we set a maximum number of global rounds, then carried out 10-fold cross validation with different random seeds for five times, and finally calculated cross-validation AUCs and average epochs.

Results
LoAdaBoost was evaluated against the baseline FedAvg algorithm in IID scenario and FedAvg with data-sharing in non-IID sceniaro. We adpoted the data-sharing strategy on non-IID data because there was a performance gap between the two scenarios, as depicted in  [7], each curve in the figure was made monotonically increasing via taking the highest test-set AUC achieved over all previous global rounds. It is apparent that FedAvg on IID data consistently exhibited a higher test AUC than on non-IID data for all different Es. Throughout the evaluation, 10-fold cross-validation with five repetitions was carried out to obtain an accurate estimate of predictive performance: 27,000 examples of the MIMIC III data were divided into 90 equally-sized clients, which were further randomly split into 10 folds, each containing nine clients. In cross validation, each fold was regarded as the test set in turn and the other nine folds were used to train models. The remaining 3,000 examples were utilized as the holdout set to implement the data-sharing strategy in non-IID scenario. Given the same E, our method seemed to converge slightly slower (lagging a couple of global rounds) but nonetheless to a higher test AUC than FedAvg.

Evaluation in IID scenario
We speculate the reason for this lagged convergence as follows. At the first few global rounds where each client model was underfitting, learning FedAvg would be more efficient because each client was trained to the full five epochs. After a few global rounds, some client models would start to be overfitted and impose an adverse effect on the predictive performance of the averaged model on the server. So, learning speed of FedAvg would be lowered. On the other hand, our method would be less affected by individual overfitted client models, because the loss-based adaptive boosting mechanism would enable underfitted models to be trained for more epochs and overfitted ones to be trained for less epochs than five. Finally, when all clients became overfitted, FedAvg and our method would cease to learn, though the convergence AUC for the latter would be higher.
In addition, both algorithms converged faster with a larger value of E. With E equal to 5, they began to converge at the 15th global round; with E equal to 10, they had already converged at the 10th round; and with E equal to 15, at the 5th round FedAvg had already converged while our method began to converge to a higher point. To make the superiority of our method more credible, 10-fold cross validation was carried out with different combinations of C and E, and was repeated for five times under each experimental setting. Wilcox signed rank test was performed on the AUC sets for FedAvg and our method. Average cross validation AUC (with standard deviation), average epochs, and p-values for the statistical test are shown in Table 3.
For all combinations of Cs and Es, our method exhibited less computational complexity (that is, fewer average epochs) than FedAvg. With C = 10%, 20% and 50%, our method consistently achieved higher cross validation AUCs than FedAvg (p = 0.03); with C = 100%, the latter's AUC was marginally higher (0.7888 versus 0.7887, and p = 0.78). However, implementing C of 100% might not be beneficial in practice, because involving all clients in federated learning was computationally costly and would not necessarily lead to the best predictive performance (0.7905 for FedAvg with C = 20% and 0.7940 for LoAdaBoost with C = 10%).

Evaluation in non-IID scenario
The data distribution became non-IID after sorting the examples by age and gender. FedAvg with data-sharing [11] was the state-of-the-art method that narrowed the performance gap between IID and non-IID [11]. The data-sharing strategy implemented on FedAvg could effectively counter the adverse effect of non-IID data distributions. To facilitate a fair comparison, we adopted the strategy and evaluated LoAdaBoost with data-sharing against Zhao et al's method. Like IID, we prepared data for cross validation by partitioning the non-IID examples into 90 clients, each holding 300 examples, and randomly divided the clients into 10 folds, each containing nine clients. Fig 6 compares predictive performance (test AUC versus global rounds) of FedAvg and LoAdaboost with the distribution fraction α = 10%, 20% and 30%, respectively. The globally shared data size β, client fraction C and epoch count E were set to 1%, 10% and 5, respectively. For all αs, both methods started convergence by the 10th global round; given the same α, our method achieved a higher test AUC than FedAvg.
Unlike IID evaluation where our method converged slower than FedAvg, here both methods had roughly the same convergence speed. We speculate the reason to be that learning on each client model with non-IID data became more difficult than with IID data, and so training for constantly five epochs across all client models was no longer advantageous.
Same as IID evaluation, 10-fold cross validation was performed for five times. We fixed C to 10% and E to 5 while varying α from 10% to 30% and β from 1% to 3%. As shown in Table 4, both methods' AUCs at convergence increased with a larger value of α or β (that is, more data was shared with each client). More importantly, our method always achieved a higher AUC with fewer average epochs. With α = 20% and β = 1% (that is, each client received only 54 additional examples, 0.2% of the total data), both methods obtained higher cross validation AUCs than those in IID scenario (0.7954 versus 0.7842 for FedAvg with data-sharing and 0.8016 versus 0.7916 for LoAdaBoost with data-sharing). Furthermore, it is worth mentioning the trade-off between the size of shared data and predictive accuracy: if more data was distributed across the clients, the higher AUCs would be obtained, and vice versa.
Moreover, we further investigated the effect of increasing client percentage on predictive performance by fixing α = 10%, β = 1% and E = 5 and varying C. The 10-fold cross validation results are displayed in Table 5. Our method obtained higher cross validation AUCs than FedAvg with data-sharing with C = 10%, 20%, 50% and 100%, and in all cases each client model under LoAdaboost with data-sharing was expected to run less epochs per global round than under FedAvg with data-sharing.

Evaluation on eICU data
To demonstrate the robustness of our method, we included in experiments another critical care dataset from the eICU Collaborative Research Database [18]. The eICU data was in nature   Table 6. Same as MIMIC III, DRUGS prescribed to patients during the first 48 hours of stay were used to predict MORTALITY of patients. In addition, another randomly chosen 90 examples was prepared as the holdout set (that is, β = 1%) for implementing the data-sharing strategy. For IID evaluation, we shuffled those 9,000 examples and then partitioned them into 30 clients, each containing 300 examples. The clients were randomly divided into 10 equally-sized folds. Nine folds were regarded as the training set and the remaining fold was used as the test set. Throughout the evaluations, C and E were set to 10% and 5, respectively. In non-IID scenario with data-sharing strategy, α was set to 10%. Fig 7 shows the evaluation results of a single run of cross validation. Federated learning outcomes on eICU were different from those on MIMIC III data. Learning became more difficult as both the baseline and our method took 50 or more global rounds to converge. In addition, as displayed in the figure, AUCs with non-IID data were close to 0.65 but dropped to roughly 0.6 when data-sharing was adopted, while AUCs with IID data were notably lower for both methods. Therefore, learning on non-IID seemed easier than on IID, which resonated with the evaluation results of language modeling on the Shakespeare dataset in McMahan et al.'s work [7]. What was consistent with evaluation on MIMIC III data was that LoAdaBoost converged to higher AUCs with fewer average epochs than FedAvg, whether the scenario be IID, non-IID or non-IID with data-sharing. This finding was confirmed by the results of 10-fold cross validation with five repetitions (see Table 7).

Discussion
Distributed health data in large quantity and of high privacy can be harnessed by federated learning where both data and computation are kept on the clients. In this study, we proposed LoAdaBoost FedAvg that adaptively boosted the performance of individual clients according to cross-entropy loss. Under the federated learning scheme, the data held on each client was random in IID scenario and came from different distributions in non-IID scenario; and the randomly chosen clients participating in each round of learning would also be different. Therefore, if the number of epochs E was fixed as in the case of FedAvg, there could highly likely be certain underfitted or overfitted clients at each global round, which would adversely affect model averaging at the server. On the other hand, our method firstly trained each client for very few epochs, then defined the goodness-of-fit of each client by comparing its crossentropy loss with the median loss from the previous round, and finally achieved performance boosting by further training poorly-fitted clients for more epochs, well-fitted ones for less, and over-fitted ones for none. In this manner, all clients would expectedly be more appropriately learnt than those of FedAvg. Experimental results with IID data and non-IID data showed that LoAdaBoost FedAvg converged to slighly higher AUCs and consumed fewer average epochs of clients than FedAvg. Our approach can also be extended to learning tasks in other fields, such as image classification and speech recognition, wherever the data is distributed.
As a final point, federated learning with IID data does not always outperform that with non-IID data. Evaluation on the eICU data is such an example; and another one is the language modeling task on the Shakespeare dataset [7] where learning on the non-IID distribution reached the target test-set AUC nearly six times faster than on IID. In cases like this, the data-sharing strategy becomes unnecessary. Moreover, according to Zhao et al. [11], weight divergence would occur in neural network models trained on clients holding data from different distributions, and was positively correlated with the degree of data skewness. The predictive accuracy of FedAvg could be reduced by up to 55% due to high weight divergence. When non-IID data is severely skewed, LoAdaBoost may also lose its competitive advantage. This is because the weights of clients' models can all diverge from the well-tuned weight that could have been obtained in centralized learning [11], and the measure of median client-training loss may no longer be an effective indicator of the overall training quality of federated learning.
In the continuation of our study, we will investigate what kind of medical datasets may result in superior modeling performance with non-IID distribution and why this occurs. Furthermore, we will try to improve the LoAdaBoost FedAvg algorithm to make learning on such datasets even easier.