Personalized Federated Learning for In-Hospital Mortality Prediction of Multi-Center ICU

Federated learning (FL), as a paradigm for addressing challenges of machine learning (ML) to be applied in private distributed data provides a novel and promising scheme to promote ML in multiple independently distributed healthcare institutions. However, the non-IID and unbalanced nature of the data distribution can decrease its performance, even resulting in the institutions losing motivation to participate in its training. This paper explored the problem with an in-hospital mortality prediction task under an actual multi-center ICU electronic health record database that preserves the original non-IID and unbalanced data distribution. It first analyzed the reason for the performance degradation of baseline FL under this data scenario, and then proposed a personalized FL (PFL) approach named POLA to tackle the problem. POLA is a personalized one-shot and two-step FL method capable of generating high-performance personalized models for each independent participant. The proposed method, POLA was compared with two other PFL methods in experiments, and the results indicate that it not only effectively improves the prediction performance of FL but also significantly reduces the communication rounds. Moreover, its generality and extensibility also make it potential to be extended to other similar cross-silo FL application scenarios.


I. INTRODUCTION
With the promotion of electronic health record (EHR) systems, a huge amount of EHR data have emerged [1]. The EHR datasets, which contain exhaustive information such as patient diagnosis and treatment, underpin the application of machine learning (ML) in digital health. Moreover, its rich resources and valuable implicit information have also made ML one of the hottest technologies in its secondary analysis [2]. Nevertheless, due to the privacy and sensitivity of EHR, the application of traditional ML which refers to centralizing or releasing these data, poses not only legal, ethical, and regulatory challenges, but also technical ones [3]. Though there are some corresponding solutions to get around these restrictions, such as removing some key information The associate editor coordinating the review of this manuscript and approving it for publication was Antonio J. R. Neves .
to anonymize the patient data or adding privacy-preserving algorithms in the transmission process to prevent data leakage [4], the above problem has not been fundamentally solved because they still involve data migration.
Federated learning (FL) [5], [6], which emerged as a paradigm to address the concern of ML on private distributed data sources brings promising prospects to further promote ML in the digital healthcare field [7]. It is a distributed ML setting that can effectively assist multiple independent clients, such as mobile phones, IoT devices, and organizations, to conduct isolated data usage and ML modeling in accordance with user privacy protection, data security, and government regulations [8]. For healthcare, FL can implement ML in independent institutions without sharing any raw EHR data, which enables common and valuable information contained by the isolated data silos to be shared on the premise of protecting patient privacy and sensitive information. In typical EHR applications, FL can help to find clinically similar patients across institutions to support medical research and applications [9], develop a general decentralized framework for prediction of hospitalization caused by cardiac events [10], as well as predict the mortality rate and stay time of ICU [11], including that under COVID-19 [12].
However, while FL has been proven to be effective and feasible in EHR from independent institutions, its performance can be degraded by the non-independently and identically distributed (non-IID) and unbalanced nature of these EHR data silos. Specifically, the non-IID feature can result in a significant reduction of model effectiveness in FL, like prediction accuracy loss of the clients' local ML models [13], [14], [15], and this situation can be magnified by the data unbalance [6]. Furthermore, the skewness of non-IID datasets (the divergence of IID data) also has a significant impact on FL performance. It can even be claimed that whether the validity of FL on non-IID data can be guaranteed depends on the extent to which the data distribution skew to non-IID [16]. Because when this skewness reaches a certain degree, the performance of the FL model will be affected, resulting in an accuracy-loss which increases with the growth of the skewness [17], [18]. Overall, for the application of FL in a multi-institution EHR scenario, the data distribution nature of non-IID and unbalance, especially that with high skewness, can reduce the model performance, even resulting in locally independent trained models exhibiting better performance than the FL-trained model, thus removing the main incentive of these healthcare organizations to participate in FL and even making FL meaningless.
To address the challenge outlined above, numerous FL optimization techniques have emerged, which have been summarized and divided into global optimization and local adaptation by D. Ting et al [19]. The local adaptation methods are specially proposed to deal with the statistical challenges in FL, which enables each participant to obtain a personalized model rather than accept a shared unified model. At present, personalized federated learning (PFL) incorporating an early straightforward ''FL training + local adaptation'' scheme and various subsequent techniques [20] has become a popular research branch [21]. As several personalized FL studies [20], [22] suggest, FL can recover from performance degradation by personalizing individuals' local models with their specific data when confronted with heterogeneous data environments such as non-IID and unbalanced distributions.
Consistent with the premise of PFL techniques, we argued that it is no longer applicable to generate a unified functional model for all FL participants in the non-IID and unbalanced data environment. Consequently, to cope with the challenge we propose a Personalized One-shot Local Adaptation (POLA) FL method after modifying the optimization problem of the standard FL. The proposed method aims to improve the performance of in-hospital mortality prediction in an actual multiple independent ICU center environment. Moreover, in order to further verify the effectiveness of the proposed method, we naturally divide the distributed ICU datasets in two different ways to generate ICU centers with different non-IID data skewness while preserving the actual data distribution. Experiments demonstrate that POLA can effectively enhance the model's mortality prediction performance in this data environment, as well as significantly reduce the number of communication rounds of FL training.
The main contributions included in this work are: 1) we underpinned our research problem by conducting experiments on baseline FL in the data context of this study. 2) we transformed the original global optimization problem of standard FL into a problem optimized for each individual, and then proposed a PFL method called POLA to generate highly personalized models for independent ICU centers.
3) we experimentally compared the POLA with baseline FL and two other PFL methods to demonstrate that it not only improves the model performance but also effectively reduces the communication overhead of FL.
The rest of this paper is organized as follows. Section II introduces the preliminary knowledge related to baseline FL, personalized FL, federated knowledge distillation, and AutoML. Section III presents the detailed designs of our proposed personalized FL scheme. The experimental evaluation and analysis are presented in Section IV. Finally, the work is discussed and concluded in Sections V and VI, respectively.

A. BASELINE FEDERATED LEARNING
The prototype and baseline of FL is a distributed ML algorithm based on mini-batch Stochastic Gradient Descent (SGD) named FederatedAveraging (FedAvg) [6]. Early optimization strategies for distributed ML generally involve iterative averaging of local models via adapting SGD in the local training process for optimization [23]. FedAvg is an adaptation of this kind of strategy under data privacy concerns. It is an orchestration pattern of distributed clients coordinated by a central server, where the clients both collect data and perform major computation tasks, and the central server coordinates the training process by integrating updated information exchanged with the clients [6].
The optimization objective of FedAvg can be defined as a global minimization problem below.
where N is the number of clients participating in the FL training, ω is a vector that contains global model parameters and f i (ω) is the objective function of the i-th client which is determined by an arbitrary specific ML model and optimization algorithm. The optimization problem can thus be interpreted as figuring out optimal ω that can minimize the average loss over training models on all clients. p i specifies the relative impact for the i-th client, which meets the conditions being 1 >p i ≥ 0 and n i=1 p i = 1. It is generally with two settings, To illustrate this method, its specific learning process and pseudo code are presented in Algorithm 1 [6]. The central server first establishes and initializes a global sharing model and then sends it to randomly selected clients. The selected clients independently and parallelly implement the SGD optimizer with pre-set local iterations and mini-batch data size on the receiving global model with their own unique data and then return the updated model or model parameters to the server. After receiving the information returned by all participating clients, the server updates the global model by performing a weighted average of these parameters according to the data proportions of each client. Again, the clients perform local training after receiving the updated global model and return their updated local model parameters to the server. These steps are repeated until a preset number of communication rounds is reached.

Inputs:
-local training data on each client; -unified global model Outputs: -unified global model with updated parameters Initialize: -total communication rounds R; -local training iterations E; -local mini-batch data size B; -learning rate η; -parameters ω of global model for each communication round r form 1 to R Server update: Randomly select N = C × K clients, C ∈ (0, 1], K is total clients Send ω r to all selected clients After all selected clients sending back updated ω i r do Update ω r+1 ←ω r Client update: for client i from 1 to N : Initialize the local model parameters ω i r ←ω r Split local training data into batches of size B for each iteration e from 1 to E: for each batch d: The initial intention of FL is to generate a globally unified model that performs effectively across the majority of participating clients. Since this idea has been proven to be limited in dealing with non-IID and unbalanced data [16], [17], personalized federated learning (PFL) has emerged as a compensation. Just as mentioned by Kulkarni et al. [20] and Tan et al. [21], the performance deterioration caused by heterogeneous data in FL can be addressed by personalized solutions.
Recently, research on PFL has set off a boom. There are numerous PFL strategies that have been developed to address the problem of the unified global model's failure to generalize well in FL while facing a data heterogeneity problem [21]. Since this study involves local adaptation to personalize FL, we also briefly summarize the related methods as follows: 1) Model fine-tuning. In highly heterogeneous data, performance gains can be achieved by simply fine-tuning all or part of the parameters of the global model obtained from FL training with private data locally on the client [18], [24]. 2) Local loss regularization. The client-drift problem caused by data heterogeneity is alleviated by adding regularization loss in the local training process to obtain better-performing personalized models [25], [26]. 3) Meta-learning. Its representative mechanism in FL is first to learn a parameterized model (or meta-learner) through the FL training process by algorithms like MAML and Reptile, then a specific personalized model for each client can be fast trained under the guidance of the meta-learner [27], [28]. 4) Multi-task learning aims to learn various models for multiple related tasks simultaneously, which is consistent with the mechanism of local adaptation for FL [29], [30]. 5) Transfer learning enables knowledge sharing among related domains to improve a learner's performance. In the FL setting under a heterogeneous data scenario, it helps the client models complete the local adaptation so as to get personalized models [31], [22]. 6) Knowledge distillation (KD) can be associated with FL to distill the knowledge like classification scores [32] and logit vectors [33] of the global model to guide the local client models in learning their personalized models.
Although all these PFL methods can improve the performance of FL on non-IID data problems, the ways in which they further personalize ML models are different. For example, model fine-tuning, meta-learning, multi-task learning, and transfer learning all personalize the parameters of the global model learned in FL. Local loss regularization personalizes the loss function of individual models in the FL learning process. Knowledge distillation can simultaneously personalize the structure and parameters of individual models as well as hyperparameters. This work aims to make the models as personalized as possible to gain performance enhancement as much as possible for FL. Therefore, the KD technique that has the most potential for model personalization is employed. In the next section, its related applications in FL are reviewed.

C. FEDERATED KNOWLEDGE DISTILLATION
KD is a student-teacher learning strategy with weak model correlation that was proposed and popularized by Hinton et al. [34]. It is extensively implemented in two major domains: model compression and knowledge transfer [35]. For model compression, KD can be used to learn a lightweight model with decent performance from the trained cumbersome model to meet the needs of real-time or edge applications. As to knowledge transfer, KD refers to a student-teacher learning structure in which the models that provide and learn knowledge are regarded as teacher and student, respectively. It enables students to learn from a larger pre-trained teacher model or an ensemble of teacher models. Consequently, KD is also regarded as an effective method that is frequently employed to transfer information from one network to another in ML.
Based on this knowledge transfer feature, KD has been applied to FL, and their combination is called federated knowledge distillation (FKD). In general FKD schemes, the global shared model is regarded as a teacher to guide the independent clients to train their local models [20]. Different from the standard FL method that directly exchanges models or parameters between clients and server, FKD allows distillated model knowledge to be exchanged as information. Thus, the communication cost during FL training can be significantly reduced, especially for deep ML models. However, since the distilled knowledge generally cannot contain as much information as the model parameters, FKD methods are usually accompanied by a decline in model accuracy.
A typical FKD method is federated distillation (FD) [33], which only exchanges the prediction logit vectors between server and clients to make the communication overhead model-independent. Compared with the baseline FL, it significantly reduced the training communication overhead but greatly decreased the model accuracy. Whereafter, a hybrid FD method (HFD) [36] is proposed as an enhancement of FD by adding an average covariate vector to the corresponding logit vectors. However, even though the model accuracy of HFD is improved compared with FD under the premise of constant communication cost, it is still lower than that of baseline FL. Although these approaches can reduce the communication cost in FL, the sacrifice of model accuracy is not worth the gain, especially in non-IID data, because participating individuals may not get any model performance gain in FL.
To alleviate this problem, some studies introduced public datasets in FKD. For instance, FedMD [32] pre-downloads a public dataset on each client to distill a classification score as exchanged knowledge. Another similar method is MHAT [37]. Each of its clients also holds a public dataset to generate the exchanged information. By introducing public datasets, both methods can reduce the communication cost while maintaining or improving the model accuracy. However, appending public datasets to FL is not recommended because it violates the original FL intention of not sharing raw data [21]. In addition, if all the clients need to download the public dataset frequently, there will be a sizable additional communication burden [15].
In addition to being utilized in FL to decrease communication overhead, KD can also be used to learn heterogeneous models for independent clients to deepen their personalization. This takes advantage of the KD's weak model correlation, which means that the teacher and student models aren't required to have the unified structure or set of hyperparameters. This extension in FL denotes that the local model of each independent individual can be regarded as a student model that is independent of the teacher model to learn high personalization according to the distribution characteristics of its local data. Li Hu et al. [37] conducted this strategy by generating heterogeneous models for clients while reducing communication overhead to compensate for the accuracy loss in FKD.

D. AutoML
This study utilized a heuristic algorithm involving automated machine learning (AutoML) in the optimization of personalized models, which may be confused with existing comparable studies. Thus, to show the difference between our proposed method and the existing ''FL + AutoML'' approaches, we conducted a retrospective analysis of related studies as follows.
AutoML is a combination of automation and machine learning (ML), booming in both academic and industrial fields in recent years. Its emergence has handed over the ML processes that require massive human interventions and efforts to the machine itself, such as algorithm and model selection, further realizing the real 'machine learning' [38].
Recently, more and more researchers have discovered that AutoML can be combined with FL to address the problem that the pre-defined unified model is not suitable for non-IID data distribution as FL has developed. Currently, the most popular use of AutoML in FL is neural architecture search (NAS), which is typically used for personalized design and optimization of clients' local models. For example, to save communication resources and accommodate edge devices in FL, Hangyu Zhu et al. [39] proposed an evolutionary realtime federated NAS approach that not only optimizes the performance of deep neural network (DNN), but also reduces the local payload of independent clients. Besides, a method named FedNAS [40] and a general framework named MGF-NAS [41] have also been developed for similar purposes to automate the model selection process in FL.
By reviewing the existing federated AutoML research, it can be found that almost all of them focus on the NAS of DNN models, especially convolutional neural networks (CNNs). Because the structure of the DNN model has a great impact on the communication overhead and the performance of FL, its automatic design and optimization can bring the most considerable benefits. But since our study does not involve DNN and is not limited to NAS, we do not compare it with existing federated NAS methods.

III. PROPOSED METHOD A. PROBLEM DEFINITION
As can be observed from (1), standard FL is to optimize the parameters of the unified global model. However, after the VOLUME 11, 2023 experimental analysis in Subsection IV-D, it can be found that this optimization objective is no longer applicable in the data environment of this study. Therefore, we modified the optimization problem and expressed it as below: where α and θ respectively represent the structure and parameters of local client model, ω is consistent with (1), which represents the parameters of the global model. This definition demonstrates how we changed the optimization problem of FL from determining the parameters for a unified global model into finding the optimal unique model structure and parameter sets for each independent individual in FL. Furthermore, it can also be seen that the specific parameters θ i of each participant are related to the parameters ω of the global model, which means that the optimization problem of this work is not separated from the original FL setting, and its purpose is to further rebalance the global generalization experience with the local data knowledge to produce the optimal personalized models.

B. OVERALL FRAMEWORK
The proposed scheme is a two-step and one-shot PFL, the overview of which is illustrated in Fig. 1. Two-step here refers to FL training and local adaptation, where FL training is to obtain a shared model with adequate global generalization experiment, and local adaptation is a subsequent step to generate high-performance personalized models for independent individuals.
One-shot means the local adaptation only needs to be performed once for each individual in the entire training process. This one-shot adaptation process is a KD-based student-teacher learning, which regards the selected shared model as the teacher and treats locally independently personalized models as students. It enables the independent ICU centers to parallelly design their own personalized student models and then makes these student models learn from both the teacher model and their own datasets to improve performance by rebalancing global experience and local data knowledge.
Furthermore, in order to enable the student models to obtain the most suitable personalization design to optimize their performance, the adaptation step also includes an optimization process of the personalized model. However, this process is usually time-consuming and labor-intensive. To simplify and automate it, a classical heuristic technique -Genetic Algorithm (GA) is introduced. GA is a classical and effective evolutionary algorithm that searches for the optimal solution through selection, crossover, and mutation. In this study, it can simultaneously provide a wide search space and optimal solutions for hyperparameters and model structures that need to be designed automatically. The detailed content of the proposed method will be described in the next subsection.

C. DETAILED DESCRIPTION
Algorithm 2 demonstrates the specific implementations of the proposed method POLA. As described above, step 1 is to complete the baseline FL training to obtain the teacher model required for subsequent local adaptation. The teacher model is critical to the outcome of the local adaptation. However, from the validation experiment in Section IV, we can see that the baseline FL is no longer unable to ensure the performance of its global model in the multi-center ICU data environment. If we directly take the global model obtained when training is completed as the teacher model, POLA's effectiveness cannot be guaranteed.
Therefore, in order to obtain a teacher model with stable performance and sufficient generalization knowledge, we adjust the baseline FL, the details of which are shown in Algorithm 3. We first divide the local training dataset of each center into validation and training data, and then use them for FL training and the global model's validation, respectively. Next, when the global shared model has learned enough generalization experience at a preset threshold number of training roundsR w , the average validation error of all participants in each round is calculated to decide whether the current global model can be selected as the teacher model. Finally, when the entire FL training is over, the global model with the minimum validation error is selected.
Step 2 is performed in parallel on each independent ICU center, which mainly contains two procedures. Procedure 1 is to coordinate the entire local adaptation process by the GA algorithm, which can automatically provide personalized solutions and evolve to produce the optimal one for each participant. Procedure 2 is to build and train the personalized model according to the solutions provided in the Procedure 1, and then return the results to evaluate.
In Procedure 1, the solution of the model personalization which involves structure design and hyperparameter selection and the inverse of model validation error which is returned by Procedure 2 are treated as chromosome and fitness function of GA, respectively. Then, GA will do selection, crossover, and mutation to evolve proper personalized solutions according to the fitness. When the evolutionary process is completed, the best solution will be the final personalized setting of the student model. In Procedure 2, each ICU center first independently builds its own personalized models according to the solutions provided by Procedure 1. Then the structured personalized models are initialized based on the teacher model that produced by Step 1. To make up for the drawback that student model usually can't outperform teacher model in KD scheme and speed up the training process, the initialization is layerwise. The input and the first hidden layer of the model are initialized directly as the corresponding parameters of the teacher model, and the remaining layers are initialized randomly. This is exactly what we need, because the base layer of the neural network model can contain more general knowledge. Next, the initialized models are going to learn from both the teacher model and the local dataset. It treats the general experience of the teacher model as soft target and the specific knowledge in the local raw data as hard target. To make the local personalized model learn as much knowledge as possible from the teacher model, we utilize two different methods to distill the outputs and features of the teacher model, respectively. The outputs distillation is a classical class probability distillation method [34], which tries to minimize the variance between the classification probability distributions of teacher and student. After estimating the classification probability of a neural network via a SoftMax function as (3) (where z n represents the n-th category output in M objectives and T is the temperature factor which is used to control the weights of each soft target), if we express the last layer's prediction outputs of the teacher model and the student model as logit vectors z t ,andz s respectively, then their divergence loss can VOLUME 11, 2023 be represented as l s1 in (4).

Algorithm 3 Adjusted FedAvg Algorithm
Generally, L R is Kullback-Leibler (KL) divergence loss, but it can also be set to the Cross Entropy or MSE loss depending on the actual situation. What is specifically used in this study is the MSE loss function. The feature distillation approach is to transfer knowledge from teacher to student by minimizing the divergence between the joint density probability estimations [42]. It first expresses the feature space of the teacher model and the student model as two conditional probability distributions p n|m ∈ [0, 1], q n|m ∈ [0, 1], and then uses KL loss to calculate the difference between them, and the training loss function l s2 shown below can be obtained.
As for the hard target, it is generally learned by the Cross Entropy loss function. Since this research is a binary classification task, a binary cross entropy loss function is adopted, which is symbolized by l h as follows: Finally, the training loss function L of the student model can be expressed as the follow: where βϵ[0, 1] is a scaling factor to balance the local specific knowledge and global general knowledge. It can be seen that its value has a crucial effect on the performance of the personalized model. When it is large, the personalized student model learns more about the teacher model, and in turn, learns more about the local data.

IV. EXPERIMENTS AND ANALYSIS A. DATA PREPROCESSING
The proposed scheme was developed in a multi-center ICU scenario which is based on an actual and freely available EHR database named eICU Collaborative Research Database, version 2.0 (eICU-CRD v2.0) [43]. This database is generated by teleICU, an actual project of Philips Healthcare, and collated by the Laboratory for Computational Physiology (LCP) at MIT. It comprises de-identified health data from over 200,000 admissions of more than 139 thousand unique ICU patients involving 335 units at 208 hospitals across the United States between 2014 and 2015 [44]. The eICU-CRD not only retains the natural characteristics of independently distributed data silos but also has abundant data resources that can properly support actual cross-silo FL application research.
Since the database is an unprocessed raw EHR, in order to obtain good research results, this study mainly refers to relevant benchmark research work [45] to do the variable selecting and preprocessing, which includes the following key steps.

1) SELECTING THE COHORT
This step is to filter the raw data based on criteria such as age range, number of records, and invalid key information, which results in 30,680 unique patients covering 1,164,966 records.

2) SELECTING THE VARIABLES
As shown in Table 1, this mortality prediction task selects 19 feature variables that reflect hospitalization status as inputs and 1 variable that indicates survival status as an output within a fixed time window of 48 hours for each patient.

3) VARIABLES PREPROCESSING
This process includes categorical variable encoding by onehot encoding (OHE), numerical variable normalization, and input matrix padding. Finally, an input matrix of size 200 * 442 for each unique patient is obtained.

B. DATA DISTRIBUTION
The research problem of this study involves not only the non-IID data but also its skewness. Thus, the data distribution involving how data is non-IID and how data skews to non-IID is crucial. Currently, the generation task of non-IID data is done artificially in most FL-related studies [6], [16], [37], which generally assign data evenly to each client based on different category labels and regulate the skewness to non-IID by the variance of data categories contained in the independent clients.
However, due to the lack of practical application support, this artificial way of generating data distribution not only fails to account for how real-world data distribution bias affects FL, but also ignores the unbalanced nature of realworld distributed datasets. Furthermore, the applicability of research results depends on how actually the experimental dataset simulates the distribution that will occur. Therefore, as mentioned by M. J. Sheller et al. [46], if feasible, an actual distribution that preserves the natural characteristics of the data is the best option for FL.
This study generates non-IID data in a natural way, which completely preserves the original distribution characteristics of eICU-CRD to simulate independent ICU centers with non-IID and unbalanced data. According to different non-IID skewness requirements, we naturally generate ICU centers according to hospital and ICU unit type, respectively. Together with the IID data distribution for comparison, this study finally includes the following three data distribution division ways:

1) IID AND EVEN DATA DISTRIBUTION
All datasets from the participating ICU centers are pooled, shuffled, and then evenly partitioned into the required number.

2) NON-IID AND UNBALANCED DATA DISTRIBUTION BASED ON HOSPITALS
The 208 hospitals in the ICU-CRD with varying numbers of patient cases are naturally treated as independent ICU centers. Since most of them have only a small number of patient admission records, we set a threshold at 600 to filter out those that cannot participate in FL training. Ultimately, 12 hospitalbased ICU centers with a total of 9660 unique patient records are produced.

3) NON-IID AND UNBALANCED DATA DISTRIBUTION BASED ON ICU UNIT TYPES
In eICU-CRD, patients with different types of disease are admitted to corresponding ICU units, which results in greater variation in their related feature variables among different unit types.  Table 2, all 335 ICU units with a total of 30,680 unique patient records are classified into 8 different types. Accordingly, we performed another data generation method according to the ICU unit types to increase the non-IID skewness of the data distribution. Finally, the database can be divided to simulate 8 independent unit-typebased ICU centers.

As shown in
In addition to the natural generation methods of ICU centers to participate in FL, we also retained the original patient amount for each center. Fig. 2 shows the unbalanced amount distributions under two different non-IID data after cohort selecting.

C. EXPERIMENTAL SETTINGS
The proposed method is implemented in Python and all experiments are conducted on a computer with Intel 3.00GHz i7-9700 16GB CPU and NVIDIA GeForce RTX2060 6G GPU.

1) MACHINE LEARNING MODEL
The specific ML model we employed in this work is Multilayer Perceptron (MLP), which has both unified and personalized designs. The unified design is applied in FedAvg with a fixed structure and pre-set hyperparameters. Specifically, it adopts a fixed model structure of two hidden layers with 100 nodes each and employs a rectified linear unit (ReLu) as the activation function. The optimizer is SGD with a momentum of 0.9 and the loss function is a binary cross entropy loss function.
As to the personalized design, the structure and hyperparameters of the MLP model are not fixed, and their specific values are determined by the evolutionary process. We restrict the structure of the model to two or three hidden layers and empirically provide the search space of the layer size and several influential hyperparameters, the detailed settings of which are shown in Table 3. As a result, the chromosome in evolutionary solutions is finally composed of four hyperparameters and three variables corresponding to the model structure, which are all realencoded. In addition, we set the maximum training epoch of the personalized model to 20 and added an early stopping mechanism during the training process to prevent the model from overfitting. That is, the training will be terminated early when the validation error does not decrease continuously. Other than these, the other personalized model settings are the same as those of the unified design.

2) HYPERPARAMETER SETTINGS
In the first step, we set the proportion of clients participating in training C to 1.0, mini batch size B to 50, training VOLUME 11, 2023 iterations E to 5, learning rate η to 0.01, threshold training rounds R w to 5, and the total number of communication rounds R to 100 (the training stop criterion). In the second step, we set the distillation temperature T to 10, the partition ratio of the training data D t and the verification data D v to 4:1.
As mentioned earlier, the scaling factor β has an important effect on the performance of the proposed method. We suggest its value should depend on the specific data distribution. When a well-performing teacher model is produced in slightly non-IID skewed data, the personalized models should learn more from the teacher model, but when the performance of the teacher model is degraded by highly non-IID data, the personalized models should be more biased towards the local datasets. Accordingly, we set its value in hospital-based and unit-type-based non-IID data to 0.6 and 0.4, respectively.
As to the evolutionary process, the population size and generations are set to be 20 and 5, respectively, which depends on the searching space and is also limited by the experimental conditions. The values of crossover and mutation operators are empirically set with probabilities of 0.9 and 0.1, respectively.

3) EVALUATION METRICS
In accordance with [45], this work employed the Area Under the Receiver Operating Characteristic Curve (AUROC) to measure the mortality prediction results because the extreme unbalance of patient survival status has made the simple estimate of percentage accuracy meaningless. AUROC can well evaluate the performance of the prediction model in the case of unbalanced data classes and provide a basis for selecting the best prediction results. Furthermore, in order to truly reflect the performance of FL in non-IID and unbalanced data distribution, all our experiments are presented by the average of independent individual models' prediction results.

D. RESULTS ANALYSIS 1) THE IMPACT OF DATA DISTRIBUTION ON BASELINE FL ALGORITHM
In this subsection, we verified the impact of different data distributions that are described in Subsection B on the performance of baseline FL. Fig. 3 shows the mortality prediction results of FedAvg when the data distributions are IID and non-IID, as well as that of locally independent training. The locally independent trained model was treated as a benchmark to evaluate whether the FL-trained model achieved a performance gain for its participants.
The observation results show that different distributions of the same data have a great impact on the performance of FL. Compared with the even IID distribution, the naturally hospital-based and unit-type-based distributions both significantly degrade the performance of the baseline FL, and as the difference in data distribution increases, the performance degradation gets more obvious and even causes FL to fail to converge. Furthermore, it can be observed in Fig.3 (a) that the performance of FL-trained models in non-IID data distribution is obviously better than that of the local independent training models. This indicates that when the data distribution is not highly skewed to non-IID and unbalanced, the baseline FL can effectively improve the model performance. Nevertheless, in Fig.3 (b), we can see that, with the substantial increase in the non-IID and unbalanced characteristics of the data distribution, the baseline FL not only becomes unable to converge but can also hardly bring performance gains to the ML models.
These observations confirm the research problem of this work. That is, the prediction performance of FL can be degraded by the non-IID and unbalanced nature of data, and the higher the skewness of non-IID and unbalanced data, the more significant the performance degradation. In severe cases, locally independently trained client models can even outperform the FL-trained models, resulting in FL becoming meaningless. We argue that the cause of this problem might be that it is ineffective to obtain a unified working model for all participants from FL training in the heterogeneous data environment. In this data context, global collaboration without considering the unique characteristic of individuals usually cannot bring performance gains to most participants. Therefore, locally adapting and personalizing the FL-trained unified model on the independent client should be a good choice to tackle this problem.

2) COMPARISON EXPERIMENT
In this section, we compare the proposed method with the baseline FL and two other PFL methods to show that the proposed method works. The first comparable PFL method is a simple base + personalization layers local fine-tuning (FT) method [24], which is called FT-FedAvg in this paper. After receiving the FL global shared model, each independent individual freezes the base layer of the model and then updates the high-layer parameters with its local data for several epochs to gain personalization while maintaining the generalization knowledge in the high-layer parameters. Specifically, we utilize this method to fine-tune the global shared model of FedAvg for two epochs.
Another comparable PFL method is called pFedme [47] which personalizes FL by regularizing clients' loss functions with Moreau Envelopes. Its objective is also to balance personalization and generalization on each client to gain performance. To be fair, we selected the optimal parameter combination for pFedMe according to the data characteristics of this study. We first set several relevant parameters according to the requirements of the original pFedMe: personal learning rate η = 0.001, computation complexity K = 5, model additional parameter β = 2, λ = 20. Then we set the local training epochs to 50 and 80 for the hospital-based and unit-based data distributions, respectively, according to the characteristics of the amount of data in this study. This is because we found through experimental observation that the local training epochs of pFedMe should be appropriately increased with the increase of the client's local data amount, so as to ensure the convergence speed. In addition, other FL hyperparameters, such as the number of communication rounds, the number of participating clients, and the training data batch size, are consistent with the proposed method. Fig. 4 shows the average prediction AUROC of the baseline FL and three PFL methods over 100 communication rounds. It should be noted that the initial setting of POLA is a one-shot local adaptation method. That is, the unique teacher model is found in the preset communication rounds and then adaptation is performed once to generate local personalized models. Here, in order to better demonstrate its performance, we adapt all the teacher models selected within the total of 100 rounds and the global model of the first round to generate personalized models for all participating ICU centers, thereby the corresponding curve of which is shown. For example, in a 100-round training, the global models selected for subsequent adaptation under the unit-type-based data distribution are in the 1st, 5th, 6th, 8th, 9th, 12th, and 13th rounds, respectively.
Furthermore, in order to present the performance of these methods in more detail, Table 4 shows the prediction results in the 5th and 100th communication rounds. The experiment was independently performed with different random seeds five times, and their average AUROCs with 95% confidence intervals are shown.
Besides, in order to observe the experimental results from the perspective of independent participants, the mortality prediction results of each ICU center's local model after 100 full rounds of training are shown in Fig. 5.
We can see from the overall findings shown in Fig. 4 and Table 4 that our proposed scheme POLA outperforms the other two PFL methods in both prediction performance and overall convergence rate under non-IID data distributions that have different skewness. From the individual perspective presented in Fig. 5, POLA also achieves acceptable performance, but its effectiveness varies depending on the distribution of the data. Compared with the best performing comparison method, pFedme, POLA significantly makes all unit-typebased ICU center models achieve performance gains, but only 58.33% of hospital-based ICU center models obtain performance enhancement. This indicates that POLA is more effective in the environment where the amount of independent dataset is sufficient and the non-IID skewness of the overall data is high.   FT-FedAvg, a simple adjustment of baseline FL, is greatly reliant on the global shared model, resulting in highly unstable performance. Overall, it achieves a certain performance gain over FedAvg. But from the standpoint of individual benefits, this improvement has no practical significance at all. As for pFedme, it can effectively overcome the obstacle of non-IID and unbalanced data to obtain stable and superior performance when the entire FL training is completed, but its convergence speed is too slow. As shown in Table 4, its performance gap with POLA at the 5th round of communication probably needs at least 30 subsequent training rounds to catch up, which requires a significant amount of computational and communication resources.
In conclusion, from the results of the comparative experiments, it can be seen that POLA not only effectively overcomes the non-IID and unbalanced data barriers with different skewness to generate personalized models with superior performance for each independent ICU center but can also significantly reduces the number of communication rounds in the FL training process, thus saving computational and communication overhead.

V. DISCUSSIONS
This section discussed several properties of the proposed method. The first one is compatibility. It can be seen from the experiment results that the effectiveness of the proposed method largely depends on the teacher model. This is also the reason why we need to select a well-performing teacher model during the FL training process. Intuitively, the better the obtained teacher model is, the better the generated personalized models are. But after experimental observation, we found that it is not the case. We speculate that this is due to POLA requires the teacher model not only to include generalization knowledge but also to be able to fit the parameter update direction of all student models. Therefore, although the proposed method seems to be a general FL training + local adaptation method, it is not compatible with arbitrary FL approaches.
Another property is extensibility, which involves two aspects: a) application scenario extension. Although POLA is especially proposed for predicting the mortality of inpatients in a multi-center ICU, it can also be applied to similar crosssilo scenarios. For example, the biomedical fields like disease incidence rate forecasting or medical image recognition, and the financial fields like multi-party borrowing detection. b) ML model extension. Although this study only employs the MLP model, it can also be extended to the application of other NN models, especially the DNN models. As model structure and hyperparameters have a greater impact on the performance of DNN, which can lead to higher performance gains. For example, if a lightweight DNN model is trained in FL and then tuned to more complex personalized models, considerable performance gain and communication overhead savings could be achieved.

VI. CONCLUSION
This study aims to enable FL to generate highly personalized ML models for each participant to tackle the predictive performance degradation in an actual multi-center ICU scenario. It keeps the natural and complete non-IID and unbalanced data distribution of the independent ICU centers, making it more significant for practical healthcare applications. We first studied the characteristics of the baseline FL in this data scenario to analyze the reason for its performance degradation. Then, we proposed POLA, a one-shot and two-step personalized scheme to make the performance of FL recover from non-IID and unbalanced data. POLA rebalances global experience and local data knowledge by making a one-shot adaptation for FL to produce a personalized local model for each independent ICU center. We experimentally demonstrate that it cannot only improve the performance of FL by generating superior-performing and highly personalized models but also significantly reduce the number of training communication rounds for FL.