A dynamic global backbone updating for communication-efficient personalised federated learning

ABSTRACT Federated learning (FL) is an emerging distributed machine learning technique. However, when dealing with heterogeneous data, a shared global model cannot generalise all devices' local data. Furthermore, the FL training process necessitates frequent parameter communication, which interferes with the limited bandwidth and unstable connections of participating devices. These two issues have a significant impact on FL's effectiveness and efficiency. In this paper, an enhanced communication-efficient personalised FL technique, FedGB, is proposed. Different from existing approaches, FedGB believes that only interacting common information from training results on different devices can improve local personalised training results more effectively. FedGB dynamically selects the backbone structures in the local models to represent the dynamically determined backbone information (common features) in the global model for aggregation. Only interacting common features between different nodes reduce the impact of heterogeneous data to a certain extent. The dynamic adaptive sub-model selection avoids the impact of manually setting the scale of sub-model. FedGB can thus reduce communication overheads while maintaining inference accuracy. The results obtained in a variety of experimental settings show that FedGB can effectively improve communication efficiency and inference accuracy.


Introduction
Edge, mobile and IoT computing (Wu et al., 2022;Xin et al., 2022;Yan et al., 2022; are gaining traction as critical steps toward achieving ubiquitous Artificial Intelligence (AI). Federated learning (FL) (Bonawitz et al., 2019;Kairouz et al., 2021;McMahan et al., 2017;Zhang et al., 2021) can fully exploit their application potential (Imteaj et al., 2021;Nguyen et al., 2021;Zhang et al., 2022) as an emerging distributed machine learning approach with privacy protection is also being studied. FL has been used in a number of applications (Alazab et al., 2021;Boopalan et al., 2022;Hard et al., 2018;Leroy et al., 2019;Ramu et al., 2022;. FL participants collaborate to learn a common global model. Each device (or node) trains the model with local data and then sends resulting in higher personalised training accuracy. Furthermore, because training data from different nodes are heterogeneous, only identifying common features between local models for interaction can reduce communication overheads.
As a result, in this paper, we propose FedGB, a novel communication-efficient personalised FL method. As shown in Figure 1, each device personalises local model training with heterogeneous data while only uploading and downloading aggregated shared common information (backbone structure on each device) to improve the performance of personalised local models. Communication overheads are reduced by exchanging only shared common information, increasing communication efficiency. Our method's contributions are summarised as follows: • We investigate the difference in parameter distribution between the local and global models. The results showed that a local preferred pruned sub-model structure is insufficient to take advantage of personalised FL's model aggregation. • We identify the global backbone in the global model as the fusion of common information from heterogeneous data when personalising the local models, and then apply this as a criterion to select the sub-model structures from the local models for aggregation. Allows for the efficient use of data information from other devices via the FL process. This paper is organised as follows. The related works are summarised in the Section 2. The motivation of FedGB is presented in Section 3. Section 4 is the details of FedGB. Sections 5 and 6 present the experimental results and discussion. Section 7 will conclude this paper.

Federated learning
Federated learning (FL), an emerging machine learning technology, is proposed to solve the problem of how to cooperatively train a high-performance model without data sharing. Massive amounts of data are collected due to the widespread deployment of IoT and edge devices. There are numerous restrictions on data transmission and sharing in terms of user privacy and data security. FL can alleviate some of the concerns about traditional machine learning. The participating devices train the machine learning and deep learning models locally in FL, and a shared global model is obtained by aggregating the intermediate training results of each device. As a result, the goal of FL is (McMahan et al., 2017): where K is the number of devices; P k ≥ 0 defines the relative impact of each local device, satisfying k P k = 1, and f k is the objective function of the kth local device, which can be defined as: f k (θ ) = 1 s k s k l k (θ ; z), where s k is the number of locally available samples, P k = 1/s or P k = s k /s and for total samples s = k s k . When each node uploads its own training results to the server for aggregation, the FedAvg (McMahan et al., 2017) method is applied: In this parameter aggregation strategy, s k is the number of training samples in the kth device, s is the total number of training samples of all devices. ω k t+1 is the training results of the kth device in round t + 1.

Communication-efficient federated learning
The frequent exchange of parameters between local devices and the server conflicts with the deployed devices' limited communication capabilities and unstable network connections, resulting in a significant reduction in FL efficiency. When considering communication overheads, some communication-efficient FL methods are proposed. For sparsification training, the authors of DGC (Lin et al., 2017) believe that in the distributed training process, there are many redundant parameters on each device. To achieve efficient communication, important parameters are chosen for transmission. Unlike DGC, eSGD (Tao & Li, 2018) believes that the parameters that can reduce the model's loss value are more significant and should be communicated. Gaia (Hsieh et al., 2017) evaluates the significance of a local update from one data center based on the magnitude of the update relative to the current parameter value. For communication reduction, high-precision parameters are quantised into low-precision values using quantization methods such as (Reisizadeh et al., 2020). The authors of Amiri et al. (2020) create a lossy FL algorithm in which the global model is compressed after parameter aggregation, reducing the communication's resource requirements. UVeQFed (Shlezinger et al., 2020), based on the concept of universal quantization, enables FL to achieve better performance with limited bitwidth. Pruning-based methods are proposed on the basis of sparsification while taking structural features of local models into account. PruneFL (Jiang et al., 2022) is a two-stage method for increasing communication efficiency that includes initial pruning at a chosen device and additional pruning as part of the FL process.  investigates the impact of network pruning on FL performance before generating closed-form solutions for the optimal pruning ratio and spectrum allocation. The central server in FL-PQSU  performs l 1 -norm based, one-shot channel pruning for FL optimization.

Communication-efficient personalised federated learning
Because the data collected by FL participants is statistically heterogeneous, i.e. Non-IID data, a shared global model is difficult to generalise to different types of data. In addition, achieving high local inference accuracy with such a collaboratively trained model in practical deployments is difficult. As a result, an increasing number of approaches focus on personalised local training while leveraging FL features to improve personalised training performance. FedMD (Li & Wang, 2019) uses the concept of knowledge distillation to provide personalised FL. This method begins with the creation of a shared dataset, and then the shared knowledge from the shared dataset aids in personalised training on each device. MOCHA (Smith et al., 2017) is a personalised FL approach based on multi-task learning. The primal-dual formulation is used by the authors to allow a shared model to perform multiple tasks across multiple devices. FedGroup (Duan et al., 2020) is a clustering strategy based on device similarity. K-means++ is used to group devices with similar optimization directions.
These approaches are solely concerned with how to improve the performance of personalised models while ignoring FL communication overheads. As a result, communicationefficient personalised FL methods are proposed. LG-FedAvg  learns compact local representations on each device, and because it only operates on local representations, the global model can be smaller because the number of parameters communicated is reduced. LotteryFL ) learns a subnetwork on each device using the Lottery Ticket Hypothesis, and only these local lottery networks communicate with the server. FedSkel (Luo et al., 2021) employs filter pruning to identify the essential components of each local model, referred to as skeleton networks. It only trains and communicates the parameters of such skeleton networks to reduce communication overheads.  enables each device to train a converged model locally in order to obtain critical parameters and substructure that guide the pruning of the network's participating FL in order to reduce communication overheads. Vahidian et al. (2021) leverages finding a small subnetwork for each client by using hybrid pruning (combination of structured and unstructured pruning) and unstructured pruning to achieve communication-efficient personalization. The pruning-based approach has gained popularity because it can take into account local structure to ensure the accuracy of personalised training while naturally reducing communication overhead.
However, the pruning-based method still has some issues that must be resolved. When only the communication efficiency of FL is considered, as seen in the preceding approaches, the server prunes the global model and then deploys the pruned model on various devices for training. Personalised FL frequently performs model pruning locally when dealing with heterogeneous data and communication bottlenecks. Nevertheless, finding and interacting with the local preferred partial model structures does not guarantee that they will Cooperatively Adaptively benefit more from model parameter aggregation. When the randomness of model feature encoding  interacts with heterogeneous data, it produces local models with large differences in feature extraction, i.e. parameter differences. The idea of using model pruning in this case will result in a sub-model structure with a stronger preference for local data. As a result, model aggregation benefits personalised models only marginally. Therefore, in our method, we consider having different devices interact with their common information rather than their unique features. A comparison of our method with related works mainly based on model pruning is shown in Table 1.

Motivation
In this section, some preliminary experiments are carried out to demonstrate the limitations of the interact of local preferred partial models in personalised FL. Furthermore, FedGB motivation will be proposed based on preliminary experimental results analysis.

Parameter updates deviation
In FL, the devices involved will collect data from a variety of scenarios and user habits, and these data will be naturally heterogeneous. When a unified model is trained with different data on different devices, the local training results have significantly different preferences. The parameters are trained to improve the performance of the local model with local data. As a result, the magnitudes of updates after training will differ across devices. A comparison of update magnitudes between different local devices is shown in Figure 2(a). The preliminary FL experiments involve four devices collaboratively learning a VGG-16 model on the CIFAR-10 dataset. The experiments are conducted on the Non-IID #2 setting. #C indicates that each device has C image classes assigned at random. We employ the Non-IID setting, which is similar to (McMahan et al., 2017), and more information on Non-IID settings will be provided in the experimental section. Figure 4 depicts the occupied data of four devices, denoted as Device 1 through Device 4. Furthermore, two devices are chosen at random and the magnitudes of their updates in the first layer of their network structure are compared.
Node 1 in Figure 2(a) has twice the density of update magnitudes in [0.000, 0.002] as node 2 with the Non-IID #2 setting. Diverse training data yields significantly different training outcomes for the same model. Such results reflect the various training preferences caused  by heterogeneous data. Important parameters are also preserved when pruning is used to compress the model. These critical parameters will highlight the unique characteristics of the heterogeneous local data at this time.

Weight parameter matching deviation
The preliminary experimental results in Figure 2(b) represent the bias of weight parameters between the local model and the global model after parameter aggregation, following the illustration of the differences in local update magnitudes. The experimental settings are identical to those used in the previous preliminary experiment. The weight bias is defined here as the difference between the normalised weight parameters. The difference between each node k's parameter ω i ∈ W and the global model is normalised by the global value: where, ω i,k is the value of local parameter ω i in node k, andω i is the global value of parameter ω i . The greater the difference in weight parameters between the local and global models, the larger b i . Two nodes are chosen at random in Figure 2(b) and their weight parameters are compared to the global model. It can be seen that under the Non-IID #2 experimental setting, the maximum parameter bias between the local model 1 and the global model can range from 10% to 100%, with more than 60% of the parameters having a bias of less than 10%. While the maximum parameter bias between the local model 2 and the global model is greater than 100%, more than 90% of the parameters have a bias greater than 30%.
The results of the aforementioned experiments show that after parameter aggregation, the global model is closer to the model parameters of node 1. This implies that the training results from node 1 have a greater influence on the global model, and that the global model is better suited to the data from node 1. However, when such a global model is deployed on node 2, extracting the data features of node 2 becomes difficult, resulting in a loss of accuracy in local inference.
Furthermore, if their pruned model is used for interaction at this point, the aggregated global model will still prefer the data of some nodes due to the unique characteristics of the important parameters retained. As a result, sending aggregated parameters to each node has an impact on local personalised training. According to this, when performing communication-efficient personalised learning, we should consider both locally and globally to ensure local personalised training performance while reaping the benefits of FL parameter aggregation. Therefore, unlike the pruning approach, we concentrate on selecting the more common features on all nodes for aggregation, effectively avoiding the global model's bias toward data on specific nodes.
As a result of our analysis and preliminary experiment results, selecting communication parameters that only focus on local properties rather than the cooperative relationship between different local models is difficult. Furthermore, it will be difficult to achieve the optimization goal of ensuring model performance on the premise of improving communication efficiency.

Identify the global backbone information
Because of the heterogeneous data distribution and limited sample amount on each device, using FL to improve local personalised model performance is critical. Since each node is trained using local data, the training results have a strong local bias. As a result, when faced with personalised model training, simple parameter aggregation will result in the global model overfitting to data from specific devices, making the aggregated model difficult to apply to local data and degrading performance. As previously stated, we only want to use the shared common information to improve the performance of local model aggregation when performing personalised model training. To accomplish this, we use the global model's common information as the intermedia and criterion to determine the common information between different devices. In this section, we first determine the most common feature of the global model.
Because the functions and scopes of each layer in a neural network differ, we look for common features in each layer separately. The model's filter is chosen as the basic unit of feature representation. As a result, we use a straightforward and efficient method. The common feature information in each layer is determined by locating the filter with the highest correlation with all other filters. To begin, we compare the correlation of each layer's filters.
For a convolutional layer with filters of the shape K × K × M × N, K represents the kernel size, M represents the input channel number, and N represents the output channel number. One filter can be thought of as a collection of K × K independent nodes. We group the filter tensors into K × K sets of nodes for ease of calculation and compute the pair-wise node correlations accordingly. Finally, we compute the filter correlation by averaging the node correlations on the given filter's K × K nodes: where, F l m and F l n are the mth and nth filter in the lth layer, ω i,j,m is the vector [W i,j,m,1 , W i,j,m,2 , . . . , W i,j,m,n ] of weights W of the lth layer. Further, when calculating corr( ω i,j,m , ω i,j,n ), Pearson correlation (Benesty et al., 2009) is applied, we have: where, μ ω i,j,m is the mean of ω i,j,m , σ ω i,j,m is the standard deviation of ω i,j,m . E is the expectation function.
The correlations between each pair of filters have now been determined. Because the goal of this step is to find the most common information in each layer. As a result, we must also transform this pair of correlations into the relationship of a specific filter with all other filters in the same layer. We define an importance coefficient for each filter to indicate whether it contains the common information we need. To determine the importance coefficient of each filter, max −normalization with its top−k highest correlations is used: We choose the filter with the highest value as the global backbone information for this layer after obtaining the normalised importance coefficients for all filters in each layer, and its features are also the common features of this layer. The global backbone information of each layer is then sent to each node, along with the aggregated parameters, as shown in Figure 3, for subsequent local training, parameter selection, and communication.

Dynamic determination of the backbone structures
In the previous section, the server determines the global backbone information. This will be used in this section to dynamically determine which local parameters will be chosen for communication during each round of training, and then to build the corresponding communication-efficient personalised FL framework.
Because our goal is for each node to interact with other nodes to improve the performance of local personalised training. The global backbone information is used as a criterion for parameter selection. Furthermore, due to the heterogeneous local data and the random nature of feature encoding in the models on each node, even the same features across all local models will be represented at different coordinates. Before parameter selection, each filter in the local network structure is compared to it in the same layer after obtaining the global backbone information. The distance between each filter and the global backbone information can be used to determine their similarities: where S l m represents the similarity between global backbone information and the mth filter in the lth layer. If a filter is close to this global backbone information, it is assumed that the features extracted by that filter are more similar to the extracted common feature from the global model. As a result, because they contain the most common information, the most similar filters can be chosen for communication. These filters form the backbone structure of each local model. Because of the heterogeneous nature of the local data, the number of filters that can build the backbone structure in the local model on each device varies. Then, as illustrated in Figure 3, we will dynamically determine the width of the backbone structures in each local model in order to obtain as much shared information as possible in order to improve the performance of personalised models.
The use of shared structures is intended to improve the performance of the local personalised model. The parameters of these backbone structures, however, must not be detrimental to the accuracy of the local model when determining the width of the backbone structures. Furthermore, because the similarity of each filter to global backbone information can be obtained, the filters for sharing will be chosen in the order of the similarity sequence. As a result, we can construct the following optimization objective for each device: where the loss function maintains model performance and F k is the device filter candidate set k, R(·) is a sparsity term used to implement filter selection based on similarity sequence, C(·) returns the number of selected filters, and T represents the total number of filters in the local model. As a result, such an optimization function can aid in the selection of the backbone structure for aggregation in each node. The Alternating Direction Method of Multipliers (ADMM) is used to solve the optimization problem (Zhang et al., 2018). With ADMM, the objective function can be decomposed into easily solved sub-problems. First, the constraint objective can be embedded in Equation (8), simplifying computation: An auxiliary variable Z replaces the variable F in the sparsity term R(·) and the number of selected filters C(·) to implement ADMM. The augmented Lagrange function of Equation (9) is then: where π is known for the Lagrange multiplier and ρ > 0. Further, with defining π k = ρu k , we have: Therefore, based on the augmented Lagrange function, ADMM can be iteratively executed with the sub-problems: As shown in Figure 3, the global backbone information of each layer in the global model can assist us in obtaining the backbone structure in the local model for parameter aggregation and information sharing by solving the above optimization objectives. When confronted with heterogeneous data, the overlap of data decreases, as does the overlap of features. Once local model performance is first guaranteed, a large number of unique features are retained. As a result, communication efficiency can be improved by uploading only a small portion of the model parameters with common information from the local device to the server. The aggregated parameters are then sent back to each node, and our goal is to use the shared information to improve the performance of the local model, so the aggregated parameters will be checked to see if they can help improve the performance of the local model. The received parameters will be validated on the local data, along with the local retained parameters. If the new model parameters result in a lower loss value, the aggregated parameters received are kept for the next round of local training. Otherwise, for the next round of local training, the original values of the communicated parameters are trained alongside the retained parameters.
FedGB looks for common features shared by different nodes under heterogeneous data for aggregation, rather than interacting with the unique structures in each node for personalised training. As a result, more personalised FL training performance gains with fewer communication overheads can be obtained. In the following section, the effectiveness of our proposed FedGB will be tested in a variety of experimental settings.

Datasets and learning models
FashionMNIST ( FashionMNIST is a dataset of fashion product images from Zalando that contains 70000 images with a size of 1 × 28 × 28. Furthermore, this dataset contains a total of 10 classes, with the training set containing 6000 examples for each class and the test data set containing 1000 examples for each class. The FEMNIST is the Federated Extended Mnist dataset which comprises 805k samples of 62 classes including digits and characters. Each image has a size of 1 × 28 × 28. CIFAR-10 includes 60000 images with a size of 3 × 32 × 32 and also 10 classes. The training set contains 50000 images, while the test set contains 10000 images. CIFAR-100 dataset contains 100 different classes. Each image is also an RGB image in size 32 × 32. It also has 50000 training images and 10000 test images. These four datasets are common for image classification tasks, in this case, AlexNet (Krizhevsky et al., 2012) is chosen for FL on the FashionMNIST and FEMNIST datasets, and VGG-16 (Simonyan & Zisserman, 2014) for the CIFAR-10 and CIFAR-100 datasets.

Non-IID data partition
The Non-IID settings are similar to McMahan et al., 2017), and the data is partitioned over each participating device to simulate the Non-IID and unbalanced datasets. We divided all data into different shards by sorting their labels, because each image has a digital label. For example, FashionMNIST and CIFAR-10 datasets both have 10 classes, they are divided into ten shards, denoted by Labels = {0, 1, 2, . . . , 9}. Each device can choose from 2, 4, 6, 8 classes. And then, for each selected class, a random factor f class ∈ [0, 1] will be assigned to construct the selection coefficient s class = f class / 10 class=1 f class , followed by a random number selection from 1000, 1600 with s class to collaboratively decide how many images each selected class will have. The Non-IID #2 setting of CIFAR-10 is illustrated in Figure 4. Because the FEMNIST dataset contains 62 classes, each device can select 10, 20, 30, 40 classes. For the CIFAR-100 dataset with 100 classes, each device can select 20, 40, 60, 80 classes.

Baseline methods
Multiple state-of-the-art (SOTA) methods are chosen for comparison in order to demonstrate that our proposed method can achieve better training performance while reducing communication overheads. by updating only the model's essential parts, namely the dynamically personalised pruned skeleton network. • FedPrune : each client trains a converged model locally to obtain critical parameters and substructure that guide the pruning of the network participating FL.

System implementation
The FedGB is run on a server equipped with a Tesla P100 16GB GPU. And in each round of training, we simulate ten participating devices. Each device in the system performs E = 1 local training epoch. k is the number of 10% input channels in each layer when calculating the importance coefficient for each filter. 10 experiments are run for each method in each experimental setting, and the initial values and the random seed of the network structure are reselected in each experiment. The results of 10 experiments are averaged to present.

Evaluation metrics
We use two metrics to evaluate training performance respectively. Local inference accuracy: we evaluate the inference accuracy on each device's test data, and report the average accuracy over all devices for evaluations. Communication overhead: we measure the transmitted datasize of uploading communication during the training process.

General performance analysis
We compare the average local inference accuracy of various methods in various experimental settings in this paper. The IID and Non-IID are both set here. Furthermore, there are four experimental scenarios in the Non-IID setting in which each node can select a different number of image classes to create its own dataset. The method for determining the number of images for the Non-IID setting was proposed in the previous section. The performance of local inference is compared in Tables 2, 3, 4, and 5. The average local inference accuracy on the IID setting is compared first. Tables 2 to 5 show that, with the exception of the Top-K method, the other methods can achieve comparable inference accuracy. The Top-K method's obvious decrease in accuracy is due to the fact that it only chooses the most variable parameters for communication based on local    training results. Furthermore, because the network structure is randomly encoded at different nodes, there is no guarantee that parameters that can represent the same features at different nodes are chosen for parameter aggregation, affecting accuracy. FedSkel and FedPrune also use the pruning-based method, but they use model pruning to find the most important part of the network structure or the converged sub-models. Because the training data are the same, the feature representation of the structure obtained through pruning is also the same, influencing the training results less.
LG-FedAvg applies personalised finetuning on top of the global model, and when the global model's performance is unaffected, so are the personalised training results. FedGB, as opposed to these methods, ensures the accuracy of local training before using shared information to improve personalised training results. As a result, communication overhead is reduced while local model accuracy is maintained. FedGB can always achieve the highest accuracy with other SOTA methods in this manner. The sections that follow focus more on the experimental results obtained in non-IID settings. Because the various Non-IID settings that make FL more complicated than IID settings, we will concentrate on the performance differences of different methods under the same Non-IID setting here. FedGB can clearly achieve greater accuracy improvements than LG-FedAvg, FedSkel, and FedPrune based on the results from Tables 2 to 5. FedGB can achieve 6.43%, 2.14%, and 2.80% accuracy improvements on LG-FedAvg, FedSkel, and Fed-Prune with their best accuracy under different Non-IID settings and communication ratios on the CIFAR-10 dataset. On the CIFAR-100 dataset, the accuracy improvement can reach up to 5.08%, 3.38%, and 1.98%. The maximum accuracy improvements with best performance on both SOTA baselines on FashionMNIST are 5.56%, 4.85%, and 4.25%, respectively. On the EFMNIST dataset, the best accuracy improvements are 5.54%, 3.78%, and 3.20%.
FedAvg's accuracy drops dramatically in the Non-IID setting. Due to the local preference of training results, it is difficult to obtain a shared global model with competitive inference accuracy on multiple nodes at the same time. These findings emphasise the importance of personalised training with heterogeneous data in FL. Furthermore, because a global model is obtained first in LG-FedAvg and then fine-tuned in personalization, when the accuracy of the global model is compromised, the results of personalised fine-tuning are compromised as well. As a result, the personalization accuracy of LG-FedAvg is lower than that of the best performance of FedSkel, FedPrune, and FedGB.
Furthermore, in FedSkel and FedPrune, uploading more parameters does not result in improved local accuracy. Under different Non-IID settings, for example, accuracy with a 50% communication ratio is generally better than accuracy with a 75% communication ratio. When the network size is reduced to 25%, a large number of parameters are discarded, so when the pruning size is increased to 50%, accuracy improves significantly. This is because when the local model is compressed on a small scale, i.e. the communication ratio of the local model is lower, the more important coarse-grained features that guarantee the model's accuracy are retained, and these coarse-grained features have a certain degree of similarity across datasets. When only a small portion of the structure is pruned, the local model communicates with other local models, the local specific features are retained, and these local specific features are more strongly correlated with the local dataset. The effects of heterogeneous data are retained when the models are aggregated. Instead, as the communication ratio increases from 50% to 75%, the accuracy decreases. As a result, it is difficult to determine an appropriate model pruning scale to achieve higher accuracy by aggregating these pruned parts of the model structures rather than the parts of structures with common representations when performing personalised training. Therefore, FedGB selects common features shared by different local models, ensuring local training results while utilising commonly shared information, resulting in higher accuracy.

Computation overheads
Aside from comparing inference accuracy, we're curious about the computation overhead of different methods. We are more concerned with the overall training efficiency of the various methods in such an FL training process than with the computation overhead in each training round. As a result, we record the average total running time of the various methods in various settings when they achieve the target accuracy in addition to the relationship between training epochs and convergence rate. For this part of the experiment, we also run the FL on a Raspberry Pi 4B to highlight the difference in time overhead between the different methods. As shown in Figures 5 and 6, the convergence rate and overall efficiency of our method are always the highest under the various settings. In our method, each training round includes the determination of additional global backbone information as well as the determination of local backbone structures for aggregation. These computational processes increase the complexity of our algorithm, requiring more computational time. This type of computation, however, takes only a small fraction of the computation time when compared to the overhead of model training. At the same time, our method is more efficient in terms of overall training due to the higher convergence rate. Furthermore, we concentrate on the overall efficiency of the various methods for Non-IID settings. FedGB clearly has higher overall efficiency, as shown in Figure 5(a), which is up to 1.92× faster in total running time than SOTA personalised methods on CIFAR-10 datasets with the Non-IID #2 setting.

Non-IID level impactions
In this section, we look at how the Non-IID level affects FL performance. Tables 2 to 5 show that baseline method performance is all related to the Non-IID level. While FedGB accuracy does not increase as data heterogeneity decreases, it does fluctuate within a narrow range. As the Non-IID degree of the local data decreases, the data on different nodes becomes more overlapping, resulting in a decrease in the local preference of the training results on different nodes. As a result, the global model obtained is more adaptable to data from different nodes, and the local test accuracy improves. As the degree of data imbalance decreases, FedAvg's accuracy improves, and LG-FedAvg using the global model can obtain better personalised accuracy as the global model's performance improves. Similarly, as data overlap grows, so does the overlap between important parameters obtained through model pruning, improving FedSkel and FedPrune's accuracy across all communication ratio settings. Our method, on the other hand, is less affected by heterogeneous data because it focuses on shared features among different local models and only uses parameter aggregation in FL to improve performance. As a result, FedGB has only minor differences in training results for various Non-IID settings. The variation of LG-FedAvg between optimal and worst accuracy is 4.45%, 3.16%, 2.13%, and 1.61%, respectively, on different datasets. FedSkel's maximum variation in accuracy with different communication ratios is 6.32%, 2.49%, 2.05%, and 3.79%, respectively. FedPrune has 3.83%, 2.64%, 2.77%, and 2.59% variations, whereas FedGB has only 0.86%, 0.93%, 1.61%, and 1.01% variations.

Communication overheads
One of the main contributions of this paper is to reduce communication overheads in order to improve the efficiency of personalised FL communication. The communication efficiency improvements will be presented after the proposed method's local inference accuracy has been verified. The communication ratio for Top-K, FedSkel, and FedPrune is set to 0.5 in the experiments. Figures 7 and 8 show the accumulative updating bytes of all participants on all datasets under IID and Non-IID settings.
FedGB has the fewest total communication bytes, as shown in Figures 7 and 8. It also has the fewest communication bytes in each round of the training process when the communication rounds are the same. The first priority in FedGB is still to ensure the accuracy of the local model, while using shared information to improve personalised model performance. Backbone structures with common features, rather than local unique structures, are interacted with in each round of the training process. It is necessary to ensure that when interacting with unique structures, such structures can be effectively trained locally to achieve competitive accuracy. As a result, it is difficult to compress local structures to smaller sizes while maintaining lower communication overheads. FedGB thus has lower communication overheads than FedSkel and FedPrune. It also uses parameter interactions to aid local  up to 2.80× on the CIFAR-10 dataset, 4.50× on the CIFAR-100 dataset, 3.01× on the Fash-ionMNIST dataset, and 2.95× on the EFMNIST dataset. As a result, FedGB achieves the best trade-off of accuracy and communication overheads, especially when using the Non-IID setting.

Scalability
Another important factor influencing training accuracy is the number of devices participating in FL. Previous FL experiments involved only ten devices. Here, we've added options for 20, 40, and 100 participants. The results are depicted in Figure 9. In the Non-IID setting, as can be seen, the accuracy increases as the number of devices increases. The accuracy only increases slightly when using the IID setting. This is because having more participants allows for more data overlap at the same level of heterogeneity, allowing for more commonly shared information. As a result, accuracy improves. However, because communication is a performance bottleneck in FL, more participating FL devices mean more communicated parameters, causing a greater conflict with the limited communication bandwidth. When the number of participants increased from 10 to 100, the amount of communication increased nearly 10 times. At the same time, the accuracy gains in the Non-IID setting are less than 3%, and even lower in the IID setting. The above results demonstrate the scalability of our approach, in which FedGB can achieve better accuracy with more devices, but increasing the number of participants is not an efficient approach when the trade-off between communication overheads and accuracy improvements is considered.

Conclusion and future work
FedGB, a novel communication-efficient personalised FL method, addresses two critical issues in FL: local inference accuracy and training process communication overheads. When faced with heterogeneous data, previous approaches first determine local preferred structures before interacting with their parameters to obtain highly accurate personalised models with low communication overhead. FedGB differs from them in that it seeks structures with common information of training results across nodes and interacts with their parameters to improve communication efficiency and the accuracy of local personalised models. FedGB can achieve up to 6.43% accuracy improvement and 4.50× communication overhead reduction when compared to SOTA methods under various experimental settings. Furthermore, in industrial applications such as unmanned driving and intelligent product inspection tasks in large-scale production lines, the devices will have to contend with the limited and dynamic nature of the collected data, as well as the dynamic and unstable connection states caused by device mobility and wide distribution. We can train and deploy intelligent algorithms with higher accuracy while incurring less communication overhead using our communication-efficient personalised FL method. In these industrial scenarios, this enables efficient and effective interaction between different devices, allowing collaboration tasks to be completed efficiently and effectively.
Besides, we used global model backbone information in this paper to select backbone structures at different devices and then aggregate these sub-model structures in the server, but we only used the traditional FL aggregation method with parameter coordinates. In the future, we will focus on the interpretable and feature representation aspects of the network structure, investigate new methods of aggregating local models, and attempt to investigate the nature of the impact of heterogeneous data on FL performance to aid in FL performance improvement.

Disclosure statement
No potential conflict of interest was reported by the author(s).