Asynchronous Federated Learning on Heterogeneous Devices: A Survey

Federated learning (FL) is a kind of distributed machine learning framework, where the global model is generated on the centralized aggregation server based on the parameters of local models, addressing concerns about privacy leakage caused by the collection of local training data. With the growing computational and communication capacities of edge and IoT devices, applying FL on heterogeneous devices to train machine learning models is becoming a prevailing trend. Nonetheless, the synchronous aggregation strategy in the classic FL paradigm, particularly on heterogeneous devices, encounters limitations in resource utilization due to the need to wait for slow devices before aggregation in each training round. Furthermore, the uneven distribution of data across devices (i.e. data heterogeneity) in real-world scenarios adversely impacts the accuracy of the global model. Consequently, many asynchronous FL (AFL) approaches have been introduced across various application contexts to enhance efficiency, performance, privacy, and security. This survey comprehensively analyzes and summarizes existing AFL variations using a novel classification scheme, including device heterogeneity, data heterogeneity, privacy, and security on heterogeneous devices, as well as applications on heterogeneous devices. Finally, this survey reveals rising challenges and presents potentially promising research directions in this under-investigated domain.


Introduction
In recent years, the rapid expansion of computing power and fast evolution of communication infrastructures have directly resulted in the prosperity of machine learning (ML), which is one of the key-driven forces of various modern technologies [1,2]. However, the training of ML models requires a large amount of high-quality data, which is a critical demand for model trainers in real-world scenarios [3]. It is evident that attaching focuses on privacy-preserving data sharing has been continuing, with new legislation and regulations making data gathering even more difficult [4,5,6]. In addition, industries are less reluctant to share their local data due to peer competition, privacy concerns, and other potential concerns [7,8,9]. All these factors jointly cause the problem of isolated data islands. For this reason, it is nearly impossible to collect data from various reliable sources, or the cost is unaffordable [10,11].
Federated learning (FL) is a novel paradigm that enables multiple parties to collaborate on ML model training without requiring direct access to local training data. First proposed by Google in 2016, FL appears to be a promising ML solution that meets the data privacy and communication efficiency requirements [12,13]. The design goal of FL is to ensure personal data privacy and develop satisfying ML models across multiple participants or computing nodes under the premise of legal compliance [14,15]. As a result, FL is adopted in many papers with a central server collecting the parameters of local models (hereafter abbreviated as "local models") from nodes before global model aggregation in each training round [16].
With the widespread deployment of the 5G network and the rapid development of hardware, heterogeneous devices (including edge and IoT devices) gain greater communication and computation capacity for a wider range of applications [17]. Compared with classic ML approaches, FL offers various advantages for edge applications [14,18]: (1) Higher local data privacy due to the gradients-based aggregation of the global model; (2) Lower latency in network transmission as the training data does not need to be transferred to the cloud; (3) Higher model quality due to the help of learned features from other devices. As a result, FL enables the collaborative training of ML models on heterogeneous devices demonstrated in many research papers.
When applying classic FL to resource-constrained devices, various shortcomings emerge [19]: (1) Unreliability of heterogeneous devices. The aggregation server waits for the updated local gradients uploaded by selected heterogeneous devices, which may go offline unexpectedly due to their unreliability. (2) Low round efficiency. The faster devices have to wait for those stale local models uploaded from straggler devices in each training round due to the disparity in resources of devices (device heterogeneity) and the disparity in training data on devices (data heterogeneity). (3) Low resource utilization. Due to inefficient node selection algorithms, multiple competent devices are rarely selected. (4) Security and privacy concerns. There are several attacks that can compromise the security of classic FL, including poisoning and backdoor attacks. The privacy concern comes from the possible data leakage during the training process.
To overcome the above-identified challenges, asynchronous federated learning (AFL) is put forward with the central server conducting global model aggregation as soon as it collects a local model. This survey systematically reviews 50 papers related to AFL and 7 existing survey papers related to FL ranges from 2019 to 2022. More specifically, we categorize the existing papers aiming at mitigating device heterogeneity into seven classes, including node selection, weighted aggregation, gradient compression, semiasynchronous aggregation, cluster FL, and model split. In addition, the data heterogeneity issues faced by AFL are categorized into two classes, including non-independent and identically distributed (non-IID) data, as well as vertical distributed data. After that, the privacy and security issues faced by AFL on heterogeneous devices are further discussed. Furthermore, the popular applications of AFL on heterogeneous devices are demonstrated. Finally, several potentially promising research directions are pointed out and discussed.
Although some survey papers related to FL have been published, none of them investigates, classifies, or summarizes AFL in detail. As a result, the key contributions of this paper lie in classifying, summarizing, and analyzing AFL thoroughly. A comparison of this survey paper with other related survey papers is summarized in Table 1.
The remainder of this paper is as follows. In Section 2, the preliminary knowledge that the survey requires is briefly introduced. Then, the challenges that AFL schemes attempt to address, including device heterogeneity, data heterogeneity, as well as privacy and security on heterogeneous devices, are then discussed and analyzed in Section 3, Section 4, and Section 5. After

Surveys
Topics Limitations [16] FL Multi-level classification of FL without a detailed classification of AFL. [14] FL on edge Treat AFL as a promising solution without comparing different AFL schemes. [17] FL on IoT Only 7 papers related to AFL are gathered with an investigation on the convergence. No detailed classification on AFL. [18] FL on IoT Explain the concept of AFL without comparing different AFL schemes. [15] FL privacy Focus on privacy-preserving in FL, no discussion related to AFL. [20] FL security AFL not mentioned. [21] FL privacy Focus on privacy-preserving in FL on IoT, no detailed classification of AFL. This survey AFL Classify and analyze the challenges faced by AFL and summarize the application scenarios of AFL.
that, Section 6 is provided to illustrate the applications of AFL on heterogeneous devices. Based on the above analysis and discussion, Section 7 further provides potentially promising research directions, which is followed by a summary and conclusion in Section 8.

Background Knowledge
This section illustrates the background knowledge needed for better comprehension of AFL from three perspectives: federated learning, blockchain, and differential privacy.

Federated Learning
The concept of FL is first introduced in [22]. In FL, local models are trained on distributed nodes, and the global model is generated on an aggregation server by averaging the local models. In each training round (iteration), nodes taking part in the training share their local models instead of local training data. FL is viewed as a ML framework to break down the barrier of the data silo [23] due to the privacy-preserving feature for local training data on each node. The major steps of the classic FL are as follows.
1. Initialization: After the training task is determined according to the application scenario, the aggregation server prepares an initial global model w 0 G and training settings, such as the learning rate, batch size, and the number of iterations to train. After selecting, the aggregation server broadcasts w 0 G to an appropriate number of nodes (denoted as K). 2. Local Model Training: Let t stand for the current iteration number.
Based on global model w t G , each node trains its local model and obtains the local models w t k , where k ∈ [1, K]. The local models w t k are subsequently sent back to the aggregation server. 3. Global Model Aggregation: The server aggregates local models and generates a new global model by: After that, w t+1 G is sent back to the nodes for the next iteration of training.
Generally, datasets across nodes in FL are expected to be independent and identically distributed (IID), which means that the distribution of the samples across nodes is the same, with training samples independently selected for each training round. In practice, however, the training data samples collected by nodes are usually not the same, named as non-independent and identically distributed (non-IID), posing challenges for both classic FL and AFL. Taking the smart hospital scenario as an example, IID data assumes that disease cases at different hospitals are similar. Non-IID data, on the other hand, assumes that disease cases at different hospitals are varied, which is closer to reality. With non-IID data, updated gradients learned in one hospital are unhelpful in predicting the situation of patients in another hospital.
In several application scenarios, the datasets kept at different parties include diverse feature sets but the same entities. For instance, the investment information and the deposit information of the same user belong to the financial company and the bank respectively. Identifying the credit risk of an investor is challenging for a financial company due to the lack of user's salary and deposit information. The datasets across nodes with the same entities but with different feature sets are defined as vertically partitioned (VP) [24] datasets. Vertical FL is the scheme that settles down to train models across VP datasets [25], as shown in Fig. 1. With vertical FL, the updated gradients from banks allow financial institutions to evaluate the risk of an investment without knowing the user's sensitive data. IoT and edge devices have increased computation and communication capabilities as a result of the broad deployment of 5G networks and the equipment of high-performance hardware [14]. In order to reduce the communication cost, an increasing number of researchers deployed ML tasks on heterogeneous devices to enable smart interactions with humans [26]. However, differences in computation and communication capabilities are inevitable among heterogeneous devices. Not only that, the disparity in data size between nodes results in significant variance in the time cost of training on each node. The stale local models are usually trained from stale global models on straggler nodes.
AFL is proposed to mitigate the influence of stale nodes and improve the efficiency of FL. In AFL, the global model aggregation happens whenever a new local model is received by the aggregation server. The comparison of asynchronous and synchronous FL architectures on heterogeneous devices is shown in Fig. 2. The major steps of AFL are as follows.
1. Initialization: Similar to classic FL, the aggregation server broadcasts the initial global model w 0 G to all K nodes. 2. Local Model Training: Nodes train their local models based on the most recent global model obtained from the aggregation server. Nodes finish training local models {w t 1 , w t+1 2 , w t+2 3 , . . . , w t+k k } in sequence due to the disparity of heterogeneous devices in computing capability. Then, the local models are independently sent back to the aggregation server. 3. Global Model Aggregation: The server aggregates the newly collected local model with the latest global model by: After that, w t+k G is obtainable for nodes for the subsequent training process.
The current iteration number t in AFL increments by 1 when a device finishes its training and uploads the updated local model to the aggregation server. The immediate model aggregation in AFL reduces the waiting time and thereby improves efficiency [27].

Blockchain
Blockchain, the backbone of Bitcoin [28], is a distributed ledger technology (DLT) that keeps transactional data consistent and immutable across nodes [29]. In the blockchain, nodes are responsible for maintaining the shared ledger and running a globally unified program known as the smart contract. The smart contract is self-verifiable and tamper-proof, ensuring the security and trustability of the shared ledger. All nodes execute and validate the received transactions through the smart contract. After achieving an agreement according to the consensus algorithm, all nodes organize the transactional data into blocks and append the blocks to the shared ledger. Proof of Work (PoW) [28], Proof of Stake (PoS) [30], and PBFT [31] are the three most widely used consensus algorithms at the moment. Generally, the consensus algorithm with a higher security level has lower efficiency.
Usually, the blockchain is treated as a distributed database to save the models generated during the training. Besides, some researchers utilize the reputation system in the blockchain to motivate nodes to contribute local models. Blockchain can bring various benefits to AFL. First, the immutable shared ledgers prevent nodes from doing evil, such as uploading plagiarized updated gradients [23]. Second, the consensus algorithm enables unfamiliar devices to trust each other, as the aggregation process is decentralized and impossible to be manipulated [32]. Third, the smart contract verifies both the local models and the authentication of nodes, preventing malicious nodes from uploading malicious gradients [23]. Fourth, the decentralized aggregate strategy in the blockchain prevents the aggregation server from DDoS attacks and reduces the risk of single-point failure [33]. However, scalability and efficiency will be compromised to certain extents when adopting blockchain in AFL.

Differential Privacy
Differential privacy is a promising privacy-preserving technique that has experienced a tremendous boom for over ten years. Derived from the differential attack, differential privacy is a data-hiding approach that hides a single piece of sensitive data in a specific dataset [34,35]. Differential privacy aims to make every piece of data indistinguishable while maintaining certain statistics features for data analysis [36].
Diverse differential privacy mechanisms have been developed while each of them plays an important role in its corresponding application scenario. The most popular ones are Laplace mechanism [35], exponential mechanism [37], and Gaussian mechanism [38]. Through adding controllable randomized noise, differential privacy is able to return sanitized and privacy-preserving responses to data requesters [34]. The data sanitizing procedure in differential privacy degrades data utility [7]. Therefore, a parameter named privacy budget is introduced to measure the trade-off between privacy protection and data utility.
To meet the demands on flexible privacy protection in the real-world, personalized privacy protection models are proposed. In this series of variants, an index (e.g. social distance in social networks) is introduced to identify the personalized level of privacy protection [39]. By fine-tuning personalized parameters, the data utility could be improved further [40].
Although efficient and scalable, differential privacy faces the problem of low data utility, especially when the injected noise is randomized and not well-controlled [41]. Since there are usually dozens or even thousands of local devices participating in the training process in FL, these issues are mitigated by setting the mean value of Laplace distribution to zero [42,43]. As a result, differential privacy has a great potential to be applied in FL or even AFL.

Device Heterogeneity
The main challenge of AFL is to maximize resource utilization on heterogeneous devices to improve efficiency. Another challenge is the stale lo-cal models caused by the device heterogeneity, which is usually harmful to the global model. The current work is summarized from several aspects, including node selection, weighted aggregation, gradient compression, semiasynchronous FL, cluster FL, and model split. The related work is summarized and compared in Table 2.

Node Selection
Several node selection algorithms are proposed to improve the efficiency of AFL on heterogeneous devices. The common node selection strategy for classic FL allows all nodes to contribute equally to the global model. However, in AFL, it is more eager to select nodes that are more robust and powerful while preventing the global model from overfitting.
For instance, in [44], the authors present a heuristic greedy node selection strategy that iteratively selects heterogeneous IoT nodes to participate in global learning aggregation based on their local computing and communication resources. Experiments are conducted on both iid and non-iid datasets to verify the effectiveness of their approach. Apart from that, considering the large number of edge devices involved, in [47], the authors limit the number of devices training at the same time in the AFL network. A limit-size cache with a weighted averaging mechanism is introduced onto the server to reduce the impact of model staleness. Experiment results back up the improved convergence speed and model accuracy.
In order to select nodes more reasonably, in [48], a prioritized nodeselecting function is designed according to the computing power and accuracy change of local models on each node. Other unselected nodes continue the iterations locally at the same time. As a result of the node-selecting function, the experiment results show a higher accuracy growth rate with a faster convergence speed. Another similar idea is to assign a trust score to each node based on its activities [49]. ML tasks with resource requirements and a minimum trust score are published in the FL network. Any candidates who do not meet the task requirement are filtered out before the training round begins. Clients who complete tasks will be rewarded, while those who do not will have their trust value decreased. In [19], a node with a lower probability to crash is more likely to be selected in an iteration. The straggling nodes that training models that are too stale will be tagged as deprecated and forced to synchronize with the server. The tolerable nodes are those training on the acceptable stale models, who work asynchronously with the server. After updated gradients from a fraction of nodes are received, the  Continue on the next page  central server ends a round of training. As a result, the waste of computation resources is minimized, and communication expenses are kept at a relatively low level.
In [52], The authors undertake experiments on both iid and non-iid datasets across varied computation resources and training data distributions to evaluate the performance of different device scheduling and update aggregation strategies when a fraction of IoT devices are allowed to upload local models. Specifically, the device scheduling policies include random scheduling, significance-based scheduling, and frequency-based scheduling; the update aggregation policies include equal weight aggregation and age-aware aggregation. The simulation results demonstrate that the random scheduling policy outperforms others while training on non-IID datasets. Besides, an appropriate age-aware aggregation policy performs better.

Weighted Aggregation
Several weighted aggregation algorithms have been proposed in numerous studies to lessen the influence of straggler devices and increase learning efficiency. In classic FL, the target of the weighted aggregation is to allow the global model to be more reliant on local models with higher accuracy. However, in AFL, the goal is to mitigate the effects of stale local models generated based on an outdated global model, which does not exist in classic FL. By incorporating a staleness parameter, weighted aggregation reduces the weight of stale local models and increases the weight of the most current local models during the aggregation procedure.
In [53], a mixing hyperparameter is introduced based on staleness to balance the convergence rate and variance reduction. The experiments conducted on CIFAR-10 and WikiText-2 validate both fast convergence and staleness tolerance. In [55], a temporally weighted aggregation strategy is proposed, which increases the weight of recently updated local models when aggregating on shallow and deep layers. Experiment results on CNN and LSTM neural networks show that the global model accuracy and convergence are improved. Another time-based weighted aggregation algorithm is proposed in [57]. The weight assigned to the updated gradients decreases as the staleness value increases. Similarly, in [47], a staleness-based weighted aggregation algorithm with cache is proposed. In [58], a decay coefficient is proposed with similar effects, balancing the previous and current models. With the dynamic learning step size, the nodes with more data or poor communication status are compensated. Experiments across three real-world datasets are conducted with results showing that their scheme converges fast and enables higher model accuracy.
Besides, a duel-weighted gradient updating strategy is proposed in [62], which takes into account the size of the dataset as well as the similarity between the local and global gradients. The updated gradients submitted by edge devices are aggregated after the duel-weight correction. The experiment results reveal that the model accuracy remains high even after gradient compression.
Another idea is to aggregate branches in a model with weights. In [63], the global model is split into branches with the aggregation procedure transformed into a branch-weighted merging process. The aggregation weight is dynamically adjusted depending on the training accuracy of all nodes to prevent the global model from overfitting to nodes that upload gradients frequently. To evaluate the effectiveness of the proposed scheme, a prototype is implemented on heterogeneous devices based on two industrial cases: (1) Fault diagnosis of motor bearings and (2) Fault diagnosis of the gearbox. The experiment results demonstrate that their scheme converges faster, achieves higher accuracy, and consumes less energy than the classic CNN model.

Gradient Compression
Since gradient compression is a general strategy to improve the efficiency of FL, it is usually adopted in AFL to further reduce communication costs. After introducing AFL, gradient compression faces new challenges of resource-constrained edge/IoT computing environment and more frequent aggregation operation. Specifically, the disparity of computing power among nodes is greater than classic FL. Besides, AFL has a larger server-side computational demand than standard FL due to the more frequent aggregation and compression operation. In order to address these challenges, several effective gradient compression algorithms for AFL are presented.
In [66], two sub-modules are presented for self-adaptive threshold gradient compression: (1) self-adaptive threshold computation and (2) gradient communication compression. The former is in charge of computing the threshold based on recent parameter changes, while the latter is in charge of compressing redundant gradient communications based on the threshold. The accuracies of the generated models after gradient compression are verified when training the MLP model on the MNIST dataset. Besides, the proposed scheme allows the node to join or quit freely, which is suitable to highly mobile edge computing scenarios. Another similar paper is published in [62] by the same authors.
From the aspect of vertical FL, in [67], based on the Top-K AllReduce sparse compression technique, the authors present a double-end sparse compression algorithm [82]. Specifically, the compression process happens on both the server and local sides to reduce the transmission cost. Experiment results demonstrate that 86.90% of information exchange is minimized during the training process, revealing that their scheme is suitable for edge computing scenarios with low-bandwidth or metered networks. Furthermore, the training data is protected securely against gradients leakage attacks [83].
Another approach to improve communication efficiency is to design a new communication protocol that more efficiently schedules model upload and download. For example, in [70], three transmission scheduling algorithms that account for stragglers are proposed to improve the efficiency of AFL in wireless networks, where statistical information regarding uncertainty is known, unknown, or limited. The experiment results show their outperformance in terms of accuracy, convergence speed, and robustness.

Semi-Asynchronous FL
In asynchronous FL schemes, stale local models uploaded by stragglers decrease the accuracy of the global model to a certain extent. To alleviate the effects of the straggler devices, semi-asynchronous FL schemes are proposed. Generally, semi-asynchronous FL is a scheme that combines classic FL and AFL, in which the aggregation server caches local models that arrive first and aggregates them after a specific period of time. Depending on the extent of staleness, the local models that arrive later are either allocated to the following training round or abandoned. Semi-asynchronous FL has a lower aggregation frequency than AFL but greater than classic FL. Similar to classic FL, a training round is defined as the process between two adjacent global aggregations.
In [48], a priority function is introduced to accurately select nodes with large amounts of data or high computation power. Meanwhile, local models on unselected nodes will be cached for a specific number of iterations before being submitted to the aggregation server. Besides, a restriction on the local training round number is set to prevent specific nodes from being unselected for a lengthy period of time, leading the global model to overfit certain nodes. The effectiveness of the scheme is evidenced by experiments conducted on iid and non-iid datasets.
On the contrary, a cache-based lag-tolerant mechanism on the aggregation server is introduced in [19] to mitigate the impacts of stragglers, crashes, and model staleness. In their scheme, all nodes are classified into three categories: up-to-date, deprecated, and tolerable. Only the up-to-date and deprecated nodes are forced to synchronize with the server, while the tolerable nodes work asynchronously. The nodes will be labeled picked, undrafted, or crashed after training. Specifically, local models from undrafted nodes are not aggregated in this round but retained in the cache for aggregation with local models in the next round. As a result, the tradeoff between faster convergence and lower communication overhead is properly addressed, which is verified by experiments. Similarly, a private buffer on the aggregation server holding a certain number of model updates is designed in [26], with convergence ensured by math. To evaluate the scalability and efficiency of their scheme under various staleness distributions, the authors train an LSTM classifier on text and image classification tasks. The results reveal that their approach is more resistant to diverse distributions and converges faster than classic synchronous and asynchronous FL schemes. Another scheme that adopts a secure buffer on the server is proposed in [74], where a secure aggrega-tion protocol is designed to prevent the server from learning any information about the local updates.
From the time aspect, the authors in [73] aggregate local models at a specific time interval determined by the slowest node. More exact control over the training nodes is allowed, especially in edge computing networks with non-iid data distribution. The authors then compare classic synchronous, asynchronous, and semi-synchronous schemes across heterogeneous devices in experiments. The results show that their approach is faster and more accurate than other schemes.
Moreover, to mitigate the overload when a huge amount of local model updates occur in a short period of time, in [57], after caching the first several local models received within a given time window, a synchronous aggregation strategy is adopted. The experiment results reveal that compared with the classic FL scheme, the time window enables their scheme many more nodes.

Cluster FL
Clustered FL is an approach of increasing training efficiency by grouping together devices with similar performance, functionalities, or datasets. Inner-group update, inter-group update, or both could benefit from the asynchronous update strategy.
For instance, an idea is grouping nodes into tiers based on their response latency [75]. Faster tiers are responsible for faster convergence, while slower tiers aid in the model accuracy improvement. Furthermore, a polylineencoding-based compression algorithm is adopted in their scheme to improve communication efficiency. Experiments are conducted across multiple datasets and models, confirming that their scheme has a low communication cost and high prediction accuracy. In [76], a grouping metric is proposed, where the gradient direction and the latency of model update are taken into account. The local update latency is composed of computation latency and communication latency. Experiments conducted on four imbalanced non-IID datasets assess the improvement in test accuracy.
In [78], a cascade training scheme with bottom and top subnetworks is proposed to fully exploit all horizontally partitioned labels. Specifically, the bottom subnetworks are responsible for extracting embedding vectors from features, while the top subnetworks are for prediction. The nodes in FL are classified into three types, including active party, passive party, and collaborator. Each active party is connected to other passive parties so that it is able to gather embedding vectors and return gradients to them. The collaborator is connected to all active parties in order to aggregate the returned gradients. The experiment results reveal that their scheme effectively addresses the straggler problem with minimum performance loss.
In [79], nodes are grouped based on data distributions and physical locations to reduce global model loss and communication delay. The authors designed a control algorithm that reduces communication costs while examining the convergence of the proposed scheme in IID settings. The outperformed accuracy and efficiency of their scheme are evidenced by the experiment results.
Apart from grouping appropriately, another solution is to adaptively modify the aggregation frequency among groups to minimize the loss of FL [80]. Under an environment with limited resources, a dynamic trade-off between computation and communication cost is formulated by Markov Decision Process (MDP) and optimized by deep reinforcement learning (DRL). Numerical results validate the accuracy, convergence, and energy-saving features of their proposed scheme.

Model Split
After splitting the deep neural network model, each node is responsible for a part of the model rather than the entire model. The model split strategy reduces the number of parameters that need to be transmitted at a particular time, resulting in improved communication efficiency. Intuitively, the computation cost of the model split strategy is reduced for each node due to fewer parameters to be trained in each training round. However, when adopting the model split strategy in the AFL architecture, due to no waiting for other nodes, nodes would fully utilize their computing resources for training the model for the next round and allowing the global model convergence more quickly. As a result, the computation cost per unit time does not reduce, but the convergence speed increases.
In [55], a layerwise asynchronous model updating strategy is proposed, in which shallow layer parameters are updated more frequently than deep layer parameters. When aggregating, the most recently updated local models have the highest weight with the help of timestamps. The experiment results support the improved communication cost and model accuracy of the proposed scheme. In [81], a similar idea is achieved by using cache and communication capabilities on UAVs and terrestrial base stations. The parameters in shallow layers are updated more frequently than those in deep layers. To predict the content caching placement, the proposed scheme employs a two-stage AFL algorithm. The efficiency of the proposed scheme is validated by experiments conducted on real-world datasets and numerical analysis.
Another approach is to divide the global model into branches according to the sample category [63]. The splitting process involves acquiring a branch from the entire model. The aggregation process is performed on branches with different weights dynamically adjusted by the aggregation server. Allowing nodes to select parts of the model according to local data distribution and update asynchronously reduces calculation and communication costs, enhancing FL efficiency.
Compared with the node selection strategy, the model split strategy reduces the computing resource requirements of nodes, updates different layers of the global model more flexibly, and mitigates the bias of the global model. However, it is not flexible to be extended to other models, as customized splitting and aggregating algorithms are necessary for each model on each dataset.

Data Heterogeneity
In practice, the data across nodes is usually non-independent and identically distributed (non-iid). Besides, the amount of data distributed on each node is always imbalanced. Therefore, the frequently-uploaded models on certain nodes are likely to cause the global model to diverge and overfit specific data.

Non-Independent and Identically Distributed Data
The non-iid data among nodes in AFL lead to a biased global model. There are mainly four research directions to address the challenges of non-iid data, including a constraint term for aggregation, clustered FL, a distributed validation strategy, and mathematically optimizing parameters.
In [75], a constraint term is presented to limit local updates to be closer to the global model. Nodes with similar updating frequencies are grouped into the same tier through synchronous and asynchronous training strategies to prevent local models from diverging. The effectiveness of the scheme is supported by mathematical analysis and experiment results. Similar non-iid datasets settings are also applied in [76]. The authors propose a spectral clustering approach, in which nodes are grouped based on an affinity matrix derived based on model update latency and direction. Experiments show that their scheme enhances test accuracy, convergence speed, and successfully addresses the effects of straggler devices. In [84], an AFL scheme is adopted to aid image-based geolocation when the data distribution is unbalanced and non-iid in a real-world scenario. In order to validate the robustness of the proposed AFL scheme practically, the authors undertake experiments where nodes join in the middle of the training process and train at varying speeds. Similarly, in [58], the authors adopt a wait-free computation and communication AFL scheme with a decay coefficient. By simulating varying network delays on heterogeneous edge devices on datasets with a non-iid setting, the authors demonstrate the robustness of their scheme. Experiments conducted on various real-world datasets reveal the competitive prediction accuracy and the rapid convergence speed of the proposed scheme. In terms of test accuracy, the authors in [85] propose a distributed validation scheme that evaluates model performance across nodes. A small percentage (5%) of local training data samples is reserved on each node to evaluate models from other nodes. As a result, a better generalized global model is obtained. By adopting both synchronous and asynchronous communication protocols, models trained on heterogeneous data and compute environments demonstrate the superior performance of the proposed scheme.
In [86], an AFL scheme is proposed, in which the number of local epochs is adjusted according to the evaluation of staleness to improve convergence speed. Experiments are conducted on both balanced and imbalanced cases with different proportions of stale nodes. The results show that their scheme is robust and converges fast. Another similar work [87] focuses on how staleness and data imbalance affect AFL by performing various levels of experiments. The results reveal that AFL works effectively on balanced data distribution when the server update frequency is unequal. Considering the effects of smooth strongly convex and smooth nonconvex functions when the data distribution is non-iid, the authors in [88] investigate the convergence theoretically and conduct several experiments. The results show that AFL has the same convergence rate as traditional FL while lowering communication requirements. By implementing the AFL scheme and conducting experiments on six Raspberry Pi 3B+ devices, the authors in [89] investigate the impact of heterogeneous devices. The results of experiments conducted on the MNIST dataset with non-iid data distribution reveal that AFL outperforms classic FL, especially when computing resources and input data sizes are disparate.
In [90], a training strategy with pre-determined initial weight parameters is proposed to mitigate the global model divergence. By using the Taylor Expansion formula, higher precision gradients are achieved on their AFL scheme, which is validated by experiments conducted across many real-world datasets. Another mathematical solution of non-iid data is choosing optimal hyper-parameters for the novel two-stage training strategy in AFL [91]. Clustered FL is also a strategy to alleviate the effects of divergent data distributions by grouping training nodes. In [92], the geometric properties of the FL loss surface are used to group nodes into clusters. The quality of clusters is ensured by math and validated by experiments. In [79], the data distribution on nodes in a group is optimized to be closer to the global data distribution.

Vertical Distributed Data
Vertical FL, unlike horizontal FL, deals with data that distribute unique subsets of features on different nodes, as explained in Section 2.1. Since the generation of the global model is dependent on the concatenation of local models, the update of local models needs to be collaborative. Such imbalanced distribution of features and the increased model dependency result in challenges to AFL.
In [93], apart from the flexible FL algorithm that allows random client participation, the authors utilize a local embedding model for each client to convert raw input to compact features, reducing the communication parameters in AFL. The feasibility and effectiveness of the proposed scheme are confirmed by rigorous convergence analysis and numerical experiments on multiple datasets.
The authors in [24] propose an asynchronous federated stochastic gradient descent (AFSGD-VP) algorithm with two variance reduction variants: stochastic variance reduced gradient (SVRG) and SAGA [94]. When the objective function is strongly convex, the convergence rate of AFSGD-VP is derived. Besides, the security and complexity of the proposed algorithm are provided. Experiment results on several vertical distributed datasets verify the theoretical analysis and prove the efficiency of their proposed scheme.
A vertical AFL scheme with a backward updating mechanism and a bilevel asynchronous parallel architecture is proposed in [95]. Specifically, the backward updating mechanism enables all parties to update the model in a secure manner. The bilevel asynchronous parallel architecture improves the efficiency of the backward updating process. As the name implies, the parallel architecture is divided into two levels: the inner-party parallel between active parties and the intra-party parallel within each party. Both levels of the update are performed asynchronously to improve efficiency and scalability. The authors demonstrate the feasibility and security of the proposed strategy through theoretical and security analysis. Experiments on real financial datasets are conducted, whose results demonstrate efficiency, scalability, and losslessness.
In [67], the authors propose a vertical AFL scheme with gradient prediction and double-end sparse compression algorithm. In particular, the gradient prediction presents the timely renewal of participants by using secondorder Taylor expansion, reducing training time while retaining a sufficient degree of accuracy. The double-end sparse compression algorithm reduces the amount of data exchanged across the network during the training process. Experiment results obtained by training models on two public datasets reveal the outperformed efficiency of the scheme without degrading the accuracy and convergence speed.

Privacy and Security on Heterogeneous Devices
Although FL is introduced to protect the privacy of the local training data, there are some new attacks toward FL resulting in privacy concerns, such as membership inference attack [96], property inference attack [97], model inversion attack [98], and deep leakage from gradients attack [83]. Several other attacks, like poisoning attacks or backdoor attacks, are also harmful to the global model and are the main security challenges for FL. There exist some solutions to the privacy and security problems in FL on heterogeneous devices, including the use of differential privacy and blockchain. Since AFL is a variant of FL, it is vulnerable to these attacks when training models on heterogeneous devices. However, the existing privacy and security solutions are too computing-intensive to be applied to heterogeneous devices and time-sensitive applications in AFL. Many papers take these challenges into consideration, introducing flexible differential-privacy models, or highefficient blockchain models to AFL. The papers with privacy and security protection are summarized and compared in Table 3.

Privacy on Heterogeneous Devices
Differential privacy is a promising approach that has been utilized in various AFL schemes to secure the privacy of local models that may result in local training data leakage on heterogeneous devices.
In [93], a flexible FL scheme with differential privacy is proposed to avoid disclosing local training data. Each node employs Gaussian differential privacy to achieve a better trade-off between data privacy and data utility. In [99], the authors proposed an AFL scheme adopting local differential privacy for secure resource sharing in vehicular networks. Particularly, a treebased gradient descent model is adopted on nodes to achieve high global model accuracy in a short amount of time. To protect the privacy of local models, a distributed local model updating approach with Gaussian noise is introduced to nodes in the regression tree. By offering rewards, nodes are encouraged to provide good models, thereby accelerating the convergence process. The experiment carried out on three real-world datasets demonstrates the high accuracy and efficiency of the scheme. Similarly, in [100], the convergence of AFL while adopting differential privacy is analyzed. Based on the analysis, a multi-stage adjustable algorithm is proposed to optimize the trade-off between model utility and privacy by dynamically changing the noise size and the learning rate. Experiments are conducted on edge servers and a cloud server with three different ML models, including logistic regression (LR), support vector machine (SVM), and convolutional neural network (CNN). The results reveal that MAPA achieves high model utilities and accuracy at the same time. Furthermore, in [101], differential privacy is introduced into AFL by adding Gaussian noise. The authors begin AFL training with a high learning rate and gradually reduce it to achieve optimum accuracy. The theoretical analysis and simulation results prove that their scheme reduces the network communication cost on heterogeneous devices.

Security on Heterogeneous Devices
When training ML models on heterogeneous devices, the blockchain is treated as a secure distributed database for securely storing or transmitting local models in AFL. To improve privacy protection and promote trust among heterogeneous devices, in [102], blockchain and digital twin edge network are integrated to store all local gradient updates in AFL. Specifically, the blockchain is adopted to track the aggregation progress by maintaining a global iteration index in AFL. A lightweight DPoS-based verification mechanism is developed, where stakes are earned based on the computing contribution to the global model. The mechanism is accomplished through the verification algorithm, which verifies the quality of the models against the historical model. In addition, a reinforcement learning-based algorithm is designed for efficient user scheduling and bandwidth allocation. A series of experiments are conducted to evaluate the performance of the scheme in terms of learning accuracy and resource cost. Another similar idea proposed by the same authors in [103] is to integrate blockchain, FL, and an asynchronous model update scheme in digital twin edge networks. The objective of lowering communication costs includes two parts: reducing transmission data size and optimizing communication resource allocation. Finally, the communication resource allocation approach is implemented by using deep neural networks. Numerical results reveal that this scheme improves communication efficiency and reduces the cost of resources. The authors of [104] present an AFL scheme coupled with the Directed Acyclic Graph (DAG) blockchain for the Internet of Vehicles (IoV). The participating nodes are selected by Deep Reinforcement Learning (DRL) to improve the learning efficiency. Besides, a two-stage verification mechanism is developed, which comprises periodic validation of blockchain transactions and validation of local model quality. Experimental results show the excellent learning accuracy and rapid convergence speed of the scheme.
In order to mitigate the risk of single-point failure caused by the centralized aggregation server, a blockchain-based AFL scheme with a staleness coefficient is proposed in [27]. Specifically, the staleness coefficient reduces the contribution from the latency device to the global model by comparing the version of the global model with the stale local model. The Proof-of-Work (PoW) consensus algorithm is adopted, where the miners are responsible for generating candidate blocks that include trained models. The block generation rate is positively correlated with the forking frequency. The experiments carried out on a variety of IoT devices demonstrate high accuracy on both horizontal and vertical FL frameworks. Another similar idea with a staleness coefficient is proposed in [107]. Instead of using PoW, a committee-based consensus algorithm is adopted to improve efficiency further. The convergence speed and model accuracy are both validated by experiments on heterogeneous devices.
From another aspect, the reputation of nodes is an important factor to be considered to improve the stability and security of AFL. The builtin reputation and reward systems in blockchain are adoptable. In [105], a blockchain-based AFL scheme is proposed, where an entropy weight method determines the participant rank by the proportion of local models trained on nodes. The metrics are all maintained in the blockchain, including the training time, training sample size, local update correlation, and global update cheating times. The resource cost and training efficiency are well balanced by optimizing local training delays and the block generation rate. The experiment results show the superiority of the scheme in terms of efficiency and preventing poisoning attacks.
To mitigate the risk of single-point failure and malicious nodes attacks, the authors in [106] propose a two-layer blockchain-driven FL framework composed of multiple Raft-based shard networks (layer-1) and a DAG-based main chain (layer-2). Layer-1 is a small group for information exchange, while layer-2 is responsible for storing and sharing models trained by layer-1 asynchronously. Furthermore, to avoid the impact of stale models, a virtual pruning procedure with a specific waiting time is presented. Models not approved by other models for a long time or with low accuracy will be pruned from the DAG blockchain. The experiment results show that this scheme is resilient against malicious nodes while maintaining acceptable convergence rates.

Applications on Heterogeneous Devices
AFL is adopted in various application scenarios to provide an efficient and flexible training process on heterogeneous devices while maintaining the privacy of local training data. The application scenarios of AFL and related work are summarized and compared in Table 4.
Smart transportation is a viable situation for AFL due to its efficient utilization of computing resources that bridges the gap between training delay and time-sensitive requirements to a certain extent. For instance, in [104], AFL is introduced to enhance the reliability and efficiency of data sharing among vehicles. The experiments conducted in a vehicular network evaluate their scheme, including one MBS and 10 RSUs covered. The results reveal that the DAG blockchain architecture in the scheme ensures both performance and security. Similarly, in [99,102], AFL is adopted in urban vehicular networks to allocate resources more efficiently and securely. The experiment results verify the effectiveness of their scheme in terms of distributed data sharing and resource caching in urban vehicular networks. In [108], a real-time end-to-end AFL scheme is applied in IoV and focuses on steering wheel angle prediction for autonomous driving. To conduct angle prediction, the authors utilize a two-stream deep CNN model with two separate neural branches that consume spatial and temporal information, respectively. To consume real-time streaming data, a sliding training window is introduced to reduce computation and communication latency. The experiments are carried out on real-world datasets, with the results showing that their scheme improves model prediction accuracy while reducing computation and communication latency. AFL is also adopted in cameras to monitor, predict, and adjust traffic by controlling signal lights in the smart transportation scenario [66,62]. By adjusting the hyper-parameter in the scheme, an optimal balance between the model accuracy and convergence speed is achieved. In [109], AFL is adopted in unmanned aerial vehicle (UAV) networks. In order to improve the convergence speed and model accuracy, an actor-critic-based AFL scheme is proposed, including equipment selection, drone placement, resource management, local training, and global aggregation. Specifically, to prevent low-quality devices from compromising learning efficiency and model accuracy, a device selection strategy is proposed, in which nodes with high processing capability, communication capabilities, and model accuracy are selected. The selection problem is modeled as a Markov Decision Process and optimized through reinforcement learning. Continue on the next page Predict the position and orientation of the camera for end-to-end localization.
The scheme is evaluated by experiments, whose results show a higher learning accuracy and lower time cost. Similarly, in [81], an intelligent content caching system in UAV networks based on AFL is proposed to extend the service coverage and reduce the communication delay of the 6G network. In the scheme, UAVs collaborate to forecast where content caching should be placed by taking real-time traffic distribution into account. In [49], AFL is applied to mobile robots that collect real-time data and perform training in a distributed and resource-constrained environment to reduce communication costs. Experiments conducted on 12 mobile robots with limited resources demonstrate that the performance of the model is guaranteed by selecting competent and reliable mobile robots.
Fault diagnosis is another application scenario for AFL. For instance, in [110], AFL is utilized to identify the local modes in real-time. To completely track actual system changes in real-time and increase the diagnostic rate of the nodes, each node turns private data into local models using an Extended Kalman Filter before transmitting. A sequential filter approach based on Sequential Kalman Filter is adopted to perform the asynchronous aggregation for uploaded local models. Experiments conducted on real-world collected fault datasets demonstrate high accuracy compared with benchmarks. Similarly, in order to improve model accuracy and convergence speed in anomaly detection while preserving privacy, the authors in [90] train the denoising autoencoder model based on labeled benign samples in AFL. Asynchronous update strategy improves the accuracy and stability of the model by reducing the impacts from the stragglers. In [111], the authors adopt AFL in the sensitive code review field to address privacy concerns. On leaks gathered from the code-sharing network Github, a prototype is developed and tested. When compared with local and centralized training, the proposed scheme improves model accuracy while preserving the privacy of the local training data. Another AFL-based fault diagnosis scheme proposed in [63] allows nodes to adaptively select branches of the model for further training according to their local datasets. Their scheme creates an effective diagnostic model for detecting potential defects while reducing resource requirements and communication overhead. Experiments conducted across heterogeneous devices verify the feasibility of their scheme.
The AFL scheme is also applied to IIoT environments for real-time analysis and decision-making. For example, in [80], the authors break down the barrier of data island in IIoT with the help of AFL. In their scheme, the effect of straggler devices is mitigated by adaptively adjusting aggregation frequency. The experiment results validate the feasibility and efficiency of their scheme. Similarly, in [103], AFL is utilized to preserve data privacy and improve the quality of services in IIoT. Besides, by adopting digital-twin technology, real-time interactions requirements in Industry 4.0 are fulfilled. The communication cost is optimized as evidenced by experiment results.
In [112], an AFL scheme is designed to detect and handle the data distribution changes (concept drift) across edge devices. Specifically, the proposed scheme improves the predictive performance of the worst 20% of devices while also maintaining the best test performance for the top 20% of devices.
In [84], the authors apply AFL to the image-based geolocation service for end-to-end localization. AFL improves the accuracy of prediction of the position and orientation of the camera while preserving the privacy of user local training data and mitigating the effects of straggler devices. Experiments conducted on the CNN model across several datasets validate the feasibility.

Research Challenges and Future Directions
As a trending research topic, recent works have revealed research challenges in AFL that need to be solved from aspects of device heterogeneity, data heterogeneity, privacy and security on heterogeneous devices, as well as applications on heterogeneous devices. To deal with these challenges, the potentially up-and-coming research directions are identified and illustrated in this section. 7.1. Device Heterogeneity 7.1.1. Optimization towards a balance between performance upgradation v.s. time consumption As summarized in section 3, for AFL, the existing performance improvement strategies on heterogeneous devices, such as node selection, weighted aggregation, and cluster FL, are effective in various ways. Some of the works even adopt several strategies at the same time to improve the efficiency of AFL [55,19,63]. However, utilizing too many strategies in AFL results in a decline in efficiency to a certain extent. For instance, if selecting a range of nodes and then compressing the gradients on resource-limited devices takes longer than uploading local gradients, it is preferable to skip one of them. So far, there has been no comprehensive analysis of the balance between multiple performance improvement strategies and time consumption, which is a potential research direction. To derive the optimized trade-off, it is possible to establish a dynamic gaming model by Markov Decision Process, which can adapt to various scenarios based on the constraints. Moreover, other lightweight convex optimization methods can be considered, such as quadratic minimization with convex quadratic constraints, semidefinite programming, and convex quadratic minimization with linear constraints.

Optimization towards a generalized AFL solution
Usually, different performance improvement strategies have different application scenarios. For example, when the disparity in computing capabilities between heterogeneous devices is extremely high, semi-asynchronous FL with suitable weighted aggregation strategies could be an optimal solution. If the dataset distribution is IID across nodes, the local models from fast nodes should be selected and compressed, while those from straggler devices should be discarded. The local models deserve higher weight if they bring a positive effect to the global model. Therefore, designing a generalized and flexible optimization framework for AFL for diverse application scenarios is a viable research field. It is expected to achieve this by integrating existing and future techniques minimally.

Optimization towards dynamic resource allocation for AFL
Intuitively, AFL requires more communication resources when compared with classic FL due to more global model aggregation operations. Therefore, it is expected to consider dynamic resource allocation algorithms, including transmit power, computation frequency for model training, and model selection strategy, to maximize the long-term time average (LTA) training data size with an LTA energy consumption constraint. Specifically, a possible solution is to first define the Lyapunov drift by converting the LTA energy consumption to a queue stability constraint. Then, a Lyapunov drift-pluspenalty ratio function can be constructed to decouple the original stochastic problem into multiple deterministic optimizations along the timeline. The construction is capable of dealing with uneven durations of communication rounds. To make the one-shot deterministic optimization problem of combinatorial fractional form tractable, the fractional problem is reformulated into a subtractive-form one by the Dinkelbach method, which leads to the asymptotically optimal solution in an iterative way. By doing so, there is a potential for both higher learning accuracy and faster convergence with limited time and energy consumption.

Data Heterogeneity 7.2.1. Optimization towards heterogeneous data distribution
Since the data distribution across nodes is usually non-IID in the real world [20], it is meaningful to obtain a generalized model while maintaining the accuracy for each local data in AFL. There are several solutions for non-IID data challenges in classic FL, such as localized independent training [113], personalized local model training [23], and cluster training [114]. However, it is hard to transplant these solutions into AFL, since the global model prefers to convergence to nodes with higher model upload frequency (i.e. fast nodes) in AFL and result in a biased global model. Such a biased global model decreases the effects of localized independent training and personalized local model training to a certain extent. Although cluster training has been utilized in several AFL schemes, it is hard to arrange a general cluster strategy for all application scenarios. For instance, a location-based cluster strategy is not ideal for traffic prediction among smart vehicles with non-IID datasets due to the randomness of vehicle movement. Cluster training based on data distribution similarity is a potential research topic [79], but it requires the development of an appropriate similarity evaluation algorithm. Besides, based on transfer learning or meta-learning, asynchronous personalized local model training is potentially an effective and accurate solution.

Optimization towards heterogeneous data size
Dataset sizes among nodes are usually unequal since each node gathers its own local data independently in most AFL application scenarios. Even all nodes have identical computing resources, the imbalanced datasets across nodes lead to varying update frequencies of local models. The weighted aggregation strategy based on local dataset size is a possible solution as the work in [62], but how to verify the validity of the dataset size on each node is another security problem. The smart contract in the blockchain offers selfverifying and self-executing capabilities [29], alleviating data fraud in AFL to some extent at a cost of low efficiency.

Optimization towards vertical data distribution
Vertical data distribution is prevalent in economic scenarios, where each node possesses different feature sets of the dataset. In a heterogeneous computing environment, the lack of local models uploaded from some straggler nodes causes the global model to be biased and unable to predict certain features, unlike the accuracy decline in horizontal FL. Therefore, the lagging local models are non-ignorable in vertical AFL. So far, only relatively few researches have been conducted in the vertical AFL area [78,93,24,95,67] compared with horizontal AFL. Moreover, none of these works analyzed the effects of extreme stragglers caused by computing resources or communication resources. A possible research direction is semi-asynchronous FL. With a server-side cache, it is possible to store stale local models and increase their effectiveness while keeping other local models up to date. Besides, another potential research direction is model split, which splits the global model according to the feature distribution on nodes and transforms to clustered horizontal AFL, as the work in [78]. However, without knowing the local dataset on each node, it is hard to identify the distribution of features among nodes by the local models.
7.3. Privacy and Security on Heterogeneous Devices 7.3.1. Optimization towards privacy protection using differential privacy Differential privacy prevents AFL from a variety of privacy attacks, including background knowledge, collusion, and inference attacks. However, as the utility of local models falls, differential privacy leads global model accuracy to decline. Therefore, the trade-off between privacy and utility is hard to achieve in AFL. There are several strategies for optimizing the trade-off in classic differential privacy, such as Static Bayesian Games [115], Markov Decision Processes [7], and Generative Adversarial Nets [116]. However, in AFL, the publishing process for local models is dynamic and distributed, which is hard to balance through a trusted third party. Local differential privacy (LDP) is an approach that users randomly perturb their inputs without the necessity for a trusted party and is treated as a solution for the privacy issues in FL [42]. Nevertheless, the trade-off between the privacy of local models and the utility of the global model is hard to achieve. Especially in AFL, it necessitates asynchronous macro control for the LDP in a distributed manner. The smart contract in the blockchain is a viable approach for manipulating LDP in a distributed manner. However, an asynchronous consensus algorithm needs to be designed to balance the privacy of local models and the utility of the global model.

Optimization towards security enhancement using blockchain
Blockchain can be utilized to address security challenges in AFL, such as the single point of failure, Byzantine attacks, and poisoning attacks. At the same time, blockchain declines the efficiency of AFL to some extent due to its low communication efficiency and high computing resource consumption [23]. Therefore, the trade-off between security and efficiency of blockchain-based AFL is also challenging. To improve the scalability of blockchain, several improved consensus algorithms are designed to replace PoW, such as Proof-of-Stake (PoS) [29], Proof-of-Reputation (PoR) [117], PBFT, and RAFT [118]. However, generally, the higher the performance of the consensus algorithm reaches, the worse the security level is. For instance, compared with PBFT, RAFT is not resistant to Byzantine attacks but has higher data throughput. A promising solution is to develop an efficient and secure consensus algorithm. For example, Algorand [119] is a byzantine-tolerant consensus algorithm with excellent scalability while maintaining a high level of security. A group of committees is randomly selected in each iteration to verify and ensure the security of the transactions. During training, the committees verify the local models in each iteration. However, it is difficult to select committees asynchronously without compromising security. In this situation, it is possible to separate the consensus process and the training process with a tailor-made blockchain structure that records the training models periodically.
7.3.3. Optimization towards security enhancement using lightweight distributed cryptography Another research direction is to apply lightweight distributed cryptography to AFL to protect security. Traditional cryptography approaches, such as public-key encryption [120], homomorphic encryption [121], and attributebased encryption [122], have several limitations in this case. Public-key encryption and homomorphic encryption are resource-consuming and unsuitable for resource-limited devices in AFL. Attribute-based encryption is not flexible enough, due to its necessity for a trusted third-party authority. A possible research area is to design a flexible and efficient attribute-based encryption algorithm for AFL with dynamical attribute adjustment that allows participants to join or leave freely.

Applications on Heterogeneous Devices 7.4.1. Expanding the range of real-world applications
As summarized in section 6, there are few real-world application scenarios for AFL for the time being, including IoV, fault diagnosis, IIoT, and so on. Compared with synchronous FL, AFL is more efficient and is better suited to time-sensitive scenarios with limited computing resources. Therefore, a possible applied research area is to apply dedicated AFL systems to a wide range of real-world scenarios. For example, in a smart hospital, ML models trained by AFL based on electronic healthcare records (EHR) predict the situation of patients. In a smart grid, AFL trains ML models on heterogeneous devices to anticipate the energy consumption in various areas and accomplish smart power dispatch. In a smart farm, the growth situation of plants is well monitored, diagnosed, and predicted by IoT with the help of ML models trained by AFL.

Real-world evaluation testbeds
As summarized in Table 2, most of the experiments of AFL are conducted in simulation mode, without demonstrating the feasibility of ALF in the real world. More experiments are expected to be conducted on IoT or edge devices to evaluate the efficiency, security, and privacy of AFL schemes. Thus, it would be a promising research direction to develop scalable and flexible testbeds deployed on heterogeneous devices and accessible from a standardized interface. The development of testbeds includes the issues of architecture design, inclusiveness of heterogeneous devices, structure-wise efficiency and performance fine-tuning, etc.

Conclusion
AFL has been attracting increasing attention due to its multiple advantageous features. To mitigate the drawbacks of existing works, three fundamental challenges in AFL are primarily studied, including device heterogeneity, data heterogeneity, as well as security and privacy issues on heterogeneous devices. By conducting an in-depth exploration of state-of-the-art research, corresponding application scenarios of AFL that potentially increase its impact and popularization on heterogeneous devices are summarized. It is pleasing to observe that the number of novel AFL schemes grows by the month. But even so, it is believed this survey is sufficiently comprehensive that new schemes can be appended and categorized correspondingly. This survey provides legible insights into the picture of AFL from a brand-new perspective, which is helpful to the community by providing potentially promising directions, and simplifying future designs, including but not limited to motivating coherent compositions uncovered by the proposed categorization and analysis.