The Internet of Federated Things (IoFT)

The Internet of Things (IoT) is on the verge of a major paradigm shift. In the IoT system of the future, IoFT, the “cloud” will be substituted by the “crowd” where model training is brought to the edge, allowing IoT devices to collaboratively extract knowledge and build smart analytics/models while keeping their personal data stored locally. This paradigm shift was set into motion by the tremendous increase in computational power on IoT devices and the recent advances in decentralized and privacy-preserving model training, coined as federated learning (FL). This article provides a vision for IoFT and a systematic overview of current efforts towards realizing this vision. Specifically, we first introduce the defining characteristics of IoFT and discuss FL data-driven approaches, opportunities, and challenges that allow decentralized inference within three dimensions: (i) a global model that maximizes utility across all IoT devices, (ii) a personalized model that borrows strengths across all devices yet retains its own model, (iii) a meta-learning model that quickly adapts to new devices or learning tasks. We end by describing the vision and challenges of IoFT in reshaping different industries through the lens of domain experts. Those industries include manufacturing, transportation, energy, healthcare, quality & reliability, business, and computing.

At the early stages of the COVID-19 pandemic, companies that mass produce personal protective equipment (PPE) required long ramp up times to fulfill the urgent demand [79,77]. The ramp up time took longer than expected as supply chains across the globe were critically disrupted, with entire countries in lockdown and essential workers succumbing to the virus [52]. Realizing this, many citizens and small businesses tried to bridge the supply gap using readily available and low-cost 3D printers [58,53]. This attempt at so-called massively distributed manufacturing [219] helped fill PPE production gaps to some extent [58,53]. However, it also revealed critical impediments to realizing massively distributed manufacturing in terms of standardizing production requirements, guaranteeing quality and reliability, and attaining high production efficiencies that can rival those of mass production [219]. For example, a large percentage of parts printed by citizens did not meet the quality requirements [112,276]. Even when following standard 3D printing guidelines, several prints failed [275] while others experienced recurrent defects due to the use of models or methods that did not account for the specific environment in which the 3D printer is operating [256]. On the other hand, citizens that succeeded struggled to effectively broadcast their improved models or methods to other users to help improve quality across the network of manufacturers [66]. Now imagine an alternative scenario based on a cyberphysical operating system for massively distributed manufacturing. All 3D printers are IoT-enabled through wifi, smart sensors or AI chips (many 3D printers nowadays have such capabilities [15,220]). The printers collaboratively learn a model for 3D printing PPE accurately with the help of a central server guiding the production to the desired quality level. To preserve privacy and intellectual property, raw data from each 3D printer is never shared with the central server; instead only focused updates needed to learn the model are transferred. This model, despite having a global state, is personalized to form a local model that accounts for individuallevel external factors effecting each 3D printer. In this alternative reality, responders can 3D print PPE at the desired quality level with little or no defects. In addition, with their personalized 3D printing models, the responders are able to push 3D printers at faster speeds to shorten printing time while maintaining quality [201,70,220]. Accordingly, the PPE supply gap is successfully filled until mass production ramps up.
In this future, not only manufacturing benefits. Take smart phones as an example. Instead of personal data being up-loaded to the cloud to learn an image classification model. The "cloud" is replaced by the "crowd", where each smart phone stores necessary data, calculates an update to the global model based on its local data and sends only the update to the central authority. This decouples the ability to learn the model from storing data in the cloud by bringing training to the device as well, where a model can be learned across thousands of millions of smart phones in geographically dispersed locations.
Lets now switch paradigms and replace smart devices by "smart" institutes. Different medical institutions can join efforts and collaboratively learn diagnostic models without sharing their electronic health records directly, as imposed by the Health Insurance Portability and Accountability Act (HIPAA). Now, diagnostic models can leverage largely diverse datasets and promote fairness through a decentralized learning framework which mitigates privacy risks and costs associated with centralized modeling. Learning can be done across institutes and individuals at multiple scales, and in areas that this has not been possible or allowed before.
The future described above is not a far cry away. It has already been set into action as the immediate, yet bold, next step for the Internet of Things (IoT). It is the cultivation of Industry 4.0. A cultivation of advances in interdisciplinary fields in the past two decades: ranging from data science, machine learning, operations research, optimization, data acquisition technologies, physics-guided modeling and privacy amongst many others.
In this article we term this future of IoT as the Internet of Federated Things (IoFT). The term "federated" refers to some level of internal autonomy of IoT devices and is inspired by the explosive interest during the past two years in Federated Learning (FL): an approach that allows decentralized and privacy preserving training of models [194]. With the help of FL, the decentralized paradigm in IoFT, allows devices to collaboratively extract knowledge and build smart analytics/models while keeping their personal data stored locally. This paradigm shift not only reduces privacy concerns but also sets forth many intrinsic advantages including cost efficiency, diversity, and reduced computation, amongst many others to be detailed in the following sections.

B. PURPOSE AND UNIQUENESS
This paper is a joint effort of researchers across a wide variety of expertise with the purpose of addressing the three questions below: 1) What are the key defining characteristics of IoFT? 2) What are the key recent advances and possible datadriven methods in IoFT that allow learning in one of the three dimensions stated below? and what modeling, optimization and statistical challenges do they face? • A Global model: that maximizes utility across all devices. The global model aims at capturing the commonalities and intrinsic relatedness across data from all devices to improve prediction and learning accuracy.
• A Personalized model: that tries to personalize and adapt the global model to data and external conditions from each device. This embodies the principle of multi-task learning [224] where each device retains its own model while borrowing strength across all IoFT devices. • A Meta-learning model: that learns a global model which can quickly adapt to a new task with only a small amount of training samples and learning steps. This embodies the principle of "learning to learn fast" [272] where the goal of the global model is not to perform well on all tasks in expectation, rather to find a good initialization, that can directly adapt to a specific task.
3) How will IoFT shape different industries and what are the domain specific challenges it faces for it to become the standard practice? Through the lens of domain experts, we shed light on the following sectors: manufacturing, transportation, energy, healthcare, quality & reliability, business and computing.
Besides defining the key characteristic of IoFT, our paper's focus is summarized in two-folds. The first is data-driven modeling where we categorize FL approaches in IoFT into learning a global, personalized and meta-learning model and then highlight recent advances, possible alternative models and statistical/optimization challenges. We should note that our intent is not to survey all possible models, but rather to shed light on key recent advances and potential alternatives. The second focus is a vision of IoFT potential use cases, modeling approaches, and obstacles within the aforementioned domains. Our overarching goal is to encourage researchers across different industries to explore the transformation from IoT to IoFT so that critical societal impacts brought by this emerging technology can be fully realized.
We note here that many excellent surveys on FL have been recently released. Most notably, Lim et al. [165] address FL challenges in mobile edge networks with a focus on communication cost, privacy and security, Niknam et al. [213] discuss FL application in wireless communications specially under 5G networks, Li et al. [159] provide a thorough overview of implementation challenges in FL, Yang et al. [309] then categorize different architectures for FL, Rahman et al. [239] discuss the evolution of the deployment architectures with an in-depth discussion on privacy and security, while Aledhari et al. [4] highlights necessary protocols and platforms needed for such architectures, Kairouz et al. [121] study open problems in FL and recent initiatives while providing a remarkable survey on privacy preserving mechanisms. Along this line, Lyu et al. [178] highlight threat and major attacks in FL. While our focus is on data-driven modeling for IoFT and how aforementioned fields will be effected by the shift from IoT to IoFT, the surveys above serve as excellent complementary work for a bird's eye view of FL and hence IoFT.
The remainder of this paper is organized as follows. Sec.
II highlights the past, present and future features of IoTenabled systems leading to IoFT. Secs. III -V provide datadriven modeling approaches for learning a global, personalized and meta-learning model respectively. Sec. VI address central statistical and optimization progress on FL and future possibilities and challenges. Sec. VII provides a vision for IoFT within manufacturing, transportation, energy, healthcare, quality & reliability, business and computing. Throughout this paper, IoFT is used to indicate the IoT system of the future. Also, edge device, local device, node, user or client are used interchangeably to denote the end user based on the problem context.

C. IoFT WEBSITE AND CENTRAL DIRECTORY
While exploring data-driven modeling approaches to FL in IoFT, it became clear that real-life datasets (in engineering, health sciences, etc..) are pressingly needed to fully explore the disruptive potential of IoFT. While few already exists, they are based on artificial examples, and the few nonartificial datasets are mostly focused on mobile applications. However, for IoFT to become a norm in different industries, real-life datasets with defining features of the underlying system are needed to unveil the potential challenges and opportunities faced within different domains. Towards this end, this paper features a supplementary website (https://ioftdata.engin.umich.edu/) managed by the University of Michigan. The website will serve as a central directory for FL based datasets and will feature brief descriptions of each dataset categorized by its respective field with a link to the repository (research lab website, github account, papers, etc..) where the data is contained. Our hope is to encourage researchers to develop real-life datasets for IoFT and to help with the outreach and visibility for their datasets and corresponding papers.
Website: https://ioft-data.engin.umich.edu/ IoT enabled systems, including the examples described in Sec. I-A, possess three defining characteristics: tangible physical components that comprise the system, connectivity among components that enables data acquisition and sharing, and data analytics and decision making capabilities that VOLUME x, x transform a merely "connected" system to a "smart and connected" system. These defining features of IoT enabled systems [235,192,37] are shown in Fig. 1. IoT has brought broad disruptive societal impacts, particularly on economic competitiveness, quality of life, public health, and essential infrastructure [180]. Companies around the globe have invested heavily in IoT including: Google's Cloud IoT [90], Samsung's Active wearable device [258], Amazon's Webservices solutions [10], Rockwell's Connected Enterprise [252], Welbilt's Smart Home Appliances, to name a few. The value at stake is more than 15 trillion dollars, a number expected to triple in the next decade [8].
The essential feature of an IoT system is that data from multiple similar units and from multiple components within the system during their operation are collected, often in realtime. Since we have observations from potentially a very large number of similar units, we can compare their operations, share the information, and extract some knowledge to enable accurate prediction and control. One can argue that such a notion of IoT dates back a long time before the Industrial Revolution, to the time when artisans producing crafts in geographically close locations used to gather to share knowledge and perfect/standardize the quality of their crafted product [274]. A lot has changed since then.
Starting with the industrial revolution came rapid advances in connectivity, automation, data science, cloud based systems, among many others [198,149]. This ushered in the present-day era of Industry 4.0 characterized by IoT enabled systems [8]. A sensor price dropped to $0.48 on average in 2018, connectivity and wide area communication became readily available with an expected 36.13 billion connected IoT devices [150], distributed computing allowed handling larger datasets than what was previously thought possible and cloud based solutions for data storage and processing have become widely available for commercial use (ex: Amazon's AWS [9] or Microsoft's Azure [11]). In this present era, a typical IoT enabled system structure is shown in Fig. 2. Take for example GM's OnStar ® teleservice system first piloted in 2010 [221,82]. Cars that enrolled for this service have their data in the form of condition monitoring signals, uploaded to the cloud on regular intervals. The cloud then acts as a back-office or data-center that processes the data in order to keep drivers informed about the health of their vehicle. In the cloud, GM fits models that can do process monitoring and prediction of maintenance needs, amongst other. The data is also used to cross validate the behavior of their learnt models for continuous improvement. When the need arrives, service alerts are then sent to drivers via phone notifications.
Much like other IoT giants such as Google, Amazon and Facebook, GM has long adopted this centralized approach towards IoT: (i) gigantic amounts of data are uploaded and stored in the cloud (ii) models (such as predictive maintenance, diagnostics, text prediction) are trained in these data centers (iii) the models are then deployed to the edge devices.
Here one should note that distributed learning is often implemented in centralized systems to alleviate the huge computational burden via parallelization. In such systems, the clients are computing nodes within this centralized framework. Nodes can then access any part of the dataset, as data partitions can be continuously adjusted. In contrast and as described in the following sections, in IoFT the data lies at the edge and not centrally stored. As a result, data partitions are fixed and cannot be shuffled nor randomized.

B. IoT: THE FUTURE
With the tremendous increase in computational power on edge devices, IoT is on its way to move from the cloud/datacenter to the edge device, hence the aforementioned notion of substituting the "cloud" by the "crowd". In this IoT system of the future (IoFT), devices collaboratively extract knowledge from each other and achieve the "smart" component of IoT, often with the orchestration of a central server, while keeping their personal data stored locally. This paradigm shift is based on one simple yet powerful idea: instead of learning models on the cloud, clients execute small computations locally and then only share the minimum information needed to learn that model. As a result, IoFT decouples the ability to do analytics from storing data in the cloud by bringing training to the device as well. The underlying premise is that IoFT devices have computational (ex: AI chips) and communication (ex: wifi) capabilities.  Let us start with a simple example, assume the central orchestartor in Fig. 3 wants to learn the mean of a feature (y) over all clients. Then to calculateȳ, client i only needs to share its own mean (ȳ i ) and not its entire feature vector (y i ).ȳ i is a sufficient statistic to learnȳ. In reality, models are often more complicated and require multiple communications between the central orchestartor and clients. For instance and without loss of generality, assume that IoFT devices cooperate to learn a deep learning model through borrowing strength from each other, rather than using their own knowledge in isolation. In the decentralized realm of IoFT, model learning is often administered by a central orchestartor and follows the cycle shown in Fig. 3. (i) the orchestrator (i.e. the central server) selects a set of IoFT devices meeting certain eligibility requirements and broadcasts the model to the selected clients. This model contains the neural network (NN) architecture, current weights and a training program. (ii) IoFT devices perform local computations by executing the program on its local data and the client reports its focused update to the orchestrator. Here the program can be running stochastic gradient descent (SGD) on local data and the focused update would be updated weights or a gradient. It is worth noting that at this stage the client might choose to encrypt the focus update or add noise to it for enhanced privacy. (iii) the central orchestartor collects the focused updates from the clients and aggregates them to update the global model. (iv) this procedure is then iterated over several rounds until a stopping criteria, such as validation accuracy, is met. Through this process the global model can account for knowledge from the IoFT clients, and each client can indirectly make use of the knowledge from other clients. Finally, the learned global model goes through a testing phase such as a quality-A/B testing on held out devices, and then a staged rollout on a gradually increasing number of devices.
This decentralized paradigm shift sets forth many intrinsic advantages, that include: • Privacy: By bringing training to the device itself, users no longer have to share their valuable information, instead it is kept local and never shared. Additionally, unlike centralized systems where each device has an identifier, in such a framework, devices cannot be indexed by their focused updates. • Autonomy: IoFT devices have the ability to be under independent control and opt-out of the collaborative training process at any time. Yet with enhanced privacy in IoFT, clients will be more inclined to collaborate and build better models. • Computation: As the number of IoT devices skyrockets, computational and storage needs accumulated from these devices (say smart phones) is far beyond what any datacenter or cloud computing system can handle [326]. Instead, by exploiting compute and storage capacity at the edge, massive parallalization (up to 10 10 devices) becomes a reality [268,110]. • Cost: Focused updates embody the principle of data minimization, and contain the minimum information needed for a specific learning task. As a result, less information is transmitted to the orchestartor, which reduces communication cost and efficiently utilizes net-work bandwidth. Also, compute power at the edge device is now utilized. Hence storage and computational needs of the orchestrator are minimal. This is in contrast to distributed systems where massive utilization and synchronization of GPU and CPU is needed. • Fast Alerts and Decisions: In IoFT, upon deployment of the final model to clients, real-time decisions or service alerts are achieved locally at the edge. In contrast, cloud based systems incur a lag in deployment, as decisions made in the cloud need to be transmitted to the clients (as shown in Fig. 2). • Minimal Infrastructure: With the increase in compute power of IoT devices and the wide availability of AI chips [300], minimal hardware is required to achieve the transition to IoFT. • Diversity and Fairness: IoFT allows integrating information across uniquely diverse datasets, some of which have been restricted to be shared previously (recall the medical institutes in Sec. I-A). This diversity, and ability to learn across geographically disperse locations promotes fairness by combining data across boundaries [26,35].
Having recently realized its disruptive potential to traditional IoT, industries are eagerly trying to use FL in their operating systems and production to clench its numerous advantages. However, these efforts are yet in their infancy phase awaiting broad implementations. Google pioneered some of FL applications and utilized it in their mobile keyboard "Gboard" [97,39,312,240] and Android messaging system [89] to improve next-word predictions and preserve privacy. Additionally, they introduced a decentralized framework to update android models on their Pixel phones [193]. In this framework, each android phone updates its model parameters locally and sends out the updated parameters to the Android cloud, which trains its central model from the aggregated parameters. BigTech giants have since started to catch up and utilize FL in their systems. Most notably Apple adopted FL in their QuickType keyboard, "Siri" and privacy protection protocols [20,6]. As well as Microsoft in their device's telemetry data [63]. Further, FL have seen some application in optimizing mobile edge computing and communication [290,165], computational offloading [290] and reliable network communication [257].
Most of the current FL applications are present within the technology industry, and specifically tailored for mobile applications and few others. However, in the envisioned IoT system of the future (IoFT), FL is expected to infiltrate all industries that are to benefit from knowledge sharing, data analytics and decision making. In fact, the presence of FL in the technology industry has set in motion a timid yet insuppressible momentum for FL application in other industries. For instance, in the healthcare field, FL is lately being used as a medium of collaboration between hospitals to share patient's electronic records and other medical data [26,130,107,284]. In Sec. VII, we will present a deeper VOLUME x, x vision into how FL will shape the future of various industries, those include: manufacturing, transportation, energy, healthcare, quality & reliability, business and computing.

1) Challenges
Needless to say, IoFT as an emerging technology poses significant intellectual challenges. Interdisciplinary skills across diverse fields are needed to bring the great promise of IoFT into reality. Below we highlight some of challenges, and shed lights to their uniqueness in comparison to centralized IoT system. This is by no means an exhaustive list as IoFT challenges vary widely across different application sectors as highlighted in Sec. VII • Statistical Heterogeneity: IoFT devices often have local datasets that differ in both size and distribution. Recent papers have shown the, unfortunately, wide gap in the global model's performance across different devices due to their heterogeneity in distribution [329,289] and size [71]. For instance, IoFT devices may have: (i) unique outputs, labels or features only observed within certain IoFT devices. (ii) similar outputs but with dissimilar features (i.e., feature distribution skew) or vice versa. This statistical heterogeneity is a direct consequence of IoFT's ability to reach out to large number of devices operating under different external factors and subject to geographic, cultural and socio-economic differences.
In contrast, traditional IoT systems offer a key, yet often subtle fundamental advantage: the ability to handle non independent or identically distributed (i.i.d) data by shuffling/randomizing the raw data collected in the cloud prior to learning; be it through distributed computing or learning on a single machine. This is not a luxury that IoFT possesses, rather it is a price to pay for enhanced privacy. • Personalization and Negative Transfer: In the IoFT process described in Sec. II-B all clients collaborate to learn a global model; "one model that fits all". This integrative analysis of multiple clients implicitly assumes that these local datasets share some commonalities. However, with heterogeneity, negative transfer of knowledge may occur, which leads to decreased performance relative to learning tasks separately [139,155]. One possible solution is through personalized modeling where global models are adapted for local clients (refer to Sec. IV for data-driven personalization approaches). Indeed, personalization may be the fundamental tool to overcome the heterogeneity barrier intrinsic to IoFT. Yet developing validation techniques to identify negative transfer and minimize it, is a critical problem in FL. • Communication Efficiency and Resource Management: Communication can be a critical bottleneck for IoFT, especially with a large number of participants. Unlike cloud datacenters, end devices in IoFT often have limited communication bandwidth with unstable and slow connection [137]. As a result, IoFT devices are often unreliable and can dropout due to battery loss or connectivity loss. Besides that, devices themselves are heterogeneous in their computational capabilities and memory budgets. Therefore, resource management in IoFT is of critical importance. Methods such as compressed communication [279,134], client selection [306] and optimal trade offs between convergence rates, accuracy, energy consumption, latency and communications [210,250] are of high future relevance. Another possible approach is through incentive design to encourage reliable clients to participate in the training process and minimize dropout rates [122]. • Privacy: Privacy remains one of key challenges and motivators behind IoFT. IoFT systems are prone to poisoning attacks on both edge devices and the central server. Targeted data perturbations [13,48,173] to specific labels/instances or corrupting a large number of devices (i.e. fake devices) can immensely reduce accuracy, specifically in deep learning which is prone to adversarial attacks [33]. Further, a malicious server might be able to reconstruct raw data even through a focused update. As a result, secure computation, aggregation and communication are needed in IoFT [17,23]. So is adversarial data modeling to ensure robustness against corrupted data in case breaches are inevitable [183]. • Bias and Fairness: IoFT systems can raise bias and fairness concerns. For example, sampling reliable phones with a larger bandwidth (i.e., more expensive phone) can lead to models mostly representative of people with certain socio-economic status. Further, it is often important to build models that are competitive over different groups or attributes. This becomes a bigger challenge if such sensitive attributes are not shared. Therefore, fair FL is an important challenge to tackle with IoFT [158] • Other Statistical and Optimization Challenges: We also refer readers to Sec. VI and for both statistical and optimization challenges/opportunities and Sec. VII for domain specific challenges in different sectors. We here note that Secs. III, IV, V shed light on data-driven modeling approaches (global, personalized and meta learning) aimed to tackle some of the challenges above. However, we exclude (i) privacy and communication efficiency: since there are excellent surveys focused mainly on these challenge (refer to Sec. I-B) (ii) resource management and fairness: since literature in that area is still scarce.

2) FL structures
The underlying structure and overall architecture of FL is tailored to fit certain applications and overcome specific challenges. Such elasticity of FL is at the heart of IoFT's applicability to various industries. Accordingly, it is essential to understand the application for which a specific FL structure can be designed. In fact, much of the current available FL architectures are influenced by the data composition. For instance, in the situation where multiple clients collaborate to learn a global model with the orchestration of a central server (as seen in Fig. 3), it is implicitly assumed that local datasets share a common feature space but have different sample space. Such data composition is technically referred to as Horizontally partitioned data [309]. A typical FL system architecture for Horizontally portioned data (also known as Horizontal FL (HFL)), would exploit the availability of common feature space. Notably, horizontally partitioned data are very common across different applications, making HFL the common practice in FL systems [97,39,309,312,240].  However, not all datasets share a common feature space which naturally poses the need for a different FL architecture. Vertically partitioned data, which refers to datasets sharing different feature space but similar sample space, is another familiar theme in different applications.
In fact, such datsets mostly appear in scenarios that involve joint collaboration between large enterprises. Consider as an example, two different health institutes each owning different health records, yet sharing same patients. Suppose you wish to build a predictive model for a patient's health using a complete portfolio of medical records from both healthcare institutes. Unlike HFL where each client trains a local model using their own data, here training a local model requires data owned by other clients since each client holds a disjoint subset of the data. Accordingly, a typical FL system architecture for Vertically partitioned data (also known as Vertical FL (VFL) [309]) is designed to introduce secure communication channels between clients to share the needed training data, while preserving privacy and preventing data leakage from one provider to another. For this, VFL architecture involves a trusted neutral party to orchestrate the federation. The orchestrator aligns and aggregates data from participants to allow for collaborative model building using the joint data, see Fig. 4. Nonetheless, VFL remains less explored than HFL and most of current developed structures can only handle two participants [216,98,310]. More challenging scenarios can occur in situation whereby clients have datasets that share only partial overlap in the feature and sample spaces. FL in these cases can leverage transfer learning techniques to allow for collaborative model training [225,309]. The general FL architecture in such a scenario is similar to that of VFL seen in Fig. 4, though the communication patterns can differ. The structures described above are designed to handle challenges arising from dataset partitioning. However, different challenges require new structures. One notable commonality of the above structures is the usage of a central orchestrator that coordinates the FL. The caveat however, is that a central orchestrator is a single point failure, and can lead to a communication bottleneck with large number of clients [163]. Accordingly, fully decentralized solutions can be explored to nullify the dependency on a central orchetrator. In fully decentralized architectures, communication with the central server is replaced by peer-to-peer communication as seen in Fig. 5. In this setting, no central location receives model updates/data or maintains a global model over all clients, however clients are set to communicate with each other to reach desired solutions. Notably, peer-to-peer networks are better able to achieve scalability to situations with large number of clients, thanks to their fully decentralized mechanism [127]; the current success of blockchains is a clear demonstration of this. Further, they offer additional security guarantees as it is difficult to observe the system's full state [18]. However, such architecture yields performance concerns. In peer-to-peer networks, some clients could be malicious and potentially corrupt the network (e.g., violate data privacy). Others could be unreliable and thus disrupt the communication channels. Consequently, a level of trust in a central authority in a peer-to-peer architecture can be of benefit in regulating the networks protocols.
The FL structures discussed in here are by no means comprehensive, and several others exist in the literature (see [309,121,239,159]). However, the common denominator here is that FL structures spawn from challenges of applicability to different scenarios. As IoFT is poised to infiltrate more and more fields, different challenges will dictate the FL architecture.

III. LEARNING A GLOBAL MODEL
Hereon, we discuss data-driven approaches for FL within IoFT. As aforementioned, we classify model building in FL into three categories: (i) a global model (ii) a personalized model (iii) a meta-learning model. We then provide an overview of data-driven models, open challenges and possible alternatives within these three categories.
As will become clear shortly, the current FL techniques mostly focus on predictive modeling using deep learning and first-order optimization techniques specifically SGD. This is understandable as the immense data collected within IoFT often necessitates such approaches. Yet, as we discuss in our VOLUME x, x statistical perspective in Sec. VI amongst others, exploring FL beyond deep networks is important for its wide scale implementation. Topics such as graphical modeling, system control, random effect modeling, calibration, game theory and optimization under conflicting objectives (as discussed in Sec. VII-E), design of experiments (discussed in Sec. VII-F), reinforcement learning amongst others are yet to be fully explored in the IoFT realm.

A. A GENERAL FRAMEWORK OF FL
As highlighted in Fig. 3, IoFT allows multiple clients to collaborate and learn a shared model while keeping their personal data stored locally. This shared model is referred to as the global model and is aimed to maximize utility across all devices. One can view the global model as: "one model that fits all", where the goal is that the global model would yield better performance in expectation across all clients relative to each client learning a separate model using its own data.
We start by constructing the objective function of a global model. Assume there are N clients (or local IoFT devices) and each client i has n i number of observations. The general objective of training a global model is to minimize the weighted average over the objective of all clients: where p i = ni n is the weight coefficient, n = N i=1 n i , and f i (w) is a risk function on client i. The risk function for client i can be expressed as where D i indicates the data distribution of the i-th client's data (x i , y i ), and (·, ·) is a loss function. The risk function is usually approximated by the empirical risk given as Therefore, learning a global model in FL aims at minimizing the weighted average of risks over all clients.
However, unlike centralized training, client i can only evaluate its own risk function f i in the FL setting. Client training and server model updating are thus decoupled. In each communication round, the server selects a subset of clients and broadcasts the global model information to the subset. Each client has a local dataset, and updates the model using its own dataset. Afterwards clients send their updated models back to the central orchestartor/server. The orchestartor then aggregates and revises the global model based on input from clients. The process repeats for several communication rounds until the performance is satisfactory or some other exit condition is met. Algorithm 1 is a general "computation then aggregation" [327] framework of FL. Note that we use [N ] to denote the set {1, 2, ...N } and the superscript t to represent the t-th communication round between the central server and selected clients.
One of the simplest FL algorithms is FedSGD [333,191], Algorithm 1 Framework for Learning a Global Model 1: Input: Client datasets {D i } N i=1 , T , initialization for w 2: for t = 1, 2, · · · T do 3: Server selects a subset of clients S ⊂ [N ], broadcasts global model w t , or a part of it, to clients in S.

4:
for each i ∈ S do 5: Clients update model parameters w t i = client_update (w t , D i ) 6: Clients send updated parameters w t i to server. 7: end for 8: Server updates w by w t+1 = server _update ({w t i }) 9: end for a distributed version of SGD. FedSGD was initially used for distributed computing in a centralized data settings. FedSGD partitions the data across multiple computing nodes. Each node then calculates the gradient based on it local data and updates its set of weights. The updated weights are then averaged across all nodes. As a data-parallelization approach, FedSGD utilizes the computation power of several devices instead of one. This approach accelerates vanilla SGD and has been widely used due to the growing size of datasets collected nowadays. Furthermore, since FedSGD only performs one step of (stochastic) gradient descent on a local node, averaging updated weights is equivalent to averaging gradients: Despite being a viable option, traditional distributed optimization algorithms are often unsuitable for FL settings due to the large communication cost and the presence of heterogeneity. FedSGD transmits the entire gradient vector from one machine to the other after each local optimization iterate. This issue is not critical in centralized distributed training when computation nodes are usually connected by large bandwidth infrastructure. However in IoFT, data lives on the edge device and not on a computing node. Communication with the central orchestrator at each gradient calculation is not feasible and may suffer immensely when the edge devices have limited communication bandwidth with unstable or slow connection.
To remedy this challenge, the seminal work of McMahan et al. [194] proposed a simple solution: FedAvg. The fundamental idea is that clients run multiple updates of model parameters before passing the updated weights to the central orchestrator. Specifically, in FedAvg, clients update local models by running multiple steps (e.g., T steps) of SGD on their local objective min w t i f i (w t i ). During each communication round, the server simply calculates the weighted average of individual client's model: w t+1 = i∈S ni n w t i . An illustration contrasting FedAvg and FedSGD is shown in Fig.  6. Indeed, FedAvg has seen wide empirical success within FL due to its communication efficiency and strong predictive performance on several datasets. However, a major observed challenge was that the performance of FedAvg degrades significantly [194] when data across clients is heterogeneous, i.e. non-i.i.d. data. Here one should note that, besides efficient communication, FedAvg still often outperforms FedSGD in the presence of heterogeneity.

Update of FedSGD
Update of FedAvg FIGURE 6. An illustration of FedAvg and FedSGD. Grey arrows represent gradients evaluated on the local client. Bold red arrows represent a global model update on the central server in one communication round. In FedSGD, each client performs one step of gradient descent, and sends the update to server, while FedAvg allows each client to perform multiple steps of stochastic gradient descent before averaging.

B. TACKLING HETEROGENEITY
As previously discussed, an intrinsic property of IoFT is that the data distribution across clients is often imbalanced and heterogeneous. Unlike centralized systems such data cannot be randomized or shuffled prior to inference as it resides on the edge. For example, wearable devices collect data on users' health conditions such as heart beats and blood pressure. Due to the many differences across users, the amount of data collected can significantly vary and statistical patterns of these data are not alike, often with unique or conflicting trends. This heterogeneity degrades the performance of FedAvg. The reason is that minimizing the local empirical loss is sometimes fundamentally inconsistent with minimizing the global empirical loss when data are non-i.i.d.
where superscript * indicates an optimal parameter. This phenomenon is known as client-drift [124].
One method to allay data heterogeneity in FL is regularization. In the literature, regularization is a popular method that reduces the complexity of models; thereby attaining better generalization [84]. Regularization places penalties on a set of parameters in the objective function to encourage the model to converge to a desired minimum. Researchers in FL have proposed several notable algorithms using regularization techniques to train global models that are robust to noni.i.d. data. Perhaps the most popular is FedProx [156] which adds a quadratic regularizer term (a proximal term) to the client objective: The proximal term in FedProx limits the impact of clientdrift by penalizing local updates that moves too far from the global model in each communication round. It was also shown that such a penalty allows each device to perform different iterates of local updates, which is especially useful when IoFT devices vary in reliability and communication/computation power. Experimental results show that first-order regularization on the client objective can partially alleviate data heterogeneity on clients, while reducing communication cost due to the often faster convergence and ability of reliable clients to run more updates than others.
Besides FedProx, [265,136,327,2] develop a framework to tackle heterogeneity through either first or second order regularization. Among this literature, DANE [265] was proposed for distributed optimization yet is readily amenable to FL settings. DANE [265] proposes a local objective similar to mirror descent where η is a stepsize parameter and D i is the Bregman regularized divergence defined as: Interestingly, if the local loss function is quadratic, optimizing (2) can approximate performing Newton updates on the function.
Along this line, federated SVRG [136] applies stochastic variance reduced gradient descent to approximately solve (2), while FedDyn [2] slightly modifies the first-order regularizer in DANE by setting η to zero, which yields the following objective: The first-order and second-order regularizers added, make the limiting points of FedDyn the stationary points of the global empirical loss. Intuitively, FedDyn tries to alleviate the inconsistency (i.e., the drift effect) between a local objective function and global objective function. FedDyn is proved to converge to fixed points without bounded dissimilarity condition. In practice, FedDyn is shown to achieve similar test accuracy with fewer communication rounds compared with FedAvg and FedProx. Similarly, FedSplit is a class of algorithms aiming at solving the fixed point issue. Since each client performs multiple steps of update before averaging models, the stationary point of FedAvg might be different from the optimal solution of the original problem [229]. To solve the problem FedSplit stores a local variable z i for each client. In communication round t, client i receives w t , and calculates: On the theo-VOLUME x, x retical side, authors prove the linear convergence of FedSplit. As is observed from FedDyn, the use of local variables seems necessary for the convergence to fixed points in FL.
Besides the above models that evolved from DANE, [124] introduce a Stochastic Controlled Averaging algorithm (SCAFFOLD) algorithm to correct the drift effect. Suppose we are at communication round t, SCAFFOLD estimates the update direction for each client c t i and uses these values to estimate the client-drift. The overall objective is same as objective (1) yet the local update is modified by c t i . Specifically, where c t = 1 N i c t i is the aggregated control variate and is updated in each communication round. The detailed SCAF-FOLD algorithm can be found in Algorithm 2. Since SCAF-FOLD relieves the client-drift effect, [124] theoretically proved that it requires significantly fewer communication rounds and converges faster than FedAvg in heterogeneous settings.
Server selects a subset of clients S ⊂ [N ], broadcasts global model w t and c t , or a part of it, to clients in S.

4:
for each i ∈ S do 5: Clients update model parameters w t i using (3) and obtain w t+1 Clients send updated parameters w t+1 i and ∆c i = c + i − c t i to server. Server updates c by c t+1 = c t + 1 N i ∆c i 12: end for Federated primal-dual (FedPD) [327] reformulates the optimization problem in (1) as a constrained optimization problem to FedDyn, FedPD uses the first-order and second-order regularizers to alleviate the discrepancy between global model and local models. Algorithmically, FedPD first randomly initializes w 0 0,i , λ i , w i for all clients. At round t, the algorithm updates w t+1 i using the oracle of L i , which will be explained shortly. It then calculates λ t+1 After the local updates, the global model selects a number from {0, 1} with probability 1−p and p, respectively. If the selected number is 0, then the server updates w t+1 0 for all i. If the selected number is 1, then the server simply sets w t+1 0,i = w t+1 0 for all i. The oracle of L i is tailored to optimize L i . Specifically, given learning rates η 1 , η, for . At the end of the oracle updating, the algorithm sets w t+1 , it was shown that that FedDyn is equivalent to FedPD on an algorithmic level.
Meanwhile, researchers pointed out that re-permutation of neurons may cause declined performance in the aggregation step of FL, specifically when data is heterogeneous. This repermutation problem is due to the fact that different neural networks created by a weight permutation might represent the same function. For example, for a simple NN y(x) = W 2 σ (W 1 x), where σ is an activation function, x ∈ R din is the input data, and W 1 ∈ R L × R din and W 2 ∈ R dout × R L are weight matrices, we can multiply an L × L permutation matrix Π to W 1 and W 2 and the function is the same y(x) = W 2 Π T σ (ΠW 1 x). However training on different clients may be attracted to networks with a different permutation matrix Π. To cope with this, [322] propose a neuron matching algorithm called Probabilistic Federated Neural Matching (PFNM). However, [285] argue that PFNM can only give marginal improvement on the neural networks with a complicated structure such as convolutional neural networks (CNN) or long short-term memory (LSTM). They remedy this by proposing FL with Matched Averaging (FedMA) algorithm. They first undo the permutation by using the Hungarian algorithm to estimate Π of each client. Then, the server matches neurons to their most probable counterpart in the global model and averages the permuted model. FedMA is reported to strong performance on CIFAR-10 and Shakespeare. Additionally, the performance of FedMA improves monotonically with the increase of local epochs, while that of FedAvg and FedProx drops after a threshold of local epochs due to the discrepancy between local models. The trend allows clients to train longer epochs with FedMA.
Besides the frequentist approaches above, there has also been a recent push on improving global modeling through a Bayesian framework. The intuition is simple; rather than betting our results on one hypothesis (w) obtained via optimizing the empirical loss, one may average over a set of possible w or integrate over all w weighted by their posterior probability P(w|D). This is the underlying philosophy of marginalization compared to optimization, whereby in the frequentist optimization approach predictions are obtained through substituting the posterior by P(w|D) = δ(w = w). Here δ is an indicator function. Indeed, this notion of Bayesian ensembling has seen empirical success in Bayesian deep learning [181,113].
One such approach is BayesFed [267]. BayesFed, is a simple plug-in into any FL algorithm that aims to learn Kmodes or K plausible solutions. To do so, BayesFed follows a random permutation sampling scheme where at each communication round, clients receive one of the K modes and then aggregation happens for each mode separately (using Fedavg or other FL approaches). This approach is shown to be equivalent to a variational inference scheme [24,324] for estimating a Gaussian mixture variational distribution q(w) whose centers are randomly initialized at the beginning. Using a neural tangent kernel argument, the authors show that all K modes converge to samples from the same limiting Gaussian process in sufficiently overparameterized regimes where each mode can behave like a model trained by centralized training.
Another recent work taking insights from Bayesian inference is FedBE [38]. It performs statistical inference on the client-trained models and uses knowledge-distillation (KD) to update the global model. Intuitively, the goal of KD is to sample high-quality base models w from a global distribution p(w). More specifically, after receiving {w i } i=1,...m from clients, the server fits them with a Gaussian or Dirichlet distribution and then samples from the estimated distribution to form an ensemble. In server_update, the global model is trained to mimic the ensemble teacher by minimizing the discrepancy between the prediction of the global model and averaged ensemble outputs evaluated on an additional unlabeled dataset on the server: is the output of global model, and Div represents the cross-entropy loss. The authors empirically show that the ensemble and knowledge distillation turns out to be more robust to non-i.i.d. data than averaging by FedAvg, yet it is not clear whether the performance gain comes from ensembling or knowledge distillation.

C. EFFICIENT & EFFECTIVE OPTIMIZATION
Several studies attempt to improve FedAvg by adopting efficient optimization algorithms in the FL realm. They show theoretically or empirically that the improved algorithms can converge faster and accelerate global model training. In general, acceleration can be achieved by either improving the server aggregation step (server_update) or client updates (client_update). FedAdam and FedYogi [249] bring the well known Adam [131] and Yogi [248] algorithms to FL through augmenting the server_update function by adaptive stepsizes. More specifically, FedAdam and FedYogi use a second order moment estimate v t to adaptively adjust the stepsize. v −1 is initialized at the beginning. Upon receiving w t i from clients, server calculates ∆w t i = w i − w, and averages them The update rule for both FedAdam and FedYogi is: where τ is a small constant for numerical stability. The theoretical convergence rates of FedAdam and FedYogi are faster than those of FedAvg. Considering the success of adaptive stepsize methods in several important fields including language models [283], GANs [313,264], amongsth others, we believe their use in FL is promising. Indeed, both algorithms demonstrate experimental benefits on image and language datasets. A related algorithm in this vein is federated averaging with server momentum (FedAvgM) [172], which uses server momentum in the server_update step.
Besides modifying server_update, multitudes of algorithms redesign the client_update function. For instance, there are some attempts to expedite local training by combining accelerating techniques in optimization.
FedAc [318] is a federated version of an accelerated SGD. Instead of updating a single variable w i as FedAvg does, FedAc updates three sequences atively on the client side by the following rules: The server averages w i and (w i ) ag from sampled clients, and broadcasts the averaged w t+1 and w t+1 ag , which clients will take as initialization of w i and (w i ) ag in the next communication round. The algorithm then proceeds till convergence. [318] theoretically prove that FedAc can achieve a linear convergence rate and only requires O(N LoAdaBoost [108] is to adaptively determine the training epochs of clients by monitoring the training loss on each client and adjusts the training schedule accordingly. More specifically, after one communication round, clients send training losses, in addition to updated weights to the server. The server estimates the median of the training losses L median := median({F i } i ). In the next round, all clients train for a certain amount of epochs E 2 , where E is the average budget of epochs. If the training loss is lower than L median , the local training is deemed to have reached its goal in this round, and the updated weight will be directly sent back to server. If the training loss is higher than L median on client i, then the model underfits client i. As a result, LoAdaBoost will train the model on client i for extra epochs until local the training loss is lower than L median or the total epochs exceed 3 2 E, whichever comes faster.

D. SAMPLING CLIENTS
Due to the often sheer size and unreliability of edge devices participating within IoFT, not all clients can participate in training in one communication round as shown in Algorithm 1. Therefore, choosing the appropriate subset S at each communication round between the orchestrator and client is of utmost importance in FL. Here we shed light on some existing schemes and other possible alternatives. Basically, there are two genres to design the sampling probability: sample uniformly and sample according to the local dataset size. For example, FedAvg samples clients uniformly with probability q i = 1 N , and averages client models with weights proportionally to their local dataset size ni n . FedProx samples clients with probability q i = ni n and averages client models with equal weights. The sampling probability and weights are chosen to make client updates unbiased estimates of global updates -i.e. unbiased estimates of n i=1 p i ∇f i (w) where p i as discussed before, is ni n . Uniform sampling might be inefficient since the orchestrator can sample edge-devices with small datasets too often. Dataset size based sampling addresses this issue, but may raise fairness concerns as some clients are rarely sampled and trained.
To form better sampling schemes and accelerate training, adaptive sampling techniques have also been proposed. These FL algorithms update the sampling probability q t i after each communication round from historical statistics [49,303,147,47]. A general intuition is to sample clients on which the model fits worse, more often. The rationale is that when a model incurs high training loss or large gradient norms on client i, client i is not performing well under the current model and should be trained for more epochs. There are a range of choices for measuring the performance of a model on the client. Among them, power-of-choice [49] samples clients with the highest loss. Oort [147] uses a similar strategy: it firstly calculates the utility of client i as function of the loss, then applies exploration and exploitation algorithms [27] to sample clients with higher utility. On the other hand, there is a set of literature that calculate the sampling probabilities adaptively using gradient norms of the clients [205,125,171]. For instance, Adambs [171]: it calculates the gradient norm of sampled batches on clients, and uses a variant of exploration and exploitation algorithm to update the sample probability so that batches with larger gradient norm are assigned higher probability.
Client selection usually improves model performance and speeds up the training process, for example, [147] can achieve 1.1× to 6.4× speed up in terms of time-to-accuracy compared with vanilla FedProx or FedYogi. Due to prevailing statistical and system heterogeneity among clients, we believe client sampling techniques will be of great significance when practitioners try to deploy FL frameworks.

IV. LEARNING A PERSONALIZED MODEL
As highlighted in previous sections, heterogeneity is a fundamental challenge for IoFT. Data distribution across edge , where x and y are features and outputs (assumed labels for illustration) respectively. By conditional probability, P i (x, y) = P i (y|x)P i (x). Therefore, the difference in data distributions P i (x, y) across clients can be explained by the difference in P i (y|x) and that in P i (x). The first is usually referred to as concept shift and the second covariate shift [277,195]. Take the sequence prediction task on mobile phones as an example: for different users, the word following "I live in ..." should be different [141]. This example corresponds to a concept shift: x is assumed to be part of a sentence, and y is the sequence to predict. In this situation, P i (y|x) should be customized for different clients even if x is the same. Another example is autonomous vehicles that make decisions based on sensory data sometimes in the form of images. Presumably, images in frigid and tropical zones can see different scenarios, but the decision space should be the same. This auto-driving example speaks for covariate shift: the input sensory data P i (x) is different across different vehicles. Besides that IoFT devices may have unique outputs/labels only observed within specific users.
One direct solution to address the challenges above is through personalization. Instead of using one global model for all edge devices, personalized FL fits tailor-made models for IoFT device while levering information across all devices. In the rest of this section, we will discuss popular personalization approaches, which can be divided into two branches: fully personalized and semi-personalized. For fully personalized algorithms, each edge device can potentially learn a customized model and for semi-personalized algorithms, models are tailor-made only to a group of clients. In Sec. V, we will further discuss personalization from a meta-learning perspective.

A. FULLY PERSONALIZED
To accommodate client specific trends while leveraging global information, [96] propose the following general objective for personalized FL: where w is a shared parameter and {β i } N i=1 is a set of unique parameters for each client. [288,164] use different layers of a neural network to represent w and β i . It is a common wisdom in multi-task learning that, in deep neural networks, base layers process the input and calculate the feature or representation of the input and top layers learn task-dependent weights based on the feature. This task can be, for instances, classification or regression. FedPer [288] fits global base layers, and personalizes top layers. As an example, a fully connected multi-layer neural network can be expressed as y = W n σ (W n−1 σ (...σ (W 1 x))), where σ is the activation function and W i 's are concatenated weights. In this example, FedPer can take W 1 to W B as base layers w and W B+1 to W n as personalized layers β i in (4). In one communication round, client i uses SGD to update w and β i simultaneously. However, different from FedAvg, only w is transmitted to the server where it is then aggregated. FedPer is found to perform better than FedAvg on image classification task on CIFAR-10 and CIFAR-100. On these datasets, the authors show that having the last one or two basic residual blocks of Resnet-34 personalized can yield the best testing performance. Similarly, LG-FedAvg [164] takes top layers as a global weight w and base layers as personalized weights β i . The intuition is to learn customized representation layers for different clients, and to train a global model that operates on local representations. Additionally, by carefully designing the loss of representation learning, the generated local representation can confound protected attributes like gender, race etc. Different from multi-task learning style splitting of global and local layers, some works [270,160,65] treat neural networks holistically and learn personalized β i 's by specially designed objectives and optimization algorithms. We call the former partial weight sharing and the later weight regularizing. In general, β i 's and w are in the same parameter space thus it's possible to perform addition or calculate the difference of these weight vectors. Weight regularizing methods thus usually allow weights to differ from each other, instead of forcing some coordinates of these weight vectors to be exactly the same.
A straightforward weight regularizing method is trainthen-adapt. As the name suggests, this approach trains the global model on all clients then adapts it to individual devices. The simplest way for the adaptation is fine-tuning [317,7], which is also widely employed in computer vision and natural language processing [238]. After being trained with FedAvg for several cycles, w is supposed to learn desirable feature representations of the entire dataset. Clients then are allowed to make small adjustments to the global model to better fit it to the local dataset: each client starts from β i = w and performs a few steps of SGD to minimize the local loss function min βi f i (β i ). In word prediction and image classification tasks, fine-tuning is shown to generalize better than fully local training.
As a more formal formulation of adaptation, multi-task FL [160] introduces the following objective: where w is the optimal solution to (1). Intuitively, this objective allows client models to deviate from the globally optimal model but penalizes large deviations. For solving the problem, Ditto [160] optimizes w and β i alternatively. The update of w is similar to FedAvg, and decoupled from the update of β i . Once client i has a relatively accurate solution of w , it can update β i by gradient descent: Ditto has better performance on image datasets and also achieves better fairness and robustness of training.
To integrate the update of w and β i , [65] proposes Moreau envelope FL for personalization. pFedMe formulates a bilevel optimization problem The algorithm gets its name because is the Moreau envelope of f i (w). In the inner level optimization, personalized weights β i minimize the local loss function in the vicinity of reference point w, and in the outer level minimization, w is minimized to produce a better reference point. This objective is closely related to model-agnostic meta-learning (MAML), which simply approximates the inner optimization as min βi ∇f i (w), The optimal solution of pFedMe can be obtained as β i = w − 1 µ ∇f i (w). Sec V will cover more details about meta-learning algorithms. To optimize the objective, pFedMe introduces the following method. In one communication round, selected clients receive global weights w and find an approximate optimal solutionβ i to the inner optimization problem min βi f i (β i ) + µ 2 w − β i 2 via gradient descent. Client i then multiplies the update by µ, ∆w i = µη(β i − w), and sends ∆w i to the server. µ is set to 15 ∼ 30 in the experiments. The server then averages received ∆w i to renew the global model as w ← w + γ i ∆w i . γ is another hyperparameter whose value is 1 ∼ 4 in experiments. The convergence rate of pFedMe is proved be faster compared with FedAvg. Local loopless GD (L2GD) [95] sets w to be the average of To optimize the objective, L2GD chooses to update More specifically, on round t, a coin whose head-up probability is 1 − p is tossed. If it ends with head up, each client will perform one step of gradient descent to minimize local training loss β t i as β t i ← β t i − η 1−p ∇f i (β). Otherwise, if the coin ends with tail up and if in the previous round the coin is head up, clients will send updated models β t i back to the server. The server update consists of two steps: first, the server takes the average of β t i to obtain β t = N i=1 p i w t i ; second, the server calculates the initialization of client i's weight on the t + 1-th communication round as β t+1 i = 1 − αµ N p β t i + αµ N p β t and sends it to the corresponding client. The expected number of communication in s coin tosses is p(p − 1)s.
Most of the above works assume the regularization term to be a quadratic function of the distance between local and some global model weights. MOCHA [270] defines the regularization function to be: µ 1 Tr β T Ωβ + µ 2 β 2 F , where β is a d by N matrix whose i-th row is the model weight of client i, Ω is a N ×N matrix, and · F represents a Frobenius norm. The first term models the similarity among clients: negative off-diagonal entries of Ω encourages the alignment of two clients' local weights. The second term is the regular L 2 loss. In linear models, the objective can be expressed as: where (x i , y i ) are data points on client i. MOCHA employs a primal-dual algorithm and updates β and Ω alternatively. MOCHA is shown to outperform baseline algorithms on datasets including GLEAM, Human Activity Recognition, and Vehicle Sensors. The limitation of MOCHA is the use of linear models, which requires handcrafted features. This bottleneck limits the representation power of MOCHA.

B. SEMI-PERSONALIZED
A possible alternative to global or fully personalized modeling is semi-personalized modeling. Semi-personalized FL fits a stylized model for a group of clients. This approach balances between the need for N individualized model or one model that fits all. This is highlighted in Fig. 8 . Usually these algorithms cluster clients into K << N groups and assume the data distribution of clients inside one group is homogeneous. [188] proposed an intuitive user clustering method. The number of clusters K < N is predetermined. Each cluster is represented by a cluster parameter in {γ 1 , ...γ K }. The The inner level of minimization taken over cluster index j assigns one client i to the cluster with the lowest training loss, and the outer level of minimization taken over γ j 's optimizes cluster model weights. Authors propose a HypCluster algorithm to optimize the objective. K cluster weights are randomly initialized by SGD on some randomly sampled clients. On each communication round, all cluster weights are broadcasted to all clients, and each client chooses one with the lowest loss. Afterward, clients train the corresponding model by SGD on their own local dataset. On EMNIST, HypCluster with K = 2 has higher test accuracy than FedAvg and agnostic FL [200].
Clustered federated learning (CFL) [262] clusters clients dynamically. In conventional FL, the optimal solutions of different edge devices may not coincide. Therefore, the global optimal solution may not correspond to that on local clients. When the averaged gradient norm is small but gradient norm on some clients is large, the global model is not suitable for these clients. Therefore, it is advantageous to divide clients into different clusters. More specifically, CFL measures the similarity between two clients in the following manner: suppose in communication round t, the update of client i and j is ∆w t i and ∆w t j , respectively. The cosine similarity is defined as For one communication round, server calculates the cosine similarity intra and across clusters. If clients in one cluster are homogeneous, the similarity of these clients should be large compared with similarity across clusters, then CFL simply performs FedAvg for this cluster. If otherwise, clients inside one cluster are heterogeneous, the similarity of these clients is low, and CFL will divide them into two subclusters. CFL then repeats the procedure on the two subclusters. As the algorithm proceeds, clients can automatically be divided into different subclusters. CFL ends when gradient norms on all clients are small and no furthur sub-dividing is needed. Experimentally, CFL can boost the performance when training convolutional neural networks on CIFAR10 and LSTM on Ag-News.

C. OPEN ISSUES
Despite the progress, personalized FL still faces many interesting open issues. We list some open problems that might be directions for future research.
• Firstly, it is important to investigate when a personalized model is needed. Intuitively when there exists large heterogeneity among clients, fitting personalized models would be better than using only global ones. Also, as pointed in [139], when data is extremely heterogeneous, negative transfer of knowledge may occur. [164] conducted some analysis on additive models that matched this intuition. Yet, literature on approaches that characterize the level of heterogeneity of data and accordingly decide on how many personalized models should be built is largely missing. In a similar fashion, a key question remains: how to cluster clients in a federated setting?. In FL clients usually send back a set of weights or gradients, are these summary statistics sufficient to recover key information from all clients? If not, what sufficient statistics can achieve such a goal and do they guarantee privacy? • Secondly, personalization may come at the price of privacy. [331] shows that input images can be reconstructed only from unperturbed gradient signals thus opens the possibility of gradient attacks, though some precautions like adding noise or quantization can somewhat reduce the risk. The privacy issue is more important on personalized models since the individualized part of these models contains unique information of the client that malicious intruders or even servers can collect and use. Differential privacy is a useful tool to use when we study privacy issues under gradient attack [295].
However, experiments on MNIST [295] also show there exist some trade-offs between privacy and performance. Though there are already some work that aims to protect privacy and achieve good performance [106], privacy is still an important factor to consider in the future development of personalized models. • Finally Bayesian approaches for personalization are still needed. Since the client datasets are private, the local features are hidden, thus only with a probabilistic perspective can true personalization occur. Much like how random effects are the key concept behind accounting for unit-to-unit heterogeneity, Bayesian methods can characterize the statistical patterns of clients for deep personalized FL.

V. META-LEARNING
Meta-learning, or learning-to-learn, is the science of observing how different learning algorithms perform on a wide range of tasks and then learning new tasks more efficiently based on prior experience [282,46,102,203,212,246]. Instead of training a new model from scratch, meta-learning aims to use historical information to improve its future learning performance. This paradigm opens a new opportunity to resolve many challenges in IoFT such as scalability, fast adaptability, and improved generalization. With proper prior knowledge, a well-trained task can be rapidly generalized to new tasks with few samples. This few-shot property becomes especially crucial in IoFT where each device only has a small amount of data and does not have permission to a centralized repository of all datasets. Different from many aforementioned learning models, the primary goal of metalearning is not to achieve good performance on all tasks in expectation. Instead it integrates knowledge across different tasks and aims at providing a good initialization of parameters which can be readily adapted to a new task. Therefore, meta-learning can be viewed as an approach to enable fast personalization and fine-tuning.

A. FREQUENTIST PERSPECTIVE
From a frequentist perspective, the seminal work of Finn et al. [80] proposed one of the first model-agnostic metalearning (MAML) algorithms. Given N different tasks , MAML optimizes a meta-loss function in the form of The meta-loss functions can be some traditional loss functions such as cross-entropy loss or mean-squared error. Besides, it can also be defined in other fashions (see Bayesian MAML in Sec. V-B). In this objective function, w can be viewed as an initialized parameter and is used to perform one gradient update w = w − η∇f Ti (w). MAML minimizes the loss f Ti (w ) given an initial parameter w. Therefore, the goal of MAML is to find an optimal initial parameter w such that one gradient step on a new task can incur maximally effective behavior on that task. This idea is similar to finetuning in which several steps of gradients are performed given a tuned parameter w.
The idea of meta-learning can be naturally extended to FL where each device is treated as a task and the goal is to learn an initialization for fast personalization on the edge devices. Indeed, recently researchers have tried to introduce the notion of meta-learning into FL to enable personalization and fewshot learning [159,195]. Along this line, [36], [166] and [76] introduce the MAML to FL. They reformulate the objective function of FL as Here, ∇f t i (w) is the gradient after t steps of SGD. Note here that t refers to the communication round while t denotes optimization iterates at the edge device. This formulation allows each user to exploit w as an initial point and update with respect to its local data (e.g., running T steps of gradient descent). When T = 1, the above objective function can be (6) and the gradient can be computed as Based on this formulation, [76] propose the Per-FedAvg algorithm to efficiently optimize (5) and demonstrate the advantages of Per-FedAvg on image classification tasks. Interestingly, [120] interpret FedAvg as the linear combination of FedSGD and MAML when ignoring the second-order term η∇ 2 f i (w). Specifically, assume each client runs T steps of local updates, then the gradient of FedAvg can be written as where g F edSGD is the gradient of FedSGD and g M AM L (t) is the gradient of MAML with t steps of local updates. Though the optimization of this linear combination in (8) is not strictly equivalent to FedAvg (due to the second-order term), this interpretation sheds light on the intrinsic connection between FL and meta-learning. Based on this observation, [120] slightly modify the Per-FedAvg algorithm as follows: it first runs FedAvg (or other conventional FL algorithm) at the early stage of training and then switch to a personalized FL algorithm such as MAML or Reptile [211]. Through many empirical results, they argue that this combined strategy ensures fast and stable convergence compared to the MAML algorithm. Besides, this paper also delivers an important message: no single FL is a panacea for all problems.
Besides the success stories of MAML, there are also many variants of meta-learning-based FL algorithms. For instance, MetaSGD [162,36] specifies a coordinate-wise learning rate η to the MAML objective function and treats both w and η as parameters to optimize. Inspired by fair resource allocation for wireless networks problems, [158] propose a q-FFL algorithm for fairness. They slightly modify the loss function and add a power q to each user The intuition is that q tunes the amount of fairness: the algorithm will incur a larger loss to the users with poor performance. Therefore, q-FFL can ensure uniform accuracy across all users. This is equivalent to designing personalized models to achieve uniform performance across all users. This property can be further extended to meta-learning setting where each device can be viewed as a task. The goal is therefore to learn a model initialization such that it can be readily adapted to new tasks using limited training samples. [60] consider an adaptive personalized FL algorithm. Specifically, each user optimizes a hybrid model where v i is the weight parameter for a local user i and w * is the optimal global parameter. The tuning parameter α i balances the importance of the local and global model. When users data are i.i.d., they argue that the global model will have better generalization and suggest choosing a smaller α i . On the other hand, when data are heterogeneous, they choose to use a larger α i to encourage personalization.
[271] consider a multi-task primal-dual learning framework and treat each client as a task. This approach can avoid model over-fitting in the central server and provides improved generalization on local users. However, their framework can only be applied to convex objective functions since the strong duality is not guaranteed for general non-convex functions. Besides that, few recent work has tried to study FL from an online learning perspective. For instance, [128,153] use the Average Regret-Upper-Bound Analysis (ARUBA) framework and apply it to improve the performance of FedAvg and differentially private algorithms. This is a route worthy of further investigation. We envision that meta-learning will have a great impact in IoFT and it is promising to tailor various existing metalearning algorithms for FL. For instance, adapting personalized learning rates, based on local sharpness, for each device can be one potential research direction [321]. Furthermore, exploring and contrasting the theoretical properties of the aformentioned algorithms [94], especially when objectives are non-convex or even non-differentiable [76] is an important issue to address. Derived rates and statistical errors may help guide practitioners to choose among the vast array of methods given their data properties and features.

B. BAYESIAN PERSPECTIVE
Besides the frequentist perspective on meta-learning, there are recent yet few efforts to explore Bayesian meta-learning. Below we discuss some recent advances.
Bayesian meta-learning simply defines w as a random variable and takes a Bayesian route to estimate its posterior. This posterior will further serve as a "prior" of a new task. The approach allows uncertainty quantification for predictions on both the central server and local clients [291]. Studies that formulate meta-learning from a Bayesian perspective include [72,14,143,144,93,315,81,91,245,280,253,228,208,335]. Most notably, [280,227] consider applied deep kernel methods to learn the complex task distributions in a few-shot learning setting. [144,304] introduce deep parameter generators that capture a wide range of parameter distributions. However, this method is not amenable to first-order stochastic optimization methods and therefore may scale poorly. Notably, [315] propose a Bayesian counterpart of MAML (BMAML) for fast adaption and uncertainty quantification. Specifically, at communication round t, the algorithm applies T steps of local Stein variational gradient descent (SVGD) [170] to obtain task-wise updated parameters for each task τ sampled from a mini-batch T t . As described in [170], SVGD maintains M instances of model parameters, called particles. Those particles can be viewed as samples from the posterior distribution. Mathematically, SVGD updates particles as ) and perform one step update using first-order gradient method There are many possible choices of meta-loss. [315] argue that the following meta-loss can prevent model over-fitting. Specifically, where is a dissimilarity measure between two sets, and where D trn+val τ is a batch of training and validation data from task τ . Particles {w T +s τ,m } m∈[M ] are used to approximate the true but unknown posterior distribution of task τ . The intuition of this meta-loss formulation is very clear: instead of minimizing empirical loss which might cause overfitting, it finds the best parameter which approximates the true posterior distribution. Other approximation schemes can also be found in the work of [14,143], where they employ Gaussian or Laplace approximations to learn the taskposterior. Similar to BMAML, [81,335] reframe MAML from a variational inference perspective.
There is also an interesting trend that formulates metalearning from a stochastic process perspective [86,129,179,175]. Most notably, [87] introduce neural processes (NPs) that combine advantages of neural networks and Gaussian processes (GP). Given a point (x, y) i , the algorithm first defines a representation r i = h((x, y)) where h is a NN. Then it defines a latent distribution z ∼ N (µ z (r), Iσ z (r)) where r = a(r i ) = 1 n i r i and a is called the aggregator. Finally, a conditional decoder g, learned through variational inference, takes inputs sampled from z and testing data x * to make predictions y * . Fig. 9 illustrates the inference procedures of NPs. One key feature of NP is that it encodes input data into a single order-invariant global representation. This representation captures the global uncertainty which allows sampling at a global level. Similar to GPs, NPs define a distribution over functions and are capable of capturing uncertainties in prediction. However, NPs also inherit computational challenges from GPs. Overall, the few-shot learning property of meta-learning open many possibilities for personalized FL. We hope this review will help inspire continued exploration into the federated meta-learning algorithms.

VI. STATISTICAL AND OPTIMIZATION PERSPECTIVE A. STATISTICAL PERSPECTIVE
In this subsection we discuss the challenges and open directions for IoFT from a statistical perspective. In particular, a number of statistical challenges such as heterogeneity, dependence and sample bias under privacy and communication constraints need to be addressed in the FL setting. Part of the challenge is developing a suitable modeling framework for IoFT that allows the assessment and validation of methods addressing the aforementioned challenges.
Much of the prior work on FL has focused on deep learning algorithms due to their predictive power. If we are VOLUME x, x interested in questions associated with statistical estimation and inference, it makes sense to extend FL to incorporate models that are interpretable in addition to having good predictive capabilities. algorithms that come to mind include kernel methods (see e.g. [12,135,242,243]), Gaussian processes (see e.g. [92,138,244,320]) and other approaches. Like deep neural networks, all of these approaches exhibit non-linearities and function complexity whilst being more amenable to statistical inference.
One of the places where statistics can make a significant contribution to IoFT is in weakening the i.i.d. assumption that is used in most FL algorithms. The i.i.d. assumption is violated in two ways: (1) statistical heterogeneity and (2) dependence across nodes/users. Dependence: Statistical dependence in IoFT is another common challenge since nodes may be dependent (due to geographic spatial dependence, common features, e.t.c.). Many current FL algorithms rely on independence assumptions. In correlated settings, even classic SGD will lead to a biased estimator of the gradients [46]. The same challenges exist in FL with the additional challenges posed by communication and privacy constraints. Yet here lies a significant opportunity: if learned effectively, a dependence structure can be exploited to improve prediction, update sampling schemes and better allocate resources. More specifically, statistical approaches for dealing with dependence typically involve learning a suitable dependence structure (e.g. graphical model) amongst the nodes of interest and exploiting the structure (e.g., [133,230,244]). Following this line of thinking, there are two challenges: (1) develop a FL approach to learn a suitable dependence structure amongst the nodes; and (2) exploit the learned dependence structure for improved inference and prediction within FL. While there are numerous techniques to address both (1) and (2) in the standard centralized setting (e.g., [167,226,247]), FL presents additional challenges due to the communication and privacy constraints. Statistical heterogeneity: In the IoFT setting, statistical heterogeneity is a common issue as individual devices potentially collect different amounts and types of data. For example, mobile users may have different languages, different settings and even the number of data points from different devices may vary. Hence the i.i.d. paradigm is violated and techniques that adapt to this statistical heterogeneity are needed. In particular, rather than learning a single global model, it is beneficial to develop more device-specific models using personalized or meta-learning approaches as highlighted in Secs. IV and V. A couple of challenging statistical issues arise in terms of how we model the heterogeneity and develop suitable algorithms that tradeoff a global model with device-specific features.

1) Open Directions
Following the discussion above, there are three significant open statistical challenges in the IoFT. Firstly, the general question of developing a suitable statistical modeling, hypothesis testing and validation framework. Much of the current work on IoFT and FL has focused on algorithm development and thus far, there has been little in the context of developing statistical models. Secondly, a central question to address both dependence and statistical heteregoneity is to learn a suitable network model in a way that satisfies privacy and communication constraints. Finally, using these modeling and network learning approaches to develop and assess prediction/estimation procedures. Modeling, validation and hypothesis testing (beyond deep networks): To this point, there has been little work in FL on suitable statistical modeling and hypothesis testing procedures. Whilst there may be a general reluctance to impose statistical models since they are never truly correct and there are advantages to being model-agnostic, imposing statistical models in the IoFT setting provides a number of potential advantages. First and foremost, modeling provides a way to develop and assess a hypothesis testing approach.
We begin by proposing a potential model that incorporates statistical heterogeniety and dependence. Let the joint distribution of data pairs at client j be P (j) x,y , and assume several models for the relationship among {P (j) x,y } N j=1 . For this type of model, we assume x (j) ∼ P x for all clients, while the conditional distribution of y (j) given x (j) are different across clients 1 ≤ j ≤ N . One type of conditional distribution for y (j) given x (j) is as follows: where different clients share the same function f (e.g. a deep neural network, kernel, Gaussian process, e.t.c.), but different parameter w (j) ∈ R p which allows for personalization. This represents N semi-parametric models [21] and a key question is what structure to impose on the w (j) 's which incorporate the degree of statistical heterogeneity and dependence? Assumptions such as graphical models, low-rank models, sparse models and many others may be incorporated (e.g., [100,29]). Graphical models naturally lend themselves to network learning while low-rank models naturally lend themselves to clustering of nodes. Network learning: Learning networks (especially through the graphical modeling framework) is a problem that has received significant attention and research in the statistics and machine learning literature [167,247,319]. In the FL setting, when there is dependence amongst the nodes, learning a graphical model/network structure potentially provides significant gains in overall performance. However, there remains the open challenge of how to adapt and implement these graphical modeling algorithms that learn pairwise sufficient statistics to respect communication and differential privacy constraints.
At the heart of the challenge is that if our goal is to learn network structure amongst nodes, second order statistics are required in the computation. From a privacy and communication perspective, this requires communication between all pairs of nodes in order to compute these second-order sufficient statistics. One open challenge is to carry over ideas from differential privacy to network learning problems. One of the key ideas is to incorporate privacy-preserving ideas (such as sketching or randomization [68,185,69,241]) that improves privacy considerations whilst still learning the sufficient statistics.
Other dependence models that might be considered are joint Gaussian process models, where dependence across nodes is again encoded by a covariance matrix. However, in order to improve privacy and communication considerations, we may assume that the covariance matrix/kernel is parameterized by a small set of parameters (e.g. Mattern, Quadratic). Prior work has developed approaches for parameter estimation of Gaussian processes using SGD [139]. A natural question to think about is whether a natural extension exists for the FL setting. Validating dependence models: The final challenge associated with network learning for IoFT is to address the question of whether the learned network improves predictive power, and if so which approaches are best. For example, would performing network learning improve predictive performance compared to clustering nodes and doing personalization? Since the ultimate goal is predictive power (although the network in itself may contain useful information), this presents a natural validation metric for network learning methods. Due to the modeling framework, predictive power can be validated theoretically using refined bounds that incorporate dependence, through simulation studies and real data examples.

B. OPTIMIZATION PERSPECTIVE
Learning a global model collaboratively has been extensively studied in the distributed optimization literature. In such settings, the goal is to learn a global objective function, that can be expressed as the sum of objective functions of agents, in a distributed manner. As used in existing datadriven systems, the orchestration server plays the role of distributing the computations across agents. In IoFT settings, however, local agents learn the model without recourse to data sharing. This commonly adopted "computation then aggregation" (CTA) protocol in IoFT settings pose significant theoretical and computational hurdles for which numerical methods and techniques used in traditional optimization and machine learning models fail. In this section, we discuss FL problems from an optimization perspective. More specifically, we provide a brief overview about recent algorithmic methods proposed for FL and discuss potential optimization techniques still to be explored in FL settings.
In recent years, many researchers have proposed FL versions of existing optimization algorithms. One popular example of such methods is FedSGD, a distributed version of SGD. Particularly, this methods applies a local SGD step for a sample of clients and passes the local gradients to the server for aggregation at every iteration. While utilizing parallel computations yield efficient training for large data-sets, such methods incur high communication cost since they require passing the gradient vector of clients to the server at every iteration.
To remedy these challenges, several algorithms tailored for FL settings were proposed. One popular instance is FedAvg which reduces the amount of communication required by applying multiple local SGD steps in each communication round. Despite being widely used in practice, several recent results have shown degrading performance in the presence of heterogeneity [329]. In fact, the presence of heterogeneity in practical FL settings render existing machine learning algorithms, that assume i.i.d data samples, nonviable in FL settings. Several algorithms have been recently proposed to mitigate the issue of heterogeneity. A class of these algorithms add a local regularization term to the client's objectives. Popular examples of such algorithms include FedProx [156], FedDyn [2], FedDANE [157], SCAFFOLD [124], FedPD [327].
FedProx tackles heterogeneity by adding a quadratic proximal term that penalizes local solutions that are far from the global model; hence, suppressing the variance among the solutions communicated back to the server. In addition to the proximal term, FedDyn adds a gradient correction term that aligns the stationary solutions of the local and global models. Aiming at the same later feature, [327] proposed a primal dual approach that penalizes the variation between local and global parameters. By adopting a similar linear regularization term, [124] also proposed an algorithm, denoted SCAFFOLD, that adopts a client variate for correcting the client-shift. While existing methods tackle heterogeneity by adding a regularization, it is worth exploring algorithms that add adaptive ball constraints when minimizing local objectives. Such methods can control the variability of local parameters and can align the stationary solutions of local and global models when the radius of the ball constraints converge to zero in the limit.
Another class of FL methods use an adaptive choice of step-size for the client and server optimizer steps. This approach is motivated by the practical benefits that repeatedly appeared in adaptive optimization methods when training machine learning models. Examples of FL-versions of such algorithms are FedAdam and FedYogi [249]. We refer the readers to Section 3.2 for a more breadth and depth discussion on the algorithms presented above. The convergence guarantees and complexity rates of many of these algorithms were established under a variety of assumptions; see [121,124,327]. The rates and assumptions allow practitioners to choose the algorithm that best fits their application.

1) Open Directions
Popular algorithms proposed for solving optimization problems that arise in machine learning models can be classified into three categories: • First-order optimization methods that use local firstorder information to iteratively find a desirable solution; popular examples include SGD, GD, primal-dual methods, ADAM. VOLUME x, x • Second-order optimization methods that utilize secondorder information; popular examples include qausi-Newton method and its variants. • Zeroth-order methods that utilize a heuristic derivative free approach for updating the sequence of iterates. Almost all existing FL algorithms belong to the class of first-order methods. Designing second-order and zero-order methods tailored for FL settings remains an interesting area to explore. The former class of algorithms can potentially perform better than first-order algorithms when the global objective is highly non-linear or ill-conditioned. Such methods aim at effectively using the curvature information for faster convergence rate. To overcome the computational drawback of computing the Hessian of the objective, estimating the curvature using first-order information was proposed in quasinewton methods. Moreover, stochastic versions of these approaches were proposed for machine learning settings [31]. Motivated by their superior performance in ill-conditioned objectives, a natural potential research direction is to investigate FL variants of such algorithms.
Zeroth-order algorithms is another class of optimization methods that can be useful in problems with only access to noisy evaluations of the objective function [54,251]. Several recently arising machine learning applications [255,44] have gained zeroth-order algorithms significant attention. A potential research direction that can pave the path for interesting FL applications is studying FL variants of zerothorder algorithms.
Another interesting direction is exploiting non-convex min-max optimization [200] for enhancing robustness against model-misspecifications and perhaps ensuring fairness across clients [158]. Most commonly, stochastic gradient descent-ascent (SGDA) that applies an ascent step followed a descent step at every iteration is used in practice. Applying SGDA and its variations is undesirable in FL since it requires communication at each iteration. A natural potential research direction is to investigate FL variants of such methods. One potential approach when the maximization problem is (strongly)-concave is using duality theory to jointly minimizing the model parameters and the dual variables. Further, research along this line may specifically help guard FL algorithms against significant drifts on some clients data.
Another direction that has recently gained significant research attention explores fair FL methods. Such methods aim at ensuring non-discriminatory or fair federated learning with respect to some protected groups [200,158]. Several approaches formulate this learning problem as a non-convex min-max optimization problem [200] for which a widevariety of algorithms have been proposed in non-FL settings. Most commonly, stochastic gradient descent-ascent (SGDA) that applies an ascent step followed a descent step at every iteration is used in practice. Applying SGDA and its variations is undesirable in federated learning setting since it requires communication at each iteration. A natural potential research direction is to investigate FL variants of such methods. One potential direction that can be explored when the maximization problem is (strongly)-concave is using duality theory to jointly minimize the model parameters and the dual variables.

VII. APPLICATIONS
In the previous sections we have discussed defining features of IoFT and data-driven modeling approaches for decentralized inference. Yet, IoFT both shapes and is shaped by the application it encompasses. This boils down to a crucial question: how will IoFT shape different industries and what are domain specific challenges it faces for it to become the standard practice? Through the lens of domain experts, we shed the light on the following sectors: manufacturing, energy, transportation, quality & reliability, computing, healthcare, and business. Note: To keep notation simple, this section slightly exploits earlier notation.

A. MANUFACTURING
The fourth industrial revolution (Industry 4.0), which is undergirded by smart technologies like IoT, is expected to create up to $3.7 trillion in value by 2025 [16,151]. In the United States alone, 86 percent of manufacturers believe that smart factories built on Industry 4.0 will be the main driver of competition by 2025. Furthermore, 83 percent believe that smart factories will transform the way products are made [296]. However, only five percent of US manufacturers surveyed in a recent study reported full conversion of at least one factory to "smart" status, with another 30 percent reporting they are currently implementing initiatives related to smart factories [142]. This means that nearly two out of three (65 percent) manufacturers surveyed report no progress on initiatives that they overwhelmingly point to as their main driver of near-term competitiveness in five years [142].
Distrust is listed as one of the dominant factors inhibiting the spread of Industry 4.0. [204]. The current paradigm of IoT where data is agglomerated in a central server does not foster trust. Instead, it breeds concerns about privacy and security [28]. Also, there are several time sensitive applications that could be advanced by Industry 4.0 but are inhibited by the current IoT paradigm. For example, through cloud computing, manufacturing machines could benefit from advanced control algorithms that significantly improve their performance [220]. However, with an IoT infrastructure reliant on exchange of data with a centralized server, internet latency becomes a significant challenge [218]. Moving large amounts of data to and from a central server also demands high internet bandwidth. Another major challenge of the current IoT landscape is that it is poised to benefit large enterprises at the expense of small and medium sized enterprises. Given the concerns around privacy and security, companies are inclined to use private rather than public cloud infrastructures [101]. Therefore, smaller companies are unlikely to have the capital to set up and maintain their own private cloud infrastructure. Even if they can set one up, they are unlikely to generate sufficient data volumes for meaningful big data analytics.
IoFT could help overcome the aforementioned challenges and create lots of new opportunities in present-day manufacturing. For example, it could enable vertical integration of IoT across a manufacturing ecosystem, which is key to capturing value from Industry 4.0 [204]. The ability for entities to keep their data private while collaborating on a shared model could allow original equipment manufacturers (OEMs), for instance, to integrate their data analytics with those of their suppliers to help improve quality across their entire supply chain. This benefits the OEMs as well as their suppliers. Similarly, the ability to develop a shared model without compromising privacy could help level the playing field between large and small enterprises. Small companies who cannot afford to have private cloud infrastructure can benefit from public cloud infrastructures without sharing their data. Moreover, even if they do not have large enough datasets for analytics, they can benefit from the data of other entities through a shared model. Moreover, data analytics for time sensitive applications can be run at the edge [202], closer to the device or machine, to reduce latency while also benefiting from a shared cloud-based model across several machines [182]. The same benefit extends to handwidth intensive applications. IoFT will also be a key enabler of futuristic paradigms like massively distributed manufacturing (MDM), which was briefly described in Sec. 1. MDM involves the manufacture of products by a large, diverse, and geographicallydispersed but coordinated network of individuals and organizations with agility and flexibility, but at near-massproduction quality, productivity, and cost effectiveness [219]. A cyber-physical operating system (CPOS), which intelligently, efficiently and securely coordinates large networks of cloud-connected, autonomous and geographically-dispersed manufacturing resources will be needed to support MDM.
The importance of operating systems to support presentday distributed manufacturing has been highlighted in recent works, together with ideas on how to realize them [57,85]. However, in the context of MDM, some distinguishing features of CPOS are that: it will optimally allocate manufacturing jobs to the resources connected to it, and leverage distributed and democratized delivery systems, like Uber/Lyft and drones, for logistics. It will apply machine learning to the data gathered from sensors to help assure and improve quality, and to optimize operations. Furthermore, it will leverage the ingenuity of humans via crowdsourcing of ideas to improve manufacturing operations across networks of manufacturers. It will leverage cybersecurity measures to protect intellectual property and privacy of participants. CPOS will thus allow the collaboration of large, autonomous, heterogeneous and geographically dispersed networks of manufacturers to rapidly respond to production demands and disruptions with agility and flexibility, while ensuring high quality, productivity, and cost effectiveness [219].
IoFT will allow CPOS to maintain the autonomy, privacy and security of all the participants in MDM while enabling them to develop shared models that improve quality (and other performance metrics) across the entire system. MDM, enabled by CPOS and IoFT, promises to improve the responsiveness and resilience of manufacturing to urgent production demands (e.g., in emergencies like pandemics); promote mass customization and cost-effective low-volume production; gainfully employ lots of ordinary citizens in manufacturing (e.g., through the gig economy); and reduce the environmental footprint of manufacturing, by producing items closer to their points of use.
In a nutshell, Industry 4.0 is poised to transform the manufacturing industry but it would need a transition from traditional IoT to IoFT for its promise to fully materialize. IoFT will help alleviate issues around privacy, security, cost, data paucity, communication latency and bandwidth that are slowing down the adoption of IoT solutions in the manufacturing industry. It will also help catalyze new paradigms of manufacturing, for example, massively distributed manufacturing. To facilitate the transition of the manufacturing industry from traditional IoT to IoFT, the challenges discussed in Sec.II have to be addressed in the context of manufacturing.

B. ENERGY
Modern society increasingly depends on complicated electric power systems. The US end use of electricity reached 3.99 trillion kilowatt hours (kWh) in 2019, and the total demand is expected to increase in the next decades [75,73,186]. Rapid developments in energy infrastructure and technology provide numerous opportunities for implementing new energy applications and services to meet demand, as depicted in Fig. 11. In particular, the market share of variable renewable energy sources, such as wind and solar, which provide local and distributed energy, grew to 19% in 2019 [73,186]. It is expected that the electricity generation from renewables will double over the next 30 years [74]. Advanced communication VOLUME x, x capabilities, smart meter installations, mobile internet, and other smart technologies are enabling grid-responsive demand response and management services, such as shedding, shifting, and modulating load in peak and off-peak periods, while minimizing occupant discomforts [207,83,116,223,152]. Additionally, increased use of battery storage technology and the growing penetration of electric vehicles will also change electricity supply and usage patterns [174,286].

FIGURE 11. Recent developments in energy infrastructure and technology
Facing the massive transformation, IoT has the potential to provide system-wide, integrated approaches to managing modern power systems. The IoT infrastructure's ability to capture and analyze data-intensive systems like the energy system can play a key role in managing renewables, demand response programs, electric vehicles, and other elements. Data collection and the use of intelligent algorithms can monitor and control the energy supply chain, including production, delivery, and consumption, so that suitable, costeffective decisions can be made to balance supply and demand with minimal disruption to system operations. From the perspective of energy supply, since energy generation is an asset-intensive industry, data analytics can improve the efficiency of power production [103,330]. On the demand side, buildings equipped with smart monitoring and communication devices can analyze end users' energy consumption, identify their needs, and transform consumers into prosumers, adjusting their demand in response to system conditions [103,330,206].
While IoT can empower data-intensive analytics by providing an integrated platform to collectively gather and process data, applying data science methods to analyze complex energy systems in the centralized IoT platform, however, is not effective due to many untraceable complexities, which makes the IoFT paradigm more suitable.
First, massive data generation from sensors, actuators, and other devices in energy systems requires big data analysis e.g., condition monitoring sensors, such as the vibration sensors in wind turbine gearboxes, produce dozens or hundreds of high-dimensional observations, and smart meters in residential, commercial and industrial buildings produce massive amounts of end use data in high frequency. In the standard cloud-based IoT framework, where centralized cloud/data centers collect and process data, the energy consumption for big data processing is substantial, thus possibly negating the benefits of IoT for the energy industry.
Second, transmitting all the energy data to the central cloud could can cause communication latency. Considering that the electric power end use demand needs to be satisfied in real time, such latency poses a serious risk in energy system operations. In power grid operations, the ancillary service allows the grid operator to maintain balance between supply and demand at all times [334]. The ancillary service ranges in duration, ramping requirements, and magnitude [254,325]. These ancillary services become more important, as renewable energy sources rapidly replace fossil fuel generation. Wind and solar, the fastest-growing renewables, are characterized by significant variability, often with limited predictability [30,5]. Smart and grid-interactive buildings can provide such grid flexibility by adjusting their end use patterns [22,196]. Storages are also an attractive option for providing ancillary services. Communication latency causes ineffective coordination of these ancillary service resources and negatively affects grid reliability. As such, fast local updates for enhanced predictions in both electricity supply and demand is important for successful ancillary service implementation.
Lastly, the modern power system is characterized by distributed energy due to the growing penetration of renewables, and storage and demand response. In these distributed systems, individual stakeholders such as utilities and consumers, can perform their own decision-making [32,104,263]. For example, utilities that manage their own renewable facilities may not want to share information with others in an attempt to maximize their profit. Those who participate in the demand response programs may want to adjust their end use demand upon grid request without revealing their energy use patterns to others due to privacy issues. The decentralized IoFT framework provides the right platform for such decentralized and/or distributed decision-making.
The IoFT research in the energy industry is at its infancy stage. In Saputra et al. [260] a communication model for an EV network, with a charging station provider as the centralized node that collects information from individual charging stations, was proposed. Instead of sharing local data, the charging stations shared local trained models with the charging station provider.
While IoFT can remedy the limitations of the centralized IoT by building individual models locally at each end node, there are several challenging issues. First, end nodes often have limited computing power to train data science/machine learning models. As discussed above, sensors, actuators and smart meters produce massive data. Inefficient computation at end nodes can possibly delay predictive decision-making, fault diagnosis/condition monitoring, and change point detection, among others [30,50,307]. New data science methods are needed to optimally guide the model learning process for achieving computational efficiency with theoretical and practical implications in the IoFT paradigm.
Next, unlike traditional power supply with fuel-based generators and end users who passively consume energy, modern power systems consist of highly heterogeneous units with distinct supply/demand characteristics. On the demand side, technologies, including smart devices, different demand management programs, mobile internets, and electric vehicles, affect end use patterns 24/7. On the supply side, energy units become more diverse and heterogeneous. Unlike traditional fuel-based sources, each renewable facility has distinctive power generation characteristics (e.g., facility layout, turbine type each wind farm [316]). While the personalized learning discussed in Sec. IV can address the heterogeneity to some extent, managing a large number of heterogeneous supply/demand units with distinct energy characteristics is challenging. Hence, the personalized prediction needs to be translated into effective collaborative management Finally, energy consumption is significantly affected by ambient environmental and other localized conditions [214]. Peak demand predictions vary 1.5-2% for every 0.5°C difference in predicted temperature [19,215]. Electricity needs vary due to many spatially localized characteristics, such as: densely populated areas experiencing urban heat island effects in summer that increase electricity demand compared to suburban and rural areas [114,187], and EV charging stations exhibiting different charging patterns depending on localized characteristics. Renewable generations are also directly influenced by local weather and geographical conditions [115]. Expanding the use of IoFT for environmental modeling will require modeling the spatially and temporarily correlated environmental conditions, while incorporating local heterogeneous characteristics.
In summary, modern power system faces technical challenges including computational scalability, efficiency, heterogeneity, localized characteristics, and distributed management. IoFT has the potential to address these challenges, and the successful development and implementation of IoFT will make the energy system (and its end users) "smarter" about efficiency, flexibility, and economic competitiveness.

C. TRANSPORTATION
The prevalence of smart personal devices and the emergence of connected vehicle technology provide a plethora of opportunities for vehicles, travelers, and the transportation infrastructure to be in constant communication. This connectivity offers the promise of a safer and more sustainable transportation system with enhanced levels of mobility and accessibility. Connectivity allows subsystems that were previously modeled and optimized separately to now be modeled as a single system, thereby capturing the interactions between them. This comprehensive modeling approach allows for increasing system throughput, which results in many benefits for travelers (e.g., lower prices, less congestion, more reliable travel times, lower levels of greenhouse gas emissions) as well as the transportation system (less pressure on the infrastructure). Take the example of traffic signal control systems. Traffic signal controllers are traditionally optimized locally, either per intersection, or for a set intersections within an arterial. This optimization is based on local information: as vehicles approach an intersection, they activate loop detectors deployed in the pavement, sending a signal to roadside controllers. The controller then optimizes the traffic signal so as to minimize total delay. In an arterial setting, the optimization of the controllers at downstream intersections could be further informed by the state of the upstream intersections. Connected vehicle (CV) technology provides two unique opportunities for traffic signal control: (1) controllers can be optimized proactively before vehicles arrive at intersections using the messages received from connected vehicles; and (2) arterial-level optimization can be advanced to network-level optimization by customizing basic safety messages (BSMs) transmitted by vehicles to include routechoice information or estimating this information based on standard BSMs ( [294,199,3]).

FIGURE 12. Network-level intersection control
Despite the benefits that connectivity can offer, the existing methodologies are generally not scalable to allow for centralised decision making in connected systems. This lack of scalability creates a critical bottleneck in leveraging connectivity in transportation applications, especially since the state of the environment in transportation systems changes dynamically. As such, decentralized methodologies that can approximate solutions for centralized operations are of high interest. Take the example of a shared mobility system, such a ridesourcing, ridesharing, or microbiology service. Although the principle of sharing resources has been used in transportation systems for several decades (e.g., transit or carpooling), it was the advent of smart and connected personal devices that led to the unprecedented growth in shared mobility systems. Consider the well-known ridesourcing company, Uber. From an operational point of view, Uber can be considered a fleet operator. However, traditional optimization-based fleet dispatching schemes are not scalable for Uber as it scales up its operation to entire cities, states, and countries. Uber can fulfill ride requests in densely-populated urban regions using myopic solution methodologies (e.g., dispatching the closest VOLUME x, x vehicle to a request's pick-up location) with short wait times, providing a high level of service. However, as the demand level diminishes in suburban areas, using myopic matching schemes leads to degradation of level of service, prompting Uber to not offer its services when demand follows below a threshold. The need for using a decentralized solution methodology for solving centralized matching problems in shared mobility systems has been acknowledged in the literature. The proposed solutions typically falls under one of the two categories of decomposition methods ( [190,328,132,56]) or partitioning and clustering approaches ( [278,231]). Both families of solutions attempt to solve a large-scale optimization problem by means of solving smaller sub-problems, typically by adopting an iterative procedure that allows for asymptotically approaching the optimal solution. Despite successfully striking a balance between solution quality and time, there are still practical challenges that limit the applicability of these methods. These challenges include lack of a guarantee in finding a feasible solution within a specified period of time, and the high set-up cost of the problem in a dynamic environment that is fast evolving. These challenges can be addressed by continuously updating a global model in federated learning, where the computational burden is divided between local devices.
Due to the high computational complexity of optimizationbased approaches, there has been a recent surge in interest to leverage deep learning in transportation applications (e.g., [209,154]). The benefit of deep learning models is that once trained, their evaluation typically takes a fraction of a second, rendering them effective for real-time applications. However, training high-performing models requires immense amounts of data. FL can be especially effective with deep learning models for two reasons. First, it enables training a global model by incorporating several local (possibly heterogeneous) datasets, thereby enabling training generalizable deep learning models.
Secondly, the FL theory provides answers to the critical questions such as what is the best approach on handling heterogeneity in local datasets? and under what circumstances multiple personalized FL models would outperform a single global model? Answers to these question would facilitate using high-performing deep learning models to make operational decisions in dynamic systems. For example, adopting FL could allow Uber to learn a global matching policy that can be customized for regional operations with minimal additional training to capture local idiosyncrasies. Such regional models are likely to outperform myopic algorithms that use only spatio-temporally local data.
The application of FL in transportation systems is not limited to connected intersections and shared mobility systems. Other existing applications that rely on spatiotemporal gathering of peers, such as vehicle platooning ( [148]) and P2P wireless power transfer ( [1]), can be enabled by FL solutions. To effectively operate such systems, centralized but fast decision making is necessary. In such applications, FL can bridge the inherent trade-off between solution accuracy and computational complexity of finding a solution. Additionally, it is anticipated that the CAV technology will give rise to new applications that leverage connectivity, and therefore by nature require fast centralized decision making.
In addition to improving system throughput, FL can be used in transportation systems for privacy preservation purposes. In the age of autonomous vehicles, training models that can predict the motion of different traffic agents, e.g., vehicles, pedestrians, cyclists, etc., is of upmost importance. Typically, road side sensors, such as cameras, can be used to obtain historical trajectories based on which trajectory prediction models can be trained. However, transmitting camera recordings or other identifiable data to a central server may create privacy concerns. FL can address these concerns as it allows for processing the data in the edge device, and only sending gradients needed for updating the global model to the server. Similarly, other models that rely on sensitive traveler data, such as mode choice, destination choice, and route choice models, can be effectively trained using FL.

D. HEALTHCARE
Healthcare stands to benefit greatly from IoFT because several unique contextual factors suggest that the status quo has failed when it comes to deploying machine learning (ML): (i) many existing models fail to generalize; (ii) legal and ethical implications limit the appetite to share data; (iii) the vendors who administer the electronic health records (EHRs) that contain patients' data have an outsized influence on model deployment; (iv) national network-based research efforts started to adopt federated methods. In this section, we will describe how these factors impact healthcare differently from other sectors and illustrate areas where IoFT is likely to thrive in this domain.
ML models are commonly used in early warning systems, diagnostic systems for radiology and pathology, and in the interpretation of medical device output such as in electrocardiography. While the medical literature contains numerous examples of apparently high-performing models, many of these studies suffer from poor generalization, which can either be demonstrated through an independent examination of its methods (by applying the PROBAST tool [299]) or through external validation of the findings. This was particularly evident early in the COVID-19 pandemic, where nearly all studied models in a large systematic review were considered to be poorly generalizable. Although the pandemic has affected millions of people in the U.S. alone, most health systems did not have a sufficient number of patients, or adequate diversity, to ensure model generalizability. The lack of generalizability is not merely theoretical, it has also been systematically demonstrated. In a recent study examining over a thousand cardiovascular clinical prediction models, 81% of validation studies found worse performance than was reported originally [297].
Generalizability improves when models are developed using pooled data from multiple health systems. Under the Health Insurance Portability and Accountability Act 24 VOLUME x, 2016 (HIPAA) Privacy Rule, healthcare data may be shared for the purposes of research if identifiers have been removed, or under certain circumstances if patients have authorized the use of their data for research after approval by an institutional review board [217]. In most instances, data sharing between health centers also requires legal agreements known as "data use agreements" or "business associates agreements". Even with such agreements in place, sharing of data between health systems may go against the expectations of the general public [233]. Thus, pooling data from multiple health systems, while enabling better ML models, may potentially damage public trust. Indeed, when a 2019 partnership between Ascension and Google was reported on by the Wall Street Journal, it resulted in public outcry [55].
The difficulty faced by health centers in combining data with other health systems has led to a vacuum that has been largely filled by EHR vendors. Indeed, two of the most widely used healthcare ML models in the U.S. include the Epic Deterioration Index (owned by Epic Systems in Verona, Wisconsin) and the APACHE-IV scores (owned by Cerner Corporation, Kansas City, Missouri) [269,332]. Both models were developed using data from the EHR systems of multiple hospitals. Because EHR vendors have direct access to patient data (on behalf of their clients), they are well-positioned not only to combine data for analysis (with permission) but also to deploy the resulting models within their EHRs. This siloing of data, while in the best interests of patients, has led to major challenges in the development of high-quality, non-proprietary, freely available models. The healthcare informatics community has responded to this challenge but there is still a long road ahead. In 2009, the development of the Shared Health Research Information Network (SHRINE) enabled federated querying of clinical data repositories [293]. Conceptually, federated querying of multi-hospital data allows researchers to identify optimal sites for ML model development based on rapid multi-system sample size determinations [292]. A federated querying system, known as the 4CE Consortium, was rapidly deployed to support COVID-19 research [25]. More recently, federated ML has been applied to the development of multi-hospital models in healthcare [261,266,126]. Federated learning allows health systems to share access to a model library and deployment engine without directly sharing data, as depicted in Fig. 13.
IoFT will be important in healthcare because it enables the creation of high-quality ML models without the privacy risks. However, IoFT will have to contend with Food and Drug Administration (FDA) regulations that treat ML algorithms as a type of software-as-medical-device (SaMD). Initial applications of IoFT in health will likely be limited to class I and II medical devices, which include ML models that are used primarily to inform care decisions. Whereas class III-IV medical devices (e.g. cardiac pacemakers) require extensive review and premarket approval by the FDA, class I-II devices only require premarket notification demonstrating equivalence with a legally marketed device. As a result, IoFT will likely be applied in supporting early warning systems in hospitals, in automating order entry, and in smart scheduling of patient visits. In each of these scenarios, global patterns exist but differ locally to the extent that the combination of global and local models will likely achieve superior results without sacrificing patient privacy.

E. BUSINESS
Capturing and maintaining relationships between businesses raises many challenges for FL, and new paradigms to meet them are needed. To get an insight into these challenges, consider a business (the principal) that has the following decision to make: shall it build a facility to supply a particular item or, shall it 'outsource' it to another business (the agent)? In the former option, all the risks are carried by the principal while in the latter these are shared with the agent. In the economy of today dominated by fast technological developments, outsourcing is often preferred. Despite the advantage of risk sharing, outsourcing comes at a cost. The agent has more information about its operations, and can use this information in its favor and at the expense of the principal, which is generally termed as 'moral hazard' in the economics literature [145,273].
The example above highlights several of key aspects of this relationship: First: Businesses are often intrinsically federated and are reluctant to share their proprietary business secrets. This federation forces decentralization of decisions, and thus all the advantages of the IoFT like localized computing, data privatization, security and information privacy can be realized, along with the advantages of risk sharing. Second: A key challenge is that in business applications, agents may have different and often competing objectives. This becomes a new challenge for FL. Indeed, the models in Secs. III -V are focused on maximizing a common objective. But if such a formulation does exist, and somehow includes agents' conflicting objectives, its successful implementation is contingent on true reporting by the agents. Otherwise the principal has to continuously monitor the agents' operations. Depending on the situation, monitoring may be impossible VOLUME x, x or too expensive, and thus may nullify the advantage of decentralization of operations. Third: The principal can make arrangements with many independent agents, and each one of these may have an arrangement with other independent businesses supplying its services. This creates an organizational structure of relationships, forming a hierarchical structure of agents (i.e., a rooted tree with agents at various distances (levels) from the root). Here agents at intermediate levels (excluding the root agent and the agents at the end) play two roles: as a principal to its subordinate nodes and an agent to the lower level node. This adds an additional level of complication for the use of FL in such business setups.
Though businesses are organized in a federated system, and are ideally suited to FL, the full potential can only be realized if the above stated problems are effectively solved. There are some recent encouraging developments towards this end. Below we highlight one possible solution. In the single agent case, an answer to this problem has been proposed by [259] by the design of a mechanism which mitigates moral hazard and effectively decentralizes decision making. In a continuous dynamic setup, at the epoch t the principal observes a noisy signal, x(t) about the 'effort' of the agent and compensates c(t) to the agent based on this signal. It is shown that in case the signal noise is generated by a Brownian motion (a Gaussian process), there exists a 'control' variable y(t) that can affect the noise of the signal to a level at which the principal can make a good decision about the compensation for the agent. In a more realistic setting, the principal may create such relationships with several agents, each with private information, data and objectives. There may also be interactions between the decisions of agents, i.e., decisions by an agent may affect the outcomes of other agents' actions, and they may have conflicting objectives with each other and with the principal's making the moral hazard problem harder to mitigate. This case has been studied by [177] who integrated the notion of Nash equilibrium into this model. Thus, if all the agents' decisions form a Nash equilibrium, no agent can gain by falsifying information when all the other agents do not. This mitigates the moral hazard, and the decisions arrived at can be implemented. Fig. 14 shows a representation of decentralized decision making with two agents.
A brief overview of the solution methodology for a dynamic optimization problem in continuous time over a finite or an infinite horizon is as follows: the optimization problem faced by the principal is to find a policy which maximizes its expected discounted profit over the horizon, i.e., for the infinite horizon case: where r P , f P and α P are respectively the discount factor, profit function and data of the principal; under the individual rational constraint that each agent's expected discounted profit over the horizon exceeds some predetermined minimum amount. When all randomness in the formulation is driven by Brownian motions, it can be seen that when the optimal policy is followed, the expected discounted profits of the principal and the agents are martingales. Using the principle of Bellman, this formulation is decomposed into federated optimization problems that are independently solved by each agent, while the principal solves a constrained Hamilton-Jacobi-Bellman (HJB) optimal control problem. The HJB solves for the continuation value ('value' function of dynamic programming) of the principal, as a function of the state variables which include the continuation values of each agent. It is obtained by using the Ito's formula on the function and the fact that the expected profit of the principal is a Martingale when an optimal policy is adapted. In the case of a hierarchical system, agents at the intermediate levels independently solve both an optimization and a constrained HJB problem, thus achieving a key goal of IoFT.
In conclusion, the above described decomposition mechanism between the principal and possibly several agents facilitate the use of federated learning in which each participant can effectively use the data collected from its operations and determine more accurately the profits, given the compensation it receives from the principal. The method described above is but one possible solution. The use of data-driven reinforcement learning is another viable option. Through this example our hope is to encourage researchers to further explore decentralized decision making within IoFT.

F. QUALITY ENGINEERING
IoT as an enabling technology for real time data sharing has stimulated a new paradigm in quality engineering, which widely expands quality control from the design and manufacturing stages to the whole product life cycle. For example, GE Prognostic Health Management Plus (PHM+) system uses its on-board sensors to collect engines' operating data during flight. These data are communicated through its secured network and analyzed by the central server to provide proactive maintenance services for their customers. Similarly, the automotive industry uses vehicles' on-board sensors to monitor vehicles' real time driving performance, allowing for early warnings of potential problems. Additionally, they can deploy integrated vehicle-based safety systems (IVBSS) [78] to improve customers' driving experience and safety. Nowadays, quality assurance throughout the product life cycle is highly demanded for customer satisfaction, which is especially imperative for expensive or safety-sensitive products. Further, these in-field operational data can provide a quick feedback for continuous quality improvement. Online monitoring and fault diagnosis, which uses either on-board sensors on products or in-situ process sensing signals during manufacturing, is one of the most important research issues in quality engineering. Currently, most collected process sensing signals or quality inspection are multistream waveform signals; image or video signals with very high frequencies [301,237,287]. Those require huge bandwidth and time to transmit the original data between local devices and the central orchestrator. More importantly, quality control requires fast decisions and real time detection of anomalies. Therefore, decision-making ought to be on the edge and not on a central system. The shift towards IoFT, will allow tackling both of these challenges. In IoFT, only summary statistics and low dimensional information is transmitted to the central server (or perhaps a peer in a peer-to-peer network), also, models reside on the edge and can be deployed in immediately. Therefore, the IoFT platform shows the distinctive advantages in quality control for reducing the communication load and making real time decisions.
That being said, there are some unique research issues to advance quality control methodologies under the IoFT platform. Below we iterate a few: Insufficient data: Quality control models, such as anomaly detection or fault diagnosis models [169,168], are posed to greatly benefit from IoFT. In statistical process control (SPC), many edge devices or clients may lack the sufficient data to build a normal operating baseline for abnormal change detection. For instance, (i) clients may have not observed the full set of possible anomalies as shown in Fig. 15 (ii) new products/processes possess few data, so does smallscale clients (such as the 3D printing citizens in the Sec. I-A) and low volume manufacturing of rare and expensive products (ex: planes). IoFT as an emerging technology offers a medium to borrow strength across different clients for better SPC models while preserving copyrights and privacy. For instance through meta-learning within IoFT, clients may directly adpat to new products/processes. Design of experiments (DOE): DOE has a rich history in quality and is detrimental to its success [302,45,34,123,305]. In the realm of IoFT, a key question is how can DOE be achieved ? For instance, for expensive experiments or computer models, DOE often uses a sequential strategy to find next design points that best help in estimating the response surface [109] or providing statistical inference. In IoFT, such an expensive computer experiment may reside on the central server or perhaps each client has it own computer model. To this end, how can sequential designs learn the central model, given that clients may be of different fidelities [111] ? how can computer models borrow strength from each other for better calibration [234] while preserving privacy ? Such questions will be of key importance in IoFT. Continual learning: IoFT allows knowledge to be readily shared. As a result, quality control models (e.g., anomaly detection), may be continuously updated to register new defects or improve detection and diagnosis accuracy for old ones [59]. Human feedback and expert knowledge: Upon the detection of an anomaly, most operators will do a post inspection about the diagnosis results (i.e., false positive or false negative). Improving models upon such expert feedback will be of importance to IoFT. Indeed, much like data in IoFT, human knowledge is often decentralized with different entities having expert knowledge on different elements of a system. Therefore, modeling approaches that combines expert knowledge, human feedback and data driven models are needed. Such models, have evolved recently under the notion of expert or physics guided data-driven modeling. However, they are yet to be explored under the IoFT paradigm. Quality Control: As described in Sec. VII-A, quality control will immensely benefit from a shared library of knowledge, be it a library of in-control behaviours, common anomalies, root causes, etc. Many companies are reluctant to collaborate in building such a library due to privacy constraints. IoFT may be bring this end-goal into fruition.
In conclusion, quality control is set to greatly benefit from IoFT. Yet many challenges still need to be tackled to realize its potential.

G. COMPUTING
With the goal of gaining insights without exposing raw data, large technology companies such as Google, Apple, and Firefox started to deploy FL for computer vision and natural language processing tasks across user devices [62,311,40,99]; others, including NVIDIA, apply FL to create medical imaging AI [161]; smart cities perform in-situ image training and testing on AI cameras to avoid expensive data migration [105,119,176]; and video streaming solutions use FL to interpret and react to network conditions [308]. However, VOLUME x, x we believe that these applications of FL are only scratching the surface given that the applications of ML in computing are even broader, many of which can be deployed more widely and improved by leveraging FL. In the following, we present an incomplete overview of the many existing and potential applications of FL in computing. The common theme across many of them is enabling information sharing between multiple administrative domains without having to share raw private data.
Databases: Indexes play a critical role in speeding up query processing in database management systems (DBMS). In recent years, learned indexes are gaining popularity, whereby an ML model replaces traditional index structures including B-Tree, Hash-Table, Bitmap, and so on. These learned indexes can be classified into two broad categories: static, read-only indexes [140] focus more on read-heavy workloads, while updatable indexes [64] can handle lookups as well as inserts and deletes common in write-heavy workloads. Nonetheless, all of these works focus on applying ML to a single administrative domain, which restricts the usefulness of learned indexes to scenarios that has already been observed within the domain and leave them potentially weaker performance on previously unseen workloads. Applying FL in this context will help collaborative training among multiple competing domains without sharing raw data. Indexes are only a part of the many research challenges faced in the database literature, and ML and FL can have possible applications in, among others, transaction processing, lock management, query planning/optimization, and cardinality estimation.
Networking: Networks are inherently distributed, and networking protocols are no exception. As a result, FL is a natural fit for many networking problems where ML can be applied and has already been applied in limited scope (e.g., due to not being able to copy all data to a centralized location). Over the course of the past decade, many networking problems have relied on ML techniques; for example, to infer datacenter topology [51], to determine hyperparameters for congestion control algorithms [298], for Internet-scale congestion control using deep reinforcement lerning [117], for leveraging single and multiple paths in an adaptive manner [67,88], and for routing [281]. They primarily relied on a single trust domain (e.g., a datacenter, an Internet AS, etc.) where everything is controlled by and cooperate with a single entity and within which data can be shared. FL can expand the scope of many of these algorithms to be applied at a broader scale via privacy-preserving learning that may incentivize multiple domains to collaborate to learn a global model and then personalize to their own needs.
Cloud Computing: To cope with the increasing number of Internet users as well as IoT and edge devices, large organizations leverage tens to hundreds of datacenters and edge sites. Collecting data related to end-user sessions, monitoring logs, and performance counters, and thereafter analyzing and personalizing this data can significantly improve overall user experience. Traditional approaches to ML require collecting all these data to a centralized cloud datacenter, which is often impossible due to bandwidth constraints and data privacy regulations. Federated computation including FL between multiple sites and clouds is the natural choice in this context to address both of the concerns [146,236].
Video Analytics: Cameras deployed for traffic control and surveillance continuously record and analyze large volumes of recorded video using video analytics [105,118,323], which has been made possible by recent advances in computer vision. A key challenge in this context is training large models, typically in datacenters, before they are deployed in the wild. Traditional centralized training is both expensive and yet narrow; the latter follows from the fact that the models are trained on relatively small training datasets. With the advent of smart cameras in the edge, i.e., cameras with on-board or nearby computing capabilities, we can leverage FL to train models with much bigger training datasets, which can significantly improve the accuracy of the models and keep them continuously updated.
Video Streaming: Videos constitute bulk of the Internet traffic today, and live video streaming is a major contributor to this category. Client-side video players typically employ a variety of adaptive bitrate (ABR) algorithms to optimize users' quality of experience. Recent advances in ABR algorithms include using reinforcement learning to generate context-specific ABR algorithms [189] to more recently demonstrating that generating many such algortihms does not necessarily perform better than using FL to generate one model that works in conjunction with classic video streaming techniques [308]. A key research direction here would be leveraging federated reinforcement learning to leverage the best of both approaches to find a balance between the global and personalized ABR algorithms.

H. RELIABILITY
Reliability engineering is concerned with the failure behavior of a system under stated conditions. A failure can be catastrophic, meaning a complete, sudden and often unexpected breakdown of the system, leading to significant or even total loss of system performance. It can also be a degradationinduced soft failure (e.g., the capacity drop of a lithium-ion battery). There are several ways to evaluate the reliability of a product, though generally evaluation based on reliability data is most common. Reliability data are usually in the format of lifetime data or degradation data. However, in these datasets, failure data is scarce, given that most products are highly reliable and do not fail often. Nevertheless, FL with its privacy protecting protocols, provides a unique opportunity to overcome challenges of scarce data in reliability engineering.
Throughout the ages, reliability data have been classified as sensitive information by companies. With the millions of products released in the marketplace by manufacturers, it is no secret that reliability data is both extensive and comprehensive. Yet, its sensitivity hinders its usuability. In such a scenario FL provides a unique opportunity to enable knowledge sharing from available datasets without compromising its privacy. For instance, existing reliability databases (e.g., product lifetime) can be replaced by a summary (or prior) distribution of modeling parameters for each product/component. Further, there exists different scenarios whereby products have only few lead manufacturers (e.g., the smartphones industry). In here, the designs from different manufacturers are distinct, implying a certain degree of heterogeneity [314]. For a smartphone manufacturer, the reliability information from previous generations of its line of smartphone products, can be more useful that information from other manufacturers. Such a setting, posses another unique challenge for FL. In the following, we will discuss the potential applications of FL in three different settings: (i) among a group of manufacturers producing a similar product, (ii) within a manufacturer, and (iii) within a reliability organization.
We first start with FL among a group of manufacturers. Consider reliability testing for evaluating a product's reliability, where reliability data is in the format of lifetime data subject to right censoring. Generally, the Weibull distribution with reliability function (scale α and shape β) and the lognormal distribution (location µ and scale σ) are two of the most commonly used distributions for describing a product's lifetime data [197] (moving forward we will focus on the Weibull distribution, though similar logic applies for different distributions). The Weibull shape parameter β is commonly believed to depend on the product type (i.e., failure mode due to the material used: e.g., corrosion of semiconductor material) or the failure mode due to costumer usage (e.g., user breaks their cellphone). Such parameter can be regarded as insensitive information to the product reliability. On the other hand, the Wiebull scale parameter α (also known as the characteristic life), is usually dependent on the effort of reliability investment from the manufacturers [197]. If a manufacturer uses local data to evaluate product reliability, then both parameters in the lifetime distribution have to be estimated, and the uncertainties in both parameters will affect the precision of the final reliability evaluation. Then, it is reasonable to advocate sharing information on β, as to decrease uncertainity in β, which eventually helps all manufacturers achieve more accurate evaluation of product reliability. Additionally, since the information on α is unshared, a manufacturer cannot infer product reliability of other manufacturers.
Operationally, we can use a Bayesian approach. Let us consider the Weibull distribution for demonstration, and provide a rough sketch of the parameter updating process in an IoFT system. First, in large sample sizes, the posterior distribution of log β can be well approximated by a normal distribution (log β is used to ensure positiveness of β). Afterwards, when a manufacturer has recently conducted a life test and requests an update, or when the central server randomly chooses a manufacturer and mandates for an update, then the manufacturer will first get a broadcast of the current posterior distribution of log β. The manufacturer can then use this posterior distribution of β and the manufacturer's local posterior distribution on α, which might be obtained from previous product testing, as a prior distribution for the newly collected reliability data. A routine Bayesian update gives the new posterior of α and log β. Then, the manufacturer can compute the mean and variance of the new posterior of log β, and return the updated posterior distribution to the central server. Finally, the central server can then check the discrepancy between the broadcasted and the updated distributions to safeguard against data corruption during transmission or malicious attacks. If acceleration is used in the life test, then parameters in the acceleration model (as the activation energy in the Arrhenius model) contains no sensitive reliability information. Thus, such parameters can also be federated together with the Weibull shape. The same idea can be extended to accelerated degradation testing, where FL can be applied to the shape parameter of the mean degradation paths, and the acceleration parameters. Fig. 16 provides a schematic view of the discussed protocol. Next, we explore the application of FL within a manufacturer. The underlying idea here is that when a certain product is sold to customers, collection of user data for early prediction of product failure must comply with some privacy terms, and thus is restricted. Given the computational and communication capabilities of the product, FL provides a unique advantage in the presence of privacy constraints. Consider as an example, lithium-ion batteries which are widely used in electrical vehicles. It is well-known that the usage pattern has a significant impact on the state-of-charge (short-term) and the remaining useful life (long-term) of the lithium batteries [232]. However, it is almost impossible to associate the usage pattern to these two important performance characteristics because of difficulties in replicating the heterogeneity in the users behavior. FL provides an opportunity to train an accurate model for each characteristic. To do so, we need a global statistical model that associates the customer usage pattern, the charge-discharge pattern, and the ambient environments to the performance characteristic. The global statistical model can be a random-effects model that allows for heterogeneity among customers. Then the approaches introduced in Secs. III and IV can be used to learn a global model or a personalized model. Because of the different locations, and thus different ambient environments of the users, there are obviously covariate shifts among users. Methods reviewed in Sec. IV can be perfectly adopted to VOLUME x, x solve these problems.
Third, we discuss FL implementation on reliability organizations. The major of reliability organizations collect field failure data on a large variety of components from various sources. The ultimate goal is to estimate the reliability of any new system based on the component reliability estimated from collected databases. Some large databases can be found in OREDA [222], Mahar et al. [184] and Denson et al. [61]. Since there are millions of components, the data reported in these databases are aggregated in such a way that only few summary statistics are provided for each component. This aggregation is based on the assumptions of exponential distribution, and it makes the fitting a Weibull distribution extremely difficult [41,42,43]. However, the FL protocol provides a better solution to build such a database. Instead of recording these summary statistics, we can first agree upon a distribution for a component, and then maintain a posterior distribution for the parameters. For example, the inverse Gaussian and the Birnbaum-Saunders distributions are commonly used for mechanical components, and Weibull is the most popular distribution in reliability. A conjugate distribution for the parameters, or a normal distribution for the transformed parameters (to ensure positivity), can be adopted for ease of use. Every time a partner of the database has new data to update, the database (which serves as the central server) can broadcast the current posterior of parameters for the component. The partner can then use this as prior, and update the posterior with local data. The above rough idea can be materialized with the framework discussed in Sec. III.
In a nutshell, reliability of a manufactured product is usually shrouded with privacy concerns. Thus, Implementing FL within a manufacturer is promising in solving the issues of data transmission and user privacy. On the other hand, FL across manufacturers is much more difficult. Nevertheless, with proper design of the information-sharing mechanisms, FL can tremendously help the manufacturers increase the accuracy of reliability estimation and prediction without sacrificing confidentiality.