The Cost of Training Machine Learning Models Over Distributed Data Sources

Federated learning is one of the most appealing alternatives to the standard centralized learning paradigm, allowing a heterogeneous set of devices to train a machine learning model without sharing their raw data. However, it requires a central server to coordinate the learning process, thus introducing potential scalability and security issues. In the literature, server-less federated learning approaches like gossip federated learning and blockchain-enabled federated learning have been proposed to mitigate these issues. In this work, we propose a complete overview of these three techniques, proposing a comparison according to an integral set of performance indicators, including model accuracy, time complexity, communication overhead, convergence time, and energy consumption. An extensive simulation campaign permits to draw a quantitative analysis considering both feedforward and convolutional neural network models. Results show that gossip federated learning and standard federated solution are able to reach a similar level of accuracy, and their energy consumption is influenced by the machine learning model adopted, the software library, and the hardware used. Differently, blockchain-enabled federated learning represents a viable solution for implementing decentralized learning with a higher level of security, at the cost of an extra energy usage and data sharing. Finally, we identify open issues on the two decentralized federated learning implementations and provide insights on potential extensions and possible research directions on this new research field.


B. CONTRIBUTIONS
Despite in the literature it is possible to find papers comparing classical centralized learning in data center with CFL [23], [24], a comparison among the different federated learning approaches (centralized versus decentralized) is still missing. In this work, we aim to fill this gap and, thus, we focus on two of the most popular and widely adopted approaches for decentralizing FL: GFL and BFL. In particular, we provide a comprehensive analysis of both methods and compare them to traditional FL, i.e., CFL. Note that we combine standard performance indicators for ML models, i.e., accuracy, with indicators that quantify the efficiency of these algorithms, i.e., time complexity, communication overhead, convergence time, and energy consumption. With our comparison under fair conditions, we would like to provide the research community with a complete overview of the three approaches, so that the best model can be chosen according to the specific use cases.
The contributions of this paper may be summarized as follows: • We overview the traditional FL setting and delve into two approaches for decentralizing it. They are selected since are two of the most popular in the literature and are kind of diverging into two completely different solutions, which are based on gossip communication and blockchain technology, respectively. • We provide a thorough analysis to derive the running time complexity, the communication overhead and the convergence time of each overviewed mechanism for FL, including CFL, BFL, and GFL. • We provide an energy model to measure the energy consumption of each solution, based on the associated communication and computation overheads. • We assess the performance of each method (CFL, GFL, and BFL) through extensive simulations on widely used TensorFlow libraries [25]. • We delve into the open aspects of decentralized FL, providing insights on potential extensions, considerations, and software implementations for GFL and BFL. The remainder of the paper is structured as follows: Section II reviews the related work. Section III describes the three studied algorithms (CFL, BFL, and GFL). Section IV analyzes their time complexity, the communication cost, introduces the communication model and the convergence time. Section V provides the energy model used in this paper. Then, Section VI compares the three mechanisms through simulation results. In Section VII, we propose solutions to address common BFL and GFL weaknesses. Section VIII provides some open issues of GFL and BFL and future research directions. Finally, Section IX concludes the paper with final remarks.

II. RELATED WORK
Distributing and decentralizing ML operations at the edge has been embraced as an appealing solution for addressing the issues of centralization (connectivity, privacy, and security) [26]. With FL, different devices collaborate to train an ML model by sharing local updates obtained from local and private data. The traditional FL algorithm (FedAvg), referred to as CFL in this paper, is introduced in [27]. In [11], the authors propose techniques to improve its communication efficiency. Nevertheless, CFL still requires a central server responsible for clients orchestration and model aggregation. The star topology is a weak aspect of CFL, since the central entity represents a single point of failure, it may limit the number of devices that can participate in the training process, augments the communication cost of the learning process, and presents privacy concerns [15], [28].
An extension of CFL, is proposed in [29] where a hierarchical aggregation scheme is adopted, i.e., a subset of edge devices aggregate the local updates shared by their neighbors. Further optimization for wireless networks is taken into account in [30], where wireless resource allocation and client selection are jointly considered to minimize CFL loss.
To address these challenges, decentralized federated learning has been proposed in [18]. The authors present a fully decentralized model, in which each device can communicate only with its one-hop neighbors. They also provide a theoretical upper bound on the mean square error. Ormándi et al. [19] introduce gossip FL, a generic approach for peer-to-peer (P2P) learning on fully distributed data, i.e., every device has only a single input sample to process at each round.
The same algorithm has been tested in [20] under real-world conditions, i.e., devices have multiple input samples available (rather than only one point, as originally stated in [19]), restricted communication topology, heterogeneous communication and computation capabilities. Removing the central server brings new challenges and opportunities. On the one hand, GFL addresses scalability properly and removes the single point of failure problem of CFL. On the other hand, the lack of coordination in GFL may lead to high temporal variability and ML model inconsistencies (e.g., nodes may have different versions of the model stored in their local cache). The fact is that, in GFL, the participating nodes interact with each other in a distributed manner. This, for instance, hinders the consensus on the ML model at the beginning of each FL round, which gains in difficulty as the number of participants increases. Moreover, network topologies with sparsely connected clients may further degrade GFL performance [20].
Another prominent solution to decentralize FL is blockchain-enabled FL [31], [32], [33]. A blockchain system allows clients to submit and retrieve model updates without the central server. Additionally, the usage of blockchain guarantees security, trust, privacy, and traceability, however, it introduces delays due to the distributed ledger technology. An analysis of end-to-end latency and the effects of blockchain parameters on the training procedure of BFL is proposed in [22].
In the literature, there exist some comparisons across FL techniques. The authors of [34] compare GFL and CFL with a logistic regression model in terms of convergence time, proportion of the misclassified examples in the test set (0-1 error), and used communication resources. When nodes have a random subset of the learning samples, GFL performance is comparable with CFL; instead, CFL converges faster when a node has only labels from one class. Another comparison is proposed in [35], where the performance of FL algorithms that require a central server, e.g., FedAvg and Federated Stochastic Reduced Gradient are analyzed. Results show that FedAvg achieves the highest accuracy among the FL algorithms regardless of how data are partitioned. In addition, the comparison between FedAvg and the standard centralized algorithm shows that they are equivalent when independent and identically distributed (IID) datasets are used.
In [21], the authors compare GFL with the standard centralized data center based architecture in terms of accuracy and energy consumption for two radio access network usecases. To achieve this goal, they use the machine learning emission calculator [36] and green algorithms [37]. In [23], the authors compare centralized data center based learning and CFL in terms of carbon footprint using different datasets. The assessment is done by sampling the CPU and GPU power consumption. In [24], the authors propose a framework to evaluate the energy consumption and the carbon footprint of distributed ML models with focus on industrial Internet of Things applications. The paper identifies specific requirements on the communication network, dataset and model size to guarantee the energy efficiency of CFL over centralized learning approaches, i.e., bounds on the local dataset or model size. Differently from our work, the authors do not consider Blockchain-enabled FL and evaluate the algorithm performance in scenarios with small number of devices (i.e., 100). In addition, we empirically measure the energy consumption of the devices based on the real load of the computations realized during the training phase; we provide a communication model to estimate overhead and convergence time. Finally, here we introduce an analysis on the computational complexity of the three federated algorithms under study.
To sum up, in this paper, we endeavor to bridge the existing gap in the literature by providing a thorough comparison including performance analysis and cost of the different federated approaches listed above, i.e., CFL, BFL, and GFL. Differently from the other works in the literature, we combine standard metrics, i.e., accuracy, with indicators of the efficiency of these algorithms, i.e., computational complexity, communication overhead, convergence time and energy consumption. Our final aim is to contribute to the development of Green AI [38].

III. FEDERATED LEARNING IMPLEMENTATIONS
Let us consider a set of N clients (or devices) N = {1, . . . , N} with their datasets D 1 , . . . , D N . Each local dataset D i , ∀i ∈ N , contains pairs (x i , y i ), where x i is the feature vector, and y i its true label. The goal of a federated setting is to train a global model (e.g., a set of weights w), that minimizes the weighted global loss function: where i represents the local loss experienced by client i and |D| = N i=1 D i . In this scenario, devices do not share raw local data with other devices. Instead, they exchange model parameter updates, computed during several iterations by training the global model on local data. In this paper, we study three different implementations to solve the federated problem stated above, namely: CFL, BFL, and GFL. The investigated solutions are depicted in Fig. 1 and we will introduce them in what follows. Though several variants are available in the literature, the three algorithms described next are baseline representations of the approaches studied and well suitable for our purposes.

A. CENTRALIZED FEDERATED LEARNING (CFL)
At the beginning of a round t, a random subset of m devices S t ⊆ N is selected, and the server sends the current global model to the parties. Each client makes E epochs on the local dataset with a mini-batch size of B, updates its local model w t+1 k and sends it to the server. The server aggregates the received local updates and generates the new global model by computing the weighted average of the local updates, as follows: where |D| = k∈S t |D k |. The process is repeated until the model reaches convergence, e.g., the loss function does not improve significantly across subsequent epochs or a specific number of training rounds have been executed. In this work, we consider the FedAvg algorithm [27] as a merging method to generate global model updates. Algorithm 1 describes the CFL with FedAvg mechanism. The procedure MAIN is executed by the server that coordinates the whole training process. Each client executes the procedure CLIENTUPDATE and applies the stochastic gradient descent (SGD) algorithm on its local dataset with a learning rate η.

B. BLOCKCHAIN-ENABLED FEDERATED LEARNING (BFL)
BFL is based on distributed ledger technology, which collects data in the form of transactions and organizes it into blocks. Indeed, a blockchain is a sequence of blocks chained one after another through advanced cryptographic techniques. Each block contains the hash value of the previous one, leading to a tampered-proof sequence and providing properties that are essential to building trust in decentralized settings, such as transparency and immutability. In a blockchain, a set of participant nodes (miners) apply certain mining protocols and consensus mechanisms to append new blocks to the blockchain and agree on the status of the same. This procedure allows devices to write concurrently on a distributed database and guarantees that any malicious change on data would not be accepted by the majority, so that data in a blockchain is secured.
When a blockchain is applied to a federated setting, the process is going as follows [22]: 1) Each device submits its local model updates in the form of transactions to the blockchain peer-to-peer (P2P) network of miners.
2) The transactions are shared and verified by miners. S t ← random set of m clients 6: for each client k ∈ S t in parallel do 7: Download the global model w t 8: Send w t+1 k to the server 10: end for 11: for E local epochs do 4: for batch b ∈ B do 5: w ← w − η∇ (w, b) 6: end for 7: end for 8: return w 9: end procedure 3) Miners execute certain tasks to decide which node updates the chain. One of the most popular mining mechanisms, and studied in this paper, is Proof-of-Work (PoW) [39], whereby miners spend their computational power (denoted by λ) to solve computationintensive mathematical puzzles. 4) As a result of the concurrent mining operation, a new block is created and propagated throughout the P2P blockchain network every BI seconds (on average). The block size S B is selected such that can include a maximum of m transactions, each one representing a local model submitted by a client. 5) Clients download the latest block from its associated miner (as in [13], [31]), which would allow performing on-device global model aggregation and local training.
An important consequence of the blockchain decentralized consensus is forking. A fork occurs when two or more miners generate a valid block simultaneously (i.e., before the winning block succeeds to be propagated). The existence of forks can be seen as a waste of resources, as it may lead to extra computation and delay overhead [40].
In this work, we consider the version of BFL reported in Algorithm 2 [22], which entails the participation of multi-access edge computing (MEC) servers and edge devices. Each client downloads the updates w t 1 . . . w t m ∈ b t contained in the latest block, computes the new global w t , and trains it on its local dataset with the CLIENTUPDATE procedure described in Section III-A. The parameters of the new updated model w t+1 k are then submitted with the method S t ← random set of m clients 6: for each client k ∈ S t in parallel do 7: Download the latest block, b t 8: if b t+1 is not valid then 15: Go to line 12 16: end if 17: t ← t + 1 18: end while 19: end procedure SUBMITLOCALUPDATE, where S tr is the transaction size. Once all the local updates are uploaded to the blockchain, a new block b t+1 is mined with MINEBLOCK, where the block generation rate, λ = 1 BI , is derived from the total computational power of blockchain nodes. Finally, the new block is shared across all the blockchain nodes with the procedure PROPAGATEBLOCK, which depends on the size of block b t+1 (fixed to S B ). The process is repeated until convergence.

C. GOSSIP FEDERATED LEARNING (GFL)
GFL is an asynchronous protocol that trains a global model over decentralized data using a gossip communication algorithm [19], [20].
We consider the general design proposed in [20] and [21]. Overall, the participating clients start from a common initialization. The global model is then trained sequentially on local data and following a given path (e.g., a random walk) of visiting clients. Algorithm 3 describes the GFL procedure and it works as follows: 1) The N clients agree on the ML model to use and store it in their local cache, i.e., lastModel i , ∀i ∈ N . 2) At each round t, m clients are randomly selected and ordered in a sequence S t = [k 1 , . . . , k m ].
3) The ML model visits sequentially the clients in S t . Let w t k i−1 be the model received by k i ∈ S t . 4) As reported in Fig. 2, first the procedure MERGE combines w t k i−1 and lastModel k i , i.e., the model from the previous round in which the client has been selected. Then, the procedure CLIENTUPDATE described in Section III-A used to train the merged model on the local dataset.
Model trained by the previous node in the sequence 9: Send model to the next client 13: end for 14: t ← t + 1 15: end while 16: end procedure 17: 18: procedure MERGE(w, w ) 19: w ← w+w 2 20: return w 21: end procedure 22: procedure GETSEQUENCE(S t ) 23: 5) The local cache of k i is updated with the model w t k i−1 , i.e., lastModel k i and the model trained on the local dataset w t k i is shared with the next node in the sequence S t . 6) If k i is the last client of the sequence the model w t k i is sent to the first node of the next FL round, i.e., k 1 ∈ S t+1 .
A round is completed when the model has visited all the clients in the sequence. The algorithm stops when convergence is reached (after a given number of rounds). As an example, we consider a scenario with N clients. After the local cache initialization, m clients are randomly selected and ordered in a sequence S 0 = [k 1 , . . . , k m ]. The first client k 1 ∈ S 0 trains the model stored in its local cache, with the procedure CLIENTUPDATE on the local dataset and shares the model w 0 k 1 with the following node in the sequence, i.e., k 2 ∈ S 0 . When k 2 receives w 0 k 1 , first applies the MERGE procedure to combine it with the model stored in lastModel k 2 and then trains it on its local dataset with the procedure CLIENTUPDATE. After the local training, the model w 0 k 2 is shared with the next node in the sequence, and w 0 k 1 is stored in the local cache lastModel k 2 . The same process is repeated at each visited client till the end of the sequence. The last client k m ∈ S 0 , after the execution of the procedures MERGE and CLIENTUPDATE, sends w 0 k m to the first node of the sequence for the next round S 1 . These operations are executed iteratively till convergence is reached, e.g., a predefined number of rounds are executed.

IV. COMPUTATIONAL AND COMMUNICATION COSTS
In this section, we introduce the mathematical statements for the calculation of the time complexity of the three federated algorithms discussed in Section III. We also elaborate on the data overhead due to the communication of the different model updates during the several rounds of the process for each implementation. Finally, we derive the equations for the calculation of the time to reach the convergence of the three analyzed federated approaches. The results proposed hereafter are derived using the following assumptions:

1) Scalar operations (sums and products) cost O(1).
2) The time complexity of the matrix multiplication is linear with the matrix size, i.e., A ∈ R i×j and B ∈ R j×k , the cost of the product is O(i · j · k). 3) For a single input pair (x i , y i ), the time complexity required to compute ∇ is linear with the number of model's weights, O(|w|). 4) During the mining process, with the PoW, a miner computes the nonce of a block using brute force until finding a hash value lower or equal to a certain threshold [41], referred to as the mining difficulty. Assume that the hash value has b bits, and that its solution should be smaller than 2 b−l bits (being l a value determined by the mining difficulty), if the miner samples the nonce values at random, the probability of a valid value is 2 −l . Henceforth, 2 l sampling operations are required for mining a block. The time complexity is O(2 l ). 5) The set of nodes that have a local copy of the blockchain is N B = {1, . . . , N B }, without loss of generality, is assumed to be N ∩ N B = ∅. 6) We assume that convergence of the FL training procedure is reached after R rounds.
Theorem 1: The time complexity of CFL is: where D max = max k∈N |D k |. The communication overhead is given by The time complexity of BFL is: where N B is the number of nodes that have a local copy of the blockchain and l is related to the PoW difficulty (see Assumption 4). Its communication overhead is and its communication overhead is The three algorithms have a time complexity that depends on the dataset size |D max |. Additionally, the time complexity of BFL (5) is also a function of the blockchain parameters N B and l. In particular, the dominant term in (5) is R2 l . Hence, the time complexity of BFL is exponential in the PoW difficulty l, while for CFL and GFL is polynomial in RmE|D max ||w|. Table 1 summarizes the different results  obtained for time complexity and communication overhead. Lastly, we conclude our analysis by computing the total execution time of each algorithm until convergence (convergence time) as a function of the delay introduced by the communication rounds and the computational operations. To do that, we characterize the links whereby the different types of nodes exchange information (e.g., local model updates, blocks), keeping the topology introduced in Fig. 1 in mind. We classify two different types of connections: cloud (solid arrows) and edge (dotted and dashed arrows). Cloud connections (assumed to be wired) are used by miners in the blockchain; instead, edge connections (assumed to be wireless) are used by edge nodes to upload/download models. Given its popularity and easiness of deployment, we adopt IEEE 802.11ax links for edge connections [42]. Since edge devices are often energy-constrained, we consider different values of transmission power for the edge connections. The central server and blockchain node use a transmission power of P c TX , instead, edge devices use P e TX , with P e TX ≤ P c TX .
The wired connection has a capacity C P2P . Additionally, we identify three main types of computational operations during the federated learning processes: local model training, model parameters exchange, and blockchain data sharing. Based on this, we can compute the convergence time of CFL, BFL, and GFL as follows: where T train is the total amount of time spent for training the ML model locally, T c/e Tx is the total transmission time of the central server/blockchain nodes (c) or the edge devices (e), and computed according to the model detailed in Appendix B. T BC is the total delay introduced by blockchain and described in steps 2-4 of the process in Section III-B.

V. ENERGY FOOTPRINT
In this section, we define the models used to characterize the energy consumption that results from the FL operations. Driven by (9), (10) and (11), the total amount of energy consumed in each scenario is: where E train is the energy consumed by all the nodes during the local training, and E c/e Tx is the energy required to transmit the model via IEEE 802.11ax wireless links during the whole algorithm execution, from either a central server/blockchain node (c) or an edge device (e). E train is calculated as: where P r CPU i is the average power consumed by the CPU and DRAM during a round r, and r i is the duration of the operation for the ith client. The model proposed in (15) is general enough to capture the characteristics of the three algorithm implementations. Moreover, it opens the possibility of using energy measurement libraries, which represent a good trade-off between the complexity of using real sensors and the high abstraction level of mathematical models.
As described in Section IV, we may have two types of communication links: cloud and edge. Considering that cloud links are wired, we assume that their energy consumption is negligible. Instead, we compute the energy consumption of the edge connections according to the following equation: where T c/e Tx and P c/e Tx are the transmission time and power of a central server/miner (c) or an edge device (e). The additional term E BC for BFL is associated to mining operations of the PoW. We measure that consumed energy based on the model proposed in [43] and according to the following equation: where P h is the total hashing power of the network, λ is the block generation rate, and N chain is the number of blocks on the main chain.

VI. PERFORMANCE EVALUATION
In this section, we first describe the experimental settings adopted to compare the three federated approaches and then, we present numerical results.

A. SIMULATION SETUP
We use the Extended MNIST (EMNIST) dataset available on Tensorflow Federated (TFF) library [44].  Fig. 3 shows the available samples of the first 4 clients, for both versions of the dataset.
To correctly classify these samples we choose two models proposed in [27]. The first one is a feed-forward neural network (FFNN) with an input layer of 784 neurons, two hidden layers of 200 neurons activated with the rectified linear unit (ReLU) function, and an output layer of 10 neurons with Softmax activation function. The number of trainable parameters for the FFNN (|w |) is 199 210. Assuming that each parameter requires 4 bytes, i.e., size of a float32 variable, the total amount of space required (S w ) is 796.84 KB. The second one is a convolutional neural network (CNN) with the following structure: 1) A 5×5 convolutional layer with 32 channels and ReLU activation function. 2) A 2 × 2 max pooling layer.
3) A 5×5 convolutional layer with 64 channels and ReLU activation function. 4) A 2 × 2 max pooling layer. 5) A fully connected layer with 512 units and ReLu activation. 6) An output layer with 10 neurons and Softmax activation function. In total, the number of trainable parameters for the CNN (|w |) is 582 026 and the size in memory (S w ) is 2.33 MB. We opted for a FFNN to reproduce a realistic scenario where edge devices might not have enough computational power to train more sophisticated deep learning mechanisms, like NNs based on convolutional architectures. Despite its simplicity, the selected FFNN model accurately classifies the digits of the EMNIST dataset, as shown next. Then, we use also a CNN to evaluate the multiple algorithms' performance using a more complex model (details in Section VI-C).
The three FL algorithms are implemented with Tensorflow (TF) [25], Tensorflow Federated (TFF) [45] and Keras [46] libraries. We have extended the Bitcoin model provided by BlockSim [47] to simulate the blockchain behavior. BlockSim is an event-based simulator that characterizes the operations carried out to store data in a blockchain, from the submission of transactions to mining blocks and reaching consensus in a decentralized manner. Accordingly, BlockSim allows simulating the delays added by the blockchain in a BFL application, i.e., the T BC parameter defined in Section IV.
We create a validation set by choosing a subset of 200 clients from the test set. Following the TFF documentation [48], the training accuracy is computed at the beginning of each round; instead the validation accuracy is calculated at the end. For this reason, may happen that the validation accuracy is higher than the training one. At the end of each simulation, we evaluate the performance on the test set.
As for E train and T train , we have used Carbontracker [49], a Python library that periodically samples the hardware energy consumption and measures the execution time. Moreover, P c Tx is set to 20 dBm and P e Tx = 9 dBm. Table 2 reports all the other parameters used in our simulations. We note here that we have used the same FL parameters for a fair comparison, being the number of rounds R of CFL and GFL equivalent to the main chain length (N chain ) in BFL. In this way, we guarantee that the number of global rounds of each learning algorithm is the same. We run the experiments with the following hardware configurations: Intel i5-6600 with 8GB of RAM (HW1) and two Intel Xeon 6230 with 188GB of RAM (HW2). Table 3 reports the accuracy of each algorithm implementation with the FFNN model on the two considered datasets executed on HW1. CFL and BFL achieve the best accuracy (both close to 0.9), instead, GFL presents lower values. This result validates the claim that, under similar setups as in our simulations (i.e., each block contains m local updates organized in transactions), the central parameter server of CFL can be replaced by a blockchain network, properly dimensioned, without compromising the learning accuracy. On the other hand, GFL achieves a validation accuracy around 0.5 after 200 rounds. We justify this behavior by noticing that, before the CLIENTUPDATE procedure, the model received from the previous node in the sequence is merged with that in the previous round. For the earliest training rounds, there is a high probability that the merging procedure works with a fresh model that has never been trained, hence disrupting the knowledge from the previous clients. We analyze this phenomenon more in depth in Section VII-A. Table 3 also details the convergence time of each algorithm, the percentage of energy consumed in the computations (as a percentage of the total energy consumed), the total amount of energy needed, and the communication overhead (i.e., data to be shared during the rounds of the algorithms). Computational energy is, generally speaking, the energy spent for performing computations needed for the specific algorithm implementation. For CFL and GFL, it takes into account the energy spent in the local training at the clients, E train . Instead, in BFL, within computational energy, we consider both local training at clients and the energy spent for the mining process in the blockchain network, i.e., E train and E BC respectively.

B. RESULT ANALYSIS
The fastest and the most energy-efficient algorithm is GFL: it is able to save the 18% of training time, the 18% of energy, and the 51% of data to be shared with respect to the CFL solution. However, GFL main drawback resides in the poor accuracy achieved, as stated above. BFL is the slowest and the most energy-hungry federated implementation, mainly due to the overhead introduced by the blockchain network to secure data in a decentralized way. Additionally, it is to be noted that computation is the most energy-consuming task for CFL, BFL, and GFL. For BFL, the mining process drains 1125 Wh, i.e., 98% of the total energy. We highlight here that our comparison may be unfair in this respect, since both CFL and GFL are not including any security mechanism. However, we believe that it is worth including BFL in our analysis on distributed versus centralized federated learning since our results show that the secure and decentralized method introduced by the blockchain network, despite increasing algorithm costs, does not jeopardize its accuracy compared to its centralized counterpart CFL. Finally, GFL is the implementation that requires the lowest communication overhead. More precisely, in this case, we need to include an extra cost to share the global model across the nodes at the end of the last round (not considered in the table), which is approximately 0.16 GB (the cost of one extra round). Table 4 and Table 5 report the performance of the CNN model on EMNIST (EMNISTp) datasets executed on two different platforms HW1 and HW2, respectively. Similar to the previous FFNN case, BFL is the slowest and the most energy demanding algorithm. Instead in this case, GFL reaches higher validation accuracy on EMNIST, i.e., 0.8, but is still not able to get the performance of the other two algorithms. Moreover, using CNN, GFL is the fastest algorithm and saves up to 16% of the execution time, with respect to CFL on HW1. Hence, model selection plays a key role for the algorithm performance and may facilitate the training process, as in the case of GFL. Finally, it is confirmed that the communication overhead of GFL is the lowest.    However, we report here some inconsistencies in performing energy measurements. In fact on HW1, differently from the FFNN case, CFL is the most energy efficient on EMNIST and saves 14% of energy with respect to GFL. On EMNISTp, instead, the situation is different since GFL saves 15% of the energy. Moreover, when using HW2 (Table 5), CFL results to be the most energy efficient for both EMNIST and EMNISTp. Such inconsistencies are mainly due to the fact that the average computational power consumption in CFL implementation is higher than GFL (around 103W and 93W on EMNIST, respectively); however GFL requires longer training time (T train ). Instead in the FFNN model implementation, the average computational power consumption is higher for GFL (around 19W for CFL and 13W for GFL), but GFL requires lower training time. The reason lays mainly in the software implementations. 1 In fact, CFL and BFL are based on TFF, which executes the training process for all the participating clients in parallel. Differently, GFL is based on the standard TF libraries and the training process is executed sequentially one client after the other.

C. MODEL AND IMPLEMENTATION DEPENDENCIES
In view of the above, we state here that hardware and software implementation play a key role in the energy assessment. Therefore, it is essential that future research directions will focus on: i) joint optimization of federated algorithms and their software implementations, ii) definition of standard libraries for the three categories of algorithms studied in this paper, and iii) design of effective and open test platforms for experiment comparison. 1. https://github.com/eliaguerra/Federated_comparison_cttc

D. MAPPING BETWEEN FL IMPLEMENTATION AND SERVICES
In state of the results achieved by our analysis, we map the most suitable FL implementation given a specific application scenario. Among the several services listed in [12], [13], in Table 6 we focus on some emerging examples such as Healthcare and Mobile Traffic Prediction for crosssilo scenarios and Urban Traffic Forecasting, Connected Vehicles, Next-word Prediction for cross-device settings. In the Healthcare scenario, accurate predictions that guarantee the confidentiality of users' data are required [50]. For these reasons, we suggest the usage of BFL. In the context of mobile networks, multiple base stations may be interested in training an ML model for mobile traffic prediction without sharing raw data and save network resources. In this case, accuracy and latency are the most important KPIs [51]. Our suggestion here is to use CFL. Nevertheless, we may also find BFL as an appealing solution when multiple operators cooperate on the mobile traffic forecasting task. In urban traffic forecasting, a network of sensors is in charge of creating a model to predict vehicular traffic flow. Considering the problem description and the limited amount of energy available at each sensor, the two most important KPIs are latency and energy [52], the suggested approach is a GFL (or GFL-NM) algorithm. For connected vehicles, latency and accuracy are the two most important KPIs [2]. Considering that the connection with a central server may be unavailable, we suggest the usage of GFL (or GFL-NM) to take advantage of nearby vehicles to share model updates. In next word prediction for mobile keyboards, minimizing energy is of paramount importance to guarantee a flawless user experience so the most important KPI is energy [53], the suggested approach is CFL.

VII. PROPOSED IMPROVEMENTS A. GFL ACCURACY
As described in Section VI-B, GFL does not achieve the same accuracy level as CFL and BFL. We identify two possible reasons for this: 1) The number of rounds is not sufficient for it to converge: the number of visited nodes might not be sufficient to achieve an acceptable accuracy.
2) The merge step negatively impacts the overall performance of the learning algorithm: the model received in the previous round and stored in the local cache slows down the learning process.
To verify the first hypothesis, we execute GFL algorithm, with the FFNN model on HW1, changing the number of rounds (R = {200, 400, 800}) and varying the number of local computations (E = {5, 10}). Table 7 shows the results obtained. Considering the EMNIST dataset, the best results are achieved with R = 800 and E = {5, 10}, i.e., a higher test accuracy of 0.66, but still lower than CFL and BFL. Moreover, the model is overfitting with R = 800 rounds; hence, when increasing the number of rounds, a regularization method would be needed. On the EMNISTp dataset, the accuracy is even lower for each combination of the hyperparameters tested. Increasing the number of rounds, R, and the local epochs, E, increases GFL validation accuracy to 0.94 but it quadruplicates the convergence time and requires 5 times more the energy of the baseline configuration (R = 200 and E = 5) on the EMNIST dataset. To verify the second hypothesis, we run GFL algorithm without the merge step (GFL-NM). The pseudocode of this algorithm is the same in Algorithm 3 except for having replaced the old Line 9 with the new command w t k i ← w t k i−1 . Thus, in GFL-NM, given a sequence of clients S t the model is trained incrementally on the client datasets. GFL-NM achieves a training accuracy of ∼ 1.0 (0.94), a validation accuracy of 0.94 (0.78) and a test accuracy of 0.93 (0.78) on the EMNIST (EMNISTp) dataset (see Fig. 4(b)), higher than both CFL and BFL. These results suggest that the MERGE step compromises the training performance. In fact, at the beginning of the learning process, there is a high probability that a model visits a node that has never been visited before and with lastModel storing initialization values. In this case, the received model is merged with a model that has never been trained before, as shown in Algorithm 3, which negatively impacts the resulting merged weights. Figure 4 shows the comparison between the learning curves of GFL and GFL-NM.
In conclusion, we proved that both 1) and 2) influence the achieved accuracy. Moreover, GFL-NM solves the accuracy problem of standard GFL and achieves the best performance from the point of view of all the metrics.

B. DELAY ANALYSIS OF BFL
Blockchain technology, while enabling a reliable and secure FL operation, entails very high overheads in terms of time and energy for the sake of keeping decentralization. The performance of a blockchain, typically measured in transactions per second (TPS), together with the granted degree of security, strongly depends on the nature of the blockchain (e.g., degree of visibility, type of consensus, mining protocol), its configurable parameters (e.g., block interval, difficulty), and the size of the P2P network maintaining it. Furthermore, as discussed in Section VIII-D, the necessary energy to maintain a blockchain is correlated to its performance in TPS and security, thus leading to the well-known performance, security, and energy trilemma.
To showcase the effect of using different types of blockchain networks, Fig. 5 shows the total delay incurred by the blockchain to the FL operation to generate up to 200 blocks under different blockchain configurations. Notice that, in the proposed setting, each block is equivalent to an FL round. In particular, we vary the total number of miners (N m = {1, 10, 100}) and the block interval (BI = {5, 15, 600} s), which affect the time required to achieve consensus.
First, a higher number of miners leads to a higher fork probability, provided that more nodes need to agree on the same status of the ledger. Note that with N m = 1 the fork probability, i.e., p f (N m ), is 0 since there is only one miner. By contrast, a higher block interval allows mitigating the effect of forks, since the probability that two miners mine a block simultaneously is lower [54].
As shown in Fig. 5, the blockchain delay increases with the block interval (BI), which indicates the average time for mining a block. Notice that, in a PoW-based blockchain, the block interval is fixed by tuning the mining difficulty according to the total computational power of miners. As for the impact of N m on the delay, its effects on the delay are more noticeable for low BI values. In particular, a higher fork probability is observed as N m increases, thus incurring additional delays to the FL application operating on top of the blockchain.

VIII. OPEN ISSUES AND FUTURE RESEARCH DIRECTIONS A. GFL
In our opinion, and encouraged by our results, the investigation of new methods for merging the model updates from the distributed sources to achieve faster and higher accuracy is an interesting and open research line. To the best of our knowledge, there are still very few works that go in this direction in the literature. In [21], the authors implement an incremental version of GFL with a single round on the edge devices using the entire local dataset (hence, without requiring any merge step). Similarly, [55] proposes an iterative continual learning algorithm, where a model is trained incrementally on the local datasets without applying any merge operation.

B. BFL
To optimize the performance of a blockchain, a widely adopted approach consists of finding the best block generation rate [31], which is controlled by tuning the mining difficulty. Other approaches consider optimizing the block size [56], which better fits scenarios where the intensity of transaction arrivals depends on the nature of the application running on top of the blockchain (e.g., FL updates provided by clients).
Regarding the communication cost of BFL, it can be improved by leveraging the computational capacity of blockchain miners to speed up the FL operation. In particular, instead of including individual local models in a block, each block can bring a global model, aggregated by the miner responsible for building the block. This approach has been widely adopted in the literature (see, e.g., [57]), and would lead to reduced time complexity and communication cost (see Appendix A-B).
Finally, another important open aspect regarding blockchain-enabled FL lies in the implications of decentralization on the learning procedure. In this paper, we have assumed that the blockchain is perfectly shared and accessed by FL devices to carry out training, thus acting as a central orchestrating server. However, the decentralized data sharing in blockchain naturally leads to model inconsistencies, provided that different FL devices can use the information from different blocks to compute local model updates.

C. ROLE OF MEC IN DECENTRALIZED FL
Due to its low-latency and computation-acceleration capabilities, the MEC paradigm is of particular relevance to uplifting the performance of FL applications. The interplay between FL and MEC has been largely studied from the data offloading perspective, but when it comes to decentralized FL approaches, MEC can play different roles. First, FL devices in GFL can leverage MEC to perform model training offloading by sharing their local datasets to a nearby MEC server. As shown in [58], ML model training offloading is an appealing solution for mitigating the straggler nodes problem in FL. And this is particularly useful in settings with computationally constrained devices such as in IoT. Nevertheless, offloading a dataset to an edge server entails a trade-off between the data transfer time and the local computation savings. For that, it is critical to properly optimize the amount of data to be offloaded with respect to the total training delay. In addition, the MEC approach can be useful for speeding up the training procedure thanks to a proper client selection scheduling [59].
When it comes to BFL, MEC can be leveraged not only for ML model training, but also to support the computation and storage requirements of the blockchain, which are typically unfeasible for small devices. In the literature, MEC servers have been considered for performing either model training or mining tasks (or both) [13]. In this work, and closer in spirit to what was proposed in [32], we have assumed that the blockchain operations are handled by MEC servers.

D. DISCUSSION ON SECURITY ASPECTS
The decentralization of FL through either GFL or BFL is important for mitigating single-point-of-failure issues and boosting scalability and democratization. As seen in Section VI-B, GFL provides a high degree of decentralization thanks to the P2P interactions between FL clients. However, its performance is significantly lower than the centralized counterpart. As for BFL, it provides outstanding accuracy values, but at a high cost in terms of energy consumption and communication. The fact is that, to ensure security, robustness, and reliability, a blockchain typically requires to perform computation-intensive validations and exhaustive information replication. For instance, to ensure transparency, the blockchain network needs to trace all the events.
Unlike CFL and GFL, the BFL setting treasures security properties by design, which makes it robust against external threats. Thanks to blockchain's tamper-proof and traceability properties, attacks on the training operation such as poisoning [60] or Sybil attacks [61] can be mitigated [62], [63], [64]. Of course, in BFL, the cost of security in terms of energy consumption and communication overheads strongly depends on the FL application scenario and its requirements. In particular, different types of blockchains can better suit different FL applications. Blockchain types are mainly categorized as follows: . This setting suits well the cooperative enterprise setting, where participants can be trusted to a certain extent. The most popular consortium blockchain platform is Hyperledger Fabric, whereby the applied consensus method (RAFT) does not require all the nodes in the blockchain to participate in the validation of data, thus speeding up performance. Depending on the blockchain setting, the required computation energy and communication overheads may vary for sustaining the desired level of security. Throughout this paper, we have considered the most decentralized type of blockchain, i.e., public blockchains, but the other settings may be more appropriate for specific cases. The analysis of the cost for other blockchain types is left as future work, but we refer the interested reader to the survey in [65].

IX. CONCLUSION
Decentralized server-less federated learning is an appealing solution to overcome CFL limitations. However, finding the best approach for each scenario is non trivial due to the lack of comprehensive comparisons. In this work, we have proposed a complete overview of these techniques and evaluated them through several key performance indicators: accuracy, computational complexity, communication overhead, convergence time, and energy consumption. To do so, we have proposed a comprehensive theoretical analysis and the physical implementation of these algorithms.
An extensive simulation campaign underlines our analysis. From numerical results, it emerges that GFL is the algorithm that requires less communication overhead to reach convergence. Then, CFL and GFL have similar behavior in terms of energy consumption and accuracy, but slightly differ based on the DL model adopted and the hardware used. BFL represents a viable solution for implementing decentralized learning with a high accuracy and level of security at the cost of an extra energy usage and data sharing.
Moreover, we have discussed some open issues and future research directions for the two decentralized federated methods, like the poor accuracy achieved by GFL and the blockchain overhead in BFL. Regarding GFL, we have argued that the main drawback lies in the method used to merge model updates across the algorithm steps. We have demonstrated that with an incremental approach, the modified version of GFL is able to outperform CFL and BFL in terms of accuracy. As for BFL, we have indicated that possible optimizations go in the direction of finding the best block generation rate and block size. Moreover, we have reasoned on the possibility of reducing the time complexity by including the global model in a block, which is aggregated by the same miner building the block. In addition, we have pointed out the importance of further studies on the implication of model inconsistencies due to the fact that the blockchain cannot be perfectly shared and accessed by (all) the FL devices.
Finally, we have argued on the key role played by the libraries used for the implementation and their influence on the energy consumption on different hardware platforms. We call for the definition of standard libraries and open test platforms to be used for research purposes.

APPENDIX A PROOFS A. PROOF OF THEOREM 1
Let us consider the procedure CLIENTUPDATE, whose time complexity is E(|D max ||w| + 2 |D max | B |w|). In fact, a single client k performs the training phase on its local dataset D k along E local epochs and updates the model parameters. The first operation has a time complexity of |D k ||w| and the second 2 |D k | B |w|. The update it is executed a number of times equal to |D k | B , and requires a product and a sum. Each client performs E local epochs, so the total cost is: To obtain an upper bound that does not depend on k, we can use |D max | as an upper bound of |D k |: We can divide the MAIN procedure in Algorithm 1 into two blocks. The first, up to Line 10, has a cost upper bounded by In parallel every client downloads the global model, executes CLIENTUPDATE, and sends the updated parameters back to the server. The download and upload operations have a time complexity proportional to |w|. Considering that the same procedure is repeated by m clients, the upper bound in (20) easily follows. The second block starts from Line 10, where the server aggregates the local updates and computes the new global model. The number of arithmetical operations performed is: 2|w|m.
Combining (20) and (21), and considering the number of total rounds R required to reach convergence, the total cost of CFL is given by: The first addend in (22) is the dominant term for the asymptotic time analysis, so this completes the proof to obtain (3).
When it comes to the communications overhead of CFL, the result easily follows considering that, for each round, each clients downloads and uploads the model parameters.

B. PROOF OF THEOREM 2
In each algorithm's round, every client in S t has to download the latest block from the closest edge server (miner) to obtain the current global model. These operations, as described before, have a cost of |w|m and 2|w|m, respectively. Then, after running the CLIENTUPDATE procedure in Algorithm 2, clients submit the new model weights with a cost of |w|. These steps are done by each node in S t (in total, m nodes), so the total cost is: When all the local updates have been computed, it is necessary to create a block, reach consensus throughout the mining operation, and propagate the block across all the blockchain nodes. The cost of these operations is given by: If we combine together (23) and (24), we obtain the total time complexity of the algorithm R 3|w|m 2 + E|D max ||w|m + 2E |D max | B |w|m + |w|m + 2 l + m|w|N B .
The dominant addends are reported in (5). The communication overhead of BFL can be easily derived from the algorithm description.
In this analysis, we considered the less efficient implementation, whereby each client has to perform the computation of the new global model given the updates in the latest block. To improve this, we can move the instruction in Line 8 outside the for loop and execute it before the MINEBLOCK procedure. In this way, the new block has size |w|, since it contains only the parameters of the new model. Following the same analysis described before, the computational complexity is: And the communication overhead is:

C. PROOF OF THEOREM 3
Let k i be a client in the sequence [k 1 , . . . , k m ]. Following the steps of Algorithm 3, three main operations are performed: 1) MERGE, 2) CLIENTUPDATE and 3) send of the model parameters to the next client of the sequence. The first one is the average of two model parameters, so its cost is 2|w|.
The cost of the second operation has already been computed in (19) and the cost of parameter sharing is |w|. By summing up these contributions we obtain: This process is repeated for R rounds, so the time complexity is: where the first addend is the dominant one. Given that each client shares its local model only with the following node in the sequence, the communication overhead is given by (8).

APPENDIX B EDGE CONNECTION MODEL
To compute the total duration for transmitting model weights, we assume IEEE 802.11ax channel access procedures [42], which also include the overheads to carry out the distributed coordination function (DCF) operation. In particular, the duration of a packet transmission is defined as: T Tx = Rm(T RTS + T SIFS + T CTS + T DATA + T SIFS + T ACK + T DIFS + T e ), where T RTS is the duration of the ready-to-send (RTS) control frame, T SIFS is the short interframe space (SIFS) duration, T CTS is the duration of the clear-to-send (CTS) control frame, T DATA is the duration of the data payload, T ACK is the duration of the acknowledgement (ACK) frame, T e is the duration of an empty slot, R is the number of FL rounds, and m the number of participating clients.
To compute the duration of each type of IEEE 802.11ax control frame, i.e., RTS, CTS, and ACK, we compute them as: T RTS/CTS/ACK = T PHY + L SF + L RTS/CTS/ACK L s σ leg , (31) where T PHY is the duration of the PHY preamble, L SF is the length of the service field (SF), L RTS/CTS/ACK is the length of the control frame, L s is the length of an orthogonal frequency division multiplexing (OFDM) symbol, and σ leg is the duration of a legacy OFDM symbol.
As for the duration of the data payload, it is computed as: where T HE-SU is the duration of the high-efficiency (HE) single-user field, L MAC is the length of the MAC header, L DATA is the length of a single data packet (in our case, it matches with the model size, S w ), and σ is the duration of an OFDM symbol. The number of bits per OFDM symbol will vary, so as the effective data rate, based on the employed modulation and coding scheme (MCS), which depends on the transmission power used.