Evaluating Federated Learning for Intrusion Detection in Internet of Things: Review and Challenges

The application of Machine Learning (ML) techniques to the well-known intrusion detection systems (IDS) is key to cope with increasingly sophisticated cybersecurity attacks through an effective and efficient detection process. In the context of the Internet of Things (IoT), most ML-enabled IDS approaches use centralized approaches where IoT devices share their data with data centers for further analysis. To mitigate privacy concerns associated with centralized approaches, in recent years the use of Federated Learning (FL) has attracted a significant interest in different sectors, including healthcare and transport systems. However, the development of FL-enabled IDS for IoT is in its infancy, and still requires research efforts from various areas, in order to identify the main challenges for the deployment in real-world scenarios. In this direction, our work evaluates a FL-enabled IDS approach based on a multiclass classifier considering different data distributions for the detection of different attacks in an IoT scenario. In particular, we use three different settings that are obtained by partitioning the recent ToN\_IoT dataset according to IoT devices' IP address and types of attack. Furthermore, we evaluate the impact of different aggregation functions according to such setting by using the recent IBMFL framework as FL implementation. Additionally, we identify a set of challenges and future directions based on the existing literature and the analysis of our evaluation results.

• Partitioning of the recent ToN IoT dataset to evaluate the impact of data distribution in a multi-class classifier for detecting specific types of attacks.
• Quantitative analysis of the impact of non-iid data considering different aggregation methods and training rounds by using the recent IBMFL implementation.
• Definition of the main challenges and future trends to be considered in the coming future for the development of FL-enabled IDS for IoT scenarios.

Introduction
Nowadays, the constant development and deployment of Internet of Things (IoT) technologies is increasing the attack surface of physical devices that could be potentially exploited by malicious entities [1]. Well-known attacks, such as the Mirai botnet and recent variants [2], demonstrate the need to strengthen IoT devices' security in order to protect large-scale IoT-enabled systems. Due to the development of such increasingly sophisticated attacks, in recent years the use of machine learning (ML) techniques has been widely considered for the detection and mitigation of these attacks in IoT scenarios. Indeed, the application of ML techniques has been proposed in recent works to improve the detection capabilities of the well-known intrusion detection systems (IDS) through the application of diverse techniques (e.g., neural networks) to infer potential attacks based on the analysis of network traffic [3]. Despite the advantages provided by the application of ML techniques to enhance IDS approaches (e.g., in terms of attack detection accuracy), a main limitation of current approaches is that they are based on a centralized training process in which a single entity receives the network traffic data from different devices to train a certain ML model. Therefore, such component has access to the whole network traffic derived from the communication of the different devices participating in the training process and also devices' local data. This aspect could lead to privacy issues, which could be exacerbated in IoT scenarios due to the amount and sensitivity of the information exchanged through certain devices, such as wearable or eHealth systems [4]; therefore, decentralised solutions to manage data are of great importance [5].
To address the limitations of traditional centralized ML approaches, Federated Learning (FL) was proposed in 2016 [6] as a collaborative learning approach in which end devices (a.k.a clients or parties) do not share their data, but only partial updates of a global model that are aggregated by a central entity (a.k.a aggregator or coordinator). Therefore, the use of FL is intended to improve users' privacy, since the data of their devices is never shared with other entities. In general, an FL scenario is characterized by a large number of client devices with a variable amount and distribution of data. Indeed, real-life scenarios are usually based on non-independent and identically distributed (non-iid) data [7]. For example, in the case of an IDS deployed on a certain network, some target devices could have traffic associated with several kinds of attacks (e.g., DoS or port scanning), while other devices could only have traffic related to their intended operation. The development of FL-enabled IDS approaches in the context of IoT scenarios has attracted an increasing interest in recent years [8,9,10]. However, most of the proposed approaches are based on unrealistic data distributions among the parties, inappropriate datasets and settings (e.g., [11]), or they use binary classification approaches, in which traffic data is only classified as attack or benign [12]. Consequently, it is hard for cybersecurity practitioners to come up with the most challenging aspects derived from the application of FL to enhance IDS approaches in IoT.
To fill this gap, our work provides a comprehensive evaluation on the use of FL for IDS in IoT by considering the impact of non-iid data. In particular, we evaluate the behavior of FL by considering different data distributions, training rounds and aggregation methods. For this purpose, we use the ToN IoT dataset [13,14], which has recently been proposed for IoT and Industrial IoT scenarios considering sensor data manipulation attacks, in addition to several network attacks. We propose three scenarios based on different partitions and processing of the ToN IoT dataset: in the first setting, network flows are split according to their destination IP address; the second scenario is balanced according to the types of attacks among the clients; then, a hybrid approach is considered as third setting, in which we find a compromise between the balance of attack types and the destination IP address by means of the Shannon entropy [15]. These three configurations are publicly available at [16]. Then, we evaluate such scenarios by using Fe-dAvg [17] and Fed+ [18] aggregation methods through the IBM framework for Federated Learning IBMFL [19]. Based on our evaluation results, and the analysis of the existing literature, we describe some of the main challenges for the development of FL-based IDS approaches to be deployed in IoT scenarios. Therefore, our work can be used as a reference for future research activities on the use of FL in this context. In summary, our contributions are: • Identification of the main aspects for the evaluation of FL-enabled IDS for IoT, and analysis of existing proposals according to such aspects.
• Partitioning of the recent ToN IoT dataset to create different data distributions among clients to evaluate its impact on the overall system accuracy.
• Quantitative analysis of the impact of non-iid data considering different aggregation methods and training rounds by using the recent IBMFL implementation.
• Usage of multi-class classification for differentiating specific types of attacks in the output.
• Definition of the main challenges and future trends to be considered in the future years for the development of FL-enabled IDS for IoT scenarios.
The structure of the paper is organized as follows. Section 2 provides an overview of FL and the main advantages derived from its application to IDS. In Section 3, we describe and classify the existing research proposals on FL-enabled IDSs for IoT. Section 4 describes our methodology, including the aspects of the dataset partitioning, classification techniques and aggregation methods. Then, evaluation results are presented in Section 5. Based on such results and the analysis of existing literature, Section 6 highlights the main challenges for the development of FL-enabled IDS for IoT. Finally, Section 7 concludes the paper with an outlook of potential future directions to be considered.

FL-enabled IDS for IoT scenarios
Intrusion detection systems (IDS) have traditionally been considered as key components to protect ICT systems by identifying potential security attacks/threats derived from traffic monitoring and analysis. Although there are several classifications [3] [20], IDS approaches are usually categorized as signature and anomaly based systems. The former is based on pre-established network patterns and, consequently, it cannot be used to detect a new attack; the latter uses specific features of network traffic, so that a certain deviation on such network behavior can be considered as a potential attack. In recent years, the application of ML techniques to IDS has attracted a strong interest [21,22] considering different approaches such as neural networks [23], [24] or clustering techniques [25]. In the context of IoT, recent efforts have been proposed by considering specific IoT devices and technologies [3]. Indeed, the use of Deep Learning (DL) techniques has been recently evaluated through different types of neural networks for the detection of different attacks in such scenarios [26,27,28]. Despite these efforts, most of the proposed IDS approaches for IoT are based on centralized approaches in which devices send their local data to data centers in the cloud or servers with considerable computing capabilities to be analyzed through ML/DL techniques [11]. Such scenario raises significant issues that need to be considered [29]. First, the disclosure of IoT devices' local data could represent a privacy concern for end users, since an attacker could even infer users' daily habits by analyzing the traffic of their devices (e.g., wearables). This aspect could also pose an issue for a specific company where IoT devices share their network traffic with third parties. Second, given the dynamism of typical IoT environments, the time required to detect a potential attack could become a key aspect (or a limitation if the computing time is considerable) to prevent its spread in a certain network. In the case of using typical cloud data centers, the latency derived from the communication of a large quantity of data with data centers could be unaffordable or it could decrease the effectiveness of the IDS deployment. Although recent approaches propose the use of fog/edge computing [30] to balance the computing resources in the IDS implementation, this solution still raises privacy concerns as devices' data is shared with external entities (i.e., fog/edge nodes). Third, many IoT scenarios are comprised of resource-constrained devices communicating through wireless technologies with limited bandwidth and throughput. Therefore, the constant communication of devices' network data could represent a high overhead for IoT networks with a high number of connected devices.
To address these issues, there is a need for decentralized approaches with on-device learning in which devices themselves could perform local processing on their own network traffic data. As described by [29], a distributed or self-learning approach is a potential solution in which devices perform local training without interacting with each other. However, in this approach, devices are not able to improve their learning capacity based on the learning process of the other devices in the network. As an alternative, Federated Learning (FL) was proposed in 2016 [6] as a collaborative learning approach in which devices still interact each other through a centralized entity without the need for sharing their data. Figure 1 shows an overview of the centralized, distributed, and federated learning approaches.
In a typical FL scenario, end devices do not share their data. Instead, they update the information onto the global model based on local calculations on their own data. These nodes are typically called clients or parties, and the entity responsible for aggregating such local updates is called coordinator or aggregator. The training process is divided into a set of rounds, in which clients interact with the coordinator to update the global model until a certain number of rounds is performed or a certain accuracy is achieved. In particular, the main steps of each training round comprise [6] [31]: 1. The coordinator selects a subset of clients. For this purpose, different aspects can be considered; for example, in an IoT scenario, devices' computation/communication resources can be used to select the most suitable clients to participate in the training round [32].
2. The coordinator sends the parameters/weights of the global model to the selected clients.
3. The different clients update the global model's parameters/weights through a training process by using Stochastic Gradient Descent (SGD) with their local data. In the case of an IDS system, the training is intended to be performed by using the local network traffic of each client.
4. Then, the clients send their updated model's parameters/weights back to the coordinator. Depending on the aggregation algorithm being used, the coordinator aggregates all the parameters/weights to build a new global model, which will be used in the next training round. Although FedAvg is the most widely used aggregation algorithm [6], there is a plethora of alternative algorithms that can be considered for this process, such as FedProx [33] or the recent Fed+ [18], which is used in our evaluation.
The application of FL in IoT scenarios has attracted a huge interest in recent years due to its benefits compared to traditional centralized learning approaches. However, there are still significant challenges to be considered, such as communication and computing requirements or potential security and privacy attacks [34,35]. In the context of IoT, the FL application for IDS is still in its infancy, and existing proposals are often based on unrealistic settings and data distributions. These efforts are described in the next section.

Related Work
As previously mentioned, the use of FL has attracted a significant interest in recent years due to its characteristics and strengths, which can be exploited in different IoT scenarios [36]. In this context, recent works have proposed the application of FL to improve IDS. To classify these works, we have considered various aspects, such as: analyzed attacks, training datasets, ML/DL algorithms to detect such attacks, aggregation methods, and implementation frameworks. An overview of these proposals is shown in Table  1.
Based on our analysis, we note that some of the proposed works use their own generated or simulated dataset, For example, [37] integrated an FL approach with fog computing, where fog nodes collaborated for detecting DDoS attacks. For this purpose, authors use Gated Recursive Units (GRUs) [48] as ML technique, and FedAvg as the aggregation algorithm. Also based on GRU, [8] proposes the creation of communication profiles associated to IoT devices that are used to detect potential attacks. In this case, the dataset is generated from real devices and the use of traffic associated with the Mirai botnet [49]. As these works are not based on publicly available datasets, it is difficult to assess the suitability of their proposed approach. Furthermore, in the case of [37], authors do not provide performance details, such as the different numbers of participating clients and training rounds.
While other FL-enabled IDS approaches have been proposed for IoT scenarios, they are not based on datasets with traffic associated with such devices. In this direction, [11]   trees, Support Vector Machines (SVM), Random Forest and MultiLayer Perceptron (MLP) in a federated environment in which the aggregation process is enabled through the use of blockchain. The proposed approach is based on intermediate nodes to perform local training using IoT devices' data, as well as the KDDCup99 dataset [50]. Moreover, [9] uses the NSL-KDD dataset [51] and MLP as the ML model for a FL-enabled IDS system. The approach is based on the concept of mimic learning in which a student model is trained with a public dataset, which is labelled with a master model trained with sensitive data. Also based on the NSL-KDD dataset, [29] uses neural networks to propose a FL-enabled IDS considering three scenarios according to different data distributions regarding attack types. The use of neural networks is also proposed by [38], which integrates a differential privacy approach [52]. For this purpose, authors consider a scenario with non-iid data using the CSE-CIC-IDS2018 dataset [53]. Moreover, [39] employs Binarized Neural Networks (BNNs) [54] in edge devices to reduce the overhead of traditional neural networks. The proposal is based on the datasets CICIDS2017 [53] and ISCX Botnet 2014 [55], as well as the aggregation algorithm signSGD [56] in order to reduce the overhead during the communication of model updates.
Besides previous works, recent efforts consider IoT-specific datasets to develop FL-enabled IDS in these scenarios. In particular, [40] proposes the use of deep belief networks [57] to be deployed in IoT gateways to detect potential attacks on a certain IoT subnet. Then, the different models are aggregated through FL. The proposed approach uses several datasets, such as the N-BaIoT [58] dataset, which includes IoT devices' traffic. However, authors do not provide information on the implementation being used or evaluation details considering aspects such as data distribution, number of clients or training rounds. This dataset is also used by [12], which proposes a binary classification approach based on supervised learning (using MLP) and unsupervised learning (using autoencoders). Additionally, the proposed approach uses different aggregation methods based on [59], which are compared considering different types of attack. In this case, it should be noted that authors created a balanced dataset with the same number of samples and proportion of classes for all devices. This distribution could be compared with our balanced scenario described in Section 4.2. Moreover, the Bot-IoT dataset [60] is used by [10], which proposes multiclass classification based on neural networks together with Principal Component Analysis (PCA) in an edge-based network architecture with IoT gateways. The proposal distributes the dataset in four clients according to attackers' IP address; however, de-tails on the implementation being used and data distribution in the different parties are not described. Additionally, other works on the use of FL for IDS in IoT are based on specific datasets for industrial environments. In this direction, [42] integrates Convolutional Neural Networks (CNN) and GRUs for the detection of different attacks using the dataset described in [43]. Furthermore, [46] also uses GRU with a dataset based on the well-known Modbus protocol [43].
Our literature analysis demonstrates that the development of FL-enabled IDS approaches for IoT is still in its infancy. On the one hand, while most of the previous works are intended to be considered in such scenarios, they are not based on datasets with IoT devices' network traffic. On the other hand, we note that a significant amount of the previous works do not provide information about the implementation being used, or details related to the evaluation process, such as number of clients or training rounds. Furthermore, most of the works do not describe the data distribution among the different clients, or they consider scenarios where clients' data are associated to a portion of the dataset that includes the same number of samples for each attack being considered. However, as discussed in previous works [7], the performance of FL can be reduced in the case of scenarios with non-iid and highly skewed data. While these aspects have not been evaluated in the context of FL-enabled IDS, our work provides an exhaustive evaluation under different data distributions using the recently proposed ToN IoT [61] dataset, which includes several IoT-related attacks. To cope with the impact of non-iid data, we compare the performance of the typical FedAvg algorithm with a recent approach called Fed+ [18]. To the best of our knowledge, this is the first approach evaluating the impact of non-iid data on the development of FL-enabled IDS for IoT.

Methodology
Before describing our evaluation results for the proposed FL-enabled IDS for IoT considering non-iid data, in this section we explain the main processes and assets used for this purpose. They include the dataset selection, data distribution among several FL clients, as well as the classifier technique and aggregation functions being considered.

Dataset selection
For the development of our FL-enabled IDS proposal for IoT, a key aspect is the selection of an appropriate dataset. As described in the previous section, recent approaches are based on obsolete and generic network traffic datasets, which do not consider IoT-specific protocols and attacks. Furthermore, as described by [12], most of the datasets for IDS were not conceived to be used in an FL environment, as they cannot be properly distributed among different clients. Therefore, our analysis is focused on IoT datasets for IDS that can be divided by IP address or device [12], namely Bot-IoT [60], N-BaIoT [58], MedBIoT [62], IoTID20 [63] and ToN IoT [61]. In the case of ToN IoT, we consider the CIC-ToN-IoT dataset [64], which is generated from the original pcap files of ToN IoT. An overview of these datasets is shown in Table 2, in which they are compared according to several aspects, such as number of features and samples, attacks, the use of labelled data, or their testbed.
A common aspect of the different datasets is that they are based on realistic testbeds, as well as labelled data considering different types of attack. Bot-IoT is the only analyzed dataset that provides training and testing sets. Furthermore, this dataset and IoTID20 identify a set of best features to be considered. However, we note that most of the datasets suffer from a significant imbalance between benign and attack traffic that can negatively affect the ML/DL approach, so that oversampling/undersampling could be required. In this direction, we note that the ToN IoT dataset provides the best ratio between benign and attack traffic. Furthermore, this dataset con-siders a broader diversity of attack types compared to the other datasets being analyzed. For example, N-BaIoT and MedBIoT focus on particular attacks that are launched by IoT devices composing a botnet. However, they do not consider other attacks, such as DDoS/DoS or MITM that should be considered in IoT environments.
Moreover, while the different datasets are based on realistic testbeds, ToN IoT is built using an IoT/IIoT testbed composed by edge/fog nodes and cloud components to simulate an IoT/IIoT production environment. Furthermore, ToN IoT is the only dataset that considers data from sensor readings and telemetry data, which can be used to detect additional attacks (beyond the network level) in such environments. Although ToN IoT has been used in recent works (e.g., [26]), to the best of our knowledge, this is the first effort to consider ToN IoT in a FL setting.

ToN IoT partitioning
To create the three proposed scenarios based on different data distributions, we use the CIC-ToN-IoT dataset [64], which was generated through the CICFlowMeter tool [65] from the original pcap files of the ToN-IoT dataset, as previously described. Such tool was used to extract 83 features, which were reduced by removing those with a non-numeric value (e.g., flow ID). Then, we separate the samples of the whole dataset according to the destination IP address, and select the 10 IP addresses with more samples. Those observations constitute our dataset. Such resulting dataset contains 4.404.084 samples, which represent 82,29% of the original CIC-ToN-IoT.
From this dataset, we create three scenarios to evaluate the impact of different data distributions on the performance of our multiclass classifier to detect attacks. The datasets of such scenarios are available at [16]. Specifically, we use Shannon entropy [15] to measure the imbalance of the different local datasets of each FL client. In particular, given a dataset of length n, and k classes of size c i , the balance between the classes is given by the formula: where the function is equal to 0 if all classes are 0 except one, and is equal to 1 if all c i = n k . Furthermore, it should be noted that we consider that each FL client is represented by a single IP address. In this context, n is the number of network flows, k is the number of the attack classes and c i is their size.

Basic scenario
In this scenario, each FL client's dataset is based on the network traffic of the corresponding IoT device. As described in Table 3, in this case the distribution of classes and samples among the different nodes is highly unbalanced. Indeed, party 7 only has benign traffic samples, while parties 1 and 3 only have 2 samples of XSS attack. Consequently, these parties have the lowest Shannon entropy value. This scenario represents a typical situation in a certain IoT network in which specific devices can be victims of several attacks while other devices perform their intended operation and they are not subject to attacks. However, as described in Section 5, the straightforward application of FL in this scenario could result in poor performance and convergence issues.

Balanced scenario
In this case, we select a portion of our dataset, which is distributed among the 10 parties, so that each party has the same number of samples of each class. Therefore, as shown in Table 3, all the parties have the same Shannon entropy value. As will be described in Section 5, such balanced scenario presents better performance; however, in this case, each FL client could have samples of other nodes, so that it can result in privacy issues depending on the scenario being considered. It should be noted that such scenario can be compared with similar settings in previous works, such as [12], which uses a version of the N-BaIoT dataset where the number of samples and the class proportions are the same for all devices.

Mixed scenario
The mixed scenario is generated to achieve a tradeoff between the two previous settings in which each party maintains its own samples, but they are locally balanced. In particular, we select the parties with a Shannon entropy value higher than a certain threshold (0.2), that is, parties 0, 2, 4 and 5. After this initial filtering step (due to the fact that the parties' classes are not well balanced) we use a simple instance selection mechanism that removes some of the samples from the predominant classes until we reach the Shannon entropy within a range of values. Having this set in between 0.66 and 0.71, we obtain a dataset that represents a compromise between the basic scenario where no balancing was used, and the balanced scenario where we artificially distributed the dataset among the 10 parties.

Multiclass classification
Considering the already described scenarios, we use a multiclass probabilistic classification model to classify the instances into benign or a specific type of attack. For this purpose, we apply the multinomial logistic regression [66], also called soft-max regression, due to its easy implementation and training efficiency. It can also interpret model coefficients as indicators of feature importance. Multinomial logistic regression is a simple extension of binary logistic regression [67] that allows for more than two categories of the dependent or outcome variable which do not present an order. As with most classifiers, the input variables need to be independent for the correct use of the algorithm. Given the input x, the objective is to know the probability of y (the label) in each potential class p(y = c|x). The softmax function takes a vector z of k arbitrary values and maps them to a probability distribution as follows .
In our case, the input to the softmax will be the dot product between a weight vector w and the input vector x plus a bias for each of the k classes: .
The loss function for multinomial logistic regression generalizes the loss function for binary logistic regression and is known as the cross-entropy loss or log loss.
It should be noted that unlike previous works based on binary classifiers (e.g., [12]), we consider the detection of a specific attack as a key factor to dynamically deploy the most effective countermeasures to mitigate such threat. Furthermore, while other classifiers could be employed (and it represents part of our future work), our evaluation results are focused on the impact of different data distributions and non-iid data in the classifier performance.

Aggregation functions
As described in Section 2, the local updates generated by each client in FL are combined through an aggregation function in each training round. The most basic aggregation function is represented by FedAvg [6], which generates the global model based on the average of the weights generated by the FL clients. In particular, let W = (w i ) be the weights of the general model and W k = (w k i ) the weights of the party k, then: where D and d i are the total data size and data size of each party respectively.
However, as described in recent works [68,17,7], the performance of Fe-dAvg may be degraded in scenarios with non-IID and highly skewed data. While recent works propose alternative aggregation functions considering convergence and privacy aspects [34], in this work we consider a recent approach called Fed+ [18], which unifies several functions to cope with scenarios composed by heterogeneous data distributions. For this purpose, Fed+ relaxes the requirement of forcing all parties to converge on a single model (as in the case of FedAvg). In particular, let be the main objective in FedAvg: where f i is the local loss function of the party i. In the case of Fed+, the main objective is: where B(·, ·) is a distance function, and C is an aggregate function that computes a central point of x.
It should be noted that this work represents the first effort to use Fed+ to evaluate its impact in the context of FL-enabled IDS for IoT. As will be described in Section 5, the use of such approach mitigates the convergence issues of FedAvg specially in settings with non-iid and skewed data.

Evaluation results
Based on the different aspects of the proposed methodology, in this section we describe our evaluation results. For this purpose, we consider the following metrics: • Accuracy: Precision, recall, F1-score, and FPR metrics are calculated for each scenario described in Section 4.2. In the case of multiclass classification, such metrics can be calculated by using micro, macro, and weighted averaging. The micro-averaging calculates the metrics using the total amount of TP, TN, FP, and FN, independently of the number of classes. The macro-averaging calculates each metric for each class independently, and then it uses the average of all the classes' values. Then, the weighted-averaging follows a similar approach to the macro-averaging, but instead of using the normal averaging, the average is weighted depending on the class size. As some of our scenarios are based on imbalanced datasets (see Section 4.2), we use the weighted-averaging for our evaluation.
Moreover, we train the model across 300 rounds for each scenario by considering one epoch for each training round. The number of epochs is a hyperparameter that defines the number of times that the learning algorithm will work through the entire training dataset in each specific client. One epoch means that each sample in the training dataset has updated the internal model parameters only once. Furthermore, the logistic regression algorithm is implemented by using scikit-learn SGDClassifier (Stochastic Gradient Descent). In particular, we choose a logarithmic loss function to use the logistic regression, and the norm L 2 in order to shrink model parameters towards the zero vector. Before the application of the ML/DL, the data is normalized. Furthermore, a ratio of 80-20 was defined between training and testing sets.
For our evaluation, we consider FedAvg and Fed+ as aggregation functions in our FL-enabled IDS approach. Furthermore, we also measure the accuracy of each client in a distributed scenario, where each party trains the model using their own data independently from the other parties (see Section 2). It should be noted that we do not consider a centralised setting (in which devices send their data for training a model) because in that case all the classes would be represented in the dataset. Therefore, it would be unfair to compare such setting with a distributed/federated scenario in which clients only have traffic associated to their IP address, and only some of the classes are represented in their partial datasets. Nevertheless, for the sake of completeness, we measure the accuracy of the centralised setting and obtain a value of 0.724 using multinomial logistic regression. This value is close to 0.77, which represents the highest accuracy value obtained in the work describing the ToN IoT dataset [61].

Basic scenario
As described in Section 4.2, in this scenario, each party has the data corresponding to the traffic associated to a single IP address. Such scenario is characterized by a non-iid and highly skewed data distribution. This aspect is reflected in Figure 2 and Figure 3, which show the accuracy evolution of each client by using FedAvg and Fed+ methods, respectively. As shown, the accuracy value of each party remains stable throughout the training rounds. While the accuracy value seems high for parties 0, 2, 3, 4, 5, 6, and 8, this circumstance may be related to the heavily imbalanced dataset where accuracy may not be an exhaustive indicator because of the predominance of the data of the larger class (e.g., the legitimate traffic in this case). Then, accuracy is not fully representative since if a class represents the vast majority of the dataset, the classification process will provide a high accuracy even if only a single class is actually learned. However, the application of such model in a more balanced dataset may result in lower accuracy. It should be noted that, according to Figure 2, the accuracy of parties 3, 4, 7, 9 is decreased after around 200 training rounds. This aspect could be related to the use of FedAvg as aggregation function that could represent convergence issues, as described by recent works [17]. Table 4 shows the accuracy of each party by considering the distributed and the federated scenario (using Fed+). It should be noted that parties with a low entropy (see Section 4.2) provide a higher accuracy in the distributed setting than in the federated scenario. This can be justified since parties with fewer classes and lower balance will classify better the samples of such predominant classes. Then, in the case of a federated environment, the weights of those parties with a few classes will be negatively influenced by the weights of other parties with more classes, because these parties detect different and additional types of attacks.
As shown in Figure 4, the other metrics (beyond accuracy), calculated   with Fed+, remain stable through the rounds, following a similar trend as the accuracy. The parties with a high FPR have poor results in terms of the others metrics. The values in recall, F1-score and precision of these parties are similar to the ones in the accuracy, except for party 2 and party 8, which provide 0 for precision and recall (and consequently in F1-score), and 1 for FPR. This situation can arise in scenarios with unbalanced datasets (like in this case), where a high accuracy is obtained (due to a high TN ratio) but recall and precision remain low (because of a low value for TP ratio) Previous results demonstrate that the direct application of FL to scenarios with non-iid and highly skewed data could lead to undesirable results. Therefore, there is a need to consider a suitable client/instance selection process to make the dataset more balanced among the different clients in terms of number of classes and samples. The evaluation results for the balanced and mixed scenarios demonstrate the importance of such process, and are described below.

Balanced scenario
In this scenario, the data is equally distributed among parties according to the description provided in Section 4.2.2. Figure 5 and Figure 6 show the evolution of the parties' accuracy by using FedAvg and Fed+ algorithms respectively. In the case of FedAvg, parties with a high accuracy obtain a decrease of such value throughout the rounds. For parties with a low accuracy, the evolution is similar to the Fed+ case. Furthermore, as shown in Figure 6, there is a clear increment in the accuracy for all parties that remain stable (between around 0.8 and 1) after about 50 rounds. Furthermore, the evolution of FPR, F1-score, recall and precision metrics in the case of Fed+ are shown in Figure 7. In particular, the value of recall, F1-score and precision increase throughout the rounds with a similar trend as the accuracy. Moreover, the FPR value decreases throughout the rounds until it converges to a lower value. Compared with the results for the basic scenario, these metrics have values akin to the accuracy following a similar trend.
According to the obtained results, this scenario shows a better evolution in the parties for the different metrics being considered compared to the basic   scenario. In particular, in the case of Fed+, all the parties improve such metrics throughout the initial 50 rounds, when their values remain stable. However, in the case of FedAvg, these value drop for some of the parties. Therefore, spite this scenario was artificially balanced, so that the parties have samples all the different attacks, the use of FedAvg still could lead to convergence issues. This could be due to the fact that even with a more balanced dataset among the different parties, the number of samples of each attach in every party still remains unbalanced.

Mixed scenario
The data distribution for this scenario is described in Section 4.2.3. Figure 8 shows the accuracy evolution for each party when FedAvg is used. According to it, there is a clear decrease in the accuracy of party 2 until about round 200, and such trend is also observed for party 0 after a significant increase in the very initial rounds. In the case of party 4 and party 5, the accuracy value remains stable. The decrease of accuracy is due to the unbalance of the scenario in which parties 0, 4 and 5 only have a subset of attack types. Then, Figure 9 shows the accuracy evolution of the different parties with Fed+, in which accuracy values grow until a certain number of rounds (about 50) when they remain stable. In the case of party 2, accuracy is more oscillating due to the fact that such party has samples of all the different classes in its local dataset.   to parties 2 and 4. It should be noted that these results are similar to the balanced scenario. As in the previous case, it means that accuracy results are consistent with the values obtained for the other metrics. Based on the obtained results, this scenario represents a trade-off between the previous two scenarios obtaining similar results to the balanced setting, where samples are shared among the different parties. Furthermore, previous results demonstrate the need for considering additional aggregation functions (beyond FedAvg) in order to deal with scenarios characterized by non-iid and skewed data among the parties that are common in real-world scenarios.

Comparison between basic, balanced, mixed, and distributed scenarios
After analyzing the different evaluation metrics, Figure 11 shows a comparison of the average accuracy of the parties for each federated scenario and a distributed setting, considering FedAvg and Fed+. According to these results, Fed+ provides higher accuracy for all the federated scenarios being considered. This demonstrates that it is able to handle better scenarios where parties do not have balanced datasets.
For the basic scenario, graphs are similar for FedAvg and Fed+. However, it should be noted that, in the case of Fed+, accuracy remains constant about 0.8725, which is close to the 0.8718 of the distributed method, whereas it drops slowly from 0.8725 when using FedAvg. In the balanced scenario, the initial accuracy starts at 0.8569 and rapidly grows to 0.9039 (where it remains stable throughout the rounds) for Fed+. When FedAvg is used, accuracy grows from 0.8349, until 0.88 after 50 rounds, but it gradually drops to 0.87. Compared with the distributed setting, Fed+ has a similar accuracy to the 0.9065 of such scenario, since all parties have the same amount of data and number of classes. However, FedAvg does not reach the accuracy of the distributed case. The main reason is that, while parties' datasets are balanced among each other, each local dataset is unbalanced in relation to the number of samples for each class.
In the case of the mixed scenario, accuracy (when Fed+ is used) goes from 0.8498 until 0.8876 after 50 rounds, and it remains stable until it finishes with 0.8869. Indeed, after about 40 rounds, Fed+ overtakes the accuracy for the distributed case (0.877). However, the behavior of FedAvg is worse than the distributed case. In particular, accuracy goes from 0.8157 to 0.8698 after 10 rounds, but then, it decreases slowly until 0.8423. Therefore, in this scenario, Fed+ clearly improves the behavior of FedAvg.
Based on the previous evaluation, Fed+ provides better results than Fe-dAvg, which could introduce convergence issues in certain situations. Indeed, Fed+ provides better results for the mixed scenario compared to the results of the balanced setting when FedAvg is used. Based on the results for the different scenarios, it should be noted that the impact of different data distributions is more clear in the case of Fed+ and the distributed setting, where the best results are obtained for the balanced scenarios, while the basic scenario provides the lowest value for accuracy. However, in the case of FedAvg the basic and balanced scenarios provide similar accuracy results, while the mixed scenario presents lower accuracy values. In any case, as already mentioned, the use of Fed+ has a clear impact in the results obtained for the different scenarios by improving the evaluation metrics' values when FedAvg is employed.

Challenges and research directions
Based on the evaluation results provided in the previous section and the analysis of the literature on the use of FL [69,34], below we describe some of the main challenges and future research directions to be considered for the development of FL-enabled IDS in the scope of IoT scenarios. As described by [69], it should be noted that many of the challenges associated with the use of FL in such context will require multidisciplinary approaches, including the application of privacy techniques, cryptography, distributed optimization, or information theory.

Deploying FL on IoT devices
While our work focuses on the impact of different data distributions on FL by using a simulated testbed, a significant set of challenges is derived from the deployment of a FL framework on real IoT devices. Indeed, as described by [70], the computational requirements of well-known ML approaches might not be satisfied by constrained IoT devices in terms of memory, computing power and energy consumption. This aspect can be aggravated in the case of applying DL techniques, which require in general more computing resources than ML. To address such limitations, a current trend is the use of intermediate nodes at the network edge, so that the end devices send their data to these nodes acting as FL clients [71,72]. For example, [73] use intermediate entities (called RSPs) in charge of performing the local training in an FL setting. A similar approach is also proposed by [74], which used an edge computing architecture to determine the aggregation frequency of the global model. However, it should be noted that sharing network traffic with these intermediate nodes to identity potential attacks can still pose privacy concerns. Other approaches consist of the reduction of the data that needs to be sent by segmenting and representing it [75], as well as by exploring feature selection [76,77]. Therefore, more efforts are needed to analyze the practical limitations of FL approaches in IoT scenarios, as well as the security and privacy implications derived from the use of edge computing architectures. In this context, a potential research direction is associated with the application of TinyML frameworks (e.g., TensorFlow Lite [78]) in FL scenarios, as recently described by [79].

Limitations of existing IDS-IoT datasets for FL
As described in Section 3, some of the existing FL-enabled IDS proposals for IoT are based on general network datasets, which do not consider IoT technologies and devices. Even though some datasets have recently been proposed for IoT scenarios, as described by [12], some of them cannot be applied in an FL environment, since they do not provide data associated with different IP addresses or devices, in particular the IP destinations that can be identified as the parties of the FL environment in IDS. Furthermore, as described in Section 4, most IDS datasets for IoT present a significant imbalance between benign and attack traffic, as well as a limited set of attacks being considered. Moreover, we note that ToN IoT is the only dataset that considers possible security threats related to telemetry data and sensor readings, unlike other datasets only dealing with network attacks. However, as described by [80], the development of IDS datasets for IoT still needs to consider a broader scope of IoT technologies (including well-known protocols like CoAP [81]), as well as additional aspects (e.g., energy consumption) that can serve to identify potential attacks. Therefore, more effort is needed in the development of IDS datasets for IoT considering its divisibility to be deployed in a FL setting.

Aggregator as bottleneck
Even though FL is based on a collaborative training approach, the coordinator entity may become a bottleneck from a performance and privacy perspective, as well as a single point of failure. To address such issue, a current trend is the application of blockchain technology [82], which represents a distributed and immutable ledger shared by several nodes. The use of blockchain can increase the level of trust in an FL environment, where the centralized coordinator is replaced by a set of nodes with distributed functionality, which is carried out through smart contracts. Indeed, blockchain has been proposed in recent works to make model updates accountable and avoid potentially malicious updates [83]. In the context of an IDS approach for IoT, [73] uses intermediate nodes acting as blockchain clients to store the model parameters updated by the end devices to avoid potential manipulation. Despite these efforts, we note that most of current approaches do not provide comprehensive evaluations considering training frequency and scenarios with a large number of devices, which may be required for IDS approaches. Furthermore, as described by [69], the use of permissionless blockchains (e.g., Ethereum [84]) can raise privacy concerns, which must be addressed by proper encryption or differential privacy techniques, as described in Section 6.8.

Communication requirements
The need for a significant communication bandwidth to exchange global model updates represents a well-known issue associated with the use of FL [70]. This problem can be exacerbated in IoT scenarios where end devices acting as FL clients need to communicate their model updates through constrained networks and devices, which can degrade the network or IoT performance [85]. In general, there are two main factors that impose strong communication requirements between FL clients and coordinator. The first aspect is related to the amount of data associated with the gradient exchange [86], which is required between clients and the coordinator for the learning process. This is generally addressed by gradient compression techniques, such as quantization and sparsification, as described by [87]. The second aspect is related to the number of training rounds required to converge the model that can vary depending on the scenario, dataset, data distribution, or the ML algorithm being considered. For example, based on our evaluation results, the different metrics remain stable after 50 rounds in the balanced and mixed scenarios, although this may be different with other evaluation conditions. While a common trend to reduce the training rounds is to perform several local training iterations before updating the global model [88], the execution of such local training iterations may have a significant impact on FL clients, specially in case of resource-constrained devices (see Section 6.1).

Client selection
As described in Section 2, in each training round, the coordinator can select a subset of devices to participate as FL clients in the training process. For this purpose, different aspects such as device status, battery level, computing/communication capacity, or ML technique's accuracy could be considered [89,90]. Indeed, the client selection process can have an impact on the obtained accuracy and, therefore, on the detection of potential security attacks in the scope of an IDS approach. In our case, according to the results described in Section 5, we found that even a static client selection process can help to obtaining a better performance of the ML algorithm. However, more sophisticated client selection strategies must consider the dynamic aspects of an IoT environment in each training round. For example, some devices may not be available in a certain round due to mobility issues or loss of connectivity [71]. Furthermore, due to devices heterogeneity, while some of them could perform the local training in a few milliseconds, other devices could require a longer period to update the model (e.g., due to resource constraints), which could slow down the overall federated training [32]. In the context of IDS, this could lead to a longer delay in detecting a certain attack, which could have severe consequences on the overall cybersecurity of the network. An additional aspect is related to the need to provide incentives to devices, in order to foster their participation in the training process [31]. Otherwise, some devices may not want to use their limited resources for this purpose. While some recent works address this issue in IoT scenarios [91], more efforts are required in real IoT environments to evaluate its impact on the learning process.

Dynamic IoT devices' behavior throughout their lifecycle
Another aspect is the need to consider the changing behavior of IoT devices throughout their lifecycle. For example, a software update process for a certain device can change its behavior [92], so that a new learning process is required in order to reflect the new behavior as benign traffic in the context of an IDS. However, such change could be also related to a potential attach affecting this device. Therefore, there is a need to integrate network management approaches to detect if behavioral changes in a certain device are produced intentionally, or they are due to a malicious action. Furthermore, the behavioral changes of a single device could affect to the behavior of other interacting devices. In the case of a FL scenario, it could require new training rounds that might have a significant impact specially in settings with constrained devices and networks. However, this aspect is not addressed existing FL-enabled IDS approaches, which are based on existing datasets that do not reflect potential behavioral changes on IoT devices throughout their lifecycle.

Security attacks
Like in the case of centralized approaches, FL is also susceptible to several attacks that can affect the learning process. Indeed, as described in recent works [31], some of the major security threats in FL are represented by data poisoning and model update poisoning attacks. The former is related to the attacker ability to add false training data or modify the existing dataset of a certain client, for example, by modifying the labels (label-flipping). The latter focuses on changing the global model instead of the local training dataset. The realization of such attacks could cause false alarms in an IDS approach due to misclassification of benign/malicious traffic [93]. To address such concerns, a recent work evaluates the behavior of different aggregation functions against several security attacks in an FL-enabled IDS approach [12]. Indeed, the application of certain aggregation approaches could help to make an FL setting more robust against potential attacks. In this direction, as part of our future work, we will evaluate how Fed+ behaves in the context of different data poisoning and model update poisoning attacks with different data distributions. Other complementary approaches to be considered are based on network management approaches to ensure that only devices behaving as intended can participate in the training process [35]. These proposals still require lightweight cryptographic mechanisms to be considered in real IoT environments. Additionally, trust and reputation mechanisms can also be used in order to prevent malicious nodes from injecting false data into the training phase, even when using suitable cryptographic approaches [94].

Privacy concerns
While FL was mainly proposed to mitigate the privacy concerns associated with centralized learning approaches, it can still leak information from clients' training data. Indeed, as described by [31], a malicious server could infer information from model updates, as well as alter them in order to fool the global model. This can be exacerbated in the context of IDS approaches to IoT, where device network traffic data can reveal everyday user habits. Therefore, the application of privacy-preserving techniques for FL has attracted a significant interest recently [34], including the use of differential privacy (DP) approaches [52], secure-multiparty computation (SMC) [95] and homomorphic encryption [96]. However, these techniques often come at a cost in terms of accuracy and efficiency [97,34], which can negatively affect the attack detection capabilities of IDS approaches. Indeed, a recent work evaluates the application of DP for an FL-enabled IDS considering non-iid data [38]. Although other recent efforts have been proposed for IoT scenarios [85], more studies are required to come up with a tradeoff between privacy requirements, as well as performance and accuracy requirements for effective IDS approaches.

Conclusions
The application of FL techniques has attracted a significant interest in recent years due to their advantages over traditional centralized learning approaches. In this work, we provided an overview about the current research efforts for the application of FL toward the development of IDS approaches for IoT scenarios. Unlike previous works, we considered several settings with different data distributions. Our evaluation demonstrates the impact of noniid and highly skewed data distributions on the FL performance, which directly affects the effectiveness of the security attack detection. We demonstrate that an instance selection process based on the Shannon entropy of each local dataset can improve the overall accuracy obtaining similar results compared with a scenario where the dataset is balanced among the parties. Toward this end, we evaluated the use of the FedAvg and Fed+ aggregation functions using the recently proposed ToN IoT dataset. Furthermore, based on our evaluation and the analysis of existing literature, we described the main challenges to be considered in the coming years for the deployment of FL-enabled IDS in IoT. As future work, we will address some of such challenges by deploying a FL-enabled IDS approach in real IoT scenarios to assess its feasibility in environments with constrained devices and networks. Furthermore, we will analyze the potential application of personalized FL, where each node uses the most appropriate learning model, in order to improve the overall accuracy for attack detection in IoT scenarios.