Node Selection Algorithm for Federated Learning Based on Deep Reinforcement Learning for Edge Computing in IoT

: The Internet of Things (IoT) and edge computing technologies have been rapidly developing in recent years, leading to the emergence of new challenges in privacy and security. Personal privacy and data leakage have become major concerns in IoT edge computing environments. Federated learning has been proposed as a solution to address these privacy issues, but the heterogeneity of devices in IoT edge computing environments poses a signiﬁcant challenge to the implementation of federated learning. To overcome this challenge, this paper proposes a novel node selection strategy based on deep reinforcement learning to optimize federated learning in heterogeneous device IoT environments. Additionally, a metric model for IoT devices is proposed to evaluate the performance of different devices. The experimental results demonstrate that the proposed method can improve training accuracy by 30% in a heterogeneous device IoT environment.


Introduction
With the continuous development of the Internet of Things (IoT) and edge computing technology, privacy issues in edge computing for IoT have become increasingly prominent [1]. Personal privacy and data leakage are among the most prominent issues. Due to the large number of sensors and devices involved in IoT, they continuously collect and transmit various types of data, including personal identification information, geographic location information, health status information, and so on [2]. If these data are obtained by malicious individuals, it could pose significant security threats and privacy risks. Another privacy issue is data security. Data in IoT are usually scattered among different devices, cloud servers, edge nodes, and sensors. These data need to be transmitted and stored, and the networks and devices used for transmitting and storing data face various security threats. For example, there may be hackers attacking the network, data centers being stolen, edge devices being eavesdropped or tampered with, and so on. These issues could all lead to data leakage and security risks. In addition, due to the inconsistency of data formats and standards among different devices and systems, data cannot be effectively shared and utilized, resulting in the problem of data silos. This not only limits the application and effectiveness of IoT but also leads to inefficiency in data management and analysis. This is because many data are stored and processed in isolation on different devices or systems, resulting in data fragmentation and the inability to achieve complete data analysis and application. With the growth of data and the increase in data transmission, IoT edge computing systems must handle more and more sensitive data, including personal privacy data and business confidential data. However, privacy and data silos are not only challenges faced by IoT edge computing but also important obstacles restricting the development of IoT technology. In order to solve these problems, federated learning has become an important solution.
Federated learning is a distributed machine learning approach that allows multiple devices or data sources to collaborate in learning without exposing raw data [3,4]. This approach not only reduces the cost of data transmission and storage, but also better protects privacy and data security, thereby avoiding privacy leaks and data loss issues. By training models without sharing data, federated learning protects the privacy of users participating in the training and improves the privacy protection and training effectiveness of edge computing in the IoT [5]. However, in the edge computing environment of the IoT, the application of federated learning faces many challenges, the most significant of which are heterogeneous devices and malicious nodes. Heterogeneous devices refer to devices participating in federated learning that have different computing capabilities, bandwidth, and data, which leads to training imbalance and instability [6,7]. Moreover, this leads to a high dimensionality of the solution space for the node selection problem in federation learning. Heuristic algorithms are prone to fall into local optimal solutions and fail to find global optimal solutions when faced with such complex problems. In federated learning, each device only uses its own local data for training, so the computing power and data quality of the device have a direct impact on the effectiveness of federated learning [8,9]. At the same time, network bandwidth between devices can also affect the training speed and effectiveness of federated learning. In federated learning, each participant, as a node, trains local data and then uploads the trained model parameters to the server for global model updates [10]. Due to the diversity and uncertainty of participants, the presence of malicious nodes may have a serious impact on the training effectiveness of federated learning [11,12]. Malicious nodes may engage in a variety of behaviors, such as transmitting false model parameters or intentionally destroying model parameters. For example, some participants may transmit incomplete or tampered data, or maliciously modify the training model to achieve their private interests or destroy the global model. These malicious behaviors may result in a decrease in the accuracy of the global model or complete collapse, seriously affecting the effectiveness and application value of federated learning. All of these issues need to be properly addressed in federated learning to enable effective model training on edge devices and ensure user privacy and security.
In summary, there are the following issues with applying federated learning in edge computing:

1.
The node selection strategy in federated learning is not targeted enough, and there are few selection mechanisms specifically designed for IoT environments. Most selection mechanisms are based on random selection.

2.
There are many heterogeneous devices in IoT edge computing, with different computing power, bandwidth, and data, which leads to training imbalance and instability; 3.
There are some malicious devices in IoT edge computing that upload outdated or incorrect local models for various reasons, which negatively impact the convergence of the global model.
To address the problems associated with applying federated learning in edge computing networks mentioned above, this manuscript proposes the following solutions: 1.
This manuscript proposes using deep reinforcement learning methods instead of traditional heuristic methods to select terminal devices to improve the accuracy of selection; 2.
This manuscript proposes measuring the resource properties of IoT devices to determine their likelihood of participating in federated learning and improve the algorithm's applicability in IoT environments;

3.
To address the issue of devices uploading outdated or incorrect local models, this manuscript proposes a node credibility measurement scheme to eliminate the impact of malicious nodes on federated learning in edge computing networks.

Federated Learning
In addition to privacy and security issues, the uneven distribution of data, communication network resources and computing resources will lead to low efficiency of model training. In order to further optimize the model iterative updating process and improve the efficiency of federated learning, researchers have conducted a lot of related research on these problems and different scenarios [13][14][15]. Because the training process of federated learning needs many iterations to update the training parameters, it causes a large communication overhead. Some researchers have performed research on optimizing the communication process for this problem. One of the research directions is the method of compressing the data that need to be updated by the model. For example, Sattle et al. [16] proposed a compression framework of sparse ternary compression. This framework extends the existing compression technology of Top-k gradient thinning through a novel mechanism for the existing federated learning compression methods that either only compress the upstream communication from the client to the server (without compressing the downstream communication), or only perform well under ideal conditions (e.g., independent and identical distribution). The hierarchical and optimal Golomb coding of downstream compression and weight updating is realized so that the federated learning communication mode is optimized, especially in the learning environment with limited bandwidth. The research direction of some researchers is to design new mechanisms and learning algorithms. Mills et al. [17] proposed a multi-task joint learning system, which benefits the accuracy of user models by using distributed Adam optimization and introducing a non-joint patch batch standardization layer, and only needs to upload a certain proportion of user data for model integration each time the model is updated. Guo et al. [18] proposed a novel design of a transceiver and learning algorithm that simulates the analog gradient aggregation (AGA) solution, which significantly reduces the multi-channel access delay. Wu et al. [19] proposed a framework for automatically selecting the most representative data from unlabeled input streams so as not to accumulate a large number of data sets due to the storage limitations of edge devices and proposed a data-replacement strategy based on contrast scores, that is, measuring the representation quality of each data without using labels, indicating that the data with low quality are not effectively learned by the model and will remain in the buffer for further learning, while data representing high quality are discarded.

Federated Learning Based on Edge Computing
As an extension of cloud computing, edge computing deploys computing resources in the edge network near the user side [20][21][22][23]. The terminal equipment can directly perform data analysis, storage and calculation at the edge node, realizing the service requirements of low delay, short communication distance and high reliability. As a learning mode for long-time distributed interaction with terminal devices, federated learning can effectively improve the performance of federated learning if edge computing can be used for task training or merging in advance. However, generally, edge nodes are different from cloud computing centers, in that their computing and communication resources are limited. Under the framework of a large-scale federal learning network, a large number of terminals will communicate and calculate based on edge computing, which is prone to the problem of communication bottlenecks and uneven resources, resulting in delay "short board". Therefore, it is necessary to optimize resource scheduling under federated learning based on edge computing. First, based on the traditional federal learning framework, Shi et al. [24] proposed a joint equipment scheduling and resource allocation strategy. According to the number of training rounds and the number of scheduled equipment in each round, communication and computing resources are jointly considered, and a greedy equipment scheduling algorithm is designed to maximize the model accuracy under the condition of time constraints. Liu et al. [25] considered that in the federated learning scenario based on edge computing, by splitting the model, some models are reserved for local training, and the rest of the models are unloaded to edge nodes for training, thus reducing the training task of end users but at the same time increasing the overhead of the communication resources. Zhang et al. [26] proposed a federal learning-based service function chain mapping algorithm to solve the resource allocation problem of air-space integration networks and effectively improve resource utilization.
In addition, some researchers innovated and optimized the framework of federal learning. Luo et al. [27] introduced a novel hierarchical joint edge learning framework, in which some model aggregations are migrated from the cloud to the edge server, and further optimized the joint consideration of computing and communication resource allocation and edge association of devices under the hierarchical joint edge learning framework. Hosseinalipour et al. [28] proposed a multi-layer federated learning framework in heterogeneous networks, which takes into account the heterogeneity of the network structure, device computing capacity and data distribution, and realizes efficient federated learning by offloading learning tasks and allocating communication and computing resources accordingly. Xue et al. [29] implemented a clinical decision support system based on federated learning in edge computing networks. The double deep Q network was deployed at the edge node, and a stable and orderly clinical treatment strategy was obtained. Considering the constraints of link limitation, delay limitation and energy limitation, Lyapunov optimization was used to improve the convergence of the system.

System Implementation
The process of the federated learning node selection mechanism based on deep reinforcement learning designed in this manuscript is shown in Figure 1 below. The physical network environment composed of IoT devices and the policy network constitutes the entire deep reinforcement learning system. When a federated learning request arrives, the policy network, acting as an intelligent agent, extracts a specific feature matrix from the physical network as input based on the current state of the IoT devices. The training is conducted in an environment built by the physical resource state, and this process is considered the environment sending a state to the agent. The intelligent agent infers the federated learning node selection decision based on the training, which is considered an action applied to the environment. The environment provides the agent with a reward signal based on the execution effectiveness of the action. The agent continually optimizes the action by interacting with the environment to accumulate the maximum reward signal.

Feature Extraction
Training environment and methods have a great impact on training effectiveness. In order to train the agent in an environment closer to the real network, this paper proposes to extract the following four device features as the device attributes extracted by deep reinforcement learning: For IoT devices, due to their requirements for low power consumption and small size, their computing power is usually limited, and computing power is also an important measure of whether an IoT device is suitable for participating in federated learning and its computing ability. For the computing power of IoT devices, this chapter uses the computing power of their processors to measure, usually expressed in FLOPS (floatingpoint operations per second). FLOPS refers to the number of floating-point operations that a device can complete in unit time and is an important indicator of computer performance. Generally, the FLOPS of IoT devices can be calculated using the following formula: where FLOPS i represents the FLOPS value of IoT device i, CPU_Fre i represents the CPU frequency of the device, CPU_Core i represents the number of CPU cores of the device, and CPU_FPU i represents the number of FPUs of the device.

Communication Model
The communication resources of IoT devices refer to the network resources required for devices to communicate, including bandwidth, network delay, network stability, etc. The adequacy of communication resources directly affects the communication quality and stability of the equipment. In IoT, different devices have different communication resources. For example, infrastructure devices usually have strong communication resources, which can support high-speed and stable data transmission, while some edge devices may have relatively limited communication resources, which need to be scheduled and optimized according to their specific usage scenarios and needs. For the communication resources of IoT devices, adequate evaluation and management are required to ensure the communication quality and stability of the devices. At the same time, it is also necessary to consider the allocation and utilization of communication resources during device design and deployment to meet the communication needs of the device.
In IoT, communication between devices can use different wireless technologies, such as Bluetooth, Zigbee, Wi-Fi, etc. Typically, these technologies employ radio frequency-based wireless communication techniques. In this context, the communication capability of the IoT device is measured by measuring its bandwidth, channel, modulation method, and signalto-noise ratio. This section uses the following formula to calculate the communication model of the device: where DTE i represents the data transmission efficiency of IoT device i, CC i represents the channel capacity of the device, MRE i represents the modulation efficiency of the device, SNR i represents the signal-to-noise ratio of the device, and BW i represents the bandwidth of the device.

Data Quality Model
In IoT applications, the data quality is crucial, as it affects subsequent data analysis and applications. The model used to evaluate the quality of data generated by IoT devices is known as the IoT device data quality evaluation model. The total data quality management (TDQM) model can help determine whether the data generated by IoT devices are reliable, accurate, consistent, complete, and usable, thus improving the accuracy and credibility of data analysis [30]. For edge IoT devices, the TDQM model based on data accuracy and data integrity is chosen in this paper to evaluate the quality of data contained in edge IoT devices by assessing the data quality through the source and availability of data. The calculation method is as follows: where Data i represents the data metric of device i, while TDQM i represents the local data quality of the device under the TDQM model.

Equipment Contribution
In federated learning, each device participating in training needs to upload its locally trained model parameters so that the server can integrate them into a global model. However, some devices may be unwilling or unable to upload their local models' correct or latest versions due to various reasons, such as network issues, computational resource limitations, or privacy protection. This may have a negative impact on the performance of the global model. Therefore, it is necessary to measure the contribution of each device to identify and exclude unreliable or low-contributing devices, thereby improving the quality and convergence speed of the global model. We evaluate the contribution of each device by the improvements made to the global model parameters by the local model parameters. In this paper, the contribution of IoT device i can be defined as follows: where V i,k represents the contribution value of device i to the global model in the k-th round of training, w i,k represents the local model parameters of device i after the k-th round of training, w k represents the global model parameters after the k-th round of training, and σ k represents the sum of weights of all devices after the k-th round of training. For terminal device i, after extracting the above network properties, the terminal device attributes are combined into a feature vector, represented as All the feature vectors are then combined into a four-dimensional feature matrix as follows: Whenever a new round of federated learning is required, the policy network will extract the above feature matrix from the terminal devices as input, providing the intelligent agent with a training environment. At the same time, the feature matrix will be continuously updated as the terminal device's resources are consumed.

Policy Network
The policy network in deep reinforcement learning is used to output a policy that selects the next action based on the current state. In our proposed method, we are selecting devices with probability greater than a specific value based on the probability of the policy network output, rather than selecting a specific number of devices to participate in the training. As shown in Figure 2, the policy network structure designed in this chapter includes four layers: extraction layer, convolutional layer, probabilistic layer, and output layer. • Extraction layer: The extraction layer, also known as the input layer, is primarily responsible for converting the input raw data into a format that can be processed by the deep neural network, usually by standardizing, normalizing, and other processing methods. In this chapter, the extraction layer extracts the feature matrix from all terminal devices based on their current states, and uses it as the input to the policy network. The feature matrix is then transferred to the next layer of the policy network. • Convolutional layer: The convolutional layer is a commonly used layer structure in deep learning. It uses convolutional kernels to perform convolution operations on input data in order to extract features. In this chapter, the convolutional layer performs convolution operations on the input vector according to the following equation:

Matr ix
where y i,j denotes the output matrix, I denotes the input matrix, and K denotes the convolution kernel. ∑ m ∑ n I i+m,j+n K m,n denotes the element I i+m,j+n of the input matrix multiplied by the element K m,n of the convolution kernel matrix. The ReLU activation function is then used to connect the fully connected layers as follows: The generated vectors are passed to the probability layer in order to generate the probability of each node. • Probability layer: The probability layer uses the softmax function to compute the feature vector and generate the probability of each terminal device. The softmax function can map the elements of a K-dimensional vector to a K-dimensional probability distribution, where each element represents a probability value in the corresponding distribution. Specifically, for a federated learning network consisting of n terminal devices, the probability layer outputs an n-dimensional probability distribution, where each element represents the probability of selecting a terminal device. In this chapter, the calculation of the probability of device i participating in federated learning can be represented by the following formula: among them, the denominator is the sum of the exponential functions of all elements, and the numerator is the exponential function of v i . In this way, the value of P i is the probability value corresponding to the dimension where v i is located, and the sum of all v i is equal to 1. • Output layer: The output layer outputs IoT devices and their probability of participating in federated learning.

Model Training
In our study, we employ deep reinforcement learning to implement a node selection strategy as shown in Figure 2. We first randomly initialize the parameters of the policy network and train it for several epochs. For the node selection task in each training iteration, we extract the feature matrix from the federated learned node set as the input of the policy network. The policy network outputs a set of available nodes and the selection probability of each node according to the node feature vector. The selection probability of each node represents the possibility of it being selected to participate in federated learning to produce better results. In the training phase, we do not select a fixed number of nodes to participate in the training but select devices whose probability value is greater than the threshold we set to participate in the training. The selected nodes will participate in the training process of federated learning and work together to learn the global model. Our node selection strategy is flexible and does not limit the number of nodes selected each time. This means that our method can adapt to federated learning scenarios of different scales and select the appropriate number of nodes according to the demand.
In deep reinforcement learning, unlike supervised learning, we do not have label information in the training data to guide the training process [31,32]. Instead, our learning agent relies on reward signals to evaluate the effectiveness of its actions. The magnitude of the reward signal indicates the agent's decision quality, with larger reward signals indicating good decisions, while smaller or even negative reward signals indicate misbehavior that needs to be adjusted [33]. The choice of reward is crucial to the training process and the formation of the final policy. In the federated learning node selection problem based on deep reinforcement learning, the effect of each round of federated learning is used as a reward signal [34]. This indicator can better reflect the contribution of all devices to the global model aggregation of federated learning under the current selection scheme, which is very representative [35,36]. Therefore, after each round of federated learning, the agent calculates the reward signal according to the aggregation effect of the global model, and updates the parameters of the policy network to optimize the performance of the policy network. Through continuous iterative training, the policy network gradually learns the optimal node selection strategy, and can provide an efficient and reliable node selection scheme for federated learning. In practical implementations, due to the lack of real label information for node selection, we introduce hand-crafted labels to approximate the agent's decision. Suppose we choose the i-th and i+2-th nodes, then in the policy network, the manual label will be an all-zero vector y, except for the i-th and i+2-th positions, which are 1. By computing the cross-entropy loss between the output of the policy network and the hand-crafted labels, we can measure the deviation of the output of the policy network from expectations and use this loss to guide the training process.
In this manuscript, backpropagation is used to calculate the parameter gradient of the policy network. First, the loss function is calculated using cross entropy based on the training samples and the output of the policy network. Then, the backpropagation algorithm is used to calculate the gradient of the loss function with respect to the parameters of the policy network. The backpropagation algorithm uses the chain rule to calculate the gradient of each parameter, starting from the output layer, calculating the partial derivative of each neuron, and then calculating the gradient of each parameter layer by layer. Finally, the gradient descent optimization algorithm is used to update the parameters of the policy network. The gradient calculated by the backpropagation algorithm indicates which direction the parameters should be adjusted, while the gradient descent optimization algorithm tells us how much to adjust. In this way, the parameters of the policy network can be continuously adjusted to improve the accuracy and performance of the policy network.

Experimental Environment
For the node selection-optimized federated learning scheme (FL-IoTEL) proposed in this manuscript for IoT edge computing, we designed simulation experiments to verify its reliability. The simulation experiment simulates an IoT edge computing network with 10 edge nodes. Each edge node is connected to several IoT devices, and a total of 100 devices are involved. In the experiments we set up, there are 110 devices involved in node selection, of which 10 are edge nodes and 100 are IoT devices. Each edge node is connected to multiple IoT devices and is responsible for aggregating the local model of IoT devices into a global model. There are a total of 50,000 training images and 10,000 test images in the CIFAR-10 dataset [37]. CIFAR-10 has a slightly larger image size than MNIST and is in color. However, there are 10,000 more training images in the MNIST dataset than in CIFAR-10 [38]. For each IoT device, different hardware metrics (such as computing power and communication) and different data are assigned.

Simulation Results and Analysis
In the edge computing environment of the Internet of Things, federated learning needs to consider the characteristics of many heterogeneous devices, including but not limited to differences in hardware performance, network communication capabilities, data volume, and data quality. These differences can lead to variations in the quality and quantity of data provided by each device, as well as affecting the training speed and effectiveness of the devices. For example, some devices may have faster processors and higher memory capacities, enabling them to train and analyze more quickly, while other devices may be limited by lower hardware performance and take longer to complete the same task. Additionally, data may differ in quality and quantity depending on their source, with some data having better quality and more samples, while other data may have more noise or lack diversity. Therefore, federated learning needs to consider the heterogeneity of devices and data and use appropriate federated algorithms and node selection strategies to effectively utilize these heterogeneous resources, improve the training effectiveness and inference speed of models, and protect the privacy of devices. This section designs experiments to verify the performance of the node selection strategy for federated learning based on deep reinforcement learning under the edge computing of the Internet of Things, and compares the algorithm designed in this chapter with the traditional FedProx algorithm.
First, this manuscript compares the performance of the final global model under the condition that the data satisfy the independent and identically distributed (IID) assumption. Figure 3 shows the accuracy of the model when the local data on the IoT devices follow the IID assumption, and the size of the local dataset is the same. It can be seen that when the models of the two algorithms finally converge, there is little difference in the accuracy of the model on the test set. This is because the data distribution on the edge nodes is consistent, and the effect of randomly selecting nodes for learning is the same as that of purposefully selecting nodes on the server. However, because the scheme considered in this article not only considers the heterogeneity of node data but also the heterogeneity of resources, the algorithm proposed in this chapter is slightly better than the traditional FedProx algorithm in terms of final convergence. From Figure 3, it can also be seen that both schemes have good performance on the MNIST dataset but perform poorly on the CIFAR-10 dataset. This is because the CIFAR-10 dataset contains more information than the MNIST dataset, and the model performance is not as good as that on MNIST. This also proves that different datasets have a significant impact on learning models, and in scenarios with node selection, small-scale algorithms can always achieve optimal results.  Then, experiments were conducted for devices with Non-IID data. Figure 4 shows the training accuracy when the data on IoT devices are Non-IID. It can be seen from Figure 4 that when the data are Non-IID, the algorithm proposed in this chapter has a significantly higher accuracy than the traditional FedProx algorithm. This is because when the data are IID or the differences between the data of each device are small, there is not much difference between the traditional random device selection algorithm and the federated learning node selection strategy based on deep reinforcement learning used in this chapter. However, when the data differences are large, the algorithm described in this chapter performs well because it takes into account the device data.
What is more, this manuscript simulated the performance in different Non-IID scenarios, and used the variance of local data distribution to represent the size of data heterogeneity. The larger the variance, the greater the heterogeneity of local data on terminal devices. As can be seen in Figure 5, all devices perform almost identically, and the results of random node selection are consistent with the algorithm-based node selection. Therefore, the results shown in the figure appear. As the variance increases, the superiority of the proposed algorithm in this scenario is reflected because in this case, selecting nodes that are more useful for the global model is more reasonable than randomly selecting nodes. In addition, the number of users also affects the accuracy of the test set. The more nodes participate in the learning process, the better the performance of the learned model will be; of course, the learned model will be slightly better than the number of nodes that are less.   Finally, experiments were conducted to analyze the performance of the algorithm under device heterogeneity, and the experimental results are shown in Figures 6 and 7. As can be seen from the figures, when there is device heterogeneity, the experimental results are similar to those of data heterogeneity. As shown in Figure 6, when the performance of the devices is similar, there is not much difference in training accuracy between the two algorithms. However, when there is significant performance difference between the devices and some devices perform poorly, the algorithm proposed in this chapter has a more obvious advantage over the traditional FedProx algorithm. This is because when there is significant device heterogeneity, some devices may not be able to complete the data training task, resulting in lower accuracy of the global model. The node selection strategy proposed in this chapter can better play its role by selecting better-performing devices to participate in training, thereby ensuring the efficiency and accuracy of federated learning.

Conclusions
This manuscript optimized the federated learning technology in edge computing for the Internet of Things. A node selection strategy based on deep reinforcement learning was proposed to select IoT nodes to participate in the federated learning training, ensuring efficient participation of heterogeneous IoT devices and improving the privacy protection ability of edge computing. The experimental results showed that the proposed method in this manuscript can improve the training accuracy by 30% in the heterogeneous device IoT environment. This manuscript provides a new perspective to solve the privacy protection problem in edge computing for the Internet of Things and proposes a node selection strategy based on deep reinforcement learning to optimize the federated learning technology. This strategy can ensure the efficiency of heterogeneous device participation in training and improve the accuracy of the model under the premise of privacy protection. The research results of this chapter can provide new ideas and methods for privacy protection in edge computing for the Internet of Things and are expected to be more widely used in practical applications.