Reputation-Aware Multi-Agent DRL for Secure Hierarchical Federated Learning in IoT

Aiming at protecting device data privacy, Federated Learning (FL) is a framework of distributed machine learning in which devices’ local model parameters are exchanged with a centralized server without revealing the actual data. Hierarchical Federated Learning (HFL) framework was introduced to improve FL communication efficiency where devices are clustered and seek model consensus with the support of edge servers (e.g., base stations). Devices in a cluster submit their local model updates to their assigned local edge server for aggregation at each iteration. The edge servers transmit the aggregated models to a centralized server and establish a global consensus. However, similar to FL, adversaries may threaten the security and privacy of HFL. The client devices within a cluster may deliberately provide unreliable local model updates through poisoning attacks or poor-quality model updates due to inconsistent communication channels, increased device mobility, or inadequate device resources. To address the above challenges, this paper investigates the client selection problem in the HFL framework to eliminate the impact of unreliable clients while maximizing the global model accuracy of HFL. Each FL edge server is equipped with a Deep Reinforcement Learning (DRL)-based reputation model to optimally measure the reliability and trustworthiness of FL workers within its cluster. A Multi-Agent Deep Deterministic Policy Gradient (MADDPG) is utilized to enhance the accuracy and stability of the HFL global model, given the workers’ dynamic behaviors in the HFL environment. The experimental results indicate that our proposed MADDPG improves the accuracy and stability of HFL compared with the conventional reputation model and single-agent DDPG-based reputation model.

aggregation optimizes the usage of limited data acquired through sparse communications and increases the quality of the trained models by maximizing the use of all accessible data. FL applications have seen widespread adoption across many different sectors, from IoT and telecommunications to military and healthcare.
In large-scale networks, where many devices participating in FL are distributed across a large network, the communication latency or delay can be huge due to restricted bandwidth and weak communication channels of users. Mass communication resources will be required to enable such an enormous number of devices with FL. Given that FL is expected to be implemented at scale, [4] proposed the Hierarchical FL (HFL) framework as an advancement of FL. HFL combines the benefits of centralized learning with FL, reducing communication overhead in an IoT network while simultaneously improving training efficiency and accuracy [5]. HFL maintains FL's data privacy property while allowing distributed training on an enormous number of devices (i.e., clients or FL workers).
In HFL, the devices are clustered (e.g., according to their locations) and seek model consensus with the support of an edge server or cluster head (e.g., a base station). Devices in a cluster submit their local model updates to their associated local edge server for aggregation at each iteration. The edge server returns the average of all received updates to the devices to update their local models (similar to traditional FL). After some aggregation iterations, all edge servers transmit the aggregated models to a centralized server to compute the global model and establish a global consensus. The global server then calculates the model average (i.e., using FedAvg [2]) and multi-casts it back to the edge servers. Upon receiving the model update from the centralized server, the edge servers send it to the devices in their clusters. After some global aggregation iterations, all the devices share a common parameter vector of the global model.
As adversarial attacks against FL systems are continuously becoming significant research areas in FL security and privacy, we study the impact of poisoning attacks on the HFL framework in this work. Despite its benefits in terms of communication efficiency, HFL can be exposed to the same attacks targeting its security and privacy. For example, in a poisoning attack, an attacker transmits inaccurate local model parameters derived from fraudulent training data to deceive the training process of HFL. Moreover, honest workers within a cluster may unintentionally provide unreliable local model updates due to unstable network conditions, device energy constraints, or improperly sensed data due to device mobility among different clusters. In contrast to reliable workers with valid training data, unreliable or malicious workers might prevent the global model from converging or make the convergence process significantly slower [6], [7]. In order to have a more reliable and secure HFL for IoT networks, it is crucial to design an effective and adaptable detection technique for unreliable workers.
Existing works utilize reputation-based methods where a reputation threshold is used to determine FL workers' reliability [8], [9]. A worker is regarded as unreliable or malicious if its calculated reputation score exceeds a predetermined threshold. Due to several challenges with the HFL environment, it is difficult to adapt existing worker selection strategies to HFL. The decision of selecting the FL workers has to be determined according to the uncertainties in the HFL network conditions. For example, clients' behavior moving across different clusters can change over time due to many factors, such as weak connection or energy constraints. Without dynamic monitoring tools, monitoring workers' behaviors within a cluster in real-time and mitigating the potential adverse impacts of malicious workers is challenging.
Reinforcement learning (RL) is an area of machine learning that involves learning in an environment and designing policies that map states to actions that optimize a reward function [10]. Deep Reinforcement Learning (DRL) is an evolution of RL, which employs Deep Neural Networks (DNNs) in the procedure of decision-making [11]. DRL has demonstrated considerable performance advantages compared with non-learning methods with regard to training accuracy and latency when used to address non-convex optimization problems. In addition, DRL has been used in FL environments for energy management optimization [12], allocation of resources and channel cost optimization [13], and improving energy consumption [14].
In our preliminary work [15], we proposed a reputation management mechanism based on DRL to address security concerns in the FL environment. Specifically, we applied Deep Deterministic Policy Gradient (DDPG) [16] for the unreliable worker detection problem in the FL environment. As an extension to our work in [15], we propose in this work a distributed and collaborative scheme based on multiagent DRL to improve the security and scalability of the HFL environment. Enabling a centralized and secure FL scheme at the edge server can be infeasible in large-scale networks considering the delay requirements of the wireless transmission between an FL worker and an edge server. In addition, the number of FL training iterations required to reach convergence grows rapidly as the number of FL workers increases [15]. Therefore, in this work, each edge server is equipped with a DRL agent that observes information about its environment and makes decisions that support HFL. We believe this work is the first to consider the security of HFL and optimization using DRL for large-scale IoT networks. The following are the main contributions of this paper: • We address the worker selection problem as an extended multi-agent Markov Decision Process (MDPs) and formulate an optimization problem for each edge server individually to jointly manage reliable worker selection for HFL. • We adopt a reputation-based assessment mechanism to analyze the behaviors of workers in HFL and identify unreliable or malicious workers using current attack detection techniques. • We implement a multi-agent scheme based on DDPG (MADDPG) algorithm to address the formulated problem. Training the MADDPG model allows each edge server to make optimal selection decisions for reliable workers in real time. This paper is organized as follows. A review of the current related research is provided in Section II. The system model is explained in Section III. Our proposed multi-agent DRLbased reputation model is described in Section IV. Following that, the performance of our proposed model is analyzed and evaluated in Section V. At last, our conclusion is provided in Section VI.

II. RELATED WORK
Recently, several research works have addressed FL security using various mechanisms. The work of [17], [18], and [19] propose different statistical-based methods that analyze FL workers' local model updates to detect potential malicious workers. In [20], [21], [22], and [23], the authors propose learning-based mechanisms (e.g., Reinforcement Learning) to select reliable FL workers. The authors in [8] and [24] propose reputation and blockchain-based solutions to evaluate local model updates. In [19] and [25], the authors propose to employ cryptographic primitives to verify the integrity of the locally encrypted models. In our previous work [15], we addressed the reliable client selection in FL by applying a reputation-based DRL algorithm. However, most of the proposed mechanisms focus on the conventional FL framework with a limited number of devices. They lack performance analysis of large-scale HFL systems with the presence of unreliable clients. Therefore, in this work, we aim to apply an optimization and security scheme in the context of HFL with unreliable clients.

III. SYSTEM MODEL
To tackle the HFL security problems, in this work, a multiagent DRL-based reputation model is designed to select optimal, reliable workers in HFL. The illustration for our proposed system model is shown in Fig. 1. At the start, the edge servers download the HFL global model from the centralized server, which is then cascaded to the assigned workers. The workers in each cluster train the received model using their locally collected training data. Then, each worker uploads the generated local model updates to the associated edge server to be aggregated. Each edge server verifies the local model updates using an attack detection technique and evaluates each worker's reputation within its cluster. Afterward, each edge server chooses the workers with reputation scores larger than a pre-defined threshold to update the edge model. To determine the appropriate and optimal reputation threshold at the edge server level, the DRL algorithm is employed to maximize the accuracy of the HFL global model. The sections that follow describe in detail our proposed multi-agent DRL system model for HFL.

A. HIERARCHICAL FEDERATED LEARNING MODEL
We consider the HFL system that consists of a centralized global server, a set of E HFL edge servers, and a set of K HFL workers. Each worker i stores a local dataset n i and the dataset at an edge server e is denoted as N e = K i=1 n i , and the dataset at the centralized global server is denoted as N = E e=1 n e . Each set of workers is assigned to a particular edge server. The HFL process starts after each worker calculates its local model updates or learning parameters w i for its local dataset n i using FedSGD.
In the HFL settings, aggregating workers' local model updates w k is performed periodically by calculating the weighted average of workers' local model updates w k according to the size of their local datasets. Then, the workers use the averaged model updates until the subsequent edge aggregation iteration. The second stage of the HFL process involves synchronizing local model updates w k across all K workers associated with a specific edge server at every local gradient step. Therefore, the parameters of each edge server w j are determined during edge aggregation iteration t as follows: In the above equation, K e is the number of workers associated with the edge server e. At every t × t step, the edge models are synced across all edge servers, where t is the aggregation iteration at the centralized global server, and t is the aggregation iteration at the edge server. Thus, the edge model updates at t are averaged across all the edge servers as follows:

B. ADVERSARY MODEL IN HFL
In this work, the clients are selected to contribute to the HFL task according to the reliability of their contributed local model updates, which is determined based on workers' behaviors during the HFL process. Malicious workers can launch targeted data poisoning attack by modifying individual features of the worker's actual training dataset n k and embedding features of the backdoor attack into the local model update w k t . The output will be classified based on the attacker's objective if the input feature includes the modified training sample [18]. Then, the attacker within a cluster uploads its poisoned local updates w k * t to its associated edge server for aggregation. The edge model w e t+1 is updated as follows: where N e is the sum of all workers' K data at edge server e and N e = K i=1 n i + m i . The adversary aims to misclassify a class as another without influencing the other class's output probabilities. Because the server has no information about such an attack, it is considered difficult to detect.

C. REPUTATION MANAGEMENT MODEL IN HFL
Each edge server computes the reputation scores based on the behaviors of its workers during the edge model update step, such as contributing poisoned or inaccurate local model updates. Low-reputation workers are assumed to be malicious, hence; their local model updates are disregarded from the aggregated edge models. Specifically, at each edge aggregation iteration t, each edge server communicates with its associated workers C ∈ K and evaluates their reputation according to their contributed local model updates. After each edge aggregation iteration t, the associated edge server updates the reputation score e,k t of each worker k using a Multi-weight Subjective Logic Model (Multi-SLM) [26]. Namely, each edge server uses Multi-SLM to determine the reputation opinion of its associated workers based on previous interaction events. Specifically, after each edge aggregation iteration, each server maintains track of the amount of positive and negative interaction events of every worker in its cluster.
The positive and negative interactions are determined by identifying adversarial updates using the FoolsGold attack detection scheme [17]. With the deployed detection scheme, each edge server can do a preliminary security check to detect unreliable or malicious workers within its cluster. After identifying the candidate's malicious worker, the worker's corresponding local updates are removed during the edge aggregation step. Each edge server then calculates the reputation opinion of the worker, which is represented as a vector that consists of belief degree b e,k t , distrust degree d e,k t , and uncertainty degree u e,k t . The computation of these values is discussed in [26].
The final reputation score e,k t of worker k in an edge aggregation iteration t is computed in (4). This value reflects the edge server's e expected belief that the worker k consistently produces reliable and accurate local model updates.
The edge server then selects the updates of the workers whose reputation scores exceed the reputation threshold * to update the edge model. All reputation value computations by each edge server are assumed to be securely stored using a centralized global server.
In HFL, each edge server e has its own interacted workers k in the network. The reputation opinions of every edge server e for FL workers are denoted as an individual vector. During the FL worker selection phase in HFL settings, the edge server e accumulates the recommended opinions on the FL worker k from neighboring edge servers interacting with that FL worker k. When the edge server e receives the local model updates worker k, it initially determines the worker's reliability using its local reputation storage. The edge server e initiates a reputation inquiry on a worker k from its adjacent edge servers x and waits for a response within a set time interval. When a reputation query is received by the adjacent edge servers x, it examines its local reputation storage for subjective opinion for the worker k. If there is a stored opinion, the adjacent edge server x responds with its subjective opinion of the FL worker k.
In the reputation computation phase under the HFL framework, each edge server e evaluates the FL worker's k reputation with the opinions accumulated from other edge servers in the reputation inquiry phase. After the reputation computation, the edge server e decides whether to consider the model updates from FL worker k based on its reputation. We assume all cooperating edge servers are trustworthy and provide honest responses during the reputation inquiry phase.
We use a freshness fading function to combine the reputation opinions from the latest t iterations into a single reputation score. Using the fading function gives recent interaction events more weight than previous events. The fading function is defined as ϑ t = z T−t , where T refers to the current iteration and z ∈ (0, 1). Thus the reputation opinion of the edge server e and worker k is denoted as the following [8]: The more frequently interacting workers between two edge servers e and x increase the reliability of their indirect reputation opinions. Every edge server's reputation opinions for the workers are represented by an individual vector. The similarity of reputation opinions amongst edge servers can be estimated using the cosine function by comparing the vectors. We specify a similarity factor between edge servers e and x as a means of assessing the reliability of indirect reputation opinions as the following [8]: where Z and Y are the set of workers who have interacted with the edge servers e and x, respectively. C = Z ∩ Y is the common set of workers between edge servers e and x. D e and D x denote the average reputation opinions of their directly interacting workers in C, respectively. A worker k's reputation opinions from edge servers e and x are denoted by D ek and D xk , respectively. A greater similarity value indicates that the reputation opinions of edge server x are more reliable. Thus, the cumulative weight of indirect reputation opinions from edge server e is defined as ω ex = ex × Sim(e, x), where 0 ≤ ex ≤ 1 is a predetermined parameter of the recommended opinions' weight from edge server x for edge server e.
All of the recommenders' indirect reputation opinions of the worker k are weighted and combined to get the overall recommended reputation opinion as ϒ rec e:k = (b rec e:k , d rec e:k , u rec e:k ), where b rec e:k , d rec e:k , u rec e:k are obtained as follows: b rec e:k = R r=1 ω ex ×b r:k , d rec e:k = R r=1 ω ex ×d r:k , u rec e:k = R r=1 ω ex ×u r:k , where r ∈ R is the set of recommenders that worker k had interacted with. Therefore, the indirect reputation opinions of various recommenders are consolidated into a cumulative recommended reputation opinion ϒ rec e:k based on each opinion's weight.
To prevent potential manipulation by other edge servers, the edge server e takes into account both the direct and recommended reputation opinions that were calculated locally by each edge server when consolidating the final reputation value for the worker k, which is denoted as ϒ

IV. DRL-BASED REPUTATION MODEL FOR HFL
In this work, we propose dynamically optimizing the reputation threshold to mitigate the adverse impacts of malicious workers in the HFL environment. Deciding on the optimal reputation threshold is a challenging task for the edge server. The reputation scores of workers and the reliability of their local model updates can be influenced by the uncertainty of their behavior in the HFL environment, such as randomly switching from reliable and trustworthy to malicious over time and based on a hidden attack pattern. Other factors affecting worker behavior in HFL include unstable network connectivity, excessive device mobility across clusters, or device energy limitations [27]. Thus, to further improve the performance of HFL, a Multi-agent DRL-based reputation mechanism is proposed in this work.

A. MULTI-AGENT DDPG-BASED REPUTATION MODEL
Enabling a centralized and secure FL scheme at the edge server can be infeasible in a large-scale network considering the delay requirements of the wireless transmission between an edge server and an FL worker. In addition, the number of FL training iterations required to reach convergence grows rapidly with the increase in FL workers [15]. Therefore, we propose to design a distributed and collaborative scheme based on a multi-agent DRL for a more scalable and secure HFL.
To solve the MDP for E edge servers or agents, a Multi-Agent DDPG (MADDPG) algorithm includes the DDPG algorithm with a collaborative multi-agent learning framework. DDPG addresses the optimal reputation threshold selection problem to detect and remove unreliable workers and enhance the accuracy and stability of the HFL global model. Since workers' behaviors are unstable in an HFL environment, their reputation scores may fluctuate to infinite possible values. With the increase in the number of workers, the edge server's problem becomes more complex and highly dimensional. Because the action space for each agent is continuous, decomposing it into discrete action vectors is difficult, particularly for large action vectors. Hence, each agent employs the DDPG algorithm to address its corresponding MDP. In contrast to Q-Learning and Deep Q-Networks, DDPG has shown its ability to handle continuous action spaces, such as the dynamic reputation thresholds [15], [31]. The DDPG algorithm adopted by each agent is detailed in Section IV-A.1. We assume that only partial observations are provided to each edge server and that each edge server's decision is unknown to the other edge servers.
We transform the formulated problem into a partially observable Markov game for E agents, which is defined as a set of states S, a set of observations X = {X 1 , . . . , X e }, and a set of actions A = {A 1 , . . . , A e }, and e ∈ {1, 2, . . . , E}. The state set S of the agent e is defined using reputation score e,k and the local loss value l(w k ) of each worker k as follows: s e,k t = e,k , l w e,k T ; k ∈ 1, 2, . . . , K.
Each edge agent's observation at time step t is part of the current state s e t ∈ S. X e is the observation space and A e is the action space for agent e. For each state s ∈ S, each agent e uses the policy π e to select an action from its action space A e based on its observation corresponding to s. Each agent's e objective is to develop an optimal policy π e * that maps sequences to actions to maximize the expected long-term cumulative reward r e for all edge aggregation iterations t, t = 1, 2, . . . , T. π e * is derived as the following: s.t., where T represents the last iteration of the edge learning task, γ is the discount factor, and ε is the edge model's target loss. We define the components of the multi-agent DDPG model as follows: • Environment is the HFL system with Multi-SLM as explained in Section III-C • State Space s e,k t of each agent is continuous and consists of the reputation score e,k , and the local loss value l(w e,k ). The state space is defined as follows: s e,k t = e,k , l w e,k T ; k ∈ 1, 2, . . . , K, Thus, the state space of each edge server e at edge aggregation iteration t is defined as: • Action Space each edge server e chooses an action a e t ∈ A e at each edge aggregation iteration t from its action space, which represents the dynamic reputation threshold * , where * ∈ [0, 1], and considers all potential changes in reputation value. • Reward of each agent r e t at current edge training iteration t is a function of the current state s e t and the action selected a e t . Each agent e updates its policy π e according to the achieved reward until reaching the optimal policy π e * . The policy directly determines the optimal reputation threshold decision for the corresponding agent. The F1-score of the global model at the edge aggregation represents the reward function to guide each agent towards enhanced HFL performance. The reward function here is optimized by maximizing the F1-score of the centralized global model and is expressed as follows: where tp represents the true positives, fp false positives, and fn false negatives of the centralized global model.

1) DDPG ALGORITHM
DDPG algorithm is based on an actor-critic learning framework. The actor network of each agent e generates action a e t based on the current policy μ(s|θ μ ) after observing the status of the environment s e t . The critic network evaluates the actor network's policy and calculates a Q-value of each state s e t and action a e t . The actor network parameter θ μ is updated using the chain rule as follows: where μ is the target actor network and Q(s e , a e |θ Q ) is the Q-value function. The critic network parameter θ Q is updated as follows: where L is loss function and y i is the target value that is obtained by y i = r s e t , a e t + γ Q s (t+1) , μ s e t+1 |θ μ |θ Q . (17) θ μ and θ Q are the parameters of the target actor and critic networks, respectively, and are updated using a soft update factor τ to improve the algorithm's learning stability and convergence. The target networks are updated as follows:

2) MADDPG-BASED RELIABLE WORKER SELECTION ALGORITHM
The proposed MADDPG-based scheme is summarized in Algorithm 1. In step 1, the actor network's parameters θ μ and the critic network parameters θ Q are initialized. At each global model training iteration t to T , each worker k sends its local model update w k t and local model loss l(w k t ) to its associated edge server e. In step 5, Each server verifies the local model updates it has received using the attack detection method to detect potential poisoning attack [17]. Upon validating the local model updates w k t , in step 6, each edge server calculates a reputation score e,k t for each worker k within its cluster. This information is passed to the main actor network as an input state s e t to get a deterministic action of optimizing the reputation threshold e,k * t . At each MADDPG training step n, each agent e observes the environment's state s e t and selects an action a e t with Ornstein-Uhlenbeck (OU) based exploration noise N [28] in step 10, and observes the reward r e t and next state s e t+1 in step 11. The transitions tuple (s e t , a e t , r e t , s e t+1 ) are saved in the experience replay memory D in step 12 from which a mini-batch of transitions is sampled randomly for each agent. The critic network's weights θ Q and the actor network's weights θ μ are updated in steps 14 and 15. After that, the target actor network's weights θ μ and target critic network's weights θ Q are updated for each agent in step 17. The steps mentioned above continue for some iterations until the appropriate reputation threshold e,k * t is found. All the agents are cooperating in the MADDPG algorithm to find the optimal reputation threshold *

V. PERFORMANCE EVALUATION A. SIMULATION SETTINGS
This subsection discusses the simulation settings used in our proposed MADDPG-based reputation mechanism. We use a server with 2 Nvidia GeForce GPUs, version RTX 3080, running on Ubuntu 20.04.1 LTS OS. The software environment we utilized is TensorFlow with Python 2 and Keras ML libraries. We used the widely experimented MNIST dataset [29] to conduct the experiment, which has 70,000 gray-scale images of handwritten 0-9 numbers.
We consider the HFL environment settings as detailed in [4]. We test our proposed model in a setting that contains 10 edge servers, 300 clients, and a centralized server. We assume the clients are assigned according to their location to each edge server in each edge aggregation iteration. The distribution of the local training samples is highly unbalanced across the workers such that the class labels are unevenly and randomly distributed. For each worker, 80% of the training samples are used for training and 20% used for testing. The percentage of malicious workers launching poisoning attacks is set to 40% of the total number of clients. Since unreliable workers can act maliciously with an invisible attack pattern, the malicious workers are randomly assigned in our simulation in each edge aggregation iteration.

Algorithm 1 MADDPG-Based Reliable Worker Selection
Algorithm for HFL 1: Initialize the parameters of the actor and critic networks θ μ , θ Q for each agent e ∈ E. 2: for t = 1 to T do 3: for e = 1 to E do 4: Each edge server e receives local model updates w k t and local model loss l(w k t ) from its associated worker k 5: Verify the received local model updates w k t 6: Calculate reputation scores e,k t based on w k t 7: Initialize random noise N for action exploration 8: Observe current state s e t 9: while n < N do 10: Select an action e * t = a e t = μ(s|θ μ ) + N 11: Execute action a e t and obtain reward r e t and new state s e Update critic network weights θ Q by minimizing L(θ Q ) 15: Update actor network weights θ μ using the chain rule 16: end while 17: Update target network weights θ μ and θ Q for each agent e 18: Obtain cooperative action a t = a e t , ∀e ∈ E 19: end for 20: end for The proportion of poisoned or modified training samples indicates the poisoning attack strength M frac and ranges from 0.1 to 0.9 of the entire dataset for each malicious worker. In our simulation, M frac is set to 0.1 such that the edge server perceives the corresponding updates as logical and viable.
We assume the reputation score is initially set to 0.5 for all workers and is updated according to the amount of positive and negative interactions between the edge server and the worker during edge aggregation iteration. In each edge aggregation iteration, workers are chosen based on their reputation scores to train the global model locally using a local learning rate of 0.02 and a mini-batch of size 10. The FL task consists of a multi-layer perceptron model with a final softmax output layer [2]. For the MADDPG model, multiple agents are deployed at each edge server to make optimal decisions through the DDPG algorithm collaboratively. Each DDPG agent has 4 neural networks (NNs). The actor and critic networks influence the environment's states and provide feedback on actions taken. The target actor and critic networks stabilize the agent's training process. The actor network has 3 units for each agent in its input layer and one hidden layer. The critic network has 2 hidden layers of units equal to the number of states and actions. The actor network's final output layer is a Sigmoid layer to limit action space to 0 and 1. The actor and critic's output layer weights are initialized from a uniform distribution [− 3 × 10 −3 , −3 × 10 3 ]. The simulation parameters used are defined in Table 1 [15].

B. PERFORMANCE RESULTS
This subsection presents the simulation results of our proposed MADDPG-based reputation mechanism. The performance of our proposed scheme is compared with the HFL static reputation threshold scheme and the HFL with the FedAvg scheme [4], which signifies the absence of a defensive mechanism. For the static reputation threshold e * t , the value is fixed to 0.5 in our simulation since specifying an optimal reputation threshold is challenging for the edge servers given that attackers' behaviors change overtime. In addition, such a decision may discard model updates of reliable workers. The reputation threshold values for the MADDPG-based reputation scheme are selected dynamically based on the agents' learning strategies. The performance of the HFL global model with the three methods is demonstrated in Fig. 2. Here, for all three schemes, the local training iterations by each worker are set to 20. We find that training the HFL workers on the received edge model updates with more iterations before the edge aggregation step contributes to improving the convergence and overall performance of the HFL global model. In Fig. 2(a), it can be observed that the global model accuracy increases for all three schemes as the global iterations increase. From the initial global model iterations, our proposed MADDPG-based reputation model considerably improves the global model's accuracy and has a more consistent and stable performance compared with the other schemes. This is because the MADDPG agents use the actor-critic method to dynamically and optimally select the appropriate reputation thresholds considering how the workers behaved in the HFL environment. This implies that our proposed model is more capable than the other two schemes of filtering out malicious updates from malicious workers. Whereas in the static reputation threshold scheme, workers are assigned to edge servers based on a pre-determined reputation threshold value. As a result, the possibility of selecting malicious or unreliable workers may increase, which leads to compromising the security and degrading the performance of the HFL global model. In the HFL with the FedAvg scheme, we can observe that the accuracy is more stochastic than the other schemes. This is because the contributed local model updates, including potential unreliable or malicious workers, are incorporated into the global model. The average accuracy values obtained are 0.91, 0.92, and 0.96 for HFL with FedAvg, HFL with static reputation scheme, and HFL with MADDPG-based reputation scheme, respectively. The HFL global model loss for the three schemes is demonstrated in Fig. 2(b). We observe that the global model loss decreases for all three schemes with the increase in the number of iterations. The convergence performance of our proposed MADDPG-based scheme is shown in Fig. 2(b) in terms of HFL global model loss. Our proposed model has the best performance in terms of converging in less time compared to the two other schemes. The proposed MADDPG achieves improved performance in terms of HFL global model loss as opposed to other schemes, where a few more iterations are required to reach convergence.
The cumulative rewards earned with each scheme are provided in Fig. 2(c), which is defined as the F1-score of the HFL global model (14). The average HFL global model accuracy and loss obtained are shown in Table 2.
Furthermore, we investigate the impact of increasing the poisoning attack strength M frac on the performance of HFL with our proposed MADDPG-based scheme in Fig. 3. We can observe that increasing the attack strength can increase the loss of the HFL global model. For example, when M frac is 0.1, the average loss obtained is 0.13, and when M frac is 0.9, the average loss increases to 0.18. This is expected since poisoned updates can have a negative impact on the HFL global model's performance.   To further analyze the performance of our proposed MADDPG-based scheme, we compare the performance of each independent edge DDPG agent E in an HFL environment (HFL-MADDPG) with a single edge DDPG agent deployed in a conventional FL environment (i.e., non-HFL) (FL-SADDPG) [15]. For this scenario, the global iterations are set to 50 for both schemes. For FL-SADDPG, the number of workers is reduced to 10, with 40% being malicious workers. This allows the FL-SADDPG to converge faster since a significantly large number of workers can reduce its convergence speed and overall performance. For HFL-MADDPG, the number of edge servers is set to 3, where each edge server is assigned 10 workers. Hence, the total number of workers in the HFL scheme is 30. Fig. 4 shows the performance of each edge DDPG agent in HFL (HFL-MADDPG) and the performance of a single edge DDPG agent in the FL setting (FL-SADDPG). We can observe that the performance of HFL has slightly decreased compared to the scenario in Fig. 2. This is due to the reduced number of workers per edge server, which decreases the overall performance of the HFL model. As expected, the performance of each edge server (i.e., E1, E2, E3) is lower than the global in HFL-MADDPG, which is an aggregation of the updates of all DDPG agents. We can see in Fig. 4(a), (b), and (c) the performance of each edge server in HFL-MADDPG and FL-SADDPG is stochastic at the initial global model iterations, and after a few iterations, both models converge. The performance of each edge server in HFL-MADDPG is higher than that of the FL-SADDPG. The reason is that in HFL, the underlying reputation model and the reputation threshold selection decision is based on the consensus of all edge servers. Moreover, the edge model in HFL-MADDPG is trained using a larger dataset than FL-SADDPG due to the fact that common workers within each edge cluster can be selected several times by various edge servers, as demonstrated in Section III-C. As a result, this contributes to the increase in the overall HFL model performance. Whereas in FL-SADDPG, there are fewer workers with relatively less total dataset than in HFL-MADDPG, and the reputation computation and reputation threshold selection are based on the evaluation of a single server. In addition, the global model loss for both schemes is shown in Fig. 4(b). We find that HFL-MADDPG converges faster than FL-SADDPG, which requires more time to reach convergence. Table 3 shows the average accuracy and average loss of the HFL global model obtained with each scheme.

VI. CONCLUSION
This paper proposes a MADDPG-based reliable worker selection model to achieve improved global model accuracy and stability for HFL. The proposed MADDPG-based model is equipped with a reputation mechanism to evaluate workers' behaviors in the HFL environment. The MADDPG provides an optimized selection of the optimal reputation threshold to maximize HFL accuracy. Simulation results show that our proposed model achieves superior performance in terms of HFL accuracy and convergence compared to conventional HFL model and reputation-based methods. For future research, we will consider optimal edge server selection in the HFL environment. We will conduct further experiments on a more dense network in order to verify the efficiency of our proposed MADDPG-based model for HFL in larger scale network. Moreover, we will study the impact of the attacks and defences on various HFL scenarios, such as training data size and energy consumption. In addition, we will explore the effect of more complex adversarial behaviors on HFL security.