Multi-Agent Reinforcement Learning-Based Pilot Assignment for Cell-Free Massive MIMO Systems

Cell-free massive multiple-input multiple-output (CF-mMIMO) has been considered as one of the potential technologies for beyond-5G and 6G to meet the demand for higher data capacity and uniform service rate for user equipment. However, reusing the same pilot signals by several users, owing to limited pilot resources, can result in the so-called pilot contamination problem, which can prevent CF-mMIMO from unlocking its full performance potential. It is challenging to employ classical pilot assignment (PA) methods to serve many users simultaneously with low complexity; therefore, a scalable and distributed PA scheme is required. In this paper, we utilize a learning-based approach to handle the pilot contamination problem by formulating PA as a multi-agent static game, developing a two-level hierarchical learning algorithm to mitigate the effects of pilot contamination, and presenting an efficient yet scalable PA strategy. We first model a PA problem as a static multi-agent game with P teams (agents), in which each team is represented by a specific pilot. We then define a multi-agent structure that can automatically determine the most appropriate PA policy in a distributed manner. The numerical results demonstrate that the proposed PA algorithm outperforms previous suboptimal algorithms in terms of the per-user spectral efficiency (SE). In particular, the proposed approach can increase the average SE and 95%-likely SE by approximately 2.2% and 3.3%, respectively, compared to the best state-of-the-art solution.


I. INTRODUCTION
As service demands in wireless communications increase, each new generation of cellular networks must determine new ways to multiplex more devices and enhance the spectral efficiency (SE) per device. Network densification is a common method for increasing the SE in wireless networks [1]. This can be achieved either by adding more base stations (BSs) or increasing the number of antennas on each BS to serve numerous user equipments (UEs) within the same time and frequency resource block, which is known as massive multiple-input multiple-output (mMIMO) [2]. Each of these techniques has certain drawbacks: placing a large The associate editor coordinating the review of this manuscript and approving it for publication was Bijoy Chand Chatterjee . number of BSs increases inter-cell interference and consequently decreases the service quality of the UEs, whereas with mMIMO, UEs positioned at the edge of the cell experience substantial propagation loss because of their long distance from the BS. Cell-free mMIMO (CF-mMIMO) was recently suggested as a solution to address the deficiencies of the aforementioned technologies by combining macrodiversity and multi-user interference reduction to provide a uniform user experience [3]. In CF-mMIMO systems, a massive number of distributed access points (APs) simultaneously service a relatively small number of UEs without imposing cell borders, which improves both SE and energy efficiency [4]. The implementation of the MIMO approach requires precise channel state information (CSI) at the transmitter (or receiver). Owing to the scalability of uplink pilot transmissions, CSI acquisition in CF-mMIMO systems can be performed via uplink pilot transmissions. However, because of the limited coherence time, we are only allowed to have a limited number of orthogonal pilots, which is often less than the total number of UEs. Consequently, we are forced to reuse the same pilots for multiple UEs, introducing undesirable effects known as pilot contamination. As such, the fading channel cannot be accurately estimated because of the co-pilot interference that occurs between UEs. It is worth mentioning that under the assumptions of suboptimal linear precoders, pilot contamination becomes the only capacity-limiting factor for both cellular and CF-mMIMO networks [3], [5]. Consequently, a proper pilot assignment (PA) is vital for minimizing the effects of pilot contamination, which is the primary focus of this research.

A. RELATED WORKS
The optimal PA problem for CF-mMIMO system is the inherently nondeterministic polynomial time (NP). Consequently, the computational complexity required to obtain an optimal solution increases exponentially with the number of UEs. Therefore, the vast majority of published studies have concentrated on heuristic-based solutions. These algorithms can be classified into centralized and distributed schemes. Random PA (RPA) is a technique presented in [3] that enables UE to randomly choose a pilot sequence. The complexity of this strategy is minimal and it can be implemented in a distributed manner. However, the RPA has the worst performance because it is possible for two close UEs to select the same pilot signal, which might result in significant pilot contamination. Additionally, a greedy PA (GPA) approach was suggested in [3] to iteratively update the minimum rate of all UEs. Unfortunately, this approach can only enhance the performance of the weakest UE and not that of the entire system. Using an iterative application of the K-means clustering technique, the authors of [6] proposed a structured PA that seeks to maximize the minimum distance between co-pilot UEs. In [7], the MIMO network was modeled by artificially imposing topological structures on the UE-AP connectivity and considering a partially connected interference network. Thus, the PA problem can be considered a topological interference management problem with multiple groupcast messages. Subsequently, depending on whether the channel connectivity pattern is known a priori, the topological PA problem is formulated in two ways. Another approach that can be employed to address the PA problem is using a graphtheoretic framework. In [8], a conflict graph was developed by forming an edge connecting UEs that are dominant interferers with one another. Subsequently, a greedy approach was employed to solve the graph coloring problem. The authors of [9] mapped the PA problem to the max K-cut problem and solved it using a heuristic algorithm. In [10], the PA problem was formulated as a graph matching problem and a Hungarian algorithm was applied to alleviate it. A novel approach based on tabu search was proposed in [11] to alleviate the PA problem and maximize the sum-user SE. The majority of these studies are based on the similar premise that geographically far enough apart UEs can use the same pilot. This principle also serves as the impetus behind the primary PA scheme that was proposed in this work, along with the additional objective that it should be distributed and scalable while providing competitive performance in terms of the SE for the average UE. The following section will provide a summary of our contributions.

B. MOTIVATION AND CONTRIBUTIONS
Recently, learning-based approaches have demonstrated considerable promise in various applications [12], specifically for addressing a range of resource-allocation problems in wireless communication networks [13], [14], [15], [16]. The authors in [13] proposed a deep supervised learning approach to reduce pilot contamination for a multi-user mMIMO system. In [14] and [15], deep reinforcement learning (DRL) was employed for pilot design to alleviate the pilot contamination problem in cellular mMIMO systems. The authors of [16] developed a deep supervised learning-based PA algorithm for CF-mMIMO systems with massive access. In our previous work [17], we used a single-agent DRL to solve the power allocation problem in CF-mMIMO systems. To the best of our knowledge, this study is the first attempt to use a multi-agent DRL approach to solve a PA problem in CF-mMIMO. The main contributions of this study are summarized as follows: 1. We develop an optimization problem to maximize the uplink sum-user SE by considering the per-user power constraints, channel estimation error, multi-antenna APs, AP selection, and pilot contamination effects. 2. An appropriately designed PA algorithm improves system performance by alleviating the pilot contamination effect. We address the PA problem in an uplink CF-mMIMO system by modeling the PA problem as a diverse clustering problem. 3. We formulate the PA problem as a multi-agent static game. We then employ hierarchical multi-agent DRL to propose an efficient yet distributed PA scheme to mitigate the effects of pilot contamination on CF-mMIMO. 4. The computational complexity and convergence of the proposed approach are analyzed.
The rest of the paper is organized as follows: In Section II, the system model for CF-mMIMO is introduced. In this section, we formulate the PA problem as an optimization problem. Preliminaries on reinforcement learning is studied in Section III. The proposed DRL-based PA approach and its experimental results are presented and discussed in Sections IV and V. Finally, conclusions are drawn, and future work is discussed in Section VI.

II. SYSTEM MODEL
As exemplified in Fig. 1, we consider a typical CF-mMIMO system with L APs each equipped with N antennas and K single-antenna UEs, which are geographically distributed over a specific region. The total number of service antennas VOLUME 10, 2022 is given by M , where M = LN . It is also assumed that all APs are connected to the central unit via error-free fronthaul links. We employ a UE-centric technique to handle the scalability problem in which each UE is served by a subset of APs. Throughout this paper, M k denotes the subset of APs serving UE k and D l denotes the subset of UEs serviced by AP l [18]. We also consider a conventional time-division duplex (TDD) protocol for downlink and uplink data transmissions. Let τ c be the length of the coherence block in samples, which is divided into three parts; τ c = τ u +τ p +τ d ; where τ u , τ p , and τ d samples are used for uplink data transmissions, uplink pilot training, and downlink data transmissions, respectively [2]. In this paper, we focus on uplink training and uplink data transmissions and assume that τ d = 0. The channel coefficient between UE k and AP l is given by where β kl indicates the large-scale fading coefficient (both path loss and shadowing), and h kl ∈ C N are independent and identically distributed random variables that represent the small-scale fading vector. In our model, it is presumed that the system has access to all deterministic information, including the large-scale fading coefficient and the geographic locations of the APs.

A. UPLINK PILOT TRAINING AND CHANNEL ESTIMATION
During the channel estimation phase, each UE uses a pilot from a set P orthogonal pilot signal P = [φ 1 , φ 2 , . . . φ p ] T , where φ i denotes the i-th pilot signal with the length of the τ p samples. We assume that P pilot signals are mutually orthogonal, P ≤ τ p . Since the pilot resources are limited, the pilot set must be reused throughout the network. As a result, many UEs share the same pilot; we refer to these UEs as co-pilot UEs (pilot-sharing UEs). We indicate the index of the pilot assigned to UE k by p k = {1, 2, . . . P} and C k as the set of co-pilot UEs, including UE k itself. After correlating the received signal at AP l with the pilot p k , the estimate of channel y p p k l is given by: where p p k denotes the normalized transmit signal-to-noise ratio of the pilot symbol for UE k, and n l is an additive white Gaussian noise at the l-th AP whose elements are i.i.d. CN (0, 1). The minimum mean-squared-error estimate of the channel coefficient between the k-th UE and the l-th AP is given byĝ where W p,l ∈ C N ×τ p denotes the noise at the l-th AP, whose elements are i.i.d. CN (0, 1), and α kl is given by ( It is worth mentioning thatg kl = g kl −ĝ kl is uncorrelated with theĝ kl and distributed as follows: where γ mk = τ p p p k β mk α mk .

B. UPLINK DATA TRANSMISSION
AP l physically receives the signal from all UEs during uplink data transmission, this is, where s k (E{|s k | 2 } = p k ) is the symbol transmitted from UE k with power √ p k , and n m ∈ C N denotes the noise at AP l, and its elements are i.i.d. CN (0, 1). However, to estimateŝ k , only the APs in M k can be used, andŝ k is given as: where a kl ∈ C N is the combining vector selected by AP l for UE k. Here, we consider the maximum ratio combining (MRC) method with a kl =ĝ kl . By substituting (5) and g kl = ĝ kl +g kl into (6), we obtain where DS k , IUI kk , TEE kk , and TN k are the desired signal, interuser interference, total estimation error, and total noise, respectively. The achievable SE for UE k is obtained by utilizing the so-called use-and-then-forget bound because the CPU is unaware of the channel estimations [19]: where SINR k is given by where D k is obtained as follow: In the above equation, the first term denotes the interference caused by co-pilot UEs, the second term corresponds to multiuser interference, and the last term contains the noise power.

C. ACCESS POINT SELECTION
In the proposed approach, unlike [18], which first selects the appropriate APs for each UE through a competitive mechanism and then implements a PA approach, we first apply the PA algorithm and then perform AP clustering. Our main idea is based on the fact that all APs physically receive the signal of all UEs, and therefore, proper PA schemes must be performed before AP clustering. The proposed PA scheme is described in detail in Sections IV. Here, we cover the AP selection algorithm in which the pilot is already assigned to each UE. The following assumption was considered when developing an algorithm for AP selection: Assumption: Each AP serves most of the τ p UEs (one UE per pilot) to prevent severe pilot contamination.
Our proposed AP selection method consists of the following steps: Step 1) UE k finds the AP l with the strongest large-scale fading (LSF) coefficient as its master AP, and two co-pilot UEs are not able to choose one specific AP for their Master AP; in this case, the latter UEs select the second-strongest LSF channel coefficient as its Master AP. In this step, we ensure that each UE is served by at least one AP.
Step 2) Each AP decides to serve P UEs. Each AP then selects one UE per pilot with the strongest LSF coefficient.

D. UPLINK DATA POWER CONTROL
Before delving into the proposed PA scheme, it is pertinent to explore a scalable power control with low complexity. Here, we utilize the fractional power (FrP) algorithm to determine the appropriate power control for UE k, which is given by where ϑ ∈ [0, 1] controls how the range of power coefficients is compressed and larger values of ϑ indicate better fairness.

E. PROBLEM FORMULATION
The objective of our PA algorithm is to maximize the sum SE by properly selecting the set of co-pilot UEs subject to certain constraints owing to the limited availability of pilots.
In the literature on linear programming, this type of problem is often written up using the column generation method, in which each possible group of co-pilot UEs is thought of as a column. Let A represent all possible co-pilot UE sets that do not include null and singleton sets. As a result, A has the cardinality of 2 K − 1, which is equal to the number of columns. The matrix A is defined, in which each column represents a set in A, and the corresponding column for the co-pilot UEs is indicated by x j . We define the cost of x j as According to the above definition, the set of co-pilot UEs that do not fulfill the minimum SINR threshold, even for a single UE, is never chosen, and min denotes the minimum SINR threshold. The optimization problem is formulated as follows where = [λ 1 , . . . , λ |A| ]. Equation (13a) guarantees that each UE is assigned one pilot, and (13b) guarantees that exactly P pilots are used in the PA scheme. Problem P1 is NP-hard in nature, whereas the feasible solution space is denoted by |A| P . For example, the feasible solution space for a moderately small system comprising 20 UEs and 10 pilots is approximately equal to 4.4 × 10 53 . Owing to the complexity of the problem, it is reasonable to explore suboptimal solutions that can be effectively implemented in a network. In Sections IV, we describe a multi-agent-based PA scheme that considers only UE's location to cluster the co-pilot UEs in a way that mitigates pilot contamination-induced interference and indirectly enhances the sum SE. This approach is scalable and can be implemented in a distributed manner as network size increases.

III. PRELIMINARIES ON DEEP REINFORCEMENT LEARNING MODEL
Reinforcement learning is a type of sequential decision making in which the goal is to learn a policy in a given environment whose dynamics are unknown. Reinforcement learning requires an interactive environment in which an agent can choose from a variety of activities that affect the environment. DRL, which effectively solves complex and high-dimensional environments, was created by fusing a reinforcement learning structure with a deep neural network (DNN) as a function approximation [20]. In this section, we review the mathematical background and preliminaries of both single-agent and multi-agent DRL.

A. SINGLE-AGENT DRL
The Markov decision process (MDP) formulation develops a mathematical foundation for modeling a single-agent DRL environment that includes tuple (S, U , t, P a , r a , η). In the MDP model, S is a set of possible states (s t ), U is a set of possible actions (u), and t is the decision time point. The transition probability of performing action u in state s t , resulting in state s t+1 , is denoted as P a : S ×A×S → [0, 1]. The expected immediate reward for taking action u and the transition from state s t to state s t+1 are defined by r u , and η determines the discount factor [21]. Different DRL systems have varying specifications regarding how data are gathered and how their performance is measured. The dynamic relationship between the agent and environment is depicted in Fig. 2. After carrying out the action requested by the agent, the environment always delivers the corresponding state and reward to the agent at the end of each iteration. Finding an optimal policy, (π * (s)), which takes action depending on the present state in order to maximize the discounted reward over time, is the fundamental objective of the RL agent. This can be obtained by optimally solving the Bellman equation: Q-learning is one of the most influential developments in reinforcement learning, and is a specific implementation of the Bellman equation based on the Q-table as follows: Q-table computes the estimated maximum future reward for each action in each state. Subsequently, the proper action is selected to yield the greatest expected reward. The optimal state-action value function can be obtained by updating (15). However, obtaining a Q-table is tedious task in many cases. A Deep Q-network (DQN) expands the Q-learning concept to alleviate this problem. The DQN utilizes a DNN instead of a Q-table to estimate the nonlinear Q-values for each stateaction [22]. The optimal action policy is obtained based on every action that the agent can perform in the environment, as follows: The DQN agent aims to maximize the return function by determining the optimal weight vector, that is, θ θ θ. Owing to the ambiguous transition probability, the DQN agent uses an epsilon-greedy algorithm to balance the exploitation and exploration. Each experience is stored in a first-in-first-out replay buffer, which is always accessible. Finally, the DQN agent updates θ θ θ to minimize the loss function by choosing a mini-batch of events from the experience replay buffer and applying a proper optimizer (such as Adam or SGD). For further information, refer to [23] which provides a detailed examination of DRL as well as various types of neural networks, DRL architectures, and the applications of these technologies in the real world.

B. MULTI-AGENT DRL (MADRL)
In practical situations, multi-agent systems have inspired the development of distributed solutions that are likely to be less expensive and more effective than centralized single-agent alternatives. Sequential decision-making problems involving a number of agents are addressed by MADRL, in which all agents work together to influence the dynamics of the system. In particular, the reward that an agent receives now depends on the actions of all other agents rather than just its own. Consequently, a specific agent should consider the policies of other agents to maximize the longterm reward. The stochastic game (SG) is a generalization of the MDP to the multi-agent scenario that includes a tuple (S, U 1 , . . . , U n , r 1 , . . . , r n , P) [24], where n is the number of agents, U i , i = 1, 2, . . . n denotes the finite sets of actions available to the agents, resulting in the joint action set U = U 1 × U 2 × U n , r i , i = 1, 2, . . . n are the agents' reward functions, and P a : S × U × S → [0, 1] is the state transition probability function. The Q-function of each agent depends on the joint action and policy. Furthermore, in fully cooperative SG, all the agents have the same reward function. There is a specific SG with no state signal known as a static (stateless) game. A static game is defined by a tuple (U 1 , . . . , U n , r 1 , . . . , r n , P) in which all agents make decisions simultaneously, without knowledge of the policies that are being chosen by other agents.

1) MULTI-AGENT DQN (MADQN)
In an SG, the joint optimal policy is known as the Nash equilibrium (NE) and can be defined as π * = (π * 1 , π * 2 , . . . π * n ). The NE is an integration of all agents' policies, with each policy being the optimal reaction to the policies of the other agents. In NE, each agent's action is an optimal reaction to the choice of another agent's action. Specifically, an agent's policy must be compatible with that of other agents; otherwise, no actor may gain an advantage by modifying its policy. Consequently, each duty of learning agent is to investigate an NE for any given condition in the environment. Each agent in MADQN comprises a primary (online) network, target network, and replay memory.In the training phase, the learnable parameters of the DNN are updated to boost the accuracy of the Q-function approximation in accordance with the system transition history. In the learning step, t, each agent inputs the current state into the DQN and outputs Q-values for each action. Agents in DQN employ the same Q-values for action selection and assessment, leading to a Q-value overestimation problem. To address this problem and enhance the learning efficiency of the agents, the following two enhanced versions of DQN are presented.

2) MULTI-AGENT DOUBLE DQN (MADDQN)
The DDQN overcomes the overestimation problem by separating the maximum procedure in the target network's action selection from the action assessment of the target network. In particular, two networks, DQN1 and DQN2, have been utilized, where DQN1 is employed to select actions and DQN2 is employed to assess the Q-value of the corresponding action.

3) MULTI-AGENT DUELING DOUBLE DQN (MAD3QN)
The standard DQN method computes the value of each action in a given state. However, in certain states, different policies can lead to the same value function. This behavior can hinder the ability to learn an optimal response to a specific condition. To alleviate this problem, dueling of a DDQN has been proposed. The dueling DDQN is an improved variant of the DQN in which the Q-network includes two streams of Q-functions, namely, the state value function V (s) and the advantage function A(s, a), where the advantage function is used to assess the relative significance of one action in relation to other actions in a given situation. The output of the dueling network is derived by merging these two streams into a single-output Q-function and an aggregate module to accelerate convergence and enhance the efficiency.

IV. PROPOSED MADRL-BASED PILOT ASSIGNMENT SCHEME A. MODELING THE PROBLEM AS A STOCHASTIC GAME
The interactions of the agents are modeled by SG, where the environment is altered in response to player actions. Here, we show that the PA in CF-mMIMO systems can be modeled as a SG. The detailed definitions are provided below.

Set of agents:
In a proper PA, each UE attempts to choose an appropriate pilot sequence from P orthogonal pilots. We define a new strategy in which each pilot, as an agent, interacts with the wireless communication environment and selects the UEs that can be assigned to it. 2. State space: All agents collectively probe the environment by inspecting various states. Specifically, the location and SINR of the co-pilot UEs are considered to be the current state.

Action space:
We define the action as reassigning a new pilot to one of the UEs in each agent. A set of discrete pilot indices is considered as the action space. Each agent is allowed to select only one UE in each time slot and to reassign the new pilot index to the selected UE. All agents have the same action space. 4. Reward function: It evaluates an agent's actions as either favorable or unfavorable. Consequently, a reward function must be constructed that corresponds to the objective function described in (12). The reward function of each agent at learning step t is defined as

B. PROPOSED PA APPROACH
In the proposed method, we provide a new perspective on the PA problem. An effective PA policy requires only location of the UE and does not require any additional signaling overhead to mitigate the effects of pilot contamination. This is because the distance between the UEs has the most significant impact on co-pilot interference. Here, we model the PA as a diverse clustering problem. To this end, a static multi-agent game is defined with P teams (agents), in which each team is represented by a specific pilot. Each team is interested in connecting with an K P UEs, where each team is supposed to select a number of UEs in which the least pilot contamination effect occurs. It is worth mentioning that the main difference between our approach and the previous one is that, instead of assigning the pilot to the UEs, we cluster the UEs into teams specified by the pilot. As shown in Fig. 3, the implementation of the proposed approach is a two-level hierarchical MADRL. The steps of the proposed approach are as follows.
1. Define P agents (P teams), each represented by a specific pilot. 2. Randomly set K P UEs for each pilot as the initialization phase. Each team has at most K P co-pilot UEs. This is the starting point and the input of the proposed algorithm. The proposed method attempts to maximize VOLUME 10, 2022 the final objective function by switching UEs between different teams. 3. In each agent, the UE that causes the worst pilot contamination effect by the pre-trained DNN is selected and expelled. Simply put, the agents learn to select the best UE candidate at a low level through independent learning. This step is executed in a distributed manner and the agents are executed independently of each other. The output of this step is an exclusion list containing all the UEs that have been expelled from their group. In this step, we design one DNN and train it in an unsupervised manner to learn to select the UE that has the least sum distance to other co-pilot UEs. Here, we do not require any ground-truth outputs for training, which makes this step more flexible for practical implementation. In Fig. 4, the proposed DNN for the worst UE selection is illustrated. The network comprises four layers that are consecutively interconnected to select one UE from the K P UEs. The input of the network is the UEs' location, the number of neurons in the input layer is equal to 2 K P , and the number of neurons in the output layers is equal to K P to construct a one-hot vector to determine the index of the expelled UE. The number of hidden layers and nodes within the layer is a hyperparameter, and in our proposed structure, we consider two hidden layers with K 2 and K 4 nodes in each hidden layer. Finally, the following loss function is considered for unsupervised training to select the UE that causes the most pilot contamination. where computes the distance between the UE k and UE * , where F denotes the geographical coordinates of the UEs. Because the distance between UEs has the greatest influence on co-pilot interference, we only consider (19) as a loss function. However, for more sophisticated loss functions, we also need to consider shadowing and fastfading effects, which will be addressed in future work. 4. Implement cooperative agents with centralized training and decentralized execution to learn how to reassign UEs in the exclusion list into different teams so that the final objective function is improved. During the learning phase, the cost function constructed in (12) is made available to each agent as a reward. Each agent then adjusts its actions to remain close to an optimal policy by updating the deep Q-network. In this step, each agent learns to connect its expelled UE to the best possible team (assigning a new pilot for the expelled UE). During the execution phase, each agent receives local observations of the environment (which is the copilot UEs' location and UEs' SINR) and then chooses an action (reassigning a new pilot to expelled UE) based on its trained DQN. The NN used in this step is similar to that used in the previous step. The only difference is the number of neurons in the output layer, which is P to construct a one-hot vector to determine a new pilot for the UE. 5. Repeat Steps 3-4 until the stopping condition is met, that is, the difference between two consecutive objective functions is less than the threshold or the number of iterations specified in advance.

C. COMPLEXITY ANALYSIS
This section presents a computational complexity analysis of the proposed PA scheme. The computational complexity of the MADQL-based PA is determined by the number of floating operations per second (FLOPS) in the neural network. The number of FLOPS for each NN used in Step 3 is determined by matrix multiplication in each layer [25] as: and the number of FLOPS for each NN used in Step 4 is given by During the inference, P DNN in Step 3 and P DQN in Step 4 should be performed to ensure proper action. Therefore, the total number of FLOPS for our approach is computed as follows: )PK 2 + ( )KP + K + P. (22) As shown in (22), the complexity of the proposed approach is of the polynomial order (PK 2 ).

A. SIMULATION SETUP
A typical CF-mMIMO with L = 100 APs equipped with N = 2 antennas and K = 50 UEs is considered, where all APs and UEs are independently and uniformly distributed in a simulation area of size 1km × 1 km. We utilize the wrap-around technique to prevent boundary effects at the edge and imitate network behavior over an infinite area. We formulate the large-scale propagation parameters to generate the path loss and the shadow fading coefficients using the 3GPP Urban Microcell model [3]. We also consider that each coherence block has τ c = 200 samples, where P samples are used for uplink pilots and the remainder is used for uplink data. Other simulation parameters are summarized in Table 1, which are the same parameters as used in [4]. We also employed Frp power control expressed by (11) to further improve the system performance.

B. BENCHMARK ALGORITHMS
In this section, we evaluate the performance of our proposed PA scheme. We compared the performance of the proposed approach with that of the benchmark solutions listed below. 1. RPA: Each UE is randomly assigned one pilot from P orthogonal pilots. 2. GPA: This approach considers a simple greedy algorithm that iteratively refines the PA. GAP is explained in detail in [3]. 3. Repulsive clustering based PA (RCBPA) [26]: In this approach, a repulsive clustering-based PA scheme is employed to mitigate the effects of pilot contamination.

No-pilot contamination (NPC):
In this approach, we consider that there is no limitation on pilot resources, and each UE uses an orthogonal pilot. It should be mentioned that in this approach, the number of pilots is equal to the number of UEs, τ p = K .

C. SIMULATION SETUP
We implemented our MADRL approach in the Tensor-Flow 2 framework and ran our simulation on a PC with a Core(TM)i7 CPU @ 4 GHz and 32 GB of installed memory (RAM).  (MADQN, MADDQN, MAD3QN) in  terms of the total reward. It is worth mentioning that we implement the MAD3QN and MADDQN algorithms for PA with the same DNN architecture as the one used in MADQN. The first observation is that all three agents eventually converge to the same amount of reward, which confirms that the final performance of the proposed PA algorithm remains unchanged by utilizing different agents, and the only difference is the speed of convergence. More specifically, MAD3QN and MADDQN converge quickly, and favorable results are obtained after 95 and 145 iterations, respectively. Furthermore, it can be seen that the MADQN approach learns the right policy to gain positive rewards after 174 episodes, and it can achieve complete convergence after 400 episodes. Fig. 6 illustrates the cumulative distribution function of the SE per UE. We compare the proposed schemes with the four above-mentioned benchmarks. The first observation is that the proposed MADRL-based PA performs better than other approaches. As shown in Fig. 6, there are 40%, 29% and 16% improvements in the median of per-user uplink SE with VOLUME 10, 2022 the proposed scheme compared to the RPA, GPA, RCBPA algorithm, respectively. When comparing with the NPC, the proposed method performs better by a small margin. It should be considered that although there is no pilot contamination in NPC approach, assigning τ p = K samples to pilots degrades the SE by a factor of 1 − τ p τ c . The 95%-likely of the uplink SE by varying the number of pilots for different PA schemes is depicted in Fig. 7, in which we can see the impact of the number of pilots in 95%-likely SE. In the NPC scheme, the number of pilots is constant and equal to τ p = 50, which we use in the figure for comparison. It is obvious that for τ p = 1 and τ p = 50 the 95%-likely of the uplink SE for all PA schemes are the same. For the RPA scheme, although increasing the number of pilots improves the system performance (the 95%-likely of the uplink SE), there is always a possibility that two UEs that are not far from each other will be assigned the same pilot and then cause strong mutual interference. For GPA, the same trend can be observed. It has slightly better performance due to the greedy search. For RCBPA and our scheme, adding more pilots will only enhance the 95%-likely of the uplink SE up to a certain point, and then it will begin to decrease once that threshold has been reached. Specifically, Fig. 7 shows that the maximum 95%-likely SE of CF-mMIMO with the MADRL-PA scheme is about 1.45 bits/s/Hz, and for the RCBPA scheme it is approximately 1.33 bits/s/Hz. It is also obvious that for both algorithms (proposed approach and RCBPA) the maximum 95%-likely SE is achieved at the τ p = 16. This figure also demonstrates how important it is to determine the optimum number of pilots. This issue will be the subject of our future research. The comparisons of average sum SE of specific CF-mMIMO system (L = 100, N = 2, and τ p = 10) with different PA schemes and for three different UEs number (K) is illustrated in Fig. 8. As evidenced from the figure, the average sum SE of the system with the proposed MADRL-based PA scheme is about 1.2%, 2.4%, and 6.1% greater than the RCBPA scheme, 2.4%, 4.6% and 8.7% greater than GPA scheme, and 5.7%, 7.4% and 12% greater than the RPA scheme for K = 20, K = 50, and K = 100, respectively. When comparing the different setups with various numbers of UEs, we can note that the performance improvement for the proposed approach is greater for higher UE density. This is due to the fact that the proposed approach is focused on reducing the amount of interference that occurs between UE, which is much more prevalent in massive access. Finally, Table 2 demonstrates the results of our PA approach in comparison with state-ofthe-art approaches in terms of the average SE and 95%-likely SE for CF-mMIMO with an MRC receiver with L = 100, N = 2, K = 50, and P = τ p = 10. It can be seen that the proposed scheme outperforms all the previous approaches. In particular, compared to the existing competing alternative, our PA approach improves the average SE by 2.2% and the 95%-likely SE by 3.3%.

VI. CONCLUSION
In this study, we proposed a MADRL-based PA scheme for CF-mMIMO systems with the objective of maximizing the sum SE in the uplink. For this purpose, we considered the PA problem as a multi-agent static game and developed a two-level hierarchical MADRL algorithm to mitigate the effects of pilot contamination by assigning the same pilot to UEs that are geographically far apart. At a low level, agents learned to select the worst UE among the co-pilot UEs in an unsupervised manner, whereas they learned to reassign expelled UEs with new pilots at a higher level via centralized training and decentralized execution. In addition, the complexity and convergence of the proposed scheme were demonstrated. Furthermore, the superiority of MADRL-based PA over prior algorithms, such as RPA, GPA, and RCBPA, was validated by simulation results. It has been shown that compared to the existing competing alternatives, the proposed MADRL-based PA approach achieved approximately 2.2% and 3.3% improvements in the average SE and the 95%-likely SE, respectively. In future work, we will expand our MADRL-based approach by jointly optimizing the PA, power control, and subcarrier allocation to enhance CF-mMIMO system performance.