Beamforming Optimization for IRS-assisted mmWave V2I Communication Systems via Reinforcement Learning

Intelligent reflecting surface (IRS), which can provide a propagation path where non-line-of-sight (NLOS) link exists, is a promising technology to enable beyond fifth-generation (B5G) mobile communication systems. In this paper, we jointly optimize the base station (BS) and IRS beamforming to enhance network performance in the mmWave vehicle-to-infrastructure (V2I) communication system. However, the joint optimization of the beamforming matrix for BS and IRS is challenging due to non-convex and time-varying issues. To tackle those issues, we propose a novel reinforcement learning algorithm based on deep deterministic policy gradient (DDPG) method. Simulation results corroborate that the proposed algorithm converges in both systems with and without IRS, and the case with IRS improves the network performance from as little as about 5% to as much as about 100% depending on the environments such as the number of vehicles or deployment. Simulation results also show that in the IRS-assisted communication, up to 10% higher network throughput can be achieved in Dense V2I network scenario compared to Sparse case.


I. INTRODUCTION
W ITNESSING an exponential growth of the number of connected machines, unprecedented requirements are expected across wireless communications [1]- [3]. Examples of connected machines include not only new form-factors, such as augmented reality (AR), virtual reality (VR), and hologram devices but also autonomous mobile devices, such as an unmanned aerial vehicle (UAV) and autonomous driving. Each requires different categories of service: enhanced mobile broadband (eMBB), ultra-reliable low latency communications (URLLC), and massive machine-type communications (mMTC) as standardized in the fifth-generation (5G) [1].
However, the existing resource (such as, sub-6 GHz) band may not be enough to satisfy all the services and requirements of beyond 5G (B5G) due to its resource scarcity. In this regard, the standardization work and academic research toward B5G or sixth-generation (6G) are already actively underway [4]. In particular, many studies have investigated the potential of other higher frequency bands such as millimeter wave (mmWave) [5], [6] and Terahertz (THz) bands [7]- [9]. This high-frequency band communication poses other challenges: severe path loss and extra signal attenuation. Particularly, the signal in such a high frequency band is attenuated by atmospheric conditions, e.g., water vapor and oxygen [10]. Moreover, due to its high directivity, the signal is severely attenuated in a non-line-of-sight (NLOS) environment [10]. Nevertheless, it can be overcome by the additional deployment of a base station (BS) or by utilizing a massive multiple input multiple output (MIMO) at the expense of implementation complexity and hardware cost.
The growth of automation technologies leads to the openness of mobile networks advancing toward 6G. To satisfy the requirements of enhanced throughput and reduced latency, the high-frequency bands, mmWave or THz band, are also considered in vehicle-to-infrastructure (V2I) or vehicle-toeverything (V2X) communications. As pointed out in [37] and [38], the vehicle communication using mmWave depends on LOS and focused-reflected paths, not on scattering and diffracting paths. Some studies have analyzed the vehicle communications with mmWave band to tackle the issues and introduced some challenges [39]- [42].
IRS can also be leveraged for the mmWave-V2X networks, particularly in urban environments which suffer from securing LOS channel conditions and from limited coverage. Using IRS instead, costs can be reduced compared with conventional BS to guarantee coverage and LOS channels in the mobile mmWave-V2X network scenario; thereby, it is of great importance to optimize the beamforming architecture to provide more wide coverage. However, there are only a few studies that have addressed the issue of beamforming design of IRS-assisted networks in the mobile networks [43], [44].

B. CONTRIBUTIONS AND PAPER ORGANIZATION
In this paper, we maximize the network throughput by jointly optimizing the beamforming matrix of BS and reflecting matrices of IRS in mmWave V2I network with IRS-assisted communication systems. To do so, the deep reinforcement learning (DRL) method is proposed by taking into account the characteristic of the non-stationary V2I network environment [45]. The main contributions of this work are summarized as follows: • We jointly optimize the BS beamforming matrix and the IRS reflecting matrices to maximize the throughput of the IRS-assisted mmWave V2I network. • We propose a novel DRL algorithm based on deep deterministic policy gradient (DDPG) method [46], to address the non-convex and time-varying optimization while considering the mmWave V2I network channel and environment. • Simulation results demonstrate that the proposed DDPG-based algorithm converges and IRS helps improve mmWave V2I network performance. Moreover, the comparison results are presented over the number of vehicles and the network density.
The remainder of this paper is organized as follows. In Section II, IRS-assisted mmWave V2I communication system model is presented. In Section III, the rate maximization problem is proposed. In Section V, simulation results are provided, followed by concluding remarks in Section VI. Notation: The boldface capital letters and lower case letters denote matrices and column vectors, respectively. Capital calligraphic letters denotes finite discrete set and | · | denotes cardinality of the set if applied to the finite discrete set, or absolute value if applied to the complex number. For example, |M| is the finite discrete set of the BS antennas. (·) H , (·) T and (·) −1 denote the Hermitian transpose, the transpose and matrix inversion of matrices or vectors, respectively. tr(·) and  Position of vehicles at time t = nδ Reflecting elements array of i-th IRS Channel from i-th IRS to k-th vehicle α r,i r-th Reflecting elements of i-th IRS y k Multi-hop received signal for k-th vehicle γ k Multi-hop SINR for k-th vehicle C Multi-hip network throughput P t Transmission power budget diag(·) denote trace and diagonal matrix with elements in vector. C A×B and R A×B denote the space of A×B complexvalues matrices and real-values matrices, respectively. E [·] denotes the statistical expectation. Re(·) and Im(·) denote the real and imaginary part of a complex number, respectively.

II. SYSTEM MODEL
As illustrated in Fig. 1, we consider the multi-IRS-assisted mmWave V2I network scenario. In particular, to consider a practical environment, we here consider an urban case with UMi Street Canyon model [39] in which a base station (BS) and multiple IRSs exist around the street. Table 1 gives the description of the symbols and notations used throughout this paper.

A. CHANNEL MODEL
In this paper, we assume that the V2I network operates in the mmWave band. In particular, throughout this paper, we consider the Saleh-Valenzuela (SV) channel model [5] with slow fading, which is a conventional channel model of mmWave MIMO case. In addition, Doppler effect [39] is further considered by taking account of the characteristics of mobile V2I network. This channel model can be expressed by baseband equivalent channel matrix H given as where H NLOS (t) and H LOS (t) are denoted as (2) and (3), respectively. In (2) and (3), C is the number of clusters, L i is the number of paths for the i-th cluster, β i,j is the path loss, and i,j and φ T i,j denote azimuth angle of arrival and departure, θ R i,j and θ T i,j are the elevation angle of arrival and departure, respectively, all for the j-th ray of the ith cluster. With a slight abuse of notation, (·) T and (·) R for scalar value represents the value for transmitter and the receiver, respectively. An array response vector of uniform linear array can be expressed as (4) Unlike a T and a R , the array response vector of IRS, a I (φ I i,j , θ I i,j ) is based on a uniform planar array (UPA), not a uniform linear array (ULA), which can be expressed as following where φ I and θ I denote azimuth and elevation angle of the j-th ray of the i-th cluster for the IRS elements, respectively. Note that in UPA vector, not only the azimuth angle but also the elevation angle are considered. For the detail of channel, the parameters of LOS in (3) are expressed, similarly to parameters of NLOS, in (2) by using the subscript 0. In (3), η[n] ∼ U(0, 2π) denotes a random variable that changes the phase according to the environment, and I L (d 0 ) is a function for LOS probability at the distance d 0 between the transceiver. As considered in [12], we assume that the channel state can be perfectly estimated by using channel estimation techniques for various mmWave communication systems.

B. V2I NETWORK SCENARIO
Consider that a BS is equipped with M antennas and communicating with K vehicles each equipped with a single antenna (M ≥ K). In addition, I IRSs assist the network between BS and vehicles to enhance communication performance. We assume that all IRSs are equipped with R passive reflective elements. For mathematical convenience, throughout this paper, we denote the set of BS antenna, IRS, elements VOLUME 4, 2016 of IRS, and vehicle as M ∈ {m = 1, 2, . . . , M } , I ∈ {i = 1, 2, . . . , I}, R ∈ {r = 1, 2, . . . , R}, and K ∈ {k = 1, 2, . . . , K}, respectively.
We also denote H i ∈ C R×M as the channel matrix between BS and the i-th IRS, g i,k ∈ C R×1 as the channel vector between the i-th IRS to the k-th vehicle, and h k ∈ C M ×1 as the channel matrix from the BS to the k-th vehicle.
By following the discrete-time state-space model [47], [48], we consider that time is discretized in slots of length δ. Then, the position of vehicles, T , at time n can be expressed as where q k [0] is the initial position of the vehicle k at time n = 0, assuming that each vehicle has a different initial position. This position of vehicles constraint can be written as where q min = [q min,x , q min,y ] T is the minimum value and q max = [q max,x , q max,y ] T is the maximum value of coordinate in 2-D Cartesian coordinate plane, respectively.
Similar to (6) and (7) where v k [0] is the initial velocity of the k-th vehicle.
In our V2I network model, there are two types of communications link; i) the direct link from BS-to-vehicle, and ii) the reflected link from BS-to-IRSs-to-vehicle, as depicted in Fig. 1. For ease of analysis, the following conditions are assumed; the signals of both links can be transmitted to the receiver synchronously, there is no reflection between IRSs, and there is a central controller between BS and IRSs which coordinates them to synchronize and for beamforming.

C. NETWORK MODEL
Let us denote w k ∈ C M ×1 and W ∈ C M ×K for the transmit beamforming vector and matrix at BS for the vehicle k, respectively. Then we can write the transmit signal, x, at BS as where s k ∼ CN (0, 1) is the transmitted symbol for the vehicle k at BS. In addition, there is a maximum transmission output limit in BS. We consider the following power constraint at BS as where P t is the total transmit power at BS. The i-th IRS reflecting elements can be expressed as are the amplitude and the phase of the i-th IRS's the rth elements, respectively. Additionally, the i-th IRS reflecting matrix is a diagonal matrix denoted by We assume that ideal IRS reflecting elements mounted on all the IRSs that do not affect the power of the signal like a mirror. This assumption can be denoted by |α r,i | = 1 for all the values of n and i. In other words, we suppose that β r,i = 1 for all r and i for the remainder of the paper. We also assume that the IRS reflecting elements are configured in a square shape. That is, when the total number of IRS elements is R, the number of horizontal and vertical elements can be considered as √ R. In our network configuration, there are two network model, i.e., single-hop and multi-hop. For single-hop case, BS can transmit a signal directly to vehicles, i.e., BS-vehicles, which represents the conventional V2I network model. While, for multi-hop case, IRS and BS support the V2I network, i.e., BS-IRS-vehicles and BS-vehicles; That is, there is not only a direct link from BS but also a multi-hop link from BS-IRS.

1) Single-Hop
For single-hop case, the received signal at the vehicle k, y k , can be expressed as where h d,k ∈ C M ×1 is the baseband equivalent channel from BS to vehicle k, s k denotes the transmit signal for the vehicle k and n k ∼ CN 0, σ 2 k is the independent and identically distributed (i.i.d.) Gaussian noise at the vehicle k. Then, SINR for the single-hop link at vehicle k, γ s k , is given by and the network throughput of the single-hop network model is given as Single BS and multiple IRSs are connected to multiple vehicles in the multi-hop network model. We consider that the received signal y k at the vehicle k is the sum of all signals from the IRSs and BS, which can be expressed as and as in single-hop case, SINR for the multi-hop link at vehicle k, γ m k , is given by (15) Thus, the network throughput of the multi-hop network model is given as

III. NETWORK THROUGHPUT MAXIMIZATION FOR IRS-ASSISTED MMWAVE V2I NETWORKS
This section addresses the network throughput maximization problem for the multi-IRS-assisted mmWave V2I network by optimizing the beamforming matrices. In this paper, we define the overall performance of the system for one time slot as the network throughput. We also define the average value of network throughput over the entire time as average network throughput (ANT).

A. PROBLEM FORMULATION
Throughout this paper, we aim to jointly optimize the BS transmit beamforming matrix W and the IRSs reflecting beamforming matrices Φ i for maximizing the network throughput. The following problem, (P1), corresponds to the network throughput under the constraints related to the characteristics of IRSs and the actual conditions of vehicles given as where P t denotes the transmission power of BS. In (P1), (17a) is the power contraint at BS while the constraints of (17b) and (17c) represent the characteristics of the IRSs reflecting beamforming matrices. Each of these represents a form of aan IRS reflecting beamforming matrix that an IRS reflecting element reflects all the transmitted signals without power loss.
However, since the problem (P1) is non-convex with nonconvex objective functions and constraints, it is challenging to solve it with general convex optimization techniques. Although there are some methods to approach the nonconvex optimization problem that provide sub-optimal solutions such as the successive convex approximation (SCA) method [49], [50], but that is still difficult to apply to this system model, as the optimization problem is composed of some entities with stochastic channel. Furthermore, the variables considered in this system are too many to utilize the conventional exhaustive search-based method. Thus, we propose a joint beamforming method via a DRL method, which is elaborated in the following section.

IV. BEAMFORMING OPTIMIZATION FOR IRS-ASSISTED V2I NETWORKS VIA DEEP REINFORCEMENT LEARNING
In this section, we introduce the DRL-based beamforming optimization method. Firstly, the MDP model is designed, which casts the optimization problem (P1). Next, our proposed algorithm, based on DDPG [46], is introduced.

A. MARKOV DECISION PROCESS MODELING
We design the problem (P1) into environment, state, behavior, and reward of the MDP model.

1) Environment
Our environment consists of the proposed communication systems, in which there is an agent that interacts with this environment to find the optimal actions and policies that maximize cumulative rewards. The environment includes all information related to the networks such as the BS, vehicle and IRS. Specifically, the transmission power of BS, the characteristics of IRS elements, the state of vehicles, and the channel information are included in the environment. At each time step n, an agent observes a state s[n] from the state space S, accordingly takes an action a[n] from the action space A based on a policy π(s, a), which is a mapping from the state space to the action space. Let us define the cardinalities of the state space and action space as |S| and |A|, respectively. After performing the action, the current state s[n] of the environment changes to the next state s[n + 1]. In addition, the agent receives current reward r[n].
We point out that V2I network environment has a periodic pattern such that the movement of vehicles in this scenario follows a similar pattern for a certain period. Thus, we set one episode of the environment as n = 0, · · · , T − 1 and consider the initial time slot n = 0 and the final time slot n = T − 1.

2) State
In this system, the agent obtains a state of the system by observing the environment. We aim to optimize the BS beamforming matrix and the IRS reflecting matrix to maximize the network throughput of the network scenario. Therefore, VOLUME 4, 2016 5 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.

3) Action
In our problem (P1), the BS transmit beamforming matrix W and the IRSs reflecting beamforming matrices Φ i are jointly optimized to maximize the total throughput of the system. Accordingly, the action space of the system includes those matrices. Thus, the action a[n] is given by Note that the beamforming vectors and reflecting elements are continuous values rather than discrete values; accordingly, the action is also determined in the continuous action space.

4) Reward
The aim of optimization problem (P1) is to maximize the total network throughput of IRS-assisted V2I networks. We set the reward function as the network throughput of the multi-hop network in (16). Therefore, for the time slot n, the instantaneous reward r[n] is given by where C[n] denotes the sum rate of the system at the time step n.

B. BEAMFORMING OPTIMIZATION VIA DRL
Under the designed MDP model, we employ a DDPGbased DRL algorithm for beamforming optimization. Before describing the considered DDPG algorithm, we firstly introduce a deep Q-Network (DQN), which is the basis of our algorithm. For the convenience of expression, state s[n], action a[n], and reward r[n] are shortened as s n , a n and r n , respectively.

1) Deep Q-Network
Deep Q-Network (DQN) is one of the most widely used reinforcement learning algorithms and is based on modelfree, value-based and off-policy. The method learns the optimal policy to maximize cumulative future rewards. The cumulative reward R at time n is expressed as where γ is the discount rate, which distinguish between present and future rewards by setting a higher weight to the present reward, e.g., 0 ≤ γ ≤ 1.
We define the expected sum of future reward as the actionvalue function, Q(s, a), when the action, a, is performed in a state s with a policy π. The action-value function, Q(s, a), is also called Q-value. The agent needs to find the optimal Qvalue to maximize the reward and the optimal action-value function Q * (s, a) to find this optimal value is defined as This optimal Q-value is a function that can obtain the best reward when action a is taken in state s. The role of policy π is to calculate the Q-value by mapping state s and action a.
The model-free DRL trains the policy by using a Bellman equation [51] to find the optimal Q-value, which can be expressed as However, in reality, the optimal Q-value cannot be found due to the lack of information, so a function that converges to the optimal Q-value is found by updating the Q-value through the policy π. This operation of repeatedly updating the Qvalue is expressed as an equation as follows and Q π n will converge to Q * as n goes to infinity. This iterative process, called value iteration, enables the problem in (24) to find the optimal Q-value. However, note that the iterative training and learning becomes challenging as the dimension of state and action increases since cause a severe complexity in calculation of Q-value and storing data. To solve this issue, the authors of [52] proposed a method to find the approximate Q-value through DNN instead of using the Q-table created by finding the deterministic Q-value, of which algorithm is called a deep Q-Network (DQN). Here, the loss function L i is given as where and θ denotes DNN parameters for finding the optimal policy by stochastic gradient descent (SGD) method. Therefore, we can calculate the optimal weight θ through SGD of the loss function as follows However, the trajectory data used in (27) and the value function are temporally correlated while learning the policy, which degrades the performance. In particular, if samples are correlated, this method does not perform well because the SGD method assumes that each sample is independent and evenly distributed. Note that DQN method uses two tricks to address this issue: i) the experience is not used for learning immediately, but stored in the replay buffer; and ii) when 6 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3181152 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ data accumulates more than a certain amount, it is randomly extracted and used for learning. This idea is called experience and replay, and it makes the samples independent.

2) Deep Deterministic Policy Gradient
For the DQN method, there are some hurdles for the application. The amount of computation rapidly increases as the number of actions or states increases due to the burden of Qvalue calculation. Besides, it can only deal with discrete actions, not with continuous ones. In value-based methods such as DQN, values must be discretized to the handle continuous action, but this method has some limitations. When the action space is discretized, the action space increases exponentially, making learning almost impossible, and since the optimal action can also be removed in the discretization process, it is difficult to find the desired action.
Most of the algorithms used to solve this problem are based on policy gradient (PG), especially, actor-critic method. In particular, compared to other stochastic policy gradient methods, the deterministic policy gradient (DPG) method is learned through a deterministic action space rather than considering the probability distribution of the action space, so the amount of computation is small and convergence is fast [53].
In this paper, we consider the deep deterministic policy gradient (DDPG) that combines the advantages of DQN and DPG for our system model. DDPG was introduced in [46] by improving the conventional DPG, a model-free, off-policy, actor-critic learning method. We make some modification of the original DDPG by using the experiance replay such as in DQN. Unlike many actor-critic methods that are on-policy methods, DDPG is also applicable because it is an off-policy method. There are two more problems, one of which is the problem of updating the actor network and the critic network using the gradient obtained from the time difference (TD) error.
The main critic network Q(s, a|θ Q ) and the main actor network µ(s|θ µ ) can be expressed as In addition, time delayed copy of the critic and the actor network are defined as Q ′ (s, a|θ Q ′ ) and µ ′ (s|θ µ ′ ), respectively. Those networks are also called target critic network and target actor network, respectively. When selecting the action for the next time step through an actor network µ ′ (s|θ µ ′ ) in DDPG, a random action is selected for exploration. In the paper that first proposed DDPG [46], Ornsten-Uhlenbeck (OU) noise derived from OU process N is used. Random action is selected by adding this noise to the output value of the network. The formula for selecting a random action in DDPG in this way can be written as The time difference target y i and the loss function L to be used in the critic network can be written respectively as The gradient of objective function can be calculated as where J is the objective function, in the form of a discounted cumulative reward, which is given as For the next learning step, the parameters of the main actor network θ u are updated as θ µ ← θ µ − lr µ ∇ θ µ . Finally, the target critic and actor network are updated through a soft update target parameter τ , which controls the learning frequency of the target networks. This parameter update process is summarized as The following subsection introduces our proposed reinforcement learning algorithm for beamforming optimization based on the DDPG algorithm with some modifications. Fig. 2 shows the training process of our proposed algorithm for IRS-assisted mmWave V2I communication systems. As described in Section II, we aim to jointly optimize the BS transmit beamforming matrix W and the IRSs reflecting matrices Φ i . The real and imaginary elements, Re(w m,k ) and Im(w m,k ), of W are continuous in the range [-1, 1], respectively. Similarly, the amplitude and the phase of IRS elements β r,i and θ r,i of Φ i are also continuous in the range [0, 1] and [0, 2π], respectively. Also, all the channel matrices have continuous complex-values. Note that our MDP model consists of continuous values for both states and actions.

3) Proposed DDPG-based Algorithm
In DRL, a non-linear activation function is used to prevent a situation in which gradient vanishes or explodes when learning a neural network. Since most of the nonlinear activation functions have very limited domains, our system model does not handle a wide range of values such as elements of a channel matrix. Therefore, to solve the gradient vanishing or exploding problem, ReLU6 is used for an activation function, as well as the operation of the batch normalization [54].
First, the activation function, ReLU6, is a modified form of the widely used ReLU [55]. The ReLU function has the advantage of efficiently solving the gradient vanishing or exploding problem, and ReLU6 has an additional advantage that it can make a quick learning when the feature is sparse like our system. Also, batch normalization is a method of normalizing the mean and variance of input values for each VOLUME 4, 2016 layer in the neural network so that the distribution is not deformed [54].

V. NUMERICAL EVALUATIONS
This section presents the numerical results of the proposed joint beamforming optimization. We, here, compare and evaluate the proposed scheme by configuring various network scenarios by changing the number of BS antennas, vehicles, and IRSs.

1) Environment
In this section, we consider a special case in Section II where all the vehicles are moving at a constant velocity in the x direction. The initial position of vehicles is as q min,x = 0 in the x-coordinate. Therefore, the discrete position of vehicles in this environment can be expressed as where q max,x is the maximum in the x coordinate that vehicle can reach and the velocity of vehicles is assumed to be a constant, v 0 , given as In our simulation situation, we assume that the episode end after a certain period of time T , regardless of the location of the car. Even though q max,x can be any value, we set q max,x to q max,x = q K [T − 1] which is the largest value of position of any vehicles. The detail of considered environment configuration is illustrated in Fig. 3. In this environment, we set T = 50 with a time step size of 0.1 [sec] as an example. Unless otherwise stated, the parameters related to BS, IRS, and vehicle are used as summarized in Table 2. Throughout this section, ANT is regarded as the main performance metric

2) DRL Network
The network structure of the proposed DDPG-based algorithm is shown in Table 3. In the case of online and target actor networks, since the entire action is received as input, the number of nodes is the same as the size of the action space A. Since online and target critic networks receive both state and action as inputs and determine the Q value, the number of nodes is equal to the sum of the state space S and the action space A.
We employ Tensorflow 2 with some modifications to handle the complex values to implement the proposed algorithm. In our networks considered for the simulations, there are two hidden layers with the number of each node given as 400 and 8 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and Input: Learning parameters : E, T, γ, τ, B, N b , µ a , µ c , τ a and τ c 1: Randomly initialize critic network Q(s, a|θ Q ) and the actor network µ(s|θ µ ) with weights θ Q and θ µ ; 2: Initialize the target critic network Q ′ and the target actor network µ ′ with weight θ Q ′ ← θ Q and θ µ ′ ← θ µ ; 3: Initialize replay buffer R; 4: for episode = 1 to E do 5: Initialize the environment and a random OUnoise process N for action exploration; 6: Receive initial observation state s 0 ; 7: for l = 0 to T − 1 do 8: Executes the beamforming design based on the state s l and the policy µ, and a l = µ(s l |θ µ )+N l ; 9: Perform the action a l and records reward r l and the next state s ′ ; 10: Store the transition (s l , a l , r l , s ′ ) in R; 11: end for 12: Sample a random mini-batch of N b transitions from R; 13: Set Minimize the loss function to update the critic network:

15:
Update the online critic network weights θ Q as: 16: Update the online actor network by sampled stochastic policy gradient ascent as: ∇ θ µ J ≈ 1 N i ∇ a Q(s, a)| s=si,a=µ(si) ∇ θ µ µ(s)| s=si ; 17: Update θ u can be expressed as: θ µ ← θ µ − lr µ ∇ θ µ ; 18: Soft update the target critic network and the target actor network: 19: end for 20: return P 300, respectively. Note that all the layers are fully connected. The activation function of the hidden layer uses the ReLU6 function, which can solve the gradient vanishing problem that occurs during training in DRL [55]. The activation function of the output layer of the critic network is used only to determine the Q value. Therefore, the output value is used by using a linear function as the activation function. Unlike the critic network, in the actor network, since the behavior is the beamforming matrix of the BS and the reflecting element of the IRS, the range of the network output value must be in [−1, 1] according to the constraint of the optimization problem. Since the most used activation function in this situation is the tanh function, it is used as the activation function of the actor network [56], [57]. In the learning process, the parameters specified in Table 2 are used unless

B. CONVERGENCE
Firstly, we present the convergence behavior of the proposed DDPG-based algorithm in Fig. 4. Particularly in Fig. 4(a) shows convergence curve as iterations go. Note that the variance and the mean of each result are drawn in bold and shaded forms, respectively, by conducting three different simulations. Here, three cases are considered, 1) No-IRS, 2) Single-IRS and 3) Multi-IRS. In this section, We consider that Multi-IRS case has two IRSs. Fig. 4(b) shows the result for the last 5000 episodes in Fig. 4(a) to show and compare the convergence and variance of each. It is shown that Multi-IRS obtains the best performance on average, although Single-IRS and Multi-IRS achieve a comparable ANT; while the variance of Multi-IRS is relatively small compared to Single-IRS and No-IRS. On the other hand, Fig. 4(c) shows the average result for the last 5000 episodes in Fig. 4(a) to explicitly compare the converged policy of each simulation. The achievable throughput after convergence in each case is 14.7 for Multi-IRS, 14.25 for Single-IRS, and 9.75 for No-IRS. Multi-IRS converges around 5500 episodes, while Single-IRS and No-IRS take around 4000 and 2000 episodes, respectively. Those results suggest that Multi-IRS achieves the best throughput performance at the expense of training complexity.

C. AVERAGE NETWORK THROUGHPUT
We compare the ANT performance of our proposed schemes denote as w/ IRS-DRL, with two different schemes, denoted as Random and w/o IRS-DRL. VOLUME 4, 2016   In a situation that BS and IRS are beamformed together using the proposed DDPG-based DRL, the optimal BS beamforming matrix and IRS reflecting matrix are selected to maximize the ANT.

2) w/o IRS-DRL
It uses the same learning method as w/ IRS-DRL, that is, DRL based on DDPG, but considers the situation where there is no IRS in the environment. That is, it is a method of maximizing the ANT by selecting only the BS beamforming matrix.

3) Random
In the same environment as w/ IRS-DRL, this is a method of randomly selecting the BS beamforming matrix and IRS reflecting matrices. Because the matrices are chosen randomly, the performance of this scheme is said to be the actual lower bound. The channel is generated by Matlab simulation and the simulation results are averaged over 10,000 times. Fig. 5 and Table 4 show the convergence curve and the convergence values of each algorithms, respectively. Fig. 5(a) and 5(b) shows the baseline convergence curves in M = 4, K = 2 and M = 4, K = 4, respectively. First, Random selects the elements of BS beamforming matrix and IRS reflecting matrix completely randomly, so the performance is very poor in both cases. In Fig. 5(a), we compare the ANT between our proposed scheme w/ IRS-DRL, and w/o IRS-DRL and show that w/ IRS-DRL scheme provides the throughput about 46.55% higher. Similarly, in Fig. 5(b), the ANT at w/ IRS-DRL is about 52.41% higher. This means that the use of IRS is a good way to improve communication performance in our proposed simulation environment.

D. IMPACT OF THE NUMBER OF VEHICLES
We analyze and compare the effects of IRS and the number of vehicles on the ANT. For convenience of comparison, when using IRS, we consider that only one IRS is used. In addition, the number of BS antenna is fixed to M = 8, and the number of IRS reflecting elements is fixed to R = 16 as an example. Fig. 6, shows the ANT according to the number of vehi-10 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.   cles. In both Single-IRS and No-IRS cases, the ANT increases as the number of vehicles increases. In particular, we can see that the ANT of Single-IRS enhances significantly compared to the case of No-IRS at the number of vehicles increases. Those trends are related to the channel rank [58]. The rank in the MIMO channel is sufficient when the number of vehicles is relatively small compared to the number of BS antennas. However, when the difference between the number of BS antennas and the number of vehicles is small, the gain through the channel cannot be sufficiently obtained because the channel rank is low. Remarkably, the IRS improves overall channel condition by providing an additional channel rank and reducing correlation between different channels, as in the case of K = 8. For Sparse network scenario, the network throughput improves by about 7.53% in Single-IRS and about 7.86% in Multi-IRS compared with No-IRS. Besides, for Dense case, the network throughput improves about 9.41% in Single-IRS and about 14.56% in Multi-IRS compared with No-IRS. As shown in this table, IRS can enhance the network throughput and multiple IRS further improve it. It is worth noting that in a dense network environment, in general, the interference power may significantly degrade the network performance. Nevertheless, in IRS-assisted communications, the reflective elements mitigate interference power well, improving overall network performance even in the dense network environment.

VI. CONCLUSION
This paper investigated a system in which the BS and the IRS perform beamforming jointly for mmWave V2I communications network. We proposed a novel DDPG-based DRL algorithm that optimizes the BS beamforming matrix and the IRS reflecting matrices to maximize network performance. Simulation results showed that IRS could improve the network performance in mmWave V2I communications network in dense as well as sparse network environments. Extending the optimization and considering beam tracking could be interesting for future research. It is also worth investigating flexible MADRL frameworks to adapt quickly to a new environment with meta and split learning [59]. VOLUME 4, 2016 11 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and