A New Deep Q-Network Design for QoS Multicast Routing in Cognitive Radio MANETs

In this paper, we propose a new deep Q-network (DQN) design for quality-of-service (QoS) multicast routing (DQMR) protocol to establish efficient QoS multicast (EQM) trees in cognitive radio mobile ad hoc networks (CR-MANETs). An EQM tree is a shortest-path multicast tree with minimum end-to-end (E2E) cost (a combination of queuing size ratio and link stability) subject to QoS constraints such as queuing size ratio, link stability, number of hops, number of time slots and avoiding the licensed channel of primary users. Particularly, we propose a NP-complete optimization problem such that its feasible solution is an EQM tree. To address this problem, we design a new DQN model and a new game-based model to form EQM tree in real time by offline training instead of online training like previous papers. Moreover, the DQMR protocol is also guaranteed to have high stability, low routing delay, low control overhead, and high packet delivery ratio (PDR). Furthermore, one more new contribution of the paper is that exact closed-form expressions for the E2E queuing delay of a multicast routing tree are also derived assuming random waypoint mobility and the reference point group mobility models to compare with simulation results of routing delay. Simulation results show that the DQMR protocol outperforms multicast ad hoc on-demand distance vector routing protocol in terms of routing delay, control overhead, and PDR.


I. INTRODUCTION
C ognitive radio (CR) technology has been deployed in mobile ad hoc networks (MANETs) which allows mobile devices to cognitively establish dynamic topologies without necessarily relying on any fixed infrastructure [3]. The benefits of CR are bought by enabling the unlicensed mobile nodes operating in an opportunistic with the licensed spectrum bands, thus improving the spectrum utilization in cognitive radio mobile ad hoc networks (CR-MANETs) [4]. Multicast routing protocols in CR-MANETs mainly relied on flooding operation to find the best route to destinations in the whole network, which often consumes considerable resources such as control overhead, spectrum, delay, and en-ergy [5], [6]. Due to the dynamic nature of MANET environments, the routing optimization problem and QoS constraints are always non-deterministic polynomial-time (NP) complete [7], [8].
Reinforcement learning (RL) is an area of machine learning that enables agents to learn in an interactive environment by trial and error using feedback from its own actions and experiences in order to maximize its reward and minimize its penalty. Due to the versatility of RL, it has ability to solve a myriad of problems ranging from computer vision, speech recognition, robotics, and self-driving car, to wireless communications [9]. Moreover, RL technique is suitable for routing problems in distributed networks such These unsolved issues motivate us to design a new deep Qnetwork design for QoS multicast routing leveraging deep Qnetwork (DQN) and game theory (GT), followed by mathematical analysis of E2E queuing delay (EQD) in this paper. For ease of presentation, Table 1 summarizes the main abbreviations used in this paper.

C. MAIN CONTRIBUTIONS
In this paper, we study mainly on QoS routing problems in the network layer with information obtained from the physical layer and the data link layer by cross-layer design. The contributions of the paper can be summarized as follows: • This paper aims to propose a new deep Q-network (DQN) design for quality-of-service (QoS) multicast routing (DQMR) protocol to establish efficient QoS multicast (EQM) trees in cognitive radio mobile ad hoc networks (CR-MANETs). An EQM tree is a shortestpath multicast tree with minimum end-to-end (E2E) cost (a combination of queuing size ratio and link stability) subject to QoS constraints such as queuing size ratio, link stability, number of hops, number of time slots and avoiding the licensed channel of primary users. • Firstly, we propose a NP-complete optimization problem such that its feasible solution is an EQM tree. Since this problem is too complicated to solve, it is divided into two sub-problems that are minimum E2E cost of multicast tree (MEC) problem and channel-time slot allocation for multicast tree (CTA) problem. • Secondly, we design a new DQN model, called DQN-MEC model, to address the MEC problem. This model is trained offline to predict optimal link values (Q * −values), which supports the DQMR protocol to establish minimum E2E cost multicast trees in real time. • Thirdly, we propose a game-based model to solve the CTA problem, called GT-CTA model. This model supports the DQMR protocol to obtain minimum E2E cost multicast trees with minimum number of time slots for given number of channels, while preventing interference links and avoiding regions of multiple primary users. Moreover, the design of GT-CTA model is proven mathematically as a convergent potential game. • Fourthly, the DQMR protocol is proposed by using the DQN-MEC and GT-CTA models to establish EQM trees with high stability, low routing delay, low overhead, and high PDR. • Fifthly, since the routing delay depends on many factors such as different kinds of delay, mobility model, network topology and so on; it cannot be analyzed correctly. Thus, we derive exact closed-form expressions for the E2E queuing delay of a multicast routing tree (EQD-MRT) under the random waypoint mobility (RWP) and the reference point group mobility (RPGM) models, that show an approximation and the same pattern as the simulation result of routing delay, which confirms the correctness of the developed analysis. • Finally, the simulation results show that the DQMR protocol outperforms multicast ad hoc on-demand distance vector (MAODV)-based routing protocol [18] in terms of routing delay, control overhead, and PDR. The rest of the paper is arranged as follows. Section II introduces the system model, the basic concept of DQMR protocol. Section III formulates the QoS multicast routing as an optimization problem. Section IV develops the DQN-MEC model. Section V proposes a GT-CTA model. Section VI proposes the DQMR protocol. Section VII provides a solid theoretical analysis for the EQD-MRT. Section VIII presents the performance evaluations. Finally, Section IX concludes the paper.

II. SYSTEM MODEL
We consider a CR-MANET consisting of multiple primary users (PUs) and secondary users (SUs) as shown in Figs. 1 and 2. Each SU can access opportunistically licensed channels which are not occupied by PUs [4]. In two-dimensional space, the SUs can move based on RWP model and RPGM model [19]- [21], while PUs rely on RWP model. We assume that each node can be aware of its location through the global positioning system (GPS) and the location of destinations in the multicast group [21], [22]. Moreover, each node has a fixed radio range and can exchange control packets by using control channels that do not affect the licensed channels of PUs [23].

A. BASIC CONCEPT OF THE PROPOSED DQMR PROTOCOL
In this paper, we use the same multicast group management techniques as MAODV protocol, e.g., join group, leave group, to maintain the multicast tree. The basic concept of the DQMR at a node, as shown in Figs. 1 and 2, can be presented as follows: Overview: • Each SU (node) uses cross-layer design in Fig. 2(a) to get parameters from physical, data link, and network layers such as node's position, node speed, direction,  channel, queue, hop count, IP address of source and destination, affected region of PUs, and multicast tree information. In routing process, these parameters will be used for DQN-MEC model in Fig. 2(b) and GT-CTA model in Fig. 2(c) to obtain EQM trees. Particularly, the DQN-MEC model predicts Q * −values to establish minimum E2E cost multicast trees in real time, and the GT-CTA model selects optimal channel-time slot strategies (Nash equilibrium points) for the minimum E2E cost multicast trees.

VOLUME xxxx, 2021
Multicast tree discovery: • If a source (src) needs to establish a multicast tree to the multicast group D, it will require the information of neighbors. For every destination dst i ∈ D, the src uses the DQN-MEC model to calculate link values Q * i (src, w) for all w in the set of the src's neighbors to select the best neighbor w * i associated with the highest value Q * i (src, w * i ). Then, the src generates a route request (RREQ) packet and broadcasts it to the set of the best neighbors {w * i }. • If a node w ∈ {w * i } receives a RREQ, it will record the sender as the previous node in the route table. Node w calculates the set of best neighbors to re-broadcast the RREQ packet by the same way as the src.
• If a destination (dst) receives a RREQ packet, it will record the sender as the previous node in the route table and unicast a route reply (RREP) packet to the previous node. • If a node receives a RREP packet, it will append the sender to the set of next hops (NH) in the route table.
Next, node v forwards the RREP to the previous node by using unicast technique. This process is repeated until the source receives all RREPs from all destinations and go to the channel-time slot allocation process. Channel-time slot allocation process: • Each multicast tree member (TM) of the EQM tree applies the GT-CTA model to obtain an optimal channeltime slot strategy. Go to data transmission process. Data transmission process: • The src and TMs of the EQM tree send data to the multicast group members based on their next hops (NH) and channel-time slot strategies. If the EQM tree is broken, the maintenance process will be activated and DQN-MEC model and GT-CTA model will be used to locally find alternative routes to the multicast group members. A CR-MANET is considered as a directed graph G = (V, L), where V is a set of SUs, and L is a set of directed links among nodes. A link between node pairs (v, w) indicates that v is a sender, w is a receiver, and w is within v's range and v is within w's range. The set of destinations is referenced to the set of destination's positions which is denoted as D.

B. QUEUING DELAY MODEL
We assume that a node is a server and number of control packet traffic for routing increases in proportion to the number of links between an intermediate mobile node and its neighbors. Thus, the control packet traffic arrival can be modeled by Poisson process, and the service time is exponentially distributed. Hence, we can employ M/M/1 queuing system for nodes to evaluate and analyze the delay caused by intermediate nodes in routing process, where packets arrive according to Poisson process and the service time is modeled by exponential distribution. The arrival rate and service rate are denoted by λ and µ, respectively. Based on the Markov chain for M/M/1 system and Little's theorem [24], each node in the network has a queuing delay model with the following preliminary results: the average time of a packet spending in the system is T = 1/(µ − λ), which includes the queuing delay plus the service time; the average of time a packet spending in queue is W = T − 1/µ; the average number of packets in the system is N = λT; and the average number of packets in the queue is N Q = λW.
We define the queue size ratio as a part of the cost function in Eq. (3) which supports the DQN-MEC model to select optimal links with low queuing delay for routing process. The queuing size ratio of a link (v, w) can be expressed as follows: where Qz(·) is a queue size of a node and Qz max denotes the maximum Qz of a node.

C. LINK STABILITY
We use the link stability ratio in [25] as a part of the cost function in Eq. (3) which supports the DQN-MEC model to select optimal links with high stability. The distances between v and w at time t i and t i+1 are denoted by D ti (v, w) and D ti+1 (v, w), respectively. The link stability ratio of a link l = (v, w) over interval time ∆t = t i+1 −t i can be expressed as follows: where ∆D(l) = D ti+1 (l) − D ti (l) and v max is the maximum speed of nodes. Note that the value of LS ∆t (l) indicates that the smaller LS ∆t (l) is, the higher the stability of the link l is.

D. COST FUNCTION
We design a cost function of a link l = (v, w) as a combination of the queue size ratio in Eq. (1) and the link stability in Eq.
(2) which supports the DQN-MEC to select a link with high stability and low queuing delay. Thus, the cost function is used to reduce routing delay and obtain a high stability EQM tree in the routing process, which can be defined as where ∆t is a period of time and α 1 + α 2 = 1. For a source src and a destination dst ∈ D, we consider a route P(src, dst) = {src = n 0 → n 1 → · · · → n m−1 → n m = dst} of a multicast tree T, where (n i , n i+1 ) ∈ T, ∀i = 0, . . . , m − 1. The number of hops of route P is denoted as #hops(P) = m and the E2E cost of the route P can be expressed as

E. CHANNEL MODEL
We present a channel model used for the GT-CTA model, which support the DQMR to establish EQM trees. We assume that there is a set of L licensed channels C = {ch 1 , . . . , ch L }.
In a time slot t, each node v only uses either a channel (ctx t v ) to transmit messages or a channel (crx t v ) to receive messages. If node w is a receiver of node v, the transmission channel of node v must be the same as the receiving channel of node w. The set of receivers of node v in time slot t is denoted as RCV t v . The set of nodes transmitting on a channel ch c in time slot t is denoted as TN t c and the set of nodes transmitting in time slot t is TN t = TN t 1 ∪ · · · TN t L .

1) The Channel-Time Slot Condition for Preventing Interference
In a time slot t, a set of multicast links ML t v = {(v, w); ∀w ∈ RCV t v } is satisfied for the channel-time slot condition for preventing interference if and only if In a time slot t, the condition (5) implies that all receiving channels crx t w of all nodes w ∈ RCV t v are the same as the transmission channel ctx t v of node v, the condition (6) means that only node v can transmit to all nodes w ∈ RCV t v on channel ctx t v at time slot t (a node cannot receive from more than one transmitter at the same time) and the condition (7) indicates that when node v transmits to RCV t v on channel ctx t v , all nodes w ∈ RCV t v do not transmit over all channels (a node cannot receive and transmit at the same time).

III. PROBLEM FORMULATION
To support the proposed DQMR protocol to establish EQM trees in routing process, we propose an optimization problem such that its feasible solution is an EQM tree. We consider a tree T as a set of routes from a source to multiple destinations (8) where src is the source, dst i is a destination belonging to multicast group D, and M is the number of destinations. The E2E cost of the tree T can be represented as cost(T) = (cost(P 1 ), . . . , cost(P M )). A tree T * is a minimum E2E cost tree if every route P * i ∈ T * has a minimum cost(P * i ). We have that T * = arg min We define a set of time slots as TS = {ts 1 , . . . , ts M }. A node v has a channel-time slot strategy which is defined as The number of time slots of a route P is defined as TS(CT P ) = max v∈P {ts tx v }, and the number of time slots of the multicast tree T is defined as The problem can be formulated as follows: where the queue size ratio (Qr) and link-stability ratio (LS) of a route P are defined as follows: The problem (P) is a NP-complete problem, and it is a new problem that has not been solved before. To address this problem, we divide it into two sub-problems that are minimum E2E cost of multicast tree (MEC) problem and channeltime slot allocation (CTA) for multicast tree problem.
The MEC problem is formulated to find a shortest-path multicast tree such that each route from a source to a destination of the multicast tree has a minimum E2E cost subject to QoS constraints. The MEC problem can be formulated as The CTA problem is formulated to find an optimal channel-time slot strategy of a tree T with minimum number of time slots, while preventing interference links and avoiding the affected regions of multiple PUs. The CTA problem can be formulated as

IV. PROPOSED DQN MODEL FOR THE MEC PROBLEM: DQN-MEC MODEL
The DQN-MEC model with offline training in Fig. 2 is designed to predict the optimal Q * −values which are used to select the best neighbors towards the respective destinations in routing process. This neighbors selection process supports the DQMR protocol in establishing EQM trees. For every destination dst i ∈ D, we need to find a route P * i (src, dst i ) which is a solution of the MEC problem. Hence, we first propose a DQN-MEC model for the MEC problem in the scenario of one source and one destination. Then, the obtained DQN-MEC model can be efficiently extended to the general scenario with one source and multiple destinations.
The DQN-MEC model is run offline once based on a realistic simulation environment on a computer to get a DNN model. Each node is equipped a program which can read the resulting DNN model to predict the Q * −values for routing process in real-time. When the network environment is changed with network size and number of nodes, the training process will be retrained, and each node will update the new DNN model. The proposed DQN-MEC model is modeled as a model-free RL which includes Q-learning model and experience replay as follows:

A. Q-LEARNING MODEL
Q-learning model is designed to make the DQN applicable to the DQMR protocol.
• Agent: We consider a node holding a packet or a pair of (packet, node) as an agent which wants to find a route from a source to the destination. Particularly, the packet starts at the source and find the route to a destination which is an optimal solution of MEC problem. • State: The agent has a set of states S which is considered as the set of nodes V. At a certain time, if the agent is at node v ∈ V, its state is denoted as s v . • Action: At a certain time, the agent at state s v has a set of neighbors NB v which is considered as a set of actions A v of the agent, i.e., the agent can move to any neighbor in NB v . We denote a node w ∈ A v as an action a w of the agent at state v. • Environment: At a certain time, the agent at state s v has an environment which includes the position, speed and direction information of all node v's neighbors. • Reward function: At state s v , if the agent selects an action a w ∈ A v , the reward function of a link l is defined as where l = (s v , a w ), Wgt hop ∈ (0, 1) denotes a weight of one hop (a connected link between two nodes), α c and α h are the weights in (0, 1) such that α c + α h = 1, and the QoS conditions are The conditions (15a) -(15c) imply the QoS constraints (10b) -(10c) of the MEC problem. For the cost of route (12) and the number of hops constraint (10d), they can only be known after that the route is established. Thus, the metric cost and #hops are included in the reward function to guarantee that a minimum cost route will be found and a long route will not be formed. The objective function (12) and the constraint (10d) are used to formulate the reward (14), where the values of α c and α h are adjusted to obtain the best reward value in the training process. Particularly, when the src obtains the best route to the destination, i.e., the DQN-MEC is converged and there exists a best neighbor w * = π * (s src ), if the number of hops is greater than the constraint hop th , the weights α q and α h of the reward function are adjusted by α q = α q − ε and α h = α h + ε and the DQN-MEC model is repeated until obtaining the best route satisfying the number of hops constraint or exceeding time.
• Quality function (Q−function): At the state s v , the agent takes an action a w ∈ A v to obtain the Q−function which is presented as follows: where α and γ is the learning rate and discount factor, respectively. We set max a∈A dst Q(s dst , a) = 0 in (16) to guarantee that the Q−values updating process will stop at the destination. • Policy: When the Q−values converge to Q * −values, a policy is a function π * that takes state s v as input and returns the action to be taken by the agent. The policy π * can be expressed as π * (s v ) = w * = arg max aw∈Av Q * (s v , a w ). The policy is applied to the DQMR protocol to select the best neighbors in the multicast-tree discovery process.

B. EXPERIENCE REPLAY
Different from regular Q-learning, when the network is complex and frequently changes its topology, experience replay is developed for deep Q-network to learn Q * −values instead of taking much time for re-training. In particular, experience replay is a replay memory technique which is used to store the agent's experiences at each time-step . The experience replay of the DQN-MEC model can be described as follows: To fix the number of neighbor nodes for training process, we assign a maximum value to information value of node w, for all w ∈ NB v . Thus, the total number of variables for the input of DNN is if node w ∈ NB v , and we assign a maximum value to Q * (v, w). Thus, the number of variables for the output is 50.
• An environment provides 50 samples corresponding to the number of current states; thus, the obtained dataset has 200 × 50 = 10, 000 samples.

V. PROPOSED GAME-BASED MODEL FOR THE CTA PROBLEM: GT-CTA MODEL
The GT-CTA model is modeled to assist the DQMR protocol to obtain EQM trees with minimum number of time slots for given number of channels, while preventing interference links and avoiding regions of multiple primary users. For a multicast-tree T, the GT-CTA model is proposed as a static best-response potential game [26] as follows: • Player: Each node of the tree T is considered as a player.
• Environment: An agent at a certain time has an environment which includes the channel-time slot schedule of node v's neighbors and the affected regions of PUs.
The set of strategies of node v is denoted as S v . At the initial time, each node is assigned a strategy s ∞ = (−∞, 0, −∞, 0). For a strategy s v ∈ S v , we denote s −v as the strategies of all agents except for agent v and S −v as the set of all s −v . • Strategy selection (SS) rules: The game is operated into epochs. In an epoch, each node v observes the environment to calculate a set of strategies S v . If the parent w of node v has not already selected a strategy, i.e., s w = s ∞ , the set of strategies S v is assigned to ∅.
Otherwise, a strategy s v must satisfy the following rules: where AN Childv = NB Childv \Child v which is the neighbors set of the node v's children set except for children of node v. This rule imply that each node v has priority to choose its strategy which may conflict with its children. Then, its child nodes will update their own strategies to eliminate these conflicts with parent nodes.
• Payoff: The payoff of a node v for taking a strategy s v ∈ S v is defined as (17) • Best Response: The best-response of node v can be expressed as • Potential function: The potential function of the game can be defined as where CT T is a set of channel-time slot strategies and s T = CT T is a strategy of tree T.
Theorem 1. The proposed game is the best-response potential game, i.e., we have that Besides, the best-response of the game will converge to a Nash equilibrium point within 1 + M × (N hop − 1) iterations at most, where N hop is the maximum number of hops of the multicast tree and M is the number of destinations of the multicast group. This theorem indicates that the game has a Nash equilibrium point and it will converge to a Nash equilibrium point within finite iterations. Moreover, the potential function (19) is equivalent to the objective function (13) of the CTA problem at optimum; thus, the Nash equilibrium point of the best-response of the game is also a subset of the feasibility set of the CTA problem.
Proof. The proof of the theorem is divided into two parts as follows: The first part: Based on the strategy selection rules, if a node v chooses a strategy , the strategy s v satisfies the rule SS-(iv), i.e., s v does not conflict with strategies of all neighbors except for children of node v. We have: • Case 1: The strategy s v does not conflict with the strategies of node v's children.
the condition (20) is satisfied.
that contraries with the assumption. Thus, the condition (20) is satisfied.
• Case 2: The strategy s v of node v conflicts with a strategy of a child w of node v. It means that the function Φ always takes −∞ and arg max Thus, the condition (20) is satisfied. The second part: In the multicast tree, there is only the source, which transmits data to the next nodes at the first hop, that needs one time slot to transmit data by multicast technique. From the second hop of the multicast tree, the maximum number of links that can interfere with each other is N hop ; thus, the maximum number of time slots that needs to transmit data without interference is M . Hence, the maximum number of time slots that needs to transmit data from a source to multicast group is 1 + M × (N hop − 1).
Agents will obtain new better strategies after each iteration, i.e., the number of time slots of the multicast tree will decrease after each iteration. Thus, the best-response of the game will converge to a Nash equilibrium point within 1 + M × (N hop − 1) iterations at most.
Finally, the algorithm of the GT-CTA model at a node v can be presented as follows: Step 1. Node v requires the information of strategies s w , for all neighbors w ∈ NB v .
Step 2. Node v calculates the set of available strategies S v based on strategy selection rules. Next, node v chooses a best-response a * v = π v (a −v ) in (18) as a current strategy.
Step 3. Steps 1 and 2 are repeated until node v can not find a better strategy, i.e., the sum of the payoffs converges.

VI. THE PROPOSED DQN-BASED QOS MULTICAST ROUTING PROTOCOL: DQMR PROTOCOL
In this section, we present the DQMR protocol that uses the DQN-MEC and GT-CTA models to establish EQM trees which are a shortest-path multicast tree with minimum E2E cost subject to QoS constraints, preventing interference links and avoiding regions of primary users. Moreover, the DQMR protocol has high stability, low routing delay, low control overhead and high packet delivery ratio (PDR). In practical MANETs, the mobile nodes can move based on different mobility models, as shown in Fig. 3. In particular, nodes 1 to 11 can move according to the RWP model while other nodes can move according to the RPGM model with different groups such as nodes 12, 13, 14 in the first group, nodes 15 to 18 in the second group, and nodes 19 to 22 in the third group. Thus, the DQMR protocol is tailored to work well in both mobility models. In the given CR-MANET with a source node (src) and the multicast group D, the DQMR protocol, as shown in Figs. 3 and 4 * , can be presented as follows: Initialization: • Each node in the given CR-MANET initializes variables of routing table as follows: The set of last visit nodes LV rt = ∅. The route cost RC rt = +∞. • Step 1. If a node needs to establish the tree to the multicast group D, the node becomes a source node (src), go to Step 2. Otherwise, go to Step 3. Multicast Tree Discovery Process (Fig. 4  *  *  ): Sending RREQ Process: The src requires information of neighbors including position, speed, direction, queue size and channels of PUs information. For each destination dst i ∈ D, the src predicts values Q * i (src, w) for all w ∈ NB src by using the DQN-MEC model to select the best neighbor w * i with the highest value Q * i (src, w * i ); for example in Fig. 3, the best neighbors of src are nodes 3, dst 2 , 15 and 16 corresponding to destinations dst 1 , (dst 2 , dst 3 ), dst 4 and dst 5 . The src updates the set of last visit nodes LV src = LV rt ∪ {src}, the set of next visit nodes NV src = {w * i , ∀dst i ∈ D} \ LV src , the list of costs from the src to all next visit nodes CL src = {cost(src, w * i ), ∀w * i ∈ NV} and the route cost RC src = 0. Next, the src generates a route request (RREQ) packet including LV rreq = LV src , NV rreq = NV src , CL rreq = CL src and RC rreq = RC src , and broadcasts the RREQ to neighbors. Go to Step 4. The RREQ packet contains the following fields: packet_type, hop_count, rreq_id, multicast_IP _address, multicast_seq_number, source_IP _address, source_seq_number, last_visit, next_visit, link_cost, route_cost Receiving RREQ Process: • Step 3. If the node receives a RREQ, go to Step 3.1.
Otherwise, the process is ended.
The RREQ is dropped if at least one of the following cases is satisfied: * The node is not in the list NV rreq of the RREQ. * The new cost RC rreq + cost(w, node) is smaller than or equal to the route cost RC rt in the route table, where cost(w, node) can be found in the CL rreq . For example in Fig. 3, nodes 9 and 10 drop a RREQ from the src. If the RREQ is dropped, the process is ended. Otherwise, go to Step 3.2. --Step 3.2. The node records the sender's ID as the previous node. Go back to Step 2. If the node is the dst, go to Step 5. Otherwise, go to Step 6. • Step 5. If the dst receives a RREQ packet, it will generate and reply a RREP packet to the previous node by unicast transmission, go to Step 9. Otherwise, the process is ended. The RREP packet contains the following fields: packet_type, hop_count, multicast_IP _address, multicast_seq_number, source_IP _address • Step 6. If the node receives a RREP packet, it appends the sender to the set of next hops (NH) in the route table and goes to Step 7. Otherwise, the process is ended. • Step 7. If the node is the src, go to Step 8. Otherwise, node v unicasts the RREP packet to the previous node, go to Step 8. Channel-time slot allocation process (Fig. 4 ): • Step 8. Each node of the obtained EQM tree, as shown in Fig. 3, applies the GT-CTA model to obtain an optimal channel-time slot schedule such that the EQM tree has minimum number of time slots for given number of channels, while preventing interference links and avoiding the affected regions of multiple PUs. For example in Fig. 3, the EQM tree uses 3 time slots (t 1 , t 2 and t 3 ) and channels c 1 , c 2 and c 3 to prevent interference links and avoid the affected regions of PUs. Go to Step 9. Data Transmission Process (Fig. 4 ): • Step 9. The source and mobile nodes of the obtained EQM tree multicasts data to the multicast group mem-VOLUME xxxx, 2021 bers based on their next hops (NH) and the optimal channel-time slot strategy. Particularly, the source generates data packets based on the Poisson process. Next, the source multicasts the data packets to the next hops by using the channel-time slot strategy. If a node of the EQM tree receives a data packet, it will forward the data packet to the multicast group by the same way as the source. Multicast tree maintenance process: • Step 10. During the routing and data transmission processes, if one of established links from a node to the next hops is broken, the node will build alternative routes locally by the same approach as the src in the multicast routing process. Particularly, if the node cannot connect with at least one of next hops, it will require the information of neighbors and calculate LV rreq , NV rreq , and CL rreq . Next, the node generates and broadcasts a RREQ packet to its neighbors. If a node w receives a RREQ from the node and knows routes to the multicast group, it will replies a RREP to the node to establish alternative routes. If node w receives a RREQ from the node and does not know routes to the multicast group, it will continue to find alternative routes to the multicast group by the same approach as the node. Thus, this maintenance process is a local process and it only establishes some alternative links to repair the broken EQM tree.

VII. E2E QUEUING DELAY ANALYSIS
In this section, we present E2E queuing delay analysis to show comparison with E2E queuing delay and routing delay in simulation for the established multicast routing trees.

A. E2E QUEUING DELAY ANALYSIS 1 IN RANDOM WAYPOINT MOBILITY MODEL
We present the analysis of EQD-MRT in the environment of RWP model. As shown in Fig. 3, nodes 1 to 11 move according to the RWP model which can be presented as follows: each node begins by pausing for a number of seconds. Next, the node selects a random direction (angle) in (0, 2π) and a random speed in (0, v max ) to move in a number of seconds. Then, the node again pauses for a number of seconds before another random direction and speed. This process is repeated over the simulation times. We assume that the network includes N mobile nodes which are deployed in a square of A = [0, 1] 2 with area S(A) = 1km 2 and nodes can move based on RWP model with the maximum speed v max . We have • The average distance between two nodes [27] is calculated by the expected distance between two independent points chosen uniformly at random in A, which is L A = 0.521405. • The average number of nodes in a region B ⊂ A is • The average direction deviation between two any nodes can be calculated by the expected distance between two independent points chosen uniformly at random in [0, 2π] which is α = 2π/3. • Let v v and v w be the speeds of node v and w, respectively. The distance deviation (DD) between v and w in an interval time ∆t = t i+1 − t i can be calculated as where D ti (v, w) and D ti+1 (v, w) are the distances between node v and node w at time t i and t i+1 , respectively.
Lemma 1. The average distance deviation (DD) between two nodes in an interval time ∆t can be expressed as where A = 0.75, B = 0.7821075, C = 0.271863, D = 0.521405, X = v max ∆t and v max is the maximum speed of each node.
Proof. Considering two nodes v and w with is the position of node v at time t i . Without loss of generality, we can assume that x where X = (v max , ∆t), v max is the maximum speed of each node, R is the transmission range of each node, and Nb v is the average number of node v's neighbors which can be expressed as Proof. Lemma 2 can be easily proved based on Lemma 1. (27) where λ is the arrival rate of queuing delay model, N is the average number of packets in the system, v max is the maximum speed of each node, ∆t is the maximum lifetime each packet and Nb v is the average number of neighbors of a node which is presented as (26).

Lemma 3. The average number of packets in a node is
Proof. The Eq. (27) can be explained as follows: • The first term in the right-hand side of (27) presents the average number of packets in the system of queuing delay model. • The second term is the average number of packets that can not be sent to receiver nodes which move out of the transmission range of node v, i.e. these packets still in the queue until lifetime expires. Thus, the lemma is proven.
For a tree T = {P 1 = P(src, dst 1 ), . . . , P M = P(src, dst M )} in (8), where src is the source, dst i is a destination belonging to multicast group D, and M is the number of destinations. The E2E queuing delay of the tree T can be represented as where q_delay(n i ) is the queuing delay of node n i .

Theorem 2.
We assume that the maximum lifetime of a packet is ∆t. When a new routing packet arrives at a node at a certain time, the average time of this packet spending in this node is where µ is service time rate of the queuing delay model, the v max is the maximum speed of each node, N rwp is the average number of packets in a node which is presented as (27). As a consequence, the E2E queuing delay of a tree T can be calculated as where n hop is the average #hops of routes of the tree T.
Proof. Theorem 2 can be proved by using the results of Lemmas 1, 2, 3 and (28).

B. E2E QUEUING DELAY ANALYSIS 2 IN REFERENCE POINT GROUP MOBILITY MODEL
We present the analysis of EQD-MRT in the environment of RPGM model. As shown in Fig. 3, nodes 12 to 22 are divided into three groups and move according to the RPGM model [19], which satisfy the following characteristics: • The network is divided into multiple adjacent regions. Each region is only occupied by a single group (in-place mobility model).
• Each group has a group leader node and multiple members. • Each group leader can move according to the RWP model in a fixed region. Each member deviates from the group leader by some degree.
Corollary 1. Assume that the network includes N nodes, K groups which are deployed in a square of A and each node has a fixed radio range R. The average number of nodes moving out of a node v's transmission range (number of node v's broken links) in an interval time ∆t is where X = (v max , ∆t), v max is the maximum speed of each node and Nb os v is the average number of outside neighbors of node v which is calculated by (32).
Proof. Given a node v in a group G, we can consider the region of group G as a disc D G with center v 0 and radius R G = S(A)/(Kπ) while the transmission region of node v is a disc D v with center v and radius R v = R. The region D v \ (D G ∩ D v ) includes nodes which are called outside neighbors of node v. The average number of outside neighbors of node v can be expressed as follows: The average distance between node v and the center v 0 of D G (the distance between two centers of D G and D v ) is d = 2R G /3. Since node v is in D G , the value R G + R v is always greater than or equal d, i.e., R G + R v ≥ d. We have the following cases: • If the region D v is a subset of the region D G , i.e., R G − R v > d, • If the region D G is a subset of the region D v , i.e., R v − R G > d, where Moreover, node v's outside neighbors can be considered as neighbors that move based on RWP model related to node v. can be considered as the average number of node v's neighbors which move based on RWP model related to node v. Hence, based on (25), the corollary is concluded.

Corollary 2. Using the assumptions as in Lemma 3 and Theorem 2, we have
• The average number of packets in a node is calculated the same as in (27), i.e.,

VIII. PERFORMANCE EVALUATION A. ENVIRONMENTS FOR PERFORMANCE EVALUATION
In this section, we presents the environments and parameters for the performance evaluation as shown in Table 2. The DQMR protocol is implemented under RWP model in VII-A and RPGM model VII-B. In the RWP model in VII-A, we set the pausing time as 3 seconds, the moving time as 5 seconds. In the RPGM model, we set the number of groups is 4 or 9.

B. PERFORMANCE METRICS
To evaluate the performance of the DQMR, the following metrics are considered: • Routing delay is defined by the average time to establish a multicast tree per one session. • The control overhead is defined by the average number of control packets to establish a multicast tree per session per node. • The PDR is defined by the average number of data packets delivered to multicast group over the number of data packets supposed to be delivered to destination per session. • E2E queuing delay is defined by the average E2E queuing analysis delay of multicast routing trees in (28) per one session.

C. THE CONVERGENCE PERFORMANCE OF THE DQN-MEC MODEL AND THE GT-CTA MODEL
The convergence performance of the DQN-MEC model is shown in Fig. 5(a). This confirms the DQN-MEC model converges quickly after 1, 000 epoches which shows that the DQN-MEC model can achieve the Q * −values for routing process in training process. Moreover, Fig. 5

D. NUMERICAL RESULTS FOR THE RWP MODEL
We present the numerical results of the DQMR protocol in the environment of RWP model by using simulation. In Fig. 6, we show the routing delay as a function of node speed for RWP model. As can be observed, the routing delay of the DQMR protocol is lower than that of the MAODVbased one in most of node speed. The reason is that instead of flooding the RREQ packets in MAODV-based protocol, the DQMR protocol only multicasts RREQs to the predicted best neighbors based on the DQN-MEC model, thus, reducing routing delay. In addition, the DQN-MEC and GT-CTA models support the DQMR protocol to obtain EQM trees with high stability and high reliability, which also alleviates the re-routing processes and routing delay.   7 presents the control overhead as a function of node speed for RWP model. As can be observed, the control overhead increases gradually with the growth of maximum speed of node, and the control overhead of the DQMR protocol is lower than that of the MAODV-based one. The reason is that the DQMR protocol just multicasts RREQs to the predicted best neighbors instead of conventional flooding. Moreover, the DQMR protocol can form EQM trees with high stability and high reliability based on the DQN-MEC and GT-CTA models. Hence, the control overhead of the DQMR protocol can be effectively reduced.    8 shows the PDR of protocols with 3 PUs as a function of node speed for RWP model. As can be observed, at the maximum speed of 80 km/h, the DQMR protocol achieves about 91% while the MAODV-based protocol is only around 84%. The reason is that the DQMR protocol provides EQM trees with high stability and optimal channel-time slot strategies that helps data to reach the destination faster and more reliability than MAODV-based protocol.  In Fig. 9, we show the scalability of the DQMR protocol by demonstrating the PDR as a function of multicast group size (number of destinations) for RWP model. As can be observed, the PDR has almost constant value and is not affected by the number of destinations. The reason is that our DQMR protocol employs the DQN-MEC model and GT-CTA model to create the underlying tree-based structure that can improve the stability and scalability of the DQMR protocol under different sizes of multicast group. In Fig. 10, we plot the number of time slots allocating for packet transmission as a function of number of PUs for RWP model. When the number of PUs is increased, the system requires more time slots for data packet transmission to avoid interfering with the licensed channel of PUs. It is observed that the protocols without using GT-CTA model consumes more time slots for packet transmission than the ones with GT-CTA model. The reason is that the GT-CTA model can help the DQMR to form EQM trees with minimum number of time slots.

E. NUMERICAL RESULTS FOR THE RPGM MODEL
We present the numerical result of the DQMR protocol in the environment of RPGM model by using simulation. In Fig. 11, we show the routing delay as a function of node speed for RPGM model. As can be observed, the routing delay of the DQMR protocol is lower than that of the MAODVbased one in most of node speed. The reason is that based on the DQN-MEC and GT-CTA models, the DQMR protocol which only multicasts RREQs to the predicted best neighbors can obtain a high stability and reliability EQM trees. Thus, it can reduces the re-routing processes and routing delay. Besides, based on the simulation parameters in Table 2, the EQD-MRT can be calculated by Corollary 1 to show that the EQD-MRT and routing delay of RPGM model with 9 groups is smaller than its counterpart with 4 groups.    12 presents the control overhead as a function of node speed for RPGM model. It can be observed that the control overhead of DQMR protocol is lower than that of the MAODV-based one. With the deployment of the DQN-MEC and GT-CTA models, the DQMR protocol just multicasts RREQs to the predicted best neighbors and establishes EQM trees with high stability and high reliability. Moreover, based on Eq. (31), the average number of a node's broken links of RPGM model with 9 groups is smaller than its counterpart with 4 groups. This leads to a smaller control overhead of the RPGM model with 9 groups compared to its counterpart with 4 groups. Fig. 13 shows the PDR of protocols with 3 PUs as a function of node speed for RPGM model. At the maximum  speed of 80 km/h with RPGM (9 group) mobility model, the DQMR protocol achieves about 95% while the MAODVbased one is only about 87%. The DQMR protocol can establish high stability EQM trees having optimal channeltime slot strategies that helps the data packet to reach the destination faster and more reliability than MAODV-based protocol. Furthermore, the PDR of all protocols assuming the RPGM model with 9 groups is also higher than that of using 4 groups due to the smaller node's broken links when deploying a larger number of groups as in (31).  In Fig. 14, we show the scalability of the DQMR protocol by demonstrating the PDR as a function of multicast group size for RPGM model. As can be observed, the PDR has almost constant value and is not affected by the number of destinations. The reason is that the DQMR protocol applies the DQN-MEC and GT-CTA models to obtain EQM trees that can help the DQMR protocol to achieve the stability and scalability under different sizes of multicast group.
In Fig. 15, we consider the number of time slots allocating for packet transmission as a function of number of PUs for RPGM model. The system requires more time slots for packet transmission to avoid interfering with the licensed channel of PUs as the number of PUs increases. It is shown that the protocols without using GT-CTA model consumes more time slots for data transmission than the ones with GT-CTA model. This shows the benefit of the designed game theory approach in Section V, which helps to improve the resource utilization of DQMR protocol.

F. ANALYSIS RESULTS OF DELAY: EQD-MRT
We presents the delay analysis results for E2E queuing delay of a multicast routing tree (EQD-MRT) with the comparison of the simulation results. Since the routing delay depends on many factors such as different kinds of delay, mobility model, network topology and so on; it cannot be analyzed correctly. Thus, we analyze the EQD-MRT instead of routing delay, that show an approximation and the same pattern as the simulation result of routing delay, which confirms the correctness of the developed analysis.    2. The analytical result of the EQD-MRT also has the same pattern as the simulation result of routing delay which can well estimate the tendency and behaviors of EQD-MRT in terms of node speed.
The small gap between the analytical results and simulation ones in Figs. 16 and 17 is due to the fact that the analysis is performed based on the average time of a packet spending in a node, as shown in (29) and (38). On the other hand, the cost in (3) includes queue size ratio parameter and the simulation results rely on the DQMR protocol to find EQM trees with high stability and high reliability. Thus, the simulation result of routing delay is smaller than the analysis of EQD-MRT.

IX. CONCLUSIONS
In this paper, we proposed a DQMR protocol assisted by game-based channel-time slot allocation to establish EQM trees in CR-MANETs. Particularly, the DQMR protocol used the DQN-MEC model to establish shortest-path multicast trees with minimum E2E cost subject to QoS constraints. Besides, the DQMR protocol also used the GT-CTA model for the obtained tree to minimize the number of time slots, prevent interference links and avoid regions of primary users. Moreover, the DQMR protocol was also guaranteed to have high stability, low routing delay, low control overhead and high PDR. Furthermore, exact closed-form expressions for the EQD-MRT are also derived assuming RWP model and RPGM model to compare with routing delay in simulation. The evaluation results showed that the DQMR protocol outperformed the MAODV-based one in terms of control overhead, PDR, and routing delay, showing to be an efficient protocol in CR-MANETs. In future works, we will propose multicast routing protocol with deep reinforcement learning and different mobility models to address the multiple sources problem, which promises in providing an ultra-reliable and low-latency routing protocol in high dynamic environments for 5G and future CR-MANETs.