Reinforcement Learning-Based Routing Protocol for Underwater Wireless Sensor Networks: A Comparative Survey

Underwater wireless sensor networks (UWSNs) have emerged as a promising networking technology owing to their various underwater applications. Many applications require sensed data to be routed to a centralized location. However, the routing of sensor networks in underwater environments presents several challenges in terms of underwater infrastructure, including high energy consumption, narrow bandwidths, and longer propagation delays than other sensor networks. Efficient routing protocols play a vital role in this regard. Recently, reinforcement learning (RL)-based routing algorithms have been investigated by different researchers seeking to exploit the learning procedure via trial-and-error methods of RL. RL algorithms are capable of operating in underwater environments without prior knowledge of the infrastructure. This paper discusses all routing protocols proposed for RL-based UWSNs. The advantages, disadvantages, and suitable application areas are also mentioned. The protocols are compared in terms of the key ideas, RL designs, optimization criteria, and performance-evaluation techniques. Moreover, research challenges and outstanding research issues are also highlighted, to indicate future research directions.


I. INTRODUCTION
Underwater wireless sensor networks (UWSNs) represent an emerging field in wireless communication, owing to their significant advantages in various underwater applications. A typical UWSN consists of several self-configurable sensor nodes anchored to the ocean floor; these are interconnected by automatically adaptive wireless links featuring one or more underwater gateways [1]. These sensor nodes are used to perform various tasks, including pollution monitoring, offshore oil-drill monitoring, disaster prevention, and geological event monitoring [2]. Moreover, different types of data (e.g., temperature, pressure, and chemical compositions for water-based-disaster warning, underwater military communications, and surveillance systems) can also be collected using UWSNs. However, these networks are considered a more The associate editor coordinating the review of this manuscript and approving it for publication was Hongwei Du. challenging wireless communication medium than wired or wireless terrestrial ones. The marine environment exhibits several distinctive features that differ from those of the atmospheric environment in which traditional communication is performed. A simple USWN architecture featuring sensor nodes, sink nodes, and a base station is shown in Figure 1. The sensor nodes transmit data to the sink nodes using other sensor nodes as relays, according to different parameters. Then, the sink nodes send these data to the base station on the ocean surface. Sensor nodes are deployed at different depths (with respect to the surface) and at different distances from each other underwater. Some nodes are anchored to the ocean floor, whilst others float in the water at various depths. The node density may vary according to the necessity and application of nodes in different locations.
Four types of underwater communications for UWSNs are commonly employed in different research works: radio frequency, acoustic, optical, and magnetic induction. The cost of the coils used in magnetic induction is relatively low, which makes it a strong candidate for large-scale deployment in UWSNs. Moreover, this mode is not affected by multipath propagation or fading and is robust against acoustic noise [15]. However, the performance of magnetic induction systems in UWSNs is still being researched, especially with regard to the characterization of broadband and complex underwater magnetic induction channels in shallow and lossy water [16]. Practical applications in both shallow and deep water show fully connected multi-coil networks can be implemented using bandwidths of the order of tens of kHz for small and large coverage areas [17]. The routing protocol in all types of wireless sensor network (WSN) plays a major role when designing schemes to transmit data from the source to the destination nodes. However, routing in UWSNs is of particular importance. The major challenges include limited bandwidth capacity, multipath fading, propagation delay, high bit-error rates, and temporary loss of connectivity. Designing efficient routing protocols in UWSNs is crucial for the quick and secure transmission of collected data to the sink node on the ocean surface. Numerous UWSN routing algorithms have been reported in the literature. These protocols are proposed to improve the efficiency with respect to end-to-end delay, node mobility, network throughput, and energy consumption.
Reinforcement learning (RL) [18] is a subfield of machine learning (ML) that utilizes an agent to take decisions in an unknown environment. The agent in RL algorithms follows a policy based upon immediate rewards for actions taken.
Along with other ML techniques, RL has been widely used to design routing protocols for different WSNs [19], [20]. RL algorithms can be employed to improve the routing performance in UWSNs, owing to the constrained environment and the limitations of the UWSN environment. Different parameters of UWSN routing (e.g., energy efficiency, latency, network lifetime, link quality, and packet delivery ratio) can be optimized by implementing the RL algorithm. Because RL algorithms learn through experience, they have the potential to improve the routing process under various objectives.
Considering the advantages of RL, many researchers have proposed RL-based routing protocols for UWSNs. However, more research is required to successfully integrate the RL concept into the UWSN routing mechanism. In this regard, a comprehensive review paper presenting the existing RL-based routing protocols can help researchers seeking to design an RL-based UWSN routing protocol. In addition to filling the research gap in the literature, a survey on RL-based routing in UWSNs is required to encourage researchers to increase their focus on intelligent UWSN routing protocols.
Numerous survey works in the literature have compared different proposed routing protocols for UWSNs [21]- [24]. They divided and categorized the existing routing protocols according to different objectives. However, none of the existing surveys focused solely on RL-based routing protocols for UWSNs, despite numerous studies on this topic. To fill this gap in the literature and provide a direction for future research, it is necessary to aggregate these disparate works. Thus, a comparative study is necessary for RL-based UWSN routing. The main contributions of this study are as follows: • A brief overview of RL is presented, to provide a fundamental understanding of the technique.
• The existing RL-based UWSN routing protocols are investigated and summarized along with their advantages, disadvantages, and suitable applications in USWN environments.
• A comparative study of all the reviewed protocols is presented. In this regard, the key ideas of all protocols are compared in a tabular format. Then, a comparison of the applied RL techniques is provided.
• The optimization parameters adopted in all protocols are compared. The performance evaluation techniques are also compared in terms of the simulation environment, techniques, and performance comparisons of all the reviewed schemes.
• The key research challenges are highlighted, along with promising research directions toward making RL-based UWSN routing protocols more efficient. The remainder of this paper is organized as follows. Section II describes existing related surveys in the literature, to highlight the necessity of the present survey. An overview of the RL technique is provided in Section III. III. All existing RL-based routing protocols for UWSNs are discussed in Section IV. In Section V, comparisons of the reviewed protocols are discussed according to their key ideas, optimization criteria, RL features, and performance measurement techniques. Challenges and open research issues are discussed in Section VI. Finally, Section VII concludes the paper.

II. EXISTING SURVEYS
This section describes the existing surveys regarding routing protocols for UWSNs and RL-based routing protocols for other WSN environments. The limitations of the existing works and the contributions of our study are also discussed. The existing surveys relating to UWSN and RL-based routing protocols are compared in Table 2.
Several surveys have been performed regarding UWSN routing protocols, focusing on different issues (e.g., energy efficient routing, node mobility, delay tolerant routing, and network-lifetime-aware routing). In [24], routing issues in UWSNs were discussed in terms of delivery ratio, end-to-end delay, energy efficiency, delay tolerant applications, mobility, and reliable routing. All routing protocols proposed thus far were also described. Cho et al. studied routing protocols considering the delay/disruption tolerance characteristics of UWSNs [25]. In this regard, they categorized the routing protocols into scheduled, opportunistic, and predicted contact schemes.
Han et al. classified UWSN routing protocols into sender-based and receiver-based protocols [21]. The protocols were then compared in terms of energy efficiency, latency, load balancing, dynamic robustness, communication overhead, and time complexity. In [26], UWSN routing protocols for acoustic communication were studied. The protocols were categorized using the cross-layer and noncross-layer design methods. An intelligent algorithm for UWSN routing was also discussed. However, none of the RL-based routing protocols for UWSNs were mentioned. Unlike that, the authors in [27] discussed several RL-based routing protocols whilst studying the routing and medium access control (MAC) protocols for UWSNs. Their main aim was to quantitatively compare the existing MAC and routing protocols in terms of energy efficiency and reliability.
In [28], the routing protocols were studied by considering the node mobility in UWSNs. In this regard, the protocols are classified into vector-based, cluster-based, autonomous underwater vehicle (AUV)-based, depth-based, and path-based routing protocols. Both qualitative and quantitative comparisons between existing protocols were performed. Khalid et al. discussed the routing issues in UWSNs [23] whilst classifying the protocols into localization-based and localization-free routing protocols.
All protocols were described and compared in terms of the employed technique, as well as other important performance metrics. The authors in [29] conducted a simulation-based survey on UWSN routing protocols. Four routing protocols, namely hop-by-hop dynamic address-based routing, depth-based routing, energy-aware opportunistic routing, and energy-efficient depth-based routing, were implemented. The performances were compared in terms of the numbers of sent packets, alive nodes, and dead nodes.
Considering the acoustic communication in UWSNs, the routing protocols are reviewed in [22]. All protocols are categorized into localization-based and localization-free routing protocols. Moreover, each of the protocols was summarized whilst mentioning their strengths and weaknesses. A survey on different aspects of UWSNs was provided in [30]. The requirements of UWSNs (e.g., longevity, accessibility, complexity, security, and environmental sustainability) are highlighted. Moreover, the routing protocols are discussed alongside other issues in the UWSN. The authors in [31] discussed energy-efficient routing protocols for UWSNs. The protocols were categorized into depth-based, cluster-based, cooperation-reliability-based, RL-based, and bio-inspired routing protocols. However, only three URL-based UWSN routing protocols were mentioned. Unlike other surveys, the authors in [1] discussed UWSNs, focusing on both acoustic and magneto-inductive communication. They discussed the characteristics and application properties of each communication channel when designing UWSN routing protocols.
Considering the advantages of RL algorithms in routing protocol design, RL-based routing has been extensively studied in the literature. In [32], RL-based routing protocols for different types of communication networks were reviewed. The network areas considered were wired networks, wireless networks, wireless mesh networks, cooperative communication wireless networks, optical networks, ad-hoc networks, WSNs, vehicular ad hoc networks (VANETs), delay-tolerant networks (DTNs), social DTNs, flying ad hoc networks, cognitive radio networks, named-data networking, peerto-peer networks, and software-defined networks. Several related surveys were also conducted for mobile ad hoc networks (MANETs) [33], cognitive radio ad-hoc networks (CRAHNs) [35], and VANETs [34]. A comprehensive survey on RL-based routing protocols in MANETs is provided along with future research directions in [33]. The authors in [34] extensively surveyed RL-based routing protocols for VANETs, by discussing their working process, advantages, limitations, and suitable application areas. Furthermore, the protocols were compared according to their main features, characteristics, evaluation methods, optimization criteria, and RL implementation. In [35], the RL-based efficient spectrum-aware routing for CRAHN was extensively discussed. Moreover, a multi-objective spectrum-aware routing protocol using RL was proposed to increase the probability of successful transmission with a minimum number of hops.
However, from the above-mentioned surveys in the literature, it is clear that no survey solely discusses RL-based routing protocols for UWSNs. The suitability of RL algorithms for solving optimization problems related to UWSN routing necessitates a survey that discusses all the studies in the literature. This will provide future researchers with an idea of the work already conducted, as well as potential research challenges and directions.

III. REINFORCEMENT LEARNING OVERVIEW
This section provides a brief overview of RL, by discussing the designs and classification of RL algorithms. RL is a sub-branch of ML. In RL, an agent learns by interacting with the environment and selects action based upon that learning. The learning process is similar to learning in the real world. The concept of RL seems straightforward, because it reflects the real world; however, implementing an RL algorithm can be a complex and challenging task. Such algorithms manage learning through interactions and feedback mechanisms; that is, learning to solve a problem using a trial-and-error approach.

A. MODELING OF RL ALGORITHM
The agent observes the state of the environment during each decision step, and it selects actions randomly or by following a policy. Next, it receives an immediate reward based upon the selected action and goes on to the next state. The reward function is designed to provide feedback to the learning algorithm, reflecting the primary objective of the task. The principle idea of RL is illustrated in Figure 2. There, the agent observes state s t from the environment. In that particular state, the agent chooses action a t by exploration or exploitation. According to the taken action, the agent receives a reward r t and goes to the next state. To solve a problem with the help of RL, the problem should be designed as a Markov decision process (MDP) [36]. Therefore, MDP can be regarded as the theoretical basis of RL. The mathematical framework of MDP consists of a tuple of < S, A, P, R >, where S is a finite set of environment states, A is a set of actions available for the agent, P is the transition probability from the current state to the next state via a particular action, and R is the reward received after transitioning to the next state with the taken action. The transition probability can be written as P a (s, s ) = P r (s t+1 = s |s t = s, a t = a), (1) where P a is the probability of transitioning from state s at time t to state s at time t + 1 by taking action a. After the transition from s to s , the agent receives an immediate reward, which can be denoted by R a (s, s ). The reward represents an evaluation of the quality of an action in a particular state. The goal of an RL agent is to identify a policy π that maximizes the cumulative rewards; typically, this is the expected discounted sum of rewards. The policy is a function that maps a given state to the probability of selecting each possible action from that state. Thus, following a policy π, the probability of taking action a in state s at time t can be denoted by π(a|s). The function that estimates how desirable it is for an agent to be in a given state, or how desirable it is to select a particular action in a given state, is called a value function. The value function can be a function of state or of state-action pairs.
The state value function V π (s) determines the value of a state for an agent following policy π. The value of a state s is the expected sum of discounted rewards starting from state s at time t following policy π. The value function can be written as where R is a random variable defined as the sum of the future discounted rewards. It can be written as where r t is the reward at time t, and γ is a discount factor designed such that 0 ≤ γ ≤ 1. The value of γ determines the importance of future rewards in the current state. Future rewards are discounted to place more emphasis on the immediate reward. The policy that optimizes the expected cumulative reward is referred to as the optimal policy and is denoted as π * . An RL algorithm converges when it identifies the optimal policy from all available policies for a given state [18]. RL algorithms can be initially classified into model-based RL [37] and model-free RL [38]. Model-based RL algorithms construct an internal model describing the transitions and immediate outcomes according to experience. Then, the optimal policy for selecting an action is chosen using the learned model. However, model-free RL algorithms do not incorporate any learned models; learning is performed by either approximating value functions or following a policy through experiences. Therefore, RL algorithms can be designed using the policy or value iteration functions [34]. Examples of policy iteration-based RL include Monte Carlo [39] and temporal differencing methods [40].
In value-based RL algorithms, the agent attempts to maximize the value function. As mentioned earlier, the value functions in RL algorithms are of two types: state-value and action-value. The value function given in Equation 2 is the state-value function, which estimates the expected cumulative reward when starting in state s and following policy π thereafter. The action-value function denoted as Q π (s, a) determines the expected reward when action a is taken in a given state s following policy π. It can be defined as The action-value function Q π is conventionally called the Q-function, and the output from this function for any given state-action pair is called the Q-value.
However, RL algorithms suffer from an explorationexploitation trade-off when taking an action. On the one hand, the algorithms should not stick to the actions with high rewards, because they might thereby become trapped in a local optimum; on the other hand, repeatedly taking different actions from a single state is also inefficient. Different methods have been proposed to solve this problem, including random action [41], greedy strategy [42], epsilon-greedy policy [43], upper confidence bound [44], explore-first [45], and Softmax action [46].

B. ADVANTAGES OF USING RL IN UWSN ROUTING PROTOCOLS
In recent years, RL has been applied to design protocols in different wireless sensor networks; UWSN is one of them. The advantages of using RL for designing UWSN routing protocols are as follows: • Routing optimization: RL algorithms can solve optimization problems in different distributed systems. Routing problem optimization can be regarded as a decision-making task. Therefore, RL can represent a practical approach for solving routing problems. In particular, solving routing problems with RL can be effective because of the reduced overheads for control packets, memory, and computation.
• Environment observability: In a UWSN, full information and knowledge of the network are unavailable. The RL algorithm can be effectively applied in such scenarios, because RL learns from the environment.
• Adapting to dynamic topology: RL-based routing learns the network topology whilst relaying packets. Hence, it can adapt to the dynamic network topology during the routing process. Moreover, RL algorithms learn iteratively, which helps reduce communication and computation overheads.

IV. RL-BASED ROUTING PROTOCOLS FOR UWSN
In this section, RL-based UWSN routing algorithms are discussed with respect to their working procedures. The advantages and disadvantages of each protocol are discussed, and VOLUME 9, 2021 suitable application areas based on the proposed scheme are highlighted. These routing protocols are categorized based on their UWSN communication medium: acoustic, optical, hybrid (acoustic-optical), and magnetic-induction. Figure 3 shows the taxonomy of the investigated routing protocols. The majority of protocols considered acoustic communication in UWSNs.
A. Q-LEARNING-BASED ENERGY-EFFICIENT LIFETIME-AWARE ADAPTIVE ROUTING (QELAR) Hu et al. proposed QELAR [47], a distributed UWSN routing protocol that initially applies Q-learning to balance the workload between sensor nodes and thereby increase network lifetime and reduce network overhead. QELAR is an older UWSN routing protocol compared to the other protocols reviewed in this survey. In QELAR, when a node receives or overhears a packet, it extracts information from the packet header, including the residual energy, average group energy, previous-hop node, and V-value. The V-value of the node is calculated using the Q-learning algorithm. Once the Q-values of state-action pairs in a state s n have been calculated for all available actions, another value function (the V-value) is calculated. The V-value of a state s n is denoted by V (s n ) and contains the maximum Q-value received by an action out of all actions in that state. Therefore, the V value can be updated according to V s = max a Q(s, a). The state space of the algorithm contains the node that holds the packet. This action is represented as packet forwarding by a node. The reward function is designed using the cost function of the sender node's residual energy and the energy distribution among the group nodes. When choosing an action from a state, the Q-values of all actions from that state are calculated first. Then, the action with the maximum Q-value is chosen, and the V-value of the state is updated. One important feature of QELAR is the acknowledgment (ACK)-receiving mechanism, which confirms packet transmission. The sender node does not remove the packet from the buffer immediately after sending; rather, it waits until the next forwarder forwards the packet to the next-hop node. Thus, retransmission is triggered if the forwarder does not overhear transmission. The transmission failure and retransmission mechanisms are shown in Figure 4.
As we can see from the figure, node A waits till node B forwards the packet to node C. Upon transmission failure, node A retransmits the packet and when node B forwards the packet to node C, it receives that as ACK. However, the number of retransmission is limited by a predefined value max trans .
• Advantages: Being one of the earliest Q-learning-based UWSN routing protocols, QELAR designs the RL algorithm to make routing decisions in a way that it could be further improved. The method used here for transmission confirmation after sending a packet reduces the communication overhead and the number of packet drops.
• Disadvantages: The reward function is designed considering only the residual energy whilst neglecting other important selection parameters, such as the distance or depth of the neighbor nodes. Sometimes, a node may have more residual energy but a higher distance; hence, the energy consumption will increase when traveling over longer distances.
• Application: QELAR is designed for a UWSN environment in which the source node is fixed; meanwhile, the other sensor nodes are dynamic. QELAR is unsuitable for UWSN routing across networks in which all nodes are dynamic, because the source node can be any one of the nodes.

B. MULTI-LEVEL ROUTING FOR ACOUSTIC-OPTICAL HYBRID UWSN (MURAO)
Extending their work in [47], the authors of [48] proposed another routing protocol named MURAO, which was designed for an acoustic-optical hybrid UWSN environments. A multilevel Q-learning method suitable for a multilevel UWSN was applied. The multilevel distributed Q-learning approach accelerates convergence. In this type of approach, the state space is divided into different groups, where each group contains one agent. This agent supervises all other lower-level agents, whilst logically being in the higher level. Thus, the number of states becomes smaller, which accelerates termination. A clustering method is applied in MURAO, in which clusters are updated based on changes in network topology. The routing process consists of several concurrent lower-layer routings inside cluster members and one upper-layer routing among the CHs. Several gateway nodes exist in the network; these connect two clusters because the nodes that receive broadcast messages from multiple CHs become members of multiple clusters and eventually function as gateway nodes. The inter-cluster routing process in the upper layer is realized via both the acoustic and optical channels, whereas the intra-cluster routing in the lower layer is performed using only the optical one. The CHs in the upper layer assign gateway nodes to the clusters, by applying Q-learning. The gateway nodes represent the destination nodes for each intra-cluster routing assigned by the CHs, which is the terminal state in the Q-learning approach. The intra-cluster routing process is similar to that in QELAR. The Q-values and V-values are updated after each action. Routing is initiated in one of the gateway nodes and terminates when it reaches the designated gateway node. Three types of information exchange occur in the network: (1) between cluster members, (2) between CHs, and (3) between cluster members and CHs.
• Advantages: Applying the multilevel Q-learning algorithm to multiple layers of the UWSN accelerates the algorithm convergence. The number of states is reduced by applying the algorithm to different clusters; this also helps the algorithm to reach the terminal state faster. Hybrid communication exploits both acoustic and optical channels.
• Disadvantages: Applying Q-learning separately for each cluster can complicate the network. Although it reduces the number of states for each cluster, the computational costs may increase. Moreover, in a dynamic UWSN, the clustering changes according to node mobility, so the routing will also be changed.
• Application: MURAO is more suitable for a static UWSN environment. In such scenarios, clustering will occur only once, and routing will be more efficient.

C. Q-LEARNING BASED DELAY-AWARE ROUTING (QDAR)
Jin et al. proposed QDAR routing algorithm with an objective to extend the network lifetime of UWSNs [49]. Q-learning was used because it can determine the globally optimal next hop, rather than a greedy one. The routing decision is taken with regard to the propagation delay and residual energy. A multi-agent Q-learning technique is employed by considering each packet in the network as an agent. The sink node performs a virtual experiment, utilizing the algorithm to determine a routing path by sending a virtual packet in the virtual topology, because the sink node possesses information regarding the nodes. The overall routing mechanism is divided into five phases: data ready phase, routing decision, interest phase, packet forwarding, and acknowledgment. A flowchart of the routing mechanism is presented in Figure 5 Three assumptions are considered: the depth information of each node is held by those nodes, and it can be embedded in the packets; nodes implement Source_initiates_Query; and the records of successful or failed communication are saved in the sink node. The source node sends a DATA_READY packet to both request communication and collect information in a reactive manner; hence, the source node must send a packet to the sink node. The neighbor node whose depth is smaller than that of the previous node forwards the packet only. In the routing decision phase, the QDAR algorithm is applied to select the routing path. Through Q-learning, the next-hop node is selected from the neighboring nodes, to optimize the residual energy and propagation delay. After the sink node makes the routing decision, it creates an INTEREST packet in the interest phase; this is sent back to the source node as an acknowledgment.
• Advantages: The algorithm mitigates the trade-offs between network lifetime and end-to-end delays in an adaptive and distributive manner. The virtual topology concept used in this protocol increases the cost of failed transmission.
• Disadvantages: The packets are considered as the agent in the network, and the state is the node that holds the packet. Under an increasing number of nodes and packets in the routing path, the number of states also increases. This will increase the state space, and the algorithm may fail to converge.
• Application: This protocol is suitable for both static and dynamic underwater environments. Therefore, it can be applied to UWSNs in which the dynamic topology changes. It can function in an adaptive and distributive manner.

D. HARVESTING-AWARE DATA ROUTING (HyDRO)
Basagni et al. proposed a routing protocol (referred to as HyDRO) in an energy-harvesting UWSN, by assuming all nodes to be capable of energy harvesting [50]. This protocol considers both the residual and harvested energy in its optimization. The sender node acts as an agent of the RL.
The action aims to select the forwarding node and thereby the route to the sink. The algorithm considers residual energy, foreseeable harvestable energy, and link quality when choosing the route. All of these optimization criteria are considered when a sender node must select a relay node for forwarding a packet to the sink node. The reward function is designed with VOLUME 9, 2021 a penalty for packet dropping, to reduce the packet drop ratio. The sender node i always possesses information regarding the neighbor node j to be selected as a relay node. Flooding for route acquisition occurs only during the initiating period and upon returning from an all-off or temporarily malfunctioning state. The all-off state of a node occurs when the node is out of energy. The nodes proactively update their neighbor nodes according to the signal received at a given time. When node i does not receive the signal from node j at time t, it temporarily removes j from its active neighbor list. Node j notifies its neighbors just before running out of battery, by setting a field in its header. Upon transmission failure, the sender retransmits the packet for a given period of time, after which the packet is dropped.
• Advantages: The penalty given for packet dropping ensures that the nodes select more reliable relays and route the packets through shorter routes.
• Disadvantages: Network lifetime considerations are not considered for performance comparison in this protocol; however, this is an important metric for evaluating a routing protocol.
• Application: The protocol is only applicable to an underwater environment in which energy harvesting is possible. It exhibits performance degradation without this harvested energy; thus, it cannot be applied to UWSNs without energy harvesting.
In [51], the author proposed a distributed routing protocol for UWSNs, by integrating a game-theory approach with Q-learning (QGDR). The sensor nodes are assumed to be individual agents; they try to maximize their profit by making a cooperative routing decision that is acceptable to all other agents. The nodes learn the policy to select the optimal strategy according to the RL algorithm. The routing problem is designed as a multiplayer routing game model that extends the MDP problem. A new game model is developed following the assumptions for UWSNs, referred to as the routing game. First, the UWSN topology is configured with the help of the configuration algorithm proposed in this study. The topology is formed using a payoff history array U [.] and a path_cost value PC which determines the cost of the link in the sourceto-sink-node route. A virtual topology is structured and can be dynamically reconfigured by changing these two values.
• Advantages: This protocol can adapt to dynamic topology changes, which is practical for UWSN scenarios. The sensor nodes can adjust their learning parameters according to changes, and they can dynamically take routing decisions.
• Disadvantages: This protocol does not consider network lifetime, which is a necessary parameter. For cases of route failure, no retransmission scheme is mentioned, only a penalty. This may increase the initial packet drop rate when the agent learns.
• Application: This protocol is applicable to UWSN environments involving node mobility, because it can function under dynamic changes in the environment.
In security and military applications where the dynamic environment is necessary, this protocol can be applied to provide information. Karim et al. proposed QL-EEBDG in [52], by considering the void hole problem for routing in a UWSN. This problem occurs when a selected next-hop node does not have any neighbor node or does not lie within range of the sink node. This leads to an increase in packet dropping and energy consumption. To mitigate the void hole problem, only nodes that have a next-hop node are selected as forwarding nodes. Each node functions as the Q-learning agent, where the sender node is the source agent and the neighbor node is the receiver. A control packet is generated from all nodes and sent to the neighbor nodes. Then, the neighbor nodes send back acknowledgment packets by which the sender node declares the neighboring nodes. Then, the Q-value of all neighbor nodes is calculated; the nodes with the highest Q-value represent the shortest distance towards the sink. Based on the distances of the nodes, three types of rewards are computed: reward sink, for choosing the sink as the next node; reward pos, for choosing a neighbor node; and reward neg, for choosing neither a sink nor neighbor node. The node with the maximum Q-value is selected as the next-hop node. If more than one node has the same Q-value, then that with a higher residual energy is selected as the next-hop node. A circular network topology is created using a static sink node and sender nodes. Another parameter (MS) is used in the simulation; it moves clockwise. When a node must send data, it determines whether the MS is within the shortest transmission range. Then, the sender node sends the data to the MS; otherwise, it utilizes the Q-learning-based method to choose the next-hop node.
• Advantages: In this protocol, only the nodes for which either one neighbor node or sink exists in the one-hop distance are selected as the neighbor nodes. Therefore, even if a node with no further one-hop node is nearer than the source node, it will not be selected as a neighbor. This procedure helps to reduce the void hole problem, leading to fewer packet drops and lower energy consumption.
• Disadvantages: No retransmission strategy or penalty is applied in the Q-learning algorithm upon route failure. This may lead to an increased packet-drop rate. Moreover, the agent (sensor node) does not consider the end-to-end delay when choosing the next-hop node.
• Application: This routing protocol can be applied to a dynamic UWSN environment because the neighbor nodes can be selected dynamically. If the UWSN exhibits node mobility, this algorithm can be utilized to select the next-hop node.

G. Q-LEARNING BASED LOCALIZATION-FREE ROUTING (QLFR)
Zhou et al. [53] proposed the routing algorithm QLFR for UWSNs; their objective was to extend the network lifetime and minimize the end-to-end delay. When a node has to send a packet, it checks the Q-values of all neighboring nodes and places these nodes in a priority list sorted in the reverse order of Q-values. The nodes with a smaller depth will have a higher priority. The priority list is added to the data packet sent to the neighboring nodes. The nodes in the list hold the packet, following a holding time mechanism provided by Q-learning. The other nodes drop the packet. The holding time mechanism design is shown in Figure 6. Here, when node s wants to send a data packet, the three neighboring nodes p, q and r will receive it. The depth of r is lower than that of the sender; therefore, it will drop the data packet. For the remaining two nodes, if p has a higher Q-value than q, it is selected as the next-hop node. The Q-value is calculated according to two cost functions: depth-based cost and energy-based cost. Therefore, the reward is designed considering these two parameters. When node s i sends a packet to s j following the action a j , the reward can be calculated by where c e (.) is the energy-based reward and c d (.) is the depth-based reward; both lie within the range of [0,1]. Furthermore, a packet-delivery ratio-based multipathsuppression mechanism was proposed to maintain the priority list length. The packet delivery ratio was calculated to control the length of the priority list for reducing unnecessary transmission.
• Advantages: The holding time mechanism causes the nodes with a lower depth to drop the packet; this in turn reduces the redundant retransmission in the network. Moreover, the overall holding time of the packet is reduced because the node with the highest Q-value transits without holding.
• Disadvantages: The nodes not included in the priority list drop the packet. This increases the packet drop ratio in the network; in particular, when the node density increases, more nodes are amongst the neighboring nodes of the sender node but have lower depths; hence, more nodes will drop the packet.
• Application: This protocol is suitable for UWSN environments with underwater monitoring applications and few nodes. Basically in the application, when the source node is anchored to the ocean floor and transmits data to the upper nodes, toward the sink. So, data are only passed from nodes with more depth to nodes which have less depth in the underwater environment.

H. CHANNEL-AWARE RL-BASED MULTI-PATH ADAPTIVE ROUTING (CARMA)
Valerio et al. proposed the routing protocol CARMA in [54], to select the set of relay nodes in a UWSN. Their main objective was to simultaneously optimize the route-long energy cost and maintain the network lifetime and packet delivery ratio. The size and composition of the relay set are determined dynamically at each transmission time. When a node sends a packet to the sink node, it discards the packet and transmits it to all nodes in the list of relay nodes. All required information is added to the header of the packet. An implicit acknowledgment mechanism is used to overhear the retransmission of the packet. The sender node waits for this acknowledgment for a specified period. If no acknowledgment is received, the packet is retransmitted thereafter. The transmission is considered successful when acknowledgment is received. During the entire procedure, RL is employed to select a list of relay nodes when forwarding the packets. The RL agent selects this list according to the local channel quality and energy consumption across the entire route. Initially, the nodes have no knowledge of the environment and with experience, the nodes learn and update their knowledge. When a node sends a packet to the sink node, it uses the Q-value to obtain the optimal route. This algorithm chooses the optimal action according to the value of the action, which is the cost required to transmit the packet from the sender to the sink node. Furthermore, increasing numbers of retransmissions affect the network performance for increased network traffic. Therefore, the maximum number of retransmissions is determined dynamically by utilizing the well-known ALOHA closed-form expression, S = Ge −2G . Here, G is the average number of transmission attempts in a time interval equal to that required to transmit one packet.
• Advantages: The size and composition of the relay set at each transmission attempt is determined dynamically, which increases the packet delivery ratio and ensures a lower energy cost. Another useful feature of this protocol is that it facilitates packet forwarding, by broadcasting a packet when no neighbor node is known to the sender.
• Disadvantages: CARMA considers only the static UWSN environment, which may not be suitable for all types of USWN scenarios in which the node exhibits mobility. Moreover, it selects a set of relaynodes (instead of the one-hop neighbor relay) at a time, which may lead to network performance degradation in cases of higher network traffic.
• Application: This protocol is suitable for a static UWSN environment (i.e., where nodes are deployed in stationary positions and when only the data from that position are delivered to the sink node). This routing protocol is suitable for monitoring temperature and other environmental attributes.

I. RL-BASED CONGESTION-AVOIDED ROUTING (RCAR)
In [55], Jin et al. investigated the congestion control problem in UWSN routing, and they proposed an RCAR protocol to minimize energy consumption and end-to-end delay. The protocol comprises four stages: initialization, virtual pipe creation, virtual routing, and packet forwarding. In the initialization stage, all nodes exchange their location information and residual energy with one-hop neighbors collected from the physical layer. A neighbor table is generated in each node using the one-hop neighbor information. Then, a dynamic virtual routing pipe is generated by the node that holds the packet forward. The radius of the pipe is based upon the average residual energy of the neighboring nodes. The radius of the pipe is given as where E ini and R are the initial energy and transmission range of the nodes, respectively. R i ini is the initial radius of the pipe, and E i is the average residual energy of the next-hop nodes. This virtual pipe helps to reduce unnecessary initial exploration detours. Then, virtual routing is performed using the RL-based algorithm to select the next forwarding node. After the node forwards the packet, packet forwarding is considered completed if the selected node is available. If unavailable, the algorithm is reapplied to choose the next-hop node, and the information regarding the link is updated. A handshake mechanism based on S-FAMA is utilized in the MAC layer to update the node information in the initially generated neighbor table. During this period, DATA and ACK packets are used to exchange information. In RCAR, three additional pieces of information are included in these packets: residual energy, current buffer state, and V value. These determines the value of a node for selection as a next-hop node. The structures of the DATA and ACK packets are shown in Figure 7, with the additional information highlighted. When a node has to send a packet again, the Q-value is calculated using the updated information.
• Advantages: Unlike other RL-based UWSN routing protocols, RCAR prevents congestion in the network. The handshake-based method for updating information in the MAC layer helps to reduce energy consumption, because nodes need not broadcast periodically to update their information. It also mitigates collisions between nodes during broadcasting.
• Disadvantages: The state-space contains information regarding the one-hop neighbors of each node. Under an increase in the number of nodes in the environment, the number of one-hop neighbors also increases. In this case, the state size becomes large, and the algorithm takes more time to converge.
• Application: This protocol can be applied to dynamic underwater acoustic communication-based networks, because it can adapt to dynamic topologies. It is suitable for UWSNs, where the number of sensor nodes is moderate.

J. Q-LEARNING BASED ENERGY-DELAY ROUTING (QL-EDR)
Wang et al. proposed a clustering-based routing protocol QL-EDR in [56], employing Q-learning learning to select the next-hop node in a hierarchical UWSN. Communication between the nodes was considered via magnetic induction.
The main objective of this protocol is to extend the network lifetime by minimizing energy consumption and end-to-end delay. The framework is divided into three parts: data collection, data processing, and decision management. In the first phase, cluster heads (CHs) are selected by forming several clusters in each layer of the three-layered UWSN. Cluster members send their sensed data to the CHs for transmission to the base station. In the second phase, the data features from the data sent by the sensor nodes are extracted. The third phase employs the Q-learning algorithm to select the next one-hop node according to the residual energy and distance of the nodes. Two parameters are used to obtain a single-hop bonus, from which the reward function and Q-values are designed. The parameters are D hop for the distance-based bonus and E hop for the residual energy-based bonus. They can be calculated by where D t is the shortest-distance-based path, d t is the distance between two nodes, E t is the maximum energy-based path, and e t is the residual energy of the nodes. A regulatory factor β is used to emphasize the value of the residual energy or transmission delay according to the state, to prolong the network lifetime.
• Advantages: The clustering of sensor nodes helps to minimize the overall end-to-end delay and energy consumption in the network, because not all sensor nodes need to join as the relay node, and only the CHs will perform data transmission.
• Disadvantages: The multi-hop path is selected after the CH receives all the data in the cluster. If a single node (rather than all the nodes in a cluster) must send data, the node must wait until all nodes send data to the CH. This increases the latency for a single sensor node, even if it decreases the overall end-to-end network delay. Moreover, if a cluster member is just one hop away from the sink, it cannot transmit data, because it must send data to the CH. CH selection was not optimized here.
• Application: QL-EDR is not suitable for emergency applications in UWSNs because data are sent by CHs. However, it is applicable in environments where sensor nodes only perform monitoring tasks after a specific timestamp. All nodes send data to the CH at that time, and the CH transmits it to the sink node and eventually the base station.

K. DISTRIBUTED MULTI-AGENT RL ROUTING (DMARL)
Li et al. proposed DMARL in [57] as a routing protocol for UWSNs, by considering an optical communication medium. They designed the UWSN as a distributed multi-agent system that supports information interaction between adjacent nodes. Subsequently, a multi-agent RL algorithm is applied in the routing process, to prolong the network lifetime and adapt to the dynamic topology of the UWSN. The implementation of DMARL is performed in three stages: preliminary stage, route discovery, and route forwarding. The preliminary stage involves the sensor node deployment and routing-table-related parameter initialization. Then, in the route discovery stage, each node determines its one-hop neighbors by periodically broadcasting hello packets. The Q-table is updated according to the neighbor node information. Q-learning is applied in the route-forwarding stage. Each node operates as an agent and maintains a Q-table. The state is regarded as a node with a data packet at a particular time t.
The action of the agent is to select the next-hop node. The reward mechanism is designed based on local and global rewards. The local reward function is designed considering the residual energy and link quality between sensor nodes; these are received by an ACK packet after data transmission. The local reward can be defined as: where E is the normalized residual energy of the receiver node j, and L Q is the normalized link quality. W E and W L represent the weights of the residual energy of node j and link quality, respectively. K non-ACK represents a negative reward when no ACK is received (i.e., when the routing quality is poor). The global reward is given here to obtain feedback regarding changes in the environment; this reward depends on the transmission direction-that is, upon whether the current node is closer to or farther from the sink node than the previous node. If it is closer, then a positive reward is given; otherwise, a negative reward is given. One important aspect of DMARL is that, to accelerate the convergence of the RL algorithm, two optimization strategies are utilized: position-based Q-value initialization and learning-rate variation. The first strategy initializes the Q-value according to the initial distance to the sink from a node and one of its neighbor nodes. In the second strategy, the learning rate is optimized according to link stability, to reflect the changes in the neighbor set.
• Advantages: Q-value initialization helps to decrease the learning time of the RL algorithm. Moreover, adjusting the learning rate according to the dynamic environment accelerates the algorithm's convergence. Integrating both techniques reduces the number of training steps and accelerates convergence, which subsequently saves energy in UWSNs.
• Disadvantages: DMARL is specifically designed for underwater optical communication. In addition, VOLUME 9, 2021 as mentioned in the paper, DMARL is unsuitable for a UWSN environment in which more than 14 neighboring nodes (on average) are present. Despite its good performance in dynamic environments, the node density constraint limits its performance to specific UWSN environments.
• Application: DMARL is designed considering node mobility in a UWSN. Therefore, this protocol can be applied to any dynamic UWSN environment featuring a limited number of nodes.

L. ENERGY-EFFICIENT DEPTH-BASED OPPORTUNISTIC ROUTING WITH Q-LEARNING (EDORQ)
Lu et al. proposed a routing protocol EDORQ [58] for UWSN; they sought to ensure energy-saving and reliable data transmission from sensor nodes to sinks. The overall routing process of EDORQ consists of two stages. First, candidate-set selection is performed based on void detection; second, candidate-set coordination is performed via Q-learning. The first stage aims at choosing candidate nodes from the neighboring nodes to forward the packet to the sink. The depth and void-flag information of the nodes are used as candidate-selection metrics. The candidate-set selection is composed of two modes: a greedy mode and void-recovery mode. In the greedy mode, the nodes closer to the sink than the sender node (according to depth) are selected as candidate nodes. A void can arise when a forwarder node is selected (because of its smaller depth) but has no further forwarder node closer to the sink. Void recovery is triggered when a node returns the packet upon detecting a void and does not receive information that the packet has been successfully forwarded. In such a case, the node for which a void is detected selects a forwarder node with a greater depth. The candidate set coordination stage utilizes the Q-learning algorithm along with a holding time mechanism to select the forwarder node. The reward function in the Q-learning algorithm is designed with residual-energyrelated values, depth-related values, and void-detection factors. Using this reward function, the Q-values of all nodes in the candidate set are calculated. Subsequently, a holding time is assigned to all candidate nodes, such that the node with a higher Q-value has a lower holding time and therefore a higher priority for sending the packet. The mechanism is illustrated in Figure 8. Nodes n 2 and n 3 are in the candidate forwarder set for sender node n 1 . The holding time is calculated for both nodes according to the Q-values of nodes n 2 and n 3 . In the figure, node n 2 forwards the packet when the timer is over and, upon overhearing the transmission, node n 3 discards the packet. The holding time of node n j is defined as follows where Q(s i , a j ) is the Q-value of node n j in state s i , and T max is the predefined maximum holding time calculated using the maximum communication range of a node and the propagation speed. This holding time is used to prevent the overheads entailed by packet forwarding of all candidate nodes. When a candidate node with lower priority overhears the same packet transmission from a higher priority node within its holding time, it drops the packet. Thus, the optimal candidate node is selected for packet forwarding.
• Advantages: The reward design ensures that, upon selecting a node with higher residual energy, smaller depth, and greater void-detection factor, the agent receives a higher reward. The candidate set coordination does not require any additional ACK packet transmission, owing to the holding time mechanism. Moreover, this protocol ensures reliable packet transmission because each node holds the packet until the end of its holding time before dropping it.
• Disadvantages: The holding time mechanism increases the delay because each node must wait until the end of its holding time before the packet is forwarded. This increases the end-to-end delay for routing in the network.
• Application: EDORQ is suitable for dynamic topologies, because it is an on-demand routing protocol that can be adjusted according to node mobility. However, this protocol may be unsuitable for time-critical applications, because each forwarder node must wait until the end of the holding time, which may cause a delay.
In an opportunistic routing procedure, packet routing is performed via the cooperation of multiple nodes receiving the packet rather than a single relay node. When a node n i must send a data packet, the candidate forwarding set of the node is determined first. This set of candidate nodes is selected according to the depth, energy, and number of neighbor nodes. Then, the sender node sends its location information to its neighbors in the candidate set. The next-hop node from the candidate nodes is selected using an RL algorithm. In this RL algorithm, the state space contains information about the current node, as well as its candidate node set. The action is to select the next-hop node from the candidate node set. The reward for taking action a j (of selecting the next-hop node n j from node n i ) can be defined as where dep is the depth difference between nodes i and j, p(d, l) is the probability of successful packet transmission, G above (n j ) is the number of neighbor nodes above n j , and E(n j ) is the energy of n j . Upon receiving a packet transmission request from the sender node, all candidate nodes calculate their Q-value using the discounted reward function. After receiving the Q-values of the candidate nodes, the sender node selects the next-hop node with the largest Q-value.
In the case of a routing void problem (i.e., when selecting a node with no neighbor nodes to transmit the packet to), a method called the recovery mode is applied. When a void is detected, the void node enters the recovery mode and selects a downward forwarder node using the RL algorithm. The forwarder node is selected based on the smaller depth differences and higher energies. In addition, opportunistic routing is used alongside an adaptive dynamic timer mechanism. A waiting time is set for every candidate node (according the communication delay), to choose the forwarder node with higher priority. The node for which the waiting time elapses first forwards the packet, and other candidate nodes will drop it.
• Advantages: The adaptive dynamic timer mechanism ensures successful packet transmission. In addition, the waiting time of the candidate nodes leads to the selection of the best relay node. The end-to-end delay is also reduced by considering the communication delay as a parameter for setting the waiting time.
• Disadvantages: In RLOR, the state space contains the current node and set of candidate forwarder nodes. In a dynamic UWSN, the candidate set varies during the routing process at different times. Therefore, the state space is larger, and the RL algorithm may be slow to reach convergence.
• Application: RLOR exhibits better performance in the case of a dense network, because the risk of routing void problems is smaller. Moreover, this protocol can only be applied to UWSNs operating via acoustic communication.

V. COMPARATIVE STUDY AMONG ROUTING PROTOCOLS
In this section, a comparative study of all the investigated routing protocols is presented from different perspectives.
The key ideas of all routing protocols are listed in Table 3.
The specific features of each routing protocol used to select forwarding nodes or routing paths differ, despite all being based upon RL. The main ideas of the protocols are compared and discussed in Section V-A.

A. KEY IDEAS OF RL-BASED ROUTING PROTOCOLS
The novelty of QELAR novelty lies in the design of its reward function, which contains two cost functions related to the residual energy: one relates to the residual energy of the node holding the packet, the other relates to the energy distribution in the group of direct neighbors of that node. The average residual energy of the neighboring nodes is thereby computed. The key idea of MURAO is to physically divide the network into several clusters and logically partition these clusters into two layers. Multi-level RL is applied to both layers, and the upper layer contains the cluster heads that function as agents for their associated clusters. In QDAR, the sink node implements Q-learning to determine the routing path; meanwhile, each packet functions as an agent. Using all information from the UWSN nodes, the sink nodes create a virtual topology by sending a virtual packet. However, the HyDRO protocol selects the route that maximizes the residual energy; hence, when choosing the neighboring node to the sink node, the main objective is to select the nodes with maximum residual energy. The reward is designed such that the relay node selected for sending packets always as the maximum residual energy.
QGDR assumes that each sensor node is a player in a multiplayer routing game. Each node learns the policy of choosing the best relay node to send packets to the sink node, by utilizing Q-learning. QL-EEBDG aims to mitigate the void hole problem encountered in UWSNs by selecting a node with at least one neighbor node or the sink within their one-hop distance. Then, the node with the shortest distance is selected as the next-hop node.
In QLFR, a new holding-time mechanism is designed using RL, to schedule packet forwarding according to node priorities. This mechanism helps reduce redundant transmissions between multiple forwarding nodes. Unlike other RL-based routing protocols, CARMA chooses a set of relay nodes to forward packets toward the sink node, using the RL algorithm. However, most protocols choose the next-hop neighbor only. The channel condition and route-long energy are considered when determining the list of relay nodes.
The RCAR is performed by each node that holds a packet forward. The node creates a dynamic virtual routing pipe using the residual energy of the neighboring nodes, and it performs virtual routing to select the next forwarding node. A clustering approach is used in the QL-EDR to collect data from the sensor nodes. After all data in the cluster are collected by the CH, Q-learning is adopted to select the next hop. The residual energy and transmission delay are used as indicators for selecting the routing path.

B. COMPARISON OF THE RL APPLICATIONS
RL has been applied to different protocols to solve different problems. In all investigated protocols, the designs of the RL algorithm differed. The state, action, and reward functions were constructed with different objectives. The reviewed protocols are compared in terms of RL applications in Table 4.  In most protocols, the RL algorithm is designed in a distributed manner, where each node acts as the RL agent. The sender node observes the states (essentially the Q-value of neighboring nodes) and chooses either the next-hop node or the routing path. Some RL designs are also centralized such that the sink node functions as the agent and chooses the routing path. Here, the RL designs in the reviewed protocols are described individually.
As shown in Table 4, in all the reviewed protocols (except for MURAO, QDAR, and QL-EDR), the sensor nodes function as the agent of the RL algorithm. In general, the source node or sender node becomes the agent and performs data forwarding by following a policy or according to the maximum Q-value. In MURAO, routing is performed in a hierarchical manner in which the CHs select the routing path in the corresponding clusters and perform packet routing. However, QDAR assumes that each packet is an agent of the Q-learning algorithm. The agent's policy is the routing path that directs the packet (agent) to take proper actions. In QL-EDR, the base station observes and makes routing decisions as the Q-learning agent.
The agent's state is the observation factor of the RL agent, from which the agent decides the action. The information available in the state is crucial for the agent learning process. In addition, the state space should not become large. In QELAR, MURAO, QDAR, DMARL, and EDORQ, the state of the algorithm relates to the node that holds the packet. Therefore, at any time t, the state in the RL algorithm is the ID of the node where the packet resides at that time. The routing action is selected according to the ID of that node. Retaining the nodes holding the packet as the state is beneficial, because the next state will be the next-hop node. Therefore, consecutive states form the routing path for the routing protocol.
The state spaces in HyDRO and CARMA are similar. They contain sets indicating the number of times node i has transmitted packet p unsuccessfully, as well as the packet transmission or packet drop. Depending on the status of the packet (i.e., whether it is received or dropped by the neighbor nodes), the packet status is established in the state of the node. The state is designed according to single-packet forwarding. In QGDR, the goal of RL is to identify the optimal routing policy in which each node is considered as a player in a multiplayer MDP problem. The state consists of a payoff history array and path cost value. The reason for designing such a state-space is to transmit the packet to the sink with the maximum payoff.
The state space in QL-EEBDG includes the control packets generated by the source node; meanwhile, the source node is referred to as the source agent. These control packets are sent by the source nodes to all sensor nodes within its range. The receiver node, as the receiver agent, sends back an ACK packet, which is used to select the neighbor node. In QLFR, the residual energy and depth of the node comprises the state of each node. With these two types of information, the selection of the next-hop node has the advantage of selecting the optimal forwarding node. The state of a node in RCAR is designed with information regarding one-hop neighbors and the link condition between them. The Q-value of each node is calculated from this information, and the highest Q-value determines the optimal link condition. The next-hop node can be selected from the optimal link quality in this manner.
The action of all reviewed RL-based routing protocols for UWSNs is the selection of one or more relay nodes for packet forwarding. CARMA acts to select the set of relay nodes; meanwhile, all other protocols act to select the next-hop neighbor.
The reward function in each of the reviewed RL-based routing protocols reflects the main objective of the protocols. For example, QELAR, MURAO, HyDRO, QL-EDR, DMARL, and EDORQ consider the residual energy of the neighbor nodes when designing the reward function, such that the node with higher residual energy can provide higher rewards and therefore be selected as the forwarding node. However, the residual energy alone cannot determine the likelihood of a node being selected as the optimal next-hop node. Therefore, other parameters (e.g., energy distribution among neighbor nodes, link quality, distance or depth of node, and delay) have been added alongside the residual energy. Protocols besides those mentioned above involve reward functions that neglect residual energy, similarly reflecting their objectives.

C. COMPARISON OF THE OPTIMIZATION PARAMETERS
The reviewed RL-based UWSN routing protocols were designed to optimize the performance from different perspectives. Given that the optimization criteria have trade-offs, a routing protocol should attempt to maximize the outcome of the expected performance metrics whilst also minimizing the negative impacts upon other performance metrics. In Table 5, the optimization parameters and evaluation metrics of the reviewed routing protocols are presented. In the table, 'O' indicates that the specific parameter is considered in the specific protocol, and 'X' indicates that the parameter is not taken into account for that protocol.
The residual energy of a sensor node is the remaining energy of the node [60]. This is an important optimization parameter because it determines how long a node can participate in packet forwarding. The residual energy can be computed as where R i denotes the residual energy of the i-th sensor nodes, and n denotes the total number of sensor nodes. When designing a routing protocol, the next-hop node must be selected by considering its residual energy. All the reviewed protocols (excluding CARMA) optimize the residual energy when making routing decisions. Residual energy can be saved by reducing the energy consumption of the sensor nodes during sensing and data transmission. The network lifetime also depends upon the energy consumption of the sensor nodes. Therefore, to be efficient, one of the most crucial features of a routing protocol is minimizing energy consumption and thereby extending the network lifetime. Considering and evaluating the network lifetime is essential when designing a routing protocol. The network lifetime can be estimated from the data transmission duration VOLUME 9, 2021 (or round) until all nodes are alive or the first sensor node in the network dies [60]. Of all the reviewed protocols, HyDRO, QGDR, RCAR, and EDORQ do not explicitly consider the network lifetime when designing routing protocols.
End-to-end delay describes the average time required for a packet to traverse from the source node to the destination one; this includes the transmission delay, holding time, processing time, propagation delay, and receiving time [61]. While choosing a routing path, it is important to choose the path that will minimize the time required to deliver the packet to the sink node. In a UWSN, the sink node is the destination node. Therefore, to transmit the packet to the sink node faster, the routing protocol must consider the end-to-end delay. However, QL-EEBDG and DMARL have not considered the data-transmission delay from the source node to the sink node when designing their protocols.
The link quality is an important parameter when selecting the subsequent forwarding node, to ensure that a more reliable link is chosen from amongst the candidate nodes. Owing to the highly error-prone nature of underwater wireless links, data transmission over poor-quality links leads to packet losses, which may necessitate retransmission. Because data retransmission increases energy consumption and delay, it is necessary to select a reliable, good-quality link to reduce the likelihood of packet losses [62]. However, the majority of the reviewed protocols do not consider the link quality when designing routing protocols, as can be seen in Table 5.
Of all reviewed routing protocols, only HyDRO considered energy-harvesting-enabled UWSNs. Nodes deployed at different depths harvest energy from the environment to support their operations. In such networks, the nodes at the seafloor harvest energy through turbines; meanwhile, harvesting in nodes closer to the sea surface happen using solar panels attached to floating devices cabled to the nodes. The energyharvesting-aware protocol can effectively bypass the energy constraints of sensor nodes in UWSNs.
The packet drop ratio is the ratio between the number of packets dropped by the sensor nodes and the total number of packets sent by the source nodes during a data transmission round [63]. Packet drop ratio can be calculated as where P r denotes the packets received, and P s denotes the packets sent during any specified round. Considering the number of packet drops in the routing protocol design ensures proper selection of the next forwarder, to realize successful packet delivery to the sink. HyDRO and CARMA considered the number of packet drops in the state of the agent. Moreover, QGDR, QL-EEBDG, and RLOR also consider this parameter in their routing protocols. The constant node mobility in the UWSN environment leads to continuous topology changes [24]. Therefore, it is important to consider dynamic topology changes when designing a routing protocol, to reflect real-world UWSNs. However, HyDRO, QLFR, CARMA, QL-EDR, and RLOR do not consider this parameter. The performance evaluation of a protocol does not ensure accurate results if dynamic topology is not considered. Nevertheless, node mobility has also been neglected when evaluating certain reviewed protocols, such as QDAR, HyDRO, QGDR, QL-EEBDG, CARMA, and RCAR.
During the routing process, a packet may have to hop through multiple nodes from the source node to the sink node. Reducing the number of hops can reduce the delay and energy consumption of the overall network. In this regard, choosing the routing path such that the number of hops is lower can improve efficiency. Among the reviewed protocols, hop count is considered in only two: QGDR and QL-EEBDG.
The distance between two nodes is an important parameter for choosing the next-hop node in terrestrial networks. In contrast, in UWSNs, the depth of the sensor nodes also plays a crucial role. Because the sink is placed on the surface, the routing path is oriented toward shorter depths. In this regard, the routing process in QDAR, HyDRO, QLFR, QL-EDR, EDORQ, and RLOR is designed considering the depths of the sensor nodes.
The packet delivery ratio is defined as the total number of packets sent until the end of a transmission round. When evaluating a routing protocol, it is necessary to measure its performance in terms of the packet delivery ratio. A higher packet delivery ratio reflects the higher efficiency of the routing protocol. The majority of the reviewed protocols considered these parameters, as shown in Table 5. During the routing of a packet, congestion can be detected because of the flooding for route discovery or route acquisition. Congestion can degrade network performance if left uncontrolled. However, among the reviewed routing protocols, only RCAR is designed to prevent congestion.

D. COMPARISON OF NETWORK DESIGN AND PERFORMANCE EVALUATION TECHNIQUES
The UWSN designs adopted in the reviewed routing protocols differ. The communication channel, deployment, topology, number of nodes, and other parameters are considered in various ways. Moreover, different researchers have used different evaluation techniques; these techniques are summarized in this subsection and presented in Table 6.
In the table, 'NG' indicates that no option is specified in the paper for the routing protocol. The deployment in the table indicates the dimensions of the topology considered for the simulation. Several protocols considered the 3D deployment of UWSNs, whereas others considered 2D deployment. Other geometric shapes (e.g., circular shapes in QL-EEBDG and rectangular shapes in CARMA) have been considered for designing protocols.
The simulation area refers to the width and height of the network scenario in the 2D network case, and the width, height, and depth in the 3D network one. As shown in Table 6, a wide variety of simulation areas have been considered for the performance evaluation of different routing protocols. The simulation area and number of sensor nodes are related: a large number of nodes in a small simulation area indicates a dense network, whereas a small number of nodes in a large simulation area represents a sparse one [64]. Both are possible in a UWSN environment, depending on the applications considered in that particular environment. For example, routing protocols such as RLOR offer superior performances in dense networks because they are designed to consider the routing void problem. The number of nodes may also be varied depending on water-depth, energy consumption, and cluster forming technique. One routing protocol may not be suitable for another environment, and the number of nodes may need to be varied accordingly.
The performance evaluations for the reviewed routing protocols were conducted using different simulators, each with their own advantages and disadvantages. However, the simulator used for most of the reviewed protocols was not mentioned in the studies. Among the mentioned simulators, the SUNSET simulator provides a realistic representation of the UWSN environment; it supports various channel models and provides a detailed representation of the communication component and node energy [65]. Network Simulator version 2 (NS-2) has been used for QELAR and EDORQ with an aquatic environment simulation package called Aqua-Sim [66].
The number of sink nodes can be one or more in UWSN applications. Both single and multiple sinks have been considered for performance evaluations of the reviewed UWSN routing protocols. For the multiple-sink scenario, the destination can vary; meanwhile, for a single sink, the destination remains the same. Therefore, RL-based routing with multiple sinks may become more difficult than with a single sink, because the terminal state can vary.
The data packet size has impact on the performance of multihop communication in UWSNs [67]. In the reviewed routing protocols, the data packet sizes varied from 50 bytes to 1 MB. The simulation time represents another important factor for an RL-based routing protocol, because the learning of the agent improves over time. The protocol may not be effective if the simulation time terminates before the agent converges. For the performance evaluations of the reviewed protocols, simulation time was mentioned in several cases, such as QDAR, RCAR, DMARL, EDORQ, and RLOR. The simulation time represents the total time required for one round of data transmission. However, HyDRO was evaluated over a 6 day simulation, and QGDR was simulated 100 times.
In addition to simulation for performance evaluation, the efficiencies of the schemes were validated by comparing them against other well-defined and widely accepted protocols. All the reviewed routing protocols were likewise compared with other existing protocols. Since QELAR is an early RL-based UWSN routing protocols, it has been used for comparison with QDAR, HyDRO, CARMA, RCAR, and EDORQ.

VI. CHALLENGES AND OPEN RESEARCH ISSUES
The challenges and open research issues for RL-based UWSN routing protocols are highlighted in this section. Although the proposed protocols have shown significant performance improvements in routing, they can be further improved. Many challenges remain to be solved before these routing protocols can be implemented in real UWSN environments. These challenges, as well as future research issues, are discussed here.

A. NODE MOBILITY
In UWSNs, underwater nodes can be static (i.e., anchored to the ocean floor) or dynamic (i.e., floating with changing mobility). Most RL-based routing protocols consider only a scenario comprising static nodes, neglecting the node mobility. In a real underwater environment, node mobility arises because of water pressure and water current [28]. This node mobility changes the topology structure of the UWSN [68]. Moreover, AUVs have been used to collect data from underwater sensor nodes in different research works [69]- [71]. In these cases, the AUVs can be considered as the sink nodes with mobility. This scenario has not been considered in any of the reviewed schemes.
RL can be used to adapt to the mobility of both the sensor and sink nodes. Because RL algorithms can explore and learn in a network without knowing its full architecture, this feature may be suitable for the scenario with node mobility in UWSNs. The Q-table of each node can be updated with the changed location information of the neighboring nodes, along with the sink nodes at every timestamp. Therefore, if topology change occurs within the interval of two packet forwarding, nodes can update the neighboring information.

B. CONVERGENCE OF RL
In the RL algorithm, an agent must experience all possible states and actions to obtain the optimal result. Therefore, the agent must traverse the entire Q-table. When the number of states and actions increases, the Q table becomes large; thus, the agent requires a longer time to converge. In some cases, it may not converge; that is, it may become stuck in a local optima, without reaching the global optima.
When designing routing protocols, the convergence time must be considered. The number of states and actions should be minimized, to obtain an optimal result faster. Integrating other techniques such as fuzzy logic [72] can help limit the number of states. If the states are continuous values, it needs to be converted into discrete values. Otherwise, the number of states may become infinite, and the algorithm may not converge.

C. Q-TABLE INITIALIZATION
Q-table initialization significantly determines the learning speed of the RL algorithm. In most cases, Q-values are initialized to zero and updated only when the corresponding state-action pair is visited in the network. This process may slow the algorithm convergence. To speed up the convergence of the RL algorithm, Q-table initialization can be performed using several learned values from the same environment. In addition, to update the Q-table in the case of a large number of states and actions, virtual Q-value updating can be performed. This accelerates convergence, because the agent need not visit all states and actions.

D. Q-VALUE UPDATING
At the outset of the Q-learning algorithm, the agent takes actions and learns with no prior knowledge of the environment. This is one of the major drawbacks of RL and is computationally expensive in certain cases. Taking the wrong action may lead to a drastic changes in the environment.
Initializing Q-values by incorporating prior knowledge can improve the Q-learning algorithm [59] and reduce the convergence time. However, no precise rules are available for appropriately choosing the Q-value. Different researchers have adopted different techniques for Q-table initialization. Updating the Q-values by applying certain tricks can also lead to faster convergence. One such trick is to update two Q-values: one for an action and another for the corresponding opposite action [73]. This helps to increase the learning speed of the agent.

E. APPLYING A VARIETY OF RL ALGORITHMS
Although RL has been used to design various routing protocols for UWSNs, it can be seen from the reviewed protocols that only Q-learning is utilized for that purpose. RL includes a variety of algorithms, each having its own advantages and disadvantages. Q-learning is the most popular RL algorithm for solving routing problems in WSNs; however, other RL algorithms such as SARSA [74], actor-critic learning [75], deep Q-learning [76] and more can be applied. It may happen that other RL algorithms achieve superior results to Q-learning, because they have been previously applied for routing in other wireless network scenarios.

F. SECURITY
Security issues are of major concern not only in UWSNs but in any type of wireless network. When collecting sensitive data, security problems represent a consistent threat to the UWSN. None of the afore-discussed routing protocols considered secure data transmission. Designing trust-based routing protocols is necessary, because UWSNs are widely used for military purposes, where data confidentiality is essential. Malicious attacks, unauthorized access, and data leakages should be considered when designing routing protocols, to make them applicable and effective for real-world UWSN tasks.

G. NODE DENSITY
Several routing protocols reviewed in this paper used sparse UWSNs, whilst others used dense UWSNs. In many cases, it has been observed that the performance of the algorithm deteriorates under an increasing number of nodes. In a real-world UWSN scenario, the node density can be high. Therefore, such routing protocols are not effective. This issue requires further research, to ensure that the performance of routing protocols is robust against node density variation.

VII. CONCLUSION
Routing for UWSNs is one of the most crucial issues in underwater applications. In RL, the efficiency of a system is increased with experience and time. This capability of RL algorithms has been widely considered in different wireless networking scenarios. RL has also been shown to significantly improve the design of routing protocols or UWSNs. In this article, we present an extensive survey of RL-based underwater routing protocols. The methods are discussed, and their advantages, disadvantages, and suitable application environments are presented. The reviewed protocols are further compared in terms of their key ideas, RL mechanisms, optimization parameters, and evaluation techniques. The applications of RL are also separately compared for all protocols. For future researchers, the research gaps and areas requiring critical improvement are emphasized as open research issues. The analysis, discussion, comparison, and future research directions highlighted in this investigation will provide UWSN researchers with an in-depth overview of existing routing protocols.