Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000

An underwater sensor network (UWSN) is a wireless network that is deployed in oceans, seas, and rivers for real-time exploration of environmental conditions. The network is used to measure temperature, pressure, water pollution, oxygen level, volcanic activity, floods, and water streams. Although radio frequency (RF) is widely utilized in wireless networks, it is incompatible with the UWSN environment; therefore, other communication mechanisms have been employed to manage the underwater wireless communication among sensors, such as acoustic channels, optical waves, or magnetic induction (MI). Unlike terrestrial wireless sensor networks, UWSNs are dynamic, and sensors move according to water activity. Therefore, the network topology changes rapidly. One of the most critical challenges in UWSNs is how to collect and route the sensed data from the distributed sensors to the sink node. Unfortunately, the direct application of efficient and well-established terrestrial routing protocols is not possible in UWSNs. In this work, a balanced routing protocol based on machine learning for underwater sensor networks (BRP-ML) is proposed that considers the UWSN environmental characteristics, such as power limitations and latency, while considering the void area issue. It is based on reinforcement learning (Q-learning), which aims to reduce the network latency and energy consumption of UWSNs. The communication technique in the proposed protocol is based on the MI technique, which has many advantages, such as steady and predictable channel response and low signal propagation delay. The simulation findings validated that BRP-ML reduced latency by 18% and increased energy efficiency by 16% compared to QELAR. INDEX TERMS Underwater sensor network, routing protocol, reinforcement learning, network lifetime.


I. INTRODUCTION
Underwater sensor networks (UWSNs) have recently attracted industry and research community attention due to their broad application areas, such as resource discovery, disaster avoidance, auxiliary navigation, and military purposes. Actually, underwater acoustics is not new. It was first studied in the early 1800s, and the first practical application was in the early 1900s [1]. It was used on ships to increase navigational safety and receive bell signals. Then, between the 1920s and 1930s, researchers started understanding the basic concepts of underwater sound. By the Second World War, the United States had equipped its ships with communication systems to determine seabed depth and detect objects miles away [2].
Consequently, the necessity of using UWSNs has emerged. A UWSN is a wireless network that utilizes a set of sensors and autonomous underwater vehicles (AUVs) to collect and sense data underwater. These underwater sensors deliver the data to a surface sensor (sink node) directly or indirectly. Then, the data are transmitted from the sink nodes to an offshore monitoring center (base station) for analysis and study of the collected data. UWSNs are a hot research area because of their numerous applications. A practical underwater acquisition technique must be studied with the increasing demands for environmental marine surveillance, exploration of marine resources, scientific marine research, and marine protection. UWSNs have become a popular research field due to their significance to the community.

II. LITERATURE REVIEW
Recently, UWSNs have attracted the attention of researchers due to the widespread demand for understanding the underwater environment. In [11], a Q-learning-based routing protocol (QELAR) was suggested to extend the lifetime of acoustic UWSNs. In this protocol, each node exchanges its metadata with neighboring nodes. When a packet is sent, the node attaches the metadata with the packet. Then, other nodes can overhear the traffic, extract the metadata, and drop the packet if it is not the eligible forwarder. If the receiver node is the eligible forwarder, it will calculate the Q-value and select the next forwarder. The QELAR reward function depends on the residual energy only, which means that it will always choose nodes with the highest residual energy regardless of the delay cost. Therefore, if the number of nodes increases, longer paths with a greater number of hops will be created, which causes more delay and energy consumption. This protocol also provides a mechanism to detect transmission failure. [12] proposed a machine learning (QDAR)-based routing algorithm to extend the lifetime of UWSNs while considering the delivery latency. Two types of packet structures are used in QDAR: the data-ready packet and the interest packet. The data-ready packet is transmitted from the source node to the sink node containing the node's necessary information. Moreover, the interest packet is transmitted from the sink node to the source node to determine the routing path. The algorithm consists of five phases. The first phase is the dataready phase, where the node's data are collected to plan the routing path, which is where the data-ready packet is sent. The source node sends a broadcast to the sink node requesting a routing path. The second phase is the routing decision phase, This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3126107, IEEE Access 3 where the QDAR algorithm determines the routing path. The third phase is the interest phase, where the sink node sends the interest packet to the source node following the output path of the QDAR algorithm. The fourth phase is the package forwarding phase, where data are sent from the source node to the sink node according to the set path. The final phase is the acknowledgment phase, where path reliability is checked. If the source node does not receive an acknowledgment (ACK) message, it returns to the first phase. Otherwise, it loops between the fourth and fifth phases. Once the path is determined, the following packets coming from the same node follow the same path until the path fails. In the simulation, the latency was reduced compared with other routing protocols, but the network lifetime was reduced to achieve latency reduction. This unexpected behavior was caused by using a mechanism that allowed the source node to use the same route repeatedly until failure, which caused node exhaustion, leading to a reduction in the network lifetime.
In [13], the study aimed to increase the network lifetime and decrease the transmission delay by using MI. According to transmission time and energy consumption, reinforcement learning (Q-learning) is applied to create multihop paths. Only one node can send data at a time for its cluster head. There are two algorithms used to apply the protocol. The first proposed algorithm goes through the initialization phase to initialize some variables. Then, it enters a loop where it computes an Htable for each node that determines the next possible hops. If the node has accessible hops, it will compute the Q-table values and the reward based on the distance-based path (Dhop) and energy-based path (Ehop). Finally, it updates the Q-table and the Q-value for the next state, while the second proposed algorithm is used to find an optimal path. The suggested algorithm outperformed the other algorithms in terms of energy consumption, throughput, and lifetime performance. In contrast, the transmission delay was not very low. A higher throughput was expected since the data rate was higher with MI communication. The results of the simulations were compared with acoustic-based protocols, which tend to have low propagation speed and high latency. Then, the transmission delay was higher, as expected. [14] considered an underwater optical network that suffers from a low delivery ratio and high energy consumption. It suggested using a multiagent reinforcement learning protocol (MARL) to consider more information exchange among nodes. It was assumed that a single-agent approach would focus only on its state, causing the residual energy to be unevenly distributed in the network. Furthermore, since the network used optical communication, the data link was more vulnerable than acoustic links. Thus, the link quality was considered while designing this protocol. The protocol works as follows: first, it initializes the routing table and sends a broadcast packet periodically every communication routing step to determine the states of the neighboring nodes and update the routing table. Then, it calculates the reward function based on the link quality of the possible next hop and all nodes' residual energy. After that, it updates the Q-value and V-value and then tries to choose the next hop with the highest V-value. In the simulation, MARL used a small number of relay nodes, which reduced the convergence time. Although the proposed protocol offered less broadcast time and stronger adaptability to water dynamics, it chose a fast route that caused more energy consumption. [15] focused on extending the network lifetime and balancing the residual energy of nodes while solving the void node problem. In this protocol, each node is considered an agent. Moreover, the node's behavior must be optimized according to the reward function. The suggested reward function attempts to increase the network lifetime and balance the consumption of energy. To solve the void node problem, this study used a mechanism called the adjacent node technique (ADN) to choose a trained and optimal node that is near the source node. Each node that faces the void problem should select another optimal path that maximizes and satisfies its reward function. After using Q-learning, the performance is enhanced in terms of the energy tax. The energy consumption has been reduced even with a large network radius. However, the network lifetime has decreased when using the ADN mechanism. In [16], the authors proposed Q-learning combined with a deep neural network (DQELR) to consider multiple metrics, such as network lifetime, node mobility, globally optimal paths, energy consumption, and end-to-end latency, in underwater acoustic sensor networks. It adopted two kinds of neural network training on routing decisions: off-policy routing and on-policy routing. The off-policy (offline) training is executed before the network is deployed underwater, where the network topology and node state are known and saved into an experience pool. Then, after the network is deployed underwater, on-policy (online) training starts. The experience pool of the off-policy training is used in the on-policy training to help make better decisions. Each node must store neighbor information and its Q-values with every neighbor, depth, residual energy, and the parameter variation (w) of the neural network. DQELR uses an asynchronous strategy to update w, meaning that when a new loss value is found, w is not updated instantly. The old value of w is stored and accumulated with the loss gradient and updated after a particular time, which archives the neural network noncorrelation input requirement. The multilayer perceptron model used in the suggested protocol consists of an input layer, an output layer, and three hidden layers. The hyperbolic tangent (tanh) function is implemented as the activation function. In the simulation, DQELR achieved a high energy efficiency and network lifetime. In terms of endto-end latency, it was higher than other protocols. Additionally, its packet delivery ratio was higher than those of other protocols for a low packet generation rate (λ). By increasing the value of λ to 0.05, the packet delivery rate of DQELR decreases compared to those of other protocols. In the proposed protocol, each node can send packets, which causes more frequent packet collisions. Consequently, this improves the network efficiency only for low values of λ. In [17], a Q-learning-aided ant colony routing protocol (QLACO) was proposed, which differs from the other work because it used ant colony optimization (ACO) with reinforcement learning. The path is selected based on the reward function and the critical ants. The architecture of the network is composed of several surface sinks, many sensor nodes, and several AUVs. The AUV travels and gathers the data from the sensors. Routes are discovered by artificial ants, and then the Q-table is updated. There are two main concepts introduced in this phase: forward ants (FANTs) and backward ants (BANTs). Each node periodically maintains a Q-table by sending FANTs and broadcast messages. The next hop of packets chosen by FANTs should satisfy the highest Q-value. FANTs collect destination node information before reaching the node location. As a result, the sink has an overview of the network to determine the optimal path. Finally, FANTs change to BANTs when reaching the destination node and return to the source node. The data of all nodes located on the return path are collected to calculate the Q-value by BANTs. Then, the reward function is calculated based on residual energy, time delay, and transmission delay. QLACO was compared with two approaches based on measuring time delay, delivery ratio, and energy consumption in the simulation. The overall performance of QLACO was better than that of the QELAR and the depth-based protocol (DBR). However, the QLACO did not mention how the AUVs fit in the algorithm and how the tasks are distributed among them.
Although the discussed algorithms achieve good performance, there remains room for improvement. Moreover, to overcome the limitations and find the appropriate balance between energy consumption and delay, this study proposes an efficient machine learning-based routing protocol. This study considers the void region issue while enhancing the performance of 3D UWSNs by using MI underwater communications, which is a new promising technology. The suggested protocol uses two machine learning algorithms that are utilized for clustering and routing. An unsupervised machine learning algorithm is applied for clustering, and reinforcement learning is applied for routing. The protocol is partitioned into four phases, which makes it adaptive and flexible. The performance analysis is conducted through simulations and experiments. The simulation results show that BRP-ML can achieve high delivery rates, shorter delays, and a longer network lifetime.

III. BACKGROUND
The UWSN architecture basically consists of sensor nodes (underwater or at the water surface), sink nodes, and AUVs if they exist. There can be one or more sink nodes. If the network has multiple sinks, then sensing nodes have alternative paths along which they can send data packets. The sensor node architecture contains a managing energy unit and power supply, CPU, communication module that applies the used communication method, sensing module, data storage used to store the sensed data, and depth control component, which is a measuring system [18].

A. COMMUNICATION METHOD
MI is based on Faraday's law of induction using two wired coils to interchange data [19]. The modulated sinusoidal current in the transmitting coil (TC) initially generates a magnetic field that changes with time (time-varying) in space, from which the sinusoidal current in the receiving coil (RC) is then induced accordingly. The data are subsequently recovered by demodulating the mediated current. There is no need to equip a power source with MI communications at the receiver. To transmit the data, the magnetic field must be altered according to the waveforms where the data are delivered. The coils' radiation resistance is far less than that of the electric dipole. In other words, only a small amount of energy is radiated across the channel. Therefore, in MI waves, multipath fading is not a concern. In addition, the MI magnetic permeability above water is the same as that underwater and is not affected by water quality, which often changes with the region, time, and depth. Thus, the MI channel behavior is more predictable and steadier than the previously mentioned techniques [20].
Because MI waves travel much faster than acoustic waves, MI waves significantly enhance the underwater communication delay efficiency and provide timely data transmission. A shorter delay improves the design and deployment of underwater communication protocols such as localization, routing, and medium access control (MAC). In addition, the synchronization of the physical layer between wireless devices is reliable and more accessible due to the stable channel response and the slight delay of the MI waves. Moreover, MI waves work by using unseen and unheard waves [21]. Consequently, it facilitates energy-saving and secure communication between wireless devices that can serve military and civil objectives and other applications. Therefore, MI can be utilized for many underwater applications, such as long-term underwater monitoring, military purposes, disaster detection, and gas or oil leakage detection. Table I illustrates a summary comparing optical, acoustic, EM, and MI technologies based on several characteristics. To accomplish energy-efficient underwater communication, an accurate MI channel model must be built. The energy consumption model contains two main parts: transmitting and receiving power. Since MI communication is used, the MI transmitter is modeled as the primary coil and the MI receiver as the secondary coil of a transformer. The primary coil works at a low frequency and aims to increase the field strength and enhance the magnetic moment. The power used in the primary coil loop is equivalent to the transmitting power, and the power used in the load impedance is equivalent to the receiving power [22]. The transmitting power Pt for a single hop is the real part of the complex number from (1): The notations used in the equations of this section are defined in Table II. The receiving power for a single hop depends mainly on the data processing power plus the receive power, where the energy consumption Erecive per receive is expressed as [23] The delay for each packet in a single hop is [23] () Since our method follows a multihop path, the distance between the source node n and the destination node (sink node) is the sum of distances in between 1 () k ns nm i i DD = =  Then, the total delay is the sum of multiple single-hop delays T(Dns) and the queuing delay [13]:

B. INTELLIGENT ROUTING PROTOCOLS
Routing in UWSNs is a challenging issue. There are special methods of routing applied only in terrestrial wireless networks. These routing methods are based on reinforcement learning, ACO, fuzzy logic, genetic algorithms, or neural networks [24].
• Reinforcement learning-based method This is the most popular method applied for UWSNs, as discussed in the previous section. Reinforcement learning is implemented easily and adaptable to topology variations. This is the main reason why it is the most common method used for distributed problems.
• ACO-based method The ACO algorithm imitates an ant's behavior, where each ant leaves a trace after it walks that makes the coming ants follow the most visited path [17]. This method is a common routing solution for solving wireless sensor networks. However, balancing convergence and avoiding prematurity must be considered when applying this method.
• Fuzzy logic-based method In traditional logic, an element value can be represented by 1 or 0, where 1 is true and 0 is false. In fuzzy logic, an This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3126107, IEEE Access 6 element can be partially true or false by a certain value between 0 and 1 [25]. This method can be applied to routing optimization and achieve multiple criteria simultaneously. However, it can produce nonoptimal solutions and cannot easily adapt to topology variations.
• Genetic algorithm-based method The genetic algorithm is an approach used to solve constrained and nonconstrained optimization problems using natural selection [26]. This method uses a fitness function to obtain a value that represents the solution efficiency. It can work with multiple objective optimization problems, but it is computationally expensive and requires many resources.
• Neural network-based method A neural network is a set of algorithms that attempts to detect a relationship between data based on a procedure that mimics how the human brain works. Neural networks consist of three types of layers: input, hidden, and output layers. Flexibility and scalability are the main features of neural networks when applied to UWSNs. In addition, it can work with multicriteria objectives. However, it is known to be a computationally expensive and time-consuming approach [16].

C. REINFORCEMENT LEARNING
Machine learning is a system that learns from data and produces a model that predicts outcomes over time [27]. Machine learning is a subfield of artificial intelligence. Reinforcement learning is a machine learning model training method to perform actions and choices. The difference between reinforcement learning algorithms and classic dynamic programming is that reinforcement learning targets large MDPs and does not assume the exact MDP mathematical model. Figure 1 demonstrates the classic reinforcement learning framework.

FIGURE 1. Reinforcement Learning Framework [28]
a) Reward Rt: This is a numerical value that is received by the agent for taking action to move from one state to another in the environment [29]. b) Policy π: This defines the agent's behavior at a given time. c) Discount factor (rate) γ: This factor is used in the future cumulative reward (return) equation to define the return for infinite series [30]. d) Agent: The agent's goal is to obtain a policy π that maps states to actions optimizing any long-run reinforcement measure. At a given time t, the agent observes the environment, and using some policy, it decides how to take action [31]. Then, it selects an action from a set of available actions. After that, the environment replies by a reward or a penalty and moves to a new state St+1. The agent tries to take actions that achieve the highest total reward (value or return). As a function of history, the agent can select any action and modify the following policy. It repeats until it reaches the optimal policy. e) The Bellman equation is useful for reinforcement learning to obtain the value of the current state by knowing the value of the next state. This information assists in finding the optimal Q-function q* and therefore finding π*, where Q(s,a) is the expected return starting from state s by selecting action a following π, Rt is the total expected reward for following π in state s selecting action a, and maxaQ(s',a') is the maximum expected discounted return for any s′ selecting action a′.

D. Q-LEARNING
Q-learning is an off-policy learner that learns the optimal policy value without the use of the agent's action. The process of updating the Q-values repeatedly for every stateaction pair with the use of the Bellman equation until the equation becomes the optimal Q-function is known as value iteration. The Q-values are stored in the Q-table. To choose the right action, a balance must be struck between exploration and exploitation. The epsilon greedy strategy is used to find that balance. When ϵ = 1, the action taken depends only on the exploration, and as new episodes come, ϵ decreases. ϵ is usually updated when an episode finishes. A random number is generated between 0 and 1 to determine whether the agent chooses exploitation or exploration at each time step [32]. A reward function is derived from the Bellman equation. After several iterations, this function converges to the optimal Q-function (Q*) [33].
where κ is the learning rate. It represents a value between 0 and 1 that defines the extent to which the new Q-value overrides the old Q-value. The agent adapts quickly to the new Q-value if the learning rate is high. The process of the Q-learning algorithm is iterative. The basic steps that it follows begin by creating the Q-table. The learning rate, discount rate, number of episodes, number of steps in each episode, and epsilon must be initialized to a specific value. Then, an action is chosen based on what was explained in the previous sections. After that, the reward is calculated, and the Q-table is updated using the Q-function. The algorithm repeats the steps except for creating the Q-table until the episode finishes.

IV. METHODOLOGY
One of the most critical challenges in UWSNs is how to collect and route the sensed data from the distributed sensors to the sink node. This protocol suggests efficient routing that considers UWSN environmental characteristics, such as power limitations and latency. This can be achieved following a machine learning method that applies reinforcement learning-based (Q-learning) routing that reduces the network This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3126107, IEEE Access 7 latency and energy consumption of UWSNs. The BRP-ML method aims to route the sensed data from the source node to the sink node using a route that consumes less energy and less delay, which is the route that has the maximum Q-value. The design of the proposed routing protocol consists of four phases: initialization phase, discovery phase, clustering phase, and data forwarding phase. i. Initialization phase In this phase, the nodes are deployed in their configured locations, and the necessary initializations are set, such as the routing tables, Q-tables, node locations, and initial node energy.
ii. Discovery phase In this phase, each node sends broadcast messages to other nodes to update their tables. This information exchange is useful for the clustering phase and the data forwarding phase. Each node that receives a broadcast message calculates the distance between the sending node and itself. iii.
Clustering phase In the clustering phase, the optimal number of clusters is chosen, and then the nodes are clustered using a suggested clustering algorithm. Cluster head and edge node selection are performed in this phase. Again, a broadcast message must be used to inform the other nodes about the clustering results. Although this could be considered communication overhead, it is necessary for dynamic topologies. This information exchange could decrease unnecessary data forwarding and balance the energy consumption among nodes.
iv. Data forwarding phase After the clustering process, each node knows its cluster ID and its type, whether it is a normal node, cluster head, or edge node. When a normal node has sensed data, it transmits the sensed data to the cluster head, and then the cluster head transmits them to one of the other cluster heads to receive a reward. The reward is calculated using the equations discussed in section IV, and the Q-values in the Q-table are updated. Based on the updates, the node chooses a route using a policy. The void area mechanism (VAM) is applied in this phase to address the void area problem. When the cluster head has collected all data from its cluster's nodes, it can transmit them to another cluster head, edge node, or sink node. It chooses the node that has the highest Q-value. The edge node assists the sink node in reducing the load on the cluster head by making the cluster head transmit the data to a closer node, decreasing the total energy consumption at the cluster head. Furthermore, the edge node can transmit the data to other cluster heads, edge nodes, or the sink node. Figure 2 shows the protocol overview.

A. VOID AREA MECHANISM (VAM)
The void area is the area without nodes or the area in which the nodes have drained their batteries. When a node has a packet to send, there are no neighboring nodes that can receive the packets. To solve this problem, a mechanism is suggested named VAM. Assume node A has a packet to transmit, and there are no neighboring nodes between the sink and node A. Some other routes could handle the data through nodes. By using VAM, node A transmits the data to the node with the highest Q-value and starts a timer and waits for an ACK message from the chosen node. Assume that node A chooses node C, and it waits for an ACK message from node C. If node A receives an ACK message from node C, no action is needed since it means that the path to the sink node can go through node C and there is no intermediate void area. Moreover, if the timer has expired and no ACK message has returned to node A, then the Q-value of node C will be reduced, and an alternative path will be chosen to deliver the packet. Because the nodes with low Q-values have a low probability of being selected, the routes that bypass the void areas are excluded in this mechanism. Figure 3 is an illustration of the discussed case. intended to facilitate the information exchange among nodes and properly achieve both the clustering and routing processes. The three types of packets are broadcast packets, data packets, and ACK packets. The broadcast packet contains the packet ID, node type (cluster head, edge node, normal node), timestamp, source node ID, cluster-ID, residual energy, and max Q-value. The broadcast packet is used to exchange information among nodes and maintain the updated information. The broadcast is among all live nodes, and it is performed periodically. Each time the clustering process is executed, broadcasts must be exchanged. The data packet includes the ID, type, timestamp, previous node ID, previous node max Q-value, previous node residual energy, destination node ID, and data. The previous node information in data packets is the information of the previous hop chosen. The unicast data packet is used to transfer the sensed data to the sink node using a multihop path. Having different types of packets benefits the routing protocol. To accomplish the clustering process, only broadcast packets are used. Then, for the routing process, both types of packets are used. Finally, the ACK packets are used to ensure that the data packets reach the destination. The ACK packets can be sent from any node in the network. Each time a node transmits a data packet, it waits for an ACK packet from the destination node to guarantee that the data packet is received. The average round trip time is calculated and used as an ACK expiry time. Table III summarizes the details of the packet structures used.

C. CLUSTERING PROCESS
The clustering process is the method used to divide any data points or population into groups such that the points in the same group have similar traits. In UWSNs, the communication interference is known to be high, and by using clustering, the interference can be reduced. Furthermore, the energy consumption is more balanced, which extends the network lifetime. The network is divided into groups or clusters, where each cluster has a cluster head. The sensors communicate only with their cluster head. Then, the cluster head transfers the aggregated data to the sink node either by a multihop path or a single hop. This process enhances the performance of UWSNs [34].
K-means is a partitioning-based algorithm. It splits the nodes based on centroids where the similarity among clusters depends on the node closeness to the cluster centroid [35]. It is a common algorithm because it is easy to implement and has low computational complexity and low memory consumption [36].
K-means++ is an enhanced version of the K-means algorithm that addresses the poor initialization problem [37]. It is a simple algorithm that can provide more accurate results than the standard K-means.
BRP-ML uses an adaptive clustering technique to adapt to network changes. In the network, the sensor nodes are clustered, where each cluster has a cluster head and an edge node. The cluster head is responsible for aggregating the data packets from the other nodes in the cluster and transmitting them to the sink using a multihop path. The edge node assists the cluster head in aggregating and transmitting the data. By using clustering, most nodes transmit the data using short hops, which consumes less energy and extends the network lifetime. Clustering in UWSNs is usually performed in three steps: define the number of clusters (k), perform the clustering process, and assign the cluster head and edge node. The clustering process is the procedure of grouping the nodes into k clusters. This process can be repeated when the number of alive nodes decreases. As mentioned, K-means++ is applied to form clusters that are modified to adapt to the nature of the underwater environment. In the underwater three-dimensional coordinate system, each node has a location vector (x, y, z). Assume that node A has coordinates (x1, y1, z1) and node B has coordinates (x2, y2, z2). Then, the distance between two nodes A and B in a 3D system is i. Prerequisites and Assumptions 1. The shapes of the clusters are not important to our algorithm, which means that it does not depend on a particular cluster feature. 2. Since MI is used as a communication method in this work, the received magnetic field strength (RMFS)based localization technique is applied. It utilizes a tridirectional antenna to increase the estimation accuracy [38]. Therefore, each node can obtain its threedimensional coordinates. 3. The location of each node is configured before the deployment, and the nodes can adjust their locations following the configuration. 4. Sensor nodes are deployed randomly in a 3D coordinate system. 5. All sensor nodes have the same initial limited energy, except the sink node, which has an unlimited power supply. 6. Three types of nodes are used: normal nodes that sense and transmit data to a cluster head using a single hop; a cluster head, which is responsible for aggregating the data from normal nodes and sending data to the sink node using a multihop path; and the edge node, which is used as a cluster head assistant. 7. Each cluster has one cluster head, one edge node, and multiple normal nodes. ii. Choosing the Optimal Number of Clusters (k) Neither the K-means nor K-means++ algorithm specifies the number of clusters (k), which should be assigned prior to the clustering process. It is important to choose the optimal number of clusters since it affects the network lifetime. Having many clusters could increase the communication overhead, and having few clusters causes the formation of large clusters, which increases the power consumption. Therefore, BRP-ML uses a combination of two methods to obtain the accurate optimal value of k. The first method is the silhouette method that calculates a silhouette coefficient, which shows the goodness of the clustering algorithm, performance wise.
where d2 is the mean distance of the nearest cluster, and d1 is the mean intracluster distance for each data point. After selecting random values of k, the silhouette coefficient is calculated for each of them. Thus, k with a higher coefficient is the optimal value [39]. Although this method gives an obvious numeric outcome, its time complexity is very high if calculated for each possible k, and with high-dimensional problems, it takes more time to converge. The other method is the elbow method. The elbow method calculates the within-cluster sum of squared errors (WSS) for each k. WSS is the mean of the squared distance between each data point and the closest cluster center. WSS can be visualized as a curve, and the optimal value of k is the elbow of the curve [40]. This method is faster than the silhouette method, but sometimes the elbow method can be ambiguous and uncertain. The number of clusters can vary from 2 to (n-1)/2, where n is the number of data points. The elbow method is used first as a decision rule. If the value of k is obvious, the rest of the clustering algorithm will continue to execute. If the value of k is vague, the silhouette method will be applied as a validation method only for the doubtful values of k. Note that the most doubtful value is the largest value at which distortion declines. This step guarantees that k is the optimal value while reducing the time complexity and avoiding having a few clusters to reduce energy consumption. Algorithm 1 shows how the optimal value of k is obtained. Algorithm  Algorithm 2 shows the steps of the clustering process, which applies K-means++ with modifications to be applied in 3D environments. The algorithm requires the value of k, which is the output of Algorithm 1. The first centroid is selected randomly. The subsequent centroids are selected with , which is the distance between node x and the previously chosen centroid c, and D(x') is the distance between previously chosen centroid c and the new centroid ci+1. After that, the previous steps are repeated until k centroids are reached. Finally, it follows the standard K-means steps that assign each data point to the nearest centroid.
Algorithm 2 Clustering Process Input: All nodes X = {x1..., xn} where n = number of nodes, Value of k. Output: K clusters 1. A random centroid c1 is selected where c1 ∈ X. 2. For j = 1 to j = k-1 3. For i = 1 to i = n 4. Distance between xi and previously chosen centroid ci is calculated using (6) 5. A new centroid ci+1 = x`∈ X = {x1..., xn} is selected with probability P 6. Repeat 7. For i = 1 to i = n 8. For j = 1 to j = k 9. Calculate the distance between xi and cj 10. Assign xi to the nearest c. 11. Update the cluster center to the average location. 12. Until convergence iii. Cluster Head and Edge Node Assignment In BRP-ML, there must be a node that is responsible for aggregating the data from the cluster nodes and transmitting them to the base station by a multihop route. Two factors are considered when assigning cluster heads. First, the cluster head must be the node that has the maximum residual energy because it performs an energy-consuming task. Second, it must be the closest cluster member to all nodes within the cluster. The edge node assignment is selected after the This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3126107, IEEE Access 10 clustering process. Normally, in each cluster, there is one cluster head and multiple normal nodes. The cluster head collects the data from its cluster members and forwards the data to the sink by multiple hops through other cluster heads. However, in our approach, there is an edge node used to reduce the load on the cluster heads. Therefore, each cluster includes the following node types: • Normal nodes: Sense data and transmit them to the cluster head. • Cluster head: Aggregates the data from normal nodes and transmits them to the sink node through other cluster heads or edge nodes. • Edge node: Receives the data from cluster heads and transmits them to the sink node using a multihop path through edge nodes or other cluster heads. In the routing phase, the Q-learning algorithm can forward the data packets through either cluster heads or edge nodes. There are two factors for selecting an edge node. First, the edge node must be the node with the maximum residual energy after the cluster head within the cluster. Second, it must be the nearest node among the cluster nodes to the sink. After the clusters are formed, cluster heads and edge nodes are assigned. The nodes send broadcast messages indicating the type of node, status, and cluster number. Usually, the cluster head consumes more energy than normal nodes, which shortens the cluster head lifetime. In some cluster head assignment approaches, a node is assigned to be a cluster head and does not assign another node until the cluster head energy drains. In contrast, our proposal is based on the reselection process approach, where cluster heads change periodically. This means that when a cluster head residual energy becomes less than a certain threshold, another node is selected as the cluster head automatically using the same factors described previously. The new cluster head broadcasts a packet notifying other nodes regarding the changes. Applying this process prolongs the network lifetime. The following rule shows the threshold value ETh, where Eave is the average residual energy of withincluster nodes: ETh = Eave. Algorithm 3 shows the steps of selecting a cluster head: The edge node selection process is the same as the cluster head selection process except that the edge node selection rule is different, noting that the complexity of both algorithms is O(n). Algorithm

D. POLICY AND REWARD FUNCTION
In this section, the reward and Q-function are described. Tables IV and V contain the notations used in the following equations.  Action-utility function of node i Q*(S) Action with the highest Q-value In BRP-ML, we assume that node i attempts to forward a packet to node j, where the state of that agent is si using action ai. The transmission reward is Rsuccess in the case of successful transmission is where 0 < ω<1, 0< β1<1, and 0< α <1 Our reward function focuses on two main parts: the energy cost and delay cost. The constant cost ω is added due to node communication, which occupies channel bandwidth. Additionally, a delay sensitivity factor is added to balance energy and delay when determining the transmission route. If the delay sensitivity factor is set to one, the selected path considers only the delay. Therefore, the sensitivity factor is the weight given to the delay cost in the equation. The energy cost function uses the initial node energy and the residual node energy. For each node, the higher the energy, the lower the cost it will be assigned. Therefore, a node with higher residual energy is more likely to be selected. Hence, the energy cost function is The transmission delay between nodes i and j is used to calculate the delay cost. If that delay between nodes i and j is high (tdelay), the delay cost is high. (10) shows the delay cost function: In the case of transmission failure, the delay and the energy costs are doubled. Considering that the delay cost is the time consumed due to communication failure plus the transmission delay between nodes i and j, the reward function is defined as Since the Q-learning functions are based on MDP, the probability of transition from one state to another must be considered to calculate the direct reward function. Therefore, the successful transition probability and the failed transition probability are used to compute the direct reward function. The direct reward is defined as where the probability of failed transmission is the number of lost packets divided by the number of transmissions. Thus, the failed transmission probability is shown in (13). Furthermore, when the sum of the failed transmission probability and successful transmission probability is equal to one, the probability of successful transmission is 1-Pfail: Thus, combining (12) and (13), the direct reward can be rewritten as After combining (8), (11), and (14), the action utility function is expressed as

E. ROUTING PROCESS
By this phase, the nodes are clustered, and each cluster has a cluster head and an edge node. Data packets are routed to the sink node only through cluster heads or edge nodes. Choosing the next hop depends on the result of the reward function and Q-table. Whenever a normal node has a packet to transmit, it transmits that packet directly to the cluster head. After that, the cluster head aggregates the sensed data from its cluster nodes. The data forwarding algorithm is where the Q-learning process happens. First, it initializes the Q-table where rewards are to be stored. If the node type is normal, then it will transmit the packet. Otherwise, if the node type is a cluster head or edge node, it will make a list of possible hops, where the possible hops are toward other cluster heads or edge nodes within the communication distance. After that, it calculates the reward using the direct reward function (14) and the Q-value using the Q-function (15). Then, a timer starts, which is used for the VAM. Using the results of the Q-function, the next hop is chosen such that its value has the highest Q-value. If the timer terminates and no ACK packet is received by the sending node, the reward of the chosen hop will be lowered, and another hop will be selected. Finally, the Q-table is updated. Algorithm 5 shows the steps of the fourth phase, which is the data forwarding.

V. SIMULATION AND RESULTS
In this section, simulations are conducted to evaluate the performance of BRP-ML using Sublime Text 3.

A. Simulation Metrics
The measured metrics are as follows:

Energy efficiency =
Output energy Input energy 2. Average delay: average time required for data to reach the sink node.

Delivery rate =
Number of delivered packets Total number of transmitted packets 4. Network lifetime: total routing time until the first node expires. 5. Alive node percent: percent of alive nodes among all nodes. 6. Execution time: time it takes to run the algorithm. The results of the simulations are compared with the QL-EDR [13] and QELAR [11] routing protocols.

B. Simulation Cases
Four simulation cases are executed to test the path changes, energy consumption, and delay on different network sizes in terms of the number of nodes. The nodes are distributed uniformly in a 250 m x 250 m x 80 m area. The selected values of α, β1, and β2 are based on the tests performed in subsection D that achieve the best result. Each simulation case is tested three times. In the first and second runs, the starting locations of nodes are the same. The second run is considered to compare the chosen path, delay, and consumed energy with the first run. The third run is when half the nodes have depleted their energy. The tracked packets are generated from the same node in all runs. In the first simulation case (test-One), a network is deployed in a 3D environment with 120 nodes and one surface sink node where the nodes are labeled with IDs ranging from 0 to 120. The optimal number of clusters for this case using the proposed algorithms is four; therefore, the number of clusters formed in this test is four clusters.
In test-One, node number 70 is selected randomly to tack the generated packets. In the first and second runs, the same path is chosen by the BRP-ML algorithm, and four hops is the length of the path to the sink node. In the third run, the path has changed, but the number of hops has decreased to three. However, the delay decreases by 5%, and the consumed energy increases by 20%. The reason is that the number of hops has decreased, which causes the delay to decrease and the energy to increase.
In the second simulation case (test-Two), a network is deployed in a 3D environment with 150 nodes and one surface sink node where the nodes are labeled with IDs ranging from 0 to 150. The number of clusters formed in this test is three. Node number 70 is also selected by the BRP-ML algorithm randomly as a starting node. Similar to test-One, the same path is chosen in the first and second runs, and four hops is the length of the path to the sink node. In the third run, the number of hops is also four, but the chosen nodes as a path have changed. Both delay and energy have increased. The delay and energy have increased by 8% and 18%, respectively. This is because the number of hops is the same, but the route has a longer distance, which causes the delay and the energy to increase.
In test-Three, the network has 170 nodes with the same starting node number 70. The results of the first and second runs are similar to those of test-Two because the network does not change significantly. In the third run, the path is different from test-Two because the chosen hops have depleted their energy, which causes the algorithm to change the path. Because the hops are less than those in the first and second runs, the delay has decreased by 6%, and the energy consumption has increased by 17%. The energy has increased due to having hops with longer distances.
Finally, in the fourth test, 200 nodes are deployed in the network. In the first and second runs, the delay is less than in all previous tests because the nodes are denser and there are more likely hops to choose from. Additionally, the number of hops is only three. The delay and energy consumption exhibit an inverse relationship, which means the energy consumption increases due to delay reduction. That is why in the third run, the algorithm attempts to balance the delay and energy by choosing a path with more hops to keep that balance. The delay has increased by 9%, and the energy consumption has decreased by 10%. Figure 5 demonstrates the effect of network size on energy. The energy consumption is higher for networks with a smaller number of nodes. The energy consumption decreases for networks between 100-170 nodes and then increases again with networks of 200 nodes, noting that all networks have the same area size. Although the energy consumption increases in a 200-node network, the delay is reduced, which achieves the targeted balance. Figure 6 shows the effect of network size on the delay. The delay wavers for networks with a small number of nodes, and it obtains almost the same delay for networks with 100-200 nodes. The delay is higher for small networks because the network area size is the same for all of them, and there are gaps between the nodes that cause long delays. In conclusion, the outcome performance does not depend on one feature. Both matrices work along with each other, and the algorithm balances them out. In conclusion, when two cases have the same number of hops, the delay and energy could have different values because the route distance is not the same. Figure 7 simplifies this conclusion. Assume that node number 70 is the source node and node number 45 is the cluster head, and all nodes' batteries are full. There are two paths available to reach the sink, using either node 63 or node 28. The total distance of the red path is 30 m, and the total distance of the blue path is 25 m. The algorithm chooses the blue path since it costs less than the red path. After a while in the simulation, the battery levels change. According to the algorithm, the red path may be chosen if the energy of node 63 is higher than that at node 28 to avoid draining the node's battery. In both cases, the number of hops is three, but the blue path delay is less than the red path. The BRP-ML algorithm balances the delay and energy consumption.  Table VI summarizes the simulation parameters used in the comparisons of the BRP-ML protocol, QELAR, and QL-EDR. The underwater network model is built based on known network models, including the principle of node data transmission and the connectivity characteristics of the network. The same model was followed in [11] and [13].  Figure 8 shows a comparison between BRP-ML, QELAR, and QL-EDR in terms of delivery rate with different packet generation rates (λ). The packet generation rate is on a per node basis where it follows the Poisson process, and the duration between each packet generation follows an exponential distribution. The delivery rate decreases as the packet generation rate increases because there are more packets to transmit, making the node's energy deplete faster. Therefore, the number of packets that reach the sink is reduced. BRP-ML improves the delivery rate by approximately 25% compared with QELAR and 6% compared with QL-EDR. BRP-ML uses the void area mechanism, which increases the delivery rate.  Figure 9 demonstrates a comparison between BRP-ML, QL-EDR, and QELAR in terms of average delay with different packet generation rates. The average delay increases as the packet generation rate increases. This is because when the number of sent packets increases, the transmission failure increases, and there are more packets to retransmit, which increases the average delay. QELAR has a higher average delay than BRP-ML by 18%, whereas QL-EDR has a 13% higher average delay than BRP-ML. This is because the reward function of BRP-ML considers both the delay and the energy. The difference is not significant because the retransmitted packets take a longer path, which takes more time and increases the delay.  Figure 10 shows a comparison between BRP-ML, QL-EDR, and QELAR in terms of energy efficiency with different packet generation rates. The energy efficiency decreases as the packet generation rate increases. The reason is that the more packets there are to transmit, the more energy is consumed by the nodes, which reduces the network lifetime. BRP-ML outperforms QELAR and QL-EDR by 16% and 9%, respectively. The design of our reward function and approach balances the energy consumption better than QELAR. Figure 11 demonstrates the relationship between the normalized network lifetime and packet generation rate. The network lifetime decreases as the packet generation rate increases. The reason is that an increase in packet transmission results in the consumption of more energy, which reduces the network lifetime. The QELAR algorithm achieves the lowest network lifetime, whereas BRP-ML achieves the highest network lifetime among the compared algorithms.

5.
Alive Node Percent Figure 12 shows the alive node percent. The alive node percent decreases as the packet generation rate increases. This is because more packets are transmitted, which drains the nodes' batteries. BRP-ML achieves a better result than the other two algorithms. It has a 14% enhancement compared with QELAR and a 5% enhancement compared with QL-EDR.

Execution time
To evaluate the energy consumption, the time required for the execution of the algorithm must also be considered. The cost of computing energy Erun can be investigated using (16) [14]. = (16) where trun is the execution time, and P is the hardware power. The computing energy must be lower than the transmission energy to achieve a longer network life. One of the important factors that affects the learning process is the number of nodes. Because exploitation and exploration consume long computational times, various node-density networks are tested using the same hardware to assess the efficiency of BRP-ML. Figure 13 shows the relation between the number of nodes and the algorithm's average-case execution time. The algorithms are run 25 times to obtain the average-case execution time. The number of nodes ranges between 130 and 230. When the number of nodes = 130, the fastest approach is QELAR, followed by BRP-ML, which has a similar QL-EDR execution time. When the number of nodes reaches 230, the QL-EDR execution time increases by 23% compared to the case when the number of nodes = 170. In BRP-ML, the execution time increases by 22%, while QELAR increases by 30%. The more time it takes to execute the learning algorithm, the more energy it consumes. Therefore, execution time is a critical factor in designing routing protocols for scalable networks.

D. Comparisons of Different Parameter Values
Different coefficients are tested to investigate the influence on the energy efficiency and delivery rate. Figure 14 shows the energy efficiency with different β coefficients. These coefficients are equivalent to the coefficients of the reward functions β1 = β2 with values varying between 0.1 and 1, and the values of α vary between 0.2 and 0.8 with a step of 0.2. When β1 and β2 are set to high values, the route selection decision differentiates based on the energy consumption, while this decision considers the delay when α is assigned high values. The energy efficiency increases as α increases. This is because the node's remaining energy affects the routing with a higher value of ω in the reward function. Thus, nodes that have more energy are more likely to be selected to forward the packet. Figure 15 illustrates the delivery rate of different coefficients. The delivery rate decreases as α decreases. This is because relying on minimizing delay while choosing the path by the algorithm causes uneven distribution of the energy. When β1 and β2 are set to high values and α=0.8, the delivery rate drops by 25% compared to low values of β1 and β2. Therefore, to balance the energy efficiency and delivery rate, choosing α should consider both delay and node residual energy. To achieve a high delivery rate and energy efficiency, the referral values to choose are α= 0.7 and β1= β2=0.6 to consider both energy consumption and delay.

VI. CONCLUSION AND FUTURE WORK
In this paper, we have presented a machine learning algorithm to address some of the UWSN limitations. We focused on extending the network lifetime by decreasing the delivery delay while balancing energy consumption. A BRP-ML routing protocol was proposed, which is a reinforcement learning (Q-learning) algorithm used to route the sensed data to the surface sink node for further data analysis. The protocol functions in four phases to ensure that each phase is working effectively and is more flexible for any future modifications. BRP-ML uses a clustered network that helps adapt to network changes and reduces communication interference. For the clustering phase, K-means++ is used to divide and group the nodes. To validate the clustering process, the silhouette score was used as an internal validation method to measure the algorithm performance. BRP-ML considers the void area issue, which is a common problem in UWSNs. We presented a VAM to address void regions and increase the delivery rates. We demonstrated that BRP-ML could effectively enhance network performance. Simulation results showed that it could balance the energy consumption and delivery delay while considering the void area. The results showed that BRP-ML increased the delivery rate up to 25% and decreased the average delay up to 18% while achieving energy efficiency up to 16% compared to the QELAR and QL-EDR algorithms.
For future work, we will consider deploying a multisink node to study the effect on the packet delivery rates and delay. Recently, some researchers have used AUVs for underwater wireless charging. We will consider deploying such AUVs to charge the nodes before the nodes deplete their energies to prolong the network lifetime and enhance the performance. Additionally, a machine learning algorithm was designed for an AUV to choose the route it takes to charge nodes. The normal head will make a forwarding decision in which it decides whether to forward it to the cluster head or edge node to save time and energy.