Deep reinforcement Learning Challenges and Opportunities for Urban 1 Water Systems

8 The efficient and sustainable supply and transport of water is a key component to any functioning civilisation making the 9 role of urban water systems (UWS) inherently crucial to the wellbeing of its customers. However, managing water is not a 10 simple task. Whether it is aging infrastructure, transient flows, air cavities or low pressures; water can be lost as a result of 11 many issues that face UWSs. The complexity of those networks grows with the high urbanisation trends and climate change 12 making water companies and regulatory bodies in need of new solutions. So, it comes as no surprise that many researchers 13 are invested in innovating within the water industry to ensure that the future of our water is safe. 14 Deep reinforcement learning (DRL) has the potential to tackle complexities that used to be very challenging as it relies on 15 deep neural networks for function approximation and representation. This technology has conquered many fields due to its 16 impressive results and can effectively revolutionise UWS. In this article, we explain the background of DRL and the 17 milestones of this field using a novel taxonomy of the DRL algorithms. This will be followed by with a novel review of DRL 18 applications in the UWS which focus on water distribution networks and stormwater systems. The review will be concluded 19 with critical insights on how DRL can benefit different aspects of urban water systems.


Introduction
Water scarcity is a reality experienced by 2.3 billion people globally that live in water-stressed countries yet water demand is set to increase by 40% by 2030 (Endo et al., 2017).Our water preservation practices are not sustainable and will diminish the availability of clean water.In response to the rising challenges of water distribution in the UK, regulatory bodies such as Ofwat and the Public Accounts committee have been pushing water companies to reimagine the water sector by 2050 (Mace, 2020).Main themes of the sector-wide strategy include to 'Deliver resilient infrastructure systems' and 'achieving net-zero carbon' that will rely on developing better water management within UWS (U.K.W.I.R., 2020).The preservation of the world's most important resource increases in complexity as we consider the outdated infrastructure forced to keep up with the rising customer demands.Tackling such high dimensional scenarios will require more research and extensive efforts from both industry and academia to rectify the mishandling of water distribution networks.
In this paper we explore a specific subfield of machine learning that has overwhelmed the research community and IT companies such as OpenAI (Berner et al., 2019) and Google (Silver et al., 2016) -Deep Reinforcement Learning (DRL).DRL is an emerging field of dynamic computing that has risen through the use of deep neural networks to advance reinforcement learning (Mnih et al., 2015a).Its successes rely on its applicability in real world scenarios that require learning from experience and its failures arise from challenges in instability and environment definition.The appealing nature of finding low-dimensional features the accurately represent high-dimensional real-world problems and experience driven autonomous learning makes DRL a true advancement in AI.As this field grows, researchers have developed numerous deep reinforcement learning algorithms that equip computational methods such as bootstrapping, backups, replay memory and function approximation to overcome any issues that arise and improve results (Li, 2017).In addition to numerous neural network architectures, deep reinforcement learning has quickly grown to become an unclassified jungle of artificial intelligence advancements.
Navigating the field of DRL requires a solid knowledge of its predecessor Reinforcement Learning and the major advancements that were led by the introduction of neural networks which is covered in section two.After reviewing the wider field of research, this paper focuses on a novel review of the application of DRL in urban water systems which includes challenges and opportunities to applying DRL in UWS followed by case studies in water distribution and stormwater management in section three.This in-depth review of the current research in the UWS will lead to an extensive discussion regarding the future of deep reinforcement learning in UWS in section four.This will hopefully unveil unexplored avenues of research to promote the use of DRL in water.A list of abbreviations used is available in

Deep Reinforcement Learning Background
The field of machine learning (ML) has been a trending topic for researchers from diverse backgrounds such as virologist, biologists, engineers, psychiatrists, and more (Libbrecht and Noble, 2015;Nichols, Herbert Chan and Baker, 2019) due to its ability to analyse real world problems using algorithms that tackle more dynamic perspective and improve with experience (Shinde and Shah, 2018).Machine learning begun as researchers hoped to achieve a novel area where instrumentation can achieve innate learning and demonstrate more 'intelligent' behaviour.From the first ML algorithm in 1951 named 'response learning algorithm' until the current day, artificial intelligence has only been empowered by this new field (Shinde and Shah, 2018).Some of the major achievements in ML was the creation of the algorithms Linear Classifier, Naive Bayes, Bayesian Network, Support Vector Machines (SVM), k-Nearest Neighbour (k-NN) and Artificial Neural Networks (ANN) (Shinde and Shah, 2018).ANNs were then adapted further to introduce deep layer and hence the introduction of Deep Learning.
ML has successfully developed the world of artificial intelligence into a true hope for near-human intelligence.Machine learning methods are often split into supervised learning used for classification and regression (Shinde and Shah, 2018;Nichols, Herbert Chan and Baker, 2019) or unsupervised learning methods used for clustering and feature engineering (Libbrecht and Noble, 2015).Where supervised learning depends on our prior knowledge and labelled examples to form an understanding of the model; unsupervised learning aims to learn some hidden structure using feature extraction of the unlabelled dataset.Whilst both forms of learning have greatly advanced their respective fields and widened the scope of artificial intelligence; they fall victim to the curse of time.Overlooking the effect of time can have grave consequences when implementing ML models to sensitive and stochastic applications which is often the case with engineering problems such as urban water management.Hence, the need of a learning approach that incorporates the hidden dimension of time -Reinforcement Learning.Figure 2-1 highlights the place of RL as a subfield of machine learning.RL's ability to consider the effects of time through semi-supervised learning was the first expression of artificial foresight in machine learning and its closest form to human intelligence.

Figure 2-1 The subfields of machine learning
In its infancy, the use of reinforcement learning (RL) was an exciting concept that promised an introduction to responsive and continuously-learning AI systems.A behaviourist mathematical approach for experience-driven learning was finally attainable through RL (Sutton and Barto, 2018).This entails a reward-driven learning from interaction with an unmapped environment rather than hard computing or supervised learning where it is near difficult to obtain examples of desirable behaviour.Despite the initial successes of RL (Tesau and Tesau, 1995;Singh et al., 2002;Kohl and Stone, 2004), it could not escape the 'curse of dimensionality' when applied to real life problems.RL was limited by complexity issues ranging from memory complexity, computational complexity and sample complexity (Strehl et al., 2006).
The recent surge of deep learning and deep neural networks that has spearheaded the movement in function approximation and representation learning giving hope to unlock the true potential of RL by overcoming the issues of scalability; hence the rise of the field of DRL.This technology gained the interests of companies such as Google and Tesla during their race for driver-less vehicles (Kool, Van Hoof and Welling, 2018;Nazari et al., 2018).It has lent its abilities to the field of robotics (Levine et al., 2016;Nguyen and La, 2019;Zhao, Queralta and Westerlund, 2020), gaming (Mnih et al., 2015a;Silver et al., 2016) and many more sectors (Li, 2017).As deep reinforcement learning gained popularity and developed further, the field of reinforcement learning was quickly populated with novel algorithms.The field of RL has quickly transformed to a forest of methods, architectures and concepts that are difficult to navigate because of its non-modularity.Defining the scopes of RL (and DRL) will help researchers understand the trade-offs involved with algorithm design.Similar work surveying offline reinforcement learning methods with a taxonomy can be found in (Prudencio, Maximo and Colombini, 2022).To highlight the diversity in RL and DRL, we have gathered and classified a novel taxonomy of the algorithms (Figure 2-2).This classification tree can serve as a map to new researchers interested in the field of DRL.It classifies the algorithms based on model free vs model based; on policy vs off policy; value-based vs policy-based; gradient based vs gradient free labels.Dotted lines are used to label fields of DRL methods such as dynamic programming, Monte Carlo, temporal difference and distributional RL algorithms.In addition, RL fundamental algorithms are written green, RL methods are in blue and DRL algorithms are written in black.The classification tree aims to introduce a variety of DRL algorithms and methods that might be useful for application in urban water systems.

The Components of DRL
To fully comprehend the aspects and range of methods available in DRL, it is crucial to delve into the formalism that make the RL paradigm.Reinforcement learning tackles its problems as Markov Decision Processes (MDPs) which is a commonly used description in the field of computing that depict real word processes.MDP formalism is based on evaluating the probability of transitions between different states in its process and is sometimes denoted with the five tuple (S,A,P,R,γ) that stand for states (S), actions (A), probabilities/dynamics (P), reward (R) and initial state (γ) (Puterman, 1990;Desharnais et al., 2004).This helps evaluate the sequential interactions between actuators (agents) and their environment to influence both the state of the agent (state, S) and the relevant state of the environment (observation).The agent is then fed the observation data and a reward signal (Reward, R) that serves as an assessor to the new state that this action has led to.The aim of the agent is to find the optimal policy (π) that will maximise the expected reward which is achieved by learning the probability of state transitions attached to a state-action pair.A visual description of this process can be found in Figure 2-3.The deep neural network is an addition only found in DRL methods whilst RL methods tend to use a tabular data frame.The components of RL and DRL can be therefore redefined to suit most real-world applications in an organic and straightforward manner.

Reward and Return
The reward (r) is the crucial identifier that tells the agent whether their action was beneficial or harmful.The cumulative reward over a trajectory is named the return (R()) and it can be a finite-horizon undiscounted return (Eq.2-1) or an infinitehorizon discounted return (Eq.2-2).Finite return is the sum of rewards for a fixed number of steps whilst infinite returns, like the name suggests, is the summation of the sum of all the rewards ever.The infinite returns must include the discount factor γ Є (0,1) used to control how much weight should be placed on the agent's foresight.This helps the infinite sum converge to a finite value.
( 2-2 ) This return is usually modified and incorporated into a value function for value-based RL methods or an objective function for policy-based RL methods.Both methods have their advantages and disadvantages; for example policy-based methods are generally less sample efficient than Value based algorithms but can learn stochastic policies and converge faster than their alternative (Lapan, 2019).We discuss this further in the classifiers section below.

Value Based
Value functions are used in almost every RL algorithm.They are a fundamental concept in RL which calculates the expected infinite horizon return to evaluate how beneficial individual states or state-action pairs are.Value functions that solely evaluate the current state without the action are often denoted by the symbol V(s) and named state value functions (Eq.2-3).Alternatively, state-action value functions are called quality functions, and they provide more of an insight on the trajectory of the agent given its current state-action pair (Eq.2-4).The Q-value is denoted by the symbol Q(s,a).
Where [.] is the expected discounted infinite horizon return, s is the state sampled from St, a is the action sampled from At and t is any time step.
An important property of RL is foresight which enables agents to weight the future consequences of their actions using the expected return hence it is rare to find value functions operating without the incorporation of the bellman equations (Bellman, 1952).Bellman equations are self-consistency equations integral to dynamic programming and MDPs that follow the concept that the value of any starting point is the reward you expect from being at the starting point in addition to the value of the next point (Bellman, 1952;Puterman, 1990).Because the actions taken by an agent depend on the policy that it follows, value functions are often described in relation to its policy.On-policy value functions estimate the expected returns as the agent follows the behavioural policy (π).On-policy value functions can either evaluate a state (state-value function) or a state-action pair (state-action value function or quality function).On-policy state-value functions are denoted by V π (s) and evaluates the expected return as the agent acts under behaviour policy (π) and starts with state (s) and is followed by the state (s').The bellman equation decomposes the value function to the sum of the current value and the future discounted values.
Similarly, the Q-value denoted by (Q π (s,a)) bellman equation is formally defined as the expected return as the agent acts under the behavioural policy (π) starting with the state-action pair (s,a) and followed by the next state-action pair(s',a').
When attempting to find the optimal policy and action for a RL problem, off-policy value functions are used to remove the restrictions of the behavioural policy and allow the agent to explore the value function following the optimal policy This leads to the off-policy state value function and off-policy state-action function.These are also called the optimal value functions (V * (s) and Q * (s,a)).The main difference between the on-policy and optimal bellman equations is that the optimal uses the maximum rewardable action as shown in the equations below (Eq.2-5, Eq. 2-6).
( 2-6 ) The optimal action of an RL problem can be extracted by finding the maximum reward argument of the off-policy stateaction value function bellman equation (optimal Q-function).In instances where there are multiple optimal actions, the algorithms often select an action at random (Achiam, 2020).Another method to evaluate the value of an action is by using the advantage function (A(s,a)) .This compares how beneficial an action is to the average value of all actions by subtracting the state value from the state-action value under policy (π) (Eq.2-7).
The use of advantage function is intuitive as it evaluates the performance of actions relative to an average.It is simpler to compare the consequence of an action with respect to another.Learning the advantage, rather than the quality or state function, has been a recent trend in DRL algorithms (Schulman et al., 2015;Wang et al., 2015;Gu et al., 2016;Mnih et al., 2016).For more details on the basics of value functions, we recommend the following introductory books, papers and articles (Arulkumaran et al., 2017;Li, 2017;Sutton and Barto, 2018;Achiam, 2020).
2.1.3.Policy Driven Other than value-based algorithms, there are policy driven techniques to solve the reinforcement learning problem and reach an optimal policy.Whilst the value-based methods use a learnt value functions to reach an implicit policy, policy-based methods do not use a value function but directly learns a policy.The value function approach often works well but it is important to be aware of its limitations.Value functions' approach to policy optimisation is focused mostly on deterministic policies which is rare in the real world since optimal policies are often stochastic.They also are subject to high sensitivities as a minor change in the expected value of an action might cause the algorithm to accept or reject it.This has been identified as a key fault that inhibits the convergence of value-based methods such as Q learning, SARSA and dynamic programming methods (Baird, 1995;Gordon, 1995;Bertsekas, Tsitsiklis and Τσιτσικλής, 1996).Policy driven methods bypass these limitations leading to better convergence properties, ability to learn stochastic policies hence more effective algorithms for higher dimensional and continuous action spaces (Sutton et al., 2000).However, these methods can habitually converge to local minimums and are more computationally demanding with higher variance.
Direct policy search methods fine tune a vector of parameters (θ) to select the best action to take for policy π(a|s,θ).The policy πƟ is updated to find the maximum expected return.They can either employ gradient free or gradient based optimisation.Gradient free algorithms often use the concepts of evolution strategies (Gomez and Schmidhuber, 2005;Koutník et al., 2013;Salimans et al., 2017) or the cross entropy function (Kalashnikov et al., 2018).Gradient-free optimisation methods can perform well in low dimensional spaces and update non-differentiable policies but, despite some successes in applying them to neural networks, the favoured method remains gradient-based training for DRL algorithms.Gradient based training methods are more sample efficient when dealing with high parameter policies (Arulkumaran et al., 2017).
The gradient-based policy methods, also called policy gradient, optimise a selected objective function (J(πθ)) which can be defined by the average reward formulation or start-state formulation (Sutton et al., 2000).Policy function approximation is challenging since gradients cannot be used through samples of a stochastic function hence why use a gradient estimator; the theory of the REINFORCE algorithm (Williams, 1988(Williams, , 1992;;Sutton et al., 2000).The objective function (J) of the parameterised policy (πθ) is the expected average return (R) under trajectory (τ).The trajectory is defined by parameterised policy.
The aim is to optimise the policy through gradient ascent by numerically defining the gradient of policy performance (∇θJ(πθ)) also called the policy gradient.A full derivation of the policy gradient can be shown in (Achiam, 2020) however the policy gradient can be redefined as (Eq.2-8).
( 2-8 ) Where the policy gradient is the expected sum of returns (R(τ)) multiplied by the gradient of the log of the parameterise policy (∇θ log (πθ (at|st))) for timesteps (t) in episode length (T).This is the simplest policy gradient; there are different variations of the policy gradient definition like the Expected Grad-Log-Prob Lemma (Schulman et al., 2015;Achiam, 2020).
Policy-based and value-based RL coincide at the actor-critic algorithms (A2C, A3C, AC, DDPG, SAC) where the actor performs and action using policy-based RL and the critic evaluates the resulting reward using a value function.The critic influences the actor using temporal difference error (TD error) to improve the algorithm's performance.

Other DRL Algorithm Terminology
To fully comprehend DRL algorithms, it is necessary to explain the parlance and methods that form those algorithms.One way to describe DRL algorithms is whether the agent is provided with a state transition function (model-based) or having to learn solely from experience through trial and error (model-free).Agents that have access to a model make use of sample efficiency and display a heightened ability of foresight but can often underperform when applied in real-world applications due to discrepancies between the model used for training and the ground-truth model.Model free methods can be implemented and easily tuned to real world application (Li, 2017).Algorithms can also be trained on sequentially generated data (online mode) or on a pre-set training batch (offline mode).
A commonly used label for RL is whether it is on-policy or off policy.On policy methods evaluate or improve the behavioural policy of the current action-value pair of the current policy (e.g.SARSA) whilst off-policy methods explore the best value policy without necessarily following the current behavioural policy; they are also called optimal methods (e.g.Qlearning) (Arulkumaran et al., 2017;Li, 2017).The value functions used to achieve were highlighted previously.

Notable DRL Algorithms
Many successes have stemmed from scaling RL using deep neural networks through function approximation.Deep neural networks can be used to approximate the optimal policy (π*) or the optimal value functions (Q*, V*, A*).In this section, we discuss the current trends and notable deep reinforcement learning algorithms that have progressed the field.This will help contextualise the current state of the research field and expose any future work.
The timeline and milestones that led to the creation of DRL was well illustrated in (Nguyen, Nguyen and Nahavandi, 2020, fig. 1) showing how trial and error learning, TD learning and deep neural networks came together to incentivise the first deep reinforcement learning algorithmthe deep Q-network (DQN).DQN was first introduced by Mnih et al. as they used convolutional neural networks (CNN) to feature engineer images from a series of 49 games (Mnih et al., 2015a).It was then used to tackle MuJoCo physics problems (Duan et al., 2016) and three-dimensional maze problems (Beattie et al., 2016).Following the success of DQN, researchers have built on the existing DQN architecture to improve its performance hence creating new algorithms such as Double DQN (DDQN) and Duelling DQN (D-DQN).Double DQN minimises the effect of noise on DQN by avoiding the overestimation of Q values (Van Hasselt, Guez and Silver, 2016) whilst the duelling network architecture combines two streams of data (the value stream and advantage stream) to produce a more accurate Q function (Wang et al., 2015).
Another milestone was the introduction of the Actor-Critic algorithms that combine the use of value functions and policy gradients to forego the trade-off of variance reduction in policy methods and bias introduction from value functions (Konda and Tsitsiklis, 1999;Schulman et al., 2015).Quickly, the DRL research community has direct their efforts to improve the AC methods.Schulman et al. improves the actor using generalised advantage estimation (GAE) to produce better variance reduction baselines (Schulman et al., 2015).The critic is also improved separately using target network in (Mnih et al., 2015b).Introducing deterministic policy gradients (DPG) in actor-critic algorithms was first observed in (Silver et al., 2014).DPGs allow the use of policy gradients in deterministic policies when they were initially exclusive to stochastic policies.This lowers the computational load as DPGs only integrate over the state space and can therefore tackle large action spaces using less sampling.Stochastic Value Gradients (SVG) are another method to apply standard gradients to stochastic policies by 'reparametrizing' (Kingma and Welling, 2013;Rezende, Mohamed and Wierstra, 2014).This trend was first introduced in (Heess et al., 2015) and created a flexible method capable of being using with and without value function critics and models (Arulkumaran et al., 2017).SVG and DPG provide algorithmic means of improving learning efficiency in DRL.
On the lines of learning efficiency, Google's DeepMind lab released the Asynchronous Advantage Actor Critic algorithm (A3C) (Mnih et al., 2016).This advancement entails the use of an advantage function in an actor-critic architecture through training parallel agents asynchronously yielding high accuracy and applicable in continuous and discrete action spaces (Zhu et al., 2016;Lapan, 2019) hence creating a trend for asynchronous and parallel learning.

Current DRL Trends
The field of DRL is growing exponentially as researchers ground their understanding of reinforcement learning in human psychology.Using methods that parallel our natural learning trends has helped develop DRL methods further leading to fields such as inverse reinforcement learning (IRL).Moreover, there is more effort on improving algorithms by modelling the reward as a distribution of values similar to our brain's reward system (Dabney et al., 2020).Multi agent reinforcement learning (MADRL) models the real-world nature of multiple agents interacting with the same environment and reward probability.In this section of the review, we focus on current trends in the field of deep reinforcement learning.We explain the recent advancements and highlight notable work and challenges that are being addressed.

Hierarchical Reinforcement Learning
As the field of DRL grows, researchers have learnt how to include biases into the algorithm's learning experience.Hierarchical reinforcement learning (HRL) is a field of DRL dedicated to introducing inductive biases by factorising the final policy into several levels through state or temporal abstractions.This approach allows algorithms to tackle higher and lower level goals simultaneously by allowing top-level policies to focus on the main goal and sub-policies to focus on fine control (Tessler et al., 2017;Vezhnevets et al., 2017).This is how HRL attempts to achieve compositionality; achieving new representations by the combination of primitives (Hutsebaut-Buysse, Mets and Latré, 2022).The challenges faced in HRL stem from the selection of sub-behaviours or policies and how to efficiently learn state abstractions.

Inverse Reinforcement Learning
As humans, we can often learn from others' mistakes and successes.Similarly, researchers have developed methods to bootstrap the learning process using trajectories from other controllers.This is known as imitation learning (also known as behavioural cloning).The success of behavioural cloning lead to the success of an autonomous car using ALVINN in (Pomerleau, 1989).The main challenge with imitation learning is its susceptibility to uncertainties.Imitation learning's inability to adapt can lead the agent down a destructive trajectory hence why it is paired with reinforcement learning.Using RL, the policy can fine-tune whist imitation learning guides the general learning leading to faster convergence properties and better stability properties.Introducing behavioural imitation to DRL births the field of inverse reinforcement learning (IRL).IRL applies behavioural cloning by relying on provided trajectories for the desired solution to approximate the reward function (Ng and Russell, 2000).Intuitively, the motivation behind using IRL usually includes learning behaviour from experts, assisting humans and learning about systems (Adams, Cody and Beling, 2022).Application of IRL are mostly concerned with teaching robots to imitate experts (Adams, Cody and Beling, 2022).Notable work and algorithms in this field include (Ziebart and Fox, 2010;Finn, Levine and Abbeel, 2016;Ho and Ermon, 2016;Levine and Van De Panne, 2018;Paine et al., 2018;Peng et al., 2018).

Distributional Reinforcement Learning
Distributional RL grounds itself in our natural brain reward system (Dabney et al., 2020).Like our natural dopamine system, DRL displays returns as a value probability distribution learned from interacting with the environment.This parallel between distributional RL and our brains opens up opportunities for collaboration between AI and neuroscience (Lowet et al., 2020).This new method of value distribution has shown its usefulness in improving learning speed and stability.The original distributional reinforcement learning algorithm is the categorical DQN (C51) (Bellemare, Dabney and Munos, 2017) where using value distributions the authors have surpassed most gains on the Atari2600 environment thus beating the benchmark DQN and DDQN.Other algorithms include quantile regression DQN (QR-DQN) which uses quantile regression to minimise the Wasserstein metric and improve greatly on the previous C51 in the Atari 2600 (Dabney et al., 2017).Implicit quantile regression (IQR) and fully parameterised quantile function (FQF) are the latest algorithms in distributional RL and they build further on the foundations of QR-DQN (Dabney et al., 2018;Yang et al., 2019).

Multi Agent Reinforcement Learning
With the rising complexity of real-world systems, deep reinforcement learning algorithms often play catch-up to be able to process and scale their models.Most of the methods devised for DRL algorithms aim to simplify complex environments and feature extraction.On the other hand, multi agent DRL introduces complexity in its algorithms by introducing several agents in the algorithms that simultaneously interact with the environment.This represents having multiple employees working as a team to carry out a desired goal (or policy) on the same system.The complexity of the algorithms brings forth multiple challenges that are currently the focus of the research community with the promise to solve more complex environments and real-world problems.There have been different approaches to tackle MADRL including sending signals to the agents, having bidirectional channels between the agents and an all-to-all channel (Arulkumaran et al., 2017).Major challenges in the field stem from non-stationarity, partial observability, complexity in training schemes, application in continuous action spaces and transfer learning (Nguyen, Nguyen and Nahavandi, 2020).Previous reviews and surveys include (Nguyen, Nguyen and Nahavandi, 2020) that provides a review of MADRL challenges, solutions, applications and perspectives; (Buşoniu, Babuška and De Schutter, 2008) evaluates stability and a taxonomy of MADRL algorithms; (Bloembergen et al., 2015) surveys dynamical models devised for multi agent systems; (Hernandez-Leal, Kartal and Taylor, 2019) bridges the gap between DRL and MADRL including benchmarks for MADRL.Other notable reviews include (Da Silva, Taylor and Costa, 2018;Hernandez-Leal, Kartal and Taylor, 2018).

Urban Water Systems (UWS)
Urban water systems are a collection of complex infrastructure and processes that supply, treat, transport, and manage water and wastewater within urban environments.These systems are crucial for managing the supply of clean drinking water as well as treating wastewater and controlling storm water.Henceforth, they are paramount for the sustainability and well-being of cities. Effective management of UWS through sustainable practice aims to ensure a resilient supply of clean water despite climate change and seasonality.It should also minimise water loss through leakage and energy consumption through inefficient water supply and distribution.The key processes in UWS can be split into four major systems which are raw water treatment plants, water distribution networks, wastewater treatment plants, and stormwater systems (Loubet et al., 2014;Etikala, Madhav and Somagouni, 2022) .Some of the processes involved in each function are displayed below in Figure 3-1.

Figure 3-1 Urban Water Systems
Urban areas often obtain their water from several resources such as rivers, lakes, groundwater and desalination plants which are managed by raw water treatment plants.Raw water goes through several treatment processes to remove impurities, and contaminants.The main treatment methods used in raw water treatment plants include screening through mesh filters or screens, coagulation, flocculation, sedimentation, filtration, disinfection, corrosion control, pH adjustment, fluoridation, and quality monitoring (Benjamin, 2014;Jiang, 2015;Teodosiu et al., 2018;Lipps, Braun-Howland and Baxter, 2022).
Once treated, clean water is distributed from the plants to the customers through a network of pipes, valves, pumps and reservoirs.This process requires advanced pressure and asset management to minimise leakage and contamination.Due to the varying elevations, demand and climate change, the distribution of water increases in complexity and leakage has become a natural phenomenon in water distribution networks (Xu et al., 2014;Barton et al., 2019).
Similar to raw water treatment, wastewater treatment plants are concerned with treating wastewater collected through a sewer pipeline network.Treatments include a variety of physical and chemical processes.Physical methods of screening, grit removal, sedimentation, and filtration remove heavier contaminants and large contaminants.Water is then treated biologically in the secondary treatment by using microorganisms to break down organic matter in wastewater (Hussain et al., 2021).Coagulant and flocculants help remove fine particles and dissolved contaminants during the tertiary advanced chemical treatment.A final step of disinfection could use chemicals such as chlorine and UV to remove harmful pathogens (Kentish and Stevens, 2001;Crini and Lichtfouse, 2019).
During detrimental events such as floods and storms, stormwater management controls the impact on the environment and infrastructure (Ahiablame and Shakya, 2016;Aryal et al., 2016;Jefferson et al., 2017).Stormwater management deal with several high-level objectives such as flood control, water quality monitoring, erosion/sediment control, groundwater recharge (Jotte, Raspati and Azrague, 2017).
3.1.Challenges and Opportunities in Urban Water Systems UWS include a wide range of processes that are riddled with unique dependencies and impacting factors.However, the preservation and use of water is a holistic process that incorporates the wider ecosystem, climate, and wildlife as much as human use.Understandably, UWS share challenges that stem from external factors and opportunities to adapt deep reinforcement learning techniques.In this section, common current challenges that plague UWS processes are discussed and how DRL can provide innovative solutions.This is followed by challenges that researchers might encounter when applying DRL algorithms to UWS.
High trends of urbanisation globally increase the stress and demand on UWS with 60% of the world's population expected to live in urban areas by 2030 ( UN-Water, 2012).This rise in demands causes heavier loads and more uncertainty throughout all processes in UWS due to increased supply and network expansions (Sharma et al., 2010).Navigating these uncertainties can be challenging for meta-heuristic decision making algorithms (Maier et al., 2014) in comparison to DRL algorithms that learn from experience and are able to act in real time (Fu et al., 2022).DRL provides a method for managing uncertainties that outperforms traditional decision-making algorithms and can learn from experience which allows it to adapt to the rise in urbanisation.
Another challenge that plagues UWS is the energy consumption and carbon emissions associated with operating water systems (Nair et al., 2014;Xu et al., 2014).It was estimated that 1-18% of all energy consumed in urban areas is due to UWS (Olsson, 2012) which in return produces a lot of carbon emissions.The negative effects of high energy consumption lie beyond the financial impacts as it promotes climate change and global warming.The circular effect of carbon emissions, water scarcity and energy consumption is displayed in the water-energy-green house nexus (Nair et al., 2014, fig. 1).DRL has had a proven record of improving energy management within the water systems (Hernández-Del-olmo et al., 2016;Hernández-del-Olmo et al., 2018) and in system efficiency (Kılkış et al., 2023).
UWS often deal with a heterogeneously aging infrastructure that add to the complexity of asset health management.The aging pipes, pumps, valves, and other system components can lead to high non-revenue water and effect the systems' overall resilience.Hence why, it is essential to provide decision making algorithms that can deal with high-level dependencies and complexities.A challenge that manifests with decision making algorithms is the high computational costs associated with this complexity thus why deploying DRL agents can benefit UWS as they rely on function approximators to lower the computational load (Sutton and Barto, 2018).Furthermore, asset management for UWS operations can be achieved by leveraging DRL for optimal design, strategic planning and predictive maintenance (Fu et al., 2022).This area of research requires more experimentation and social proof despite its clear advantages.
In most pipeline infrastructure, it is necessary to quantify leakage and asset health.Managing leakage effectively is an ongoing battle that effects UWS especially water distribution systems.The use of DRL for leakage management is an unrealised opportunity but has been recommended by reviews and surveys (Mosetlhe et al., 2020;Fu et al., 2022).The use of a tabular Q-learning method for leakage reduction using pressure management in water distribution networks was tested in (Negm, Ma and Aggidis, 2023b) and whilst the results were positive, it was clear that using DRL would enhance it further and overcome the curse of dimensionality.

Challenges of DRL in UWS
Building DRL algorithms is a science.In this section we build on the challenges and trade-offs underlined in the previous sections inherent in algorithm design.It is crucial to note that the field of RL research, much like the algorithms, has been expanded by experience followed by theory.In essence, some challenges were identified but not completely understood such as the deadly triad issue (Sutton and Barto, 2018).
In DRL algorithm design, most researchers will make use of some form of function approximation, bootstrapping or offpolicy.Function approximation uses examples to generalise an entire function hence it aids with the scalability and generalisation issue that riddles tabular algorithms and is the main tide driving the success of deep neural networks in reinforcement learning (DRL).On the other hand, bootstrapping used in DP and TD fields help with improving the algorithm's data efficiency, hence reducing computational loads.Finally, off-policy methods free our agent from target policy to explore optimality.Separately, each of these methods help RL researchers reach their desired benefits and design a better optimisation algorithms, however when combined the same methods induce instability and divergencethe deadly triad issue (Tsitsiklis and Van Roy, 1997;Sutton and Barto, 2018).This instability can be detrimental when controlling urban water management system and could result in undesirable states.Issues rising from instability often spill into suboptimal policy development which leads to low performing algorithms.In addition, this could lead to weak dependencies between the observation data and the action space forming unresponsive algorithms.In UWS, this would echo as low performing water systems affecting their resilience and ability to handle change.Further implications depend mostly on the system being managed for example in water distribution, which could mean supply interruptions or pressure limit violations.Ensuring stability and resilience should be a primary goal of DRL design.
Another common challenge is the 'credit assignment problem'.This refers to the notable phenomena of incorrectly evaluating the credit of the action due to unclear or unforeseeable consequences manifesting later (Arulkumaran et al., 2017).These long-term dependencies are necessary to allow the agent to better comprehend the value of its action.Hence, value functions have been modified to incorporate the estimated subsequent rewards and they have been discounted to signify the dwindling nature of consequence (Eq.2-5 & 2-6).UWS applications tend to be connected through both shortterm and long-term dependencies therefore it is importance to include these consequences in the DRL algorithm's learning strategy.UWSs are complex and interconnected systems, and the consequences of specific actions may not be immediately apparent.Unforeseeable impacts on water quality, pipeline integrity, or energy consumption may manifest over time.In addition, UWS are often dynamic with changing environments which will further emphasise the effect of the credit assignment problem when attempting to navigate the evolving nature of UWS.
Finally, the exploration versus exploitation dilemma.This problem riddles most RL (and DRL) algorithms as agents tend to behave in a reward greedy manner.Since the agent's observation depends on its actions and its actions depend on the reward generated; RL agents can find themselves in a loop around a local optimum rather than finding the global optimaexploitation.Ultimately, the only way to solve this is to introduce randomness to the agent's behaviour hence allowing the agent to receive new observations and possibly lead it to the global optimaexploration.This trade-off in agent behaviour has been navigated in many ways and the simplest is the use of ε-greedy exploration policy where the agent acts randomly with probability ε ϵ [0,1].The value of ε decreases as time passes leading the agent to a more exploitative nature as it learns.For continuous control, more complex methods have been used to introduce randomness over time to preserve momentum (Lillicrap et al., 2016;Arulkumaran et al., 2017).Other methods to tackle the exploration-exploitation dilemma include Osband et al.'s bootstrapped DQN using experience replay memory (Osband et al., 2016), Usuneier et al.'s exploration in policy space (Usunier et al., 2017) and upper confidence bounds (UCB) (Lai and Robbins, 1985;Arulkumaran et al., 2017;Pathak et al., 2017).Managing the exploration-exploitation trade-off should be bespoke to each UWS application to ensure that agents don't converge at sub-optimal policies.If not managed properly, the exploration-exploitation dilemma could affect UWSs manifest in operational inefficiencies.This is particularly critical in regions where water resources are scarce, and efficient use is imperative.
These challenges are inherent in most RL problems and navigating them is a skill necessary to develop an effective DRL algorithm.The application of DRL in UWS include specific limitations such as its reliance on clear data.Data-driven optimisation tends to be insightful nevertheless it requires sensor data across the entire network.UWSs vary in their data availability and data quality which could limit the usability of DRL algorithms.Therefore, this study is best applied to UWSs that have established a coherent data pipeline and are looking to expand their facilities.Consequently, it is important to build accurate models/data pipelines that can be used to build the DRL agents.Well-developed DRL models also tend to be quite sensitive to erroneous observation data which could falsely trigger harmful actions by the pressure valves.The DRL input data must be cleaned and tested for accuracy to ensure that it represents the current state of the system.Furthermore, the application of DRL requires reliability evaluations before being deployed on UWSs.It is necessary to ensure that the optimisation algorithm won't endanger the customers/water system.For example, in WDN, agents need to ensure that water supply remains uninterrupted without affecting asset life or risking future bursts.These concerns were covered by (Tian, Liao, Zhi, et al., 2022) where the authors devised a 'voting' method to improve reliability.Most UWSs are subject to daily and seasonal changes that will undoubtedly influence the performance of the DRL models.While the DRL algorithms were proven to deal with randomness in the observation data, seasonal changes might require re-training of the models and further policy development.This could be achieved through a continuous integration/deployment (CI/CD) pipeline for the DRL models which automates the deployment of newer, more suitable models.
Limitations also include the effect of the DRL algorithm on designing a reward function that incorporates multiple objectives.Most UWSs control tasks require the optimisation of multiple objectives as they influence each other hence why any relevant objectives should be included in the reward formulation design to ensure that the agents are trained with a complete picture of the desired behaviour.Complex model design is not limited to the selection of the reward function but includes DRL sensitivity to hyperparameters and neural network architecture.The design of DRL algorithms involve many decisions including various options for neural network architectures, optimisers, activation functions, pre-training techniques, and hyperparameters.The complexity of making these design choices require careful consideration and experimentation.Furthermore, generalisation of the DRL models is limited since the policy developed for one network may not necessarily work for another therefore it is important to develop a separate model for each network.On another hand, the option for transfer learning between the neural networks is valid as that could help train models from different networks.
The risks associated with DRL issues stem from unreliable sub-optimal control.This could appear as concerns with water quality.Unanticipated consequences, such as changes in flow patterns or variations in water treatment processes, may lead to water quality issues that pose risks to public health.Other issues could arise from adjustments in water flow and pressure affecting the integrity of the pipeline infrastructure.Over time, actions that seem reasonable in the short term may contribute to pipeline degradation or leaks.The challenge lies in identifying the causal relationships between management decisions and the gradual deterioration of the infrastructure.UWSs often require energy for pumping, treatment, and distribution processes.Management decisions that impact system dynamics can influence energy consumption.Unforeseen consequences may lead to suboptimal energy use or inefficiencies in the system, affecting both operational costs and environmental sustainability.Further implications are bespoken to the application of DRL and would appear with testing.

DRL Research in UWS
In essence, there are many parameters to consider when selecting a DRL algorithm but through careful consideration of selecting the correct DRL components and algorithms.Depending on the optimisation objective, the agent's nature (pump, valve, etc.) and requirements (nodal pressures, head measurements, pump speed, etc.) would vary.In a critical review of deep learning in the water industry Fu et al. mentioned the applicability of DRL in water distribution networks (WDN) and urban wastewater systems (Fu et al., 2022).In (Croll et al., 2023), the applications of reinforcement learning techniques in wastewater treatment were reviewed with a few studies utilising DRL methods.Otherwise, there are no mentions or reviews published on DRL algorithms in UWS research.There is limited literature on the application of DRL in UWS where most research relate to stormwater systems, water distribution networks and a few publications in wastewater systems.This shows a massive gap in the research field and an exciting journey for researchers in UWS at the cusp of realisation.In this section we will review the available literature on deep reinforcement learning in urban water systems.

DRL in Water Distribution
In article (Hajgató, Paál and Gyires-Tóth, 2020), the authors use a Duelling Deep Q Network (D-DQN) to find the optimal pump speeds for hydraulic efficiency in randomly generated demands.The algorithm minimises the inflow and outflow of tanks whilst keeping heads within an acceptable range in all the nodes.The reward is calculated by evaluating the consumer satisfaction as the number of problematic nodes divided by the number of all nodes; the efficiency of the pumps as the product of standalone pumps divided by the product of theoretical peak efficiencies; the feed ratio by comparing the ratio of pumps supplying the water to the tanks and reservoirs supply.When compared to a test set of Nelder-Mead, Differential Evolution (DE), Particle Swarm Optimisation (PSO), Fixed-Step Size Random Search (FSSRS) and One-shot Random Trial; the agent performed at a comparable level to the differential evolution algorithm and much better than the rest of the test set.All the algorithms were tested on a small (Anytown) and large (D-town) WDN model.When using the one-shot random trial as a reference solution as a sub optimal policy; the agent reaches a better solution and moves off policy to overperform the DE algorithm.This technique relies entirely on live measurement data and can predict the best action in real-time making it the most suitable controller for real life application.Hu et al. conducted a thorough experiment where they optimised the scheduling of fixed speed pumps to minimise the electric cost of the pumps and tank level variations whilst adhering to sensible hydraulic constraints using Proximal Policy Optimisation (PPO) and Exploration enhanced Proximal Policy Optimisation (E-PPO) (Hu et al., 2023).Both DRL algorithms are policy-driven methods set out to find the best policy to achieve the highest rewards.They conducted three experiments that introduced three increasing levels of uncertainty to the consumer demand patterns using 0.3, 0.6 and 0.9 multiplier respectively on the Net3 test networks model.The results were compared with metaheuristics including genetic algorithms (GA), PSO and DE.GA converged after 100 epochs and were considered the optimal solutions (Hu et al., 2023).They were followed in performance E-PPO followed by PPO, DE and PSO.The exploration enhanced policy saves approximately 6.10% of the energy cost with respect to PPO.Unlike the rest of the metaheuristic methods that require to be trained before each scheduling case; the DRL methods (PPO, E-PPO) can just call their trained models to act in a fraction of a second (0.4s) (Hu et al., 2023).(Xu et al., 2021) tackles the pump scheduling optimisation problem in WDNs through combining knowledge learning and deep reinforcement learning in a knowledge assisted proximal policy optimisation learning (KA-PPO) (Xu et al., 2021).KA-RL evaluates the state using historical nodal pressure data and a reward function.Pressure management objectives were placed to maintain junction heads within a specific range, minimise water age, and increase pump efficiency.The proposed algorithm was tested on the benchmark Anytown network to manage the performance of two pumps in the pump station.The results show that the algorithm performs favourably in comparison to the Nelder-Mead method and the DDQN algorithm used in (Hajgató, Paál and Gyires-Tóth, 2020;Xu et al., 2021).Future work can improve the reward formulation process by including energy prices.The problem setup can also be modified to consider a continuous action space and long period accumulated return.The use of emulators and parallel computing can also minimise the training time.
In (Hasan et al., 2019), the authors offer four novel contributions to the fields of dynamic multiple-objective deep reinforcement learning and water quality resilience applications.Based on the deep-sea treasure (DST) test bed, the authors develop a new test bed to fit the RL settings hence creating the first test bed accommodating for dynamic multi-objective DRL (DMODRL).They also devise a new for multi-objective optimisation using DRL and the first deployment of objective relation mapping (ORM) to construct the govern policy (Hasan et al., 2019).The last contribution is an expert system to evaluate the water quality resilience (WQR) in Sao Paulo, Brazil.The proposed parity-Q deep Q network (PQDQN) algorithm proposed was tested in the two DST environments and the WQR model.In all three test beds, the PQDQN algorithm has outperformed the state-of-the-art multi-policy DRL algorithms which were multi-policy DQN (MP-DQN), multi-objective monte carlo tree search (MO-MCTS) and multi-pareto Q learning (MPQ).In all three test beds, the performance of the algorithms were assessed using the evaluation matrices generational distance measure (GD), inverted generational distance (IGD) and hypervolume (HV) (Hasan et al., 2019).PQDQN managed priorities best using the ORM aiding its impressive performance and defeating the other multi-policy algorithms (MP-DQN, MO-MCTS, MPQ) (Hasan et al., 2019).This work can benefit by experimenting with multi-agent DRL and integrating real-world scenarios to the WQR model.Parallel computing and GPU processors can also reduce training time.Hyperparameter optimisation may even improve the performance of the PQDQN algorithm further.
In a broader look on water systems, (Fan, Zhang and Yu, 2022) tackles asset management of water distribution networks post-earthquake.The problem setup involves four models that assess damages incurred by the earthquake, recover the water distribution network (WDN) using the optimisation algorithms, measure the WDN hydraulic performance using the performance degree (PDW) at each timestep, quantify the overall WDN resilience using the system resilience index (SRI).The chronological and iterative process between these models is clearly displayed in (Fan, Zhang and Yu, 2022, fig. 2).A graph convolutional network (GCN) was deployed as the function approximator for a DQN algorithm hence creating GCN-DQN.This selection was a great step towards better representation for water distribution networks since the graphical nature of the data requires a similar deep neural network architecture.Other strategies used for comparison included two greed search algorithms (static importance based and dynamic importance based), genetic algorithm (GA) and diameter-based prioritisation method.All five strategies were tested under three identical earthquake scenarios with different magnitudes.In all three scenarios the GCN-DRL model outperforms the other strategies by following repairing sequences that lead to higher SRI scores (Fan, Zhang and Yu, 2022).The importance-based methods cam second and third whilst the diameter-based prioritisation came last.In order to minimise the training computation time, the authors have used transfer learning to use the previous GCN weights on an old damage scenario to initialise the GCN weights for the new scenario.This reduced the computational load significantly and proved the scalability of the GCN-DRL model across all scenarios.Accommodating more sophisticated assumptions can be easily implemented to improve the GCN-DQN model's reliability and improve the problem setup.Applying this work on different test networks can further prove its generality and encourage more development of asset management through deep reinforcement learning.

DRL in Stormwater Systems
Mullapudi et al. provide a first look on the application of deep reinforcement learning for real time control in storm water systems (Mullapudi et al., 2020).The authors test a simple DQN algorithm on the urban watershed in Ann Arbor as a benchmark test network.The problem setup involved agents taking actions to control valves status; water levels and outflows as states and an assumption of uniform rainfall and negligible base flow (Mullapudi et al., 2020).The authors set out to test the stability of DRL algorithms in controlling storm water management models (SWMM) through controlling a singular basin and controlling multiple basins.Their research highlighted DRL algorithms' known sensitivity to reward formulation and deep neural network architecture.Even though the agent could have benefitted from a longer learning phase, the DRL proved useful in managing the single-basin SWMM scenario.Due to the increase in state and action space, controlling multiple basins was more challenging.The agent behaved favourably in comparison to uncontrolled SWMMs in both scenarios but were outperformed by the equal-filling algorithm.The authors remain determined that RL-based controllers need to be explored further and applied to SWMM in hopes of reaching a stable real-time controller.The results provided in this paper could be used as a starting point to compare more capable DRL algorithms A3C and advanced variations of DQN.Also, a more systematic method for reward formulation and neural network hyperparameter optimisation would greatly improve the scalability and stability of the model.
A common issue with real-time control using DRL is concerns of the reliability and uncertainty of its fluctuating actions in high-risk real-world cases.Tian et al.'s paper tackles this issue through a novel methodology called 'voting' (Tian, Liao, Zhi, et al., 2022).Voting compares actions from five different DRL algorithms to select the safest and most rewardable action hence minimising the risk associated with DRL control.If none of the DRL agents provide a viable action, a backup userdefined rule-based action is executed.The methodology is used to minimise combined sewer overflow (CSO) and flooding in urban drainage system.The DRL algorithms used in this study are DQN, DDQN, PPO1, PPO2 and A2C.Voting uses a novel independent security system to evaluate whether the actions meet the user-defined safety requirements.All five DRL algorithms and voting algorithms are compared to a GA algorithm that was used as an upper bound performance reference by subjecting them to eight scenarios under different rainfall patterns.The results prove that voting avoids harmful actions to minimise risk hence improving the reliability of the real-time control.Figure 16 highlights that voting often draws its actions from PPO1 and never needed to use the backup action in all eight scenarios (Tian, Liao, Zhi, et al., 2022, fig. 16).All DRL algorithms have performed well in this sequential problem and are therefore suitable candidates for CSO and flooding mitigation.Concerns of long training times and computational loads can be mitigated with parallel computing and an emulator for the stormwater model.The DRL algorithms can benefit from hyperparameter optimisation to improve the results further.Future work can also attempt deploying the voting algorithm on a SCADA system or online monitoring system to uncover uncertainties from real world applications.
It is worth mentioning that the authors published a different paper where they developed an emulator for the stormwater model to relieve the high computational load associated with training the DRL agents (Tian, Liao, Zhang, et al., 2022).This emulator succeeded in decreasing the training time by 9 hours and 57 minutes hence improving data efficiency when compared to the regular RL-stormwater model approach.
Like the previous article, (Bowes et al., 2021) leverages the power of DRL for flood mitigation.In this experiment, the authors developed a DDPG algorithm to create control policies that mitigate flood risks in the coastal city of Norfolk, Virginia.The DRL agent manages to balance flooding throughout the system and follow the control objectives of maintaining target pond levels and mitigating flood through controlling valves in the stormwater management model.The performance of DDPG as a DRL method was compared to rule-based control strategy, model predictive control and a passive system.In summary, the DDPG algorithm boasted a 32% reduction in flooding in comparison to the passive system and a 19% reduction with respect to rule-based control.The model predictive control strategy deployed an online genetic algorithm optimisation as in (Sadler et al., 2020) to produce similar results to the DDPG algorithms (3% reduction in flood compared to DDPG).The model predictive control was too computationally expensive to run on the complete dataset whilst RL provided an 88x speed up in the creation of control policy (Bowes et al., 2021).This research highlights the power of DRL in real-time control of stormwater systems and its ability to produce impressive results with a lower computational load.Further research should aim to recreate these results on real-world systems through RL controllers.Combining the different real-time control methods as decision support tools should be investigated to enhance stormwater systems.

DRL in Wastewater Treatment
Wastewater treatment has initially experimented with RL methods to manage the oxidation-reduction potential and pH levels of wastewater using Model Free Linear Control (MFLC-MSA) (Syafiie et al., 2011), improve the cost of N-ammonia removal using tabular Q-learning (Hernández-Del-olmo et al., 2016), improving energy and environmental efficiency of Nammonia removal using policy iteration (Hernández-del-Olmo et al., 2018), and optimising hydraulic retention through aerobic and anaerobic processes for biological phosphorous removal using Q-learning (Pang et al., 2019).In addition, actor critic RL methods are utilised for pH adjustment for electroplating industry wastewater in a continuous action space (Alves Goulart and Dutra Pereira, 2020).This RL method was mimicked in (Yang et al., 2022) where the authors utilise an actor critic RL method to track the desired dissolved oxygen set points in a wastewater treatment plant (WWTP).A more detailed review of RL application in WWTP can be found at (Croll et al., 2023).Following the successes of DRL algorithms and its growing popularity, more research has deployed DRL methods to solve issues in WWTPs.
The only use of value-based DRL algorithm in wastewater treatment is present in (Nam et al., 2020).The article carries out an experiment involving both RL (Q, SARSA) algorithms and DRL (DQN, deep-SARSA) to reduce the aeration energy consumption without decreasing the effluent quality index.These factors were estimated using the activated sludge model soluble product (ASM-SMP) named benchmark simulation model 1 (BSM 1) developed by (Alex et al., 2018).The DQN model largely outperformed the other methods as it develops a trajectory that simultaneously improves the economic benefits by 36.53% and the environmental efficiency by 0.23%.The RL methods deployed fail to handle the complexity and caused decreases in energy savings and environmental efficiency.Further work recommended includes the experimentation with multi-agent systems to control environmental and economic benefits whilst minimising risks from membrane fouling (Nam et al., 2020).The authors did not discuss hyperparameter optimisation which could further improve their current results.In addition, the use of policy gradient methods can provide insights on the difference in policy gradient and value driven DRL in performance.
In (Panjapornpon et al., 2022), the author leverage the hybrid properties of multiple DDPG agents as an actor critic method.This study is more focused on developing a MADRL for pH control and tank level control by simultaneously managing the flow rates of the influent stream and neutralisation stream (Panjapornpon et al., 2022) in a continuous stirred tank reactor.The authors use the grid search methods for hyperparameter tuning of three performance indexes.The DDPG uses a gated recurrent unit and rectified linear units for the actor and critic networks as shown in figures 6 & 7 (Panjapornpon et al., 2022, figs 6 & 7).The multi agent DDPG algorithm performed favourably in comparison to the proportional-integral controller with controlling efficiency with better performance indexes and less oscillations (Panjapornpon et al., 2022).This paper highlights the benefits of using DRL to optimise control performance.Deploying the RL controllers using programmable logical controllers on real WWTPs can provide social proof.
MADRL is utilised in (Chen et al., 2021) to control dissolved oxygen set points and chemical dosage in WWTP.In this article, the authors use a multiple agent DDPG algorithm to lower environmental impacts, cost and energy consumption using a life cycle driven reward function.The life cycle assessment driven strategy has outperformed cost oriented and effluent quality optimisation in eliminating environment impacts.The use of multiple agent DDPG has provided good results however the study lacks comparisons with other optimisation algorithms which should be investigated in the future.MADRL should enable better navigation in highly complex environments therefore it would be great to validate this novel algorithm with field data.
A statistical learning based PPO algorithm is used to develop a predictive control strategy that minimises energy consumption in a wastewater pumping station in (Filipe et al., 2019).The model free method decreases electrical consumption by 16.7% and tank level violations by 97% in comparison to the current operating conditions of the pumping station based in a WWTP in Fábrica da Água de Alcântara, Portugal.The authors also compare the results of using wastewater intake rate forecasts to improve the PPO algorithm's results.Indeed the forecasts help improve the results of the algorithm with cumulative energy consumption dropping from 459MWh-469MWh to 340MWh-348MWh (Filipe et al., 2019).Bayesian optimisation was also utilised to optimise the forecasting hyperparameters.It is important to compare these results to other model predictive control methods used in WWTP pumping stations and other optimisation approaches to highlight the DRL algorithm's performance with respect to known benchmarks.It will be beneficial to recreate the results using WWTP benchmark models and validate the results in real-world applications.

DRL in Raw Water Treatment
The authors haven't found many papers to review relating to the application of DRL to the supply and treatment of raw water.A related paper discusses the use of DRL as a smart planning agent for off-grid camp water infrastructure (Makropoulos and Bouziotas, 2023) therefore it is not an urban water system.DQN, PPO and multi-armed bandits were tested using an urban water optioneering tool (UWOT).The DRL agents are tasked with using an array of different supply technologies with relevant costs and a set of demand pattern for potable and non-potable water to explore conditions of deployment in the off-grid system.This paper's ability to train and test DRL agents in strategic planning paves the way for strategic planning opportunities in UWS as well.
The only raw water supply application can be found in (Li et al., 2023) where the researchers apply proximal policy optimisation (PPO) algorithm to lower suspended sediment concentration (SSC) and energy consumption tested on data from the Yellow River pumping station in China.The DRL environment is made by combining data from the hydraulic model and the SSC predictive model which is formed of a multilayer perceptron model.The PPO algorithm is trained on the predicted SSC (predictive control) and real-world SSC data (perfect predictive control).Both strategies are compared to manual strategy developed by experienced operators.The SSC predictive model was not accurate as it deviates from the training and validation sets.In both the predictive and perfect predictive control, the DRL algorithm outperforms the manual strategy resulting in a smoother sediment profile, decreases the energy consumption by 8.33%, and average sand volume per unit water withdrawal by 37.01% and 40.575% respectively (Mullapudi et al., 2020).Furthermore, the authors investigate the effects of reservoir water outflows and initial reservoir water volumes.There is a strong relationship between reservoir initial water volume.This paper can benefit by comparing the DRL algorithm to other heuristic optimisation algorithms such as iterations of genetic algorithm (GA) or differential evolution (DE).The researchers should attempt to optimise the reward function by experimenting with different weights and apply some form of hyperparameter optimisation to increase the accuracy of the SSC predictive model.

Future Work
As repeatedly displayed throughout this review, the field of deep reinforcement learning is growing rapidly and expanding across various real-world applications; the most recent of which being the water industry.This field of application is relatively new and is brimming with new possibilities for the real-time control.Extending this technology to the operational management of water systems is a field of untapped potential with many avenues to explore.DRL provides a method to continuously train the model to react and adjust to the environment it is placed in.This ability for unsupervised learning makes DRL a great tool for the instantaneous optimisation of any foreign network hence possibly globalising it water networks across the country.Researchers are therefore encouraged to experiment with simple DRL algorithms in different aspects of water distribution networks, stormwater systems, water treatment and sanitation, wastewater management such as strategic planning and asset management.The link between leakage and greenhouse gas emissions has been repeatedly mentioned in water management literature (Negm, Ma and Aggidis, 2023a) due to its relevance in the research community.It will be interesting to extend DRL algorithms in water applications to minimize carbon emissions.
As this is the first review paper dedicated to deep reinforcement learning in UWS, the collation of this evolving field should be constant to act as a beacon to new researchers.More review papers will also help define the community's direction, evaluate recent findings and reveal possible novelties.Nevertheless, it is essential that researchers interested in this field spend a considerable amount of effort understanding the fundamentals of DRL.This will help clear any misconception on the applicability of the field and highlight any new advancements.Hopefully, this will steer academics away from repeating mistakes.More research articles with the purpose of formalising methods of DRL application would serve as a great bridge for aspiring researchers.Whilst researcher focus on testing DRL on models and software case studies, it is necessary to validate the use of DRL as controllers in real-world case studies.Finally, focusing on the application of DRL in graphical based distribution systems such as the electrical distribution networks will provide a clearer perspective on possible overlaps and trends that could benefit water distribution.
To fuel further research, the research community should focus its efforts on benchmarking scalable DRL environments for testing.Early efforts to benchmark environments can save upcoming researchers the need to repeatedly contextualise the optimisation problem in the scope of DRL.These environments should be able to communicate effectively with the most popular hydraulic simulators (e.g., EPANET, SWMM and so on) through wrappers such as PYSWMM (McDonnell et al., 2020) and EPYNET (Vitens, 2017).They should also be written in the necessary syntax to include benchmarked DRL libraries such as Stable Baselines, PyTorch, TensorFlow and so on.As this is an engineering application, researchers should aim to develop models that focus on reliability and scalability.Demonstrations of these algorithms acting on live data and ground-truth models in real-time should be the objective from an engineering perspective.

Conclusions
In this new age of digitalisation, it is necessary that our physical systems do not lack too far behind.Hence the need to constantly explore new avenues to incorporate and test the state-of-the-art algorithms.After introducing the proposed field of DRL in the water industry, the field was contextualised in the realm of artificial intelligence and machine learning.The main advantages and properties of reinforcement learning were highlighted to explain the appeal behind the technology.This was followed with a gradual explanation of the formalism and mechanisms behind reinforcement learning and deep reinforcement learning supported with mathematical proof.Different computing fields were explained thoroughly to highlight the origins of commonly used computing methods in DRL.Furthermore, the milestones, trends and challenges of deep reinforcement learning were discussed to develop a better understanding of the current research area.The main research articles that have adapted deep reinforcement learning methods to solve problems in urban water systems were review thoroughly and summarised in Table 1.Finally, future works and recommendations were included to provide a clear view for the application of DRL in UWSs.Therefore, the conclusion of this review can be summarised below.

•
Deep reinforcement learning improves on reinforcement learning using deep neural networks for function approximation.This has improved scalability and resulted in many successes across simulated and real applications.

•
Current DRL trends tackle high dimensional complexity by mimicking human psychology and natural hierarchy structures.

•
The field of deep reinforcement learning can benefit from better classification to help new researchers navigate better.

•
The application of DRL in the UWS is still developing yet it shows great promise to improve our current practices with water.Early efforts to benchmark DRL test beds and environments will aid the growth of this topic.
This paper aims to spark discussions and actions on future applications that harness the power of deep reinforcement learning's experience-based real-time learning in the UWS.Water is earth's most valuable resource hence the necessity to continuously improve our water practices.

Table 3 -1
Summary of reviewed articles