A maintenance planning framework using online and offline deep reinforcement learning

Bukhsh, Zaharah A.; Molegraaf, Hajo; Jansen, Nils

doi:10.1007/s00521-023-08560-7

A maintenance planning framework using online and offline deep reinforcement learning

S.I.: Adaptive and Learning Agents 2021
Open access
Published: 16 April 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

A maintenance planning framework using online and offline deep reinforcement learning

Download PDF

1163 Accesses
1 Citation
2 Altmetric
Explore all metrics

Abstract

Cost-effective asset management is an area of interest across several industries. Specifically, this paper develops a deep reinforcement learning (DRL) solution to automatically determine an optimal rehabilitation policy for continuously deteriorating water pipes. We approach the problem of rehabilitation planning in an online and offline DRL setting. In online DRL, the agent interacts with a simulated environment of multiple pipes with distinct lengths, materials, and failure rate characteristics. We train the agent using deep Q-learning (DQN) to learn an optimal policy with minimal average costs and reduced failure probability. In offline learning, the agent uses static data, e.g., DQN replay data, to learn an optimal policy via a conservative Q-learning algorithm without further interactions with the environment. We demonstrate that DRL-based policies improve over standard preventive, corrective, and greedy planning alternatives. Additionally, learning from the fixed DQN replay dataset in an offline setting further improves the performance. The results warrant that the existing deterioration profiles of water pipes consisting of large and diverse states and action trajectories provide a valuable avenue to learn rehabilitation policies in the offline setting, which can be further fine-tuned using the simulator.

A deep reinforcement learning framework for life-cycle maintenance planning of regional deteriorating bridges using inspection data

Article 24 April 2022

Equipment Health Indicator Learning Using Deep Reinforcement Learning

Optimal Control of Energy Pipeline Systems Based on Deep Reinforcement Learning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Reliable water distribution systems (WDS) are paramount for functioning societies. Such systems are subject to deterioration around the globe due to budget cuts, lack of maintenance, and an increase in urbanization [1]. Different kinds of maintenance approaches, such as recurring schedules (planned) or run-to-failure (corrective), are implemented to keep these assets from failing. However, such approaches cannot facilitate effective solutions with a minimal cost at the desired level of services. Replacing an asset according to a predefined plan results in the loss of its useful functioning life, whereas replacement after failure can cause large consequential damage resulting in unavailability of services [2].

The recent success of deep neural networks as high-capacity function approximators, such as deep Q-networks, has stimulated enormous progress in solving sequential decision-making problems. In particular, the integration of deep learning and reinforcement learning as deep reinforcement learning (DRL) has been applied to various applications, including board games [3], video games [4], robotic control [5], and optimal routing [6]. DRL harnesses powerful general-purpose representations to learn and characterize feedback in a long-term horizon [7]. Yet, beyond games and standard optimization problems such as knapsack or traveling salesman, application-oriented studies of DRL for practical, real-world problems remain scarce.

This paper develops and implements a DRL solution to automatically devise an optimal rehabilitation policy for WDS under economic and performance requirements. Note that the policy in reinforcement learning is the agent’s way of learning and behaving at a given time [7]. In this application setting, the learned policy is referred to as rehabilitation/maintenance plan. We study the problem of rehabilitation planning via online and offline DRL approaches, illustrated in Fig. 1. In the online learning paradigm, the agent actively interacts with the (simulated) environment to collect experiences to learn an optimal policy. The offline learning paradigm seeks to learn optimal policies using a logged dataset. The need for active interactions and inability to (re)use the large, diverse dataset remains one of the main limitations for the wide applicability of reinforcement learning framework in practical applications [8]. In this paper, we further study whether the optimal policy found by online DRL can be improved further by reusing the logged dataset in the offline DRL setting. Given the deterioration profiles (states) and subsequent maintenance (actions) on assets, offline DRL offers a useful paradigm to explore and learn rehabilitation policies using the logged dataset.

In the following, we explain the online learning setting followed by the offline DRL approach. Figure 1a shows the schematic illustration of the online DRL setup. We model agent and environment interactions as a Markov decision process (MDP) [9]. Note that we do not explicitly create such an MDP within our framework but merely use it to gain insights into the problem. The agent observes the (simulated) environment state s at time t and performs an action $a_t$. The agent receives a reward $r_t$ as a feedback signal, and the environment moves to the next state represented as $s_{t+1}$. The MDP formulation assumes that the state transition follows the Markovian property, meaning that future states depend only on the current state $s_t$ and action $a_t$ irrespective of the agent’s previous states and actions. The agent’s interactions with the environment, based on a policy, i.e. $\pi _k$, are collected in a buffer denoted as ${\mathcal {D}}$. The buffer ${\mathcal {D}}$ consists of samples from multiple policies as $\pi _1, \pi _2, \pi _3,...\pi _k$, which are further used to update a new policy as $\pi _{k+1}$. We choose a deep Q-network as an online learning algorithm [4].

In an offline setting (see Fig. 1b), the agent does not iteratively collect more data to update the policy. Instead, an offline DRL agent utilizes a static dataset $~{\mathcal {D}}$ consisting of state, action, reward, and next state tuple. The static dataset is generated using a behavior policy depicted as $\pi _{\beta }$, which is based on either a random policy, an expert policy, or a partially trained online policy. In this paper, we use a dataset accumulated during the training of an online DQN agent based on a near-expert policy. Previous studies [10, 11] also used the dataset collected by an online agent for learning in an offline DRL setup. We use a conservative Q-learning (CQL) algorithm proposed by Kumar et al. [12] to learn an optimal policy from a static dataset.

In both online and offline DRL, the agent seeks to maximize the expected cumulative reward in a definite time horizon following a sequence of actions. Our work offers the following key results and contributions:

We establish a novel solution for optimal rehabilitation planning of water distribution systems under economic and performance requirements. We design the elements for a DRL framework, including state definition, discrete actions, and reward function.
We are the first to introduce offline reinforcement learning to solve the practical problem of rehabilitation planning of water pipes using a static dataset only. We adapt the conservative Q-learning algorithm [12] to use logged interactions from the near-expert policy learned by the online DRL agent.
We show that the online DRL setup can learn a cost-effective intervention policy compared to the traditional preventive, corrective, and greedy schedules. Additionally, the offline DRL-based approach can further improve the learned policy using the logged dataset.

The rest of the paper is structured as follows: Sect. 2 provides an overview of related work. Section 3 introduces the problem setting followed by an explanation of solution approaches in Sect. 4. Section 5 introduces the case study, presents the experimental setup, outlines the results, and provides a general reflection, followed by the limitations of the proposed approach. Section 6 highlights the concluding remarks and provides a future outlook.

2 Related work

2.1 Rehabilitation planning of water pipes

Optimal scheduling for rehabilitation of WDS has been actively studied since 1979 with the pioneering work of Shamir and Howard [13]. The reported methods in the literature employ various optimization techniques and mathematical models such as genetic algorithm [14, 15], dynamic programming [16], integer linear programming [17], multi-criteria decision analysis [18] and budget allocation approaches [19, 20] to facilitate decision-makers in the planning of repair and replacement of water pipes. Advanced technologies such as artificial neural networks (ANN) [21], cluster analysis [22] and graph theory [23] are also adapted by a few studies to identify critical components of WDS and to predict the number of pipe failures along with influencing factors. Similarly, in the realm of sequential decision-making, Markov decision processes (MDP) are used to model the deterioration of water networks [24].

Despite the vast research interests, it is noted that existing planning methods are site-specific and do not include comprehensive criteria (such as economic, reliability, availability, and social impact) of large-scale pipe networks [25, 26]. Besides, traditional programming approaches such as integer programming and evolutionary methods do not scale to accommodate a large number of continuously changing the physical state of pipes, and the (computational) complexity of these methods grows exponentially with the problem size [27].

2.2 Deep reinforcement learning applications

With the success of the Deep Q network for playing Atari video games [4], the field of deep reinforcement learning (DRL) has gained enormous traction. DRL has outperformed human experts on several board [3], card [28] and video games [29]. It has also been applied to solve complex robotic movements [5], vision control [30], and routing tasks [6]. Nevertheless, many opportunities (and challenges) of DRL in solving problems in diverse domains such as asset management, health, manufacturing, and transportation have remained under-explored.

A few notable studies address sequential decision-making problems from diverse application domains using the DRL. Zheng et al. [31] conducted a comprehensive examination to learn dynamic tax policies by economic simulations to balance equality and productivity in socio-economic settings. Zheng et al. [32] solved a chemical production scheduling process to account for uncertainty using the actor-critic policy gradient algorithm. Cals et al. [33] introduced an approach to solving the order batching and sequencing problems in a warehouse using the proximal policy optimization algorithm. Similarly, using a multi-agent framework, the sequencing problem to avoid bus bunching is addressed by Wang and Sun [34]. The DRL-based methods are also being employed for the efficient energy management of the buildings [35, 36].

Wei et al. [37] suggest planning of intervention at the asset level with a deterministic environment setting. Lei et al. [38] introduces a life-cycle maintenance planning approach for network-wide bridges. Huang et al. [39] propose a DRL approach for preventive maintenance of production lines. Khorasgani et al. [40] provides an offline reinforcement learning approach for maintenance decision-making. Our work distinguishes itself from previous studies as we investigate the DRL application for the maintenance planning of water pipes under stochastic settings. We aim to minimize the average intervention cost and the failure probability while considering the chances of unplanned failures, whereas previous studies mainly focus on improving assets’ performance. Our paper also studies the offline RL paradigm to further improve the policy learned by online variant using the logged dataset.

3 Problem formulation

The objective is to learn an optimal rehabilitation policy for continuously deteriorating water pipes. A policy is optimal if it incurs minimal average cost and reduced failure probability for a definite planning horizon. We achieve this by finding optimal intervention moments such that maintenance is performed before failure but not too early, resulting in a waste of functional life and too late causing system unavailability and additional costs. In the following, we formulate the maintenance planning with an MDP.

States The state space represents the (physical) characteristics of a pipe. It consists of a pipe’s age, material, failure rate, and failure probability, denoted as $s_{t} = \langle {\text{age}},\,{\text{mat}},\lambda ,pf\rangle$. The failure rate concerning the material is obtained from [41], whereas the probability of failure is elicited from Eq. 4. Besides material, the failure rate is also dependent on the length of the pipe as longer pipes are likely to experience more failures compared to the smaller pipes [42]. Therefore, the failure rate of each pipe is multiplied by its length to obtain the failure rate per meter of the pipe.

Actions The agent’s objective is to find an optimal action depending on the given state at each timestep. The action $a_t \in [0, 1, 2]$ represents the discrete actions for an agent at each timestep for a pipe, where $a_t = 0$ suggests the do nothing action, $a_t = 1$ represents the maintain action, and $a_t = 2$ denotes the replace action.

Reward function The DRL-based agent seeks to maximize the expected cumulative reward in a definite time horizon following a sequence of actions [7]. Accordingly, we design the reward function to represent our user-specific objective. We construct the reward function with inverse values that seek to minimize the overall intervention cost and failure probability, as shown in the following Equation.

$$\begin{aligned} R{(s_t, a_t, s_{t+1})} = MC_t + (-pf_t) \end{aligned}$$

(1)

where ${\text{MC}} _t$ represents the maintenance intervention cost, and $pf_t$ is the failure probability, corresponding to the safety risk for the asset at each timestep t. The cost is computed as follows:

$$\begin{aligned} {\text{MC}} _t ={\left\{ \begin{array}{ll} 0 &{} \text {if }a_t = 0 \\ -0.5 &{} \text {if }a_t = 1\text { and }pf_t> 0.5 \\ -0.8 &{} \text {if }a_t = 2\text { and }pf_t> 0.5 \\ -1 &{} \text {if }a_t = 0\text { and }pf_t > 0.9 \\ -1 &{} \text {if }a_t = 1\text { and }pf_t \le 0.5\\ -1 &{} \text {if }a_t = 2\text { and }pf_t \le 0.5 \\ \end{array}\right. } \end{aligned}$$

(2)

We introduce a penalty of $-1$ to discourage unnecessary maintenance actions. Similarly, a penalty is given if the system is near failure, yet the agent chooses the action of do nothing. Note that the cost values in the reward function are representative and do not relate to the material and length of the pipe.

Dynamics (Time to transition) The environment simulates the physical characteristics of water pipes. At any timestep, the agent receives a representation of the environment’s state in the form of state $s_t \in {\mathcal {S}}$ where $s_t=\langle {\text{age}}, mat, \lambda , pf \rangle$. The agent responds with an action $a_t \in {\mathcal {A}}$, receives a reward $r_t \in {\mathcal {R}}$ and moves to next state $s_{t+1} \in {\mathcal {S}}$. In the finite MDP, the next state $s_{t+1}$ and reward $r_{t+1}$ have a discrete probability distribution, dependent only on the current $s_t$ and $a_t$ [7]. The dynamics function (also referred to as state transition probability function) is defined as:

$$\begin{aligned} p (s^{\prime}, r | s,a) = {\mathbb {P}}(s_{t+1} = s^{\prime}, r_{t+1} = r | s_t = s, a_t = a) \end{aligned}$$

(3)

In other words, the next state and reward are only dependent on the current state and action irrespective of all the previously visited states, thus respecting the Markov property.

We estimate the failure probability of water pipes using the exponential (Poisson) distribution [43] represented as:

$$\begin{aligned} pf_t = 1 - e^{- \lambda \times \, {\text{age}}_{t}} \end{aligned}$$

(4)

where $pf_t$ is the probability of failure at time t, and $\lambda$ is the failure rate, and ${\text{age}}_{t}$ is the current age of the pipe.

In each iteration, the ${\text{age}}_{t}$ of the pipe is updated depending on the agent’s chosen action as follows:

$$\begin{aligned} {\text{age}}_{t+1} ={\left\{ \begin{array}{ll} {\text{age}}_t + 1 &{} \text {if }a_t = 0 \\ {\text{age}}_t - {\text{U}}(j, k) &{} \text {if }a_t = 1\\ {\text{age}}_t = 1 &{} \text {if }a_i = 2 \\ \end{array}\right. } \end{aligned}$$

(5)

The age variable ${\text{age}}_{t+1}$ is

incremented with one (year) in case of doing nothing, depicting the increase in pf.
reduced by a certain factor sampled from a uniform distribution between j and k as a result of the maintain action. This is to simulate the variable improvement in the pipe’s physical state, which depends on several factors such as its material, length, intervention type, and intervention quality [38].
set to one to depict the good as the new condition state of the pipe resulting from the replace action.

Besides a gradual degradation, an asset can experience sudden failure due to changes in its surroundings and environmental impacts. Therefore, in each iteration, we simulate a 5% random chance of sudden failure leading to the obvious replacement action.

4 Reinforcement learning approaches

This section briefly explains deep Q-network (DQN) and conservative Q-learning (CQL) algorithms used to solve the maintenance planning problem. CQL is an offline DRL algorithm comparable to DQN in an online DRL setting.

4.1 Deep Q-network for online DRL

The agent’s interactions with the environment generate a sequence of trajectories, which are denoted as:

$$\begin{aligned} s_0, a_0, r_1, s_1, a_1, r_2, s_2, a_2, r_3, \dots \end{aligned}$$

(6)

The goal of an agent is to find an optimal policy that maximizes the expected accumulated return $G_t$, which is discounted sum of rewards represented as $G_t$ = $r_t + \gamma r_{t+1}+ \gamma ^2 r_{t+2}...,$ where $r_t$ is the reward received at time t, and $\gamma \in [0,1]$ is a discount factor. The optimal Q-function (action-value function) must provide the maximum action values at all the states determined by the Bellman optimality Equation as follows [7]:

$$\begin{aligned} Q^* (s,a) = {\mathbb {E}}_{s^{\prime}}[r + \gamma ~ {\text{max}} _{a^{\prime}} Q^* (s^{\prime}, a^{\prime})] \end{aligned}$$

(7)

where r is the immediate reward received, $a^{\prime}$ is the action to move to a state $s^{\prime}$ that returns the maximum reward. ${\mathbb {E}}$ is the expected value of a random variable given that the agent follows the policy $\pi$. The optimal policy $\pi$ is derived by Eq. 7 in an iterative update such that the agent starts in state s and takes the highest return action a and follows the policy $\pi$ for all future steps.

Q-learning falls short for complex systems beyond standard grid examples. The seminal work of [4] proposed to utilize deep neural networks (DQN) as function approximators for estimating Q-functions for high-dimensional state spaces. For a long time, utilizing neural networks for reinforcement learning remained an open research question due to the instability of network training and convergence caused by correlation among observations [44]. DQN introduced two training techniques. First, it implements experience replay that randomly samples observations, thus avoiding correlations. Second, the target value $[r + \gamma ~ {\text{max}}_{a^{\prime}} Q(s^{\prime}, a^{\prime})]$ is only periodically updated against the action values Q(s, a).

The value function, which uses neural networks as approximator, is denoted as $Q(s,a;\theta _i)$ where $\theta$ are weights of Q network at i^th iteration. The Q-value is updated using the following loss function [4]:

$$\begin{aligned} L_i(\theta _i)\,= & {} {\mathbb {E}}_{(s,a,r,s^{\prime})~\sim \ U(D)}\nonumber \\{} & {} \Big [ \big ( r + \gamma ~{\text{max}}_{a^{\prime}} Q(s^{\prime}, a^{\prime}; \theta _i^-) - Q (s,a;\theta _i)\big )^2 \Big ] \end{aligned}$$

(8)

where U(D) represents the uniform distribution over the transitions tuple ($s, a, r, s^{\prime}$) drawn from experience replay to apply the Q-value updates. The $\gamma$ is the discount factor, $\theta _i$ are the weights (the neural network) used to determine the $Q (s,a|\theta )$ and $\theta ^-_i$ are the weights to compute the target value. As noted earlier, the $\theta ^-_i$ weights are updated only periodically.

Table 1 Case study data of water pipes network used for DRL agent training

Full size table

4.2 Conservative Q-learning for offline RL

Offline DRL aims to learn optimal policy $\pi$ that returns maximum discounted future reward by learning only from a static dataset having transitions like $(s, a, r, s^{\prime})$. The static dataset is generated using behavior policy $\beta$ (as shown in Fig. 1b), which is either based on a random initial policy, near-expert online policy, or their combination [8].

Standard Q-learning methods based on temporal difference suffer from distributional shifts as the learned policy $\pi _k$ differs from the policy $\pi _\beta$ used to collect data [45]. Specifically, when evaluating the next state $s^{\prime}$ to obtain the action with the highest return, we may be querying the dataset for the $(s^{\prime}, a^{\prime})$ pairs for which the true Q-value does not exist. Typical off-policy algorithms tend to overestimate the Q-values for unseen actions, thus resulting in the problem of out-of-distribution actions. Distributional shifts, along with sampling and function approximator errors, lead to overestimating the value function in the offline DRL setting. Kumar et al. [12] proposed an effective algorithm called conservative Q-learning (CQL) to address the problem of Q-value overestimation for unseen actions. CQL suggests being conservative (i.e., assigning lower values) in estimating the Q-values and learning the lower bound on its true values. The lower-bounded Q-values guarantee that the learning policy prevents the execution of unseen actions or avoids their overestimation.

CQL estimates the value function using the given dataset ${\mathcal {D}}$. To avoid overestimation of Q-values, a penalty is introduced to minimize the expected Q-value under a particular state-action pair distribution denoted a $\mu (s,a)$. The authors [12] prove that this minimization value along with the Bellman error objective achieves a lower-bound $Q^\pi$ at all $S, {\mathcal {D}}, a\in {\mathcal {A}}$. An additional Q-value maximization term is introduced under the data distribution, $\pi _\beta (a|s)$ for a tighter lower bound. A trade-off factor $\alpha \ge 0$ accounts for lower bounds for true and expected Q-function under sampling error and function approximation. With the availability of a larger dataset ${\mathcal {D}}$, the lower bounds can be achieved with a minimal $\alpha$ value. The mathematical formulation for conservative policy evaluation is given below:

$$\begin{aligned}{} & {} {\hat{Q}}^{\pi }_{CQL} \leftarrow \mathop {{\mathrm{arg\,min}}}\limits _Q \alpha \nonumber \\{} & {} \quad \Big ( {\mathbb {E}}_{s \sim {\mathcal {D}}, a\sim \mu (a|s)} [Q(s,a)] - {\mathbb {E}}_{s \sim {\mathcal {D}}, a \sim {{\hat{\pi }}}_\beta (a|s)} [Q(s,a)] \Big ) \nonumber \\{} & {} \qquad + \frac{1}{2} {\mathbb {E}}_{(s, a s^{\prime}) \sim {\mathcal {D}}}\Big [(Q(s,a) - {\mathcal {B}}^{{{\hat{\pi }}}} Q) ^2\Big ]\\{} & {} \hat{{\mathcal {B}}}^{\pi } Q = r + \gamma {\mathbb {E}}_{\pi } [Q(s^{\prime},a^{\prime})].\nonumber \end{aligned}$$

(9)

where ${\mathcal {D}} = {(s, a, r, s^{\prime})}$ is a dataset of tuples from trajectories collected using the behavior policy $\pi _\beta (a|s)$. Since ${\mathcal {D}}$ does not contain all the state action transitions, the policy evaluation uses an empirical bellman operator denoted as ${\mathcal {B}}^{{{\hat{\pi }}}}$ and the ${{\hat{\pi }}}_\beta (a|s)$ denotes the empirical behavior of the policy. The learned policy using ${\hat{Q}}^{\pi }_{CQL}$ can be used for policy optimization in a typical temporal difference learning procedure.

5 Case study and results

We use the data of 16 pipes from the WDS, given in Table 1. Each pipe has an age, material, length, failure rate, and failure probability computed using Eq. 4. The failure rate is determined based on pipe material. The length information is used to calculate the failure rate per meter of the pipe as the larger pipe is likely to experience a higher number of failures compared to smaller pipes [42]. During training, we randomly sample a single pipe detail for learning in each episode. This is to ensure that the agent is introduced with diverse initial states. Once the agent is trained, we evaluate its performance on all the pipes and report the average cost and failure probabilities.

5.1 Experiment setup for DQN

We develop a simulation environment of water pipes using the standard Open AI Gym library. We train an agent to converge to an optimal rehabilitation policy of 100 years, where each timestep is a year. An episode finishes when the timestep reaches the 100^th year. We evaluate the quality of actions in a single trajectory (episode) by computing the discounted sum of rewards $G_t$. We use standard DQN algorithm implementation from StableBaseline library [46].

We performed hyperparameter tuning to find optimal parameters for training the agent. We assess the impact of a single parameter on the performance by altering its value while the remaining parameters remains fixed. This approach enables us to study the impact of a single parameter on the performance of the agent. Table 2 shows the base parameters configuration, tested parameters ranges, and the chosen value for the DQN agent. We study the significance of network architecture, activation function, buffer size, discount factor, the minimum value for epsilon, and learning rate. Figure 2 shows the rolling mean and standard deviation of return obtained by the agent for 1000 training episodes under base parametric settings noted in Table 2. To study the impact of a larger buffer size, the agent is trained longer with 10000 episodes. The performance remains comparable for most parameters except for the Tanh activation function, and extremely small learning rates negatively impact the performance. The larger buffer size also does not result in improved performance. We noted that the few agents, e.g., one trained with (100, 50, 25) network configuration performed well in the training environment; however, it performed poorly in the test environment due to changes in values of parameters resulting from hyperparameter tuning. Therefore, the best-performing parameters are not always chosen for training the agent.

Table 2 Hyperparameter values for training online DQN agent

Full size table

The training performance of DQN agent with chosen parameters is shown in Fig. 3. Note that the cost represents the intervention and penalty for higher failure probability incurred for a horizon of 100 timesteps for a single pipe. As can be noticed, the performance improves after 600 episodes and stabilizes after around 750 episodes.

Table 3 Hyperparameter values for training offline CQL agent

Full size table

5.2 Experiment setup for CQL (Offline)

Given that each pipe can have three discrete actions, we implement discrete CQL with DQN using the offline reinforcement learning library [47]. The offline learning agent uses a static dataset ${\mathcal {D}}$ to optimize an objective function without additional interactions from the environment.

We use a dataset accumulated during the training of an online DQN agent, as shown in previous works [10, 11]. The dataset collected during the training is based on the near-expert policy. The dataset consists of 1000 episodes, each with 100 transitions. A single transition is a tuple of state, action, reward, and next state denoted as ${(s_t^i, a_t^i, r_t^i, s_{t+1}^i)}$. The dataset is collected once and is not altered during the training and inference. In an offline setting, the episode concept is similar to the online DRL variant, which consists of all the data collected by the agent’s active interactions with the environment until a terminating condition is met. Recall that in our case, the agent interacts with the environment for 100 timesteps, where each timestep represents a year. Similar to supervised learning, we split 80% of the dataset for training, and the rest 20% is used for the evaluation after each epoch. We train an offline learning agent to compute the optimal rehabilitation policy for each pipe for 200 epochs. The test dataset is used to evaluate the learning progress of the CQL agent by computing the return.

We study the impact of various network architectures, activation functions, discount factor values, Q-functions, and learning rates. The base parameters, range, and chosen value are noted in Table 3. Figure 4 provides the rolling mean and standard deviation of return obtained by the agent for 200 training epochs under given parametric settings for a single pipe. We notice a higher standard deviation for almost all chosen parameter values than the online setup. We found that most of the parameters perform comparably after 200 epochs of training except for the Tanh function.

Figure 5 shows the training curve of the CQL agent. The objective of offline agents is to develop an understanding of the underlying MDP of ${\mathcal {D}}$ to construct optimal policy entirely from the static dataset. Since the dataset is based on near-optimal policy, the learning agent converges after fewer training epochs. The agent is trained using the best parametric configurations provided in Table 3.

We also compare the CQL learning performance with different source datasets where random policy represents the data collected by the agent’s random interaction; the near-expert policy collects the data during training of DQN in which the agent actively explores and exploits the learned information. The expert policy collects the dataset using the trained DQN agent. The training curves are shown in Fig. 6. We found that offline learning using the expert policy dataset results in poor performance. This is because the expert policy dataset can be limited in assessed actions, limiting the agent’s ability to experience various actions and resulting rewards. Learning from the dataset from near-expert policy shows good performance in fewer training epochs. It is also noted that the random policy can achieve comparable results to the near-expert policy if trained for longer. This result is aligned with the findings reported in [12, 48].

5.3 Results and analysis

We evaluate both online and offline learning reinforcement learning paradigms for the case of maintenance planning. We report the average intervention cost and failure probability obtained for all 16 pipes using the trained DRL agents and baseline policies. To enable comparison with baselines, the intervention cost does not include the penalty of delaying the maintenance action as defined in the reward function. Both trained agents prefer the replace action over the maintain action. This could be because replace action brings an immediate reduction in failure probability compared to maintain action. The CQL agent suggests the replace action 89 times with approximately 5.56 actions per pipe. The DQN agent takes replace actions 122 times with a mean value of 7.6 per pipe. Since do nothing action incurs no cost, this action is predominant in both rehabilitation plans, which is also aligned with the real planning situation.

We establish baselines with time-based preventive, corrective, and greedy approaches to evaluate and compare the usefulness of employing the DRL framework for rehabilitation planning. The time-based preventive planning approach is based on a recurring schedule, whereas in the corrective approach, mainly replace action is executed after the failure. The corrective approach results in higher costs due to the impact on the network and users. However, in our comparison, we only include the cost of replacement. The greedy approach takes locally optimal choices based on heuristics. We develop a rehabilitation plan for all pipes for 100 per year with a time-based preventive approach, where maintain action is performed every five (referred to as Maintain-5) and ten years (referred to as Maintain-10). The action replace is performed in the corrective approach when a pipe has $pf_i \ge 0.95$. In the greedy approach, the maintain as the cheapest intervention option is chosen as soon as a pipe reaches to $pf_i \ge 0.80$. For a fair comparison with the DRL-based approach, we also consider a random chance of sudden failure leading to replacement action. Additionally, the penalty cost is deducted from DRL methods, and only the intervention costs are reported to enable comparison with the baselines.

Figure 7 shows the average costs and failure probability obtained by developing rehabilitation policy using different methods. We report the average intervention cost and failure probability of all 16 pipes. The DQN approach results in the lowest failure probability. The highest failure probabilities are noted with time-based preventive schedules, which are standard in the practice. This possible justification for higher failure probabilities is that maintain action does not bring standard improvement to the performance state of an asset, like a replacement. Instead, it depends on the quality of maintenance actions (see Eq. 5). Besides reducing failure probability, a suitable rehabilitation plan must also incur a minimal cost. The CQL method finds the most cost-effective plan followed by DRL-DQN and Maintain-10. The greedy method provides the most expensive plan with a failure probability of 0.84.

Without any intervention, the failure probability of assets will reach its limit, i.e., 1. We also compare the unit cost per method needed to reduce the failure probability in Fig. 8. The result shows that the CQL method performs slightly better in reducing the failure probability, followed by the DQN method. The better performance of CQL can be the result of using a near-expert policy obtained by the DQN agent. Both DRL-based methods outperform the preventive, corrective, and greedy maintenance approaches. The comparisons warrant the effectiveness of maintenance planning of the water pipes network using the DRL framework.

5.4 Discussion and limitations

We propose a novel DRL-based solution for the rehabilitation planning of water pipes. The proposed approach accommodates multiple pipes with distinct characteristics such as length, material, and failure rates. We show that the DRL-based methods, both online and offline agents, develop an optimal rehabilitation plan that ensures sufficient reduced failure probability with minimum cost. Depending on the requirements of the infrastructure managers, the reward function can be modified to ensure that none of the asset experience failure probability above a defined threshold.

In this paper, we, for the first time, investigate the offline reinforcement learning paradigm to solve the rehabilitation planning problem. Standard pipe maintenance datasets are an ideal candidate to evaluate the capabilities of offline reinforcement learning. This is because the typical maintenance dataset can be framed as a Markov decision process, were given a state (condition) of an asset, an action (intervention) is applied. Based on the chosen action, the state of an asset is either improved or remains the same. This setup requires the reward function, which can be formulated based on the improvement in the condition state of an asset. To illustrate offline reinforcement learning, we used the dataset generated by online interactions with the simulated environment based on near-expert DQN policy. The future work aims to use the larger dataset of water pipes based on different policies to explore the capabilities of offline reinforcement learning compared to online DRL.

Our case study uses a limited number of attributes related to each pipe: age, material, length, failure rate, and failure probability. However, several additional attributes can be added depending on the requirements of the decision-maker. The setup can be extended to include network zones, the number of previous failures, traffic load, the impact of unavailability, uncertainty arising from sudden failures, and so on. Additionally, our approach assumes the assets are independent of each other. The proposed rehabilitation plan can be further improved by accounting for the co-located assets where maintenance is performed for network segments instead of a single pipe.

6 Conclusion and future work

We present a successful application of deep reinforcement learning (DRL) with an online deep Q-network (DQN) and an offline conservative Q-learning (CQL) method for the management of the water pipes network. The DRL-based trained agents effectively outperform classical planning approaches, mainly preventive, corrective, and greedy strategies, without the need for explicit expert knowledge and detailed heuristic rules. Besides optimal policy, the DRL framework yields transparency in rehabilitation planning due to distinct definitions of states, actions, and reward functions.

The future work of this study will extend this solution to include a comprehensive deterioration model, intervention costs, the impact of the interventions, and failure on the availability and surrounding households. The goal would be to devise a rehabilitation policy for underground utilities, including water and sewer pipes, to cluster intervention moments. The DRL is a promising framework for modeling sequential decision-making problems. However, it remains an under-explored research area in asset infrastructure management due to the complexity of modeling environments for multi-component structures. The offline DRL approach provides a favorable solution for such problem settings.

Data availability

The data used to develop the simulator are included in the paper. The dataset generated during the study is available from the corresponding author on reasonable request.

References

Elshaboury N, Attia T, Marzouk M (2021) Reliability assessment of water distribution networks using minimum cut set analysis. J Infrastruct Syst 27(1):04020048
Article Google Scholar
Ugarelli R, Di Federico V (2010) Optimal scheduling of replacement and rehabilitation in wastewater pipeline networks. J Water Resour Plann Manag 136(3):348–356
Article Google Scholar
Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T et al (2018) A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419):1140–1144
Article MathSciNet MATH Google Scholar
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Article Google Scholar
Ren X, Luo J, Solowjow E, Ojea JA, Gupta A, Tamar A, Abbeel P (2019) Domain randomization for active pose estimation. In: 2019 International conference on robotics and automation (ICRA), IEEE, pp 7228–7234
Almasan P, Suárez-Varela J, Badia-Sampera A, Rusek K, Barlet-Ros P, Cabellos-Aparicio A (2019) Deep reinforcement learning meets graph neural networks: exploring a routing optimization use case. arXiv: 1910.07421
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Cambridge, Massachusetts
MATH Google Scholar
Levine S, Kumar A, Tucker G, Fu J (2020) Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643
Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley series in probability and statistics. Wiley, Hoboken, New Jersey, U.S
Book Google Scholar
Peng XB, Kumar A, Zhang G, Levine S (2019) Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv:1910.00177
Agarwal R, Schuurmans D, Norouzi M (2019) Striving for simplicity in off-policy deep reinforcement learning. CoRR arXiv:1907.04543
Kumar A, Zhou A, Tucker G, Levine S (2020) Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779
Shamir U, Howard CD (1979) An analytic approach to scheduling pipe replacement. J Am Water Works Assoc 71(5):248–258
Article Google Scholar
Zangenhmadar Z, Moselhi O, Golnaraghi S (2020) Optimized planning of repair works for pipelines in water distribution networks using genetic algorithm. Eng Rep 2:e12179
Google Scholar
Ismaeel M, Zayed T (2021) Performance-based budget allocation model for water networks. J Pipeline Syst Eng Pract 12(3):04021017
Article Google Scholar
Kleiner Y, Adams B, Rogers J (2001) Water distribution network renewal planning. J Comput Civil Eng 15(1):15–26
Article Google Scholar
Saad DA, Mansour H, Osman H (2018) Concurrent bilevel multi-objective optimisation of renewal funding decisions for large-scale infrastructure networks. Struct Infrastruct Eng 14(5):594–603
Article Google Scholar
Kerwin S, Adey BT (2020) Optimal intervention planning: a bottom-up approach to renewing aging water infrastructure. J Water Resour Plann Manag 146(7):04020044
Article Google Scholar
Kerwin S, Adey BT (2021) Exploiting digitalisation to plan interventions on large water distribution networks. Infrastruct Asset Manag 40(XXXX):1–16
Google Scholar
Mahmoudzadeh A, Khodakarami M, Ma C, Mitchell KN, Wang XB, Zhang Y (2021) Waterway maintenance budget allocation in a multimodal network. Trans Res Part E Logist Trans Rev 146:102215
Article Google Scholar
Wilson D, Filion Y, Moore I (2017) State-of-the-art review of water pipe failure prediction models and applicability to large-diameter mains. Urban Water J 14(2):173–184
Article Google Scholar
Diao K, Farmani R, Fu G, Astaraie-Imani M, Ward S, Butler D (2014) Clustering analysis of water distribution systems: identifying critical components and community impacts. Water Sci Technol 70(11):1764–1773
Article Google Scholar
Meijer D, Post J, van der Hoek JP, Korving H, Langeveld J, Clemens F (2021) Identifying critical elements in drinking water distribution networks using graph theory. Struct Infrastruct Eng 17(3):347–360
Article Google Scholar
Smit R, van de Loo J, van den Boomen M, Khakzad N, van Heck GJ, Wolfert AR (2019) Long-term availability modelling of water treatment plants. J Water Process Eng 28:203–213
Article Google Scholar
Salehi S, Jalili Ghazizadeh M, Tabesh M, Valadi S, Salamati Nia SP (2020) A risk component-based model to determine pipes renewal strategies in water distribution networks. Struct Infrastruct Eng pp 1–22
Liu Z, Kleiner Y, Rajani B, Wang L, Condit W (2012) Condition assessment technologies for water transmission and distribution systems. United States Environmental Protection Agency (EPA) 108
Kim JW, Choi G, Suh JC, Lee JM (2015) Optimal scheduling of the maintenance and improvement for water main system using Markov decision process. IFAC-PapersOnLine 48(8):379–384
Article Google Scholar
Moravcík M, Schmid M, Burch N, Lisý V, Morrill D, Bard N, Davis T, Waugh K, Johanson M, Bowling MH (2017) Deepstack: Expert-level artificial intelligence in no-limit poker. CoRR arXiv:1701.01724
Berner C, Brockman G, Chan B, Cheung V, Debiak P, Dennison C, Farhi D, Fischer Q, Hashme S, Hesse C, Józefowicz R, Gray S, Olsson C, Pachocki J, Petrov M, de Oliveira Pinto HP, Raiman J, Salimans T, Schlatter J, Schneider J, Sidor S, Sutskever I, Tang J, Wolski F, Zhang S (2019) Dota 2 with large scale deep reinforcement learning. CoRR arXiv:1912.06680
de Morais GA, Marcos LB, Bueno JNA, de Resende NF, Terra MH, Grassi V Jr (2020) Vision-based robust control framework based on deep reinforcement learning applied to autonomous ground vehicles. Control Eng Pract 104:104630
Article Google Scholar
Zheng S, Trott A, Srinivasa S, Naik N, Gruesbeck M, Parkes DC, Socher R (2020) The ai economist: improving equality and productivity with ai-driven tax policies. arXiv preprint arXiv:2004.13332
Hubbs CD, Li C, Sahinidis NV, Grossmann IE, Wassick JM (2020) A deep reinforcement learning approach for chemical production scheduling. Computers & Chemical Engineering 141:106982
Article Google Scholar
Cals B, Zhang Y, Dijkman R, van Dorst C (2020) Solving the order batching and sequencing problem using deep reinforcement learning. arXiv preprint arXiv:2006.09507
Wang J, Sun L (2020) Dynamic holding control to avoid bus bunching: a multi-agent deep reinforcement learning framework. Trans Res Part C Emerg Technol 116:102661
Article Google Scholar
Pinto G, Piscitelli MS, Vázquez-Canteli JR, Nagy Z, Capozzoli A (2021) Coordinated energy management for a cluster of buildings through deep reinforcement learning. Energy 229:120725
Article Google Scholar
Du Y, Li F, Munk J, Kurte K, Kotevska O, Amasyali K, Zandi H (2021) Multi-task deep reinforcement learning for intelligent multi-zone residential HVAC control. Electric Power Syst Res 192:106959
Article Google Scholar
Wei S, Bao Y, Li H (2020) Optimal policy for structure maintenance: a deep reinforcement learning framework. Struct Safety 83:101906
Article Google Scholar
Lei X, Xia Y, Deng L, Sun L (2022) A deep reinforcement learning framework for life-cycle maintenance planning of regional deteriorating bridges using inspection data. Struct Multidis Opt 65(5):1–18
Article Google Scholar
Huang J, Chang Q, Arinez J (2020) Deep reinforcement learning based preventive maintenance policy for serial production lines. Exp Syst Appl 160:113701
Article Google Scholar
Khorasgani H, Wang H, Gupta C, Farahat A (2021) An offline deep reinforcement learning for maintenance decision-making. arXiv preprint arXiv:2109.15050
Wols B, Vogelaar A, Moerman A, Raterman B (2019) Effects of weather conditions on drinking water distribution pipe failures in the Netherlands. Water Supply 19(2):404–416
Article Google Scholar
Makar J, Desnoyers R, McDonald S (2020) Failure modes and mechanisms in gray cast iron pipes. In: Underground infrastructure research, CRC Press, pp 303–312
Birolini A (2013) Reliability engineering: theory and practice. Springer Science & Business Media, Berlin
MATH Google Scholar
Riedmiller M (2005) Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In: European conference on machine learning, Springer, pp 317–328
Agarwal R, Schuurmans D, Norouzi M (2020) An optimistic perspective on offline reinforcement learning. In: International conference on machine learning, PMLR, pp 104–114
Raffin A, Hill A, Gleave A, Kanervisto A, Ernestus M, Dormann N (2021) Stable-baselines3: reliable reinforcement learning implementations. J Mach Learn Res 22(268):1–8
MATH Google Scholar
Seno T (2020) d3rlpy: An offline deep reinforcement library. https://github.com/takuseno/d3rlpy
Fu J, Kumar A, Nachum O, Tucker G, Levine S (2020) D4rl: Datasets for deep data-driven reinforcement learning. https://doi.org/10.48550/ARXIV.2004.07219, https://arxiv.org/abs/2004.07219

Download references

Acknowledgements

This research has been partially funded by NWO under the grant PrimaVera NWA.1 160.18.238 and by the ERC Starting Grant 101077178 (DEUCE).

Author information

Authors and Affiliations

Eindhoven University of Technology, Eindhoven, The Netherlands
Zaharah A. Bukhsh
Rolsch Assetmanagement, Enschede, The Netherlands
Hajo Molegraaf
Radboud University, Nijmegen, The Netherlands
Nils Jansen

Authors

Zaharah A. Bukhsh
View author publications
You can also search for this author in PubMed Google Scholar
Hajo Molegraaf
View author publications
You can also search for this author in PubMed Google Scholar
Nils Jansen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed equally to the study’s conception and design. Material preparation, data collection, and analysis were performed by the first author. The draft of the manuscript was written by the first author and all authors reviewed and corrected multiple versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zaharah A. Bukhsh.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bukhsh, Z.A., Molegraaf, H. & Jansen, N. A maintenance planning framework using online and offline deep reinforcement learning. Neural Comput & Applic (2023). https://doi.org/10.1007/s00521-023-08560-7

Download citation

Received: 15 November 2021
Accepted: 29 March 2023
Published: 16 April 2023
DOI: https://doi.org/10.1007/s00521-023-08560-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A maintenance planning framework using online and offline deep reinforcement learning

Abstract

Similar content being viewed by others

A deep reinforcement learning framework for life-cycle maintenance planning of regional deteriorating bridges using inspection data

Equipment Health Indicator Learning Using Deep Reinforcement Learning

Optimal Control of Energy Pipeline Systems Based on Deep Reinforcement Learning

1 Introduction

2 Related work

2.1 Rehabilitation planning of water pipes

2.2 Deep reinforcement learning applications

3 Problem formulation