An investigation of belief-free DRL and MCTS for inspection and maintenance planning

We propose a novel Deep Reinforcement Learning (DRL) architecture for sequential decision processes under uncertainty, as encountered in inspection and maintenance (I&M) planning. Unlike other DRL algorithms for (I&M) planning, the proposed +RQN architecture dispenses with computing the belief state and directly handles erroneous observations instead. We apply the algorithm to a basic I&M planning problem for a one-component system subject to deterioration. In addition, we investigate the performance of Monte Carlo tree search for the I&M problem and compare it to the +RQN. The comparison includes a statistical analysis of the two methods' resulting policies, as well as their visualization in the belief space.


Introduction
Reliable civil infrastructure, such as power, water and gas distribution systems or transportation networks, is essential for society.Large efforts are therefore spent on properly maintaining these systems.However, at present such maintenance is based mainly on simple legacy rules, such as fixed inspection intervals, combined with expert judgement.There is a significant potential for optimal inspection and maintenance (I &M) planning that makes best use of the information at hand to ensure safe and reliable infrastructure while being sustainable and costefficient [1][2][3].
I &M planning is a sequential decision making problem under uncertainty.One challenge in deriving optimal I &M decisions is the presence of large epistemic and aleatoric uncertainties associated with the system properties, load, representation model, and measurements [4][5][6][7].Another major challenge is the exponential increase in possible I &M strategies with the number of components and the considered time horizon [4,8].Standard practice for dealing with these challenges is the use of established decision heuristics, e.g., safety factors during design, predetermined scheduled inspections, and threshold-or failure-based replacement of components [9][10][11].The parameters of these heuristics can then be optimized to find good I &M strategies [4,8].However, heuristics can be suboptimal and finding good heuristics is challenging.
Another approach to embed uncertainty into the inherently sequential nature of inspection and maintenance problems, is to integrate probabilistic models into decision process models [12][13][14].Under certain conditions, these sequential decision problems under uncertainty can be modeled as Partially Observable Markov Decision Processes (POMDPs), which provide an efficient framework for optimal decision making, and can additionally account for measurement errors [15][16][17].The POMDP is in general intractable [18].Many approaches for solving the POMDP use the belief state representation, which incorporates the entire information, i.e., actions and observations up to the current point [15,[19][20][21][22].However, these methods require an explicit probabilistic model of the environment to calculate the transition probabilities between states as well as the belief states, which is not always available.In addition, they typically are not computationally efficient beyond small state and action spaces [19].This hinders their application to I &M planning of infrastructure systems, where the investigated systems are usually consisting of a larger number of components.
Both NNs and MCTS have been heavily researched in the field of computer games, which provide a safe (i.e., no real-life consequences) and controllable environment with a variety of complex problems to solve (2D, 3D, single-agent, multi-agent, etc.) with an infinite supply of useful data that is much faster than real-time [38].The success of these methods in this application has motivated researchers to apply them to I &M planning (e.g., [16,20,39,40]).However, this problem's specific characteristics e.g., sparse rewards due to low probability of failure, can pose a challenge to DRL methods, the efficiency of which remains to be systematically assessed.
The literature on solving POMDPs with DRL in the context of I &M is fairly limited.Most studies have focused on fully observable MDPs, for instance coupling Bayesian particle filters and a DQN for real-time maintenance policies [41], employing a DDQN for preventive maintenance of a serial production line [42], coupling a pre-trained NN for reward estimation with a DDQN for maintenance of multi-component systems [43], and adopting a DDQN for rail renewal and maintenance planning [44].Concerning POMDPs, Andriotis and Papakonstantinou [20] developed the Deep Centralized Multi-agent Actor Critic (DCMAC) architecture for multi-component systems operating in high-dimensional spaces, with extended applications for roadway network maintenance [39].The corresponding decentralized version (DDMAC), where each agent has a separate policy network [16], has been applied to life cycle bridge assessment [40] and 9-out-of-10 systems [45].However, both DCMAC and DDMAC take the belief state of the system as an input, which is in general computationally expensive to obtain for a system with many components and arbitrary state evolution processes.Thus, newer studies (e.g., [46,47]) have shifted the focus to observation-based DRL.However, a problem setting concerning continuous state and continuous erroneous observations has not been considered, yet.
In a similarly limited manner, MCTS has been applied to maintenance planning problems modeled as MDPs.Examples with MCTS include, for instance, finding stochastic schedules in active distribution networks [48], in combination with genetic algorithms for condition-based maintenance [49], or combined with NNs for wind turbine maintenance [50].To the best of our knowledge, MCTS has not been applied to POMDPs in the context of I &M.
The purpose of this paper is twofold.Firstly, we propose a DRL architecture for POMDP and I &M planning, which does not require the computation of the belief state.The proposed NN combines the features of the Action-specific Deep Recurrent Q-Network [25] and the dueling architecture [51].The resulting +RQN architecture is able to deal directly with erroneous observations over the whole life cycle of the system.
Secondly, we investigate the performance of MCTS when applied to I &M planning.In this context, we perform a systematic comparison of the proposed +RQN and MCTS.The investigated problem is a one-component system subject to deterioration and is formulated as a POMDP, for which an exact solution is available, because of linear Gaussian assumptions for the model dynamics.Component deterioration models are often used for investigations in infrastructure I &M planning (e.g., [21,52,53]) and are applied for I &M planning in practice (e.g., [54,55]).The analysis includes a comparison of performance, i.e., the achieved optimized expected life cycle costs (LCC) and the computation time.It is carried out for different measurement errors.We also review the information carried by two metrics to compare the resulting policies of the two methods, namely via a statistical analysis and a visualization in the belief space.The solutions from both methods are compared to the exact POMDP solution.
The structure of the paper is as follows.Basic maintenance problem section introduces the investigated problem as well as sequential decision making along with the key definitions and metrics needed for the employed RL methods.Neural networks section explains the workings of the NN architecture used herein, and MCTS section illustrates how the MCTS method has been adapted for solving the proposed problem.Metrics for comparison section is dedicated to the metrics we employ to compare the NN and MCTS solutions, and Computation time, Performance, and Policy comparison sections contain the respective results.Discussion section discusses the obtained solutions and policies, and gives insight into the advantages and disadvantages of the two approaches.

Investigated system
For the numerical investigations in this paper, we study a one-component system subject to deterioration, taken from [56].It is modeled with two random variables (RVs): D representing the deterioration state and K representing the deterioration rate.The subscript t indicates timesteps, where t = 0, 1, 2, ..., T end , with finite time horizon T end .The generic deterioration model is given as where D 0 and K 0 are normally distributed and independ- ent.Equation (1) shows that the deterioration process is modeled as a Markov process through state space augmentation.The deterioration D t is observable with a Gaussian measurement noise, through the measurement random variable O t , i.e., O t ∼ N (D t , σ E ).
Four actions a 0 − a 3 are available for counteracting the deterioration and ultimately the failure of the structure.The action A t is taken after observation O t and affects D t+1 and/or K t+1 (see Appendix 1).The effects of the actions on the system are detailed in Appendix 1: Table 2. (1) The structure fails when the deterioration exceeds the critical deterioration d cr .In the failed state, an annual failure cost is incurred until the system is either repaired or replaced (no automatic setback of the system to the initial state).In addition, each action a i has a specific cost c a i incurred at time t. Figure 1 depicts the generic influence diagram of the corresponding POMDP.This case study is set up such that linearity, and hence also the normality of any set of RVs, is conserved (see Appendix 1: Table 2).As a result, the belief state and all transitions of the belief-MDP can be computed analytically.
Moreover, in our case, the covariance matrix does not depend on the observations and the actions taken, and can hence be pre-computed for all timesteps.Thus, the actions and observations only influence the prior and posterior means of D t and K t , respectively (see Appendix 1).
The model assumption allows for the system to regenerate if K t is negative.However, 1) we set up the numeri- cal values so that we limit this effect, 2) it is a useful assumption for obtaining a reference solution and 3) the solution methods introduced hereafter do not require it.

Sequential decision making
At every timestep, the operator has to decide which action to choose based on the history of observations and actions; hence they try to solve a sequential decision making problem.Specifically, as the deterioration state D t is only observable through erroneous measurements O t , and the deterioration rate K t is not observable at all, the investigated setup falls under the category of a Partially Observable Markov Decision Processes (POMDP) [15].One can transform a POMDP into a belief MDP by replacing the states with the belief (vector) as the variable of interest, and then employ conventional methods for solving MDPs, such as value iteration (VI) or policy iteration [57].We utilize this belief state representation to obtain a reference solution for the numerical investigations (see POMDP reference solution and Results sections).However, the focus of this paper is specifically on reinforcement learning (RL) techniques that can directly deal with observation-action sequences and hence do not need the belief state representation.
The goal is to find a sequence of actions that minimizes the expected life cycle cost (LCC), which is defined as the sum of discounted expected action and failure costs: In standard literature, the two costs associated with action and failure are summarized in a single cost C(s, a), which is the immediate cost resulting from executing action a in state s of the system.Hence, will adopt this notation in the following.
The decision-making rule, which determines the action to take in function of the available information, is called the policy π .In general, the policy is time-and history- dependent [15,58].There exists a mapping from the current observation-action history h t = (o 1:t , a 1:t−1 ) to the time-agnostic belief over the set of system states b(s t ) = p(s t |o 1:t , a 1:t−1 ) , where b(s) represents the prob- ability of the system being in state s, when the agent's belief state is b [59].Hence, the policy as well as other functions can be expressed in terms of both: Accordingly, the ideal policy π * determines the ideal action to take to reach the set goal.For finite-horizon problems (as for our case study), π * is generally time-dependent.In our case, the set of ideal policies {π * t , t = 1, 2, ..., T end − 1} is the one that minimizes LCC .To find an expression for π * t , we substitute the global LCC measure (Eq.2) with recursively defined value functions.
A state value function assigns a value to a particular (belief ) state at a specific point in time.We denote with (2) the sum of expected discounted costs when fol- lowing policy π starting from belief b at time t [60].The optimal value function is then defined as [59]: where b a o is the belief that results from b after executing action a and observing o, and can be obtained from the POMDP model and Bayesian updating (e.g., demonstrated in [57]).Note that P(o|b, a) can be expressed as a function of the belief transition probability P(b a o |b, a) , and the sum over o can be transformed into a sum over b (see POMDP reference solution section).
One can also define an action-value function , which denotes the value of action a at belief state b under policy π at time t and continuing opti- mally for the remaining timesteps until the end of the system lifetime [57].The optimal value function V * can be expressed as a minimization over the action-value function Q, and the optimal action-value function Q * satisfies the Bellman equation [15,57,61]: Lastly, the advantage function A π t (b, a) is a measure of the relative importance of each action [51]: where the advantage of the optimal action a * is 0 [51]: The optimal policy at every timestep can be easily extracted by performing a greedy selection over the optimal Q-value [57]: which is also the value-and advantage-minimizing action from Eqs. ( 4) and ( 6), respectively.
The solution methods presented in Neural networks and MCTS sections have the goal of approximating V, Q, or A, from which the optimal policy can be extracted.

POMDP reference solution
To evaluate the performance of approximate solutions, we also provide a reference solution for the POMDP model of Investigated system section.It is computed with standard value iteration applied to a discretized belief MDP.The belief, in our case, is a vector comprising the posterior mean values of D and K from Eqs. ( 27) and ( 28): (4) The adapted version of Eq. ( 4) for discretized beliefs is then [56,59]: 4) and (5).
Due to the linear Gaussian transition dynamics of this case study, the transition probabilities P b ′ |b, a can be calculated analytically.The discretization of the belief and the computation of the probability tables is done according to [62].Equation ( 10) is solved by backward induction for each discrete belief state.The resulting LCC is verified by MCS.The discretization scheme is chosen such that 1) the value function of Eq. ( 4) is estimated with a small error (compared to MCS on continuous belief space) and 2) such that the resulting policy is quasi-optimal (it performs better than every other solution found).
Note that in the general case, obtaining a reference solution with dynamic programming, e.g., via value iteration, is not feasible due to the super-exponential growth in the value function complexity [63].Hence, the problem investigated in this work represents a special case.

Architecture
Our aim is an NN approach that is able to handle imperfect observations without the need for computing the belief.The NN needs to account for the time dependence of the value function for the finite horizon problem.This can be achieved by a network architecture that is able to handle sequential data, i.e., the observation-action history.For that, we adopt the basic structure of the actionspecific deep recurrent Q-network [23].
The final NN architecture proposed in this work is depicted in Fig. 2, which we name Action-specific Deep Dueling Recurrent Q-network (+RQN).At each timestep t, the two inputs of the network are the one-hot encoded action [64] taken at t − 1 and the scalar obser- vation obtained at t.The outputs of the network are the estimated Q-values for each action at t.The inputs are fed through two fully connected (FC) layers for feature (9) extraction.The core of the network is formed by the Long Short-term Memory (LSTM) layer, which can resolve short as well as long-term dependencies through the hidden and cell states, respectively [65].Depending on the observation-action history, these states take different values.Hence, the LSTM layer can be interpreted as a high-dimensional embedding of the history or a highdimensional approximator of the belief state.The LSTM output is then fed through another FC layer for further feature extraction.To estimate the Q-values, the value and the advantage functions are first estimated separately and combined using a modified version of Eq. ( 6), which is discussed in the section below.Wang et al. [51] report that this configuration has superior performance compared to standard DQNs.

Q-values, loss, cost and weight updates
Instead of directly using Eq. ( 6), Wang et al. [51] propose to introduce the mean over the advantages as a correction term, which improves the stability of the optimization of the network parameters.Let θ j t denote the parameters of all layers prior to the value-advantage split, υ j denote the parameters of the value stream, and α j the parameters of the advantage stream.The superscript j = 1, 2, ..., N e refers to the weights at a certain iteration/epoch and hence highlights iterative convergence towards a set of weights that best approximate the true Q-value.Herein, an epoch consists of passing through the whole life cycle of a batch of sample trajectories, after which the weights get updated, and the next epoch starts.Since θ does also include the hidden and cell states of the LSTM layer, it is dependent on the observation-action history, which is denoted with the subscript t.By contrast, α and υ stay constant for the whole life cycle (epoch).The modified approximation for the Q-values is then [51]: where ) is the Q-value estimate for the action a at time t and epoch j after observing o t and given the previous action a t−1 , the weights θ j t−1 (which embedded o 1:t−1 and a 1:t−2 through the hidden and cell states), α j and υ j .Accordingly, does not use the weights of the separate advantage stream α j and vice versa.
To evaluate the performance of the network, i.e., the accuracy of the predicted Q-values, we need a target value for each pair of sample observation and action o (i)  t , a (i) t−1 , which are passed as inputs to the network.To obtain a (11) target value, we use the fact that the optimal Q-values follow the Bellman equation (Eq.5).Therefore, we define the NN output y (i),j NN,t and the target value y (i),j Tar,t for a sample (i) at a specific point in time t and epoch j as [25]: where a sel.denotes the selected action under the behav- iour policy ( ǫ-greedy, see Appendix 2: Optimized NN parameters section) at j; c (i)  t is the total cost sample at t (12) which includes the cost of a potential failure at t and the cost of the latest selected action at t − 1 under the behav- iour policy at j.Moreover, t , α j,− , υ j,− ) denotes the Q-value estimate at t + 1 and epoch j, after an action has been taken under the behaviour policy at t and j which, upon interaction with the environment, resulted in observation o (i)  t+1 .The "−" indicates that the parameters θ j,− t , α j,− , υ j,− belong to a separate target network [25].More details on the sampling procedure and the target network are provided in Training procedure section.
For training, we use the mean-squared error (MSE) loss function [66]: The "Env." above the arrow denotes an interaction with the environment.The gray layers represent FC layers and the orange layer represents the LSTM layer with its hidden state and cell states h, c .The represents the concatenation operation.The number of circles inside some layers depicts the fixed number of nodes, and the relative sizes of the layers qualitatively show the number of nodes; adapted and merged from [25,51] For stochastic gradient descent, a batch of N b samples is passed through the network to speed up training [67].On this basis, the MSE cost function accumulated over the whole life cycle is evaluated as The NN weights are updated based on this cost function.The simplest gradient-based update scheme is [66,68] where η is the learning rate.The weights α j+1 and θ j+1 are computed accordingly.Alternative update schemes such as RMSProp or Adam are available (e.g., [68]).

Training procedure
Each epoch j is composed of a data collection and a training phase.The data collection part consists of simulating a batch of N b trajectories with the current network with weights θ j , α j , υ j .We start by drawing initial samples d (i) 0 and k (i)  0 from their initial distributions and check for resulting failure costs c (i)  f ,0 .The actions a (i) 0 are fixed to a 0 (with action costs c (i)  a,0 = 0 ), as observation-based action selection starts at t = 1 .Then, d (i)  0 , k (i) 0 , a (i) 0 are passed to the environment which returns d (i)  1 , k (i) 1 and c (i) f ,1 according to the dynamics in Appendix 1: Table 2.For t = 1, ..., T end−1 , observations o (i)  t are generated from N (d (i) t , σ E ) and passed with a (i)  t−1 to the network which outputs the Q-values.The behaviour policy at epoch j selects the next action according to the ǫ-greedy scheme, where a random action is selected with probability ǫ (for exploration) and the action with mini- mal Q-value is selected with probability 1 − ǫ (for exploita- tion).The chosen action a (i)  t together with d (i) t and k (i)  t is passed to the environment that simulates the system for one timestep and returns d (i)  t+1 , k (i) t+1 and c (i) f ,t+1 .This alternating interaction between network and environment continues until the end of the system lifetime is reached.The samples o (i) a,1:T end−1 are then stored for the training phase.
Once a batch of sample trajectories has been collected, the training phase starts.Herein, the batch is again fed through the network sequentially, and the cost is accumulated over the whole life cycle.For the computation of the individual MSE loss terms, a target network is defined such that the values of the target network weights are clones of the original network weights: θ j,− = θ j , α j,− = α j , υ j,− = υ j .At each time t, o (i)  t and a (i) t−1 are the inputs of the network; o (i) and a (i)  t are the inputs of the target network.The target NN ( Tar,t . ( outputs are greedily selected over the respective Q-values (as opposed to the ǫ−greedy behaviour policy used for trajec- tory sampling, hence this is off-policy learning [69]) according to Eqs. (12 and (13).The batch cost at t is computed with a batch-averaged version of Eq. ( 14) and added to the total cumulative cost.This process continues until the end of the life cycle is reached, and the LCC MSE cost has been computed according to Eq. ( 15).Then, the LSTM is unrolled, the loss is backpropagated through time [65] and the weights are adjusted according to the chosen update scheme (e.g., Eq. ( 16).After updating, the learning procedure continues with the next epoch until the weights have converged.The weights of the target network are updated periodically every p epochs to ensure stable optimization [30].
The hyperparameter tuning procedure, either by grid search or by some heuristics, is outlined in Appendix 2.

Functionality
Monte Carlo tree search (MCTS) arises from the combination of tree search and Monte Carlo sampling [70].Classically, games have been modeled with game trees, where the root is the starting position, leaves are possible ending positions, and each edge represents a possible move [71].To select the best action at a given node (position), one needs to know its consequences.Small games can be solved by constructing the full game tree and using backwards induction [72].However, for more complex games (e.g., chess, Go), this is practically impossible.Hence, one needs an estimator of the preference for each resulting position.Defining the value of each node as an expected outcome given random play opened the door for the use of Monte Carlo, which specifies node values as random variables and characterizes game trees as probabilistic [73].In [36], MCTS was extended to partially observable environments.
The MCTS algorithm consists of four main steps: selection, expansion, rollout, and backpropagation.In the selection step, the algorithm traverses the tree from the root to a leaf node using a selection policy (see UCT for action selection section).In the expansion step, the algorithm adds a child node to the selected leaf node.In the rollout step, the algorithm performs a simulation from the newly added node until the end of the lifetime by choosing uniformly random actions, i.e., p(a i ) = 1 4 .In the backpropagation step, the algorithm updates the statistics of all nodes along the path from the selected node to the root node based on the simulation outcome [74].The Q-value of an action a for a given observation-action history h at time t is the updated statistic at an action and is computed as: (17 where N(h, a) is the total number of samples used for the estimation, or the current visitation counter of the respective action node, and q (i) t (h, a) are the individual (backpropagated) results at time t.

UCT for action selection
To make use of the exploitation-exploration tradeoff [58], we implement the Upper Confidence Bound for Trees (UCT) algorithm.The UCT selects the next action A t based on minimizing the estimation of the Q-value for each action (exploitation) minus an exploration term [75]: where c is an adjustable constant that enables a tradeoff between exploration and exploitation, and N(h) is the visitation counter of the parent node such that N (h) = a N (h, a) .A pseudocode for the implementa- tion of MCTS for POMDPs is given in [36].
Each simulation starts by sampling an initial state from the current belief state, which is, for our case study, described in Eq. ( 19): Silver and Veness [36] propose a samples-based approximation of the belief state for the general case when the belief state is not analytically available.We have not implemented this in the case study, hence one should keep in mind that an MCTS without the belief is likely to perform worse.
The tuning of the MCTS parameters is outlined in Appendix 3.

Metrics for comparison
We employ several metrics to assess the performance of the NN and the MCTS approaches and to compare the results to the POMDP reference solution.
Firstly, we evaluate the computation time needed by the methods, including training and testing times.
Secondly, their computational performance is compared through the LCC's expected value and the standard deviation for the identified policies.Thereby, LCC is approxi- mated with Monte Carlo (MC) samples for both methods.The optimal solution curve obtained by evaluating the POMDP with VI (see POMDP reference solution section) serves as a reference.We additionally provide the performance of a benchmark policy that consists of choosing action a 1 in every timestep, irrespective of the observation.( 18) .
Thirdly, we investigate the policies obtained from each method.The analysis comprises a statistical representation of the actions taken at each timestep to reveal potential tendencies, as well as a depiction in the belief space for policy extraction.

Computation time
All computations are performed on a Fujitsu Celcius R970 PC comprising an NVIDIA GP104GL (Quadro P4000) 8118 MB GPU and Intel Xeon Silver 4114 2.20 GHz: 10 Cores 20 Logical Processor.To accelerate the computation, training and testing of the NNs is conducted on the GPU, whereas MCTS is implemented with CPU parallelization.
With these specifications, the process of training and testing a single NN took 45 seconds (25 seconds of training and 20 seconds of testing 10 6 sample trajectories).In train- ing, we consider different hyperparameter configurations following Appendix 2: Optimized NN parameters section, which leads to a total training time of approx.150min.
By contrast, with MCTS there is no distinct training phase.Nevertheless, it is necessary to find good MCTS parameters, as described in Appendix 3: Tunable MCTS parameters section.This is a time-consuming process, because testing is expensive with MCTS.With the chosen parameter setting, generating 1000 trajectories for testing takes 20 minutes.For this reason, NN training is ultimately significantly cheaper and more straightforward.
Once the NN is trained or the MCTS setting is fixed, evaluating the policy is efficient.For NN, the computational time is negligible; for MCTS, it is in the order of seconds.

Performance
Figure 3 shows the mean LCC achieved by the +RQN, MCTS, VI, and the basic benchmark in function of the observation error.Firstly, all curves have a characteristic shape which consists of two saturation regions σ E < 0.5 (essentially corresponding to perfect observations) and σ E > 10 3 (uninformative observations) and a smooth tran- sition in-between.Both the NN and MCTS methods perform worse than the optimal solution.However, the NN consistently outperforms the MCTS method, which performs especially poorly under high observation errors.
Figure 4 shows the standard deviation of the resulting LCC in function of the observation error.The standard deviation increases with increasing observation error, which is to be expected.The NN generally leads to a slightly higher LCC standard deviation than the VI reference solution, although with some exceptions.By contrast, the MCTS results in a low LCC standard deviation for small σ E and in a very large one for large σ E .

Policy comparison
Figure 5 depicts the identified strategy profiles for the +RQN, MCTS, and VI in a statistical sense for the selected cases of σ E = {0.5, 50}.
The reference VI method utilizes mainly a 1 in the first half of the system lifetime and employs a 2 in the second half.More maintenance is performed when the observation error is larger; for σ E = 50 , action a 1 is imple- mented early on in all cases, i.e., independent of the observation.Action a 3 is avoided, presumably due to its large cost.
The actions selected by the NN, as shown in panels (c) and (d), differ significantly from those of the reference solution.Note that the policies obtained with the NN vary substantially among repeated training runs, even if they lead to similar LCC .The results in Fig. 5 corre- spond to a single trained NN for each observation error; with other trained NN instances, different proportions of a 0 , a 1 , a 2 are observed.In all trained NN, we observe that for σ E < 200 , the NN employs solely a 2 for failure pre- vention; a 1 is involved only for higher observation errors.
By contrast, MCTS has a similar strategy profile over all observation errors: about 30% use a 1 at every timestep.The only difference observed for larger observation errors is the increased use of a 2 in the second half of the system lifetime with increasing measurement errors.Interestingly, for small σ E , the statistic of the selected actions with MCTS is closer to the reference solution than the one of NN, even if the expected LCC achieved with the NN is smaller than the one achieved with MCTS.
To investigate and compare the resulting policies, we illustrate how the strategies manifest in the belief space.Figure 6 depicts the policies resulting from VI for the reference case of σ E = 50 and t = {1, 10, 18, 20} .The occa- sional islands in otherwise continuous action bands in panels (a) and (b) result from the sampling-based estimation of the belief transition probabilities outlined in [62].
For comparison, we show the output of one run of the MCTS method (one for each b and t) in Fig. 7.The poli- cies are similar to the VI policies in the choice of a 2 and a 3 , i.e., the regions close to or beyond failure are primarily occupied with strips of a 2 and a 3 .The extent of variation is determined by the magnitude of the measurement error as well as the remaining time until the end of the life cycle, e.g., Fig. 6 Evolution of the VI optimal actions represented on a 60 × 20 belief grid for t = 1 (a), t = 10 (b), t = 18 (c), and t = 20 (d) for an observation error of σ E = 50 , where the cell mid-points are chosen as representatives for each cell region, respectively almost no variation for σ E << 1 & t > 10 , and high vari- ation with no apparent structure for σ E > 100 ∀t .By con- trast, the region far away from failure almost always shows high variability, and it seems that the choice between actions a 0 and a 1 is taken more or less randomly (except for very low σ E at t = 20 ).The already mentioned variation tenden- cies for a 2 and a 3 also hold for a 0 and a 1 .The large variance of MCTS (which could be reduced with increasing computational cost, see Appendix 3: MCTS parameter optimization technique section) leads to suboptimal policies.
For the NN, mapping all belief states to the optimal actions is not straightforward, as it takes observations and not beliefs as an input.However, the belief state can be tracked over time for sample trajectories, as shown in Fig. 8.
Once the trajectories in the belief space are available (Fig. 8), we can select a specific timestep and plot the actions taken by NN.This results in a point cloud in the belief space, which is shown in Fig. 9. Fig. 7 Evolution of the MCTS recommended actions, same set-up as in Fig. 6 actually has an easily tractable belief, which facilitates the evaluation of the algorithms in our investigation).
Computation time-wise, the NNs vastly outperform MCTS.This can be partly attributed to the implementation: PyTorch tensor operations on the GPU for NNs are much faster compared to standard Python list implementation on multiple CPUs for MCTS.The other part can be attributed to the nature of the methods: passing a state-action pair through the NN and retrieving the next action via the Q-values is much faster than performing a tree search for the next action at every timestep.
The results of our numerical investigation show that both the NN architecture as well as the MCTS approach perform suboptimally compared to the reference solution found by value iteration.It is possible to improve the performance of both approaches, in the case of NN, by additional training and hyperparameter tuning, and in the case of MCTS, by employing a larger number of samples.However, our results reflect an honest assessment of the capabilities of these methods.
The NN's solution highly depends on the local minimum found during training.This explains the non-smooth standard deviation curve and the large differences of resulting statistical strategy profiles in different training runs, as reflected in Fig. 4. Generally, the NN's strategy profile changes considerably for σ E > 100 as the NN approaches the solution for the case of uninformative observations.As evidenced by Fig. 9, the NN's policy is stochastic in the belief space, although it is deterministic in the observation space.As the optimal policy is deterministic in the belief space (as given by the VI solution shown in Fig. 6), one can observe that the trained NN is not yet able to capture implicitly the underlying belief space, which is one reason for its suboptimality.Thus, if the belief can be computed, it should be used as an input to the NN, as this will strongly enhance its performance (see, e.g., [46]) and facilitate interpretability.
The MCTS provides suboptimal but still decent results for σ E ≤ 50 , where it trades LCC for lower var- iance.This can also be seen in Fig. 5, where the NN employs only a 2 leading to an overall lower mean cost but higher variance due to the acceptance of occasional failures.For higher observation errors, the MCTS performance decreases significantly, showing a limited ability to handle uninformative observations.This is exemplified by slightly changing strategy profiles with increasing σ E in Fig. 5. Interestingly, Figs. 7 and 6 show that the MCTS' general solution is similar to the VI optimal solution provided.However, the inherent stochasticity of the method results in a stochastic policy in the belief space.This property is most apparent at the beginning of the life cycle, where the long-term effects of some actions are difficult to estimate.
A disadvantage of the MCTS approach is that it has no memory; thus, each sample trajectory has to be computed independently and expensively.By contrast, NNs, once trained, contain all the information in the weights, and the evaluation can be performed swiftly.In addition, we speculate that this memorylessness of the MCTS leads to worse performance compared to the NNs, which can learn the degradation behaviour through observed trajectories.
Overall, and possibly expected, the neural networks are the preferred choice.However, there are numerous opportunities for further enhancements of both solution approaches.
The performance investigation of the NN could be extended, for example, by studying its dependence on the network size, its generalization capabilities (e.g., increased lifetime, different distributions), or by using the belief as an input instead of the observations for comparison.Moreover, the NN architecture can be extended by incorporating a double deep Q-network (DDQN) or by replacing the LSTM architecture with transformers (see, e.g., [46,47]).
The MCTS method could be extended by, e.g., using erroneous observations instead of exact beliefs [36] for performance comparison or by switching to continuous state MCTS to dispense with discretization.NN and MCTS can also be combined by adding a planning step to the NN-based solution.

Conclusion
In this work, we propose the +RQN architecture for POMDP and I &M planning, which requires merely the erroneous observations and the previous action taken as an input.The resulting neural networks are computationally fast and achieve good performance for measurement errors over several magnitudes through policy adaption.However, NNs, in general, inherently suffer from interpretation difficulties.The trained model consists already for small problems of thousands of weights.Interpreting the results or gaining underlying physical insights and properties of the system is non-trivial.This characteristic is evident in policy extraction, which is challenging to conduct in the belief space, as beliefs cannot be imposed but only tracked along the NN's trajectories.
By contrast, computing many histories with the MCTS method is computationally much slower.In addition, it is inherently based on constructing a tree that exponentially grows with increasing depth, which needs large amounts of memory.The results of the MCTS are comparable to the NNs for small to medium observation errors.However, for high observation errors, the MCTS method fails to adapt its policy and achieves significantly worse results compared to the NNs and VI.The key advantage of the MCTS method lies in the evaluation of their policies.Any belief combination can be specified as a starting point which greatly facilitates the interpretation of the results.

Model data
The specific parameters for the model used in this work are outlined in Appendix 1: Table 1.

Effect of actions
The (belief) state of the system is influenced by the four available actions, whose effects are detailed in Appendix 1: Table 2.
Table 2 Mathematical description of the action a i effects on individual D and K states as well as their corresponding beliefs µ D and µ K .At a 3 , the replacement is conducted by drawing new samples D0 and K0 (from the distribution in Eq. ( 20)) and not by reusing the samples from the current simulation

Action Effect State level
Belief level Action a 3 consists of sampling new values for the deterio- ration state and deterioration rate from the following multivariate normal distribution: where we denote with " ′ " and " ′′ " the prior and posterior distributions, respectively.The corresponding analytical terms are detailed in the following.(

Transition probabilities -belief level
The covariance of D t and K t is fully known (does not depend on O t ).The means of the distributions are fully observed at each timestep (see Eqs. ( 27) and ( 28)).The belief B t at time t is composed of the two posterior means, (21

Fixed NN parameters
The number of hidden layers loosely follows the architecture from [25].It is possible that the network achieves better performance or equal performance with shorter training time with other configurations.The output dimensions of O, a, A, V , and Q are fixed by our problem formulation, i.e., we have one observation variable and four available one-hot encoded actions.The output dimensions of all other layers -the three FC layers and the LSTM layer -can be freely chosen.The dimensions of all customizable layers have been selected heuristically.The fully connected layers are numbered according to the order in which they appear from left to right, i.e., there are two FC1 and FC2 layers.The sizes of FC2 (and FC1) have been chosen to have the same dimension to not impose an ad hoc ranking of importance before they enter the LSTM layer.The exact values for the number of nodes in each layer and all other parameters set heuristically are given in Appendix 2: Table 3.The total number of parameters of our NN architecture for the specific values given in Appendix 2: Table 3 is 57,195.We train the networks for at most 500 epochs.However, early stopping is also implemented, i.e., training is interrupted if the training loss does not further decrease over an extended period [68].(

Optimized NN parameters
Our chosen parameters to optimize are given in the following list.
1. Weight decay parameter (L2 regularization) [68] 2. Maximum ǫ value (coupled with a decrease) 3. Learning rate step size 4. Learning rate multiplication factor Including weight decay, the loss function gets an additional term: where W is a matrix containing all network weights, R denotes the regularization function, which is the squared sum of all network weights ( L 2 ), and is a scaling param- eter determining the relative importance of the regularization term compared to the MSE loss.We search for an optimal value of .We implement our behaviour policy, i.e., the policy with which we select the next action when generating a batch of trajectories, as a decreasing ǫ-greedy method which starts with the value ǫ in the beginning to fuel exploration but decreases to 0 for exploitation of the final policy.However, one can also choose a different minimum ǫ-value (e.g., 0.1 in [25]) to always force some exploration.Our update scheme takes the form of: Therefore, our scheme implements a simple linear reduction.The starting value of ǫ is optimized.
We also implement a learning rate scheduler, where the learning rate starts at a high value and periodically decreases, which helps both generalization and optimization [79].We implement a simple step decay schedule that reduces the learning rate by a constant factor η every constant number of epochs m [80].Hence, we search for the optimal values of m and η.
There are plenty more common practices for training NNs, e.g., weight initialization, batch normalization, and dropout.For most of these, we follow the default settings of PyTorch; these will not be further explained here.

NN Optimization technique
Several search techniques can be employed to find good NN hyperparameters.The most common one is manual search, which is simple and effective for finding reasonable estimates (e.g., initial learning rate), but becomes unstructured and ineffective when the search space of the parameters to tune grows.Therefore, we use grid search, where we define a set of points for each of our desired hyperparameters and iterate over all possible combinations [68].During the procedure, we error.Generally, it is assumed that the optimal number of observation buckets is dependent on the observation error with regards to minimizing the LCC.
The first analysis is conducted on the time dependence of N T and N R , where we impose some threshold of com- putation time needed to traverse a whole life cycle with the MCTS method to stay in a computationally feasible domain.It is assumed that the computation time is independent of the observation error and is only minimally affected by the choice of N ob , which is why they are fixed.
The result of the analysis is a set of different possible combinations of the two parameters, which satisfy our imposed computation threshold.To settle for a single combination, the influence of N T and N R on the LCC will be taken into account.We assume that the resulting curves qualitatively hold for any N ob and σ E , which is why they are fixed again.
Secondly, once N T and N R have been fixed with the time constraints and LCC maximization, we search for the optimal number of observation buckets given a set of observation errors of interest.

Fig. 1
Fig. 1 Complete influence diagram of the model, especially depicting the starting and end operations, where T end = 21 .The first action is taken at t = t 1

Fig. 2
Fig.2Snapshot of our NN architecture depicting the information flow through the network from t → t + 1 .A is the action taken at the previous timestep in one-hot encoded form, e.g., a 0 = [1 0 0 0] T , O is the current observation, V, A & Q are the Value, Advantages, and action-values, respectively.The "Env." above the arrow denotes an interaction with the environment.The gray layers represent FC layers and the orange layer represents the LSTM layer with its hidden state and cell states h, c .The represents the concatenation operation.The number of circles inside some layers depicts the fixed number of nodes, and the relative sizes of the layers qualitatively show the number of nodes; adapted and merged from[25,51]

Table 1
Summary of model and cost parameters

Transition probabilities -state level
At every timestep t ≥ 1 , after observing O t , the updated distribution of D t and K t is a binormal distribution, with mean µ ′′ D,t , µ ′′ K ,t , standard deviations σ ′′ D,t , σ ′′ K ,t and correlation coefficient ρ ′′ t .

and posterior covariance matrix of D t and K t
For the covariance matrix, the transition from ′′ t−1 to ′′ t does not depend on O t or A t , hence is deterministic:

Posterior mean values of D t and K t
Conversely to the covariance matrix, the posterior mean values of D t and K t depend on the value of the observation O t

Table 3
Summary of new heuristically chosen network and optimizer parameters 29)