Scalable Policies for the Dynamic Traveling Multi-Maintainer Problem with Alerts

Downtime of industrial assets such as wind turbines and medical imaging devices is costly. To avoid such downtime costs, companies seek to initiate maintenance just before failure, which is challenging because: (i) Asset failures are notoriously difficult to predict, even in the presence of real-time monitoring devices which signal degradation; and (ii) Limited resources are available to serve a network of geographically dispersed assets. In this work, we study the dynamic traveling multi-maintainer problem with alerts ($K$-DTMPA) under perfect condition information with the objective to devise scalable solution approaches to maintain large networks with $K$ maintenance engineers. Since such large-scale $K$-DTMPA instances are computationally intractable, we propose an iterative deep reinforcement learning (DRL) algorithm optimizing long-term discounted maintenance costs. The efficiency of the DRL approach is vastly improved by a reformulation of the action space (which relies on the Markov structure of the underlying problem) and by choosing a smart, suitable initial solution. The initial solution is created by extending existing heuristics with a dispatching mechanism. These extensions further serve as compelling benchmarks for tailored instances. We demonstrate through extensive numerical experiments that DRL can solve single maintainer instances up to optimality, regardless of the chosen initial solution. Experiments with hospital networks containing up to $35$ assets show that the proposed DRL algorithm is scalable. Lastly, the trained policies are shown to be robust against network modifications such as removing an asset or an engineer or yield a suitable initial solution for the DRL approach.


Introduction
Industrial assets such as medical imaging equipment, wind turbines and wafer steppers are expected to operate on demand.To ensure maximum availability, assets nowadays are regularly inspected to evaluate their degradation, i.e., condition-based maintenance (CBM).Equipping assets with sensory equipment for real-time degradation monitoring can be used to devise efficient CBM policies.
For example, an array of sensors is connected to a magnetic resonance imaging device that tracks medical procedures, but the gathered data can also be used to signal various stages of degradation and emit alerts, which in turn may be used to dispatch a maintenance engineer.Decision-makers often rely on heuristics or human intuition which often decouple dispatching from other operational decisions, for instance, the dispatching and relocation decisions of emergency service providers after an incident occurred.Devising maintenance and dispatching policies are, each in their own right, notoriously difficult problems and have been studied individually.
The combination of maintenance and dispatching decisions, although understudied, is extremely important.As an essential step in the direction of the application, we contribute to the literature with the Dynamic Traveling Multi-Maintainer Problem with Alerts (K-DTMPA).In the K-DTMPA, K engineers travel in a network of assets where the degradation of each asset is modeled through a finite number of degradation states.The first degradation state captures that the asset is asgood-as-new.Subsequently, the severity of degradation increases with the degradation state until the asset reaches the failed state and becomes unavailable.After each state transition, an alert is issued immediately to a central decision-maker who is responsible for scheduling maintenance and dispatching the maintenance engineers.We propose heuristic dispatching solutions based on classical ranking heuristics, but the optimal solution likely cannot be represented by a simple set of rules.Indeed, the K-DTMPA model can be naturally formulated as a Markov decision process (MDP), which is computationally intractable for realistic size problems.It is well known that such realistic sequential decision-making problems suffer from the curse of dimensionality.This curse of dimensionality can be tackled using approximate dynamic programming / deep reinforcement learning (DRL) (Boute et al., 2022;Powell, 2019) via a combination of machine learning and simulation.DRL has achieved state-of-the-art performance, for instance, in Atari video games (Mnih et al., 2013) and chess (Silver et al., 2018).However, while numerical experiments for sequential decision-making problems arising in operations management have yielded encouraging results, the ability to solve industrial-scale instances is restricted due to the long training times (Gijsbrechts et al., 2022).Indeed, DRL has been shown to produce near-optimal policies for Dynamic Traveling Maintainer Problem with Alerts (DTMPA) instances (da Costa et al., 2023), but training times exceed 12 hours even for networks containing 6 assets maintained by a single engineer.The K-DTMPA instances studied here are more challenging because they involve multiple engineers that need to be dispatched in a coordinated fashion to up to 35 assets, and overcoming this challenge is a crucial step towards industrial-scale solutions.
We adopt a form of approximate policy iteration (API) -an iterative algorithm in the DRL domain -and demonstrate that heuristic solutions may be leveraged as a starting point for the training algorithm, and that doing so may vastly reduce training times.More specifically, we extend the ranking heuristics for the DTMPA framework proposed by da Costa et al. (2023) to the K-DTMPA framework by equipping them with a state-of-the-art dispatching heuristic.We demonstrate that API can improve such heuristics and can learn a repositioning strategy for unassigned maintenance engineers aimed at anticipating future alerts and failures.Additionally, to significantly reduce the action space complexity, we propose a suitable reformulation of the associated MDP in which actions for engineers are selected sequentially in a fixed order.Using an engineer-centric feature representation for this MDP reformulation further improves DRL's efficiency.
The K-DTMPA represents a rich class of problems, and as a consequence, it is challenging to design general-purpose traditional heuristic algorithms that can be used to benchmark our learned policies: Such heuristics would have to incorporate jointly the geographical layout, the observed degradation, the costs for asset unavailability, maintenance and travel, and the spatial and temporal information regarding the engineers.To circumvent this, we devise two suitable subclasses of K-DTMPA instances that allow for the construction of strong benchmarks, namely the single maintainer and the dispatching & repositioning (D&R) instances.We show that API can train policies that outperform the benchmark for such instances.Moreover, our algorithm produces state-of-the-art policies for more complex instances.
The primary contributions of the paper are specified as follows: • The proposed K-DTMPA model jointly optimizes maintenance and dispatching decisions, problems that are linked in practice but are studied separately in prior literature.
• To reduce the action and state space complexity, we propose an MDP reformulation in which actions for the engineers are selected sequentially using an engineer-centric feature representation, and show that this yields more cost-effective K-DTMPA policies.
• We propose a generic approach that leverages classical heuristic policies to improve the training of neural network policies, and we demonstrate its effectiveness for K-DTMPA instances.
The main insights gathered from the numerical experiments are: • API can solve single maintainer instances up to optimality within a few iterations, regardless of the choice of the initial solution.
• Sophisticated dispatching heuristics are superior initial solutions when solving multi-maintainer instances compared to trivial policies such as the random policy.
• The trained policies are robust against removing an asset/engineer or yield suitable initial solutions to optimize such instances.
By providing scalable solution approaches to make data-driven decisions for industrial-scale problems, we attempt to bridge the gap between academia and industry.
The remainder of the paper is structured as follows.Section 2 provides an overview of the related literature.Section 3 formalizes in detail the K-DTMPA framework.In Sections 4 and 5, we detail the heuristic solutions and the deep reinforcement learning algorithm, respectively.In Section 6, the setup of the numerical experiments on Dutch hospital networks is presented.Section 7 discusses the numerical results and the corresponding managerial insights.We conclude in Section 8 and discuss operational constraints limiting application in industry.

Literature review
In this section, we first discuss relevant literature in the streams of maintenance optimization and traveling maintainer problems.Subsequently, we provide an overview of the application of DRL in dynamic dispatching problems integrating operational decisions with a focus on scalable DRL for decision-making.

Maintenance optimization and traveling maintainer problems
According to Keizer et al. (2017); De Jonge and Scarf (2020), maintenance models can be single-asset or multi-asset.Multi-asset models generalize single-asset models by considering joint maintenance policies for assets with any of the following dependencies: economic, structural, stochastic or resource dependency (Keizer et al., 2017).The degradation of assets is often modeled using a stochastic process that takes values in a discrete finite state-space, e.g., a Markov chain.Scheduled inspections can be improved by leveraging information acquired via sensors.These sensors sometimes measure asset degradation directly, for instance in the form of alerts (De Jonge et al., 2016;Akcay, 2022).
Abdul-Malak and Kharoufeh (2018) study the problem of optimally replacing multiple stochastically degrading systems using condition-based maintenance in a shared environment.However, in the multi-asset scenario, the geographical layout often constitutes a complex dependency, prompting the traveling maintainer problem (TMP).The goal of the traditional TMP is to find a route that visits each asset such that the sum of the times needed to reach each asset is minimized.The TMP is a mean-flow variant of the traveling salesman problem (TSP) and is thus NP-complete (Afrati et al., 1986).The computational complexity further increases when assigning a hard deadline to each asset, e.g., a bound on the response time.The TMP objective of minimizing the sum of functions of response times is studied by Camci (2014).Real-time CBM prognostics are incorporated in a TSP by scheduling maintenance using forecasted failure information, which is then generalized to also include travel times (Camci, 2015).The dynamic TMP (DTMP) considers jobs that appear uniformly in a region according to a Poisson process (Bertsimas and Van Ryzin, 1989).These jobs must be completed by a single maintainer with the objective of minimizing the average response time.Drent et al. (2020) model a DTMP variant as a sequential decision-making problem and provide heuristic solution approaches leveraging real-time condition information to the dispatching and repositioning subproblems based on the minimum weighted bipartite matching problem and the maximum expected covering location problem, respectively.Pechina et al. (2019) propose and evaluate a range of heuristic solution approaches to the dispatching and relocation subproblems inspired by the domain of emergency response networks to serve a network of identical geographically distributed assets.In emergency response dispatching problems, it is commonly believed that the closest idle ambulance rule is near-optimal, however, significant cost reductions can be achieved by dispatching policies that account for coverage (Jagtenberg et al., 2017) and relocation of ambulances (Van Buuren et al., 2018).
Condition-based maintenance optimization typically optimizes the timing of maintenance, taking into account risks, costs and dependencies (De Jonge et al., 2016), which is a formidable problem in itself.In such models, the dispatching of resources based on their spatio-temporal availability is rarely modeled in detail, and resources are typically abstracted away.To the best of our knowledge, dispatching and coordinating resources in response to unforeseen alerts has only been studied outside of the (condition-based) maintenance context, e.g., in ambulance dispatching (Van Buuren et al., 2018).Our newly proposed K-DTMPA model jointly considers resource dispatching and tactical postponement of maintenance, which advances the applicability of condition-based maintenance models in a practical context, while in a sense merging two streams of literature.

The application of deep reinforcement learning in dynamic dispatching problems
Recent advances in machine learning have led to a variety of applications in various fields.For example, applications in the field of dynamic dispatching include ambulance dispatching, ATM servicing and mining logistics.Relevant challenges of the application of DRL include: (i) multi-agent systems may have a variable number of agents, (ii) variable objectives require costly retraining, (iii) the curse of dimensionality, (iv) the stochastic environment can be non-stationary and (v) the explainability of the trained agent (Khorasgani et al., 2020).Zhang et al. (2020) tackle the Open-Pit Operational Planning problem by training a neural network that is shared amongst all the agents (trucks), i.e., the network receives each agent's observation and outputs actions for each agent independently.Holler et al. (2019) also apply DRL from a system-centric perspective to solve the multi-driver vehicle D&R problem.Schmid (2012) solves the dynamic ambulance D&R problem using approximate dynamic programming on a Vienna case study.Ji et al. (2019)  DRL has the potential to deliver good policies for any operations management problem that possesses a natural MDP formulation.MDP instances considered in DRL studies are often restricted to stylized models, which is in contrast with the complexity of practical problems arising in operations management (Boute et al., 2022).Approaches to make DRL more scalable include aggregating states (Refaei Afshar et al., 2020) or modifying action selection, e.g., decoupling action selection (Feng et al., 2021) or by using continuous action representations (Vanvuchelen et al., 2022).De Moor et al. (2022) and Boute et al. (2022) argue that incorporating domain knowledge embedded in well-performing heuristic policies into the training algorithm improves DRL's efficiency, e.g., through reward shaping.Reward shaping incentivizes the DRL agent to act similar to the action selected by the heuristic policy.
Like De Moor et al. (2022), we propose an approach that leverages domain knowledge to improve DRL's efficiency.Structurally, our approach differs from the reward shaping approach adopted by De Moor et al. (2022) as follows: Reward shaping alters the MDP formulation to reward actions that coincide with the actions selected by a teacher heuristic, and involves tunable parameters that control the amount of deviation that is allowed.Our approach uses the heuristic as a starting point for further improvements, without the need for any parameters.

The dynamic traveling multi-maintainer problem with alerts
The K-DTMPA is a discrete-time model in which a central decision-maker is responsible for selecting the actions of the K maintenance engineers, denoted by K = {1, . . ., K}.To prevent and resolve failures, the assets require maintenance regularly.The engineers maintain a set of assets (machines), denoted by M = {1, . . ., M }, each positioned at a unique location in the network.Each asset m ∈ M is subject to degradation which occurs randomly over time.The degradation state is collected in real-time via sensors.After an increase in degradation, an alert is issued that informs the decision-maker about the state of the assets.In each time period, the decision-maker selects an action u k for the k-th maintenance engineer, for each k ∈ K.The action space per engineer consists of actions to travel to another location, idling/continuing or to start maintenance at their location.
The objective is to minimize the total expected discounted cost over an infinite horizon.Maintenance is referred to as preventive maintenance (PM) when carried out before a failure occurs and it restores the degradation state of the asset to as-good-as-new, as opposed to corrective maintenance (CM) which can be carried out after a failure has occurred and it also restores the state to as-good-as-new.Typically, CM comes at a higher cost compared to PM.Until maintenance is completed, the machine is down during which the decision-maker incurs downtime costs.The work covered in this paper extends to any degradation model, however, for numerical tractability we have opted to implement the framework propose by Derman (1963): In the numerical section, we assume that, without interference by an engineer, the next state is a "worse" state, viz.
x m (t + 1) ≥ x m (t) and that the transition time T xm m between degradation states x m ∈ N m \ {x f m } and x m + 1 is random but positive and integer.I.e., the random variable T xm m follows a Geometric distribution with success parameter p xm m ∈ (0, 1].
The remainder of this section formalizes the sequential decision-making K-DTMPA model framework.
We detail the states, actions, transitions, costs and the optimization objective hereinafter.

States, actions and transitions
The state h ∈ H contains the degradation state of all machines in the network and K blocks capturing the status of the engineers.Each block consists of three elements that describe the engineer's location, current activity and availability.A state can thus be represented as a vector (otherwise, the engineer is either traveling or idling); δ k ∈ ∆ ⊂ N 0 counts the remaining time units that the engineer is occupied (e.g., δ k = 0 specifies that the engineer is available).Under the assumption of Geometric transition times between subsequent degradation states, the elapsed times in the state space description (h) can be excluded.
At every time instance, for each maintenance engineer, given h ∈ H, the decision-maker must choose one of the following options: (i) start traveling to location m ∈ M \ {ℓ k }, say u m ; (ii) start maintenance at the present location, say v; or (iii) continue the ongoing activity or remain idle, say u ℓ k , with ℓ k denoting the location of the engineer for which an action is selected, say the k-th engineer, k ∈ K.In particular, when δ k > 0, the k-th engineer is unavailable and therefore action u ℓ k must be chosen, while if δ k = 0, then the action can be chosen from the set {u m } m∈M ∪ {v}, where maintenance action v is only available if the machine at its location is not already being maintained by another engineer.Thus, the state-dependent action set for the k-th maintenance engineer becomes The state-dependent action set U(h) is the Cartesian product of the K individual state-dependent action sets excluding those actions that result in two or more engineers simultaneously maintaining a machine, i.e., The state transition h → h ′ is decomposed in two stages: The first stage is determined by the deterministic consequences of the chosen actions, say a = (a 1 , . . ., a K ) ∈ U(h), and the second stage is determined by the random evolution of the degradation processes.More specifically, h The order of handling the actions a 1 , . . ., a K is irrelevant, therefore, we only detail how the action of the k-th maintenance engineer is processed.
In the case that (a k = u ℓ k ), the engineer remains at the current location ℓ k and continues the ongoing action or idles.The remaining unavailability δ k is increased by one if the engineer is idle.
, the engineer starts to travel to location m.The remaining unavailability δ k is increased with the travel time θ ℓ k m and the location ℓ k is updated accordingly.The third and fourth possibility, when ( ) respectively, represent initiating PM or CM at the current location, depending on the status x ℓ k of the machine at the location of the engineer.The remaining unavailability δ k is increased with the duration of maintenance, which Following the modeling assumptions, the degradation state x m of machine m is set to x f m to indicate that the machine is unavailable during maintenance.For , determining h ′ is separated in two steps: First, we update the state of every machine according to the random evolution of the degradation process or completion of maintenance.Subsequently, we update the remaining state variables.In more detail, the state is modified as follows: 1.One of the following triptych of cases determines the evolution of machine m ∈ M: This first case represents that after completion of PM or CM, the state of the machine is updated to as-good-as-new.Else, with probability P(T x a m + 1, the machine transitions to the subsequent degradation state.This is the second case.Otherwise, in the third case, which occurs with probability x a m , the machine degradation state remains the same.

The evolution of each engineer
, the remaining unavailability δ ′ k of the k-th engineer is decreased by one when continuing an ongoing activity.The indicator ι ′ k resets when the k-th engineer completes an activity.

Cost structure and objective
The cost structure of the K-DTMPA includes costs for travel, maintenance and asset unavailability.
A small cost c T ∈ R + is paid for each unit of time that an engineer travels, independent of the maintenance engineer, the origin and the destination.Initiating PM or CM on machine m ∈ M m models the unavailability of asset m, i.e., when the asset has failed or is under repair.The cost of downtime is c DT m ∈ R + per time unit, regardless of the source of disruption.
Thus, when taking action a ∈ U(h) in state h, the incurred costs are: The objective is to devise a policy π that minimizes the total expected discounted cost.A policy π = (π 1 , π 2 , . . ., π t , . ..) is defined as a sequence of decision rules, where the decision rule π t is a probability distribution over the action space U(h) at time t, given the state h ∈ H.We denote with π k t the induced probability distribution over the action set for the k-th engineer U k (h).Let γ ∈ [0, 1) be the discount factor and J(π) be the total expected discounted cost.Thus, the objective is to determine the optimal policy π * satisfying where (h(t), a(t)) denotes the tuple of the state and the respective action given the policy π t at time t, t ≥ 0, and C(•) denotes the associated cost (maintenance, travel and downtime).

Heuristic policies and benchmarks
In this section, we discuss the aspects that characterize a good policy for a K-DTMPA instance.
Subsequently, we detail the dispatching heuristic with which we equip existing ranking heuristics.
We argue that the resulting class of heuristics contains compelling benchmarks for suitable subclasses of K-DTMPA instances.The K-DTMPA framework encompasses a wide range of dynamic traveling maintainer problem instances, however, suitable parameter choices will reduce the resulting problem to well-studied problems.For instance, by considering a single maintainer, the problem reduces to the DTMPA under full state information studied by da Costa et al. (2023).Setting the number of states N m ≡ 2 yields a problem similar to a dynamic vehicle routing problem.Thus, we can compare our DRL algorithm to heuristics designed for such special cases.

Aspects of a good policy
What constitutes a good policy is the ability to jointly consider: the network layout, the spatial and temporal information of all engineers, the uncertainty in the evolution of each machine's condition, and the cost structures.Such a policy must account for the fact that faster degrading machines typically require maintenance more regularly and expensive CM actions (compared to PM) must be avoided.In addition, the downtime costs require engineers to move proactively to anticipate future events in the network.
On that account, the aspects that construct a good policy are envisioned to be the following: (i) Efficient dispatching of the engineers to the various locations; (ii) Assessing the risk of delaying preventive maintenance; and (iii) Tactical repositioning of any remaining available engineers.See Figure 2 for the visualization and further elaboration on these three policy aspects.

Dispatching heuristic policy
The existing greedy and reactive ranking heuristic solutions introduced by da Costa et al. (2023, Section 5.1) are not immediately applicable to the K-DTMPA setting since they do not include a dispatching mechanism and since they operate at an information disadvantage.In the setting of da Costa et al. ( 2023), these heuristics only observe transitions to the first degradation state x h m + 1 and the failed state x f m , while in the K-DTMPA setting all degradation state transitions are observed.Therefore, it is paramount to extend the greedy and reactive ranking heuristics to the multi-maintainer setting cf. the policy aspects discussed in Section 4.1.The reactive heuristic maintains only failed machines whereas the greedy heuristic maintains both alerted and failed machines.To determine which machines to maintain, we require a ranking between assets.In this work, we consider a state-dependent threshold ranking leveraging the available degradation information.A second optimization step accounts for costs and travel times when dispatching the available maintenance engineers to the ranked assets.The remaining engineers are assigned the action to remain idle or continue with the current activity, i.e., no repositioning step is performed.
We denote the resulting dispatching heuristic policy by π D .
State-dependent threshold ranking.Given the alert information, the machines are ranked on their observed degradation level.Let t ∈ N 0 be the current time and h ∈ H be the state at time t.
Machine m is added to the ranking when x m (t) ≥ s m (h), where s m (h) ∈ N m is the state-dependent degradation threshold corresponding to machine m.Note that setting s m (h) ≡ |N m | yields a reactive policy.If there are more ranked assets than K ′ available engineers, we iteratively reduce the ranking as follows: An asset that is farthest from one of the available engineers is chosen randomly and it is removed from the ranking.
Dispatching step.The ranked assets must now be assigned to the available engineers.The formulation of the assignment problem in its general form is as follows: The problem instance has a number of engineers and a number of maintenance jobs.Any engineer can be assigned to perform any job, incurring some cost that may vary depending on the engineer-job assignment.It is required to perform as many jobs as possible by assigning at most one engineer to each job and at most one job to each engineer in such a way that the total cost of the assignment is minimized.By construction, the assignment problem contains at most K ′ jobs.When there are fewer jobs than available engineers, the so-called unbalanced assignment problem can be reformulated as a balanced assignment problem by adding dummy jobs (i.e., jobs with a cost of 0 for each available engineer).
The pairwise travel time between ranked assets and available engineers seems to be the rational choice to construct the cost matrix in the case of identical assets (in terms of cost parameters and distributional degradation characteristics).The Hungarian method solves the constructed assignment problem in O(M 3 ) polynomial time complexity (Schrijver, 2003, Chapter 17.2).The engineers are dispatched according to the solution to the constructed assignment problem.

Approximate policy iteration for K-DTMPA
Da Costa et al. ( 2023) have shown that DRL, specifically, n-step quantile regression double Q-Learning (nQR-DDQN), produces near-optimal policies for DTMPA instances.However, training times exceed 12 hours even for networks containing up to 6 assets.To learn policies for K-DTMPA instances, we adopt a form of approximate policy iteration (API), which enables us to distribute the sample collection over multiple compute nodes (cf.Silver et al. (2018)), contributing substantially to scalability.In particular, we adopt deep controlled learning (DCL) (Temizöz et al., 2023), which combines variance reduction and optimized allocation of roll-outs for greater efficiency and trains neural networks using cross-entropy (instead of Euclidean) loss functions.DCL has been shown to outperform other DRL algorithms such as proximal policy optimization or asynchronous advantage actor-critic on inventory problems (Temizöz et al., 2023).
To successfully apply API/DCL to K-DTMPA instances, we combine several novel ideas that shall be discussed in depth in this section.First, we provide an overview of the application of API to K-DTMPA instances (for details, we refer to Temizöz et al. (2023)) and then discuss the feature representation of state information, training the neural network classifier and suitable initial solutions in the forthcoming sections.
Recall from Section 3.1 that the action space U(h) grows exponentially in the number of engineers K. To vastly reduce the action space complexity, we train a neural network to select the actions for the engineers sequentially in a fixed order.Due to symmetry, any ordering of the engineers can be adopted.To enable cooperation, after each action selection, the input is updated with the consequences of the action (see Section 3.1) before selecting an action for the next engineer.
Let I = {1, . . ., |I|} be some index set.Starting from some initial policy π 0 , we interact with the environment to collect a data set D = {(h for which the action has been obtained using a form of simulation optimization.More specifically, starting from state h i , given that the first k − 1 maintenance engineers select actions , we select the action a k i for the k-th engineer in state h that minimizes the actionvalue function q π 0 (h ), the action-value function is defined as follows: In other words, q π 0 (h ) is the total expected discounted cost when selecting action ã for state h and following the policy π 0 in the remainder of the roll-out, i.e., a j (0) ∼ π j 0 (h ) for all t ≥ 1.To estimate q π 0 (h , ã), we generate r ∈ {r min , . . ., r max } independent roll-out simulations of length T ∼ Geo(1 − γ) and compute the undiscounted trajectory costs Q j π 0 (h , ã) for j = 1, . . ., r.The resulting unbiased estimator qπ 0 (h (Haviv and Puterman, 1992) satisfies qπ 0 (h Subsequently, the improved action a k i for state h We refer to π+ (h ) as the simulation-based policy for state h 3 for a visualization.
Ideally, we include those states that are visited most frequently under this simulation-based policy.
Thus, starting from some initial state h 0 ∈ H, states for D are those that are encountered when selecting in each decision epoch, for each engineer, the randomized simulation-based policy: The policy that chooses a random action with probability ϵ ∈ [0, 1] and with probability 1 − ϵ, follows the simulation-based policy π+ .
Subsequently, we train a neural network classifier on D (see Section 5.2 for more details) which induces a hopefully improved policy.API can be transformed into an iterative scheme by collecting m=1 ∪ {v}) that minimizes the average undiscounted trajectory cost qπ 0 (h a k−1 , ã).Unbiased estimates of the action-value function are computed from r independent roll-out simulations whose length follows a Geometric distribution with parameter 1 − γ.
new data using the improved neural network policy.Like exact policy iteration, API can potentially improve any heuristic policy and find good solutions in a handful of iterations.
Summarized, the API algorithm consists of the following three steps: 1. Choose a suitable initial solution π 0 .
2. Construct the data set D using π 0 .
3. Train a neural network classifier on the constructed data set D.
For step three above, the neural network can be interpreted as a parameterized function from R m to R n for some m, n ∈ N. Let such a (generic) function be denoted by N θ (•), where θ denotes the function parameters.The input to the neural network is the feature representation f (h) ∈ R m of a state h ∈ H.The output of the neural network N θ (•) ∈ R n , where n = M + 1, is transformed into a probability distribution over the action space and the action ã which is assigned the highest probability N θ (•) ã is chosen, i.e., the neural network induces a policy.Given the actions a 1 , . . ., a k−1 for the first k − 1 engineers, the action a k for the k-th engineer in state h is thus determined from the following decision rule: where by convention h a 0 = h.We denote by π θ the neural network policy that selects in every decision epoch, for each maintenance engineer k, the action )).

Feature representation
We propose a handcrafted feature representation to make the state information suitable for input into the neural network.Although a much more compact representation of the state h is possible, we propose a feature design based on conveying the state information per location in a form that is tailored to the engineer for which we are currently selecting an action.We have found that this engineer-centric feature design is crucial to efficiently learn cooperative dispatching mechanisms.
Specifically, the state is transformed as follows: i.e., the feature vector f 1 (h) contains an information block (x m , n av m , n ua m , t ν m , t Θ 1 m , t Θ 2 m ) that can be computed from h for each m ∈ M, and one additional feature.Here, x m is the observed degradation level of machine m, n av m and n ua m denote the number of available and unavailable maintenance engineers at location m, respectively.The entry t ν m captures the remaining time to the completion of a maintenance job at location m, whilst t Θ 1 m and t Θ 2 m indicate the remaining travel time until the first and second arrival of an engineer at location m, respectively.By convention, the default value of t ν m , t Θ 1 m and t Θ 2 m is 0. The last block entry ξ m indicates whether the maintenance engineer for which we are currently selecting an action is present at location m.Lastly, the total number of available engineers is added as an additional feature.All in all, the dimension of the feature vector equals n = 7M + 1 and is independent of the number of maintenance engineers K.In Section 7.1, we compare the quality of the trained neural networks using the proposed feature representation against (i) a similar feature representation albeit without the last feature (say f 2 (h)), i.e., f 1 (h) = f 2 (h), m∈M n av m , and (ii) against the most compact state representation (say f 3 (h)).θ is an iterative, gradient-based process: In each step, the gradient of L(θ) with respect to θ is estimated, and subsequently, θ is updated by taking a step in the opposite direction.We terminate when the loss on the test set, defined analogously to the loss for the training set, no longer decreases.

Initial solutions
Our experiments reveal that choosing a suitable policy π 0 to initiate API/DCL may significantly reduce computation times.In particular, we have found that the following four properties play a key role.
Exploration.Sufficiently many states must be encountered to yield a rich data set D, dismissing for instance the idle policy, viz. the policy that always selects the action to idle for every engineer.
Cooperation.Desirable cooperative behavior is typically hard to learn or improve, limiting the use of the random policy to single-maintainer instances.
Computational complexity.The ability to generate large data sets in a reasonable time is of paramount importance for solving large-scale K-DTMPA instances.
Non-self-correcting. Policies that revert deviations are typically not suitable.For example, suppose we adopt a network decomposition approach as π 0 .Such an approach assigns engineers to predetermined clusters of machines (more details are provided in Section 6).Under such a policy, dispatching an engineer outside its cluster will typically be followed by a correcting action that returns the engineer to its cluster, which inhibits the effective learning of cooperative behavior.
The dispatching heuristic policies developed in Section 4.2 satisfy all four properties.The dispatching heuristics encounter all the relevant states while maintaining the network.Moreover, the cooperative behavior of the engineers is optimized in a myopic fashion in polynomial time.The heuristics are also non-self-correcting since they implement no special asset-maintainer constraints.

Numerical experiments
To assess the performance and scalability of the algorithm proposed in Section 5, we construct several asset networks with the number of machines M ranging from 4 to 35.Machine degradation times, i.e., times to go from one degradation level to the next, are geometrically distributed.Under these circumstances, the K-DTMPA is a large-scale computationally intractable MDP.We assess the performance of DCL on a selection of K-DTMPA instances for which compelling benchmarks are available: single maintainer instances and D&R instances.Indeed, the heuristic approaches developed in Section 4.2 are specifically suitable for the latter instances.The preventive maintenance instances serve as more complex cases to learn new valuable insights using DCL.
Single maintainer instances.In the situation that K = 1, the 1-DTMPA reduces to a DTMPA under full state information, cf.da Costa et al. (2023, Section 3.2).For small instances containing up to M = 4 machines, the optimal policy is available as a benchmark.
Dispatching & repositioning instances.D&R instances are K-DTMPA instances on networks where all assets are identical, both in terms of cost structure and degradation dynamics.Moreover, all assets are assumed to have only two states: healthy and failed.As such, the element of preventive maintenance is eliminated and the sole objective becomes to minimize the unavailability of the assets.The dispatching heuristic developed in Section 4.2 has exactly this objective in mind and can be improved by selecting additional repositioning actions.Therefore, the dispatching heuristic will serve both as the benchmark and as the initial policy for DCL.
Preventive maintenance instances.We modify the D&R instances by including an additional state, in total we have three states: healthy, degraded and failed.The machine will transition to the degraded state on average at 75% of the machine's life expectancy.This complicates the objective as the policy now needs to jointly consider the cost structures and the network layout, including the position and availability of the engineers.For such instances, no strong benchmark exists in prior work.As such, we propose a traditional heuristic approach by means of network decomposition: The K-DTMPA instance is decomposed into K disjoint 1-DTMPA instances, which can each individually be optimized using DCL, which is known to perform well for 1-DTMPA instances.We shall refer to this policy as π DEC D . To compose the i-th generation policy π DEC θ i , per cluster, we select the best-found neural network policy so far.An example of a decomposition of a K-DTMPA instance is given in Section 6.2.
A detailed setup of the experiments follows in the remainder of this section.

Cost structure
We introduce the three cost structures C1, C2 and C3, which are presented in Table 1.To discourage repositioning tasks that yield negligible gain, we introduce a small travel cost which is paid each time unit an engineer is traveling.Each cost structure represents a distinct, realistic relationship between preventive and corrective costs that induces distinctive optimal policies favoring more or less frequent maintenance actions.For example, when c CM /c PM is large, i.e., when CM costs greatly surpass PM costs, we expect that preventive maintenance policies outperform reactive policies.In all experiments, we consider a discount factor γ = 0.99.
Table 1: Cost structures considered in the numerical experiments.

Hospital networks
We construct six K-DTMPA instances, each having a different combination of network size, cost structure and machine degradation matrices.Besides the networks introduced in da Costa et al.
(2023), we construct two additional geographical layouts with real-life asset network characteristics.
To this end, these latter layouts are based on the Dutch hospital network.This case is appropriate Register Addresses and Buildings (Basisregistratie Adressen en Gebouwen).Finally, using these GPS coordinates, we compute the travel time between locations using the public OpenStreetMap application programming interface.The obtained travel times are converted and rounded up to multiples of 15 minutes.We assume repairs take 4 time units, i.e., 1 hour, regardless of the machine, the engineer or the maintenance type.
Generally, we consider two types of degradation matrices: One type with only two states, used to create the D&R instances, and one type with three states, to include the aspect of preventive maintenance which increases the problem complexity significantly.We briefly discuss specifics regarding the instances.

Academic hospitals
The first two cases contain the 8 academic hospitals in the Netherlands and will serve as complex yet relatively understandable and manageable examples to study the behavior of the learned policies.
The Dutch academic hospitals are located in Amsterdam (2x), Groningen, Leiden, Maastricht, Nijmegen, Rotterdam and Utrecht, and are serviced by K = 3 maintenance engineers, see Figure 4 for a visualization of the geographical layout and the corresponding travel time matrix.For the D&R K-DTMPA instance, we adopt the degradation matrix Q1 together with cost structure C1, referred to as M8K3-Q1C1.Under these dynamics, few failures occur and the objective is to minimize machine unavailability.Figure 4 shows a network decomposition of M8K3-Q1C1 into 1-DTMPA instances.Each engineer induces a 1-DTMPA instance on the locations within their assigned cluster, e.g., the first engineer maintains the locations in Amsterdam (2x) and Leiden (and only those).matrix Q2 together with cost structure C3.This models the situation where the alert is issued on average at 75% of the machine's life expectancy and is thus an accurate indicator of failure.The goal now is to perform preventive maintenance while keeping all the machines operational.

City hospitals
We extend the academic hospital network by including a geographically dispersed subset of 35 city hospitals.The network is serviced by K = 5 maintenance engineers.This network will serve to study the behavior of the learned policies for industrial-scale K-DTMPA instances, see Figure 5 for a visualization of the geographical layout and the asset degradation matrices.For the D&R K-DTMPA instance, we adopt the degradation matrix Q3 together with cost structure C1, referred to as M35K5-Q3C1.Under these dynamics, few failures occur and the objective is to minimize machine unavailability whilst maximizing coverage.For the preventive maintenance K-DTMPA instance M35K5-Q4C3, we adopt the degradation matrix Q4 together with cost structure C3.Note that the alert is again issued (on average) at 75% of the machine's life expectancy.

Numerical results
This section contains the experimental results for the previously introduced K-DTMPA instances.
All values are obtained using 10 6 repetitions, if applicable.The reported half-widths correspond to

Impact of feature design on trained neural network policies
First, we investigate the effectiveness of the proposed feature representation f 3 (h), cf.Section 5.1.
Recall that f 2 (h) is obtained by removing the last entry of f 1 (h), i.e., f 1 (h) = f 2 (h), m∈M n av m and f 3 (h) denotes the most compact state representation.Given a data set D consisting of 500, 000 state-actions pairs corresponding to the reactive heuristic π D (s m (h) ≡ x f m ), for each feature representation, we train 5 neural network policies and report relevant statistics in Table 2.
The choice of the feature design proves crucial: trained neural network policies using feature design f 3 (h) (the most compact state representation) barely beat the benchmark set by the reactive heuristic, if at all.Using the proposed feature design f 1 (h) results in less training variability and consistently produces policies that perform significantly better.This also holds for the feature representation f 2 (h), therefore, dropping the last feature does not significantly affect performance.
Additional experiments for f 1 (h) where we also varied the data set D produced similar results.In the forthcoming sections, in all experiments, we use the feature representation f 1 (h). the variability of the trained neural network when varying the feature design f (h).In all cases, the initial policy is the reactive dispatching heuristic πD (sm(h) ≡ x f m ).We train 5 neural network policies per choice of f (h) and report the performance of the best and worst policy, as well as the average and coefficient of variation (CV) of the acquired performance estimates.The benchmark satisfies J(πD (sm(h) ≡ x f m )) = 27.612± 0.065.

Single maintainer instances
To demonstrate that DCL can produce the optimal policy, we focus on the DTMPA instances M4-Q2Q3 and M6-Q2Q3Q4 introduced by da Costa et al. (2023, Section 6.3), under cost structure C2 listed in Table 1.Note that c T = 0, i.e., there is no cost for travel.Moreover, repair times and travel times are assumed to be 1.The corresponding Q-matrices differ from the Q-matrices introduced in Section 6.2, and can be found in Appendix A. The shorthand notation must be interpreted as follows: First, the number of machines and engineers is listed, followed by a sequence of degradation matrices which are assumed to be distributed evenly over the machines.For example, the 1-DTMPA instance M4K1-Q2Q3C2 contains four machines: Two with matrix Q2 and two with matrix Q3, all sharing cost structure C2.
M4K1-Q2Q3C2.For small instances, the optimal policy π * can be obtained via exact policy iteration.We perform two policy improvement steps using DCL on three dispatching heuristics π D , with maintenance thresholds s m (h) ranging from 3 to 5, and on both the random policy π R and the idle policy π I .DCL consistently produces a near-optimal policy after only two iterations, regardless of the initial solution.The best-found neural network policy places the optimality gap (computed as the relative increase over J(π * )) at only 0.45%.The improvement over the nQR-DDQN policy proposed by da Costa et al. ( 2023) is 8.34%.(Note that DCL operates at an information advantage compared to nQR-DDQN, since the latter only observes transitions to the first degradation state x h + 1 and the failed state x f , while the former observes all degradation state transitions.) Observe from Table 3 that the best performing dispatching heuristic (π D with s m (h) ≡ 3) is not necessarily the best choice of initial policy for a single one-step improvement; the neural network policy trained using the reactive dispatching heuristic (with s m (h) ≡ x f m ≡ 5) performs at least 1.58% better than any of the other first-generation neural network policies.The idle policy is a particularly poor choice since it produces a rather homogeneous data set and only learns to start maintenance at the current location in the first iteration.M6K1-Q2Q3Q4C2.For this instance, computing the optimal policy is intractable and thus the best available solution in literature is the neural network policy trained by nQR-DDQN.We perform three policy improvement steps using DCL on three dispatching heuristics π D with varying maintenance thresholds and the random policy π R .Similarly, the best neural network policy yields a 12.34% advantage over the π nQR-DDQN policy (which may be partially due to informational advantage) and the reactive dispatching heuristic consistently produces the best policy after the first iteration.The results on the M4K1-Q2Q3C2 instance, together with observations from Table 4, indicate that the neural network policy improvements have ended which could be taken as weak evidence that the best performing neural network policy is near-optimal.

Dispatching & repositioning instances
To illustrate how DCL improves upon an existing solution, we now turn our attention to the K-DTMPA instances M8K3-Q1C1 and M35K5-Q3C1.For both instances, the benchmark is set by the reactive dispatching heuristic π D (s m (h) ≡ x f m ).We note that for D&R instances, this benchmark is expected to be rather strong and difficult to beat.M8K3-Q1C1.In this 3-DTMPA instance, the engineers are initially placed in the cities Amsterdam, Maastricht and Rotterdam.We perform three policy improvement steps using DCL on the reactive dispatching heuristic π D (s m (h) ≡ x f m ), the random policy π R and the decomposition heuristic π DEC D , for which we show the results in Table 5.The best neural network policy π θ 1 improves upon the benchmark with 1.74%.A second iteration yields a further 0.67% performance improvement and a third step accomplishes an additional 1.07%cost reduction and thus a total performance improvement of 3.48% over the benchmark.Three steps of DCL improving π R is not sufficient to outperform the benchmark, meaning that π R is not a suitable initial solution in a cooperative setting.Moreover, decomposing the network into clusters and solving the induced 1-DTMPA instances individually does not bring us close to the benchmark either.We next briefly illustrate how the best neural network policy improves upon the benchmark in terms of the behavioral aspects introduced in Section 4.1.In Figure 6a, we observe that the DRL agent moves an available engineer from Rotterdam to the centrally located Utrecht, which is likely a better initial placement of the engineer.The DRL agent's dispatching strategy is similar to the reactive heuristic and handles the tricky cases correctly as well, see for instance the dispatching problem in Figure 6b.When there are events at remote locations in the network, the DRL agent has learned to proactively move an engineer to achieve better coverage.Figure 6c shows such a tactical repositioning of an engineer from Nijmegen to Amsterdam.
M35K5-Q3C1.In the first large-scale instance, the initial placement of the additional two engineers is Arnhem and Groningen.
We perform two policy improvement steps using DCL on the reactive dispatching heuristic π D (s m (h) ≡ x f m ) and the decomposition heuristic π DEC D , for which we show the results in Table 6.The best found neural network policy π θ 2 improves upon the benchmark with 5.35%.
Besides a better strategic initial positioning, the neural network policy learns to travel using intermediate locations, see Figure 7.
The benefit of such behavior is that it enables the engineer to have an extra decision epoch, e.g., to divert to another location or even catch a failure at one of the intermediate locations.Moreover, after two iterations of DCL, sharing resources over the network is shown to yield a 6.37% cost improvement.

Preventive maintenance instances
We show that DCL can also handle the more complex K-DTMPA instances M8K3-Q2C3 and M35K5-Q4C3.For these instances, no strong benchmark exists and therefore, we consider both the greedy and the reactive dispatching heuristics π D , as well as the decomposition heuristic π DEC D .
M8K3-Q2C3.We perform three steps of DCL on the various heuristics, for which we show the results in Table 7.The best neural network policy π θ 3 improves upon the best heuristic with 6.44% and shows that at least a 5.82% cost reduction can be achieved by sharing resources over the network.We briefly illustrate how the third generation neural network policy π θ 3 improves upon the benchmark in terms of the behavioral aspects introduced in Section 4.1.In Figure 8a, we see that the trained agent proactively moves available engineers to alerted locations, but postpones  instance M35K5-Q3C1.The color-coding is as in Figure 6.maintenance on them.When there are many alerts in the network, the trained agent chooses to initiate preventive maintenance, see Figure 8b.When there are events at remote locations in the network as Figure 8c, the DRL agent proactively dispatches an engineer.
M35K5-Q4C3.We perform a single iteration of DCL on the reactive dispatching heuristic π D (s m (h) ≡ x f m ) due to the relatively large cost of additional steps (see Appendix D).We compare against three policy improvement steps on the network decomposition heuristics π DEC D , for which we show the results in Table 8.The neural network policy π θ 1 improves upon the initial policy with 6.08% and shows that a 1.17% cost reduction can be achieved by sharing resources over the network.In Figure 9a, we see that the trained agent proactively moves available engineers to alerted locations while ensuring a large coverage of the network.The trained agent however chooses to prioritize corrective maintenance, see Figure 9b.

Robustness of policies to changes in the model
In some cases, a neural network policy trained to optimize a K-DTMPA instance also yields a policy for a compatible K-DTMPA instance, but since the neural network was trained for a specific instance, this is somewhat detrimental to performance.We briefly investigate this next.
A trained neural network policy is also a policy for a compatible instance when the instance preserves the dimension of the feature vector f (h), viz.when the number of machines remains the same or is reduced.(The case of removing machines can be tackled by replacing them with dummy machines that never emit alerts.)We investigate the effect of removing one hospital/machine (at location 2) and hiring/firing a single maintenance engineer (at location 2 and 1, respectively) for the cases M8K3-Q1C1, M35K5-Q3C1 and M8K3-Q2C3, for which we show the results in Table 9.The case where only one machine is removed preserves the most similarity in the encountered features, the main difference being that one asset never emits alerts.In a network with identical assets, the  The color-coding is as in Figure 6.
24.117 ± 0.057 23.908 ± 0.055 23.390 ± 0.053 21.833 ± 0.061 63.869 ± 0.141 61.085 ± 0.133 35.601 ± 0.086 98.050 ± 0.417 28.550 ± 0.069 29.584 ± 0.110 72.570 ± 0.163 69.442 ± 0.153  trained neural network policy is thus robust against removing a single asset.Note that robustness against completely removing a machine is also an indication that performance would be robust against slight modification of machine degradation models.When removing a maintenance engineer, the trained neural network policy produced a good solution 2 out of 3 times.However, performance decreases significantly when adding a maintenance engineer, although the neural network policy still provides a suitable initial solution.An alternative approach to tackle such instances could be to construct a policy as follows: Use the trained neural network to dispatch K engineers and employ a classical heuristic for the additional engineer.This, however, is outside the scope of this work.

Conclusion and discussion
In this work, we study the dynamic traveling multi-maintainer problem with alerts (K-DTMPA) for a network of modern industrial assets with stochastic failure times maintained by K maintenance engineers.We extended the existing 1-DTMPA framework under the assumption of perfect To demonstrate the effectiveness of our approach, we extended existing ranking heuristics to the multi-maintainer setting.More specifically, we equip ranking heuristics that rank alerts based on their observed degradation levels with a state-of-the-art dispatching algorithm.Moreover, we propose an additional benchmark heuristic through decomposition: We decompose the network into K disjoint 1-DTMPA instances using a handcrafted network clustering and solve each of the induced subproblems individually using DRL.
The results for small instances show that we can get close to the performance of optimal policies within a few iterations of the algorithm, regardless of the choice of the initial solution.When the problem complexity increases, the proposed DRL method yields effective policies that directly improve upon the benchmark.This significantly reduces the amount of required iterations, thereby saving costs.Moreover, by comparing with a traditional solution by decomposition, we show that it is cost-effective to share resources over the network.
Future research directions include expanding the proposed framework to include logistical and asset-maintainer constraints, possibly as a learning objective.The assumption of geometrically distributed degradation transition times remains and underlies the tractability of some of the instances; this assumption can be relaxed within the K-DTMPA framework, which we leave for future work.The bottleneck of the adopted DRL approach to solving large-scale K-DTMPA instances is the vast amount of samples required, which may be improved using more sophisticated neural network architectures or training algorithms.The DRL approach can be extended to also optimize K-DTMPA instances for other performance metrics besides the discounted cost criterion, e.g., the average cost criterion.

Supplementary material
Appendix A. Degradation matrices The following degradation matrices are adopted from da Costa et al.
provide an effective dynamic ambulance redeployment algorithm implementing a neural network trained to score the waiting locations.Da Costa et al. (2023) integrate maintenance and dispatching decisions in a holistic DTMPA framework, including uncertainty in the acquired information in the form of three information levels.They propose a wide range of heuristic solution approaches and a DRL algorithm to optimize long-term discounted costs.

Figure 1 :
Figure 1: (Figure best viewed in color.)Visualization of the K-DTMPA model for an asset network of M = 8 machines serviced by K = 3 maintenance engineer.Blue dots on top of machine nodes indicate that the machine is healthy, orange when alerted or red when the machine is down.The engineers are colored cyan, green and purple and are located at Amsterdam, Maastricht and Utrecht, respectively.At discrete decision epochs, engineers can either: (i) idle/continue, (ii) travel to another location or (iii) start maintenance at the current location.
with a minor abuse of notation.Here, x m ∈ N m represents the degradation state of asset m ∈ M; ℓ k ∈ M, k ∈ K, denotes the location of the k-th engineer; ι k ∈ I = {0, 1} indicates whether this engineer is currently carrying out maintenance (a) Failure: When assets fail, available engineers must be dispatched efficiently.(b) Alert: When an alert is issued, the decision-maker must conduct a risk urgency assessment to decide whether to dispatch an engineer.(c) Repositioning: Idle engineers are proactively repositioned to be closer to future alerts and failures.

Figure 2 :
Figure 2: Visualization and description of the three envisioned policy aspects.

Figure 3 :
Figure3: Visualization of the simulation-based policy π+ .In state h a k−1 , the policy prescribes to follow the actionã ∈ U k (h a k−1 ) (recall that U k (h a k−1 ) ⊆ {um} Mm=1 ∪ {v}) that minimizes the average undiscounted trajectory cost qπ 0 (h a k−1 , ã).Unbiased estimates of the action-value function are computed from r independent roll-out simulations API relies on supervised learning to train neural networks.In our context, supervised learning finds a relation between the actions ã ∈ U k (h a k−1 ) as taken by the simulation-based policy π+ and the feature representation f (h a k−1 ) of the state h a k−1 ∈ H. Specifically, we employ a multilayer perceptron consisting of L ∈ N layers.In each layer l ∈ {1, . . ., L}, an affine transformation of the input is combined with a nonlinear activation function.For our experiments, we adopt a standard supervised learning algorithm cf.Temizöz et al. (2023, Appendix A).For learning the parameters θ, we split the data set D in a training set and a test set.The training loss L(θ) measures the "distance" between π+ and the neural network policy π θ for the states in the training set.Fitting since hospital equipment includes medical imaging and image-guided therapy systems.Such systems are associated with high costs, and manufacturers of such systems increasingly seek to avoid unplanned downtime via remote monitoring.The corresponding travel time matrices are constructed using the four-digit zip codes found in the 2021 hospital inventory data set published by the Dutch National Institute for Public Health and the Environment (RIVM).The partial zip codes are converted to GPS coordinates using the 4PP data set maintained by the Dutch Key

Figure 4 :
Figure 4: (Figure best viewed in color.)The Dutch academic hospitals with the corresponding travel time matrix Θ in quarters.The engineers are colored cyan, green and purple and are located in Amsterdam, Maastricht and Rotterdam, respectively.For the decomposition heuristic, appropriate clusters are constructed using K-means clustering; locations within the respective clusters of engineers are colored accordingly.

Figure 5 :
Figure 5: The subset of Dutch city hospitals with the corresponding degradation matrices Q3 and Q4.The color-coding is as in Figure 4, the additional engineers are colored brown and yellow and are located in Arnhem and Groningen, respectively.For the decomposition heuristic, appropriate clusters are constructed using K-means clustering; locations within the respective clusters of engineers are colored accordingly.

Figure 6 :
Figure 6: (Figure best viewed in color.)Policy aspects of the DCL improved reactive dispatching heuristic for the dispatching & repositioning instance M8K3-Q1C1.The color-coding is as in Figure 5.The labels pass, move, pm and cm, correspond to wait, move to another location and preventive/corrective maintenance actions.Blue dots on top of machine nodes indicate that the machine is healthy, orange when alerted or red when the machine is down.
(a) Tactical repositioning from Eindhoven to Tilburg.(b) Tactical repositioning from Tilburg to Breda.

Figure 7 :
Figure 7: Policy aspects of the DCL improved reactive dispatching heuristic for the dispatching & repositioning (a) The trained agent proactively dispatches engineers to alerted locations.(b) The trained agent performs preventive maintenance when there are many events in the network.(c) The trained agent tactically repositions engineers when there are distant events in the network.

Figure 8 :
Figure 8: Policy aspects of the DCL improved greedy dispatching heuristic for the preventive maintenance instance M8K3-Q2C3.The color-coding is as in Figure 6. π (a) The trained agent proactively moves engineers to alerted locations, ensuring a large coverage of the network.(b) The trained agent prioritizes moving to and starting maintenance on assets in the failed state.

Figure 9 :
Figure 9: Policy aspects of the DCL improved reactive dispatching heuristic for the preventive maintenance instance M35K5-Q4C3.The color-coding is as in Figure 6.
information proposed by daCosta et al. (2023).Also, our experiments include cases with an underlying geographical nature, which are more challenging than the unit-distance cases considered by daCosta et al. (2023).In the K-DTMPA framework, independent degradation processes are observed in real-time by a central decision-maker.The decision-maker has access to perfect degradation information to decide on joint cost-effective dispatching, maintenance and repositioning actions for all available engineers.To solve the problem, we adopt a deep reinforcement learning (DRL) approach based on approximate policy iteration, more specifically, deep controlled learning (DCL).To successfully apply DCL to K-DTMPA instances, we propose several new ideas: Actions for the engineers are selected sequentially and the feature design is tailored to each individual engineer.Moreover, we use policies tailored to the problem to kickstart DCL.This enables us to use expert knowledge without the requirement to penalize deviations from the expert-crafted policy (like De Moor et al. (2022)).
Deep reinforcement learning hyperparametersWe list the choice of hyperparameters for the training algorithm per K-DTMPA instance.All neural networks utilize the rectified linear unit (ReLU) activation function.

Table 2 :
One-step policy improvement results for the dispatching & repositioning instance M8K3-Q1C1 highlighting

Table 5 :
One-step policy improvement results for the dispatching & repositioning instance M8K3-Q1C1.Note that the results for the reactive dispatching heuristic are obtained using a data set D that is smaller than the data set used for the experiments in Table2.

Table 7 :
One-step policy improvement results for the preventive maintenance instance M8K3-Q2C3.

Table 8 :
One-step policy improvement results for the preventive maintenance instance M35K5-Q4C3.
TableC.1: Approximate policy iteration hyperparameters for all single maintainer instances, i.e., M4K1-Q2Q3C2, M6K1-Q2Q3Q4C2, and the 1-DTMPA instances induced by the clusters when training the decomposition heuristic.Table C.2: Approximate policy iteration hyperparameters for the dispatching & repositioning instance M8K3-Q1C1.Table C.3: Approximate policy iteration hyperparameters for the preventive maintenance instance M8K3-Q2C3.Table C.4: Approximate policy iteration hyperparameters for the dispatching & repositioning instance M35K5-Q3C1.Note that the data for the second generation is collected using less roll-outs.TableC.5:Approximate policy iteration hyperparameters for the preventive maintenance instance M35K5-Q4C3.