Online reinforcement learning for condition-based group maintenance using factored Markov decision processes

We investigate a condition-based group maintenance problem for multi-component systems


Introduction
Maintenance serves as an essential measure to improve system reliability, sustain system operations, and reduce operating costs in various industries, including energy-generation, manufacturing, and transportation, among others.From the modelling perspective, maintenance models can be classified into two categories: maintenance models for single-unit systems and those for multi-component systems (de Jonge & Scarf, 2020).In the reliability and maintenance field, numerous studies have been devoted to single-unit systems, while maintenance problems for multi-component systems are generally much more complex, due to the presence of multiple components as well as various dependencies among components.In particular, dependencies among components are typically categorized into three types in the literature (Olde Keizer et al., 2017a): structural dependence (i.e., maintenance of a certain component requires the dismantling or maintenance of other components because of the system's physical structure), stochastic dependence (i.e., degradation or failure process of a certain component affects those of other components), and economic dependence (i.e., combining a shift in maintenance paradigm towards condition-based maintenance (CBM).Within the framework of CBM, optimal maintenance decisions are determined based on observed health conditions-obtained through either discrete inspection or continuous monitoring-of the systems and/or their components (Ahmad & Kamaruddin, 2012).However, developing a condition-based group maintenance policy based on component health conditions is rather complicated given the dependencies among the components.
Markov decision process (MDP), a well known stochastic control process, has been widely used to model CBM problems where a system is represented by a set of states that present random evolvement (Gámiz et al., 2023;Liu et al., 2021).MDP is an effective and flexible modelling tool for single-unit systems in the sense that it is able to evaluate and optimize maintenance policies for either a finite or an infinite horizon.However, for a multi-component system, any system state is a combination of the states of all the components.This leads to the so-called curse of dimensionality (i.e., the exponential explosion of the number of system states).Even for systems with moderate number of components (say, 15 to 20 ones), traditional MDPs would suffer from notorious computational complexity and cannot be directly applied.To relieve this issue, the factored MDP (FMDP) model is developed to represent large MDPs with factored structures (Talebi et al., 2021).In particular, FMDP separates the transitions and costs into their counterparts defined on small sets of elements in the state vector, which can reduce computational complexity in determining an optimal policy.FMDP is a promising approach to solving the group maintenance problem for multi-component systems, since degradation of a component might depend only on a small cluster of ''neighbouring'' components.In this case, maintenance cost can also be decomposed as the sum of the costs related to the individual or small sets of components (Zhou et al., 2018).
In practice, some large-scale, multi-component systems can be decomposed into a number of locally interactive components and thus modelled by an FMDP.A typical example is maintenance of railway tracks.A railway network consists of thousands of tracks, among which the tracks in a specific area are interdependent in the sense that they are operating in a common environment and under similar traffic loads.In this sense, the tracks are subject to location-based stochastic dependence; that is, neighbouring tracks are expected to have a higher level of dependence than those distant ones (Brown et al., 2022).When conducting maintenance activities on railway tracks, decision makers have to consider the effect of such dependencies so as to improve maintenance efficiency.Another example is maintenance of machines in a production line.In modern manufacturing scenarios, multiple machines in a production line work collectively to manufacture a product.Degradation of a certain machine reduces the quality of the parts produced, which might further affect the degradation of downstream machines.For example, the degradation process of cutting tools is affected by defective materials/parts from the previous production cell.
In reality, such effect usually diminishes with the ''distance'' of machines.From the modelling perspective, this dependence relationship among components of a complex system can be modelled by a dynamic Bayesian network or its variants (Guestrin et al., 2003).
A practical issue to be considered when applying FMDP in the group maintenance problem is that the transition probabilities are generally unknown a priori.An implicit assumption adopted by most literature is that model parameters are fully known to the decision maker; this assumption is, unfortunately, not realistic in most practical scenarios, as the parameters often need to be estimated from historical data or suggested by domain experts.In this work, focusing on group maintenance for multi-component systems subject to economic and stochastic dependencies, we tackle this issue by developing an online reinforcement learning algorithm to simultaneously learn the transition probabilities and determine an optimal maintenance policy.To the best of our knowledge, this is the first work that presents an online reinforcement learning algorithm for CBM problems.In particular, the algorithm is devised by extending the existing factored Model-Based Interval Estimation (fMBIE) approach to an online learning scenario.The performance of this algorithm is investigated in both theoretical and numerical manners.
We summarize the contributions of this work in the following four aspects: • We formulate a condition-based group maintenance problem for multi-component systems with stochastic and economic dependencies into an FMDP.• We develop a modified factored value iteration algorithm to improve computational efficiency in calculating an optimal policy.• We develop a novel online learning algorithm to learn the parameters and the dependence relationship and optimize the maintenance policy, simultaneously.• We theoretically prove the bound of errors for the proposed online learning approach against the nominal model.
The remainder of the paper is organized as follows.Section 2 reviews relevant literature on CBM for multi-component systems and reinforcement learning for FMDPs.Section 3 formally describes the condition-based group maintenance problem and formulates the problem into an FMDP.For the nominal model with known parameters, Section 4 presents a modified factored value iteration algorithm to compute an optimal maintenance policy.Further considering the scenario in which model parameters are unknown a priori, Section 5 presents an online learning algorithm to learn the model parameters and evaluates the performance of this algorithm.Section 6 conducts numeral experiments to validate the developed approaches.Finally, Section 7 concludes the paper and suggests future research topics.All proofs can be found in Appendix A.

Related literature
We discuss two major research streams that are related to our work: (i) CBM for multi-component systems and (ii) reinforcement learning for FMDPs.

CBM for multi-component systems
Though most CBM-related studies are focused on single-unit systems (see, e.g., Chen et al., 2015;Deep et al., 2023;Drent et al., 2023;Elwany et al., 2011;Khaleghei & Kim, 2021), the CBM problems for multi-component systems have attracted an increasing attention in the literature (see, e.g., Olde Keizer et al., 2016;Tian & Liao, 2011;Zhu & Xiang, 2021).One can refer to Olde Keizer et al. (2017a) for a comprehensive overview on CBM for multi-component systems, in which the CBM policies are classified in terms of dependence type.
We confine our attention to MDP-based CBM studies for multicomponent systems.Sun et al. (2017) develop a CBM model for multicomponent systems with identical and independent components.Liu et al. (2021) investigate a CBM problem for two-component systems with heterogeneous and dependent components.They formulate the problem into a finite-horizon MDP and characterize the optimal preventive maintenance curve in terms of the components' degradation levels.Barlow et al. (2021) propose a performance-centred approach to optimizing maintenance of complex systems with multiple components and adopt a reinforcement learning algorithm to solve the complex optimization problem.Hoffman et al. (2022) develop an online improvement approach for CBM with Monte Carlo tree search.They develop a two-stage approach that first optimizes the static CBM policy and then uses Monte Carlo tree search to further improve the static policy.Olde Keizer et al. (2017b) develop a joint CBM and inventory model to reduce the maintenance and inventory losses through optimizing maintenance and spare ordering decisions based on the components' conditions.Wang and Zhu (2021) propose a joint CBM and inventory model that determines the optimal decisions based on the number of components in each degradation state instead of the condition of each individual component.(Zheng et al., 2023) jointly optimize CBM and spare provisioning for a -out-of- system, where system degradation states are revealed upon periodic inspections that trigger the opportunities to replace components and order spare parts.When using MDP to model CBM for multi-component systems, a common yet burdensome issue is the computational complexity, since the number of system states increases exponentially with the number of components.
FMDP is an effective modelling approach to compactly representing the stochastic nature of degrading systems, which is, to some extent, helpful for resolving the curse of dimensionality.However, studies on FMDP-based maintenance optimization are rather limited.Zhou et al. (2016) formulate the maintenance problem of a multi-component system into an FMDP model and develop an improved approximate linear programming algorithm to solve the problem.Zhou et al. (2018) further investigate maintenance optimization of a series production system and develop a multi-agent FMDP model to select the maintenance actions in cooperation of different agents.Kıvanç et al. (2022) employ a factored partially observable MDP model to investigate the maintenance problem of a regenerative air heater system subject to stochastic and economic dependencies.Nevertheless, the aforementioned research implicitly assumes that model parameters and dependence structures are known to the decision maker in advance.This assumption is, unfortunately, not realistic as the parameters and structures usually need to be estimated.To address this issue, this work aims to develop an online learning algorithm for multi-component systems modelled by an FMDP to jointly learn the parameters and dependence structure, while determining optimal maintenance actions.

Reinforcement learning for FMDPs
In the literature, several attempts have been made on reinforcement learning for FMDPs.Kearns and Koller (1999) present an efficient and near-optimal algorithm for reinforcement learning in an FMDP framework, where the structure is modelled by a dynamic Bayesian network.Sallans and Hinton (2004) propose a novel approximation method to approximate the value function and select actions for MDPs with large state and action spaces.The approach enables determining the actions in large factored action spaces via Markov chain Monte Carlo sampling.Degris et al. (2006) develop a general framework that integrates FMDP-based incremental planning algorithms with supervised learning techniques to build structured representations of the reinforcement learning problem.Strehl (2007) extends the work of Kearns and Koller (1999) by employing the Interval Estimation approach for exploration, which outperforms traditional algorithms on most domains.Strehl et al. (2007) propose an efficient reinforcement learning algorithm to learn the unknown dynamic Bayesian network structure for an FMDP.Mahadevan and Maggioni (2007) develop a novel spectral framework to jointly learn the representations and the optimal policy for MDPs and FMDPs.Szita and Lörincz (2009) develop a factored optimistic algorithm to attain polynomial-time reinforcement learning in FMDPs.The work emphasizes the importance of initialization and proves that suitable initialization can lead to convergence and polynomial-time number of steps for near-optimal decisions.Osband and Van Roy (2014) report that it is possible to achieve regret that scales polynomially in the number of parameters encoding an FMDP.In addition, the work presents two algorithms that satisfy near-optimal regret bounds in this setting.Tian et al. (2020) investigate minimax optimal reinforcement learning for episodic FMDPs.By assuming that the factorization is known beforehand, two model-based algorithms that attain minimax optimal regret guarantees are proposed.Xu and Tewari (2020) develop oracle-efficient algorithms that achieve tighter regret bounds for non-episodic FMDPs.Deng et al. (2022) design the first polynomial-time algorithm for reinforcement learning in FMDPs that only requires a linear value function with a suitable local basis with respect to the factorization, permitting efficient variable elimination.
Though significant progresses have been made on reinforcement learning for FMDPs in a general context, up to now no efforts have been devoted to the specific condition-based group maintenance problem where the interactions among components are represented by location-based stochastic dependence.Different from most of the existing studies that focus on offline reinforcement learning algorithms, we develop a novel and more efficient online learning algorithm tailored to the condition-based group maintenance problem that is able to simultaneously learn the transition probabilities and the dependence relationship.the optimal policy

The condition-based group maintenance problem
In this section, we formally describe the condition-based group maintenance problem for a multi-component system in Section 3.1, and then formulate the problem into an FMDP in Section 3.2.

Problem description
We consider the maintenance problem for a multi-component system consisting of  non-identical components, where the health condition of each component deteriorates during operations.In reality, such health condition varies for different systems, such as crack sizes of railway tracks and charging/discharging rates of batteries.Moreover, the degradation process of a specific component is affected by only its neighbouring ones instead of all components; that is, there is a special type of stochastic dependence among the components (Olde Keizer et al., 2017a).The degradation process of the whole system, without interventions, is assumed to follow a Markov chain.To improve system reliability and sustain system operations, periodic inspections are carried out to reveal the system and component states.At each inspection, maintenance action might be implemented, depending on the observed component states.For each component, we only consider two actions: ''do nothing'' and ''replacement'', while the case involving imperfect maintenance actions of different depths is left for future research.The ''replacement'' action restores a component to an as-goodas-new state.The time duration required to complete a replacement action is assumed to be negligible compared with the inspection interval, which is a common assumption in maintenance studies (see, e.g., Drent et al., 2023;Liu et al., 2021).We consider that maintenance on a group of components can reduce costs compared to separate maintenance on individual components, corresponding to the so-called economic dependence (Zhao et al., 2022).However, due to the limited availability of maintenance crews, only a set of components can be maintained at one time, reflecting the resource dependence among the components (Olde Keizer et al., 2017a).
In this problem, the transition of a component's state between any two successive inspections can be caused by maintenance at the former inspection and/or natural degradation during this interval.At inspection, the component state is determined immediately by the action (replacement or do nothing) applied.Specifically, if the replacement action is taken, then the component is instantly restored to an asgood-as-new state, given negligible replacement duration; otherwise, the component state remains unchanged.Within the interval between two successive inspections, each component gradually deteriorates to a worse state during operations.Because of the location-based stochastic dependence, the degradation process of each component is affected only by its neighbouring components.
Our objective is to determine an optimal maintenance policy so that the total discounted long-run cost is minimized.In doing so, we first study the optimal maintenance policy in a scenario where the degradation parameters (i.e., transition probabilities of the Markov chain and the parameters associated with the stochastic dependence) are pre-known, and then extend our attention to a more typical online maintenance scenario where the parameters are unknown a priori.

Model formulation
Based on the previous description, we now formulate the conditionbased group maintenance problem into an FMDP.As mentioned earlier, FMDP separates the transitions and costs into their counterparts defined on small sets of elements in the state vector.As a result, FMDP can reduce the computational complexity in determining an optimal maintenance policy for multi-component systems.
Let  ≜ {1, … , } be the set of all components.Degradation of the components is described by a controlled Markov process (, ,  ).The term ''controlled'' means that the degradation paths of the components can be influenced by the decision maker through maintenance actions.Specifically,  ≜ {1, … , }  is the set of all possible state vectors of all components.For a state vector  = ( 1 , … ,   ) ∈ ,   is the state of an individual component  ∈ ; a higher value of state indicates a healthier condition. ⊂ {0, 1}  is the action set with  = ( 1 , … ,   ) being a generic element thereof.In particular,   represents the action taken on component  ∈ , with   = 1 and   = 0 representing ''replacement'' and ''do nothing'', respectively.We impose a restriction upon  to reflect the limited availability of maintenance resources, particularly the crews executing maintenance actions.At each inspection, due to the limited maintenance crews, only a portion of components can be maintained; specifically, for all  = ( 1 , … ,   ) ∈ , we impose ∑ ∈   ≤ , where  is a pre-specified constant satisfying  ≪ .It is worth noting that in extreme cases  can be as big as , especially when the system scale is not large.Nonetheless, our approach well accommodates the case of  = .
Furthermore,  is the controlled transition matrix of the states, where  ( | , ) is the transition probability from state  ∈  to state  ∈  under action  ∈ .As discussed earlier, the transition of a component's state between two successive inspections can be attributed to maintenance at inspection and/or natural degradation during the inspection interval.Specifically, suppose that action  is taken at inspection.Then, the state of each component  ∈  makes an immediate transition according to {   (  |   ,   )} ∈ , ∀ ∈ , where    (  |   ,   ) characterizes the transition probability, for component , from state   to state   under action   .In particular, when the ''do nothing'' action is taken (i.e.,   = 0), Upon the next inspection, the system transitions from state  right after the previous inspection and maintenance, if any, to a new state  according to   ( | ), ∀ ∈ , which reflects the transition of system state under natural degradation during the inspection interval.We consider that degradation of each component is affected only by a small set (relative to ) of its neighbouring components, which can be represented in a rigorous way as follows.Here and thereafter, where    ∶ (  )× is the local state transition probability of a single component .Following the paradigm of dynamic Bayesian network, we refer to each   as the parent set of the th component.
Combining    and    , ∀, the transition function  can be represented by After an action is an increasing function of   for given   .However, when multiple components are maintained simultaneously, there might be significant economic dependence between adjacent components.To be specific, if two components are adjacent (say,  and  + 1), then the cost of maintaining them together should be less than that of maintaining them individually.To characterize such effect, we define a penalty cost on top of the marginal maintenance costs of each individual components, and the total cost is defined as the combination of both.In particular, we say that components  1 ∈  and  2 ∈  are adjacent if | 1 −  2 | = 1.For any nonempty subsets  1 ⊂  and  2 ⊂ , we say that  1 and  2 are attached if  1 ≠  2 and there exist components  1 ∈  1 and  2 ∈  2 such that  1 and  2 are adjacent; otherwise, we say that  1 and  2 are detached.For any  ∈ , let { ∈  ∶   = 1} be the set of components that need active maintenance under action ; further suppose that { ∈  ∶   = 1} is the union of exactly () mutually detached sets of components, none of which can be further separated into multiple detached sets.We can then define a cost function  ∶  → R + as () =  ⋅ (), where  is an instance-free constant.We let (, ) ≜ ∑ ∈   (  ,   ) + () denote the total cost of applying action  when the state vector is .This formulation implies that maintenance crews can maintain detached sets within the capacity constraint ; however, a penalty cost () would be incurred to reflect the additional efforts needed to maintain detached sets that are distanced.
Denote  ∶  →  as a policy and  as the set of all such policies.At each inspection epoch  = 1, 2, …, the decision maker observes a state vector   and takes an action   = (  ) that incurs cost (  ,   ).In the context of sequential decision making, the long-term performance of a policy  given an initial state  ∈  is measured by the following value function: where 0 <  < 1 is the discount factor.The objective is to find an optimal policy  * that minimizes the value function, namely, The optimal policy can be evaluated through the classical value iteration approach (Puterman, 2014).In particular, the optimal value function under  * , denoted by  * for brevity, can be evaluated by (2) In addition, the optimal policy can be represented in a compact way using the state-action value function  * ∶  ×  of  * defined as and  * () = arg min ∈  * (, ) for any  ∈ .
Though the optimal policy can be numerically calculated by Eq. ( 2), the dynamic programme suffers from high computational complexity due to the curse of dimensionality, and is thus considered to be computationally intractable even for moderate-scale problems.However, in many real-world maintenance problems, there can be hundreds of components and the exponentially large state space prohibits a feasible computation process.In what follows, we take advantage of the crucial structural properties of the problem to design an effective algorithm that can significantly reduce the computational difficulty.

Modified factored value iteration algorithm
The factorization properties of FMDP, if well exploited, can significantly reduce computational complexity in determining an optimal policy.(Szita & Lörincz, 2008) propose the factored value iteration (FVI) approach by combining the factorization properties with the classical value iteration (Guestrin et al., 2003).FVI has been proven to be an efficient method for FMDPs.In this section, we first modify the original FVI approach to construct an efficient planning algorithm that can obtain an approximate optimal policy.We then prove the efficiency of the constructed algorithm from the perspectives of both computational complexity and approximation bias.
The FVI approach first approximates the real value function by a linear combination of a set of basis functions ℎ  ∶  → R,  = 1, … , , where ℎ  is specified such that it relies only on a small set of elements in .That is, for all  = 1, … , , there exists   ⊂ {1, … , } such that ℎ  () relies only on (  ), ∀ ∈ .For notational convenience, we use ℎ  ((  )) to characterize the dependence.Define  as an || ×  matrix with entries  ,  = ℎ  ((  )) for all  ∈  and  = 1, … , .The objective of FVI is to determine a weight vector  ∈ R  such that  ⋅ is close to  * under some metric.Let   be an ||-dimensional cost vector with entries    = (, ) and   be an || × || transition matrix with entries   ,  =  ( | , ).For any  ∈ , the optimal weight  * can be obtained by solving the following fixed-point problem: where  is a linear operator from the space of value functions to the linear space () expanded by ℎ 1 , … , ℎ  , represented by a matrix in R ×|| that satisfies the following non-expansion property: Then, a sampling technique is used to cope with the computational complexity caused by the scales of the matrices in Eq. ( 3), which are (||) and can still be prohibitively large.We sample a subset of state ) .
vectors  from  and confine the calculation to , to formulate an approximation of  * .It is apparent that the sample size | | influences the efficiency of approximation.Though there is no universal approach to specifying the sample size across different problems, there are indeed some routines to follow in practice.An important routine is that the sample size should be sufficiently large while keeping polynomial in , so that the approximation accuracy and the computational efficiency can be well balanced.We denote by Ĝ, ĉ,  , and B the sub-matrices of ,  ,  , and   , respectively, with rows corresponding to .An approximation of  * (i.e., ŵ * ) can be evaluated by Since the scales of  and  are both polynomial in , Eq. ( 4) becomes computationally tractable.The procedures of producing an approximate optimal policy are summarized in Algorithm 1.
We now derive a bound on the approximation error, in terms of a bound on the difference between the value function of the optimal policy (i.e.,  * ) and that of the approximation (i.e.,  ŵ * ).To this end, we make Assumption 1 on .By this assumption, we shall show in Theorem 1 that the bias can be bounded using only a sampled set  with a carefully determined size which is polynomial in .
We assume that  can be separated as  = ∑  =1   , where each   is an  × || and   -scope matrix, namely, each row of   , considered as a function on , relies only on (  ).Moreover, we suppose that We denote by  0 the greedy policy using  * , and   0 and   0 the cost function and transition matrix induced by  0 .Further define a matrix   0 along with   , with entries Following the definition of   0 ,  , Eq. ( 3) can be rewritten as Moreover, we can separate matrix   0 as where, for any  = 1, … , ,   0 ,  is the product of   0 with an ||× matrix that keeps the th column of  and sets other entries to 0. Similar to the entries of matrix   , entries of It is easy to verify that each   0 ,  is a ∪ ∈   local scope matrix.Under Assumption 1 and using Eq. ( 5), the following theorem provides performance guarantee of Algorithm 1.
Theorem 1.For any 0 <  < 1 and  > 0, when the size of  satisfies where and with probability at least 1 − , we have Theorem 1 provides a lower bound on the sample size | |, which guarantees that Algorithm 1 approximates the true optimal value function  * (and the action-value function  * ) well enough.It is worth noting that the lower bound is not necessarily polynomial in , because it still relies on max  ‖ ‖   ‖ ‖∞ and max  ‖  0 ,  ‖ ∞ , with scale  × ||.This issue can be addressed by carefully choosing matrix .Nevertheless, Algorithm 1 is based on classical techniques for highdimensional MDPs, while specifying an appropriate  can be flexible yet challenging in different problems.One can follow some general routines for choosing .For example, one may choose  such that the value of each ℎ  relies only on a very limited number of elements in the state space.In our simulation study, we define each ℎ  as a categorical function on a single component.Existing research has shown that such basis functions deliver high computational efficiency and low approximation error.In Section 6, we shall show that such choice of  ensures that max  ‖ ‖   ‖ ‖∞ and max  ‖  0 ,  ‖ ∞ are delicately bounded, so that | | is small enough and Algorithm 1 is computationally tractable when max ∈   and max =1,…, |  | are relatively small.In the subsequent sections, we always admit a projection matrix  that satisfies the restrictions in our previous discussions.

An online learning perspective
The modified FVI algorithm developed previously is based on an implicit assumption that state transitions of the system are fully known to the decision maker, which is usually not the case in real applications.In this section, we tackle the condition-based group maintenance problem of interest from an online learning perspective in which the exact transitions are unknown a priori.In this setting, the decision maker determines an optimal maintenance policy upon the arrival of a new inspection data, while simultaneously learning the true model parameters using historical observations.This leads to the so-called exploration-exploitation tradeoff in reinforcement learning (Xu et al., 2021).
In the online group maintenance problem, we need to learn both the transition probabilities of the components and the structure {  } ∈ of the transitions.We thus introduce an additional assumption on the structure.Specifically, we assume that a uniform upper bound  on {  } ∈ , instead of their true values, is known; that is, we have  ≥ max ∈   .This implies that the decision maker has a crude knowledge on the maximum number of neighbouring components that can affect the degradation of any specific component, which is fairly reasonable in practical scenarios.Such an upper bound can be established by expert judgements or estimated from historical data, if available.
Developing learning algorithms for FMDPs under unknown transition structure has been an active research topic.A recent and significant progress has been made by Rosenberg and Mansour (2021), who propose a novel approach to finding exact positions of parent sets with fixed and known sizes.Our problem setting differs from that of Rosenberg and Mansour (2021) in two aspects: First, in our setting the positions of parent sets are known but the sizes {  } ∈ are unknown.Second, our model uses the discounted total cost as the objective function, while Rosenberg and Mansour (2021) focus on the average cost.Nevertheless, their approach provides the foundation upon which modification can be made to solve our maintenance problem.On the other hand, value iteration-type learning algorithms have been proven to be efficient for discounted MDPs by Strehl (2007), albeit known transition structure is assumed therein.In particular, Strehl (2007) develops the fMBIE method that can effectively address the explorationexploitation tradeoff for model-based reinforcement learning.In this work, we develop an online algorithm to approximate the optimal maintenance policy  * by combining the online approach of learning transition structure (see Rosenberg & Mansour, 2021) and the fMBIE approach for discounted MDPs (see Strehl, 2007).
To this end, we first introduce a performance metric for online learning algorithms.The objective of an online algorithm is to gradually approximate some optimal policy; therefore, the necessary samples (time steps) for the algorithm to generate some policy that is close enough to the optimal policy is crucial for the algorithm's performance.In particular, an efficient online algorithm is supposed to generate policies with value functions -close to that of the optimal policy with high probability within a time at most polynomial in 1∕ and some other parameters of the underlying model.An online algorithm that satisfies this property is called an efficient Probably Approximate Correct (PAC) algorithm.A formal definition of efficient PAC algorithms for FMDPs can be given based on the sample complexity defined below.We assume here that {ℎ  }  =1 and {  }  =1 are specified beforehand.
Definition 1 (Sample Complexity).For any  > 0, the sample complexity of an online algorithm for FMDP (, ,  , , , ) is the number of time steps such that the sequence of policies generated by the algorithm, denoted by {  } ∞ =1 , satisfies    (  ) <   * (  ) + .An online algorithm for FMDP (, ,  , , , ) is called an efficient PAC learning algorithm if for any  > 0 and 0 <  < 1, the per-step computational complexity and sample complexity can be bounded by some polynomial in the relevant parameters (1∕, 1∕, 1∕(1−), ,  2+1 , ||) with probability at least 1 − .It should be highlighted that the sample complexity is required to be polynomial in  2+1 , instead of || =   required for efficient PAC algorithms for general MDPs.This is because FMDP has a factored structure, so that an efficient online algorithm can collect sufficient samples from each factor to approximate the true model well enough.

An online PAC learning algorithm
We are now in a position to construct an online learning algorithm to support condition-based group maintenance decision making, which is proven to be an efficient PAC algorithm for our FMDP model.At each inspection, the proposed algorithm proceeds in two steps.In the first step, the algorithm leverages the value iteration method to update state-action value functions and generate a policy to recommend an action (replacement or do nothing).In the second step, the algorithm utilizes previous samples to evaluate the transitions as well as the exact value of {  } ∈ .
To proceed, we list below the necessary quantities observed or calculated by the algorithm up to epoch  ≥ 1 during execution.)) for all  ∈ ,   ,   , , and  using historical data.Note that our algorithm follows a standardized online reinforcement learning pattern, in which historical data contain the sequence of system states and the corresponding actions taken in preceding epochs.Such data are available in most practical maintenance scenarios.After a sufficient amount of samples, as indicated in Theorem 1, have been collected, the algorithm will consider the model to be fully learned and stop learning anymore.However, if no historical data is available, then domain knowledge would be needed to estimate the transition probabilities.
Then, the algorithm calculates the estimated overall transition probabilities, denoted by P , via and the state-action value function Q by where   ∶  ×  → R is an exploration bonus to balance the exploitation-exploration tradeoff.We note here that   (⋅, ⋅) should be determined such that it is factored and Q can be effectively solved by Algorithm 1. Next, the algorithm takes an action greedily on the current state vector   ; that is,   = arg min ∈ Q (  , ).
In the second step of epoch  ≥ 1, the algorithm utilizes historical observations to make a judgement if    <   holds with high probability.Specifically, the algorithm checks in each epoch  ≥ 1, ∀ ∈ , if there exists    <  ≤  such that the following condition is satisfied: where ‖⋅‖ 1 is the  1 -norm.If condition (10) holds for some  ∈  and let    <  ≤  be the largest value that makes (10) valid, then the 1 The exact values of   and   shall be specified in Theorem 2.

5:
Observe state vector   immediately after action   is taken.

27:
←  + 1. 28: end while algorithm updates   +1 ←  +1 and proceeds to the next epoch +1.The procedures are summarized in Algorithm 2, and the algorithm breaks the ties at any time.

Sample complexity
We now derive an upper bound on the sample complexity of Algorithm 2, which is polynomial in the relevant model parameters; thus, we can claim that the proposed algorithm is an efficient PAC algorithm.The main theoretical result on the sample complexity is presented below.
Theorem 2. Given  > 0 and 0 <  < 1, if the exploration bonus   is chosen as ) , and   and   are chosen as ] , and then the sample complexity of Algorithm 2 can be bounded by ) , with probability at least 1 − .
Here we provide the sketch of the proof (detailed proof can be found in Appendix B).We first construct in Lemma 1 a ''good event'' in that the estimation biases of P   and    shrink with high probability as the sample size increases.Then, restricted to this event, we derive a bound on the estimation bias of the total transition   .This bound can be directly transferred to a bound on the estimation bias of the state-action value function .Finally, we integrate all these results and follow the existing techniques to complete the proof.
We argue that Algorithm 2 follows the traditional paradigm of PAC learning.The coefficients   and   are pre-determined thresholds based on some known model parameters.During the execution of Algorithm 2, for each single component, only the first   samples under the same action taken at inspection and the first   samples in the degradation interval are observed and utilized.After that, Algorithm 2 considers the model to be fully learned and neglects any new data.The exploration bonus   is set to balance between exploiting the ''optimal'' decision based on the latest data (exploitation) and checking if there are less-explored decisions that can be better identified (exploration).

Numerical studies
In this section, we present numerical studies to demonstrate the effectiveness of the proposed algorithms.For this purpose, we consider a hypothetical system with multiple non-identical components.The components are periodically inspected, and the degradation level of each component is discretized into three states for illustration.Specifically, state 3 represents the perfect state in which no maintenance action is needed; state 1 is the failure state that induces corrective maintenance; while the decision maker needs to decide whether to implement a preventive maintenance if a component is in state 2.
To implement our approach, the first step is to specify the matrix  of basis functions in Algorithm 1. Plenty of work has examined different basis functions for such problems, and a set of basis functions given below have been proven to be both simple and efficient (Guestrin et al., 2003;Osband & Van Roy, 2014;Xu & Tewari, 2020): Specifically, a set of  (i.e.,  = ) basis functions are defined and each basis function ℎ  relies only on   .Inspired by the discussion in Szita and Lörincz (2008), the following operator  is compatible with the modified FVI approach and easy to calculate in the concerned scenario: where  + is the Moore-Penrose inverse matrix of .We claim that the above-defined  satisfies the requirements as stated in the following proposition, and thus can be used as the projection operator in our approach.
Proposition 1.The operator  defined in (11) satisfies the non-expansion property.Moreover, there exist  1 , … ,   such that  = ∑  =1   and ‖ ‖   ‖ ‖∞ can be easily bounded by ‖ ‖   ‖ ‖∞ < 1∕(3 + 2).The next step is to determine the sample size | |.This requires us to evaluate the values of  1 and  2 in Theorem 1, respectively.For   0 in Theorem 1, we have   0 = ∑  =1   0 ,  , where   0 ,  is the product of   0 with an ||× matrix that keeps the th column of  while setting other entries to 0. As the row sum of   0 is 1 and the elements of  are either 0 or 1, the row sum of   0 ,  should be bounded by 1; that is, ‖  0 ,  ‖ ∞ ≤ 1.We note that all quantities needed to determine the values of  1 and  2 are either known or upper bounded.Specifically, the values of , , ,   , and   (  ,   )'s are known, and ‖  0 ,  ‖ ∞ 's and ‖  ‖ ∞ 's are bounded.Therefore, by substituting all quantities into  1 and  2 in Theorem 1, we have We let  0 ≜ max ,   ,     (  ,   ).Since  * is the projection of According to Theorem 1 and the trivial fact that ‖‖ ∞ ≤ , the sample size | | required by Algorithm 1 can be calculated as In what follows, we first demonstrate the preciseness of the modified FVI approach in Section 6.1, and then examine the superiority of Algorithm 2 through comparison studies in Section 6.2.

Modified FVI
We first implement Algorithm 1 presented in Section 4 and illustrate the performance of this approach.Though our algorithm is designed for large-scale maintenance problems, we choose a moderate problem size for illustrative purposes.This facilitates comparing the value function of the approximate policy generated by Algorithm 1 and that of the true optimal policy, where the latter can only be efficiently evaluated with a small or moderate problem size due to the computational complexity.First, we run Algorithm 1 on problems under different levels of , namely,  ∈ {4, 5, 6, 7, 8, 9, 10}.We let components connected as a circle so that no component is positioned at the boundary.Because usually a limited number of neighbouring components are correlated in practical maintenance scenarios, without loss of generality, we fix  1 = ⋯ =   = 2 under each value of .
Tables 1 and 2 present the transition probabilities that are specified according to the following considerations.First, the degradation process of each component leads to worse health conditions.Therefore, when   = 3, the possible values for   after degradation are {3, 2, 1}, whereas when   = 2, the possible values for   after degradation are only {2, 1}.Second, practical evidence shows that a component is more likely to degrade to a state adjacent to the current state; see, for example, the degradation process of railway tracks (Sadeghi & Askarinejad, 2010).Third, when a component is found failed upon inspection, an instant corrective maintenance would be executed to restore it back to state 3.  (1, 2, 1)  (1, 2, 2)  (2, 2, 1)  (2, 2, 2)  (1, 2, 3)  (3, 2, 1)  (2, 2, 3)  (3, 2, 2)  (3, 2, 3 Since there are at most 10 components in the numerical example, we do not use the sampling technique.Other parameters are arbitrarily set as  = 0.8,  = 3, and  = 3.In particular, 0.8 is a commonly used value for discount factor  in the learning literature,  = 3 imposes a moderate penalty on maintaining detached components, and  = 3 indicates that at most 3 components are allowed to be maintained at each inspection. Under each value of , we evaluate the true optimal policy using the traditional value iteration method and the approximate optimal policy through Algorithm 1.The  ∞ -difference between the optimal value function  * and the approximate optimal value function  * is calculated and normalized as ‖ * −  * ‖ ∞ ∕‖ * ‖ ∞ .Meanwhile, the  ∞ -difference between the  * and the true value function   0 of the greedy policy  0 is also calculated and normalized as ‖ * −   0 ‖ ∞ ∕‖ * ‖ ∞ .Since this is the bias of total return caused by using  0 in practice, it can also be used to show the preciseness of the approximate optimal policy.The results are presented in Fig. 1(a).We can see that the bias for the approximate value function  * is below 12%, which is acceptable in many practical scenarios.Meanwhile, the bias grows in a sublinear manner with the number of components , implying that the performance of the modified FVI is robust to the problem scale.In addition, Fig. 1(a) shows that the bias for the true value function of  0 is smaller than that for  * , indicating that when used in practice, the bias of value function induced by  0 is even smaller than estimated.
The superiority of the modified FVI lies in its low computational complexity compared with classical iterative algorithms.We thus compare in Fig. 1(b) the running times of the modified FVI and the original value iteration algorithm under different values of .Significant advantage of the modified FVI over its counterpart can be observed under each value of .Moreover, the running time of the modified FVI grows in a polynomial manner as  increases, whereas that of the original value iteration grows exponentially.This implies that the modified FVI is still feasible when  is large, whereas the original value iteration algorithm may fail due to the high computational cost.

Simulation study of Algorithm 2
We now examine the performance-in particular, the sample complexity-of the online learning algorithm through simulation experiments.As discussed previously, one of our novelties in developing Algorithm 2 is that we incorporate an online method to evaluate the dependencies among components (i.e., adaptively detecting the value of {  } ∈ ).Hence, we compare the performance of Algorithm 2 with that of the common practice that assumes known value of {  } ∈ in advance.
In the simulation, we use the same model setting as before; that is, each component has 3 states and the cost function is identical.The same basis functions as specified in Section 6.1 are used here.We set other parameters as follows:  = 0.8 and  = 5.To generate the transitions, we define two additional functions p1 ∶ {1, 2, 3} → [0, 1] and p2 ∶ {1, 2, 3} × {1, 2, 3} → [0, 1] as Then, we set the degradation transition as Here we focus on the power of Algorithm 2 for moderate-or largescale systems.We first study the influence of economic dependence on the optimal solution.For this purpose, we fix  = 30 and examine the number of detached sets of the optimal solution produced by Algorithm 2. Recall that () is the number of detached sets of components selected for maintenance under action  and   is the action taken by Algorithm 2 at epoch  ≥ 1.In particular, we are interested in the average value ( ∑  =1 (  )∕ ) under different values of , where (⋅) returns the nearest integer to any input real number.We choose 14 different levels of , that is,  ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 50}.The time horizon is fixed at 10 5 to ensure that the output policy by Algorithm 2 is stable.The results are presented in Table 3.
We can observe from Table 3 that the value of  has a significant influence on the number of detached sets of components.More specifically, when  becomes larger, the economic dependence among components becomes stronger and the benefit of maintaining adjacent components together becomes higher.This implies that the optimal maintenance policy prefers to maintain neighbouring components in each single epoch.This is consistent with our motivation of introducing the penalty coefficient  in the cost function.
Next, we conduct a comparison study to demonstrate the superiority of our proposed algorithm.To this end, we fix  = 3 as in Section 6.1 and vary the value of  from 20 to 30, namely,  ∈ {20, 21, … , 30}.It should be noted that very few existing approaches can solve problems with such a large scale in the learning environment, especially when the parameters {  } ∈ are unknown (Dann et al., 2017;Rosenberg & Mansour, 2021;Strehl, 2007).A commonly adopted routine is to set values of {  } ∈ in advance and use this fixed value thereafter.Following this routine, we choose two maintenance strategies that both use fixed values of {  } ∈ for performance comparison, which are suboptimal but make sense in practice.In particular, the first strategy is to ignore the dependence among different components and consider that each component evolves independently, or equivalently, fix  1 = ⋯ =   = 0 throughout the whole maintenance process.In the second strategy, only dependence between neighbouring components are considered, namely,  1 = ⋯ =   = 1 is fixed all the time.The true values of all   's are set to 2. The two strategies can be easily realized by excluding the learning process for each   in Algorithm 2 and fixing   at 0 and 1, respectively.In this way, we are comparing our algorithm (that considers the stochastic dependence and learns the dependence) with the strategy that does not consider stochastic dependence, in order to show the effectiveness of our method.
An important note to make is that the theoretical guarantee in Theorem 2 holds under the premise that the true optimal policy can be achieved each time.However, the true optimal policy, as we discussed in Section 4, is not computationally tractable in many real cases.As a result, we first illustrate the sample complexity of Algorithm 2 with the modified FVI approximation method.Because calculating the true optimal policy is no longer feasible, we can simply record the total number of samples Algorithm 2 uses to update the model parameters as its sample complexity.In Fig. 2(a), we illustrate the sample complexity of Algorithm 2 under different values of .We can see that the sample complexity of Algorithm 2 increases in a polynomial pattern in .This is different from the sample complexity of a learning algorithm for general MDPs, which grows exponentially in .The result verifies the feasibility of Algorithm 2 in solving a large-scale maintenance problem.In Fig. 2(b), we compare the value functions of the three maintenance strategies.For a problem with over 20 components, the true value function of any policy is no longer tractable, thus we use the approximate value function instead of the true one in the experiment.In Fig. 2(b), we illustrate the approximate value functions of the three policies based on the strategies we select previously.The results in Fig. 2(b) show that the proposed algorithm outperforms the other two without learning processes.

Concluding remarks
In this work, we study a condition-based group maintenance problem for multi-component systems subject to multiple types of dependencies among components.The problem is modelled by an FMDP taking advantage of a specific location-based stochastic dependence among components.We first examine this problem from a traditional perspective in which model parameters are assumed to be fully known in advance.To reduce the computational burden, we develop a modified FVI algorithm to efficiently approximate the optimal maintenance policy and also provide an upper bound on the approximation error of this algorithm.Subsequently, we turn to an online learning environment in which model parameters are unknown a priori, and develop an online reinforcement learning algorithm to learn the model parameters and determine an optimal maintenance policy, simultaneously.The algorithm is capable of learning transition probabilities and the system structure (indicating the stochastic dependence among components) from previous observations.Moreover, it outperforms the other existing approaches in that it can generate computationally tractable and approximately optimal maintenance policies even under a large problem scale.A key point here is that we properly incorporate the dependence properties into the design of our learning algorithm.By doing so, our algorithm is able to effectively mitigate computational complexities and burdens associated with online maintenance problems, while still retaining a good performance.
We believe that our model and algorithms are not restricted to the specific setting concerned in this work.First, the location-based stochastic dependence assumption can be relaxed beyond neighbouring components, as long as the dependence structure can be gradually learned from previous observations.Second, the modified FVI with sampling technique in our framework can be replaced by any effective approximation method to solve FMDPs.Third, from a more conceptual perspective, the scheme of the proposed algorithm, which learns the dependence structure among components while evaluating optimal maintenance policies, can be modified and extended to more extensive maintenance problems for large-scale systems with some structural features among components.
However, this paper presents several limitations that deserve further research efforts.First, an implicit assumption adopted in this work is perfect inspection; that is, an inspection can reveal each component's actual state without errors.In reality, an inspection may be imperfect caused by measurement errors or sensor deterioration.Generalizing our modelling framework to involve imperfect inspections and developing appropriate algorithms to solve the associated maintenance problems is an open question.Second, we assume that at inspection, there are only two actions (replacement or do nothing) to be taken for each component.Considering imperfect maintenance actions of different depths is an interesting research topic.Finally, conducting a real-world case study to calibrate our FMDP model with real data (collected from, e.g., energy-generation, transportation, or manufacturing systems) and compare the performance of our algorithms with relevant benchmarks is highly valuable.(A.1) The rest of the proof is to bound ‖ * −  ŵ * ‖ ∞ .Based on the discussions in the proof of Theorem 1 in Szita and Lörincz (2008), we have to be a -dimensional vector with the th component being   (  ,  0 ()  ) and other components being 0, then ∑  =1   0 ,  , according to Szita and Lörincz (2008, Lemma 5), we have for any  0 > 0, with probability at least 1 − ∕2, ∕), and with probability at least 1−∕2, then by combining (A.2), (A.3), and (A.4), we have, with probability at least 1 − , This completes the proof.□

Appendix B. Proof of Theorem 2
The proof follows the process we sketch in Section 5.2.In the following Lemma, we construct an event where the estimated transitions are close to the true transitions, and prove that the event holds with a high probability.
Lemma 1.We define ℰ to be an event where for all  ≥ 1 during the execution of Algorithm Theorem 2 and all  ∈ , holds with probability at least 1 − ∕{2 [  2+1 ( − 1) + 2 ]   }.Using the trick of the union probability bound, we combine all stateaction pairs (  ,   )'s and (  ())'s, and all possible samples 1, … ,   for state-action pairs   (  ,   )'s and all possible samples 1, … ,   for (  ())'s to conclude that Thus, we conclude the proof.□ The corollary below follows directly from Lemma 1, namely, for all   ≤  ≤ , Condition (10) does not hold.
Corollary 1. Restricted to event ℰ , for all  ≥ 1 during the execution of Algorithm 2, the relation    ≤   holds.
Proof.According to Lemma 1, ∀ ≥   , we have where the last inequality is because of  ≤ , which implies   ((  ())) ≤   ((  (   ))).Thus, the update of    only happens when    <   , by which we conclude the proof.□ In the next lemma, we show that restricted to event ℰ , the  1divergence between the estimated transition P that Algorithm 2 uses each time and the true transition  can be bounded.
Lemma 2. Restricted to event ℰ , the following relation holds for all  ≥ 1,  ∈  and  ∈ .
Proof.For notational convenience, for all  ∈ , we define and According to Strehl (2007, Corollary 1), ∀  ′ ,   ′ , we have and for all (  (   )), where the last inequality is based on Corollary 1.Meanwhile, ∀  ′ ,   ′ ,   ′ , we have  Proof of Theorem 2. The main target of the proof is to show that the three conditions in Strehl et al. (2006, Proposition 1) are satisfied by our algorithm, then use the results therein to construct an upper bound on the sample complexity of our algorithm and conclude the proof.We first define a set of ''known'' state-action pairs at each time  ≥ 1 as and we also define a ''known'' action-value function as According to Strehl et al. (2006, Proposition 1), to conclude the bound on the sample complexity in Theorem 2, we need to prove that for all  > 0, 0 <  < 1,  ≥ 1 and all (, ) ∈  × , the following conditions hold with probability at least 1 − ∕2: (1) min ∈ Q (, ) ≤ min ∈  * (, ) + ∕4; (2) | min ∈ Q (, ) − min ∈    (, )| ≤ ∕4; (3) The number of time steps when some (  ,   ) ∉   is observed can be bounded by  ⋅  ⋅   +  2+1 ⋅  ⋅   .In the rest of the proof, we verify these three conditions one by one.Note that we always restrict our discussions to event ℰ .
We consider the value iteration equation for solving   below:  Therefore, the number of epochs when some (  ,   ) ∉   is observed can be bounded by max{ ⋅  ⋅   ,  2+1 ⋅  ⋅   }.
We finally claim that all the three conditions presented at the beginning of the proof are satisfied, and thus complete the whole proof.□ ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ , and

Fig. 1 .
Fig. 1.The approximate error and running time of the modified FVI.

Table 3
Average number of detached sets of components under different values of .