A Heuristic Policy for Maintaining Multiple Multi-State Systems

This work is concerned with the optimal allocation of limited maintenance resources among a collection of competing multi-state systems, and the dynamic of each multi-state system is modelled by a Markov chain. Determining the optimal dynamic maintenance policy is prohibitively difﬁcult, and hence we propose a heuristic dynamic maintenance policy in which maintenance resources are allocated to systems with higher importance. The importance measure is well justiﬁed by the idea of subsidy, yet the computation is expensive. Hence, we further propose two modiﬁcations of the importance measure, resulting in two modiﬁed heuristic policies. The performance of the two modiﬁed heuristics is evaluated in a systematic computational study, showing exceptional competence.


Introduction
A partially observable Markov decision process (POMDP) is a generalization of a Markov decision process. A POMDP models a decision process in which it is assumed that the system's dynamic is determined by a Markov decision process, but the decision maker cannot directly observe the system's state. For a finite-state Markov decision process, the optimal policy can be expressed in a simple tabular form. When state uncertainty is introduced, the optimal policy for a POMDP is defined over a continuum of states. It is established in Madani et al. (1999) that optimal planning without full observability is prohibitively difficult both in theory and practice, and many natural questions in this domain are undecidable. Consequently, approximate methods are required even for small-size problems. Existing efficient approximate methods are policy iteration (Hansen, 1998), point-based value iteration (Pineau et al., 2003), and approximate linear programming (Hauskrecht and Kveton, 2004). The current work investigates an even more difficult problem: optimally maintaining a collection of multi-state systems with limited maintenance resources, where the dynamic of each multi-state system is modelled by a Markov chain. That is, instead of one POMDP, the problem involves multiple independent POMDPs, and the state of a POMDP affects the action taken on another POMDP. Determining the optimal dynamic maintenance policy for multiple competing POMDPs is apparently impractical, and hence we develop a heuristic policy: at each decision epoch, we measure the importance of each system, and only systems with larger importance measures will receive their optimal actions. Importance measures have been widely used as important decision-aiding indicators in various domains. For example, in risk analyses, importance measures are used in risk-informed decisionmaking (Tyrväinen, 2013); in reliability engineering, importance measures are used to prioritize components in a system for reliability improvement (Borgonovo et al., 2016). Recently, importance measures have been applied for maintenance optimization. Liu et al. (2014) proposed a maintenance strategy in which the component yielding the largest expected net revenue is selected for maintenance whenever the system reliability is below a threshold. To reduce system downtime, Wu et al. (2016) proposed a maintenance strategy that, when a component in a system is failed and under repair, a number of the other components are selected for preventive maintenance; the authors developed an importance measure for the selection of components for preventive maintenance. Dui et al. (2017) pointed out that the preventive maintenance time of a selected component may be longer than the maintenance time of the failed component, and that with the same reliability improvement on the system, different components may result in different preventive maintenance costs; the authors developed an importance measure taking into account the time and cost of preventive maintenance. With the objective of maximizing the throughput of a production system over a time interval, Ahmed and Liu (2019) developed two types of importance measures for prioritizing the critical components in the maintenance schedule. In the framework of condition-based maintenance, Do and Bérenguer (2020) developed an importance measure based on the conditional reliability of the system; that is, components are ranked according to their ability to improve the system's conditional reliability over a time interval.
Existing works on importance-measure based maintenance are all focused on ranking the com-ponents. By contrast, this work is devoted to ranking systems. Within the POMDP framework, a multi-state system is treated as important (having a large importance measure) if the cost for not optimally maintaining the system is high. The importance measure defined in this work has the economic interpretation as a subsidy (for a positive importance measure) or a tax (for a negative importance measure); see Whittle (1988). Our sequential resource allocation and stochastic scheduling framework is very general, and can be applied to solve, e.g., the dynamic multichannel access problem (Liu and Zhao, 2010), multi-UAV dynamic routing (Ny et al., 2008), sequential selection of online ads (Yuan and Wang, 2012), etc.
In the upcoming sections, we will cover the following. In Section 2, we formulate the problem, define the importance measure, and point out its drawbacks. In Section 3, we introduce the two modified importance measures. We prove that the two measures are well defined and further give two interpretations of the second measure. In Section 4, the performance of the proposed heuristics is studied in computational experiments. Section 5 concludes.

Problem Formulation
POMDPs provide a rich framework for planning under both state transition uncertainty and observation uncertainty. A standard discrete-time POMDP can be defined by a tuple (S, A, Z, p a ss , f a s (z), R a s , θ ): • S is a finite set of states; • A is a finite set of actions; • Z is an observation space; • p a ss is the probability of transitioning to state s after taking action a, given that the current state is s (s, s ∈ S and a ∈ A); • f a s (z) is the probability for observing z after taking action a, given that the current state is s (z ∈ Z, s ∈ S and a ∈ A); • R a s is the finite immediate reward by taking action a for state s (s ∈ S and a ∈ A); • θ ∈ (0, 1) is a discount factor.
For an action a that cannot return any observation, it is equivalent to saying that action a always returns the same observation, denoted by "null", and f a s (z = null) = 1 for any state s. Ellis et al. (1995) provided an application example of the POMDP to a one-lane, two-girder highway bridge.
The condition of the bridge is characterized by five states, i.e., S = {1, 2, 3, 4, 5}. The available actions are A={doing nothing, visual inspection, nondestructive ultrasonic evaluation, cleaning and repainting corroded surfaces, repainting and strengthening deteriorated girders, extensive structural repair}. An visual inspection yields one of three possible outcomes: good, fair, and poor. The ultrasonic technique is to measure web and flange thickness loss in girders, and the indicated results {state 1, state 2, state 3, state 4, state 5} are error corrupted. Therefore, the observation space Z is a discrete set of eight observations. If, for example, the underlying state is s = 1 and the action taken is a =visual inspection, then f a s (z = good) = 0.2 and f a s (z = fair) = 0.8; if the underlying state is s = 2 and the action taken is a =nondestructive ultrasonic evaluation, then f a s (z = state 1) = 0.05, f a s (z = state 2) = 0.9, and f a s (z = state 3) = 0.05. State transitions satisfy the Markov property; for example, given s = 1 at time t and a =doing nothing, the probability p a ss for s = 2 at time t + 1 is 0.13, independent of all states and actions before time t.
Within the POMDP framework, the information on the system's true state is incomplete and encapsulated by a probability vector, called the belief state. A belief state at epoch where b t s is the probability of the system being in state s at epoch t. We have b t s ≥ 0 and ∑ s∈S b t s = 1, and therefore the belief state space is a unit simplex, denoted by ∆. It is well-known that b b b t summarizes all the information necessary for making decisions at epoch t (Sondik, 1978); that is, to make a decision at epoch t, we only need to know the belief state b b b t , instead of all the historical actions and observations. The Markovian decision-making process is as follows. At time 0, the decision maker's belief state b b b 0 characterizes the prior knowledge regarding the condition of the system before the beginning of the sequential decision making. At time point t (t = 1, 2, . . .), the decision maker collects an observation z t . According to the information at time t − 1 (i.e., b b b t−1 and a t−1 ) and the new information (i.e., z t ), the decision maker updates his belief regarding the system's current state s t . According to the newly updated belief state b b b t , the decision maker then determines the action a t . Likewise, at epoch t + 1, the decision maker collects a new observation z t+1 , then updates the belief state b b b t+1 from (b b b t , a t , z t+1 ), and finally determines the action a t+1 .
The rule for determining the action a t for the belief state b b b t is called a policy. More formally, a policy π is a mapping from the belief state space to the action set (π : ∆ → A), and the optimal policy π * maximizes the value function (the expected discounted reward) for any given belief state: where b b b t+1 is calculated from (b b b t , a t , z t+1 ) using Bayes' rule: In the following, we write , a t = a and z t+1 = z. The optimal policy π * is deterministic, stationary and Markovian (Blackwell, 1965). The optimum policy is defined over a continuum of states, yet does not have an analytic expression. Hence, different methods have been developed for approximating the optimal policy; see Hauskrecht (2000), de Farias and Roy (2003) and Shani et al. (2013).
The current work is focused on the problem of optimally allocating limited effort (such as time, spares, maintenance personnel, etc.) among a collection of competing projects, and the dynamic of each project is modelled by an independent Markov chain. For example, a collection of multistate systems competing for a limited number of spare parts. For illustrative purpose, we here consider the problem of maintaining a collection of M (> 1) multi-state systems with only κ(< M) repairmen. Consequently, at each decision epoch, if there are more than κ systems whose optimal actions are not "doing nothing", we need to decide which κ systems will receive their optimal actions -the remaining M − κ systems will all receive the do-nothing action. The optimal planning for a collection of competing POMDPs is prohibitively difficult due to the inherent complexity of the POMDP model. In fact, Papadimitriou and Tsitsiklis (1999) proved that such problems are PSPACE-hard. This motivates us to develop a heuristic policy: at each decision epoch, we measure the importance of each system, and only κ systems with larger importance measures will receive their optimal actions. Hereafter, we label the do-nothing action by the number 0; that is, a t = 0 means that the action taken at time t is "doing nothing".
The importance measure defined in this work is inspired by the idea of subsidy for "doing nothing". We explain the idea through one POMDP/multi-state system. Assume that the decision maker will be given a subsidy whenever the action taken on the system is "doing nothing". For example, if the optimal action for the belief state b b b t is "replacing a component". If the decision maker instead takes the do-nothing action, he will be given a positive subsidy to offset the loss caused by not taking the optimal action for the belief state b b b t . Apparently, the decision maker is willing to trade "replacing a component" for "doing nothing" only when the subsidy is large enough to cover the loss. In other words, the minimal subsidy required by the decision maker reflects the importance of the optimal action for the belief state b b b t , and hence can be adopted as the importance measure of the system at time t.
We now formally define the importance measure. After including the subsidy w for the donothing action, let V (b b b t ; w) denote the new maximal expected discounted reward (EDR) for belief where δ (·) is the indicator function. Equation (3) implies that the subsidy can be incorporated into the reward structure, and the tuple (S, A, Z, p a ss , f a s (z), R a s + wδ (a = 0), θ ) is still a POMDP with a deterministic and stationary optimal policy. The optimal action for belief state We call the set of belief states P(w) = {b b b ∈ ∆ : a(b b b; w) = 0} as the inactive set. In other words, under subsidy w, if the belief state b b b t ∈ P(w), then the optimal action a(b b b t ; w) is "doing nothing".
Intuitively, if the optimal action for a belief state b b b is "doing nothing" when the subsidy is w, then the optimal action for b b b will be "doing nothing" for any subsidy larger than w. Hence, we would expect that, if the action a(b b b; w 1 ) is "doing nothing", then a(b b b; w 2 ) is always "doing nothing" for w 2 > w 1 ; or, equivalently, if b b b ∈ P(w 1 ) and w 2 > w 1 , then b b b ∈ P(w 2 ). Unfortunately, this is not always the case (Whittle, 1988): for an arbitrary POMDP (S, . In other words, the subsidy as an importance measure is not well defined for all POMDPs. The POMDPs whose inactive sets can only increase with the subsidy are called indexable: is called indexable if the inactive set P(w) increases from the empty set ∅ to the whole belief state space ∆ as the subsidy w increases from −∞ to +∞.
Definition 2. If a POMDP (i.e., a multi-state system) is indexable, and its belief state at time t is b b b t , then its importance measure at time t, denoted by I(b b b t ), is the infimum subsidy w such that Given that indexability does not always hold, we have to trade indexability for specific structural conditions. In Appendix A, we study a particular POMDP (with only two actions) for which the indexability always holds.
After defining the importance measure, we now come back to the problem of optimally allocating limited effort among M multi-state systems. Note that the M multi-state systems need not be identical; each multi-state system can be modelled by a different Markov chain. Suppose that all the M multi-state systems are indexable. At each decision epoch, if the number of positive importance measures is larger than κ, then only κ multi-state systems with larger importance measures will receive their optimal actions. If the number of positive importance measures is smaller than κ, then only multi-state systems with positive importance measures will receive their optimal actions.
Although the importance measure defined above is well justified by the notion of subsidy, it has two drawbacks: (1) The importance measure is only defined for indexable POMDPs.
(2) The importance measure is computationally expensive; according to Equation (3), we have to try many candidate subsidy values for a belief state, and each trial calls the running of value iteration until convergence. Therefore, we below introduce two modified importance measures, both of which are defined for every POMDP and are computationally cheap.

Approximate Measure
The computational burden of the importance measure is mainly introduced by the difficulty in evaluating the value function V π * (b b b). We hence propose to approximate the value function to the second order. Then the infimum subsidy calculated from the approximate value function will serve as an importance measure, called the approximate measure.
Recall that, given a policy π, the EDR for the POMDP (S, A, Z, p a ss , f a s (z), R a s , θ ) is The well-known myopic policy approximates the EDR where ·, · is the inner product, and R a = (R a s : s ∈ S) is a vector of rewards. We here propose a second-order approxima-tion: Then the optimal value function V π * (·) is approximated by V 2 (·): For the POMDP (S, A, Z, p a ss , f a s (z), R a s + wδ (a = 0), θ ), the corresponding optimal value function is approximated by The optimal action determined by the second-order approximation is proposition states that the approximate measure is well defined for every POMDP.
Proposition 1. For any POMDP, the inactive set P 2 (w) increases from the empty set ∅ to the whole belief state space ∆ as the subsidy w increases from −∞ to +∞.
Proof. The proof is given in Appendix B.
Then the heuristic policy for the competing M multi-state systems operates as follows. At each decision epoch, if the number of positive approximate measures is larger than κ, then only κ systems with larger approximate measures will receive their optimal actions. If the number of positive approximate measures is smaller than κ, then only systems with positive approximate measures will receive their optimal actions. Although the values of the approximate measures are different from the values of the importance measures, it is the ordering of the importance/approximate measures that determines the policy. We expect that the ordering of the importance measures is most of the time preserved under our approximate approach.
We can further approximate the optimal value function V π * (·) to the third order: and define an importance measure from the third-order approximation in a similar manner, which we might call the third-order measure. One may argue that the heuristic policy under the thirdorder measure is superior to the approximate-measure policy, as the third-order approximation . However, as with the importance measure, the third-order measure is not well defined for every POMDP. The computational complexity of the approximate measure is much lower than that of the third-order measure. Moreover, the numerical study in Section 4 will reveal that the approximate-measure policy outperforms the third-order measure policy.
To calculate the approximate measure, we need to numerically try different values of w. For a large enough subsidyŵ such that 0 = arg max a∈A R a +ŵδ (a = 0),b b b for any b b b, the optimal action at any decision epoch is always a = 0. Hence, we only need to search in the interval (0,ŵ) the minimal subsidy value such that the optimal action for b b b is a = 0. If the observation space Z is discrete, the approximate measure can be quickly determined. Otherwise, if the observation space is continuous, we can apply numerical integration on the grid of points {z 1 , z 2 , z 3 , . . .} over the observation space Z. Specifically, under subsidy w, the second-order approximation reads:

Rate Measure
The rate measure for belief state b b b, denoted by I(b b b), is the minimal subsidy w such that can be interpreted as a one-off subsidy as follows. Recall that the optimal action for b b b should be arg max However, due to competing multi-state systems, we have to take action a = 0. We assume that this is a one-time restriction, and we can still act optimally afterwards according to the optimal policy π * . Under this assumption, the loss for taking action a = 0 (at time t only) is If we subsidize action a = 0 by the amount I(b b b), then the optimal action for state b b b will be a = 0.
Therefore, we have We can utilize the above equation to calculate the rate measure, which requires very little effort.
A POMDP under the rate measure is apparently indexable: if ( (b b b, a, z))dz} for any ; that is, with the subsidy increasing, the inactive set cannot decrease.
We here give another interpretation of I(b b b) utilizing the approximate linear programming technique (de Farias and Roy, 2003;Hauskrecht and Kveton, 2004). Consider the problem Here, c(·) is an arbitrary positively valued function. It is clear that, for any positive function c(·), V π * (·) is the unique solution to problem (P1). The approximate linear programming method approximates the value function V (·) by a set of basis functions, in order to transform the problem into linear. With an aim of computing a coefficient vector β β β = (β 1 , . . . , β k ) such that V π * (·) can be approximated closely by the given basis functions υ υ υ we pose the following optimization problem where we approximate the belief state space by a finite set, B, of randomly sampled belief states.
The corresponding Lagrange dual problem is Let β β β * and {λ * b b b,a : a ∈ A,b b b ∈ B} denote the optimal primal and dual solutions. We note the following.
• The objective function of the dual problem (P3) indicates that λ * b b b,a can be interpreted as the expected discounted time that action a is taken for belief state b b b under the optimal policy.
By complementary slackness, we have λ * b b b,a = 0 for any non-optimal action a: β β β * ,υ υ υ ῡ υ υ(b b b, a) . In other words, the optimal action for a belief point b b b is simply • The Lagrange dual function (17) indicates that β β β ,υ υ υ -the expected discounted time that action a is taken for b b b. Therefore, we can define a rate measure as β β β * ,υ υ υ It is clear that the rate measure is exactly the one-off subsidy I(b b b), hence the name.

Numerical Study
In this section, we numerically evaluate the performance of the approximate-measure policy and the rate-measure policy. We first compare the two heuristic policies with a random policy, and then compare the approximate-measure policy with the third-order measure policy and the myopic policy.
Suppose we have M identical systems (e.g., M wind turbines in a wind farm) and κ repairmen.
Correspondingly, the reward structure is specified as follows: for 1000 times, we approximate the total EDR by the average of the 1000 total discounted rewards.

Evaluating the Two Heuristic Policies
Generally, the relative suboptimality gap is employed as the performance measure: is the total EDR under the optimal policy, and V i (b b b 0 1:M ) is the total EDR under a heuristic policy. However, evaluating the optimal policy is PSPACE hard.
Hence, instead of the optimal policy, we compare with a random policy in which we randomly select κ out of all the systems that need be maintained. Let V (b b b 0 1:M ) be the total EDR under the random policy.
Set M to be 10, and let κ in turn take a value from {2, 4, 6, 8}. Randomly generate one set of To calculate the original optimal action for any given belief state, the optimal value function for each system is approximated by a set of 10000 α-vectors (Hauskrecht, 2000). In Figures 1-3, the red solid curve corresponds to the approximate-measure policy, the blue dashed curve corresponds to the rate-measure policy, and the black dotdash curve corresponds to the random policy. Figure 1 plots the total discounted reward ∑ M m=1 ∑ 90 t=0 θ t R a m t s m t for each of the 1000 repeats, and Table 1 gives the mean value of the 1000 total discounted rewards. As stated in Section 3.2, the performance of the rate-measure policy depends on the ratio κ M , the larger the better. Figure 1 and Table 1 show that, when the ratio κ M is larger than 0.5, the rate-measure policy and the approximate-measure policy have the same performance. Hence, when the ratio is larger than 0.5, we can use only the rate measure, as calculating the rate measure is faster than calculating the approximate measure. When the ratio is smaller than 0.5, the approximate-measure policy outperforms the rate-measure policy. In each case, the random policy performs the worst, with the 1000 total discounted rewards having low mean value and large variance.
To further examine the influence of the ratio κ M , we now fix κ at 12 and let M in turn take a value from {15, 20, 30, 60}, making the ratio κ M take the values {0.2, 0.4, 0.6, 0.8}. With the randomly generated initial belief states (b b b 0 (1), . . . ,b b b 0 (M)) being fixed, simulate a Markovian maintenance decision process until arriving at time 90, and then calculate the total discounted reward Repeat the procedure for 1000 times to obtain 1000 total discounted rewards.  Table 2 lists the total EDRs. It is clear from Figure 2 and Table 2   policy outperforms the others when κ M < 0.5; the large gap between the total discounted rewards of the random policy and the approximate-measure policy verifies the efficiency of the approximatemeasure policy.
In each panel of Figures 1 and 2 Figure  3 indicates that • when κ M is smaller than 0.5, the approximate-measure policy has the best performance; • when κ M is larger than 0.5, the approximate-measure policy and the rate-measure policy have the same performance, but calculating the rate measure is faster than calculating the approximate measure; • the large gap between V i (b b b 0 1:M ) and V (b b b 0 1:M ) verifies the exceptional competence of the approximate measure.
To decide which importance measure to apply for a particular problem, one can calculate both the approximate measure and the rate measure for the first few decision epochs. If the two measures produce very similar total rewards, then it is safe to use only the rate measure for the following decision epochs. Note that, for either type, the M importance measures for the M systems can be calculated in parallel.

Comparing with the Third-Order Approximation
To further reveal the competence of the approximate measure, we here compare the approximatemeasure policy with the myopic policy and the third-order measure policy.  Table 3. Instead of one single set of starting belief states, Figure 5 further plots   Table 3, it is clear that the approximate-measure policy frequently gives a higher total EDR than the third-order measure policy. Particularly, when the ratio κ M is small, the approximate-measure policy always outmatches the third-order measure policy in terms of the total EDR. Therefore, we claim that the second-order approximation is superior to the third-order approximation: the computation for the second-order approximation is much less demanding. The large gap between the total discounted rewards of the myopic policy and the approximate-measure policy when κ M = 0.2 further approves the dominance of the second-order approximation.
We then fix M at 10, and let κ in turn take a value from {2, 4, 6, 8}. Figure  cost. The myopic policy, though better than the random policy, still produces a much lower total EDR when κ M = 0.2. exceptional performance, and when the ratio κ M is large, the rate-measure policy is also outstanding. But calculating the rate measure is faster than calculating the approximate measure. Hence, the approximate measure and the rate measure can be applied to different settings. To decide which importance measure to use, one can calculate both importance measures for the first few decision epochs. If the two measures produce very similar total rewards, then one can switch to the rate measure for the following decision epochs. R codes for the above numerical study are available on request.

Conclusion and Further Research
As future work, it is necessary to further provide provable bounds or establish asymptotic optimality of the proposed heuristics. Moreover, we found that if the actions can be ordered in certain way, then the ranking of the approximate importance measures is often the same with the ranking of the optimal actions; in other words, the rank of the optimal action indicates the importance of the multi-state system at the decision epoch. More study need be taken to examine under which conditions such a relationship holds.

Appendix A A Two-Action Maintenance Problem
We here study a two-action maintenance problem: available maintenance actions are either "doing nothing" or "replacement". Arrange the states w.r.t. the level of degradation: the first state represents the worst machine condition, while the last state represents the pristine condition.
In the context of machine maintenance, if the do-nothing action is taken, then the condition of the machine will degrade. Hence, the transition matrix for the do-nothing action, denoted by P 0 = (p 0 ss ), is a lower triangular matrix; the main diagonal entries are smaller than 1 except the first entry. For a belief state b b b, if we take the non-optimal action a = 0, then at the following epoch, action a = 0 will still be non-optimal. In other words, if the machine is in need of replacement but we do nothing, then at the following epoch the machine becomes more deteriorated, and hence replacement becomes more urgent.
The action "replacement" (labelled by the number 1) restores the machine condition to brand new. Hence, the transition matrix for the action "replacement", denoted by P 1 = (p 1 ss ), has the structure that the last column is the vector 1 1 1 while all the other entries are 0. Then it is readily to prove that where F a (z) = diag( f a s (z) : s ∈ S) is a diagonal matrix, 1 1 1 = (1, . . . , 1) is the column vector of 1's, and T is the transpose operator. That is, after the "replacement" action, our belief state changes to (0, . . . , 0, 1) -we actually know that the machine is now in the pristine state. We write e e e as a notational shorthand for (0, . . . , 0, 1).
For the POMDP (S, A, Z, p a ss , f a s (z), R a s + wδ (a = 0), θ ), define the (stationary) stopping time t w := min{t : t ≥ 1, the action at time t is replacement.}.
Define two vectors of rewards: R 0 = (R 0 s : s ∈ S) and R 1 = (R 1 s : s ∈ S). Denote R 0 = R 0 + w and R 1 = R 1 . Let π * w be the optimal policy for the POMDP (S, A, Z, p a ss , f a s (z), R a s + wI(a = 0), θ ). We have If we take the do-nothing action for b b b 0 and follow the optimal policy afterwards, then the EDR is For notational convenience, define If we take the replacement action for b b b 0 and follow the optimal policy afterwards, then the EDR is ). Hence, action a = 0 is optimal for b b b 0 if and only if which is equivalent to The l.h.s. is independent of w, while the r.h.s. is decreasing in w. Therefore, the inactive set increases with the subsidy w.
Remark 1. For any action a ∈ A, define the action region D a π = {b b b : π(b b b) = a}. It is easily seen that the set of belief states where it is optimal to take action 1 is convex (and therefore connected): For any belief states b b b 1 ,b b b 2 ∈ D 1 π * w and any ρ ∈ [0, 1], we have w (e e e) + (1 − ρ) R 1 ,b b b 2 + (1 − ρ)θV π * w (e e e) = R 1 , ρb b b 1 + (1 − ρ)b b b 2 + θV π * w (e e e) where we have used the fact that V π * w (·) is a convex function. Thus all the inequalities above are equalities, and ρb b b 1 + (1 − ρ)b b b 2 ∈ D 1 π * w . The region D 0 π * w , however, can be disconnected. Under suitable conditions, the optimal policy π * w can be characterized by a single curve, which partitions the belief state space ∆ into two connected regions D 0 π * w and D 1 π * w (Krishnamurthy, 2016, Chapter 12). Then the importance measure for a belief state b b b is the value w making the switching curve passing through b b b. The curve can be estimated via simulation based stochastic approximation algorithms.

Appendix B Proof of Proposition 1
Given b b b t = b b b and a t = a, the observation space Z can be divided into |A| different sets {Zã b b b,a :ã ∈ A} such that max a t+1 ∈A R a t+1 , (b b b, a, z) = Rã, (b b b, a, z) , for any z ∈ Zã b b b,a .
Then we have where F a (Zã b b b,a ) is a diagonal matrix with the main diagonal entries { Zã b b b,a f a s (z)dz : s ∈ S}. Let the optimal action be denoted byä:ä = arg max a∈A R a + θ P a ∑ã ∈A F a (Zã b b b,a ) Rã, b b b . Now we subsidize actionä by the amount w. Then the observation space Z will be divided into |A| new sets {Z w,ã b b b,a :ã ∈ A} such that max a t+1 ∈A wδ (a t+1 =ä) + R a t+1 , (b b b, a, z) = w + Rä, (b b b, a, z) , ∀z ∈ Z w,ä b b b,a ; Rã, (b b b, a, z) , ∀z ∈ Z w,ã b b b,a andã =ä.
The second-order approximate function V 2 (b b b; w) can be written into V 2 (b b b; w) = max a∈A wδ (a =ä) + R a + θ P a ∑ a∈A F a (Z w,ã b b b,a ) Rã + wθ P a F a (Z w,ä b b b,a )1 1 1, b b b .
If the optimal action is a ∈ A/{ä}, then wθ P a F a (Z w,ä b b b,a ) 1 1 1, b On one hand, we have wθ P a F a (Z w,ä b b b,a )1 1 1, b b b ≤ wθ P a F a (Z)1 1 1, b b b = wθ < w.
On the other hand, Therefore, we claim that arg max a∈A wδ (a =ä) + R a + θ P a ∑ a∈A F a (Z w,ã b b b,a ) Rã + wθ P a F a (Z w,ä b b b,a )1 1 1, b b b =ä, and hence the inactive set P 2 (w) increases with the subsidy w.