Absorbing Markov Decision Processes

In this paper, we study discrete-time absorbing Markov Decision Processes (MDP) with measurable state space and Borel action space with a given initial distribution. For such models, solutions to the characteristic equation that are not occupation measures may exist. Several necessary and sufficient conditions are provided to guarantee that any solution to the characteristic equation is an occupation measure. Under the so-called continuity-compactness conditions, it is shown that the set of occupation measures is compact in the weak-strong topology if and only if the model is uniformly absorbing. Finally, it is shown that the occupation measures are characterized by the characteristic equation and an additional condition. Several examples are provided to illustrate our results.


Introduction
In this work, we consider a discrete-time absorbing Markov Decision Process (MDP) with measurable state space X and Borel action space A with a given initial distribution denoted by η.We consider general measurable state spaces to cover models that are studied in game theory, see for example [15] and [7] for an analysis of absorbing non-zero sum Markov games.An absorbing MDP is a special type of MDP where some measurable subset ∆ of the state space is considered as absorbing and lead to termination of the process.This means that once the system enters in ∆, it remains there indefinitely without any further transitions with a null reward or cost.The underlying assumption is that for any policy, the average time to enter ∆ is finite.For more details on this type of MDP, we refer to the following papers and books [2,9,10,18] and their references therein.
The objective of this paper is to study the properties of the solution set of the characteristic equation and of the set of occupation measures.The notion of occupation measure is particularly important and plays a central role in solving constraint Markov decision processes, see for example the references [2,6,13,14,17].It describes the expected amount of time spends by the state and action processes in any measurable subset of X × A and under any policy.Any occupation measure satisfies the so-called characteristic equation which is of the form (see equation (3) for a more precise statement) where µ X is the marginal of the measure µ on X, η is the initial distribution of the MDP and R is a kernel depending on the transition kernel of the MDP and on the absorbing set ∆.We will show that although the model is absorbing, there may exist solutions to the characteristic equation that are not occupation measures (see example 3.4).These are called phantom solution to the characteristic equation (or simply phantom measures).In order to avoid this pathological phenomenon, we will provide in Theorem 3.6 several necessary and sufficient conditions to ensure that there are no phantom measures, thereby guaranteeing that any solution to the characteristic equation is an occupation measure.By strengthening our hypotheses, that is, assuming the so-called continuity-compactness conditions introduced by Schäl in [19] (see Condition (S) below), we will show in our second main result (see Theorem 4.9) that the set of occupation measures is compact in the weak-strong topology (see [4]) if and only if the model is uniformly absorbing.Finally, it will be shown in our last main result (see Theorem 4.5) that under Condition (S) the occupation measures are characterized by the characteristic equation and an additional condition.This is the first notable difference with the discounted model, where the occupation measures are essentially characterized by a single equation of type (1).
Notation and terminology.On a measurable space (Ω, F) we will consider the set of finite signed measures M(Ω), the set of finite nonnegative measures M + (Ω), and the set of probability measures P(Ω).For a set Γ ∈ F, we denote by I Γ : Ω → {0, 1} the indicator function of the set Γ, that is, I Γ (ω) = 1 if and only if ω ∈ Γ.For ω ∈ Ω, we write δ {ω} for the Dirac probability measure at ω defined on (Ω, F) by δ {ω} (B) = I B (ω) for any B ∈ F. If µ ∈ M(Ω) and Γ ∈ F, we denote by µ Γ the measure on (Ω, F) defined by µ Γ (B) = µ(Γ ∩ B) for B ∈ F. The trace σ-algebra of a set Γ ⊆ Ω is denoted by F Γ .On P(Ω), the s-topology is the coarsest topology that makes µ → µ(D) continuous for every D ∈ F.
Let (Ω, F) and ( Ω, F ) be two measurable spaces.A kernel on Ω given Ω is a mapping Q : for all ω ∈ Ω then we say that Q is a stochastic kernel.We write I Γ for the kernel on Ω given Ω defined by I Γ (B|ω) = I Γ (ω)δ {ω} (B) for ω ∈ Ω and B ∈ F. Let Q be a stochastic kernel on Ω given Ω.For a bounded measurable function f : Ω → R, we will denote by Qf : Ω → R the measurable function For a measure µ ∈ M + (Ω), we denote by µQ the finite measure on ( Ω, F ) given by The product of the σ-algebras F and F is denoted by F ⊗ F and consists of the σ-algebra generated by the measurable rectangles, that is, the sets of the form Γ × Γ for Γ ∈ F and Γ ∈ F. We denote by µ ⊗ Q the unique probability measure (or finite measure) on the product space (Ω × Ω, F ⊗ F) satisfying [16] for a proof of existence and uniqueness of such measure.Let (Ω, F) be a third measurable space and R a stochastic kernel on Ω given Ω.Then we will denote by QR the stochastic kernel on Ω given Ω given by Given µ ∈ M(Ω × Ω), the marginal measures are µ Ω ∈ M(Ω) and µ Ω ∈ M( Ω) defined by µ Ω (•) = µ(• × Ω) and µ Ω (•) = µ(Ω × •).If π is a kernel on Ω, ×Ω given Ω the marginal kernels are π Ω and π Ω , respectively defined by π Ω = π(• × Ω|ω) and π Ω = π( Ω × •|ω) for ω ∈ Ω.
We say that f : Ω × S → S ′ , where S ′ is a metric space, is a Carathéodory function if f (•, s) is measurable on Ω for every s ∈ S and f (ω, •) is continuous on S for every ω ∈ Ω.The family of the so-defined Carathéodory functions is denoted by Car(Ω × S, S ′ ).The family of Carathéodory functions which, in addition, are bounded is denoted by Car b (Ω × S, S ′ ).When the metric space S is separable then any f ∈ Car(Ω × S, S ′ ) is a jointly measurable function on (Ω × S, F ⊗ B(S)); see [1,Lemma 4.51].
If S is a Polish space (a complete and separable metric space), on M(Ω × S) we will consider the ws-topology (weak-strong topology) which is the coarsest topology for which the mappings for f ∈ Car b (Ω × S, R) are continuous.There are other equivalent definitions of this topology as discussed, for instance, in [11,Section 3.3].
The next disintegration lemma will be useful in the forthcoming (see Theorem 1 in [20]).Lemma 1.1 (Disintegration lemma) Let (Ω, F) be a measurable space and let S be a Polish space.Let ϕ : Ω ։ S be a weakly measurable correspondence with nonempty closed values, and let K be the graph of the correspondence.For every µ ∈ M + (Ω × S) such that µ(K c ) = 0 there exists a stochastic kernel Q on S given Ω such that and such that Q(ϕ(ω)|ω) = 1 for each ω ∈ Ω.Moreover, Q is unique µ Ω -almost surely, meaning that if Q and Q ′ are two stochastic kernels that satisfy (2) then for all ω in a set of µ Ω -probability one, the probability measures Q(•|ω) and Q ′ (•|ω) coincide.

The absorbing control model
The main goal of this section is to introduce the parameters defining the model with a brief presentation of the construction of the controlled process.We also describe the notions of (uniformly) absorbing PDM and give the definition of an occupation measure, providing a first elementary property (see lemma 2.4).

The control model.
We consider a stationary Markov controlled process (X, A, {A(x) : x ∈ X}, Q, η) consisting of: • A measurable state space X endowed with a σ-algebra X.
• A Borel space A, representing the action space.
• A family of nonempty measurable sets A(x) ⊆ A for x ∈ X.The set A(x) gives the available actions in state x.Let K = {(x, a) ∈ X × A : a ∈ A(x)} be the family of feasible stateaction pairs.We assume that K is a measurable subset of X × A endowed with the σ-algebra X ⊗ B(A).
• A stochastic kernel Q on X given X × A, which stands for the transition probability function.
• An initial distribution given by η ∈ P(X).
Additionally, we assume that a measurable set ∆ ∈ X is given.As we shall see later in Defintion 2.2, ∆ will play the role of the absorbing set.The so-defined control model is denoted by M(η, ∆) where we make explicit the dependence on the initial distribution and the absorbing set.
The space of admissible histories of the controlled process up to time n ∈ N is denoted by H n .It is defined recursively by all endowed with their corresponding product σ-algebras.A control policy π is a sequence {π n } n≥0 of stochastic kernels on A given H n , denoted by π n (da|h n ), such that The set of all policies is denoted by Π.
Let us denote by M the set of stochastic kernels ϕ on A given In such a case, we will write ϕ instead of π to emphasize that the corresponding stationary randomized policy π is generated by ϕ.We denote by Π s the set of all stationary randomized policies.
A policy π = {π n } n∈N ∈ Π is called a Markovian randomized policy if there exists a sequence Let Π m be the set of all Markovian randomized policies.We have The canonical space of all possible sample paths of the state-action process is Ω = (X × A) ∞ endowed with the product σ-algebra F. The coordinate projection functions from Ω to the state space X, the action space A, and H n for n ≥ 0 are respectively denoted by X n , A n , and H n .We will refer to {X n } n∈N as to state process and {A n } n∈N as the action process.It is a well known result that for every policy π ∈ Π and any initial probability measure λ on (X, X) there exists a unique probability measure P λ,π on (Ω, F) such that P λ,π (K ∞ ) = 1 and such that for every n ∈ N, Γ ∈ X, and Λ ∈ B(A) with P λ,π -probability one.The expectation with respect to P λ,π is denoted by E λ,π .Definition 2.1 The hitting time T ∆ of the set ∆ is given by T ∆ : Ω → N ∪ {∞} defined as where the min over the empty set is defined as +∞.
Next we propose the definition of an absorbing control model.Definition 2.2 Given an initial distribution λ ∈ P(X) and ∆ ∈ X, we say that the control model M(λ, ∆) is absorbing if the conditions (a) and (b) below are satisfied, and we say that it is uniformly absorbing if, additionally, condition (c) holds.(c).The following limit holds: We define now the occupation measures of an absorbing control model M(λ, ∆).
We note that the occupation measure µ λ,π takes into account the state-action process up to time T ∆ and it does not count the time spent in ∆: indeed, µ λ,π (∆ × A) = 0. We will consider the following sets of occupation measures Lemma 2.4 For an absorbing control model M(λ, ∆), the set O λ is bounded.
Proof.Note that the total mass of the occupation measure µ λ,π is which is finite as a consequence of item (b) in Definition 2.2.We also have sup π∈Π E λ,π [T ∆ ] < ∞ according to [8, Sections 4.4 and 5.5] for the special case of a Borel state space or Proposition 2.4(i) in [7] for the general case of a measurable state space.This shows that the set O λ is bounded.✷

The characteristic equations
In this section, we define the notion of phantom measure.We show that for absorbing MDP, phantom measures may exist.In order to avoid this pathological phenomenon, we establish our first main result (see Theorem 3.6) that provides several necessary and sufficient conditions to ensure that there are no phantom measures.
We denote by C η the family of all solutions of the characteristic equations.
Lemma 3.2 Suppose that the control model M(η, ∆) is absorbing.Given any π ∈ Π and σ ∈ M, we have Proof.For notational convenience, let us introduce the functions h t and h on X taking values in [0, 1] as h t (x) = P x,σ {T ∆ > t} and h(x) = P x,σ {T ∆ = ∞} respectively.Note that h t vanishes on ∆ and also that h t is a decreasing sequence of measurable functions which converges pointwise to h(x) = P x,σ {T ∆ = ∞}.To prove the result, we proceed by contradiction and suppose that hdµ X η,π > 0. By definition of the occupation measure µ η,π and since h vanishes on ∆, this implies that there exist some s ≥ 0 with E η,π [h(X s )] > 0 and, in particular, since Define the strategy γ ∈ Π as follows: γ k = π k for 0 ≤ k ≤ s − 1 and γ k = σ for every k ≥ s (in case that s = 0, this definition reduces to γ = σ).Consequently, the distribution of X s is the same under P η,π and P η,γ , and we have Recalling (4), this implies that the series in contradiction with E η,γ [T ∆ ] being finite.This gives the result.✷ Proposition 3.3 Suppose that the control model M(η, ∆) is absorbing.
(i).Given any π ∈ Π, its occupation measure µ η,π satisfies the characteristic equations.Therefore, we have the inclusion Proof.(i).We have already shown that µ η,π is a finite measure on X × A. To prove the stated result, note that for any B ∈ X we have Observe now that for each t ≥ 1 the conditional probability within brackets vanishes on the set {T ∆ ≤ t − 1}, and so which can be equivalently written precisely as µ X = (η + µQ)I ∆ c .By construction of the stateaction process, it is clear that µ(K c ) = 0.This shows that µ η,π indeed verifies the characteristic equations.
(ii).We will show that for any π ∈ Π there is some σ ∈ M with µ η,π = µ η,σ .Since µ η,π is in M + (X × A) with µ η,π (K c ) = 0, the measure µ η,π can be disintegrated as It follows that the characteristic equation can be written Iterating this equation we obtain that for any t ≥ 0 We have, for any x ∈ X and t ≥ 0, and P x,σ {T ∆ > t} ↓ P x,σ {T ∆ = ∞} as t → ∞ for any x ∈ X. Applying Lemma 3.2, we obtain that lim t→∞ µ X η,π (Q σ I ∆ c ) t (X) = 0. Taking the limit as t → ∞ in (6) we obtain Once we have shown that the X-marginals of the occupation measures of µ η,π and µ η,σ coincide, we conclude the result from (5). ✷ The result in Proposition 3.3 does not exclude the possibility that there are measures in C η but not in O η .Such measures are called phantom measures.In other words, a phantom measure µ ∈ C η satisfies the characteristic equations but it is not the occupation measure of any policy in Π.The following simple example illustrates the existence of phantom measures.
. Therefore, this implies that any measure ν K for K > 0 defined by ν K (0) = 0, ν K (1) = 1 and ν K (n) = 1/2β n−2 for n ≥ 2 and ν K (−1) = ν K (−2) = K is a phantom measure.It will be shown in Theorem 4.5 that the occupation measures are characterized by (3) and an additional condition.This is the first notable difference with the discounted model, where the occupation measures are essentially characterized by a single equation of type (3), that is, µ(K c ) = 0 and µ X = η + αµQ (where α is the discount factor) excluding the existence of phantom measures.Definition 3.5 Let M(η, ∆) be an absorbing control model.Given a measure ϑ ∈ M + (X × A) we say that ϑ is invariant for the kernel QI ∆ c when ϑ(K c ) = 0 and ϑ X = ϑQI ∆ c .
Let us make some comments on this definition.Note that we are not excluding that ϑ is the null measure.We call such a measure invariant because, by disintegration, there exists σ ∈ M satisfying ϑ X = ϑ X Q σ I ∆ c which can be written as for measurable B ⊆ ∆ c , and so ϑ X is an invariant measure for the substochastic kernel Q σ on ∆ c .Theorem 3.6 Let M(η, ∆) be absorbing.

(i).
A measure is in C η if and only if it can be decomposed as the sum of an occupation measure µ η,σ in O η and an invariant measure ν ⊗ σ for QI ∆ c with σ ∈ M and ν ∈ M + (X).
(ii).The following statements are equivalent.
(a).The unique invariant measure for QI ∆ c is the null measure on X × A.
Proof.(i).Suppose that µ ∈ M + (X × A) is a solution of the characteristic equations.Proceeding as in the proof of Proposition 3.3, we derive the existence of σ ∈ M such that µ = µ X ⊗ σ and for any t ≥ 1.It follows that by taking the limit as we get that ν = νQ σ I ∆ c .The measure ϑ = ν ⊗ σ satisfies the conditions in the statement of this proposition: µ = µ η,σ + ϑ.
Conversely, it is straightforward to check that the sum of an invariant measure and an occupation measure (which satisfies the characteristic equations) satisfies itself the characteristic equations and, hence, belongs to C η .In fact, the sum of any measure in C η and an invariant measure lies in C η .
(ii).The implication (a) ⇒ (b) follows directly from item (i), while (b) ⇒ (c) is derived from Lemma 2.4.We prove (c) ⇒ (a) by contradiction and, hence, we suppose that there exists a non-null invariant measure ϑ.For any K > 0 we have that Kϑ is also an invariant measure and so for any µ ∈ C η and every K > 0 we have µ + Kϑ ∈ C η , which is not compatible with C η being bounded. ✷ Our next result gives some insight on the behavior of phantom measures.
Proof.By disintegration of ϑ there exists σ ∈ M with ϑ X = ϑ X Q σ I ∆ c .Iterating this equation we obtain ϑ X = ϑ X Q t σ I ∆ c for every t ≥ 1 and taking the limit as t → ∞ yields and so the set B = {x ∈ X : P x,σ {T ∆ = ∞} < 1} satisfies ϑ X (B) = 0.By the Lebesgue decomposition theorem there exist two finite measures ϑ Therefore, Applying Lemma 3.2, we obtain that P x,σ {T ∆ = ∞} vanishes µ X η,π -a.s. and so Observe that ϑ X 2 (B) = 0 and so This shows that the measure ϑ X 1 is null and, therefore, ϑ X = ϑ X 2 , showing the result.✷ This result establishes, loosely speaking, that the state process {X t } under any policy π ∈ Π never visits the support of a non-null invariant measure.

Compactness of occupation measures
In section, we introduce an additional hypothesis.It is assumed the so-called continuity-compactness conditions (see Condition (S) below).In this context, we show in our second main result (see Theorem 4.9) that the set of occupation measures is compact in the weak-strong topology if and only if the model is uniformly absorbing.Finally, it is shown that the occupation measures are characterized by the characteristic equation and an additional condition see our last main result (see Theorem 4.5).This represents a distinguishing factor from the discounted model, where occupation measures are fundamentally defined by a single characteristic equation.

Condition (S)
(S 1 ) The action set A is compact and the correspondence from X to A given by x → A(x) is weakly measurable with nonempty compact values.
(S 2 ) For any x ∈ X and Γ ∈ X, the mapping a → Q(Γ|x, •) is continuous on A. Lemma 4.1 Suppose that the Conditions (S 1 )-(S 2 ) are satisfied.There exists ξ * ∈ M such that Proof.The multifunction from X to A defined by x → A(x) is weakly measurable.By (S 1 ) and Corollary 18.15 in [1], we obtain the existence of a sequence {ξ n } n∈N of measurable selectors for the multifunction x → A(x) satisfying Define now ξ * ∈ Π s by means of To prove the result, fix arbitrary (x, a) ∈ K and B ∈ X such that Q ξ * (B|x) = 0.This implies that Q(B|x, ξ * k (x)) = 0 for all k ∈ N. Using (S 2 ) and ( 9) we obtain that Q(B|x, a) = 0. ✷ Based on this result, we define λ β ∈ P(X) as where β is some fixed parameter with 0 < β < 1.
For every π ∈ Π we have Proof: By Proposition 3.3(ii), there exists σ ∈ M such that We are going to show that for every k ≥ 1 and x ∈ X we have The proof is by induction.
For the case k = 1 suppose that Q ξ * (B|x) = 0.By Lemma 4.1 this implies that Q(B|x, a) = 0 for all a ∈ A(x) and Q σ (B|x) = 0 follows.Assuming the result true for some k and, for k + 1, suppose that Q k+1 ξ * (B|x) = 0. Note that By the induction hypothesis, we have that Q k σ (B|y) = 0 for all y ∈ C with Q σ (C|x) = 1.This shows that, indeed, Q k+1 σ (B|x) = 0.As a direct consequence we have that and the stated result follows from (11).✷ Proposition 4.3 Under Conditions (S 1 )-(S 2 ), if the control model M(η, ∆) is absorbing then M(λ β , ∆) is also absorbing.
Proof.Consider arbitrary π ∈ Π.We have the following equalities Observe that using the policy π for the initial distribution ηQ k ξ * is equivalent to using the policy γ k ∈ Π given by γ k j (da|x 0 , a 0 , . . ., x j ) = ξ * (da|x j ) for 0 ≤ j < k and γ k j (da|x 0 , a 0 , . . ., x j ) = π j−k (da|x k , a k , . . ., x j ) for j ≥ k for the initial distribution η just by make a shift of k time units.Therefore, It follows that The model M(η, ∆) being absorbing, we have that and so E λ β ,π [T ∆ ] ≤ c, which shows that M(λ β , ∆) is absorbing as well.✷ Proposition 4.4 Suppose that M(η, ∆) is absorbing and that the Conditions (S 1 )-(S 2 ) are satisfied.Let Γ be an arbitrary subset of Π s and let {h π } π∈Γ be a family of non-negative functions in L 1 (X, X, λ β ) which are uniformly λ β -integrable.Under these conditions, Proof.Consider a fixed arbitrary ǫ > 0. By the uniform integrability hypothesis, there exists c ǫ > 0 such that sup π∈Γ {x∈X:hπ(x)>cǫ} Therefore, for any π ∈ Γ and t ≥ 1 By Proposition 4.3 we have that the above supremum is finite.Hence, for t sufficiently large we obtain that sup and the result follows.✷ Our next result gives a characterization of O η based on the probability measure λ β .
Theorem 4.5 Let M(η, ∆) be an absorbing model satisfying the conditions Conversely, let us assume that µ ∈ C η is such that µ X ≪ λ β .Using Theorem 3.6(i), we can find σ ∈ M and an invariant measure ϑ ∈ M + (X × A) for the kernel QI ∆ c such that This implies that ϑ X ≪ λ β and so for every t ≥ 1 Applying Proposition 4.4 to the set Γ = {σ} and the function dϑ X /dλ β , we can take the limit as t → ∞ in the last expression to obtain that ϑ X (X) = 0.This shows that, indeed, µ ∈ O η .✷ Lemma 4.6 Assume that the Conditions (S 1 )-(S 2 ) hold.Let {ν π } π∈Π s be a relatively s-compact subset of M + (X).The set By hypothesis, Λ X is relatively s-compact and Λ A is tight since A is a compact metric space.Therefore, Λ is relatively sequentially ws-compact by using Theorem 2.5 in [4].So, there exists a sequence {π n } in Π s such that We have that sup π∈Π s t>k P η,πµ {T ∆ > t} as k → ∞ since M(η, ∆) is uniformly absorbing by hypothesis.Moreover, the leftmost term in (15) converges to zero as k → ∞ by using Lemma 4.7 showing that the limit ( 14) holds.This shows that O η is relatively compact for the ws-topology.
(b) ⇒ (a).Since O η is relatively compact for the ws-topology, it follows from [4, Theorem 5.2] that the set of X-marginal measures of O η = O s η is relatively s-compact.Recalling that µ X η,π ≪ λ β for any π ∈ Π s , using Lemma 4.2, Proposition 2.2 in [4], and Corollary 2.7 in [12], we get that the family {h π } π∈Π s of density functions h π = dµ X η,π /dλ β is uniformly λ β -integrable.Now, observe that for π ∈ Π s , and by using Proposition 4.4 we can conclude that the rightmost term in the previous equation converges to zero uniformly in π ∈ Π s as t → ∞.This establishes that M(η, ∆) is indeed uniformly absorbing.
(b) ⇔ (c).This is obvious using Lemma 4. • Q (i, j + 1) | (i, j), s = 1 for i ∈ N and 0 ≤ j < 2 i − 1; • Q x | (i, 2 i − 1), s = 1 for i ∈ N; • Q(x | x, s) = 1.In [9,Example 3.13], it has been shown that this model is absorbing to ∆ = {x} with initial distribution η = δ (0,0) but not uniformly absorbing.Our objective is to illustrate by means of this example that the set of occupation measures O η is not compact for the ws-topology, as can be derived from Theorem 4.9.
Let us consider the sequence {γ t } t∈N of nonrandomized stationary policies in Π s defined as follows.The policy γ t (da|x) takes the action c for all states (i, 0) for 0 ≤ i < t and takes the action s for all states (i, 0) with i ≥ t.If O η were relatively compact, then for any decreasing sequence of sets Γ n ↓ ∅ we would have Let us consider Γ n = (i, j) ∈ X : i ≥ n and 1 ≤ j ≤ 2 i − 1} which indeed satisfies Γ n ↓ ∅.Observe that P η,γn {X n+j = (n, j)} = 1 2 n for n ≥ 1 and j ∈ {0, • • • , 2 n − 1}.Therefore, 16).This exhibits that, for this control model which is not uniformly absorbing, the set of occupation measures O η is not compact.