Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process

This paper is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies. A Python implementation of the proposed procedure is available at https://github.com/Mamba413/cope.


Introduction
We consider reinforcement learning (RL) where the goal is to learn an optimal policy that maximizes the (discounted) cumulative rewards the decision maker receives (Sutton and Barto;2018). A (stationary) policy is a time-homogeneous decision rule that determines an action based on a set of observed state variables. Off-policy evaluation (OPE) aims to evaluate the impact of a given policy (called target policy) using observational data generated by a potentially different policy (called behavior policy). OPE is an important problem in settings where it is expensive or unethical to directly run an experiment that implements the target policy. This includes applications in precision medicine (Murphy;2003;Zhang et al.;2012, 2013 This paper is concerned with OPE under infinite horizon settings where the number of decision points is not necessarily fixed and is allowed to diverge to infinity. We remark that most works in the statistics literature focused on learning and evaluating treatment decision rules for precision medicine with only a few treatment stages (see Tsiatis et al.; 2019; Kosorok and Laber; 2019, for an overview). These methods are not directly applicable to many other sequential decision making problems in reinforcement learning with infinite horizons (see e.g., Sutton and Barto;2018), such as autonomous driving, robotics and mobile health (mHealth). Recently, there is a growing interest on policy learning and evaluation in mHealth applications (Ertefaie;2014;Luckett et al.;2020;Shi, Zhang, Lu and Song;2022;Hu et al.;2020;Qi and Liao;2020;Xu et al.;2020;Liao et al.;2020. In the computer science literature, existing works for OPE in infinite horizons can be roughly divided into three categories. The first type of method directly derives the value estimates by learning the system transition matrix or the Q-function under the target policy (Le et al.;2019;Feng et al.;2020;Hao et al.;. The second type of method is built upon importance sampling (IS) that re-weights the observed rewards with the density ratio of the target and behavior policies (Thomas et al.;2015;Liu et al.;2018;Nachum et al.;2019;Dai et al.;2020). The last type of method combines the first two for more robust and efficient value evaluation. References include Jiang and Li (2016); Uehara et al.
(2020); Kallus and Uehara (2019). In particular, Kallus and Uehara (2019) develops a double reinforcement learning (DRL) estimator that achieves the semiparametric efficiency limits for OPE. Informally speaking, a semiparametric efficiency bound can be viewed as the nonparametric extension of the Cramer-Rao lower bound in parametric models Bickel et al. (1993). It lower bounds the asymptotic variance among all regular estimators Van der Vaart (2000). However, all the above cited works rely on the sequential ignorability or the sequential randomization assumption (see e.g., Robins;2004, for a detailed definition). It essentially precludes the existence of unmeasured variables that confound the action-reward or action-next-state associations. However, this assumption is likely to be violated in applications such as healthcare and technological industries. We consider the following example to elaborate.
Our work is motivated by the example of applying customer recommendation program in a ride-hailing platform. We consider evaluating the effects of applying certain customer recommendation program in large-scale ride-hailing platforms such as Uber, Lyft and Didi.
These companies form a typical two-sided market that enables efficient interactions between passengers and drivers (Rysman; and has substantially transformed the transportation landscape of human beings (Jin et al.;2018).
Suppose a customer launches a ride-hailing application on their smart phone. When they enter their destination, the platform will decide whether to recommend them to join a program. This corresponds to the action. Different programs will apply different coupons to the customer to discount this ride. The purpose of such recommendation is to (i) increase the chance that the customer orders this particular ride, and reduce the local drivers' vacancy periods; (ii) increase the chance that the customer uses the app more frequently in the future. We remark that (i) and (ii) correspond to the short-term and long-term benefits for the company, respectively.
We would like to evaluate the cumulative effect of a given customer recommendation program given an observational dataset collected from the ride-hailing company. In addition to a point estimate on a target policy's value, many applications would benefit from having a confidence interval (CI) that quantifies the uncertainty of the value estimates. For instance, it allows us to infer whether the difference between two policies' values is statistically significant. This motivates us to study the off-policy confidence interval estimation problem.
Confounding is a serious issue in data generated from these applications. This is because the behavior policy involves not only an estimated automated policy to maximize the company's long term rewards but human interventions as well. For example, when there is severe weather like thunderstorms or large events like sports games and concerts in a certain area, there will be much more passengers than drivers in the local area. In that case, human interventions are needed to discourage passengers to request call orders in this area. However, live events and extreme weather are not recorded, leading to a confounded dataset.
More recently, in the causal inference literature, a few methods have been proposed to deal with unmeasured confounders for treatment effects evaluation. Tchetgen Tchetgen et al. (2020) proposed a proximal g-computation algorithm in single-stage and two-stage studies. Shi, Miao, Nelson and Tchetgen Tchetgen (2020) proposed to learn the average treatment effect (ATE) with double-negative control adjustment. See also . These methods are not directly applicable to the infinite horizon setting, which is the focus of our paper. In the RL literature, a few works considered reinforcement learning with confounded datasets. Among those available, Wang et al. (2020) considered learning an optimal policy in an episodic confounded MDP setting. Namkoong et al. (2020) and Kallus and Zhou (2020) proposed partial identification bounds on the target policy's value under a single-decision confounding assumption and a memoryless unobserved confounding assumption, respectively.  introduced an optimal balancing algorithm for OPE in a confounded MDP, without requiring the mediators to exist. Tennenholtz et al.
(2020) adopted the POMDP model to formulate the confounded OPE problem and develop value estimators in tabular settings using the idea of proxy variables. More recently, there are a few works that extend their method to more general settings Nair and Jiang;Shi, Uehara, Huang and Jiang;2022). However, none of the aforementioned methods considered constructing confidence intervals for the target policy's value in infinite horizons.
In this paper, we model the observational data by a confounded Markov decision process (CMDP, Zhang and Bareinboim;2016). See Section 2.1 for a detailed description of the model. To handle unmeasured confounders, we make use of some intermediate variables (mediators) that mediate the effect of actions on the system dynamics. These mediators are required to be conditionally independent of the unmeasured confounders given the actions.
We remark that these auxiliary variables exist in several applications.
For instance, in the ride-hailing example, the mediator corresponds to the final discount applied to each ride. It is worth mentioning that the final discount might be different from the discount included in the program, as it depends on other promotion strategies the platform applies to the ride, but is conditionally independent of other unmeasured variables that confound the action. In addition, the action will affect the immediate reward and future state variables only through the mediator (see our real data section for a detailed definition of the immediate reward). Consequently, the mediator satisfies the desired condition. Predictive policing is another example. Consider the Crime Incidents dataset (Elzayn et al.;2019). The action is whether a district is labeled as dangerous or not and the outcome is the total number of discovered crime incidents. Given the action, the police allocation (mediator) is determined by the current available policing resources and is thus conditionally independent of the confounder. In addition, in medicine, the treatment (action) and the patient's outcome might be confounded by that patient's attitude towards different treatments. For example, some patients might prefer conservative treatments, and others will strictly stick to the doctor's advice. However, given the treatment, the dosage that patients receive is determined by their age, weight and clinical conditions, and is thus conditionally independent of the confounder.
To the best of our knowledge, this is the first paper that systematically studies off-policy confidence interval estimation under infinite horizon settings with unmeasured confounders.
Most prior work either requires the unmeasured confounders assumption, or focuses on point estimation. More importantly, our proposal addresses an important practical question in ride-sharing companies, allowing them to evaluate different customer recommendation programs more accurately in the presence of unmeasured confounders. Our proposal involves two key components. We first show that in the presence of mediators, the target policy's value can be represented using the probability distribution that generates the observational data. This result generalizes the front-door adjustment formula (see e.g., Pearl;) to infer the average treatment effect in single-stage decision making. Based on this result, we next apply the semiparametric theory (see e.g., Tsiatis; 2007) to derive the efficiency limits for OPE under CMDP with mediators, and outline a robust and efficient value estimate that achieves this efficiency bound and its associated CI.
The rest of the paper is organized as follows. Section 2 lays out the basic model notation and data generating process. Section 3 discusses the identifiability of the policy value and construct efficient and robust interval estimation. Section 4 presents the asymptotic properties of the proposed estimator with its inferential results. Section 5 presents two simulation studies to evaluate the performance of our proposed estimator and compare with the state-of-the-art methods by using synthetic data only. In Section 6, an application of the proposed estimator is used to analyze real data collected from a world-leading ride-hailing company. All proofs are given in the supplementary material.

Data Generating Process
We consider observational data generated from an confounded Markov decision process.
Specifically, at a given time t, let (S t , A t , R t ) denote the observed state-action-reward triplet.
A standard MDP without confounding is depicted in Figure 1a. We assume both the state and action spaces are discrete, and the immediate rewards are uniformly bounded. The discrete state-space assumption is imposed only to simplify the theoretical analysis. Our proposal is equally applicable to continuous state space as well. Let U t denote the set of unmeasured variables at time t that confounds either the A t -R t or A t -S t+1 associations, as shown in Figure 1b. Such a data generating process excludes the existence of confounders that are influenced by past actions, leading to "memoryless unmeasured confounding" (Kallus and Zhou; 2020). It yields the following Markov assumption: Assumption 1. U t and other observed variables at time t are conditionally independent of {U j } j<t and past observed variables up to time t − 1 given S t .
To deal with unmeasured confounders, we assume there exist some observed immediate variables M t that mediate the effect of A t on R t and S t+1 at time t, as shown in Figure 1c.
See Assumption 2 below. This assumption is similar to the front-door adjustment criterion (Pearl; in single-stage decision making and is considered by Wang et al. (2020) as well for multi-stage decision making.
For any two nodes X and Y , a backdoor path from X to Y is a path that would remain if we were to remove any arrows pointing out of X. We revisit Figure 1c to elaborate Assumption 2. Specifically, Assumption 2(a) requires the pathway that A t has a direct effect on S t+1 absent M t to be missing. Without Assumption 2(a), we can only identify the natural indirect treatment effect (Fulcher et al.;2020) and the policy value is not identifiable. Under Assumptions 2(b) and 2(c), U t will not directly affect M t . Assumption 2(b) essentially requires that there are no unmeasured variables that confound the A t -M t association. Similarly, Assumption 2(c) requires that there are no unmeasured variables that confound the M t -S t+1 association.
We next detail the data generating process. At time t, we observe the state vector S t and the environment randomly selects some unmeasured confounder U t ∼ p u (•|S t ).
Then the agent takes the action A t ∼ p a (•|S t , U t ) and the mediator M t is generated using p m (•|A t , S t ) which is not confounded by U t according to Assumption 2. Finally, the agent receives a reward R t ∼ p r (•|M t , A t , S t , U t ) and the environment transits into the next state We refer to such a stochastic process as the confounded MDP with mediators, or CMDPWM for short.

Problem Formulation
The data consist of N trajectories, summarized as where T i corresponds to the termination time of the ith trajectory. We assume these Let π denote a given stationary policy that maps the state space to a probability mass function on the action space A. Following π, at each time t, the decision maker will set A t = a with probability π(a|S t ) for any a ∈ A. Unlike the behavior policy p a , the probability mass function π does not depend on the unmeasured confounders. For a given discounted factor 0 ≤ γ < 1, we define the corresponding (state) value function as where the expectation E π is defined by assuming the system follows the policy π. Based on the observed data, our objective is to learn the aggregated value η π = E{V π (S 0 )} where the expectation is taken with respect to the initial state distribution, and to construct its associated confidence interval.
We remark that we adopt a discounted reward formulation to investigate the policy evaluation problem. This formulation allows us to take customers' frequency of using the app into consideration in our application (see Section 6 for details). Meanwhile, our proposal can be easily extended to the average reward setting (see Appendix A.4).

Off-Policy Confidence Interval Estimation
We first discuss the challenge of OPE in the presence of unmeasured confounders. We next show that η π can be represented as a function of the observed dataset. This result implies that η π is identifiable and forms the basis of our proposal. We then outline two potential estimators for η π . Each estimator suffers from some limitations and requires some parts of the model to be correctly specified. This motivates our procedure that combines both estimators for more robust and efficient off-policy evaluation, based upon which a Wald-type CI is derived. Finally, we detail our method.

The Challenge with Unmeasured Confounders
In this section, we discuss the challenge of OPE with unmeasured confounders. To simplify the presentation, we assume π is a deterministic policy such that π(•|s) is a degenerate distribution for any s throughout this section and Section 3.2. For any such policy, we use π(s) to denote the action that the agent selects after observing the state vector s. To begin with, we introduce the do-operator do to represent a (hard) intervention (see e.g., Pearl;. It amounts to lift A t from the influence of the old functional mechanism A t ∼ p a (•|S t , U t ) and place it under the influence of a new mechanism that sets the value A t while keeping all other mechanisms unperturbed. For instance, the notation do(A t = π(S t )) means that the action A t is set to the value π(S t ) irrespective of the value of U t . In other words, whatever relationship exists between U t and A t , that relationship is no longer in effect when we perform the intervention. Adopting the do-operator, the expectation E π in (1) can be represented as In the presence of unmeasured confounders, the major challenge lies in that η π is defined based on the intervention distribution under the do-operator and cannot be easily approx-imated via the distribution of the observed data. To elaborate this, we remark that the expectation in (2) is generally not equal to E{R t |A j = π(S j ), ∀0 ≤ j ≤ t, S 0 = s}. This is because the distribution under do(A t = π(S t )) is different from that given the observation The latter corresponds to the conditional distribution generated by the causal diagram in Figure 1 given A t = π(S t ), whereas the former is the distribution generated by a slightly different graph, with the pathway U t → A t removed.
As an illustration, we apply DRL and the proposed method to a toy example detailed in Section 5.1. The data are generated according to a CMDPWM model. As we have commented, DRL is proposed by assuming no unmeasured confounders exist. As such, it can be seen from the left panel of Figure 2 that the DRL estimator has a non-diminishing bias under this example, due to the presence of unmeasured confounders. As shown in the right panel of Figure 2, the mean squared error (MSE) of DRL does not decay to zero as the number of trajectories increases to infinity.
In the next section, we address the above mentioned challenge by making use of the auxiliary variables M t in the observed data. It can be seen from Figure 2 that the proposed estimator is consistent. Both its bias and MSE decay to zero as the number of trajectories diverges to infinity. Finally, we remark that in addition to the use of do-operator, one can adopt the potential outcome framework to formulate the policy evaluation problem (see e.g., Fulcher et al.; 2020). We omit the details to save space.

Identification of η π
In this section, we first show that η π is identifiable based on the observed data. The main idea is to iteratively apply the Markov property and the front-door adjustment formula to represent the intervention distribution under the do-operator via the observed data distribution.
Recall that p a is the conditional distribution of A t |(S t , U t ). We use p * a to denote the corresponding conditional distribution A t |S t , marginalized over U t . Similarly, we use p * s,r to denote the conditional distribution (S t+1 , R t )|(M t , A t , S t ). We summarize the results in the following theorem.
Theorem 1. Let τ t denote the data history {(s j , a j , m j , r j )} 0≤j≤t up to time t and ν denote the initial state distribution. Under Assumptions 1 and 2, η π is equal to Note that none of the distributions p * s,r , p m and p * a involves the unmeasured confounders. As such, these distribution functions can be consistently estimated based on the observational data. Consequently, Theorem 1 implies that η π can be rewritten using the observed data distribution. Assumption 1 ensures the process satisfies the Markov property. Together with Assumption 2, it allows us to iteratively apply the front-door adjustment formula to replace the intervention distribution with the observed data distribution. See the proof of Theorem 1 in Appendix C.2 for details. We next outline two potential estimators for η π .

Direct Estimator
The first estimator is Direct Estimator, where we estimate the Q-function based on the observed data and directly use it to derive the value estimator. In our setting, we define the Q-function Q π (m, a, s) = E{R t + γV π (S t+1 )|M t = m, A t = a, S t = s}. We make a few remarks. First, our definition of the Q-function is slightly different from that in the existing RL literature, defined by E{R t +γV π (S t+1 )|A t = a, S t = s}, as it involves mediators.
Second, similar to Theorem 1, we can show V π is identifiable from the observed data. It follows that Q π is identifiable as well.
To motivate the first estimator, we notice that η π can be rewritten as Applying the front-door adjustment formula, we obtain that This motivates us to learn p m , p * a , Q π and ν from the observed data and construct the value estimate by plugging-in these estimators. We refer to this estimator as the direct estimator, since the procedure shares similar spirits as the direct method in the RL literature.

Importance Sampling Estimator
The second estimator is Importance Sampling (IS) Estimator. This is motivated by the work of Liu et al. (2018) that develops a marginal IS estimator that breaks the curse of horizon, assuming no unmeasured confounders exist. Compared to the standard IS estimator (Zhang et al.; 2013) whose variance will grow exponentially fast with respect to the number of decision points, the marginal IS estimator takes the stationary property of the state transitions into consideration and effectively breaks the curse of high variance in sequential decision making. Specifically, let ω π (•) be the marginal density ratio, where p π t (s) denotes the probability of S t = s by assuming the system follows π, and p ∞ denotes the limiting distribution of the stochastic process {S t } t≥0 . Similar to Theorem 1, we can show for any t > 1, p π t is identifiable. So is ω π . A key observation is that, when the stochastic process {S t } t≥0 is stationary, it follows from the change of measure theorem that S t )}, yielding the marginal IS estimator. To replace the intervention distribution with the observed data distribution, we apply the importance sampling method again and re-weight each reward by another probability ratio Such an importance sampling trick has been used by Fulcher et al. (2020) to handle unmeasured confounders in single-stage decision making. This yields the following value estimate, where ω and p m denote some estimators for ω π and p m .
To conclude this section, we discuss the limitations of the two estimators. First, each estimator requires some parts of the model to be correctly specified. Specifically, the direct estimator requires consistent estimates for Q π , p m and p * a , and IS requires correct specification of ω π and p m . Second, generally speaking, the direct estimator suffers from a large bias due to potential model misspecification whereas the IS estimator suffers from a large variance due to inverse probability weighting. To address both limitations simultaneously, we develop a robust and efficient OPE procedure by carefully combining the two estimating strategies used in Sections 3.3 and 3.4. Meanwhile, the resulting estimator requires weaker assumptions to achieve consistency. We present the main idea in the next section.

Our Proposal
We begin with some notations. Let O be a shorthand for a data tuple (S, A, M, R, S ).
The key to our estimator is the estimating function, terms detailed below. Recall that ψ 0 depends on Q π , p m and p * a . The purpose of adding the three augmentation terms is to offer additional protection against potential model misspecification of these nuisance functions. As such, the proposed estimator achieves the desired robustness property. See Figure 2 for an illustration. A pseudocode summarizing the proposed algorithm is given in Algorithm 1.
We next present the explicit forms of the three augmentation terms. Specifically, where ρ is the probability ratio defined in (4). The last term in the curly bracket corresponds to the temporal difference error under the CMDPWM model whose conditional mean given (M, A, S) equals zero. As such, ψ 1 (O) has zero mean when Q π , p m and p * a are correctly specified.
When p m is correctly specified, the last term in the curly bracket can be represented as the residual Q π (M, a, S) − E{Q π (M, a, S)|A, S}. As such, ψ 2 (O) has zero mean when p m is correctly specified.
Similarly, the last term in the curly bracket can be represented as the residual Q π (m, A, S) − E{Q π (m, A, S)|S}. When p * a is correctly specified, we have Eψ 3 (O) = 0. Based on the estimating function, the proposed estimator takes the following formula, where ). Compared to the standard DRL estimator, the proposed estimator involves additional computations due to the inclusion of the mediator distribution function in the latter two augmentation terms. When there are no unmeasured confounders, the proposed estimator shares similar spirits with DRL.
To construct such an estimator, we need to learn Q π , ω π , p * a , p m and the initial state distribution ν. Note that estimating p * a or p m is essentially a regression problem. These functions can be conveniently estimated via existing supervised learning algorithms. We estimate ν via the empirical distribution of {S i,0 } 1≤i≤N . As for Q π and ω π , we discuss the corresponding estimating procedure later in Section 3.6.
Next, we discuss the relationship between the proposed estimator in (5) and the two estimators outlined in Sections 3.3 and 3.4. Suppose p m is correctly specified. Let M 1 denote the set of models {Q π , p * a }, and M 2 denote the model ω π . First, when the models in M 1 are correctly specified, the three augmentations terms have zero mean, as we have discussed earlier. By the weak law of large numbers, (5) is asymptotically equivalent to the direct estimator and is thus consistent. Second, when the models in M 2 are correctly specified, we have Eψ 2 (O) = 0. In addition, using similar arguments in Part 3 of the proof of Theorem 2 in the appendix, we can show that Algorithm 1 Proposed procedure for confounded off-policy confidence interval estimation.
, and the significance level 0 < α < 1. 1: Compute the estimators for p * a and p m via supervised learning algorithms. Estimate ν via the empirical initial state distribution. 2: Compute the Q-function and marginal density ratio estimator according to Section 3.6. 3: Plug-in the aforementioned estimated nuisance functions into (5) to construct the value estimator η. 4: Construct the Wald-type CI according to (6).
By the definition of ψ 1 , this in turn implies that (5) is unbiased to the IS estimator. It is thus consistent. The above discussion informally justifies the robustness property of (5).
We will rigorously prove the claim in Theorem 2.
Finally, we observe that the proposed estimator can be written as N −1 N i=1 η i where η i denotes the estimating function based on the ith trajectory only. Since the trajectories are independent, the proposed estimator is asymptotically normal, as shown in Theorem 3. A Wald-type CI is valid for off-policy interval estimation, where z α denotes the upper αth quantile of a standard normal distribution and σ 2 η denotes the sampling variance estimator of {η i } i .

Learning Q π and ω π
The estimating procedure for Q π is motivated by the following Bellman equation,

Similar to the standard Bellman equation under settings without unmeasured confounders,
it decomposes the Q-function into two parts, the immediate reward plus the discounted future state-action values.
To prove this identity, notice that similar to (3), we can show that based on the front-door adjustment formula. This together with our definition of the Q-function yields the above Bellman equation.
Let p m and p * a denote consistent estimators for p m and p a , based on the observed data. Given the Bellman equation, multiple methods can be applied to estimate Q π . We employ the fitted Q-evaluation method Le et al. (2019) in our setup and propose to iteratively , for some function class Q and = 0, 1, · · · , until convergence. Similar to Fan et al. (2020), we can show that the resulting Q-estimator is consistent when Q is a class of universal function approximators such as neural networks.
We next consider ω π . Similar to the work of Liu et al. (2018), we can show that when the process {S t } t≥0 is stationary, ω π satisfies the equation L(ω π , f ) = 0 for any discriminator function f in our setup, where L(ω π , f ) is given by As such, ω π can be learned by solving the following mini-max problem, for some function classes Ω and F. The expectation in (7) can be approximated by the sample average. p m and ν in (7) can be substituted with their estimators. As pointed out by one of the referees, the minimax optimization is often not stable. To address this issue, we restrict attention to linear or kernel function classes to simply the calculation. See Appendix B for details.

Statistical Guarantees
We prove the robustness and efficiency of our estimator as well as the validity of our CI in this section. Without loss of generality, we assume T i = T for any i. To derive the asymptotic theories, we require the number of trajectories N to diverge to infinity. The termination time T can either be bounded, or diverge with N . The assumption on N is imposed to ensure that the initial state distribution ν can be well-approximated by the Assumption 2 is mild as the function classes are user-specified. VC type classes contains a wide variety of functional classes, including neural networks and regression trees. The VC index controls the model complexity. It generally increases with the number of parameters in the model. We allow the VC index to diverge with the sample size to reduce the bias of the estimator due to model misspecification. To save space, we present the detailed definition of L 2 -norm convergence in Appendix C.1. Theorem 2 formally establishes the robustness property. Notice that the proposed estimator equals the direct estimator outlined in Section 3.3 when ω π = 0 and equals the IS estimator in Section 3.4 when Q = 0. As a byproduct, we obtain the following corollary. To achieve efficiency, we need the following assumption: We make a few remarks. First, we show in the proof of Theorem 3 that the proposed estimator is asymptotically normal and satisfies

Corollary 1. (i) Suppose the conditions in
The asymptotic variance estimator for σ 2 T can be constructed via the sampling-variance formula. Consequently, a two-sided Wald-type confidence interval (CI) can be derived for η π . Second, the asymptotic variance σ 2 T decays with T . Specifically, it can be decomposed into σ 2 0 + T −1 σ 2 * for some σ 0 , σ * . See Appendix C.4 for the explicit forms of these quantities. The first term σ 2 0 accounts for the variation of the initial state distribution in the plug-in estimator. The second term T −1 σ 2 * is the variance of the augmentation terms and decays to zero as T → ∞. Third, Kallus and Uehara (2019) derives the efficiency bound for OPE in infinite horizon settings where no unmeasured confounders exist and the initial state distribution under the target policy is known. Our proof for Theorem 3 differs from theirs in that we allow the initial state distribution to be unknown and allow unmeasured confounders to exist. Fourth, in Assumption 3, we require all the nuisance function estimators to converge to their oracle values and thus exclude the case with model misspecification. When the model is misspecified, the semiparametric efficiency bound cannot be achieved. Finally, in our proposal, we use the same dataset twice to estimate the nuisance functions and construct the final value estimator. We do not utilize cross-fitting. Because of that, we impose certain metric entropy conditions in Assumption 2 to establish the robustness and efficiency of the proposed value estimator. To remove Assumption 2, we can couple our procedure with sample-splitting and cross-fitting (see e.g., Chernozhukov et al.; 2018; Kallus and Uehara; 2019). However, in our setup, we find that the proposed estimator without cross-fitting has better finite sample properties.
Theorem 4 (Validity). Suppose the conditions in Theorem 3 hold. Then the coverage probability of the proposed CI approaches to the nominal level as N diverges to infinity.
We remark that Theorems 3 and 4 are concerned with the asymptotic distribution of the value estimator under a single target policy. In Appendix A.3 of the supplementary article, we establish the joint asymptotic distribution of the proposed value estimators under multiple target policies, introduce the proposed CI for the value difference between two target policies and prove its validity.

Simulation studies
In this section, we evaluate the finite sample performance of the proposed estimator using two simulation studies. The first toy example aims to illustrate the robustness properties of our estimator to unmeasured confounding and model misspecification. In the second simulation study, we demonstrate that our method is superior to state-of-the-art policy evaluation methods.

A toy example
We first describe the detailed setting for the toy example. We fix time T = 100 and the initial state is sampled from a Bernoulli distribution with support {0, 1} and satisfies The mediator is drawn from a Bernoulli distribution with binary support. We set p m (1|A t , S t ) = sigmoid(0.1S t − 0.9(A t − 0.5)) which does not depend on U t . Assumption 1 is thus satisfied. The reward R t and the next-state S t+1 are Bernoulli random variables with support {0, 10} and {0, 1}, respectively, and satisfy P( We are interested in evaluating a random policy that outputs 0 with probability 1 − sigmoid(0.3S t ), and outputs -1 or 1 with probability 0.5sigmoid(0.3S t ) after observing S t . Under this toy example, we are able to derive Q π , p m , p * a and η π theoretically, and we calculate the true value of ω π via Monte Carlo method.
Recall that M 1 is a combination of Q π , p * a and M 2 = {ω π }. We evaluated the performance of the proposed estimator under the following scenarios: (i) all the models p m , M 1 , M 2 are correct; (ii) p m and M 1 are correct, M 2 is misspecified; (iii) p m and M 2 are correct, M 1 is misspecified. Specifically, to misspecify Q π , we inject a Gaussian noise with unit variance to the true Q π . To misspecify p * a , we multiply p * a by a variable sampled from a uniform distribution with lower boundary 0.75 and higher boundary 1. To misspecify ω π , we increase the value of ω π (0) by 0.5 and reduce the value of ω π (1) by 0.5.
As shown in Figure 2, our proposed estimator is robust to unmeasured confounding and model misspecification.

Comparison with state-of-the-art methods
We compare the proposed method with the state-of-the-art methods in the existing reinforcement learning literature. The simulated data are generated as follows. The initial state is sampled from a standard normal distribution with dimension d S = 1 or 3. The distributions of unmeasured confounders are the same as those in the toy example. Let 1 t be a length t vector with values 1, and C t = 1 d S S t , the sum of the state. The action is binary-valued and is generated according to the behaviour policy p a (1|S t , U t ) = sigmoid(0.1C t + 0.9U t ). The mediator is drawn from a Bernoulli distribution with binary support. We set p m (1|A t , S t ) = sigmoid(0.1C t + 0.9(A t − 0.5)) which does not depend on U t . Assumption 1 is thus satisfied. The reward R t is sampled from a normal distribution with conditional mean 0.5I(U t = 1)(M t + C t ) − 0.1C t and standard deviation 0.1. The future state S t+1 is sampled from a multivariate normal distribution with mean 0.5I(U t = 1)(M t 1 d S + S t ) − 0.1S t and covariance matrix 0.25I d S . The target policy selects action 1 with probability sigmoid(0.3C t ).
We compare the proposed estimator with three types of baseline methods. All these methods are developed by assuming no unmeasured confounders. The first one is the direct estimator, computed based on an estimated Q-function, η REG = N −1 N i=1 Q(S i,0 , π(S i,0 )) (denoted by REG). In our implementation, we compute Q via the fitted Q-evaluation algorithm. The second one is the marginal importance sampling (MIS) estimator (Liu et al.;2018). To implement this method, the marginal sampling ratio is estimated by assuming no unmeasured confounders exist and is different from the proposed estimator for ω π . The third one is the DRL estimator that combines the first two estimators for value evaluation.
None of these methods uses the mediator. For fair comparison, we also include the mediator in the state to construct the value estimates. Denote the resulting three estimators by REG-M, MIS-M and DRL-M, respectively. We further estimate their variances based on the sampling variance formula (see Appendix B for details) and construct the associated confidence interval. The proposed estimator is denoted by COPE, short for confounded off-policy interval estimation.
The linear basis function models p * a , p m , Q π , ω π employ randomly generated Fourier features based on the Python RBFsampler function. We find that the performance of the value estimator is not overly sensitive to the number of basis functions (see Appendix B for details). Let η π be the ground truth and η π be a given OPE estimator, we define logBias as log 10 (|E η π − η π |) and logMSE as log 10 {E( η π − η π ) 2 }, where the expectation E(·) is approximated by Monte Carlo simulations. We report these metrics, as well as the empirical coverage probabilities of all the confidence intervals for the target policy's value in Figures 3 and 7 (see Appendix B in the supplementary article). We also calculate the standard deviation of these metrics in 400 replications and report them in Appendix B.
It can be seen that COPE achieves the least bias and MSE among all methods. In addition, its MSE decays with N and T in general and the empirical coverage rate of our CI is close to the nominal level. We also notice that the squared bias of our estimator is much smaller than its MSE. This demonstrates the consistency of our method and is in line with our theoretical findings. In contrast, other baseline estimators are severely biased, since they cannot handle the unmeasured confounders.

Real Data Application
In this section, we apply our method to a real dataset from a world-leading ride-hailing company. We focus on a particular recommendation program applied to customers in regions where there are more taxi drivers than the call orders. As we have commented, in the short term, this helps balance the taxi supply and passenger demand across different areas of the city. In the long term, this increases the frequency that the customer uses the app to request the trip.
The dataset consists of all the call orders at a given a city from September 16th to September 22th. The features available to us consist of each order's time, origin, destination and a supply-demand equilibrium metric that characterizes the degree that supply meets the demand. For each of the call order, the customer might receive a coupon for 20% off.
This yields a binary action. The mediator is the actual discount applied to the order. As we have commented, the mediator is calculated by the platform using the action and other promotion strategies and differs from the action, but is conditionally independent of those unmeasured variables that confound the action. Assumption 2(b) is thus satisfied. The reward is zero if the customer does not request the ride at the end, and one minus the actual discount times the price of the order otherwise. We present the empirical quantiles of the reward and state in Table 4.
By definition, the reward depends on the action only through its effect on the mediator.
In addition, the customer will observe the final discount applied to their ride on the application, but is not aware of which promotion strategy yields the discount. As such, it is reasonable to assume that these promotion strategies will affect their behaviors through the final discount only. Consequently, the reward and future state are conditionally independent of the action and other promotion strategies given the mediator. Assumptions 2(a) and 2(c) thus hold.
We first fit a MDP model to this dataset, and use the estimated MDP to generate synthetic data to mimic the real dataset. Specifically, the distribution of initial state is approximated by a multivariate normal distribution. The state transition S t+1 |A t , S t , M t is modelled by a multivariate normal distribution N (µ(S t , A t , M t ), σ 2 ) where the conditional mean function µ is estimated using regularized linear basis function models. Similarly, we estimate the reward function E(R t |S t , A t , M t ) using a regularized linear basis function model as well. All the tuning parameters are selected by five-folds cross validation. Based on the fitted MDP, synthetic dataset can be generated to evaluate different OPE methods.
We are interested in evaluating two recommendation policies. One of them is a random policy (denote by π 1 ), with which each customer would have an equal chance to get a 20 percent discount with probability 0.5. Another policy (denote by π 2 ) relies on the imbalance measure between supply and demand. Specifically, for the region with extremely more vacant drivers than requests, we randomly choose 70% percent of the customers for getting the discount. For the rest of regions, the customers would have a 30% chance for obtaining the discount. We expect the second policy would yield larger values, as it has better immediate reward and encourages customers to request rides more often.
To take customers' frequency of using the app into consideration, we use a slightly different definition of the value function as in Xu et al. (2018), and adjust the proposed method and other baselines accordingly to reflect this change. In addition to the observed data consist of another sequence of variables {T i,t }, corresponding to the time that the ith customer launches the app and enters the destination. We initialize T i,0 to zero, for all i. The target policy's value is defined as To reflect this change, the proposed estimator takes the following form, where ψ 0 is the same as the direct estimator with the Q-estimator Q replaced by Q detailed below, and for any j, replaced by Q and ω replaced by ω . Specifically, Q is computed by solving a slightly different Bellman equation and ω is computed by solving (8) with γ replaced by γ T i,t+1 −T i,t in (7). The DRL estimator can be similarly modified to adapt to this change.
We apply REG-M, MIS-M, DRL-M and the proposed method COPE to evaluate the value difference η π 2 − η π 1 . The ground truth OPE is approximated via Monte Carlo based on the fitted MDP model and equals 0.17. This is consistent with our expectation that the second policy yields a larger value. We evaluate the estimation accuracy by logBias and logMSE as in the simulation section, and the coverage probability of the confidence interval. discounted factor γ is set to 0.99, as we are interested in the long-term treatment effects. First, COPE has the best estimation accuracy among the four methods. Second, the coverage probability of the proposed CI is close to the nominal level. In contrast, the baseline methods fail to achieve the nominal coverage when N or T is large. These results are consistent with our simulation findings.
We next apply our method and DRL to the real dataset to evaluate the value difference η π 2 − η π 1 . The proposed method yields a value difference of 0.63. The 95% associated confidence interval is [0.03, 1.23]. As such, the second policy is significantly better than the first one. The result is consistent with our expectation. On the contrary, DRL yields a value difference of -0.96. The associated confidence interval is [-2.07, 0.14]. According to DRL, the random policy is much better. This is due to that DRL cannot handle unmeasured confounders, leading to a biased estimator. Combining this with our theoretical and simulations results, we have more confidence about the findings of our proposed CI.

Discussion
In this section, we discuss several extensions. First, our current proposal relies on the "memoryless unmeasured confounding" assumption to simplify the derivation. In Appendix A.1 of the supplementary article, we discuss several possible relaxations of this assumption.
Second, we assume the mediators are discrete to simplify the presentation. In Appendix A.2, we extend the proposed method to settings with continuous mediators. Third, we adopt a discounted reward formulation to investigate the policy evaluation problem. We Zhang, J. and Bareinboim, E. (2016). Markov decision processes with unobserved confounders: A causal approach, Technical report, Technical Report R-23, Purdue AI Lab.

A.1 Memoryless Unmeasured Confounding
Our proposal relies on a "memoryless unmeasured confounding assumption" that requires the unmeasured confounders U t to be conditionally independent of the past observations and latent confounders given S t , at each time t. This assumption might be violated in some applications. We discuss several possible relaxations of this assumption in this section.
One way to relax this assumption is to allow U t to depend on past observations. Specifically, we impose the following "high-order memoryless unmeasured confounding assumption": For some integer K ≥ 1, we assume U t is conditionally independent of U t satisfies the memoryless unmeasured confounding assumption given S * t . As such, our proposal is equally applicable to the transformed dataset.
Another way to relax this assumption is to allow U t to depend on the past latent factors.
In that case, the state vectors no longer satisfy the Markov assumption and the data follow a partially observable MDP (POMDP) model. Under partial observability, Theorem 1 is not applicable for value identification. However, under Assumption 2, the front-door adjustment formula can still be applied to handle unmeasured confounders. It would be practically interesting to further investigate the direct, IS and DR estimators under the POMDP model, but this is beyond the scope of the current paper. We leave it for future research.
To summarize, we have discussed two potential relaxations of the memoryless unmeasured confounding assumption, one with a "high-order memoryless unmeasured confounding" assumption and the other without any further assumptions. These assumptions are not directly testable. However, notice that under the (high-order) memoryless unmeasured confounding assumption, the observed data process {(S j , A j , M j , R j ) : j ≥ 0} forms a (highorder) MDP. Without these assumptions, this process forms a POMDP. In practice, we can apply the sequential testing procedure developed by Shi, Wan, Song, Lu and Leng (2020) to the observed data for model selection. Specifically, if the MDP assumption is satisfied, then we can apply the proposed procedure in the main paper for value evaluation. If the data satisfies a Kth-order MDP assumption, then the Kth-order memoryless unmeasured confounding assumption is likely to hold. In that case, we can first construct the new state vector by concatenating measurements over the past K time points and then apply our proposal. Suppose we concatenate measurements over sufficiently many decision points and the test still rejects the MDP assumption. Then the "high-order memoryless unmeasured confounding" assumption is likely to be violated and we may apply the estimators designed for POMDPs for policy evaluation.

A.2 Continuous Mediators
As we have commented in the main paper, the proposed method is equally applicable to settings with continuous mediators. To elaborate, first, notice that the Q-function and the marginal density ratio are well-defined regardless of whether the mediator is continuous or not. In addition, the estimating procedure discussed in Section 3.6 remains consistent with continuous mediators. Second, the proposed estimator involves the conditional probability mass function of the mediator p m (m|a, s) = P(M = m|A = a, S = s). In settings with continuous mediators, we will replace it with the corresponding conditional probability density function of the mediator and replace the sum with the integral. This yields the following estimating function,

A.3 Multiple Target Policies
We first establish the joint asymptotic distribution of the proposed value estimators under multiple target policies {π k } K k=1 . We next introduce the proposed confidence interval for the value difference between two target policies and prove its validity.
Let η π denote the proposed estimator in (5) with the target policy given by π. Similarly, let Q π and ω π denote the corresponding estimated Q-function and marginal density ratio, for any π. We have the following results.
Theorem 5. Suppose the conditions in Theorem 2 are satisfied. Suppose Q π k , ω π k , p * a , p m converge in L 2 -norm to their oracle values at a rate of N −κ * for some κ * > 1/4 and any are jointly asymptotically normal with covariance matrix given by where ψ π j is equal to ψ j defined in (3.5) with the target policy given by π.
The proof of Theorem 5 is very similar to that of Theorem 3 and is thus omitted to save space.
(10) Theorem 6. Suppose the conditions in Theorem 5 are satisfied and (9) is bounded away from zero. Then the coverage probability of the proposed CI in (10) approaches to the nominal level as N diverges to infinity.
Finally, we remark that in the statement of Theorem 6, the asymptotic variance of the value difference estimator is required to be bounded away from zero. This essentially requires the two target policies π 1 and π 2 to not be too close to each other. In the extreme case where π 1 = π 2 , the asymptotic variance equals to zero and the value difference estimator has a degenerate distribution. To address this concern, we could redefine the variance estimator by setting σ 2 η ← max( σ 2 η , δ 2 ) for some δ > 0, as in Luedtke and van der Laan (2017). This guarantees that the variance estimator is strictly positive and the resulting CI would be valid as long as δ T −1/6 (see Theorem 3.1 of 2019).

A.4 Average Reward
We extend our proposal to the average reward setting in this section. Under this formulation, the target value can be represented as η π = lim T →∞ T −1 T −1 t=0 E π R t . Similar to the discounted reward setting, the proposed estimating function takes the following form where ψ 0 denotes the direct estimator whose form will be specified later and {ψ j } j are the augmentation terms.
where Q π (m, a, s) denotes the relative value function +∞ t=0 E π (R t −η π |M 0 = m, A 0 = a, S 0 = s), ω π denotes the density ratio of the stationary state distribution under π and that under the behavior policy, and the definitions of ρ, p m , p * a , ψ 2 (O) and ψ 3 (O) are consistent to those in the main paper.
We next discuss the estimating procedures for (Q π , ψ 0 ) and ω π . Similar to Equation Given the estimators for p m , p a , we can adopt the general coupled estimation framework to jointly learn Q π and ψ 0 from the observed data (see Section 4.2 in Liao et al.; 2020).
As for ω π , we note that it satisfies the following equation, for any testing function f . The resulting estimator can thus be computed by solving a minimax problem, as detailed in Appendix B.

A.5 Joint Effect
To conclude this section, we remark that in this paper, our interest lies in analyzing the effect of one individual customer recommendation program while controling for all other promotion programs. In our formulation, this particular recommendation program serves as the action whereas other promotion programs will affect the mediator. Suppose one is interested in the joint effect of all promotion programs. In that case, we may treat the mediator (e.g., the final discount) as the action for policy evaluation.

B More on the Numerical Experiments
We first discuss how we estimate ω π . One way to simply the calculation is to set F to a unit ball of a reproducing kernel Hilbert space (RKHS) based on some symmetric positive definite kernel κ. This yields a closed form expression for the inner maximization. Specifically, based N 20 40 80 160 320 DRL -1.26 -1.46 -1.49 -1.58 -1.71 COPE with linear function approximation -1.57 -1.85 -2.14 -2.48 -2.78 COPE with nonlinear function approximation -1.63 -1.84 -2.14 -2.46 -2.78 on the observed data, we aim to identify ω that minimizes the following objective function Stochastic gradient descent algorithms can be applied to update the parameters in ω π ; see e.g., Algorithm 2 of Shi et al. (2021).
In practice, we recommend to use the Gaussian radial basis function kernel. The bandwidth parameter in the kernel can be selected via the median heuristic (see e.g., Garreau et al.;2017). In high-dimensional settings, we can set Ω to a class of deep neural networks (DNN). The DNN structure can be adaptively determined via cross-validation. We also conduct a simulation study and find that the method is convergent in the optimization.
For instance, Figure 5 depicts the change of loss function during the training process. It can be seen that the objective function converges as the number of training epochs reaches to 1000. In this example, we use a multilayer perceptron (MLP) neural network with a single hidden layer and five hidden nodes. The initial parameter values are randomly assigned and the learning rate is fixed to 0.01. In addition, it can be seen from Table 1 that MSEs of the resulting value estimators are smaller than or comparable to those presented in Section 5.2, based on linear function approximation. In our implementation, we set both Ω and F to linear function classes, i.e, ω π and f can be characterized by a linear combination of d ω -dimensional random Fourier features ξ following the Random Kitchen Sinks (RKS) algorithm (Rahimi and Recht;2008). Without loss of generality, suppose ω π (s) = ξ (s)β * for some β * ∈ R dω . Due to L(ω π , f ) = 0, β * is the solution to Consequently, we can estimate β * by In our simulation studies, the number of features d ω is set to 6d S where d S corresponds to the dimension of the state. In addition, we further conduct a sensitivity analysis to investigate the empirical performance of the proposed estimator with different number of basis functions. Figure 6 reports biases, mean square errors and 95% CIs' coverage rates. It can be seen that the performance of the proposed procedure is not overly sensitive to the number of basis functions.
Next, we discuss how we estimate Q π . In our implementation, we use linear basis functions with random Fourier features to model the Q-function. It suffices to solve a leastsquare regression during each Q-iteration. The number of basis functions is set to 5(d S + 2).
The initial parameter is randomly generated from a multivariate normal distribution with zero mean and identity covariance matrix. To prevent overfitting, we add a ridge penalty function to shrink the least-square estimator. This guarantees that the estimated Q-function will not diverge. The regularization parameter is set to 10 −3 . We stop the Q-iteration when the difference between the estimated regression coefficients from one iteration to another is less than 10 −4 .
Next, we similarly use linear basis functions with random Fourier features to parametrize p * a and p m . The number of basis functions are set to d S and d S + 1, respectively. Next, we discuss how we estimate the variance of all baseline estimators. The variance of η REG is estimated by the sampling variance estimator of { Q(S i,0 , π(S i,0 ))} i . As for the MIS estimator, we first compute the value estimator based on the data for each individual trajectory as .
We estimate the variance of MIS as the sampling variance estimator of { η MIS i } i . The variance estimator of DRL can be similarly derived as that of the proposed estimator. Given these variance estimators, we use Wald-type confidence intervals to infer the target policy's value.
All the experiments are conducted on a 64-bit Windows platform with Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz.
Finally, when state is one-dimensional, we report the bias, MSE of various OPE estimators and the coverage rate of their associated confidence intervals in Figure 7. Reported in Tables 2 and 3 are the standard deviations of biases and MSEs of various OPE estimators under settings described in Section 5.2. Quantiles of the state and reward are reported in Table 4.

C Technical Definitions and Proofs
We begin with some notations. In the proof of Theorems 2 and 3, we write ψ 0 , ψ 1 (O), ψ 2 (O) and ψ 3 (O) as ψ 0 ( Q, p m , p * a ), ψ 1 (O; Q, p m , p * a , ω), ψ 2 (O; Q, p m , p * a , ω) and ψ 3 (O; Q, p m , p * a , ω) to highlight their dependence on the nuisance function estimators Q, p m , p * a and ω. We use ψ * 0 , ψ * 1 , ψ * 2 and ψ * 3 to denote the corresponding oracle versions with the estimators replaced by their oracle values. Our proofs are generally applicable to either continuous state space or discrete state space with number of elements diverging with the sample size. The action and mediator spaces are assumed to be discrete with finite number of elements.
The rest of the proof is organized as follows. We first give a detailed definition of L 2 -norm Convergence. We next present the proofs of our major theorems. Theorem 4 can be proven using similar arguments used in the proof of Theorems 2 and 3. Hence, we omit the proof of Theorem 4 for brevity.

C.1 Definition of L 2 -norm Convergence
We require all nuisance function estimators to converge in L 2 -norm. Specifically, a sequence of variables {X N } N ≥0 is said to converge in L 2 -norm to X if and only if E|X N − X| 2 → 0 as N → ∞. Adopting this definition, a Q-estimator Q is said to converge in L 2 -norm to Q π at a rate of N −α if for any m and a. The Q-estimator is said to converge to its oracle value if the right-hand-side  Similarly, a density ratio estimator ω is said to converge in L 2 -norm to ω π at a rate of ω is said to converge to its oracle value if the right-hand-side is o(1).
The estimator p m is said to converge to p m at a rate of N −α if  Table 4: The quantiles of reward and state (including supply-demand equilibrium metric and road distance) in the real data application. Out of privacy concerns, the presented values are scaled.
for any m and a. p m is said to converge to its oracle value is the RHS is o(1).
Finally, the estimator p a is said to converge to p * a at a rate of N −α if for any a. p a is said to converge to its oracle value is the RHS is o(1).

C.2 Proof of Theorem 1
The main idea of the proof is to iteratively apply the front-door adjustment to replace the intervention distribution under the do-calculus with the observed data distribution.
We first observe that the process {(S t , A t , M t , R t )} t≥0 satisfies the Markov property, under Assumption 1. This is due to the fact that (i) the distribution of (A t , M t , R t ) is independent of the past observed data history conditional on S t ; (ii) the distribution of S t+1 is independent of the observed data history up to time t − 1 conditional on (S t , A t , M t ).
Similarly, according to the data generating mechanism, we can show that the process generated by assuming the system follows the target policy π satisfies the Markov property as well. As such, it follows from (2) that We can apply the front-door adjustment formula to represent the inner conditional expectation E{R t |do(A t = π(S t )), S t } as s t+1 ,rt,mt,at r t p * s,r (s t+1 , r t |m t , a t , S t )p m (m t |π(S t ), S t )p * a (a t |S t ).
As such, Using the Markov property again, the conditional distribution of S t given do(A j = π(S j )), ∀0 ≤ j < t, S 0 = s can be written as Apply the front-door adjustment formula again, the conditional probability in the second line can be rewritten as This yields that Repeating the above procedures, we can show that Sum E π (R t |S 0 = s 0 ) from t = 0 to +∞, we obtain that Integrating over s 0 with respect to the initial state distribution ν, we obtain that The proof is thus completed.

C.3 Proof of Theorem 2
We break the proof into three parts. We first observe that p m is always consistent under the given conditions. As we have discussed in the main text, this implies that the second augmentation term (N T ) −1 i,t ψ 2 (O i,t ; Q, p m , p * a , ω) will converge to zero. We rigorously prove this claim in Part 1.
In Part 2, we focus on the case where Q, p m and p * a are consistent and show that (N T ) −1 i,t ψ j (O i,t ; Q, p m , p * a , ω) converges to zero for j = 1, 3, and that ψ 0 ( Q, p m , p * a ) is consistent to η π . This together with Part 1 yields the consistency of our estimator.
In Part 3, we consider the case where ω and p m are consistent. We focus on showing that for any Q ∈ Q, h a ∈ H a , for a given data tuple O = (S, A, M, R, S ). As discussed in the main text, this further implies that the estimating function is unbiased to the IS estimator with correctly specified ω π , p m and is thus unbiased to η π . Applying similar arguments in used in Part 1 and Part 2, we can show that the proposed estimator is consistent. The proof is thus completed.
We next detail the proof for each part.
It suffices to show both η 1 and η 2 converge to zero in probability.
Consider η 1 first. Under the assumptions that Ω, Q, H m , H a are bounded function classes and p * a (A i,t |S i,t ) is almost surely bounded away from zero, |η 1 | can be upper bounded by where O(1) denotes some positive constant. We aim to show (11) converge to zero in probability. By Markov's inequality, it suffices to show Since p m depends on the observed data as well, the left-hand-side (LHS) is in general To prove (12), for any sufficient small constant ε > 0, we define a set of functions H m (ε) that contains all mass functions p such that Under the assumption that p m converges to p m in total variation norm, we have by Markov's inequality that p m ∈ H m (ε) with probability approaching 1 (wpa1), for sufficiently large N .
As such, LHS of (12) can be upper bounded by wpa1. We apply the empirical process theory (Van Der Vaart and Wellner;1996) to bound (13). Specifically, we further decompose (13) as where By the definition of H m ( ) and Cauchy-Schwarz inequality, the second line is upper bounded by ε and can be made arbitrarily small by setting ε → 0. It suffices to show η 3 = o(1).
Applying the maximal inequality specialized to the VC-type class (see e.g., and apply the maximal inequality again to bound (15). Specifically, using similar arguments in bounding the first line of (14), we can show that (15) In addition, using similar arguments in showing η 1 = o p (1) in Part 1, we can show that (1) and This yields that (1) and It remains to show ψ 0 ( Q, p m , p * a ) is consistent to η π . Since the action and mediator spaces have finitely many elements, it suffices to show for any a, a and m, where ν denotes the initial state distribution. Under the assumption that the process {S t } t≥0 is stationary, we have ν = p ∞ . Applying similar arguments in Part 1, we can show that the difference E s∼ ν Q(m, a, s) p m (m, a , s)π(a |s) p * a (a|s) − E s∼ ν Q π (m, a, s)p m (m, a , s)π(a |s)p * a (a|s) will converge in probability to zero.
Recall that we set ν to the initial state distribution. Applying the weak law of large numbers again, we have as N → ∞ that E s∼ ν Q π (m, a, s)p m (m, a , s)π(a |s)p * a (a|s) P → E s∼ν Q π (m, a, s)p m (m, a , s)π(a |s)p * a (a|s).
This yields (16). The proof for Part 2 is hence completed.
It follows from the definition of ψ 3 that By the definition of ω π (S), both lines in (17) are equal to t≥1 γ t s,a,a ,m p π t (s)Q π (m, a, s)p m (m, a , s)π(a |s)h a (a|s).
The proof is hence completed.

C.4 Proof of Theorem 3
The proof is divided into two parts. In the Part 1, we show that the proposed estimator is asymptotically equivalent to the oracle estimator η * = ψ * 0 + (N T ) −1 3 j=1 i,t ψ * j (O i,t ) with the difference upper bounded by o p (N −1/2 ). In the second part, we show the oracle estimator is asymptotically normal and satisfies that √ is the semiparametric efficiency bound. By Slutsky's theorem, the proposed estimator is asymptotically normal and achieves the semiparametric efficiency bound as well. This completes the proof.
Before detailing each part of the proof, we make some remarks. First, it follows from the Slutsky's theorem that the proposed estimator is asymptotically normal and satisfies √ N ( η − η π ) d → N (0, σ 2 T ) as well. Second, a key observation is that, for any j = 1, 2, 3, t ψ * j (O i,t ) corresponds to a sum of the martingale difference sequence with respect to the filtration, the σ-algebra generated by {O i,l } 0≤l≤t−1 . Under the stationarity assumption, we have σ 2 T = Var m,a,a ,S 0 Q π (m, a, S 0 )p m (m, a , S 0 )π(a |S 0 )p * a (a|S 0 ) + where the first term on the right-hand-side is due to the variation of the initial state distribution in the plug-in estimator ψ * 0 , the second term is due to the variation of the three augmentation terms. Consequently, σ 2 T decreases with T . As T → ∞, σ 2 T converges to the first term, Var m,a,a ,S 0 Q π (m, a, S 0 )p m (m, a , S 0 )π(a |S 0 )p * a (a|S 0 ) .

Part 1.
We provide a sketch of the proof to save brevity. We decompose the difference between the proposed value estimator and the oracle estimator as η (1) ( Q, p * a ) + η (2) ( Q, p * a ) where Consider η (1) ( Q, p * a ) first. As discussed in Part 3 of the proof of Theorem 2, for any Q ∈ Q and h a ∈ H a , we have E η (1) (Q, h a ) = 0. Using similar arguments in the proof of Theorem 2, we can apply empirical process theory to show that η (1) ( Q, p * a ) = o p (N −1/2 ), under the conditions that Q and p * a converge in L 2 -norm to their oracle values. Next we consider η (2) ( Q, p * a ). Similar to η (1) , we can show that η (2) (Q π , p * a ) = o p (N −1/2 ). It remains to show η (2) ( Q, p * a ) − η (2) (Q π , p * a ) = o p (N −1/2 ). The difference η (2) ( Q, p * a ) − η (2) (Q π , p * a ) can be decomposed into the sum To save space, we focus on proving that the first line is o p (N −1/2 ). Using similar arguments, one can show that the second line is o p (N −1/2 ) as well. This completes the proof of this part.
With some calculations, we can show that the absolute value of the first line can be further bounded by η 4 + η 5 where Applying the Cauchy-Schwarz inequality, it can be further upper bounded by Using similar arguments in bounding (13), we can show that (18) is o p (N −1/2 ), under the condition that p m and p * a converge in L 2 -norm to their oracle values at a rate of N −κ * for some κ * > 1/4.
can be represented as See section C.5 for a detailed proof.
We can further decompose equation 21 into three parts and express it as I 1 + I 2 + I 3 , where whereŌ T denotes the observed data history {O 0 , O 1 , · · · , O T −1 } up to time T and S(·) denotes the score function, i.e., the gradient with respect to the log-likelihood function evaluated at θ = θ 0 . See section C.6 for a detailed proof.
As such, ∇ θ η π θ 0 takes the following form This completes the proof.
C.6.1 Derivations of I 1 In the following, we focus on proving equation 22.
First, we note that = we have Using the fact that the expectation of a score function is zero, we have Using the change of measure theorem, I 3 can be further represented as