Statistically Efficient Advantage Learning for Offline Reinforcement Learning in Infinite Horizons

We consider reinforcement learning (RL) methods in offline domains without additional online data collection, such as mobile health applications. Most of existing policy optimization algorithms in the computer science literature are developed in online settings where data are easy to collect or simulate. Their generalizations to mobile health applications with a pre-collected offline dataset remain unknown. The aim of this paper is to develop a novel advantage learning framework in order to efficiently use pre-collected data for policy optimization. The proposed method takes an optimal Q-estimator computed by any existing state-of-the-art RL algorithms as input, and outputs a new policy whose value is guaranteed to converge at a faster rate than the policy derived based on the initial Q-estimator. Extensive numerical experiments are conducted to back up our theoretical findings. A Python implementation of our proposed method is available at https://github.com/leyuanheart/SEAL.


Introduction
Reinforcement learning (RL, see Sutton and Barto, 2018, for an overview) is concerned with how intelligence agents learn and take actions in an unknown environment in order to maximize the cumulative reward that it receives. It has been arguably one of the most vibrant research frontiers in machine learning over the last few years. According to Google Scholar, over 40K scientific articles have been published in 2020 with the phrase "reinforcement learning". Over 100 papers on RL were accepted for presentation at ICML 2021, a premier conference in the machine learning area, accounting for more than 10% of the accepted papers in total. RL algorithms have been applied in a wide variety of real applications, including games , robotics (Kormushev et al., 2013), healthcare (Komorowski et al., 2018), bidding (Jin et al., 2018), ridesharing (Xu et al., 2018) and automated driving (de Haan et al., 2019), to name a few.
This paper is partly motivated by developing statistical learning methodologies in offline RL domains such as mobile health (mHealth). mHealth technologies have recently emerged due to the use of mobile phones, tablets computers or wearable devices. They play an important role in precision medicine as they offer a means to monitor a patient's health status and deliver interventions in real-time. They also collect rich longitudinal data for optimal treatment decision making. One motivating example being considered in this paper uses the OhioT1DM Dataset (Marling and Bunescu, 2018). It contains 8 weeks of data for 6 patients with type 1 diabetes, an autoimmune disease wherein the pancreas produces insufficient levels of insulin. For those patients, their continuous glucose monitoring blood glucose levels, insulin doses being injected, self-reported times of meals and exercises are continually measured. Their outcomes have the potential to be improved by treatment policies tailored to the continually evolving health status of each patient (Luckett et al., 2020;. Despite the popularity of developing various RL algorithms in the computer science literature, statistics as a field, has only recently begun to engage with RL both in depth and in breadth. Most works in the statistics literature focused on developing data-driven methodologies for precision medicine with only a few treatment stages (see e.g., Murphy, 2003;Robins, 2004;Chakraborty et al., 2010;Qian and Murphy, 2011;Zhang et al., 2013;Zhao et al., 2015;Wallace and Moodie, 2015;Song et al., 2015;Luedtke and van der Laan, 2016;Zhu et al., 2017;Wang et al., 2018;Qi et al., 2020;Nie et al., 2020). These methods require a large number of patients in the observed data to be consistent. They are not applicable to mHealth applications with only a few patients, which is the case in the OhioT1DM dataset. Nor are they applicable to many other sequential decision making problems in infinite horizons where the number of decision stages is allowed to diverge to infinity, such as games or robotics. Recently, a few algorithms have been proposed in the statistics literature for policy optimization in mHealth applications (Ertefaie and Strawderman, 2018;Luckett et al., 2020;Hu et al., 2020;Liao et al., 2020;Zhou et al., 2021).
Among all existing methods in infinite horizons, Q-learning (Watkins and Dayan, 1992) is arguably one of the most popular model-free RL algorithms. It derives the optimal policy by learning an optimal Q-function, without explicitly modelling the system dynamics. Variants of Q-learning include gradient Q-learning (Maei et al., 2010;Ertefaie and Strawderman, 2018), fitted Q-iteration (Riedmiller, 2005), deep Q-network (DQN, Mnih et al., 2015), double DQN (Van Hasselt et al., 2016) and quantile DQN (Dabney et al., 2018), among others. All these Q-learning type algorithms are primarily motivated by the application of developing artificial intelligence in online video games, so their generalization to offline applications with a pre-collected dataset remains unknown.
Different from online settings (e.g., video games) where data are easy to collect or simulate, the number of observations in many offline applications (e.g., healthcare) is limited.
Take the OhioT1DM dataset as an example, only a few thousands observations are available . With such limited data, it is critical to develop RL algorithms that are statistically efficient. Instead of proposing a specific algorithm for policy optimization, our work undertakes the ambitious task of devising an "efficiency enhancement" method that is generally applicable to any Q-learning type algorithms to improve their statistical efficiency.
The input of our method is an optimal Q-estimator computed by existing state-of-the-art RL algorithms and the output is a new policy whose value converges at a faster rate than the policy derived based on the initial Q-estimator.
The proposed method is motivated by a line of research on developing A-learning type algorithms 1 to learn an optimal dynamic treatment regime (DTR) to implement preci-1 Similar algorithms are developed in the causal inference literature for heterogeneous treatment effects estimation (see e.g., Tian et al., 2014;Nie and Wager, 2017;Kennedy, 2020;. sion medicine (see e.g., Murphy, 2003;Robins, 2004;Lu et al., 2013). These methods directly model the difference between two conditional mean functions (known as the contrast function). They are semi-parametrically efficient and outperform Q-learning 2 (see e.g., Chakraborty et al., 2010;Qian and Murphy, 2011) in cases where the Q-function is misspecified . In addition, A-learning has the so-called doubly robustness property, i.e., the estimated optimal DTR is consistent when either the model for the conditional mean function or the treatment assignment mechanism is correctly specified.
The contributions of our paper are summarized as follows. Methodologically, we propose a statistically efficient advantage learning procedure to estimate the optimal policy in offline infinite horizon settings. Our proposal integrates existing policy optimization and policy evaluation algorithms in RL. Specifically, we start with applying existing Q-learning type algorithms to compute an initial estimator for the optimal Q-function. Based on these Qestimators, we leverage ideas from the off-policy evaluation literature (OPE, see e.g. Jiang and Li, 2016;Thomas and Brunskill, 2016;Liu et al., 2018;Uehara, 2019, 2020;Shi et al., 2021) to construct pseudo outcomes that are asymptotically unbiased to the optimal contrast function (see Section 2.2 for the detailed definition). With these pseudo outcomes as the prediction target, we can directly apply existing state-of-the-art supervised learning algorithms to derive the optimal policy. The use of OPE effectively alleviates the bias of the estimated contrast function resulting from the potential model misspecification of the optimal Q-function, which in turn improves the statistical efficiency over Q-learning.
In that sense, our proposal shares similar spirits with the A-learning type methods to learn DTRs in finite horizons.
Theoretically, we show our estimated contrast function converges at a faster rate than the Q-function computed by existing state-of-the-art Q-learning type algorithms (Theorem 2). This in turn implies that our estimated policy achieves a larger value function (Theorem 3). All the error bounds derived in this paper converge to zero when either the number of trajectories N or the number of decision stages per trajectory T to approach infinity. This guarantees the consistency of our method when applied to a wide range of real-world problems, ranging from the OhioT1DM Dataset that contains eight weeks' data for 6 patients to 2 Q-learning here is different from those Q-learning type algorithms in RL, due to different data structures and model setups. It relies on a backward induction algorithm to identify the optimal DTR in finite horizon settings with only a few treatment stages. In contrast, Q-learning type algorithms in RL usually rely on a Markov assumption to derive the optimal policy in infinite horizons. the 2018 Intern Health Study with over 1000 subjects (see e.g., NeCamp et al., 2020). It is also applicable to data generated from online video games where both N and T are allowed to grow to infinity.
Empirically, we show that our procedure outperforms existing learning algorithms using both synthetic datasets and a real dataset from the mobile health application. We remark that most papers in the existing literature use synthetic datasets to evaluate the performance of different RL algorithms. Results in our paper offer a useful evaluation tool for assessing these algorithms in real applications.
The rest of this article is organised as follows. In Section 2, we introduce some basic concepts in RL, describe the data generating process and formulate the problem. In Section 3, we demonstrate the advantage of A-learning over Q-learning by comparing their rate of convergence. The proposed algorithm is formally presented in Section 4. In Section 5, we study the statistical properties of our algorithm, proving that our estimated policy achieves a faster rate of convergence than existing Q-learning type algorithms. In Section 6, we investigate the finite sample performance of the proposed algorithm using Monte Carlo simulations. In Section 7, we use the OhioT1DM Dataset to further demonstrate the empirical advantage of the proposed algorithm over other baseline algorithms. Finally, we conclude our paper by a discussion section. Proofs of our major theorems are presented in Section A of the supplementary article.

Preliminaries
We first formulate the policy optimization problem in infinite horizon settings. We next briefly review Q-learning.

Problem Formulation
RL is concerned with solving sequential decision making problems in an unknown environment. The observed data can be summarized into a sequence of state-action-reward triplets over time. At each time t ≥ 0, the decision maker observes some features from the environment, summarized into a state vector S t ∈ S where the state space S is assumed to be a subset of R d . The decision maker then selects an action A t from the action space A. The environment responds by providing the decision maker with an immediate reward R t ∈ R and moving to the next state S t+1 . In this paper, we focus on the setting where A is discrete.
Extensions to the continuous action space are discussed in Sections 8.1 and 8.2. The state space S can be either continuous or discrete.
A policy defines the agent's way of behaving. A history-dependent policy π is a sequence of decision rules {π t } t≥0 where each π t is a function that maps the observed data history to a probability distribution function on the action space at time t. When these decision rules are time-homogeneous (i.e., π 1 = π 2 = · · · = π t = · · · ) and depend on the past data history only through the current state vector, the resulting policy is referred to as a stationary policy. Following π, the discounted cumulative reward that the decision maker receives is referred to as the value function, where the expectation E π is taken by assuming that actions are assigned according to π and 0 ≤ γ < 1 is a discounted factor that balances the long-term and short-term rewards.
The objective of policy optimization is to identify an optimal policy π opt that maximizes the value, i.e., π opt = arg max π EV π (S 0 ).
We model the data generating process by a Markov decision process (MDP, Puterman, 1994). Specifically, we impose the following Markov assumption (MA) and conditional mean independence assumption (CMIA).
(MA) There exists some function q such that for any t ≥ 0, S ∈ S, we have (CMIA) There exists some reward function r such that for any t ≥ 0, we have We make a few remarks. First, MA requires the future state to be conditional independent of the past data history given the current state-action pair. The function q corresponds to the Markov transition density function that characterizes the state transitions. This assumption is testable from the observed data (see e.g., . Second, under MA, CMIA is automatically satisfied when R t is a deterministic function of S t , A t and S t+1 . The latter assumption is commonly imposed in the literature (Ertefaie and Strawderman, 2018;Luckett et al., 2020). CMIA is weaker than this assumption.
Second, these two assumptions lay the foundations of the existing state-of-the-art RL algorithms (e.g., DQN). Specifically, they guarantee the existence of an optimal stationary policy that is no worse than any history-dependent policies (see e.g., Puterman, 1994). It allows us to restrict our attentions to the class of stationary policies. For any such policy π, we use π(•|s) to denote the probability mass function that the decision maker will follow to select actions given that the environment is in the state s.
be the data collected from the i-th trajectory where T is the termination time. We assume these trajectories are independent copies of {(S t , A t , R t )} t≥0 . Our objective is to learn π opt based on this offline dataset.

Q-learning
For a given policy π, we define the state-action value function (better known as the Qfunction) under π as Q π (a, s) = t≥0 γ t E π (R t |A 0 = a, S 0 = s).
It represents the average cumulative reward that the decision maker will receive if they select the action a initially and follow π afterwards. In addition, notice that where the third equation follows from CMIA and the definition of Q π and the last equation follows from MA. The above equation is referred to as the Bellman equation for Q π .
Define optimal Q-function Q opt as Q opt (a, s) = max π Q π (a, s) for any state-action pair (a, s). Under MA and CMIA, it can be shown that π opt satisfies π opt (a|s) = I a = arg max a Q opt (a , s) , ∀a, s, where I{·} denotes the indicator function. In addition, we have Q opt = Q π opt . Similar to (1), one can show that Q opt satisfies the following Bellman optimality equation, or equivalently, Equations (2) and (4) form the basis for all Q-learning type algorithms. Specifically, these algorithms first estimate the optimal Q-function by solving (4) and then derive the estimated optimal policy based on (2). Take the fitted Q-iteration algorithm as an example. It iteratively updates the optimal Q-function using supervised learning. At each iteration, the input includes (A t , S t ) that serves as the "predictors" and R t +γ max a Q(a, S t+1 ) that serves as the "response" where Q denotes the current estimate of the optimal Q-function.
Finally, we introduce the contrast function. For a given π, define the contrast function associated with π as τ π (a, s) = Q π (a, s) − Q π (a 0 , s) 3 for some a 0 ∈ A. In practice, the control arm a 0 could be set to the action that occurs the most in the data. This is because the baseline Q-function Q π (a 0 , s) needs to be accurately estimated in order to consistently estimate the contrast function. Hence, it is natural to consider the most frequently selected arm, which has the largest number of observations to learn the baseline Q-function. Let τ opt (a, s) = τ π opt (a, s) be the optimal contrast function. Similar to (2), we obtain that for any a and s. Consequently, to estimate the optimal policy, it suffices to estimate τ opt .
This observation motivates the proposed advantage learning method.

Q-v.s. A-learning
This section is organised as follows. We first introduce the minimax-optimal statistical convergence rate in supervised learning, which serves as an evaluation metric to compare various supervised learning algorithms. We next demonstrate the advantage of A-learning over Q-learning by comparing the worst-case convergence rates of the estimated optimal contrast and Q-functions. Finally, we discuss the challenge of developing statistically efficient A-learning algorithms.

Minimax optimal statistical convergence rate
Consider a supervised learning setup where we have given i.i.d. random vectors {(X i , Y i ) : 1 ≤ i ≤ n}. Our objective is to predict the value of the response Y from the value of the feature X ∈ S. The aim is to construct a best predictor to approximate the conditional mean function m(X) = E(Y |X). For any such predictor m, its prediction accuracy is measured by the root mean square error, Suppose m belongs to the class of p-smooth (also known as Hölder smooth with exponent p) functions. When p is an integer, this condition essentially requires m to have bounded derivatives up to the pth order. Formally speaking, for a J-tuple α = (α 1 , . . . , α J ) of nonnegative integers and a given function h on S, let D α denote the differential operator: Here, s j denotes the jth element of s. For any p > 0, let p denote the largest integer that is smaller than p. The class of p-smooth functions is defined as follows: for some constant c > 0. When 0 < p ≤ 1, we have p = 0. It is equivalent to require h to satisfy sup s 1 ,s 2 |h(s 1 ) − h(s 2 )|/ s 1 − s 2 p 2 ≤ c. The notion of p-smoothness is thus reduced to the Hölder continuity. Stone (1982) showed that the optimal minimax rate of convergence for m is given by where d denotes the dimension of S. In other words, for any data-dependent predictor m, there exists some p-smooth function m such that (5) decays at a rate of (6). This rate cannot be improved unless imposing certain parametric model assumptions on m. Notice that (6) increases with the smoothness parameter p. In other words, the smoother the underlying regression function, the faster worst-case rate of convergence a supervised learner could achieve.
Finally, we remark that we focus on the class of Hölder smooth functions throughout this paper. Alternatively, one may consider the Sobolev space. Discussion of Sobolev and Hölder spaces can be found in Giné and Nickl (2021).

Modelling contrast or Q-function?
We assume the state space S is continuous and both the transition function q(s ; a, •) and reward function r(a, •) belong to the class of p-smooth functions on S for some p > 0.
The p-smoothness assumption is likely to hold in many mobile health applications and we delegate the related discussions in Section 8.3. Under this condition, the optimal Q-function is p-smooth as well (see Section 4, Fan et al., 2020). Fan et al. (2020) proved that the Qfunction computed by DQN achieves a rate of (N T ) −p/(2p+d) up to some logarithmic factors.
As they commented, this rate achieves the minimax-optimal statistical convergence rate in (6) within the class of p-smooth functions and cannot be further improved.
Since the optimal contrast function corresponds to the difference between two optimal Q-functions, τ opt is at least at smooth as Q opt . On the other hand, there are cases where τ opt is strictly "smoother" than Q opt , leading to a possibly faster worst-case rate of convergence according to the minimax-optimal rate formula. We consider two examples to elaborate.
Example 1 (Independent Transitions). Consider the setting where the state transitions are independent, i.e., q(s ; a, s) = q(s ) is independent of (a, s). Then Q opt (a, s) = r(a, s) + C for some constant C > 0 that is independent of s and a. Suppose the reward function has the following decomposition r(a, s) = r * (a, s) + r 0 (s), for some p-smooth baseline reward function r 0 and p * -smooth function r * with p * > p. It Example 2 (Dependent Transitions). Suppose q has the following decomposition q(s ; a, s) = q * (s ; a, s) + q 0 (s ; s), where q * (s ; •, a) has derivatives up to the p * -th order whereas q 0 (s ; •) has derivatives up to the p-th order with p < p * . By changing the order of integration and differentiation with respect to s, we can show that the second term on the right-hand-side (RHS) of (3) is p-smooth. Suppose r(a, •) has derivatives of all orders. It follows from (3) that Q opt is p-smooth.
Using similar arguments, we can show that the last term on the RHS is p * -smooth. This in turn implies that τ opt is p * -smooth as well.
To conclude this section, we remark that the minimax rate for the contrast function has been recently established in singe-stage decision making (Kennedy et al., 2022). In infinite horizon settings with tabular models, several papers have investigated the minimaxoptimality of the Q-learning estimator (see e.g., Wainwright, 2019;Li et al., 2020. In settings with continuous state space, a recent proposal of Chen and Qi (2022) derived a minimax lower bound for the Q-function estimator under a fixed target policy and found that the rate matches those for nonparametric regression (Stone, 1982). We expect that similar arguments can be applied to formally obtain the minimax lower bounds for the estimated optimal Q-or contrast function.

The challenge
So far we have shown that the worse-case convergence rate of the estimated optimal contrast function is faster than that of the estimated optimal Q-function. However, it remains challenging to devise an advantage learning algorithm that achieves such a rate of convergence.
To elaborate, let us revisit the Bellman optimality equation in (4). By the definition of the optimal contrast function, it follows that The presence of the nuisance function Q opt (a 0 , •) in the above equation poses a serious challenge to efficient estimation of τ opt . A simple solution is to apply Q-learning type algorithms to learn the nuisance function, plug in this estimator in (8) and update τ opt using e.g., fitted Q-iteration. However, such an approach would yield a sub-optimal solution. This is because the estimation error of the initial Q-estimator would directly affect that of the estimated contrast function. As a result, the estimated contrast would have the same convergence rate as the Q-estimator.

Statistically efficient A-learning
We first present the motivation of our algorithm. We next formally introduce our proposal.

A thought experiment
To illustrate the idea, in this section, let us consider a simplified model where the discounted factor γ = 0 and the transitions are independent (see Example 1). In that case, we are interested in learning an optimal myopic policy the maximizes the short-term reward on average, which is essentially a single-stage decision making problem. By definition, the Q-function Q π and the contrast τ π are independent of the policy π. Equation (8) can be rewritten as where Q(a 0 , s) = E(R t |A t = a 0 , S t = s).
A-learning algorithms developed in the statistics literature can be employed to learn the contract function in this setting. They are motivated by the following identity, Unlike Equation (9), the above equation is doubly-robust. It holds when either the propensity score Pr(A t = •|S t ) or the Q-function Q(a 0 , •) is correctly specified. This motives the following two-step procedure. In the first step, we first estimate the propensity score and the Q-function from the observed data. In the second step, we plug in these estimates in (10) to estimate the contrast function. Such a two-step method guarantees the estimated contrast to be robust to the potential model misspecification of the Q-function.
When linear sieves are used to approximate τ , i.e., τ (a, s) = φ(a, s) β 0 for some basis function φ, an estimating equation for β 0 can be constructed based on (10). A Dantzig selector-type regularization can be applied when the number of basis functions is large . To employ more flexible machine learning methods, we can consider the following least-square objective function, Here, ψ(S i,t , A i,t , R i,t , a) serves as a pseudo outcome for τ (a, S i,t ). It is derived based on augmented inverse probability weighting (AIPW, see e.g., Bang and Robins, 2005). One can similarly show that E{ψ(S i,t , A i,t , R i,t , a)|S i,t } is unbiased to τ (a, S i,t ) when either the propensity score or the Q-function is correctly specified. A by-product of the doublyrobustness property is that when both nuisance functions are estimated from the data, the bias of the pseudo outcome will converge at a faster rate than these estimated nuisance functions. This in turn allows the resulting estimated contrast to converge at a faster rate than the Q-function. See e.g., Section 5 for details.
Although the above solution is developed in single-stage decision making, the same principle can be applied to general sequential decision making problems in infinite horizons, as we detail in the next section.

The complete algorithm
Our proposal involves two key components. First, we apply existing off-policy evaluation methods to construct pseudo outcomes for the optimal contrast function. This effectively reduces the bias of the initial Q-estimators, as we show in Theorem 1 that the bias of our pseudo outcomes decays at a much faster rate than initial Q-estimators. It in turn ensures that the estimated contrast is robust to the model misspecification of the Q-function, improving its rate of convergence. Second, we learn τ opt by directly minimizing the least square loss between the pseudo outcomes and the estimated contrast. This allows us to borrow the strength of supervised learning to improve the statistical efficiency for RL. We call this set of method SEAL -short for statistically efficient advantage learning.
Our proposal consists of five steps, including data splitting, policy optimization, estimation of the density ratio, construction of pseudo outcomes, and supervised learning. We next discuss each step in detail.

Step 1. Data splitting
We randomly divide the indices of all trajectories {1, . . . , N } into L subsets ∪ L =1 I with equal size, for some fixed integer L > 0. Let I c be the complement of I . Data splitting allows us to use one part of the data (I c ) to train RL models and the remaining part (I ) to construct the pseudo outcomes. We could aggregate the estimate over different to get full efficiency. This allows the bias of the constructed pseudo outcomes to decay to zero under minimal conditions on the estimated RL models. We remark that data splitting has been commonly used in the statistics and machine learning literature (see e.g., Chernozhukov et al., 2018;Romano and DiCiccio, 2019;Kallus and Uehara, 2019).

Step 2. Policy optimization
For = 1, . . . , L, we apply existing state-of-the-art Q-learning type algorithms to the data subset in I c to compute an initial Q-estimator Q ( ) for Q opt . Several algorithms can be applied here, as we elaborate below.
Example 3 (DQN). The deep Q-network algorithm is a Q-learning type method that uses a neural network Q-function approximator and several tricks to mitigate instability. It was developed in online settings and shown to yield superior performance to previously known methods for playing Atari 2600 games. To handle offline data, at each time step, we sample a minibatch of transitions {(S i,t , A i,t , R i,t , S i,t+1 )} (i,t)∈M and update the parameter θ of the Q-network by the gradient of where Q θ * is the target network whose parameter θ * is updated every T target steps by letting θ * = θ. In Mnih et al. (2015), T target is set to 10000. As T target grows to infinity, performing T target stochastic gradient steps is equivalent to solve In that sense, DQN shares similar spirits with the fitted Q-iteration algorithm (Fan et al., 2020).
Example 4 (Double DQN). The double DQN algorithm is very similar to DQN. It is developed to alleviate the overestimation bias of the learned Q-function. DQN is likely to overestimate the Q-function under certain conditions, due to the biased resulting from the maximization step max a Q θ * (a, S i,t+1 ) in (11). See e.g., Sutton and Barto (2018) for a detailed explanation of the maximization bias. To reduce this bias, it replaces the target In other words, it decomposes the maximization operation into action selection and stateaction value evaluation, uses the Q-network for action selection and the target network for Given the Q-estimator Q ( ) , we denote the derived optimal policy (see Equation 2) by π ( ) , for = 1, · · · , L.

4.2.3
Step 3. Estimation of the density ratio The purpose of this step is to learn a density ratio estimator based on each data subset. These estimators are further employed in the subsequent step to construct the pseudo outcomes for the optimal contrast function.
We first define the density ratio. For a given policy π, let p π t (•, •|a, s) denote the conditional probability density function of (A t , S t ) given the initial state-action pair (a, s) assuming that the decision maker follows π at time 1, 2, · · · , t. We define the γ-discounted average visitation density as follows, Let p ∞ (•, •) denote the density function of the limiting distribution of the stochastic process We define the density ratio as for any s, a, s , a . Such a density ratio plays an important role in breaking the curse of horizon in off-policy evaluation 4 .
In this step, we learn the density ratio ω π ( ) based on each data subset in I c , for = 1, . . . , L, where π ( ) is the initial optimal policy computed in Step 2. Several methods can be used here, e.g., Liu et al. (2018); Uehara et al. (2019); Kallus and Uehara (2019). In our implementation, we adopt the proposal in Liu et al. (2018) to construct a mini-max loss function (see Equation 52) to estimate ω π ( ) . We use ω ( ) to denote the corresponding estimator. Additional details are given in Appendix B to save space.

Step 4. Construction of Pseudo Outcomes
For = 1, · · · , L, consider a pair of indices (i, t) with i ∈ I , 0 ≤ t < T . In this step, we focus on constructing a pseudo outcome Q i,t,a for Q opt (a, S i,t ) for any a ∈ A, based on the Q-and density ratio estimators computed in Steps 2 and 3. The corresponding pseudo To motivate our method, notice that by the Bellman equation, it suffices to construct pseudo outcomes for r(a, S i,t ) and E{V π opt (S i,t+1 )|A i,t = a, S i,t }.
Pseudo outcomes for r(a, S i,t ) can be derived based on augmented inverse propensity-score weighting, as in Section 4.1, where r ( ) denotes some estimator for the reward function r computed using the data subset in I c . As we have commented, the use of AIPW ensures the unbiasedness of the pseudo outcome, regardless of whether r ( ) is consistent to r or not.
using the estimated optimal policy π ( ) .
Suppose for now, the Markov transition density function q is known. Then ν ( ) (S i,t , a) can be estimated using the existing policy evaluation methods. Here, we consider the doubly reinforcement learning method proposed by Kallus and Uehara (2019), where η i,t,a is an augmentation term, defined as ual constructed based on the initial Q-estimator. When the initial Q-estimator is consistent, it follows from the Bellman optimality equation that the mean of η i,t,a is asymptotically zero.
The purpose of adding η i,t,a in (13) is to offer additional protection against potential model misspecification of the initial Q-estimator. Specifically, it ensures that (13) is unbiased to ν ( ) (S i,t , a) when either Q ( ) or ω ( ) is consistent (see e.g., Kallus and Uehara, 2019). In addition, when the estimated ratio is consistent, it allows the bias of (13) to decay to zero at a rate faster than Q ( ) . See Theorem 1 for a formal statement.
However, the pseudo outcome outlined in (13) suffers from two major limitations. The first one is that the transition density q is in general unknown in practice. The second one is that the calculation of η i,t,a requires O(N T ) number of flops, which is computationally intensive to implement on large datasets.
To address the first limitation, we again use augmented inverse probability weighting and replace the first term in (13) by Similar to r i,t,a , one can easily verify that ν i,t,a is unbiased to ν(a, S i,t ) regardless of whether ν ( ) is consistent or not. To address the second limitation, we randomly sample a by η i,t,a , constructed based on the observations in M i,t only. Specifically, we define η i,t,a by When | • | denotes the cardinality of a set. When |M i,t | is much smaller than N T , it largely facilitates the computation.
Combining both parts yields the following, Putting all the pieces together, we obtain the following pseudo outcome for Q i,t,a , defined by As we have discussed, the pseudo outcome for the optimal contrast is obtained by We again make some remarks. First, we employ cross-fitting to construct τ i,t,a . That is, This helps avoid overfitting which can easily result from the estimation of the Q-function and density ratio. Second, to simplify the presentation, we assume the propensity score is known. In practice, it can be estimated from the observed data and our theoretical results will be the same when the estimated propensity score satisfies certain rate of convergence.

Supervised learning
In the final step, we factorize the contrast function τ opt by some models τ ∈ T and estimate the model parameter by minimizing the following objective function, The corresponding estimated optimal policy is given by I{a = arg max * a τ (a * , s)} for any a and s.
To solve (14), it is equivalent to solve arg min for each a = a 0 . Many methods are available to solve (15), as it is essentially a nonparametric regression problem. In our implementation, we set T a to the class of deep neural networks (DNNs), so as to capture the complex dependence between the reward and the state-action pair. The input of the network is a d-dimensional vector, corresponding to the state (coloured in blue in Figure 1). The hidden units (coloured in green) are grouped in a sequence of L a layers. Each unit in the hidden layer is determined as a nonlinear transformation of a linear combination of the nodes from the previous layer. We use W a to denote the total number of parameters. These parameters are updated by the Adam algorithm (Kingma and Ba, 2015).

Theoretical findings
We first summarize our theoretical findings. In Theorem 1, we provide a finite sample bias analysis of the pseudo outcome, proving its bias decays at a faster rate than the initial Qestimator. In Theorem 2, we show our estimator for the optimal contrast achieves a faster rate of the convergence than the Q-estimator. In Theorem 3, we show the resulting optimal policy achieves a larger value than those computed by Q-learning type algorithms. Finally, we discuss a potential limitation of the proposed method. All the error bounds derived in this paper converge to zero when either N or T diverges to infinity. As commented in the introduction, this ensure our method is valid when applied to a wide range of real problems.

Finite sample bias analysis
In this section, we focus on deriving an error bound on the bias E( Q i,t,a |S i,t )−Q opt (S i,t , a) as a function of the total number of observations N T . We introduce the following conditions.
(A1) The state space S is compact. There exists some constant α > 0 such that where λ denotes the Lebesgue measure and the big-O term in (16) is uniform in 0 < ε < δ for some sufficiently small δ > 0. By convention, max a ∈A−arg max a Q opt (a,s) Q opt (a , s) = −∞ if the set A − arg max a Q opt (a, s) is empty.
(A2) Q opt (a, ·) is p-smooth and τ opt (a, ·) is p * -smooth for some p * < p and any a.
(A3) There exists some constant 0 < c 0 ≤ 1 such that the followings hold for any a and , with probability approaching 1, (A4) The process {(S t , A t , R t )} t≥0 is stationary and exponentially β-mixing (see e.g., Bradley, 2005, for detailed definitions).
(A5) The probability density function p ∞ is uniformly bounded away from zero.
In (A1), we require the state space to be continuous. When it is discrete, we can replace the Lebesgue measure with the counting measure. Our theories are equally applicable. we refer to the quantity max a Q opt (a, s) − max a ∈A−arg max a Q opt (a,s) Q opt (a , s) as the "margin" of the optimal Q-function. It measures the difference in value between π opt and the policy that assigns the best suboptimal treatment(s) at the first decision point and follows π opt subsequently. Such a margin-type condition is commonly used to bound the excess misclassification error (Tsybakov, 2004;Audibert and Tsybakov, 2007) and the regret of estimated optimal treatment regime (Qian and Murphy, 2011;Luedtke and van der Laan, 2016;. Here, we impose Condition (A1) to bound the difference between Q π ( ) (S i,t , a) and Q opt (S i,t , a). This condition is mild. To elaborate, we consider a simple scenario where A = {0, 1} and a 0 = 0. It follows that the margin equals |τ opt (1, s)| if τ opt (1, s) is nonzero and +∞ otherwise. (16) is thus equivalent to the following, The above condition can be satisfied in a wide range of settings. We consider three examples to illustrate.
Example 6. Suppose τ opt (1, s) = 0 for any s. Then (17) is automatically satisfied. In this example, the two actions have the same effects. Any policy would achieve the same value.
Example 7. Suppose inf s |τ opt (1, s)| > 0. Then (17) is automatically satisfied for any sufficiently small ε > 0. When the optimal contrast function is continuous, it requires τ opt (1, s) to be always positive or negative as a function of s. As such, the optimal policy is nondynamic and will assign the same action at each time.
Example 8. Consider the case where the state is one-dimensional. Suppose (17) is thus satisfied.
In (A2), we assume the optimal contrast function is strictly "smoother" than the optimal Q-function. As we have discussed, this assumption holds under several cases. See Examples 1 and 2 in Section 3.2 for details.
In the first part of (A3), we assume the squared prediction loss of the estimated Qfunction achieves a rate of (N T ) −2p/(2p+d) . As we have commented, this condition automatically holds when deep-Q network is used to fit the initial Q-estimator. The second part of (A3) is mild as the constant c 0 could be arbitrarily small. Suppose some parametric model These assumptions enable us to derive the following theorem.
Then there exists some constantc > p/(2p + d) such that for any a ∈ A, Theorem 1 states that the conditional bias of Q i,t,a decays at a rate of (N T ) −c on average. In comparison, under (A3), the squared prediction loss of the initial Q-estimator decays at a rate of (N T ) −2p/(2p+d) . Suppose the square bias and variance of Q ( ) are of the same order. Then we expect E{ Q ( ) (a, S i,t )|S i,t } to approach Q opt (S i,t , a) at a rate of (N T ) −p/(2p+d) . Sincec > p/(2p + d), biases of our pseudo outcomes are much smaller than the initial Q-estimators.

Efficiency enhancement
In this section, we establish the convergence rates of the estimated contrast function and the derived optimal policy. Without loss of generality, we assume the state space S = [0, 1] d .
We write a n b n for two sequences {a n } n and {b n } n if there exists some universal constant c ≥ 1 such that c −1 a n ≤ b n ≤ ca n for all n.
Theorem 2. Assume the conditions in Theorem 1 hold. Then there exists DNN class {T a } a with L a log(N T ) and W a (N T ) C log(N T ) for some C > d/(2p + d) and any a = a 0 such that with probability approaching 1, where the expectation is taken with respect to the stationary state distribution.
Theorem 2 formally shows that our estimated contrast function converges at a faster rate than the Q-function computed by Q-learning type-estimators, leading to the desired efficiency enhancement property. To illustrate why the estimated contrast converges faster, suppose we have access to some unbiased pseudo outcome for τ (a, S i,t ). Then under (A2), the estimated contrast function would converge at a minimax optimal rate of (N T ) −p * /(2p * +d) , which is much faster than that of the Q-estimator. In practice, we do not have access to unbiased pseudo outcomes. As such, the rate would depend on the bias of the pseudo outcome Q i,t,a − Q i,t,a 0 . Nonetheless, the efficiency enhancement property holds as long as the bias decays faster than the convergence rate of the Q-estimator. The latter assertion is confirmed in Theorem 1.
We next show this in turn leads to an improvement in the value. More specifically, for any policy π, define the integrated value function V(π) = S V π (s)ν 0 (s)ds where ν 0 denotes the density function of S 0 . Let π τ denote the derived policies based on the estimated contrast function τ .
Let Q denotes a Q-learning type estimator that satisfies and π Q be the derived policy based on Q. Similar to Theorem 3, we can show that EV( π Q ) converges at a rate of α 0 p/(2p + d). Based on the fact that c 1 > 2p/(2p + d), it is clear that the value of our estimated policy converges to the optimal value at a faster rate than those of policies computed by Q-learning type algorithms.
The convergence rates in Theorems 2 and 3 relies crucially on the exponent α in the margin condition (A1) and the convergence rate of the estimated density ratio in (A3), i.e., (N T ) −c 0 . The following corollary shows that under certain conditions on α and c 0 , the exponent c 1 in both theorems achieve a maximum value of 2p * /(2p * + d).
Corollary 1. Suppose the conditions in Theorems 2 and 3 hold. Suppose Then with proper choice of the DNN class {T a } a , we have for any a = a 0 that with probability approaching 1, and that V(π opt ) − EV( π τ ) = O{(N T ) −α 0 p * /(2p * +d) }.
Notice that we do not require the optimal policy to be unique in order to establish the regret bound of the estimated optimal policy. This is because our proposal is value-based which derives the optimal policy using the estimated advantage function. The advantage function is well-defined despite that the optimal policy might not be unique, and the regret bound decays to zero as long as the estimated advantage function is consistent. To elaborate, let us revisit Example 8. By definition, when the state is nonpositive, both actions are optimal. The uniqueness assumption is thus violated. Nonetheless, the regret is zero since choosing either action is optimal.
Finally, we remark that although the proposed contrast function estimator converges at a faster rate than Q-learning type estimators, these rates are asymptotic. A potential limitation of the proposed method is that our estimated contrast function might have larger variance than Q-learning type estimators in finite samples, due to the use of importance sampling in constructing the pseudo outcomes. This reflects a bias-variance trade-off. The

Simulations
We evaluate the performance of our method using two synthetic datasets generated by the  (BEAR, Kumar et al., 2019). We compare them with the proposed procedure based on QR-DQN, which yields the best performance among (d)-(f).

LunarLander-v2
We conduct experiments in an OpenAI Gym environment, LunarLander-v2. Detailed description about this environment can be found at LunarLander-v2. To generate the data, we train a QR-DQN agent 500K time steps, with learning rate 0.0005. The estimated policy after 500K time steps is near optimal and solves the environment (e.g., achieves a score of 200 on average). The state-of-the-art optimal average reward is over 250 5 . We then terminate the training process, store all the generated trajectories encountered during the online training process and use them as the offline data. The behavior policy corresponds policy. We repeat the entire data generating process, the training and evaluation procedures 5 https://medium.datadriveninvestor.com/training-the-lunar-lander-agent-with-deep-qlearning-and-its-variants-2f7ba63e822c  10 times with different random seeds. We also vary the number of training steps for the initial Q-estimator and apply the proposed method to each of the estimated Q-functions.
For fair comparison, we use the same number of training steps (i.e., 20K, 30K, 40K or 50K) to train the baseline policy. Figure 2 are the values of the estimated policies computed by (a)-(i) as well as the associated 95% confidence intervals, with different number of training steps.

Reported in
We summarize our findings as follows: (1) The proposed procedure achieves a larger value compared to the baseline methods in most cases; (2) Our improvement is significant in many cases, as suggested by the error bar; (3) All the methods get improved as the number of training steps increases.

Qbert-ram-v0
We next conduct experiments in another environment, Qbert-ram-v0. The best 100-episode average reward for Qbert-ram-v0 is 586.00 ± 12.16. We similarly train a Quantile DQN agent to generate 1373 trajectories. Each trajectory lasts for 364 time steps on average.
The average return per trajectory equals 278. We similarly compare our procedures (d)-(f) with the baseline methods (a)-(c) and (g)-(i). Results are depicted in Figure 3. Overall, findings are very similar to those in Section 6.1. We notice that the performances of some deep Q-learning methods drop when the number of training step increases and cannot even improve after a few more iterations. We discuss this in detail in Section 8.4 to save space.
Finally, it can very computationally expensive to implement deep RL algorithms in LunarLander-v2 and Qbert-ram-v0. For instance, in our implementation, it took a few hours to run one simulation. As such, our simulation results are aggregated over 10 runs only. We also remark that beginning with DQN, 5 or less runs are common in the existing RL literature, as it is often computationally prohibitive to evaluate more runs

The OhioT1DM dataset
In this section, we use the OhioT1DM Ddataset (Marling and Bunescu, 2018) to illustrate the usefulness of our new method in moblie health applications. The data contains continuous measurements for six patients with type 1 diabetes over eight weeks. The objective is to learn an optimal policy that maps patients' time-varying covariates into the amount of insulin injected at each time to maximize patients' health status.
In our experiment, we divide each day of follow-up into one hour intervals and a treatment decision is made every hour. We consider three important time-varying state variables, including the average blood glucose level G t during the one hour interval (t−1, t], the carbohydrate estimate for the meal C t during (t − 1, t] and Ex t which measures exercise intensity during (t − 1, t]. At time t, we define the action A t by discretizing the amount of insulin In t injected. The reward R t is chosen according to the Index of Glycemic Control (Rodbard, 2009) that is a deterministic function G t+1 . Detailed definitions of A t and R t are given in Appendix B. We will receive a low reward if the patient's average blood glucoses level is outside the range [80,140]. Let X t = (G t , C t , Ex t ). We define the state S t by concatenating measurements over the last four decision points, i.e., S t = (X T t−3 , A t−3 , · · · , X t ) . This ensures the Markov assumption is satisfied . The number of decision points for each patient in the OhioT1DM dataset ranges from 1119 to 1288. Transitions across different days are treated as independent trajectories. This yields 279 trajectories in total.
We use cross-validation to evaluate the performance of different algorithms. Specifically, we apply each of the method in (a)-(i) to the training dataset to learn an optimal policy. Then we apply the fitted Q-evaluation (FQE, Le et al., 2019) algorithm to the testing dataset to evaluate the values of these policies. FQE is very similar to FQI. It iteratively update the state-action value using supervised learning algorithms. See Algorithm 1 in Appendix B for details. In our implementation, we set the function approximator F to a class of DNN and apply deep learning to update the value. These estimated values are further aggregated over different training/testing combinations. Finally, we repeat this procedure 10 times with different random seeds to further aggregated the values. Results are reported in Figure 4.
It can be seen that the proposed method performs significant better than other baseline methods in most cases.

Discussion
This section is organized as follows. In Sections 8.1 and 8.2, we discuss extensions of our proposal to the continuous action space. In Section 8.3, we discuss the p-smoothness assumption. Finally, in Section 8.4, we discuss some offline RL algorithms and the pessimistic principle.

Action discretization
One possible approach to handle continuous action space is to first discretize the action space and then apply the proposed method for policy learning. Suppose the action space is one-dimensional, as in personalized dose finding and dynamic pricing. Then we can extend the jump interval-learning method  to sequential decision making for adaptively discretizing the action space. The baseline action can be similarly determined after we obtain the discretization.
Specifically, the jump interval learning method was originally designed in contextual bandit settings for deriving interval-valued policies. The main idea is to partition the action space into a set of disjoint intervals such that within each interval, the Q-function is a constant function of the action. In this section, we outline an extension of this method to sequential decision making. Without loss of generally, assume the action space A = [0, 1].
To illustrate the method, we assume the transition function is a piecewise function of the action, i.e., there exist a set of disjoint intervals D that cover A and satisfy that q(s ; a, s) = I∈D I(a ∈ I)q I (s ; s), for any s, s and a set of transition functions q I . Under such an assumption, any Q-function is a piecewise function of the action. To identify D, we can couple fitted Q-iteration with jump-interval learning and iteratively solve the following optimization problem based on dynamic programming (Friedrich et al., 2008), for k = 1, · · · , K and some tuning parameter γ > 0, where | D| denotes the number of intervals in D. The final output D yields a set of action intervals, based on which we can define a categorical action variable A * i,t whose value depends on the interval that the original action A i,t belongs to. Finally, we apply the proposed method to the transformed dataset

Kernel-based method
Another potential approach is to adopt kernel-based methods that leverage treatment proximity to allow continuous actions. In that case, we can apply kernel density estimation to learn the marginal probability density function of the action and set the baseline action to the one that maximizes the estimated density function.
Specifically, kernel-based methods are commonly used for policy learning and evaluation with continuous action space (see e.g., Chen et al., 2016;Kallus and Zhou, 2018;Colangelo and Lee, 2020). Notice that the proposed method requires to weight each observation by the importance sampling ratio I(A i,t = a)/Pr(A i,t = a|S i,t ) to construct the pseudo outcome. When the actions are continuous, the indicator function I(A i,t = a) equals zero almost surely for any i and t. Consequently, naively applying the importance sampling ratio would yield a biased estimator. To address this concern, we can employ kernel-based

The p-smoothness assumption
We discuss the p-smoothness assumption in this section. First, as commented in the main text, the p-smoothness assumption is likely to hold in mobile health studies. In these applications, the state corresponds to some time-varying variables that measure the health status of a given subject. It is expected that the system dynamics would vary smoothly as a function of these variables. In the literature, a number of papers also use a smooth transition function to simulate the environment in healthcare applications (Ertefaie, 2014;Luckett et al., 2020;Li et al., 2022). In addition, the reward is usually a deterministic function of the current state-action pair and future state; see e.g., the definition of the reward in our application in Appendix B. As such, as long as the transition function is p-smooth, so is the reward function as well.
Second, the p-smoothness assumption is not likely to hold in many OpenAI gym environments where the state variables are all discrete or the state transition is deterministic.
Nonetheless, we remark that it can be seen from our simulation results that the proposed method works better than or comparable to existing Q-learning algorithms.

Offline RL and the pessimistic principle
As pointed out by one of the referee, it can be seen from Figure 3 that the performances of some deep Q-learning methods drop when the number of training step increases and cannot even improve after a few more iterations. We suspect this is due to that some stateaction pairs are less explored in the offline dataset, leading to the violation of the positivity assumption (A5). In that case, increasing the number of training steps might overfit the data and worsen the performance of the resulting policy. Similar phenomena have been observed in the existing RL literature (see e.g., Kumar et al., 2019).
Many offline RL algorithms such as BCQ, REM and BEAR adopted the pessimistic principle (see e.g., Levine et al., 2020;Jin et al., 2021;Xie et al., 2021) and proposed to learn a "conservative" Q-function by penalizing the values evaluated at those "out-ofdistribution" state-action pairs. These algorithms are shown to outperform the standard Q-learning algorithm in offline domains when (A5) is violated to some extent. We suspect that reducing the number of training steps is equivalent to penalize the Q-function to some extent to avoid overfitting. As such, increasing the number of training steps would make things worse. It is interesting to investigate how to couple the pessimistic principle with our proposal to further improve the estimated policy. However, this is beyond the scope of the current paper. We leave it for future research.

A Proofs
We prove Theorems 1 and 2 in this section. Theorem 3 can be proven in a similar manner as Theorem 4 of . We omit its proof for brevity. Throughout this section, we use c and C to denote some generic constants whose values are allowed to vary from place to place. For any two positive sequences {a t } t≥1 and {b t } t≥1 , we write a t b t if there exists some constant c > 0 such that a t ≤ cb t for any t. The notation a t 1 means a t = O(1).

A.1 Proof of Theorem 1
Let {β(q)} q≥0 denote the β-mixing coefficients (see e.g., Bradley, 2005, for detailed definitions) of the process {(S t , A t , R t )} t≥0 . Under the given conditions, we obtain β(q) → 0 as q → ∞. Let b(a|S t ) denote the propensity score Pr(A t = a|S t ) for any a ∈ A. Let V ( ) (s) denote the estimated optimal value function a∈A π (a|s) Q ( ) (a, s) for any s ∈ S.
By definition, Note that As such, the conditional mean of Q i,t,a is the same as Q i,t,a , defined as In the following, we break the proof into three steps. In the first step, we show Note that η i,t,a can be decompose into where η i,t,a,1 and η i,t,a,2 are defined by respectively.
In the second step, we show that where κ 1 = p/(2p + d) + c 0 /2 and satisfies p/(2p + d) < κ 1 < 1/2 under the given conditions on c 0 , and O(1) denotes some positive constant that is independent of i, t, a, . Specifically, we will decompose η i,t,a,1 into three terms and investigate the conditional mean of each of these terms in detail.
Next, we give detailed proofs for each step. A.1.1 Step 1 With some calculations, we have This yields (19). The proof of Step 1 is thus completed.

Notice that for any
By Bellman equation, we have E(ε i,t |S i,t ) = 0 and hence In view of (24) and (25), to complete the proof of this step, it suffices to show Under the stationarity assumption in (A4), we have By Cauchy-Schwarz inequality, we obtain By definition, we have It follows from Cauchy-Schwarz inequality and the stationarity condition that Since p ∞ is uniformly bounded away from zero, it follows that for some constant c > 0. Theorem 4 of  showed that the value under an estimated optimal policy computed by Q-learning type estimators converges at a faster rate than the estimated Q-function under (A1). It follows that Similar to (28), we have for some constant C > 0. Combining these together with (27) yields Under the given conditions in (A3), we obtain (26). This completes the proof of this step.

A.2 Step 3
For a given integer q > 1, we decompose η i,t,a,2 into η i,t,a,6 + η i,t,a,7 where For any t such that |t − t| > q, we have |t + 1 − t| ≥ q. Consequently, the β-mixing coefficient between the two σ-algebras σ(S i,t ) and σ(S i,t , A i,t , R i,t , S i,t +1 )) are upper bounded by β(q), under the given conditions. By Berbee's lemma, there exists an identical copy Consequently, η i,t,a,7 can be decomposed into and is equal to By (29) and the boundedness assumption, the conditional expectation of the last two terms are bounded by cβ(q), for some constant c > 0. As β is exponentially decreasing, by setting q to be proportional log T , the last two terms decay at a rate of T −κ for any κ > 0.
Using similar arguments in Step 2 of the proof, the conditional expectation of the first term is asymptotically equivalent to E[ s {V π ( ) (s ) − V ( ) (s )}q(s ; a, S i,t )ds |S i,t ] and can be similarly bounded.
In addition, by setting q to be proportional to log T , we have η i,t,a,6 = O(T −1 log T ) The proof is hence completed.
We begin by specifying the DNN class T a . We consider using ReLU as the activation function. Let θ denote a vector of parameters involved in the DNN. We use W to denote the total numbers of parameters in the DNN. Let L denote the number of layers in the DNN.
Without loss of generality, we assume sup a,s |τ opt (a, s)| ≤ M for any constant M > 0. For any sufficiently small > 0, there exists a DNN function class for some constantC > 0 such that there exists some τ * a belonging to T a that approximates τ opt (a, •) with the uniform approximation error upper bounded by . See e.g., Lemma 7 of Farrell et al. (2021). We set = (N T ) −κ 0 . It suffices to show that E| τ a (S 0 ) − τ * a (S 0 ) 2 | converges at a rate of O{(N T ) −κ 0 }.
We aim to upper bound the probability for some positive integer k 0 . The detailed requirements of k 0 and κ 0 will be specified later.
Notice that Suppose we can show the probability (30) can be upper bounded by O(N −1 T −1 ). The second term on the right-hand-side (RHS) of (31) can be upper bounded by k 2 0 (N T ) −2κ 0 . Under the p-smoothness assumption, the optimal Q-function is continuous over the state space. As such, it is uniformly bounded. So is the optimal contrast function. The first term is O(N −1 T −1 ) given that the class T a is a bounded function class. This yields the desired rate of convergence. The proof is thus completed.
It remains to show that the probability (30) can be upper bounded by O(N −1 T −1 ).

Notice that (30) is upper bounded by
by Bonferroni's inequality.
Consider an integer k ≥ k 0 . Let A k denote the event, We focus on evaluating the following stochastic error term, under the event defined in A k .
Since the number of folds L and the action space is finite, it suffices to bound The above bound can be upper bounded by the sum of the following three terms, where τ i,t,a is a version of τ i,t,a with η i,t,a replaced by η i,t,a .
In addition, for any (i 1 , t 1 ) and (i 2 , t 2 ), we can similarly show that the absolute value of the covariance for any sufficiently small constant ε > 0.
We next consider (36). Note that for any i, t, the set of variables { τ i,t,a − τ i,t,a } i,t are mean-zero and independent conditional on the observed data. We aim to apply the empirical process theory (van der Vaart and Wellner, 1996) to bound (36). Under the event defined in A k , we have τ a ∈ T a,k where As such, (36) can be upper bounded by We first consider bounding the expectation E sup τa∈T a,k Z(τ, a). Toward that end, we apply the maximal inequality developed in Corollary 5.1 of Chernozhukov et al. (2014). Under the given conditions, the set of variables { τ i,t,a } i,t,a , {τ i,t,a } i,t,a are uniformly bounded. Let M * > 0 be the upper bound. As such, the absolute value of each summand in (36)  See Lemmas 4-6 in Farrell et al. (2021). It follows that E sup Under the given conditions, T a,k is a bounded function class. It follows from Theorem 2 of Adamczak et al. (2008) that for any u > 0, we have with probability 1 − exp(−u) that This together with (41) yields with probability at least 1 − exp(−u). Set u = 2 log(kN T ), (36) is upper bounded by with probability at least 1 − k −2 (N T ) −2 .
Next, we consider (37). We note that (37) can be bounded by the sum of the following two terms, where Consider (43) first. Set T * to be an (nT ) −1 -net of T a,k with respect to the distance a)| 2 for any f 1 and f 2 . The number of elements in T * is upper bounded by It follows from Cauchy-Schwarz inequality that (43) is upper bounded by For each τ a ∈ T * , (nT ) −1 i∈I T −1 t=0 {Q i,t,a − E(Q i,t,a |S i,t )}{τ a (S i,t ) − τ * a (S i,t )} corresponds to a sum of martingale difference sequence. It follows from the Bernstein's inequality (see e.g., Theorem A, Fan et al., 2015) that with probability at least 1 − exp(u) that for each τ a ∈ T * , under the event defined in A k . By Bonferroni's inequality, we obtain with probability at least 1 − exp{O(1)(N T ) κ 0 d/p * log 4 (N T ) − u} that the first term in (45)  Combining this together with (45) yields that (43) with probability at least 1 − k −2 (N T ) −2 .
Next let us consider (44). Following the discussion below Lemma 4.1 of Dedecker and Louhichi (2002), we can construct a sequence of random variables {O 0 i,t } i,t such that O 0 i,t has the same distribution function as O i,t and that with probability at least 1 − nT β(q)/q, where η 0 i,t,a corresponds to a version of η i,t,a with {O i,t } 0≤t≤T,i∈I replaced by {O 0 i,t } 0≤t≤T,i ∈I . In addition, the sequences {U 0 i,2t } i,t and {U 0 j,2t+1 } i,t are i.i.d. where U 0 i,t = (O 0 i,t(q+1) , O 0 i,t(q+1)+1 , · · · , O 0 i,t(q+1)+q ). By setting q to be proportional to log(N T ), the probability 1 − nT β(q)/q can be lower bounded by 1−(N T ) −1 . Without loss of generality, suppose (n/L)T = 2mq for some integer m > 0. By definition, I 0 = E(η 0 i,t,a − η 0 i,t,a |S 0 i,t ) can be decomposed into three variables I 0 1 , I 0 2 and I 0 3 , where each of the first two corresponds to a sum of m i.i.d. random variables with zero mean given S i,t and {O j,t } j∈I c ,0≤t<T , and |I 0 3 | is a remainder term upper bounded by O(1)(nT ) −1 log(nT ). In addition, using similar arguments in Step 3 of the proof of Theorem 1, we can show with probability at least 1 − (nT ) −C that the bias max i,t |E(η i,t,a |S i,t ) − E(η 0 i,t,a |S 0 i,t )| can be upper bounded by O(1)(nT ) −1 as well. Consequently, we obtain with probability 1−(N T ) −1 that max i,t |E(η i,t,a |S i,t )−E(η 0 i,t,a |S 0 i,t )| ≤ O(1)(nT ) −1/2 log(nT ). Let A 0 denote this event. On the set A 0 , using the Cauchy-Schwarz inequality, (44) can be upper bounded by O(1) N T ε log(N T ) + ε 2 nT i∈I T −1 t=0 | τ a (S i,t ) − τ * a (S i,t )| 2 , for any sufficiently small ε > 0.
To summarize, on the set A 0 ∩ A * ∩ A k , the stochastic error term (34) can be upper bounded by with probability at least 1 − O(N −2 T −2 k −2 ).
By definition, we have Decomposing τ i,t,a − τ a (S i,t ) into the sum of τ i,t,a −τ * a (S i,t ) and τ * a (S i,t )− τ a (S i,t ), the stochastic error term shall be smaller than the squared bias term Set ε in (48) to 1/8. On the set A 0 ∩ A * ∩ A k , we obtain with probability at least 1 − O{(N T k) −1 } that ≤ +k(N T ) −κ 0 log 1/2 k + (N T ) −1/2 log k .
Using similar arguments in proving (48) with probability at least 1 − O{(N T k) −2 }, on the set A 0 ∩ A * ∩ A k . In particular, note that the first term is at least k 2 (N T ) −2κ 0 /8 under A k . By setting κ 0 = min{κ, p * /(2p * + d)}, for sufficiently large k 0 , the above expression is strictly larger than the right-hand-side of (50). This violates the result in (49). Consequently, on the set A 0 , the event A c k holds with probability at least 1 − O{(N T k) −2 }. By Bonferroni's inequality, on the set A 0 ∩ A * , the event ∩ k≥k 0 A c k holds with probability at least 1 − k≥k 0 O{(N T k) −2 } = 1 − O(N −2 T −2 ).
Since both A 0 and A * hold with probability at least 1 − O(N −1 T −1 ), (30) occurs with probability at most O(N −1 T −1 ). The proof is thus completed.

B Some implementation details
We first present the fitted-Q evaluation algorithm below in Algorithm 1.
Finally, we present the detailed definition of A t and R t in the real data application. 0: Input: The data {(S i,t , A i,t , S i,t+1 ) : i ∈ I c , 0 ≤ t < T i }. An estimated optimal policy π ( ) .
Initial: Initial the density ratio w = w β to be a neural network parameterized by β.
Algorithm 2: Estimation of the density ratio.