A study of first-passage time minimization via Q-learning in heated gridworlds

Optimization of first-passage times is required in applications ranging from nanobots navigation to market trading. In such settings, one often encounters unevenly distributed noise levels across the environment. We extensively study how a learning agent fares in 1- and 2- dimensional heated gridworlds with an uneven temperature distribution. The results show certain bias effects in agents trained via simple tabular Q-learning, SARSA, Expected SARSA and Double Q-learning. While high learning rate prevents exploration of regions with higher temperature, low enough rate increases the presence of agents in such regions. The discovered peculiarities and biases of temporal-difference-based reinforcement learning methods should be taken into account in real-world physical applications and agent design.


I. INTRODUCTION
Machine learning methods have become established and widely used for solving many hard problems such as image or speech recognition [1][2][3], games like Atari [4] and Go [5] or vehicle routing problems [6,7]. The latter few applications demonstrate, in particular, the emerging success of reinforcement learning (RL) approaches. Along with other machine learning approaches [8] the RL techniques find ever more applications in physics, especially for optimization of motion in complex physics environments [9][10][11]. However, tuning of RL agents is a non-trivial task and unexpected effects, such as biases [12,13], may occur in their deployment.
One area where RL is a major candidate for the development of autonomous navigation is active matter research [10]. Active particles or agents are objects with an ability to control some of their dynamics and, thus, are a natural sandbox for RL algorithms. A lion's share of the relevant work in active matter deals with small scales, where thermal fluctuations along with Brownian motion and turbulence play a crucial role [14]. They have to be taken into account while learning and optimizing control strategies. RL was also applied for the navigation of microswimmers in such highly stochastic environments as complex and turbulent flows in [15][16][17][18][19]. The actorcritic RL significantly outperformed a trivial policy of finding the fastest path from A to B for an agent with a constant slip speed in 2D turbulent environment [15]. RL agents in 3D stationary Arnold-Beltrami-Childress helical flow learned to target specific high vorticity regions [17]. Among tabular RL methods, Q-learning is perhaps one of the most convenient. It was used, e. g., to control self-thermophoretic active particles in a solution with the real-time microscopy system [19]. The Q table corresponded to the discretized position of the microswimmer, thus, staging the gridworld geometry in experiment. Munios et al. [19] noted that the noise due to Brownian motion substantially affects both the learning process and the actions within the learned behavior.
Navigation and prediction of motion in highly stochastic or turbulent environments is a necessity not only for nanobots [10,19,20], but even for large macroworld objects such as marine vehicles in ocean currents [21]. The macroscopic movement optimization in turbulent media with RL was performed with gliders in turbulent air flows [22,23]. The results clearly show that the efficiency of control decreases with an increasing speed of the glider which is equivalent to increased fluctuations. Still the learned soaring strategy was effective even in the case of strong fluctuations. Other relevant studies include Q-learning for the optimization of collective motion in stochastic environments with small UAVs learning how to flock ("Q-flocking") [24], deep RL for coordinated collective energy-saving swimming [18] and navigation with obstacle avoidance in a system with thermal fluctuations by using deep double reinforcement learning [25]. The huge impact of stochastic dynamics, however, is not an exclusive speciality of physical systems. It underlies the base of the modern economy through stock price fluctuations on financial markets, where RL is expanding its presence as a trading algorithm [26,27].
Randomness in RL. Generally, theory of RL and Markov decision processes as well as other control strategies employ noise as a part of the problem setting. Talking about applications, in the literature one could identify several directions concerning the impact of stochasticity: • Noisy reward signal. Many problems such as games of chance have this noise as a feature [12,[28][29][30].
Alternatively it comes from an imperfect observation process [31,32]. In human-guided learning, for instance, it arises from mistakes and incoherent answers of human teachers [33].
• Learning in stochastic dynamics. The transition between states is affected by a random force and the arXiv:2110.02129v1 [math.OC] 5 Oct 2021 environment dynamics is assumed to be stochastic [41,42].
Depending on the problem, different adjustments to a learning process or algorithms were suggested. For instance, Double Q-learning [12] is a modification of Qlearning [43] with a double estimator that counters maximization bias and demonstrates superiority in tasks with noisy reward signal. However, there are no examples with stochastic transition dynamics in the original paper. Impact of stochastic transition function was discussed in [42], where G-learning was proposed and tested against both Q-and Double Q-learning. Advantage updating presented in [41] was compared with Q-learning in linear quadratic regulator problem with presence of noise in state transition function.
However, to the best of our knowledge a comparison of different algorithms with respect to learning in presence of stochastic dynamics of states was not carried out yet carefully. Thus, it is unclear what kind of adjustments could be useful in such settings.
First-passage problem. Over the past couple of decades it became clear that many of search and optimization problems in physics, biology and finance can be formulated within a first-passage framework [44,45]. The first-passage time (FPT) τ is a first moment when an agent/process with a coordinate/value X t reaches a boundary x, i. e. τ = inf{t > 0 : X t ≥ x}. This boundary could be, for instance, a threshold to sell or buy at a stock market or a location of reward in space or a site which, once reached, triggers a biological/chemical process in a living cell. Alternatively, one could reformulate this approach in terms of survival times and probabilities. The FPT optimization is often done by analytical or numerical minimization within model's assumptions as in many of the ecological [46], biological [47] problems or risk assessment [48]. First-passage times have been found to be connected to the relative advantage of states in Markov decision processes (MDP) [49] and have proven to be useful for characterization of reachability of states [50]. The passage time itself could be used as a reward function for an algorithm to minimize. It is interesting that in more traditional thermal bath settings the minimization of FPT could produce non-trivial results such as complex shapes of potentials needed for minimization observed in theory and experiment [51,52].
In this paper we find that uneven noise distributions can trigger biases of RL learning algorithms and as such have to be paid attention to. We introduce a new type of gridworld models with a state-dependent noise affecting actions which we call as heated gridworlds. We perform an extensive study of agent learning in heated gridworlds with state-dependent temperature distribution. We find that the state-dependency of the noise triggers convergence of agents to suboptimal solutions, around which the respective policies stay for practically long learning times. This happens with such common RL algorithms as tabular Q-learning, SARSA, Expected SARSA and Double Q-learning. The observed phenomena should be taken into account in design and deployment of agents in physical applications that follow the formalism of a heated gridworld.
Notation. Capital letters will denote random variables, if not specified otherwise. Small letters will denote definite values thereof. For instance, if R t is a reward at time t as a random variable, then r t is a value that it assumed.

II. BACKGROUND
The general scheme of RL consists of an agent and an environment. The agent interacts with an environment in a cycle by doing actions and receiving rewards [13,53]. RL problem is usually described with the framework of Markov Decision Processes (MDP). At a time frame t, the agent perceives an environment through a random state S t ∈ S, then selects an action A t ∈ A distributed with a probability distribution π, called the policy, and gets some R ⊆ R-valued random reward R t . The sets S, A, R are assumed to be finite. The environment transitions into the next state adopt a value s t+1 with probability P S (s t+1 |s t , a t ) = P [S t+1 = s t+1 |S t = s t , A t = a t ], where s t , a t are the current state and action values, respectively. a t is sampled from the probability distribution π. The policy is assumed to be Markovian π(a t ) = π(a t |s t ). The total expected discounted reward under a policy π starting at a state s is denoted as where γ ∈ (0, 1) is a discounting factor. Another handy formalism is of action-value functions. If an agent takes an action a in a state s and follows π thereafter one can define The agent's goal is to find the optimal policy π that maximizes the discounted sum of received rewards Q (s, a) = max π Q π (s, a), a. First-passage time minimization and MDP If the task is in finding the fastest way to a target, one can tie the reward signal to the time needed to reach the target. The value function could then be made proportional to the mean first-passage time (MFPT). Hence, the optimization of the policy is equivalent to minimization of the MFPT. Technically speaking, one can write the following dynamical system, where S † represents the gridworld, and s * is an artificial formal state value which indicates the "end of the game" (the episode is considered finished if the environment state enters this value), W t represents the temperature effect as a random disturbance with a discrete probability distribution W(θ) having a finite support as a bounded subset of Z n and with parameters θ which are random. The random stateS t and the policy π take values from a bounded subset of Z n . π is a Markov policy. That is, at each time frame t, the agent takes some finite number of steps in each direction on n-dimensional grid. Then its actions are perturbed by temperature effects W t that shift the agent randomly by a finite number of steps. In general, the respective probability distribution depends on the current state. When the agent crosses the boundary and thus leaves the gridworldS t , we formally "fix" the state at an abstract value s * .
An equivalent description of the above gridworld reads: Formally, S andS are states of two independent dynamical systems that can each be considered on their own. The dynamical system corresponding to S can be understood as "absorbing", whereas the one withS as "free". Essentially, the trajectories of the two systems coincide up until the first passage beyond S † .
We can formulate the reward function for the considered first passage problem as follows: where I is the indicator function, r target ∈ R is any number used purely for scoring, i.e. when the agent is still inside S † , it gets a minus one point. When it crosses the boundary, it receives some r target points, which may conveniently be set to a large number compared to the grid size. The goal is to score as many points as possible, i.e. to cross the boundary as fast as possible. Let R tot be the random variable of total reward and let J be the objective function as the expected total reward, The next subsection describes particular algorithms tested for training boundary-crossing agents in this work, whereas a more detailed theoretical analysis of the gridworld is provided in the Appendix B where it is shown that the respective objective is well-defined for any admissible policy. b. Temporal-difference algorithms When the transition probability function P S is known and can be analytically expressed, a solution to RL problem can be obtained from Bellman equation [54] or Hamilton-Jacobi-Bellman equation. Often it is not the case and agents have to learn through interaction, building an estimation of the value function on-line. Temporal-difference TD(0) algorithms (here we use Q-learning, SARSA, Expected SARSA and Double Q-learning) employ this idea, renewing the estimation Q(s, a) after every time frame. For all listed algorithms Q(s, a) is stored in Q table. Update rules for Q table for all used methods are given in Algorithms 1, 2, 3 and 4 (see Appendix A).
c. Choice of learning rate The update rule of Qlearning is governed by a learning rate α and a discount rate γ as follows: where r t +γ max a Q(s t+1 , a) is the update target, in other words, let us define: There is one more tuning parameter called explorationexploitation parameter ε, which is not used in update rule directly but affects the exploratory behaviour and is required for a proper convergence of the algorithm (see the algorithm and ε-greedy policy description in Appendix A).
As it was shown in [55], Q-learning converges to Q (s, a) and the optimal policy if each state-action pair is visited infinitely many times (ε-greedy policy) and the learning rate satisfies the conditions However, in practice both constant [6,12,22,41] and scheduled [25] learning rates are used as well. The value Q(s, a) for some fixed (s, a) pair is renewed in a cycle [13] With a constant rate the weight of a single update in total sum decreases exponentially with the number of updates n. The higher α is, the sooner an agent overwrites its previous experience. We expect that in stochastic dynamics there could be a restriction on α under which convergence to certain policies is possible.

START END
Heat action: +0 random step(s) +1 random step(s) +2 random step(s) +3 random step(s) obstacle We introduce a new type of gridworlds which we call as heated gridworlds in order to test and to compare work of algorithms in the case of uneven, possibly state-dependent, distributed noise. The noise could be caused, for instance, by thermal fluctuations. As an example we sketch a 2dimensional heated gridworld in Fig. 1. It is based on the common gridworld setup with 4 actions (left, right, up, down). Every played time frame t is penalized with r tick and when the goal position is reached the reward is r target . Thus, r target is used rather for convenience as discussed in Section II. Attempts to cross the boundaries or obstacles lead to void moves (reflective boundary). In our particular setting an agent starts in the bottom left corner and aims to learn the fastest way to reach the upper right corner. For the states s shown as blue squares in Fig. 1a, the action proceeds according to the selected policy π. In the heated states (the magenta, orange and yellow squares in Fig. 1a) the temperature (noise) affects the outcomes. Random offsets described by W t are added to the action selected from the policy π for the state s. The actual effect of W t depends on the dimension of the gridworld and on a parameter T , called temperature of the state.
The motion in the studied heated gridworld follows a procedure that reads, for every time frame t: 1. Compute an action a t using π(s t ), set actions = [a t ].
2. For the position s t ∈ Z 2 take corresponding temperature T t , sample T t values w ti and append them to the action list to yield the effective action list [a t , w t1 , ..., w tT ] .
3. Update the state sequentially applying moves from the effective action list. Ordering of actions in this list matters due to possible interactions with obstacles and boundaries.
An example of final position scattering is given in Fig.  2. For all numerical experiments shown below, the tuning parameters were set as ε = 0.1, γ = 0.9.

IV. 2D SIMULATIONS
In this section, we consider an agent on a 10 × 10 grid shown in Fig. 1 (b), r tick = -1 and r target = 100. The heated region is placed in the lower right quarter of the grid. Temperature T of this region is constant throughout the learning episode and is varied between the episodes from T = 0 (symmetric setup) to T = 3. The symmetric L-shaped obstacle leaves only two possible ways to reach the end tile from the start, either through deterministic part of medium or through the heated region. The learning rate α is varied in the range [0.07, 0.09, 0.1, 0.2, 0.3, 0.4, ..., 0.9]. The quantity we aim to optimize is the mean first-passage time.
We mark agents as failed if their time scores are higher than 500 timesteps cutoff, which is the case for trajectories closed in a loop for a long time. Regardless of T , this setup always has one option of a path 18 steps long with MFPT = 18 and zero standard deviation. In the absence of temperature fluctuations (T = 0), Q-learning converges to this optimum readily in 20K time steps for α > 0.09. Once we introduce the noise, T > 0, the number of failed agents changes (see Fig. 3 In the following we do not count the failed agents and only consider what happens with successful ones. The successful agents basically choose between two routes, the deterministic and through the heated region, depending on their learning rate, Fig. 4. For α ∼ 0.09 one observes nearly 100% convergence to the heated route, while for α > 0.4 the majority selects the deterministic path after 300K played iterations. For higher α values the transit to the deterministic route occurs in shorter time. Scheduling of α that decreases its value with learning time thus leads to the population of agents staying in the heated area. Importantly, these changes occur for all tested algorithms similarly, as the path density plots show in Fig. 6. Only SARSA stands out being notably unstable. Remarkably, our findings do not show any advantage of double estimator in this task (see the comparison in Appendix C). In Fig. 5 one can clearly see that the presence of thermal noise increases the mean first-passage time for all α, even for α values for which agents seem to operate well in heated areas. Higher learning rates produce a worse score with a bigger deviation after short learning (Fig. 7, top row). However, after learning for longer, they achieve a much better performance (Fig. 7, bottom row) due to switching to another route.
In our simulations the observed effect does not depend on the heated region location (Fig. 5), the presence of obstacles and the value of discounting parameter γ. When several heated regions with different temperatures T are placed in the gridworld, agents are divided between them proportionally to region temperature (Appendix C). The particular geometry with L-shaped wall allows us to demonstrate the effect quantitatively. Our explanation of the effect is that the environmental noise boosts the exploratory behavior of an agent in some parts of the state space, therefore the policy tends to converge to regions with high temperature.
We found that transition to deterministic route when T = 3 and α = 0.1 happens in 5M frames or 250K played episodes. Setting epsilon to 1.0 during the whole course of learning forms similar policy in 50K iterations which is equivalent to 100 played episodes (see Appendix C).

A. Interval with absorbing boundaries
The 2D case shown in the previous section provided a fairly illustrative picture. However, the 1D case is easier to analyze and comprehend. We mainly studied a 1D gridworld where the agent starts in the center x 0 of an interval consisted of 41 states. Only two actions are available for the agent, namely, the moves to the left and to the right. The reward comprises −1 for each time frame until the boundary is crossed, and r target is set to zero. The whole right-hand side of the gridworld x > x 0 has the temperature T . 1D case does not have interaction with borders and despite its simplicity still allows to produce a bias, as will be seen below.
We introduce the notations for policies shown in Fig.  8. By π L (π R ) we denote the policy that admits a left (right) step, starting from the middle. Analogously, to indicate a policy bias of two steps, we use a notation like π RR , which means that from two states to the left of the middle the most common learned policy is to step right. Further policies are denoted following this token. In simulations of this section the learning rate α = 0.1 was fixed. Overall, impact of α in 1D case is exactly the same to the one described in the previous section. High values α ∼ 0.5 prevent algorithms from operating in stochastic media. It should be noted that notions of "high" and "low" α are relative and depend on average fluctuation scale in an environment. For instance, an interval with 5 states and T = 1 has move length fluctuations which are 30% of optimal path length. The low rate is then 0.01 and the high is 0.1.
Policies of interest were tested by 10 6 Monte Carlo (MC) runs. The best found in this way policy is denoted as π , whereas π Q represents the most common policy of 10 4 Q-learning agents. It was calculated by taken the most common action among the population of learning agents for every cell on the interval.
The difference between the best found policy π and the real agent's performance was observed at ∼ 1.5 − 2 % on average for all considered temperature levels, as Figure  9 shows. In contrast with 2D problem there is a little improvement in agents' scores as learning time passes.  Table I demonstrates that when the action "right" is optimal in position x 0 agents perform it in x 0 and x 0 − 1, i.e. instead of π R they follow π RR , instead of π RR they follow π RRR with an exception of weak noises (T = 1, 2).  In order to strengthen the statement that the algorithms considered are prone to biases we construct the following example. We add a drift pushing the agent from the end of the interval in the heated part (see Fig. 10). Its purpose is to gradually make the π R policy less profitable. The drift in our numerical experiment occurs only in the last right quarter of the interval (i.e. 25 per cent of the length). Its action on the agent is defined by a probability to make an extra move left. Top: ε = 0.1, biased policy is hard to overcome for both Q-learning and double Q-learning. Bottom: exponential decay of ε enables Q-learning to achieve the optimal strategy, but has a moderate effect with Double Q-learning.
In the absence of thermal fluctuations (T = 0) agents easily detect the drift for shift probability of ∼ 0.1 and select π L policy. Turning on even small T = 1 effectively hides this non-optimality for majority of population.
Increase in fluctuations scale makes it possible to hide more intensive drift, as Table II shows. At T = 1 only one agent out of ten is insensitive to the drift probability of 0.1 which yields approximately 9% worse mean score. Similar 9% decrease at T = 3 is given by the drift value of 0.35, but this time it stays unnoticed by nearly 40% of agents.
The numbers above are obtained after 50K time frames for Q-learning. The top plot in Fig. 11 shows how a percent of non-optimal right actions in the center x 0 is changed for Q-learning and Double Q-learning through the learning process. As one can see, there is no steady progress towards the optimal policy for both algorithms after playing 300K time frames. First-passage properties of media affect the policy selection. Poorly trained agents are able to reach target in first stages of learning faster due to noise kicks and then they tend to stick to these policies.
As in 2D simulations, the proper scheduling of ε for Q-learning (exponential decay starting from ε 0 = 1.0 in this case) significantly improves convergence to optimal MFPT value, Fig. 11, bottom plot. Surprisingly, the scheduling has very moderate effect when applied with Double Q-learning, which is conventionally seen as a superior alternative of plain Q-learning for stochastic environments.

VI. CONCLUSIONS
The application of machine learning algorithms to real physical systems have to be tested and vetted by using toy models in order to understand and anticipate possible biases. In this paper we find that four well-established tabular reinforcement learning algorithms show bias in terms of producing suboptimal solutions for the problem of fastest boundary crossing in gridworlds with statedependent noise. We name this type of gridworlds as heated gridworlds. The state-dependent noise affecting the work of algorithms can occur for different physical reasons starting from uneven temperature distribution or concentration variations in the case of atmosphere pollution to long-lived current patterns in the water or the atmosphere. For 1D and 2D heated gridworlds we see a pronounced bias for Q-learning, SARSA, Expected SARSA and Double Q-learning: High learning rate pre-vents the algorithm from operating in stochastic media while for small enough α agents tend to go through noisy regions, even when these policies are suboptimal.
We clearly see that the methods developed in recent years to tackle the case of noisy rewards (e.g. Double Q-learning) do not necessarily offer the same benefits for learning in the case of stochastic transition dynamics. Our work is a sandbox example which could be useful for those who looks into new applications of temporal-difference algorithms primarily in physics, navigation and trading or generally study capabilities of these optimisation tools. We expect that similar effects could be found in other environments with unevenly distributed noise. Algorithm 4: Double Q-learning [12] Initialize Q A (s, a), Q B (s, a), S repeat Choose A based on Q A (S, ·) and Q B (S, ·), observe R and S

Choose random UPDATE(A) or UPDATE(B) if UPDATE(A) then
Define a = arg max a Q A (S, a)

Appendix B: Theoretical analysis
In this appendix we provide some theoretical analysis of the case considered in Section V. We show bounds on survival probabilities and finally show well-posedness of the studied gridworld boundary crossing problem. The following notation will be used: 0, N means the set {0, 1, . . . , N }. We fix an ambient probability space (Ω, Σ, P) and consider (cf. (3)): where B(a, b) denotes the binomial distribution with parameters a, b and the policy comprises of the choices to go either left or right (we allow zero "noise-induced" move here only for convenience). LetS denote [−ζ, ζ] ∩ Z and let S denoteS ∪ {x * }. Note that the state-space equation for S can be rewritten as (cf. (4)) otherwise (B2) Therefore, S andS represent mutually "independent" dynamical systems that can be considered on their own. From this point onward the dynamical systems corresponding to S andS will be referred to as "the absorbing system" and "the free system" accordingly. There is, however, a certain connection between the trajectories of the two systems. Since both systems by definition have identical noise and control, the two of them also share the same probability space. Another notable connection is that the trajectories of the two systems coincide up until the first passage beyond ζ or −ζ. Both systems are evidently Markov decision processes with stationary control. And since the absorbing system has a finite number of states, one can construct its probability transition matrix under the assumption that π is fixed. Obviously, the absorbing system only has 2r + 2 states, 2r + 1 of which are −ζ, 1 − ζ, . . . , ζ and the remaining one is s * . To construct the probability transition matrix G we first need to order the states. Let the order be (−ζ, . . . , ζ, * ), and thus let G t = g t −ζ , . . . , g t ζ , g t * Now, letḠ = ḡ t −ζ , . . . ,ḡ t ζ be the top left square submatrix of G of size 2ζ + 1.
We have: Proof. Since it is impossible to escape the absorbing state s * , the following identity holds: ∀k ∈ N : g k * = (0 0 . . . 0 1) The lemma can be proven by simple mathematical induction. Namely, Induction base:: Induction step:: Let P s 0 be the initial probability distribution vector under the assumption that S 0 = s. Structurally, the latter would be a binary vector with a single entry equal 1 at the position that corresponds to the chosen initial state. Therefore, Evidently, g t s is the distribution vector at step the t under the assumption that s was the initial state.
Similarly let us defineP s 0 as P s 0 without the last element. The analogy is clear:Ḡ tP s 0 =ḡ t s . Let H t ⊆ Σ be the event that at step t the absorbing system is still not in state s * . Obviously, P [H t ] is the survival probability.
Lemma B.2. If s 0 = s, then P s t = ḡ t s 1 . Proof.ḡ t s is a subvector of g t s . The only element of g t s that g t s does not include is the one that corresponds to the absorbing state s * . Therefore,ḡ t s comprises probabilities of being in states −ζ, . . . , ζ at step t accordingly. Since ḡ t x 1 is the sum of absolute values of elements ofḡ t x , evidently ḡ t x 1 is the probability of being in any state other than s * by step t or, equivalently, We have the following upper bound on the survival probability: Without loss of generality let us assume that s 0 ≥ 0.
The probability of each event within the latter intersection is no less than w w ( 1 2 ) w ( 1 2 ) 0 , and since the events are independent, it holds that We denote the event that the state of the environment is outside S at t as E t := ω ∈ Ω | |S t [ω]| > ζ . According to (B1), this event obviously implies that the agent has not survived by t, therefore, For each elementary outcome from K t , the following identity trivially follows from definition of K t : According to (B1), we have: Since the above holds for arbitrary t ≥ 1, then it should also hold for τ (s) := ζ−s w−1 + 1 as long as s ∈ [0, ζ]. Evidently, τ (s) has the following important property Therefore, This in turn means Notably, this implies that and, therefore, according to (B3), (B4) and (B5), we have: Finally, Let τ (with no argument) denote τ (0). It can be easily observed from the definition of τ (s) that This has two important implications. Firstly, since obviouslyH t impliesH t+1 . And secondly Combining (B6), (B7) and (B8), we can determine that the following holds under the assumption that the initial state s 0 is non-negative or, equivalently, The same can be proven for non-positive values of s 0 in a similar fashion.
We can furthermore bound the survival probability as follows: Lemma B.4. There is a C > 0 such that the survival probability satisfies the relation Proof. By Lemma B.2 and Lemma B.3 we have Observe that Earlier, we already established thatḠ t P s 0 =ḡ t s . Therefore, by Lemma B.2 we have: Combining (B10) and (B11), we obtain where quantifying over s 0 was dropped, because according to (B1) H t is independent of s 0 .
Now, let us formulate the reward function for the considered boundary crossing problem as follows (cf. where I is the indicator function. Let R tot be the random variable of total reward and let J be the expected total reward: The following theorem states that the objective is wellposed. Theorem B.5. The objective J[π(·)] exists and is finite for any admissible π(·).
Proof. Let D t denote the event that the agent perished exactly at step t and let D ∞ denote Evidently, D t implies that the total reward is equal to exactly r target − t or, equivalently, ∀t ∈ N ∀ω ∈ D t : ∞ t=0 r(S t (ω), π(S t (ω)), W t (ω)) = r target − t.

(B14)
Since ∀t ∈ N : D ∞ ⊂ H t and P [(] H t ) ≤ Cσ t , it is true that In other words, D ∞ is an impossible event.
Let us determine that the right-hand side of this relation converges to: Finally, which proves the claim.